Overview
Chunker provides intelligent text chunking using the Chonkie library. It preserves semantic boundaries and creates optimal-sized chunks for embedding and retrieval.
Constructor
Parameters
Language code for the text (e.g., “en”, “es”, “fr”, “de”)
Example
Methods
chunk
Split text into semantic chunks.Parameters
The text to split into chunks
Returns
List of Chunk objects with text and metadata
Example
Chunk Object
Each chunk returned by thechunk method has the following attributes:
Attributes
The text content of the chunk
Number of tokens in the chunk (approximate)
Starting character position in the original text
Ending character position in the original text
Example
Chunking Strategy
The Chunker uses Chonkie’s markdown recipe which:- Respects semantic boundaries: Headers, paragraphs, lists
- Maintains context: Keeps related content together
- Optimizes token count: Creates chunks suitable for embedding models
- Preserves structure: Maintains markdown formatting
Default Behavior
Complete Example
Integration Example
Using Chunker in a complete pipeline:Best Practices
Chunk Size
Chunk Size
The default chunk size (~512 tokens) works well for most embedding models:
- OpenAI text-embedding-3-small: 8191 tokens max
- Most content fits well in 512-token chunks
- Provides good balance of context and retrieval precision
Language-Specific Chunking
Language-Specific Chunking
Use the appropriate language code:
Markdown Documents
Markdown Documents
The chunker works best with markdown-formatted text:
- Respects headers and sections
- Preserves lists and code blocks
- Maintains document structure
Token Counting
Token Counting
Token counts are approximate:
Troubleshooting
Chunks Too Large
Chunks Too Large
If chunks are too large for your embedding model:Consider pre-processing the text or splitting long sections.
Chunks Too Small
Chunks Too Small
If chunks are too small and lack context:
- Ensure markdown formatting is present
- Check that headers and paragraphs are properly formatted
- Consider the document structure
Memory Issues
Memory Issues
For very large documents:
