Overview
Text chunking is a critical step in RAG systems. TheChunker class uses Chonkie, a smart text chunking library, to split documents into optimal chunks that:
- Preserve semantic meaning
- Fit within embedding model token limits
- Maintain context and relationships
- Optimize for retrieval quality
Basic Usage
Chunk Structure
Each chunk is an object with:Chunking Strategies
Chonkie provides multiple chunking strategies. By default, Mini RAG uses the Markdown recipe, which is optimized for most text types:Markdown Recipe (Default)
- Respects document structure (headings, paragraphs)
- Maintains context boundaries
- Handles code blocks and lists appropriately
- Works well for technical documentation
Custom Chunk Size
Integration with AgenticRAG
When usingAgenticRAG, chunking is handled automatically:
Why Chunking Matters
Token Limits
Embedding models have token limits (e.g., 8192 for text-embedding-3-small)
Retrieval Precision
Smaller chunks improve precision by matching specific passages
Context Preservation
Proper chunking maintains semantic relationships
Answer Quality
Better chunks lead to more relevant context for LLM
Chunking Strategies Comparison
| Strategy | Best For | Pros | Cons |
|---|---|---|---|
| Markdown | General text, docs | Respects structure | May create variable sizes |
| Semantic | Long documents | Preserves meaning | More computation |
| Fixed Size | Uniform processing | Predictable sizes | May break context |
| Sentence | Short answers | High precision | May lose context |
Advanced Usage
Custom Preprocessing
Chunk Metadata
When indexing with AgenticRAG, you can add metadata to chunks:Inspecting Chunks
Best Practices
Chunk Size Selection
Chunk Size Selection
Optimal chunk size depends on your use case:
- Question Answering: 256-512 tokens (precise answers)
- Document Summary: 512-1024 tokens (more context)
- Code Search: 256-512 tokens (function/class level)
- Long Documents: 512-1024 tokens (more context)
Overlap Strategy
Overlap Strategy
Use overlap to maintain context across boundaries:
- 10-20% overlap for general text
- 20-30% overlap for technical content
- More overlap = more chunks but better recall
Document Structure
Document Structure
Preserve document structure when possible:
- Keep headings with their content
- Don’t split tables or code blocks
- Maintain paragraph integrity
- The Markdown recipe does this automatically
Token Counting
Token Counting
Be aware of tokenization differences:
- Different models use different tokenizers
- Chunk size is approximate
- Test with your specific embedding model
Common Patterns
Pattern 1: Standard Document Processing
Pattern 2: Custom Chunking with Manual Pipeline
Pattern 3: Batch Processing
Performance Tips
Batch Operations
Process multiple documents together for better performance
Cache Results
Cache chunked results if reprocessing same documents
Monitor Token Count
Keep chunks under embedding model limits
Optimize Structure
Clean document structure before chunking
Troubleshooting
Chunks too large
Chunks too large
Solution: Use a smaller target chunk size:
Chunks breaking context
Chunks breaking context
Solution: Increase overlap or use semantic chunking:
Performance issues
Performance issues
Solution: The default Markdown recipe is optimized for performance. If you need faster processing:
