Skip to main content

Overview

Chunker provides intelligent text chunking using the Chonkie library. It preserves semantic boundaries and creates optimal-sized chunks for embedding and retrieval.

Constructor

from mini.chunker import Chunker

chunker = Chunker(lang: str = "en")

Parameters

lang
str
default:"en"
Language code for the text (e.g., “en”, “es”, “fr”, “de”)

Example

from mini.chunker import Chunker

# English text
chunker_en = Chunker(lang="en")

# Spanish text
chunker_es = Chunker(lang="es")

Methods

chunk

Split text into semantic chunks.
def chunk(self, text: str) -> List[Chunk]

Parameters

text
str
required
The text to split into chunks

Returns

chunks
List[Chunk]
List of Chunk objects with text and metadata

Example

from mini.chunker import Chunker

chunker = Chunker()

text = """
# Introduction

This is a long document that needs to be split into manageable chunks.

## Section 1

Content for section 1...

## Section 2

Content for section 2...
"""

chunks = chunker.chunk(text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:")
    print(f"  Text: {chunk.text[:100]}...")
    print(f"  Tokens: {chunk.token_count}")

Chunk Object

Each chunk returned by the chunk method has the following attributes:

Attributes

text
str
The text content of the chunk
token_count
int
Number of tokens in the chunk (approximate)
start_index
int
Starting character position in the original text
end_index
int
Ending character position in the original text

Example

chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Text: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")
    print(f"Position: {chunk.start_index}-{chunk.end_index}")
    print()

Chunking Strategy

The Chunker uses Chonkie’s markdown recipe which:
  • Respects semantic boundaries: Headers, paragraphs, lists
  • Maintains context: Keeps related content together
  • Optimizes token count: Creates chunks suitable for embedding models
  • Preserves structure: Maintains markdown formatting

Default Behavior

chunker = Chunker()

# Uses markdown recipe with:
# - Target chunk size: ~512 tokens
# - Respects markdown structure
# - Preserves headers and context

Complete Example

from mini.loader import DocumentLoader
from mini.chunker import Chunker

# Load document
loader = DocumentLoader()
text = loader.load("document.pdf")

print(f"Document length: {len(text)} characters")

# Chunk text
chunker = Chunker()
chunks = chunker.chunk(text)

print(f"Created {len(chunks)} chunks")

# Analyze chunks
total_tokens = sum(chunk.token_count for chunk in chunks)
avg_tokens = total_tokens / len(chunks)

print(f"Total tokens: {total_tokens}")
print(f"Average tokens per chunk: {avg_tokens:.1f}")

# Show first few chunks
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i+1} ({chunk.token_count} tokens):")
    print(chunk.text[:200] + "...")

Integration Example

Using Chunker in a complete pipeline:
from mini import AgenticRAG, EmbeddingModel, VectorStore
from mini.loader import DocumentLoader
from mini.chunker import Chunker

# Using AgenticRAG (automatic chunking)
rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model
)

# Chunking happens automatically
rag.index_document("document.pdf")

# Manual chunking for custom pipelines
loader = DocumentLoader()
chunker = Chunker()
embedding_model = EmbeddingModel()

text = loader.load("document.pdf")
chunks = chunker.chunk(text)
embeddings = embedding_model.embed_chunks([c.text for c in chunks])

# Store in vector store
vector_store.insert(
    embeddings=embeddings,
    texts=[c.text for c in chunks],
    metadata=[{"chunk_id": i} for i in range(len(chunks))]
)

Best Practices

The default chunk size (~512 tokens) works well for most embedding models:
  • OpenAI text-embedding-3-small: 8191 tokens max
  • Most content fits well in 512-token chunks
  • Provides good balance of context and retrieval precision
Use the appropriate language code:
# English
chunker_en = Chunker(lang="en")

# Spanish
chunker_es = Chunker(lang="es")

# German
chunker_de = Chunker(lang="de")
The chunker works best with markdown-formatted text:
  • Respects headers and sections
  • Preserves lists and code blocks
  • Maintains document structure
Token counts are approximate:
# Get accurate token count if needed
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
accurate_count = len(enc.encode(chunk.text))

Troubleshooting

If chunks are too large for your embedding model:
# Check token counts
for chunk in chunks:
    if chunk.token_count > 8000:
        print(f"Warning: Large chunk ({chunk.token_count} tokens)")
Consider pre-processing the text or splitting long sections.
If chunks are too small and lack context:
  • Ensure markdown formatting is present
  • Check that headers and paragraphs are properly formatted
  • Consider the document structure
For very large documents:
# Process in segments
segment_size = 100000  # characters
all_chunks = []

for i in range(0, len(text), segment_size):
    segment = text[i:i + segment_size]
    chunks = chunker.chunk(segment)
    all_chunks.extend(chunks)

See Also