Overview
Reranking improves retrieval quality by re-scoring and reordering retrieved chunks based on their relevance to the query. Mini RAG supports multiple reranking strategies.Why Reranking?
Initial retrieval uses embedding similarity (cosine/dot product), which may not capture all aspects of relevance. Reranking uses more sophisticated models to:- Improve precision: Keep only the most relevant chunks
- Better relevance scoring: More accurate than embedding similarity alone
- Reduce noise: Filter out marginally relevant results
Available Rerankers
LLM-based
Uses your configured LLM
Cohere
Specialized reranking API
Sentence Transformer
Local cross-encoder models
None
Disable reranking
Comparison
| Reranker | Quality | Speed | Cost | Privacy | Setup |
|---|---|---|---|---|---|
| Cohere | ⭐⭐⭐⭐⭐ | ⚡⚡⚡ | 💰💰 | Cloud | Easy |
| LLM-based | ⭐⭐⭐⭐ | ⚡⚡ | 💰💰 | Cloud | Easy |
| Sentence Transformer | ⭐⭐⭐ | ⚡⚡⚡⚡ | Free | Local | Moderate |
| None | ⭐⭐ | ⚡⚡⚡⚡⚡ | Free | - | None |
How Reranking Works
Pipeline:- Retrieve
top_kchunks (e.g., 10) using embedding similarity - Rerank these chunks with a more sophisticated model
- Keep
rerank_top_kbest chunks (e.g., 3) - Generate answer using only the top chunks
Basic Usage
Choosing a Reranker
Use Cohere When:
- Quality is paramount
- You have budget for API calls
- Processing English text
- Want best-in-class reranking
Use LLM-based When:
- Already using an LLM for generation
- Want simple setup (no extra API)
- Have good LLM access
- Budget for LLM tokens
Use Sentence Transformer When:
- Need local/private deployment
- Have GPU available
- Want to avoid API costs
- Processing many queries
Use None When:
- Speed is critical
- Initial retrieval is good enough
- Processing simple queries
- Minimizing latency/cost
Configuration Examples
Production (Cohere)
Development (LLM)
Local (Sentence Transformer)
Disabled
Performance Impact
Latency
- Cohere: +50-100ms
- LLM-based: +500-1000ms (depends on LLM)
- Sentence Transformer: +100-200ms (GPU), +500ms (CPU)
- None: 0ms
Cost
- Cohere: ~$1 per 1000 searches (1000 docs)
- LLM-based: Depends on LLM pricing and chunk size
- Sentence Transformer: Free (after model download)
- None: Free
Quality Improvement
Typical improvements in retrieval quality:- Cohere: +20-40% better precision
- LLM-based: +15-30% better precision
- Sentence Transformer: +10-25% better precision
- None: Baseline
Best Practices
Tune top_k and rerank_top_k
Tune top_k and rerank_top_k
Cast a wide net, then narrow down:
Combine with Query Rewriting
Combine with Query Rewriting
Reranking works best with query rewriting:
Match Reranker to Use Case
Match Reranker to Use Case
- Customer-facing: Cohere for best quality
- Internal tools: LLM or Sentence Transformer
- High-volume: Sentence Transformer (local)
- Prototyping: LLM (simple setup)
