Overview
Re-ranking is the process of re-scoring and re-ordering retrieved chunks to improve relevance. After initial retrieval (which may return 10-20 chunks), re-ranking selects the most relevant 3-5 chunks for answer generation. Why re-rank?- Embedding-based retrieval is fast but may miss nuances
- Re-rankers use more sophisticated models to assess relevance
- Better chunks = better answers from the LLM
Re-ranking Strategies
Mini RAG supports multiple re-ranking methods:LLM-based
Uses your LLM to score relevance (default)
Cohere API
Specialized re-ranking models via Cohere
Local Models
Open-source cross-encoders running locally
Quick Start
Strategy 1: LLM-Based Re-ranking
Uses your configured LLM to score chunk relevance.Configuration
Pros & Cons
Pros
- No additional API needed
- Uses existing LLM
- Good quality
- Simple setup
Cons
- Slower than dedicated rerankers
- More expensive per query
- Limited by LLM context
Strategy 2: Cohere Re-rank API
Uses Cohere’s specialized re-ranking models.Configuration
Available Models
| Model | Languages | Best For |
|---|---|---|
rerank-english-v3.0 | English | English content (best quality) |
rerank-multilingual-v3.0 | 100+ languages | International content |
Setup
1
Get API Key
Sign up at cohere.com and get your API key
2
Add to Environment
Add
COHERE_API_KEY to your .env file3
Configure
Use
type="cohere" in RerankerConfigPros & Cons
Pros
- Very fast
- High quality
- Specialized for reranking
- Cost-effective
Cons
- Requires API key
- External dependency
- API limits apply
Strategy 3: Local Cross-Encoders
Uses open-source sentence-transformer models locally.Configuration
Available Models
| Model | Size | Quality | Speed |
|---|---|---|---|
cross-encoder/ms-marco-TinyBERT-L-2-v2 | Tiny | Good | Fast |
cross-encoder/ms-marco-MiniLM-L-6-v2 | Small | Better | Medium |
cross-encoder/ms-marco-MiniLM-L-12-v2 | Medium | Best | Slower |
Pros & Cons
Pros
- No API costs
- Data privacy (runs locally)
- No rate limits
- Open source
Cons
- Requires local compute
- GPU recommended
- Model download needed
- Slower than Cohere
Strategy 4: Custom Re-ranker
Provide your own re-ranker instance:Disabling Re-ranking
Comparison
Performance
Speed Comparison
| Method | Speed | Cost | Quality |
|---|---|---|---|
| No reranking | Fastest | Free | Baseline |
| Cohere | Fast | Low ($0.002/1K docs) | Excellent |
| Local Cross-Encoder | Medium | Free | Very Good |
| LLM-based | Slow | Medium | Good |
Quality Comparison
Best Practices
Choose the Right Strategy
Choose the Right Strategy
Selection guide:
- Cohere: Best balance for production (fast + high quality)
- LLM: Simple setup, good for prototyping
- Local: Data privacy requirements, no API costs
- None: Speed is critical, budget is tight
Top-K Configuration
Top-K Configuration
Balance retrieval and reranking:
- Higher
top_k: More candidates, better recall - Lower
rerank_top_k: Only best chunks for LLM
Combine with Hybrid Search
Combine with Hybrid Search
Optimal pipeline:
- Hybrid search retrieves 20 diverse chunks
- Re-ranker selects top 5 most relevant
- LLM generates answer from top 5
Monitor Performance
Monitor Performance
Use observability to track:
- Reranking latency
- Score distributions
- Cost per query
- Quality improvements
Cost Analysis
Cohere Rerank Pricing
LLM-based Cost
Local Cross-Encoder
Troubleshooting
Cohere API errors
Cohere API errors
Solution: Check your API key:Ensure it’s set in your
.env file and valid.Local model not loading
Local model not loading
Solution: Install sentence-transformers:First run downloads the model (~100MB).
Slow reranking
Slow reranking
Solutions:
- Switch to Cohere (fastest)
- Reduce
top_k(fewer chunks to rerank) - Use smaller local model
- Disable reranking if speed is critical
Poor reranking quality
Poor reranking quality
Solutions:
- Try Cohere (usually best quality)
- Increase
top_k(more candidates) - Ensure good initial retrieval
- Check if chunks are well-formed
