Overview
Sentence Transformer Reranker uses local cross-encoder models for reranking. It runs entirely on your infrastructure for privacy and cost efficiency.Setup
Install Dependencies
Sentence Transformers is included with Mini RAG:Configuration
Basic Usage
With GPU
With CPU
Available Models
cross-encoder/ms-marco-MiniLM-L-6-v2
cross-encoder/ms-marco-MiniLM-L-6-v2
Size: Small (~80MB)
Quality: Good
Speed: Fast
Best for: General purpose, balanced performance
Quality: Good
Speed: Fast
Best for: General purpose, balanced performance
cross-encoder/ms-marco-MiniLM-L-12-v2
cross-encoder/ms-marco-MiniLM-L-12-v2
Size: Medium (~130MB)
Quality: Better
Speed: Moderate
Best for: Higher quality needs
Quality: Better
Speed: Moderate
Best for: Higher quality needs
cross-encoder/ms-marco-TinyBERT-L-2-v2
cross-encoder/ms-marco-TinyBERT-L-2-v2
Size: Tiny (~40MB)
Quality: Basic
Speed: Very Fast
Best for: Resource-constrained environments
Quality: Basic
Speed: Very Fast
Best for: Resource-constrained environments
BAAI/bge-reranker-base
BAAI/bge-reranker-base
Size: Medium (~280MB)
Quality: High
Speed: Moderate
Best for: Multilingual support
Quality: High
Speed: Moderate
Best for: Multilingual support
BAAI/bge-reranker-large
BAAI/bge-reranker-large
Size: Large (~560MB)
Quality: Highest
Speed: Slower
Best for: Maximum quality
Quality: Highest
Speed: Slower
Best for: Maximum quality
Direct Usage
Use the reranker directly:Complete Example
Performance
Speed Comparison
With 10 documents to rerank:| Model | GPU (ms) | CPU (ms) |
|---|---|---|
| TinyBERT-L-2 | 20-30 | 100-150 |
| MiniLM-L-6 | 30-50 | 150-250 |
| MiniLM-L-12 | 50-80 | 300-500 |
| bge-reranker-base | 60-100 | 400-600 |
| bge-reranker-large | 100-150 | 800-1200 |
Memory Usage
| Model | GPU Memory | RAM |
|---|---|---|
| TinyBERT-L-2 | ~200MB | ~500MB |
| MiniLM-L-6 | ~500MB | ~1GB |
| MiniLM-L-12 | ~800MB | ~1.5GB |
| bge-reranker-base | ~1.5GB | ~2.5GB |
| bge-reranker-large | ~2.5GB | ~4GB |
Best Practices
Choose the Right Model
Choose the Right Model
- Prototyping: MiniLM-L-6 (good balance)
- Production: MiniLM-L-12 or bge-reranker-base
- High volume: TinyBERT-L-2 (fastest)
- Best quality: bge-reranker-large (if you have resources)
Use GPU When Possible
Use GPU When Possible
GPU provides 5-10x speedup:
Cache the Model
Cache the Model
The model is loaded once and reused:
Batch Processing
Batch Processing
Process multiple queries to amortize model loading:
Advantages
✅ Privacy: Runs entirely on your infrastructure✅ No API Costs: Free after model download
✅ No Rate Limits: Process as many queries as you want
✅ Low Latency: Fast with GPU
✅ Offline: Works without internet
Limitations
❌ Initial Download: Models need to be downloaded first❌ Resource Requirements: Needs GPU for best performance
❌ Quality: Slightly lower than Cohere
❌ Maintenance: You manage the infrastructure
Troubleshooting
Model Download Fails
Model Download Fails
Models are downloaded from Hugging Face:
CUDA Out of Memory
CUDA Out of Memory
Use a smaller model or CPU:
Slow Performance
Slow Performance
Ensure GPU is being used:
