Skip to main content

Overview

Reranking improves retrieval quality by re-scoring and reordering retrieved chunks based on their relevance to the query. Mini RAG supports multiple reranking strategies.

Why Reranking?

Initial retrieval uses embedding similarity (cosine/dot product), which may not capture all aspects of relevance. Reranking uses more sophisticated models to:
  • Improve precision: Keep only the most relevant chunks
  • Better relevance scoring: More accurate than embedding similarity alone
  • Reduce noise: Filter out marginally relevant results

Available Rerankers

Comparison

RerankerQualitySpeedCostPrivacySetup
Cohere⭐⭐⭐⭐⭐⚡⚡⚡💰💰CloudEasy
LLM-based⭐⭐⭐⭐⚡⚡💰💰CloudEasy
Sentence Transformer⭐⭐⭐⚡⚡⚡⚡FreeLocalModerate
None⭐⭐⚡⚡⚡⚡⚡Free-None

How Reranking Works

Pipeline:
  1. Retrieve top_k chunks (e.g., 10) using embedding similarity
  2. Rerank these chunks with a more sophisticated model
  3. Keep rerank_top_k best chunks (e.g., 3)
  4. Generate answer using only the top chunks

Basic Usage

from mini import AgenticRAG, RetrievalConfig, RerankerConfig

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    retrieval_config=RetrievalConfig(
        top_k=10,           # Retrieve 10 chunks
        rerank_top_k=3,     # Keep top 3 after reranking
        use_reranking=True  # Enable reranking
    ),
    reranker_config=RerankerConfig(
        type="cohere"  # or "llm", "sentence-transformer", "none"
    )
)

Choosing a Reranker

Use Cohere When:

  • Quality is paramount
  • You have budget for API calls
  • Processing English text
  • Want best-in-class reranking

Use LLM-based When:

  • Already using an LLM for generation
  • Want simple setup (no extra API)
  • Have good LLM access
  • Budget for LLM tokens

Use Sentence Transformer When:

  • Need local/private deployment
  • Have GPU available
  • Want to avoid API costs
  • Processing many queries

Use None When:

  • Speed is critical
  • Initial retrieval is good enough
  • Processing simple queries
  • Minimizing latency/cost

Configuration Examples

Production (Cohere)

reranker_config = RerankerConfig(
    type="cohere",
    kwargs={
        "api_key": os.getenv("COHERE_API_KEY"),
        "model": "rerank-english-v3.0"
    }
)

Development (LLM)

reranker_config = RerankerConfig(
    type="llm"
)

Local (Sentence Transformer)

reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={
        "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2",
        "device": "cuda"
    }
)

Disabled

reranker_config = RerankerConfig(
    type="none"
)

Performance Impact

Latency

  • Cohere: +50-100ms
  • LLM-based: +500-1000ms (depends on LLM)
  • Sentence Transformer: +100-200ms (GPU), +500ms (CPU)
  • None: 0ms

Cost

  • Cohere: ~$1 per 1000 searches (1000 docs)
  • LLM-based: Depends on LLM pricing and chunk size
  • Sentence Transformer: Free (after model download)
  • None: Free

Quality Improvement

Typical improvements in retrieval quality:
  • Cohere: +20-40% better precision
  • LLM-based: +15-30% better precision
  • Sentence Transformer: +10-25% better precision
  • None: Baseline

Best Practices

Cast a wide net, then narrow down:
# Good: Retrieve more, keep less
RetrievalConfig(top_k=15, rerank_top_k=3)

# Less effective: Retrieve few, keep most
RetrievalConfig(top_k=5, rerank_top_k=4)
Reranking works best with query rewriting:
RetrievalConfig(
    use_query_rewriting=True,  # Generate variations
    use_reranking=True         # Rerank results
)
  • Customer-facing: Cohere for best quality
  • Internal tools: LLM or Sentence Transformer
  • High-volume: Sentence Transformer (local)
  • Prototyping: LLM (simple setup)

Custom Reranker

Implement your own reranker:
from mini.reranker import BaseReranker, RerankResult
from typing import List, Optional

class CustomReranker(BaseReranker):
    def rerank(
        self,
        query: str,
        documents: List[str],
        top_k: Optional[int] = None
    ) -> List[RerankResult]:
        # Your custom reranking logic
        scores = [self.score(query, doc) for doc in documents]
        
        results = [
            RerankResult(index=i, score=score, document=doc)
            for i, (doc, score) in enumerate(zip(documents, scores))
        ]
        
        # Sort by score
        results.sort(key=lambda x: x.score, reverse=True)
        
        return results[:top_k] if top_k else results

# Use custom reranker
custom_reranker = CustomReranker()
reranker_config = RerankerConfig(custom_reranker=custom_reranker)

Next Steps