Skip to main content

Overview

Re-ranking is the process of re-scoring and re-ordering retrieved chunks to improve relevance. After initial retrieval (which may return 10-20 chunks), re-ranking selects the most relevant 3-5 chunks for answer generation. Why re-rank?
  • Embedding-based retrieval is fast but may miss nuances
  • Re-rankers use more sophisticated models to assess relevance
  • Better chunks = better answers from the LLM

Re-ranking Strategies

Mini RAG supports multiple re-ranking methods:

LLM-based

Uses your LLM to score relevance (default)

Cohere API

Specialized re-ranking models via Cohere

Local Models

Open-source cross-encoders running locally

Quick Start

from mini import AgenticRAG, RetrievalConfig, RerankerConfig

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    retrieval_config=RetrievalConfig(
        top_k=10,           # Retrieve 10 chunks
        rerank_top_k=3,     # Keep top 3 after reranking
        use_reranking=True  # Enable reranking
    ),
    reranker_config=RerankerConfig(
        type="llm"  # Default: LLM-based reranking
    )
)

response = rag.query("What are the key findings?")

Strategy 1: LLM-Based Re-ranking

Uses your configured LLM to score chunk relevance.

Configuration

from mini import RerankerConfig

reranker_config = RerankerConfig(
    type="llm"  # Use LLM for reranking
)

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=reranker_config
)

Pros & Cons

Pros

  • No additional API needed
  • Uses existing LLM
  • Good quality
  • Simple setup

Cons

  • Slower than dedicated rerankers
  • More expensive per query
  • Limited by LLM context

Strategy 2: Cohere Re-rank API

Uses Cohere’s specialized re-ranking models.

Configuration

import os
from mini import RerankerConfig

reranker_config = RerankerConfig(
    type="cohere",
    kwargs={
        "api_key": os.getenv("COHERE_API_KEY"),
        "model": "rerank-english-v3.0"  # or "rerank-multilingual-v3.0"
    }
)

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=reranker_config
)

Available Models

ModelLanguagesBest For
rerank-english-v3.0EnglishEnglish content (best quality)
rerank-multilingual-v3.0100+ languagesInternational content

Setup

1

Get API Key

Sign up at cohere.com and get your API key
2

Add to Environment

Add COHERE_API_KEY to your .env file
3

Configure

Use type="cohere" in RerankerConfig

Pros & Cons

Pros

  • Very fast
  • High quality
  • Specialized for reranking
  • Cost-effective

Cons

  • Requires API key
  • External dependency
  • API limits apply

Strategy 3: Local Cross-Encoders

Uses open-source sentence-transformer models locally.

Configuration

from mini import RerankerConfig

reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={
        "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2",
        "device": "cuda"  # or "cpu"
    }
)

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=reranker_config
)

Available Models

ModelSizeQualitySpeed
cross-encoder/ms-marco-TinyBERT-L-2-v2TinyGoodFast
cross-encoder/ms-marco-MiniLM-L-6-v2SmallBetterMedium
cross-encoder/ms-marco-MiniLM-L-12-v2MediumBestSlower

Pros & Cons

Pros

  • No API costs
  • Data privacy (runs locally)
  • No rate limits
  • Open source

Cons

  • Requires local compute
  • GPU recommended
  • Model download needed
  • Slower than Cohere

Strategy 4: Custom Re-ranker

Provide your own re-ranker instance:
from mini.reranker import CohereReranker
from mini import RerankerConfig

# Create custom reranker with specific settings
custom_reranker = CohereReranker(
    api_key=os.getenv("COHERE_API_KEY"),
    model="rerank-multilingual-v3.0",
    max_chunks_per_doc=10
)

# Use custom instance
reranker_config = RerankerConfig(
    custom_reranker=custom_reranker
)

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=reranker_config
)

Disabling Re-ranking

from mini import RetrievalConfig

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    retrieval_config=RetrievalConfig(
        use_reranking=False  # Disable reranking
    )
)

Comparison

Performance

# Test different rerankers
rerankers = [
    ("LLM", "llm", {}),
    ("Cohere", "cohere", {"model": "rerank-english-v3.0"}),
    ("Local", "sentence-transformer", {"model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2"})
]

query = "What are the key findings?"

for name, type, kwargs in rerankers:
    rag = AgenticRAG(
        vector_store=vector_store,
        embedding_model=embedding_model,
        reranker_config=RerankerConfig(type=type, kwargs=kwargs)
    )
    
    response = rag.query(query)
    print(f"{name}: {response.answer[:150]}...")

Speed Comparison

MethodSpeedCostQuality
No rerankingFastestFreeBaseline
CohereFastLow ($0.002/1K docs)Excellent
Local Cross-EncoderMediumFreeVery Good
LLM-basedSlowMediumGood

Quality Comparison

Query: "What is the budget for railways?"

No reranking: 
├─ Chunk 1: [0.85] Railways infrastructure...
├─ Chunk 2: [0.84] Budget allocation total...
└─ Chunk 3: [0.82] Transportation spending...

With Cohere reranking:
├─ Chunk 1: [0.95] Railways budget: $50M allocated...
├─ Chunk 2: [0.89] Infrastructure railways improvements...
└─ Chunk 3: [0.75] Transportation budget overview...

Best Practices

Selection guide:
  • Cohere: Best balance for production (fast + high quality)
  • LLM: Simple setup, good for prototyping
  • Local: Data privacy requirements, no API costs
  • None: Speed is critical, budget is tight
Balance retrieval and reranking:
retrieval_config = RetrievalConfig(
    top_k=15,        # Cast a wide net
    rerank_top_k=5   # Keep best 5
)
  • Higher top_k: More candidates, better recall
  • Lower rerank_top_k: Only best chunks for LLM
Use observability to track:
  • Reranking latency
  • Score distributions
  • Cost per query
  • Quality improvements
observability_config = ObservabilityConfig(enabled=True)

Cost Analysis

Cohere Rerank Pricing

$0.002 per 1,000 documents

Example costs:
- 100 queries × 10 chunks = 1,000 docs = $0.002
- 1,000 queries × 10 chunks = 10,000 docs = $0.02
- 10,000 queries × 10 chunks = 100,000 docs = $0.20

LLM-based Cost

Depends on your LLM pricing

With GPT-4o-mini:
- 10 chunks × 200 tokens each = 2,000 tokens input
- $0.15 per 1M input tokens
- Cost per query: ~$0.0003

100,000 queries = ~$30

Local Cross-Encoder

Zero API costs

One-time costs:
- GPU (optional): $500-2000
- Compute time: Minimal

Ongoing costs:
- Electricity: ~$0.10/day with GPU

Troubleshooting

Solution: Check your API key:
echo $COHERE_API_KEY
Ensure it’s set in your .env file and valid.
Solution: Install sentence-transformers:
pip install sentence-transformers
First run downloads the model (~100MB).
Solutions:
  1. Switch to Cohere (fastest)
  2. Reduce top_k (fewer chunks to rerank)
  3. Use smaller local model
  4. Disable reranking if speed is critical
Solutions:
  1. Try Cohere (usually best quality)
  2. Increase top_k (more candidates)
  3. Ensure good initial retrieval
  4. Check if chunks are well-formed

Advanced Usage

Dynamic Reranker Selection

def get_reranker_config(query_type):
    if query_type == "technical":
        return RerankerConfig(type="cohere")
    elif query_type == "simple":
        return RerankerConfig(type="llm")
    else:
        return RerankerConfig(type="sentence-transformer")

# Use different rerankers based on query
config = get_reranker_config("technical")
rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=config
)

Accessing Reranked Scores

response = rag.query("What are the findings?")

for chunk in response.retrieved_chunks:
    print(f"Original score: {chunk.score:.4f}")
    print(f"Reranked score: {chunk.reranked_score:.4f}")
    print(f"Improvement: {chunk.reranked_score - chunk.score:.4f}")

Next Steps