Re-ranking - Mini RAG

Overview

Re-ranking is the process of re-scoring and re-ordering retrieved chunks to improve relevance. After initial retrieval (which may return 10-20 chunks), re-ranking selects the most relevant 3-5 chunks for answer generation. Why re-rank?

Embedding-based retrieval is fast but may miss nuances
Re-rankers use more sophisticated models to assess relevance
Better chunks = better answers from the LLM

Re-ranking Strategies

Mini RAG supports multiple re-ranking methods:

LLM-based

Uses your LLM to score relevance (default)

Cohere API

Specialized re-ranking models via Cohere

Local Models

Open-source cross-encoders running locally

Quick Start

from mini import AgenticRAG, RetrievalConfig, RerankerConfig

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    retrieval_config=RetrievalConfig(
        top_k=10,           # Retrieve 10 chunks
        rerank_top_k=3,     # Keep top 3 after reranking
        use_reranking=True  # Enable reranking
    ),
    reranker_config=RerankerConfig(
        type="llm"  # Default: LLM-based reranking
    )
)

response = rag.query("What are the key findings?")

Strategy 1: LLM-Based Re-ranking

Uses your configured LLM to score chunk relevance.

Configuration

from mini import RerankerConfig

reranker_config = RerankerConfig(
    type="llm"  # Use LLM for reranking
)

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=reranker_config
)

Pros & Cons

Pros

No additional API needed
Uses existing LLM
Good quality
Simple setup

Cons

Slower than dedicated rerankers
More expensive per query
Limited by LLM context

Strategy 2: Cohere Re-rank API

Uses Cohere’s specialized re-ranking models.

Configuration

import os
from mini import RerankerConfig

reranker_config = RerankerConfig(
    type="cohere",
    kwargs={
        "api_key": os.getenv("COHERE_API_KEY"),
        "model": "rerank-english-v3.0"  # or "rerank-multilingual-v3.0"
    }
)

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=reranker_config
)

Available Models

Model	Languages	Best For
`rerank-english-v3.0`	English	English content (best quality)
`rerank-multilingual-v3.0`	100+ languages	International content

Setup

Get API Key

Add to Environment

Add COHERE_API_KEY to your .env file

Configure

Use type="cohere" in RerankerConfig

Pros & Cons

Pros

Very fast
High quality
Specialized for reranking
Cost-effective

Cons

Requires API key
External dependency
API limits apply

Strategy 3: Local Cross-Encoders

Uses open-source sentence-transformer models locally.

Configuration

from mini import RerankerConfig

reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={
        "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2",
        "device": "cuda"  # or "cpu"
    }
)

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=reranker_config
)

Available Models

Model	Size	Quality	Speed
`cross-encoder/ms-marco-TinyBERT-L-2-v2`	Tiny	Good	Fast
`cross-encoder/ms-marco-MiniLM-L-6-v2`	Small	Better	Medium
`cross-encoder/ms-marco-MiniLM-L-12-v2`	Medium	Best	Slower

Pros & Cons

Pros

No API costs
Data privacy (runs locally)
No rate limits
Open source

Cons

Requires local compute
GPU recommended
Model download needed
Slower than Cohere

Strategy 4: Custom Re-ranker

Provide your own re-ranker instance:

from mini.reranker import CohereReranker
from mini import RerankerConfig

# Create custom reranker with specific settings
custom_reranker = CohereReranker(
    api_key=os.getenv("COHERE_API_KEY"),
    model="rerank-multilingual-v3.0",
    max_chunks_per_doc=10
)

# Use custom instance
reranker_config = RerankerConfig(
    custom_reranker=custom_reranker
)

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=reranker_config
)

Disabling Re-ranking

from mini import RetrievalConfig

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    retrieval_config=RetrievalConfig(
        use_reranking=False  # Disable reranking
    )
)

Comparison

Performance

# Test different rerankers
rerankers = [
    ("LLM", "llm", {}),
    ("Cohere", "cohere", {"model": "rerank-english-v3.0"}),
    ("Local", "sentence-transformer", {"model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2"})
]

query = "What are the key findings?"

for name, type, kwargs in rerankers:
    rag = AgenticRAG(
        vector_store=vector_store,
        embedding_model=embedding_model,
        reranker_config=RerankerConfig(type=type, kwargs=kwargs)
    )
    
    response = rag.query(query)
    print(f"{name}: {response.answer[:150]}...")

Speed Comparison

Method	Speed	Cost	Quality
No reranking	Fastest	Free	Baseline
Cohere	Fast	Low ($0.002/1K docs)	Excellent
Local Cross-Encoder	Medium	Free	Very Good
LLM-based	Slow	Medium	Good

Quality Comparison

Query: "What is the budget for railways?"

No reranking: 
├─ Chunk 1: [0.85] Railways infrastructure...
├─ Chunk 2: [0.84] Budget allocation total...
└─ Chunk 3: [0.82] Transportation spending...

With Cohere reranking:
├─ Chunk 1: [0.95] Railways budget: $50M allocated...
├─ Chunk 2: [0.89] Infrastructure railways improvements...
└─ Chunk 3: [0.75] Transportation budget overview...

Best Practices

Choose the Right Strategy

Selection guide:

Cohere: Best balance for production (fast + high quality)
LLM: Simple setup, good for prototyping
Local: Data privacy requirements, no API costs
None: Speed is critical, budget is tight

Top-K Configuration

Balance retrieval and reranking:

retrieval_config = RetrievalConfig(
    top_k=15,        # Cast a wide net
    rerank_top_k=5   # Keep best 5
)

Higher top_k: More candidates, better recall
Lower rerank_top_k: Only best chunks for LLM

Combine with Hybrid Search

Optimal pipeline:

retrieval_config = RetrievalConfig(
    top_k=20,
    rerank_top_k=5,
    use_hybrid_search=True,  # Semantic + BM25
    use_reranking=True        # Then rerank
)

Hybrid search retrieves 20 diverse chunks
Re-ranker selects top 5 most relevant
LLM generates answer from top 5

Monitor Performance

Use observability to track:

Reranking latency
Score distributions
Cost per query
Quality improvements

observability_config = ObservabilityConfig(enabled=True)

Cost Analysis

Cohere Rerank Pricing

$0.002 per 1,000 documents

Example costs:
- 100 queries × 10 chunks = 1,000 docs = $0.002
- 1,000 queries × 10 chunks = 10,000 docs = $0.02
- 10,000 queries × 10 chunks = 100,000 docs = $0.20

LLM-based Cost

Depends on your LLM pricing

With GPT-4o-mini:
- 10 chunks × 200 tokens each = 2,000 tokens input
- $0.15 per 1M input tokens
- Cost per query: ~$0.0003

100,000 queries = ~$30

Local Cross-Encoder

Zero API costs

One-time costs:
- GPU (optional): $500-2000
- Compute time: Minimal

Ongoing costs:
- Electricity: ~$0.10/day with GPU

Troubleshooting

Cohere API errors

Solution: Check your API key:

echo $COHERE_API_KEY

Ensure it’s set in your .env file and valid.

Local model not loading

Solution: Install sentence-transformers:

pip install sentence-transformers

First run downloads the model (~100MB).

Slow reranking

Solutions:

Switch to Cohere (fastest)
Reduce top_k (fewer chunks to rerank)
Use smaller local model
Disable reranking if speed is critical

Poor reranking quality

Solutions:

Try Cohere (usually best quality)
Increase top_k (more candidates)
Ensure good initial retrieval
Check if chunks are well-formed

Advanced Usage

Dynamic Reranker Selection

def get_reranker_config(query_type):
    if query_type == "technical":
        return RerankerConfig(type="cohere")
    elif query_type == "simple":
        return RerankerConfig(type="llm")
    else:
        return RerankerConfig(type="sentence-transformer")

# Use different rerankers based on query
config = get_reranker_config("technical")
rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=config
)

Accessing Reranked Scores

response = rag.query("What are the findings?")

for chunk in response.retrieved_chunks:
    print(f"Original score: {chunk.score:.4f}")
    print(f"Reranked score: {chunk.reranked_score:.4f}")
    print(f"Improvement: {chunk.reranked_score - chunk.score:.4f}")

Next Steps

Hybrid Search

Combine reranking with hybrid search

Query Rewriting

Improve retrieval with query variations

AgenticRAG

Complete RAG pipeline documentation

Examples

See reranking in action

Getting Started

Core Concepts

Features

Guides

Examples

​Overview

​Re-ranking Strategies

LLM-based

Cohere API

Local Models

​Quick Start

​Strategy 1: LLM-Based Re-ranking

​Configuration

​Pros & Cons

Pros

Cons

​Strategy 2: Cohere Re-rank API

​Configuration

​Available Models

​Setup

​Pros & Cons

Pros

Cons

​Strategy 3: Local Cross-Encoders

​Configuration

​Available Models

​Pros & Cons

Pros

Cons

​Strategy 4: Custom Re-ranker

​Disabling Re-ranking

​Comparison

​Performance

​Speed Comparison

​Quality Comparison

​Best Practices

​Cost Analysis

​Cohere Rerank Pricing

​LLM-based Cost

​Local Cross-Encoder

​Troubleshooting

​Advanced Usage

​Dynamic Reranker Selection

​Accessing Reranked Scores

​Next Steps

Hybrid Search

Query Rewriting

AgenticRAG

Examples

Overview

Re-ranking Strategies

Quick Start

Strategy 1: LLM-Based Re-ranking

Configuration

Pros & Cons

Strategy 2: Cohere Re-rank API

Configuration

Available Models

Setup

Pros & Cons

Strategy 3: Local Cross-Encoders

Configuration

Available Models

Pros & Cons

Strategy 4: Custom Re-ranker

Disabling Re-ranking

Comparison

Performance

Speed Comparison

Quality Comparison

Best Practices

Cost Analysis

Cohere Rerank Pricing

LLM-based Cost

Local Cross-Encoder

Troubleshooting

Advanced Usage

Dynamic Reranker Selection

Accessing Reranked Scores

Next Steps