LLM Reranker

Overview

LLM Reranker uses your configured language model to score and rerank retrieved chunks. It’s simple to set up and doesn’t require additional APIs.

Configuration

Basic Usage

from mini import AgenticRAG, RerankerConfig

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=RerankerConfig(
        type="llm"  # Uses your configured LLM
    )
)

This is the default reranker, so you can also omit it:

# These are equivalent
rag1 = AgenticRAG(vector_store, embedding_model)
rag2 = AgenticRAG(
    vector_store,
    embedding_model,
    reranker_config=RerankerConfig(type="llm")
)

How It Works

The LLM reranker:

Receives the query and retrieved chunks
Asks the LLM to score each chunk’s relevance (0-10)
Reranks chunks by score
Returns the top chunks

Prompt Example

Given the query: "What is machine learning?"

Score the relevance of this document on a scale of 0-10:
"Machine learning is a subset of artificial intelligence..."

Score: [LLM provides score]

Direct Usage

from mini.reranker import LLMReranker
from openai import OpenAI

# Initialize with OpenAI client
client = OpenAI(api_key="sk-...")
reranker = LLMReranker(
    client=client,
    model="gpt-4o-mini",
    temperature=0.3
)

# Rerank documents
query = "What is machine learning?"
documents = [
    "Machine learning is a subset of AI...",
    "Python is a programming language...",
    "Deep learning uses neural networks..."
]

results = reranker.rerank(query, documents, top_k=2)

for result in results:
    print(f"Score: {result.score:.3f}")
    print(f"Document: {result.document[:100]}...")

Complete Example

import os
from mini import (
    AgenticRAG,
    LLMConfig,
    RetrievalConfig,
    RerankerConfig,
    EmbeddingModel,
    VectorStore
)

# Initialize RAG with LLM reranking
rag = AgenticRAG(
    vector_store=VectorStore(
        uri=os.getenv("MILVUS_URI"),
        token=os.getenv("MILVUS_TOKEN"),
        collection_name="documents",
        dimension=1536
    ),
    embedding_model=EmbeddingModel(),
    llm_config=LLMConfig(
        model="gpt-4o-mini",
        temperature=0.7
    ),
    retrieval_config=RetrievalConfig(
        top_k=10,
        rerank_top_k=3,
        use_reranking=True
    ),
    reranker_config=RerankerConfig(
        type="llm"
    )
)

# Index and query
rag.index_document("document.pdf")
response = rag.query("What is the main topic?")

print(response.answer)

Configuration Options

The LLM reranker can be configured through LLMConfig:

from mini import LLMConfig, RerankerConfig

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    llm_config=LLMConfig(
        model="gpt-4o-mini",
        temperature=0.3,  # Lower for more consistent scoring
        timeout=60.0,
        max_retries=3
    ),
    reranker_config=RerankerConfig(type="llm")
)

Performance

Speed

Fast LLMs (gpt-3.5-turbo, gpt-4o-mini): 500-1000ms for 10 chunks
Slower LLMs (gpt-4): 1000-2000ms for 10 chunks
Local LLMs: Varies widely (500-5000ms)

Cost

Depends on your LLM pricing and number of chunks:

# Example: GPT-4o-mini
# - 10 chunks × ~200 tokens each = ~2000 tokens input
# - ~100 tokens output for scores
# - Total: ~2100 tokens per reranking operation

Quality

GPT-4: Excellent (comparable to Cohere)
GPT-4o-mini: Very Good
GPT-3.5-turbo: Good
Local models: Varies

Best Practices

Use Lower Temperature

Consistent scoring requires lower temperature:

LLMConfig(
    model="gpt-4o-mini",
    temperature=0.3  # Lower = more consistent
)

Fast LLMs for Reranking

Use faster models for reranking:

# Fast reranking
llm_config = LLMConfig(model="gpt-4o-mini")

# Can still use GPT-4 for answer generation
# (Mini RAG handles this automatically)

Chunk Truncation

Long chunks are automatically truncated to save tokens:

# LLMReranker truncates to ~500 chars by default
reranker = LLMReranker(
    client=client,
    truncate_length=500  # Adjust if needed
)

Balance top_k

More chunks = more LLM tokens:

# Efficient
RetrievalConfig(top_k=10, rerank_top_k=3)

# More expensive
RetrievalConfig(top_k=20, rerank_top_k=5)

Advantages

✅ Simple Setup: No additional APIs needed
✅ Uses Existing LLM: Leverages your configured model
✅ Good Quality: Especially with GPT-4/4o
✅ Flexible: Works with any OpenAI-compatible API

Limitations

❌ Token Cost: Uses LLM tokens for each reranking
❌ Latency: Slower than specialized rerankers
❌ Consistency: Scoring can vary between runs
❌ Not Optimized: General LLM vs specialized reranker

When to Use

Use LLM reranker when:

✅ You’re already using a good LLM (GPT-4, GPT-4o-mini)
✅ You want simple setup with no extra APIs
✅ You don’t need the absolute fastest reranking
✅ Token cost is acceptable

Consider alternatives when:

❌ You need maximum quality → Use Cohere
❌ You need maximum speed → Use Sentence Transformer with GPU
❌ You want to minimize LLM costs → Use Sentence Transformer
❌ You need local/private → Use Sentence Transformer

Comparison with Other Rerankers

Feature	LLM	Cohere	Sentence Transformer
Quality	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐
Speed	⚡⚡	⚡⚡⚡	⚡⚡⚡⚡
Setup	✅ Easy	✅ Easy	⚠️ Moderate
Cost	💰💰 LLM tokens	💰💰 API	💰 Free
Privacy	☁️ Cloud	☁️ Cloud	🔒 Local

Troubleshooting

Inconsistent Scores

Lower the temperature:

LLMConfig(temperature=0.1)

Too Slow

Use a faster model:

LLMConfig(model="gpt-3.5-turbo")  # Faster

Or reduce chunks:

RetrievalConfig(top_k=5)  # Fewer chunks to rerank

High Costs

Consider alternatives:

Sentence Transformer (local, free)
Reduce top_k to rerank fewer chunks
Use LLM reranking selectively

Rerankers Overview

Compare rerankers

Cohere Reranker

Higher quality option

Sentence Transformer

Local option

API Documentation

Core Classes

Configuration

Rerankers

Overview

Configuration

Basic Usage

How It Works

Prompt Example

Direct Usage

Complete Example

Configuration Options

Performance

Speed

Cost

Quality

Best Practices

Advantages

Limitations

When to Use

Comparison with Other Rerankers

Troubleshooting

See Also

Rerankers Overview

Cohere Reranker

Sentence Transformer

API Documentation

Core Classes

Configuration

Rerankers

​Overview

​Configuration

​Basic Usage

​How It Works

​Prompt Example

​Direct Usage

​Complete Example

​Configuration Options

​Performance

​Speed

​Cost

​Quality

​Best Practices

​Advantages

​Limitations

​When to Use

​Comparison with Other Rerankers

​Troubleshooting

​See Also

Rerankers Overview

Cohere Reranker

Sentence Transformer

Overview

Configuration

Basic Usage

How It Works

Prompt Example

Direct Usage

Complete Example

Configuration Options

Performance

Speed

Cost

Quality

Best Practices

Advantages

Limitations

When to Use

Comparison with Other Rerankers

Troubleshooting

See Also