Skip to main content

Overview

Sentence Transformer Reranker uses local cross-encoder models for reranking. It runs entirely on your infrastructure for privacy and cost efficiency.

Setup

Install Dependencies

Sentence Transformers is included with Mini RAG:
uv add mini-rag

Configuration

Basic Usage

from mini import AgenticRAG, RerankerConfig

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=RerankerConfig(
        type="sentence-transformer",
        kwargs={
            "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2"
        }
    )
)

With GPU

reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={
        "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2",
        "device": "cuda"  # Use GPU
    }
)

With CPU

reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={
        "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2",
        "device": "cpu"  # Use CPU (slower)
    }
)

Available Models

Size: Small (~80MB)
Quality: Good
Speed: Fast
Best for: General purpose, balanced performance
Size: Medium (~130MB)
Quality: Better
Speed: Moderate
Best for: Higher quality needs
Size: Tiny (~40MB)
Quality: Basic
Speed: Very Fast
Best for: Resource-constrained environments
Size: Medium (~280MB)
Quality: High
Speed: Moderate
Best for: Multilingual support
Size: Large (~560MB)
Quality: Highest
Speed: Slower
Best for: Maximum quality

Direct Usage

Use the reranker directly:
from mini.reranker import SentenceTransformerReranker

# Initialize
reranker = SentenceTransformerReranker(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
    device="cuda"
)

# Rerank documents
query = "What is machine learning?"
documents = [
    "Machine learning is a subset of AI...",
    "Python is a programming language...",
    "Deep learning uses neural networks..."
]

results = reranker.rerank(query, documents, top_k=2)

for result in results:
    print(f"Score: {result.score:.3f}")
    print(f"Document: {result.document[:100]}...")

Complete Example

import os
from mini import (
    AgenticRAG,
    LLMConfig,
    RetrievalConfig,
    RerankerConfig,
    EmbeddingModel,
    VectorStore
)

# Initialize RAG with local reranking
rag = AgenticRAG(
    vector_store=VectorStore(
        uri=os.getenv("MILVUS_URI"),
        token=os.getenv("MILVUS_TOKEN"),
        collection_name="documents",
        dimension=1536
    ),
    embedding_model=EmbeddingModel(),
    llm_config=LLMConfig(model="gpt-4o-mini"),
    retrieval_config=RetrievalConfig(
        top_k=10,
        rerank_top_k=3,
        use_reranking=True
    ),
    reranker_config=RerankerConfig(
        type="sentence-transformer",
        kwargs={
            "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2",
            "device": "cuda"
        }
    )
)

# Index and query
rag.index_document("document.pdf")
response = rag.query("What is the main topic?")

print(response.answer)

Performance

Speed Comparison

With 10 documents to rerank:
ModelGPU (ms)CPU (ms)
TinyBERT-L-220-30100-150
MiniLM-L-630-50150-250
MiniLM-L-1250-80300-500
bge-reranker-base60-100400-600
bge-reranker-large100-150800-1200

Memory Usage

ModelGPU MemoryRAM
TinyBERT-L-2~200MB~500MB
MiniLM-L-6~500MB~1GB
MiniLM-L-12~800MB~1.5GB
bge-reranker-base~1.5GB~2.5GB
bge-reranker-large~2.5GB~4GB

Best Practices

  • Prototyping: MiniLM-L-6 (good balance)
  • Production: MiniLM-L-12 or bge-reranker-base
  • High volume: TinyBERT-L-2 (fastest)
  • Best quality: bge-reranker-large (if you have resources)
GPU provides 5-10x speedup:
import torch

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"

reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={"device": device}
)
The model is loaded once and reused:
# Model loads on first use
rag = AgenticRAG(..., reranker_config=config)

# Subsequent queries reuse the model
response1 = rag.query("question 1")
response2 = rag.query("question 2")
Process multiple queries to amortize model loading:
questions = [...]
for question in questions:
    response = rag.query(question)

Advantages

Privacy: Runs entirely on your infrastructure
No API Costs: Free after model download
No Rate Limits: Process as many queries as you want
Low Latency: Fast with GPU
Offline: Works without internet

Limitations

Initial Download: Models need to be downloaded first
Resource Requirements: Needs GPU for best performance
Quality: Slightly lower than Cohere
Maintenance: You manage the infrastructure

Troubleshooting

Models are downloaded from Hugging Face:
# Check connection
from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
Use a smaller model or CPU:
# Smaller model
reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={
        "model_name": "cross-encoder/ms-marco-TinyBERT-L-2-v2",
        "device": "cuda"
    }
)

# Or use CPU
reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={"device": "cpu"}
)
Ensure GPU is being used:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device: {torch.cuda.get_device_name(0)}")

See Also