Sentence Transformer Reranker

Overview

Sentence Transformer Reranker uses local cross-encoder models for reranking. It runs entirely on your infrastructure for privacy and cost efficiency.

Setup

Install Dependencies

Sentence Transformers is included with Mini RAG:

uv add mini-rag

Configuration

Basic Usage

from mini import AgenticRAG, RerankerConfig

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    reranker_config=RerankerConfig(
        type="sentence-transformer",
        kwargs={
            "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2"
        }
    )
)

With GPU

reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={
        "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2",
        "device": "cuda"  # Use GPU
    }
)

With CPU

reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={
        "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2",
        "device": "cpu"  # Use CPU (slower)
    }
)

Available Models

cross-encoder/ms-marco-MiniLM-L-6-v2

Size: Small (~80MB)
Quality: Good
Speed: Fast
Best for: General purpose, balanced performance

cross-encoder/ms-marco-MiniLM-L-12-v2

Size: Medium (~130MB)
Quality: Better
Speed: Moderate
Best for: Higher quality needs

cross-encoder/ms-marco-TinyBERT-L-2-v2

Size: Tiny (~40MB)
Quality: Basic
Speed: Very Fast
Best for: Resource-constrained environments

BAAI/bge-reranker-base

Size: Medium (~280MB)
Quality: High
Speed: Moderate
Best for: Multilingual support

BAAI/bge-reranker-large

Size: Large (~560MB)
Quality: Highest
Speed: Slower
Best for: Maximum quality

Direct Usage

Use the reranker directly:

from mini.reranker import SentenceTransformerReranker

# Initialize
reranker = SentenceTransformerReranker(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
    device="cuda"
)

# Rerank documents
query = "What is machine learning?"
documents = [
    "Machine learning is a subset of AI...",
    "Python is a programming language...",
    "Deep learning uses neural networks..."
]

results = reranker.rerank(query, documents, top_k=2)

for result in results:
    print(f"Score: {result.score:.3f}")
    print(f"Document: {result.document[:100]}...")

Complete Example

import os
from mini import (
    AgenticRAG,
    LLMConfig,
    RetrievalConfig,
    RerankerConfig,
    EmbeddingModel,
    VectorStore
)

# Initialize RAG with local reranking
rag = AgenticRAG(
    vector_store=VectorStore(
        uri=os.getenv("MILVUS_URI"),
        token=os.getenv("MILVUS_TOKEN"),
        collection_name="documents",
        dimension=1536
    ),
    embedding_model=EmbeddingModel(),
    llm_config=LLMConfig(model="gpt-4o-mini"),
    retrieval_config=RetrievalConfig(
        top_k=10,
        rerank_top_k=3,
        use_reranking=True
    ),
    reranker_config=RerankerConfig(
        type="sentence-transformer",
        kwargs={
            "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2",
            "device": "cuda"
        }
    )
)

# Index and query
rag.index_document("document.pdf")
response = rag.query("What is the main topic?")

print(response.answer)

Performance

Speed Comparison

With 10 documents to rerank:

Model	GPU (ms)	CPU (ms)
TinyBERT-L-2	20-30	100-150
MiniLM-L-6	30-50	150-250
MiniLM-L-12	50-80	300-500
bge-reranker-base	60-100	400-600
bge-reranker-large	100-150	800-1200

Memory Usage

Model	GPU Memory	RAM
TinyBERT-L-2	~200MB	~500MB
MiniLM-L-6	~500MB	~1GB
MiniLM-L-12	~800MB	~1.5GB
bge-reranker-base	~1.5GB	~2.5GB
bge-reranker-large	~2.5GB	~4GB

Best Practices

Choose the Right Model

Prototyping: MiniLM-L-6 (good balance)
Production: MiniLM-L-12 or bge-reranker-base
High volume: TinyBERT-L-2 (fastest)
Best quality: bge-reranker-large (if you have resources)

Use GPU When Possible

GPU provides 5-10x speedup:

import torch

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"

reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={"device": device}
)

Cache the Model

The model is loaded once and reused:

# Model loads on first use
rag = AgenticRAG(..., reranker_config=config)

# Subsequent queries reuse the model
response1 = rag.query("question 1")
response2 = rag.query("question 2")

Batch Processing

Process multiple queries to amortize model loading:

questions = [...]
for question in questions:
    response = rag.query(question)

Advantages

✅ Privacy: Runs entirely on your infrastructure
✅ No API Costs: Free after model download
✅ No Rate Limits: Process as many queries as you want
✅ Low Latency: Fast with GPU
✅ Offline: Works without internet

Limitations

❌ Initial Download: Models need to be downloaded first
❌ Resource Requirements: Needs GPU for best performance
❌ Quality: Slightly lower than Cohere
❌ Maintenance: You manage the infrastructure

Troubleshooting

Model Download Fails

Models are downloaded from Hugging Face:

# Check connection
from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

CUDA Out of Memory

Use a smaller model or CPU:

# Smaller model
reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={
        "model_name": "cross-encoder/ms-marco-TinyBERT-L-2-v2",
        "device": "cuda"
    }
)

# Or use CPU
reranker_config = RerankerConfig(
    type="sentence-transformer",
    kwargs={"device": "cpu"}
)

Slow Performance

Ensure GPU is being used:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device: {torch.cuda.get_device_name(0)}")

Rerankers Overview

Compare rerankers

Reranking Feature

Learn about reranking

Sentence Transformers

Official documentation

API Documentation

Core Classes

Configuration

Rerankers

Sentence Transformer Reranker

Overview

Setup

Install Dependencies

Configuration

Basic Usage

With GPU

With CPU

Available Models

Direct Usage

Complete Example

Performance

Speed Comparison

Memory Usage

Best Practices

Advantages

Limitations

Troubleshooting

See Also

Rerankers Overview

Reranking Feature

Sentence Transformers

API Documentation

Core Classes

Configuration

Rerankers

​Overview

​Setup

​Install Dependencies

​Configuration

​Basic Usage

​With GPU

​With CPU

​Available Models

​Direct Usage

​Complete Example

​Performance

​Speed Comparison

​Memory Usage

​Best Practices

​Advantages

​Limitations

​Troubleshooting

​See Also

Rerankers Overview

Reranking Feature

Sentence Transformers

Overview

Setup

Install Dependencies

Configuration

Basic Usage

With GPU

With CPU

Available Models

Direct Usage

Complete Example

Performance

Speed Comparison

Memory Usage

Best Practices

Advantages

Limitations

Troubleshooting

See Also