Skip to main content

Overview

Query rewriting is an agentic feature that automatically generates multiple variations of your query to improve retrieval coverage. Different phrasings can retrieve different relevant chunks, leading to more comprehensive answers.

How It Works

When enabled, Mini RAG:
  1. Takes your original query
  2. Generates 2-3 variations using an LLM
  3. Embeds all queries (original + variations)
  4. Searches with each embedding
  5. Combines and deduplicates results
  6. Re-ranks if enabled
Original: "What is the budget?"

Variations:
1. "How much funding was allocated?"
2. "Budget allocation details"
3. "Financial resources assigned"

→ Search with all 4 queries
→ Combine results
→ Return top chunks

Quick Start

from mini import AgenticRAG, RetrievalConfig

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    retrieval_config=RetrievalConfig(
        use_query_rewriting=True  # Enable query rewriting
    )
)

response = rag.query("What is the budget?")

# View query variations
print(f"Original: {response.original_query}")
print(f"Variations: {response.rewritten_queries}")

When to Use

Use Query Rewriting When

  • Queries may be ambiguous
  • Multiple phrasings possible
  • Need comprehensive coverage
  • Domain has varied terminology
  • Users ask questions naturally

Skip Query Rewriting When

  • Queries are very specific
  • Speed is critical
  • Cost is a concern
  • Simple factual lookups
  • Technical exact matches

Configuration

Enable/Disable

from mini import RetrievalConfig

# Enable (default for AgenticRAG)
retrieval_config = RetrievalConfig(
    use_query_rewriting=True
)

# Disable
retrieval_config = RetrievalConfig(
    use_query_rewriting=False
)

With Other Features

# Combine with hybrid search and reranking
retrieval_config = RetrievalConfig(
    top_k=15,
    rerank_top_k=5,
    use_query_rewriting=True,   # Generate variations
    use_hybrid_search=True,      # Search with semantic + BM25
    use_reranking=True           # Rerank combined results
)

rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    retrieval_config=retrieval_config
)

Examples

Example 1: Natural Questions

query = "How can I reset my password?"

response = rag.query(query)

print("Original:", response.original_query)
# "How can I reset my password?"

print("Variations:", response.rewritten_queries)
# ["What is the process to reset my password?",
#  "Steps for password reset",
#  "Password recovery procedure"]

Example 2: Technical Terms

query = "What is the JWT authentication flow?"

response = rag.query(query)

print("Variations:", response.rewritten_queries)
# ["How does JSON Web Token authentication work?",
#  "JWT auth process explained",
#  "Authentication flow using JWTs"]

Example 3: Ambiguous Queries

query = "What is the budget?"

response = rag.query(query)

print("Variations:", response.rewritten_queries)
# ["How much funding was allocated?",
#  "Budget allocation details",
#  "Financial resources assigned",
#  "Total budget amount"]

Benefits

Find more relevant content:Different phrasings retrieve different chunks, improving coverage:
  • Original: Finds chunks with “budget”
  • Variation 1: Finds chunks with “funding allocation”
  • Variation 2: Finds chunks with “financial resources”
  • Combined: More comprehensive results
Clarify vague queries:
Query: "What happened last quarter?"

Variations:
- "Q4 2023 results and performance"
- "Last quarter financial summary"
- "Recent quarterly achievements"
Each variation can target different aspects of the query.
Match domain terminology:The LLM can rephrase queries using domain-specific terms:
Query: "How do I fix the broken feature?"

Variations:
- "Troubleshooting steps for the issue"
- "Debug the malfunctioning component"
- "Resolve the defect"
Handle conversational queries:
Query: "Can you tell me about the new policy?"

Variations:
- "What is the new policy?"
- "New policy details"
- "Policy changes and updates"

Performance Impact

Speed

ConfigurationTimeImpact
No rewriting100msBaseline
With rewriting250ms+150ms (2-3 extra queries)
Components:
  • LLM query rewriting: ~50ms
  • Additional embeddings: ~30ms
  • Extra searches: ~70ms per variation

Cost

Query rewriting uses your LLM to generate variations:
With GPT-4o-mini:
- Input: ~100 tokens (original query + prompt)
- Output: ~50 tokens (2-3 variations)
- Cost per query: ~$0.000015

100,000 queries = ~$1.50

Quality

Typical improvements with query rewriting:
  • Recall: +15-30% (finds more relevant chunks)
  • Answer Quality: +10-20% (more comprehensive context)
  • Edge Cases: +30-50% (handles ambiguous queries better)

Best Practices

Choose based on use case:
# High quality (slower)
retrieval_config = RetrievalConfig(
    use_query_rewriting=True,
    top_k=15
)

# High speed (faster)
retrieval_config = RetrievalConfig(
    use_query_rewriting=False,
    top_k=5
)
Optimal pipeline:
  1. Query rewriting generates variations
  2. Each variation retrieves chunks
  3. Combine and deduplicate results
  4. Rerank for best quality
retrieval_config = RetrievalConfig(
    use_query_rewriting=True,
    use_reranking=True,
    top_k=15,
    rerank_top_k=5
)
Check what’s being generated:
response = rag.query("What is the policy?")

print("Query variations:")
for i, variation in enumerate(response.rewritten_queries, 1):
    print(f"{i}. {variation}")

# Review to ensure quality variations
Compare with and without:
# Test both configurations
rag_with = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    retrieval_config=RetrievalConfig(use_query_rewriting=True)
)

rag_without = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    retrieval_config=RetrievalConfig(use_query_rewriting=False)
)

# Compare results on your queries

Comparison

Without Query Rewriting

query = "What is the authentication flow?"

# Only searches with original embedding
results = [
    "JWT authentication implementation",
    "Login process overview",
    "User authentication system"
]

With Query Rewriting

query = "What is the authentication flow?"

# Searches with multiple embeddings
variations = [
    "What is the authentication flow?",
    "How does user authentication work?",
    "Authentication process steps",
    "Login and auth sequence"
]

results = [
    "JWT authentication implementation",
    "OAuth 2.0 flow diagram",
    "Login process overview",
    "Session management and auth",
    "User authentication system",
    "Token-based authentication guide"
]
# More comprehensive results!

Advanced Usage

Inspect Query Variations

response = rag.query("How do I deploy the application?")

# Original query
print(f"Original: {response.original_query}")

# Generated variations
print("\nVariations:")
for i, variation in enumerate(response.rewritten_queries, 1):
    print(f"  {i}. {variation}")

# Metadata includes rewriting info
print(f"\nMetadata: {response.metadata}")

Custom Query Preprocessing

# Preprocess queries before rewriting
def preprocess_query(query: str) -> str:
    # Remove filler words, fix typos, etc.
    query = query.lower()
    query = query.replace("can you tell me", "")
    query = query.replace("please explain", "what is")
    return query.strip()

# Use with RAG
original_query = "Can you tell me about the budget?"
cleaned_query = preprocess_query(original_query)
response = rag.query(cleaned_query)

Combine with Metadata Filtering

# Query rewriting + metadata filtering
response = rag.query(
    query="What are the new features?",
    # Note: metadata filtering would be added to the search call
)

# The rewritten variations are used with filters

Troubleshooting

Solution: The variations are generated by your LLM. Try:
  1. Use a better LLM model
  2. Provide more context in queries
  3. Check LLM configuration
llm_config = LLMConfig(
    model="gpt-4o-mini",  # or "gpt-4"
    temperature=0.7
)
Solution: Disable query rewriting or optimize:
# Option 1: Disable
retrieval_config = RetrievalConfig(
    use_query_rewriting=False
)

# Option 2: Use faster LLM
llm_config = LLMConfig(
    model="gpt-4o-mini"  # Faster than gpt-4
)
Solution: Query rewriting adds LLM calls. To reduce cost:
  1. Disable for simple queries
  2. Use cheaper LLM model
  3. Cache common queries
# Selective usage
if is_complex_query(query):
    retrieval_config.use_query_rewriting = True
else:
    retrieval_config.use_query_rewriting = False

When to Disable

Consider disabling query rewriting when:
  • Simple factual queries: “What is X?” where X is specific
  • Speed is critical: Real-time systems with tight latency requirements
  • Cost constraints: High query volume with budget limits
  • Technical queries: Queries with specific technical terms
  • Exact match needs: Looking for specific keywords or phrases
# For simple, fast lookups
rag = AgenticRAG(
    vector_store=vector_store,
    embedding_model=embedding_model,
    retrieval_config=RetrievalConfig(
        use_query_rewriting=False,
        top_k=5
    )
)

Next Steps