RAG Agents
Retrieval-Augmented Generation combines the reasoning capabilities of language models with the precision of database retrieval. The agent doesn't just generate—it grounds its responses in your actual documents. This guide covers the complete RAG pipeline in technical depth, from query understanding through generation and evaluation.
The Complete RAG Pipeline
A RAG system augments the LLM's context with retrieved information. Instead of relying solely on training data, the model receives relevant documents alongside the user's query. The pipeline consists of discrete stages, each with its own optimization surface.
Query Understanding and Preprocessing
Raw user queries are rarely optimal for retrieval. Query understanding transforms the input into forms that maximize retrieval effectiveness. This stage determines the ceiling for the entire pipeline—even perfect retrieval cannot recover from a misunderstood query.
Intent Classification
Determine what type of information the user seeks. Factual queries ("What is X?") require different retrieval strategies than procedural queries ("How do I X?") or comparative queries ("What's the difference between X and Y?").
PROCEDURAL → retrieve step-by-step content, maintain order
COMPARATIVE → retrieve multiple entities, ensure coverage
EXPLORATORY → broader retrieval, diverse sources
Entity Extraction
Identify named entities, technical terms, and domain-specific concepts in the query. These become high-priority matching targets and can be used for metadata filtering before vector search.
Entities: PostgreSQL (database), MVCC (concept), transactions (concept)
Filter: category=databases OR topic=concurrency
Query Decomposition
Complex queries often contain multiple information needs. Decomposition breaks these into atomic sub-queries that can be independently retrieved and then synthesized. This is especially important for multi-hop reasoning.
Sub-query 1: "Q3 revenue figures"
Sub-query 2: "Competitor Q3 revenue data"
Sub-query 3: "Revenue variance analysis methodology"
Coreference Resolution
In conversational RAG, queries often reference previous context. "What about their pricing?" requires resolving "their" to the entity discussed earlier. Without this, the retrieval query is semantically incomplete and will likely fail.
Retrieval Strategies: Dense, Sparse, and Hybrid
Retrieval is the core of RAG. Three fundamental approaches exist, each with distinct strengths. Production systems typically combine them.
Sparse Retrieval: BM25 and TF-IDF
Sparse retrieval represents documents and queries as high-dimensional vectors where most values are zero—only terms present in the text have non-zero weights. This approach excels at exact lexical matching.
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF weights terms by how frequently they appear in a document (TF) and how rare they are across the corpus (IDF). Common words like "the" get low weights; rare domain terms get high weights.
IDF(t) = log(total_docs / docs_containing(t))
TF-IDF(t, d) = TF(t, d) × IDF(t)
// Example: "PostgreSQL" in a database doc
TF = 5/200 = 0.025
IDF = log(10000/50) = 5.3
TF-IDF = 0.025 × 5.3 = 0.133
BM25 (Best Matching 25)
BM25 extends TF-IDF with saturation and document length normalization. Term frequency saturates—additional occurrences matter less. Long documents are penalized to avoid bias toward verbose text.
/ [f(qi, D) + k1 × (1 - b + b × |D|/avgdl)]
// Parameters
k1 = 1.2 to 2.0 // term frequency saturation
b = 0.75 // document length normalization
avgdl // average document length in corpus
BM25 remains the gold standard for keyword search. It handles out-of-vocabulary terms gracefully and requires no training. For domain-specific jargon, product names, and exact phrase matching, sparse retrieval often outperforms dense methods.
Dense Retrieval: Semantic Embeddings
Dense retrieval represents text as continuous vectors where every dimension is non-zero. These embeddings capture semantic meaning—"car" and "automobile" have similar vectors despite sharing no characters. The embedding model learns these relationships from training data.
query_embedding = embed(query) // → float[768]
// Similarity search in vector space
for doc_embedding in corpus:
similarity = cosine(query_embedding, doc_embedding)
if similarity > threshold:
results.append(doc)
// Cosine similarity
cosine(a, b) = (a · b) / (||a|| × ||b||)
Range: -1 to 1, where 1 = identical direction
Dense Strengths
- Semantic matching across paraphrases
- Handles synonyms and related concepts
- Works across languages (multilingual models)
- Captures conceptual similarity
Dense Weaknesses
- Struggles with rare terms and neologisms
- Product names, codes may be mishandled
- Requires model fine-tuning for domains
- Exact phrase matching is imprecise
Hybrid Retrieval: Combining Sparse and Dense
Hybrid retrieval runs both sparse and dense searches, then fuses the results. This captures the strengths of both approaches—exact keyword matching from BM25 and semantic understanding from dense embeddings.
RRF_score(doc) = Σ 1 / (k + rank_i(doc))
// k is a constant (typically 60) that dampens high-ranked docs
Example:
Doc A: BM25 rank=1, Dense rank=5
RRF = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318
Doc B: BM25 rank=10, Dense rank=2
RRF = 1/(60+10) + 1/(60+2) = 0.0143 + 0.0161 = 0.0304
Alternative Fusion Methods
- Weighted Linear Combination: Normalize scores to [0,1], then compute α×sparse + (1-α)×dense. Requires tuning α per domain.
- Learned Fusion: Train a small model to predict relevance from sparse and dense features. Most accurate but requires labeled data.
- Cascade: Use sparse retrieval for initial candidates, then rerank with dense scoring. Reduces computation for large corpora.
Approximate Nearest Neighbor Algorithms
Exact nearest neighbor search requires comparing the query vector against every document vector—O(n) per query. At scale, this is prohibitive. Approximate Nearest Neighbor (ANN) algorithms trade perfect recall for orders-of-magnitude speedup.
HNSW (Hierarchical Navigable Small World)
HNSW builds a multi-layer graph where each node connects to nearby nodes. Upper layers have sparse, long-range connections for fast traversal; lower layers have dense, local connections for precision. Search starts at top layers and descends.
↓
Layer 2: [Node_12] ──── [Node_156] ──── [Node_847]
↓ ↓
Layer 1: [12]─[45]─[67]─[156]─[189]─[234]─[847]─[901]
↓ ↓ ↓
Layer 0: All nodes with local neighborhood connections
Key parameters:
M = max connections per node (higher = better recall, more memory)
efConstruction = search depth during build (higher = better graph)
efSearch = search depth at query time (higher = better recall)
HNSW offers excellent query latency (sub-millisecond) and high recall (> 0.95). Memory usage is significant—expect 1-2KB per vector on top of the vectors themselves. It's the default choice for most production systems.
IVF (Inverted File Index)
IVF clusters vectors using k-means, then builds an inverted index mapping centroids to their member vectors. At query time, only clusters near the query centroid are searched.
centroids = kmeans(vectors, nlist=1024)
for vec in vectors:
cluster = nearest_centroid(vec)
index[cluster].append(vec)
// Query: search nprobe closest clusters
query_clusters = top_k_centroids(query, nprobe=16)
candidates = union(index[c] for c in query_clusters)
results = brute_force_search(query, candidates)
Tradeoff: nprobe ↑ = better recall, slower search
IVF is memory-efficient and works well when combined with Product Quantization. It requires training (k-means) before use, unlike HNSW which builds incrementally.
PQ (Product Quantization)
PQ compresses vectors by splitting them into subvectors and replacing each with a centroid ID. A 768-dim float32 vector (3KB) can compress to 96 bytes with acceptable accuracy loss.
// Split into M subvectors
subvectors = split(vector, M=96) // 96 subvecs of 8 dims each
// Each subvector → 1 byte centroid ID (256 centroids)
compressed = [centroid_id(sv) for sv in subvectors]
Result: 96 bytes (32x compression)
// Distance computed via lookup tables
distance ≈ sum(lookup_table[i][compressed[i]] for i in range(M))
PQ is often combined with IVF (IVF-PQ) for large-scale deployments. The compression introduces some accuracy loss, typically 5-10% recall reduction.
ANN Algorithm Comparison
| Algorithm | Memory | Build Time | Query Latency | Best For |
|---|---|---|---|---|
| HNSW | High | Medium | <1ms | Low latency, high recall |
| IVF | Low | Fast | 1-10ms | Memory-constrained |
| IVF-PQ | Very Low | Slow | 1-10ms | Billion-scale |
| Flat (Exact) | Low | None | O(n) | <100k vectors, perfect recall |
Vector Database Internals
Vector databases combine ANN indexes with storage, filtering, and operational features. Understanding their internals helps with capacity planning and performance tuning.
Index Types
- In-memory index: Entire index in RAM. Fastest queries, limited by memory. Typical for <10M vectors.
- Memory-mapped index: Index on disk, paged into memory on demand. Handles larger datasets, some latency variance.
- On-disk index: Purpose-built disk structures. Supports very large scale but with higher latency.
Metadata Filtering
Pre-filtering narrows the search space before vector similarity. Post-filtering applies filters after ANN search. The choice affects both performance and result quality.
candidates = filter(category="technical")
results = ann_search(query, candidates) // smaller search space
// Post-filtering
results = ann_search(query, all_vectors)
results = filter(results, category="technical") // may return <k results
Pre-filter when: filter is selective (<10% of corpus)
Post-filter when: filter is broad or ANN index doesn't support it
Local Vector Database Options
Query Expansion Techniques
Query expansion transforms a single query into multiple queries that collectively cover more of the semantic space. This compensates for vocabulary mismatch between user queries and document content.
Multi-Query Expansion
Generate multiple rephrasings of the user query, retrieve for each, and merge results. Covers different phrasings that might match different documents.
Generated queries:
1. "database connection timeout configuration"
2. "handling DB connection pool exhaustion"
3. "troubleshooting database connectivity issues"
4. "connection timeout retry strategies"
// Retrieve for each, merge with deduplication
all_results = unique(union(retrieve(q) for q in queries))
Step-Back Prompting
Generate a more abstract version of the query to retrieve foundational information. Useful for questions that require background knowledge before addressing specifics.
Step-back query: "How does React detect state changes?"
// Retrieval strategy
context_general = retrieve(step_back_query)
context_specific = retrieve(original_query)
final_context = context_general + context_specific
HyDE (Hypothetical Document Embeddings)
Instead of embedding the query directly, generate a hypothetical answer document and embed that. The hypothesis exists in the same semantic space as actual answers, often yielding better matches.
hypothetical_doc = llm.generate(
"Write a paragraph answering: " + query
)
// Hypothetical document (hallucinated but semantically useful)
"Memory leaks in Python typically occur when objects
maintain references preventing garbage collection.
Common causes include circular references, global
variables holding large objects, and unclosed resources..."
// Embed the hypothesis, not the question
search_vector = embed(hypothetical_doc)
results = retrieve(search_vector)
HyDE adds latency (LLM call before retrieval) but can significantly improve recall for complex or ambiguous queries. The hypothesis need not be factually correct—it just needs to be semantically similar to real answers.
Reranking Models and Cross-Encoders
Initial retrieval optimizes for recall—casting a wide net. Reranking optimizes for precision—sorting the candidates by actual relevance. This two-stage approach lets you use expensive models only on a small candidate set.
Bi-Encoders vs Cross-Encoders
d_emb = encode(doc) // precomputed
score = cosine(q_emb, d_emb)
// Full cross-attention between
// query and doc tokens
candidates = vector_search(query, top_k=100) // ~5ms
// Stage 2: Precise reranking (cross-encoder)
scores = [cross_encoder.score(query, doc) for doc in candidates]
reranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
final_results = reranked[:10] // ~50ms for 100 candidates
Total latency: ~55ms for high-quality results
Local Reranking Models
- bge-reranker: BAAI's family of rerankers. Available in small (30M), base (278M), and large (560M) sizes.
- ms-marco-MiniLM: Lightweight cross-encoder trained on MS MARCO. Good balance of speed and accuracy.
- ColBERT: Late-interaction model. Computes token-level embeddings for fine-grained matching with reasonable speed.
Context Window Management
LLMs have fixed context windows (4k to 128k+ tokens). Efficient context management ensures the most relevant information fits within limits while maintaining coherence.
Truncation Strategies
- Rank-based: Include top-k results by relevance score. Simple but may miss diverse information.
- Token-budget: Fill context up to a token limit. Include as many chunks as fit, prioritized by score.
- Diversity-aware: Use MMR (Maximal Marginal Relevance) to balance relevance and diversity. Avoids redundant chunks.
- Hierarchical: Summarize low-scoring chunks to include more information at lower fidelity.
Context Ordering
Position matters. Research shows LLMs attend more strongly to the beginning and end of context (the "lost in the middle" problem). Strategic ordering improves answer quality.
Relevance-first: Most relevant at top
Relevance-last: Most relevant at bottom (near query)
Bookend: Most relevant at top and bottom, less relevant in middle
Chronological: For time-sensitive content, preserve temporal order
Maximal Marginal Relevance (MMR)
MMR iteratively selects documents that are both relevant to the query and different from already-selected documents. This reduces redundancy.
// λ controls relevance vs diversity tradeoff
λ = 1.0 → pure relevance (no diversity)
λ = 0.5 → balanced
λ = 0.0 → pure diversity (ignores relevance)
Prompt Construction for RAG
The prompt template determines how the LLM interprets and uses the retrieved context. Well-structured prompts reduce hallucination and improve answer quality.
You are a helpful assistant. Answer questions using ONLY the provided
context. If the context doesn't contain the answer, say so explicitly.
Do not make up information.
context:
[Document 1: {source: policies/refund.md}]
Enterprise customers may request refunds within 90 days...
[Document 2: {source: faq/billing.md}]
All refund requests require manager approval for amounts...
user:
{user_query}
assistant:
Based on the provided context...
Prompt Engineering Patterns
- Source attribution: Include source identifiers in context so the LLM can cite them. "According to [Document 2]..."
- Uncertainty instructions: Explicitly instruct the model to acknowledge when information is missing or uncertain.
- Format specification: Define output structure for consistent parsing. JSON, markdown, or structured prose.
- Chain-of-thought: "First, identify relevant information. Then, synthesize. Finally, answer with citations."
Citation and Attribution
Citations serve two purposes: they let users verify claims, and they provide accountability for the system. Implementing reliable citation requires both prompt engineering and post-processing.
"...refunds are available within 90 days [1] for enterprise
customers, subject to manager approval [2]."
Structured output:
{
"answer": "Refunds are available within 90 days...",
"citations": [
{"claim": "90 days", "source": "doc_1", "excerpt": "..."}
]
}
Post-hoc attribution:
// After generation, match claims to sources
for claim in extract_claims(answer):
best_match = find_supporting_source(claim, context)
if similarity(claim, best_match) > threshold:
citations.append((claim, best_match))
Citation Verification
LLMs can hallucinate citations. Verify that cited text actually appears in the source document. This can be done via string matching, semantic similarity, or an NLI (Natural Language Inference) model that checks if the source entails the claim.
Hallucination Detection and Mitigation
Hallucination in RAG occurs when the model generates information not supported by the retrieved context. Detection and mitigation are critical for trustworthy systems.
Detection Methods
- NLI-based: Check if context entails each claim
- Self-consistency: Generate multiple answers, check agreement
- Token probability: Low confidence often correlates with hallucination
- Fact verification: Cross-check against a knowledge base
- Semantic overlap: Measure similarity between answer and context
Mitigation Strategies
- Strong grounding prompts: "Only use provided context"
- Extractive preference: Quote rather than paraphrase
- Confidence thresholds: Decline to answer when uncertain
- Human-in-the-loop for high-stakes answers
- Fine-tuning on grounded response data
for claim in claims:
for context_chunk in retrieved_context:
result = nli_model(premise=context_chunk, hypothesis=claim)
// result in ['entailment', 'neutral', 'contradiction']
if result == 'entailment':
claim.supported = True
break
unsupported = [c for c in claims if not c.supported]
if unsupported:
flag_for_review(answer, unsupported)
Evaluation Metrics
RAG evaluation requires measuring both retrieval quality and generation quality. The pipeline nature means errors compound—poor retrieval limits generation quality.
Retrieval Metrics
// What fraction of retrieved docs are relevant?
Recall@k = relevant_retrieved / total_relevant
// What fraction of relevant docs were retrieved?
MRR (Mean Reciprocal Rank) = 1/rank_of_first_relevant
// How high does the first relevant doc rank?
NDCG (Normalized Discounted Cumulative Gain)
// Position-weighted relevance score
DCG = Σ rel_i / log2(i + 1)
NDCG = DCG / ideal_DCG
Generation Metrics
// Is the retrieved context relevant to the question?
score = llm_judge("Is this context relevant to: {query}")
Answer Faithfulness
// Is the answer supported by the context?
claims = extract_claims(answer)
faithfulness = supported_claims / total_claims
Answer Relevance
// Does the answer address the question?
score = semantic_similarity(answer, question)
RAGAS Framework
RAGAS provides automated evaluation using LLMs as judges. It computes four core metrics that cover the full RAG pipeline.
Advanced RAG Patterns
Beyond basic retrieve-then-generate, advanced patterns add reasoning, self-correction, and adaptive behavior. These transform static RAG pipelines into dynamic agents.
CRAG (Corrective RAG)
CRAG evaluates retrieval quality before generation. If retrieved documents are irrelevant, it triggers web search or query reformulation rather than proceeding with poor context.
relevance = evaluate_relevance(query, retrieved)
if relevance == "CORRECT":
context = refine(retrieved) // extract key info
elif relevance == "AMBIGUOUS":
context = retrieved + web_search(query) // augment
elif relevance == "INCORRECT":
context = web_search(query) // replace entirely
answer = generate(query, context)
Self-RAG
Self-RAG trains the LLM to generate special tokens that control retrieval and self-critique. The model decides when to retrieve, evaluates retrieved content, and assesses its own responses.
[Retrieve]: yes/no - should I retrieve?
[ISREL]: relevant/irrelevant - is retrieved doc relevant?
[ISSUP]: fully/partially/no - is my response supported?
[ISUSE]: 1-5 - how useful is this response?
// Generation with self-reflection
"Based on the context [ISREL:relevant], the refund policy
allows returns within 90 days [ISSUP:fully_supported].
[ISUSE:5]"
Adaptive RAG
Adaptive RAG classifies queries and routes them to different pipelines based on complexity. Simple factual queries skip expensive processing; complex multi-hop queries get full treatment.
if complexity == "SIMPLE":
// Direct retrieval + generation
answer = basic_rag(query)
elif complexity == "MODERATE":
// Add query expansion + reranking
answer = enhanced_rag(query)
elif complexity == "COMPLEX":
// Multi-step reasoning, iterative retrieval
answer = agentic_rag(query)
Graph RAG
Graph RAG builds a knowledge graph from documents, then uses graph traversal alongside vector search. Enables multi-hop reasoning and relationship-based queries.
entities = extract_entities(chunks)
relations = extract_relations(chunks)
graph = build_graph(entities, relations)
// Query: hybrid vector + graph
vector_results = vector_search(query)
seed_entities = extract_entities(query)
graph_context = traverse(graph, seed_entities, depth=2)
context = merge(vector_results, graph_context)
answer = generate(query, context)
Multi-Modal RAG
Multi-modal RAG extends retrieval to images, tables, charts, and other non-text content. This is essential for technical documentation, research papers, and enterprise knowledge bases with visual content.
Image Retrieval
Two approaches: (1) Generate text descriptions/captions for images, embed those. (2) Use multi-modal embeddings (CLIP, SigLIP) that encode images and text into the same vector space.
caption = vision_model.describe(image)
embedding = text_embed(caption)
// Approach 2: Multi-modal embeddings
image_embedding = clip.encode_image(image)
query_embedding = clip.encode_text(query)
// Both in same 512-dim space
Table Retrieval
Tables contain structured information that flat text embeddings miss. Options include: table-aware embeddings, serialization (markdown/JSON), or dedicated table retrieval.
markdown = table.to_markdown() // |col1|col2|
json = table.to_json() // [{row}...]
linearized = f"Table {title}: Column {col} contains..."
// Include column summaries for retrieval
summary = f"Table about {topic} with columns: {cols}"
embedding = embed(summary + markdown[:500])
Production Considerations
Moving RAG from prototype to production requires addressing latency, cost, and reliability. These operational concerns often dominate the engineering effort.
Caching Strategies
- Embedding cache: Cache query embeddings keyed by query text. Saves embedding model calls.
- Semantic cache: If a new query is semantically similar to a cached query, return cached results. Use embedding similarity.
- Response cache: Cache full responses for exact query matches. Include cache invalidation on document updates.
Batching
Batch embedding requests and reranking calls to maximize throughput. Trade latency for efficiency when processing multiple queries or documents.
embeddings = [embed(q) for q in queries] // n API calls
// Efficient: batched
embeddings = embed_batch(queries) // 1 API call
Typical batch sizes:
Embedding: 32-128 texts
Reranking: 100-500 pairs
Streaming
Stream LLM responses to reduce perceived latency. Users see tokens as they're generated rather than waiting for the full response.
context = await retrieve(query) // blocking
prompt = construct_prompt(query, context)
for token in llm.stream(prompt):
yield token // send immediately
Time-to-first-token matters more than total latency
for user experience
Monitoring and Observability
- Latency breakdown: Track time spent in each pipeline stage (embedding, retrieval, reranking, generation).
- Retrieval quality: Log retrieved documents and relevance scores. Sample and review periodically.
- User feedback: Thumbs up/down, explicit corrections, click-through on citations.
- Error rates: Track retrieval failures, LLM errors, timeout rates, and hallucination detection triggers.
Local RAG Architecture
A complete local RAG agent requires several components working together. Running entirely on-device eliminates API costs, network latency, and data privacy concerns.
Embedding Model
Local models like nomic-embed-text, bge-small, or all-MiniLM. 50-100ms per query on CPU,<10ms on GPU. Size range: 30MB to 500MB.
Vector Database
ChromaDB (embedded SQLite), FAISS (library), or Qdrant (embedded mode). All support HNSW indexes and run fully offline.
Reranker (Optional)
bge-reranker-small (30MB) or ms-marco-MiniLM. Adds 50-100ms latency but significantly improves precision for complex queries.
Language Model
Llama, Mistral, Phi, or Qwen via Ollama or llama.cpp. 7B models require 4-8GB RAM (quantized), 13B+ models need 16GB+. GPU strongly recommended.
Orchestration Layer
Agent framework managing the retrieval-generation loop. Coordinates query processing, context assembly, prompt construction, and response streaming.
From RAG to Agent
Basic RAG is a pipeline: query → retrieve → generate. A RAG agent adds autonomy— the ability to decide when and what to retrieve, iterate on results, and combine information from multiple queries.
Basic RAG
- Single retrieval per query
- Fixed number of results
- No query refinement
- Stateless between requests
RAG Agent
- Multiple retrieval rounds
- Adaptive result count
- Query decomposition and refinement
- Memory across interactions
Getting Started with OnyxLab
OnyxLab's systems implement these patterns with sensible defaults. The goal is production-ready RAG without the infrastructure complexity—local-first, offline-capable, and zero recurring costs. We handle the orchestration, caching, and optimization so you can focus on your documents and use cases. Explore our Systems to see these concepts in action.