OnyxLab | AI/ML Engineering

Retrieval-Augmented Generation combines the reasoning capabilities of language models with the precision of database retrieval. The agent doesn't just generate—it grounds its responses in your actual documents. This guide covers the complete RAG pipeline in technical depth, from query understanding through generation and evaluation.

The Complete RAG Pipeline

A RAG system augments the LLM's context with retrieved information. Instead of relying solely on training data, the model receives relevant documents alongside the user's query. The pipeline consists of discrete stages, each with its own optimization surface.

// Complete RAG Pipeline Architecture

User Input

Query

Stage 1

Query Understanding

intent, entities, reformulation

Stage 2

Retrieval

sparse + dense + hybrid

Stage 3

Reranking

cross-encoder scoring

Stage 4

Context Assembly

truncation, ordering, formatting

Stage 5

Generation

prompt construction, LLM call

Stage 6

Post-Processing

citations, hallucination check

Output

Response + Citations

Query Understanding and Preprocessing

Raw user queries are rarely optimal for retrieval. Query understanding transforms the input into forms that maximize retrieval effectiveness. This stage determines the ceiling for the entire pipeline—even perfect retrieval cannot recover from a misunderstood query.

Intent Classification

Determine what type of information the user seeks. Factual queries ("What is X?") require different retrieval strategies than procedural queries ("How do I X?") or comparative queries ("What's the difference between X and Y?").

// Intent detection shapes retrieval strategy

FACTUAL → prioritize authoritative single sources
PROCEDURAL → retrieve step-by-step content, maintain order
COMPARATIVE → retrieve multiple entities, ensure coverage
EXPLORATORY → broader retrieval, diverse sources

Entity Extraction

Identify named entities, technical terms, and domain-specific concepts in the query. These become high-priority matching targets and can be used for metadata filtering before vector search.

Query: "How does PostgreSQL handle MVCC for concurrent transactions?"

Entities: PostgreSQL (database), MVCC (concept), transactions (concept)
Filter: category=databases OR topic=concurrency

Query Decomposition

Complex queries often contain multiple information needs. Decomposition breaks these into atomic sub-queries that can be independently retrieved and then synthesized. This is especially important for multi-hop reasoning.

Original: "Compare our Q3 revenue to competitors and explain the variance"

Sub-query 1: "Q3 revenue figures"
Sub-query 2: "Competitor Q3 revenue data"
Sub-query 3: "Revenue variance analysis methodology"

Coreference Resolution

In conversational RAG, queries often reference previous context. "What about their pricing?" requires resolving "their" to the entity discussed earlier. Without this, the retrieval query is semantically incomplete and will likely fail.

Retrieval Strategies: Dense, Sparse, and Hybrid

Retrieval is the core of RAG. Three fundamental approaches exist, each with distinct strengths. Production systems typically combine them.

Sparse Retrieval: BM25 and TF-IDF

Sparse retrieval represents documents and queries as high-dimensional vectors where most values are zero—only terms present in the text have non-zero weights. This approach excels at exact lexical matching.

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF weights terms by how frequently they appear in a document (TF) and how rare they are across the corpus (IDF). Common words like "the" get low weights; rare domain terms get high weights.

// TF-IDF calculation

TF(t, d) = count(t in d) / total_terms(d)
IDF(t) = log(total_docs / docs_containing(t))
TF-IDF(t, d) = TF(t, d) × IDF(t)

// Example: "PostgreSQL" in a database doc
TF = 5/200 = 0.025
IDF = log(10000/50) = 5.3
TF-IDF = 0.025 × 5.3 = 0.133

BM25 (Best Matching 25)

BM25 extends TF-IDF with saturation and document length normalization. Term frequency saturates—additional occurrences matter less. Long documents are penalized to avoid bias toward verbose text.

// BM25 scoring function

score(D, Q) = Σ IDF(qi) × [f(qi, D) × (k1 + 1)]
               / [f(qi, D) + k1 × (1 - b + b × |D|/avgdl)]

// Parameters
k1 = 1.2 to 2.0   // term frequency saturation
b = 0.75         // document length normalization
avgdl            // average document length in corpus

BM25 remains the gold standard for keyword search. It handles out-of-vocabulary terms gracefully and requires no training. For domain-specific jargon, product names, and exact phrase matching, sparse retrieval often outperforms dense methods.

Dense Retrieval: Semantic Embeddings

Dense retrieval represents text as continuous vectors where every dimension is non-zero. These embeddings capture semantic meaning—"car" and "automobile" have similar vectors despite sharing no characters. The embedding model learns these relationships from training data.

// Dense embedding process

query = "How do I configure SSL certificates?"
query_embedding = embed(query) // → float[768]

// Similarity search in vector space
for doc_embedding in corpus:
  similarity = cosine(query_embedding, doc_embedding)
  if similarity > threshold:
    results.append(doc)

// Cosine similarity
cosine(a, b) = (a · b) / (||a|| × ||b||)
Range: -1 to 1, where 1 = identical direction

Dense Strengths

Semantic matching across paraphrases
Handles synonyms and related concepts
Works across languages (multilingual models)
Captures conceptual similarity

Dense Weaknesses

Struggles with rare terms and neologisms
Product names, codes may be mishandled
Requires model fine-tuning for domains
Exact phrase matching is imprecise

Hybrid Retrieval: Combining Sparse and Dense

Hybrid retrieval runs both sparse and dense searches, then fuses the results. This captures the strengths of both approaches—exact keyword matching from BM25 and semantic understanding from dense embeddings.

// Reciprocal Rank Fusion (RRF)

// Combine rankings from multiple retrievers
RRF_score(doc) = Σ 1 / (k + rank_i(doc))

// k is a constant (typically 60) that dampens high-ranked docs

Example:
Doc A: BM25 rank=1, Dense rank=5
RRF = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Doc B: BM25 rank=10, Dense rank=2
RRF = 1/(60+10) + 1/(60+2) = 0.0143 + 0.0161 = 0.0304

Alternative Fusion Methods

Weighted Linear Combination: Normalize scores to [0,1], then compute α×sparse + (1-α)×dense. Requires tuning α per domain.
Learned Fusion: Train a small model to predict relevance from sparse and dense features. Most accurate but requires labeled data.
Cascade: Use sparse retrieval for initial candidates, then rerank with dense scoring. Reduces computation for large corpora.

Approximate Nearest Neighbor Algorithms

Exact nearest neighbor search requires comparing the query vector against every document vector—O(n) per query. At scale, this is prohibitive. Approximate Nearest Neighbor (ANN) algorithms trade perfect recall for orders-of-magnitude speedup.

HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph where each node connects to nearby nodes. Upper layers have sparse, long-range connections for fast traversal; lower layers have dense, local connections for precision. Search starts at top layers and descends.

// HNSW structure

Layer 3: [Node_12] ────────────────── [Node_847]
            ↓
Layer 2: [Node_12] ──── [Node_156] ──── [Node_847]
            ↓             ↓
Layer 1: [12]─[45]─[67]─[156]─[189]─[234]─[847]─[901]
            ↓     ↓     ↓
Layer 0: All nodes with local neighborhood connections

Key parameters:
M = max connections per node (higher = better recall, more memory)
efConstruction = search depth during build (higher = better graph)
efSearch = search depth at query time (higher = better recall)

HNSW offers excellent query latency (sub-millisecond) and high recall (> 0.95). Memory usage is significant—expect 1-2KB per vector on top of the vectors themselves. It's the default choice for most production systems.

IVF (Inverted File Index)

IVF clusters vectors using k-means, then builds an inverted index mapping centroids to their member vectors. At query time, only clusters near the query centroid are searched.

// IVF search process

// Index construction
centroids = kmeans(vectors, nlist=1024)
for vec in vectors:
cluster = nearest_centroid(vec)
index[cluster].append(vec)

// Query: search nprobe closest clusters
query_clusters = top_k_centroids(query, nprobe=16)
candidates = union(index[c] for c in query_clusters)
results = brute_force_search(query, candidates)

Tradeoff: nprobe ↑ = better recall, slower search

IVF is memory-efficient and works well when combined with Product Quantization. It requires training (k-means) before use, unlike HNSW which builds incrementally.

PQ (Product Quantization)

PQ compresses vectors by splitting them into subvectors and replacing each with a centroid ID. A 768-dim float32 vector (3KB) can compress to 96 bytes with acceptable accuracy loss.

// Product Quantization

Original: [768 floats] = 3072 bytes

// Split into M subvectors
subvectors = split(vector, M=96) // 96 subvecs of 8 dims each

// Each subvector → 1 byte centroid ID (256 centroids)
compressed = [centroid_id(sv) for sv in subvectors]
Result: 96 bytes (32x compression)

// Distance computed via lookup tables
distance ≈ sum(lookup_table[i][compressed[i]] for i in range(M))

PQ is often combined with IVF (IVF-PQ) for large-scale deployments. The compression introduces some accuracy loss, typically 5-10% recall reduction.

ANN Algorithm Comparison

Algorithm	Memory	Build Time	Query Latency	Best For
HNSW	High	Medium	<1ms	Low latency, high recall
IVF	Low	Fast	1-10ms	Memory-constrained
IVF-PQ	Very Low	Slow	1-10ms	Billion-scale
Flat (Exact)	Low	None	O(n)	<100k vectors, perfect recall

Vector Database Internals

Vector databases combine ANN indexes with storage, filtering, and operational features. Understanding their internals helps with capacity planning and performance tuning.

Index Types

In-memory index: Entire index in RAM. Fastest queries, limited by memory. Typical for <10M vectors.
Memory-mapped index: Index on disk, paged into memory on demand. Handles larger datasets, some latency variance.
On-disk index: Purpose-built disk structures. Supports very large scale but with higher latency.

Metadata Filtering

Pre-filtering narrows the search space before vector similarity. Post-filtering applies filters after ANN search. The choice affects both performance and result quality.

// Pre-filtering (most vector DBs)
candidates = filter(category="technical")
results = ann_search(query, candidates) // smaller search space

// Post-filtering
results = ann_search(query, all_vectors)
results = filter(results, category="technical") // may return <k results

Pre-filter when: filter is selective (<10% of corpus)
Post-filter when: filter is broad or ANN index doesn't support it

Local Vector Database Options

ChromaDB

Embedded SQLite + HNSW. Simple API, good for prototyping. Single-node only.

FAISS

Facebook's library. Multiple index types (Flat, IVF, HNSW, PQ). Requires manual management.

Qdrant

Rust-based, full-featured. Filtering, payload storage, clustering. Can run embedded.

Query Expansion Techniques

Query expansion transforms a single query into multiple queries that collectively cover more of the semantic space. This compensates for vocabulary mismatch between user queries and document content.

Multi-Query Expansion

Generate multiple rephrasings of the user query, retrieve for each, and merge results. Covers different phrasings that might match different documents.

// Multi-query generation prompt

Original: "How to handle database connection timeouts"

Generated queries:
1. "database connection timeout configuration"
2. "handling DB connection pool exhaustion"
3. "troubleshooting database connectivity issues"
4. "connection timeout retry strategies"

// Retrieve for each, merge with deduplication
all_results = unique(union(retrieve(q) for q in queries))

Step-Back Prompting

Generate a more abstract version of the query to retrieve foundational information. Useful for questions that require background knowledge before addressing specifics.

Original: "Why does React re-render when I update a nested object?"

Step-back query: "How does React detect state changes?"

// Retrieval strategy
context_general = retrieve(step_back_query)
context_specific = retrieve(original_query)
final_context = context_general + context_specific

HyDE (Hypothetical Document Embeddings)

Instead of embedding the query directly, generate a hypothetical answer document and embed that. The hypothesis exists in the same semantic space as actual answers, often yielding better matches.

// HyDE process

query = "What causes memory leaks in Python?"

hypothetical_doc = llm.generate(
"Write a paragraph answering: " + query
)

// Hypothetical document (hallucinated but semantically useful)
"Memory leaks in Python typically occur when objects
maintain references preventing garbage collection.
Common causes include circular references, global
variables holding large objects, and unclosed resources..."

// Embed the hypothesis, not the question
search_vector = embed(hypothetical_doc)
results = retrieve(search_vector)

HyDE adds latency (LLM call before retrieval) but can significantly improve recall for complex or ambiguous queries. The hypothesis need not be factually correct—it just needs to be semantically similar to real answers.

Reranking Models and Cross-Encoders

Initial retrieval optimizes for recall—casting a wide net. Reranking optimizes for precision—sorting the candidates by actual relevance. This two-stage approach lets you use expensive models only on a small candidate set.

Bi-Encoders vs Cross-Encoders

Bi-Encoder (Retrieval)

Query and document encoded separately. Enables precomputation of document embeddings. Fast but less accurate.

q_emb = encode(query)
d_emb = encode(doc) // precomputed
score = cosine(q_emb, d_emb)

Cross-Encoder (Reranking)

Query and document processed together through attention layers. More accurate but requires inference per pair.

score = model([query, doc])
// Full cross-attention between
// query and doc tokens

// Two-stage retrieval with reranking

// Stage 1: Fast retrieval (bi-encoder)
candidates = vector_search(query, top_k=100) // ~5ms

// Stage 2: Precise reranking (cross-encoder)
scores = [cross_encoder.score(query, doc) for doc in candidates]
reranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
final_results = reranked[:10] // ~50ms for 100 candidates

Total latency: ~55ms for high-quality results

Local Reranking Models

bge-reranker: BAAI's family of rerankers. Available in small (30M), base (278M), and large (560M) sizes.
ms-marco-MiniLM: Lightweight cross-encoder trained on MS MARCO. Good balance of speed and accuracy.
ColBERT: Late-interaction model. Computes token-level embeddings for fine-grained matching with reasonable speed.

Context Window Management

LLMs have fixed context windows (4k to 128k+ tokens). Efficient context management ensures the most relevant information fits within limits while maintaining coherence.

Truncation Strategies

Rank-based: Include top-k results by relevance score. Simple but may miss diverse information.
Token-budget: Fill context up to a token limit. Include as many chunks as fit, prioritized by score.
Diversity-aware: Use MMR (Maximal Marginal Relevance) to balance relevance and diversity. Avoids redundant chunks.
Hierarchical: Summarize low-scoring chunks to include more information at lower fidelity.

Context Ordering

Position matters. Research shows LLMs attend more strongly to the beginning and end of context (the "lost in the middle" problem). Strategic ordering improves answer quality.

// Ordering strategies
Relevance-first: Most relevant at top
Relevance-last: Most relevant at bottom (near query)
Bookend: Most relevant at top and bottom, less relevant in middle
Chronological: For time-sensitive content, preserve temporal order

Maximal Marginal Relevance (MMR)

MMR iteratively selects documents that are both relevant to the query and different from already-selected documents. This reduces redundancy.

// MMR selection

MMR = argmax[λ × sim(doc, query) - (1-λ) × max(sim(doc, selected))]

// λ controls relevance vs diversity tradeoff
λ = 1.0 → pure relevance (no diversity)
λ = 0.5 → balanced
λ = 0.0 → pure diversity (ignores relevance)

Prompt Construction for RAG

The prompt template determines how the LLM interprets and uses the retrieved context. Well-structured prompts reduce hallucination and improve answer quality.

// Standard RAG prompt template

system:
You are a helpful assistant. Answer questions using ONLY the provided
context. If the context doesn't contain the answer, say so explicitly.
Do not make up information.

context:
[Document 1: {source: policies/refund.md}]
Enterprise customers may request refunds within 90 days...

[Document 2: {source: faq/billing.md}]
All refund requests require manager approval for amounts...

user:
{user_query}

assistant:
Based on the provided context...

Prompt Engineering Patterns

Source attribution: Include source identifiers in context so the LLM can cite them. "According to [Document 2]..."
Uncertainty instructions: Explicitly instruct the model to acknowledge when information is missing or uncertain.
Format specification: Define output structure for consistent parsing. JSON, markdown, or structured prose.
Chain-of-thought: "First, identify relevant information. Then, synthesize. Finally, answer with citations."

Citation and Attribution

Citations serve two purposes: they let users verify claims, and they provide accountability for the system. Implementing reliable citation requires both prompt engineering and post-processing.

// Citation extraction approaches

Inline citations (prompt-based):
"...refunds are available within 90 days [1] for enterprise
customers, subject to manager approval [2]."

Structured output:
{
  "answer": "Refunds are available within 90 days...",
  "citations": [
    {"claim": "90 days", "source": "doc_1", "excerpt": "..."}
  ]
}

Post-hoc attribution:
// After generation, match claims to sources
for claim in extract_claims(answer):
  best_match = find_supporting_source(claim, context)
  if similarity(claim, best_match) > threshold:
    citations.append((claim, best_match))

Citation Verification

LLMs can hallucinate citations. Verify that cited text actually appears in the source document. This can be done via string matching, semantic similarity, or an NLI (Natural Language Inference) model that checks if the source entails the claim.

Hallucination Detection and Mitigation

Hallucination in RAG occurs when the model generates information not supported by the retrieved context. Detection and mitigation are critical for trustworthy systems.

Detection Methods

NLI-based: Check if context entails each claim
Self-consistency: Generate multiple answers, check agreement
Token probability: Low confidence often correlates with hallucination
Fact verification: Cross-check against a knowledge base
Semantic overlap: Measure similarity between answer and context

Mitigation Strategies

Strong grounding prompts: "Only use provided context"
Extractive preference: Quote rather than paraphrase
Confidence thresholds: Decline to answer when uncertain
Human-in-the-loop for high-stakes answers
Fine-tuning on grounded response data

// NLI-based hallucination check

claims = extract_claims(generated_answer)

for claim in claims:
  for context_chunk in retrieved_context:
    result = nli_model(premise=context_chunk, hypothesis=claim)
    // result in ['entailment', 'neutral', 'contradiction']
    if result == 'entailment':
      claim.supported = True
      break

unsupported = [c for c in claims if not c.supported]
if unsupported:
  flag_for_review(answer, unsupported)

Evaluation Metrics

RAG evaluation requires measuring both retrieval quality and generation quality. The pipeline nature means errors compound—poor retrieval limits generation quality.

Retrieval Metrics

Precision@k = relevant_retrieved / k
// What fraction of retrieved docs are relevant?

Recall@k = relevant_retrieved / total_relevant
// What fraction of relevant docs were retrieved?

MRR (Mean Reciprocal Rank) = 1/rank_of_first_relevant
// How high does the first relevant doc rank?

NDCG (Normalized Discounted Cumulative Gain)
// Position-weighted relevance score
DCG = Σ rel_i / log2(i + 1)
NDCG = DCG / ideal_DCG

Generation Metrics

Context Relevance
// Is the retrieved context relevant to the question?
score = llm_judge("Is this context relevant to: {query}")

Answer Faithfulness
// Is the answer supported by the context?
claims = extract_claims(answer)
faithfulness = supported_claims / total_claims

Answer Relevance
// Does the answer address the question?
score = semantic_similarity(answer, question)

RAGAS Framework

RAGAS provides automated evaluation using LLMs as judges. It computes four core metrics that cover the full RAG pipeline.

faithfulness

Claims supported by context

answer_relevancy

Answer addresses question

context_precision

Useful context ranks high

context_recall

Needed info was retrieved

Advanced RAG Patterns

Beyond basic retrieve-then-generate, advanced patterns add reasoning, self-correction, and adaptive behavior. These transform static RAG pipelines into dynamic agents.

CRAG (Corrective RAG)

CRAG evaluates retrieval quality before generation. If retrieved documents are irrelevant, it triggers web search or query reformulation rather than proceeding with poor context.

retrieved = retrieve(query)
relevance = evaluate_relevance(query, retrieved)

if relevance == "CORRECT":
  context = refine(retrieved)  // extract key info
elif relevance == "AMBIGUOUS":
  context = retrieved + web_search(query)  // augment
elif relevance == "INCORRECT":
  context = web_search(query)  // replace entirely

answer = generate(query, context)

Self-RAG

Self-RAG trains the LLM to generate special tokens that control retrieval and self-critique. The model decides when to retrieve, evaluates retrieved content, and assesses its own responses.

// Self-RAG special tokens
[Retrieve]: yes/no - should I retrieve?
[ISREL]: relevant/irrelevant - is retrieved doc relevant?
[ISSUP]: fully/partially/no - is my response supported?
[ISUSE]: 1-5 - how useful is this response?

// Generation with self-reflection
"Based on the context [ISREL:relevant], the refund policy
allows returns within 90 days [ISSUP:fully_supported].
[ISUSE:5]"

Adaptive RAG

Adaptive RAG classifies queries and routes them to different pipelines based on complexity. Simple factual queries skip expensive processing; complex multi-hop queries get full treatment.

complexity = classify_query(query)

if complexity == "SIMPLE":
  // Direct retrieval + generation
  answer = basic_rag(query)

elif complexity == "MODERATE":
  // Add query expansion + reranking
  answer = enhanced_rag(query)

elif complexity == "COMPLEX":
  // Multi-step reasoning, iterative retrieval
  answer = agentic_rag(query)

Graph RAG

Graph RAG builds a knowledge graph from documents, then uses graph traversal alongside vector search. Enables multi-hop reasoning and relationship-based queries.

// Build knowledge graph from chunks
entities = extract_entities(chunks)
relations = extract_relations(chunks)
graph = build_graph(entities, relations)

// Query: hybrid vector + graph
vector_results = vector_search(query)
seed_entities = extract_entities(query)
graph_context = traverse(graph, seed_entities, depth=2)

context = merge(vector_results, graph_context)
answer = generate(query, context)

Multi-Modal RAG

Multi-modal RAG extends retrieval to images, tables, charts, and other non-text content. This is essential for technical documentation, research papers, and enterprise knowledge bases with visual content.

Image Retrieval

Two approaches: (1) Generate text descriptions/captions for images, embed those. (2) Use multi-modal embeddings (CLIP, SigLIP) that encode images and text into the same vector space.

// Approach 1: Caption-based
caption = vision_model.describe(image)
embedding = text_embed(caption)

// Approach 2: Multi-modal embeddings
image_embedding = clip.encode_image(image)
query_embedding = clip.encode_text(query)
// Both in same 512-dim space

Table Retrieval

Tables contain structured information that flat text embeddings miss. Options include: table-aware embeddings, serialization (markdown/JSON), or dedicated table retrieval.

// Table serialization strategies
markdown = table.to_markdown() // |col1|col2|
json = table.to_json() // [{row}...]
linearized = f"Table {title}: Column {col} contains..."

// Include column summaries for retrieval
summary = f"Table about {topic} with columns: {cols}"
embedding = embed(summary + markdown[:500])

Production Considerations

Moving RAG from prototype to production requires addressing latency, cost, and reliability. These operational concerns often dominate the engineering effort.

Caching Strategies

Embedding cache: Cache query embeddings keyed by query text. Saves embedding model calls.
Semantic cache: If a new query is semantically similar to a cached query, return cached results. Use embedding similarity.
Response cache: Cache full responses for exact query matches. Include cache invalidation on document updates.

Batching

Batch embedding requests and reranking calls to maximize throughput. Trade latency for efficiency when processing multiple queries or documents.

// Inefficient: one-by-one
embeddings = [embed(q) for q in queries] // n API calls

// Efficient: batched
embeddings = embed_batch(queries) // 1 API call

Typical batch sizes:
Embedding: 32-128 texts
Reranking: 100-500 pairs

Streaming

Stream LLM responses to reduce perceived latency. Users see tokens as they're generated rather than waiting for the full response.

// Streaming RAG response
context = await retrieve(query) // blocking
prompt = construct_prompt(query, context)

for token in llm.stream(prompt):
yield token // send immediately

Time-to-first-token matters more than total latency
for user experience

Monitoring and Observability

Latency breakdown: Track time spent in each pipeline stage (embedding, retrieval, reranking, generation).
Retrieval quality: Log retrieved documents and relevance scores. Sample and review periodically.
User feedback: Thumbs up/down, explicit corrections, click-through on citations.
Error rates: Track retrieval failures, LLM errors, timeout rates, and hallucination detection triggers.

Local RAG Architecture

A complete local RAG agent requires several components working together. Running entirely on-device eliminates API costs, network latency, and data privacy concerns.

Embedding Model

Local models like nomic-embed-text, bge-small, or all-MiniLM. 50-100ms per query on CPU,<10ms on GPU. Size range: 30MB to 500MB.

Vector Database

ChromaDB (embedded SQLite), FAISS (library), or Qdrant (embedded mode). All support HNSW indexes and run fully offline.

Reranker (Optional)

bge-reranker-small (30MB) or ms-marco-MiniLM. Adds 50-100ms latency but significantly improves precision for complex queries.

Language Model

Llama, Mistral, Phi, or Qwen via Ollama or llama.cpp. 7B models require 4-8GB RAM (quantized), 13B+ models need 16GB+. GPU strongly recommended.

Orchestration Layer

Agent framework managing the retrieval-generation loop. Coordinates query processing, context assembly, prompt construction, and response streaming.

From RAG to Agent

Basic RAG is a pipeline: query → retrieve → generate. A RAG agent adds autonomy— the ability to decide when and what to retrieve, iterate on results, and combine information from multiple queries.

Basic RAG

Single retrieval per query
Fixed number of results
No query refinement
Stateless between requests

RAG Agent

Multiple retrieval rounds
Adaptive result count
Query decomposition and refinement
Memory across interactions

Getting Started with OnyxLab

OnyxLab's systems implement these patterns with sensible defaults. The goal is production-ready RAG without the infrastructure complexity—local-first, offline-capable, and zero recurring costs. We handle the orchestration, caching, and optimization so you can focus on your documents and use cases. Explore our Systems to see these concepts in action.

Embeddings

RAG Agents