Before an agent can reason about your documents, those documents must be processed into a format the system can use. This pipeline—loading, chunking, embedding—is where most RAG implementations succeed or fail. The difference between a mediocre system and a production-quality one often comes down to decisions made at this layer.
This guide covers the data ingestion pipeline in depth. We'll examine the tradeoffs at each stage, provide concrete recommendations, and show you how to evaluate whether your pipeline is working. Expect to spend significant time here—data quality is the foundation everything else rests on.
The Data Pipeline
Raw documents can't be fed directly to a language model. The context window is limited, and even if it weren't, retrieval precision would suffer. Instead, we process documents through a pipeline that optimizes for semantic search. Each stage introduces potential failure modes that compound downstream.
Raw DocumentsPDF, HTML, DOCX
ExtractionText + Structure
ChunkingSemantic Units
EmbeddingVector Space
StorageVector DB
Document Loading: Harder Than It Looks
Extracting text from documents sounds trivial until you actually try it. PDFs are the canonical nightmare—they're a page description language, not a document format. The "text" you see is actually a collection of positioned glyphs that may or may not appear in reading order. A two-column layout might interleave columns. Headers and footers might appear mid-content. Tables are especially treacherous.
PDF Parsing Challenges
Text Extraction Order
PDF stores text as positioned elements, not logical sequences. Multi-column layouts, sidebars, and callout boxes can produce jumbled output. The visual order rarely matches the internal order.
Scanned Documents
Scanned PDFs contain images, not text. OCR is required, but OCR quality varies wildly based on scan quality, font, and language. Handwriting is particularly challenging and often unusable.
Table Extraction
Tables in PDFs have no semantic structure—they're just positioned text. Inferring row/column relationships requires heuristics that frequently fail on complex tables, merged cells, or nested structures.
Encoding Issues
Some PDFs use custom font encodings or ligatures that produce garbage when extracted. "fi" might become a single glyph, currency symbols might disappear, and non-Latin scripts may extract as mojibake.
Text Extraction Libraries
The Python ecosystem offers several PDF extraction options, each with distinct tradeoffs. No library handles every case—production systems often need multiple extractors with fallback logic.
# Library comparison for PDF extraction
Library Speed Table Support OCR Layout Best For
─────────────────────────────────────────────────────────────────────
PyMuPDF Fast Basic No Poor Simple text PDFs
pdfplumber Medium Good No Good Documents with tables
PyPDF2 Fast None No Poor Basic extraction, metadata
pdf2image+ Slow Via OCR Yes Good Scanned documents
Tesseract
Unstructured Medium Good Yes Excellent Mixed document types
Amazon Textract Slow Excellent Yes Excellent High-value documents
Azure Doc Intel Slow Excellent Yes Excellent Enterprise workflows
PyMuPDF (fitz)
The fastest option for native PDFs. Handles most simple documents well, but struggles with complex layouts. No built-in table detection or OCR. Use for high-volume processing where documents are known to be well-structured.
import fitz
def extract_with_pymupdf(pdf_path: str) -> str:
doc = fitz.open(pdf_path)
text_blocks = []
for page in doc:
# Get text with layout preservation
blocks = page.get_text("blocks", sort=True)
for block in blocks:
if block[6] == 0: # Text block, not image
text_blocks.append(block[4])
return "\n\n".join(text_blocks)
pdfplumber
Excellent for documents with tables. Provides fine-grained control over extraction with access to individual characters and their positions. Slower than PyMuPDF but produces better results for structured documents.
import pdfplumber
def extract_with_pdfplumber(pdf_path: str) -> dict:
results = {"text": [], "tables": []}
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
# Extract tables first
tables = page.extract_tables()
for table in tables:
results["tables"].append(table)
# Extract text outside of tables
# Filter out table regions to avoid duplication
text = page.extract_text(
x_tolerance=3,
y_tolerance=3
)
results["text"].append(text)
return results
Unstructured
A higher-level library that handles multiple document types with consistent output. Includes document element classification (Title, NarrativeText, Table, etc.) which is valuable for downstream processing. Heavier dependency footprint but worth it for production systems handling diverse document types.
from unstructured.partition.auto import partition
def extract_with_unstructured(file_path: str) -> list:
elements = partition(
filename=file_path,
strategy="hi_res", # Use vision model for layout
include_page_breaks=True
)
structured_output = []
for element in elements:
structured_output.append({
"type": type(element).__name__,
"text": str(element),
"metadata": element.metadata.to_dict()
})
return structured_output
HTML Cleaning
HTML extraction presents different challenges. The text is accessible, but buried in navigation, ads, boilerplate, and scripts. Good HTML extraction requires identifying the main content and stripping everything else.
# HTML content extraction with trafilatura
import trafilatura
def extract_article_content(html: str) -> dict:
"""Extract main content, discarding boilerplate."""
# trafilatura uses multiple heuristics to find main content
extracted = trafilatura.extract(
html,
include_comments=False,
include_tables=True,
no_fallback=False,
favor_precision=True, # Prefer clean output over completeness
output_format="txt"
)
# Also extract metadata
metadata = trafilatura.extract_metadata(html)
return {
"content": extracted,
"title": metadata.title if metadata else None,
"author": metadata.author if metadata else None,
"date": metadata.date if metadata else None
}
OCR Considerations
Scanned documents require OCR, which introduces its own error modes. Even with modern OCR engines, expect 1-5% character error rate on clean scans, and much higher on degraded documents. These errors compound during retrieval—a misspelled term won't match the query.
OCR Quality Factors
Scan resolution: 300 DPI minimum, 600 DPI for small fonts
The chunking strategy has outsized impact on retrieval quality. There's no universal best approach—optimal chunking depends on document structure, query patterns, and your embedding model's characteristics. What works for technical documentation will fail for legal contracts.
Fixed-Size Chunking
The simplest approach: split text at fixed token or character intervals. Predictable chunk sizes make capacity planning easy, but the approach is semantically naive—it will happily split mid-sentence or separate a question from its answer.
# Fixed-size chunking with overlap
def fixed_size_chunk(
text: str,
chunk_size: int = 512,
overlap: int = 50
) -> list[str]:
"""Split text into fixed-size chunks with overlap."""
tokens = tokenize(text) # Use your tokenizer
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunks.append(detokenize(chunk_tokens))
start = end - overlap # Overlap with previous chunk
return chunks
Advantages
Predictable chunk sizes for capacity planning
Simple to implement and debug
Works with any text, no parsing required
Fast processing, minimal overhead
Disadvantages
Breaks semantic coherence
Splits sentences mid-thought
Separates related content
Overlap wastes storage/compute
Recursive Character Splitting
A smarter approach that tries to split at natural boundaries. The algorithm attempts to split at paragraph breaks first, then sentences, then words, falling back to characters only if necessary. This preserves more semantic coherence while still respecting size limits.
# Recursive splitting with separator hierarchy
SEPARATORS = [
"\n\n", # Paragraph breaks (try first)
"\n", # Line breaks
". ", # Sentence endings
"? ",
"! ",
"; ", # Clause boundaries
", ", # Phrase boundaries
" ", # Word boundaries
"" # Character level (last resort)
]
def recursive_split(
text: str,
chunk_size: int = 512,
separators: list[str] = SEPARATORS
) -> list[str]:
"""Split text recursively at natural boundaries."""
if len(text) <= chunk_size:
return [text]
# Try each separator in order
for sep in separators:
if sep in text:
splits = text.split(sep)
chunks = []
current = ""
for split in splits:
candidate = current + sep + split if current else split
if len(candidate) <= chunk_size:
current = candidate
else:
if current:
chunks.append(current)
# Recursively handle oversized segments
if len(split) > chunk_size:
chunks.extend(recursive_split(
split, chunk_size, separators[separators.index(sep)+1:]
))
else:
current = split
if current:
chunks.append(current)
return chunks
# Fallback to character split
return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
Semantic Chunking
Uses embeddings to identify topic boundaries. The algorithm embeds sentences or paragraphs, then measures similarity between adjacent segments. Low similarity indicates a topic shift—a natural place to split. This produces highly coherent chunks but at significant computational cost.
# Semantic chunking via embedding similarity
import numpy as np
from sentence_transformers import SentenceTransformer
def semantic_chunk(
sentences: list[str],
model: SentenceTransformer,
similarity_threshold: float = 0.5,
min_chunk_size: int = 2,
max_chunk_size: int = 10
) -> list[str]:
"""Split at semantic boundaries detected via embedding similarity."""
# Embed all sentences
embeddings = model.encode(sentences)
# Calculate pairwise similarity between adjacent sentences
similarities = []
for i in range(len(embeddings) - 1):
sim = np.dot(embeddings[i], embeddings[i + 1])
similarities.append(sim)
# Find split points where similarity drops
chunks = []
current_chunk = [sentences[0]]
for i, sim in enumerate(similarities):
if (sim < similarity_threshold and
len(current_chunk) >= min_chunk_size):
# Low similarity = topic shift
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i + 1]]
elif len(current_chunk) >= max_chunk_size:
# Force split at max size
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i + 1]]
else:
current_chunk.append(sentences[i + 1])
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Document-Aware Chunking
Leverages document structure (headings, sections, lists) for natural split points. This requires good document parsing but produces excellent results—each chunk corresponds to a logical unit of information. Particularly effective for technical documentation, legal documents, and any content with clear hierarchical structure.
# Document-aware chunking using structure
from dataclasses import dataclass
@dataclass
class Section:
heading: str
level: int
content: str
subsections: list["Section"]
def chunk_by_structure(
sections: list[Section],
max_chunk_size: int = 1000,
include_heading: bool = True
) -> list[dict]:
"""Chunk document respecting section boundaries."""
chunks = []
for section in sections:
# Build section content with heading context
section_text = ""
if include_heading and section.heading:
section_text = f"## {section.heading}\n\n"
section_text += section.content
if len(section_text) <= max_chunk_size:
# Section fits in one chunk
chunks.append({
"content": section_text,
"heading": section.heading,
"level": section.level
})
else:
# Split section content, preserving heading in each chunk
sub_chunks = recursive_split(section.content, max_chunk_size)
for i, sub_chunk in enumerate(sub_chunks):
prefix = f"## {section.heading} (continued)\n\n" if i > 0 else f"## {section.heading}\n\n"
chunks.append({
"content": prefix + sub_chunk if include_heading else sub_chunk,
"heading": section.heading,
"level": section.level,
"part": i + 1
})
# Recursively process subsections
chunks.extend(chunk_by_structure(
section.subsections, max_chunk_size, include_heading
))
return chunks
Sentence-Window Chunking
A hybrid approach that embeds individual sentences but retrieves surrounding context. Each sentence is embedded separately for precise matching, but when retrieved, the system expands to include neighboring sentences. This gives you precise retrieval with rich context.
How Sentence-Window Works
Split document into sentences
Embed each sentence individually
Store sentence with references to neighbors (window)
At retrieval time, match against sentence embeddings
Expand matched sentences to include window (e.g., 2 sentences before/after)
Return expanded context to LLM
Chunk Size Optimization
Finding the right chunk size is empirical. Too small and you lose context; too large and you dilute relevance. The optimal size depends on your embedding model, query patterns, and document characteristics.
Chunk Size Tradeoffs
Small (128-256 tokens)
+ High precision
+ Fast retrieval
- Loses context
- More chunks to store
Medium (512-768 tokens)
Common default
Balanced tradeoff
Works for most cases
Large (1024+ tokens)
+ Rich context
+ Fewer chunks
- Lower precision
- Slower retrieval
Empirical Chunk Size Selection
The best way to find optimal chunk size is experimentation with your actual data and queries. Create a test set of questions with known answers, then measure retrieval quality at different chunk sizes.
def evaluate_chunk_size(
documents: list[str],
test_queries: list[dict], # {"query": str, "expected_doc": str}
chunk_sizes: list[int] = [256, 512, 768, 1024]
) -> dict:
"""Evaluate retrieval quality at different chunk sizes."""
results = {}
for size in chunk_sizes:
# Build index at this chunk size
chunks = [chunk_document(doc, size) for doc in documents]
index = build_index(chunks)
# Measure retrieval quality
hits_at_1 = 0
hits_at_5 = 0
mrr = 0.0
for test in test_queries:
results = index.search(test["query"], k=5)
for rank, result in enumerate(results):
if result.source_doc == test["expected_doc"]:
if rank == 0:
hits_at_1 += 1
hits_at_5 += 1
mrr += 1.0 / (rank + 1)
break
n = len(test_queries)
results[size] = {
"hits@1": hits_at_1 / n,
"hits@5": hits_at_5 / n,
"mrr": mrr / n
}
return results
Overlap Strategies
Overlap between chunks preserves context at boundaries. Without overlap, information that spans chunk boundaries becomes unretrievable—the first half is in one chunk, the second half in another, and neither chunk alone contains the complete answer.
No Overlap
0%
Minimal storage, but loses boundary context. Only appropriate when documents have very clear section boundaries.
Moderate Overlap
10-20%
Standard choice. Preserves most boundary context with reasonable storage overhead. Start here and adjust based on testing.
High Overlap
30-50%
Maximum boundary preservation but significant storage cost. Consider for critical applications or very short documents.
The Boundary Problem
Without overlap:
Chunk 1: "...the system uses AES-256Chunk 2: encryption with a 32-byte key..."
Neither chunk contains "AES-256 encryption"
With 20% overlap:
Chunk 1: "...the system uses AES-256 encryption with
Chunk 2: uses AES-256 encryption with a 32-byte key..."
Both chunks now contain the key phrase
Hierarchical Chunking
Hierarchical chunking creates parent-child relationships between chunks at different granularities. Small chunks enable precise retrieval; parent chunks provide context. This is particularly powerful for long documents where local precision and global context both matter.
At retrieval time, match against leaf chunks (paragraphs) for precision, then expand to parent chunks for context. This gives you the best of both worlds: precise matching with rich context.
def hierarchical_retrieve(query: str, k: int = 5) -> list[dict]:
"""Retrieve with parent context expansion."""
# Match against paragraph-level embeddings
matches = vector_search(query, level=2, k=k)
results = []
for match in matches:
# Fetch parent section for context
parent = get_chunk(match.parent_id)
results.append({
"matched_chunk": match.content,
"context": parent.content,
"document_id": parent.parent_id,
"score": match.score
})
return results
Metadata Schemas
Each chunk should carry rich metadata. This enables filtered retrieval, provides context for the LLM, and supports debugging and analytics. A good metadata schema is an investment that pays dividends throughout the system.
Search only recent documents: modified_at > 2024-01-01
Restrict to department: department = "engineering"
Find code examples: has_code = true
Specific document type: source.filename LIKE "%.pdf"
Context for LLM
Include section hierarchy in prompts
Show document date for temporal reasoning
Indicate content type for interpretation
Provide source for attribution
Document Preprocessing
Raw extracted text often needs cleaning before chunking. Preprocessing improves both retrieval quality and LLM comprehension. The goal is to normalize text while preserving meaningful structure.
Text Cleaning
# Common text cleaning operations
import re
import unicodedata
def clean_text(text: str) -> str:
"""Clean and normalize extracted text."""
# Unicode normalization
text = unicodedata.normalize("NFKC", text)
# Fix common OCR errors
text = text.replace("fi", "fi").replace("fl", "fl")
text = text.replace("'", "'").replace(""", '"').replace(""", '"')
# Remove page headers/footers (often repeated patterns)
text = remove_repeated_headers(text)
# Normalize whitespace
text = re.sub(r"[ \t]+", " ", text) # Multiple spaces to single
text = re.sub(r"\n{3,}", "\n\n", text) # Multiple newlines to double
# Remove hyphenation at line breaks
text = re.sub(r"(\w+)-\n(\w+)", r"\1\2", text)
# Clean up bullet points and list markers
text = re.sub(r"^[•●○▪▸►]\s*", "- ", text, flags=re.MULTILINE)
# Remove invisible characters
text = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]", "", text)
return text.strip()
def remove_repeated_headers(text: str, threshold: int = 3) -> str:
"""Remove headers/footers that repeat across pages."""
lines = text.split("\n")
line_counts = Counter(lines)
# Lines appearing more than threshold times are likely headers/footers
repeated = {line for line, count in line_counts.items()
if count >= threshold and len(line.strip()) > 0}
return "\n".join(line for line in lines if line not in repeated)
Deduplication
Document corpora often contain duplicates—exact copies, near-duplicates, or documents with substantial overlap. Deduplication prevents the same content from dominating search results and reduces storage costs.
# Deduplication strategies
from datasketch import MinHash, MinHashLSH
import hashlib
def exact_dedup(documents: list[str]) -> list[str]:
"""Remove exact duplicates via hashing."""
seen = set()
unique = []
for doc in documents:
doc_hash = hashlib.sha256(doc.encode()).hexdigest()
if doc_hash not in seen:
seen.add(doc_hash)
unique.append(doc)
return unique
def near_dedup_minhash(
documents: list[str],
threshold: float = 0.8,
num_perm: int = 128
) -> list[str]:
"""Remove near-duplicates using MinHash LSH."""
lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
minhashes = {}
for i, doc in enumerate(documents):
# Create MinHash signature
mh = MinHash(num_perm=num_perm)
for word in doc.split():
mh.update(word.encode())
minhashes[i] = mh
# Check for near-duplicates
duplicates = lsh.query(mh)
if not duplicates:
lsh.insert(i, mh)
# Return only documents that were inserted (non-duplicates)
return [documents[i] for i in sorted(lsh.keys)]
Handling Different Document Types
Different content types require different processing strategies. A one-size-fits-all approach will produce poor results for specialized content like code, tables, or mixed media documents.
Code
Code has different semantic units than prose. Functions, classes, and methods are natural chunk boundaries. Preserve indentation and syntax structure. Consider including docstrings and comments with the code they document.
# Code-aware chunking strategy
- Chunk at function/class boundaries
- Keep function with its docstring
- Preserve import statements as context
- Maintain indentation (meaningful in Python)
- Consider AST-based splitting for precision
Tables
Tables are challenging because their meaning depends on structure. Options include: converting to markdown, serializing row-by-row with headers, or generating natural language descriptions. The right choice depends on how users will query the data.
# Table handling strategies
1. Markdown: | Header | Value | - preserves structure
2. Row descriptions: "Product X costs $100 with 50 units in stock"
3. Multiple representations: store both structured and prose
4. Metadata flagging: mark chunks as tables for special handling
Images and Diagrams
Images require either OCR (for text-heavy images), vision model descriptions, or multimodal embeddings. For diagrams and charts, extracting a textual description is often more useful than OCR. CLIP-style embeddings can enable image search alongside text.
# Image handling approaches
1. OCR for text-heavy images (screenshots, scanned docs)
2. Vision model captions for photos/diagrams
3. Alt-text extraction from HTML
4. Multimodal embeddings (CLIP) for image search
5. Combine: store image + generated description
Incremental Ingestion
Real-world systems need to handle updates—new documents, modified documents, deleted documents. Reprocessing the entire corpus on every change doesn't scale. Incremental ingestion updates only what changed.
# Incremental ingestion with change detection
from dataclasses import dataclass
from datetime import datetime
import hashlib
@dataclass
class DocumentState:
document_id: str
content_hash: str
last_processed: datetime
chunk_ids: list[str]
class IncrementalIngester:
def __init__(self, state_store, vector_store):
self.state_store = state_store
self.vector_store = vector_store
def process_document(self, doc_id: str, content: str) -> bool:
"""Process document if changed, return True if updated."""
content_hash = hashlib.sha256(content.encode()).hexdigest()
existing = self.state_store.get(doc_id)
if existing and existing.content_hash == content_hash:
# Document unchanged, skip processing
return False
# Document is new or modified
if existing:
# Delete old chunks
self.vector_store.delete(existing.chunk_ids)
# Process and store new chunks
chunks = self.chunk_document(content)
chunk_ids = self.vector_store.upsert(chunks)
# Update state
self.state_store.put(DocumentState(
document_id=doc_id,
content_hash=content_hash,
last_processed=datetime.now(),
chunk_ids=chunk_ids
))
return True
def delete_document(self, doc_id: str):
"""Remove document and its chunks."""
existing = self.state_store.get(doc_id)
if existing:
self.vector_store.delete(existing.chunk_ids)
self.state_store.delete(doc_id)
Update Strategies
Full replacement: Delete all chunks, reprocess entire document. Simple but wasteful for small changes.
Diff-based: Compare old and new content, update only changed sections. Complex but efficient.
Append-only: Never delete, just add new versions with timestamps. Query filters to latest. Good for audit trails.
Lazy invalidation: Mark old chunks as stale, reprocess on next access. Reduces peak load.
Data Quality Metrics
How do you know if your pipeline is working? Without metrics, you're flying blind. Establish quality indicators at each pipeline stage and monitor them continuously.
Extraction Quality
Character error rate (for OCR)
Table extraction accuracy
Encoding error frequency
Empty/failed extraction rate
Chunk Quality
Chunk size distribution
Semantic coherence scores
Truncated sentence frequency
Empty/near-empty chunk rate
Retrieval Quality
Precision@k on test queries
Mean reciprocal rank (MRR)
NDCG for ranked results
Human relevance judgments
System Health
Processing latency by document type
Embedding generation throughput
Storage growth rate
Error/retry rates
# Automated quality checks
class QualityChecker:
def __init__(self, thresholds: dict):
self.thresholds = thresholds
def check_chunks(self, chunks: list[dict]) -> dict:
"""Run quality checks on processed chunks."""
issues = []
# Check for size anomalies
sizes = [len(c["content"]) for c in chunks]
avg_size = sum(sizes) / len(sizes)
for i, chunk in enumerate(chunks):
size = len(chunk["content"])
# Too small
if size < self.thresholds["min_chunk_size"]:
issues.append({
"type": "undersized_chunk",
"chunk_index": i,
"size": size
})
# Too large
if size > self.thresholds["max_chunk_size"]:
issues.append({
"type": "oversized_chunk",
"chunk_index": i,
"size": size
})
# Truncated sentence (ends mid-word)
if not chunk["content"].rstrip()[-1] in ".!?:;\"')"]:
issues.append({
"type": "possible_truncation",
"chunk_index": i
})
# High repetition (possible extraction error)
if self.has_high_repetition(chunk["content"]):
issues.append({
"type": "repetitive_content",
"chunk_index": i
})
return {
"total_chunks": len(chunks),
"issues": issues,
"issue_rate": len(issues) / len(chunks),
"avg_chunk_size": avg_size
}
Storage Formats and Persistence
Processed chunks need persistent storage. The storage format affects query performance, portability, and operational complexity. For local-first systems, embedded databases are often preferable to standalone servers.
# Storage options comparison
Storage Type Persistence Scalability Best For
────────────────────────────────────────────────────────────────────────
FAISS Embedded File export 10M+ vectors High-performance search
ChromaDB Embedded SQLite 1M vectors Rapid prototyping
LanceDB Embedded Native files 10M+ vectors Local-first, multimodal
SQLite + ext Embedded Single file 100K vectors Simple deployments
Qdrant Server Native 100M+ vectors Production scale
Milvus Server Native 1B+ vectors Enterprise scale
Pinecone Managed Cloud Unlimited Fully managed
Local-First Storage with LanceDB
For OnyxLab's local-first philosophy, embedded databases that persist to local files are ideal. LanceDB offers vector search with native file persistence and no server process.
import lancedb
# Create or connect to local database
db = lancedb.connect("./data/vectors")
# Create table with schema
table = db.create_table(
"documents",
data=[{
"id": "chunk_1",
"content": "Example content...",
"embedding": [0.1, 0.2, ...], # 1536 dimensions
"metadata": {"source": "doc.pdf", "page": 1}
}],
mode="overwrite"
)
# Vector search
results = table.search([0.1, 0.2, ...]) \
.where("metadata.source = 'doc.pdf'") \
.limit(10) \
.to_list()
# Data persists to ./data/vectors/documents.lance
Chunk Storage Schema
Store chunks with their embeddings, metadata, and relationships in a format that supports both vector search and metadata filtering.
1.Document extraction is harder than it looks. Test your extractors on real data and build fallback chains.
2.Chunk size is empirical. Start with 512 tokens and measure retrieval quality, then adjust.
3.Use overlap (10-20%) to preserve boundary context. The storage cost is worth it.
4.Rich metadata enables filtering and provides context. Design your schema early.
5.Different content types need different strategies. Don't force code into prose pipelines.
6.Build incremental ingestion from the start. Full reprocessing doesn't scale.
7.Measure quality at every stage. What you don't measure, you can't improve.
The quality of your data pipeline directly determines the quality of your agent's responses. Time spent here pays dividends throughout the system. Get this right, and retrieval becomes reliable. Get it wrong, and no amount of prompt engineering will save you.