OnyxLab | AI/ML Engineering

Before an agent can reason about your documents, those documents must be processed into a format the system can use. This pipeline—loading, chunking, embedding—is where most RAG implementations succeed or fail. The difference between a mediocre system and a production-quality one often comes down to decisions made at this layer.

This guide covers the data ingestion pipeline in depth. We'll examine the tradeoffs at each stage, provide concrete recommendations, and show you how to evaluate whether your pipeline is working. Expect to spend significant time here—data quality is the foundation everything else rests on.

The Data Pipeline

Raw documents can't be fed directly to a language model. The context window is limited, and even if it weren't, retrieval precision would suffer. Instead, we process documents through a pipeline that optimizes for semantic search. Each stage introduces potential failure modes that compound downstream.

Raw DocumentsPDF, HTML, DOCX

ExtractionText + Structure

ChunkingSemantic Units

EmbeddingVector Space

StorageVector DB

Document Loading: Harder Than It Looks

Extracting text from documents sounds trivial until you actually try it. PDFs are the canonical nightmare—they're a page description language, not a document format. The "text" you see is actually a collection of positioned glyphs that may or may not appear in reading order. A two-column layout might interleave columns. Headers and footers might appear mid-content. Tables are especially treacherous.

PDF Parsing Challenges

Text Extraction Order

PDF stores text as positioned elements, not logical sequences. Multi-column layouts, sidebars, and callout boxes can produce jumbled output. The visual order rarely matches the internal order.

Scanned Documents

Scanned PDFs contain images, not text. OCR is required, but OCR quality varies wildly based on scan quality, font, and language. Handwriting is particularly challenging and often unusable.

Table Extraction

Tables in PDFs have no semantic structure—they're just positioned text. Inferring row/column relationships requires heuristics that frequently fail on complex tables, merged cells, or nested structures.

Encoding Issues

Some PDFs use custom font encodings or ligatures that produce garbage when extracted. "fi" might become a single glyph, currency symbols might disappear, and non-Latin scripts may extract as mojibake.

Text Extraction Libraries

The Python ecosystem offers several PDF extraction options, each with distinct tradeoffs. No library handles every case—production systems often need multiple extractors with fallback logic.

# Library comparison for PDF extraction

Library Speed Table Support OCR Layout Best For ───────────────────────────────────────────────────────────────────── PyMuPDF Fast Basic No Poor Simple text PDFs pdfplumber Medium Good No Good Documents with tables PyPDF2 Fast None No Poor Basic extraction, metadata pdf2image+ Slow Via OCR Yes Good Scanned documents Tesseract Unstructured Medium Good Yes Excellent Mixed document types Amazon Textract Slow Excellent Yes Excellent High-value documents Azure Doc Intel Slow Excellent Yes Excellent Enterprise workflows

PyMuPDF (fitz)

The fastest option for native PDFs. Handles most simple documents well, but struggles with complex layouts. No built-in table detection or OCR. Use for high-volume processing where documents are known to be well-structured.

import fitz def extract_with_pymupdf(pdf_path: str) -> str: doc = fitz.open(pdf_path) text_blocks = [] for page in doc: # Get text with layout preservation blocks = page.get_text("blocks", sort=True) for block in blocks: if block[6] == 0: # Text block, not image text_blocks.append(block[4]) return "\n\n".join(text_blocks)

pdfplumber

Excellent for documents with tables. Provides fine-grained control over extraction with access to individual characters and their positions. Slower than PyMuPDF but produces better results for structured documents.

import pdfplumber def extract_with_pdfplumber(pdf_path: str) -> dict: results = {"text": [], "tables": []} with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: # Extract tables first tables = page.extract_tables() for table in tables: results["tables"].append(table) # Extract text outside of tables # Filter out table regions to avoid duplication text = page.extract_text( x_tolerance=3, y_tolerance=3 ) results["text"].append(text) return results

Unstructured

A higher-level library that handles multiple document types with consistent output. Includes document element classification (Title, NarrativeText, Table, etc.) which is valuable for downstream processing. Heavier dependency footprint but worth it for production systems handling diverse document types.

from unstructured.partition.auto import partition def extract_with_unstructured(file_path: str) -> list: elements = partition( filename=file_path, strategy="hi_res", # Use vision model for layout include_page_breaks=True ) structured_output = [] for element in elements: structured_output.append({ "type": type(element).__name__, "text": str(element), "metadata": element.metadata.to_dict() }) return structured_output

HTML Cleaning

HTML extraction presents different challenges. The text is accessible, but buried in navigation, ads, boilerplate, and scripts. Good HTML extraction requires identifying the main content and stripping everything else.

# HTML content extraction with trafilatura

import trafilatura def extract_article_content(html: str) -> dict: """Extract main content, discarding boilerplate.""" # trafilatura uses multiple heuristics to find main content extracted = trafilatura.extract( html, include_comments=False, include_tables=True, no_fallback=False, favor_precision=True, # Prefer clean output over completeness output_format="txt" ) # Also extract metadata metadata = trafilatura.extract_metadata(html) return { "content": extracted, "title": metadata.title if metadata else None, "author": metadata.author if metadata else None, "date": metadata.date if metadata else None }

OCR Considerations

Scanned documents require OCR, which introduces its own error modes. Even with modern OCR engines, expect 1-5% character error rate on clean scans, and much higher on degraded documents. These errors compound during retrieval—a misspelled term won't match the query.

OCR Quality Factors

Scan resolution: 300 DPI minimum, 600 DPI for small fonts
Image preprocessing: Deskewing, denoising, binarization
Language models: Tesseract with appropriate language packs
Font characteristics: Serif fonts OCR better than decorative fonts
Post-processing: Spell checking, dictionary-based correction

Chunking Strategies: A Deep Dive

The chunking strategy has outsized impact on retrieval quality. There's no universal best approach—optimal chunking depends on document structure, query patterns, and your embedding model's characteristics. What works for technical documentation will fail for legal contracts.

Fixed-Size Chunking

The simplest approach: split text at fixed token or character intervals. Predictable chunk sizes make capacity planning easy, but the approach is semantically naive—it will happily split mid-sentence or separate a question from its answer.

# Fixed-size chunking with overlap

def fixed_size_chunk( text: str, chunk_size: int = 512, overlap: int = 50 ) -> list[str]: """Split text into fixed-size chunks with overlap.""" tokens = tokenize(text) # Use your tokenizer chunks = [] start = 0 while start < len(tokens): end = start + chunk_size chunk_tokens = tokens[start:end] chunks.append(detokenize(chunk_tokens)) start = end - overlap # Overlap with previous chunk return chunks

Advantages

Predictable chunk sizes for capacity planning
Simple to implement and debug
Works with any text, no parsing required
Fast processing, minimal overhead

Disadvantages

Breaks semantic coherence
Splits sentences mid-thought
Separates related content
Overlap wastes storage/compute

Recursive Character Splitting

A smarter approach that tries to split at natural boundaries. The algorithm attempts to split at paragraph breaks first, then sentences, then words, falling back to characters only if necessary. This preserves more semantic coherence while still respecting size limits.

# Recursive splitting with separator hierarchy

SEPARATORS = [ "\n\n", # Paragraph breaks (try first) "\n", # Line breaks ". ", # Sentence endings "? ", "! ", "; ", # Clause boundaries ", ", # Phrase boundaries " ", # Word boundaries "" # Character level (last resort) ] def recursive_split( text: str, chunk_size: int = 512, separators: list[str] = SEPARATORS ) -> list[str]: """Split text recursively at natural boundaries.""" if len(text) <= chunk_size: return [text] # Try each separator in order for sep in separators: if sep in text: splits = text.split(sep) chunks = [] current = "" for split in splits: candidate = current + sep + split if current else split if len(candidate) <= chunk_size: current = candidate else: if current: chunks.append(current) # Recursively handle oversized segments if len(split) > chunk_size: chunks.extend(recursive_split( split, chunk_size, separators[separators.index(sep)+1:] )) else: current = split if current: chunks.append(current) return chunks # Fallback to character split return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

Semantic Chunking

Uses embeddings to identify topic boundaries. The algorithm embeds sentences or paragraphs, then measures similarity between adjacent segments. Low similarity indicates a topic shift—a natural place to split. This produces highly coherent chunks but at significant computational cost.

# Semantic chunking via embedding similarity

import numpy as np from sentence_transformers import SentenceTransformer def semantic_chunk( sentences: list[str], model: SentenceTransformer, similarity_threshold: float = 0.5, min_chunk_size: int = 2, max_chunk_size: int = 10 ) -> list[str]: """Split at semantic boundaries detected via embedding similarity.""" # Embed all sentences embeddings = model.encode(sentences) # Calculate pairwise similarity between adjacent sentences similarities = [] for i in range(len(embeddings) - 1): sim = np.dot(embeddings[i], embeddings[i + 1]) similarities.append(sim) # Find split points where similarity drops chunks = [] current_chunk = [sentences[0]] for i, sim in enumerate(similarities): if (sim < similarity_threshold and len(current_chunk) >= min_chunk_size): # Low similarity = topic shift chunks.append(" ".join(current_chunk)) current_chunk = [sentences[i + 1]] elif len(current_chunk) >= max_chunk_size: # Force split at max size chunks.append(" ".join(current_chunk)) current_chunk = [sentences[i + 1]] else: current_chunk.append(sentences[i + 1]) if current_chunk: chunks.append(" ".join(current_chunk)) return chunks

Document-Aware Chunking

Leverages document structure (headings, sections, lists) for natural split points. This requires good document parsing but produces excellent results—each chunk corresponds to a logical unit of information. Particularly effective for technical documentation, legal documents, and any content with clear hierarchical structure.

# Document-aware chunking using structure

from dataclasses import dataclass @dataclass class Section: heading: str level: int content: str subsections: list["Section"] def chunk_by_structure( sections: list[Section], max_chunk_size: int = 1000, include_heading: bool = True ) -> list[dict]: """Chunk document respecting section boundaries.""" chunks = [] for section in sections: # Build section content with heading context section_text = "" if include_heading and section.heading: section_text = f"## {section.heading}\n\n" section_text += section.content if len(section_text) <= max_chunk_size: # Section fits in one chunk chunks.append({ "content": section_text, "heading": section.heading, "level": section.level }) else: # Split section content, preserving heading in each chunk sub_chunks = recursive_split(section.content, max_chunk_size) for i, sub_chunk in enumerate(sub_chunks): prefix = f"## {section.heading} (continued)\n\n" if i > 0 else f"## {section.heading}\n\n" chunks.append({ "content": prefix + sub_chunk if include_heading else sub_chunk, "heading": section.heading, "level": section.level, "part": i + 1 }) # Recursively process subsections chunks.extend(chunk_by_structure( section.subsections, max_chunk_size, include_heading )) return chunks

Sentence-Window Chunking

A hybrid approach that embeds individual sentences but retrieves surrounding context. Each sentence is embedded separately for precise matching, but when retrieved, the system expands to include neighboring sentences. This gives you precise retrieval with rich context.

How Sentence-Window Works

Split document into sentences
Embed each sentence individually
Store sentence with references to neighbors (window)
At retrieval time, match against sentence embeddings
Expand matched sentences to include window (e.g., 2 sentences before/after)
Return expanded context to LLM

Chunk Size Optimization

Finding the right chunk size is empirical. Too small and you lose context; too large and you dilute relevance. The optimal size depends on your embedding model, query patterns, and document characteristics.

Chunk Size Tradeoffs

Small (128-256 tokens)

+ High precision
+ Fast retrieval
- Loses context
- More chunks to store

Medium (512-768 tokens)

Common default
Balanced tradeoff
Works for most cases

Large (1024+ tokens)

+ Rich context
+ Fewer chunks
- Lower precision
- Slower retrieval

Empirical Chunk Size Selection

The best way to find optimal chunk size is experimentation with your actual data and queries. Create a test set of questions with known answers, then measure retrieval quality at different chunk sizes.

def evaluate_chunk_size( documents: list[str], test_queries: list[dict], # {"query": str, "expected_doc": str} chunk_sizes: list[int] = [256, 512, 768, 1024] ) -> dict: """Evaluate retrieval quality at different chunk sizes.""" results = {} for size in chunk_sizes: # Build index at this chunk size chunks = [chunk_document(doc, size) for doc in documents] index = build_index(chunks) # Measure retrieval quality hits_at_1 = 0 hits_at_5 = 0 mrr = 0.0 for test in test_queries: results = index.search(test["query"], k=5) for rank, result in enumerate(results): if result.source_doc == test["expected_doc"]: if rank == 0: hits_at_1 += 1 hits_at_5 += 1 mrr += 1.0 / (rank + 1) break n = len(test_queries) results[size] = { "hits@1": hits_at_1 / n, "hits@5": hits_at_5 / n, "mrr": mrr / n } return results

Overlap Strategies

Overlap between chunks preserves context at boundaries. Without overlap, information that spans chunk boundaries becomes unretrievable—the first half is in one chunk, the second half in another, and neither chunk alone contains the complete answer.

No Overlap

Minimal storage, but loses boundary context. Only appropriate when documents have very clear section boundaries.

Moderate Overlap

10-20%

Standard choice. Preserves most boundary context with reasonable storage overhead. Start here and adjust based on testing.

High Overlap

30-50%

Maximum boundary preservation but significant storage cost. Consider for critical applications or very short documents.

The Boundary Problem

Without overlap:

Chunk 1: "...the system uses AES-256Chunk 2: encryption with a 32-byte key..."

Neither chunk contains "AES-256 encryption"

With 20% overlap:

Chunk 1: "...the system uses AES-256 encryption with

Chunk 2: uses AES-256 encryption with a 32-byte key..."

Both chunks now contain the key phrase

Hierarchical Chunking

Hierarchical chunking creates parent-child relationships between chunks at different granularities. Small chunks enable precise retrieval; parent chunks provide context. This is particularly powerful for long documents where local precision and global context both matter.

Hierarchical Structure

Document Level

Full document summary, global metadata

Section Level (1024 tokens)

Major sections, broad context

Paragraph Level (256 tokens)

Fine-grained retrieval units

# Hierarchical chunk structure

@dataclass class HierarchicalChunk: id: str content: str level: int # 0=document, 1=section, 2=paragraph parent_id: str | None children_ids: list[str] embedding: list[float] def build_hierarchy(document: str, doc_id: str) -> list[HierarchicalChunk]: """Create multi-level chunk hierarchy.""" chunks = [] # Level 0: Document summary doc_chunk = HierarchicalChunk( id=f"{doc_id}_doc", content=summarize(document), # Use LLM or extractive summary level=0, parent_id=None, children_ids=[], embedding=embed(summarize(document)) ) chunks.append(doc_chunk) # Level 1: Sections sections = split_into_sections(document) for i, section in enumerate(sections): section_id = f"{doc_id}_sec_{i}" section_chunk = HierarchicalChunk( id=section_id, content=section, level=1, parent_id=doc_chunk.id, children_ids=[], embedding=embed(section) ) doc_chunk.children_ids.append(section_id) chunks.append(section_chunk) # Level 2: Paragraphs within section paragraphs = split_into_paragraphs(section) for j, para in enumerate(paragraphs): para_id = f"{doc_id}_sec_{i}_para_{j}" para_chunk = HierarchicalChunk( id=para_id, content=para, level=2, parent_id=section_id, children_ids=[], embedding=embed(para) ) section_chunk.children_ids.append(para_id) chunks.append(para_chunk) return chunks

Retrieval with Hierarchy

At retrieval time, match against leaf chunks (paragraphs) for precision, then expand to parent chunks for context. This gives you the best of both worlds: precise matching with rich context.

def hierarchical_retrieve(query: str, k: int = 5) -> list[dict]: """Retrieve with parent context expansion.""" # Match against paragraph-level embeddings matches = vector_search(query, level=2, k=k) results = [] for match in matches: # Fetch parent section for context parent = get_chunk(match.parent_id) results.append({ "matched_chunk": match.content, "context": parent.content, "document_id": parent.parent_id, "score": match.score }) return results

Metadata Schemas

Each chunk should carry rich metadata. This enables filtered retrieval, provides context for the LLM, and supports debugging and analytics. A good metadata schema is an investment that pays dividends throughout the system.

# Comprehensive chunk metadata schema

{ // Core identification "chunk_id": "doc_abc123_chunk_7", "document_id": "doc_abc123", "collection": "technical-docs", // Source tracking "source": { "filename": "system-architecture.pdf", "filepath": "/docs/architecture/system-architecture.pdf", "url": null, "page_numbers": [42, 43], "byte_range": [102400, 105800] }, // Structural position "structure": { "section_hierarchy": ["Architecture", "Storage Layer", "Vector Index"], "heading": "Vector Index Configuration", "chunk_index": 7, "total_chunks": 23, "parent_chunk_id": "doc_abc123_section_3" }, // Content characteristics "content_type": "prose", // prose, table, code, list, mixed "language": "en", "token_count": 487, "has_code": false, "has_tables": false, "has_images": false, // Temporal metadata "created_at": "2024-01-15T10:30:00Z", "modified_at": "2024-01-15T10:30:00Z", "document_date": "2024-01-10", "ingestion_timestamp": "2024-01-15T10:35:22Z", // Access control "access_level": "internal", "department": "engineering", "owner": "architecture-team", // Processing metadata "extraction_method": "pdfplumber", "chunking_strategy": "recursive", "embedding_model": "text-embedding-3-small", "embedding_version": "2024-01" }

Filtering Use Cases

Search only recent documents: modified_at > 2024-01-01
Restrict to department: department = "engineering"
Find code examples: has_code = true
Specific document type: source.filename LIKE "%.pdf"

Context for LLM

Include section hierarchy in prompts
Show document date for temporal reasoning
Indicate content type for interpretation
Provide source for attribution

Document Preprocessing

Raw extracted text often needs cleaning before chunking. Preprocessing improves both retrieval quality and LLM comprehension. The goal is to normalize text while preserving meaningful structure.

Text Cleaning

# Common text cleaning operations

import re import unicodedata def clean_text(text: str) -> str: """Clean and normalize extracted text.""" # Unicode normalization text = unicodedata.normalize("NFKC", text) # Fix common OCR errors text = text.replace("ﬁ", "fi").replace("ﬂ", "fl") text = text.replace("'", "'").replace(""", '"').replace(""", '"') # Remove page headers/footers (often repeated patterns) text = remove_repeated_headers(text) # Normalize whitespace text = re.sub(r"[ \t]+", " ", text) # Multiple spaces to single text = re.sub(r"\n{3,}", "\n\n", text) # Multiple newlines to double # Remove hyphenation at line breaks text = re.sub(r"(\w+)-\n(\w+)", r"\1\2", text) # Clean up bullet points and list markers text = re.sub(r"^[•●○▪▸►]\s*", "- ", text, flags=re.MULTILINE) # Remove invisible characters text = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]", "", text) return text.strip() def remove_repeated_headers(text: str, threshold: int = 3) -> str: """Remove headers/footers that repeat across pages.""" lines = text.split("\n") line_counts = Counter(lines) # Lines appearing more than threshold times are likely headers/footers repeated = {line for line, count in line_counts.items() if count >= threshold and len(line.strip()) > 0} return "\n".join(line for line in lines if line not in repeated)

Deduplication

Document corpora often contain duplicates—exact copies, near-duplicates, or documents with substantial overlap. Deduplication prevents the same content from dominating search results and reduces storage costs.

# Deduplication strategies

from datasketch import MinHash, MinHashLSH import hashlib def exact_dedup(documents: list[str]) -> list[str]: """Remove exact duplicates via hashing.""" seen = set() unique = [] for doc in documents: doc_hash = hashlib.sha256(doc.encode()).hexdigest() if doc_hash not in seen: seen.add(doc_hash) unique.append(doc) return unique def near_dedup_minhash( documents: list[str], threshold: float = 0.8, num_perm: int = 128 ) -> list[str]: """Remove near-duplicates using MinHash LSH.""" lsh = MinHashLSH(threshold=threshold, num_perm=num_perm) minhashes = {} for i, doc in enumerate(documents): # Create MinHash signature mh = MinHash(num_perm=num_perm) for word in doc.split(): mh.update(word.encode()) minhashes[i] = mh # Check for near-duplicates duplicates = lsh.query(mh) if not duplicates: lsh.insert(i, mh) # Return only documents that were inserted (non-duplicates) return [documents[i] for i in sorted(lsh.keys)]

Handling Different Document Types

Different content types require different processing strategies. A one-size-fits-all approach will produce poor results for specialized content like code, tables, or mixed media documents.

Code

Code has different semantic units than prose. Functions, classes, and methods are natural chunk boundaries. Preserve indentation and syntax structure. Consider including docstrings and comments with the code they document.

# Code-aware chunking strategy - Chunk at function/class boundaries - Keep function with its docstring - Preserve import statements as context - Maintain indentation (meaningful in Python) - Consider AST-based splitting for precision

Tables

Tables are challenging because their meaning depends on structure. Options include: converting to markdown, serializing row-by-row with headers, or generating natural language descriptions. The right choice depends on how users will query the data.

# Table handling strategies 1. Markdown: | Header | Value | - preserves structure 2. Row descriptions: "Product X costs $100 with 50 units in stock" 3. Multiple representations: store both structured and prose 4. Metadata flagging: mark chunks as tables for special handling

Images and Diagrams

Images require either OCR (for text-heavy images), vision model descriptions, or multimodal embeddings. For diagrams and charts, extracting a textual description is often more useful than OCR. CLIP-style embeddings can enable image search alongside text.

# Image handling approaches 1. OCR for text-heavy images (screenshots, scanned docs) 2. Vision model captions for photos/diagrams 3. Alt-text extraction from HTML 4. Multimodal embeddings (CLIP) for image search 5. Combine: store image + generated description

Incremental Ingestion

Real-world systems need to handle updates—new documents, modified documents, deleted documents. Reprocessing the entire corpus on every change doesn't scale. Incremental ingestion updates only what changed.

# Incremental ingestion with change detection

from dataclasses import dataclass from datetime import datetime import hashlib @dataclass class DocumentState: document_id: str content_hash: str last_processed: datetime chunk_ids: list[str] class IncrementalIngester: def __init__(self, state_store, vector_store): self.state_store = state_store self.vector_store = vector_store def process_document(self, doc_id: str, content: str) -> bool: """Process document if changed, return True if updated.""" content_hash = hashlib.sha256(content.encode()).hexdigest() existing = self.state_store.get(doc_id) if existing and existing.content_hash == content_hash: # Document unchanged, skip processing return False # Document is new or modified if existing: # Delete old chunks self.vector_store.delete(existing.chunk_ids) # Process and store new chunks chunks = self.chunk_document(content) chunk_ids = self.vector_store.upsert(chunks) # Update state self.state_store.put(DocumentState( document_id=doc_id, content_hash=content_hash, last_processed=datetime.now(), chunk_ids=chunk_ids )) return True def delete_document(self, doc_id: str): """Remove document and its chunks.""" existing = self.state_store.get(doc_id) if existing: self.vector_store.delete(existing.chunk_ids) self.state_store.delete(doc_id)

Update Strategies

Full replacement: Delete all chunks, reprocess entire document. Simple but wasteful for small changes.
Diff-based: Compare old and new content, update only changed sections. Complex but efficient.
Append-only: Never delete, just add new versions with timestamps. Query filters to latest. Good for audit trails.
Lazy invalidation: Mark old chunks as stale, reprocess on next access. Reduces peak load.

Data Quality Metrics

How do you know if your pipeline is working? Without metrics, you're flying blind. Establish quality indicators at each pipeline stage and monitor them continuously.

Extraction Quality

Character error rate (for OCR)
Table extraction accuracy
Encoding error frequency
Empty/failed extraction rate

Chunk Quality

Chunk size distribution
Semantic coherence scores
Truncated sentence frequency
Empty/near-empty chunk rate

Retrieval Quality

Precision@k on test queries
Mean reciprocal rank (MRR)
NDCG for ranked results
Human relevance judgments

System Health

Processing latency by document type
Embedding generation throughput
Storage growth rate
Error/retry rates

# Automated quality checks

class QualityChecker: def __init__(self, thresholds: dict): self.thresholds = thresholds def check_chunks(self, chunks: list[dict]) -> dict: """Run quality checks on processed chunks.""" issues = [] # Check for size anomalies sizes = [len(c["content"]) for c in chunks] avg_size = sum(sizes) / len(sizes) for i, chunk in enumerate(chunks): size = len(chunk["content"]) # Too small if size < self.thresholds["min_chunk_size"]: issues.append({ "type": "undersized_chunk", "chunk_index": i, "size": size }) # Too large if size > self.thresholds["max_chunk_size"]: issues.append({ "type": "oversized_chunk", "chunk_index": i, "size": size }) # Truncated sentence (ends mid-word) if not chunk["content"].rstrip()[-1] in ".!?:;\"')"]: issues.append({ "type": "possible_truncation", "chunk_index": i }) # High repetition (possible extraction error) if self.has_high_repetition(chunk["content"]): issues.append({ "type": "repetitive_content", "chunk_index": i }) return { "total_chunks": len(chunks), "issues": issues, "issue_rate": len(issues) / len(chunks), "avg_chunk_size": avg_size }

Storage Formats and Persistence

Processed chunks need persistent storage. The storage format affects query performance, portability, and operational complexity. For local-first systems, embedded databases are often preferable to standalone servers.

# Storage options comparison

Storage Type Persistence Scalability Best For ──────────────────────────────────────────────────────────────────────── FAISS Embedded File export 10M+ vectors High-performance search ChromaDB Embedded SQLite 1M vectors Rapid prototyping LanceDB Embedded Native files 10M+ vectors Local-first, multimodal SQLite + ext Embedded Single file 100K vectors Simple deployments Qdrant Server Native 100M+ vectors Production scale Milvus Server Native 1B+ vectors Enterprise scale Pinecone Managed Cloud Unlimited Fully managed

Local-First Storage with LanceDB

For OnyxLab's local-first philosophy, embedded databases that persist to local files are ideal. LanceDB offers vector search with native file persistence and no server process.

import lancedb # Create or connect to local database db = lancedb.connect("./data/vectors") # Create table with schema table = db.create_table( "documents", data=[{ "id": "chunk_1", "content": "Example content...", "embedding": [0.1, 0.2, ...], # 1536 dimensions "metadata": {"source": "doc.pdf", "page": 1} }], mode="overwrite" ) # Vector search results = table.search([0.1, 0.2, ...]) \ .where("metadata.source = 'doc.pdf'") \ .limit(10) \ .to_list() # Data persists to ./data/vectors/documents.lance

Chunk Storage Schema

Store chunks with their embeddings, metadata, and relationships in a format that supports both vector search and metadata filtering.

# Recommended chunk storage format { # Identifiers "chunk_id": "uuid", "document_id": "uuid", "parent_chunk_id": "uuid | null", # Content "content": "text", "content_hash": "sha256", # Vector (stored efficiently in columnar format) "embedding": "float32[1536]", # Metadata (indexed for filtering) "source_file": "string", "page_number": "int", "section_path": "string[]", "created_at": "timestamp", "content_type": "enum", # Chunking info (for debugging/reprocessing) "chunk_strategy": "string", "chunk_index": "int", "token_count": "int" }

Key Takeaways

1.Document extraction is harder than it looks. Test your extractors on real data and build fallback chains.
2.Chunk size is empirical. Start with 512 tokens and measure retrieval quality, then adjust.
3.Use overlap (10-20%) to preserve boundary context. The storage cost is worth it.
4.Rich metadata enables filtering and provides context. Design your schema early.
5.Different content types need different strategies. Don't force code into prose pipelines.
6.Build incremental ingestion from the start. Full reprocessing doesn't scale.
7.Measure quality at every stage. What you don't measure, you can't improve.

The quality of your data pipeline directly determines the quality of your agent's responses. Time spent here pays dividends throughout the system. Get this right, and retrieval becomes reliable. Get it wrong, and no amount of prompt engineering will save you.

Understanding Agents Next: Embeddings