Getting Started

Embeddings

Embeddings are the mathematical foundation of semantic search. They transform text into vectors where meaning becomes measurable—similar concepts cluster together in high-dimensional space. Understanding how embeddings work at a deep level is essential for building effective retrieval systems.

How Embedding Models Work

Modern embedding models are built on transformer encoder architectures. Unlike decoder-only models (GPT, Llama) that generate text autoregressively, encoder models process the entire input simultaneously and produce contextual representations for each token.

Transformer Encoder Architecture

The transformer encoder consists of stacked self-attention layers. Each layer allows every token to attend to every other token, building increasingly abstract representations. The key components:

// Simplified transformer encoder flow
Input: "The cat sat on the mat"
↓ Tokenization
Tokens: [CLS] the cat sat on the mat [SEP]
↓ Token + Position Embeddings
Initial embeddings: [768-dim] × 8 tokens
↓ Self-Attention Layer 1
↓ Feed-Forward Layer 1
↓ ... (6-12 layers typically)
↓ Self-Attention Layer N
↓ Feed-Forward Layer N
Final hidden states: [768-dim] × 8 tokens
↓ Pooling Strategy
Sentence embedding: [768-dim] × 1

The self-attention mechanism computes attention scores between all token pairs, allowing the model to understand context. For a sequence of n tokens, this requires O(n²) operations, which is why most embedding models have sequence length limits (512-8192 tokens).

Pooling Strategies

After the transformer produces per-token representations, we need to collapse them into a single vector. The pooling strategy significantly affects embedding quality:

[CLS] Token Pooling

Uses the final hidden state of the special [CLS] token as the sentence embedding. This token is trained to aggregate information from the entire sequence. Simple but effective for models trained with this objective.

embedding = hidden_states[0] # [CLS] is first token

Mean Pooling

Averages all token embeddings (excluding padding). Often outperforms [CLS] pooling because it incorporates information from every position equally. Most common in modern embedding models.

embedding = mean(hidden_states * attention_mask)

Max Pooling

Takes the maximum value across tokens for each dimension. Captures the strongest signal for each feature. Less common but useful for certain retrieval tasks.

embedding = max(hidden_states, dim=0)

Last Token Pooling

Uses the final non-padding token. Common in decoder-based embeddings (e5-mistral, GritLM) where the last token aggregates bidirectional context through causal attention modifications.

embedding = hidden_states[last_non_pad_idx]

Contrastive Learning: The Training Objective

Embedding models are not trained to predict tokens. They are trained with contrastive learning objectives that teach the model to produce similar vectors for semantically similar text and dissimilar vectors for unrelated text.

InfoNCE Loss

The most common training objective is InfoNCE (Noise Contrastive Estimation), which treats embedding as a classification problem: given a query, identify the correct positive from a set of negatives.

// InfoNCE Loss formulation
L = -log(
exp(sim(q, p⁺) / τ)
─────────────────────────────────────
exp(sim(q, p⁺) / τ) + Σᵢ exp(sim(q, pᵢ⁻) / τ)
)
where:
q = query embedding
p⁺ = positive (similar) embedding
pᵢ⁻ = negative (dissimilar) embeddings
τ = temperature parameter (typically 0.01-0.1)
sim = similarity function (usually dot product)

The temperature τ controls how "peaky" the softmax distribution is. Lower temperatures create harder distinctions between positives and negatives, leading to more discriminative embeddings but potentially unstable training.

Hard Negatives

The quality of negatives dramatically affects embedding quality. Easy negatives (random documents) provide weak training signal. Hard negatives (documents that look similar but have different meaning) force the model to learn subtle distinctions.

Hard Negative Mining Strategies

  • BM25 negatives: Retrieve lexically similar but semantically different documents
  • In-batch negatives: Use other positives in the batch as negatives (memory efficient)
  • Cross-encoder negatives: Use a cross-encoder to find challenging pairs
  • Synthetic negatives: Generate near-miss negatives using LLMs

Training Data Sources

Modern embedding models are trained on diverse datasets:

Web pairs: (query, clicked document) from search logs
Q&A pairs: (question, answer) from forums, Stack Overflow
NLI pairs: (premise, entailment) from natural language inference
Paraphrase pairs: Different phrasings of the same meaning
Title-body pairs: (article title, article content)
Synthetic pairs: LLM-generated query-document pairs

The Geometry of Embedding Space

Embedding spaces have rich geometric structure that emerges from training. Understanding this geometry helps explain both the power and limitations of embedding-based retrieval.

Semantic Clusters

Related concepts cluster together in embedding space. Documents about "machine learning" occupy a different region than documents about "Renaissance art." Within the ML cluster, subclusters form for specific topics: neural networks, decision trees, reinforcement learning.

// Cluster structure (conceptual)
Technology ────┬──── Machine Learning ───┬─── Neural Networks
│ ├─── Decision Trees
│ └─── Reinforcement Learning
├──── Web Development ────┬─── React
│ ├─── Vue
│ └─── Backend
└──── Systems ────────────┬─── Databases
└─── Networking

Linear Relationships and Analogies

One of the most remarkable properties of embedding spaces is that semantic relationships often correspond to linear directions. The classic example from word embeddings:

// Vector arithmetic captures analogies
king - man + woman ≈ queen
paris - france + germany ≈ berlin
walked - walk + swim ≈ swam
// This works because gender, nationality, and tense
// are encoded as consistent directions in the space

For sentence embeddings, linear relationships are less clean but still present. The direction from "positive sentiment" to "negative sentiment" is roughly consistent across topics. This enables techniques like linear probing for classification.

Anisotropy: The Cone Problem

Embedding spaces often exhibit anisotropy—embeddings cluster in a narrow cone rather than uniformly filling the space. This means even unrelated texts can have surprisingly high cosine similarity (0.4-0.6 baseline instead of 0).

Implications of Anisotropy

  • Similarity thresholds: A 0.7 cosine similarity might only indicate weak relatedness
  • Relative ranking: Focus on relative similarity rather than absolute values
  • Whitening: Transform embeddings to have unit covariance (can improve retrieval)
  • Model choice: Some models are trained to reduce anisotropy explicitly

Dimensionality and Information Density

Embedding dimensions represent a tradeoff between expressiveness, storage, and computational cost. Understanding this tradeoff is essential for system design.

Storage Costs

// Storage requirements (float32)
384 dimensions × 4 bytes = 1.5 KB per embedding
768 dimensions × 4 bytes = 3.0 KB per embedding
1024 dimensions × 4 bytes = 4.0 KB per embedding
1536 dimensions × 4 bytes = 6.0 KB per embedding
// For 1 million documents:
384d: 1.5 GB
768d: 3.0 GB
1536d: 6.0 GB

Information Density

Higher dimensions do not always mean better embeddings. What matters is information density—how much useful semantic information is packed into each dimension. A well-trained 384d model can outperform a poorly-trained 1536d model.

Research suggests that most embedding models have significant redundancy. The top 64-128 principal components often capture 90%+ of the variance. This observation motivates dimensionality reduction techniques and Matryoshka embeddings.

Curse of Dimensionality

In very high dimensions, counterintuitive things happen. Points become approximately equidistant—the ratio of farthest to nearest neighbor approaches 1. This affects:

  • Nearest neighbor search: Approximate methods become necessary
  • Similarity thresholds: Absolute similarity values become less meaningful
  • Clustering: Density-based clustering struggles in high dimensions

Distance Metrics Deep Dive

The choice of distance metric affects retrieval quality and computational cost. Understanding the mathematics helps you choose correctly.

Cosine Similarity

// Cosine similarity formula
cos(θ) = (A · B) / (||A|| × ||B||)
= Σᵢ(Aᵢ × Bᵢ) / (√Σᵢ(Aᵢ²) × √Σᵢ(Bᵢ²))
Range: [-1, 1]
1 = identical direction
0 = orthogonal (unrelated)
-1 = opposite direction

Cosine similarity measures the angle between vectors, ignoring magnitude. This is ideal for text because document length affects magnitude but not meaning. A long article and its one-sentence summary should have high similarity despite different magnitudes.

Euclidean (L2) Distance

// Euclidean distance formula
d(A, B) = √Σᵢ(Aᵢ - Bᵢ)²
Range: [0, ∞)
0 = identical points
∞ = infinitely far apart
// For normalized vectors: d² = 2(1 - cos(θ))
// Euclidean and cosine become equivalent!

Euclidean distance measures straight-line distance. It is sensitive to magnitude, which can be problematic for text. However, for normalized embeddings (unit length), Euclidean distance is monotonically related to cosine similarity—they produce identical rankings.

Dot Product (Inner Product)

// Dot product formula
A · B = Σᵢ(Aᵢ × Bᵢ)
Range: (-∞, ∞)
// For normalized vectors: A · B = cos(θ)
// Dot product equals cosine similarity!

Dot product is the fastest to compute—just element-wise multiplication and sum, no square roots. For normalized embeddings, dot product equals cosine similarity. This is why production systems often normalize embeddings at index time and use dot product at query time.

Manhattan (L1) Distance

// Manhattan distance formula
d(A, B) = Σᵢ|Aᵢ - Bᵢ|
Range: [0, ∞)
Also called: taxicab distance, city block distance

Manhattan distance sums absolute differences. It is more robust to outliers than Euclidean distance because it does not square the differences. Less common for embeddings but useful in some specialized applications.

Metric Selection Guide

Normalized embeddings, speed critical:Dot Product
Non-normalized embeddings:Cosine Similarity
Using HNSW or IVF indexes:L2 or Dot Product
General recommendation:Normalize + Dot Product

Normalization and Why It Matters

Normalization transforms embeddings to unit length (L2 norm = 1). This seemingly simple operation has profound implications for retrieval systems.

// L2 normalization
norm = √Σᵢ(vᵢ²)
v_normalized = v / norm
// After normalization: ||v_normalized|| = 1
// Vector lies on the unit hypersphere

Benefits of Normalization

Metric Equivalence

For normalized vectors, dot product = cosine similarity, and L2 distance is a monotonic transformation of both. You can use the fastest metric (dot product) without sacrificing accuracy.

Index Efficiency

Many vector indexes (HNSW, IVF) are optimized for specific metrics. Normalizing allows consistent use of inner product indexes across all your data.

Bounded Similarity

Normalized embeddings have similarity in [-1, 1], making threshold-based filtering predictable. Raw embeddings can have unbounded dot products.

Length Invariance

Document length affects raw embedding magnitude. Normalization removes this bias, ensuring a one-sentence query can match a long document about the same topic.

// Normalize embeddings before indexing (Python)
import numpy as np
def normalize(embeddings: np.ndarray) -> np.ndarray:
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
return embeddings / norms
# Many embedding models output normalized vectors by default
# Check your model's documentation

Quantization of Embeddings

Full-precision embeddings (float32) consume significant storage and memory. Quantization reduces precision to shrink memory footprint and accelerate similarity computation, with controllable quality tradeoffs.

Scalar Quantization (int8)

// int8 quantization
float32: 4 bytes per dimension
int8: 1 byte per dimension
Compression: 4x
// Quantization process:
v_min, v_max = min(v), max(v)
v_quantized = round((v - v_min) / (v_max - v_min) * 255) - 128

int8 quantization maps each float to one of 256 integer values. This typically degrades retrieval quality by 1-3% while providing 4x storage reduction. The quantization can be per-vector (better quality) or global (faster computation).

Binary Quantization

// Binary quantization
float32: 32 bits per dimension
binary: 1 bit per dimension
Compression: 32x
// Quantization process:
v_binary = (v > 0) ? 1 : 0 // Simple sign-based
// Similarity computation (Hamming distance):
similarity = popcount(a XOR b) // Count differing bits
// Uses single CPU instruction on modern processors

Binary quantization is extreme—each dimension becomes a single bit. This enables hardware-accelerated Hamming distance using POPCNT instructions. Quality loss is substantial (10-20% typically), making it best suited for initial candidate retrieval followed by re-ranking with full-precision embeddings.

Product Quantization (PQ)

Product quantization divides the vector into subspaces (e.g., 8 groups of 48 dimensions for a 384d vector), then quantizes each subspace independently to one of 256 centroids. This provides adjustable compression ratios and often outperforms scalar quantization.

Quantization Comparison

MethodCompressionQuality LossSpeed Gain
float321x (baseline)0%1x
float162x<0.5%1.5-2x
int84x1-3%2-4x
PQ (8x48)8-16x2-5%3-6x
Binary32x10-20%10-30x

Local Embedding Models Comparison

Several high-quality embedding models run efficiently on consumer hardware. Each has distinct characteristics suited to different use cases.

ModelDimsMax TokensSizeMTEB Avg
all-MiniLM-L6-v2384512~80MB56.3
bge-small-en-v1.5384512~130MB62.2
bge-base-en-v1.5768512~440MB64.2
bge-large-en-v1.51024512~1.3GB64.6
nomic-embed-text-v1.57688192~550MB62.3
e5-small-v2384512~130MB59.9
e5-base-v2768512~440MB61.5
e5-large-v21024512~1.3GB62.3
gte-small384512~70MB61.4
gte-base768512~220MB63.1
gte-large1024512~670MB65.4
jina-embeddings-v2-base-en7688192~550MB60.4
mxbai-embed-large-v11024512~670MB64.7

Model Recommendations by Use Case

Maximum Speed / Minimum Resources

all-MiniLM-L6-v2 or gte-small. Sub-100MB models that run well even on CPUs. Good for edge devices, real-time applications, or when embedding is not the bottleneck.

Best Quality / General Purpose

bge-large-en-v1.5 or gte-large. Top MTEB scores among open models. Require ~1GB RAM and benefit from GPU acceleration.

Long Documents

nomic-embed-text-v1.5 or jina-embeddings-v2. 8192 token context allows embedding entire documents without chunking. Essential for applications where chunk boundaries would break semantics.

Quality/Speed Balance

bge-base-en-v1.5 or gte-base. 768 dimensions with good MTEB scores. Half the size of large models with 90%+ of the quality.

Query Prefixes

Some models require specific prefixes for queries vs documents. This is a training detail that improves asymmetric retrieval (short query, long document):

// Query prefixes by model
BGE: "Represent this sentence for searching..."
E5: "query: " for queries, "passage: " for docs
Nomic: "search_query: " and "search_document: "
GTE: No prefix required
MiniLM: No prefix required
// Using wrong prefixes degrades retrieval quality
// Always check model documentation

Benchmarks Explained: MTEB and BEIR

Understanding benchmarks helps you interpret model quality claims and choose models appropriate for your use case.

MTEB (Massive Text Embedding Benchmark)

MTEB evaluates embeddings across 58 datasets spanning 8 task categories. The overall score is an average across all tasks, which may not reflect performance on your specific use case.

MTEB Task Categories

  • Retrieval: Find relevant documents for a query (most relevant for RAG)
  • Classification: Assign labels to text
  • Clustering: Group similar documents
  • Pair Classification: Determine if two texts are similar
  • Reranking: Order documents by relevance
  • STS: Semantic textual similarity scoring
  • Summarization: Evaluate summary quality
  • BitextMining: Find translation pairs

For RAG applications, focus on the Retrieval and Reranking scores rather than the overall average. A model optimized for classification may underperform on retrieval despite a high average MTEB score.

BEIR (Benchmarking Information Retrieval)

BEIR focuses specifically on zero-shot retrieval across 18 diverse datasets. It is the standard benchmark for evaluating retrieval models on out-of-domain data.

// BEIR dataset domains (selection)
MS MARCO - Web search queries
Natural Questions - Wikipedia Q&A
HotpotQA - Multi-hop reasoning
FEVER - Fact verification
SciFact - Scientific claims
NFCorpus - Medical/nutrition
ArguAna - Argument retrieval
Touché-2020 - Controversial topics
CQADupStack - StackExchange Q&A
Quora - Duplicate questions
DBPedia - Entity retrieval
SCIDOCS - Scientific papers
FiQA - Financial Q&A
TREC-COVID - COVID-19 research
Climate-FEVER - Climate claims

Interpreting Benchmark Scores

Key Metrics

  • nDCG@10: Normalized Discounted Cumulative Gain at rank 10. Measures ranking quality with position weighting. Higher is better. Range 0-1.
  • Recall@k: Fraction of relevant documents found in top k results. Critical for RAG where you retrieve top-k chunks for the LLM.
  • MRR: Mean Reciprocal Rank. Average of 1/rank of first relevant result. Emphasizes finding at least one good result quickly.
  • MAP: Mean Average Precision. Considers precision at each relevant document. Good for understanding overall ranking quality.

Benchmark limitations to keep in mind:

  • Scores reflect average performance across datasets; your domain may differ significantly
  • Short queries dominate benchmarks; long-form queries may behave differently
  • Benchmarks test zero-shot; fine-tuned models can dramatically outperform
  • English benchmarks dominate; multilingual performance varies
  • Dataset contamination can inflate scores (training on test data)

Multilingual and Cross-lingual Embeddings

Multilingual embedding models map text from multiple languages into a shared vector space. This enables powerful cross-lingual capabilities without machine translation.

Cross-lingual Retrieval

In a well-aligned multilingual embedding space, a query in one language can retrieve relevant documents in any language:

// Cross-lingual retrieval example
Query (English): "How does photosynthesis work?"
→ Retrieves relevant docs in English, German, French, Japanese...
Query (German): "Wie funktioniert die Photosynthese?"
→ Same documents retrieved with similar ranking
// The embedding space is language-agnostic
// Semantic similarity transcends language boundaries

Notable Multilingual Models

ModelLanguagesDimsNotes
multilingual-e5-large100+1024Best multilingual quality
bge-m3100+1024Dense + sparse + multi-vector
paraphrase-multilingual-mpnet50+768Efficient, sentence-focused
LaBSE109768Bitext mining focused

Challenges with Multilingual Embeddings

  • Capacity dilution: The same model size must represent 100+ languages instead of one. Per-language quality is lower than monolingual models.
  • Low-resource languages: Quality varies dramatically. English, Chinese, German are well-represented; many languages have poor embeddings.
  • Script differences: Languages with unique scripts (Thai, Arabic, Korean) may cluster differently than Latin-script languages.
  • Domain shift: Training data is often web-crawled, underrepresenting specialized domains in non-English languages.

Matryoshka Embeddings: Adaptive Dimensions

Matryoshka Representation Learning (MRL) trains embeddings where truncated prefixes remain useful. You can use the first 64, 128, 256, or all dimensions depending on your quality/efficiency requirements.

// Matryoshka embedding usage
full_embedding = model.encode(text) # e.g., 768 dims
embedding_64 = full_embedding[:64] # ~90% of full quality
embedding_128 = full_embedding[:128] # ~95% of full quality
embedding_256 = full_embedding[:256] # ~98% of full quality
// Storage savings with minimal quality loss
// 768→256 dims = 3x compression, ~2% quality drop

How MRL Training Works

During training, the loss function is computed at multiple dimension checkpoints. This forces the model to encode the most important information in the earlier dimensions:

// MRL training loss (simplified)
L_total = Σₖ wₖ × L(embedding[:dₖ])
where dₖ ∈ {64, 128, 256, 512, 768}
and wₖ are dimension-specific weights

Use Cases for Adaptive Dimensions

Tiered Retrieval

Use 64-128 dimensions for fast initial candidate retrieval, then re-rank top-k with full embeddings. Dramatic speedup with minimal recall loss.

Adaptive Storage

Store short dimensions for older/less-accessed documents, full dimensions for frequently queried content. Balance storage cost with quality.

Edge Deployment

Use minimal dimensions on edge devices for local search, full dimensions on server for comprehensive queries. Same model, different tradeoffs.

Matryoshka-Enabled Models

Not all models support dimension truncation. These models are trained with MRL:

  • nomic-embed-text-v1.5: Supports 64, 128, 256, 512, 768
  • mxbai-embed-large-v1: Full MRL support
  • snowflake-arctic-embed: Designed for adaptive dimensions
  • jina-embeddings-v3: Flexible dimension truncation

Fine-tuning Embeddings for Domain-Specific Use

Off-the-shelf embedding models are trained on general web data. For specialized domains (legal, medical, scientific), fine-tuning on domain data can yield significant improvements.

When to Fine-tune

  • Domain vocabulary: Specialized terminology that general models misunderstand (legal terms, chemical compounds, medical abbreviations)
  • Different similarity notion: What counts as "similar" differs from web text (code similarity, legal precedent matching)
  • Quality plateau: Exhausted other optimizations (chunking, prompting, retrieval strategy) but still underperforming
  • Sufficient data: Have thousands of query-document pairs or can generate them synthetically

Fine-tuning Approaches

Contrastive Fine-tuning

Continue training with domain-specific positive pairs. The standard approach using the same InfoNCE loss as original training, but on your data.

sentence_transformers.losses.MultipleNegativesRankingLoss

Synthetic Data Generation

Use an LLM to generate queries for your documents. Given a document, prompt the LLM: "Generate 5 questions that this document would answer." Creates training pairs at scale.

(generated_query, original_document) pairs

Hard Negative Mining

Use BM25 or the base model to find challenging negatives. Fine-tune on triplets: (query, positive, hard_negative) with margin loss.

sentence_transformers.losses.TripletLoss

Distillation from Cross-encoder

Train a cross-encoder on your data (slow but accurate), then distill its knowledge into the bi-encoder embedding model. Best quality but most complex.

sentence_transformers.losses.MarginMSELoss

Fine-tuning Best Practices

// Fine-tuning recipe (sentence-transformers)
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.training_args import BatchSamplers
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Key hyperparameters:
learning_rate = 2e-5 # Start small, tune carefully
batch_size = 32 # Larger = more negatives = better
epochs = 1-3 # Few epochs to avoid overfitting
warmup_ratio = 0.1 # Gradual learning rate increase
# Critical: use in-batch negatives
train_loss = losses.MultipleNegativesRankingLoss(model)

Common pitfalls to avoid:

  • Training too long: embeddings overfit quickly, 1-3 epochs is usually sufficient
  • Learning rate too high: destroys pretrained knowledge, start at 1e-5 to 2e-5
  • Batch size too small: fewer in-batch negatives, weaker training signal
  • No evaluation set: impossible to detect overfitting without held-out data
  • Ignoring prefixes: if the base model uses prefixes, maintain them during fine-tuning

Batch Processing and Optimization

Embedding large document collections requires careful optimization. The difference between naive and optimized approaches can be 10-100x in throughput.

Batching Strategy

// Batching impact on throughput
// Naive: one text at a time
for text in texts:
embedding = model.encode(text) # ~50 texts/sec on GPU
// Better: batch processing
embeddings = model.encode(texts, batch_size=32) # ~500 texts/sec
// Optimal: sorted batching
# Sort by length, batch similar lengths together
# Minimizes padding waste → ~800 texts/sec

GPU Optimization

Mixed Precision (FP16)

Use half-precision inference for 2x memory reduction and faster computation. Quality impact is negligible for most models.

model.half() # Convert to FP16

Multi-GPU Parallelism

Distribute batches across multiple GPUs for linear speedup. Use data parallelism for embedding (model on each GPU, split data).

model = SentenceTransformer(..., device="cuda")model.start_multi_process_pool() # sentence-transformers

ONNX Runtime

Convert models to ONNX for optimized inference. Can provide 2-3x speedup on CPU and additional gains on GPU through graph optimization.

optimum-cli export onnx --model BAAI/bge-base-en-v1.5 ./onnx/

CPU Optimization

For CPU-only deployment, additional optimizations are critical:

// CPU optimization techniques
# Set thread count (OMP, MKL)
export OMP_NUM_THREADS=8
export MKL_NUM_THREADS=8
# Use quantized models
from optimum.onnxruntime import ORTModelForFeatureExtraction
model = ORTModelForFeatureExtraction.from_pretrained(
"model_path",
provider="CPUExecutionProvider"
)
# INT8 quantization for additional speedup
# 2-4x faster with ~1-3% quality loss

Streaming and Incremental Processing

For real-time systems, process documents as they arrive rather than in large batches:

// Streaming embedding pipeline
async def embed_stream(documents: AsyncIterator[str]):
buffer = []
async for doc in documents:
buffer.append(doc)
if len(buffer) >= batch_size:
embeddings = await embed_batch(buffer)
for emb in embeddings:
yield emb
buffer = []
if buffer: # Flush remaining
for emb in await embed_batch(buffer):
yield emb

Caching Strategies

Embedding computation is expensive. Effective caching avoids redundant work and dramatically improves system performance.

Content-based Caching

// Hash-based embedding cache
import hashlib
def get_embedding(text: str, cache: dict) -> np.ndarray:
# Create deterministic hash of input
key = hashlib.sha256(text.encode()).hexdigest()
if key in cache:
return cache[key]
embedding = model.encode(text)
cache[key] = embedding
return embedding

Persistent Embedding Store

For large-scale systems, persist embeddings to disk with proper indexing:

Storage Options

  • SQLite + numpy blobs: Simple, single-file, good for <10M embeddings
  • LMDB: Fast key-value store, memory-mapped, excellent for read-heavy workloads
  • Parquet files: Columnar format, good for analytics and batch operations
  • Vector databases: ChromaDB, Qdrant, Milvus—integrated storage and search

Cache Invalidation

Embeddings must be regenerated when:

  • Source document changes (content hash changes)
  • Embedding model changes (model version tracking)
  • Chunking strategy changes (preprocessing version tracking)
  • Model fine-tuning occurs (new model checkpoint)
// Versioned cache key
cache_key = f"{model_version}-{chunking_version}-{content_hash}"
# Example:
"bge-base-v1.5-chunk-v2-a1b2c3d4e5f6..."
# When model_version changes, all cached embeddings
# for that model become stale automatically

Query Embedding Caching

For frequently repeated queries, cache query embeddings separately from document embeddings:

  • LRU cache: Keep recent query embeddings in memory. Use functools.lru_cache or custom implementation.
  • Query normalization: Lowercase, strip whitespace, canonicalize before hashing to improve cache hit rate.
  • Semantic deduplication: Cluster similar queries, use a single representative embedding. Advanced but effective for high-volume systems.
// Query caching with normalization
from functools import lru_cache
def normalize_query(q: str) -> str:
return q.lower().strip()
@lru_cache(maxsize=10000)
def get_query_embedding(query: str) -> tuple:
normalized = normalize_query(query)
embedding = model.encode(normalized)
return tuple(embedding) # hashable for cache

Summary and Recommendations

Embeddings are the foundation of semantic retrieval. Key decisions for your system:

Decision Framework

  1. Start with a proven base model: bge-base-en-v1.5 or gte-base provide good quality/speed balance. Upgrade to large variants only if quality is insufficient.
  2. Normalize embeddings at index time: Enables fast dot product search equivalent to cosine similarity.
  3. Implement proper caching: Content-hash-based caching with model version tracking prevents redundant computation.
  4. Measure on your data: Benchmark scores are indicative but not definitive. Create an evaluation set from your domain.
  5. Consider fine-tuning last: Exhaust chunking, prompting, and retrieval strategy optimizations before fine-tuning.

For local-first systems like those built with OnyxLab, embedding choice directly impacts user experience. A well-chosen embedding model running locally provides instant, private semantic search without network dependencies.