Embeddings
Embeddings are the mathematical foundation of semantic search. They transform text into vectors where meaning becomes measurable—similar concepts cluster together in high-dimensional space. Understanding how embeddings work at a deep level is essential for building effective retrieval systems.
How Embedding Models Work
Modern embedding models are built on transformer encoder architectures. Unlike decoder-only models (GPT, Llama) that generate text autoregressively, encoder models process the entire input simultaneously and produce contextual representations for each token.
Transformer Encoder Architecture
The transformer encoder consists of stacked self-attention layers. Each layer allows every token to attend to every other token, building increasingly abstract representations. The key components:
The self-attention mechanism computes attention scores between all token pairs, allowing the model to understand context. For a sequence of n tokens, this requires O(n²) operations, which is why most embedding models have sequence length limits (512-8192 tokens).
Pooling Strategies
After the transformer produces per-token representations, we need to collapse them into a single vector. The pooling strategy significantly affects embedding quality:
[CLS] Token Pooling
Uses the final hidden state of the special [CLS] token as the sentence embedding. This token is trained to aggregate information from the entire sequence. Simple but effective for models trained with this objective.
embedding = hidden_states[0] # [CLS] is first tokenMean Pooling
Averages all token embeddings (excluding padding). Often outperforms [CLS] pooling because it incorporates information from every position equally. Most common in modern embedding models.
embedding = mean(hidden_states * attention_mask)Max Pooling
Takes the maximum value across tokens for each dimension. Captures the strongest signal for each feature. Less common but useful for certain retrieval tasks.
embedding = max(hidden_states, dim=0)Last Token Pooling
Uses the final non-padding token. Common in decoder-based embeddings (e5-mistral, GritLM) where the last token aggregates bidirectional context through causal attention modifications.
embedding = hidden_states[last_non_pad_idx]Contrastive Learning: The Training Objective
Embedding models are not trained to predict tokens. They are trained with contrastive learning objectives that teach the model to produce similar vectors for semantically similar text and dissimilar vectors for unrelated text.
InfoNCE Loss
The most common training objective is InfoNCE (Noise Contrastive Estimation), which treats embedding as a classification problem: given a query, identify the correct positive from a set of negatives.
The temperature τ controls how "peaky" the softmax distribution is. Lower temperatures create harder distinctions between positives and negatives, leading to more discriminative embeddings but potentially unstable training.
Hard Negatives
The quality of negatives dramatically affects embedding quality. Easy negatives (random documents) provide weak training signal. Hard negatives (documents that look similar but have different meaning) force the model to learn subtle distinctions.
Hard Negative Mining Strategies
- BM25 negatives: Retrieve lexically similar but semantically different documents
- In-batch negatives: Use other positives in the batch as negatives (memory efficient)
- Cross-encoder negatives: Use a cross-encoder to find challenging pairs
- Synthetic negatives: Generate near-miss negatives using LLMs
Training Data Sources
Modern embedding models are trained on diverse datasets:
The Geometry of Embedding Space
Embedding spaces have rich geometric structure that emerges from training. Understanding this geometry helps explain both the power and limitations of embedding-based retrieval.
Semantic Clusters
Related concepts cluster together in embedding space. Documents about "machine learning" occupy a different region than documents about "Renaissance art." Within the ML cluster, subclusters form for specific topics: neural networks, decision trees, reinforcement learning.
Linear Relationships and Analogies
One of the most remarkable properties of embedding spaces is that semantic relationships often correspond to linear directions. The classic example from word embeddings:
For sentence embeddings, linear relationships are less clean but still present. The direction from "positive sentiment" to "negative sentiment" is roughly consistent across topics. This enables techniques like linear probing for classification.
Anisotropy: The Cone Problem
Embedding spaces often exhibit anisotropy—embeddings cluster in a narrow cone rather than uniformly filling the space. This means even unrelated texts can have surprisingly high cosine similarity (0.4-0.6 baseline instead of 0).
Implications of Anisotropy
- Similarity thresholds: A 0.7 cosine similarity might only indicate weak relatedness
- Relative ranking: Focus on relative similarity rather than absolute values
- Whitening: Transform embeddings to have unit covariance (can improve retrieval)
- Model choice: Some models are trained to reduce anisotropy explicitly
Dimensionality and Information Density
Embedding dimensions represent a tradeoff between expressiveness, storage, and computational cost. Understanding this tradeoff is essential for system design.
Storage Costs
Information Density
Higher dimensions do not always mean better embeddings. What matters is information density—how much useful semantic information is packed into each dimension. A well-trained 384d model can outperform a poorly-trained 1536d model.
Research suggests that most embedding models have significant redundancy. The top 64-128 principal components often capture 90%+ of the variance. This observation motivates dimensionality reduction techniques and Matryoshka embeddings.
Curse of Dimensionality
In very high dimensions, counterintuitive things happen. Points become approximately equidistant—the ratio of farthest to nearest neighbor approaches 1. This affects:
- Nearest neighbor search: Approximate methods become necessary
- Similarity thresholds: Absolute similarity values become less meaningful
- Clustering: Density-based clustering struggles in high dimensions
Distance Metrics Deep Dive
The choice of distance metric affects retrieval quality and computational cost. Understanding the mathematics helps you choose correctly.
Cosine Similarity
Cosine similarity measures the angle between vectors, ignoring magnitude. This is ideal for text because document length affects magnitude but not meaning. A long article and its one-sentence summary should have high similarity despite different magnitudes.
Euclidean (L2) Distance
Euclidean distance measures straight-line distance. It is sensitive to magnitude, which can be problematic for text. However, for normalized embeddings (unit length), Euclidean distance is monotonically related to cosine similarity—they produce identical rankings.
Dot Product (Inner Product)
Dot product is the fastest to compute—just element-wise multiplication and sum, no square roots. For normalized embeddings, dot product equals cosine similarity. This is why production systems often normalize embeddings at index time and use dot product at query time.
Manhattan (L1) Distance
Manhattan distance sums absolute differences. It is more robust to outliers than Euclidean distance because it does not square the differences. Less common for embeddings but useful in some specialized applications.
Metric Selection Guide
Normalization and Why It Matters
Normalization transforms embeddings to unit length (L2 norm = 1). This seemingly simple operation has profound implications for retrieval systems.
Benefits of Normalization
Metric Equivalence
For normalized vectors, dot product = cosine similarity, and L2 distance is a monotonic transformation of both. You can use the fastest metric (dot product) without sacrificing accuracy.
Index Efficiency
Many vector indexes (HNSW, IVF) are optimized for specific metrics. Normalizing allows consistent use of inner product indexes across all your data.
Bounded Similarity
Normalized embeddings have similarity in [-1, 1], making threshold-based filtering predictable. Raw embeddings can have unbounded dot products.
Length Invariance
Document length affects raw embedding magnitude. Normalization removes this bias, ensuring a one-sentence query can match a long document about the same topic.
Quantization of Embeddings
Full-precision embeddings (float32) consume significant storage and memory. Quantization reduces precision to shrink memory footprint and accelerate similarity computation, with controllable quality tradeoffs.
Scalar Quantization (int8)
int8 quantization maps each float to one of 256 integer values. This typically degrades retrieval quality by 1-3% while providing 4x storage reduction. The quantization can be per-vector (better quality) or global (faster computation).
Binary Quantization
Binary quantization is extreme—each dimension becomes a single bit. This enables hardware-accelerated Hamming distance using POPCNT instructions. Quality loss is substantial (10-20% typically), making it best suited for initial candidate retrieval followed by re-ranking with full-precision embeddings.
Product Quantization (PQ)
Product quantization divides the vector into subspaces (e.g., 8 groups of 48 dimensions for a 384d vector), then quantizes each subspace independently to one of 256 centroids. This provides adjustable compression ratios and often outperforms scalar quantization.
Quantization Comparison
| Method | Compression | Quality Loss | Speed Gain |
|---|---|---|---|
| float32 | 1x (baseline) | 0% | 1x |
| float16 | 2x | <0.5% | 1.5-2x |
| int8 | 4x | 1-3% | 2-4x |
| PQ (8x48) | 8-16x | 2-5% | 3-6x |
| Binary | 32x | 10-20% | 10-30x |
Local Embedding Models Comparison
Several high-quality embedding models run efficiently on consumer hardware. Each has distinct characteristics suited to different use cases.
| Model | Dims | Max Tokens | Size | MTEB Avg |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 512 | ~80MB | 56.3 |
| bge-small-en-v1.5 | 384 | 512 | ~130MB | 62.2 |
| bge-base-en-v1.5 | 768 | 512 | ~440MB | 64.2 |
| bge-large-en-v1.5 | 1024 | 512 | ~1.3GB | 64.6 |
| nomic-embed-text-v1.5 | 768 | 8192 | ~550MB | 62.3 |
| e5-small-v2 | 384 | 512 | ~130MB | 59.9 |
| e5-base-v2 | 768 | 512 | ~440MB | 61.5 |
| e5-large-v2 | 1024 | 512 | ~1.3GB | 62.3 |
| gte-small | 384 | 512 | ~70MB | 61.4 |
| gte-base | 768 | 512 | ~220MB | 63.1 |
| gte-large | 1024 | 512 | ~670MB | 65.4 |
| jina-embeddings-v2-base-en | 768 | 8192 | ~550MB | 60.4 |
| mxbai-embed-large-v1 | 1024 | 512 | ~670MB | 64.7 |
Model Recommendations by Use Case
Maximum Speed / Minimum Resources
all-MiniLM-L6-v2 or gte-small. Sub-100MB models that run well even on CPUs. Good for edge devices, real-time applications, or when embedding is not the bottleneck.
Best Quality / General Purpose
bge-large-en-v1.5 or gte-large. Top MTEB scores among open models. Require ~1GB RAM and benefit from GPU acceleration.
Long Documents
nomic-embed-text-v1.5 or jina-embeddings-v2. 8192 token context allows embedding entire documents without chunking. Essential for applications where chunk boundaries would break semantics.
Quality/Speed Balance
bge-base-en-v1.5 or gte-base. 768 dimensions with good MTEB scores. Half the size of large models with 90%+ of the quality.
Query Prefixes
Some models require specific prefixes for queries vs documents. This is a training detail that improves asymmetric retrieval (short query, long document):
Benchmarks Explained: MTEB and BEIR
Understanding benchmarks helps you interpret model quality claims and choose models appropriate for your use case.
MTEB (Massive Text Embedding Benchmark)
MTEB evaluates embeddings across 58 datasets spanning 8 task categories. The overall score is an average across all tasks, which may not reflect performance on your specific use case.
MTEB Task Categories
- Retrieval: Find relevant documents for a query (most relevant for RAG)
- Classification: Assign labels to text
- Clustering: Group similar documents
- Pair Classification: Determine if two texts are similar
- Reranking: Order documents by relevance
- STS: Semantic textual similarity scoring
- Summarization: Evaluate summary quality
- BitextMining: Find translation pairs
For RAG applications, focus on the Retrieval and Reranking scores rather than the overall average. A model optimized for classification may underperform on retrieval despite a high average MTEB score.
BEIR (Benchmarking Information Retrieval)
BEIR focuses specifically on zero-shot retrieval across 18 diverse datasets. It is the standard benchmark for evaluating retrieval models on out-of-domain data.
Interpreting Benchmark Scores
Key Metrics
- nDCG@10: Normalized Discounted Cumulative Gain at rank 10. Measures ranking quality with position weighting. Higher is better. Range 0-1.
- Recall@k: Fraction of relevant documents found in top k results. Critical for RAG where you retrieve top-k chunks for the LLM.
- MRR: Mean Reciprocal Rank. Average of 1/rank of first relevant result. Emphasizes finding at least one good result quickly.
- MAP: Mean Average Precision. Considers precision at each relevant document. Good for understanding overall ranking quality.
Benchmark limitations to keep in mind:
- Scores reflect average performance across datasets; your domain may differ significantly
- Short queries dominate benchmarks; long-form queries may behave differently
- Benchmarks test zero-shot; fine-tuned models can dramatically outperform
- English benchmarks dominate; multilingual performance varies
- Dataset contamination can inflate scores (training on test data)
Multilingual and Cross-lingual Embeddings
Multilingual embedding models map text from multiple languages into a shared vector space. This enables powerful cross-lingual capabilities without machine translation.
Cross-lingual Retrieval
In a well-aligned multilingual embedding space, a query in one language can retrieve relevant documents in any language:
Notable Multilingual Models
| Model | Languages | Dims | Notes |
|---|---|---|---|
| multilingual-e5-large | 100+ | 1024 | Best multilingual quality |
| bge-m3 | 100+ | 1024 | Dense + sparse + multi-vector |
| paraphrase-multilingual-mpnet | 50+ | 768 | Efficient, sentence-focused |
| LaBSE | 109 | 768 | Bitext mining focused |
Challenges with Multilingual Embeddings
- Capacity dilution: The same model size must represent 100+ languages instead of one. Per-language quality is lower than monolingual models.
- Low-resource languages: Quality varies dramatically. English, Chinese, German are well-represented; many languages have poor embeddings.
- Script differences: Languages with unique scripts (Thai, Arabic, Korean) may cluster differently than Latin-script languages.
- Domain shift: Training data is often web-crawled, underrepresenting specialized domains in non-English languages.
Matryoshka Embeddings: Adaptive Dimensions
Matryoshka Representation Learning (MRL) trains embeddings where truncated prefixes remain useful. You can use the first 64, 128, 256, or all dimensions depending on your quality/efficiency requirements.
How MRL Training Works
During training, the loss function is computed at multiple dimension checkpoints. This forces the model to encode the most important information in the earlier dimensions:
Use Cases for Adaptive Dimensions
Tiered Retrieval
Use 64-128 dimensions for fast initial candidate retrieval, then re-rank top-k with full embeddings. Dramatic speedup with minimal recall loss.
Adaptive Storage
Store short dimensions for older/less-accessed documents, full dimensions for frequently queried content. Balance storage cost with quality.
Edge Deployment
Use minimal dimensions on edge devices for local search, full dimensions on server for comprehensive queries. Same model, different tradeoffs.
Matryoshka-Enabled Models
Not all models support dimension truncation. These models are trained with MRL:
- nomic-embed-text-v1.5: Supports 64, 128, 256, 512, 768
- mxbai-embed-large-v1: Full MRL support
- snowflake-arctic-embed: Designed for adaptive dimensions
- jina-embeddings-v3: Flexible dimension truncation
Fine-tuning Embeddings for Domain-Specific Use
Off-the-shelf embedding models are trained on general web data. For specialized domains (legal, medical, scientific), fine-tuning on domain data can yield significant improvements.
When to Fine-tune
- Domain vocabulary: Specialized terminology that general models misunderstand (legal terms, chemical compounds, medical abbreviations)
- Different similarity notion: What counts as "similar" differs from web text (code similarity, legal precedent matching)
- Quality plateau: Exhausted other optimizations (chunking, prompting, retrieval strategy) but still underperforming
- Sufficient data: Have thousands of query-document pairs or can generate them synthetically
Fine-tuning Approaches
Contrastive Fine-tuning
Continue training with domain-specific positive pairs. The standard approach using the same InfoNCE loss as original training, but on your data.
sentence_transformers.losses.MultipleNegativesRankingLossSynthetic Data Generation
Use an LLM to generate queries for your documents. Given a document, prompt the LLM: "Generate 5 questions that this document would answer." Creates training pairs at scale.
(generated_query, original_document) pairsHard Negative Mining
Use BM25 or the base model to find challenging negatives. Fine-tune on triplets: (query, positive, hard_negative) with margin loss.
sentence_transformers.losses.TripletLossDistillation from Cross-encoder
Train a cross-encoder on your data (slow but accurate), then distill its knowledge into the bi-encoder embedding model. Best quality but most complex.
sentence_transformers.losses.MarginMSELossFine-tuning Best Practices
Common pitfalls to avoid:
- Training too long: embeddings overfit quickly, 1-3 epochs is usually sufficient
- Learning rate too high: destroys pretrained knowledge, start at 1e-5 to 2e-5
- Batch size too small: fewer in-batch negatives, weaker training signal
- No evaluation set: impossible to detect overfitting without held-out data
- Ignoring prefixes: if the base model uses prefixes, maintain them during fine-tuning
Batch Processing and Optimization
Embedding large document collections requires careful optimization. The difference between naive and optimized approaches can be 10-100x in throughput.
Batching Strategy
GPU Optimization
Mixed Precision (FP16)
Use half-precision inference for 2x memory reduction and faster computation. Quality impact is negligible for most models.
model.half() # Convert to FP16Multi-GPU Parallelism
Distribute batches across multiple GPUs for linear speedup. Use data parallelism for embedding (model on each GPU, split data).
model = SentenceTransformer(..., device="cuda")model.start_multi_process_pool() # sentence-transformersONNX Runtime
Convert models to ONNX for optimized inference. Can provide 2-3x speedup on CPU and additional gains on GPU through graph optimization.
optimum-cli export onnx --model BAAI/bge-base-en-v1.5 ./onnx/CPU Optimization
For CPU-only deployment, additional optimizations are critical:
Streaming and Incremental Processing
For real-time systems, process documents as they arrive rather than in large batches:
Caching Strategies
Embedding computation is expensive. Effective caching avoids redundant work and dramatically improves system performance.
Content-based Caching
Persistent Embedding Store
For large-scale systems, persist embeddings to disk with proper indexing:
Storage Options
- SQLite + numpy blobs: Simple, single-file, good for <10M embeddings
- LMDB: Fast key-value store, memory-mapped, excellent for read-heavy workloads
- Parquet files: Columnar format, good for analytics and batch operations
- Vector databases: ChromaDB, Qdrant, Milvus—integrated storage and search
Cache Invalidation
Embeddings must be regenerated when:
- Source document changes (content hash changes)
- Embedding model changes (model version tracking)
- Chunking strategy changes (preprocessing version tracking)
- Model fine-tuning occurs (new model checkpoint)
Query Embedding Caching
For frequently repeated queries, cache query embeddings separately from document embeddings:
- LRU cache: Keep recent query embeddings in memory. Use functools.lru_cache or custom implementation.
- Query normalization: Lowercase, strip whitespace, canonicalize before hashing to improve cache hit rate.
- Semantic deduplication: Cluster similar queries, use a single representative embedding. Advanced but effective for high-volume systems.
Summary and Recommendations
Embeddings are the foundation of semantic retrieval. Key decisions for your system:
Decision Framework
- Start with a proven base model: bge-base-en-v1.5 or gte-base provide good quality/speed balance. Upgrade to large variants only if quality is insufficient.
- Normalize embeddings at index time: Enables fast dot product search equivalent to cosine similarity.
- Implement proper caching: Content-hash-based caching with model version tracking prevents redundant computation.
- Measure on your data: Benchmark scores are indicative but not definitive. Create an evaluation set from your domain.
- Consider fine-tuning last: Exhaust chunking, prompting, and retrieval strategy optimizations before fine-tuning.
For local-first systems like those built with OnyxLab, embedding choice directly impacts user experience. A well-chosen embedding model running locally provides instant, private semantic search without network dependencies.