OnyxLab | AI/ML Engineering

Embeddings are the mathematical foundation of semantic search. They transform text into vectors where meaning becomes measurable—similar concepts cluster together in high-dimensional space. Understanding how embeddings work at a deep level is essential for building effective retrieval systems.

How Embedding Models Work

Modern embedding models are built on transformer encoder architectures. Unlike decoder-only models (GPT, Llama) that generate text autoregressively, encoder models process the entire input simultaneously and produce contextual representations for each token.

Transformer Encoder Architecture

The transformer encoder consists of stacked self-attention layers. Each layer allows every token to attend to every other token, building increasingly abstract representations. The key components:

// Simplified transformer encoder flow

Input: "The cat sat on the mat"

↓ Tokenization

Tokens: [CLS] the cat sat on the mat [SEP]

↓ Token + Position Embeddings

Initial embeddings: [768-dim] × 8 tokens

↓ Self-Attention Layer 1

↓ Feed-Forward Layer 1

↓ ... (6-12 layers typically)

↓ Self-Attention Layer N

↓ Feed-Forward Layer N

Final hidden states: [768-dim] × 8 tokens

↓ Pooling Strategy

Sentence embedding: [768-dim] × 1

The self-attention mechanism computes attention scores between all token pairs, allowing the model to understand context. For a sequence of n tokens, this requires O(n²) operations, which is why most embedding models have sequence length limits (512-8192 tokens).

Pooling Strategies

After the transformer produces per-token representations, we need to collapse them into a single vector. The pooling strategy significantly affects embedding quality:

[CLS] Token Pooling

Uses the final hidden state of the special [CLS] token as the sentence embedding. This token is trained to aggregate information from the entire sequence. Simple but effective for models trained with this objective.

embedding = hidden_states[0] # [CLS] is first token

Mean Pooling

Averages all token embeddings (excluding padding). Often outperforms [CLS] pooling because it incorporates information from every position equally. Most common in modern embedding models.

embedding = mean(hidden_states * attention_mask)

Max Pooling

Takes the maximum value across tokens for each dimension. Captures the strongest signal for each feature. Less common but useful for certain retrieval tasks.

embedding = max(hidden_states, dim=0)

Last Token Pooling

Uses the final non-padding token. Common in decoder-based embeddings (e5-mistral, GritLM) where the last token aggregates bidirectional context through causal attention modifications.

embedding = hidden_states[last_non_pad_idx]

Contrastive Learning: The Training Objective

Embedding models are not trained to predict tokens. They are trained with contrastive learning objectives that teach the model to produce similar vectors for semantically similar text and dissimilar vectors for unrelated text.

InfoNCE Loss

The most common training objective is InfoNCE (Noise Contrastive Estimation), which treats embedding as a classification problem: given a query, identify the correct positive from a set of negatives.

// InfoNCE Loss formulation

L = -log(

exp(sim(q, p⁺) / τ)

─────────────────────────────────────

exp(sim(q, p⁺) / τ) + Σᵢ exp(sim(q, pᵢ⁻) / τ)

)

where:

q = query embedding

p⁺ = positive (similar) embedding

pᵢ⁻ = negative (dissimilar) embeddings

τ = temperature parameter (typically 0.01-0.1)

sim = similarity function (usually dot product)

The temperature τ controls how "peaky" the softmax distribution is. Lower temperatures create harder distinctions between positives and negatives, leading to more discriminative embeddings but potentially unstable training.

Hard Negatives

The quality of negatives dramatically affects embedding quality. Easy negatives (random documents) provide weak training signal. Hard negatives (documents that look similar but have different meaning) force the model to learn subtle distinctions.

Hard Negative Mining Strategies

BM25 negatives: Retrieve lexically similar but semantically different documents
In-batch negatives: Use other positives in the batch as negatives (memory efficient)
Cross-encoder negatives: Use a cross-encoder to find challenging pairs
Synthetic negatives: Generate near-miss negatives using LLMs

Training Data Sources

Modern embedding models are trained on diverse datasets:

Web pairs: (query, clicked document) from search logs

Q&A pairs: (question, answer) from forums, Stack Overflow

NLI pairs: (premise, entailment) from natural language inference

Paraphrase pairs: Different phrasings of the same meaning

Title-body pairs: (article title, article content)

Synthetic pairs: LLM-generated query-document pairs

The Geometry of Embedding Space

Embedding spaces have rich geometric structure that emerges from training. Understanding this geometry helps explain both the power and limitations of embedding-based retrieval.

Semantic Clusters

Related concepts cluster together in embedding space. Documents about "machine learning" occupy a different region than documents about "Renaissance art." Within the ML cluster, subclusters form for specific topics: neural networks, decision trees, reinforcement learning.

// Cluster structure (conceptual)

Technology ────┬──── Machine Learning ───┬─── Neural Networks

│ ├─── Decision Trees

│ └─── Reinforcement Learning

│

├──── Web Development ────┬─── React

│ ├─── Vue

│ └─── Backend

│

└──── Systems ────────────┬─── Databases

└─── Networking

Linear Relationships and Analogies

One of the most remarkable properties of embedding spaces is that semantic relationships often correspond to linear directions. The classic example from word embeddings:

// Vector arithmetic captures analogies

king - man + woman ≈ queen

paris - france + germany ≈ berlin

walked - walk + swim ≈ swam

// This works because gender, nationality, and tense

// are encoded as consistent directions in the space

For sentence embeddings, linear relationships are less clean but still present. The direction from "positive sentiment" to "negative sentiment" is roughly consistent across topics. This enables techniques like linear probing for classification.

Anisotropy: The Cone Problem

Embedding spaces often exhibit anisotropy—embeddings cluster in a narrow cone rather than uniformly filling the space. This means even unrelated texts can have surprisingly high cosine similarity (0.4-0.6 baseline instead of 0).

Implications of Anisotropy

Similarity thresholds: A 0.7 cosine similarity might only indicate weak relatedness
Relative ranking: Focus on relative similarity rather than absolute values
Whitening: Transform embeddings to have unit covariance (can improve retrieval)
Model choice: Some models are trained to reduce anisotropy explicitly

Dimensionality and Information Density

Embedding dimensions represent a tradeoff between expressiveness, storage, and computational cost. Understanding this tradeoff is essential for system design.

Storage Costs

// Storage requirements (float32)

384 dimensions × 4 bytes = 1.5 KB per embedding

768 dimensions × 4 bytes = 3.0 KB per embedding

1024 dimensions × 4 bytes = 4.0 KB per embedding

1536 dimensions × 4 bytes = 6.0 KB per embedding

// For 1 million documents:

384d: 1.5 GB

768d: 3.0 GB

1536d: 6.0 GB

Information Density

Higher dimensions do not always mean better embeddings. What matters is information density—how much useful semantic information is packed into each dimension. A well-trained 384d model can outperform a poorly-trained 1536d model.

Research suggests that most embedding models have significant redundancy. The top 64-128 principal components often capture 90%+ of the variance. This observation motivates dimensionality reduction techniques and Matryoshka embeddings.

Curse of Dimensionality

In very high dimensions, counterintuitive things happen. Points become approximately equidistant—the ratio of farthest to nearest neighbor approaches 1. This affects:

Nearest neighbor search: Approximate methods become necessary
Similarity thresholds: Absolute similarity values become less meaningful
Clustering: Density-based clustering struggles in high dimensions

Distance Metrics Deep Dive

The choice of distance metric affects retrieval quality and computational cost. Understanding the mathematics helps you choose correctly.

Cosine Similarity

// Cosine similarity formula

cos(θ) = (A · B) / (||A|| × ||B||)

= Σᵢ(Aᵢ × Bᵢ) / (√Σᵢ(Aᵢ²) × √Σᵢ(Bᵢ²))

Range: [-1, 1]

1 = identical direction

0 = orthogonal (unrelated)

-1 = opposite direction

Cosine similarity measures the angle between vectors, ignoring magnitude. This is ideal for text because document length affects magnitude but not meaning. A long article and its one-sentence summary should have high similarity despite different magnitudes.

Euclidean (L2) Distance

// Euclidean distance formula

d(A, B) = √Σᵢ(Aᵢ - Bᵢ)²

Range: [0, ∞)

0 = identical points

∞ = infinitely far apart

// For normalized vectors: d² = 2(1 - cos(θ))

// Euclidean and cosine become equivalent!

Euclidean distance measures straight-line distance. It is sensitive to magnitude, which can be problematic for text. However, for normalized embeddings (unit length), Euclidean distance is monotonically related to cosine similarity—they produce identical rankings.

Dot Product (Inner Product)

// Dot product formula

A · B = Σᵢ(Aᵢ × Bᵢ)

Range: (-∞, ∞)

// For normalized vectors: A · B = cos(θ)

// Dot product equals cosine similarity!

Dot product is the fastest to compute—just element-wise multiplication and sum, no square roots. For normalized embeddings, dot product equals cosine similarity. This is why production systems often normalize embeddings at index time and use dot product at query time.

Manhattan (L1) Distance

// Manhattan distance formula

d(A, B) = Σᵢ|Aᵢ - Bᵢ|

Range: [0, ∞)

Also called: taxicab distance, city block distance

Manhattan distance sums absolute differences. It is more robust to outliers than Euclidean distance because it does not square the differences. Less common for embeddings but useful in some specialized applications.

Metric Selection Guide

Normalized embeddings, speed critical:Dot Product

Non-normalized embeddings:Cosine Similarity

Using HNSW or IVF indexes:L2 or Dot Product

General recommendation:Normalize + Dot Product

Normalization and Why It Matters

Normalization transforms embeddings to unit length (L2 norm = 1). This seemingly simple operation has profound implications for retrieval systems.

// L2 normalization

norm = √Σᵢ(vᵢ²)

v_normalized = v / norm

// After normalization: ||v_normalized|| = 1

// Vector lies on the unit hypersphere

Benefits of Normalization

Metric Equivalence

For normalized vectors, dot product = cosine similarity, and L2 distance is a monotonic transformation of both. You can use the fastest metric (dot product) without sacrificing accuracy.

Index Efficiency

Many vector indexes (HNSW, IVF) are optimized for specific metrics. Normalizing allows consistent use of inner product indexes across all your data.

Bounded Similarity

Normalized embeddings have similarity in [-1, 1], making threshold-based filtering predictable. Raw embeddings can have unbounded dot products.

Length Invariance

Document length affects raw embedding magnitude. Normalization removes this bias, ensuring a one-sentence query can match a long document about the same topic.

// Normalize embeddings before indexing (Python)

import numpy as np

def normalize(embeddings: np.ndarray) -> np.ndarray:

norms = np.linalg.norm(embeddings, axis=1, keepdims=True)

return embeddings / norms

# Many embedding models output normalized vectors by default

# Check your model's documentation

Quantization of Embeddings

Full-precision embeddings (float32) consume significant storage and memory. Quantization reduces precision to shrink memory footprint and accelerate similarity computation, with controllable quality tradeoffs.

Scalar Quantization (int8)

// int8 quantization

float32: 4 bytes per dimension

int8: 1 byte per dimension

Compression: 4x

// Quantization process:

v_min, v_max = min(v), max(v)

v_quantized = round((v - v_min) / (v_max - v_min) * 255) - 128

int8 quantization maps each float to one of 256 integer values. This typically degrades retrieval quality by 1-3% while providing 4x storage reduction. The quantization can be per-vector (better quality) or global (faster computation).

Binary Quantization

// Binary quantization

float32: 32 bits per dimension

binary: 1 bit per dimension

Compression: 32x

// Quantization process:

v_binary = (v > 0) ? 1 : 0 // Simple sign-based

// Similarity computation (Hamming distance):

similarity = popcount(a XOR b) // Count differing bits

// Uses single CPU instruction on modern processors

Binary quantization is extreme—each dimension becomes a single bit. This enables hardware-accelerated Hamming distance using POPCNT instructions. Quality loss is substantial (10-20% typically), making it best suited for initial candidate retrieval followed by re-ranking with full-precision embeddings.

Product Quantization (PQ)

Product quantization divides the vector into subspaces (e.g., 8 groups of 48 dimensions for a 384d vector), then quantizes each subspace independently to one of 256 centroids. This provides adjustable compression ratios and often outperforms scalar quantization.

Quantization Comparison

Method	Compression	Quality Loss	Speed Gain
float32	1x (baseline)	0%	1x
float16	2x	<0.5%	1.5-2x
int8	4x	1-3%	2-4x
PQ (8x48)	8-16x	2-5%	3-6x
Binary	32x	10-20%	10-30x

Local Embedding Models Comparison

Several high-quality embedding models run efficiently on consumer hardware. Each has distinct characteristics suited to different use cases.

Model	Dims	Max Tokens	Size	MTEB Avg
all-MiniLM-L6-v2	384	512	~80MB	56.3
bge-small-en-v1.5	384	512	~130MB	62.2
bge-base-en-v1.5	768	512	~440MB	64.2
bge-large-en-v1.5	1024	512	~1.3GB	64.6
nomic-embed-text-v1.5	768	8192	~550MB	62.3
e5-small-v2	384	512	~130MB	59.9
e5-base-v2	768	512	~440MB	61.5
e5-large-v2	1024	512	~1.3GB	62.3
gte-small	384	512	~70MB	61.4
gte-base	768	512	~220MB	63.1
gte-large	1024	512	~670MB	65.4
jina-embeddings-v2-base-en	768	8192	~550MB	60.4
mxbai-embed-large-v1	1024	512	~670MB	64.7

Model Recommendations by Use Case

Maximum Speed / Minimum Resources

all-MiniLM-L6-v2 or gte-small. Sub-100MB models that run well even on CPUs. Good for edge devices, real-time applications, or when embedding is not the bottleneck.

Best Quality / General Purpose

bge-large-en-v1.5 or gte-large. Top MTEB scores among open models. Require ~1GB RAM and benefit from GPU acceleration.

Long Documents

nomic-embed-text-v1.5 or jina-embeddings-v2. 8192 token context allows embedding entire documents without chunking. Essential for applications where chunk boundaries would break semantics.

Quality/Speed Balance

bge-base-en-v1.5 or gte-base. 768 dimensions with good MTEB scores. Half the size of large models with 90%+ of the quality.

Query Prefixes

Some models require specific prefixes for queries vs documents. This is a training detail that improves asymmetric retrieval (short query, long document):

// Query prefixes by model

BGE: "Represent this sentence for searching..."

E5: "query: " for queries, "passage: " for docs

Nomic: "search_query: " and "search_document: "

GTE: No prefix required

MiniLM: No prefix required

// Using wrong prefixes degrades retrieval quality

// Always check model documentation

Benchmarks Explained: MTEB and BEIR

Understanding benchmarks helps you interpret model quality claims and choose models appropriate for your use case.

MTEB (Massive Text Embedding Benchmark)

MTEB evaluates embeddings across 58 datasets spanning 8 task categories. The overall score is an average across all tasks, which may not reflect performance on your specific use case.

MTEB Task Categories

Retrieval: Find relevant documents for a query (most relevant for RAG)
Classification: Assign labels to text
Clustering: Group similar documents
Pair Classification: Determine if two texts are similar
Reranking: Order documents by relevance
STS: Semantic textual similarity scoring
Summarization: Evaluate summary quality
BitextMining: Find translation pairs

For RAG applications, focus on the Retrieval and Reranking scores rather than the overall average. A model optimized for classification may underperform on retrieval despite a high average MTEB score.

BEIR (Benchmarking Information Retrieval)

BEIR focuses specifically on zero-shot retrieval across 18 diverse datasets. It is the standard benchmark for evaluating retrieval models on out-of-domain data.

// BEIR dataset domains (selection)

MS MARCO - Web search queries

Natural Questions - Wikipedia Q&A

HotpotQA - Multi-hop reasoning

FEVER - Fact verification

SciFact - Scientific claims

NFCorpus - Medical/nutrition

ArguAna - Argument retrieval

Touché-2020 - Controversial topics

CQADupStack - StackExchange Q&A

Quora - Duplicate questions

DBPedia - Entity retrieval

SCIDOCS - Scientific papers

FiQA - Financial Q&A

TREC-COVID - COVID-19 research

Climate-FEVER - Climate claims

Interpreting Benchmark Scores

Key Metrics

nDCG@10: Normalized Discounted Cumulative Gain at rank 10. Measures ranking quality with position weighting. Higher is better. Range 0-1.
Recall@k: Fraction of relevant documents found in top k results. Critical for RAG where you retrieve top-k chunks for the LLM.
MRR: Mean Reciprocal Rank. Average of 1/rank of first relevant result. Emphasizes finding at least one good result quickly.
MAP: Mean Average Precision. Considers precision at each relevant document. Good for understanding overall ranking quality.

Benchmark limitations to keep in mind:

Scores reflect average performance across datasets; your domain may differ significantly
Short queries dominate benchmarks; long-form queries may behave differently
Benchmarks test zero-shot; fine-tuned models can dramatically outperform
English benchmarks dominate; multilingual performance varies
Dataset contamination can inflate scores (training on test data)

Multilingual and Cross-lingual Embeddings

Multilingual embedding models map text from multiple languages into a shared vector space. This enables powerful cross-lingual capabilities without machine translation.

Cross-lingual Retrieval

In a well-aligned multilingual embedding space, a query in one language can retrieve relevant documents in any language:

// Cross-lingual retrieval example

Query (English): "How does photosynthesis work?"

→ Retrieves relevant docs in English, German, French, Japanese...

Query (German): "Wie funktioniert die Photosynthese?"

→ Same documents retrieved with similar ranking

// The embedding space is language-agnostic

// Semantic similarity transcends language boundaries

Notable Multilingual Models

Model	Languages	Dims	Notes
multilingual-e5-large	100+	1024	Best multilingual quality
bge-m3	100+	1024	Dense + sparse + multi-vector
paraphrase-multilingual-mpnet	50+	768	Efficient, sentence-focused
LaBSE	109	768	Bitext mining focused

Challenges with Multilingual Embeddings

Capacity dilution: The same model size must represent 100+ languages instead of one. Per-language quality is lower than monolingual models.
Low-resource languages: Quality varies dramatically. English, Chinese, German are well-represented; many languages have poor embeddings.
Script differences: Languages with unique scripts (Thai, Arabic, Korean) may cluster differently than Latin-script languages.
Domain shift: Training data is often web-crawled, underrepresenting specialized domains in non-English languages.

Matryoshka Embeddings: Adaptive Dimensions

Matryoshka Representation Learning (MRL) trains embeddings where truncated prefixes remain useful. You can use the first 64, 128, 256, or all dimensions depending on your quality/efficiency requirements.

// Matryoshka embedding usage

full_embedding = model.encode(text) # e.g., 768 dims

embedding_64 = full_embedding[:64] # ~90% of full quality

embedding_128 = full_embedding[:128] # ~95% of full quality

embedding_256 = full_embedding[:256] # ~98% of full quality

// Storage savings with minimal quality loss

// 768→256 dims = 3x compression, ~2% quality drop

How MRL Training Works

During training, the loss function is computed at multiple dimension checkpoints. This forces the model to encode the most important information in the earlier dimensions:

// MRL training loss (simplified)

L_total = Σₖ wₖ × L(embedding[:dₖ])

where dₖ ∈ {64, 128, 256, 512, 768}

and wₖ are dimension-specific weights

Use Cases for Adaptive Dimensions

Tiered Retrieval

Use 64-128 dimensions for fast initial candidate retrieval, then re-rank top-k with full embeddings. Dramatic speedup with minimal recall loss.

Adaptive Storage

Store short dimensions for older/less-accessed documents, full dimensions for frequently queried content. Balance storage cost with quality.

Edge Deployment

Use minimal dimensions on edge devices for local search, full dimensions on server for comprehensive queries. Same model, different tradeoffs.

Matryoshka-Enabled Models

Not all models support dimension truncation. These models are trained with MRL:

nomic-embed-text-v1.5: Supports 64, 128, 256, 512, 768
mxbai-embed-large-v1: Full MRL support
snowflake-arctic-embed: Designed for adaptive dimensions
jina-embeddings-v3: Flexible dimension truncation

Fine-tuning Embeddings for Domain-Specific Use

Off-the-shelf embedding models are trained on general web data. For specialized domains (legal, medical, scientific), fine-tuning on domain data can yield significant improvements.

When to Fine-tune

Domain vocabulary: Specialized terminology that general models misunderstand (legal terms, chemical compounds, medical abbreviations)
Different similarity notion: What counts as "similar" differs from web text (code similarity, legal precedent matching)
Quality plateau: Exhausted other optimizations (chunking, prompting, retrieval strategy) but still underperforming
Sufficient data: Have thousands of query-document pairs or can generate them synthetically

Fine-tuning Approaches

Contrastive Fine-tuning

Continue training with domain-specific positive pairs. The standard approach using the same InfoNCE loss as original training, but on your data.

sentence_transformers.losses.MultipleNegativesRankingLoss

Synthetic Data Generation

Use an LLM to generate queries for your documents. Given a document, prompt the LLM: "Generate 5 questions that this document would answer." Creates training pairs at scale.

(generated_query, original_document) pairs

Hard Negative Mining

Use BM25 or the base model to find challenging negatives. Fine-tune on triplets: (query, positive, hard_negative) with margin loss.

sentence_transformers.losses.TripletLoss

Distillation from Cross-encoder

Train a cross-encoder on your data (slow but accurate), then distill its knowledge into the bi-encoder embedding model. Best quality but most complex.

sentence_transformers.losses.MarginMSELoss

Fine-tuning Best Practices

// Fine-tuning recipe (sentence-transformers)

from sentence_transformers import SentenceTransformer, losses

from sentence_transformers.training_args import BatchSamplers

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Key hyperparameters:

learning_rate = 2e-5 # Start small, tune carefully

batch_size = 32 # Larger = more negatives = better

epochs = 1-3 # Few epochs to avoid overfitting

warmup_ratio = 0.1 # Gradual learning rate increase

# Critical: use in-batch negatives

train_loss = losses.MultipleNegativesRankingLoss(model)

Common pitfalls to avoid:

Training too long: embeddings overfit quickly, 1-3 epochs is usually sufficient
Learning rate too high: destroys pretrained knowledge, start at 1e-5 to 2e-5
Batch size too small: fewer in-batch negatives, weaker training signal
No evaluation set: impossible to detect overfitting without held-out data
Ignoring prefixes: if the base model uses prefixes, maintain them during fine-tuning

Batch Processing and Optimization

Embedding large document collections requires careful optimization. The difference between naive and optimized approaches can be 10-100x in throughput.

Batching Strategy

// Batching impact on throughput

// Naive: one text at a time

for text in texts:

embedding = model.encode(text) # ~50 texts/sec on GPU

// Better: batch processing

embeddings = model.encode(texts, batch_size=32) # ~500 texts/sec

// Optimal: sorted batching

# Sort by length, batch similar lengths together

# Minimizes padding waste → ~800 texts/sec

GPU Optimization

Mixed Precision (FP16)

Use half-precision inference for 2x memory reduction and faster computation. Quality impact is negligible for most models.

model.half() # Convert to FP16

Multi-GPU Parallelism

Distribute batches across multiple GPUs for linear speedup. Use data parallelism for embedding (model on each GPU, split data).

model = SentenceTransformer(..., device="cuda")model.start_multi_process_pool() # sentence-transformers

ONNX Runtime

Convert models to ONNX for optimized inference. Can provide 2-3x speedup on CPU and additional gains on GPU through graph optimization.

optimum-cli export onnx --model BAAI/bge-base-en-v1.5 ./onnx/

CPU Optimization

For CPU-only deployment, additional optimizations are critical:

// CPU optimization techniques

# Set thread count (OMP, MKL)

export OMP_NUM_THREADS=8

export MKL_NUM_THREADS=8

# Use quantized models

from optimum.onnxruntime import ORTModelForFeatureExtraction

model = ORTModelForFeatureExtraction.from_pretrained(

"model_path",

provider="CPUExecutionProvider"

)

# INT8 quantization for additional speedup

# 2-4x faster with ~1-3% quality loss

Streaming and Incremental Processing

For real-time systems, process documents as they arrive rather than in large batches:

// Streaming embedding pipeline

async def embed_stream(documents: AsyncIterator[str]):

buffer = []

async for doc in documents:

buffer.append(doc)

if len(buffer) >= batch_size:

embeddings = await embed_batch(buffer)

for emb in embeddings:

yield emb

buffer = []

if buffer: # Flush remaining

for emb in await embed_batch(buffer):

yield emb

Caching Strategies

Embedding computation is expensive. Effective caching avoids redundant work and dramatically improves system performance.

Content-based Caching

// Hash-based embedding cache

import hashlib

def get_embedding(text: str, cache: dict) -> np.ndarray:

# Create deterministic hash of input

key = hashlib.sha256(text.encode()).hexdigest()

if key in cache:

return cache[key]

embedding = model.encode(text)

cache[key] = embedding

return embedding

Persistent Embedding Store

For large-scale systems, persist embeddings to disk with proper indexing:

Storage Options

SQLite + numpy blobs: Simple, single-file, good for <10M embeddings
LMDB: Fast key-value store, memory-mapped, excellent for read-heavy workloads
Parquet files: Columnar format, good for analytics and batch operations
Vector databases: ChromaDB, Qdrant, Milvus—integrated storage and search

Cache Invalidation

Embeddings must be regenerated when:

Source document changes (content hash changes)
Embedding model changes (model version tracking)
Chunking strategy changes (preprocessing version tracking)
Model fine-tuning occurs (new model checkpoint)

// Versioned cache key

cache_key = f"{model_version}-{chunking_version}-{content_hash}"

# Example:

"bge-base-v1.5-chunk-v2-a1b2c3d4e5f6..."

# When model_version changes, all cached embeddings

# for that model become stale automatically

Query Embedding Caching

For frequently repeated queries, cache query embeddings separately from document embeddings:

LRU cache: Keep recent query embeddings in memory. Use functools.lru_cache or custom implementation.
Query normalization: Lowercase, strip whitespace, canonicalize before hashing to improve cache hit rate.
Semantic deduplication: Cluster similar queries, use a single representative embedding. Advanced but effective for high-volume systems.

// Query caching with normalization

from functools import lru_cache

def normalize_query(q: str) -> str:

return q.lower().strip()

@lru_cache(maxsize=10000)

def get_query_embedding(query: str) -> tuple:

normalized = normalize_query(query)

embedding = model.encode(normalized)

return tuple(embedding) # hashable for cache

Summary and Recommendations

Embeddings are the foundation of semantic retrieval. Key decisions for your system:

Decision Framework

Start with a proven base model: bge-base-en-v1.5 or gte-base provide good quality/speed balance. Upgrade to large variants only if quality is insufficient.
Normalize embeddings at index time: Enables fast dot product search equivalent to cosine similarity.
Implement proper caching: Content-hash-based caching with model version tracking prevents redundant computation.
Measure on your data: Benchmark scores are indicative but not definitive. Create an evaluation set from your domain.
Consider fine-tuning last: Exhaust chunking, prompting, and retrieval strategy optimizations before fine-tuning.

For local-first systems like those built with OnyxLab, embedding choice directly impacts user experience. A well-chosen embedding model running locally provides instant, private semantic search without network dependencies.

Data Next: RAG Agents