Introduction
OnyxLab builds autonomous systems that run entirely on your hardware. No cloud dependencies, no API costs, no data leaving your network. This documentation covers the fundamentals of local AI deployment, the architectural decisions that make it possible, and the philosophy of owning the tools you depend on.
What You'll Learn
- The distinction between language models and autonomous agents
- How retrieval-augmented generation works at a systems level
- Vector embeddings and semantic search fundamentals
- Building production-grade local inference pipelines
- Quantization strategies and hardware optimization
- The open-source inference stack: Ollama, llama.cpp, vLLM, and beyond
Why Local-First Matters
The current AI landscape is dominated by centralized inference. You send your data to someone else's servers, pay per token, and hope their API stays online. This model works for many use cases, but it comes with structural limitations that become apparent at scale or in sensitive contexts.
Privacy by Architecture
When inference runs locally, privacy isn't a policy decision—it's a physical constraint. Your prompts, documents, and outputs never traverse a network boundary you don't control. There's no trust required because there's no third party involved.
Predictable Costs
API pricing scales linearly with usage. Local inference has high upfront costs (hardware) but near-zero marginal costs. For sustained workloads, the crossover point comes faster than most assume—often within months for moderate usage.
Latency Control
Network round-trips add 50-200ms minimum. Local inference eliminates this entirely. For real-time applications—voice interfaces, coding assistants, robotics—this difference is the gap between usable and unusable.
Data Sovereignty
Regulatory requirements (GDPR, HIPAA, SOC2) often prohibit sending certain data to third parties. Local inference sidesteps compliance complexity by keeping data within your existing security perimeter.
There's also a philosophical dimension. When your AI infrastructure depends entirely on external providers, you're subject to their pricing changes, their content policies, their uptime, and their business decisions. Local-first means you control the stack. You can modify, audit, and extend every component. The system belongs to you.
The State of AI Infrastructure
We're in an unusual moment. The underlying models are increasingly open—Llama, Mistral, Qwen, DeepSeek, and others release weights that rival proprietary systems. The inference tooling is maturing rapidly. Yet most production deployments still route through a handful of API providers.
The Centralization Problem
As of 2024, over 90% of commercial AI inference runs through five providers. This creates single points of failure, pricing power concentration, and systemic risk. When OpenAI has an outage, thousands of applications break simultaneously. When they change their terms of service, entire business models become uncertain overnight.
The open-source ecosystem offers an alternative path. The same architectures that power GPT-4 and Claude are now reproducible with publicly available weights and training methods. The gap between open and closed models shrinks with each release. More importantly, for most practical applications, the current generation of open models is already sufficient.
What's missing isn't capability—it's infrastructure. Most developers have never set up a local inference server. They don't know how to evaluate quantization tradeoffs or optimize batch processing. The knowledge exists but it's scattered across papers, Discord servers, and GitHub issues. This documentation aims to consolidate it.
What "Offline by Default" Means
"Offline by default" isn't marketing language—it's an architectural constraint. Every component we build must function without network access. This shapes design decisions at every level:
This constraint eliminates entire categories of failure modes. There's no API key to expire, no rate limit to hit, no network partition to handle. But it also means being deliberate about what you include. Every dependency must work offline. Every model must fit in available memory. Every feature must degrade gracefully when resources are constrained.
Offline Capability Checklist
- All model weights stored locally with integrity verification
- No runtime license checks or activation requirements
- Embedding and retrieval pipelines use local models only
- Vector storage in local databases (not hosted services)
- Document parsing without cloud APIs
- Zero telemetry or phone-home behavior in any component
Hardware Requirements
Local inference is constrained by memory more than compute. The model must fit in VRAM (for GPU inference) or RAM (for CPU inference). This is where quantization becomes essential—trading precision for size.
| Model Size | Full Precision | Q8 (8-bit) | Q4 (4-bit) | Use Case |
|---|---|---|---|---|
| 7B parameters | 14 GB | 7 GB | 4 GB | Consumer GPUs, real-time chat |
| 13B parameters | 26 GB | 13 GB | 7 GB | Mid-range GPUs, coding tasks |
| 34B parameters | 68 GB | 34 GB | 17 GB | Workstation GPUs, complex reasoning |
| 70B parameters | 140 GB | 70 GB | 35 GB | Multi-GPU setups, production systems |
Recommended Configurations
Entry Level (8GB VRAM)
RTX 3070/4060 or equivalent. Runs 7B models at Q4 quantization with good performance. Suitable for personal use, development, and testing. Expect 20-40 tokens/second depending on context length.
Professional (16-24GB VRAM)
RTX 3090/4080/4090 or equivalent. Runs 13B models comfortably, 34B with aggressive quantization. Handles longer contexts and faster generation. Good for small teams or demanding applications.
Production (48GB+ VRAM)
A100/H100 or multi-GPU consumer setups. Runs 70B+ models, handles concurrent requests, supports production workloads. Consider vLLM for efficient batching at this scale.
CPU Inference: When It Makes Sense
GPU inference is faster, but CPU inference is more accessible. Modern frameworks like llama.cpp have excellent CPU performance, especially on Apple Silicon with unified memory. For development, testing, or low-throughput applications, CPU inference is often sufficient.
Quantization Tradeoffs
Quantization reduces model size by representing weights with fewer bits. The tradeoff is quality degradation, though modern quantization methods minimize this significantly.
Q8 (8-bit)
Minimal quality loss, typically indistinguishable from full precision. Use when memory allows. Good default for production systems where quality matters.
Q6_K
Excellent quality-to-size ratio. Slight measurable degradation on benchmarks, rarely noticeable in practice. Good balance for most applications.
Q4_K_M
The sweet spot for consumer hardware. Roughly 4x smaller than full precision with acceptable quality degradation. Most commonly used quantization level.
Q2_K / Q3_K
Aggressive quantization for memory-constrained environments. Noticeable quality degradation, especially on complex reasoning tasks. Use only when necessary.
The Open-Source Inference Stack
A robust ecosystem of open-source tools makes local inference practical. Each serves different needs, and understanding when to use which is essential for building reliable systems.
Ollama
The easiest path to local inference. Ollama packages models with their dependencies, handles quantization automatically, and exposes an OpenAI-compatible API. Ideal for development, prototyping, and simple deployments.
llama.cpp
The foundation layer. Pure C/C++ implementation of Llama inference, optimized for both CPU and GPU execution. Most other tools (including Ollama) build on llama.cpp. Use directly when you need maximum control or custom integrations.
vLLM
High-performance inference engine designed for throughput. Implements PagedAttention for efficient memory management and continuous batching for concurrent requests. The choice for production deployments serving multiple users.
Text Generation Inference (TGI)
Hugging Face's production inference solution. Rust-based, highly optimized, with built-in support for quantization, speculative decoding, and multi-GPU deployment. Good alternative to vLLM with different performance characteristics.
ExLlamaV2
Specialized for EXL2 quantization format, which offers better quality-per-bit than GGUF at the cost of GPU-only inference. Fastest option for single-GPU consumer hardware when using compatible quantizations.
Who This Documentation Is For
This documentation targets developers and engineers who want to move beyond API-dependent AI systems. You might be:
Building Internal Tools
You need AI capabilities for internal applications but can't send proprietary data to third-party APIs. Legal, finance, healthcare, or any domain with confidentiality requirements.
Reducing Infrastructure Costs
Your API bill is growing and you want to understand whether local inference makes economic sense for your workload. Spoiler: it often does, especially for sustained usage.
Developing AI Products
You're building products that embed AI capabilities and want to avoid per-user API costs or dependency on external providers. Edge deployment, desktop applications, embedded systems.
Learning the Stack
You want to understand how modern AI systems actually work, beyond the abstraction of an API call. Inference pipelines, embeddings, retrieval systems—the full architecture.
We assume basic programming competence and familiarity with command-line interfaces. You don't need machine learning experience—we'll explain the relevant concepts as they arise. Some sections go deep into hardware and optimization; skip those if they're not relevant to your use case.
Philosophy: Owning Your Tools
There's an old principle in software: don't build on rented land. When your core infrastructure depends on services you don't control, you're always one pricing change, one policy update, one acquisition away from disruption.
This doesn't mean avoiding cloud services entirely—that would be impractical. It means being deliberate about what you outsource and what you own. For AI specifically, the calculus is shifting. The models are increasingly open. The tooling is mature. The hardware is accessible. The main barrier is knowledge, and that's what we're addressing.
What Ownership Means in Practice
- Auditability: You can inspect every component, from model weights to inference code. No black boxes.
- Modifiability: You can fine-tune models, adjust prompts, change architectures without waiting for provider updates.
- Portability: Your system runs anywhere you have compatible hardware. No vendor lock-in, no platform risk.
- Longevity: Open-source tools don't get deprecated. Models you download today will work in a decade.
The goal isn't ideological purity—it's practical resilience. By understanding how to run AI locally, you have options. You can choose cloud when it makes sense and local when it doesn't. You're not dependent on any single provider's decisions. That optionality has real value.
What You'll Build
This documentation isn't purely theoretical. Each section builds toward practical capabilities:
Prerequisites
This documentation assumes familiarity with basic programming concepts and command-line interfaces. You don't need prior machine learning experience—we'll cover the relevant concepts as they arise.
Command Line
Comfortable running commands in a terminal. Basic shell scripting is helpful but not required.
Python or TypeScript
Most examples use Python. TypeScript alternatives provided where relevant. Other languages work via HTTP APIs.
Hardware Access
A machine with 8GB+ RAM minimum. GPU with 8GB+ VRAM recommended but not required for getting started.