OnyxLab | AI/ML Engineering

OnyxLab builds autonomous systems that run entirely on your hardware. No cloud dependencies, no API costs, no data leaving your network. This documentation covers the fundamentals of local AI deployment, the architectural decisions that make it possible, and the philosophy of owning the tools you depend on.

What You'll Learn

The distinction between language models and autonomous agents
How retrieval-augmented generation works at a systems level
Vector embeddings and semantic search fundamentals
Building production-grade local inference pipelines
Quantization strategies and hardware optimization
The open-source inference stack: Ollama, llama.cpp, vLLM, and beyond

Why Local-First Matters

The current AI landscape is dominated by centralized inference. You send your data to someone else's servers, pay per token, and hope their API stays online. This model works for many use cases, but it comes with structural limitations that become apparent at scale or in sensitive contexts.

Privacy by Architecture

When inference runs locally, privacy isn't a policy decision—it's a physical constraint. Your prompts, documents, and outputs never traverse a network boundary you don't control. There's no trust required because there's no third party involved.

Predictable Costs

API pricing scales linearly with usage. Local inference has high upfront costs (hardware) but near-zero marginal costs. For sustained workloads, the crossover point comes faster than most assume—often within months for moderate usage.

Latency Control

Network round-trips add 50-200ms minimum. Local inference eliminates this entirely. For real-time applications—voice interfaces, coding assistants, robotics—this difference is the gap between usable and unusable.

Data Sovereignty

Regulatory requirements (GDPR, HIPAA, SOC2) often prohibit sending certain data to third parties. Local inference sidesteps compliance complexity by keeping data within your existing security perimeter.

There's also a philosophical dimension. When your AI infrastructure depends entirely on external providers, you're subject to their pricing changes, their content policies, their uptime, and their business decisions. Local-first means you control the stack. You can modify, audit, and extend every component. The system belongs to you.

The State of AI Infrastructure

We're in an unusual moment. The underlying models are increasingly open—Llama, Mistral, Qwen, DeepSeek, and others release weights that rival proprietary systems. The inference tooling is maturing rapidly. Yet most production deployments still route through a handful of API providers.

The Centralization Problem

As of 2024, over 90% of commercial AI inference runs through five providers. This creates single points of failure, pricing power concentration, and systemic risk. When OpenAI has an outage, thousands of applications break simultaneously. When they change their terms of service, entire business models become uncertain overnight.

The open-source ecosystem offers an alternative path. The same architectures that power GPT-4 and Claude are now reproducible with publicly available weights and training methods. The gap between open and closed models shrinks with each release. More importantly, for most practical applications, the current generation of open models is already sufficient.

What's missing isn't capability—it's infrastructure. Most developers have never set up a local inference server. They don't know how to evaluate quantization tradeoffs or optimize batch processing. The knowledge exists but it's scattered across papers, Discord servers, and GitHub issues. This documentation aims to consolidate it.

What "Offline by Default" Means

"Offline by default" isn't marketing language—it's an architectural constraint. Every component we build must function without network access. This shapes design decisions at every level:

# A typical cloud-dependent pipeline:

user_input -> network -> API endpoint -> inference -> network -> response

# An offline-first pipeline:

user_input -> local inference server -> response

# Dependencies that must be local:

- Model weights (downloaded once, stored locally)

- Tokenizer files (bundled with model)

- Embedding models (for RAG pipelines)

- Vector database (SQLite, LanceDB, or similar)

- Document processing (no cloud OCR or parsing)

This constraint eliminates entire categories of failure modes. There's no API key to expire, no rate limit to hit, no network partition to handle. But it also means being deliberate about what you include. Every dependency must work offline. Every model must fit in available memory. Every feature must degrade gracefully when resources are constrained.

Offline Capability Checklist

All model weights stored locally with integrity verification
No runtime license checks or activation requirements
Embedding and retrieval pipelines use local models only
Vector storage in local databases (not hosted services)
Document parsing without cloud APIs
Zero telemetry or phone-home behavior in any component

Hardware Requirements

Local inference is constrained by memory more than compute. The model must fit in VRAM (for GPU inference) or RAM (for CPU inference). This is where quantization becomes essential—trading precision for size.

Model Size	Full Precision	Q8 (8-bit)	Q4 (4-bit)	Use Case
7B parameters	14 GB	7 GB	4 GB	Consumer GPUs, real-time chat
13B parameters	26 GB	13 GB	7 GB	Mid-range GPUs, coding tasks
34B parameters	68 GB	34 GB	17 GB	Workstation GPUs, complex reasoning
70B parameters	140 GB	70 GB	35 GB	Multi-GPU setups, production systems

Recommended Configurations

Entry Level (8GB VRAM)

RTX 3070/4060 or equivalent. Runs 7B models at Q4 quantization with good performance. Suitable for personal use, development, and testing. Expect 20-40 tokens/second depending on context length.

Professional (16-24GB VRAM)

RTX 3090/4080/4090 or equivalent. Runs 13B models comfortably, 34B with aggressive quantization. Handles longer contexts and faster generation. Good for small teams or demanding applications.

Production (48GB+ VRAM)

A100/H100 or multi-GPU consumer setups. Runs 70B+ models, handles concurrent requests, supports production workloads. Consider vLLM for efficient batching at this scale.

CPU Inference: When It Makes Sense

GPU inference is faster, but CPU inference is more accessible. Modern frameworks like llama.cpp have excellent CPU performance, especially on Apple Silicon with unified memory. For development, testing, or low-throughput applications, CPU inference is often sufficient.

# Approximate CPU inference speeds (Apple M2 Pro, 32GB RAM):

Llama 3 8B Q4_K_M: ~15-20 tokens/sec

Mistral 7B Q4_K_M: ~18-22 tokens/sec

Phi-3 3.8B Q4_K_M: ~35-45 tokens/sec

# For comparison, GPU inference (RTX 4090):

Llama 3 8B Q4_K_M: ~80-120 tokens/sec

Mistral 7B Q4_K_M: ~90-130 tokens/sec

Quantization Tradeoffs

Quantization reduces model size by representing weights with fewer bits. The tradeoff is quality degradation, though modern quantization methods minimize this significantly.

Q8 (8-bit)

Minimal quality loss, typically indistinguishable from full precision. Use when memory allows. Good default for production systems where quality matters.

Q6_K

Excellent quality-to-size ratio. Slight measurable degradation on benchmarks, rarely noticeable in practice. Good balance for most applications.

Q4_K_M

The sweet spot for consumer hardware. Roughly 4x smaller than full precision with acceptable quality degradation. Most commonly used quantization level.

Q2_K / Q3_K

Aggressive quantization for memory-constrained environments. Noticeable quality degradation, especially on complex reasoning tasks. Use only when necessary.

The Open-Source Inference Stack

A robust ecosystem of open-source tools makes local inference practical. Each serves different needs, and understanding when to use which is essential for building reliable systems.

Ollama

The easiest path to local inference. Ollama packages models with their dependencies, handles quantization automatically, and exposes an OpenAI-compatible API. Ideal for development, prototyping, and simple deployments.

# Install and run in minutes

ollama run llama3:8b

# OpenAI-compatible API endpoint

curl http://localhost:11434/v1/chat/completions

llama.cpp

The foundation layer. Pure C/C++ implementation of Llama inference, optimized for both CPU and GPU execution. Most other tools (including Ollama) build on llama.cpp. Use directly when you need maximum control or custom integrations.

# Direct llama.cpp usage

./llama-server -m model.gguf -c 4096 --host 0.0.0.0 --port 8080

vLLM

High-performance inference engine designed for throughput. Implements PagedAttention for efficient memory management and continuous batching for concurrent requests. The choice for production deployments serving multiple users.

# vLLM server with continuous batching

python -m vllm.entrypoints.openai.api_server \

--model mistralai/Mistral-7B-Instruct-v0.2 \

--tensor-parallel-size 2

Text Generation Inference (TGI)

Hugging Face's production inference solution. Rust-based, highly optimized, with built-in support for quantization, speculative decoding, and multi-GPU deployment. Good alternative to vLLM with different performance characteristics.

ExLlamaV2

Specialized for EXL2 quantization format, which offers better quality-per-bit than GGUF at the cost of GPU-only inference. Fastest option for single-GPU consumer hardware when using compatible quantizations.

Who This Documentation Is For

This documentation targets developers and engineers who want to move beyond API-dependent AI systems. You might be:

Building Internal Tools

You need AI capabilities for internal applications but can't send proprietary data to third-party APIs. Legal, finance, healthcare, or any domain with confidentiality requirements.

Reducing Infrastructure Costs

Your API bill is growing and you want to understand whether local inference makes economic sense for your workload. Spoiler: it often does, especially for sustained usage.

Developing AI Products

You're building products that embed AI capabilities and want to avoid per-user API costs or dependency on external providers. Edge deployment, desktop applications, embedded systems.

Learning the Stack

You want to understand how modern AI systems actually work, beyond the abstraction of an API call. Inference pipelines, embeddings, retrieval systems—the full architecture.

We assume basic programming competence and familiarity with command-line interfaces. You don't need machine learning experience—we'll explain the relevant concepts as they arise. Some sections go deep into hardware and optimization; skip those if they're not relevant to your use case.

Philosophy: Owning Your Tools

There's an old principle in software: don't build on rented land. When your core infrastructure depends on services you don't control, you're always one pricing change, one policy update, one acquisition away from disruption.

This doesn't mean avoiding cloud services entirely—that would be impractical. It means being deliberate about what you outsource and what you own. For AI specifically, the calculus is shifting. The models are increasingly open. The tooling is mature. The hardware is accessible. The main barrier is knowledge, and that's what we're addressing.

What Ownership Means in Practice

Auditability: You can inspect every component, from model weights to inference code. No black boxes.
Modifiability: You can fine-tune models, adjust prompts, change architectures without waiting for provider updates.
Portability: Your system runs anywhere you have compatible hardware. No vendor lock-in, no platform risk.
Longevity: Open-source tools don't get deprecated. Models you download today will work in a decade.

The goal isn't ideological purity—it's practical resilience. By understanding how to run AI locally, you have options. You can choose cloud when it makes sense and local when it doesn't. You're not dependent on any single provider's decisions. That optionality has real value.

What You'll Build

This documentation isn't purely theoretical. Each section builds toward practical capabilities:

# By the end of this documentation, you'll be able to:

01.Deploy a local inference server and connect applications to it

02.Build RAG systems with local embeddings and vector storage

03.Create agent loops that use tools and maintain state

04.Optimize inference for your specific hardware configuration

05.Evaluate models systematically for your use case

06.Design systems that work offline without degradation

Prerequisites

This documentation assumes familiarity with basic programming concepts and command-line interfaces. You don't need prior machine learning experience—we'll cover the relevant concepts as they arise.

Command Line

Comfortable running commands in a terminal. Basic shell scripting is helpful but not required.

Python or TypeScript

Most examples use Python. TypeScript alternatives provided where relevant. Other languages work via HTTP APIs.

Hardware Access

A machine with 8GB+ RAM minimum. GPU with 8GB+ VRAM recommended but not required for getting started.

Next: LLM vs Agents

Introduction