OnyxLab | AI/ML Engineering

The terms "LLM" and "AI agent" are often used interchangeably, but they describe fundamentally different systems. Understanding this distinction at a technical level is essential for building effective autonomous applications and making informed architectural decisions. This guide provides a deep dive into how each works, their computational characteristics, and when to use which approach.

How Transformers Actually Work

Modern LLMs are built on the transformer architecture, introduced in the 2017 paper "Attention Is All You Need." Understanding this architecture demystifies much of what LLMs can and cannot do.

Tokenization: Text to Numbers

Before any processing occurs, text must be converted to numerical representations. Tokenizers split input text into subword units—not characters, not words, but something in between. The sentence "Understanding transformers deeply" might become tokens like ["Under", "standing", " transform", "ers", " deeply"]. Each token maps to an integer ID and then to a learned embedding vector, typically 4096-8192 dimensions for modern models.

// Tokenization example

Input: "The model predicts tokens"
Tokens: ["The", " model", " predicts", " tokens"]
IDs: [464, 2746, 52960, 16326]
Embeddings: [vector_4096d, vector_4096d, vector_4096d, vector_4096d]

Different tokenizers produce different token counts for the same text. This matters because context windows are measured in tokens, not characters. GPT-4's tokenizer averages roughly 4 characters per token for English, but this varies significantly by language and content type. Code tends to tokenize less efficiently than prose.

The Attention Mechanism

The core innovation of transformers is self-attention. For each token position, the model computes attention weights over all other positions, determining how much each token should "attend to" every other token. This is computed through three learned projections: Query (Q), Key (K), and Value (V).

# Simplified attention computation

def attention(Q, K, V):
    # Q, K, V are projections of the input embeddings
    # d_k is the dimension of the key vectors

    scores = matmul(Q, K.transpose()) / sqrt(d_k)
    weights = softmax(scores)  # Normalize to sum to 1
    output = matmul(weights, V)

    return output

The computational cost of attention scales quadratically with sequence length—O(n^2) where n is the number of tokens. This is why context windows have practical limits. A 128k context window requires computing 128,000 x 128,000 attention scores per layer. Modern models use techniques like grouped-query attention (GQA) and sliding window attention to reduce this cost, but the fundamental scaling remains a constraint.

Context Windows: The Memory Boundary

A context window defines the maximum number of tokens a model can process in a single forward pass. Everything the model "knows" about your conversation must fit within this window. When you exceed it, tokens are typically truncated from the beginning—the model literally forgets the start of the conversation.

4K Context

~3,000 words. Suitable for single document Q&A or short conversations. Common in older or smaller models.

32K Context

~24,000 words. Handles longer documents and extended conversations. Standard for many current models.

128K+ Context

~100,000 words. Entire codebases or book-length documents. Significant memory and compute requirements.

Autoregressive Generation: One Token at a Time

LLMs generate text autoregressively—one token at a time, where each new token depends on all previous tokens. This isn't a limitation of implementation; it's fundamental to how these models work. The model outputs a probability distribution over its entire vocabulary (often 32,000-100,000+ tokens), and one token is selected from this distribution to continue the sequence.

# Autoregressive generation loop

def generate(prompt, max_tokens):
    tokens = tokenize(prompt)

    for _ in range(max_tokens):
        # Full forward pass through all transformer layers
        logits = model.forward(tokens)

        # Get probability distribution for next token
        next_token_probs = softmax(logits[-1])

        # Sample from distribution (details in next section)
        next_token = sample(next_token_probs)

        tokens.append(next_token)

        if next_token == END_OF_SEQUENCE:
            break

    return detokenize(tokens)

This has profound implications for performance. Generating 1,000 tokens requires 1,000 sequential forward passes through the model. While techniques like KV-caching avoid recomputing attention for previous tokens, generation speed is fundamentally bound by sequential token production. This is why "tokens per second" is a key performance metric for local inference.

Sampling: Temperature, Top-p, and Top-k

The raw output of a transformer is a vector of logits—unnormalized scores for each token in the vocabulary. Converting these to a selected token involves several configurable steps that dramatically affect output characteristics.

Temperature Scaling

Temperature divides the logits before softmax normalization. Lower temperatures sharpen the distribution, making high-probability tokens more likely. Higher temperatures flatten it, giving lower-probability tokens more chance.

# Temperature effect on probability distribution

logits = [2.0, 1.0, 0.5, 0.1]  # Raw model output

# Temperature = 1.0 (default)
probs_t1 = softmax(logits / 1.0)  # [0.48, 0.18, 0.11, 0.07, ...]

# Temperature = 0.3 (more deterministic)
probs_t03 = softmax(logits / 0.3)  # [0.89, 0.07, 0.02, 0.01, ...]

# Temperature = 1.5 (more random)
probs_t15 = softmax(logits / 1.5)  # [0.35, 0.21, 0.16, 0.12, ...]

Low Temperature (0.1-0.5)

More deterministic, focused outputs. Best for factual tasks, code generation, and structured data extraction where consistency matters.

High Temperature (0.8-1.2)

More varied, creative outputs. Better for brainstorming, creative writing, and exploration. Risk of incoherence increases significantly above 1.0.

Top-k Sampling

Top-k sampling restricts token selection to the k highest-probability tokens, setting all others to zero probability before sampling. This prevents the model from selecting extremely unlikely tokens while preserving some randomness among the top candidates.

# Top-k = 5: Only consider 5 most likely tokens
probs = [0.3, 0.25, 0.15, 0.1, 0.08, 0.05, 0.03, 0.02, 0.02, ...]
top_k_probs = [0.3, 0.25, 0.15, 0.1, 0.08, 0, 0, 0, 0, ...]
# Renormalize: [0.34, 0.28, 0.17, 0.11, 0.09]

Top-p (Nucleus) Sampling

Top-p sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds p. This adapts to the model's confidence—when the model is certain, fewer tokens are considered; when uncertain, more tokens remain viable.

# Top-p = 0.9: Include tokens until cumulative prob > 0.9
probs = [0.4, 0.3, 0.15, 0.08, 0.04, 0.02, 0.01, ...]
cumsum = [0.4, 0.7, 0.85, 0.93, ...]  # Stop at 0.93 > 0.9
nucleus = [0.4, 0.3, 0.15, 0.08]  # 4 tokens selected
# Renormalize: [0.43, 0.32, 0.16, 0.09]

In practice, top-p is often preferred over top-k because it adapts to the distribution shape. A typical production setting uses temperature 0.7-0.8 with top-p 0.9-0.95. These parameters interact—using both temperature and top-p/top-k sampling changes the effective behavior of each.

What "Stateless" Really Means

LLMs are fundamentally stateless between requests. The model weights are frozen after training—they don't update based on your conversation. What appears as "memory" within a conversation is actually the entire conversation history being passed as input tokens on every single request.

Implications of Statelessness

Cost Scaling: Every message includes the full conversation history, so token costs grow linearly with conversation length.
No Learning: The model cannot learn new information from your interactions. Corrections in one session don't persist to the next.
Context Dependency: Behavior is entirely determined by the input context. Same prompt, same parameters = same probability distribution.
Session Boundaries: Each API call or inference request is independent. The model has no concept of "previous sessions."

// What actually happens in a "conversation"

// Turn 1
request_1 = {
  messages: [
    {role: "user", content: "What is Rust?"}
  ]
}

// Turn 2 - ENTIRE history is sent again
request_2 = {
  messages: [
    {role: "user", content: "What is Rust?"},
    {role: "assistant", content: "Rust is a systems programming language..."},
    {role: "user", content: "How does ownership work?"}
  ]
}

// Turn 3 - History keeps growing
request_3 = {
  messages: [
    {role: "user", content: "What is Rust?"},
    {role: "assistant", content: "Rust is a systems programming language..."},
    {role: "user", content: "How does ownership work?"},
    {role: "assistant", content: "Ownership is Rust's approach to memory..."},
    {role: "user", content: "Show me an example"}
  ]
}

Language Models: The Complete Picture

With the technical foundation established, we can now characterize what an LLM actually is: a powerful but bounded text-prediction system.

LLM Characteristics

Pure Function: Given the same input tokens and sampling parameters, the probability distribution over outputs is deterministic.
Fixed Knowledge: Knowledge is frozen at training cutoff. No access to current information without external retrieval.
No Execution: LLMs generate text that describes code or actions—they don't execute anything themselves.
Bounded Context: Everything must fit in the context window. No external memory access.
Single Modality (typically): Input text, output text. Multi-modal models extend this but maintain the same computational model.

Agents: From Text Generation to Action

An agent is an architecture that wraps an LLM with additional systems to enable autonomous action. The LLM becomes the "reasoning engine" within a larger system that can perceive, plan, act, and learn from feedback. This is a fundamental shift from text generation to task completion.

Core Agent Components

LLM Core: The reasoning engine that interprets situations and decides actions.
Tool Interface: Structured APIs the LLM can invoke to affect the external world.
Memory System: Persistent storage that survives beyond context windows and sessions.
Execution Loop: Orchestration logic that manages the observe-think-act cycle.
State Management: Tracking of goals, plans, and intermediate results across steps.

Agent Architectures in Detail

Several architectural patterns have emerged for building agents, each with different tradeoffs in capability, reliability, and computational cost.

ReAct: Reasoning and Acting

ReAct (Reason + Act) interleaves reasoning traces with actions. The model explicitly "thinks out loud" before each action, making its decision process interpretable. This pattern emerged from the observation that chain-of-thought prompting improves reasoning quality, and that making reasoning visible helps with debugging.

// ReAct execution pattern

Question: What is the population of the capital of France?

Thought: I need to find the capital of France first, then look up its population.
Action: search("capital of France")
Observation: Paris is the capital of France.

Thought: Now I know the capital is Paris. I need to find its population.
Action: search("population of Paris")
Observation: The population of Paris is approximately 2.1 million in the city proper.

Thought: I have the answer. Paris is the capital with 2.1 million people.
Action: finish("The capital of France is Paris, with a population of about 2.1 million")

Plan-and-Execute

This architecture separates planning from execution. A "planner" LLM creates a high-level plan, and an "executor" LLM (or the same LLM in a different mode) carries out each step. Plans can be revised based on execution results. This reduces the cognitive load on each individual LLM call.

// Plan-and-Execute pattern

// Step 1: Planning phase
Planner Input: "Create a report on Q3 sales by region"
Planner Output: [
  "1. Query database for Q3 sales data",
  "2. Group results by region",
  "3. Calculate summary statistics per region",
  "4. Generate visualization",
  "5. Write narrative summary",
  "6. Compile into report format"
]

// Step 2: Execution phase (per step)
Executor Input: "Execute step 1: Query database for Q3 sales data"
Executor Output: {action: "sql_query", query: "SELECT * FROM sales WHERE quarter='Q3'"}

// Step 3: Observation and replanning if needed
Observation: Query returned 0 rows - table name is 'transactions' not 'sales'
Replanner: [Update step 1 to use correct table name, continue from step 1]

MRKL: Modular Reasoning, Knowledge, and Language

MRKL systems combine an LLM router with specialized expert modules. The LLM determines which module(s) to invoke for each subtask, then synthesizes their outputs. Modules can be other LLMs, traditional ML models, symbolic systems, or databases. This architecture acknowledges that LLMs aren't optimal for all tasks—calculators beat LLMs at math, search engines beat them at retrieval.

// MRKL architecture
modules = {
  "calculator": CalculatorModule(),      // Symbolic math
  "search": SearchModule(),               // Information retrieval
  "code_exec": PythonREPL(),              // Code execution
  "database": SQLModule(),                // Structured data
  "llm": LLMModule()                      // General reasoning
}

// Router decides which module handles each part
query = "What's 15% of the revenue from last quarter?"

router_decision = [
  ("database", "SELECT revenue FROM quarterly_reports WHERE quarter='last'"),
  ("calculator", "0.15 * {database_result}")
]

Toolformer: Learned Tool Use

Toolformer represents a different approach: instead of prompting an existing model to use tools, the model is fine-tuned on examples of tool use. This embeds tool-use capabilities directly into the model weights, potentially improving reliability. The model learns when to insert API calls and how to incorporate results naturally.

// Toolformer-style generation (tool calls embedded in text)
Input: "The Eiffel Tower is [QA("height of Eiffel Tower")->330 meters] tall
        and was completed in [QA("Eiffel Tower completion year")->1889]."

// Model learns to:
// 1. Recognize when external information is needed
// 2. Insert appropriate API calls
// 3. Continue generation using API responses

The Tool-Use Paradigm

Tool use is what transforms an LLM from a text generator into an agent capable of affecting the world. The model outputs structured commands that an execution layer interprets and runs. This requires careful design of both the tool interface and the execution environment.

Function Calling: Structured Output for Tool Invocation

Modern LLMs support "function calling"—the ability to output structured JSON that specifies a function name and arguments. This is more reliable than parsing free-form text for tool invocations. The model is trained (or prompted) with function schemas and learns to produce valid calls.

// Function calling example

// Tool definitions provided to the model
tools = [
  {
    name: "search_web",
    description: "Search the web for current information",
    parameters: {
      type: "object",
      properties: {
        query: {type: "string", description: "Search query"},
        num_results: {type: "integer", default: 5}
      },
      required: ["query"]
    }
  },
  {
    name: "read_file",
    description: "Read contents of a file",
    parameters: {
      type: "object",
      properties: {
        path: {type: "string", description: "File path"}
      },
      required: ["path"]
    }
  }
]

// Model output (structured, not free-form text)
{
  "function_call": {
    "name": "search_web",
    "arguments": {"query": "latest rust 1.75 features", "num_results": 3}
  }
}

Tool Design Principles

Atomic Operations

Each tool should do one thing. Compound tools reduce flexibility and make error handling harder. Prefer "read_file" + "parse_json" over "read_and_parse_json_file".

Clear Semantics

Tool names and descriptions must be unambiguous. The LLM's only understanding of a tool comes from its schema—make it precise.

Structured Outputs

Tool results should be structured (JSON, not prose) so the LLM can reliably extract information. Include relevant metadata.

Error Information

When tools fail, return structured error information the LLM can act on. "File not found: /path/to/file" enables recovery.

How Agents Maintain State

Since LLMs are stateless, agents must implement external memory systems to persist information across interactions. Different memory types serve different purposes, mirroring how human memory works.

Conversation Memory (Short-term)

The simplest form: maintain a buffer of recent conversation turns. This is typically what's passed directly to the LLM as context. Strategies include sliding windows (keep last N messages), token-based limits, or summary-based compression where older content is summarized.

// Conversation memory with summarization
class ConversationMemory:
    def __init__(self, max_tokens=4000):
        self.messages = []
        self.summary = ""

    def add(self, message):
        self.messages.append(message)

        # If approaching limit, summarize older messages
        if self.token_count() > self.max_tokens:
            old_messages = self.messages[:-5]  # Keep last 5 intact
            self.summary = summarize(self.summary, old_messages)
            self.messages = self.messages[-5:]

    def get_context(self):
        if self.summary:
            return [{"role": "system", "content": f"Previous context: {self.summary}"}]
                   + self.messages
        return self.messages

Episodic Memory (Long-term Experiences)

Episodic memory stores specific past events and experiences. Each episode captures what happened, when, and what the outcome was. This enables the agent to recall relevant past experiences when facing similar situations. Typically implemented as a vector database with temporal metadata.

// Episodic memory structure
episode = {
  id: "ep_20240115_001",
  timestamp: 1705334400,
  trigger: "User asked to debug authentication issue",
  actions: [
    "Searched codebase for auth-related files",
    "Found bug in token validation",
    "Applied fix and tested"
  ],
  outcome: "success",
  learnings: ["Always check token expiry logic first for auth issues"],
  embedding: vector_768d  // For similarity search
}

// Retrieval: find similar past experiences
relevant_episodes = vector_search(
  query_embedding=embed(current_situation),
  filter={"outcome": "success"},
  top_k=3
)

Semantic Memory (Knowledge Base)

Semantic memory stores general knowledge and facts, independent of when they were learned. This includes project documentation, user preferences, domain knowledge, and extracted facts. Unlike episodic memory, semantic memory is organized by meaning rather than temporal sequence.

// Semantic memory as knowledge graph + vector store
knowledge_base = {
  entities: [
    {type: "project", name: "auth-service", language: "rust", ...},
    {type: "api", name: "/api/login", method: "POST", ...},
    {type: "preference", user: "alice", setting: "verbose_output", value: true}
  ],

  relations: [
    {from: "auth-service", relation: "exposes", to: "/api/login"},
    {from: "alice", relation: "owns", to: "auth-service"}
  ],

  facts: [  // Vector-indexed for retrieval
    {content: "Auth tokens expire after 24 hours", embedding: [...]},
    {content: "Rate limit is 100 requests per minute", embedding: [...]}
  ]
}

Working Memory (Active Context)

Working memory holds information actively being used for the current task. This includes the current goal, intermediate results, and scratchpad for reasoning. Unlike conversation memory, working memory is structured around the task rather than the dialogue.

// Working memory for a multi-step task
working_memory = {
  goal: "Generate quarterly report for Q3 2024",

  plan: [
    {step: 1, action: "fetch_data", status: "complete", result: "..."},
    {step: 2, action: "analyze", status: "in_progress"},
    {step: 3, action: "visualize", status: "pending"},
    {step: 4, action: "write_summary", status: "pending"}
  ],

  scratchpad: {
    total_revenue: 1250000,
    regions_processed: ["NA", "EU"],
    pending_questions: ["Should APAC be included?"]
  },

  context_files: ["/data/q3_sales.csv", "/templates/report.md"]
}

Execution Loops and Computational Cost

Agent execution follows an observe-think-act loop that repeats until the task is complete or a termination condition is met. Each iteration involves at least one LLM inference call, making computational cost a critical consideration.

// Standard agent execution loop

async function agentLoop(goal: string, maxIterations: number = 10) {
  let state = initializeState(goal);
  let iteration = 0;

  while (iteration < maxIterations) {
    iteration++;

    // 1. Observe: Gather current context
    const observation = await gatherContext(state);

    // 2. Think: LLM decides next action (INFERENCE CALL)
    const decision = await llm.generate({
      system: AGENT_SYSTEM_PROMPT,
      messages: formatMessages(state, observation),
      tools: AVAILABLE_TOOLS
    });

    // 3. Check for completion
    if (decision.action === "finish") {
      return {success: true, result: decision.result, iterations: iteration};
    }

    // 4. Act: Execute the chosen tool
    const result = await executeTool(decision.action, decision.arguments);

    // 5. Update state with result
    state = updateState(state, decision, result);

    // 6. Optional: Reflect on progress (ANOTHER INFERENCE CALL)
    if (iteration % 3 === 0) {
      state = await reflectOnProgress(state);
    }
  }

  return {success: false, error: "Max iterations exceeded", state};
}

Cost Analysis

A single user request to an agent might trigger many LLM calls. Consider a task that requires 8 iterations, with reflection every 3 steps:

Component	Calls	Tokens (est.)
Main reasoning loop	8	8 x 2000 = 16,000
Reflection calls	2	2 x 1500 = 3,000
Memory retrieval	8	8 x 500 = 4,000
Total	18	~23,000 tokens

This cost multiplication is why agent efficiency matters. For local deployment, each LLM call consumes GPU memory and compute time. Poorly designed agents that loop excessively or maintain excessive context can become impractical to run.

Optimization: Early Exit

Detect when the agent is looping or making no progress. Implement confidence thresholds and similarity checks on consecutive actions.

Optimization: Context Pruning

Only include relevant history in each call. Use summarization and selective retrieval rather than full context every time.

Optimization: Parallel Tools

When multiple independent tool calls are needed, execute them in parallel rather than sequentially. Reduces wall-clock time.

Optimization: Caching

Cache tool results and LLM responses for identical inputs. Many agent tasks involve repeated similar queries.

When to Use Raw LLM vs Agent

Not every task requires an agent. Adding the agent layer introduces complexity, latency, and potential failure modes. Use the right tool for the job.

Use a Raw LLM When:

1.
The task is purely generative: Writing, summarization, translation, code completion. No external data or actions needed.
2.
All context fits in one prompt: The information needed is small enough to include directly. No retrieval required.
3.
Single-turn interaction: User asks, model responds, done. No iteration or verification needed.
4.
Latency is critical: Raw LLM calls are faster than agent loops. For real-time applications, simpler is better.
5.
You control the integration: The application code handles any necessary tool use or multi-step logic.

// Good use case for raw LLM: code explanation

const explanation = await llm.generate({
  system: "You are a code explanation assistant.",
  messages: [{
    role: "user",
    content: `Explain this function:

${codeSnippet}`
  }]
});

// No tools, no iteration, no external data needed
return explanation;

Use an Agent When:

1.
The task requires external actions: Database queries, API calls, file operations, code execution. The LLM must affect the world.
2.
Information must be gathered dynamically: The model doesn't know upfront what data it needs—discovery is part of the task.
3.
Multi-step reasoning with verification: Complex tasks where the model should check its work and iterate on failures.
4.
State must persist across sessions: Long-running tasks, user preferences, learned behaviors that should carry over.
5.
The action space is complex: Many possible tools, conditional logic, or domain-specific operations the LLM should orchestrate.

// Good use case for agent: research task

const result = await agent.run({
  goal: "Find the top 3 competitors to Acme Corp and summarize their pricing",
  tools: [webSearch, readPage, extractData, writeReport],
  maxIterations: 15
});

// Agent will:
// 1. Search for "Acme Corp competitors"
// 2. Read several pages to identify competitors
// 3. Search for each competitor's pricing page
// 4. Extract pricing information
// 5. Synthesize findings into summary
// Each step may require multiple sub-steps and verification

Decision Matrix

Factor	Raw LLM	Agent
External tools needed	No	Yes
Dynamic information gathering	No	Yes
Multi-step with feedback	No	Yes
Cross-session persistence	No	Yes
Latency sensitive	Preferred	Slower
Cost sensitive	Lower	Higher (N calls)
Failure handling	External	Self-correcting

Real-World Examples

Example 1: Code Review (Raw LLM)

A developer submits a code diff for review. The entire diff fits in context, and the task is purely analytical—no external data or actions needed.

// Raw LLM approach - simple and efficient
async function reviewCode(diff: string): Promise<Review> {
  const response = await llm.generate({
    system: `You are a senior code reviewer. Analyze the diff for:
      - Bugs or logic errors
      - Security issues
      - Performance concerns
      - Style violations
      Provide specific, actionable feedback.`,
    messages: [{
      role: "user",
      content: `Review this diff:

${diff}`
    }],
    temperature: 0.3  // Low temperature for consistent, focused output
  });

  return parseReview(response);
}

// Single LLM call, ~2000 tokens, <2 seconds
// No agent overhead needed

Example 2: Bug Investigation (Agent)

A user reports: "The login page is broken." The agent must investigate by examining logs, checking code, testing hypotheses, and proposing fixes.

// Agent approach - necessary for investigation
const investigation = await agent.run({
  goal: "Investigate and fix the broken login page",

  tools: {
    read_logs: (filter) => fetchLogs("auth-service", filter),
    search_code: (query) => grepCodebase(query),
    read_file: (path) => readFile(path),
    run_test: (testName) => executeTest(testName),
    git_blame: (file, line) => getBlameInfo(file, line),
    propose_fix: (file, changes) => createPatch(file, changes)
  },

  maxIterations: 20
});

// Typical execution trace:
// 1. read_logs({level: "error", service: "auth", last: "1h"})
//    -> Found: "TypeError: Cannot read property 'token' of undefined"
// 2. search_code("property 'token'")
//    -> Found references in auth/validate.ts, auth/session.ts
// 3. read_file("auth/validate.ts")
//    -> Identified null check missing on line 47
// 4. git_blame("auth/validate.ts", 47)
//    -> Changed 2 days ago in commit abc123
// 5. run_test("auth.validate.test")
//    -> Confirmed: test fails with null session
// 6. propose_fix("auth/validate.ts", {line: 47, change: "add null check"})

// 12 LLM calls, ~15,000 tokens, ~30 seconds
// Could not be done without agent capabilities

Example 3: Data Pipeline (Hybrid)

Sometimes the optimal approach combines both patterns. Use agents for complex orchestration, raw LLM calls for individual processing steps.

// Hybrid approach - agent orchestrates, raw LLM processes
async function analyzeCustomerFeedback(feedbackBatch: Feedback[]) {
  // Agent handles orchestration and decision-making
  const agent = createAgent({
    goal: "Analyze customer feedback and generate insights report",

    tools: {
      // This tool uses RAW LLM for each item - no agent overhead
      classify_feedback: async (items: Feedback[]) => {
        return Promise.all(items.map(item =>
          llm.generate({
            messages: [{role: "user", content: `Classify: ${item.text}`}],
            temperature: 0.1  // Deterministic classification
          })
        ));
      },

      // Raw LLM for summarization
      summarize_category: async (category: string, items: Feedback[]) => {
        return llm.generate({
          messages: [{
            role: "user",
            content: `Summarize these ${category} feedback items: ${JSON.stringify(items)}`
          }]
        });
      },

      // Actual tool actions
      save_to_db: (report) => database.insert("reports", report),
      notify_team: (channel, message) => slack.post(channel, message)
    }
  });

  return agent.run();
}

// Agent makes ~5 orchestration calls
// Raw LLM makes ~100 classification calls (in parallel)
// Best of both worlds: intelligent orchestration + efficient processing

Local Deployment Considerations

For OnyxLab's local-first philosophy, the distinction between LLMs and agents takes on additional dimensions. Running locally means managing your own compute resources and dealing with constraints that cloud deployments abstract away.

Memory Constraints

Local GPUs have fixed VRAM. Agent loops that maintain large contexts can exhaust memory. Design for context efficiency.

Inference Speed

Local inference is slower than cloud. Agent loops multiply this latency. Consider user experience for interactive agents.

Model Selection

Smaller local models may need more agent iterations to match larger model performance. Balance model size against loop efficiency.

Offline Capability

Agents with local tools work offline. Those depending on external APIs don't. Design tool sets for your connectivity requirements.

OnyxLab's tooling is designed to address these constraints—providing efficient agent runtimes optimized for local hardware, memory-conscious context management, and tool frameworks that work fully offline while remaining extensible for cloud integration when available.

Summary

LLMs are stateless text predictors. They process tokens through transformer layers using attention mechanisms. Context windows bound their "memory." Sampling parameters control output randomness.
Agents are systems built around LLMs. They add tool use, memory, and execution loops. Architectures like ReAct, Plan-and-Execute, and MRKL provide different tradeoffs.
Memory systems enable persistence. Conversation, episodic, semantic, and working memory each serve different purposes. Proper memory design is essential for capable agents.
Execution loops multiply costs. Each iteration requires LLM inference. Optimization through caching, parallelism, and early exit matters—especially for local deployment.
Choose the right abstraction. Use raw LLMs for simple, single-turn generation. Use agents when tasks require external actions, dynamic discovery, or persistent state.

Introduction Next: Understanding Agents

LLM vs Agents