OnyxLab | AI/ML Engineering

An AI agent is a system that perceives its environment, reasons about goals, and takes action to achieve them. The language model provides the reasoning capability; the agent framework provides everything else. This document provides a comprehensive technical deep-dive into agent architectures, patterns, and implementation considerations for local-first deployment.

Core Components: A Deep Dive

Every agent architecture, regardless of implementation, contains four fundamental components. The sophistication of each component determines the agent's capabilities. Understanding these components at a granular level is essential for building robust local agents.

Reasoning Engine

The language model that interprets input, plans actions, and generates responses. This is where understanding and decision-making occur. For local deployment, this is typically a quantized model running on consumer hardware—Llama, Mistral, Qwen, or similar open-weight models.

The reasoning engine's effectiveness depends on several factors:

Model capability: Larger models generally reason better, but quantization and architecture matter. A well-tuned 7B model can outperform a poorly prompted 70B model.
Context window: Determines how much information the model can consider at once. Critical for complex multi-step tasks. Local models typically range from 4K to 128K tokens.
Instruction following: The model must reliably follow structured output formats. Fine-tuned chat models are generally better than base models for agent tasks.
Inference speed: Agent loops require multiple LLM calls. At 10 tokens/second on CPU, a task requiring 50 calls becomes painfully slow. GPU acceleration is strongly recommended.

// Reasoning engine configuration example

{
  "model": "llama-3.1-8b-instruct",
  "quantization": "Q4_K_M",
  "context_length": 8192,
  "temperature": 0.1,  // Low temp for consistent tool selection
  "gpu_layers": 35,    // Offload to GPU for speed
  "num_predict": 1024  // Max tokens per generation
}

Tool Interface

The mechanisms through which the agent affects its environment. Tools can be anything: database queries, API calls, file operations, code execution, web browsing. The agent decides which tool to use based on the current goal and available options.

Tool design is one of the most underestimated aspects of agent development. Well-designed tools dramatically reduce agent errors. Key principles:

Atomic operations: Each tool should do one thing well. Combine simple tools rather than building complex multi-function tools.
Clear naming: Tool names should be unambiguous. search_documents is better than search.
Rich descriptions: The LLM only knows what you tell it. Describe when to use each tool and what it returns.
Validated inputs: Use strict schemas. Reject invalid parameters early with helpful error messages.
Structured outputs: Return consistent formats. JSON is preferable to free-form text.

Memory System

Short-term memory (conversation context) and long-term memory (persistent storage) that allow the agent to maintain state across interactions. Without memory, each interaction starts from zero. Vector databases enable semantic memory retrieval— the agent can recall relevant information based on meaning rather than exact match.

Memory architecture choices significantly impact agent behavior and performance. We explore these in detail in the Memory Architectures section below.

Execution Loop

The orchestration layer that ties everything together. It manages the cycle of perception, reasoning, action, and observation. Common patterns include ReAct (Reasoning + Acting), Plan-and-Execute, and Tree of Thoughts. The loop continues until the goal is achieved or a termination condition is met.

The execution loop handles critical concerns beyond simple iteration:

Error recovery: What happens when a tool fails? The loop must decide whether to retry, try an alternative, or ask for help.
Loop detection: Agents can get stuck in cycles. The loop must detect and break repetitive patterns.
Resource limits: Maximum iterations, token budgets, and time limits prevent runaway execution.
State management: Maintaining consistent state across async tool calls and potential failures.

How Tool Calling Actually Works

Tool calling is the mechanism by which language models interact with external systems. Understanding the internals demystifies agent behavior and helps you debug issues when tools are called incorrectly or not at all.

JSON Schema Tool Definitions

Tools are defined using JSON Schema, which describes the tool's name, purpose, and expected parameters. This schema is injected into the system prompt or provided via a dedicated API field (for models with native tool support).

// Tool definition schema

{
  "name": "search_documents",
  "description": "Searches the local document store for relevant passages. Use when the user asks about information that might be in their documents. Returns a list of matching passages with source citations.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "The search query. Be specific and use relevant keywords."
      },
      "max_results": {
        "type": "integer",
        "description": "Maximum number of results to return. Default 5.",
        "default": 5
      },
      "filters": {
        "type": "object",
        "properties": {
          "date_after": {
            "type": "string",
            "format": "date",
            "description": "Only return documents created after this date"
          },
          "file_types": {
            "type": "array",
            "items": { "type": "string" },
            "description": "Filter by file extension, e.g., ['pdf', 'md']"
          }
        }
      }
    },
    "required": ["query"]
  }
}

The Tool Calling Flow

When the agent needs to call a tool, the following sequence occurs:

1. // LLM generates structured output

{"tool": "search_documents", "parameters": {"query": "quarterly revenue 2024", "max_results": 3}}

2. // Agent framework parses the JSON

tool_name = response.tool // "search_documents"

params = response.parameters // validated against schema

3. // Framework invokes the actual function

result = tools[tool_name].execute(**params)

4. // Result is formatted and returned to LLM

observation = {"status": "success", "results": [...], "total_matches": 47}

5. // LLM receives observation in next turn

messages.append({"role": "tool", "content": observation})

Native vs Prompted Tool Calling

Native Tool Calling

Some models (GPT-4, Claude, Llama 3.1+) have been specifically trained to output tool calls in a structured format. The API accepts tool definitions and returns parsed tool calls directly.

Pros: More reliable parsing, better parameter handling, fewer hallucinated tool names. Cons: Requires model support, less flexibility in output format.

Prompted Tool Calling

For models without native support, tools are described in the system prompt with explicit instructions on output format. The agent framework must parse the response and extract tool calls.

Pros: Works with any model, highly customizable format. Cons: More prone to parsing errors, requires robust regex/parsing logic.

// Prompted tool calling - system prompt excerpt

You have access to the following tools:

1. search_documents(query: str, max_results: int = 5)
   - Searches your document store for relevant passages
   - Use when users ask about their documents

2. calculator(expression: str)
   - Evaluates mathematical expressions
   - Use for any calculations

When you need to use a tool, respond with:
<tool_call>
{"name": "tool_name", "parameters": {"param1": "value1"}}
</tool_call>

Wait for the result before continuing.

Parsing and Validation

Robust parsing is essential for reliable agents. LLMs do not always produce perfectly formatted JSON, especially smaller local models. Your parsing layer should handle:

Malformed JSON: Missing quotes, trailing commas, unescaped characters. Use lenient parsers or repair strategies.
Partial tool calls: Model output was truncated. Detect and request completion or retry.
Unknown tools: Model hallucinated a tool name. Return an error message describing available tools.
Missing parameters: Required fields are absent. Return validation error with specific guidance.
Wrong types: String instead of integer. Attempt type coercion where safe, error otherwise.

// Robust tool call parsing

def parse_tool_call(response: str) -> ToolCall | None:
    # Try multiple extraction patterns
    patterns = [
        r'<tool_call>\s*({.*?})\s*</tool_call>',
        r'```json\s*({.*?})\s*```',
        r'\{\s*"name":\s*"[^"]+",\s*"parameters".*?\}'
    ]

    for pattern in patterns:
        match = re.search(pattern, response, re.DOTALL)
        if match:
            try:
                data = json_repair.loads(match.group(1))  # Lenient parser
                return validate_tool_call(data)
            except JSONError:
                continue

    return None  # No valid tool call found

Memory Architectures

Memory gives agents the ability to maintain context, learn from interactions, and recall relevant information. Different memory types serve different purposes, and most production agents use a combination. Local deployment has specific memory considerations—you control where data lives and how it persists.

Buffer Memory (Conversation History)

The simplest form: store the full conversation in context. Every message sent and received is included in subsequent prompts. Works well for short conversations but quickly consumes the context window.

class BufferMemory:
    def __init__(self, max_messages: int = 50):
        self.messages = []
        self.max_messages = max_messages

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]

    def get_context(self) -> list[dict]:
        return self.messages.copy()

Use when: Short conversations, simple Q&A, demos. Avoid when: Long sessions, token-constrained environments.

Summary Memory

Instead of storing raw messages, periodically summarize the conversation and store the summary. This compresses information, preserving key points while reducing token usage. Requires an LLM call to generate summaries.

class SummaryMemory:
    def __init__(self, llm, summarize_threshold: int = 10):
        self.llm = llm
        self.summary = ""
        self.recent_messages = []
        self.threshold = summarize_threshold

    def add(self, role: str, content: str):
        self.recent_messages.append({"role": role, "content": content})
        if len(self.recent_messages) >= self.threshold:
            self._update_summary()

    def _update_summary(self):
        prompt = f"""Current summary: {self.summary}

New messages:
{self._format_messages()}

Write an updated summary that incorporates the new information."""

        self.summary = self.llm.complete(prompt)
        self.recent_messages = []

    def get_context(self) -> str:
        return f"Summary: {self.summary}\n\nRecent: {self._format_messages()}"

Use when: Long conversations, limited context. Trade-off: Loses detail, adds latency from summarization calls.

Vector Memory (Semantic Retrieval)

Store messages and facts as embeddings in a vector database. Retrieve relevant memories based on semantic similarity to the current query. Enables recall across long time horizons and large knowledge bases.

class VectorMemory:
    def __init__(self, embed_model, vector_store):
        self.embed = embed_model
        self.store = vector_store

    def add(self, content: str, metadata: dict = None):
        embedding = self.embed.encode(content)
        self.store.upsert({
            "id": str(uuid4()),
            "embedding": embedding,
            "content": content,
            "metadata": metadata or {},
            "timestamp": datetime.now().isoformat()
        })

    def retrieve(self, query: str, k: int = 5) -> list[str]:
        query_embedding = self.embed.encode(query)
        results = self.store.search(query_embedding, top_k=k)
        return [r["content"] for r in results]

    def get_context(self, query: str) -> str:
        memories = self.retrieve(query)
        return "Relevant memories:\n" + "\n".join(f"- {m}" for m in memories)

Use when: Large knowledge bases, long-running agents, information retrieval. Requires: Embedding model, vector database (ChromaDB, Qdrant, etc.).

Entity Memory

Extract and track entities (people, places, concepts) and their properties across conversations. Maintains a structured knowledge graph that can be queried and updated. Useful for agents that need to track evolving information about specific subjects.

class EntityMemory:
    def __init__(self, llm):
        self.llm = llm
        self.entities = {}  # entity_name -> {properties}

    def extract_entities(self, text: str) -> list[dict]:
        prompt = f"""Extract entities and their properties from this text:
"{text}"

Return JSON: [{{"name": "...", "type": "...", "properties": {{}}}}]"""
        return json.loads(self.llm.complete(prompt))

    def update(self, text: str):
        entities = self.extract_entities(text)
        for entity in entities:
            name = entity["name"]
            if name in self.entities:
                self.entities[name]["properties"].update(entity["properties"])
            else:
                self.entities[name] = entity

    def get_context(self, relevant_entities: list[str]) -> str:
        context = []
        for name in relevant_entities:
            if name in self.entities:
                context.append(f"{name}: {json.dumps(self.entities[name])}")
        return "Known entities:\n" + "\n".join(context)

Use when: Tracking people, projects, evolving facts. Trade-off: Extraction adds latency and can be error-prone.

Combining Memory Types

Production agents often layer multiple memory types. A common pattern:

class HybridMemory:
    def __init__(self):
        self.buffer = BufferMemory(max_messages=10)      # Recent context
        self.summary = SummaryMemory(llm)                 # Compressed history
        self.vector = VectorMemory(embed_model, store)    # Semantic retrieval
        self.entities = EntityMemory(llm)                 # Structured facts

    def get_context(self, query: str) -> str:
        return f"""
## Conversation Summary
{self.summary.get_context()}

## Relevant Memories
{self.vector.get_context(query)}

## Known Entities
{self.entities.get_context(self._extract_mentioned_entities(query))}

## Recent Messages
{self.buffer.get_context()}
"""

The ReAct Pattern in Detail

ReAct (Reasoning + Acting) is the most widely used agent pattern. The model explicitly verbalizes its reasoning before each action, creating a chain of Thought-Action-Observation steps. This pattern emerged from the ReAct paper (Yao et al., 2022) and has become the foundation for most agent frameworks.

The ReAct Loop Structure

// ReAct prompt structure

Question: What was the revenue growth for Acme Corp last quarter?

Thought: I need to find financial information about Acme Corp. I should search
the documents for their quarterly report or revenue data.

Action: search_documents
Action Input: {"query": "Acme Corp revenue quarterly report Q3 2024"}

Observation: Found 3 results:
1. "Q3 2024 Report: Acme Corp reported revenue of $4.2M, up from $3.8M in Q2..."
2. "Financial Summary: Year-over-year growth of 15%..."

Thought: I found the revenue figures. Q3 was $4.2M and Q2 was $3.8M.
Let me calculate the quarter-over-quarter growth.

Action: calculator
Action Input: {"expression": "((4.2 - 3.8) / 3.8) * 100"}

Observation: 10.526315789473683

Thought: The quarter-over-quarter revenue growth was approximately 10.5%.
I have enough information to answer the question.

Action: final_answer
Action Input: {"answer": "Acme Corp's revenue grew by 10.5% last quarter,
from $3.8M in Q2 to $4.2M in Q3 2024."}

ReAct System Prompt Template

SYSTEM_PROMPT = """You are a helpful assistant with access to tools.

Available tools:
{tool_descriptions}

To use a tool, you MUST use this exact format:

Thought: [Your reasoning about what to do next]
Action: [tool_name]
Action Input: [JSON parameters]

After receiving an Observation, continue with another Thought.

When you have enough information to answer, use:
Thought: [Summary of what you learned]
Action: final_answer
Action Input: {{"answer": "[Your complete response]"}}

Important:
- Always think before acting
- Use tools to verify information, don't guess
- If a tool fails, try a different approach
- Be concise in your reasoning
"""

Implementing ReAct

class ReActAgent:
    def __init__(self, llm, tools: list[Tool], max_iterations: int = 10):
        self.llm = llm
        self.tools = {t.name: t for t in tools}
        self.max_iterations = max_iterations

    def run(self, query: str) -> str:
        prompt = self._build_prompt(query)
        trajectory = []

        for i in range(self.max_iterations):
            # Generate thought and action
            response = self.llm.complete(prompt)
            trajectory.append(response)

            # Parse action
            action = self._parse_action(response)
            if action is None:
                prompt += f"\n{response}\nThought: I need to use a proper action format."
                continue

            # Check for final answer
            if action["name"] == "final_answer":
                return action["input"]["answer"]

            # Execute tool
            if action["name"] not in self.tools:
                observation = f"Error: Unknown tool '{action['name']}'. Available: {list(self.tools.keys())}"
            else:
                try:
                    observation = self.tools[action["name"]].execute(**action["input"])
                except Exception as e:
                    observation = f"Error: {str(e)}"

            # Append to prompt
            prompt += f"\n{response}\nObservation: {observation}\n"

        return "Max iterations reached without finding an answer."

    def _parse_action(self, response: str) -> dict | None:
        # Extract Action and Action Input from response
        action_match = re.search(r'Action:\s*(.+)', response)
        input_match = re.search(r'Action Input:\s*(.+)', response, re.DOTALL)

        if action_match and input_match:
            return {
                "name": action_match.group(1).strip(),
                "input": json.loads(input_match.group(1).strip())
            }
        return None

ReAct Advantages and Limitations

Advantages

Interpretable: You can follow the agent's reasoning step by step
Flexible: Can adapt to unexpected situations by reasoning about them
Self-correcting: Can recognize and recover from errors through reflection
Simple to implement: Minimal infrastructure required

Limitations

Greedy: Makes locally optimal decisions without global planning
Token intensive: Repeated reasoning consumes many tokens
Slow: Multiple LLM calls per task add latency
Can loop: May repeat the same actions without making progress

Plan-and-Execute vs ReAct

Plan-and-Execute separates planning from execution. First, create a complete plan; then execute each step. This addresses ReAct's greedy local decisions with upfront global planning.

The Plan-and-Execute Pattern

// Plan-and-Execute flow

Task: "Summarize the key points from all PDFs in my research folder
      and create a comparison table."

=== PLANNING PHASE ===
Plan:
1. List all PDF files in the research folder
2. For each PDF, extract the main content
3. Identify key points from each document
4. Compare the key points across documents
5. Format findings as a comparison table
6. Return the final table to the user

=== EXECUTION PHASE ===
Executing Step 1: list_files(path="/research", filter="*.pdf")
Result: ["paper_a.pdf", "paper_b.pdf", "paper_c.pdf"]

Executing Step 2: extract_pdf_text(file="paper_a.pdf")
Result: "Abstract: This paper examines..."
[continues for each PDF]

Executing Step 3: analyze_content(text="...", task="extract key points")
Result: ["Key point 1", "Key point 2", ...]
[continues for each document]

...

=== REPLANNING (if needed) ===
Step 4 failed - documents have different structures.
Revised plan: Group documents by type before comparing.

Implementation Considerations

class PlanAndExecuteAgent:
    def __init__(self, planner_llm, executor_llm, tools):
        self.planner = planner_llm    # Can be larger/smarter model
        self.executor = executor_llm  # Can be smaller/faster model
        self.tools = tools

    def run(self, task: str) -> str:
        # Phase 1: Generate plan
        plan = self._generate_plan(task)
        results = []

        # Phase 2: Execute each step
        for i, step in enumerate(plan):
            try:
                result = self._execute_step(step, results)
                results.append({"step": step, "result": result})
            except Exception as e:
                # Phase 3: Replan if step fails
                remaining = plan[i:]
                plan = self._replan(task, results, remaining, str(e))
                continue

        # Phase 4: Synthesize final answer
        return self._synthesize(task, results)

    def _generate_plan(self, task: str) -> list[str]:
        prompt = f"""Create a step-by-step plan to accomplish this task:
{task}

Available tools: {[t.name for t in self.tools]}

Return a numbered list of concrete steps. Each step should use exactly one tool.
"""
        response = self.planner.complete(prompt)
        return self._parse_plan(response)

    def _execute_step(self, step: str, prior_results: list) -> str:
        prompt = f"""Execute this step: {step}

Prior results: {json.dumps(prior_results, indent=2)}

Determine which tool to use and what parameters to pass."""
        # Execute using ReAct-like single-step execution
        return self._single_step_react(prompt)

When to Use Each Pattern

Criterion	ReAct	Plan-and-Execute
Task complexity	Simple to moderate	Complex, multi-step
Predictability	Unpredictable tasks	Predictable workflow
Error recovery	Implicit, per-step	Explicit replanning
Token efficiency	Lower (repeated context)	Higher (plan once)
Latency	Varies widely	More predictable
Model requirements	Single model	Can use two models

Multi-Agent Systems and Orchestration

Multi-agent systems use multiple specialized agents that collaborate on complex tasks. Each agent has a specific role, expertise, or perspective. Orchestration patterns determine how agents communicate and coordinate.

Hierarchical Orchestration

A supervisor agent decomposes tasks and delegates to specialist agents. The supervisor aggregates results and handles coordination. This is the most common pattern.

Supervisor: "Write a blog post about local AI"
  ├─→ Researcher: Gathers facts and sources
  ├─→ Writer: Creates initial draft
  ├─→ Editor: Refines and improves
  └─→ Fact-checker: Verifies claims
Supervisor: Combines and returns final post

Peer-to-Peer Collaboration

Agents communicate directly with each other without a central coordinator. Each agent can request help from or provide information to others. More flexible but harder to debug.

Coder: "I need the API specification"
  └─→ Architect: "Here's the OpenAPI spec..."
Coder: "What's the authentication method?"
  └─→ Security: "Use JWT with these claims..."
Coder: "Tests are failing on edge case"
  └─→ Tester: "Here's a fix for that case..."

Debate and Consensus

Multiple agents independently work on the same problem and then debate their solutions. Useful for reducing errors and getting diverse perspectives on ambiguous questions.

Question: "Should we use PostgreSQL or MongoDB?"

Agent A (Relational): "PostgreSQL because ACID compliance..."
Agent B (Document): "MongoDB for flexible schemas..."
Agent C (Pragmatist): "PostgreSQL with JSONB columns..."

Moderator: "Agent C's hybrid approach addresses both concerns.
           Final recommendation: PostgreSQL with JSONB."

Multi-Agent Implementation

class MultiAgentOrchestrator:
    def __init__(self, agents: dict[str, Agent]):
        self.agents = agents
        self.message_queue = []
        self.shared_memory = {}

    def run_hierarchical(self, task: str, supervisor: str = "supervisor") -> str:
        # Supervisor creates the plan
        plan = self.agents[supervisor].plan(task)
        results = {}

        for step in plan:
            agent_name = step["assigned_to"]
            subtask = step["task"]
            context = {
                "prior_results": results,
                "shared_memory": self.shared_memory
            }

            result = self.agents[agent_name].execute(subtask, context)
            results[step["id"]] = result

            # Update shared memory
            self._update_shared_memory(agent_name, result)

        # Supervisor synthesizes final answer
        return self.agents[supervisor].synthesize(task, results)

    def run_debate(self, question: str, debaters: list[str], rounds: int = 2) -> str:
        positions = {}

        # Initial positions
        for agent_name in debaters:
            positions[agent_name] = self.agents[agent_name].answer(question)

        # Debate rounds
        for round in range(rounds):
            for agent_name in debaters:
                other_positions = {k: v for k, v in positions.items() if k != agent_name}
                positions[agent_name] = self.agents[agent_name].respond(
                    question,
                    own_position=positions[agent_name],
                    other_positions=other_positions
                )

        # Reach consensus
        return self.agents["moderator"].synthesize_consensus(question, positions)

Multi-Agent Trade-offs

Multi-agent systems add significant complexity. Consider these trade-offs:

Coordination overhead: Agents must understand each other. Misunderstandings between agents are common and hard to debug.
Latency multiplication: Each agent call is an LLM inference. A 5-agent system with 3 turns each means 15+ LLM calls minimum.
Context fragmentation: Each agent has partial context. Important information can be lost in handoffs.
Debugging difficulty: Tracing issues across agent boundaries requires comprehensive logging.
Local resource constraints: Running multiple agents on local hardware requires careful resource management.

Agent Failure Modes and Mitigations

Agents fail in predictable ways. Understanding these failure modes helps you build more robust systems. The mitigations are often more important than the initial implementation.

Infinite Loops

The agent repeats the same actions without making progress. Often caused by ambiguous goals, tool errors that the agent doesn't recognize, or context that doesn't accumulate properly.

Mitigations:

Set hard iteration limits (typically 10-20 for most tasks)
Track action history and detect repetition patterns
Inject "you seem to be stuck" prompts after repeated similar actions
Force the agent to try a different approach after N failures

Hallucinated Tool Calls

The agent invents tools that don't exist or uses incorrect parameters. More common with smaller models and prompted (vs native) tool calling.

Mitigations:

Return clear error messages listing available tools
Use few-shot examples of correct tool usage in the system prompt
Validate all parameters against schemas before execution
Consider constrained decoding to force valid tool names

Context Window Overflow

Long-running agents accumulate context until they exceed the model's limit. This causes either errors or silent truncation of important early context.

Mitigations:

Track token usage continuously
Summarize intermediate steps when context grows large
Use sliding window over conversation history
Store detailed results in memory, pass only summaries in context

Premature Termination

The agent declares success before completing the task. Often happens when the agent misunderstands the goal or gives up after minor difficulties.

Mitigations:

Include clear success criteria in the prompt
Add a verification step before final answer
Ask the agent to explain why the task is complete
Use a separate model to validate completion

Cascading Tool Failures

One tool failure causes a cascade of subsequent failures because the agent doesn't properly handle the error and continues with invalid data.

Mitigations:

Return structured error responses with recovery suggestions
Distinguish between retryable and fatal errors
Give the agent explicit error-handling tools
Checkpoint state before risky operations

Observability and Debugging

Agents are notoriously difficult to debug. Their behavior emerges from the interaction of prompts, model responses, tool outputs, and orchestration logic. Comprehensive observability is essential for development and production.

Essential Logging

Every agent run should capture enough information to fully reconstruct what happened:

@dataclass
class AgentTrace:
    run_id: str
    timestamp: datetime
    user_query: str
    steps: list[TraceStep]
    final_output: str
    total_tokens: int
    total_duration_ms: int
    status: Literal["success", "error", "timeout"]

@dataclass
class TraceStep:
    step_number: int
    step_type: Literal["llm_call", "tool_call", "memory_access"]
    input: str
    output: str
    tokens_used: int
    duration_ms: int
    metadata: dict  # model params, tool name, etc.

Debugging Strategies

Trace Replay

Save complete traces and replay them with modified prompts or tools. Allows testing fixes without re-running the expensive parts.

Step-Through Execution

Add breakpoints where you can inspect state and optionally override the next action. Essential for understanding complex failures.

Prompt Diffing

Compare prompts between successful and failed runs. Often reveals subtle context differences that change agent behavior.

Temperature Sweeps

Run the same query at different temperatures. If behavior varies wildly, the prompt may be ambiguous or the task may be too hard for the model.

Production Monitoring

In production, track these metrics to catch issues before they become critical:

# Key agent metrics
metrics = {
    # Performance
    "avg_steps_per_task": gauge(),
    "avg_tokens_per_task": gauge(),
    "avg_duration_ms": gauge(),
    "p95_duration_ms": gauge(),

    # Reliability
    "success_rate": gauge(),
    "timeout_rate": gauge(),
    "loop_detection_rate": gauge(),

    # Tool usage
    "tool_call_count": counter(labels=["tool_name"]),
    "tool_error_rate": gauge(labels=["tool_name"]),

    # Resource usage
    "context_window_utilization": histogram(),
    "memory_retrieved_per_query": histogram(),
}

Token Budgeting and Context Management

Context windows are finite and tokens are the currency of agent computation. Effective context management is the difference between an agent that handles complex tasks and one that fails on anything beyond simple queries.

Context Window Allocation

For a model with an 8K context window, a typical allocation:

CONTEXT_BUDGET = 8192

# Fixed allocations
SYSTEM_PROMPT = 800       # Tools, instructions, persona
SAFETY_MARGIN = 200       # Buffer for tokenization variance

# Dynamic allocations (remaining: 7192 tokens)
def allocate_context(query_tokens: int) -> dict:
    remaining = CONTEXT_BUDGET - SYSTEM_PROMPT - SAFETY_MARGIN - query_tokens

    return {
        "memory": int(remaining * 0.3),      # ~2150 tokens
        "conversation": int(remaining * 0.3), # ~2150 tokens
        "tool_results": int(remaining * 0.25),# ~1800 tokens
        "generation": int(remaining * 0.15),  # ~1080 tokens (output)
    }

Context Management Strategies

Aggressive summarization: Summarize tool outputs immediately. A 10,000-character search result can often be reduced to 200 characters of relevant information.
Sliding window with anchors: Keep the system prompt and most recent N messages, but also preserve "anchor" messages that contain critical context.
Hierarchical context: Store full details in external memory. Pass only summaries in context. The agent can request full details when needed.
Lazy loading: Don't retrieve all relevant memories upfront. Fetch additional context only when the agent explicitly requests it.
Token counting before calls: Always count tokens before sending to the model. Truncate or summarize if you'll exceed the limit.

class ContextManager:
    def __init__(self, max_tokens: int, tokenizer):
        self.max_tokens = max_tokens
        self.tokenizer = tokenizer
        self.allocations = self._default_allocations()

    def fit_to_budget(self, components: dict[str, str]) -> dict[str, str]:
        """Ensure all components fit within budget."""
        result = {}
        remaining = self.max_tokens

        # Fixed components first (system prompt)
        for name in ["system_prompt"]:
            tokens = self._count(components[name])
            if tokens > self.allocations[name]:
                raise ValueError(f"{name} exceeds allocation")
            result[name] = components[name]
            remaining -= tokens

        # Dynamic components with truncation
        for name in ["memory", "conversation", "tool_results"]:
            content = components.get(name, "")
            max_tokens = min(self.allocations[name], remaining - 500)
            result[name] = self._truncate(content, max_tokens)
            remaining -= self._count(result[name])

        return result

    def _truncate(self, text: str, max_tokens: int) -> str:
        tokens = self.tokenizer.encode(text)
        if len(tokens) <= max_tokens:
            return text
        # Keep beginning and end, mark truncation
        half = (max_tokens - 10) // 2
        return self.tokenizer.decode(tokens[:half]) + "\n...[truncated]...\n" + self.tokenizer.decode(tokens[-half:])

Prompt Engineering for Agents

Agent prompts differ from standard LLM prompts. They must define behavior patterns, tool usage conventions, and output formats. Small changes in agent prompts can have large effects on reliability.

System Prompt Structure

AGENT_SYSTEM_PROMPT = """
# Identity and Role
You are a research assistant with access to a local document store and web search.
Your goal is to provide accurate, well-sourced answers to questions.

# Core Behaviors
- Always verify claims using tools before stating them as fact
- Cite sources for all factual statements
- Acknowledge uncertainty when information is incomplete
- Ask clarifying questions when the request is ambiguous

# Tool Usage Guidelines
You have access to these tools:

1. search_documents(query: str) -> list[Document]
   Use when: User asks about their personal documents
   Returns: List of relevant passages with source info

2. web_search(query: str) -> list[Result]
   Use when: User asks about current events or general knowledge
   Returns: Search results with titles, snippets, and URLs

3. calculator(expression: str) -> float
   Use when: Any mathematical calculation is needed
   Returns: The computed result

# Output Format
When using tools, follow this exact format:

Thought: [Your reasoning]
Action: [tool_name]
Action Input: {"param": "value"}

After receiving a result:
Observation: [Tool result will appear here]
Thought: [Your analysis of the result]

When ready to answer:
Thought: [Summary of findings]
Final Answer: [Your response to the user]

# Error Handling
If a tool returns an error:
1. Acknowledge the error
2. Try an alternative approach
3. If no alternative exists, explain the limitation

# Examples
[Include 2-3 complete examples of ideal behavior]
"""

Few-Shot Examples

Few-shot examples dramatically improve agent reliability, especially for local models. Include complete trajectories showing correct behavior:

# Example 1: Document Search
User: What did the Q3 report say about customer churn?

Thought: The user wants information from a specific document (Q3 report).
I should search their documents for churn-related information.
Action: search_documents
Action Input: {"query": "Q3 report customer churn rate"}

Observation: [1 result] "Q3 2024 Report, page 12: Customer churn decreased
to 2.3% from 2.8% in Q2, driven by improvements in onboarding..."

Thought: I found the relevant information. The Q3 report shows churn
decreased from 2.8% to 2.3%.
Final Answer: According to your Q3 2024 report, customer churn decreased
to 2.3%, down from 2.8% in Q2. The report attributes this improvement to
changes in the onboarding process.

---

# Example 2: Calculation Required
User: If we grew 15% in Q3 and 20% in Q4, what's our total growth?

Thought: I need to calculate compound growth. Two consecutive growth
periods compound multiplicatively, not additively.
Action: calculator
Action Input: {"expression": "((1.15 * 1.20) - 1) * 100"}

Observation: 38.0

Thought: The compound growth is 38%, not 35% (which would be simple addition).
Final Answer: Your total growth across Q3 and Q4 is 38%. This is compound
growth: 1.15 * 1.20 = 1.38, representing a 38% increase from the starting point.

Prompt Engineering Tips

Be explicit about format: Models often need exact format specifications. "Use this exact format" works better than "use a format like this."
Define edge cases: What should the agent do when a tool fails? When it's uncertain? When the query is ambiguous? Define these explicitly.
Limit scope: Agents work better with constrained domains. "You are a research assistant" is better than "You can do anything."
Include anti-patterns: Show what NOT to do. "Never guess when you can search" is clearer than "try to be accurate."
Version your prompts: Track prompt changes in version control. Small changes can have large behavioral effects.

Local Agent Frameworks

Several frameworks support local agent development. Each has different trade-offs in terms of flexibility, complexity, and local-first support.

LangChain / LangGraph

The most popular framework with extensive documentation and community. LangGraph adds explicit state machines for more complex agent flows. Good local model support through integrations with Ollama, llama.cpp, and vLLM.

Strengths: Large ecosystem, many integrations, active development.Weaknesses: Abstractions can be leaky, debugging can be difficult, API changes frequently.

LlamaIndex

Originally focused on RAG, now includes agent capabilities. Strong emphasis on data connectors and retrieval. Good for agents that primarily work with documents.

Strengths: Excellent RAG support, many data connectors, good local model support.Weaknesses: Agent features less mature than LangChain, smaller community.

Haystack

Pipeline-based framework with strong production focus. Good for building structured workflows. Native support for local models through various backends.

Strengths: Production-ready, pipeline paradigm is intuitive, good evaluation tools.Weaknesses: Less flexible for complex agent patterns, smaller ecosystem.

Custom Implementation

Building your own agent loop gives maximum control but requires more effort. Consider this when framework abstractions get in the way or you have specific requirements.

Strengths: Full control, no framework overhead, can optimize for your specific use case.Weaknesses: More code to maintain, no ecosystem benefits, need to solve common problems yourself.

Minimal Custom Agent

A complete but minimal agent implementation in ~100 lines, suitable as a starting point for custom development:

from dataclasses import dataclass
from typing import Callable
import json
import re

@dataclass
class Tool:
    name: str
    description: str
    parameters: dict
    function: Callable

class SimpleAgent:
    def __init__(self, llm, tools: list[Tool], max_iterations: int = 10):
        self.llm = llm
        self.tools = {t.name: t for t in tools}
        self.max_iterations = max_iterations

    def _build_system_prompt(self) -> str:
        tool_desc = "\n".join(
            f"- {t.name}: {t.description}\n  Parameters: {json.dumps(t.parameters)}"
            for t in self.tools.values()
        )
        return f"""You are a helpful assistant with access to tools.

Tools:
{tool_desc}

To use a tool, write:
Thought: [reasoning]
Action: [tool_name]
Action Input: {{"param": "value"}}

When done:
Thought: [summary]
Final Answer: [response]"""

    def run(self, query: str) -> str:
        messages = [
            {"role": "system", "content": self._build_system_prompt()},
            {"role": "user", "content": query}
        ]

        for _ in range(self.max_iterations):
            response = self.llm.chat(messages)
            messages.append({"role": "assistant", "content": response})

            # Check for final answer
            if "Final Answer:" in response:
                return response.split("Final Answer:")[-1].strip()

            # Parse and execute tool call
            action_match = re.search(r"Action:\s*(.+)", response)
            input_match = re.search(r"Action Input:\s*({.+})", response, re.DOTALL)

            if action_match and input_match:
                tool_name = action_match.group(1).strip()
                try:
                    params = json.loads(input_match.group(1))

                    if tool_name in self.tools:
                        result = self.tools[tool_name].function(**params)
                        observation = f"Observation: {result}"
                    else:
                        observation = f"Observation: Error - Unknown tool '{tool_name}'"
                except Exception as e:
                    observation = f"Observation: Error - {str(e)}"

                messages.append({"role": "user", "content": observation})
            else:
                messages.append({
                    "role": "user",
                    "content": "Please use the correct format: Action: tool_name\nAction Input: {params}"
                })

        return "Max iterations reached without final answer."

Summary

Building reliable agents requires understanding each component deeply: the reasoning engine's capabilities and limitations, tool design principles, memory architecture trade-offs, and orchestration patterns. Local deployment adds constraints (compute, memory) but also advantages (privacy, cost, control).

Start with the simplest architecture that might work—often a basic ReAct agent with a few well-designed tools. Add complexity only when you hit specific limitations. Invest heavily in observability from day one. And remember that prompt engineering and tool design often matter more than framework choice.

LLM vs Agents Next: Data

Understanding Agents

Core Components: A Deep Dive

Reasoning Engine

Tool Interface

Memory System

Execution Loop

How Tool Calling Actually Works

JSON Schema Tool Definitions

The Tool Calling Flow

Native vs Prompted Tool Calling

Native Tool Calling

Prompted Tool Calling

Parsing and Validation

Memory Architectures

Buffer Memory (Conversation History)

Summary Memory

Vector Memory (Semantic Retrieval)

Entity Memory

Combining Memory Types

The ReAct Pattern in Detail

The ReAct Loop Structure

ReAct System Prompt Template

Implementing ReAct

ReAct Advantages and Limitations

Advantages

Limitations

Plan-and-Execute vs ReAct

The Plan-and-Execute Pattern

Implementation Considerations

When to Use Each Pattern

Multi-Agent Systems and Orchestration

Hierarchical Orchestration

Peer-to-Peer Collaboration

Debate and Consensus

Multi-Agent Implementation

Multi-Agent Trade-offs

Agent Failure Modes and Mitigations

Infinite Loops

Hallucinated Tool Calls

Context Window Overflow

Premature Termination

Cascading Tool Failures

Observability and Debugging

Essential Logging

Debugging Strategies

Trace Replay

Step-Through Execution

Prompt Diffing

Temperature Sweeps

Production Monitoring

Token Budgeting and Context Management

Context Window Allocation

Context Management Strategies

Prompt Engineering for Agents

System Prompt Structure

Few-Shot Examples

Prompt Engineering Tips

Local Agent Frameworks

LangChain / LangGraph

LlamaIndex

Haystack

Custom Implementation

Minimal Custom Agent

Summary