Understanding Agents
An AI agent is a system that perceives its environment, reasons about goals, and takes action to achieve them. The language model provides the reasoning capability; the agent framework provides everything else. This document provides a comprehensive technical deep-dive into agent architectures, patterns, and implementation considerations for local-first deployment.
Core Components: A Deep Dive
Every agent architecture, regardless of implementation, contains four fundamental components. The sophistication of each component determines the agent's capabilities. Understanding these components at a granular level is essential for building robust local agents.
Reasoning Engine
The language model that interprets input, plans actions, and generates responses. This is where understanding and decision-making occur. For local deployment, this is typically a quantized model running on consumer hardware—Llama, Mistral, Qwen, or similar open-weight models.
The reasoning engine's effectiveness depends on several factors:
- Model capability: Larger models generally reason better, but quantization and architecture matter. A well-tuned 7B model can outperform a poorly prompted 70B model.
- Context window: Determines how much information the model can consider at once. Critical for complex multi-step tasks. Local models typically range from 4K to 128K tokens.
- Instruction following: The model must reliably follow structured output formats. Fine-tuned chat models are generally better than base models for agent tasks.
- Inference speed: Agent loops require multiple LLM calls. At 10 tokens/second on CPU, a task requiring 50 calls becomes painfully slow. GPU acceleration is strongly recommended.
{
"model": "llama-3.1-8b-instruct",
"quantization": "Q4_K_M",
"context_length": 8192,
"temperature": 0.1, // Low temp for consistent tool selection
"gpu_layers": 35, // Offload to GPU for speed
"num_predict": 1024 // Max tokens per generation
}Tool Interface
The mechanisms through which the agent affects its environment. Tools can be anything: database queries, API calls, file operations, code execution, web browsing. The agent decides which tool to use based on the current goal and available options.
Tool design is one of the most underestimated aspects of agent development. Well-designed tools dramatically reduce agent errors. Key principles:
- Atomic operations: Each tool should do one thing well. Combine simple tools rather than building complex multi-function tools.
- Clear naming: Tool names should be unambiguous.
search_documentsis better thansearch. - Rich descriptions: The LLM only knows what you tell it. Describe when to use each tool and what it returns.
- Validated inputs: Use strict schemas. Reject invalid parameters early with helpful error messages.
- Structured outputs: Return consistent formats. JSON is preferable to free-form text.
Memory System
Short-term memory (conversation context) and long-term memory (persistent storage) that allow the agent to maintain state across interactions. Without memory, each interaction starts from zero. Vector databases enable semantic memory retrieval— the agent can recall relevant information based on meaning rather than exact match.
Memory architecture choices significantly impact agent behavior and performance. We explore these in detail in the Memory Architectures section below.
Execution Loop
The orchestration layer that ties everything together. It manages the cycle of perception, reasoning, action, and observation. Common patterns include ReAct (Reasoning + Acting), Plan-and-Execute, and Tree of Thoughts. The loop continues until the goal is achieved or a termination condition is met.
The execution loop handles critical concerns beyond simple iteration:
- Error recovery: What happens when a tool fails? The loop must decide whether to retry, try an alternative, or ask for help.
- Loop detection: Agents can get stuck in cycles. The loop must detect and break repetitive patterns.
- Resource limits: Maximum iterations, token budgets, and time limits prevent runaway execution.
- State management: Maintaining consistent state across async tool calls and potential failures.
How Tool Calling Actually Works
Tool calling is the mechanism by which language models interact with external systems. Understanding the internals demystifies agent behavior and helps you debug issues when tools are called incorrectly or not at all.
JSON Schema Tool Definitions
Tools are defined using JSON Schema, which describes the tool's name, purpose, and expected parameters. This schema is injected into the system prompt or provided via a dedicated API field (for models with native tool support).
{
"name": "search_documents",
"description": "Searches the local document store for relevant passages. Use when the user asks about information that might be in their documents. Returns a list of matching passages with source citations.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query. Be specific and use relevant keywords."
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return. Default 5.",
"default": 5
},
"filters": {
"type": "object",
"properties": {
"date_after": {
"type": "string",
"format": "date",
"description": "Only return documents created after this date"
},
"file_types": {
"type": "array",
"items": { "type": "string" },
"description": "Filter by file extension, e.g., ['pdf', 'md']"
}
}
}
},
"required": ["query"]
}
}The Tool Calling Flow
When the agent needs to call a tool, the following sequence occurs:
Native vs Prompted Tool Calling
Native Tool Calling
Some models (GPT-4, Claude, Llama 3.1+) have been specifically trained to output tool calls in a structured format. The API accepts tool definitions and returns parsed tool calls directly.
Pros: More reliable parsing, better parameter handling, fewer hallucinated tool names. Cons: Requires model support, less flexibility in output format.
Prompted Tool Calling
For models without native support, tools are described in the system prompt with explicit instructions on output format. The agent framework must parse the response and extract tool calls.
Pros: Works with any model, highly customizable format. Cons: More prone to parsing errors, requires robust regex/parsing logic.
You have access to the following tools:
1. search_documents(query: str, max_results: int = 5)
- Searches your document store for relevant passages
- Use when users ask about their documents
2. calculator(expression: str)
- Evaluates mathematical expressions
- Use for any calculations
When you need to use a tool, respond with:
<tool_call>
{"name": "tool_name", "parameters": {"param1": "value1"}}
</tool_call>
Wait for the result before continuing.Parsing and Validation
Robust parsing is essential for reliable agents. LLMs do not always produce perfectly formatted JSON, especially smaller local models. Your parsing layer should handle:
- Malformed JSON: Missing quotes, trailing commas, unescaped characters. Use lenient parsers or repair strategies.
- Partial tool calls: Model output was truncated. Detect and request completion or retry.
- Unknown tools: Model hallucinated a tool name. Return an error message describing available tools.
- Missing parameters: Required fields are absent. Return validation error with specific guidance.
- Wrong types: String instead of integer. Attempt type coercion where safe, error otherwise.
def parse_tool_call(response: str) -> ToolCall | None:
# Try multiple extraction patterns
patterns = [
r'<tool_call>\s*({.*?})\s*</tool_call>',
r'```json\s*({.*?})\s*```',
r'\{\s*"name":\s*"[^"]+",\s*"parameters".*?\}'
]
for pattern in patterns:
match = re.search(pattern, response, re.DOTALL)
if match:
try:
data = json_repair.loads(match.group(1)) # Lenient parser
return validate_tool_call(data)
except JSONError:
continue
return None # No valid tool call foundMemory Architectures
Memory gives agents the ability to maintain context, learn from interactions, and recall relevant information. Different memory types serve different purposes, and most production agents use a combination. Local deployment has specific memory considerations—you control where data lives and how it persists.
Buffer Memory (Conversation History)
The simplest form: store the full conversation in context. Every message sent and received is included in subsequent prompts. Works well for short conversations but quickly consumes the context window.
class BufferMemory:
def __init__(self, max_messages: int = 50):
self.messages = []
self.max_messages = max_messages
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
if len(self.messages) > self.max_messages:
self.messages = self.messages[-self.max_messages:]
def get_context(self) -> list[dict]:
return self.messages.copy()Use when: Short conversations, simple Q&A, demos. Avoid when: Long sessions, token-constrained environments.
Summary Memory
Instead of storing raw messages, periodically summarize the conversation and store the summary. This compresses information, preserving key points while reducing token usage. Requires an LLM call to generate summaries.
class SummaryMemory:
def __init__(self, llm, summarize_threshold: int = 10):
self.llm = llm
self.summary = ""
self.recent_messages = []
self.threshold = summarize_threshold
def add(self, role: str, content: str):
self.recent_messages.append({"role": role, "content": content})
if len(self.recent_messages) >= self.threshold:
self._update_summary()
def _update_summary(self):
prompt = f"""Current summary: {self.summary}
New messages:
{self._format_messages()}
Write an updated summary that incorporates the new information."""
self.summary = self.llm.complete(prompt)
self.recent_messages = []
def get_context(self) -> str:
return f"Summary: {self.summary}\n\nRecent: {self._format_messages()}"Use when: Long conversations, limited context. Trade-off: Loses detail, adds latency from summarization calls.
Vector Memory (Semantic Retrieval)
Store messages and facts as embeddings in a vector database. Retrieve relevant memories based on semantic similarity to the current query. Enables recall across long time horizons and large knowledge bases.
class VectorMemory:
def __init__(self, embed_model, vector_store):
self.embed = embed_model
self.store = vector_store
def add(self, content: str, metadata: dict = None):
embedding = self.embed.encode(content)
self.store.upsert({
"id": str(uuid4()),
"embedding": embedding,
"content": content,
"metadata": metadata or {},
"timestamp": datetime.now().isoformat()
})
def retrieve(self, query: str, k: int = 5) -> list[str]:
query_embedding = self.embed.encode(query)
results = self.store.search(query_embedding, top_k=k)
return [r["content"] for r in results]
def get_context(self, query: str) -> str:
memories = self.retrieve(query)
return "Relevant memories:\n" + "\n".join(f"- {m}" for m in memories)Use when: Large knowledge bases, long-running agents, information retrieval. Requires: Embedding model, vector database (ChromaDB, Qdrant, etc.).
Entity Memory
Extract and track entities (people, places, concepts) and their properties across conversations. Maintains a structured knowledge graph that can be queried and updated. Useful for agents that need to track evolving information about specific subjects.
class EntityMemory:
def __init__(self, llm):
self.llm = llm
self.entities = {} # entity_name -> {properties}
def extract_entities(self, text: str) -> list[dict]:
prompt = f"""Extract entities and their properties from this text:
"{text}"
Return JSON: [{{"name": "...", "type": "...", "properties": {{}}}}]"""
return json.loads(self.llm.complete(prompt))
def update(self, text: str):
entities = self.extract_entities(text)
for entity in entities:
name = entity["name"]
if name in self.entities:
self.entities[name]["properties"].update(entity["properties"])
else:
self.entities[name] = entity
def get_context(self, relevant_entities: list[str]) -> str:
context = []
for name in relevant_entities:
if name in self.entities:
context.append(f"{name}: {json.dumps(self.entities[name])}")
return "Known entities:\n" + "\n".join(context)Use when: Tracking people, projects, evolving facts. Trade-off: Extraction adds latency and can be error-prone.
Combining Memory Types
Production agents often layer multiple memory types. A common pattern:
class HybridMemory:
def __init__(self):
self.buffer = BufferMemory(max_messages=10) # Recent context
self.summary = SummaryMemory(llm) # Compressed history
self.vector = VectorMemory(embed_model, store) # Semantic retrieval
self.entities = EntityMemory(llm) # Structured facts
def get_context(self, query: str) -> str:
return f"""
## Conversation Summary
{self.summary.get_context()}
## Relevant Memories
{self.vector.get_context(query)}
## Known Entities
{self.entities.get_context(self._extract_mentioned_entities(query))}
## Recent Messages
{self.buffer.get_context()}
"""The ReAct Pattern in Detail
ReAct (Reasoning + Acting) is the most widely used agent pattern. The model explicitly verbalizes its reasoning before each action, creating a chain of Thought-Action-Observation steps. This pattern emerged from the ReAct paper (Yao et al., 2022) and has become the foundation for most agent frameworks.
The ReAct Loop Structure
Question: What was the revenue growth for Acme Corp last quarter?
Thought: I need to find financial information about Acme Corp. I should search
the documents for their quarterly report or revenue data.
Action: search_documents
Action Input: {"query": "Acme Corp revenue quarterly report Q3 2024"}
Observation: Found 3 results:
1. "Q3 2024 Report: Acme Corp reported revenue of $4.2M, up from $3.8M in Q2..."
2. "Financial Summary: Year-over-year growth of 15%..."
Thought: I found the revenue figures. Q3 was $4.2M and Q2 was $3.8M.
Let me calculate the quarter-over-quarter growth.
Action: calculator
Action Input: {"expression": "((4.2 - 3.8) / 3.8) * 100"}
Observation: 10.526315789473683
Thought: The quarter-over-quarter revenue growth was approximately 10.5%.
I have enough information to answer the question.
Action: final_answer
Action Input: {"answer": "Acme Corp's revenue grew by 10.5% last quarter,
from $3.8M in Q2 to $4.2M in Q3 2024."}ReAct System Prompt Template
SYSTEM_PROMPT = """You are a helpful assistant with access to tools.
Available tools:
{tool_descriptions}
To use a tool, you MUST use this exact format:
Thought: [Your reasoning about what to do next]
Action: [tool_name]
Action Input: [JSON parameters]
After receiving an Observation, continue with another Thought.
When you have enough information to answer, use:
Thought: [Summary of what you learned]
Action: final_answer
Action Input: {{"answer": "[Your complete response]"}}
Important:
- Always think before acting
- Use tools to verify information, don't guess
- If a tool fails, try a different approach
- Be concise in your reasoning
"""Implementing ReAct
class ReActAgent:
def __init__(self, llm, tools: list[Tool], max_iterations: int = 10):
self.llm = llm
self.tools = {t.name: t for t in tools}
self.max_iterations = max_iterations
def run(self, query: str) -> str:
prompt = self._build_prompt(query)
trajectory = []
for i in range(self.max_iterations):
# Generate thought and action
response = self.llm.complete(prompt)
trajectory.append(response)
# Parse action
action = self._parse_action(response)
if action is None:
prompt += f"\n{response}\nThought: I need to use a proper action format."
continue
# Check for final answer
if action["name"] == "final_answer":
return action["input"]["answer"]
# Execute tool
if action["name"] not in self.tools:
observation = f"Error: Unknown tool '{action['name']}'. Available: {list(self.tools.keys())}"
else:
try:
observation = self.tools[action["name"]].execute(**action["input"])
except Exception as e:
observation = f"Error: {str(e)}"
# Append to prompt
prompt += f"\n{response}\nObservation: {observation}\n"
return "Max iterations reached without finding an answer."
def _parse_action(self, response: str) -> dict | None:
# Extract Action and Action Input from response
action_match = re.search(r'Action:\s*(.+)', response)
input_match = re.search(r'Action Input:\s*(.+)', response, re.DOTALL)
if action_match and input_match:
return {
"name": action_match.group(1).strip(),
"input": json.loads(input_match.group(1).strip())
}
return NoneReAct Advantages and Limitations
Advantages
- Interpretable: You can follow the agent's reasoning step by step
- Flexible: Can adapt to unexpected situations by reasoning about them
- Self-correcting: Can recognize and recover from errors through reflection
- Simple to implement: Minimal infrastructure required
Limitations
- Greedy: Makes locally optimal decisions without global planning
- Token intensive: Repeated reasoning consumes many tokens
- Slow: Multiple LLM calls per task add latency
- Can loop: May repeat the same actions without making progress
Plan-and-Execute vs ReAct
Plan-and-Execute separates planning from execution. First, create a complete plan; then execute each step. This addresses ReAct's greedy local decisions with upfront global planning.
The Plan-and-Execute Pattern
Task: "Summarize the key points from all PDFs in my research folder
and create a comparison table."
=== PLANNING PHASE ===
Plan:
1. List all PDF files in the research folder
2. For each PDF, extract the main content
3. Identify key points from each document
4. Compare the key points across documents
5. Format findings as a comparison table
6. Return the final table to the user
=== EXECUTION PHASE ===
Executing Step 1: list_files(path="/research", filter="*.pdf")
Result: ["paper_a.pdf", "paper_b.pdf", "paper_c.pdf"]
Executing Step 2: extract_pdf_text(file="paper_a.pdf")
Result: "Abstract: This paper examines..."
[continues for each PDF]
Executing Step 3: analyze_content(text="...", task="extract key points")
Result: ["Key point 1", "Key point 2", ...]
[continues for each document]
...
=== REPLANNING (if needed) ===
Step 4 failed - documents have different structures.
Revised plan: Group documents by type before comparing.Implementation Considerations
class PlanAndExecuteAgent:
def __init__(self, planner_llm, executor_llm, tools):
self.planner = planner_llm # Can be larger/smarter model
self.executor = executor_llm # Can be smaller/faster model
self.tools = tools
def run(self, task: str) -> str:
# Phase 1: Generate plan
plan = self._generate_plan(task)
results = []
# Phase 2: Execute each step
for i, step in enumerate(plan):
try:
result = self._execute_step(step, results)
results.append({"step": step, "result": result})
except Exception as e:
# Phase 3: Replan if step fails
remaining = plan[i:]
plan = self._replan(task, results, remaining, str(e))
continue
# Phase 4: Synthesize final answer
return self._synthesize(task, results)
def _generate_plan(self, task: str) -> list[str]:
prompt = f"""Create a step-by-step plan to accomplish this task:
{task}
Available tools: {[t.name for t in self.tools]}
Return a numbered list of concrete steps. Each step should use exactly one tool.
"""
response = self.planner.complete(prompt)
return self._parse_plan(response)
def _execute_step(self, step: str, prior_results: list) -> str:
prompt = f"""Execute this step: {step}
Prior results: {json.dumps(prior_results, indent=2)}
Determine which tool to use and what parameters to pass."""
# Execute using ReAct-like single-step execution
return self._single_step_react(prompt)When to Use Each Pattern
| Criterion | ReAct | Plan-and-Execute |
|---|---|---|
| Task complexity | Simple to moderate | Complex, multi-step |
| Predictability | Unpredictable tasks | Predictable workflow |
| Error recovery | Implicit, per-step | Explicit replanning |
| Token efficiency | Lower (repeated context) | Higher (plan once) |
| Latency | Varies widely | More predictable |
| Model requirements | Single model | Can use two models |
Multi-Agent Systems and Orchestration
Multi-agent systems use multiple specialized agents that collaborate on complex tasks. Each agent has a specific role, expertise, or perspective. Orchestration patterns determine how agents communicate and coordinate.
Hierarchical Orchestration
A supervisor agent decomposes tasks and delegates to specialist agents. The supervisor aggregates results and handles coordination. This is the most common pattern.
Supervisor: "Write a blog post about local AI" ├─→ Researcher: Gathers facts and sources ├─→ Writer: Creates initial draft ├─→ Editor: Refines and improves └─→ Fact-checker: Verifies claims Supervisor: Combines and returns final post
Peer-to-Peer Collaboration
Agents communicate directly with each other without a central coordinator. Each agent can request help from or provide information to others. More flexible but harder to debug.
Coder: "I need the API specification" └─→ Architect: "Here's the OpenAPI spec..." Coder: "What's the authentication method?" └─→ Security: "Use JWT with these claims..." Coder: "Tests are failing on edge case" └─→ Tester: "Here's a fix for that case..."
Debate and Consensus
Multiple agents independently work on the same problem and then debate their solutions. Useful for reducing errors and getting diverse perspectives on ambiguous questions.
Question: "Should we use PostgreSQL or MongoDB?"
Agent A (Relational): "PostgreSQL because ACID compliance..."
Agent B (Document): "MongoDB for flexible schemas..."
Agent C (Pragmatist): "PostgreSQL with JSONB columns..."
Moderator: "Agent C's hybrid approach addresses both concerns.
Final recommendation: PostgreSQL with JSONB."Multi-Agent Implementation
class MultiAgentOrchestrator:
def __init__(self, agents: dict[str, Agent]):
self.agents = agents
self.message_queue = []
self.shared_memory = {}
def run_hierarchical(self, task: str, supervisor: str = "supervisor") -> str:
# Supervisor creates the plan
plan = self.agents[supervisor].plan(task)
results = {}
for step in plan:
agent_name = step["assigned_to"]
subtask = step["task"]
context = {
"prior_results": results,
"shared_memory": self.shared_memory
}
result = self.agents[agent_name].execute(subtask, context)
results[step["id"]] = result
# Update shared memory
self._update_shared_memory(agent_name, result)
# Supervisor synthesizes final answer
return self.agents[supervisor].synthesize(task, results)
def run_debate(self, question: str, debaters: list[str], rounds: int = 2) -> str:
positions = {}
# Initial positions
for agent_name in debaters:
positions[agent_name] = self.agents[agent_name].answer(question)
# Debate rounds
for round in range(rounds):
for agent_name in debaters:
other_positions = {k: v for k, v in positions.items() if k != agent_name}
positions[agent_name] = self.agents[agent_name].respond(
question,
own_position=positions[agent_name],
other_positions=other_positions
)
# Reach consensus
return self.agents["moderator"].synthesize_consensus(question, positions)Multi-Agent Trade-offs
Multi-agent systems add significant complexity. Consider these trade-offs:
- Coordination overhead: Agents must understand each other. Misunderstandings between agents are common and hard to debug.
- Latency multiplication: Each agent call is an LLM inference. A 5-agent system with 3 turns each means 15+ LLM calls minimum.
- Context fragmentation: Each agent has partial context. Important information can be lost in handoffs.
- Debugging difficulty: Tracing issues across agent boundaries requires comprehensive logging.
- Local resource constraints: Running multiple agents on local hardware requires careful resource management.
Agent Failure Modes and Mitigations
Agents fail in predictable ways. Understanding these failure modes helps you build more robust systems. The mitigations are often more important than the initial implementation.
Infinite Loops
The agent repeats the same actions without making progress. Often caused by ambiguous goals, tool errors that the agent doesn't recognize, or context that doesn't accumulate properly.
Mitigations:
- Set hard iteration limits (typically 10-20 for most tasks)
- Track action history and detect repetition patterns
- Inject "you seem to be stuck" prompts after repeated similar actions
- Force the agent to try a different approach after N failures
Hallucinated Tool Calls
The agent invents tools that don't exist or uses incorrect parameters. More common with smaller models and prompted (vs native) tool calling.
Mitigations:
- Return clear error messages listing available tools
- Use few-shot examples of correct tool usage in the system prompt
- Validate all parameters against schemas before execution
- Consider constrained decoding to force valid tool names
Context Window Overflow
Long-running agents accumulate context until they exceed the model's limit. This causes either errors or silent truncation of important early context.
Mitigations:
- Track token usage continuously
- Summarize intermediate steps when context grows large
- Use sliding window over conversation history
- Store detailed results in memory, pass only summaries in context
Premature Termination
The agent declares success before completing the task. Often happens when the agent misunderstands the goal or gives up after minor difficulties.
Mitigations:
- Include clear success criteria in the prompt
- Add a verification step before final answer
- Ask the agent to explain why the task is complete
- Use a separate model to validate completion
Cascading Tool Failures
One tool failure causes a cascade of subsequent failures because the agent doesn't properly handle the error and continues with invalid data.
Mitigations:
- Return structured error responses with recovery suggestions
- Distinguish between retryable and fatal errors
- Give the agent explicit error-handling tools
- Checkpoint state before risky operations
Observability and Debugging
Agents are notoriously difficult to debug. Their behavior emerges from the interaction of prompts, model responses, tool outputs, and orchestration logic. Comprehensive observability is essential for development and production.
Essential Logging
Every agent run should capture enough information to fully reconstruct what happened:
@dataclass
class AgentTrace:
run_id: str
timestamp: datetime
user_query: str
steps: list[TraceStep]
final_output: str
total_tokens: int
total_duration_ms: int
status: Literal["success", "error", "timeout"]
@dataclass
class TraceStep:
step_number: int
step_type: Literal["llm_call", "tool_call", "memory_access"]
input: str
output: str
tokens_used: int
duration_ms: int
metadata: dict # model params, tool name, etc.Debugging Strategies
Trace Replay
Save complete traces and replay them with modified prompts or tools. Allows testing fixes without re-running the expensive parts.
Step-Through Execution
Add breakpoints where you can inspect state and optionally override the next action. Essential for understanding complex failures.
Prompt Diffing
Compare prompts between successful and failed runs. Often reveals subtle context differences that change agent behavior.
Temperature Sweeps
Run the same query at different temperatures. If behavior varies wildly, the prompt may be ambiguous or the task may be too hard for the model.
Production Monitoring
In production, track these metrics to catch issues before they become critical:
# Key agent metrics
metrics = {
# Performance
"avg_steps_per_task": gauge(),
"avg_tokens_per_task": gauge(),
"avg_duration_ms": gauge(),
"p95_duration_ms": gauge(),
# Reliability
"success_rate": gauge(),
"timeout_rate": gauge(),
"loop_detection_rate": gauge(),
# Tool usage
"tool_call_count": counter(labels=["tool_name"]),
"tool_error_rate": gauge(labels=["tool_name"]),
# Resource usage
"context_window_utilization": histogram(),
"memory_retrieved_per_query": histogram(),
}Token Budgeting and Context Management
Context windows are finite and tokens are the currency of agent computation. Effective context management is the difference between an agent that handles complex tasks and one that fails on anything beyond simple queries.
Context Window Allocation
For a model with an 8K context window, a typical allocation:
CONTEXT_BUDGET = 8192
# Fixed allocations
SYSTEM_PROMPT = 800 # Tools, instructions, persona
SAFETY_MARGIN = 200 # Buffer for tokenization variance
# Dynamic allocations (remaining: 7192 tokens)
def allocate_context(query_tokens: int) -> dict:
remaining = CONTEXT_BUDGET - SYSTEM_PROMPT - SAFETY_MARGIN - query_tokens
return {
"memory": int(remaining * 0.3), # ~2150 tokens
"conversation": int(remaining * 0.3), # ~2150 tokens
"tool_results": int(remaining * 0.25),# ~1800 tokens
"generation": int(remaining * 0.15), # ~1080 tokens (output)
}Context Management Strategies
- Aggressive summarization: Summarize tool outputs immediately. A 10,000-character search result can often be reduced to 200 characters of relevant information.
- Sliding window with anchors: Keep the system prompt and most recent N messages, but also preserve "anchor" messages that contain critical context.
- Hierarchical context: Store full details in external memory. Pass only summaries in context. The agent can request full details when needed.
- Lazy loading: Don't retrieve all relevant memories upfront. Fetch additional context only when the agent explicitly requests it.
- Token counting before calls: Always count tokens before sending to the model. Truncate or summarize if you'll exceed the limit.
class ContextManager:
def __init__(self, max_tokens: int, tokenizer):
self.max_tokens = max_tokens
self.tokenizer = tokenizer
self.allocations = self._default_allocations()
def fit_to_budget(self, components: dict[str, str]) -> dict[str, str]:
"""Ensure all components fit within budget."""
result = {}
remaining = self.max_tokens
# Fixed components first (system prompt)
for name in ["system_prompt"]:
tokens = self._count(components[name])
if tokens > self.allocations[name]:
raise ValueError(f"{name} exceeds allocation")
result[name] = components[name]
remaining -= tokens
# Dynamic components with truncation
for name in ["memory", "conversation", "tool_results"]:
content = components.get(name, "")
max_tokens = min(self.allocations[name], remaining - 500)
result[name] = self._truncate(content, max_tokens)
remaining -= self._count(result[name])
return result
def _truncate(self, text: str, max_tokens: int) -> str:
tokens = self.tokenizer.encode(text)
if len(tokens) <= max_tokens:
return text
# Keep beginning and end, mark truncation
half = (max_tokens - 10) // 2
return self.tokenizer.decode(tokens[:half]) + "\n...[truncated]...\n" + self.tokenizer.decode(tokens[-half:])Prompt Engineering for Agents
Agent prompts differ from standard LLM prompts. They must define behavior patterns, tool usage conventions, and output formats. Small changes in agent prompts can have large effects on reliability.
System Prompt Structure
AGENT_SYSTEM_PROMPT = """
# Identity and Role
You are a research assistant with access to a local document store and web search.
Your goal is to provide accurate, well-sourced answers to questions.
# Core Behaviors
- Always verify claims using tools before stating them as fact
- Cite sources for all factual statements
- Acknowledge uncertainty when information is incomplete
- Ask clarifying questions when the request is ambiguous
# Tool Usage Guidelines
You have access to these tools:
1. search_documents(query: str) -> list[Document]
Use when: User asks about their personal documents
Returns: List of relevant passages with source info
2. web_search(query: str) -> list[Result]
Use when: User asks about current events or general knowledge
Returns: Search results with titles, snippets, and URLs
3. calculator(expression: str) -> float
Use when: Any mathematical calculation is needed
Returns: The computed result
# Output Format
When using tools, follow this exact format:
Thought: [Your reasoning]
Action: [tool_name]
Action Input: {"param": "value"}
After receiving a result:
Observation: [Tool result will appear here]
Thought: [Your analysis of the result]
When ready to answer:
Thought: [Summary of findings]
Final Answer: [Your response to the user]
# Error Handling
If a tool returns an error:
1. Acknowledge the error
2. Try an alternative approach
3. If no alternative exists, explain the limitation
# Examples
[Include 2-3 complete examples of ideal behavior]
"""Few-Shot Examples
Few-shot examples dramatically improve agent reliability, especially for local models. Include complete trajectories showing correct behavior:
# Example 1: Document Search
User: What did the Q3 report say about customer churn?
Thought: The user wants information from a specific document (Q3 report).
I should search their documents for churn-related information.
Action: search_documents
Action Input: {"query": "Q3 report customer churn rate"}
Observation: [1 result] "Q3 2024 Report, page 12: Customer churn decreased
to 2.3% from 2.8% in Q2, driven by improvements in onboarding..."
Thought: I found the relevant information. The Q3 report shows churn
decreased from 2.8% to 2.3%.
Final Answer: According to your Q3 2024 report, customer churn decreased
to 2.3%, down from 2.8% in Q2. The report attributes this improvement to
changes in the onboarding process.
---
# Example 2: Calculation Required
User: If we grew 15% in Q3 and 20% in Q4, what's our total growth?
Thought: I need to calculate compound growth. Two consecutive growth
periods compound multiplicatively, not additively.
Action: calculator
Action Input: {"expression": "((1.15 * 1.20) - 1) * 100"}
Observation: 38.0
Thought: The compound growth is 38%, not 35% (which would be simple addition).
Final Answer: Your total growth across Q3 and Q4 is 38%. This is compound
growth: 1.15 * 1.20 = 1.38, representing a 38% increase from the starting point.Prompt Engineering Tips
- Be explicit about format: Models often need exact format specifications. "Use this exact format" works better than "use a format like this."
- Define edge cases: What should the agent do when a tool fails? When it's uncertain? When the query is ambiguous? Define these explicitly.
- Limit scope: Agents work better with constrained domains. "You are a research assistant" is better than "You can do anything."
- Include anti-patterns: Show what NOT to do. "Never guess when you can search" is clearer than "try to be accurate."
- Version your prompts: Track prompt changes in version control. Small changes can have large behavioral effects.
Local Agent Frameworks
Several frameworks support local agent development. Each has different trade-offs in terms of flexibility, complexity, and local-first support.
LangChain / LangGraph
The most popular framework with extensive documentation and community. LangGraph adds explicit state machines for more complex agent flows. Good local model support through integrations with Ollama, llama.cpp, and vLLM.
Strengths: Large ecosystem, many integrations, active development.Weaknesses: Abstractions can be leaky, debugging can be difficult, API changes frequently.
LlamaIndex
Originally focused on RAG, now includes agent capabilities. Strong emphasis on data connectors and retrieval. Good for agents that primarily work with documents.
Strengths: Excellent RAG support, many data connectors, good local model support.Weaknesses: Agent features less mature than LangChain, smaller community.
Haystack
Pipeline-based framework with strong production focus. Good for building structured workflows. Native support for local models through various backends.
Strengths: Production-ready, pipeline paradigm is intuitive, good evaluation tools.Weaknesses: Less flexible for complex agent patterns, smaller ecosystem.
Custom Implementation
Building your own agent loop gives maximum control but requires more effort. Consider this when framework abstractions get in the way or you have specific requirements.
Strengths: Full control, no framework overhead, can optimize for your specific use case.Weaknesses: More code to maintain, no ecosystem benefits, need to solve common problems yourself.
Minimal Custom Agent
A complete but minimal agent implementation in ~100 lines, suitable as a starting point for custom development:
from dataclasses import dataclass
from typing import Callable
import json
import re
@dataclass
class Tool:
name: str
description: str
parameters: dict
function: Callable
class SimpleAgent:
def __init__(self, llm, tools: list[Tool], max_iterations: int = 10):
self.llm = llm
self.tools = {t.name: t for t in tools}
self.max_iterations = max_iterations
def _build_system_prompt(self) -> str:
tool_desc = "\n".join(
f"- {t.name}: {t.description}\n Parameters: {json.dumps(t.parameters)}"
for t in self.tools.values()
)
return f"""You are a helpful assistant with access to tools.
Tools:
{tool_desc}
To use a tool, write:
Thought: [reasoning]
Action: [tool_name]
Action Input: {{"param": "value"}}
When done:
Thought: [summary]
Final Answer: [response]"""
def run(self, query: str) -> str:
messages = [
{"role": "system", "content": self._build_system_prompt()},
{"role": "user", "content": query}
]
for _ in range(self.max_iterations):
response = self.llm.chat(messages)
messages.append({"role": "assistant", "content": response})
# Check for final answer
if "Final Answer:" in response:
return response.split("Final Answer:")[-1].strip()
# Parse and execute tool call
action_match = re.search(r"Action:\s*(.+)", response)
input_match = re.search(r"Action Input:\s*({.+})", response, re.DOTALL)
if action_match and input_match:
tool_name = action_match.group(1).strip()
try:
params = json.loads(input_match.group(1))
if tool_name in self.tools:
result = self.tools[tool_name].function(**params)
observation = f"Observation: {result}"
else:
observation = f"Observation: Error - Unknown tool '{tool_name}'"
except Exception as e:
observation = f"Observation: Error - {str(e)}"
messages.append({"role": "user", "content": observation})
else:
messages.append({
"role": "user",
"content": "Please use the correct format: Action: tool_name\nAction Input: {params}"
})
return "Max iterations reached without final answer."Summary
Building reliable agents requires understanding each component deeply: the reasoning engine's capabilities and limitations, tool design principles, memory architecture trade-offs, and orchestration patterns. Local deployment adds constraints (compute, memory) but also advantages (privacy, cost, control).
Start with the simplest architecture that might work—often a basic ReAct agent with a few well-designed tools. Add complexity only when you hit specific limitations. Invest heavily in observability from day one. And remember that prompt engineering and tool design often matter more than framework choice.