Build an LLM Playground — Part 3: Build an "Ask-the-Web" Agent with Tool Calling
The third entry in the learn-by-doing AI engineer series. We cover AI agents from the ground up — workflows, tool calling, MCP, multi-step reasoning patterns like ReACT and Reflexion, multi-agent systems, and evaluation — then build a Perplexity-style web search agent.
Series: The AI Engineer Learning Path
This is Part 3 of a hands-on series designed to take you from zero to working AI engineer. Every post follows a learn-by-doing philosophy — we explain the theory, then you build something real.
| Part | Topic | Status |
|---|---|---|
| 1 | Build an LLM Playground | Complete |
| 2 | Customer Support Chatbot with RAGs & Prompt Engineering | Complete |
| 3 | "Ask-the-Web" Agent with Tool Calling (this post) | Current |
| 4 | Deep Research with Reasoning Models | Available |
| 5 | Multi-modal Generation Agent | Available |
In Part 1, we learned how LLMs work. In Part 2, we built a RAG chatbot that answers questions from documents. Now we're taking a leap: building an AI system that can reason, plan, use tools, and take actions in the real world.
By the end of this post, you'll understand the full spectrum of agentic AI — from simple prompt chains to autonomous multi-step agents — and you'll build a Perplexity-style "Ask-the-Web" agent that searches the internet, synthesizes information, and provides cited answers.
Why Agents?
A chatbot answers questions. An agent takes actions.
Think about the difference between asking "What's the weather in Tokyo?" and asking "Book me a flight to Tokyo next week, find a hotel near Shibuya, and check the weather so I know what to pack." The first is a lookup. The second requires planning, tool use, decision-making, and multi-step execution.
Agents bridge this gap. They turn LLMs from conversational interfaces into systems that can interact with the world — searching the web, calling APIs, executing code, reading files, and coordinating complex workflows.
Part I: Agents Overview
Agents vs. Agentic Systems vs. LLMs
These terms get used loosely. Let's be precise:
| Concept | What It Is | Example |
|---|---|---|
| LLM | A model that generates text given a prompt. No memory, no tools, no autonomy. It produces one response and stops. | GPT-4 answering "What is gravity?" |
| Agentic system | An LLM wrapped in a loop with access to tools and some degree of autonomy. The system can take multiple steps to accomplish a goal. | A chatbot that searches a knowledge base before answering |
| Agent | A highly autonomous agentic system that can plan, execute, observe results, and adapt its strategy. It decides what to do next based on what happened. | An AI research assistant that formulates search queries, reads papers, synthesizes findings, and iterates until it has a complete answer |
The key distinction is autonomy. An LLM does exactly what you ask once. An agentic system follows a predefined pattern (retrieve, then generate). An agent decides its own approach and adapts.
Agency Levels
Not every system needs full agent autonomy. In fact, simpler is usually better. Here's a spectrum:
| Level | Description | Autonomy | Example |
|---|---|---|---|
| Level 0: Direct LLM call | Single prompt → single response | None | "Translate this sentence to French" |
| Level 1: Workflow | Predefined sequence of LLM calls. The developer controls the flow. | Low | Prompt chain: summarize → translate → format |
| Level 2: Router | LLM decides which path to take from a fixed set of options | Low-Medium | Classify a customer query, then route to the right handler |
| Level 3: Tool-using LLM | LLM decides when and how to call tools, but within a single turn | Medium | Search the web, then answer the question |
| Level 4: Multi-step agent | LLM operates in a loop — observe, think, act, repeat — until the task is done | High | ReACT agent that researches a topic across multiple searches |
| Level 5: Multi-agent system | Multiple agents collaborating, delegating, and coordinating | Very High | A team of agents: researcher, writer, and editor working together |
Practical advice: Start at the lowest level that solves your problem. Most production AI features are Level 1-3. Full agents (Level 4-5) are powerful but harder to control, debug, and make reliable.
Part II: Workflows
Workflows are the most reliable form of agentic systems. The developer defines the control flow — the LLM handles the language processing at each step, but doesn't decide what to do next.
Prompt Chaining
Run a sequence of LLM calls where each call's output feeds into the next call's input. Each step handles one focused task.
Input → [Step 1: Extract key facts] → [Step 2: Research each fact] → [Step 3: Write summary] → Output
Example: Research report generator
# Step 1: Extract key questions from the user's topic
questions = llm("Given the topic '{topic}', generate 5 specific research questions.")
# Step 2: For each question, generate a search query
for question in questions:
search_results = web_search(question)
facts.append(llm(f"Extract key facts from these results:\n{search_results}"))
# Step 3: Synthesize into a report
report = llm(f"Write a research report based on these facts:\n{facts}")When to use: Tasks that are naturally sequential, where each step has clear inputs and outputs. The most common pattern in production.
| Advantage | Disadvantage |
|---|---|
| Easy to debug — inspect each step's output | Rigid — can't adapt to unexpected results |
| Easy to test — unit test each step | Latency compounds — N steps = N LLM calls |
| Easy to improve — swap out individual steps | Error propagation — early mistakes cascade |
Routing
An LLM classifies the input and routes it to the appropriate handler. The LLM acts as a decision maker but doesn't execute the downstream logic.
User Input → [LLM Classifier] → Route A: Technical support handler
→ Route B: Billing handler
→ Route C: General inquiry handler
Example: Support ticket router
def route_ticket(message: str) -> str:
category = llm(
f"""Classify this support message into exactly one category:
- technical: API errors, integration issues, bugs
- billing: charges, invoices, refunds, plans
- account: login issues, settings, permissions
- general: everything else
Message: {message}
Category:"""
)
handlers = {
"technical": handle_technical,
"billing": handle_billing,
"account": handle_account,
"general": handle_general,
}
return handlers[category.strip()](message)When to use: When different input types require fundamentally different handling. Common in customer support, content moderation, and task dispatching.
Parallelization
Run multiple LLM calls simultaneously and combine the results. Two main patterns:
Sectioning: Split a task into independent sub-tasks, run them in parallel, combine results.
┌→ [Analyze sentiment] ─┐
User Review ────────┼→ [Extract key features] ├→ [Combine into report]
└→ [Check for policy issues]─┘
Voting: Run the same task multiple times and aggregate results for higher accuracy.
┌→ [LLM call 1: "toxic"] ─┐
User Message ───────┼→ [LLM call 2: "toxic"] ├→ Majority vote: "toxic"
└→ [LLM call 3: "not toxic"] ─┘
| Pattern | When to Use | Example |
|---|---|---|
| Sectioning | Task has independent sub-tasks that don't depend on each other | Analyze a document for sentiment, entities, and key themes simultaneously |
| Voting | High-stakes decisions where accuracy matters more than speed or cost | Content moderation, medical triage classification |
Reflection
The LLM reviews and critiques its own output, then improves it. This creates a self-improving loop without external feedback.
Input → [Generate] → [Critique] → [Revise] → Output
↑ │
└────────────────────────┘ (repeat N times)
Example: Code generation with self-review
# Generate initial code
code = llm(f"Write a Python function that {task_description}")
# Self-review loop
for i in range(3):
critique = llm(f"""Review this code for bugs, edge cases, and improvements:
```python
{code}List specific issues.""")
if "no issues" in critique.lower(): break
code = llm(f"""Fix the following issues in this code: Issues: Code:
{code}
```""")When to use: Tasks where quality can be objectively assessed — code generation, writing, translation, data extraction. Not useful when the model can't reliably judge its own output.
Orchestrator-Worker
An orchestrator LLM breaks down a complex task and delegates sub-tasks to worker LLMs. The orchestrator manages the overall plan and synthesizes results.
┌→ [Worker 1: Research pricing]
[Orchestrator] → Plan ─────┼→ [Worker 2: Research features]
↑ └→ [Worker 3: Research reviews]
│
└──── [Synthesize results into final report]
Example:
def orchestrator(task: str) -> str:
# Orchestrator creates a plan
plan = llm(f"""Break this task into 3-5 independent sub-tasks:
Task: {task}
Return as a JSON array of sub-task descriptions.""")
sub_tasks = json.loads(plan)
# Workers execute in parallel
results = parallel_map(
lambda sub_task: llm(f"Complete this sub-task thoroughly:\n{sub_task}"),
sub_tasks
)
# Orchestrator synthesizes
return llm(f"""Synthesize these sub-task results into a final answer:
Task: {task}
Results: {json.dumps(results)}""")When to use: Complex tasks where the sub-tasks aren't known in advance and may vary based on the input. More flexible than prompt chaining but also more complex and expensive.
Part III: Tools
Tools are what make agents capable of interacting with the real world. Without tools, an LLM can only generate text. With tools, it can search the web, query databases, execute code, send emails, and more.
Tool Calling
Tool calling (also called function calling) is a structured way for an LLM to request that external functions be executed. The model doesn't execute the tool itself — it outputs a structured request, your application executes it, and the result is fed back to the model.
User: "What's the weather in Tokyo?"
↓
LLM thinks: "I need to use the weather tool"
↓
LLM outputs: {"tool": "get_weather", "args": {"city": "Tokyo"}}
↓
Your app: executes get_weather("Tokyo") → {"temp": 22, "condition": "sunny"}
↓
LLM receives result, generates: "It's currently 22°C and sunny in Tokyo."
The tool calling flow in detail:
┌────────┐ ┌─────┐ ┌──────────┐ ┌─────┐ ┌────────┐
│ User │────→│ LLM │────→│ Tool Call │────→│ App │────→│ Result │
│Message │ │ │ │ Request │ │ │ │ │
└────────┘ └─────┘ └──────────┘ └─────┘ └───┬────┘
│
┌─────┐ ┌──────────┐ │
│ LLM │←────│ Tool │←────────────────────┘
│ │ │ Result │
└──┬──┘ └──────────┘
│
▼
Final Response
Tool Formatting
Different providers use different formats for defining tools. Here's how the major ones work:
OpenAI format:
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for current information on a topic",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
},
"num_results": {
"type": "integer",
"description": "Number of results to return (default: 5)"
}
},
"required": ["query"]
}
}
}Anthropic format:
{
"name": "search_web",
"description": "Search the web for current information on a topic",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
},
"num_results": {
"type": "integer",
"description": "Number of results to return (default: 5)"
}
},
"required": ["query"]
}
}Best practices for tool definitions:
| Practice | Why It Matters |
|---|---|
| Write clear descriptions | The model uses descriptions to decide when and how to use the tool. Vague descriptions = wrong tool calls. |
| Include parameter descriptions | Don't just name parameters — explain what they do, valid ranges, and defaults. |
| Use specific names | search_knowledge_base is better than search. Specificity helps the model pick the right tool. |
| Limit the tool set | More tools = more confusion. Provide only the tools relevant to the current task. |
| Include examples in descriptions | "Search query, e.g., 'CloudAPI authentication error 401'" helps the model format inputs correctly. |
Tool Execution
Your application is responsible for executing tool calls. This means you control:
| Concern | What You Control |
|---|---|
| Validation | Check that the model's arguments are valid before executing |
| Authorization | Ensure the user has permission to use this tool |
| Rate limiting | Prevent runaway agents from making thousands of API calls |
| Error handling | Return clear error messages so the model can adapt |
| Timeouts | Kill long-running tool calls to prevent hangs |
| Sandboxing | For code execution tools, run in isolated environments |
Execution pattern:
def execute_tool(tool_name: str, args: dict) -> str:
"""Execute a tool call from the LLM with safety checks."""
# 1. Validate tool exists
if tool_name not in AVAILABLE_TOOLS:
return f"Error: Unknown tool '{tool_name}'"
# 2. Validate arguments
tool = AVAILABLE_TOOLS[tool_name]
validation_error = tool.validate_args(args)
if validation_error:
return f"Error: {validation_error}"
# 3. Check permissions
if not user_has_permission(current_user, tool_name):
return f"Error: User does not have permission to use '{tool_name}'"
# 4. Execute with timeout
try:
result = tool.execute(args, timeout=30)
return json.dumps(result)
except TimeoutError:
return "Error: Tool execution timed out after 30 seconds"
except Exception as e:
return f"Error: {str(e)}"MCP (Model Context Protocol)
MCP is an open protocol (created by Anthropic) that standardizes how LLMs connect to external tools and data sources. Think of it as USB-C for AI tools — a universal interface so any tool can work with any model.
Why MCP matters:
Before MCP, every tool integration was custom. Want your agent to use GitHub? Write a GitHub integration. Slack? Write another. Every tool × every model = an explosion of custom code.
MCP standardizes this:
Without MCP:
App → Custom GitHub integration
App → Custom Slack integration
App → Custom DB integration
(N tools × M apps = N×M integrations)
With MCP:
App → MCP Client → MCP Server (GitHub)
App → MCP Client → MCP Server (Slack)
App → MCP Client → MCP Server (DB)
(N tools + M apps = N+M integrations)
MCP architecture:
┌────────────────────────────────┐
│ MCP Host │
│ (Your AI application) │
│ │
│ ┌────────────────────────┐ │
│ │ MCP Client │ │
│ │ (Protocol handler) │ │
│ └──────────┬─────────────┘ │
└─────────────┼──────────────────┘
│ (JSON-RPC over stdio/SSE)
│
┌──────────▼──────────────┐
│ MCP Server │
│ (Tool provider) │
│ │
│ Exposes: │
│ - Tools (functions) │
│ - Resources (data) │
│ - Prompts (templates) │
└─────────────────────────┘
MCP capabilities:
| Capability | Description | Example |
|---|---|---|
| Tools | Functions the model can call | search_web, read_file, query_database |
| Resources | Data sources the model can read | Files, database records, API responses |
| Prompts | Reusable prompt templates | "Summarize this document", "Review this code" |
Example MCP server (Python):
from mcp.server import Server
from mcp.types import Tool, TextContent
server = Server("web-search")
@server.tool()
async def search_web(query: str, num_results: int = 5) -> list[TextContent]:
"""Search the web for current information on a topic."""
results = await perform_web_search(query, num_results)
return [TextContent(type="text", text=format_results(results))]
@server.tool()
async def fetch_page(url: str) -> list[TextContent]:
"""Fetch and extract the main content from a web page."""
content = await fetch_and_extract(url)
return [TextContent(type="text", text=content)]Why this matters for your "Ask-the-Web" agent: MCP lets you build your web search tools as a reusable server that any MCP-compatible application can use — not just your specific agent.
Part IV: Multi-Step Agents
Workflows are developer-controlled. Multi-step agents are model-controlled. The agent decides what to do next based on what it observes.
Planning Autonomy
The core question with agents is: how much autonomy should the model have?
| Approach | Planning | Execution | When to Use |
|---|---|---|---|
| Fixed plan | Developer defines the steps | Model executes each step | Predictable tasks with known workflows |
| LLM-generated plan | Model creates a plan, human approves it, then it executes | Model follows its own approved plan | Complex tasks where you want oversight |
| Fully autonomous | Model plans and executes in a loop, adapting as it goes | Model decides everything | Exploratory tasks where the path isn't known in advance |
ReACT (Reasoning + Acting)
ReACT is the foundational agent pattern. The model alternates between thinking (reasoning about what to do) and acting (using tools), then observing the results.
Loop:
1. Thought: "I need to find out X. I'll search for Y."
2. Action: search_web("Y")
3. Observation: [search results]
4. Thought: "The results show Z, but I also need to know W."
5. Action: search_web("W")
6. Observation: [more results]
7. Thought: "Now I have enough information to answer."
8. Final Answer: [synthesized response]
ReACT implementation:
def react_agent(question: str, tools: list, max_steps: int = 10) -> str:
messages = [
{"role": "system", "content": """You are a research agent. For each step:
1. Think about what you need to know and what tool to use.
2. Use a tool to gather information.
3. Observe the result.
4. Repeat until you can answer the question.
When you have enough information, provide your final answer."""},
{"role": "user", "content": question}
]
for step in range(max_steps):
response = llm(messages, tools=tools)
# Check if the model wants to use a tool
if response.tool_calls:
for tool_call in response.tool_calls:
result = execute_tool(tool_call.name, tool_call.args)
messages.append({"role": "tool", "content": result})
else:
# No tool call = model is ready to give a final answer
return response.content
return "Agent reached maximum steps without a final answer."Why ReACT works:
- The explicit "Thought" step forces the model to reason before acting
- Each observation grounds the next decision in real data
- The loop naturally handles multi-step tasks
- The model can adapt its plan based on what it learns
Reflexion
Reflexion adds a self-reflection step to the agent loop. After completing a task (or failing), the agent reflects on what went well and what didn't, then uses that reflection to improve on the next attempt.
Attempt 1:
ReACT loop → Answer → Evaluate → "My answer was wrong because I didn't consider X"
↓
Attempt 2:
ReACT loop (with reflection context) → Better Answer → Evaluate → "Correct!"
When Reflexion helps:
- Tasks where the agent can evaluate its own output (code that must pass tests, math with verifiable answers)
- When the first attempt often fails but the model can learn from the failure
- Research tasks where the initial search strategy was too narrow
ReWOO (Reasoning Without Observation)
ReWOO separates planning from execution. The agent creates a complete plan upfront, then executes all steps, then synthesizes. This reduces the number of LLM calls.
Standard ReACT: Think → Act → Observe → Think → Act → Observe → Think → Answer
(7 LLM calls)
ReWOO: Plan (all steps at once) → Execute all → Synthesize
(2 LLM calls)
def rewoo_agent(question: str) -> str:
# Step 1: Plan all steps at once
plan = llm(f"""Create a plan to answer this question: {question}
For each step, specify which tool to use and what arguments to pass.
Format: Step N: tool_name(args) - purpose""")
# Step 2: Execute all steps
results = {}
for step in parse_plan(plan):
results[step.id] = execute_tool(step.tool, step.args)
# Step 3: Synthesize
return llm(f"""Given these results, answer the question: {question}
Results: {json.dumps(results)}""")| Aspect | ReACT | ReWOO |
|---|---|---|
| LLM calls | Many (one per step) | Few (plan + synthesize) |
| Adaptability | High — can change plan mid-execution | Low — plan is fixed |
| Latency | Higher (sequential LLM calls) | Lower (parallel tool execution possible) |
| When to use | Exploratory tasks, unknown number of steps | Well-defined tasks, latency-sensitive |
Tree Search for Agents
For complex reasoning tasks, a single linear chain of thought may not find the best solution. Tree search explores multiple reasoning paths and selects the most promising one.
Tree of Thought (ToT):
[Initial Question]
/ | \
[Approach A] [Approach B] [Approach C]
/ \ | / \
[A1] [A2] [B1] [C1] [C2]
↓ ↓
[Evaluate] [Evaluate]
↓ ↓
Score: 0.9 Score: 0.7
↓
[Best path → Final Answer]
How it works:
- Generate multiple possible next steps (branching)
- Evaluate each branch with a heuristic or LLM judge
- Expand the most promising branches
- Prune unpromising branches
- Continue until a satisfactory solution is found
Monte Carlo Tree Search (MCTS) for agents:
| Component | In Games (AlphaGo) | In Agents |
|---|---|---|
| State | Board position | Current reasoning + gathered information |
| Action | Place a stone | Choose a reasoning step or tool call |
| Reward | Win/lose | Answer correctness (verified or LLM-judged) |
| Rollout | Random play to end | Complete the reasoning chain to get an answer |
When to use tree search:
- Mathematical reasoning where multiple approaches exist
- Planning tasks with many possible strategies
- Tasks where you can verify correctness (code, math, logic puzzles)
- When accuracy matters more than speed
Part V: Multi-Agent Systems
Sometimes one agent isn't enough. Multi-agent systems use multiple specialized agents that collaborate, debate, or delegate to accomplish complex tasks.
Why Multiple Agents?
| Reason | Description | Example |
|---|---|---|
| Specialization | Different agents with different expertise | Researcher agent + writer agent + editor agent |
| Parallelism | Multiple agents working simultaneously | Three agents researching different aspects of a topic |
| Debate/verification | Agents check each other's work | One agent generates code, another reviews it |
| Separation of concerns | Each agent has a focused scope and toolset | A planning agent that delegates to execution agents |
Challenges of Multi-Agent Systems
| Challenge | Description | Mitigation |
|---|---|---|
| Coordination overhead | Agents need to communicate, which adds latency and cost | Clear protocols, minimal message passing |
| Error propagation | One agent's mistake cascades to others | Validation between agent handoffs |
| Infinite loops | Agents pass tasks back and forth forever | Step limits, loop detection, human-in-the-loop checkpoints |
| Context management | Each agent has limited context; sharing state is hard | Shared memory store, structured handoff messages |
| Debugging | Hard to trace why the system produced a specific output | Comprehensive logging of all agent interactions |
| Cost | Multiple agents = multiple LLM calls per user request | Budget limits, efficient agent design |
Use Cases for Multi-Agent Systems
| Use Case | Agent Architecture | How It Works |
|---|---|---|
| Software development | Planner → Coder → Reviewer → Tester | Planner breaks down the task, coder implements, reviewer checks quality, tester validates |
| Research synthesis | Coordinator → multiple Researchers → Synthesizer | Coordinator assigns sub-topics, researchers investigate in parallel, synthesizer combines |
| Content pipeline | Researcher → Writer → Editor → Fact-checker | Each agent specializes in one stage of content creation |
| Customer support escalation | Tier 1 bot → Specialist agents → Human escalation | Simple queries handled by Tier 1, complex ones routed to domain-specific agents |
| Debate / red team | Proposer → Critic → Judge | One agent proposes an answer, another critiques it, a judge decides |
A2A Protocol (Agent-to-Agent)
Just as MCP standardizes tool communication, the A2A protocol (introduced by Google) standardizes how agents communicate with each other.
Core concepts:
┌──────────┐ Agent Card (discovery) ┌──────────┐
│ Agent A │ ─────────────────────────────→ │ Agent B │
│ (Client) │ │ (Server) │
│ │ ←── Task (request/response) ──→ │ │
│ │ │ │
│ │ ←── Artifacts (results) ────── │ │
└──────────┘ └──────────┘
| Concept | Description |
|---|---|
| Agent Card | A JSON document describing what an agent can do — its capabilities, skills, and endpoint. Used for discovery. |
| Task | A unit of work sent from one agent to another. Has a lifecycle: submitted → working → completed/failed. |
| Artifact | The output of a task — files, text, structured data. |
| Message | Communication within a task — instructions, status updates, questions. |
Why A2A matters: It enables interoperability between agents built by different teams, companies, or frameworks. Your research agent could delegate to a third-party data analysis agent without custom integration code.
Part VI: Evaluation of Agents
Evaluating agents is harder than evaluating LLMs because agents take actions, make decisions, and produce results through multi-step processes.
What to Evaluate
| Dimension | What It Measures | How to Measure |
|---|---|---|
| Task completion | Did the agent accomplish the goal? | Binary success/failure on a test suite |
| Answer quality | Is the final output correct and useful? | LLM-as-judge, human evaluation, ground-truth comparison |
| Efficiency | How many steps/tokens/tool calls did it take? | Count steps, measure tokens, track cost |
| Tool use accuracy | Did the agent pick the right tools with correct arguments? | Compare against expected tool call sequences |
| Reasoning quality | Were the agent's intermediate thoughts logical? | Evaluate thought traces, check for reasoning errors |
| Robustness | Does the agent handle edge cases and errors gracefully? | Adversarial test cases, error injection |
| Safety | Does the agent avoid harmful actions? | Red-team testing, sandboxed execution |
| Latency | How long does the full agent loop take? | End-to-end timing |
Evaluation Approaches
Benchmark-based evaluation:
| Benchmark | What It Tests | Domain |
|---|---|---|
| SWE-bench | Resolve real GitHub issues by writing code patches | Software engineering |
| WebArena | Complete real-world tasks on live websites | Web navigation |
| GAIA | General AI Assistant tasks requiring tool use and reasoning | General assistant |
| AgentBench | Multi-domain agent tasks (OS, DB, web, code) | Cross-domain |
| ToolBench | Tool selection and use across 16,000+ real APIs | Tool use |
| HotPotQA | Multi-hop question answering requiring multiple evidence sources | Research |
Trajectory-based evaluation:
Don't just evaluate the final answer — evaluate the entire trajectory:
Score each step:
Step 1: Chose correct tool? ✅ Arguments correct? ✅ Result useful? ✅
Step 2: Chose correct tool? ✅ Arguments correct? ❌ Result useful? ❌
Step 3: Recovered from error? ✅ Adapted strategy? ✅
Final answer correct? ✅
Trajectory score: 5/7 steps correct, recovered from error, correct final answer
Cost-quality trade-off:
The best agent isn't always the one with the best answers — it's the one that balances quality against cost and latency.
Agent A: 95% accuracy, avg 12 steps, $0.50/query, 45s latency
Agent B: 90% accuracy, avg 4 steps, $0.08/query, 12s latency
For most production use cases, Agent B is better.
Building Your Evaluation Pipeline
def evaluate_agent(agent, test_cases: list[dict]) -> dict:
results = []
for test in test_cases:
# Run agent
trace = agent.run(test["question"], return_trace=True)
results.append({
"question": test["question"],
"expected": test["expected_answer"],
"actual": trace.final_answer,
"correct": judge_correctness(trace.final_answer, test["expected_answer"]),
"steps": len(trace.steps),
"tool_calls": len(trace.tool_calls),
"tokens_used": trace.total_tokens,
"latency_ms": trace.duration_ms,
"cost": trace.estimated_cost,
})
return {
"accuracy": sum(r["correct"] for r in results) / len(results),
"avg_steps": sum(r["steps"] for r in results) / len(results),
"avg_cost": sum(r["cost"] for r in results) / len(results),
"avg_latency": sum(r["latency_ms"] for r in results) / len(results),
}Part VII: Build Your "Ask-the-Web" Agent
Now let's build it — a Perplexity-style research agent that searches the web, reads pages, and synthesizes cited answers.
Architecture
┌──────────────────────────────────────────────────────────┐
│ Ask-the-Web Agent │
│ │
│ User Question │
│ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ ReACT Loop │ │
│ │ │ │
│ │ Thought: "I need to search for X" │ │
│ │ ↓ │ │
│ │ Action: search_web("X") │ │
│ │ ↓ │ │
│ │ Observation: [10 search results with snippets] │ │
│ │ ↓ │ │
│ │ Thought: "Result 3 looks relevant, let me read"│ │
│ │ ↓ │ │
│ │ Action: fetch_page("https://...") │ │
│ │ ↓ │ │
│ │ Observation: [full page content] │ │
│ │ ↓ │ │
│ │ Thought: "I need one more perspective on Y" │ │
│ │ ↓ │ │
│ │ Action: search_web("Y different angle") │ │
│ │ ↓ │ │
│ │ ... (continue until sufficient information) │ │
│ │ ↓ │ │
│ │ Final Answer (with citations) │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Tools: │
│ - search_web(query) → search results │
│ - fetch_page(url) → page content │
│ - calculate(expression) → numerical result │
└──────────────────────────────────────────────────────────┘
Implementation
from anthropic import Anthropic
client = Anthropic()
# Define tools
tools = [
{
"name": "search_web",
"description": "Search the web for current information. Returns a list of results with titles, URLs, and snippets. Use specific, detailed queries for best results.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query, e.g. 'latest advances in quantum computing 2026'"
}
},
"required": ["query"]
}
},
{
"name": "fetch_page",
"description": "Fetch the full content of a web page. Use this to read articles, documentation, or any URL found in search results.",
"input_schema": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to fetch"
}
},
"required": ["url"]
}
}
]
SYSTEM_PROMPT = """You are an expert research agent similar to Perplexity AI. Your job is to
answer questions by searching the web, reading relevant pages, and synthesizing
comprehensive, well-cited answers.
Follow this process:
1. Think about what information you need to answer the question.
2. Search the web with specific, targeted queries.
3. Read the most relevant pages to get detailed information.
4. If needed, do follow-up searches to fill gaps or verify claims.
5. Synthesize a comprehensive answer with inline citations.
Rules:
- Always cite your sources using [1], [2], etc. with a sources list at the end.
- If sources conflict, note the disagreement.
- If you can't find reliable information, say so clearly.
- Prefer recent, authoritative sources.
- Be thorough but concise — cover the key points without unnecessary detail."""
def ask_the_web(question: str) -> str:
messages = [{"role": "user", "content": question}]
# ReACT loop
for step in range(15): # max 15 steps
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=SYSTEM_PROMPT,
tools=tools,
messages=messages,
)
# Collect all content blocks
messages.append({"role": "assistant", "content": response.content})
# Check if the model wants to use tools
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})
else:
# No tool use = final answer
return extract_text(response.content)
return "Agent reached maximum steps. Partial answer may be available."
def execute_tool(name: str, args: dict) -> str:
if name == "search_web":
return search_web(args["query"])
elif name == "fetch_page":
return fetch_page(args["url"])
else:
return f"Unknown tool: {name}"Adding Streaming for Real-Time Output
Users shouldn't stare at a blank screen while the agent works. Stream the agent's thinking and progress:
import anthropic
def ask_the_web_streaming(question: str):
"""Stream the agent's process — show thinking, tool use, and final answer."""
messages = [{"role": "user", "content": question}]
for step in range(15):
print(f"\n--- Step {step + 1} ---")
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=4096,
system=SYSTEM_PROMPT,
tools=tools,
messages=messages,
) as stream:
response = stream.get_final_message()
# Print text blocks as they stream
for block in response.content:
if hasattr(block, "text"):
print(block.text)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
print(f" 🔧 {block.name}({json.dumps(block.input)})")
result = execute_tool(block.name, block.input)
print(f" ← Got {len(result)} chars")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})
else:
breakFeatures to Add
Phase 1: Core Agent
- Web search + page fetching tools
- ReACT loop with tool calling
- Cited answers with source list
Phase 2: Enhanced Search 4. Query rewriting for better search results 5. Multiple search attempts with different queries 6. Source credibility scoring
Phase 3: User Experience 7. Streaming output showing the agent's progress 8. Follow-up questions (conversational) 9. Source preview cards with titles and snippets
Phase 4: Advanced 10. Parallel search (section multiple queries simultaneously) 11. Fact-checking via cross-referencing sources 12. Caching frequent queries 13. MCP server for reusable web search tools
Key Learning Outcomes
Building this agent teaches you:
- Tool calling mechanics — how to define, format, and execute tools with an LLM
- The ReACT pattern — the foundational loop for autonomous agents
- Prompt engineering for agents — how to guide agent behavior through system prompts
- Multi-step reasoning — how an agent decides what to search, when to dig deeper, and when to stop
- Citation and grounding — how to ensure outputs are traceable to sources
- Error handling in agent loops — what happens when a search returns nothing or a page fails to load
- Streaming for agents — giving users visibility into the agent's process
What You Should Know After Reading This
If you've read this post carefully, you should be able to answer these questions:
- What's the difference between an LLM, an agentic system, and an agent?
- When would you use a prompt chain vs. a ReACT agent?
- What are the five workflow patterns and when would you use each?
- How does tool calling work — what does the model output and what does your application do?
- What is MCP and what problem does it solve?
- How does the ReACT pattern work? What are Thought, Action, and Observation?
- What is Reflexion and when does it help?
- What are the key challenges of multi-agent systems?
- How do you evaluate an agent beyond just answer correctness?
- What is the A2A protocol and why does it matter?
If you can't answer all of them yet, re-read the relevant section. Understanding agent architectures is essential for building the next generation of AI applications.
Further Reading
For those who want to go deeper on any topic covered here:
- "ReAct: Synergizing Reasoning and Acting in Language Models" (Yao et al., 2022) — The original ReACT paper
- "Reflexion: Language Agents with Verbal Reinforcement Learning" (Shinn et al., 2023) — The Reflexion paper
- "ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models" (Xu et al., 2023) — The ReWOO paper
- "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Yao et al., 2023) — Tree search for LLMs
- "Toolformer: Language Models Can Teach Themselves to Use Tools" (Schick et al., 2023) — Self-taught tool use
- "Building Effective Agents" (Anthropic, 2024) — Anthropic's practical guide to agent design
- "The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling" (Masterman et al., 2024)
- Model Context Protocol specification — https://modelcontextprotocol.io
- LangGraph documentation — Framework for building stateful agent workflows
- CrewAI documentation — Framework for multi-agent orchestration
Next in the Series
Part 4: Deep Research with Reasoning Models — We cover reasoning and thinking LLMs (o1, DeepSeek-R1, Claude extended thinking), inference-time scaling techniques (Chain-of-Thought, self-consistency, Tree of Thoughts, search against a verifier), training-time techniques (STaR, RL with verifiers, reward modeling, Meta-CoT), and build a deep research agent that combines web search with multi-step structured reasoning.
Stay tuned.
You might also like
Build Your Own GREMLIN IN THE SHELL
A hands-on guide to building your own shell-based AI agent that haunts your terminal and gets things done.
BlogMake Your Own Claude Code
How to build your own CLI coding assistant inspired by Claude Code — from terminal UI to tool use to agentic loops.
BlogHow to Make Your Own Agent
A step-by-step guide to building AI agents — from simple ReAct loops to multi-tool autonomous systems.