Build an LLM Playground — Part 3: Build an "Ask-the-Web" Agent with Tool Calling — ML & AI

Series: The AI Engineer Learning Path

This is Part 3 of a hands-on series designed to take you from zero to working AI engineer. Every post follows a learn-by-doing philosophy — we explain the theory, then you build something real.

Part	Topic	Status
1	Build an LLM Playground	Complete
2	Customer Support Chatbot with RAGs & Prompt Engineering	Complete
3	"Ask-the-Web" Agent with Tool Calling (this post)	Current
4	Deep Research with Reasoning Models	Available
5	Multi-modal Generation Agent	Available

In Part 1, we learned how LLMs work. In Part 2, we built a RAG chatbot that answers questions from documents. Now we're taking a leap: building an AI system that can reason, plan, use tools, and take actions in the real world.

By the end of this post, you'll understand the full spectrum of agentic AI — from simple prompt chains to autonomous multi-step agents — and you'll build a Perplexity-style "Ask-the-Web" agent that searches the internet, synthesizes information, and provides cited answers.

Why Agents?

A chatbot answers questions. An agent takes actions.

Think about the difference between asking "What's the weather in Tokyo?" and asking "Book me a flight to Tokyo next week, find a hotel near Shibuya, and check the weather so I know what to pack." The first is a lookup. The second requires planning, tool use, decision-making, and multi-step execution.

Agents bridge this gap. They turn LLMs from conversational interfaces into systems that can interact with the world — searching the web, calling APIs, executing code, reading files, and coordinating complex workflows.

Part I: Agents Overview

Agents vs. Agentic Systems vs. LLMs

These terms get used loosely. Let's be precise:

Concept	What It Is	Example
LLM	A model that generates text given a prompt. No memory, no tools, no autonomy. It produces one response and stops.	GPT-4 answering "What is gravity?"
Agentic system	An LLM wrapped in a loop with access to tools and some degree of autonomy. The system can take multiple steps to accomplish a goal.	A chatbot that searches a knowledge base before answering
Agent	A highly autonomous agentic system that can plan, execute, observe results, and adapt its strategy. It decides what to do next based on what happened.	An AI research assistant that formulates search queries, reads papers, synthesizes findings, and iterates until it has a complete answer

The key distinction is autonomy. An LLM does exactly what you ask once. An agentic system follows a predefined pattern (retrieve, then generate). An agent decides its own approach and adapts.

Agency Levels

Not every system needs full agent autonomy. In fact, simpler is usually better. Here's a spectrum:

Level	Description	Autonomy	Example
Level 0: Direct LLM call	Single prompt → single response	None	"Translate this sentence to French"
Level 1: Workflow	Predefined sequence of LLM calls. The developer controls the flow.	Low	Prompt chain: summarize → translate → format
Level 2: Router	LLM decides which path to take from a fixed set of options	Low-Medium	Classify a customer query, then route to the right handler
Level 3: Tool-using LLM	LLM decides when and how to call tools, but within a single turn	Medium	Search the web, then answer the question
Level 4: Multi-step agent	LLM operates in a loop — observe, think, act, repeat — until the task is done	High	ReACT agent that researches a topic across multiple searches
Level 5: Multi-agent system	Multiple agents collaborating, delegating, and coordinating	Very High	A team of agents: researcher, writer, and editor working together

Practical advice: Start at the lowest level that solves your problem. Most production AI features are Level 1-3. Full agents (Level 4-5) are powerful but harder to control, debug, and make reliable.

Part II: Workflows

Workflows are the most reliable form of agentic systems. The developer defines the control flow — the LLM handles the language processing at each step, but doesn't decide what to do next.

Prompt Chaining

Run a sequence of LLM calls where each call's output feeds into the next call's input. Each step handles one focused task.

Input → [Step 1: Extract key facts] → [Step 2: Research each fact] → [Step 3: Write summary] → Output

Example: Research report generator

# Step 1: Extract key questions from the user's topic
questions = llm("Given the topic '{topic}', generate 5 specific research questions.")
 
# Step 2: For each question, generate a search query
for question in questions:
    search_results = web_search(question)
    facts.append(llm(f"Extract key facts from these results:\n{search_results}"))
 
# Step 3: Synthesize into a report
report = llm(f"Write a research report based on these facts:\n{facts}")

When to use: Tasks that are naturally sequential, where each step has clear inputs and outputs. The most common pattern in production.

Advantage	Disadvantage
Easy to debug — inspect each step's output	Rigid — can't adapt to unexpected results
Easy to test — unit test each step	Latency compounds — N steps = N LLM calls
Easy to improve — swap out individual steps	Error propagation — early mistakes cascade

Routing

An LLM classifies the input and routes it to the appropriate handler. The LLM acts as a decision maker but doesn't execute the downstream logic.

User Input → [LLM Classifier] → Route A: Technical support handler
                                → Route B: Billing handler
                                → Route C: General inquiry handler

Example: Support ticket router

def route_ticket(message: str) -> str:
    category = llm(
        f"""Classify this support message into exactly one category:
        - technical: API errors, integration issues, bugs
        - billing: charges, invoices, refunds, plans
        - account: login issues, settings, permissions
        - general: everything else
 
        Message: {message}
        Category:"""
    )
 
    handlers = {
        "technical": handle_technical,
        "billing": handle_billing,
        "account": handle_account,
        "general": handle_general,
    }
 
    return handlers[category.strip()](message)

When to use: When different input types require fundamentally different handling. Common in customer support, content moderation, and task dispatching.

Parallelization

Run multiple LLM calls simultaneously and combine the results. Two main patterns:

Sectioning: Split a task into independent sub-tasks, run them in parallel, combine results.

                    ┌→ [Analyze sentiment]     ─┐
User Review ────────┼→ [Extract key features]   ├→ [Combine into report]
                    └→ [Check for policy issues]─┘

Voting: Run the same task multiple times and aggregate results for higher accuracy.

                    ┌→ [LLM call 1: "toxic"]     ─┐
User Message ───────┼→ [LLM call 2: "toxic"]      ├→ Majority vote: "toxic"
                    └→ [LLM call 3: "not toxic"]  ─┘

Pattern	When to Use	Example
Sectioning	Task has independent sub-tasks that don't depend on each other	Analyze a document for sentiment, entities, and key themes simultaneously
Voting	High-stakes decisions where accuracy matters more than speed or cost	Content moderation, medical triage classification

Reflection

The LLM reviews and critiques its own output, then improves it. This creates a self-improving loop without external feedback.

Input → [Generate] → [Critique] → [Revise] → Output
              ↑                        │
              └────────────────────────┘ (repeat N times)

Example: Code generation with self-review

# Generate initial code
code = llm(f"Write a Python function that {task_description}")
 
# Self-review loop
for i in range(3):
    critique = llm(f"""Review this code for bugs, edge cases, and improvements:
    ```python
    {code}

List specific issues.""")

if "no issues" in critique.lower(): break

code = llm(f"""Fix the following issues in this code: Issues: Code:

{code}
```""")

When to use: Tasks where quality can be objectively assessed — code generation, writing, translation, data extraction. Not useful when the model can't reliably judge its own output.

Orchestrator-Worker

An orchestrator LLM breaks down a complex task and delegates sub-tasks to worker LLMs. The orchestrator manages the overall plan and synthesizes results.

                           ┌→ [Worker 1: Research pricing]
[Orchestrator] → Plan ─────┼→ [Worker 2: Research features]
       ↑                   └→ [Worker 3: Research reviews]
       │
       └──── [Synthesize results into final report]

Example:

def orchestrator(task: str) -> str:
    # Orchestrator creates a plan
    plan = llm(f"""Break this task into 3-5 independent sub-tasks:
    Task: {task}
    Return as a JSON array of sub-task descriptions.""")
 
    sub_tasks = json.loads(plan)
 
    # Workers execute in parallel
    results = parallel_map(
        lambda sub_task: llm(f"Complete this sub-task thoroughly:\n{sub_task}"),
        sub_tasks
    )
 
    # Orchestrator synthesizes
    return llm(f"""Synthesize these sub-task results into a final answer:
    Task: {task}
    Results: {json.dumps(results)}""")

When to use: Complex tasks where the sub-tasks aren't known in advance and may vary based on the input. More flexible than prompt chaining but also more complex and expensive.

Part III: Tools

Tools are what make agents capable of interacting with the real world. Without tools, an LLM can only generate text. With tools, it can search the web, query databases, execute code, send emails, and more.

Tool Calling

Tool calling (also called function calling) is a structured way for an LLM to request that external functions be executed. The model doesn't execute the tool itself — it outputs a structured request, your application executes it, and the result is fed back to the model.

User: "What's the weather in Tokyo?"
    ↓
LLM thinks: "I need to use the weather tool"
    ↓
LLM outputs: {"tool": "get_weather", "args": {"city": "Tokyo"}}
    ↓
Your app: executes get_weather("Tokyo") → {"temp": 22, "condition": "sunny"}
    ↓
LLM receives result, generates: "It's currently 22°C and sunny in Tokyo."

The tool calling flow in detail:

┌────────┐     ┌─────┐     ┌──────────┐     ┌─────┐     ┌────────┐
│  User  │────→│ LLM │────→│ Tool Call │────→│ App │────→│ Result │
│Message │     │     │     │ Request   │     │     │     │        │
└────────┘     └─────┘     └──────────┘     └─────┘     └───┬────┘
                                                            │
               ┌─────┐     ┌──────────┐                     │
               │ LLM │←────│ Tool     │←────────────────────┘
               │     │     │ Result   │
               └──┬──┘     └──────────┘
                  │
                  ▼
            Final Response

Tool Formatting

Different providers use different formats for defining tools. Here's how the major ones work:

OpenAI format:

{
  "type": "function",
  "function": {
    "name": "search_web",
    "description": "Search the web for current information on a topic",
    "parameters": {
      "type": "object",
      "properties": {
        "query": {
          "type": "string",
          "description": "The search query"
        },
        "num_results": {
          "type": "integer",
          "description": "Number of results to return (default: 5)"
        }
      },
      "required": ["query"]
    }
  }
}

Anthropic format:

{
  "name": "search_web",
  "description": "Search the web for current information on a topic",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "The search query"
      },
      "num_results": {
        "type": "integer",
        "description": "Number of results to return (default: 5)"
      }
    },
    "required": ["query"]
  }
}

Best practices for tool definitions:

Practice	Why It Matters
Write clear descriptions	The model uses descriptions to decide when and how to use the tool. Vague descriptions = wrong tool calls.
Include parameter descriptions	Don't just name parameters — explain what they do, valid ranges, and defaults.
Use specific names	`search_knowledge_base` is better than `search`. Specificity helps the model pick the right tool.
Limit the tool set	More tools = more confusion. Provide only the tools relevant to the current task.
Include examples in descriptions	"Search query, e.g., 'CloudAPI authentication error 401'" helps the model format inputs correctly.

Tool Execution

Your application is responsible for executing tool calls. This means you control:

Concern	What You Control
Validation	Check that the model's arguments are valid before executing
Authorization	Ensure the user has permission to use this tool
Rate limiting	Prevent runaway agents from making thousands of API calls
Error handling	Return clear error messages so the model can adapt
Timeouts	Kill long-running tool calls to prevent hangs
Sandboxing	For code execution tools, run in isolated environments

Execution pattern:

def execute_tool(tool_name: str, args: dict) -> str:
    """Execute a tool call from the LLM with safety checks."""
 
    # 1. Validate tool exists
    if tool_name not in AVAILABLE_TOOLS:
        return f"Error: Unknown tool '{tool_name}'"
 
    # 2. Validate arguments
    tool = AVAILABLE_TOOLS[tool_name]
    validation_error = tool.validate_args(args)
    if validation_error:
        return f"Error: {validation_error}"
 
    # 3. Check permissions
    if not user_has_permission(current_user, tool_name):
        return f"Error: User does not have permission to use '{tool_name}'"
 
    # 4. Execute with timeout
    try:
        result = tool.execute(args, timeout=30)
        return json.dumps(result)
    except TimeoutError:
        return "Error: Tool execution timed out after 30 seconds"
    except Exception as e:
        return f"Error: {str(e)}"

MCP (Model Context Protocol)

MCP is an open protocol (created by Anthropic) that standardizes how LLMs connect to external tools and data sources. Think of it as USB-C for AI tools — a universal interface so any tool can work with any model.

Why MCP matters:

Before MCP, every tool integration was custom. Want your agent to use GitHub? Write a GitHub integration. Slack? Write another. Every tool × every model = an explosion of custom code.

MCP standardizes this:

Without MCP:
  App → Custom GitHub integration
  App → Custom Slack integration
  App → Custom DB integration
  (N tools × M apps = N×M integrations)

With MCP:
  App → MCP Client → MCP Server (GitHub)
  App → MCP Client → MCP Server (Slack)
  App → MCP Client → MCP Server (DB)
  (N tools + M apps = N+M integrations)

MCP architecture:

┌────────────────────────────────┐
│         MCP Host               │
│  (Your AI application)         │
│                                │
│  ┌────────────────────────┐    │
│  │      MCP Client        │    │
│  │  (Protocol handler)    │    │
│  └──────────┬─────────────┘    │
└─────────────┼──────────────────┘
              │ (JSON-RPC over stdio/SSE)
              │
   ┌──────────▼──────────────┐
   │      MCP Server         │
   │  (Tool provider)        │
   │                         │
   │  Exposes:               │
   │  - Tools (functions)    │
   │  - Resources (data)     │
   │  - Prompts (templates)  │
   └─────────────────────────┘

MCP capabilities:

Capability	Description	Example
Tools	Functions the model can call	`search_web`, `read_file`, `query_database`
Resources	Data sources the model can read	Files, database records, API responses
Prompts	Reusable prompt templates	"Summarize this document", "Review this code"

Example MCP server (Python):

from mcp.server import Server
from mcp.types import Tool, TextContent
 
server = Server("web-search")
 
@server.tool()
async def search_web(query: str, num_results: int = 5) -> list[TextContent]:
    """Search the web for current information on a topic."""
    results = await perform_web_search(query, num_results)
    return [TextContent(type="text", text=format_results(results))]
 
@server.tool()
async def fetch_page(url: str) -> list[TextContent]:
    """Fetch and extract the main content from a web page."""
    content = await fetch_and_extract(url)
    return [TextContent(type="text", text=content)]

Why this matters for your "Ask-the-Web" agent: MCP lets you build your web search tools as a reusable server that any MCP-compatible application can use — not just your specific agent.

Part IV: Multi-Step Agents

Workflows are developer-controlled. Multi-step agents are model-controlled. The agent decides what to do next based on what it observes.

Planning Autonomy

The core question with agents is: how much autonomy should the model have?

Approach	Planning	Execution	When to Use
Fixed plan	Developer defines the steps	Model executes each step	Predictable tasks with known workflows
LLM-generated plan	Model creates a plan, human approves it, then it executes	Model follows its own approved plan	Complex tasks where you want oversight
Fully autonomous	Model plans and executes in a loop, adapting as it goes	Model decides everything	Exploratory tasks where the path isn't known in advance

ReACT (Reasoning + Acting)

ReACT is the foundational agent pattern. The model alternates between thinking (reasoning about what to do) and acting (using tools), then observing the results.

Loop:
  1. Thought: "I need to find out X. I'll search for Y."
  2. Action: search_web("Y")
  3. Observation: [search results]
  4. Thought: "The results show Z, but I also need to know W."
  5. Action: search_web("W")
  6. Observation: [more results]
  7. Thought: "Now I have enough information to answer."
  8. Final Answer: [synthesized response]

ReACT implementation:

def react_agent(question: str, tools: list, max_steps: int = 10) -> str:
    messages = [
        {"role": "system", "content": """You are a research agent. For each step:
1. Think about what you need to know and what tool to use.
2. Use a tool to gather information.
3. Observe the result.
4. Repeat until you can answer the question.
 
When you have enough information, provide your final answer."""},
        {"role": "user", "content": question}
    ]
 
    for step in range(max_steps):
        response = llm(messages, tools=tools)
 
        # Check if the model wants to use a tool
        if response.tool_calls:
            for tool_call in response.tool_calls:
                result = execute_tool(tool_call.name, tool_call.args)
                messages.append({"role": "tool", "content": result})
        else:
            # No tool call = model is ready to give a final answer
            return response.content
 
    return "Agent reached maximum steps without a final answer."

Why ReACT works:

The explicit "Thought" step forces the model to reason before acting
Each observation grounds the next decision in real data
The loop naturally handles multi-step tasks
The model can adapt its plan based on what it learns

Reflexion

Reflexion adds a self-reflection step to the agent loop. After completing a task (or failing), the agent reflects on what went well and what didn't, then uses that reflection to improve on the next attempt.

Attempt 1:
  ReACT loop → Answer → Evaluate → "My answer was wrong because I didn't consider X"
                                          ↓
Attempt 2:
  ReACT loop (with reflection context) → Better Answer → Evaluate → "Correct!"

When Reflexion helps:

Tasks where the agent can evaluate its own output (code that must pass tests, math with verifiable answers)
When the first attempt often fails but the model can learn from the failure
Research tasks where the initial search strategy was too narrow

ReWOO (Reasoning Without Observation)

ReWOO separates planning from execution. The agent creates a complete plan upfront, then executes all steps, then synthesizes. This reduces the number of LLM calls.

Standard ReACT:  Think → Act → Observe → Think → Act → Observe → Think → Answer
                 (7 LLM calls)

ReWOO:           Plan (all steps at once) → Execute all → Synthesize
                 (2 LLM calls)

def rewoo_agent(question: str) -> str:
    # Step 1: Plan all steps at once
    plan = llm(f"""Create a plan to answer this question: {question}
    For each step, specify which tool to use and what arguments to pass.
    Format: Step N: tool_name(args) - purpose""")
 
    # Step 2: Execute all steps
    results = {}
    for step in parse_plan(plan):
        results[step.id] = execute_tool(step.tool, step.args)
 
    # Step 3: Synthesize
    return llm(f"""Given these results, answer the question: {question}
    Results: {json.dumps(results)}""")

Aspect	ReACT	ReWOO
LLM calls	Many (one per step)	Few (plan + synthesize)
Adaptability	High — can change plan mid-execution	Low — plan is fixed
Latency	Higher (sequential LLM calls)	Lower (parallel tool execution possible)
When to use	Exploratory tasks, unknown number of steps	Well-defined tasks, latency-sensitive

Tree Search for Agents

For complex reasoning tasks, a single linear chain of thought may not find the best solution. Tree search explores multiple reasoning paths and selects the most promising one.

Tree of Thought (ToT):

                            [Initial Question]
                           /        |         \
                   [Approach A]  [Approach B]  [Approach C]
                    /     \         |            /     \
               [A1]     [A2]     [B1]       [C1]    [C2]
                          ↓                   ↓
                     [Evaluate]          [Evaluate]
                          ↓                   ↓
                     Score: 0.9          Score: 0.7
                          ↓
                    [Best path → Final Answer]

How it works:

Generate multiple possible next steps (branching)
Evaluate each branch with a heuristic or LLM judge
Expand the most promising branches
Prune unpromising branches
Continue until a satisfactory solution is found

Monte Carlo Tree Search (MCTS) for agents:

Component	In Games (AlphaGo)	In Agents
State	Board position	Current reasoning + gathered information
Action	Place a stone	Choose a reasoning step or tool call
Reward	Win/lose	Answer correctness (verified or LLM-judged)
Rollout	Random play to end	Complete the reasoning chain to get an answer

When to use tree search:

Mathematical reasoning where multiple approaches exist
Planning tasks with many possible strategies
Tasks where you can verify correctness (code, math, logic puzzles)
When accuracy matters more than speed

Part V: Multi-Agent Systems

Sometimes one agent isn't enough. Multi-agent systems use multiple specialized agents that collaborate, debate, or delegate to accomplish complex tasks.

Why Multiple Agents?

Reason	Description	Example
Specialization	Different agents with different expertise	Researcher agent + writer agent + editor agent
Parallelism	Multiple agents working simultaneously	Three agents researching different aspects of a topic
Debate/verification	Agents check each other's work	One agent generates code, another reviews it
Separation of concerns	Each agent has a focused scope and toolset	A planning agent that delegates to execution agents

Challenges of Multi-Agent Systems

Challenge	Description	Mitigation
Coordination overhead	Agents need to communicate, which adds latency and cost	Clear protocols, minimal message passing
Error propagation	One agent's mistake cascades to others	Validation between agent handoffs
Infinite loops	Agents pass tasks back and forth forever	Step limits, loop detection, human-in-the-loop checkpoints
Context management	Each agent has limited context; sharing state is hard	Shared memory store, structured handoff messages
Debugging	Hard to trace why the system produced a specific output	Comprehensive logging of all agent interactions
Cost	Multiple agents = multiple LLM calls per user request	Budget limits, efficient agent design

Use Cases for Multi-Agent Systems

Use Case	Agent Architecture	How It Works
Software development	Planner → Coder → Reviewer → Tester	Planner breaks down the task, coder implements, reviewer checks quality, tester validates
Research synthesis	Coordinator → multiple Researchers → Synthesizer	Coordinator assigns sub-topics, researchers investigate in parallel, synthesizer combines
Content pipeline	Researcher → Writer → Editor → Fact-checker	Each agent specializes in one stage of content creation
Customer support escalation	Tier 1 bot → Specialist agents → Human escalation	Simple queries handled by Tier 1, complex ones routed to domain-specific agents
Debate / red team	Proposer → Critic → Judge	One agent proposes an answer, another critiques it, a judge decides

A2A Protocol (Agent-to-Agent)

Just as MCP standardizes tool communication, the A2A protocol (introduced by Google) standardizes how agents communicate with each other.

Core concepts:

┌──────────┐     Agent Card (discovery)      ┌──────────┐
│  Agent A  │ ─────────────────────────────→  │  Agent B  │
│ (Client)  │                                 │ (Server)  │
│           │ ←── Task (request/response) ──→ │           │
│           │                                 │           │
│           │ ←── Artifacts (results) ──────  │           │
└──────────┘                                  └──────────┘

Concept	Description
Agent Card	A JSON document describing what an agent can do — its capabilities, skills, and endpoint. Used for discovery.
Task	A unit of work sent from one agent to another. Has a lifecycle: submitted → working → completed/failed.
Artifact	The output of a task — files, text, structured data.
Message	Communication within a task — instructions, status updates, questions.

Why A2A matters: It enables interoperability between agents built by different teams, companies, or frameworks. Your research agent could delegate to a third-party data analysis agent without custom integration code.

Part VI: Evaluation of Agents

Evaluating agents is harder than evaluating LLMs because agents take actions, make decisions, and produce results through multi-step processes.

What to Evaluate

Dimension	What It Measures	How to Measure
Task completion	Did the agent accomplish the goal?	Binary success/failure on a test suite
Answer quality	Is the final output correct and useful?	LLM-as-judge, human evaluation, ground-truth comparison
Efficiency	How many steps/tokens/tool calls did it take?	Count steps, measure tokens, track cost
Tool use accuracy	Did the agent pick the right tools with correct arguments?	Compare against expected tool call sequences
Reasoning quality	Were the agent's intermediate thoughts logical?	Evaluate thought traces, check for reasoning errors
Robustness	Does the agent handle edge cases and errors gracefully?	Adversarial test cases, error injection
Safety	Does the agent avoid harmful actions?	Red-team testing, sandboxed execution
Latency	How long does the full agent loop take?	End-to-end timing

Evaluation Approaches

Benchmark-based evaluation:

Benchmark	What It Tests	Domain
SWE-bench	Resolve real GitHub issues by writing code patches	Software engineering
WebArena	Complete real-world tasks on live websites	Web navigation
GAIA	General AI Assistant tasks requiring tool use and reasoning	General assistant
AgentBench	Multi-domain agent tasks (OS, DB, web, code)	Cross-domain
ToolBench	Tool selection and use across 16,000+ real APIs	Tool use
HotPotQA	Multi-hop question answering requiring multiple evidence sources	Research

Trajectory-based evaluation:

Don't just evaluate the final answer — evaluate the entire trajectory:

Score each step:
  Step 1: Chose correct tool? ✅  Arguments correct? ✅  Result useful? ✅
  Step 2: Chose correct tool? ✅  Arguments correct? ❌  Result useful? ❌
  Step 3: Recovered from error? ✅  Adapted strategy? ✅
  Final answer correct? ✅

Trajectory score: 5/7 steps correct, recovered from error, correct final answer

Cost-quality trade-off:

The best agent isn't always the one with the best answers — it's the one that balances quality against cost and latency.

Agent A: 95% accuracy, avg 12 steps, $0.50/query, 45s latency
Agent B: 90% accuracy, avg 4 steps, $0.08/query, 12s latency

For most production use cases, Agent B is better.

Building Your Evaluation Pipeline

def evaluate_agent(agent, test_cases: list[dict]) -> dict:
    results = []
    for test in test_cases:
        # Run agent
        trace = agent.run(test["question"], return_trace=True)
 
        results.append({
            "question": test["question"],
            "expected": test["expected_answer"],
            "actual": trace.final_answer,
            "correct": judge_correctness(trace.final_answer, test["expected_answer"]),
            "steps": len(trace.steps),
            "tool_calls": len(trace.tool_calls),
            "tokens_used": trace.total_tokens,
            "latency_ms": trace.duration_ms,
            "cost": trace.estimated_cost,
        })
 
    return {
        "accuracy": sum(r["correct"] for r in results) / len(results),
        "avg_steps": sum(r["steps"] for r in results) / len(results),
        "avg_cost": sum(r["cost"] for r in results) / len(results),
        "avg_latency": sum(r["latency_ms"] for r in results) / len(results),
    }

Part VII: Build Your "Ask-the-Web" Agent

Now let's build it — a Perplexity-style research agent that searches the web, reads pages, and synthesizes cited answers.

Architecture

┌──────────────────────────────────────────────────────────┐
│                    Ask-the-Web Agent                     │
│                                                          │
│  User Question                                           │
│       ↓                                                  │
│  ┌─────────────────────────────────────────────────┐     │
│  │  ReACT Loop                                     │     │
│  │                                                 │     │
│  │  Thought: "I need to search for X"              │     │
│  │      ↓                                          │     │
│  │  Action: search_web("X")                        │     │
│  │      ↓                                          │     │
│  │  Observation: [10 search results with snippets] │     │
│  │      ↓                                          │     │
│  │  Thought: "Result 3 looks relevant, let me read"│     │
│  │      ↓                                          │     │
│  │  Action: fetch_page("https://...")              │     │
│  │      ↓                                          │     │
│  │  Observation: [full page content]               │     │
│  │      ↓                                          │     │
│  │  Thought: "I need one more perspective on Y"    │     │
│  │      ↓                                          │     │
│  │  Action: search_web("Y different angle")        │     │
│  │      ↓                                          │     │
│  │  ... (continue until sufficient information)    │     │
│  │      ↓                                          │     │
│  │  Final Answer (with citations)                  │     │
│  └─────────────────────────────────────────────────┘     │
│                                                          │
│  Tools:                                                  │
│  - search_web(query) → search results                    │
│  - fetch_page(url) → page content                        │
│  - calculate(expression) → numerical result              │
└──────────────────────────────────────────────────────────┘

Implementation

from anthropic import Anthropic
 
client = Anthropic()
 
# Define tools
tools = [
    {
        "name": "search_web",
        "description": "Search the web for current information. Returns a list of results with titles, URLs, and snippets. Use specific, detailed queries for best results.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query, e.g. 'latest advances in quantum computing 2026'"
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "fetch_page",
        "description": "Fetch the full content of a web page. Use this to read articles, documentation, or any URL found in search results.",
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The URL to fetch"
                }
            },
            "required": ["url"]
        }
    }
]
 
SYSTEM_PROMPT = """You are an expert research agent similar to Perplexity AI. Your job is to
answer questions by searching the web, reading relevant pages, and synthesizing
comprehensive, well-cited answers.
 
Follow this process:
1. Think about what information you need to answer the question.
2. Search the web with specific, targeted queries.
3. Read the most relevant pages to get detailed information.
4. If needed, do follow-up searches to fill gaps or verify claims.
5. Synthesize a comprehensive answer with inline citations.
 
Rules:
- Always cite your sources using [1], [2], etc. with a sources list at the end.
- If sources conflict, note the disagreement.
- If you can't find reliable information, say so clearly.
- Prefer recent, authoritative sources.
- Be thorough but concise — cover the key points without unnecessary detail."""
 
 
def ask_the_web(question: str) -> str:
    messages = [{"role": "user", "content": question}]
 
    # ReACT loop
    for step in range(15):  # max 15 steps
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            system=SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        )
 
        # Collect all content blocks
        messages.append({"role": "assistant", "content": response.content})
 
        # Check if the model wants to use tools
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
            messages.append({"role": "user", "content": tool_results})
        else:
            # No tool use = final answer
            return extract_text(response.content)
 
    return "Agent reached maximum steps. Partial answer may be available."
 
 
def execute_tool(name: str, args: dict) -> str:
    if name == "search_web":
        return search_web(args["query"])
    elif name == "fetch_page":
        return fetch_page(args["url"])
    else:
        return f"Unknown tool: {name}"

Adding Streaming for Real-Time Output

Users shouldn't stare at a blank screen while the agent works. Stream the agent's thinking and progress:

import anthropic
 
def ask_the_web_streaming(question: str):
    """Stream the agent's process — show thinking, tool use, and final answer."""
    messages = [{"role": "user", "content": question}]
 
    for step in range(15):
        print(f"\n--- Step {step + 1} ---")
 
        with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            system=SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        ) as stream:
            response = stream.get_final_message()
 
            # Print text blocks as they stream
            for block in response.content:
                if hasattr(block, "text"):
                    print(block.text)
 
        messages.append({"role": "assistant", "content": response.content})
 
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  🔧 {block.name}({json.dumps(block.input)})")
                    result = execute_tool(block.name, block.input)
                    print(f"  ← Got {len(result)} chars")
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
            messages.append({"role": "user", "content": tool_results})
        else:
            break

Features to Add

Phase 1: Core Agent

Web search + page fetching tools
ReACT loop with tool calling
Cited answers with source list

Phase 2: Enhanced Search 4. Query rewriting for better search results 5. Multiple search attempts with different queries 6. Source credibility scoring

Phase 3: User Experience 7. Streaming output showing the agent's progress 8. Follow-up questions (conversational) 9. Source preview cards with titles and snippets

Phase 4: Advanced 10. Parallel search (section multiple queries simultaneously) 11. Fact-checking via cross-referencing sources 12. Caching frequent queries 13. MCP server for reusable web search tools

Key Learning Outcomes

Building this agent teaches you:

Tool calling mechanics — how to define, format, and execute tools with an LLM
The ReACT pattern — the foundational loop for autonomous agents
Prompt engineering for agents — how to guide agent behavior through system prompts
Multi-step reasoning — how an agent decides what to search, when to dig deeper, and when to stop
Citation and grounding — how to ensure outputs are traceable to sources
Error handling in agent loops — what happens when a search returns nothing or a page fails to load
Streaming for agents — giving users visibility into the agent's process

What You Should Know After Reading This

If you've read this post carefully, you should be able to answer these questions:

What's the difference between an LLM, an agentic system, and an agent?
When would you use a prompt chain vs. a ReACT agent?
What are the five workflow patterns and when would you use each?
How does tool calling work — what does the model output and what does your application do?
What is MCP and what problem does it solve?
How does the ReACT pattern work? What are Thought, Action, and Observation?
What is Reflexion and when does it help?
What are the key challenges of multi-agent systems?
How do you evaluate an agent beyond just answer correctness?
What is the A2A protocol and why does it matter?

If you can't answer all of them yet, re-read the relevant section. Understanding agent architectures is essential for building the next generation of AI applications.

Next in the Series

Part 4: Deep Research with Reasoning Models — We cover reasoning and thinking LLMs (o1, DeepSeek-R1, Claude extended thinking), inference-time scaling techniques (Chain-of-Thought, self-consistency, Tree of Thoughts, search against a verifier), training-time techniques (STaR, RL with verifiers, reward modeling, Meta-CoT), and build a deep research agent that combines web search with multi-step structured reasoning.

Stay tuned.