Back to ML & AI
Budding··29 min read

Build an LLM Playground — Part 3: Build an "Ask-the-Web" Agent with Tool Calling

The third entry in the learn-by-doing AI engineer series. We cover AI agents from the ground up — workflows, tool calling, MCP, multi-step reasoning patterns like ReACT and Reflexion, multi-agent systems, and evaluation — then build a Perplexity-style web search agent.

aillmagentstool-callingmcpreactmulti-agenttutorialseries
Share

Series: The AI Engineer Learning Path

This is Part 3 of a hands-on series designed to take you from zero to working AI engineer. Every post follows a learn-by-doing philosophy — we explain the theory, then you build something real.

PartTopicStatus
1Build an LLM PlaygroundComplete
2Customer Support Chatbot with RAGs & Prompt EngineeringComplete
3"Ask-the-Web" Agent with Tool Calling (this post)Current
4Deep Research with Reasoning ModelsAvailable
5Multi-modal Generation AgentAvailable

In Part 1, we learned how LLMs work. In Part 2, we built a RAG chatbot that answers questions from documents. Now we're taking a leap: building an AI system that can reason, plan, use tools, and take actions in the real world.

By the end of this post, you'll understand the full spectrum of agentic AI — from simple prompt chains to autonomous multi-step agents — and you'll build a Perplexity-style "Ask-the-Web" agent that searches the internet, synthesizes information, and provides cited answers.


Why Agents?

A chatbot answers questions. An agent takes actions.

Think about the difference between asking "What's the weather in Tokyo?" and asking "Book me a flight to Tokyo next week, find a hotel near Shibuya, and check the weather so I know what to pack." The first is a lookup. The second requires planning, tool use, decision-making, and multi-step execution.

Agents bridge this gap. They turn LLMs from conversational interfaces into systems that can interact with the world — searching the web, calling APIs, executing code, reading files, and coordinating complex workflows.


Part I: Agents Overview

Agents vs. Agentic Systems vs. LLMs

These terms get used loosely. Let's be precise:

ConceptWhat It IsExample
LLMA model that generates text given a prompt. No memory, no tools, no autonomy. It produces one response and stops.GPT-4 answering "What is gravity?"
Agentic systemAn LLM wrapped in a loop with access to tools and some degree of autonomy. The system can take multiple steps to accomplish a goal.A chatbot that searches a knowledge base before answering
AgentA highly autonomous agentic system that can plan, execute, observe results, and adapt its strategy. It decides what to do next based on what happened.An AI research assistant that formulates search queries, reads papers, synthesizes findings, and iterates until it has a complete answer

The key distinction is autonomy. An LLM does exactly what you ask once. An agentic system follows a predefined pattern (retrieve, then generate). An agent decides its own approach and adapts.

Agency Levels

Not every system needs full agent autonomy. In fact, simpler is usually better. Here's a spectrum:

LevelDescriptionAutonomyExample
Level 0: Direct LLM callSingle prompt → single responseNone"Translate this sentence to French"
Level 1: WorkflowPredefined sequence of LLM calls. The developer controls the flow.LowPrompt chain: summarize → translate → format
Level 2: RouterLLM decides which path to take from a fixed set of optionsLow-MediumClassify a customer query, then route to the right handler
Level 3: Tool-using LLMLLM decides when and how to call tools, but within a single turnMediumSearch the web, then answer the question
Level 4: Multi-step agentLLM operates in a loop — observe, think, act, repeat — until the task is doneHighReACT agent that researches a topic across multiple searches
Level 5: Multi-agent systemMultiple agents collaborating, delegating, and coordinatingVery HighA team of agents: researcher, writer, and editor working together

Practical advice: Start at the lowest level that solves your problem. Most production AI features are Level 1-3. Full agents (Level 4-5) are powerful but harder to control, debug, and make reliable.


Part II: Workflows

Workflows are the most reliable form of agentic systems. The developer defines the control flow — the LLM handles the language processing at each step, but doesn't decide what to do next.

Prompt Chaining

Run a sequence of LLM calls where each call's output feeds into the next call's input. Each step handles one focused task.

Input → [Step 1: Extract key facts] → [Step 2: Research each fact] → [Step 3: Write summary] → Output

Example: Research report generator

# Step 1: Extract key questions from the user's topic
questions = llm("Given the topic '{topic}', generate 5 specific research questions.")
 
# Step 2: For each question, generate a search query
for question in questions:
    search_results = web_search(question)
    facts.append(llm(f"Extract key facts from these results:\n{search_results}"))
 
# Step 3: Synthesize into a report
report = llm(f"Write a research report based on these facts:\n{facts}")

When to use: Tasks that are naturally sequential, where each step has clear inputs and outputs. The most common pattern in production.

AdvantageDisadvantage
Easy to debug — inspect each step's outputRigid — can't adapt to unexpected results
Easy to test — unit test each stepLatency compounds — N steps = N LLM calls
Easy to improve — swap out individual stepsError propagation — early mistakes cascade

Routing

An LLM classifies the input and routes it to the appropriate handler. The LLM acts as a decision maker but doesn't execute the downstream logic.

User Input → [LLM Classifier] → Route A: Technical support handler
                                → Route B: Billing handler
                                → Route C: General inquiry handler

Example: Support ticket router

def route_ticket(message: str) -> str:
    category = llm(
        f"""Classify this support message into exactly one category:
        - technical: API errors, integration issues, bugs
        - billing: charges, invoices, refunds, plans
        - account: login issues, settings, permissions
        - general: everything else
 
        Message: {message}
        Category:"""
    )
 
    handlers = {
        "technical": handle_technical,
        "billing": handle_billing,
        "account": handle_account,
        "general": handle_general,
    }
 
    return handlers[category.strip()](message)

When to use: When different input types require fundamentally different handling. Common in customer support, content moderation, and task dispatching.

Parallelization

Run multiple LLM calls simultaneously and combine the results. Two main patterns:

Sectioning: Split a task into independent sub-tasks, run them in parallel, combine results.

                    ┌→ [Analyze sentiment]     ─┐
User Review ────────┼→ [Extract key features]   ├→ [Combine into report]
                    └→ [Check for policy issues]─┘

Voting: Run the same task multiple times and aggregate results for higher accuracy.

                    ┌→ [LLM call 1: "toxic"]     ─┐
User Message ───────┼→ [LLM call 2: "toxic"]      ├→ Majority vote: "toxic"
                    └→ [LLM call 3: "not toxic"]  ─┘
PatternWhen to UseExample
SectioningTask has independent sub-tasks that don't depend on each otherAnalyze a document for sentiment, entities, and key themes simultaneously
VotingHigh-stakes decisions where accuracy matters more than speed or costContent moderation, medical triage classification

Reflection

The LLM reviews and critiques its own output, then improves it. This creates a self-improving loop without external feedback.

Input → [Generate] → [Critique] → [Revise] → Output
              ↑                        │
              └────────────────────────┘ (repeat N times)

Example: Code generation with self-review

# Generate initial code
code = llm(f"Write a Python function that {task_description}")
 
# Self-review loop
for i in range(3):
    critique = llm(f"""Review this code for bugs, edge cases, and improvements:
    ```python
    {code}

List specific issues.""")

if "no issues" in critique.lower(): break

code = llm(f"""Fix the following issues in this code: Issues: Code:

{code}
```""")

When to use: Tasks where quality can be objectively assessed — code generation, writing, translation, data extraction. Not useful when the model can't reliably judge its own output.

Orchestrator-Worker

An orchestrator LLM breaks down a complex task and delegates sub-tasks to worker LLMs. The orchestrator manages the overall plan and synthesizes results.

                           ┌→ [Worker 1: Research pricing]
[Orchestrator] → Plan ─────┼→ [Worker 2: Research features]
       ↑                   └→ [Worker 3: Research reviews]
       │
       └──── [Synthesize results into final report]

Example:

def orchestrator(task: str) -> str:
    # Orchestrator creates a plan
    plan = llm(f"""Break this task into 3-5 independent sub-tasks:
    Task: {task}
    Return as a JSON array of sub-task descriptions.""")
 
    sub_tasks = json.loads(plan)
 
    # Workers execute in parallel
    results = parallel_map(
        lambda sub_task: llm(f"Complete this sub-task thoroughly:\n{sub_task}"),
        sub_tasks
    )
 
    # Orchestrator synthesizes
    return llm(f"""Synthesize these sub-task results into a final answer:
    Task: {task}
    Results: {json.dumps(results)}""")

When to use: Complex tasks where the sub-tasks aren't known in advance and may vary based on the input. More flexible than prompt chaining but also more complex and expensive.


Part III: Tools

Tools are what make agents capable of interacting with the real world. Without tools, an LLM can only generate text. With tools, it can search the web, query databases, execute code, send emails, and more.

Tool Calling

Tool calling (also called function calling) is a structured way for an LLM to request that external functions be executed. The model doesn't execute the tool itself — it outputs a structured request, your application executes it, and the result is fed back to the model.

User: "What's the weather in Tokyo?"
    ↓
LLM thinks: "I need to use the weather tool"
    ↓
LLM outputs: {"tool": "get_weather", "args": {"city": "Tokyo"}}
    ↓
Your app: executes get_weather("Tokyo") → {"temp": 22, "condition": "sunny"}
    ↓
LLM receives result, generates: "It's currently 22°C and sunny in Tokyo."

The tool calling flow in detail:

┌────────┐     ┌─────┐     ┌──────────┐     ┌─────┐     ┌────────┐
│  User  │────→│ LLM │────→│ Tool Call │────→│ App │────→│ Result │
│Message │     │     │     │ Request   │     │     │     │        │
└────────┘     └─────┘     └──────────┘     └─────┘     └───┬────┘
                                                            │
               ┌─────┐     ┌──────────┐                     │
               │ LLM │←────│ Tool     │←────────────────────┘
               │     │     │ Result   │
               └──┬──┘     └──────────┘
                  │
                  ▼
            Final Response

Tool Formatting

Different providers use different formats for defining tools. Here's how the major ones work:

OpenAI format:

{
  "type": "function",
  "function": {
    "name": "search_web",
    "description": "Search the web for current information on a topic",
    "parameters": {
      "type": "object",
      "properties": {
        "query": {
          "type": "string",
          "description": "The search query"
        },
        "num_results": {
          "type": "integer",
          "description": "Number of results to return (default: 5)"
        }
      },
      "required": ["query"]
    }
  }
}

Anthropic format:

{
  "name": "search_web",
  "description": "Search the web for current information on a topic",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "The search query"
      },
      "num_results": {
        "type": "integer",
        "description": "Number of results to return (default: 5)"
      }
    },
    "required": ["query"]
  }
}

Best practices for tool definitions:

PracticeWhy It Matters
Write clear descriptionsThe model uses descriptions to decide when and how to use the tool. Vague descriptions = wrong tool calls.
Include parameter descriptionsDon't just name parameters — explain what they do, valid ranges, and defaults.
Use specific namessearch_knowledge_base is better than search. Specificity helps the model pick the right tool.
Limit the tool setMore tools = more confusion. Provide only the tools relevant to the current task.
Include examples in descriptions"Search query, e.g., 'CloudAPI authentication error 401'" helps the model format inputs correctly.

Tool Execution

Your application is responsible for executing tool calls. This means you control:

ConcernWhat You Control
ValidationCheck that the model's arguments are valid before executing
AuthorizationEnsure the user has permission to use this tool
Rate limitingPrevent runaway agents from making thousands of API calls
Error handlingReturn clear error messages so the model can adapt
TimeoutsKill long-running tool calls to prevent hangs
SandboxingFor code execution tools, run in isolated environments

Execution pattern:

def execute_tool(tool_name: str, args: dict) -> str:
    """Execute a tool call from the LLM with safety checks."""
 
    # 1. Validate tool exists
    if tool_name not in AVAILABLE_TOOLS:
        return f"Error: Unknown tool '{tool_name}'"
 
    # 2. Validate arguments
    tool = AVAILABLE_TOOLS[tool_name]
    validation_error = tool.validate_args(args)
    if validation_error:
        return f"Error: {validation_error}"
 
    # 3. Check permissions
    if not user_has_permission(current_user, tool_name):
        return f"Error: User does not have permission to use '{tool_name}'"
 
    # 4. Execute with timeout
    try:
        result = tool.execute(args, timeout=30)
        return json.dumps(result)
    except TimeoutError:
        return "Error: Tool execution timed out after 30 seconds"
    except Exception as e:
        return f"Error: {str(e)}"

MCP (Model Context Protocol)

MCP is an open protocol (created by Anthropic) that standardizes how LLMs connect to external tools and data sources. Think of it as USB-C for AI tools — a universal interface so any tool can work with any model.

Why MCP matters:

Before MCP, every tool integration was custom. Want your agent to use GitHub? Write a GitHub integration. Slack? Write another. Every tool × every model = an explosion of custom code.

MCP standardizes this:

Without MCP:
  App → Custom GitHub integration
  App → Custom Slack integration
  App → Custom DB integration
  (N tools × M apps = N×M integrations)

With MCP:
  App → MCP Client → MCP Server (GitHub)
  App → MCP Client → MCP Server (Slack)
  App → MCP Client → MCP Server (DB)
  (N tools + M apps = N+M integrations)

MCP architecture:

┌────────────────────────────────┐
│         MCP Host               │
│  (Your AI application)         │
│                                │
│  ┌────────────────────────┐    │
│  │      MCP Client        │    │
│  │  (Protocol handler)    │    │
│  └──────────┬─────────────┘    │
└─────────────┼──────────────────┘
              │ (JSON-RPC over stdio/SSE)
              │
   ┌──────────▼──────────────┐
   │      MCP Server         │
   │  (Tool provider)        │
   │                         │
   │  Exposes:               │
   │  - Tools (functions)    │
   │  - Resources (data)     │
   │  - Prompts (templates)  │
   └─────────────────────────┘

MCP capabilities:

CapabilityDescriptionExample
ToolsFunctions the model can callsearch_web, read_file, query_database
ResourcesData sources the model can readFiles, database records, API responses
PromptsReusable prompt templates"Summarize this document", "Review this code"

Example MCP server (Python):

from mcp.server import Server
from mcp.types import Tool, TextContent
 
server = Server("web-search")
 
@server.tool()
async def search_web(query: str, num_results: int = 5) -> list[TextContent]:
    """Search the web for current information on a topic."""
    results = await perform_web_search(query, num_results)
    return [TextContent(type="text", text=format_results(results))]
 
@server.tool()
async def fetch_page(url: str) -> list[TextContent]:
    """Fetch and extract the main content from a web page."""
    content = await fetch_and_extract(url)
    return [TextContent(type="text", text=content)]

Why this matters for your "Ask-the-Web" agent: MCP lets you build your web search tools as a reusable server that any MCP-compatible application can use — not just your specific agent.


Part IV: Multi-Step Agents

Workflows are developer-controlled. Multi-step agents are model-controlled. The agent decides what to do next based on what it observes.

Planning Autonomy

The core question with agents is: how much autonomy should the model have?

ApproachPlanningExecutionWhen to Use
Fixed planDeveloper defines the stepsModel executes each stepPredictable tasks with known workflows
LLM-generated planModel creates a plan, human approves it, then it executesModel follows its own approved planComplex tasks where you want oversight
Fully autonomousModel plans and executes in a loop, adapting as it goesModel decides everythingExploratory tasks where the path isn't known in advance

ReACT (Reasoning + Acting)

ReACT is the foundational agent pattern. The model alternates between thinking (reasoning about what to do) and acting (using tools), then observing the results.

Loop:
  1. Thought: "I need to find out X. I'll search for Y."
  2. Action: search_web("Y")
  3. Observation: [search results]
  4. Thought: "The results show Z, but I also need to know W."
  5. Action: search_web("W")
  6. Observation: [more results]
  7. Thought: "Now I have enough information to answer."
  8. Final Answer: [synthesized response]

ReACT implementation:

def react_agent(question: str, tools: list, max_steps: int = 10) -> str:
    messages = [
        {"role": "system", "content": """You are a research agent. For each step:
1. Think about what you need to know and what tool to use.
2. Use a tool to gather information.
3. Observe the result.
4. Repeat until you can answer the question.
 
When you have enough information, provide your final answer."""},
        {"role": "user", "content": question}
    ]
 
    for step in range(max_steps):
        response = llm(messages, tools=tools)
 
        # Check if the model wants to use a tool
        if response.tool_calls:
            for tool_call in response.tool_calls:
                result = execute_tool(tool_call.name, tool_call.args)
                messages.append({"role": "tool", "content": result})
        else:
            # No tool call = model is ready to give a final answer
            return response.content
 
    return "Agent reached maximum steps without a final answer."

Why ReACT works:

  • The explicit "Thought" step forces the model to reason before acting
  • Each observation grounds the next decision in real data
  • The loop naturally handles multi-step tasks
  • The model can adapt its plan based on what it learns

Reflexion

Reflexion adds a self-reflection step to the agent loop. After completing a task (or failing), the agent reflects on what went well and what didn't, then uses that reflection to improve on the next attempt.

Attempt 1:
  ReACT loop → Answer → Evaluate → "My answer was wrong because I didn't consider X"
                                          ↓
Attempt 2:
  ReACT loop (with reflection context) → Better Answer → Evaluate → "Correct!"

When Reflexion helps:

  • Tasks where the agent can evaluate its own output (code that must pass tests, math with verifiable answers)
  • When the first attempt often fails but the model can learn from the failure
  • Research tasks where the initial search strategy was too narrow

ReWOO (Reasoning Without Observation)

ReWOO separates planning from execution. The agent creates a complete plan upfront, then executes all steps, then synthesizes. This reduces the number of LLM calls.

Standard ReACT:  Think → Act → Observe → Think → Act → Observe → Think → Answer
                 (7 LLM calls)

ReWOO:           Plan (all steps at once) → Execute all → Synthesize
                 (2 LLM calls)
def rewoo_agent(question: str) -> str:
    # Step 1: Plan all steps at once
    plan = llm(f"""Create a plan to answer this question: {question}
    For each step, specify which tool to use and what arguments to pass.
    Format: Step N: tool_name(args) - purpose""")
 
    # Step 2: Execute all steps
    results = {}
    for step in parse_plan(plan):
        results[step.id] = execute_tool(step.tool, step.args)
 
    # Step 3: Synthesize
    return llm(f"""Given these results, answer the question: {question}
    Results: {json.dumps(results)}""")
AspectReACTReWOO
LLM callsMany (one per step)Few (plan + synthesize)
AdaptabilityHigh — can change plan mid-executionLow — plan is fixed
LatencyHigher (sequential LLM calls)Lower (parallel tool execution possible)
When to useExploratory tasks, unknown number of stepsWell-defined tasks, latency-sensitive

Tree Search for Agents

For complex reasoning tasks, a single linear chain of thought may not find the best solution. Tree search explores multiple reasoning paths and selects the most promising one.

Tree of Thought (ToT):

                            [Initial Question]
                           /        |         \
                   [Approach A]  [Approach B]  [Approach C]
                    /     \         |            /     \
               [A1]     [A2]     [B1]       [C1]    [C2]
                          ↓                   ↓
                     [Evaluate]          [Evaluate]
                          ↓                   ↓
                     Score: 0.9          Score: 0.7
                          ↓
                    [Best path → Final Answer]

How it works:

  1. Generate multiple possible next steps (branching)
  2. Evaluate each branch with a heuristic or LLM judge
  3. Expand the most promising branches
  4. Prune unpromising branches
  5. Continue until a satisfactory solution is found

Monte Carlo Tree Search (MCTS) for agents:

ComponentIn Games (AlphaGo)In Agents
StateBoard positionCurrent reasoning + gathered information
ActionPlace a stoneChoose a reasoning step or tool call
RewardWin/loseAnswer correctness (verified or LLM-judged)
RolloutRandom play to endComplete the reasoning chain to get an answer

When to use tree search:

  • Mathematical reasoning where multiple approaches exist
  • Planning tasks with many possible strategies
  • Tasks where you can verify correctness (code, math, logic puzzles)
  • When accuracy matters more than speed

Part V: Multi-Agent Systems

Sometimes one agent isn't enough. Multi-agent systems use multiple specialized agents that collaborate, debate, or delegate to accomplish complex tasks.

Why Multiple Agents?

ReasonDescriptionExample
SpecializationDifferent agents with different expertiseResearcher agent + writer agent + editor agent
ParallelismMultiple agents working simultaneouslyThree agents researching different aspects of a topic
Debate/verificationAgents check each other's workOne agent generates code, another reviews it
Separation of concernsEach agent has a focused scope and toolsetA planning agent that delegates to execution agents

Challenges of Multi-Agent Systems

ChallengeDescriptionMitigation
Coordination overheadAgents need to communicate, which adds latency and costClear protocols, minimal message passing
Error propagationOne agent's mistake cascades to othersValidation between agent handoffs
Infinite loopsAgents pass tasks back and forth foreverStep limits, loop detection, human-in-the-loop checkpoints
Context managementEach agent has limited context; sharing state is hardShared memory store, structured handoff messages
DebuggingHard to trace why the system produced a specific outputComprehensive logging of all agent interactions
CostMultiple agents = multiple LLM calls per user requestBudget limits, efficient agent design

Use Cases for Multi-Agent Systems

Use CaseAgent ArchitectureHow It Works
Software developmentPlanner → Coder → Reviewer → TesterPlanner breaks down the task, coder implements, reviewer checks quality, tester validates
Research synthesisCoordinator → multiple Researchers → SynthesizerCoordinator assigns sub-topics, researchers investigate in parallel, synthesizer combines
Content pipelineResearcher → Writer → Editor → Fact-checkerEach agent specializes in one stage of content creation
Customer support escalationTier 1 bot → Specialist agents → Human escalationSimple queries handled by Tier 1, complex ones routed to domain-specific agents
Debate / red teamProposer → Critic → JudgeOne agent proposes an answer, another critiques it, a judge decides

A2A Protocol (Agent-to-Agent)

Just as MCP standardizes tool communication, the A2A protocol (introduced by Google) standardizes how agents communicate with each other.

Core concepts:

┌──────────┐     Agent Card (discovery)      ┌──────────┐
│  Agent A  │ ─────────────────────────────→  │  Agent B  │
│ (Client)  │                                 │ (Server)  │
│           │ ←── Task (request/response) ──→ │           │
│           │                                 │           │
│           │ ←── Artifacts (results) ──────  │           │
└──────────┘                                  └──────────┘
ConceptDescription
Agent CardA JSON document describing what an agent can do — its capabilities, skills, and endpoint. Used for discovery.
TaskA unit of work sent from one agent to another. Has a lifecycle: submitted → working → completed/failed.
ArtifactThe output of a task — files, text, structured data.
MessageCommunication within a task — instructions, status updates, questions.

Why A2A matters: It enables interoperability between agents built by different teams, companies, or frameworks. Your research agent could delegate to a third-party data analysis agent without custom integration code.


Part VI: Evaluation of Agents

Evaluating agents is harder than evaluating LLMs because agents take actions, make decisions, and produce results through multi-step processes.

What to Evaluate

DimensionWhat It MeasuresHow to Measure
Task completionDid the agent accomplish the goal?Binary success/failure on a test suite
Answer qualityIs the final output correct and useful?LLM-as-judge, human evaluation, ground-truth comparison
EfficiencyHow many steps/tokens/tool calls did it take?Count steps, measure tokens, track cost
Tool use accuracyDid the agent pick the right tools with correct arguments?Compare against expected tool call sequences
Reasoning qualityWere the agent's intermediate thoughts logical?Evaluate thought traces, check for reasoning errors
RobustnessDoes the agent handle edge cases and errors gracefully?Adversarial test cases, error injection
SafetyDoes the agent avoid harmful actions?Red-team testing, sandboxed execution
LatencyHow long does the full agent loop take?End-to-end timing

Evaluation Approaches

Benchmark-based evaluation:

BenchmarkWhat It TestsDomain
SWE-benchResolve real GitHub issues by writing code patchesSoftware engineering
WebArenaComplete real-world tasks on live websitesWeb navigation
GAIAGeneral AI Assistant tasks requiring tool use and reasoningGeneral assistant
AgentBenchMulti-domain agent tasks (OS, DB, web, code)Cross-domain
ToolBenchTool selection and use across 16,000+ real APIsTool use
HotPotQAMulti-hop question answering requiring multiple evidence sourcesResearch

Trajectory-based evaluation:

Don't just evaluate the final answer — evaluate the entire trajectory:

Score each step:
  Step 1: Chose correct tool? ✅  Arguments correct? ✅  Result useful? ✅
  Step 2: Chose correct tool? ✅  Arguments correct? ❌  Result useful? ❌
  Step 3: Recovered from error? ✅  Adapted strategy? ✅
  Final answer correct? ✅

Trajectory score: 5/7 steps correct, recovered from error, correct final answer

Cost-quality trade-off:

The best agent isn't always the one with the best answers — it's the one that balances quality against cost and latency.

Agent A: 95% accuracy, avg 12 steps, $0.50/query, 45s latency
Agent B: 90% accuracy, avg 4 steps, $0.08/query, 12s latency

For most production use cases, Agent B is better.

Building Your Evaluation Pipeline

def evaluate_agent(agent, test_cases: list[dict]) -> dict:
    results = []
    for test in test_cases:
        # Run agent
        trace = agent.run(test["question"], return_trace=True)
 
        results.append({
            "question": test["question"],
            "expected": test["expected_answer"],
            "actual": trace.final_answer,
            "correct": judge_correctness(trace.final_answer, test["expected_answer"]),
            "steps": len(trace.steps),
            "tool_calls": len(trace.tool_calls),
            "tokens_used": trace.total_tokens,
            "latency_ms": trace.duration_ms,
            "cost": trace.estimated_cost,
        })
 
    return {
        "accuracy": sum(r["correct"] for r in results) / len(results),
        "avg_steps": sum(r["steps"] for r in results) / len(results),
        "avg_cost": sum(r["cost"] for r in results) / len(results),
        "avg_latency": sum(r["latency_ms"] for r in results) / len(results),
    }

Part VII: Build Your "Ask-the-Web" Agent

Now let's build it — a Perplexity-style research agent that searches the web, reads pages, and synthesizes cited answers.

Architecture

┌──────────────────────────────────────────────────────────┐
│                    Ask-the-Web Agent                     │
│                                                          │
│  User Question                                           │
│       ↓                                                  │
│  ┌─────────────────────────────────────────────────┐     │
│  │  ReACT Loop                                     │     │
│  │                                                 │     │
│  │  Thought: "I need to search for X"              │     │
│  │      ↓                                          │     │
│  │  Action: search_web("X")                        │     │
│  │      ↓                                          │     │
│  │  Observation: [10 search results with snippets] │     │
│  │      ↓                                          │     │
│  │  Thought: "Result 3 looks relevant, let me read"│     │
│  │      ↓                                          │     │
│  │  Action: fetch_page("https://...")              │     │
│  │      ↓                                          │     │
│  │  Observation: [full page content]               │     │
│  │      ↓                                          │     │
│  │  Thought: "I need one more perspective on Y"    │     │
│  │      ↓                                          │     │
│  │  Action: search_web("Y different angle")        │     │
│  │      ↓                                          │     │
│  │  ... (continue until sufficient information)    │     │
│  │      ↓                                          │     │
│  │  Final Answer (with citations)                  │     │
│  └─────────────────────────────────────────────────┘     │
│                                                          │
│  Tools:                                                  │
│  - search_web(query) → search results                    │
│  - fetch_page(url) → page content                        │
│  - calculate(expression) → numerical result              │
└──────────────────────────────────────────────────────────┘

Implementation

from anthropic import Anthropic
 
client = Anthropic()
 
# Define tools
tools = [
    {
        "name": "search_web",
        "description": "Search the web for current information. Returns a list of results with titles, URLs, and snippets. Use specific, detailed queries for best results.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query, e.g. 'latest advances in quantum computing 2026'"
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "fetch_page",
        "description": "Fetch the full content of a web page. Use this to read articles, documentation, or any URL found in search results.",
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The URL to fetch"
                }
            },
            "required": ["url"]
        }
    }
]
 
SYSTEM_PROMPT = """You are an expert research agent similar to Perplexity AI. Your job is to
answer questions by searching the web, reading relevant pages, and synthesizing
comprehensive, well-cited answers.
 
Follow this process:
1. Think about what information you need to answer the question.
2. Search the web with specific, targeted queries.
3. Read the most relevant pages to get detailed information.
4. If needed, do follow-up searches to fill gaps or verify claims.
5. Synthesize a comprehensive answer with inline citations.
 
Rules:
- Always cite your sources using [1], [2], etc. with a sources list at the end.
- If sources conflict, note the disagreement.
- If you can't find reliable information, say so clearly.
- Prefer recent, authoritative sources.
- Be thorough but concise — cover the key points without unnecessary detail."""
 
 
def ask_the_web(question: str) -> str:
    messages = [{"role": "user", "content": question}]
 
    # ReACT loop
    for step in range(15):  # max 15 steps
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            system=SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        )
 
        # Collect all content blocks
        messages.append({"role": "assistant", "content": response.content})
 
        # Check if the model wants to use tools
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
            messages.append({"role": "user", "content": tool_results})
        else:
            # No tool use = final answer
            return extract_text(response.content)
 
    return "Agent reached maximum steps. Partial answer may be available."
 
 
def execute_tool(name: str, args: dict) -> str:
    if name == "search_web":
        return search_web(args["query"])
    elif name == "fetch_page":
        return fetch_page(args["url"])
    else:
        return f"Unknown tool: {name}"

Adding Streaming for Real-Time Output

Users shouldn't stare at a blank screen while the agent works. Stream the agent's thinking and progress:

import anthropic
 
def ask_the_web_streaming(question: str):
    """Stream the agent's process — show thinking, tool use, and final answer."""
    messages = [{"role": "user", "content": question}]
 
    for step in range(15):
        print(f"\n--- Step {step + 1} ---")
 
        with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            system=SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        ) as stream:
            response = stream.get_final_message()
 
            # Print text blocks as they stream
            for block in response.content:
                if hasattr(block, "text"):
                    print(block.text)
 
        messages.append({"role": "assistant", "content": response.content})
 
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  🔧 {block.name}({json.dumps(block.input)})")
                    result = execute_tool(block.name, block.input)
                    print(f"  ← Got {len(result)} chars")
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
            messages.append({"role": "user", "content": tool_results})
        else:
            break

Features to Add

Phase 1: Core Agent

  1. Web search + page fetching tools
  2. ReACT loop with tool calling
  3. Cited answers with source list

Phase 2: Enhanced Search 4. Query rewriting for better search results 5. Multiple search attempts with different queries 6. Source credibility scoring

Phase 3: User Experience 7. Streaming output showing the agent's progress 8. Follow-up questions (conversational) 9. Source preview cards with titles and snippets

Phase 4: Advanced 10. Parallel search (section multiple queries simultaneously) 11. Fact-checking via cross-referencing sources 12. Caching frequent queries 13. MCP server for reusable web search tools

Key Learning Outcomes

Building this agent teaches you:

  • Tool calling mechanics — how to define, format, and execute tools with an LLM
  • The ReACT pattern — the foundational loop for autonomous agents
  • Prompt engineering for agents — how to guide agent behavior through system prompts
  • Multi-step reasoning — how an agent decides what to search, when to dig deeper, and when to stop
  • Citation and grounding — how to ensure outputs are traceable to sources
  • Error handling in agent loops — what happens when a search returns nothing or a page fails to load
  • Streaming for agents — giving users visibility into the agent's process

What You Should Know After Reading This

If you've read this post carefully, you should be able to answer these questions:

  1. What's the difference between an LLM, an agentic system, and an agent?
  2. When would you use a prompt chain vs. a ReACT agent?
  3. What are the five workflow patterns and when would you use each?
  4. How does tool calling work — what does the model output and what does your application do?
  5. What is MCP and what problem does it solve?
  6. How does the ReACT pattern work? What are Thought, Action, and Observation?
  7. What is Reflexion and when does it help?
  8. What are the key challenges of multi-agent systems?
  9. How do you evaluate an agent beyond just answer correctness?
  10. What is the A2A protocol and why does it matter?

If you can't answer all of them yet, re-read the relevant section. Understanding agent architectures is essential for building the next generation of AI applications.


Further Reading

For those who want to go deeper on any topic covered here:

  • "ReAct: Synergizing Reasoning and Acting in Language Models" (Yao et al., 2022) — The original ReACT paper
  • "Reflexion: Language Agents with Verbal Reinforcement Learning" (Shinn et al., 2023) — The Reflexion paper
  • "ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models" (Xu et al., 2023) — The ReWOO paper
  • "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Yao et al., 2023) — Tree search for LLMs
  • "Toolformer: Language Models Can Teach Themselves to Use Tools" (Schick et al., 2023) — Self-taught tool use
  • "Building Effective Agents" (Anthropic, 2024) — Anthropic's practical guide to agent design
  • "The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling" (Masterman et al., 2024)
  • Model Context Protocol specificationhttps://modelcontextprotocol.io
  • LangGraph documentation — Framework for building stateful agent workflows
  • CrewAI documentation — Framework for multi-agent orchestration

Next in the Series

Part 4: Deep Research with Reasoning Models — We cover reasoning and thinking LLMs (o1, DeepSeek-R1, Claude extended thinking), inference-time scaling techniques (Chain-of-Thought, self-consistency, Tree of Thoughts, search against a verifier), training-time techniques (STaR, RL with verifiers, reward modeling, Meta-CoT), and build a deep research agent that combines web search with multi-step structured reasoning.

Stay tuned.

You might also like