Back to ML & AI
Budding··40 min read

Build an LLM Playground — Part 4: Build "Deep Research" with Web Search and Reasoning Models

The fourth entry in the learn-by-doing AI engineer series. We cover reasoning and thinking LLMs, inference-time scaling techniques (CoT, self-consistency, Tree of Thoughts), training-time techniques (STaR, RLHF with verifiers, reward modeling), and build a deep research agent that combines web search with multi-step reasoning.

aillmreasoningchain-of-thoughtdeep-researchtree-of-thoughtsreinforcement-learningtutorialseries
Share

Series: The AI Engineer Learning Path

This is Part 4 of a hands-on series designed to take you from zero to working AI engineer. Every post follows a learn-by-doing philosophy — we explain the theory, then you build something real.

PartTopicStatus
1Build an LLM PlaygroundComplete
2Customer Support Chatbot with RAGs & Prompt EngineeringComplete
3"Ask-the-Web" Agent with Tool CallingComplete
4Deep Research with Reasoning Models (this post)Current
5Multi-modal Generation AgentAvailable

In Part 3, we built an agent that can search the web and synthesize answers. But that agent treats reasoning as a single forward pass — it thinks once and answers. Real research requires deliberate, multi-step reasoning: forming hypotheses, searching for evidence, evaluating conflicting information, revising conclusions, and knowing when you're confident enough to stop.

This post is about the science and engineering of making LLMs think harder. We'll cover reasoning models, inference-time scaling, training-time techniques, and then build a "Deep Research" agent that combines web search with structured multi-step reasoning.


Why Reasoning Matters

Standard LLMs are pattern matchers. They produce the most likely next token given the context. This works remarkably well for most tasks, but it fails on problems that require:

  • Multi-step logical deduction — "If A implies B and B implies C, does A imply C?"
  • Planning under constraints — "How do I schedule 5 tasks with dependencies in the shortest time?"
  • Self-correction — "Wait, that calculation was wrong. Let me redo it."
  • Deliberate exploration — "There are three possible approaches. Let me evaluate each before committing."

The core insight of reasoning models is that you can trade compute at inference time for better answers. Instead of generating one answer in one pass, you let the model think longer, explore multiple paths, and verify its own work.

This is called inference-time compute scaling — and it's one of the most important ideas in modern AI.

Standard LLM:
  Question → [Single forward pass] → Answer
  Fast, cheap, often wrong on hard problems

Reasoning LLM:
  Question → [Think step 1] → [Think step 2] → ... → [Think step N] → [Verify] → Answer
  Slower, more expensive, dramatically better on hard problems

Part I: Reasoning and Thinking LLMs

What Are Reasoning Models?

Reasoning models are LLMs that have been specifically trained or prompted to "think before answering." Instead of producing a final answer immediately, they generate an extended chain of reasoning — often called a "thinking trace" or "internal monologue" — before arriving at a conclusion.

The key distinction:

AspectStandard LLMReasoning LLM
OutputDirect answerExtended thinking + answer
ComputeFixed (one forward pass per token)Variable (more thinking for harder problems)
Error handlingErrors compound silentlyCan catch and correct its own mistakes
TransparencyBlack box — no insight into reasoningVisible chain of thought you can inspect
CostLower tokens per queryHigher tokens per query, but better accuracy

Overview of Reasoning Models

OpenAI's "o" Family

OpenAI's o1, o1-mini, o3, and o3-mini models were the first major commercial reasoning models. They use a technique where the model generates a hidden "chain of thought" before producing the final answer.

How they work:

User: "How many r's are in the word strawberry?"

Standard GPT-4:
  → "There are 2 r's in strawberry."  (wrong)

o1:
  → [Internal reasoning]:
     "Let me spell it out: s-t-r-a-w-b-e-r-r-y"
     "Now let me count each 'r': position 3 is 'r', position 9 is 'r', position 10 is 'r'"
     "Wait, let me recount: s(1) t(2) r(3) a(4) w(5) b(6) e(7) r(8) r(9) y(10)"
     "Positions 3, 8, 9 contain 'r'"
     "That's 3 r's"
  → "There are 3 r's in strawberry."  (correct)

Key characteristics:

FeatureDetail
Hidden reasoningThe thinking trace is not shown to the user (only a summary)
Adaptive computeHarder problems get more thinking tokens automatically
Trained with RLUses reinforcement learning to improve reasoning quality
Cost structureYou pay for both thinking tokens and output tokens

DeepSeek-R1

DeepSeek-R1 is an open-weight reasoning model that made waves by achieving reasoning performance competitive with o1 while being openly available.

What makes R1 interesting:

  1. Open weights — You can download, inspect, and fine-tune the model
  2. Transparent reasoning — The full thinking trace is visible (unlike o1's hidden reasoning)
  3. Trained with RL — Uses Group Relative Policy Optimization (GRPO) to learn reasoning
  4. Emergent behaviors — The model spontaneously learned to self-correct, explore alternatives, and verify its work during RL training — these behaviors were not explicitly programmed

R1's training pipeline:

Step 1: Cold-start SFT
  Start with DeepSeek-V3 base model
  Fine-tune on a small set of high-quality reasoning examples
  This gives the model the "format" of reasoning

Step 2: Reasoning-focused RL
  Train with GRPO on math and coding tasks
  Reward = correctness of final answer (verified by running code or checking math)
  The model learns that longer, more careful reasoning leads to correct answers

Step 3: Rejection sampling + SFT
  Generate many reasoning traces, keep only the ones that led to correct answers
  Fine-tune on this curated dataset

Step 4: Final RL round
  Another round of RL to polish reasoning quality and helpfulness

Emergent reasoning behaviors in R1:

During RL training, R1 spontaneously developed several sophisticated reasoning patterns:

BehaviorExample
Self-verification"Let me double-check this calculation... 7 x 8 = 56, yes that's correct"
Backtracking"Wait, this approach isn't working. Let me try a different method."
Decomposition"This problem has three parts. Let me handle each one separately."
Reflection"Hmm, my first answer seems too simple for this problem. Let me think more carefully."

These behaviors emerged naturally from the reward signal — the model discovered that careful thinking leads to more correct answers, and more correct answers lead to higher rewards.

Claude's Extended Thinking

Anthropic's Claude models support "extended thinking" — a mode where the model generates a detailed reasoning trace before its final answer.

import anthropic
 
client = anthropic.Anthropic()
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    temperature=1,  # required for extended thinking
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # max tokens for thinking
    },
    messages=[{
        "role": "user",
        "content": "How many prime numbers are between 1 and 100?"
    }]
)
 
# The response has two parts: thinking and text
for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
    elif block.type == "text":
        print("=== ANSWER ===")
        print(block.text)

Key features of extended thinking:

FeatureDetail
Visible reasoningYou can inspect the full thinking trace
Budget controlSet budget_tokens to control how much thinking the model does
StreamingThinking tokens stream in real-time
Tool use compatibleWorks with tool calling — the model can think between tool calls

Comparing Reasoning Models

ModelReasoning Visible?Open Weights?Best AtCost
o1/o3Summary onlyNoMath, coding, scienceHigh (hidden thinking tokens)
o3-mini/o4-miniSummary onlyNoGood balance of speed and reasoningMedium
DeepSeek-R1Full traceYesMath, coding, open-ended reasoningLow (self-hosted) or medium (API)
Claude (extended thinking)Full traceNoAnalysis, writing, coding, researchMedium-High (budget controllable)
Gemini 2.5 ProSummaryNoLong-context reasoning, multimodalMedium-High

Part II: Inference-Time Techniques

Inference-time techniques are methods you apply when using a model (not during training) to improve its reasoning. These work with any LLM — you don't need a special reasoning model.

The core idea: spend more compute at inference time to get better answers.

Inference-Time Scaling

The traditional way to make LLMs better is to train bigger models on more data (training-time scaling). But there's another dimension: inference-time scaling — letting the model use more compute per question.

Training-time scaling:
  Better model = more parameters + more training data + more training compute
  (Decided months before the model is used)

Inference-time scaling:
  Better answer = more tokens of reasoning + multiple attempts + verification
  (Decided at the moment you ask the question)

Why this matters: Training-time scaling has diminishing returns and enormous costs. Inference-time scaling lets you allocate compute where it matters — hard questions get more thinking, easy questions get answered quickly.

                    Answer Quality
                         ▲
                         │           ┌─── Inference-time scaling
                         │          ╱     (more thinking per question)
                         │        ╱
                         │      ╱
                         │    ╱  ┌─── Training-time scaling
                         │  ╱  ╱     (bigger model)
                         │╱  ╱
                         │ ╱
                         │╱
                         └──────────────────────→ Compute

The exciting finding from recent research: for many tasks, spending 10x more compute at inference time (through better reasoning strategies) can match or exceed a model that's 10x bigger.

Chain-of-Thought (CoT) Prompting

Chain-of-Thought is the simplest and most widely used inference-time technique. You prompt the model to show its reasoning step-by-step before giving the final answer.

Zero-shot CoT — Just add "Let's think step by step":

# Without CoT
response = llm("What is 17 * 24?")
# Model might jump to an answer and get it wrong
 
# With CoT
response = llm("""What is 17 * 24?
 
Let's think step by step.""")
# Model breaks it down:
# "17 * 24 = 17 * 20 + 17 * 4 = 340 + 68 = 408"

Few-shot CoT — Provide examples of step-by-step reasoning:

prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans of 3 balls each.
That's 2 * 3 = 6 new balls. 5 + 6 = 11. The answer is 11.
 
Q: The cafeteria had 23 apples. If they used 20 to make lunch and
bought 6 more, how many apples do they have?
A: The cafeteria started with 23 apples. They used 20, so they had
23 - 20 = 3. Then they bought 6 more, so 3 + 6 = 9. The answer is 9.
 
Q: {user_question}
A: Let's think step by step."""

Why CoT works:

ReasonExplanation
Working memoryThe model can offload intermediate results into the text, avoiding the need to hold everything "in its head"
Error visibilityWhen reasoning is explicit, errors become visible and the model can catch them
DecompositionComplex problems are broken into simpler sub-problems
Trained distributionDuring training, the model saw many examples of step-by-step reasoning (textbooks, tutorials, Stack Overflow)

When CoT helps vs. doesn't:

HelpsDoesn't Help
Multi-step math problemsSimple factual recall ("What's the capital of France?")
Logic puzzlesTasks the model can already do well in one step
Code generation with complex requirementsCreative writing (reasoning doesn't improve creativity)
Any task requiring more than 2-3 mental stepsTasks where the model lacks the underlying knowledge

Implementing CoT with the API

import anthropic
 
client = anthropic.Anthropic()
 
def solve_with_cot(question: str) -> dict:
    """Solve a problem using Chain-of-Thought prompting."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        system="""You are a careful problem solver. For every question:
1. Break the problem into clear steps
2. Work through each step explicitly
3. Show all calculations
4. State your final answer clearly
 
Format your response as:
## Reasoning
[step-by-step work]
 
## Answer
[final answer]""",
        messages=[{
            "role": "user",
            "content": question
        }]
    )
 
    text = response.content[0].text
 
    # Parse reasoning and answer
    parts = text.split("## Answer")
    reasoning = parts[0].replace("## Reasoning", "").strip() if len(parts) > 1 else text
    answer = parts[1].strip() if len(parts) > 1 else text
 
    return {
        "reasoning": reasoning,
        "answer": answer,
        "tokens_used": response.usage.input_tokens + response.usage.output_tokens
    }
 
# Example
result = solve_with_cot(
    "A store sells apples for $1.50 each and oranges for $2.00 each. "
    "If Sarah buys 3 apples and some oranges, and spends exactly $13.50, "
    "how many oranges did she buy?"
)
print(result["reasoning"])
print(f"\nAnswer: {result['answer']}")

Self-Consistency

Self-consistency is a simple but powerful idea: sample multiple reasoning chains, then take the majority vote on the final answer.

Different reasoning paths may reach different conclusions. By sampling many paths and picking the most common answer, you filter out reasoning errors.

Question: "What is the probability of rolling at least one six in four dice rolls?"

Chain 1: "P(no six) = (5/6)^4 = 625/1296. P(at least one) = 1 - 625/1296 = 671/1296 ≈ 0.518"
Chain 2: "P(at least one) = 1 - P(none) = 1 - (5/6)^4 = 1 - 0.482 = 0.518"
Chain 3: "P(six on one die) = 1/6, four rolls... 4 * 1/6 = 4/6 ≈ 0.667"  (wrong reasoning)
Chain 4: "1 - (5/6)^4 = 1 - 0.482 = 0.518"
Chain 5: "1 - (5/6)^4 ≈ 0.518"

Majority answer: 0.518 (4 out of 5 chains agree)

Implementation:

import anthropic
from collections import Counter
 
client = anthropic.Anthropic()
 
def self_consistency(question: str, n_samples: int = 5, temperature: float = 0.7) -> dict:
    """
    Generate multiple reasoning chains and take majority vote.
    Higher temperature = more diverse reasoning paths.
    """
    answers = []
    chains = []
 
    for i in range(n_samples):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            temperature=temperature,
            messages=[{
                "role": "user",
                "content": f"""{question}
 
Think step by step, then give your final answer on the last line
in the format: ANSWER: [your answer]"""
            }]
        )
 
        text = response.content[0].text
        chains.append(text)
 
        # Extract answer from last line
        for line in reversed(text.split("\n")):
            if "ANSWER:" in line:
                answer = line.split("ANSWER:")[-1].strip()
                answers.append(answer)
                break
 
    # Majority vote
    vote_counts = Counter(answers)
    best_answer = vote_counts.most_common(1)[0] if vote_counts else ("No consensus", 0)
 
    return {
        "answer": best_answer[0],
        "confidence": best_answer[1] / len(answers) if answers else 0,
        "vote_distribution": dict(vote_counts),
        "n_chains": n_samples,
        "chains": chains
    }
 
# Example
result = self_consistency(
    "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. "
    "How much does the ball cost?",
    n_samples=7
)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"Vote distribution: {result['vote_distribution']}")

Why self-consistency works:

  • Correct reasoning paths tend to converge on the same answer
  • Wrong reasoning paths tend to produce different wrong answers (they scatter)
  • The majority vote amplifies the signal of correct reasoning

Key parameters:

ParameterEffect
n_samplesMore samples = higher accuracy, higher cost. 5-10 is usually enough.
temperatureHigher = more diverse chains. 0.5-0.8 works well. Too low = all chains are identical.

Sequential Revision

Sequential revision has the model iteratively improve its answer through multiple rounds of self-critique and refinement.

Round 1: Generate initial answer
Round 2: Critique the answer — find errors, gaps, weaknesses
Round 3: Revise the answer based on the critique
Round 4: Critique again — are the issues fixed? Any new ones?
Round 5: Final revision

Implementation:

import anthropic
 
client = anthropic.Anthropic()
 
def sequential_revision(question: str, max_rounds: int = 3) -> dict:
    """
    Iteratively improve an answer through self-critique and revision.
    """
    # Round 1: Initial answer
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"Answer this question thoroughly:\n\n{question}"
        }]
    )
    current_answer = response.content[0].text
    history = [{"round": 0, "type": "initial", "content": current_answer}]
 
    for round_num in range(1, max_rounds + 1):
        # Critique
        critique_response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""Here is a question and an answer. Critically evaluate the answer.
Identify specific errors, gaps, unsupported claims, or areas for improvement.
Be harsh but constructive. If the answer is already excellent, say "NO ISSUES FOUND".
 
Question: {question}
 
Answer: {current_answer}
 
Critique:"""
            }]
        )
        critique = critique_response.content[0].text
        history.append({"round": round_num, "type": "critique", "content": critique})
 
        # Check if no issues found
        if "NO ISSUES FOUND" in critique.upper():
            break
 
        # Revise
        revision_response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": f"""Revise this answer based on the critique below.
Fix all identified issues while preserving what was already good.
 
Question: {question}
 
Original answer: {current_answer}
 
Critique: {critique}
 
Revised answer:"""
            }]
        )
        current_answer = revision_response.content[0].text
        history.append({"round": round_num, "type": "revision", "content": current_answer})
 
    return {
        "final_answer": current_answer,
        "rounds": len([h for h in history if h["type"] == "revision"]),
        "history": history
    }
 
# Example
result = sequential_revision(
    "Explain the CAP theorem in distributed systems and give a real-world example "
    "of a system that prioritizes each of the three pairs (CP, AP, CA)."
)
print(f"Final answer (after {result['rounds']} revisions):")
print(result["final_answer"])

When to use sequential revision:

Good ForNot Good For
Open-ended explanationsSimple factual questions
Code review and improvementTasks where the first answer is usually right
Essay writing and refinementTime-sensitive applications
Analysis that needs to be thoroughTasks where the model can't evaluate quality

Tree of Thoughts (ToT)

Tree of Thoughts extends Chain-of-Thought from a single chain to a tree of reasoning paths. The model explores multiple approaches, evaluates each one, and prunes unpromising branches.

                        [Problem]
                       /    |    \
              [Approach A] [Approach B] [Approach C]
              Score: 0.8   Score: 0.3   Score: 0.7
               /     \                    |
          [A→step2] [A→step2']         [C→step2]
          Score: 0.9 Score: 0.4        Score: 0.6
             |
          [A→step3]
          Score: 0.95
             |
          [Final Answer]

Implementation:

import anthropic
import json
 
client = anthropic.Anthropic()
 
def tree_of_thoughts(
    problem: str,
    n_branches: int = 3,
    max_depth: int = 3,
    beam_width: int = 2
) -> dict:
    """
    Explore multiple reasoning paths using Tree of Thoughts.
 
    Args:
        problem: The problem to solve
        n_branches: Number of branches to generate at each step
        max_depth: Maximum depth of the reasoning tree
        beam_width: Number of top branches to keep at each level (beam search)
    """
 
    def generate_thoughts(problem: str, context: str, n: int) -> list:
        """Generate n possible next reasoning steps."""
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": f"""Problem: {problem}
 
Reasoning so far: {context if context else "None — this is the first step."}
 
Generate exactly {n} different possible next steps in the reasoning.
Each should take a DIFFERENT approach or consider a DIFFERENT angle.
 
Return as a JSON array of strings, each being one reasoning step.
Example: ["Step: First approach...", "Step: Alternative approach...", "Step: Third angle..."]"""
            }]
        )
        try:
            return json.loads(response.content[0].text)
        except json.JSONDecodeError:
            return [response.content[0].text]
 
    def evaluate_thought(problem: str, reasoning_path: str) -> float:
        """Evaluate how promising a reasoning path is (0-1)."""
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": f"""Problem: {problem}
 
Reasoning path so far:
{reasoning_path}
 
Rate how promising this reasoning path is for solving the problem.
Consider: Is the logic sound? Is it making progress? Is it heading toward a correct answer?
 
Respond with ONLY a number between 0.0 and 1.0."""
            }]
        )
        try:
            return float(response.content[0].text.strip())
        except ValueError:
            return 0.5
 
    # Initialize with root branches
    current_paths = [{"path": "", "score": 1.0}]
 
    for depth in range(max_depth):
        all_candidates = []
 
        for node in current_paths:
            # Generate possible next steps
            thoughts = generate_thoughts(problem, node["path"], n_branches)
 
            for thought in thoughts:
                new_path = f"{node['path']}\n{thought}" if node["path"] else thought
                score = evaluate_thought(problem, new_path)
                all_candidates.append({"path": new_path, "score": score})
 
        # Keep top beam_width candidates (beam search)
        all_candidates.sort(key=lambda x: x["score"], reverse=True)
        current_paths = all_candidates[:beam_width]
 
        print(f"Depth {depth + 1}: {len(all_candidates)} candidates → kept top {beam_width}")
        for p in current_paths:
            print(f"  Score: {p['score']:.2f}")
 
    # Generate final answer from the best path
    best_path = current_paths[0]
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"""Problem: {problem}
 
Best reasoning path:
{best_path['path']}
 
Based on this reasoning, provide the final, complete answer to the problem."""
        }]
    )
 
    return {
        "answer": response.content[0].text,
        "best_path": best_path["path"],
        "path_score": best_path["score"],
        "explored_paths": sum(len(generate_thoughts(problem, "", n_branches)) for _ in range(max_depth))
    }
 
# Example
result = tree_of_thoughts(
    "Design a system to detect fraudulent transactions in real-time. "
    "Consider latency, accuracy, and false positive rate.",
    n_branches=3,
    max_depth=3,
    beam_width=2
)
print(f"Answer (path score: {result['path_score']:.2f}):")
print(result["answer"])

ToT vs CoT vs Self-Consistency:

TechniquePaths ExploredSelection MethodLLM CallsBest For
CoT1 (single chain)None — take what you get1Simple reasoning tasks
Self-ConsistencyN parallel chainsMajority vote on final answerNMath, logic, factual questions with verifiable answers
Tree of ThoughtsN^D (branching tree)Evaluation + pruning at each stepManyComplex problems requiring exploration of different strategies

Search Against a Verifier

The most powerful inference-time technique: generate many candidate solutions, then use a verifier to pick the best one.

This works when you have a way to check whether an answer is correct — a unit test for code, a math checker, a constraint validator, etc.

Generate 50 candidate solutions
    ↓
Run each through a verifier
    ↓
Pick the one that passes (or scores highest)

Implementation for code generation:

import anthropic
import subprocess
import tempfile
 
client = anthropic.Anthropic()
 
def search_against_verifier(
    problem: str,
    test_cases: list[dict],
    n_candidates: int = 10,
    temperature: float = 0.8
) -> dict:
    """
    Generate multiple code solutions and verify each against test cases.
    Return the first solution that passes all tests.
    """
    results = []
 
    for i in range(n_candidates):
        # Generate a candidate solution
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            temperature=temperature,
            messages=[{
                "role": "user",
                "content": f"""Solve this problem in Python. Return ONLY the function, no explanation.
 
{problem}"""
            }]
        )
 
        code = response.content[0].text
        # Strip markdown code fences if present
        if "```python" in code:
            code = code.split("```python")[1].split("```")[0]
 
        # Run against test cases
        passed = 0
        total = len(test_cases)
 
        for test in test_cases:
            test_code = f"""{code}
 
# Test
result = {test['call']}
expected = {test['expected']}
assert result == expected, f"Got {{result}}, expected {{expected}}"
print("PASS")
"""
            try:
                with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
                    f.write(test_code)
                    f.flush()
                    proc = subprocess.run(
                        ['python', f.name],
                        capture_output=True, text=True, timeout=5
                    )
                    if proc.returncode == 0 and "PASS" in proc.stdout:
                        passed += 1
            except (subprocess.TimeoutExpired, Exception):
                pass
 
        results.append({
            "candidate": i + 1,
            "code": code,
            "passed": passed,
            "total": total,
            "all_passed": passed == total
        })
 
        # Early exit if we find a perfect solution
        if passed == total:
            print(f"  Candidate {i + 1}: PASSED all {total} tests")
            return {
                "solution": code,
                "candidates_tried": i + 1,
                "results": results
            }
        else:
            print(f"  Candidate {i + 1}: {passed}/{total} tests passed")
 
    # Return best solution if none passed all tests
    best = max(results, key=lambda r: r["passed"])
    return {
        "solution": best["code"],
        "candidates_tried": n_candidates,
        "best_score": f"{best['passed']}/{best['total']}",
        "results": results
    }
 
# Example
result = search_against_verifier(
    problem="""Write a function `longest_palindrome(s: str) -> str` that returns
the longest palindromic substring in s. If there are multiple with the same length,
return the first one found.""",
    test_cases=[
        {"call": "longest_palindrome('babad')", "expected": "'bab'"},
        {"call": "longest_palindrome('cbbd')", "expected": "'bb'"},
        {"call": "longest_palindrome('a')", "expected": "'a'"},
        {"call": "longest_palindrome('racecar')", "expected": "'racecar'"},
        {"call": "longest_palindrome('')", "expected": "''"},
    ],
    n_candidates=10
)
print(f"\nFound solution after {result['candidates_tried']} candidates")
print(result["solution"])

Why this is so powerful:

The pass@k metric shows the probability that at least one of k generated samples is correct. For many coding tasks:

MetricPass Rate
pass@1 (single attempt)~50%
pass@10 (best of 10)~85%
pass@100 (best of 100)~95%

With a reliable verifier, you can dramatically boost accuracy just by generating more candidates.


Part III: Training-Time Techniques

Training-time techniques change how the model is trained to improve its reasoning capability. These are what reasoning model creators (OpenAI, DeepSeek, etc.) do before you ever use the model.

Understanding these helps you:

  1. Know why reasoning models work the way they do
  2. Fine-tune your own reasoning models
  3. Make informed decisions about which models to use

SFT on Reasoning Data (STaR)

STaR (Self-Taught Reasoner) is a technique where a model learns to reason by training on its own successful reasoning traces.

The STaR loop:

Step 1: Give the model a question
Step 2: Ask it to generate a reasoning chain + answer
Step 3: Check if the answer is correct
Step 4: If correct → add this (question, reasoning, answer) to the training set
        If wrong → give the model the correct answer and ask it to generate
                     a reasoning chain that arrives at that answer (rationalization)
Step 5: Fine-tune the model on the collected correct reasoning traces
Step 6: Repeat from Step 1 with the improved model
# Pseudocode for STaR training loop
def star_training(model, questions, correct_answers, num_iterations=5):
    for iteration in range(num_iterations):
        training_data = []
 
        for question, correct_answer in zip(questions, correct_answers):
            # Generate reasoning + answer
            reasoning, predicted_answer = model.generate_with_reasoning(question)
 
            if predicted_answer == correct_answer:
                # Direct: model got it right naturally
                training_data.append({
                    "question": question,
                    "reasoning": reasoning,
                    "answer": correct_answer,
                    "type": "direct"
                })
            else:
                # Rationalization: hint the correct answer and get reasoning
                hint_reasoning, _ = model.generate_with_reasoning(
                    question,
                    hint=f"The correct answer is {correct_answer}. Show your reasoning."
                )
                training_data.append({
                    "question": question,
                    "reasoning": hint_reasoning,
                    "answer": correct_answer,
                    "type": "rationalized"
                })
 
        # Fine-tune model on collected reasoning traces
        model = fine_tune(model, training_data)
        accuracy = evaluate(model, held_out_questions)
        print(f"Iteration {iteration + 1}: accuracy = {accuracy:.2%}")
 
    return model

Why STaR works:

AspectExplanation
BootstrappingThe model starts with weak reasoning but gets training data from its own successes
RationalizationWhen the model gets the wrong answer, you give it the answer and ask for reasoning — this creates training data even for hard problems
Self-improvementEach iteration produces a better model, which generates better reasoning traces for the next iteration

Reinforcement Learning with a Verifier

This is the technique behind o1 and DeepSeek-R1. Instead of supervised fine-tuning on correct examples, you use reinforcement learning where the reward comes from a verifier.

Traditional SFT:
  "Here are correct reasoning traces. Learn to produce text like this."

RL with Verifier:
  "Here's a problem. Try to solve it. I'll tell you if you got the right answer.
   Figure out how to reason in a way that produces correct answers."

How it works:

                    ┌─────────────────────────────────────┐
                    │                                     │
                    ▼                                     │
  Problem → [LLM Policy] → Reasoning + Answer → [Verifier] → Reward
                    ▲                                     │
                    │         ┌───────────────────────────┘
                    │         │
                    └─────────┘
                   Update policy to maximize reward

The training loop:

# Pseudocode for RL-based reasoning training
def train_reasoning_with_rl(policy_model, problems, verifier):
    """
    policy_model: The LLM we're training
    problems: Math/code problems with known correct answers
    verifier: Can check if an answer is correct (test runner, math checker)
    """
    for batch in sample_batches(problems):
        for problem in batch:
            # Generate multiple reasoning traces (exploration)
            traces = []
            for _ in range(K):
                reasoning, answer = policy_model.generate(problem, temperature=0.8)
                is_correct = verifier.check(problem, answer)
                reward = 1.0 if is_correct else 0.0
                traces.append((reasoning, answer, reward))
 
            # Compute advantage: how much better was each trace than average?
            avg_reward = mean([t[2] for t in traces])
            advantages = [(t[0], t[1], t[2] - avg_reward) for t in traces]
 
            # Update policy: increase probability of high-reward traces,
            # decrease probability of low-reward traces
            policy_model.update(advantages)  # e.g., GRPO, PPO

Key components:

ComponentRoleExample
PolicyThe LLM being trained to reasonDeepSeek-V3 base model
VerifierChecks answer correctnessPython test runner for code, symbolic math checker
RewardSignal that guides learning+1 for correct, 0 for incorrect
ExplorationGenerating diverse reasoning tracesHigh temperature sampling
Policy updateAdjusting model weights based on rewardsGRPO, PPO, REINFORCE

Why RL produces better reasoning than SFT:

SFT tells the model "reason like this." RL tells the model "find any reasoning strategy that produces correct answers." The model discovers its own reasoning patterns — which can be more diverse and robust than any human-written examples.

Reward Modeling (ORM and PRM)

A verifier that only checks the final answer is limited. Reward models evaluate the quality of reasoning at a more granular level.

Outcome Reward Model (ORM):

Evaluates the entire reasoning trace as a whole. "Given this complete reasoning chain, how likely is the final answer to be correct?"

Reasoning chain → [ORM] → Score: 0.87

Process Reward Model (PRM):

Evaluates each step of the reasoning. "Is this particular step correct and useful?"

Step 1: "17 * 24 = 17 * 20 + 17 * 4"  → [PRM] → Score: 0.95 (correct decomposition)
Step 2: "17 * 20 = 340"                → [PRM] → Score: 0.98 (correct)
Step 3: "17 * 4 = 72"                  → [PRM] → Score: 0.15 (WRONG! 17*4=68)
Step 4: "340 + 72 = 412"               → [PRM] → Score: 0.90 (arithmetic is right, but input is wrong)

PRM can catch the error at step 3 and guide the model to fix it before it propagates.

Comparison:

AspectORMPRM
GranularityEntire chainStep by step
Training dataEasier (just need final answer correctness)Harder (need per-step annotations)
Error detectionCan only say "the chain is probably wrong"Can pinpoint exactly which step is wrong
Use at inferenceRank complete solutionsGuide search: prune bad branches early
ComputeOne evaluation per chainOne evaluation per step

Using a PRM for guided search:

# Pseudocode: Use PRM to guide step-by-step generation
def prm_guided_generation(problem, prm, n_candidates=5):
    """
    At each step, generate multiple continuations, score them with PRM,
    and keep only the best ones (beam search with PRM scoring).
    """
    beams = [{"steps": [], "score": 1.0}]
 
    for step_num in range(max_steps):
        all_candidates = []
 
        for beam in beams:
            # Generate N possible next steps
            next_steps = model.generate_next_steps(
                problem, beam["steps"], n=n_candidates
            )
 
            for step in next_steps:
                # Score this step with PRM
                step_score = prm.score_step(problem, beam["steps"] + [step])
                all_candidates.append({
                    "steps": beam["steps"] + [step],
                    "score": beam["score"] * step_score
                })
 
        # Keep top beams
        all_candidates.sort(key=lambda x: x["score"], reverse=True)
        beams = all_candidates[:beam_width]
 
        # Check if any beam has reached a final answer
        for beam in beams:
            if is_final_answer(beam["steps"][-1]):
                return beam
 
    return beams[0]  # Return best beam

Self-Refinement

Self-refinement is a training-time approach where the model is trained to improve its own outputs iteratively. Unlike sequential revision (which is inference-time), self-refinement bakes this capability into the model's weights.

Training process:

1. Generate initial response to a question
2. Generate a critique of that response
3. Generate a revised response
4. If the revised response is better (checked by a verifier or human),
   train the model on the full (response → critique → revision) trajectory
5. Repeat until the model naturally produces high-quality self-critiques

The goal: A model that, when it generates a wrong answer, can reliably identify what's wrong and fix it — without external prompting.

# Pseudocode: Self-refinement training data generation
def generate_refinement_training_data(model, problems, verifier):
    training_examples = []
 
    for problem in problems:
        # Initial attempt
        initial_response = model.generate(problem)
        initial_correct = verifier.check(problem, initial_response)
 
        # Self-critique
        critique = model.generate(
            f"Critique this solution:\n{problem}\n{initial_response}"
        )
 
        # Revision
        revised_response = model.generate(
            f"Revise based on critique:\n{problem}\n{initial_response}\nCritique: {critique}"
        )
        revised_correct = verifier.check(problem, revised_response)
 
        # Only keep examples where refinement actually improved the answer
        if not initial_correct and revised_correct:
            training_examples.append({
                "problem": problem,
                "initial": initial_response,
                "critique": critique,
                "revision": revised_response,
                "label": "improvement"
            })
 
    return training_examples

Internalizing Search (Meta-CoT)

The latest frontier in reasoning research: instead of running explicit search algorithms at inference time (like Tree of Thoughts or beam search), train the model to internalize the search process.

The idea: When you use Tree of Thoughts, you're running an external algorithm that makes multiple LLM calls. But what if the model could do all that exploration, evaluation, and backtracking in a single forward pass? That's internalizing search.

How Meta-CoT works:

External search (Tree of Thoughts):
  LLM call 1: Generate branch A
  LLM call 2: Generate branch B
  LLM call 3: Generate branch C
  LLM call 4: Evaluate branches
  LLM call 5: Expand best branch
  ... (many LLM calls)

Internalized search (Meta-CoT):
  Single generation:
  "Let me consider approach A... [explores A]... this leads to a contradiction.
   Let me try approach B... [explores B]... this seems promising but hits a wall at step 3.
   Combining ideas from A and B... [hybrid approach]... yes, this works.
   Final answer: ..."

Training Meta-CoT:

Step 1: Collect search traces
  Run Tree of Thoughts / MCTS on hard problems
  Record the full search trace: all branches explored, evaluations, backtracking

Step 2: Linearize the search trace
  Convert the tree structure into a linear text sequence:
  "Exploring approach A → evaluating (score 0.3, unpromising) → backtracking →
   Exploring approach B → evaluating (score 0.8, promising) → deepening →
   B step 2 → evaluating (score 0.9) → final answer: ..."

Step 3: Train the model on these linearized search traces
  The model learns to generate text that mimics the search process

Step 4: At inference, the model generates its own internal search
  It naturally explores, evaluates, backtracks, and converges — all in one generation

Why this matters:

ApproachLLM CallsLatencyQuality
Standard CoT1LowGood
Tree of Thoughts10-50Very HighVery Good
Meta-CoT1 (but longer output)MediumVery Good

Meta-CoT gets the quality benefits of search with the efficiency of a single generation.


Part IV: Build a "Deep Research" Agent

Now let's combine everything. We'll build a Deep Research agent that uses structured reasoning, web search, and iterative refinement to produce comprehensive research reports on complex topics.

This is different from the Part 3 "Ask-the-Web" agent in a key way: it reasons about what it knows and doesn't know, plans its research strategy, evaluates evidence quality, and revises its conclusions.

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                      Deep Research Agent                              │
│                                                                      │
│  User Question                                                       │
│       ↓                                                              │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │  Phase 1: Question Analysis                                  │     │
│  │  - Decompose into sub-questions                             │     │
│  │  - Identify what types of sources are needed                │     │
│  │  - Create a research plan                                   │     │
│  └──────────────────────┬──────────────────────────────────────┘     │
│                         ↓                                            │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │  Phase 2: Research Loop (per sub-question)                  │     │
│  │  - Search the web with targeted queries                     │     │
│  │  - Read and extract key findings from pages                 │     │
│  │  - Evaluate source credibility                              │     │
│  │  - Note contradictions and gaps                             │     │
│  └──────────────────────┬──────────────────────────────────────┘     │
│                         ↓                                            │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │  Phase 3: Synthesis with Reasoning                          │     │
│  │  - Extended thinking to reason over all evidence            │     │
│  │  - Resolve contradictions                                   │     │
│  │  - Identify confidence levels                               │     │
│  └──────────────────────┬──────────────────────────────────────┘     │
│                         ↓                                            │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │  Phase 4: Self-Critique and Revision                        │     │
│  │  - Review the draft for gaps and errors                     │     │
│  │  - Do follow-up searches if needed                          │     │
│  │  - Produce final report with citations                      │     │
│  └─────────────────────────────────────────────────────────────┘     │
│                         ↓                                            │
│  Final Report with citations, confidence levels, and source list     │
└──────────────────────────────────────────────────────────────────────┘

Implementation

import anthropic
import json
from datetime import datetime
 
client = anthropic.Anthropic()
 
# --- Tool definitions ---
tools = [
    {
        "name": "search_web",
        "description": "Search the web for current information. Use specific, targeted queries.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query"
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "fetch_page",
        "description": "Fetch and read the full content of a web page.",
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The URL to fetch"
                }
            },
            "required": ["url"]
        }
    }
]
 
 
def execute_tool(name: str, args: dict) -> str:
    """Execute a tool call. Replace with real implementations."""
    if name == "search_web":
        return search_web(args["query"])  # Your search implementation
    elif name == "fetch_page":
        return fetch_page(args["url"])    # Your fetch implementation
    return f"Unknown tool: {name}"
 
 
def run_agent_loop(system_prompt: str, user_message: str, max_steps: int = 15) -> str:
    """Run a ReACT-style agent loop with tool calling."""
    messages = [{"role": "user", "content": user_message}]
 
    for step in range(max_steps):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages,
        )
 
        messages.append({"role": "assistant", "content": response.content})
 
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
            messages.append({"role": "user", "content": tool_results})
        else:
            # Extract text from response
            return "".join(
                block.text for block in response.content if hasattr(block, "text")
            )
 
    return "Agent reached maximum steps."
 
 
# --- Phase 1: Question Analysis ---
def analyze_question(question: str) -> dict:
    """Decompose the question into sub-questions and create a research plan."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        temperature=1,
        thinking={"type": "enabled", "budget_tokens": 4000},
        messages=[{
            "role": "user",
            "content": f"""Analyze this research question and create a research plan.
 
Question: {question}
 
Return a JSON object with:
1. "sub_questions": Array of 3-6 specific sub-questions to investigate
2. "source_types": What types of sources would be most valuable
   (academic papers, news articles, documentation, expert blogs, etc.)
3. "search_queries": Array of 5-8 specific search queries to run
4. "known_context": What you already know about this topic (brief)
5. "key_uncertainties": What you're most uncertain about
 
Return ONLY the JSON object."""
        }]
    )
 
    # Extract text content (skip thinking blocks)
    text = ""
    for block in response.content:
        if hasattr(block, "text"):
            text = block.text
            break
 
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        return {
            "sub_questions": [question],
            "search_queries": [question],
            "source_types": ["general"],
            "known_context": "",
            "key_uncertainties": ["Unable to parse analysis"]
        }
 
 
# --- Phase 2: Research Loop ---
def research_sub_question(sub_question: str, search_queries: list) -> str:
    """Research a specific sub-question using web search."""
    system_prompt = f"""You are a research assistant investigating this specific question:
"{sub_question}"
 
Your goal:
1. Search the web using the provided queries (and create new ones if needed)
2. Read the most relevant pages
3. Extract key findings, noting the source for each fact
4. Note any contradictions between sources
5. When you have enough information, provide a structured summary
 
Format your final output as:
## Key Findings
[Numbered list of findings with source attribution]
 
## Source Quality Assessment
[Brief assessment of how reliable your sources are]
 
## Contradictions or Uncertainties
[Any conflicting information or gaps]
 
## Sources
[Numbered list of URLs used]"""
 
    query_text = "\n".join(f"- {q}" for q in search_queries[:3])
    user_message = f"""Research this question: {sub_question}
 
Start with these search queries:
{query_text}
 
Search, read relevant pages, and provide a comprehensive summary."""
 
    return run_agent_loop(system_prompt, user_message, max_steps=12)
 
 
# --- Phase 3: Synthesis with Reasoning ---
def synthesize_research(question: str, research_findings: list) -> str:
    """Use extended thinking to synthesize all research into a coherent analysis."""
    findings_text = "\n\n---\n\n".join(
        f"### Sub-question {i+1}\n{finding}"
        for i, finding in enumerate(research_findings)
    )
 
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=8000,
        temperature=1,
        thinking={"type": "enabled", "budget_tokens": 8000},
        messages=[{
            "role": "user",
            "content": f"""You are writing a comprehensive research report.
 
Original question: {question}
 
Here are the research findings from investigating different aspects of this question:
 
{findings_text}
 
Write a comprehensive, well-structured research report that:
1. Synthesizes findings from all sub-questions into a coherent narrative
2. Resolves contradictions (explain which sources are more credible and why)
3. Clearly states confidence levels (high/medium/low) for each major claim
4. Includes inline citations [1], [2], etc.
5. Has a "Limitations and Gaps" section noting what you couldn't find or verify
6. Ends with a consolidated sources list
 
The report should be thorough but readable — like a research briefing for a smart
non-expert who needs to make decisions based on this information."""
        }]
    )
 
    # Extract text (skip thinking)
    for block in response.content:
        if hasattr(block, "text"):
            return block.text
    return "Synthesis failed."
 
 
# --- Phase 4: Self-Critique and Revision ---
def critique_and_revise(question: str, report: str) -> str:
    """Review the report for gaps and errors, then revise."""
    # Step 1: Critique
    critique_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        temperature=1,
        thinking={"type": "enabled", "budget_tokens": 4000},
        messages=[{
            "role": "user",
            "content": f"""Critically review this research report. Be thorough and harsh.
 
Original question: {question}
 
Report:
{report}
 
Evaluate:
1. Are there logical errors or unsupported claims?
2. Are important perspectives or counterarguments missing?
3. Are the confidence levels appropriate?
4. Is any information likely outdated or incorrect?
5. Are there follow-up questions that should have been investigated?
 
If the report is excellent, respond with "REPORT APPROVED".
Otherwise, list specific issues that need to be fixed."""
        }]
    )
 
    critique_text = ""
    for block in critique_response.content:
        if hasattr(block, "text"):
            critique_text = block.text
            break
 
    if "REPORT APPROVED" in critique_text.upper():
        return report
 
    # Step 2: Revise based on critique
    revision_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=8000,
        temperature=1,
        thinking={"type": "enabled", "budget_tokens": 4000},
        messages=[{
            "role": "user",
            "content": f"""Revise this research report based on the critique below.
Fix all identified issues. If the critique mentions missing information that requires
additional research, note it in the Limitations section rather than making things up.
 
Original question: {question}
 
Report:
{report}
 
Critique:
{critique_text}
 
Provide the complete revised report:"""
        }]
    )
 
    for block in revision_response.content:
        if hasattr(block, "text"):
            return block.text
    return report
 
 
# --- Main: Deep Research Pipeline ---
def deep_research(question: str) -> str:
    """
    Run the full deep research pipeline:
    1. Analyze the question and plan research
    2. Research each sub-question with web search
    3. Synthesize findings with extended reasoning
    4. Self-critique and revise
    """
    print(f"Deep Research: {question}\n")
 
    # Phase 1: Analyze
    print("Phase 1: Analyzing question...")
    plan = analyze_question(question)
    print(f"  Sub-questions: {len(plan['sub_questions'])}")
    print(f"  Search queries: {len(plan['search_queries'])}")
 
    # Phase 2: Research each sub-question
    print("\nPhase 2: Researching...")
    findings = []
    for i, sub_q in enumerate(plan["sub_questions"]):
        print(f"  Researching sub-question {i+1}: {sub_q[:80]}...")
        # Pick relevant search queries for this sub-question
        relevant_queries = plan["search_queries"][
            i * 2 : (i + 1) * 2
        ] or [sub_q]
        finding = research_sub_question(sub_q, relevant_queries)
        findings.append(finding)
 
    # Phase 3: Synthesize
    print("\nPhase 3: Synthesizing research...")
    report = synthesize_research(question, findings)
 
    # Phase 4: Critique and revise
    print("\nPhase 4: Self-critique and revision...")
    final_report = critique_and_revise(question, report)
 
    print("\nDone!")
    return final_report
 
 
# --- Run it ---
if __name__ == "__main__":
    report = deep_research(
        "What are the current best practices for building reasoning-capable AI agents, "
        "and how do inference-time scaling techniques compare to training-time approaches "
        "in terms of cost, quality, and practical applicability?"
    )
    print("\n" + "=" * 80)
    print(report)

Adding Self-Consistency to the Research Pipeline

For critical research tasks, you can add self-consistency by running the entire pipeline multiple times and comparing the results:

def deep_research_with_consistency(question: str, n_runs: int = 3) -> str:
    """
    Run deep research multiple times and synthesize the most consistent findings.
    """
    reports = []
    for i in range(n_runs):
        print(f"\n{'='*40} Run {i+1}/{n_runs} {'='*40}")
        report = deep_research(question)
        reports.append(report)
 
    # Synthesize across runs
    reports_text = "\n\n---\n\n".join(
        f"### Report {i+1}\n{report}" for i, report in enumerate(reports)
    )
 
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=8000,
        temperature=1,
        thinking={"type": "enabled", "budget_tokens": 8000},
        messages=[{
            "role": "user",
            "content": f"""You ran a deep research pipeline {n_runs} times on the same question.
Each run independently searched the web, analyzed sources, and produced a report.
 
Question: {question}
 
Here are all {n_runs} reports:
 
{reports_text}
 
Synthesize these into a single, definitive report:
1. Claims that appear in all/most reports are HIGH CONFIDENCE
2. Claims that appear in only one report need verification — mark as LOW CONFIDENCE
3. Contradictions between reports should be explicitly noted
4. Combine the best citations from all reports
 
Produce the final consolidated research report:"""
        }]
    )
 
    for block in response.content:
        if hasattr(block, "text"):
            return block.text
    return reports[0]

Key Design Decisions

DecisionOur ChoiceWhy
Reasoning modelClaude with extended thinkingVisible reasoning, budget control, tool-use compatible
Search strategyPlan-based (decompose first, then search)Better coverage than ad-hoc searching
VerificationSelf-critique + optional multi-run consistencyCatches errors without external verifier
Depth controlConfigurable sub-questions and search per sub-questionBalance thoroughness vs cost
Citation styleInline citations with source listTraceable, verifiable claims

Part V: Putting It All Together — When to Use What

Inference-Time Technique Decision Tree

Is the task simple (single-step, factual)?
  → YES: Standard LLM call. No special technique needed.
  → NO: Continue...

Does the task have a verifiable answer (code, math, constraints)?
  → YES: Search against a verifier. Generate N candidates, verify, pick best.
  → NO: Continue...

Is there one clear approach, or multiple possible approaches?
  → ONE APPROACH: Chain-of-Thought prompting.
  → MULTIPLE: Continue...

Do you need to explore different strategies?
  → YES: Tree of Thoughts (or Meta-CoT if available).
  → NO: Continue...

Is the answer open-ended (no single correct answer)?
  → YES: Sequential revision (generate → critique → improve).
  → NO: Self-consistency (sample multiple chains, majority vote).

Technique Comparison Summary

TechniqueLLM CallsBest ForQuality BoostCost
Standard1Simple tasksBaseline$
CoT prompting1Math, logic, multi-step+15-30% on reasoning tasks$
Self-consistencyN (5-10)Tasks with verifiable answers+10-20% over single CoT$$$
Sequential revision2-6Open-ended analysis, writingHighly variable$$
Tree of Thoughts10-50+Complex strategy/planning+20-40% on hard problems$$$$
Search + verifierN (10-100)Code, math, constraint satisfaction+30-50% (pass@k)$$$$
Extended thinking1 (more tokens)Deep analysis, complex reasoning+20-40%$$
Deep research pipelineManyMulti-faceted research questionsComprehensive coverage$$$$$

Training-Time Technique Summary

TechniqueWhat It DoesKey RequirementWho Uses It
SFT on reasoningTrain on correct reasoning examplesHigh-quality reasoning tracesFine-tuners, researchers
STaRSelf-generate and filter reasoning dataA way to verify answersResearchers
RL with verifierLearn reasoning through trial and errorReliable verifier + RL infrastructureOpenAI (o1), DeepSeek (R1)
ORMScore complete reasoning chainsLabeled chain-level dataUsed in best-of-N selection
PRMScore individual reasoning stepsStep-level annotationsUsed in guided search, MCTS
Self-refinementLearn to critique and improve own outputVerifier to assess improvementResearchers
Meta-CoTInternalize search into generationSearch traces for trainingCutting-edge research

What You Should Know After Reading This

  1. What is inference-time scaling and why does it matter?
  2. How do reasoning models (o1, DeepSeek-R1, Claude extended thinking) differ from standard LLMs?
  3. What is Chain-of-Thought prompting and when does it help?
  4. How does self-consistency improve accuracy through majority voting?
  5. What is Tree of Thoughts and how does it extend CoT to tree search?
  6. How does "search against a verifier" work for code and math problems?
  7. What is STaR and how does it bootstrap reasoning capability?
  8. How does RL with a verifier train reasoning models like o1 and R1?
  9. What is the difference between ORM and PRM reward models?
  10. What is Meta-CoT and why is internalizing search important?
  11. How would you design a deep research agent that combines search, reasoning, and self-critique?

Further Reading

  • "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Wei et al., 2022) — The paper that started it all
  • "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (Wang et al., 2022) — Majority voting over CoT chains
  • "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Yao et al., 2023) — Extending CoT to tree search
  • "STaR: Bootstrapping Reasoning With Reasoning" (Zelikman et al., 2022) — Self-taught reasoning
  • "Let's Verify Step by Step" (Lightman et al., 2023) — Process reward models for math
  • "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (DeepSeek, 2025) — Open-weight reasoning model
  • "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (Snell et al., 2024) — The case for inference-time scaling
  • "Meta Chain-of-Thought: Learning to Think in One Generation" (Meta, 2025) — Internalizing search
  • "Reflexion: Language Agents with Verbal Reinforcement Learning" (Shinn et al., 2023)
  • "Training Verifiers to Solve Math Word Problems" (Cobbe et al., 2021) — ORM for math

Next in the Series

Part 5: Multi-modal Generation Agent — We cover the full landscape of visual generation — VAEs, GANs, auto-regressive models, and diffusion models. Then we go deep on text-to-image (data preparation, U-Net vs DiT architectures, diffusion training, sampling, and evaluation) and text-to-video (3D VAE compression, video DiT with factored attention, large-scale training challenges), and build a multi-modal generation agent.

Stay tuned.

You might also like