Build an LLM Playground — Part 4: Build "Deep Research" with Web Search and Reasoning Models
The fourth entry in the learn-by-doing AI engineer series. We cover reasoning and thinking LLMs, inference-time scaling techniques (CoT, self-consistency, Tree of Thoughts), training-time techniques (STaR, RLHF with verifiers, reward modeling), and build a deep research agent that combines web search with multi-step reasoning.
Series: The AI Engineer Learning Path
This is Part 4 of a hands-on series designed to take you from zero to working AI engineer. Every post follows a learn-by-doing philosophy — we explain the theory, then you build something real.
| Part | Topic | Status |
|---|---|---|
| 1 | Build an LLM Playground | Complete |
| 2 | Customer Support Chatbot with RAGs & Prompt Engineering | Complete |
| 3 | "Ask-the-Web" Agent with Tool Calling | Complete |
| 4 | Deep Research with Reasoning Models (this post) | Current |
| 5 | Multi-modal Generation Agent | Available |
In Part 3, we built an agent that can search the web and synthesize answers. But that agent treats reasoning as a single forward pass — it thinks once and answers. Real research requires deliberate, multi-step reasoning: forming hypotheses, searching for evidence, evaluating conflicting information, revising conclusions, and knowing when you're confident enough to stop.
This post is about the science and engineering of making LLMs think harder. We'll cover reasoning models, inference-time scaling, training-time techniques, and then build a "Deep Research" agent that combines web search with structured multi-step reasoning.
Why Reasoning Matters
Standard LLMs are pattern matchers. They produce the most likely next token given the context. This works remarkably well for most tasks, but it fails on problems that require:
- Multi-step logical deduction — "If A implies B and B implies C, does A imply C?"
- Planning under constraints — "How do I schedule 5 tasks with dependencies in the shortest time?"
- Self-correction — "Wait, that calculation was wrong. Let me redo it."
- Deliberate exploration — "There are three possible approaches. Let me evaluate each before committing."
The core insight of reasoning models is that you can trade compute at inference time for better answers. Instead of generating one answer in one pass, you let the model think longer, explore multiple paths, and verify its own work.
This is called inference-time compute scaling — and it's one of the most important ideas in modern AI.
Standard LLM:
Question → [Single forward pass] → Answer
Fast, cheap, often wrong on hard problems
Reasoning LLM:
Question → [Think step 1] → [Think step 2] → ... → [Think step N] → [Verify] → Answer
Slower, more expensive, dramatically better on hard problems
Part I: Reasoning and Thinking LLMs
What Are Reasoning Models?
Reasoning models are LLMs that have been specifically trained or prompted to "think before answering." Instead of producing a final answer immediately, they generate an extended chain of reasoning — often called a "thinking trace" or "internal monologue" — before arriving at a conclusion.
The key distinction:
| Aspect | Standard LLM | Reasoning LLM |
|---|---|---|
| Output | Direct answer | Extended thinking + answer |
| Compute | Fixed (one forward pass per token) | Variable (more thinking for harder problems) |
| Error handling | Errors compound silently | Can catch and correct its own mistakes |
| Transparency | Black box — no insight into reasoning | Visible chain of thought you can inspect |
| Cost | Lower tokens per query | Higher tokens per query, but better accuracy |
Overview of Reasoning Models
OpenAI's "o" Family
OpenAI's o1, o1-mini, o3, and o3-mini models were the first major commercial reasoning models. They use a technique where the model generates a hidden "chain of thought" before producing the final answer.
How they work:
User: "How many r's are in the word strawberry?"
Standard GPT-4:
→ "There are 2 r's in strawberry." (wrong)
o1:
→ [Internal reasoning]:
"Let me spell it out: s-t-r-a-w-b-e-r-r-y"
"Now let me count each 'r': position 3 is 'r', position 9 is 'r', position 10 is 'r'"
"Wait, let me recount: s(1) t(2) r(3) a(4) w(5) b(6) e(7) r(8) r(9) y(10)"
"Positions 3, 8, 9 contain 'r'"
"That's 3 r's"
→ "There are 3 r's in strawberry." (correct)
Key characteristics:
| Feature | Detail |
|---|---|
| Hidden reasoning | The thinking trace is not shown to the user (only a summary) |
| Adaptive compute | Harder problems get more thinking tokens automatically |
| Trained with RL | Uses reinforcement learning to improve reasoning quality |
| Cost structure | You pay for both thinking tokens and output tokens |
DeepSeek-R1
DeepSeek-R1 is an open-weight reasoning model that made waves by achieving reasoning performance competitive with o1 while being openly available.
What makes R1 interesting:
- Open weights — You can download, inspect, and fine-tune the model
- Transparent reasoning — The full thinking trace is visible (unlike o1's hidden reasoning)
- Trained with RL — Uses Group Relative Policy Optimization (GRPO) to learn reasoning
- Emergent behaviors — The model spontaneously learned to self-correct, explore alternatives, and verify its work during RL training — these behaviors were not explicitly programmed
R1's training pipeline:
Step 1: Cold-start SFT
Start with DeepSeek-V3 base model
Fine-tune on a small set of high-quality reasoning examples
This gives the model the "format" of reasoning
Step 2: Reasoning-focused RL
Train with GRPO on math and coding tasks
Reward = correctness of final answer (verified by running code or checking math)
The model learns that longer, more careful reasoning leads to correct answers
Step 3: Rejection sampling + SFT
Generate many reasoning traces, keep only the ones that led to correct answers
Fine-tune on this curated dataset
Step 4: Final RL round
Another round of RL to polish reasoning quality and helpfulness
Emergent reasoning behaviors in R1:
During RL training, R1 spontaneously developed several sophisticated reasoning patterns:
| Behavior | Example |
|---|---|
| Self-verification | "Let me double-check this calculation... 7 x 8 = 56, yes that's correct" |
| Backtracking | "Wait, this approach isn't working. Let me try a different method." |
| Decomposition | "This problem has three parts. Let me handle each one separately." |
| Reflection | "Hmm, my first answer seems too simple for this problem. Let me think more carefully." |
These behaviors emerged naturally from the reward signal — the model discovered that careful thinking leads to more correct answers, and more correct answers lead to higher rewards.
Claude's Extended Thinking
Anthropic's Claude models support "extended thinking" — a mode where the model generates a detailed reasoning trace before its final answer.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
temperature=1, # required for extended thinking
thinking={
"type": "enabled",
"budget_tokens": 10000 # max tokens for thinking
},
messages=[{
"role": "user",
"content": "How many prime numbers are between 1 and 100?"
}]
)
# The response has two parts: thinking and text
for block in response.content:
if block.type == "thinking":
print("=== THINKING ===")
print(block.thinking)
elif block.type == "text":
print("=== ANSWER ===")
print(block.text)Key features of extended thinking:
| Feature | Detail |
|---|---|
| Visible reasoning | You can inspect the full thinking trace |
| Budget control | Set budget_tokens to control how much thinking the model does |
| Streaming | Thinking tokens stream in real-time |
| Tool use compatible | Works with tool calling — the model can think between tool calls |
Comparing Reasoning Models
| Model | Reasoning Visible? | Open Weights? | Best At | Cost |
|---|---|---|---|---|
| o1/o3 | Summary only | No | Math, coding, science | High (hidden thinking tokens) |
| o3-mini/o4-mini | Summary only | No | Good balance of speed and reasoning | Medium |
| DeepSeek-R1 | Full trace | Yes | Math, coding, open-ended reasoning | Low (self-hosted) or medium (API) |
| Claude (extended thinking) | Full trace | No | Analysis, writing, coding, research | Medium-High (budget controllable) |
| Gemini 2.5 Pro | Summary | No | Long-context reasoning, multimodal | Medium-High |
Part II: Inference-Time Techniques
Inference-time techniques are methods you apply when using a model (not during training) to improve its reasoning. These work with any LLM — you don't need a special reasoning model.
The core idea: spend more compute at inference time to get better answers.
Inference-Time Scaling
The traditional way to make LLMs better is to train bigger models on more data (training-time scaling). But there's another dimension: inference-time scaling — letting the model use more compute per question.
Training-time scaling:
Better model = more parameters + more training data + more training compute
(Decided months before the model is used)
Inference-time scaling:
Better answer = more tokens of reasoning + multiple attempts + verification
(Decided at the moment you ask the question)
Why this matters: Training-time scaling has diminishing returns and enormous costs. Inference-time scaling lets you allocate compute where it matters — hard questions get more thinking, easy questions get answered quickly.
Answer Quality
▲
│ ┌─── Inference-time scaling
│ ╱ (more thinking per question)
│ ╱
│ ╱
│ ╱ ┌─── Training-time scaling
│ ╱ ╱ (bigger model)
│╱ ╱
│ ╱
│╱
└──────────────────────→ Compute
The exciting finding from recent research: for many tasks, spending 10x more compute at inference time (through better reasoning strategies) can match or exceed a model that's 10x bigger.
Chain-of-Thought (CoT) Prompting
Chain-of-Thought is the simplest and most widely used inference-time technique. You prompt the model to show its reasoning step-by-step before giving the final answer.
Zero-shot CoT — Just add "Let's think step by step":
# Without CoT
response = llm("What is 17 * 24?")
# Model might jump to an answer and get it wrong
# With CoT
response = llm("""What is 17 * 24?
Let's think step by step.""")
# Model breaks it down:
# "17 * 24 = 17 * 20 + 17 * 4 = 340 + 68 = 408"Few-shot CoT — Provide examples of step-by-step reasoning:
prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans of 3 balls each.
That's 2 * 3 = 6 new balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and
bought 6 more, how many apples do they have?
A: The cafeteria started with 23 apples. They used 20, so they had
23 - 20 = 3. Then they bought 6 more, so 3 + 6 = 9. The answer is 9.
Q: {user_question}
A: Let's think step by step."""Why CoT works:
| Reason | Explanation |
|---|---|
| Working memory | The model can offload intermediate results into the text, avoiding the need to hold everything "in its head" |
| Error visibility | When reasoning is explicit, errors become visible and the model can catch them |
| Decomposition | Complex problems are broken into simpler sub-problems |
| Trained distribution | During training, the model saw many examples of step-by-step reasoning (textbooks, tutorials, Stack Overflow) |
When CoT helps vs. doesn't:
| Helps | Doesn't Help |
|---|---|
| Multi-step math problems | Simple factual recall ("What's the capital of France?") |
| Logic puzzles | Tasks the model can already do well in one step |
| Code generation with complex requirements | Creative writing (reasoning doesn't improve creativity) |
| Any task requiring more than 2-3 mental steps | Tasks where the model lacks the underlying knowledge |
Implementing CoT with the API
import anthropic
client = anthropic.Anthropic()
def solve_with_cot(question: str) -> dict:
"""Solve a problem using Chain-of-Thought prompting."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system="""You are a careful problem solver. For every question:
1. Break the problem into clear steps
2. Work through each step explicitly
3. Show all calculations
4. State your final answer clearly
Format your response as:
## Reasoning
[step-by-step work]
## Answer
[final answer]""",
messages=[{
"role": "user",
"content": question
}]
)
text = response.content[0].text
# Parse reasoning and answer
parts = text.split("## Answer")
reasoning = parts[0].replace("## Reasoning", "").strip() if len(parts) > 1 else text
answer = parts[1].strip() if len(parts) > 1 else text
return {
"reasoning": reasoning,
"answer": answer,
"tokens_used": response.usage.input_tokens + response.usage.output_tokens
}
# Example
result = solve_with_cot(
"A store sells apples for $1.50 each and oranges for $2.00 each. "
"If Sarah buys 3 apples and some oranges, and spends exactly $13.50, "
"how many oranges did she buy?"
)
print(result["reasoning"])
print(f"\nAnswer: {result['answer']}")Self-Consistency
Self-consistency is a simple but powerful idea: sample multiple reasoning chains, then take the majority vote on the final answer.
Different reasoning paths may reach different conclusions. By sampling many paths and picking the most common answer, you filter out reasoning errors.
Question: "What is the probability of rolling at least one six in four dice rolls?"
Chain 1: "P(no six) = (5/6)^4 = 625/1296. P(at least one) = 1 - 625/1296 = 671/1296 ≈ 0.518"
Chain 2: "P(at least one) = 1 - P(none) = 1 - (5/6)^4 = 1 - 0.482 = 0.518"
Chain 3: "P(six on one die) = 1/6, four rolls... 4 * 1/6 = 4/6 ≈ 0.667" (wrong reasoning)
Chain 4: "1 - (5/6)^4 = 1 - 0.482 = 0.518"
Chain 5: "1 - (5/6)^4 ≈ 0.518"
Majority answer: 0.518 (4 out of 5 chains agree)
Implementation:
import anthropic
from collections import Counter
client = anthropic.Anthropic()
def self_consistency(question: str, n_samples: int = 5, temperature: float = 0.7) -> dict:
"""
Generate multiple reasoning chains and take majority vote.
Higher temperature = more diverse reasoning paths.
"""
answers = []
chains = []
for i in range(n_samples):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
temperature=temperature,
messages=[{
"role": "user",
"content": f"""{question}
Think step by step, then give your final answer on the last line
in the format: ANSWER: [your answer]"""
}]
)
text = response.content[0].text
chains.append(text)
# Extract answer from last line
for line in reversed(text.split("\n")):
if "ANSWER:" in line:
answer = line.split("ANSWER:")[-1].strip()
answers.append(answer)
break
# Majority vote
vote_counts = Counter(answers)
best_answer = vote_counts.most_common(1)[0] if vote_counts else ("No consensus", 0)
return {
"answer": best_answer[0],
"confidence": best_answer[1] / len(answers) if answers else 0,
"vote_distribution": dict(vote_counts),
"n_chains": n_samples,
"chains": chains
}
# Example
result = self_consistency(
"A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. "
"How much does the ball cost?",
n_samples=7
)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"Vote distribution: {result['vote_distribution']}")Why self-consistency works:
- Correct reasoning paths tend to converge on the same answer
- Wrong reasoning paths tend to produce different wrong answers (they scatter)
- The majority vote amplifies the signal of correct reasoning
Key parameters:
| Parameter | Effect |
|---|---|
| n_samples | More samples = higher accuracy, higher cost. 5-10 is usually enough. |
| temperature | Higher = more diverse chains. 0.5-0.8 works well. Too low = all chains are identical. |
Sequential Revision
Sequential revision has the model iteratively improve its answer through multiple rounds of self-critique and refinement.
Round 1: Generate initial answer
Round 2: Critique the answer — find errors, gaps, weaknesses
Round 3: Revise the answer based on the critique
Round 4: Critique again — are the issues fixed? Any new ones?
Round 5: Final revision
Implementation:
import anthropic
client = anthropic.Anthropic()
def sequential_revision(question: str, max_rounds: int = 3) -> dict:
"""
Iteratively improve an answer through self-critique and revision.
"""
# Round 1: Initial answer
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"Answer this question thoroughly:\n\n{question}"
}]
)
current_answer = response.content[0].text
history = [{"round": 0, "type": "initial", "content": current_answer}]
for round_num in range(1, max_rounds + 1):
# Critique
critique_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Here is a question and an answer. Critically evaluate the answer.
Identify specific errors, gaps, unsupported claims, or areas for improvement.
Be harsh but constructive. If the answer is already excellent, say "NO ISSUES FOUND".
Question: {question}
Answer: {current_answer}
Critique:"""
}]
)
critique = critique_response.content[0].text
history.append({"round": round_num, "type": "critique", "content": critique})
# Check if no issues found
if "NO ISSUES FOUND" in critique.upper():
break
# Revise
revision_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"""Revise this answer based on the critique below.
Fix all identified issues while preserving what was already good.
Question: {question}
Original answer: {current_answer}
Critique: {critique}
Revised answer:"""
}]
)
current_answer = revision_response.content[0].text
history.append({"round": round_num, "type": "revision", "content": current_answer})
return {
"final_answer": current_answer,
"rounds": len([h for h in history if h["type"] == "revision"]),
"history": history
}
# Example
result = sequential_revision(
"Explain the CAP theorem in distributed systems and give a real-world example "
"of a system that prioritizes each of the three pairs (CP, AP, CA)."
)
print(f"Final answer (after {result['rounds']} revisions):")
print(result["final_answer"])When to use sequential revision:
| Good For | Not Good For |
|---|---|
| Open-ended explanations | Simple factual questions |
| Code review and improvement | Tasks where the first answer is usually right |
| Essay writing and refinement | Time-sensitive applications |
| Analysis that needs to be thorough | Tasks where the model can't evaluate quality |
Tree of Thoughts (ToT)
Tree of Thoughts extends Chain-of-Thought from a single chain to a tree of reasoning paths. The model explores multiple approaches, evaluates each one, and prunes unpromising branches.
[Problem]
/ | \
[Approach A] [Approach B] [Approach C]
Score: 0.8 Score: 0.3 Score: 0.7
/ \ |
[A→step2] [A→step2'] [C→step2]
Score: 0.9 Score: 0.4 Score: 0.6
|
[A→step3]
Score: 0.95
|
[Final Answer]
Implementation:
import anthropic
import json
client = anthropic.Anthropic()
def tree_of_thoughts(
problem: str,
n_branches: int = 3,
max_depth: int = 3,
beam_width: int = 2
) -> dict:
"""
Explore multiple reasoning paths using Tree of Thoughts.
Args:
problem: The problem to solve
n_branches: Number of branches to generate at each step
max_depth: Maximum depth of the reasoning tree
beam_width: Number of top branches to keep at each level (beam search)
"""
def generate_thoughts(problem: str, context: str, n: int) -> list:
"""Generate n possible next reasoning steps."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"""Problem: {problem}
Reasoning so far: {context if context else "None — this is the first step."}
Generate exactly {n} different possible next steps in the reasoning.
Each should take a DIFFERENT approach or consider a DIFFERENT angle.
Return as a JSON array of strings, each being one reasoning step.
Example: ["Step: First approach...", "Step: Alternative approach...", "Step: Third angle..."]"""
}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return [response.content[0].text]
def evaluate_thought(problem: str, reasoning_path: str) -> float:
"""Evaluate how promising a reasoning path is (0-1)."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
messages=[{
"role": "user",
"content": f"""Problem: {problem}
Reasoning path so far:
{reasoning_path}
Rate how promising this reasoning path is for solving the problem.
Consider: Is the logic sound? Is it making progress? Is it heading toward a correct answer?
Respond with ONLY a number between 0.0 and 1.0."""
}]
)
try:
return float(response.content[0].text.strip())
except ValueError:
return 0.5
# Initialize with root branches
current_paths = [{"path": "", "score": 1.0}]
for depth in range(max_depth):
all_candidates = []
for node in current_paths:
# Generate possible next steps
thoughts = generate_thoughts(problem, node["path"], n_branches)
for thought in thoughts:
new_path = f"{node['path']}\n{thought}" if node["path"] else thought
score = evaluate_thought(problem, new_path)
all_candidates.append({"path": new_path, "score": score})
# Keep top beam_width candidates (beam search)
all_candidates.sort(key=lambda x: x["score"], reverse=True)
current_paths = all_candidates[:beam_width]
print(f"Depth {depth + 1}: {len(all_candidates)} candidates → kept top {beam_width}")
for p in current_paths:
print(f" Score: {p['score']:.2f}")
# Generate final answer from the best path
best_path = current_paths[0]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"""Problem: {problem}
Best reasoning path:
{best_path['path']}
Based on this reasoning, provide the final, complete answer to the problem."""
}]
)
return {
"answer": response.content[0].text,
"best_path": best_path["path"],
"path_score": best_path["score"],
"explored_paths": sum(len(generate_thoughts(problem, "", n_branches)) for _ in range(max_depth))
}
# Example
result = tree_of_thoughts(
"Design a system to detect fraudulent transactions in real-time. "
"Consider latency, accuracy, and false positive rate.",
n_branches=3,
max_depth=3,
beam_width=2
)
print(f"Answer (path score: {result['path_score']:.2f}):")
print(result["answer"])ToT vs CoT vs Self-Consistency:
| Technique | Paths Explored | Selection Method | LLM Calls | Best For |
|---|---|---|---|---|
| CoT | 1 (single chain) | None — take what you get | 1 | Simple reasoning tasks |
| Self-Consistency | N parallel chains | Majority vote on final answer | N | Math, logic, factual questions with verifiable answers |
| Tree of Thoughts | N^D (branching tree) | Evaluation + pruning at each step | Many | Complex problems requiring exploration of different strategies |
Search Against a Verifier
The most powerful inference-time technique: generate many candidate solutions, then use a verifier to pick the best one.
This works when you have a way to check whether an answer is correct — a unit test for code, a math checker, a constraint validator, etc.
Generate 50 candidate solutions
↓
Run each through a verifier
↓
Pick the one that passes (or scores highest)
Implementation for code generation:
import anthropic
import subprocess
import tempfile
client = anthropic.Anthropic()
def search_against_verifier(
problem: str,
test_cases: list[dict],
n_candidates: int = 10,
temperature: float = 0.8
) -> dict:
"""
Generate multiple code solutions and verify each against test cases.
Return the first solution that passes all tests.
"""
results = []
for i in range(n_candidates):
# Generate a candidate solution
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
temperature=temperature,
messages=[{
"role": "user",
"content": f"""Solve this problem in Python. Return ONLY the function, no explanation.
{problem}"""
}]
)
code = response.content[0].text
# Strip markdown code fences if present
if "```python" in code:
code = code.split("```python")[1].split("```")[0]
# Run against test cases
passed = 0
total = len(test_cases)
for test in test_cases:
test_code = f"""{code}
# Test
result = {test['call']}
expected = {test['expected']}
assert result == expected, f"Got {{result}}, expected {{expected}}"
print("PASS")
"""
try:
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(test_code)
f.flush()
proc = subprocess.run(
['python', f.name],
capture_output=True, text=True, timeout=5
)
if proc.returncode == 0 and "PASS" in proc.stdout:
passed += 1
except (subprocess.TimeoutExpired, Exception):
pass
results.append({
"candidate": i + 1,
"code": code,
"passed": passed,
"total": total,
"all_passed": passed == total
})
# Early exit if we find a perfect solution
if passed == total:
print(f" Candidate {i + 1}: PASSED all {total} tests")
return {
"solution": code,
"candidates_tried": i + 1,
"results": results
}
else:
print(f" Candidate {i + 1}: {passed}/{total} tests passed")
# Return best solution if none passed all tests
best = max(results, key=lambda r: r["passed"])
return {
"solution": best["code"],
"candidates_tried": n_candidates,
"best_score": f"{best['passed']}/{best['total']}",
"results": results
}
# Example
result = search_against_verifier(
problem="""Write a function `longest_palindrome(s: str) -> str` that returns
the longest palindromic substring in s. If there are multiple with the same length,
return the first one found.""",
test_cases=[
{"call": "longest_palindrome('babad')", "expected": "'bab'"},
{"call": "longest_palindrome('cbbd')", "expected": "'bb'"},
{"call": "longest_palindrome('a')", "expected": "'a'"},
{"call": "longest_palindrome('racecar')", "expected": "'racecar'"},
{"call": "longest_palindrome('')", "expected": "''"},
],
n_candidates=10
)
print(f"\nFound solution after {result['candidates_tried']} candidates")
print(result["solution"])Why this is so powerful:
The pass@k metric shows the probability that at least one of k generated samples is correct. For many coding tasks:
| Metric | Pass Rate |
|---|---|
| pass@1 (single attempt) | ~50% |
| pass@10 (best of 10) | ~85% |
| pass@100 (best of 100) | ~95% |
With a reliable verifier, you can dramatically boost accuracy just by generating more candidates.
Part III: Training-Time Techniques
Training-time techniques change how the model is trained to improve its reasoning capability. These are what reasoning model creators (OpenAI, DeepSeek, etc.) do before you ever use the model.
Understanding these helps you:
- Know why reasoning models work the way they do
- Fine-tune your own reasoning models
- Make informed decisions about which models to use
SFT on Reasoning Data (STaR)
STaR (Self-Taught Reasoner) is a technique where a model learns to reason by training on its own successful reasoning traces.
The STaR loop:
Step 1: Give the model a question
Step 2: Ask it to generate a reasoning chain + answer
Step 3: Check if the answer is correct
Step 4: If correct → add this (question, reasoning, answer) to the training set
If wrong → give the model the correct answer and ask it to generate
a reasoning chain that arrives at that answer (rationalization)
Step 5: Fine-tune the model on the collected correct reasoning traces
Step 6: Repeat from Step 1 with the improved model
# Pseudocode for STaR training loop
def star_training(model, questions, correct_answers, num_iterations=5):
for iteration in range(num_iterations):
training_data = []
for question, correct_answer in zip(questions, correct_answers):
# Generate reasoning + answer
reasoning, predicted_answer = model.generate_with_reasoning(question)
if predicted_answer == correct_answer:
# Direct: model got it right naturally
training_data.append({
"question": question,
"reasoning": reasoning,
"answer": correct_answer,
"type": "direct"
})
else:
# Rationalization: hint the correct answer and get reasoning
hint_reasoning, _ = model.generate_with_reasoning(
question,
hint=f"The correct answer is {correct_answer}. Show your reasoning."
)
training_data.append({
"question": question,
"reasoning": hint_reasoning,
"answer": correct_answer,
"type": "rationalized"
})
# Fine-tune model on collected reasoning traces
model = fine_tune(model, training_data)
accuracy = evaluate(model, held_out_questions)
print(f"Iteration {iteration + 1}: accuracy = {accuracy:.2%}")
return modelWhy STaR works:
| Aspect | Explanation |
|---|---|
| Bootstrapping | The model starts with weak reasoning but gets training data from its own successes |
| Rationalization | When the model gets the wrong answer, you give it the answer and ask for reasoning — this creates training data even for hard problems |
| Self-improvement | Each iteration produces a better model, which generates better reasoning traces for the next iteration |
Reinforcement Learning with a Verifier
This is the technique behind o1 and DeepSeek-R1. Instead of supervised fine-tuning on correct examples, you use reinforcement learning where the reward comes from a verifier.
Traditional SFT:
"Here are correct reasoning traces. Learn to produce text like this."
RL with Verifier:
"Here's a problem. Try to solve it. I'll tell you if you got the right answer.
Figure out how to reason in a way that produces correct answers."
How it works:
┌─────────────────────────────────────┐
│ │
▼ │
Problem → [LLM Policy] → Reasoning + Answer → [Verifier] → Reward
▲ │
│ ┌───────────────────────────┘
│ │
└─────────┘
Update policy to maximize reward
The training loop:
# Pseudocode for RL-based reasoning training
def train_reasoning_with_rl(policy_model, problems, verifier):
"""
policy_model: The LLM we're training
problems: Math/code problems with known correct answers
verifier: Can check if an answer is correct (test runner, math checker)
"""
for batch in sample_batches(problems):
for problem in batch:
# Generate multiple reasoning traces (exploration)
traces = []
for _ in range(K):
reasoning, answer = policy_model.generate(problem, temperature=0.8)
is_correct = verifier.check(problem, answer)
reward = 1.0 if is_correct else 0.0
traces.append((reasoning, answer, reward))
# Compute advantage: how much better was each trace than average?
avg_reward = mean([t[2] for t in traces])
advantages = [(t[0], t[1], t[2] - avg_reward) for t in traces]
# Update policy: increase probability of high-reward traces,
# decrease probability of low-reward traces
policy_model.update(advantages) # e.g., GRPO, PPOKey components:
| Component | Role | Example |
|---|---|---|
| Policy | The LLM being trained to reason | DeepSeek-V3 base model |
| Verifier | Checks answer correctness | Python test runner for code, symbolic math checker |
| Reward | Signal that guides learning | +1 for correct, 0 for incorrect |
| Exploration | Generating diverse reasoning traces | High temperature sampling |
| Policy update | Adjusting model weights based on rewards | GRPO, PPO, REINFORCE |
Why RL produces better reasoning than SFT:
SFT tells the model "reason like this." RL tells the model "find any reasoning strategy that produces correct answers." The model discovers its own reasoning patterns — which can be more diverse and robust than any human-written examples.
Reward Modeling (ORM and PRM)
A verifier that only checks the final answer is limited. Reward models evaluate the quality of reasoning at a more granular level.
Outcome Reward Model (ORM):
Evaluates the entire reasoning trace as a whole. "Given this complete reasoning chain, how likely is the final answer to be correct?"
Reasoning chain → [ORM] → Score: 0.87
Process Reward Model (PRM):
Evaluates each step of the reasoning. "Is this particular step correct and useful?"
Step 1: "17 * 24 = 17 * 20 + 17 * 4" → [PRM] → Score: 0.95 (correct decomposition)
Step 2: "17 * 20 = 340" → [PRM] → Score: 0.98 (correct)
Step 3: "17 * 4 = 72" → [PRM] → Score: 0.15 (WRONG! 17*4=68)
Step 4: "340 + 72 = 412" → [PRM] → Score: 0.90 (arithmetic is right, but input is wrong)
PRM can catch the error at step 3 and guide the model to fix it before it propagates.
Comparison:
| Aspect | ORM | PRM |
|---|---|---|
| Granularity | Entire chain | Step by step |
| Training data | Easier (just need final answer correctness) | Harder (need per-step annotations) |
| Error detection | Can only say "the chain is probably wrong" | Can pinpoint exactly which step is wrong |
| Use at inference | Rank complete solutions | Guide search: prune bad branches early |
| Compute | One evaluation per chain | One evaluation per step |
Using a PRM for guided search:
# Pseudocode: Use PRM to guide step-by-step generation
def prm_guided_generation(problem, prm, n_candidates=5):
"""
At each step, generate multiple continuations, score them with PRM,
and keep only the best ones (beam search with PRM scoring).
"""
beams = [{"steps": [], "score": 1.0}]
for step_num in range(max_steps):
all_candidates = []
for beam in beams:
# Generate N possible next steps
next_steps = model.generate_next_steps(
problem, beam["steps"], n=n_candidates
)
for step in next_steps:
# Score this step with PRM
step_score = prm.score_step(problem, beam["steps"] + [step])
all_candidates.append({
"steps": beam["steps"] + [step],
"score": beam["score"] * step_score
})
# Keep top beams
all_candidates.sort(key=lambda x: x["score"], reverse=True)
beams = all_candidates[:beam_width]
# Check if any beam has reached a final answer
for beam in beams:
if is_final_answer(beam["steps"][-1]):
return beam
return beams[0] # Return best beamSelf-Refinement
Self-refinement is a training-time approach where the model is trained to improve its own outputs iteratively. Unlike sequential revision (which is inference-time), self-refinement bakes this capability into the model's weights.
Training process:
1. Generate initial response to a question
2. Generate a critique of that response
3. Generate a revised response
4. If the revised response is better (checked by a verifier or human),
train the model on the full (response → critique → revision) trajectory
5. Repeat until the model naturally produces high-quality self-critiques
The goal: A model that, when it generates a wrong answer, can reliably identify what's wrong and fix it — without external prompting.
# Pseudocode: Self-refinement training data generation
def generate_refinement_training_data(model, problems, verifier):
training_examples = []
for problem in problems:
# Initial attempt
initial_response = model.generate(problem)
initial_correct = verifier.check(problem, initial_response)
# Self-critique
critique = model.generate(
f"Critique this solution:\n{problem}\n{initial_response}"
)
# Revision
revised_response = model.generate(
f"Revise based on critique:\n{problem}\n{initial_response}\nCritique: {critique}"
)
revised_correct = verifier.check(problem, revised_response)
# Only keep examples where refinement actually improved the answer
if not initial_correct and revised_correct:
training_examples.append({
"problem": problem,
"initial": initial_response,
"critique": critique,
"revision": revised_response,
"label": "improvement"
})
return training_examplesInternalizing Search (Meta-CoT)
The latest frontier in reasoning research: instead of running explicit search algorithms at inference time (like Tree of Thoughts or beam search), train the model to internalize the search process.
The idea: When you use Tree of Thoughts, you're running an external algorithm that makes multiple LLM calls. But what if the model could do all that exploration, evaluation, and backtracking in a single forward pass? That's internalizing search.
How Meta-CoT works:
External search (Tree of Thoughts):
LLM call 1: Generate branch A
LLM call 2: Generate branch B
LLM call 3: Generate branch C
LLM call 4: Evaluate branches
LLM call 5: Expand best branch
... (many LLM calls)
Internalized search (Meta-CoT):
Single generation:
"Let me consider approach A... [explores A]... this leads to a contradiction.
Let me try approach B... [explores B]... this seems promising but hits a wall at step 3.
Combining ideas from A and B... [hybrid approach]... yes, this works.
Final answer: ..."
Training Meta-CoT:
Step 1: Collect search traces
Run Tree of Thoughts / MCTS on hard problems
Record the full search trace: all branches explored, evaluations, backtracking
Step 2: Linearize the search trace
Convert the tree structure into a linear text sequence:
"Exploring approach A → evaluating (score 0.3, unpromising) → backtracking →
Exploring approach B → evaluating (score 0.8, promising) → deepening →
B step 2 → evaluating (score 0.9) → final answer: ..."
Step 3: Train the model on these linearized search traces
The model learns to generate text that mimics the search process
Step 4: At inference, the model generates its own internal search
It naturally explores, evaluates, backtracks, and converges — all in one generation
Why this matters:
| Approach | LLM Calls | Latency | Quality |
|---|---|---|---|
| Standard CoT | 1 | Low | Good |
| Tree of Thoughts | 10-50 | Very High | Very Good |
| Meta-CoT | 1 (but longer output) | Medium | Very Good |
Meta-CoT gets the quality benefits of search with the efficiency of a single generation.
Part IV: Build a "Deep Research" Agent
Now let's combine everything. We'll build a Deep Research agent that uses structured reasoning, web search, and iterative refinement to produce comprehensive research reports on complex topics.
This is different from the Part 3 "Ask-the-Web" agent in a key way: it reasons about what it knows and doesn't know, plans its research strategy, evaluates evidence quality, and revises its conclusions.
Architecture
┌──────────────────────────────────────────────────────────────────────┐
│ Deep Research Agent │
│ │
│ User Question │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Phase 1: Question Analysis │ │
│ │ - Decompose into sub-questions │ │
│ │ - Identify what types of sources are needed │ │
│ │ - Create a research plan │ │
│ └──────────────────────┬──────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Phase 2: Research Loop (per sub-question) │ │
│ │ - Search the web with targeted queries │ │
│ │ - Read and extract key findings from pages │ │
│ │ - Evaluate source credibility │ │
│ │ - Note contradictions and gaps │ │
│ └──────────────────────┬──────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Phase 3: Synthesis with Reasoning │ │
│ │ - Extended thinking to reason over all evidence │ │
│ │ - Resolve contradictions │ │
│ │ - Identify confidence levels │ │
│ └──────────────────────┬──────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Phase 4: Self-Critique and Revision │ │
│ │ - Review the draft for gaps and errors │ │
│ │ - Do follow-up searches if needed │ │
│ │ - Produce final report with citations │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Final Report with citations, confidence levels, and source list │
└──────────────────────────────────────────────────────────────────────┘
Implementation
import anthropic
import json
from datetime import datetime
client = anthropic.Anthropic()
# --- Tool definitions ---
tools = [
{
"name": "search_web",
"description": "Search the web for current information. Use specific, targeted queries.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
},
{
"name": "fetch_page",
"description": "Fetch and read the full content of a web page.",
"input_schema": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to fetch"
}
},
"required": ["url"]
}
}
]
def execute_tool(name: str, args: dict) -> str:
"""Execute a tool call. Replace with real implementations."""
if name == "search_web":
return search_web(args["query"]) # Your search implementation
elif name == "fetch_page":
return fetch_page(args["url"]) # Your fetch implementation
return f"Unknown tool: {name}"
def run_agent_loop(system_prompt: str, user_message: str, max_steps: int = 15) -> str:
"""Run a ReACT-style agent loop with tool calling."""
messages = [{"role": "user", "content": user_message}]
for step in range(max_steps):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=system_prompt,
tools=tools,
messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})
else:
# Extract text from response
return "".join(
block.text for block in response.content if hasattr(block, "text")
)
return "Agent reached maximum steps."
# --- Phase 1: Question Analysis ---
def analyze_question(question: str) -> dict:
"""Decompose the question into sub-questions and create a research plan."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
temperature=1,
thinking={"type": "enabled", "budget_tokens": 4000},
messages=[{
"role": "user",
"content": f"""Analyze this research question and create a research plan.
Question: {question}
Return a JSON object with:
1. "sub_questions": Array of 3-6 specific sub-questions to investigate
2. "source_types": What types of sources would be most valuable
(academic papers, news articles, documentation, expert blogs, etc.)
3. "search_queries": Array of 5-8 specific search queries to run
4. "known_context": What you already know about this topic (brief)
5. "key_uncertainties": What you're most uncertain about
Return ONLY the JSON object."""
}]
)
# Extract text content (skip thinking blocks)
text = ""
for block in response.content:
if hasattr(block, "text"):
text = block.text
break
try:
return json.loads(text)
except json.JSONDecodeError:
return {
"sub_questions": [question],
"search_queries": [question],
"source_types": ["general"],
"known_context": "",
"key_uncertainties": ["Unable to parse analysis"]
}
# --- Phase 2: Research Loop ---
def research_sub_question(sub_question: str, search_queries: list) -> str:
"""Research a specific sub-question using web search."""
system_prompt = f"""You are a research assistant investigating this specific question:
"{sub_question}"
Your goal:
1. Search the web using the provided queries (and create new ones if needed)
2. Read the most relevant pages
3. Extract key findings, noting the source for each fact
4. Note any contradictions between sources
5. When you have enough information, provide a structured summary
Format your final output as:
## Key Findings
[Numbered list of findings with source attribution]
## Source Quality Assessment
[Brief assessment of how reliable your sources are]
## Contradictions or Uncertainties
[Any conflicting information or gaps]
## Sources
[Numbered list of URLs used]"""
query_text = "\n".join(f"- {q}" for q in search_queries[:3])
user_message = f"""Research this question: {sub_question}
Start with these search queries:
{query_text}
Search, read relevant pages, and provide a comprehensive summary."""
return run_agent_loop(system_prompt, user_message, max_steps=12)
# --- Phase 3: Synthesis with Reasoning ---
def synthesize_research(question: str, research_findings: list) -> str:
"""Use extended thinking to synthesize all research into a coherent analysis."""
findings_text = "\n\n---\n\n".join(
f"### Sub-question {i+1}\n{finding}"
for i, finding in enumerate(research_findings)
)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=8000,
temperature=1,
thinking={"type": "enabled", "budget_tokens": 8000},
messages=[{
"role": "user",
"content": f"""You are writing a comprehensive research report.
Original question: {question}
Here are the research findings from investigating different aspects of this question:
{findings_text}
Write a comprehensive, well-structured research report that:
1. Synthesizes findings from all sub-questions into a coherent narrative
2. Resolves contradictions (explain which sources are more credible and why)
3. Clearly states confidence levels (high/medium/low) for each major claim
4. Includes inline citations [1], [2], etc.
5. Has a "Limitations and Gaps" section noting what you couldn't find or verify
6. Ends with a consolidated sources list
The report should be thorough but readable — like a research briefing for a smart
non-expert who needs to make decisions based on this information."""
}]
)
# Extract text (skip thinking)
for block in response.content:
if hasattr(block, "text"):
return block.text
return "Synthesis failed."
# --- Phase 4: Self-Critique and Revision ---
def critique_and_revise(question: str, report: str) -> str:
"""Review the report for gaps and errors, then revise."""
# Step 1: Critique
critique_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
temperature=1,
thinking={"type": "enabled", "budget_tokens": 4000},
messages=[{
"role": "user",
"content": f"""Critically review this research report. Be thorough and harsh.
Original question: {question}
Report:
{report}
Evaluate:
1. Are there logical errors or unsupported claims?
2. Are important perspectives or counterarguments missing?
3. Are the confidence levels appropriate?
4. Is any information likely outdated or incorrect?
5. Are there follow-up questions that should have been investigated?
If the report is excellent, respond with "REPORT APPROVED".
Otherwise, list specific issues that need to be fixed."""
}]
)
critique_text = ""
for block in critique_response.content:
if hasattr(block, "text"):
critique_text = block.text
break
if "REPORT APPROVED" in critique_text.upper():
return report
# Step 2: Revise based on critique
revision_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=8000,
temperature=1,
thinking={"type": "enabled", "budget_tokens": 4000},
messages=[{
"role": "user",
"content": f"""Revise this research report based on the critique below.
Fix all identified issues. If the critique mentions missing information that requires
additional research, note it in the Limitations section rather than making things up.
Original question: {question}
Report:
{report}
Critique:
{critique_text}
Provide the complete revised report:"""
}]
)
for block in revision_response.content:
if hasattr(block, "text"):
return block.text
return report
# --- Main: Deep Research Pipeline ---
def deep_research(question: str) -> str:
"""
Run the full deep research pipeline:
1. Analyze the question and plan research
2. Research each sub-question with web search
3. Synthesize findings with extended reasoning
4. Self-critique and revise
"""
print(f"Deep Research: {question}\n")
# Phase 1: Analyze
print("Phase 1: Analyzing question...")
plan = analyze_question(question)
print(f" Sub-questions: {len(plan['sub_questions'])}")
print(f" Search queries: {len(plan['search_queries'])}")
# Phase 2: Research each sub-question
print("\nPhase 2: Researching...")
findings = []
for i, sub_q in enumerate(plan["sub_questions"]):
print(f" Researching sub-question {i+1}: {sub_q[:80]}...")
# Pick relevant search queries for this sub-question
relevant_queries = plan["search_queries"][
i * 2 : (i + 1) * 2
] or [sub_q]
finding = research_sub_question(sub_q, relevant_queries)
findings.append(finding)
# Phase 3: Synthesize
print("\nPhase 3: Synthesizing research...")
report = synthesize_research(question, findings)
# Phase 4: Critique and revise
print("\nPhase 4: Self-critique and revision...")
final_report = critique_and_revise(question, report)
print("\nDone!")
return final_report
# --- Run it ---
if __name__ == "__main__":
report = deep_research(
"What are the current best practices for building reasoning-capable AI agents, "
"and how do inference-time scaling techniques compare to training-time approaches "
"in terms of cost, quality, and practical applicability?"
)
print("\n" + "=" * 80)
print(report)Adding Self-Consistency to the Research Pipeline
For critical research tasks, you can add self-consistency by running the entire pipeline multiple times and comparing the results:
def deep_research_with_consistency(question: str, n_runs: int = 3) -> str:
"""
Run deep research multiple times and synthesize the most consistent findings.
"""
reports = []
for i in range(n_runs):
print(f"\n{'='*40} Run {i+1}/{n_runs} {'='*40}")
report = deep_research(question)
reports.append(report)
# Synthesize across runs
reports_text = "\n\n---\n\n".join(
f"### Report {i+1}\n{report}" for i, report in enumerate(reports)
)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=8000,
temperature=1,
thinking={"type": "enabled", "budget_tokens": 8000},
messages=[{
"role": "user",
"content": f"""You ran a deep research pipeline {n_runs} times on the same question.
Each run independently searched the web, analyzed sources, and produced a report.
Question: {question}
Here are all {n_runs} reports:
{reports_text}
Synthesize these into a single, definitive report:
1. Claims that appear in all/most reports are HIGH CONFIDENCE
2. Claims that appear in only one report need verification — mark as LOW CONFIDENCE
3. Contradictions between reports should be explicitly noted
4. Combine the best citations from all reports
Produce the final consolidated research report:"""
}]
)
for block in response.content:
if hasattr(block, "text"):
return block.text
return reports[0]Key Design Decisions
| Decision | Our Choice | Why |
|---|---|---|
| Reasoning model | Claude with extended thinking | Visible reasoning, budget control, tool-use compatible |
| Search strategy | Plan-based (decompose first, then search) | Better coverage than ad-hoc searching |
| Verification | Self-critique + optional multi-run consistency | Catches errors without external verifier |
| Depth control | Configurable sub-questions and search per sub-question | Balance thoroughness vs cost |
| Citation style | Inline citations with source list | Traceable, verifiable claims |
Part V: Putting It All Together — When to Use What
Inference-Time Technique Decision Tree
Is the task simple (single-step, factual)?
→ YES: Standard LLM call. No special technique needed.
→ NO: Continue...
Does the task have a verifiable answer (code, math, constraints)?
→ YES: Search against a verifier. Generate N candidates, verify, pick best.
→ NO: Continue...
Is there one clear approach, or multiple possible approaches?
→ ONE APPROACH: Chain-of-Thought prompting.
→ MULTIPLE: Continue...
Do you need to explore different strategies?
→ YES: Tree of Thoughts (or Meta-CoT if available).
→ NO: Continue...
Is the answer open-ended (no single correct answer)?
→ YES: Sequential revision (generate → critique → improve).
→ NO: Self-consistency (sample multiple chains, majority vote).
Technique Comparison Summary
| Technique | LLM Calls | Best For | Quality Boost | Cost |
|---|---|---|---|---|
| Standard | 1 | Simple tasks | Baseline | $ |
| CoT prompting | 1 | Math, logic, multi-step | +15-30% on reasoning tasks | $ |
| Self-consistency | N (5-10) | Tasks with verifiable answers | +10-20% over single CoT | $$$ |
| Sequential revision | 2-6 | Open-ended analysis, writing | Highly variable | $$ |
| Tree of Thoughts | 10-50+ | Complex strategy/planning | +20-40% on hard problems | $$$$ |
| Search + verifier | N (10-100) | Code, math, constraint satisfaction | +30-50% (pass@k) | $$$$ |
| Extended thinking | 1 (more tokens) | Deep analysis, complex reasoning | +20-40% | $$ |
| Deep research pipeline | Many | Multi-faceted research questions | Comprehensive coverage | $$$$$ |
Training-Time Technique Summary
| Technique | What It Does | Key Requirement | Who Uses It |
|---|---|---|---|
| SFT on reasoning | Train on correct reasoning examples | High-quality reasoning traces | Fine-tuners, researchers |
| STaR | Self-generate and filter reasoning data | A way to verify answers | Researchers |
| RL with verifier | Learn reasoning through trial and error | Reliable verifier + RL infrastructure | OpenAI (o1), DeepSeek (R1) |
| ORM | Score complete reasoning chains | Labeled chain-level data | Used in best-of-N selection |
| PRM | Score individual reasoning steps | Step-level annotations | Used in guided search, MCTS |
| Self-refinement | Learn to critique and improve own output | Verifier to assess improvement | Researchers |
| Meta-CoT | Internalize search into generation | Search traces for training | Cutting-edge research |
What You Should Know After Reading This
- What is inference-time scaling and why does it matter?
- How do reasoning models (o1, DeepSeek-R1, Claude extended thinking) differ from standard LLMs?
- What is Chain-of-Thought prompting and when does it help?
- How does self-consistency improve accuracy through majority voting?
- What is Tree of Thoughts and how does it extend CoT to tree search?
- How does "search against a verifier" work for code and math problems?
- What is STaR and how does it bootstrap reasoning capability?
- How does RL with a verifier train reasoning models like o1 and R1?
- What is the difference between ORM and PRM reward models?
- What is Meta-CoT and why is internalizing search important?
- How would you design a deep research agent that combines search, reasoning, and self-critique?
Further Reading
- "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Wei et al., 2022) — The paper that started it all
- "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (Wang et al., 2022) — Majority voting over CoT chains
- "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Yao et al., 2023) — Extending CoT to tree search
- "STaR: Bootstrapping Reasoning With Reasoning" (Zelikman et al., 2022) — Self-taught reasoning
- "Let's Verify Step by Step" (Lightman et al., 2023) — Process reward models for math
- "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (DeepSeek, 2025) — Open-weight reasoning model
- "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (Snell et al., 2024) — The case for inference-time scaling
- "Meta Chain-of-Thought: Learning to Think in One Generation" (Meta, 2025) — Internalizing search
- "Reflexion: Language Agents with Verbal Reinforcement Learning" (Shinn et al., 2023)
- "Training Verifiers to Solve Math Word Problems" (Cobbe et al., 2021) — ORM for math
Next in the Series
Part 5: Multi-modal Generation Agent — We cover the full landscape of visual generation — VAEs, GANs, auto-regressive models, and diffusion models. Then we go deep on text-to-image (data preparation, U-Net vs DiT architectures, diffusion training, sampling, and evaluation) and text-to-video (3D VAE compression, video DiT with factored attention, large-scale training challenges), and build a multi-modal generation agent.
Stay tuned.
You might also like
Build Your Own GREMLIN IN THE SHELL
A hands-on guide to building your own shell-based AI agent that haunts your terminal and gets things done.
BlogOn Creating an OpenAI Client Clone
Building an OpenAI-compatible API client from the ground up — understanding the protocol, streaming, and tool calling.
BlogMake Your Own Claude Code
How to build your own CLI coding assistant inspired by Claude Code — from terminal UI to tool use to agentic loops.