Build an LLM Playground — Part 1: How Large Language Models Actually Work — ML & AI

Series: The AI Engineer Learning Path

This is Part 1 of a hands-on series designed to take you from zero to working AI engineer. Every post follows a learn-by-doing philosophy — we explain the theory, then you build something real.

Part	Topic	Status
1	Build an LLM Playground (this post)	Current
2	Customer Support Chatbot with RAGs & Prompt Engineering	Available
3	"Ask-the-Web" Agent with Tool Calling	Available
4	Deep Research with Reasoning Models	Available
5	Multi-modal Generation Agent	Available

By the end of this post, you'll understand every stage of how a large language model is built — from raw internet data to the chatbot you interact with daily. More importantly, you'll have a mental model that makes every other AI concept click into place.

Why Start Here?

If you want to be an AI engineer, you don't need to train GPT-5 from scratch. But you do need to understand what happens under the hood. Without that understanding, you'll be copy-pasting API calls without knowing why things break, why outputs are weird, or how to fix them.

This post covers the full lifecycle of an LLM:

Pre-Training — How raw data becomes a model that can predict text
Post-Training — How that model is refined to follow instructions and be helpful
Evaluation — How we measure whether the model is actually good
Chatbot Design — How the full system around the model works
Build the Playground — Practical guidance for building your own LLM playground

Let's go.

Part I: Pre-Training — Teaching a Model to Predict Language

Pre-training is where the model learns language itself. It reads billions of pages of text and learns to predict the next word. That's the core idea — everything else is details. But the details matter enormously.

1. Data Collection

LLMs are trained on massive text datasets — hundreds of billions to trillions of tokens. Where does all that text come from?

Sources of Training Data

Source	Description	Scale
Common Crawl	Monthly snapshots of the public internet. The largest open web corpus available. Raw dumps are petabytes of HTML.	~250 billion pages
Manual/Targeted Crawling	Custom web scrapers targeting specific high-quality domains — Wikipedia, Stack Overflow, GitHub, arXiv, textbooks, legal filings, patent databases.	Varies by source
Books Corpora	Digitized books (BookCorpus, Project Gutenberg, Books3). Long-form, high-quality prose.	Millions of books
Code Repositories	GitHub public repositories, filtered by license. Critical for code-capable models (Codex, StarCoder, Code Llama).	Hundreds of millions of files
Academic Papers	arXiv, Semantic Scholar, PubMed. Essential for scientific reasoning.	Tens of millions of papers
Conversational Data	Reddit, forums, Q&A sites. Teaches dialogue patterns and informal language.	Billions of posts

The Data Scale Problem

GPT-3 was trained on ~300 billion tokens. LLaMA 2 used 2 trillion tokens. Modern frontier models likely use 10+ trillion tokens. At this scale, data quality and deduplication aren't nice-to-haves — they're the difference between a good model and a bad one.

Key insight: The quality of your data matters more than the quantity. A model trained on 1 trillion clean tokens will outperform one trained on 5 trillion noisy tokens.

2. Data Cleaning

Raw web data is a mess. It contains duplicates, spam, porn, malware, boilerplate HTML, navigation menus, cookie banners, and pages that are 90% ads. Cleaning this data is one of the most important — and most underrated — parts of building an LLM.

The Cleaning Pipeline

Raw HTML → Text Extraction → Language Filtering → Quality Filtering →
Deduplication → PII Removal → Toxic Content Filtering → Final Dataset

Key Data Cleaning Projects

Understanding these projects gives you insight into what "clean data" actually means in practice:

RefinedWeb (Falcon)

Built by the Technology Innovation Institute for the Falcon models
Started from Common Crawl and applied aggressive filtering
Used trafilatura for text extraction from HTML (much better than simple tag stripping)
Applied URL-based filtering, language identification, and document-level quality heuristics
Heavy deduplication using MinHash LSH (fuzzy matching that catches near-duplicates, not just exact copies)
Result: 5 trillion tokens of high-quality web text
Key finding: with enough cleaning, web-only data can match curated datasets

Dolma (OLMo / AI2)

Built by the Allen Institute for AI for the OLMo family of models
Fully open and documented — you can see every filtering decision
Mixed sources: Common Crawl, Wikipedia, Project Gutenberg, Semantic Scholar, GitHub, Reddit
Uses a pipeline of taggers that annotate text with quality signals (language, toxicity, duplication, etc.) and then filters based on those tags
Explicitly documents trade-offs — for example, aggressive deduplication improves quality but reduces diversity
Result: 3 trillion tokens with full provenance

FineWeb (HuggingFace)

Built by HuggingFace as a fully open, reproducible web dataset
15 trillion tokens from 96 Common Crawl snapshots (2013-2024)
Key innovation: developed custom quality classifiers trained on educational content
FineWeb-Edu subset: filtered to only educational content, resulting in significant benchmark improvements despite being much smaller
Every step is documented and reproducible — a model for open data practices

What Gets Filtered Out

Filter	What It Catches	Why It Matters
Language detection	Non-target language text	Training an English model on Chinese text wastes compute
Deduplication	Repeated pages, boilerplate, scraped content farms	Duplicates cause the model to memorize rather than generalize
Quality heuristics	Short pages, high symbol-to-word ratio, low perplexity text	Removes spam, auto-generated content, and gibberish
URL filtering	Known spam domains, adult content sites	Removes obviously low-quality sources
PII removal	Email addresses, phone numbers, SSNs	Legal and ethical requirement
Toxicity filtering	Hate speech, violence, explicit content	Reduces harmful model outputs

Hands-on exercise: Download a small Common Crawl WARC file and try extracting clean text from it. You'll immediately understand why data cleaning is a multi-billion-dollar problem.

3. Tokenization

Computers don't understand words. They understand numbers. Tokenization is the process of converting text into a sequence of integer IDs that the model can process.

Why Not Just Use Characters or Words?

Approach	Problem
Character-level	Sequences become extremely long. "artificial intelligence" = 25 characters. The model needs to learn spelling from scratch. Very slow to train.
Word-level	Vocabulary explodes. Every misspelling, conjugation, and compound word needs its own entry. Out-of-vocabulary words become a constant problem.
Subword	The sweet spot. Common words stay whole ("the", "is"), rare words get split into meaningful pieces ("un" + "believ" + "able"). Fixed vocabulary, handles any input.

Byte-Pair Encoding (BPE)

BPE is the most widely used tokenization algorithm (used by GPT, LLaMA, and most modern LLMs). Here's how it works:

Training the tokenizer:

Start with a vocabulary of individual bytes (256 entries)
Scan the training corpus and find the most frequent pair of adjacent tokens
Merge that pair into a new token and add it to the vocabulary
Repeat steps 2-3 until you reach your desired vocabulary size (typically 32K-128K tokens)

Example of BPE in action:

Input:  "lowest lower newest"

Step 0: ['l','o','w','e','s','t',' ','l','o','w','e','r',' ','n','e','w','e','s','t']
Step 1: merge (e,s) → es:  ['l','o','w','es','t',' ','l','o','w','e','r',' ','n','e','w','es','t']
Step 2: merge (es,t) → est: ['l','o','w','est',' ','l','o','w','e','r',' ','n','e','w','est']
Step 3: merge (l,o) → lo:  ['lo','w','est',' ','lo','w','e','r',' ','n','e','w','est']
Step 4: merge (lo,w) → low: ['low','est',' ','low','e','r',' ','n','e','w','est']
...and so on

Tokenization at inference time:

Once trained, the tokenizer applies the learned merges in order to encode any new text.

"lowest"  → [low, est]      → [4521, 382]
"highest" → [high, est]     → [9301, 382]
"cat"     → [cat]           → [2163]
"Pokémon" → [Pok, é, mon]   → [51, 8948, 1711]

Other Tokenization Methods

Method	Used By	Key Difference
BPE	GPT-2/3/4, LLaMA, Mistral	Merges most frequent pairs. Industry standard.
WordPiece	BERT, DistilBERT	Similar to BPE but uses likelihood instead of frequency for merges.
Unigram	T5, ALBERT, SentencePiece	Starts with a large vocabulary and prunes down. Can output multiple tokenizations with probabilities.
SentencePiece	LLaMA, T5, many multilingual models	Language-agnostic. Treats the input as a raw byte stream — no need for pre-tokenization rules.

Why Tokenization Matters for AI Engineers

Token limits are not word limits. "I don't know" is 4 words but might be 3-5 tokens depending on the tokenizer. When an API says "128K context," that's tokens, not words.
Cost is per token. API pricing is based on token count. Efficient prompts = lower cost.
Different models use different tokenizers. You can't assume token counts are portable across models.
Tokenization artifacts. Some models struggle with simple arithmetic because numbers get tokenized inconsistently ("380" might be [3, 80] or [380] depending on context).

Hands-on exercise: Use OpenAI's tiktoken or HuggingFace's tokenizers library to tokenize the same sentence with different model tokenizers. Compare the results — you'll see surprisingly large differences.

4. Architecture: Neural Networks and Transformers

This is the core of the model. We'll go from first principles to the architectures you'll work with daily.

Neural Networks in 60 Seconds

A neural network is a function that maps inputs to outputs through layers of weighted connections.

Input → [Layer 1] → [Layer 2] → ... → [Layer N] → Output

Each layer takes a vector of numbers, multiplies by a weight matrix, adds a bias, and applies a non-linear activation function. Training adjusts the weights to minimize a loss function (the difference between predicted and actual outputs).

For language models, the input is a sequence of token embeddings (vectors that represent tokens) and the output is a probability distribution over the vocabulary for the next token.

The Transformer Architecture

The Transformer (Vaswani et al., 2017, "Attention Is All You Need") is the architecture behind every modern LLM. Here's why it matters and how it works.

The key innovation: Self-Attention

Before Transformers, models processed text sequentially (RNNs, LSTMs). Word 50 had to wait for words 1-49 to be processed first. This was slow and made it hard to capture long-range dependencies.

Self-attention lets every token attend to every other token in parallel. The model can directly connect "it" to "the dog" even if they're 200 tokens apart.

How self-attention works:

For each token, the model computes three vectors from the token's embedding:

Query (Q) — "What am I looking for?"
Key (K) — "What do I contain?"
Value (V) — "What information do I provide?"

The attention score between two tokens is the dot product of one token's Query with another's Key, scaled and softmaxed. The output is a weighted sum of the Values.

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Multi-Head Attention runs multiple attention operations in parallel (e.g., 32 or 64 heads), each learning to attend to different types of relationships — one head might learn syntax, another semantics, another coreference.

Full Transformer block:

Input
  ↓
Multi-Head Self-Attention + Residual Connection + Layer Norm
  ↓
Feed-Forward Network (2 linear layers with activation) + Residual Connection + Layer Norm
  ↓
Output

A GPT-style model stacks 32-96+ of these blocks. More blocks = more capacity = more parameters.

The GPT Family

GPT (Generative Pre-trained Transformer) models are decoder-only Transformers. They use causal (masked) self-attention — each token can only attend to tokens before it, not after. This makes them natural text generators: they predict one token at a time, left to right.

Model	Parameters	Training Data	Context Length	Key Innovation
GPT-2 (2019)	1.5B	WebText (40GB)	1,024 tokens	Showed scaling works. Released with "too dangerous" controversy.
GPT-3 (2020)	175B	300B tokens	2,048 tokens	Few-shot learning via prompting. No fine-tuning needed for many tasks.
GPT-3.5 (2022)	~175B	+ RLHF training	4,096 tokens	InstructGPT + ChatGPT. First model to feel "useful" to the public.
GPT-4 (2023)	Undisclosed (rumored MoE)	Undisclosed	8K / 32K / 128K tokens	Multimodal (vision), dramatically better reasoning.
GPT-4o (2024)	Undisclosed	Undisclosed	128K tokens	Natively multimodal (text, vision, audio), faster, cheaper.

The LLaMA Family (Open-Weight Models)

Meta's LLaMA family democratized large language models by releasing model weights to the research community.

Model	Parameters	Training Data	Key Innovation
LLaMA (2023)	7B, 13B, 33B, 65B	1.4T tokens	Showed smaller models trained on more data beat larger models.
LLaMA 2 (2023)	7B, 13B, 70B	2T tokens	Open commercial license. Grouped-Query Attention (GQA) for faster inference.
LLaMA 3 (2024)	8B, 70B	15T tokens	Massive data scaling. Larger vocabulary (128K tokens).
LLaMA 3.1 (2024)	8B, 70B, 405B	15T+ tokens	128K context. Tool use. The 405B model competes with GPT-4 class models.

Architectural improvements in LLaMA vs. original GPT:

RMSNorm instead of LayerNorm (simpler, equally effective)
Rotary Position Embeddings (RoPE) instead of learned position embeddings (better extrapolation to longer sequences)
SwiGLU activation instead of ReLU in the feed-forward layers (better performance)
Grouped-Query Attention (GQA) — shares Key/Value heads across multiple Query heads, reducing memory during inference without hurting quality

Other Notable Architectures

Model Family	Creator	Key Feature
Mistral / Mixtral	Mistral AI	Sliding window attention + Mixture of Experts (MoE). Mixtral 8x7B uses 8 expert FFNs and routes each token to 2 of them — only 13B active parameters with 47B total.
Claude	Anthropic	Constitutional AI training. Strong reasoning. Details undisclosed.
Gemini	Google DeepMind	Natively multimodal from the ground up (not a bolted-on vision encoder).
DeepSeek	DeepSeek	Open-weight MoE models. DeepSeek-V2 introduced Multi-head Latent Attention (MLA) for extremely efficient KV cache.
Phi	Microsoft	Small models (1.3B-14B) trained on high-quality "textbook" data. Shows that data quality can compensate for parameter count.
Qwen	Alibaba	Strong multilingual performance, especially for Chinese + English. Competitive with LLaMA at equivalent sizes.

5. Text Generation: How LLMs Actually Produce Output

The model outputs a probability distribution over its vocabulary for the next token. But how do you turn probabilities into text? This is the decoding strategy, and it has a huge impact on output quality.

Greedy Search

Pick the highest-probability token at every step.

P("the") = 0.4, P("a") = 0.3, P("my") = 0.2, ...
→ Pick "the"

Pros: Fast, deterministic. Cons: Repetitive, boring, often gets stuck in loops ("the the the..."). Misses better sequences where a lower-probability early token leads to higher overall probability.

Beam Search

Maintain the top-k sequences (beams) at each step and pick the highest-scoring complete sequence.

Beam 1: "The cat sat on" (score: -2.3)
Beam 2: "A dog ran to"  (score: -2.5)
Beam 3: "The cat ran on" (score: -2.7)
→ Continue expanding all three, prune to top k

Pros: Finds higher-probability sequences than greedy. Good for translation and summarization. Cons: Still tends toward generic, safe outputs. Computationally expensive. Not great for creative or conversational text.

Temperature Sampling

Scale the logits (raw model outputs) by a temperature value before applying softmax. Then sample from the resulting distribution.

temperature = 0.0 → Greedy (always pick the top token)
temperature = 0.7 → Mild randomness (good default for most tasks)
temperature = 1.0 → Sample directly from the model's distribution
temperature = 1.5 → Very random, creative, potentially incoherent

Lower temperature = more focused, deterministic, repetitive. Higher temperature = more creative, diverse, potentially nonsensical.

Top-k Sampling

Only consider the top-k most probable tokens. Redistribute their probabilities and sample.

k = 50: Consider the top 50 tokens at each step
k = 10: More focused
k = 1:  Greedy search

Problem: A fixed k doesn't adapt. Sometimes the model is very confident (top 3 tokens cover 95% of probability — k=50 wastes compute on junk tokens). Sometimes the model is uncertain (top 50 tokens only cover 60% — k=50 might still miss good options).

Top-p (Nucleus) Sampling

Instead of a fixed count, include the smallest set of tokens whose cumulative probability exceeds p.

p = 0.9: Include tokens until their probabilities sum to 0.9
p = 0.5: More focused
p = 1.0: Consider all tokens (temperature sampling only)

This adapts to the model's confidence. When the model is confident, only a few tokens are considered. When it's uncertain, more tokens are included.

In Practice: Combining Strategies

Most production systems combine temperature + top-p:

# Typical chat configuration
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages,
    temperature=0.7,
    top_p=0.9,
)

Use Case	Temperature	Top-p	Why
Code generation	0.0-0.2	1.0	Correctness matters. Low randomness.
Chat / conversation	0.7	0.9	Natural, varied, but coherent.
Creative writing	0.9-1.2	0.95	More surprising word choices.
Factual Q&A	0.0-0.3	1.0	Accuracy over creativity.

Repetition Penalty and Other Controls

Repetition penalty: Reduces the probability of tokens that have already appeared. Prevents "the the the" loops.
Frequency penalty: Penalizes tokens proportionally to how often they've appeared. Encourages vocabulary diversity.
Presence penalty: Penalizes any token that has appeared at all (binary). Encourages topic diversity.
Stop sequences: Halt generation when specific strings are produced (e.g., "\n\nHuman:" in a chatbot).
Max tokens: Hard cap on output length.

Part II: Post-Training — Making the Model Useful

Pre-training gives you a model that can predict text. But a raw pre-trained model is like a brilliant student who's read every book in the library but has never had a conversation. It will complete your prompt, but it won't answer your question.

Post-training bridges this gap. It transforms a text predictor into an assistant.

1. Supervised Fine-Tuning (SFT)

SFT is conceptually simple: show the model examples of good behavior and train it to mimic them.

Training data format:

{
  "messages": [
    {"role": "user", "content": "Explain quantum entanglement simply."},
    {"role": "assistant", "content": "Imagine you have two coins that are magically linked..."}
  ]
}

You collect thousands to hundreds of thousands of these (prompt, ideal response) pairs. The model is trained to maximize the probability of the ideal response given the prompt.

Where do the examples come from?

Source	Description	Quality
Human annotators	Paid contractors write ideal responses. Expensive but high quality.	Highest
Distillation	Use a stronger model (GPT-4) to generate training data for a smaller model.	High
Open datasets	OpenAssistant, Dolly, ShareGPT, UltraChat. Free but variable quality.	Variable
Synthetic generation	Use the model itself + filtering to generate training data. Self-play.	Medium-High

What SFT teaches:

Follow instructions ("Write a poem about..." → actually writes a poem)
Adopt a helpful persona (answers questions rather than continuing the prompt)
Format outputs properly (markdown, code blocks, numbered lists)
Refuse harmful requests (though this is crude without RL)

Limitations of SFT:

SFT alone produces a model that imitates the training examples. It doesn't learn why some responses are better than others. It can't generalize the concept of "helpfulness" beyond the specific examples it's seen. This is where reinforcement learning comes in.

2. Reinforcement Learning and RLHF

Reinforcement Learning from Human Feedback (RLHF) teaches the model to optimize for human preferences rather than just imitating examples.

The RLHF Pipeline

Step 1: Train a Reward Model (RM)
  Human annotators rank model outputs from best to worst
  → Train a model to predict these rankings (the reward model)

Step 2: Optimize the LLM using RL
  The LLM generates responses
  → The reward model scores them
  → The LLM is updated to produce higher-scoring responses
  → A KL penalty prevents the model from drifting too far from the SFT baseline

Step 1: Reward Models

A reward model takes a (prompt, response) pair and outputs a scalar score representing quality.

Training data: Human annotators are shown the same prompt with 2-4 different model responses. They rank them from best to worst. The reward model is trained on these comparisons.

Prompt: "What is the capital of France?"
Response A: "The capital of France is Paris." (Rank 1 - best)
Response B: "Paris is a city in Europe." (Rank 2)
Response C: "France is a country." (Rank 3 - worst)

The reward model learns to assign: score(A) > score(B) > score(C)

What the reward model captures:

Helpfulness (did it answer the question?)
Harmlessness (did it avoid dangerous content?)
Honesty (did it avoid making things up?)
Formatting quality, tone, detail level

Step 2: Policy Optimization with PPO

Proximal Policy Optimization (PPO) is the most common RL algorithm used for RLHF. Here's the intuition:

Generate: The LLM (called the "policy") generates a response to a prompt
Score: The reward model scores the response
Update: Adjust the LLM's weights to increase the probability of high-scoring responses
Constrain: A KL divergence penalty prevents the model from changing too much in a single step (which would cause instability or "reward hacking")

Objective = E[reward(response)] - β * KL(policy || reference_policy)

The β term is crucial — without it, the model quickly learns to exploit quirks of the reward model rather than genuinely improving.

Verifiable Tasks and Process Reward Models

A newer trend moves away from pure human-preference reward models toward verifiable rewards — tasks where the answer can be checked automatically.

Approach	How It Works	Example
Outcome Reward Models (ORM)	Score the final answer only. Binary: right or wrong.	Math: is 2+2=4? Correct!
Process Reward Models (PRM)	Score each reasoning step individually.	"Step 1: correct. Step 2: correct. Step 3: wrong."
Verifiable tasks	Use tasks with known answers as training signal. No human annotation needed.	Code that passes test cases, math with known solutions.

Why this matters: Human preference annotation is expensive, slow, and subjective. Verifiable tasks provide unlimited, objective training signal. DeepSeek-R1 and OpenAI's o1/o3 models heavily use this approach for reasoning.

Alternatives to PPO

Method	Description	Advantage
DPO (Direct Preference Optimization)	Skips the reward model entirely. Directly optimizes the LLM using preference pairs. Much simpler pipeline.	No reward model needed. Fewer hyperparameters. Stable training.
REINFORCE	Classic policy gradient. Simpler than PPO but higher variance.	Simplicity.
GRPO (Group Relative Policy Optimization)	Used by DeepSeek. Groups responses and uses relative ranking within the group as the reward signal.	No separate reward model. Works well for reasoning tasks.
KTO (Kahneman-Tversky Optimization)	Uses binary feedback (good/bad) instead of ranked comparisons. Inspired by prospect theory.	Easier to collect binary feedback than rankings.

3. The Full Post-Training Pipeline in Practice

Modern post-training is multi-stage:

Pre-trained Model
  ↓
SFT on instruction-following data
  ↓
RLHF/DPO on human preferences (helpfulness)
  ↓
Safety training (refusals, harmlessness)
  ↓
Specialized RL on verifiable tasks (math, code, reasoning)
  ↓
Final model

Each stage builds on the previous one. Skip SFT and RLHF doesn't work well. Skip RLHF and the model follows instructions but isn't refined. The order matters.

Part III: Evaluation — How Do You Know If Your Model Is Good?

Building a model is one thing. Knowing whether it's actually good is harder than it sounds.

1. Traditional NLP Metrics

These come from the pre-LLM era but are still used for specific tasks:

Metric	What It Measures	Used For	Limitation
Perplexity	How surprised the model is by the test data. Lower = better.	Language modeling quality	Doesn't measure usefulness or factuality
BLEU	N-gram overlap between generated text and reference text	Translation, summarization	A correct paraphrase can score 0. Doesn't capture meaning.
ROUGE	Recall-oriented n-gram overlap	Summarization	Same problems as BLEU
F1 Score	Precision/recall balance for extracted answers	Question answering, NER	Only works for tasks with clear correct answers
Exact Match	Binary — did the model produce the exact correct answer?	QA, classification	Too strict. "Paris" and "The answer is Paris" both fail.

The fundamental problem: These metrics measure surface-level text similarity, not whether the response is actually helpful, accurate, or well-written. This is why benchmarks and human evaluation exist.

2. Task-Specific Benchmarks

Benchmarks provide standardized tasks with known correct answers. Here are the ones that matter:

Reasoning and Knowledge

Benchmark	What It Tests	Format	Why It Matters
MMLU	Massive Multitask Language Understanding. 57 subjects from elementary to professional level.	Multiple choice	The most-cited general knowledge benchmark. Covers STEM, humanities, social sciences, and more.
ARC	AI2 Reasoning Challenge. Grade-school science questions.	Multiple choice	Tests scientific reasoning. ARC-Challenge subset is genuinely hard.
HellaSwag	Sentence completion requiring commonsense reasoning.	Multiple choice	Tests whether the model understands how everyday situations unfold.
Winogrande	Pronoun resolution requiring world knowledge.	Binary choice	"The trophy didn't fit in the suitcase because it was too big." What was too big?
TruthfulQA	Questions where common misconceptions lead to wrong answers.	Open-ended + multiple choice	Tests whether the model gives truthful answers vs. popular-but-wrong ones.
BoolQ	Yes/no questions based on a passage.	Boolean	Tests reading comprehension.

Math and Code

Benchmark	What It Tests	Format
GSM8K	Grade-school math word problems requiring multi-step reasoning.	Open-ended (numerical answer)
MATH	Competition-level mathematics (AMC, AIME difficulty).	Open-ended
HumanEval	Python function completion. 164 problems with test cases.	Code generation
MBPP	Mostly Basic Python Problems. Simpler than HumanEval.	Code generation
SWE-bench	Real GitHub issues. The model must write a patch that resolves the issue and passes tests.	Code patch

Conversation and Instruction Following

Benchmark	What It Tests	Format
MT-Bench	Multi-turn conversation quality. 80 questions across 8 categories.	Open-ended, scored by GPT-4
AlpacaEval	Instruction following quality. Compared against a reference model.	Open-ended, LLM-as-judge
IFEval	Instruction following with verifiable constraints ("write exactly 3 paragraphs," "use no commas").	Open-ended with automated checks

Safety

Benchmark	What It Tests
BBQ	Bias Benchmark for QA — tests for social biases
ToxiGen	Toxic content generation across demographics
RealToxicityPrompts	How often the model generates toxic continuations
XSTest	Whether safety filters over-trigger on benign prompts

3. Human Evaluation and Leaderboards

Benchmarks have a fundamental limitation: they can be gamed. A model can be trained specifically to score well on MMLU without being generally capable. This is why human evaluation matters.

Chatbot Arena (LMSYS)

The gold standard for LLM evaluation. Real users have conversations with two anonymous models side-by-side and vote for the better response. Results are aggregated into an Elo rating system (like chess).

Why it's important:

Real users, real tasks, real preferences
Models are anonymous — no brand bias
Elo ratings are continuously updated with new votes
Widely considered the most reliable LLM ranking

Human Evaluation Practices

Method	Description	When to Use
Side-by-side comparison	Show two model outputs, ask which is better	Ranking models against each other
Likert scale rating	Rate individual outputs on a 1-5 scale for specific criteria	Measuring specific qualities (helpfulness, accuracy, tone)
Red teaming	Humans actively try to make the model fail or produce harmful outputs	Safety evaluation before deployment
Task completion	Measure whether humans can accomplish real tasks using the model	End-to-end usefulness evaluation

LLM-as-Judge

Using a strong model (e.g., GPT-4, Claude) to evaluate outputs from other models. Faster and cheaper than human evaluation, but introduces the evaluated model's biases.

Common patterns:
- Position bias: tends to prefer the first response shown
- Verbosity bias: tends to prefer longer responses
- Self-preference: models tend to rate their own outputs higher

Mitigation: Run evaluations in both orders and average. Use specific rubrics. Combine with human eval for calibration.

Part IV: Chatbot Design — The Full System

The model is just one component. A production chatbot is a system with multiple layers.

System Architecture

┌─────────────────────────────────────┐
│           User Interface            │
│  (Web app, API, mobile, CLI)        │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│        Application Layer            │
│  - Conversation management          │
│  - System prompt injection          │
│  - Tool/function calling router     │
│  - Rate limiting & auth             │
│  - Content filtering (input)        │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│          Model Layer                │
│  - LLM inference (local or API)     │
│  - Decoding parameters              │
│  - Context window management        │
│  - Streaming response               │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│        Post-Processing              │
│  - Output filtering (safety)        │
│  - Citation extraction              │
│  - Format validation                │
│  - Tool call execution              │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│       Memory & Context              │
│  - Conversation history storage     │
│  - RAG retrieval (next post!)       │
│  - Long-term memory                 │
│  - User preferences                 │
└─────────────────────────────────────┘

Key Design Decisions

System Prompts

The system prompt defines the model's persona, capabilities, and constraints. It's the most important piece of prompt engineering in a chatbot.

You are a helpful customer support agent for Acme Corp.
You can help with: billing, account issues, product questions.
You cannot: process refunds directly, access payment info, make promises about future features.
Always be polite. If unsure, say so and offer to escalate to a human agent.

Best practices:

Be specific about what the model should and shouldn't do
Include examples of ideal responses
Define the tone and personality
Specify how to handle edge cases (unknown questions, off-topic requests)

Conversation History Management

LLMs have finite context windows. Long conversations must be managed:

Strategy	How It Works	Trade-off
Truncation	Drop the oldest messages when the context is full	Simple but loses important early context
Summarization	Periodically summarize older messages into a compact form	Preserves key info but lossy
Sliding window	Keep the system prompt + last N messages	Predictable behavior, loses mid-conversation context
RAG on history	Embed and retrieve relevant past messages	Best retention but adds complexity and latency

Tool Use / Function Calling

Modern chatbots aren't just text generators — they can take actions:

{
  "type": "function",
  "function": {
    "name": "search_knowledge_base",
    "description": "Search the company knowledge base for relevant articles",
    "parameters": {
      "type": "object",
      "properties": {
        "query": {"type": "string", "description": "Search query"}
      }
    }
  }
}

The model decides when to call a tool, what arguments to pass, and then incorporates the result into its response. This is how chatbots search the web, query databases, send emails, and interact with external systems.

Streaming

Users expect to see text appear word-by-word, not wait 10 seconds for a complete response. Streaming via Server-Sent Events (SSE) is standard:

data: {"choices":[{"delta":{"content":"The"}}]}
data: {"choices":[{"delta":{"content":" capital"}}]}
data: {"choices":[{"delta":{"content":" of"}}]}
data: {"choices":[{"delta":{"content":" France"}}]}
...

Safety Layers

Production chatbots use multiple safety layers:

Input filtering — Block or flag harmful prompts before they reach the model
System prompt guardrails — Instructions in the system prompt about what to refuse
Output filtering — Scan generated text for harmful content before showing it to the user
Rate limiting — Prevent abuse through request limits
Human escalation — Route difficult or sensitive conversations to human agents

Part V: Build Your Own LLM Playground

Now that you understand how everything works, let's talk about what you should actually build.

What Is an LLM Playground?

An LLM playground is a web interface where you can:

Send prompts to different LLM providers (OpenAI, Anthropic, open-source models)
Adjust generation parameters (temperature, top-p, max tokens)
Compare outputs from different models side by side
Experiment with system prompts
View token counts and costs
Save and share conversations

Architecture for Your Playground

┌────────────────────────────────────────────┐
│              Frontend (React/Next.js)       │
│  - Chat interface with streaming            │
│  - Parameter controls (sliders, dropdowns)  │
│  - Model selector                           │
│  - Token counter                            │
│  - Conversation history                     │
└─────────────────┬──────────────────────────┘
                  │
┌─────────────────▼──────────────────────────┐
│            Backend (Node.js / Python)       │
│  - Unified API router for multiple LLMs     │
│  - API key management                       │
│  - Request/response logging                 │
│  - Cost tracking                            │
└─────────────────┬──────────────────────────┘
                  │
        ┌─────────┼──────────┐
        ▼         ▼          ▼
   ┌────────┐ ┌────────┐ ┌──────────┐
   │ OpenAI │ │Anthropic│ │ Ollama   │
   │  API   │ │  API    │ │ (local)  │
   └────────┘ └────────┘ └──────────┘

Features to Implement (in order)

Phase 1: Core Chat

Basic chat interface with a single model (start with OpenAI or Anthropic)
Streaming responses using Server-Sent Events
System prompt input field
Temperature and max token controls

Phase 2: Multi-Model 5. Add a second provider (e.g., Anthropic if you started with OpenAI) 6. Model selector dropdown 7. Side-by-side comparison mode

Phase 3: Power Features 8. Token counter and cost estimator 9. Conversation history with save/load 10. Preset system prompts (creative writer, code reviewer, tutor, etc.) 11. Add local model support via Ollama (run LLaMA, Mistral, etc. locally)

Phase 4: Advanced 12. Function/tool calling playground 13. Logprobs visualization (see the model's confidence for each token) 14. Prompt templates with variables 15. Export conversations as JSON/Markdown

Getting Started: Minimal Viable Playground

Here's the simplest possible starting point — a streaming chat with parameter controls:

// Core: unified model interface
interface LLMProvider {
  name: string;
  chat(params: ChatParams): AsyncIterable<string>;
}
 
interface ChatParams {
  model: string;
  messages: Message[];
  temperature: number;
  topP: number;
  maxTokens: number;
  systemPrompt?: string;
}

Key learning outcomes from building this:

How streaming APIs work (SSE, chunked transfer encoding)
How different providers' APIs differ (and how to abstract over them)
How parameters like temperature and top-p actually affect output (you'll see it live)
How system prompts shape model behavior
How token counting and context window management work in practice
How to handle errors, rate limits, and API quirks

What You Should Know After Reading This

If you've read this post carefully, you should be able to answer these questions:

What is BPE and why do LLMs use it instead of word-level tokenization?
What is self-attention and why was it a breakthrough over RNNs?
What's the difference between GPT-style (decoder-only) and BERT-style (encoder-only) architectures?
What is the difference between SFT and RLHF? Why do you need both?
What is a reward model and how is it trained?
What's the difference between temperature, top-k, and top-p sampling?
Why is Chatbot Arena considered more reliable than benchmarks like MMLU?
What are the main components of a production chatbot system beyond just the LLM?
What role does data cleaning play, and what's the difference between RefinedWeb, Dolma, and FineWeb?
What is DPO and why is it becoming popular as an alternative to PPO-based RLHF?

If you can't answer all of them yet, re-read the relevant section. These are the foundations everything else builds on.

Next in the Series

Part 2: Customer Support Chatbot with RAGs & Prompt Engineering — We build a system that gives your LLM access to external knowledge. You'll learn about embeddings, vector databases, chunking strategies, prompt engineering patterns, and how to build a RAG pipeline for a customer support chatbot that actually works.