Build an LLM Playground — Part 1: How Large Language Models Actually Work
The first entry in a learn-by-doing series to become an AI engineer. We break down every stage of how LLMs are built — from raw data to chatbot — so you can build your own playground with real understanding.
Series: The AI Engineer Learning Path
This is Part 1 of a hands-on series designed to take you from zero to working AI engineer. Every post follows a learn-by-doing philosophy — we explain the theory, then you build something real.
| Part | Topic | Status |
|---|---|---|
| 1 | Build an LLM Playground (this post) | Current |
| 2 | Customer Support Chatbot with RAGs & Prompt Engineering | Available |
| 3 | "Ask-the-Web" Agent with Tool Calling | Available |
| 4 | Deep Research with Reasoning Models | Available |
| 5 | Multi-modal Generation Agent | Available |
By the end of this post, you'll understand every stage of how a large language model is built — from raw internet data to the chatbot you interact with daily. More importantly, you'll have a mental model that makes every other AI concept click into place.
Why Start Here?
If you want to be an AI engineer, you don't need to train GPT-5 from scratch. But you do need to understand what happens under the hood. Without that understanding, you'll be copy-pasting API calls without knowing why things break, why outputs are weird, or how to fix them.
This post covers the full lifecycle of an LLM:
- Pre-Training — How raw data becomes a model that can predict text
- Post-Training — How that model is refined to follow instructions and be helpful
- Evaluation — How we measure whether the model is actually good
- Chatbot Design — How the full system around the model works
- Build the Playground — Practical guidance for building your own LLM playground
Let's go.
Part I: Pre-Training — Teaching a Model to Predict Language
Pre-training is where the model learns language itself. It reads billions of pages of text and learns to predict the next word. That's the core idea — everything else is details. But the details matter enormously.
1. Data Collection
LLMs are trained on massive text datasets — hundreds of billions to trillions of tokens. Where does all that text come from?
Sources of Training Data
| Source | Description | Scale |
|---|---|---|
| Common Crawl | Monthly snapshots of the public internet. The largest open web corpus available. Raw dumps are petabytes of HTML. | ~250 billion pages |
| Manual/Targeted Crawling | Custom web scrapers targeting specific high-quality domains — Wikipedia, Stack Overflow, GitHub, arXiv, textbooks, legal filings, patent databases. | Varies by source |
| Books Corpora | Digitized books (BookCorpus, Project Gutenberg, Books3). Long-form, high-quality prose. | Millions of books |
| Code Repositories | GitHub public repositories, filtered by license. Critical for code-capable models (Codex, StarCoder, Code Llama). | Hundreds of millions of files |
| Academic Papers | arXiv, Semantic Scholar, PubMed. Essential for scientific reasoning. | Tens of millions of papers |
| Conversational Data | Reddit, forums, Q&A sites. Teaches dialogue patterns and informal language. | Billions of posts |
The Data Scale Problem
GPT-3 was trained on ~300 billion tokens. LLaMA 2 used 2 trillion tokens. Modern frontier models likely use 10+ trillion tokens. At this scale, data quality and deduplication aren't nice-to-haves — they're the difference between a good model and a bad one.
Key insight: The quality of your data matters more than the quantity. A model trained on 1 trillion clean tokens will outperform one trained on 5 trillion noisy tokens.
2. Data Cleaning
Raw web data is a mess. It contains duplicates, spam, porn, malware, boilerplate HTML, navigation menus, cookie banners, and pages that are 90% ads. Cleaning this data is one of the most important — and most underrated — parts of building an LLM.
The Cleaning Pipeline
Raw HTML → Text Extraction → Language Filtering → Quality Filtering →
Deduplication → PII Removal → Toxic Content Filtering → Final Dataset
Key Data Cleaning Projects
Understanding these projects gives you insight into what "clean data" actually means in practice:
RefinedWeb (Falcon)
- Built by the Technology Innovation Institute for the Falcon models
- Started from Common Crawl and applied aggressive filtering
- Used trafilatura for text extraction from HTML (much better than simple tag stripping)
- Applied URL-based filtering, language identification, and document-level quality heuristics
- Heavy deduplication using MinHash LSH (fuzzy matching that catches near-duplicates, not just exact copies)
- Result: 5 trillion tokens of high-quality web text
- Key finding: with enough cleaning, web-only data can match curated datasets
Dolma (OLMo / AI2)
- Built by the Allen Institute for AI for the OLMo family of models
- Fully open and documented — you can see every filtering decision
- Mixed sources: Common Crawl, Wikipedia, Project Gutenberg, Semantic Scholar, GitHub, Reddit
- Uses a pipeline of taggers that annotate text with quality signals (language, toxicity, duplication, etc.) and then filters based on those tags
- Explicitly documents trade-offs — for example, aggressive deduplication improves quality but reduces diversity
- Result: 3 trillion tokens with full provenance
FineWeb (HuggingFace)
- Built by HuggingFace as a fully open, reproducible web dataset
- 15 trillion tokens from 96 Common Crawl snapshots (2013-2024)
- Key innovation: developed custom quality classifiers trained on educational content
- FineWeb-Edu subset: filtered to only educational content, resulting in significant benchmark improvements despite being much smaller
- Every step is documented and reproducible — a model for open data practices
What Gets Filtered Out
| Filter | What It Catches | Why It Matters |
|---|---|---|
| Language detection | Non-target language text | Training an English model on Chinese text wastes compute |
| Deduplication | Repeated pages, boilerplate, scraped content farms | Duplicates cause the model to memorize rather than generalize |
| Quality heuristics | Short pages, high symbol-to-word ratio, low perplexity text | Removes spam, auto-generated content, and gibberish |
| URL filtering | Known spam domains, adult content sites | Removes obviously low-quality sources |
| PII removal | Email addresses, phone numbers, SSNs | Legal and ethical requirement |
| Toxicity filtering | Hate speech, violence, explicit content | Reduces harmful model outputs |
Hands-on exercise: Download a small Common Crawl WARC file and try extracting clean text from it. You'll immediately understand why data cleaning is a multi-billion-dollar problem.
3. Tokenization
Computers don't understand words. They understand numbers. Tokenization is the process of converting text into a sequence of integer IDs that the model can process.
Why Not Just Use Characters or Words?
| Approach | Problem |
|---|---|
| Character-level | Sequences become extremely long. "artificial intelligence" = 25 characters. The model needs to learn spelling from scratch. Very slow to train. |
| Word-level | Vocabulary explodes. Every misspelling, conjugation, and compound word needs its own entry. Out-of-vocabulary words become a constant problem. |
| Subword | The sweet spot. Common words stay whole ("the", "is"), rare words get split into meaningful pieces ("un" + "believ" + "able"). Fixed vocabulary, handles any input. |
Byte-Pair Encoding (BPE)
BPE is the most widely used tokenization algorithm (used by GPT, LLaMA, and most modern LLMs). Here's how it works:
Training the tokenizer:
- Start with a vocabulary of individual bytes (256 entries)
- Scan the training corpus and find the most frequent pair of adjacent tokens
- Merge that pair into a new token and add it to the vocabulary
- Repeat steps 2-3 until you reach your desired vocabulary size (typically 32K-128K tokens)
Example of BPE in action:
Input: "lowest lower newest"
Step 0: ['l','o','w','e','s','t',' ','l','o','w','e','r',' ','n','e','w','e','s','t']
Step 1: merge (e,s) → es: ['l','o','w','es','t',' ','l','o','w','e','r',' ','n','e','w','es','t']
Step 2: merge (es,t) → est: ['l','o','w','est',' ','l','o','w','e','r',' ','n','e','w','est']
Step 3: merge (l,o) → lo: ['lo','w','est',' ','lo','w','e','r',' ','n','e','w','est']
Step 4: merge (lo,w) → low: ['low','est',' ','low','e','r',' ','n','e','w','est']
...and so on
Tokenization at inference time:
Once trained, the tokenizer applies the learned merges in order to encode any new text.
"lowest" → [low, est] → [4521, 382]
"highest" → [high, est] → [9301, 382]
"cat" → [cat] → [2163]
"Pokémon" → [Pok, é, mon] → [51, 8948, 1711]
Other Tokenization Methods
| Method | Used By | Key Difference |
|---|---|---|
| BPE | GPT-2/3/4, LLaMA, Mistral | Merges most frequent pairs. Industry standard. |
| WordPiece | BERT, DistilBERT | Similar to BPE but uses likelihood instead of frequency for merges. |
| Unigram | T5, ALBERT, SentencePiece | Starts with a large vocabulary and prunes down. Can output multiple tokenizations with probabilities. |
| SentencePiece | LLaMA, T5, many multilingual models | Language-agnostic. Treats the input as a raw byte stream — no need for pre-tokenization rules. |
Why Tokenization Matters for AI Engineers
- Token limits are not word limits. "I don't know" is 4 words but might be 3-5 tokens depending on the tokenizer. When an API says "128K context," that's tokens, not words.
- Cost is per token. API pricing is based on token count. Efficient prompts = lower cost.
- Different models use different tokenizers. You can't assume token counts are portable across models.
- Tokenization artifacts. Some models struggle with simple arithmetic because numbers get tokenized inconsistently ("380" might be [3, 80] or [380] depending on context).
Hands-on exercise: Use OpenAI's tiktoken or HuggingFace's
tokenizerslibrary to tokenize the same sentence with different model tokenizers. Compare the results — you'll see surprisingly large differences.
4. Architecture: Neural Networks and Transformers
This is the core of the model. We'll go from first principles to the architectures you'll work with daily.
Neural Networks in 60 Seconds
A neural network is a function that maps inputs to outputs through layers of weighted connections.
Input → [Layer 1] → [Layer 2] → ... → [Layer N] → Output
Each layer takes a vector of numbers, multiplies by a weight matrix, adds a bias, and applies a non-linear activation function. Training adjusts the weights to minimize a loss function (the difference between predicted and actual outputs).
For language models, the input is a sequence of token embeddings (vectors that represent tokens) and the output is a probability distribution over the vocabulary for the next token.
The Transformer Architecture
The Transformer (Vaswani et al., 2017, "Attention Is All You Need") is the architecture behind every modern LLM. Here's why it matters and how it works.
The key innovation: Self-Attention
Before Transformers, models processed text sequentially (RNNs, LSTMs). Word 50 had to wait for words 1-49 to be processed first. This was slow and made it hard to capture long-range dependencies.
Self-attention lets every token attend to every other token in parallel. The model can directly connect "it" to "the dog" even if they're 200 tokens apart.
How self-attention works:
For each token, the model computes three vectors from the token's embedding:
- Query (Q) — "What am I looking for?"
- Key (K) — "What do I contain?"
- Value (V) — "What information do I provide?"
The attention score between two tokens is the dot product of one token's Query with another's Key, scaled and softmaxed. The output is a weighted sum of the Values.
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Multi-Head Attention runs multiple attention operations in parallel (e.g., 32 or 64 heads), each learning to attend to different types of relationships — one head might learn syntax, another semantics, another coreference.
Full Transformer block:
Input
↓
Multi-Head Self-Attention + Residual Connection + Layer Norm
↓
Feed-Forward Network (2 linear layers with activation) + Residual Connection + Layer Norm
↓
Output
A GPT-style model stacks 32-96+ of these blocks. More blocks = more capacity = more parameters.
The GPT Family
GPT (Generative Pre-trained Transformer) models are decoder-only Transformers. They use causal (masked) self-attention — each token can only attend to tokens before it, not after. This makes them natural text generators: they predict one token at a time, left to right.
| Model | Parameters | Training Data | Context Length | Key Innovation |
|---|---|---|---|---|
| GPT-2 (2019) | 1.5B | WebText (40GB) | 1,024 tokens | Showed scaling works. Released with "too dangerous" controversy. |
| GPT-3 (2020) | 175B | 300B tokens | 2,048 tokens | Few-shot learning via prompting. No fine-tuning needed for many tasks. |
| GPT-3.5 (2022) | ~175B | + RLHF training | 4,096 tokens | InstructGPT + ChatGPT. First model to feel "useful" to the public. |
| GPT-4 (2023) | Undisclosed (rumored MoE) | Undisclosed | 8K / 32K / 128K tokens | Multimodal (vision), dramatically better reasoning. |
| GPT-4o (2024) | Undisclosed | Undisclosed | 128K tokens | Natively multimodal (text, vision, audio), faster, cheaper. |
The LLaMA Family (Open-Weight Models)
Meta's LLaMA family democratized large language models by releasing model weights to the research community.
| Model | Parameters | Training Data | Key Innovation |
|---|---|---|---|
| LLaMA (2023) | 7B, 13B, 33B, 65B | 1.4T tokens | Showed smaller models trained on more data beat larger models. |
| LLaMA 2 (2023) | 7B, 13B, 70B | 2T tokens | Open commercial license. Grouped-Query Attention (GQA) for faster inference. |
| LLaMA 3 (2024) | 8B, 70B | 15T tokens | Massive data scaling. Larger vocabulary (128K tokens). |
| LLaMA 3.1 (2024) | 8B, 70B, 405B | 15T+ tokens | 128K context. Tool use. The 405B model competes with GPT-4 class models. |
Architectural improvements in LLaMA vs. original GPT:
- RMSNorm instead of LayerNorm (simpler, equally effective)
- Rotary Position Embeddings (RoPE) instead of learned position embeddings (better extrapolation to longer sequences)
- SwiGLU activation instead of ReLU in the feed-forward layers (better performance)
- Grouped-Query Attention (GQA) — shares Key/Value heads across multiple Query heads, reducing memory during inference without hurting quality
Other Notable Architectures
| Model Family | Creator | Key Feature |
|---|---|---|
| Mistral / Mixtral | Mistral AI | Sliding window attention + Mixture of Experts (MoE). Mixtral 8x7B uses 8 expert FFNs and routes each token to 2 of them — only 13B active parameters with 47B total. |
| Claude | Anthropic | Constitutional AI training. Strong reasoning. Details undisclosed. |
| Gemini | Google DeepMind | Natively multimodal from the ground up (not a bolted-on vision encoder). |
| DeepSeek | DeepSeek | Open-weight MoE models. DeepSeek-V2 introduced Multi-head Latent Attention (MLA) for extremely efficient KV cache. |
| Phi | Microsoft | Small models (1.3B-14B) trained on high-quality "textbook" data. Shows that data quality can compensate for parameter count. |
| Qwen | Alibaba | Strong multilingual performance, especially for Chinese + English. Competitive with LLaMA at equivalent sizes. |
5. Text Generation: How LLMs Actually Produce Output
The model outputs a probability distribution over its vocabulary for the next token. But how do you turn probabilities into text? This is the decoding strategy, and it has a huge impact on output quality.
Greedy Search
Pick the highest-probability token at every step.
P("the") = 0.4, P("a") = 0.3, P("my") = 0.2, ...
→ Pick "the"
Pros: Fast, deterministic. Cons: Repetitive, boring, often gets stuck in loops ("the the the..."). Misses better sequences where a lower-probability early token leads to higher overall probability.
Beam Search
Maintain the top-k sequences (beams) at each step and pick the highest-scoring complete sequence.
Beam 1: "The cat sat on" (score: -2.3)
Beam 2: "A dog ran to" (score: -2.5)
Beam 3: "The cat ran on" (score: -2.7)
→ Continue expanding all three, prune to top k
Pros: Finds higher-probability sequences than greedy. Good for translation and summarization. Cons: Still tends toward generic, safe outputs. Computationally expensive. Not great for creative or conversational text.
Temperature Sampling
Scale the logits (raw model outputs) by a temperature value before applying softmax. Then sample from the resulting distribution.
temperature = 0.0 → Greedy (always pick the top token)
temperature = 0.7 → Mild randomness (good default for most tasks)
temperature = 1.0 → Sample directly from the model's distribution
temperature = 1.5 → Very random, creative, potentially incoherent
Lower temperature = more focused, deterministic, repetitive. Higher temperature = more creative, diverse, potentially nonsensical.
Top-k Sampling
Only consider the top-k most probable tokens. Redistribute their probabilities and sample.
k = 50: Consider the top 50 tokens at each step
k = 10: More focused
k = 1: Greedy search
Problem: A fixed k doesn't adapt. Sometimes the model is very confident (top 3 tokens cover 95% of probability — k=50 wastes compute on junk tokens). Sometimes the model is uncertain (top 50 tokens only cover 60% — k=50 might still miss good options).
Top-p (Nucleus) Sampling
Instead of a fixed count, include the smallest set of tokens whose cumulative probability exceeds p.
p = 0.9: Include tokens until their probabilities sum to 0.9
p = 0.5: More focused
p = 1.0: Consider all tokens (temperature sampling only)
This adapts to the model's confidence. When the model is confident, only a few tokens are considered. When it's uncertain, more tokens are included.
In Practice: Combining Strategies
Most production systems combine temperature + top-p:
# Typical chat configuration
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
temperature=0.7,
top_p=0.9,
)| Use Case | Temperature | Top-p | Why |
|---|---|---|---|
| Code generation | 0.0-0.2 | 1.0 | Correctness matters. Low randomness. |
| Chat / conversation | 0.7 | 0.9 | Natural, varied, but coherent. |
| Creative writing | 0.9-1.2 | 0.95 | More surprising word choices. |
| Factual Q&A | 0.0-0.3 | 1.0 | Accuracy over creativity. |
Repetition Penalty and Other Controls
- Repetition penalty: Reduces the probability of tokens that have already appeared. Prevents "the the the" loops.
- Frequency penalty: Penalizes tokens proportionally to how often they've appeared. Encourages vocabulary diversity.
- Presence penalty: Penalizes any token that has appeared at all (binary). Encourages topic diversity.
- Stop sequences: Halt generation when specific strings are produced (e.g.,
"\n\nHuman:"in a chatbot). - Max tokens: Hard cap on output length.
Part II: Post-Training — Making the Model Useful
Pre-training gives you a model that can predict text. But a raw pre-trained model is like a brilliant student who's read every book in the library but has never had a conversation. It will complete your prompt, but it won't answer your question.
Post-training bridges this gap. It transforms a text predictor into an assistant.
1. Supervised Fine-Tuning (SFT)
SFT is conceptually simple: show the model examples of good behavior and train it to mimic them.
Training data format:
{
"messages": [
{"role": "user", "content": "Explain quantum entanglement simply."},
{"role": "assistant", "content": "Imagine you have two coins that are magically linked..."}
]
}You collect thousands to hundreds of thousands of these (prompt, ideal response) pairs. The model is trained to maximize the probability of the ideal response given the prompt.
Where do the examples come from?
| Source | Description | Quality |
|---|---|---|
| Human annotators | Paid contractors write ideal responses. Expensive but high quality. | Highest |
| Distillation | Use a stronger model (GPT-4) to generate training data for a smaller model. | High |
| Open datasets | OpenAssistant, Dolly, ShareGPT, UltraChat. Free but variable quality. | Variable |
| Synthetic generation | Use the model itself + filtering to generate training data. Self-play. | Medium-High |
What SFT teaches:
- Follow instructions ("Write a poem about..." → actually writes a poem)
- Adopt a helpful persona (answers questions rather than continuing the prompt)
- Format outputs properly (markdown, code blocks, numbered lists)
- Refuse harmful requests (though this is crude without RL)
Limitations of SFT:
SFT alone produces a model that imitates the training examples. It doesn't learn why some responses are better than others. It can't generalize the concept of "helpfulness" beyond the specific examples it's seen. This is where reinforcement learning comes in.
2. Reinforcement Learning and RLHF
Reinforcement Learning from Human Feedback (RLHF) teaches the model to optimize for human preferences rather than just imitating examples.
The RLHF Pipeline
Step 1: Train a Reward Model (RM)
Human annotators rank model outputs from best to worst
→ Train a model to predict these rankings (the reward model)
Step 2: Optimize the LLM using RL
The LLM generates responses
→ The reward model scores them
→ The LLM is updated to produce higher-scoring responses
→ A KL penalty prevents the model from drifting too far from the SFT baseline
Step 1: Reward Models
A reward model takes a (prompt, response) pair and outputs a scalar score representing quality.
Training data: Human annotators are shown the same prompt with 2-4 different model responses. They rank them from best to worst. The reward model is trained on these comparisons.
Prompt: "What is the capital of France?"
Response A: "The capital of France is Paris." (Rank 1 - best)
Response B: "Paris is a city in Europe." (Rank 2)
Response C: "France is a country." (Rank 3 - worst)
The reward model learns to assign: score(A) > score(B) > score(C)
What the reward model captures:
- Helpfulness (did it answer the question?)
- Harmlessness (did it avoid dangerous content?)
- Honesty (did it avoid making things up?)
- Formatting quality, tone, detail level
Step 2: Policy Optimization with PPO
Proximal Policy Optimization (PPO) is the most common RL algorithm used for RLHF. Here's the intuition:
- Generate: The LLM (called the "policy") generates a response to a prompt
- Score: The reward model scores the response
- Update: Adjust the LLM's weights to increase the probability of high-scoring responses
- Constrain: A KL divergence penalty prevents the model from changing too much in a single step (which would cause instability or "reward hacking")
Objective = E[reward(response)] - β * KL(policy || reference_policy)
The β term is crucial — without it, the model quickly learns to exploit quirks of the reward model rather than genuinely improving.
Verifiable Tasks and Process Reward Models
A newer trend moves away from pure human-preference reward models toward verifiable rewards — tasks where the answer can be checked automatically.
| Approach | How It Works | Example |
|---|---|---|
| Outcome Reward Models (ORM) | Score the final answer only. Binary: right or wrong. | Math: is 2+2=4? Correct! |
| Process Reward Models (PRM) | Score each reasoning step individually. | "Step 1: correct. Step 2: correct. Step 3: wrong." |
| Verifiable tasks | Use tasks with known answers as training signal. No human annotation needed. | Code that passes test cases, math with known solutions. |
Why this matters: Human preference annotation is expensive, slow, and subjective. Verifiable tasks provide unlimited, objective training signal. DeepSeek-R1 and OpenAI's o1/o3 models heavily use this approach for reasoning.
Alternatives to PPO
| Method | Description | Advantage |
|---|---|---|
| DPO (Direct Preference Optimization) | Skips the reward model entirely. Directly optimizes the LLM using preference pairs. Much simpler pipeline. | No reward model needed. Fewer hyperparameters. Stable training. |
| REINFORCE | Classic policy gradient. Simpler than PPO but higher variance. | Simplicity. |
| GRPO (Group Relative Policy Optimization) | Used by DeepSeek. Groups responses and uses relative ranking within the group as the reward signal. | No separate reward model. Works well for reasoning tasks. |
| KTO (Kahneman-Tversky Optimization) | Uses binary feedback (good/bad) instead of ranked comparisons. Inspired by prospect theory. | Easier to collect binary feedback than rankings. |
3. The Full Post-Training Pipeline in Practice
Modern post-training is multi-stage:
Pre-trained Model
↓
SFT on instruction-following data
↓
RLHF/DPO on human preferences (helpfulness)
↓
Safety training (refusals, harmlessness)
↓
Specialized RL on verifiable tasks (math, code, reasoning)
↓
Final model
Each stage builds on the previous one. Skip SFT and RLHF doesn't work well. Skip RLHF and the model follows instructions but isn't refined. The order matters.
Part III: Evaluation — How Do You Know If Your Model Is Good?
Building a model is one thing. Knowing whether it's actually good is harder than it sounds.
1. Traditional NLP Metrics
These come from the pre-LLM era but are still used for specific tasks:
| Metric | What It Measures | Used For | Limitation |
|---|---|---|---|
| Perplexity | How surprised the model is by the test data. Lower = better. | Language modeling quality | Doesn't measure usefulness or factuality |
| BLEU | N-gram overlap between generated text and reference text | Translation, summarization | A correct paraphrase can score 0. Doesn't capture meaning. |
| ROUGE | Recall-oriented n-gram overlap | Summarization | Same problems as BLEU |
| F1 Score | Precision/recall balance for extracted answers | Question answering, NER | Only works for tasks with clear correct answers |
| Exact Match | Binary — did the model produce the exact correct answer? | QA, classification | Too strict. "Paris" and "The answer is Paris" both fail. |
The fundamental problem: These metrics measure surface-level text similarity, not whether the response is actually helpful, accurate, or well-written. This is why benchmarks and human evaluation exist.
2. Task-Specific Benchmarks
Benchmarks provide standardized tasks with known correct answers. Here are the ones that matter:
Reasoning and Knowledge
| Benchmark | What It Tests | Format | Why It Matters |
|---|---|---|---|
| MMLU | Massive Multitask Language Understanding. 57 subjects from elementary to professional level. | Multiple choice | The most-cited general knowledge benchmark. Covers STEM, humanities, social sciences, and more. |
| ARC | AI2 Reasoning Challenge. Grade-school science questions. | Multiple choice | Tests scientific reasoning. ARC-Challenge subset is genuinely hard. |
| HellaSwag | Sentence completion requiring commonsense reasoning. | Multiple choice | Tests whether the model understands how everyday situations unfold. |
| Winogrande | Pronoun resolution requiring world knowledge. | Binary choice | "The trophy didn't fit in the suitcase because it was too big." What was too big? |
| TruthfulQA | Questions where common misconceptions lead to wrong answers. | Open-ended + multiple choice | Tests whether the model gives truthful answers vs. popular-but-wrong ones. |
| BoolQ | Yes/no questions based on a passage. | Boolean | Tests reading comprehension. |
Math and Code
| Benchmark | What It Tests | Format |
|---|---|---|
| GSM8K | Grade-school math word problems requiring multi-step reasoning. | Open-ended (numerical answer) |
| MATH | Competition-level mathematics (AMC, AIME difficulty). | Open-ended |
| HumanEval | Python function completion. 164 problems with test cases. | Code generation |
| MBPP | Mostly Basic Python Problems. Simpler than HumanEval. | Code generation |
| SWE-bench | Real GitHub issues. The model must write a patch that resolves the issue and passes tests. | Code patch |
Conversation and Instruction Following
| Benchmark | What It Tests | Format |
|---|---|---|
| MT-Bench | Multi-turn conversation quality. 80 questions across 8 categories. | Open-ended, scored by GPT-4 |
| AlpacaEval | Instruction following quality. Compared against a reference model. | Open-ended, LLM-as-judge |
| IFEval | Instruction following with verifiable constraints ("write exactly 3 paragraphs," "use no commas"). | Open-ended with automated checks |
Safety
| Benchmark | What It Tests |
|---|---|
| BBQ | Bias Benchmark for QA — tests for social biases |
| ToxiGen | Toxic content generation across demographics |
| RealToxicityPrompts | How often the model generates toxic continuations |
| XSTest | Whether safety filters over-trigger on benign prompts |
3. Human Evaluation and Leaderboards
Benchmarks have a fundamental limitation: they can be gamed. A model can be trained specifically to score well on MMLU without being generally capable. This is why human evaluation matters.
Chatbot Arena (LMSYS)
The gold standard for LLM evaluation. Real users have conversations with two anonymous models side-by-side and vote for the better response. Results are aggregated into an Elo rating system (like chess).
Why it's important:
- Real users, real tasks, real preferences
- Models are anonymous — no brand bias
- Elo ratings are continuously updated with new votes
- Widely considered the most reliable LLM ranking
Human Evaluation Practices
| Method | Description | When to Use |
|---|---|---|
| Side-by-side comparison | Show two model outputs, ask which is better | Ranking models against each other |
| Likert scale rating | Rate individual outputs on a 1-5 scale for specific criteria | Measuring specific qualities (helpfulness, accuracy, tone) |
| Red teaming | Humans actively try to make the model fail or produce harmful outputs | Safety evaluation before deployment |
| Task completion | Measure whether humans can accomplish real tasks using the model | End-to-end usefulness evaluation |
LLM-as-Judge
Using a strong model (e.g., GPT-4, Claude) to evaluate outputs from other models. Faster and cheaper than human evaluation, but introduces the evaluated model's biases.
Common patterns:
- Position bias: tends to prefer the first response shown
- Verbosity bias: tends to prefer longer responses
- Self-preference: models tend to rate their own outputs higher
Mitigation: Run evaluations in both orders and average. Use specific rubrics. Combine with human eval for calibration.
Part IV: Chatbot Design — The Full System
The model is just one component. A production chatbot is a system with multiple layers.
System Architecture
┌─────────────────────────────────────┐
│ User Interface │
│ (Web app, API, mobile, CLI) │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Application Layer │
│ - Conversation management │
│ - System prompt injection │
│ - Tool/function calling router │
│ - Rate limiting & auth │
│ - Content filtering (input) │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Model Layer │
│ - LLM inference (local or API) │
│ - Decoding parameters │
│ - Context window management │
│ - Streaming response │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Post-Processing │
│ - Output filtering (safety) │
│ - Citation extraction │
│ - Format validation │
│ - Tool call execution │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Memory & Context │
│ - Conversation history storage │
│ - RAG retrieval (next post!) │
│ - Long-term memory │
│ - User preferences │
└─────────────────────────────────────┘
Key Design Decisions
System Prompts
The system prompt defines the model's persona, capabilities, and constraints. It's the most important piece of prompt engineering in a chatbot.
You are a helpful customer support agent for Acme Corp.
You can help with: billing, account issues, product questions.
You cannot: process refunds directly, access payment info, make promises about future features.
Always be polite. If unsure, say so and offer to escalate to a human agent.
Best practices:
- Be specific about what the model should and shouldn't do
- Include examples of ideal responses
- Define the tone and personality
- Specify how to handle edge cases (unknown questions, off-topic requests)
Conversation History Management
LLMs have finite context windows. Long conversations must be managed:
| Strategy | How It Works | Trade-off |
|---|---|---|
| Truncation | Drop the oldest messages when the context is full | Simple but loses important early context |
| Summarization | Periodically summarize older messages into a compact form | Preserves key info but lossy |
| Sliding window | Keep the system prompt + last N messages | Predictable behavior, loses mid-conversation context |
| RAG on history | Embed and retrieve relevant past messages | Best retention but adds complexity and latency |
Tool Use / Function Calling
Modern chatbots aren't just text generators — they can take actions:
{
"type": "function",
"function": {
"name": "search_knowledge_base",
"description": "Search the company knowledge base for relevant articles",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
}
}
}
}The model decides when to call a tool, what arguments to pass, and then incorporates the result into its response. This is how chatbots search the web, query databases, send emails, and interact with external systems.
Streaming
Users expect to see text appear word-by-word, not wait 10 seconds for a complete response. Streaming via Server-Sent Events (SSE) is standard:
data: {"choices":[{"delta":{"content":"The"}}]}
data: {"choices":[{"delta":{"content":" capital"}}]}
data: {"choices":[{"delta":{"content":" of"}}]}
data: {"choices":[{"delta":{"content":" France"}}]}
...
Safety Layers
Production chatbots use multiple safety layers:
- Input filtering — Block or flag harmful prompts before they reach the model
- System prompt guardrails — Instructions in the system prompt about what to refuse
- Output filtering — Scan generated text for harmful content before showing it to the user
- Rate limiting — Prevent abuse through request limits
- Human escalation — Route difficult or sensitive conversations to human agents
Part V: Build Your Own LLM Playground
Now that you understand how everything works, let's talk about what you should actually build.
What Is an LLM Playground?
An LLM playground is a web interface where you can:
- Send prompts to different LLM providers (OpenAI, Anthropic, open-source models)
- Adjust generation parameters (temperature, top-p, max tokens)
- Compare outputs from different models side by side
- Experiment with system prompts
- View token counts and costs
- Save and share conversations
Architecture for Your Playground
┌────────────────────────────────────────────┐
│ Frontend (React/Next.js) │
│ - Chat interface with streaming │
│ - Parameter controls (sliders, dropdowns) │
│ - Model selector │
│ - Token counter │
│ - Conversation history │
└─────────────────┬──────────────────────────┘
│
┌─────────────────▼──────────────────────────┐
│ Backend (Node.js / Python) │
│ - Unified API router for multiple LLMs │
│ - API key management │
│ - Request/response logging │
│ - Cost tracking │
└─────────────────┬──────────────────────────┘
│
┌─────────┼──────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌──────────┐
│ OpenAI │ │Anthropic│ │ Ollama │
│ API │ │ API │ │ (local) │
└────────┘ └────────┘ └──────────┘
Features to Implement (in order)
Phase 1: Core Chat
- Basic chat interface with a single model (start with OpenAI or Anthropic)
- Streaming responses using Server-Sent Events
- System prompt input field
- Temperature and max token controls
Phase 2: Multi-Model 5. Add a second provider (e.g., Anthropic if you started with OpenAI) 6. Model selector dropdown 7. Side-by-side comparison mode
Phase 3: Power Features 8. Token counter and cost estimator 9. Conversation history with save/load 10. Preset system prompts (creative writer, code reviewer, tutor, etc.) 11. Add local model support via Ollama (run LLaMA, Mistral, etc. locally)
Phase 4: Advanced 12. Function/tool calling playground 13. Logprobs visualization (see the model's confidence for each token) 14. Prompt templates with variables 15. Export conversations as JSON/Markdown
Getting Started: Minimal Viable Playground
Here's the simplest possible starting point — a streaming chat with parameter controls:
// Core: unified model interface
interface LLMProvider {
name: string;
chat(params: ChatParams): AsyncIterable<string>;
}
interface ChatParams {
model: string;
messages: Message[];
temperature: number;
topP: number;
maxTokens: number;
systemPrompt?: string;
}Key learning outcomes from building this:
- How streaming APIs work (SSE, chunked transfer encoding)
- How different providers' APIs differ (and how to abstract over them)
- How parameters like temperature and top-p actually affect output (you'll see it live)
- How system prompts shape model behavior
- How token counting and context window management work in practice
- How to handle errors, rate limits, and API quirks
What You Should Know After Reading This
If you've read this post carefully, you should be able to answer these questions:
- What is BPE and why do LLMs use it instead of word-level tokenization?
- What is self-attention and why was it a breakthrough over RNNs?
- What's the difference between GPT-style (decoder-only) and BERT-style (encoder-only) architectures?
- What is the difference between SFT and RLHF? Why do you need both?
- What is a reward model and how is it trained?
- What's the difference between temperature, top-k, and top-p sampling?
- Why is Chatbot Arena considered more reliable than benchmarks like MMLU?
- What are the main components of a production chatbot system beyond just the LLM?
- What role does data cleaning play, and what's the difference between RefinedWeb, Dolma, and FineWeb?
- What is DPO and why is it becoming popular as an alternative to PPO-based RLHF?
If you can't answer all of them yet, re-read the relevant section. These are the foundations everything else builds on.
Further Reading
For those who want to go deeper on any topic covered here:
- "Attention Is All You Need" (Vaswani et al., 2017) — The original Transformer paper
- "Language Models are Few-Shot Learners" (Brown et al., 2020) — The GPT-3 paper
- "Training language models to follow instructions with human feedback" (Ouyang et al., 2022) — The InstructGPT/RLHF paper
- "LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023) — The original LLaMA paper
- "Direct Preference Optimization" (Rafailov et al., 2023) — The DPO paper
- "The RefinedWeb Dataset for Falcon LLM" (Penedo et al., 2023) — Deep dive into web data cleaning
- "Dolma: An Open Corpus of Three Trillion Tokens" (Soldaini et al., 2024) — AI2's open data documentation
- "FineWeb: decanting the web for the finest text data" (Penedo et al., 2024) — HuggingFace's data pipeline
- Andrej Karpathy's "Let's build GPT from scratch" — Best video walkthrough of Transformer internals
- Chip Huyen's "Designing Machine Learning Systems" — Essential reading for ML in production
Next in the Series
Part 2: Customer Support Chatbot with RAGs & Prompt Engineering — We build a system that gives your LLM access to external knowledge. You'll learn about embeddings, vector databases, chunking strategies, prompt engineering patterns, and how to build a RAG pipeline for a customer support chatbot that actually works.
You might also like
Build Your Own GREMLIN IN THE SHELL
A hands-on guide to building your own shell-based AI agent that haunts your terminal and gets things done.
BlogOn Creating an OpenAI Client Clone
Building an OpenAI-compatible API client from the ground up — understanding the protocol, streaming, and tool calling.
BlogMake Your Own Claude Code
How to build your own CLI coding assistant inspired by Claude Code — from terminal UI to tool use to agentic loops.