Back to ML & AI
Budding··32 min read

Build an LLM Playground — Part 1: How Large Language Models Actually Work

The first entry in a learn-by-doing series to become an AI engineer. We break down every stage of how LLMs are built — from raw data to chatbot — so you can build your own playground with real understanding.

aillmmachine-learningtransformersdeep-learningtutorialseries
Share

Series: The AI Engineer Learning Path

This is Part 1 of a hands-on series designed to take you from zero to working AI engineer. Every post follows a learn-by-doing philosophy — we explain the theory, then you build something real.

PartTopicStatus
1Build an LLM Playground (this post)Current
2Customer Support Chatbot with RAGs & Prompt EngineeringAvailable
3"Ask-the-Web" Agent with Tool CallingAvailable
4Deep Research with Reasoning ModelsAvailable
5Multi-modal Generation AgentAvailable

By the end of this post, you'll understand every stage of how a large language model is built — from raw internet data to the chatbot you interact with daily. More importantly, you'll have a mental model that makes every other AI concept click into place.


Why Start Here?

If you want to be an AI engineer, you don't need to train GPT-5 from scratch. But you do need to understand what happens under the hood. Without that understanding, you'll be copy-pasting API calls without knowing why things break, why outputs are weird, or how to fix them.

This post covers the full lifecycle of an LLM:

  1. Pre-Training — How raw data becomes a model that can predict text
  2. Post-Training — How that model is refined to follow instructions and be helpful
  3. Evaluation — How we measure whether the model is actually good
  4. Chatbot Design — How the full system around the model works
  5. Build the Playground — Practical guidance for building your own LLM playground

Let's go.


Part I: Pre-Training — Teaching a Model to Predict Language

Pre-training is where the model learns language itself. It reads billions of pages of text and learns to predict the next word. That's the core idea — everything else is details. But the details matter enormously.

1. Data Collection

LLMs are trained on massive text datasets — hundreds of billions to trillions of tokens. Where does all that text come from?

Sources of Training Data

SourceDescriptionScale
Common CrawlMonthly snapshots of the public internet. The largest open web corpus available. Raw dumps are petabytes of HTML.~250 billion pages
Manual/Targeted CrawlingCustom web scrapers targeting specific high-quality domains — Wikipedia, Stack Overflow, GitHub, arXiv, textbooks, legal filings, patent databases.Varies by source
Books CorporaDigitized books (BookCorpus, Project Gutenberg, Books3). Long-form, high-quality prose.Millions of books
Code RepositoriesGitHub public repositories, filtered by license. Critical for code-capable models (Codex, StarCoder, Code Llama).Hundreds of millions of files
Academic PapersarXiv, Semantic Scholar, PubMed. Essential for scientific reasoning.Tens of millions of papers
Conversational DataReddit, forums, Q&A sites. Teaches dialogue patterns and informal language.Billions of posts

The Data Scale Problem

GPT-3 was trained on ~300 billion tokens. LLaMA 2 used 2 trillion tokens. Modern frontier models likely use 10+ trillion tokens. At this scale, data quality and deduplication aren't nice-to-haves — they're the difference between a good model and a bad one.

Key insight: The quality of your data matters more than the quantity. A model trained on 1 trillion clean tokens will outperform one trained on 5 trillion noisy tokens.


2. Data Cleaning

Raw web data is a mess. It contains duplicates, spam, porn, malware, boilerplate HTML, navigation menus, cookie banners, and pages that are 90% ads. Cleaning this data is one of the most important — and most underrated — parts of building an LLM.

The Cleaning Pipeline

Raw HTML → Text Extraction → Language Filtering → Quality Filtering →
Deduplication → PII Removal → Toxic Content Filtering → Final Dataset

Key Data Cleaning Projects

Understanding these projects gives you insight into what "clean data" actually means in practice:

RefinedWeb (Falcon)

  • Built by the Technology Innovation Institute for the Falcon models
  • Started from Common Crawl and applied aggressive filtering
  • Used trafilatura for text extraction from HTML (much better than simple tag stripping)
  • Applied URL-based filtering, language identification, and document-level quality heuristics
  • Heavy deduplication using MinHash LSH (fuzzy matching that catches near-duplicates, not just exact copies)
  • Result: 5 trillion tokens of high-quality web text
  • Key finding: with enough cleaning, web-only data can match curated datasets

Dolma (OLMo / AI2)

  • Built by the Allen Institute for AI for the OLMo family of models
  • Fully open and documented — you can see every filtering decision
  • Mixed sources: Common Crawl, Wikipedia, Project Gutenberg, Semantic Scholar, GitHub, Reddit
  • Uses a pipeline of taggers that annotate text with quality signals (language, toxicity, duplication, etc.) and then filters based on those tags
  • Explicitly documents trade-offs — for example, aggressive deduplication improves quality but reduces diversity
  • Result: 3 trillion tokens with full provenance

FineWeb (HuggingFace)

  • Built by HuggingFace as a fully open, reproducible web dataset
  • 15 trillion tokens from 96 Common Crawl snapshots (2013-2024)
  • Key innovation: developed custom quality classifiers trained on educational content
  • FineWeb-Edu subset: filtered to only educational content, resulting in significant benchmark improvements despite being much smaller
  • Every step is documented and reproducible — a model for open data practices

What Gets Filtered Out

FilterWhat It CatchesWhy It Matters
Language detectionNon-target language textTraining an English model on Chinese text wastes compute
DeduplicationRepeated pages, boilerplate, scraped content farmsDuplicates cause the model to memorize rather than generalize
Quality heuristicsShort pages, high symbol-to-word ratio, low perplexity textRemoves spam, auto-generated content, and gibberish
URL filteringKnown spam domains, adult content sitesRemoves obviously low-quality sources
PII removalEmail addresses, phone numbers, SSNsLegal and ethical requirement
Toxicity filteringHate speech, violence, explicit contentReduces harmful model outputs

Hands-on exercise: Download a small Common Crawl WARC file and try extracting clean text from it. You'll immediately understand why data cleaning is a multi-billion-dollar problem.


3. Tokenization

Computers don't understand words. They understand numbers. Tokenization is the process of converting text into a sequence of integer IDs that the model can process.

Why Not Just Use Characters or Words?

ApproachProblem
Character-levelSequences become extremely long. "artificial intelligence" = 25 characters. The model needs to learn spelling from scratch. Very slow to train.
Word-levelVocabulary explodes. Every misspelling, conjugation, and compound word needs its own entry. Out-of-vocabulary words become a constant problem.
SubwordThe sweet spot. Common words stay whole ("the", "is"), rare words get split into meaningful pieces ("un" + "believ" + "able"). Fixed vocabulary, handles any input.

Byte-Pair Encoding (BPE)

BPE is the most widely used tokenization algorithm (used by GPT, LLaMA, and most modern LLMs). Here's how it works:

Training the tokenizer:

  1. Start with a vocabulary of individual bytes (256 entries)
  2. Scan the training corpus and find the most frequent pair of adjacent tokens
  3. Merge that pair into a new token and add it to the vocabulary
  4. Repeat steps 2-3 until you reach your desired vocabulary size (typically 32K-128K tokens)

Example of BPE in action:

Input:  "lowest lower newest"

Step 0: ['l','o','w','e','s','t',' ','l','o','w','e','r',' ','n','e','w','e','s','t']
Step 1: merge (e,s) → es:  ['l','o','w','es','t',' ','l','o','w','e','r',' ','n','e','w','es','t']
Step 2: merge (es,t) → est: ['l','o','w','est',' ','l','o','w','e','r',' ','n','e','w','est']
Step 3: merge (l,o) → lo:  ['lo','w','est',' ','lo','w','e','r',' ','n','e','w','est']
Step 4: merge (lo,w) → low: ['low','est',' ','low','e','r',' ','n','e','w','est']
...and so on

Tokenization at inference time:

Once trained, the tokenizer applies the learned merges in order to encode any new text.

"lowest"  → [low, est]      → [4521, 382]
"highest" → [high, est]     → [9301, 382]
"cat"     → [cat]           → [2163]
"Pokémon" → [Pok, é, mon]   → [51, 8948, 1711]

Other Tokenization Methods

MethodUsed ByKey Difference
BPEGPT-2/3/4, LLaMA, MistralMerges most frequent pairs. Industry standard.
WordPieceBERT, DistilBERTSimilar to BPE but uses likelihood instead of frequency for merges.
UnigramT5, ALBERT, SentencePieceStarts with a large vocabulary and prunes down. Can output multiple tokenizations with probabilities.
SentencePieceLLaMA, T5, many multilingual modelsLanguage-agnostic. Treats the input as a raw byte stream — no need for pre-tokenization rules.

Why Tokenization Matters for AI Engineers

  • Token limits are not word limits. "I don't know" is 4 words but might be 3-5 tokens depending on the tokenizer. When an API says "128K context," that's tokens, not words.
  • Cost is per token. API pricing is based on token count. Efficient prompts = lower cost.
  • Different models use different tokenizers. You can't assume token counts are portable across models.
  • Tokenization artifacts. Some models struggle with simple arithmetic because numbers get tokenized inconsistently ("380" might be [3, 80] or [380] depending on context).

Hands-on exercise: Use OpenAI's tiktoken or HuggingFace's tokenizers library to tokenize the same sentence with different model tokenizers. Compare the results — you'll see surprisingly large differences.


4. Architecture: Neural Networks and Transformers

This is the core of the model. We'll go from first principles to the architectures you'll work with daily.

Neural Networks in 60 Seconds

A neural network is a function that maps inputs to outputs through layers of weighted connections.

Input → [Layer 1] → [Layer 2] → ... → [Layer N] → Output

Each layer takes a vector of numbers, multiplies by a weight matrix, adds a bias, and applies a non-linear activation function. Training adjusts the weights to minimize a loss function (the difference between predicted and actual outputs).

For language models, the input is a sequence of token embeddings (vectors that represent tokens) and the output is a probability distribution over the vocabulary for the next token.

The Transformer Architecture

The Transformer (Vaswani et al., 2017, "Attention Is All You Need") is the architecture behind every modern LLM. Here's why it matters and how it works.

The key innovation: Self-Attention

Before Transformers, models processed text sequentially (RNNs, LSTMs). Word 50 had to wait for words 1-49 to be processed first. This was slow and made it hard to capture long-range dependencies.

Self-attention lets every token attend to every other token in parallel. The model can directly connect "it" to "the dog" even if they're 200 tokens apart.

How self-attention works:

For each token, the model computes three vectors from the token's embedding:

  • Query (Q) — "What am I looking for?"
  • Key (K) — "What do I contain?"
  • Value (V) — "What information do I provide?"

The attention score between two tokens is the dot product of one token's Query with another's Key, scaled and softmaxed. The output is a weighted sum of the Values.

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Multi-Head Attention runs multiple attention operations in parallel (e.g., 32 or 64 heads), each learning to attend to different types of relationships — one head might learn syntax, another semantics, another coreference.

Full Transformer block:

Input
  ↓
Multi-Head Self-Attention + Residual Connection + Layer Norm
  ↓
Feed-Forward Network (2 linear layers with activation) + Residual Connection + Layer Norm
  ↓
Output

A GPT-style model stacks 32-96+ of these blocks. More blocks = more capacity = more parameters.

The GPT Family

GPT (Generative Pre-trained Transformer) models are decoder-only Transformers. They use causal (masked) self-attention — each token can only attend to tokens before it, not after. This makes them natural text generators: they predict one token at a time, left to right.

ModelParametersTraining DataContext LengthKey Innovation
GPT-2 (2019)1.5BWebText (40GB)1,024 tokensShowed scaling works. Released with "too dangerous" controversy.
GPT-3 (2020)175B300B tokens2,048 tokensFew-shot learning via prompting. No fine-tuning needed for many tasks.
GPT-3.5 (2022)~175B+ RLHF training4,096 tokensInstructGPT + ChatGPT. First model to feel "useful" to the public.
GPT-4 (2023)Undisclosed (rumored MoE)Undisclosed8K / 32K / 128K tokensMultimodal (vision), dramatically better reasoning.
GPT-4o (2024)UndisclosedUndisclosed128K tokensNatively multimodal (text, vision, audio), faster, cheaper.

The LLaMA Family (Open-Weight Models)

Meta's LLaMA family democratized large language models by releasing model weights to the research community.

ModelParametersTraining DataKey Innovation
LLaMA (2023)7B, 13B, 33B, 65B1.4T tokensShowed smaller models trained on more data beat larger models.
LLaMA 2 (2023)7B, 13B, 70B2T tokensOpen commercial license. Grouped-Query Attention (GQA) for faster inference.
LLaMA 3 (2024)8B, 70B15T tokensMassive data scaling. Larger vocabulary (128K tokens).
LLaMA 3.1 (2024)8B, 70B, 405B15T+ tokens128K context. Tool use. The 405B model competes with GPT-4 class models.

Architectural improvements in LLaMA vs. original GPT:

  • RMSNorm instead of LayerNorm (simpler, equally effective)
  • Rotary Position Embeddings (RoPE) instead of learned position embeddings (better extrapolation to longer sequences)
  • SwiGLU activation instead of ReLU in the feed-forward layers (better performance)
  • Grouped-Query Attention (GQA) — shares Key/Value heads across multiple Query heads, reducing memory during inference without hurting quality

Other Notable Architectures

Model FamilyCreatorKey Feature
Mistral / MixtralMistral AISliding window attention + Mixture of Experts (MoE). Mixtral 8x7B uses 8 expert FFNs and routes each token to 2 of them — only 13B active parameters with 47B total.
ClaudeAnthropicConstitutional AI training. Strong reasoning. Details undisclosed.
GeminiGoogle DeepMindNatively multimodal from the ground up (not a bolted-on vision encoder).
DeepSeekDeepSeekOpen-weight MoE models. DeepSeek-V2 introduced Multi-head Latent Attention (MLA) for extremely efficient KV cache.
PhiMicrosoftSmall models (1.3B-14B) trained on high-quality "textbook" data. Shows that data quality can compensate for parameter count.
QwenAlibabaStrong multilingual performance, especially for Chinese + English. Competitive with LLaMA at equivalent sizes.

5. Text Generation: How LLMs Actually Produce Output

The model outputs a probability distribution over its vocabulary for the next token. But how do you turn probabilities into text? This is the decoding strategy, and it has a huge impact on output quality.

Pick the highest-probability token at every step.

P("the") = 0.4, P("a") = 0.3, P("my") = 0.2, ...
→ Pick "the"

Pros: Fast, deterministic. Cons: Repetitive, boring, often gets stuck in loops ("the the the..."). Misses better sequences where a lower-probability early token leads to higher overall probability.

Maintain the top-k sequences (beams) at each step and pick the highest-scoring complete sequence.

Beam 1: "The cat sat on" (score: -2.3)
Beam 2: "A dog ran to"  (score: -2.5)
Beam 3: "The cat ran on" (score: -2.7)
→ Continue expanding all three, prune to top k

Pros: Finds higher-probability sequences than greedy. Good for translation and summarization. Cons: Still tends toward generic, safe outputs. Computationally expensive. Not great for creative or conversational text.

Temperature Sampling

Scale the logits (raw model outputs) by a temperature value before applying softmax. Then sample from the resulting distribution.

temperature = 0.0 → Greedy (always pick the top token)
temperature = 0.7 → Mild randomness (good default for most tasks)
temperature = 1.0 → Sample directly from the model's distribution
temperature = 1.5 → Very random, creative, potentially incoherent

Lower temperature = more focused, deterministic, repetitive. Higher temperature = more creative, diverse, potentially nonsensical.

Top-k Sampling

Only consider the top-k most probable tokens. Redistribute their probabilities and sample.

k = 50: Consider the top 50 tokens at each step
k = 10: More focused
k = 1:  Greedy search

Problem: A fixed k doesn't adapt. Sometimes the model is very confident (top 3 tokens cover 95% of probability — k=50 wastes compute on junk tokens). Sometimes the model is uncertain (top 50 tokens only cover 60% — k=50 might still miss good options).

Top-p (Nucleus) Sampling

Instead of a fixed count, include the smallest set of tokens whose cumulative probability exceeds p.

p = 0.9: Include tokens until their probabilities sum to 0.9
p = 0.5: More focused
p = 1.0: Consider all tokens (temperature sampling only)

This adapts to the model's confidence. When the model is confident, only a few tokens are considered. When it's uncertain, more tokens are included.

In Practice: Combining Strategies

Most production systems combine temperature + top-p:

# Typical chat configuration
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages,
    temperature=0.7,
    top_p=0.9,
)
Use CaseTemperatureTop-pWhy
Code generation0.0-0.21.0Correctness matters. Low randomness.
Chat / conversation0.70.9Natural, varied, but coherent.
Creative writing0.9-1.20.95More surprising word choices.
Factual Q&A0.0-0.31.0Accuracy over creativity.

Repetition Penalty and Other Controls

  • Repetition penalty: Reduces the probability of tokens that have already appeared. Prevents "the the the" loops.
  • Frequency penalty: Penalizes tokens proportionally to how often they've appeared. Encourages vocabulary diversity.
  • Presence penalty: Penalizes any token that has appeared at all (binary). Encourages topic diversity.
  • Stop sequences: Halt generation when specific strings are produced (e.g., "\n\nHuman:" in a chatbot).
  • Max tokens: Hard cap on output length.

Part II: Post-Training — Making the Model Useful

Pre-training gives you a model that can predict text. But a raw pre-trained model is like a brilliant student who's read every book in the library but has never had a conversation. It will complete your prompt, but it won't answer your question.

Post-training bridges this gap. It transforms a text predictor into an assistant.

1. Supervised Fine-Tuning (SFT)

SFT is conceptually simple: show the model examples of good behavior and train it to mimic them.

Training data format:

{
  "messages": [
    {"role": "user", "content": "Explain quantum entanglement simply."},
    {"role": "assistant", "content": "Imagine you have two coins that are magically linked..."}
  ]
}

You collect thousands to hundreds of thousands of these (prompt, ideal response) pairs. The model is trained to maximize the probability of the ideal response given the prompt.

Where do the examples come from?

SourceDescriptionQuality
Human annotatorsPaid contractors write ideal responses. Expensive but high quality.Highest
DistillationUse a stronger model (GPT-4) to generate training data for a smaller model.High
Open datasetsOpenAssistant, Dolly, ShareGPT, UltraChat. Free but variable quality.Variable
Synthetic generationUse the model itself + filtering to generate training data. Self-play.Medium-High

What SFT teaches:

  • Follow instructions ("Write a poem about..." → actually writes a poem)
  • Adopt a helpful persona (answers questions rather than continuing the prompt)
  • Format outputs properly (markdown, code blocks, numbered lists)
  • Refuse harmful requests (though this is crude without RL)

Limitations of SFT:

SFT alone produces a model that imitates the training examples. It doesn't learn why some responses are better than others. It can't generalize the concept of "helpfulness" beyond the specific examples it's seen. This is where reinforcement learning comes in.


2. Reinforcement Learning and RLHF

Reinforcement Learning from Human Feedback (RLHF) teaches the model to optimize for human preferences rather than just imitating examples.

The RLHF Pipeline

Step 1: Train a Reward Model (RM)
  Human annotators rank model outputs from best to worst
  → Train a model to predict these rankings (the reward model)

Step 2: Optimize the LLM using RL
  The LLM generates responses
  → The reward model scores them
  → The LLM is updated to produce higher-scoring responses
  → A KL penalty prevents the model from drifting too far from the SFT baseline

Step 1: Reward Models

A reward model takes a (prompt, response) pair and outputs a scalar score representing quality.

Training data: Human annotators are shown the same prompt with 2-4 different model responses. They rank them from best to worst. The reward model is trained on these comparisons.

Prompt: "What is the capital of France?"
Response A: "The capital of France is Paris." (Rank 1 - best)
Response B: "Paris is a city in Europe." (Rank 2)
Response C: "France is a country." (Rank 3 - worst)

The reward model learns to assign: score(A) > score(B) > score(C)

What the reward model captures:

  • Helpfulness (did it answer the question?)
  • Harmlessness (did it avoid dangerous content?)
  • Honesty (did it avoid making things up?)
  • Formatting quality, tone, detail level

Step 2: Policy Optimization with PPO

Proximal Policy Optimization (PPO) is the most common RL algorithm used for RLHF. Here's the intuition:

  1. Generate: The LLM (called the "policy") generates a response to a prompt
  2. Score: The reward model scores the response
  3. Update: Adjust the LLM's weights to increase the probability of high-scoring responses
  4. Constrain: A KL divergence penalty prevents the model from changing too much in a single step (which would cause instability or "reward hacking")
Objective = E[reward(response)] - β * KL(policy || reference_policy)

The β term is crucial — without it, the model quickly learns to exploit quirks of the reward model rather than genuinely improving.

Verifiable Tasks and Process Reward Models

A newer trend moves away from pure human-preference reward models toward verifiable rewards — tasks where the answer can be checked automatically.

ApproachHow It WorksExample
Outcome Reward Models (ORM)Score the final answer only. Binary: right or wrong.Math: is 2+2=4? Correct!
Process Reward Models (PRM)Score each reasoning step individually."Step 1: correct. Step 2: correct. Step 3: wrong."
Verifiable tasksUse tasks with known answers as training signal. No human annotation needed.Code that passes test cases, math with known solutions.

Why this matters: Human preference annotation is expensive, slow, and subjective. Verifiable tasks provide unlimited, objective training signal. DeepSeek-R1 and OpenAI's o1/o3 models heavily use this approach for reasoning.

Alternatives to PPO

MethodDescriptionAdvantage
DPO (Direct Preference Optimization)Skips the reward model entirely. Directly optimizes the LLM using preference pairs. Much simpler pipeline.No reward model needed. Fewer hyperparameters. Stable training.
REINFORCEClassic policy gradient. Simpler than PPO but higher variance.Simplicity.
GRPO (Group Relative Policy Optimization)Used by DeepSeek. Groups responses and uses relative ranking within the group as the reward signal.No separate reward model. Works well for reasoning tasks.
KTO (Kahneman-Tversky Optimization)Uses binary feedback (good/bad) instead of ranked comparisons. Inspired by prospect theory.Easier to collect binary feedback than rankings.

3. The Full Post-Training Pipeline in Practice

Modern post-training is multi-stage:

Pre-trained Model
  ↓
SFT on instruction-following data
  ↓
RLHF/DPO on human preferences (helpfulness)
  ↓
Safety training (refusals, harmlessness)
  ↓
Specialized RL on verifiable tasks (math, code, reasoning)
  ↓
Final model

Each stage builds on the previous one. Skip SFT and RLHF doesn't work well. Skip RLHF and the model follows instructions but isn't refined. The order matters.


Part III: Evaluation — How Do You Know If Your Model Is Good?

Building a model is one thing. Knowing whether it's actually good is harder than it sounds.

1. Traditional NLP Metrics

These come from the pre-LLM era but are still used for specific tasks:

MetricWhat It MeasuresUsed ForLimitation
PerplexityHow surprised the model is by the test data. Lower = better.Language modeling qualityDoesn't measure usefulness or factuality
BLEUN-gram overlap between generated text and reference textTranslation, summarizationA correct paraphrase can score 0. Doesn't capture meaning.
ROUGERecall-oriented n-gram overlapSummarizationSame problems as BLEU
F1 ScorePrecision/recall balance for extracted answersQuestion answering, NEROnly works for tasks with clear correct answers
Exact MatchBinary — did the model produce the exact correct answer?QA, classificationToo strict. "Paris" and "The answer is Paris" both fail.

The fundamental problem: These metrics measure surface-level text similarity, not whether the response is actually helpful, accurate, or well-written. This is why benchmarks and human evaluation exist.

2. Task-Specific Benchmarks

Benchmarks provide standardized tasks with known correct answers. Here are the ones that matter:

Reasoning and Knowledge

BenchmarkWhat It TestsFormatWhy It Matters
MMLUMassive Multitask Language Understanding. 57 subjects from elementary to professional level.Multiple choiceThe most-cited general knowledge benchmark. Covers STEM, humanities, social sciences, and more.
ARCAI2 Reasoning Challenge. Grade-school science questions.Multiple choiceTests scientific reasoning. ARC-Challenge subset is genuinely hard.
HellaSwagSentence completion requiring commonsense reasoning.Multiple choiceTests whether the model understands how everyday situations unfold.
WinograndePronoun resolution requiring world knowledge.Binary choice"The trophy didn't fit in the suitcase because it was too big." What was too big?
TruthfulQAQuestions where common misconceptions lead to wrong answers.Open-ended + multiple choiceTests whether the model gives truthful answers vs. popular-but-wrong ones.
BoolQYes/no questions based on a passage.BooleanTests reading comprehension.

Math and Code

BenchmarkWhat It TestsFormat
GSM8KGrade-school math word problems requiring multi-step reasoning.Open-ended (numerical answer)
MATHCompetition-level mathematics (AMC, AIME difficulty).Open-ended
HumanEvalPython function completion. 164 problems with test cases.Code generation
MBPPMostly Basic Python Problems. Simpler than HumanEval.Code generation
SWE-benchReal GitHub issues. The model must write a patch that resolves the issue and passes tests.Code patch

Conversation and Instruction Following

BenchmarkWhat It TestsFormat
MT-BenchMulti-turn conversation quality. 80 questions across 8 categories.Open-ended, scored by GPT-4
AlpacaEvalInstruction following quality. Compared against a reference model.Open-ended, LLM-as-judge
IFEvalInstruction following with verifiable constraints ("write exactly 3 paragraphs," "use no commas").Open-ended with automated checks

Safety

BenchmarkWhat It Tests
BBQBias Benchmark for QA — tests for social biases
ToxiGenToxic content generation across demographics
RealToxicityPromptsHow often the model generates toxic continuations
XSTestWhether safety filters over-trigger on benign prompts

3. Human Evaluation and Leaderboards

Benchmarks have a fundamental limitation: they can be gamed. A model can be trained specifically to score well on MMLU without being generally capable. This is why human evaluation matters.

Chatbot Arena (LMSYS)

The gold standard for LLM evaluation. Real users have conversations with two anonymous models side-by-side and vote for the better response. Results are aggregated into an Elo rating system (like chess).

Why it's important:

  • Real users, real tasks, real preferences
  • Models are anonymous — no brand bias
  • Elo ratings are continuously updated with new votes
  • Widely considered the most reliable LLM ranking

Human Evaluation Practices

MethodDescriptionWhen to Use
Side-by-side comparisonShow two model outputs, ask which is betterRanking models against each other
Likert scale ratingRate individual outputs on a 1-5 scale for specific criteriaMeasuring specific qualities (helpfulness, accuracy, tone)
Red teamingHumans actively try to make the model fail or produce harmful outputsSafety evaluation before deployment
Task completionMeasure whether humans can accomplish real tasks using the modelEnd-to-end usefulness evaluation

LLM-as-Judge

Using a strong model (e.g., GPT-4, Claude) to evaluate outputs from other models. Faster and cheaper than human evaluation, but introduces the evaluated model's biases.

Common patterns:
- Position bias: tends to prefer the first response shown
- Verbosity bias: tends to prefer longer responses
- Self-preference: models tend to rate their own outputs higher

Mitigation: Run evaluations in both orders and average. Use specific rubrics. Combine with human eval for calibration.


Part IV: Chatbot Design — The Full System

The model is just one component. A production chatbot is a system with multiple layers.

System Architecture

┌─────────────────────────────────────┐
│           User Interface            │
│  (Web app, API, mobile, CLI)        │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│        Application Layer            │
│  - Conversation management          │
│  - System prompt injection          │
│  - Tool/function calling router     │
│  - Rate limiting & auth             │
│  - Content filtering (input)        │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│          Model Layer                │
│  - LLM inference (local or API)     │
│  - Decoding parameters              │
│  - Context window management        │
│  - Streaming response               │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│        Post-Processing              │
│  - Output filtering (safety)        │
│  - Citation extraction              │
│  - Format validation                │
│  - Tool call execution              │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│       Memory & Context              │
│  - Conversation history storage     │
│  - RAG retrieval (next post!)       │
│  - Long-term memory                 │
│  - User preferences                 │
└─────────────────────────────────────┘

Key Design Decisions

System Prompts

The system prompt defines the model's persona, capabilities, and constraints. It's the most important piece of prompt engineering in a chatbot.

You are a helpful customer support agent for Acme Corp.
You can help with: billing, account issues, product questions.
You cannot: process refunds directly, access payment info, make promises about future features.
Always be polite. If unsure, say so and offer to escalate to a human agent.

Best practices:

  • Be specific about what the model should and shouldn't do
  • Include examples of ideal responses
  • Define the tone and personality
  • Specify how to handle edge cases (unknown questions, off-topic requests)

Conversation History Management

LLMs have finite context windows. Long conversations must be managed:

StrategyHow It WorksTrade-off
TruncationDrop the oldest messages when the context is fullSimple but loses important early context
SummarizationPeriodically summarize older messages into a compact formPreserves key info but lossy
Sliding windowKeep the system prompt + last N messagesPredictable behavior, loses mid-conversation context
RAG on historyEmbed and retrieve relevant past messagesBest retention but adds complexity and latency

Tool Use / Function Calling

Modern chatbots aren't just text generators — they can take actions:

{
  "type": "function",
  "function": {
    "name": "search_knowledge_base",
    "description": "Search the company knowledge base for relevant articles",
    "parameters": {
      "type": "object",
      "properties": {
        "query": {"type": "string", "description": "Search query"}
      }
    }
  }
}

The model decides when to call a tool, what arguments to pass, and then incorporates the result into its response. This is how chatbots search the web, query databases, send emails, and interact with external systems.

Streaming

Users expect to see text appear word-by-word, not wait 10 seconds for a complete response. Streaming via Server-Sent Events (SSE) is standard:

data: {"choices":[{"delta":{"content":"The"}}]}
data: {"choices":[{"delta":{"content":" capital"}}]}
data: {"choices":[{"delta":{"content":" of"}}]}
data: {"choices":[{"delta":{"content":" France"}}]}
...

Safety Layers

Production chatbots use multiple safety layers:

  1. Input filtering — Block or flag harmful prompts before they reach the model
  2. System prompt guardrails — Instructions in the system prompt about what to refuse
  3. Output filtering — Scan generated text for harmful content before showing it to the user
  4. Rate limiting — Prevent abuse through request limits
  5. Human escalation — Route difficult or sensitive conversations to human agents

Part V: Build Your Own LLM Playground

Now that you understand how everything works, let's talk about what you should actually build.

What Is an LLM Playground?

An LLM playground is a web interface where you can:

  • Send prompts to different LLM providers (OpenAI, Anthropic, open-source models)
  • Adjust generation parameters (temperature, top-p, max tokens)
  • Compare outputs from different models side by side
  • Experiment with system prompts
  • View token counts and costs
  • Save and share conversations

Architecture for Your Playground

┌────────────────────────────────────────────┐
│              Frontend (React/Next.js)       │
│  - Chat interface with streaming            │
│  - Parameter controls (sliders, dropdowns)  │
│  - Model selector                           │
│  - Token counter                            │
│  - Conversation history                     │
└─────────────────┬──────────────────────────┘
                  │
┌─────────────────▼──────────────────────────┐
│            Backend (Node.js / Python)       │
│  - Unified API router for multiple LLMs     │
│  - API key management                       │
│  - Request/response logging                 │
│  - Cost tracking                            │
└─────────────────┬──────────────────────────┘
                  │
        ┌─────────┼──────────┐
        ▼         ▼          ▼
   ┌────────┐ ┌────────┐ ┌──────────┐
   │ OpenAI │ │Anthropic│ │ Ollama   │
   │  API   │ │  API    │ │ (local)  │
   └────────┘ └────────┘ └──────────┘

Features to Implement (in order)

Phase 1: Core Chat

  1. Basic chat interface with a single model (start with OpenAI or Anthropic)
  2. Streaming responses using Server-Sent Events
  3. System prompt input field
  4. Temperature and max token controls

Phase 2: Multi-Model 5. Add a second provider (e.g., Anthropic if you started with OpenAI) 6. Model selector dropdown 7. Side-by-side comparison mode

Phase 3: Power Features 8. Token counter and cost estimator 9. Conversation history with save/load 10. Preset system prompts (creative writer, code reviewer, tutor, etc.) 11. Add local model support via Ollama (run LLaMA, Mistral, etc. locally)

Phase 4: Advanced 12. Function/tool calling playground 13. Logprobs visualization (see the model's confidence for each token) 14. Prompt templates with variables 15. Export conversations as JSON/Markdown

Getting Started: Minimal Viable Playground

Here's the simplest possible starting point — a streaming chat with parameter controls:

// Core: unified model interface
interface LLMProvider {
  name: string;
  chat(params: ChatParams): AsyncIterable<string>;
}
 
interface ChatParams {
  model: string;
  messages: Message[];
  temperature: number;
  topP: number;
  maxTokens: number;
  systemPrompt?: string;
}

Key learning outcomes from building this:

  • How streaming APIs work (SSE, chunked transfer encoding)
  • How different providers' APIs differ (and how to abstract over them)
  • How parameters like temperature and top-p actually affect output (you'll see it live)
  • How system prompts shape model behavior
  • How token counting and context window management work in practice
  • How to handle errors, rate limits, and API quirks

What You Should Know After Reading This

If you've read this post carefully, you should be able to answer these questions:

  1. What is BPE and why do LLMs use it instead of word-level tokenization?
  2. What is self-attention and why was it a breakthrough over RNNs?
  3. What's the difference between GPT-style (decoder-only) and BERT-style (encoder-only) architectures?
  4. What is the difference between SFT and RLHF? Why do you need both?
  5. What is a reward model and how is it trained?
  6. What's the difference between temperature, top-k, and top-p sampling?
  7. Why is Chatbot Arena considered more reliable than benchmarks like MMLU?
  8. What are the main components of a production chatbot system beyond just the LLM?
  9. What role does data cleaning play, and what's the difference between RefinedWeb, Dolma, and FineWeb?
  10. What is DPO and why is it becoming popular as an alternative to PPO-based RLHF?

If you can't answer all of them yet, re-read the relevant section. These are the foundations everything else builds on.


Further Reading

For those who want to go deeper on any topic covered here:

  • "Attention Is All You Need" (Vaswani et al., 2017) — The original Transformer paper
  • "Language Models are Few-Shot Learners" (Brown et al., 2020) — The GPT-3 paper
  • "Training language models to follow instructions with human feedback" (Ouyang et al., 2022) — The InstructGPT/RLHF paper
  • "LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023) — The original LLaMA paper
  • "Direct Preference Optimization" (Rafailov et al., 2023) — The DPO paper
  • "The RefinedWeb Dataset for Falcon LLM" (Penedo et al., 2023) — Deep dive into web data cleaning
  • "Dolma: An Open Corpus of Three Trillion Tokens" (Soldaini et al., 2024) — AI2's open data documentation
  • "FineWeb: decanting the web for the finest text data" (Penedo et al., 2024) — HuggingFace's data pipeline
  • Andrej Karpathy's "Let's build GPT from scratch" — Best video walkthrough of Transformer internals
  • Chip Huyen's "Designing Machine Learning Systems" — Essential reading for ML in production

Next in the Series

Part 2: Customer Support Chatbot with RAGs & Prompt Engineering — We build a system that gives your LLM access to external knowledge. You'll learn about embeddings, vector databases, chunking strategies, prompt engineering patterns, and how to build a RAG pipeline for a customer support chatbot that actually works.

You might also like