The Complete ML & AI Learning Roadmap — With Code
A hands-on, code-first guide to learning machine learning and AI. Every concept comes with runnable Python code you can copy-paste and execute. Start from scratch, build real things, and prepare for ML interviews.
The Complete ML & AI Learning Roadmap — With Code
This guide is different from most ML roadmaps. Instead of just listing topics, every concept comes with runnable Python code you can copy-paste into a Jupyter notebook and execute right now.
The philosophy: understand by building. We start with the simplest possible version of each concept, then expand from there.
Prerequisites: Basic Python knowledge. That's it. We'll cover the math as we go.
Setup: Create a fresh Python environment and install the essentials:
# Create environment
python -m venv ml-env
source ml-env/bin/activate # On Windows: ml-env\Scripts\activate
# Install core libraries
pip install numpy pandas matplotlib scikit-learn jupyter
# Install deep learning (later phases)
pip install torch torchvision
# Install LLM tools (later phases)
pip install transformers datasets openai anthropicStart a Jupyter notebook:
jupyter notebookLet's go.
Phase 0: The Math, Through Code
You don't need to study math for months before writing ML code. Instead, we'll learn the math by implementing it. Each concept below is something you'll actually use.
Vectors and Dot Products
A vector is just a list of numbers. In ML, everything is a vector — a data point, a model's weights, an image pixel row.
import numpy as np
# A vector is a 1D array of numbers
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Vector addition — element by element
print(a + b) # [5, 7, 9]
# Scalar multiplication — multiply every element
print(3 * a) # [3, 6, 9]
# Dot product — multiply element by element, then sum
# This is the MOST IMPORTANT operation in ML
dot = np.dot(a, b) # 1*4 + 2*5 + 3*6 = 32
print(dot) # 32
# WHY the dot product matters:
# It measures SIMILARITY between two vectors.
# If two vectors point in the same direction → large positive dot product
# If perpendicular → dot product is 0
# If opposite → large negative dot product
# Example: are these movie preferences similar?
alice = np.array([5, 1, 4, 0, 2]) # [action, romance, sci-fi, horror, comedy]
bob = np.array([4, 0, 5, 1, 1])
carol = np.array([0, 5, 0, 0, 5])
print(f"Alice-Bob similarity: {np.dot(alice, bob)}") # 42 — similar tastes!
print(f"Alice-Carol similarity: {np.dot(alice, carol)}") # 15 — different tastesMatrix Multiplication
Every layer of a neural network is a matrix multiplication. If you understand this, you understand what a neural network does at each layer.
import numpy as np
# A matrix is a 2D array
# Think of it as: each ROW is a different operation
# applied to the input vector
# Input: 3 features (e.g., height, weight, age)
x = np.array([170, 70, 25])
# Weight matrix: 2 neurons, each looking at 3 inputs
# Each ROW is one neuron's weights
W = np.array([
[0.1, 0.3, -0.2], # neuron 1
[-0.5, 0.2, 0.4], # neuron 2
])
# Matrix multiply: each neuron computes a dot product with the input
output = W @ x # same as np.dot(W, x)
print(output) # [170*0.1 + 70*0.3 + 25*(-0.2), 170*(-0.5) + 70*0.2 + 25*0.4]
# [33.0, -61.0]
# That's it. That's what a neural network layer does.
# Input vector → multiply by weight matrix → output vector
# The magic is in LEARNING the right weights.
# Dimension rules:
# If W is (2, 3) and x is (3,), output is (2,)
# If W is (10, 5) and x is (5,), output is (10,)
# General: (m, n) @ (n,) → (m,)
# The inner dimensions must match!
print(f"W shape: {W.shape}, x shape: {x.shape}, output shape: {output.shape}")Derivatives and Gradient Descent
Gradient descent is how every ML model learns. The intuition: you're standing on a hilly landscape and want to find the lowest point. The gradient tells you which direction is "downhill."
import numpy as np
import matplotlib.pyplot as plt
# Let's learn gradient descent by finding the minimum of a simple function
# f(x) = (x - 3)^2
# The minimum is at x=3 (obviously), but let's find it with gradient descent
# The derivative (gradient) of f(x) = (x-3)^2 is f'(x) = 2(x-3)
def f(x):
return (x - 3) ** 2
def gradient(x):
return 2 * (x - 3)
# Start at a random position
x = 10.0
learning_rate = 0.1
history = [x]
# Gradient descent: take steps downhill
for step in range(20):
grad = gradient(x) # Which direction is downhill?
x = x - learning_rate * grad # Take a step in that direction
history.append(x)
print(f"Step {step+1:2d}: x = {x:.4f}, f(x) = {f(x):.6f}, gradient = {grad:.4f}")
# Visualize
xs = np.linspace(-1, 11, 100)
plt.figure(figsize=(10, 5))
plt.plot(xs, f(xs), 'b-', label='f(x) = (x-3)²')
plt.plot(history, [f(h) for h in history], 'ro-', markersize=5, label='Gradient descent path')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Gradient Descent Finding the Minimum')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print(f"\nFinal x: {x:.6f} (true minimum: 3.0)")Probability Basics
ML is about making predictions under uncertainty. Every classifier outputs probabilities.
import numpy as np
# SOFTMAX: converts raw numbers into probabilities
# This is how neural networks output probabilities
def softmax(x):
"""Convert a vector of numbers into probabilities that sum to 1."""
exp_x = np.exp(x - np.max(x)) # subtract max for numerical stability
return exp_x / exp_x.sum()
# Raw model outputs (called "logits")
logits = np.array([2.0, 1.0, 0.1])
# Convert to probabilities
probs = softmax(logits)
print(f"Logits: {logits}")
print(f"Probabilities: {probs}")
print(f"Sum: {probs.sum():.4f}") # Always sums to 1.0
print(f"Prediction: class {np.argmax(probs)} with {probs.max():.1%} confidence")
# CROSS-ENTROPY LOSS: measures how wrong our predictions are
# Lower = better. 0 = perfect.
def cross_entropy_loss(predicted_probs, true_class):
"""How wrong is our prediction? Lower is better."""
return -np.log(predicted_probs[true_class])
# If the true answer is class 0:
loss = cross_entropy_loss(probs, true_class=0)
print(f"\nTrue class: 0, predicted prob: {probs[0]:.4f}, loss: {loss:.4f}")
# If we predicted poorly (true class is 2, but we gave it low probability):
loss_bad = cross_entropy_loss(probs, true_class=2)
print(f"True class: 2, predicted prob: {probs[2]:.4f}, loss: {loss_bad:.4f}") # Higher loss = worse predictionPhase 1: Build Your First ML Models
Linear Regression from Scratch
Before using sklearn, let's build linear regression ourselves. This teaches you exactly what's happening under the hood.
import numpy as np
import matplotlib.pyplot as plt
# Generate some fake data: house price = 50 * size + 100 + noise
np.random.seed(42)
X = np.random.uniform(1, 10, 100) # house size (100s of sq ft)
y = 50 * X + 100 + np.random.normal(0, 30, 100) # price ($1000s)
# LINEAR REGRESSION FROM SCRATCH
# Model: y = w * x + b
# Goal: find w and b that minimize the error
w = 0.0 # weight (slope)
b = 0.0 # bias (intercept)
learning_rate = 0.01
n = len(X)
losses = []
for epoch in range(100):
# Forward pass: make predictions
y_pred = w * X + b
# Compute loss (Mean Squared Error)
loss = np.mean((y_pred - y) ** 2)
losses.append(loss)
# Compute gradients (derivatives of loss w.r.t. w and b)
dw = (2/n) * np.sum((y_pred - y) * X)
db = (2/n) * np.sum(y_pred - y)
# Update weights (gradient descent)
w -= learning_rate * dw
b -= learning_rate * db
if epoch % 20 == 0:
print(f"Epoch {epoch:3d}: loss={loss:.2f}, w={w:.2f}, b={b:.2f}")
print(f"\nLearned: y = {w:.2f} * x + {b:.2f}")
print(f"True: y = 50.00 * x + 100.00")
# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: data + fitted line
axes[0].scatter(X, y, alpha=0.5, s=20, label='Data')
x_line = np.linspace(0, 11, 100)
axes[0].plot(x_line, w * x_line + b, 'r-', linewidth=2, label=f'y = {w:.1f}x + {b:.1f}')
axes[0].set_xlabel('House Size')
axes[0].set_ylabel('Price')
axes[0].set_title('Linear Regression')
axes[0].legend()
# Right: loss over time
axes[1].plot(losses)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Loss (MSE)')
axes[1].set_title('Training Loss')
plt.tight_layout()
plt.show()The Same Thing with Scikit-Learn (3 Lines)
Now that you understand what's happening, here's the library version:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
# Same data as before
np.random.seed(42)
X = np.random.uniform(1, 10, 100).reshape(-1, 1)
y = 50 * X.ravel() + 100 + np.random.normal(0, 30, 100)
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Weight: {model.coef_[0]:.2f}, Bias: {model.intercept_:.2f}")
print(f"Test MSE: {mse:.2f}")Logistic Regression: Your First Classifier
Logistic regression is the "hello world" of classification. It's also the building block of neural networks.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Generate a binary classification dataset
X, y = make_classification(
n_samples=500, n_features=2, n_redundant=0,
n_informative=2, random_state=42, n_clusters_per_class=1
)
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(f"\nClassification Report:\n{classification_report(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
# Show probabilities for first 5 predictions
print("\nFirst 5 predictions with probabilities:")
for i in range(5):
print(f" True: {y_test[i]}, Predicted: {y_pred[i]}, "
f"P(class=0): {y_proba[i][0]:.3f}, P(class=1): {y_proba[i][1]:.3f}")
# Visualize decision boundary
fig, ax = plt.subplots(figsize=(8, 6))
xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 200),
np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='RdBu', edgecolors='black', s=50)
ax.set_title('Logistic Regression Decision Boundary')
plt.show()Decision Trees and Random Forests
Trees are the most intuitive ML algorithm. A random forest is just many trees voting together.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
# Load the classic Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Single Decision Tree
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
# Print the tree — this is why trees are interpretable
print("Decision Tree Rules:")
print(export_text(tree, feature_names=iris.feature_names))
# Random Forest — 100 trees voting together
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Cross-validation: test on data the model hasn't seen
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"Random Forest Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")
# Feature importance — which features matter most?
rf.fit(X, y)
for name, importance in sorted(zip(iris.feature_names, rf.feature_importances_),
key=lambda x: x[1], reverse=True):
print(f" {name}: {importance:.3f}")XGBoost: The Kaggle Winner
XGBoost/LightGBM wins almost every tabular data competition. Here's how to use it:
pip install xgboost lightgbmimport xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Load California housing dataset
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Train XGBoost
model = xgb.XGBRegressor(
n_estimators=200,
max_depth=6,
learning_rate=0.1,
random_state=42,
verbosity=0
)
model.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False)
# Evaluate
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
# Feature importance
print("\nTop 5 Features:")
for name, imp in sorted(zip(data.feature_names, model.feature_importances_),
key=lambda x: x[1], reverse=True)[:5]:
print(f" {name}: {imp:.3f}")Evaluation Metrics: Know These Cold
import numpy as np
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, confusion_matrix
)
# Simulated predictions for a medical test
# 0 = healthy, 1 = disease
y_true = np.array([0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0])
y_pred = np.array([0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0])
# Confusion matrix
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
print(f"True Positives: {tp} (correctly identified disease)")
print(f"True Negatives: {tn} (correctly identified healthy)")
print(f"False Positives: {fp} (healthy person flagged as sick)")
print(f"False Negatives: {fn} (sick person missed!)")
# Metrics
print(f"\nAccuracy: {accuracy_score(y_true, y_pred):.2%} — overall correctness")
print(f"Precision: {precision_score(y_true, y_pred):.2%} — of those we flagged, how many were right?")
print(f"Recall: {recall_score(y_true, y_pred):.2%} — of all sick people, how many did we catch?")
print(f"F1 Score: {f1_score(y_true, y_pred):.2%} — harmonic mean of precision & recall")
# WHEN TO USE WHICH:
# - Accuracy: only when classes are balanced
# - Precision: when false positives are expensive (spam filter — don't want to block real emails)
# - Recall: when false negatives are dangerous (cancer screening — don't miss sick patients)
# - F1: when you need to balance precision and recall
# - AUC-ROC: when you need a threshold-independent metricPhase 2: Deep Learning — Neural Networks from Scratch
Build a Neural Network with Just NumPy
This is the most important exercise in this entire guide. If you understand this code, you understand deep learning.
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
# Generate a non-linear dataset (two interleaving half circles)
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Reshape y to be (n, 1)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
# --- Activation functions ---
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(a):
return a * (1 - a)
def relu(z):
return np.maximum(0, z)
def relu_derivative(z):
return (z > 0).astype(float)
# --- Initialize the network ---
np.random.seed(42)
input_size = 2
hidden_size = 16
output_size = 1
learning_rate = 0.1
# He initialization for weights
W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2 / input_size)
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2 / hidden_size)
b2 = np.zeros((1, output_size))
# --- Training loop ---
losses = []
for epoch in range(1000):
# === FORWARD PASS ===
# Layer 1: input → hidden
z1 = X_train @ W1 + b1 # linear transformation
a1 = relu(z1) # activation
# Layer 2: hidden → output
z2 = a1 @ W2 + b2 # linear transformation
a2 = sigmoid(z2) # sigmoid for binary classification
# === COMPUTE LOSS (binary cross-entropy) ===
eps = 1e-8 # avoid log(0)
loss = -np.mean(y_train * np.log(a2 + eps) + (1 - y_train) * np.log(1 - a2 + eps))
losses.append(loss)
# === BACKWARD PASS (backpropagation) ===
m = X_train.shape[0]
# Output layer gradients
dz2 = a2 - y_train # derivative of loss w.r.t. z2
dW2 = (1/m) * a1.T @ dz2 # gradient for W2
db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
# Hidden layer gradients (chain rule!)
da1 = dz2 @ W2.T # how much each hidden neuron contributed to error
dz1 = da1 * relu_derivative(z1) # apply activation derivative
dW1 = (1/m) * X_train.T @ dz1 # gradient for W1
db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)
# === UPDATE WEIGHTS ===
W2 -= learning_rate * dW2
b2 -= learning_rate * db2
W1 -= learning_rate * dW1
b1 -= learning_rate * db1
if epoch % 200 == 0:
print(f"Epoch {epoch:4d}: loss = {loss:.4f}")
# === EVALUATE ===
# Forward pass on test data
z1 = X_test @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
predictions = (a2 > 0.5).astype(int)
accuracy = np.mean(predictions == y_test)
print(f"\nTest accuracy: {accuracy:.2%}")The Same Network in PyTorch (Industry Standard)
Now that you understand the mechanics, here's the PyTorch way:
import torch
import torch.nn as nn
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
# Data
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Convert to PyTorch tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train).unsqueeze(1)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test).unsqueeze(1)
# Define the model
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(2, 16)
self.layer2 = nn.Linear(16, 1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.relu(self.layer1(x))
x = self.sigmoid(self.layer2(x))
return x
model = SimpleNet()
criterion = nn.BCELoss() # Binary cross-entropy
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Train
for epoch in range(500):
# Forward pass
y_pred = model(X_train_t)
loss = criterion(y_pred, y_train_t)
# Backward pass + update (PyTorch does this for you!)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 100 == 0:
print(f"Epoch {epoch}: loss = {loss.item():.4f}")
# Evaluate
with torch.no_grad():
y_pred = model(X_test_t)
accuracy = ((y_pred > 0.5).float() == y_test_t).float().mean()
print(f"\nTest accuracy: {accuracy:.2%}")Image Classification with a CNN
The classic "hello world" of deep learning — classifying handwritten digits:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# Load MNIST
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)
# CNN model
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, 3, padding=1) # 1 input channel, 32 filters, 3x3 kernel
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.pool = nn.MaxPool2d(2) # 2x2 max pooling
self.fc1 = nn.Linear(64 * 7 * 7, 128) # flatten → fully connected
self.fc2 = nn.Linear(128, 10) # 10 digit classes
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.25)
def forward(self, x):
x = self.pool(self.relu(self.conv1(x))) # 28x28 → 14x14
x = self.pool(self.relu(self.conv2(x))) # 14x14 → 7x7
x = x.view(-1, 64 * 7 * 7) # flatten
x = self.dropout(self.relu(self.fc1(x)))
x = self.fc2(x) # raw logits (no softmax — CrossEntropyLoss handles it)
return x
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Train
for epoch in range(5):
model.train()
total_loss = 0
for batch_X, batch_y in train_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)
optimizer.zero_grad()
output = model(batch_X)
loss = criterion(output, batch_y)
loss.backward()
optimizer.step()
total_loss += loss.item()
# Evaluate
model.eval()
correct = 0
total = 0
with torch.no_grad():
for batch_X, batch_y in test_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)
output = model(batch_X)
_, predicted = torch.max(output, 1)
total += batch_y.size(0)
correct += (predicted == batch_y).sum().item()
print(f"Epoch {epoch+1}: loss={total_loss/len(train_loader):.4f}, "
f"test accuracy={correct/total:.2%}")Self-Attention from Scratch
This is the core of Transformers, GPT, BERT, and every modern LLM. Understanding this code means understanding modern AI.
import torch
import torch.nn.functional as F
import math
# Self-Attention from scratch
# The idea: each token looks at every other token and decides
# what information to gather
def self_attention(Q, K, V):
"""
Q (query): "What am I looking for?" — shape (seq_len, d_k)
K (key): "What do I contain?" — shape (seq_len, d_k)
V (value): "What info do I give?" — shape (seq_len, d_v)
"""
d_k = Q.shape[-1]
# Step 1: Compute attention scores (how much should token i attend to token j?)
scores = Q @ K.T / math.sqrt(d_k) # scale to prevent softmax saturation
print("Attention scores (before softmax):")
print(scores.numpy().round(2))
# Step 2: Convert to probabilities
attention_weights = F.softmax(scores, dim=-1)
print("\nAttention weights (after softmax):")
print(attention_weights.numpy().round(3))
# Step 3: Weighted sum of values
output = attention_weights @ V
return output, attention_weights
# Example: 4 tokens, each embedded as a 3D vector
torch.manual_seed(42)
seq_len = 4
d_model = 3
# Pretend these are word embeddings for: ["The", "cat", "sat", "down"]
embeddings = torch.randn(seq_len, d_model)
# In a real Transformer, Q/K/V come from learned linear projections
# For simplicity, we'll use the embeddings directly
W_q = torch.randn(d_model, d_model) * 0.5
W_k = torch.randn(d_model, d_model) * 0.5
W_v = torch.randn(d_model, d_model) * 0.5
Q = embeddings @ W_q
K = embeddings @ W_k
V = embeddings @ W_v
output, weights = self_attention(Q, K, V)
print(f"\nOutput shape: {output.shape}") # (4, 3) — each token now has context from all others
print(f"\nEach row of attention weights shows how much each token attends to the others.")
print(f"Row 0 (token 'The'): {weights[0].numpy().round(3)}")Phase 3: Working with LLMs
Call the OpenAI API
pip install openaifrom openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
# Basic completion
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in one paragraph."}
],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.total_tokens}")
print(f"Cost: ~${response.usage.total_tokens * 0.00015 / 1000:.6f}")Call the Anthropic API
pip install anthropicimport anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from environment
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=200,
system="You are a helpful AI tutor. Be concise.",
messages=[
{"role": "user", "content": "Explain backpropagation in one paragraph."}
]
)
print(response.content[0].text)
print(f"\nInput tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")Prompt Engineering: Techniques That Work
from openai import OpenAI
client = OpenAI()
def ask(prompt, system="You are a helpful assistant.", temperature=0.7):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
temperature=temperature
)
return response.choices[0].message.content
# --- ZERO-SHOT ---
print("=== Zero-Shot ===")
print(ask("Classify this review as positive or negative: 'The food was terrible and the service was slow.'"))
# --- FEW-SHOT ---
print("\n=== Few-Shot ===")
print(ask("""Classify these reviews:
Review: "Amazing experience, will come back!" → Positive
Review: "Worst meal I've ever had." → Negative
Review: "Decent food but overpriced." → ?"""))
# --- CHAIN OF THOUGHT ---
print("\n=== Chain of Thought ===")
print(ask("""A store sells apples for $2 each. They have a buy-3-get-1-free deal.
How much do 7 apples cost?
Think step by step."""))
# --- STRUCTURED OUTPUT ---
print("\n=== Structured JSON Output ===")
print(ask(
"""Extract the entities from this text and return as JSON:
"Apple CEO Tim Cook announced the iPhone 16 at their Cupertino headquarters on September 9, 2024."
Return format: {"people": [...], "organizations": [...], "products": [...], "locations": [...], "dates": [...]}""",
temperature=0
))Build a RAG System from Scratch
This is the most practical LLM skill. RAG lets you give an LLM access to your own data.
pip install chromadb openaiimport chromadb
from openai import OpenAI
client = OpenAI()
# --- Step 1: Create a vector database and add documents ---
chroma = chromadb.Client()
collection = chroma.create_collection("my_docs")
# Your knowledge base (could be docs, wiki, help articles, etc.)
documents = [
"Python was created by Guido van Rossum and first released in 1991.",
"PyTorch was developed by Meta AI and is the most popular deep learning framework.",
"NumPy is the fundamental package for scientific computing in Python.",
"Scikit-learn provides simple and efficient tools for data mining and analysis.",
"Jupyter notebooks allow interactive computing with code, text, and visualizations.",
"FastAPI is a modern web framework for building APIs with Python.",
"Docker containers package applications with their dependencies for consistent deployment.",
"Git is a distributed version control system created by Linus Torvalds.",
]
# Add documents to the vector DB (ChromaDB auto-generates embeddings)
collection.add(
documents=documents,
ids=[f"doc_{i}" for i in range(len(documents))]
)
print(f"Added {len(documents)} documents to the vector database.")
# --- Step 2: Query the database (semantic search) ---
def search(query, n_results=3):
results = collection.query(query_texts=[query], n_results=n_results)
return results['documents'][0]
# Test search
query = "What framework should I use for deep learning?"
relevant_docs = search(query)
print(f"\nQuery: {query}")
print(f"Retrieved documents:")
for i, doc in enumerate(relevant_docs):
print(f" {i+1}. {doc}")
# --- Step 3: Generate answer using retrieved context ---
def rag_answer(question):
# Retrieve relevant documents
context_docs = search(question, n_results=3)
context = "\n".join(f"- {doc}" for doc in context_docs)
# Build the prompt with context
prompt = f"""Answer the question based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have information about that."
Context:
{context}
Question: {question}
Answer:"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.content, context_docs
# Test RAG
questions = [
"What's the best framework for deep learning?",
"Who created Python?",
"How do I deploy applications consistently?",
"What is the capital of France?", # Not in our knowledge base!
]
for q in questions:
answer, sources = rag_answer(q)
print(f"\nQ: {q}")
print(f"A: {answer}")Fine-Tune a Model with Hugging Face
pip install transformers datasets peft accelerate bitsandbytesfrom datasets import load_dataset
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
TrainingArguments,
Trainer
)
import numpy as np
# Load a small dataset — movie review sentiment
dataset = load_dataset("rotten_tomatoes")
print(f"Train: {len(dataset['train'])}, Test: {len(dataset['test'])}")
print(f"Example: {dataset['train'][0]}")
# Load a pre-trained model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Tokenize the dataset
def tokenize(batch):
return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=128)
tokenized = dataset.map(tokenize, batched=True)
# Define training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
eval_strategy="epoch",
logging_steps=100,
learning_rate=2e-5,
weight_decay=0.01,
save_strategy="no",
)
# Metric
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = (predictions == labels).mean()
return {"accuracy": accuracy}
# Train!
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics,
)
trainer.train()
# Evaluate
results = trainer.evaluate()
print(f"\nTest accuracy: {results['eval_accuracy']:.2%}")
# Use the model
def predict_sentiment(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
probs = outputs.logits.softmax(dim=-1)
label = "Positive" if probs[0][1] > 0.5 else "Negative"
confidence = probs[0].max().item()
return label, confidence
# Test
for text in ["This movie was absolutely fantastic!", "Terrible waste of time."]:
label, conf = predict_sentiment(text)
print(f"'{text}' → {label} ({conf:.1%})")Phase 4: Implement a Paper — Transformer from Scratch
This is the exercise that separates people who understand deep learning from people who copy-paste. We implement the core of "Attention Is All You Need."
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x, mask=None):
batch_size, seq_len, _ = x.shape
# Project to Q, K, V
Q = self.W_q(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention = F.softmax(scores, dim=-1)
context = attention @ V
# Concatenate heads and project
context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
return self.W_o(context)
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
def forward(self, x):
return self.linear2(F.relu(self.linear1(x)))
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, n_heads)
self.ff = FeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention + residual connection + layer norm
attn_out = self.attention(self.norm1(x), mask)
x = x + self.dropout(attn_out)
# Feed-forward + residual connection + layer norm
ff_out = self.ff(self.norm2(x))
x = x + self.dropout(ff_out)
return x
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
return x + self.pe[:, :x.size(1)]
class MiniGPT(nn.Module):
"""A tiny GPT-style language model."""
def __init__(self, vocab_size, d_model=128, n_heads=4, n_layers=4, d_ff=512, max_len=256):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoding = PositionalEncoding(d_model, max_len)
self.blocks = nn.ModuleList([
TransformerBlock(d_model, n_heads, d_ff) for _ in range(n_layers)
])
self.norm = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size)
def forward(self, x):
seq_len = x.size(1)
# Causal mask: each token can only attend to previous tokens
mask = torch.tril(torch.ones(seq_len, seq_len, device=x.device)).unsqueeze(0).unsqueeze(0)
x = self.embedding(x)
x = self.pos_encoding(x)
for block in self.blocks:
x = block(x, mask)
x = self.norm(x)
logits = self.head(x)
return logits
# Test it
vocab_size = 1000
model = MiniGPT(vocab_size)
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {total_params:,}")
# Forward pass with random input
x = torch.randint(0, vocab_size, (2, 32)) # batch of 2, sequence length 32
logits = model(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {logits.shape}") # (2, 32, 1000) — probability over vocab for each position
# Next token prediction
next_token_probs = F.softmax(logits[0, -1, :], dim=-1)
next_token = torch.argmax(next_token_probs)
print(f"Predicted next token: {next_token.item()} with prob {next_token_probs[next_token]:.4f}")Phase 5: Interview Preparation — Practice Problems
ML Coding: Implement K-Means from Scratch
This is a classic interview question.
import numpy as np
import matplotlib.pyplot as plt
def kmeans(X, k, max_iters=100):
"""K-Means clustering from scratch."""
n_samples = X.shape[0]
# Step 1: Initialize centroids randomly
indices = np.random.choice(n_samples, k, replace=False)
centroids = X[indices].copy()
for iteration in range(max_iters):
# Step 2: Assign each point to the nearest centroid
distances = np.sqrt(((X[:, np.newaxis] - centroids) ** 2).sum(axis=2))
labels = np.argmin(distances, axis=1)
# Step 3: Update centroids to the mean of their assigned points
new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
# Check for convergence
if np.allclose(centroids, new_centroids):
print(f"Converged at iteration {iteration}")
break
centroids = new_centroids
return labels, centroids
# Generate data
np.random.seed(42)
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
# Run our K-Means
labels, centroids = kmeans(X, k=4)
# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30, alpha=0.7)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, edgecolors='black')
plt.title('K-Means Clustering (from scratch)')
plt.show()ML System Design: Recommendation System
Walk through this out loud as if you're in an interview:
"""
Interview Question: Design a movie recommendation system.
Step 1: Clarify requirements
- Personalized recommendations for each user
- Should handle new users (cold start)
- Needs to serve millions of users with low latency
Step 2: Approach — Collaborative Filtering
Two users who liked the same movies will likely enjoy similar movies.
"""
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# User-movie rating matrix (0 = not rated)
# Rows = users, Columns = movies
ratings = np.array([
[5, 4, 0, 0, 1], # User 0: likes action/adventure
[4, 5, 0, 1, 0], # User 1: likes action/adventure
[0, 0, 5, 4, 0], # User 2: likes romance/drama
[0, 1, 4, 5, 0], # User 3: likes romance/drama
[3, 3, 3, 3, 3], # User 4: likes everything equally
])
movies = ["Action Hero", "Space Wars", "Love Story", "Drama Queen", "Horror Night"]
# Compute user-user similarity
user_similarity = cosine_similarity(ratings)
print("User similarity matrix:")
for i, row in enumerate(user_similarity):
print(f" User {i}: {row.round(2)}")
# Predict: what would User 0 rate "Love Story" (movie 2)?
def predict_rating(user_id, movie_id, ratings, similarity):
# Find users who rated this movie
rated_mask = ratings[:, movie_id] > 0
if not rated_mask.any():
return 0
# Weighted average of other users' ratings, weighted by similarity
sim_scores = similarity[user_id][rated_mask]
user_ratings = ratings[rated_mask, movie_id]
if sim_scores.sum() == 0:
return 0
return np.dot(sim_scores, user_ratings) / np.abs(sim_scores).sum()
# Recommend movies for User 0
print(f"\nRecommendations for User 0:")
for movie_id, movie_name in enumerate(movies):
if ratings[0, movie_id] == 0: # Only recommend unrated movies
predicted = predict_rating(0, movie_id, ratings, user_similarity)
print(f" {movie_name}: predicted rating = {predicted:.2f}")Where to Go Next
Now that you have the foundations with working code, here's your next steps:
| What to Do | Where |
|---|---|
| Build more projects | Kaggle competitions — start with "Getting Started" ones |
| Watch Karpathy's series | "Neural Networks: Zero to Hero" on YouTube |
| Read the Transformer paper | "Attention Is All You Need" — now you can follow the math |
| Build a full RAG app | Expand the RAG example above with a real UI |
| Study system design | Designing Machine Learning Systems by Chip Huyen |
| Practice interviews | NeetCode for coding, the system design framework above for ML design |
The key insight: you learn ML by building things, not by watching courses. Courses give you structure. Building gives you understanding. Do both, but always be building.
You might also like
Anthropic's Walkie Talkie
Claude Code can now listen to the outside world. Channels let Telegram, Discord, CI pipelines, and other tools push events into a live session.
BlogOn Building your own Lobster Trap
BlogBuild Your Own GREMLIN IN THE SHELL
A hands-on guide to building your own shell-based AI agent that haunts your terminal and gets things done.