The Complete ML & AI Learning Roadmap — With Code

This guide is different from most ML roadmaps. Instead of just listing topics, every concept comes with runnable Python code you can copy-paste into a Jupyter notebook and execute right now.

The philosophy: understand by building. We start with the simplest possible version of each concept, then expand from there.

Prerequisites: Basic Python knowledge. That's it. We'll cover the math as we go.

Setup: Create a fresh Python environment and install the essentials:

# Create environment
python -m venv ml-env
source ml-env/bin/activate  # On Windows: ml-env\Scripts\activate
 
# Install core libraries
pip install numpy pandas matplotlib scikit-learn jupyter
 
# Install deep learning (later phases)
pip install torch torchvision
 
# Install LLM tools (later phases)
pip install transformers datasets openai anthropic

Start a Jupyter notebook:

jupyter notebook

Let's go.

Phase 0: The Math, Through Code

You don't need to study math for months before writing ML code. Instead, we'll learn the math by implementing it. Each concept below is something you'll actually use.

Vectors and Dot Products

A vector is just a list of numbers. In ML, everything is a vector — a data point, a model's weights, an image pixel row.

import numpy as np
 
# A vector is a 1D array of numbers
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
 
# Vector addition — element by element
print(a + b)  # [5, 7, 9]
 
# Scalar multiplication — multiply every element
print(3 * a)  # [3, 6, 9]
 
# Dot product — multiply element by element, then sum
# This is the MOST IMPORTANT operation in ML
dot = np.dot(a, b)  # 1*4 + 2*5 + 3*6 = 32
print(dot)  # 32
 
# WHY the dot product matters:
# It measures SIMILARITY between two vectors.
# If two vectors point in the same direction → large positive dot product
# If perpendicular → dot product is 0
# If opposite → large negative dot product
 
# Example: are these movie preferences similar?
alice = np.array([5, 1, 4, 0, 2])  # [action, romance, sci-fi, horror, comedy]
bob   = np.array([4, 0, 5, 1, 1])
carol = np.array([0, 5, 0, 0, 5])
 
print(f"Alice-Bob similarity:   {np.dot(alice, bob)}")   # 42 — similar tastes!
print(f"Alice-Carol similarity: {np.dot(alice, carol)}")  # 15 — different tastes

Matrix Multiplication

Every layer of a neural network is a matrix multiplication. If you understand this, you understand what a neural network does at each layer.

import numpy as np
 
# A matrix is a 2D array
# Think of it as: each ROW is a different operation
# applied to the input vector
 
# Input: 3 features (e.g., height, weight, age)
x = np.array([170, 70, 25])
 
# Weight matrix: 2 neurons, each looking at 3 inputs
# Each ROW is one neuron's weights
W = np.array([
    [0.1, 0.3, -0.2],   # neuron 1
    [-0.5, 0.2, 0.4],   # neuron 2
])
 
# Matrix multiply: each neuron computes a dot product with the input
output = W @ x  # same as np.dot(W, x)
print(output)  # [170*0.1 + 70*0.3 + 25*(-0.2), 170*(-0.5) + 70*0.2 + 25*0.4]
               # [33.0, -61.0]
 
# That's it. That's what a neural network layer does.
# Input vector → multiply by weight matrix → output vector
# The magic is in LEARNING the right weights.
 
# Dimension rules:
# If W is (2, 3) and x is (3,), output is (2,)
# If W is (10, 5) and x is (5,), output is (10,)
# General: (m, n) @ (n,) → (m,)
# The inner dimensions must match!
 
print(f"W shape: {W.shape}, x shape: {x.shape}, output shape: {output.shape}")

Derivatives and Gradient Descent

Gradient descent is how every ML model learns. The intuition: you're standing on a hilly landscape and want to find the lowest point. The gradient tells you which direction is "downhill."

import numpy as np
import matplotlib.pyplot as plt
 
# Let's learn gradient descent by finding the minimum of a simple function
# f(x) = (x - 3)^2
# The minimum is at x=3 (obviously), but let's find it with gradient descent
 
# The derivative (gradient) of f(x) = (x-3)^2 is f'(x) = 2(x-3)
def f(x):
    return (x - 3) ** 2
 
def gradient(x):
    return 2 * (x - 3)
 
# Start at a random position
x = 10.0
learning_rate = 0.1
history = [x]
 
# Gradient descent: take steps downhill
for step in range(20):
    grad = gradient(x)           # Which direction is downhill?
    x = x - learning_rate * grad  # Take a step in that direction
    history.append(x)
    print(f"Step {step+1:2d}: x = {x:.4f}, f(x) = {f(x):.6f}, gradient = {grad:.4f}")
 
# Visualize
xs = np.linspace(-1, 11, 100)
plt.figure(figsize=(10, 5))
plt.plot(xs, f(xs), 'b-', label='f(x) = (x-3)²')
plt.plot(history, [f(h) for h in history], 'ro-', markersize=5, label='Gradient descent path')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Gradient Descent Finding the Minimum')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
 
print(f"\nFinal x: {x:.6f} (true minimum: 3.0)")

Probability Basics

ML is about making predictions under uncertainty. Every classifier outputs probabilities.

import numpy as np
 
# SOFTMAX: converts raw numbers into probabilities
# This is how neural networks output probabilities
 
def softmax(x):
    """Convert a vector of numbers into probabilities that sum to 1."""
    exp_x = np.exp(x - np.max(x))  # subtract max for numerical stability
    return exp_x / exp_x.sum()
 
# Raw model outputs (called "logits")
logits = np.array([2.0, 1.0, 0.1])
 
# Convert to probabilities
probs = softmax(logits)
print(f"Logits: {logits}")
print(f"Probabilities: {probs}")
print(f"Sum: {probs.sum():.4f}")  # Always sums to 1.0
print(f"Prediction: class {np.argmax(probs)} with {probs.max():.1%} confidence")
 
# CROSS-ENTROPY LOSS: measures how wrong our predictions are
# Lower = better. 0 = perfect.
 
def cross_entropy_loss(predicted_probs, true_class):
    """How wrong is our prediction? Lower is better."""
    return -np.log(predicted_probs[true_class])
 
# If the true answer is class 0:
loss = cross_entropy_loss(probs, true_class=0)
print(f"\nTrue class: 0, predicted prob: {probs[0]:.4f}, loss: {loss:.4f}")
 
# If we predicted poorly (true class is 2, but we gave it low probability):
loss_bad = cross_entropy_loss(probs, true_class=2)
print(f"True class: 2, predicted prob: {probs[2]:.4f}, loss: {loss_bad:.4f}")  # Higher loss = worse prediction

Phase 1: Build Your First ML Models

Linear Regression from Scratch

Before using sklearn, let's build linear regression ourselves. This teaches you exactly what's happening under the hood.

import numpy as np
import matplotlib.pyplot as plt
 
# Generate some fake data: house price = 50 * size + 100 + noise
np.random.seed(42)
X = np.random.uniform(1, 10, 100)  # house size (100s of sq ft)
y = 50 * X + 100 + np.random.normal(0, 30, 100)  # price ($1000s)
 
# LINEAR REGRESSION FROM SCRATCH
# Model: y = w * x + b
# Goal: find w and b that minimize the error
 
w = 0.0  # weight (slope)
b = 0.0  # bias (intercept)
learning_rate = 0.01
n = len(X)
 
losses = []
 
for epoch in range(100):
    # Forward pass: make predictions
    y_pred = w * X + b
 
    # Compute loss (Mean Squared Error)
    loss = np.mean((y_pred - y) ** 2)
    losses.append(loss)
 
    # Compute gradients (derivatives of loss w.r.t. w and b)
    dw = (2/n) * np.sum((y_pred - y) * X)
    db = (2/n) * np.sum(y_pred - y)
 
    # Update weights (gradient descent)
    w -= learning_rate * dw
    b -= learning_rate * db
 
    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d}: loss={loss:.2f}, w={w:.2f}, b={b:.2f}")
 
print(f"\nLearned: y = {w:.2f} * x + {b:.2f}")
print(f"True:    y = 50.00 * x + 100.00")
 
# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: data + fitted line
axes[0].scatter(X, y, alpha=0.5, s=20, label='Data')
x_line = np.linspace(0, 11, 100)
axes[0].plot(x_line, w * x_line + b, 'r-', linewidth=2, label=f'y = {w:.1f}x + {b:.1f}')
axes[0].set_xlabel('House Size')
axes[0].set_ylabel('Price')
axes[0].set_title('Linear Regression')
axes[0].legend()
 
# Right: loss over time
axes[1].plot(losses)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Loss (MSE)')
axes[1].set_title('Training Loss')
plt.tight_layout()
plt.show()

The Same Thing with Scikit-Learn (3 Lines)

Now that you understand what's happening, here's the library version:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
 
# Same data as before
np.random.seed(42)
X = np.random.uniform(1, 10, 100).reshape(-1, 1)
y = 50 * X.ravel() + 100 + np.random.normal(0, 30, 100)
 
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
 
# Evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
 
print(f"Weight: {model.coef_[0]:.2f}, Bias: {model.intercept_:.2f}")
print(f"Test MSE: {mse:.2f}")

Logistic Regression: Your First Classifier

Logistic regression is the "hello world" of classification. It's also the building block of neural networks.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
 
# Generate a binary classification dataset
X, y = make_classification(
    n_samples=500, n_features=2, n_redundant=0,
    n_informative=2, random_state=42, n_clusters_per_class=1
)
 
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Train
model = LogisticRegression()
model.fit(X_train, y_train)
 
# Predict
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)
 
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(f"\nClassification Report:\n{classification_report(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
 
# Show probabilities for first 5 predictions
print("\nFirst 5 predictions with probabilities:")
for i in range(5):
    print(f"  True: {y_test[i]}, Predicted: {y_pred[i]}, "
          f"P(class=0): {y_proba[i][0]:.3f}, P(class=1): {y_proba[i][1]:.3f}")
 
# Visualize decision boundary
fig, ax = plt.subplots(figsize=(8, 6))
xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 200),
                      np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='RdBu', edgecolors='black', s=50)
ax.set_title('Logistic Regression Decision Boundary')
plt.show()

Decision Trees and Random Forests

Trees are the most intuitive ML algorithm. A random forest is just many trees voting together.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
 
# Load the classic Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
 
# Single Decision Tree
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Print the tree — this is why trees are interpretable
print("Decision Tree Rules:")
print(export_text(tree, feature_names=iris.feature_names))
 
# Random Forest — 100 trees voting together
rf = RandomForestClassifier(n_estimators=100, random_state=42)
 
# Cross-validation: test on data the model hasn't seen
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"Random Forest Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")
 
# Feature importance — which features matter most?
rf.fit(X, y)
for name, importance in sorted(zip(iris.feature_names, rf.feature_importances_),
                                key=lambda x: x[1], reverse=True):
    print(f"  {name}: {importance:.3f}")

XGBoost: The Kaggle Winner

XGBoost/LightGBM wins almost every tabular data competition. Here's how to use it:

pip install xgboost lightgbm

import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
 
# Load California housing dataset
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
 
# Train XGBoost
model = xgb.XGBRegressor(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    verbosity=0
)
model.fit(X_train, y_train,
          eval_set=[(X_test, y_test)],
          verbose=False)
 
# Evaluate
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
 
# Feature importance
print("\nTop 5 Features:")
for name, imp in sorted(zip(data.feature_names, model.feature_importances_),
                         key=lambda x: x[1], reverse=True)[:5]:
    print(f"  {name}: {imp:.3f}")

Evaluation Metrics: Know These Cold

import numpy as np
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix
)
 
# Simulated predictions for a medical test
# 0 = healthy, 1 = disease
y_true = np.array([0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0])
y_pred = np.array([0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0])
 
# Confusion matrix
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
print(f"True Positives:  {tp}  (correctly identified disease)")
print(f"True Negatives:  {tn}  (correctly identified healthy)")
print(f"False Positives: {fp}  (healthy person flagged as sick)")
print(f"False Negatives: {fn}  (sick person missed!)")
 
# Metrics
print(f"\nAccuracy:  {accuracy_score(y_true, y_pred):.2%}  — overall correctness")
print(f"Precision: {precision_score(y_true, y_pred):.2%}  — of those we flagged, how many were right?")
print(f"Recall:    {recall_score(y_true, y_pred):.2%}  — of all sick people, how many did we catch?")
print(f"F1 Score:  {f1_score(y_true, y_pred):.2%}  — harmonic mean of precision & recall")
 
# WHEN TO USE WHICH:
# - Accuracy: only when classes are balanced
# - Precision: when false positives are expensive (spam filter — don't want to block real emails)
# - Recall: when false negatives are dangerous (cancer screening — don't miss sick patients)
# - F1: when you need to balance precision and recall
# - AUC-ROC: when you need a threshold-independent metric

Phase 2: Deep Learning — Neural Networks from Scratch

Build a Neural Network with Just NumPy

This is the most important exercise in this entire guide. If you understand this code, you understand deep learning.

import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
 
# Generate a non-linear dataset (two interleaving half circles)
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Reshape y to be (n, 1)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
 
# --- Activation functions ---
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
 
def sigmoid_derivative(a):
    return a * (1 - a)
 
def relu(z):
    return np.maximum(0, z)
 
def relu_derivative(z):
    return (z > 0).astype(float)
 
# --- Initialize the network ---
np.random.seed(42)
input_size = 2
hidden_size = 16
output_size = 1
learning_rate = 0.1
 
# He initialization for weights
W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2 / input_size)
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2 / hidden_size)
b2 = np.zeros((1, output_size))
 
# --- Training loop ---
losses = []
for epoch in range(1000):
    # === FORWARD PASS ===
    # Layer 1: input → hidden
    z1 = X_train @ W1 + b1        # linear transformation
    a1 = relu(z1)                   # activation
 
    # Layer 2: hidden → output
    z2 = a1 @ W2 + b2             # linear transformation
    a2 = sigmoid(z2)               # sigmoid for binary classification
 
    # === COMPUTE LOSS (binary cross-entropy) ===
    eps = 1e-8  # avoid log(0)
    loss = -np.mean(y_train * np.log(a2 + eps) + (1 - y_train) * np.log(1 - a2 + eps))
    losses.append(loss)
 
    # === BACKWARD PASS (backpropagation) ===
    m = X_train.shape[0]
 
    # Output layer gradients
    dz2 = a2 - y_train                     # derivative of loss w.r.t. z2
    dW2 = (1/m) * a1.T @ dz2               # gradient for W2
    db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
 
    # Hidden layer gradients (chain rule!)
    da1 = dz2 @ W2.T                       # how much each hidden neuron contributed to error
    dz1 = da1 * relu_derivative(z1)         # apply activation derivative
    dW1 = (1/m) * X_train.T @ dz1          # gradient for W1
    db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)
 
    # === UPDATE WEIGHTS ===
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
 
    if epoch % 200 == 0:
        print(f"Epoch {epoch:4d}: loss = {loss:.4f}")
 
# === EVALUATE ===
# Forward pass on test data
z1 = X_test @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
 
predictions = (a2 > 0.5).astype(int)
accuracy = np.mean(predictions == y_test)
print(f"\nTest accuracy: {accuracy:.2%}")

The Same Network in PyTorch (Industry Standard)

Now that you understand the mechanics, here's the PyTorch way:

import torch
import torch.nn as nn
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
 
# Data
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Convert to PyTorch tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train).unsqueeze(1)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test).unsqueeze(1)
 
# Define the model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 16)
        self.layer2 = nn.Linear(16, 1)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
 
    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.sigmoid(self.layer2(x))
        return x
 
model = SimpleNet()
criterion = nn.BCELoss()          # Binary cross-entropy
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
 
# Train
for epoch in range(500):
    # Forward pass
    y_pred = model(X_train_t)
    loss = criterion(y_pred, y_train_t)
 
    # Backward pass + update (PyTorch does this for you!)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
 
    if epoch % 100 == 0:
        print(f"Epoch {epoch}: loss = {loss.item():.4f}")
 
# Evaluate
with torch.no_grad():
    y_pred = model(X_test_t)
    accuracy = ((y_pred > 0.5).float() == y_test_t).float().mean()
    print(f"\nTest accuracy: {accuracy:.2%}")

Image Classification with a CNN

The classic "hello world" of deep learning — classifying handwritten digits:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
 
# Load MNIST
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])
 
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)
 
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)
 
# CNN model
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)   # 1 input channel, 32 filters, 3x3 kernel
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.pool = nn.MaxPool2d(2)                     # 2x2 max pooling
        self.fc1 = nn.Linear(64 * 7 * 7, 128)          # flatten → fully connected
        self.fc2 = nn.Linear(128, 10)                   # 10 digit classes
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.25)
 
    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))   # 28x28 → 14x14
        x = self.pool(self.relu(self.conv2(x)))   # 14x14 → 7x7
        x = x.view(-1, 64 * 7 * 7)               # flatten
        x = self.dropout(self.relu(self.fc1(x)))
        x = self.fc2(x)                           # raw logits (no softmax — CrossEntropyLoss handles it)
        return x
 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
 
# Train
for epoch in range(5):
    model.train()
    total_loss = 0
    for batch_X, batch_y in train_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()
        output = model(batch_X)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
 
    # Evaluate
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_X, batch_y in test_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            output = model(batch_X)
            _, predicted = torch.max(output, 1)
            total += batch_y.size(0)
            correct += (predicted == batch_y).sum().item()
 
    print(f"Epoch {epoch+1}: loss={total_loss/len(train_loader):.4f}, "
          f"test accuracy={correct/total:.2%}")

Self-Attention from Scratch

This is the core of Transformers, GPT, BERT, and every modern LLM. Understanding this code means understanding modern AI.

import torch
import torch.nn.functional as F
import math
 
# Self-Attention from scratch
# The idea: each token looks at every other token and decides
# what information to gather
 
def self_attention(Q, K, V):
    """
    Q (query): "What am I looking for?" — shape (seq_len, d_k)
    K (key):   "What do I contain?"     — shape (seq_len, d_k)
    V (value): "What info do I give?"   — shape (seq_len, d_v)
    """
    d_k = Q.shape[-1]
 
    # Step 1: Compute attention scores (how much should token i attend to token j?)
    scores = Q @ K.T / math.sqrt(d_k)  # scale to prevent softmax saturation
    print("Attention scores (before softmax):")
    print(scores.numpy().round(2))
 
    # Step 2: Convert to probabilities
    attention_weights = F.softmax(scores, dim=-1)
    print("\nAttention weights (after softmax):")
    print(attention_weights.numpy().round(3))
 
    # Step 3: Weighted sum of values
    output = attention_weights @ V
    return output, attention_weights
 
# Example: 4 tokens, each embedded as a 3D vector
torch.manual_seed(42)
seq_len = 4
d_model = 3
 
# Pretend these are word embeddings for: ["The", "cat", "sat", "down"]
embeddings = torch.randn(seq_len, d_model)
 
# In a real Transformer, Q/K/V come from learned linear projections
# For simplicity, we'll use the embeddings directly
W_q = torch.randn(d_model, d_model) * 0.5
W_k = torch.randn(d_model, d_model) * 0.5
W_v = torch.randn(d_model, d_model) * 0.5
 
Q = embeddings @ W_q
K = embeddings @ W_k
V = embeddings @ W_v
 
output, weights = self_attention(Q, K, V)
print(f"\nOutput shape: {output.shape}")  # (4, 3) — each token now has context from all others
print(f"\nEach row of attention weights shows how much each token attends to the others.")
print(f"Row 0 (token 'The'): {weights[0].numpy().round(3)}")

Phase 3: Working with LLMs

Call the OpenAI API

pip install openai

from openai import OpenAI
 
client = OpenAI()  # reads OPENAI_API_KEY from environment
 
# Basic completion
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in one paragraph."}
    ],
    temperature=0.7,
    max_tokens=200
)
 
print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.total_tokens}")
print(f"Cost: ~${response.usage.total_tokens * 0.00015 / 1000:.6f}")

Call the Anthropic API

pip install anthropic

import anthropic
 
client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from environment
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=200,
    system="You are a helpful AI tutor. Be concise.",
    messages=[
        {"role": "user", "content": "Explain backpropagation in one paragraph."}
    ]
)
 
print(response.content[0].text)
print(f"\nInput tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

Prompt Engineering: Techniques That Work

from openai import OpenAI
client = OpenAI()
 
def ask(prompt, system="You are a helpful assistant.", temperature=0.7):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature
    )
    return response.choices[0].message.content
 
# --- ZERO-SHOT ---
print("=== Zero-Shot ===")
print(ask("Classify this review as positive or negative: 'The food was terrible and the service was slow.'"))
 
# --- FEW-SHOT ---
print("\n=== Few-Shot ===")
print(ask("""Classify these reviews:
Review: "Amazing experience, will come back!" → Positive
Review: "Worst meal I've ever had." → Negative
Review: "Decent food but overpriced." → ?"""))
 
# --- CHAIN OF THOUGHT ---
print("\n=== Chain of Thought ===")
print(ask("""A store sells apples for $2 each. They have a buy-3-get-1-free deal.
How much do 7 apples cost?
 
Think step by step."""))
 
# --- STRUCTURED OUTPUT ---
print("\n=== Structured JSON Output ===")
print(ask(
    """Extract the entities from this text and return as JSON:
    "Apple CEO Tim Cook announced the iPhone 16 at their Cupertino headquarters on September 9, 2024."
 
    Return format: {"people": [...], "organizations": [...], "products": [...], "locations": [...], "dates": [...]}""",
    temperature=0
))

Build a RAG System from Scratch

This is the most practical LLM skill. RAG lets you give an LLM access to your own data.

pip install chromadb openai

import chromadb
from openai import OpenAI
 
client = OpenAI()
 
# --- Step 1: Create a vector database and add documents ---
chroma = chromadb.Client()
collection = chroma.create_collection("my_docs")
 
# Your knowledge base (could be docs, wiki, help articles, etc.)
documents = [
    "Python was created by Guido van Rossum and first released in 1991.",
    "PyTorch was developed by Meta AI and is the most popular deep learning framework.",
    "NumPy is the fundamental package for scientific computing in Python.",
    "Scikit-learn provides simple and efficient tools for data mining and analysis.",
    "Jupyter notebooks allow interactive computing with code, text, and visualizations.",
    "FastAPI is a modern web framework for building APIs with Python.",
    "Docker containers package applications with their dependencies for consistent deployment.",
    "Git is a distributed version control system created by Linus Torvalds.",
]
 
# Add documents to the vector DB (ChromaDB auto-generates embeddings)
collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))]
)
 
print(f"Added {len(documents)} documents to the vector database.")
 
# --- Step 2: Query the database (semantic search) ---
def search(query, n_results=3):
    results = collection.query(query_texts=[query], n_results=n_results)
    return results['documents'][0]
 
# Test search
query = "What framework should I use for deep learning?"
relevant_docs = search(query)
print(f"\nQuery: {query}")
print(f"Retrieved documents:")
for i, doc in enumerate(relevant_docs):
    print(f"  {i+1}. {doc}")
 
# --- Step 3: Generate answer using retrieved context ---
def rag_answer(question):
    # Retrieve relevant documents
    context_docs = search(question, n_results=3)
    context = "\n".join(f"- {doc}" for doc in context_docs)
 
    # Build the prompt with context
    prompt = f"""Answer the question based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have information about that."
 
Context:
{context}
 
Question: {question}
Answer:"""
 
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content, context_docs
 
# Test RAG
questions = [
    "What's the best framework for deep learning?",
    "Who created Python?",
    "How do I deploy applications consistently?",
    "What is the capital of France?",  # Not in our knowledge base!
]
 
for q in questions:
    answer, sources = rag_answer(q)
    print(f"\nQ: {q}")
    print(f"A: {answer}")

Fine-Tune a Model with Hugging Face

pip install transformers datasets peft accelerate bitsandbytes

from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
import numpy as np
 
# Load a small dataset — movie review sentiment
dataset = load_dataset("rotten_tomatoes")
print(f"Train: {len(dataset['train'])}, Test: {len(dataset['test'])}")
print(f"Example: {dataset['train'][0]}")
 
# Load a pre-trained model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
 
# Tokenize the dataset
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=128)
 
tokenized = dataset.map(tokenize, batched=True)
 
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    eval_strategy="epoch",
    logging_steps=100,
    learning_rate=2e-5,
    weight_decay=0.01,
    save_strategy="no",
)
 
# Metric
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = (predictions == labels).mean()
    return {"accuracy": accuracy}
 
# Train!
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics,
)
 
trainer.train()
 
# Evaluate
results = trainer.evaluate()
print(f"\nTest accuracy: {results['eval_accuracy']:.2%}")
 
# Use the model
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    outputs = model(**inputs)
    probs = outputs.logits.softmax(dim=-1)
    label = "Positive" if probs[0][1] > 0.5 else "Negative"
    confidence = probs[0].max().item()
    return label, confidence
 
# Test
for text in ["This movie was absolutely fantastic!", "Terrible waste of time."]:
    label, conf = predict_sentiment(text)
    print(f"'{text}' → {label} ({conf:.1%})")

Phase 4: Implement a Paper — Transformer from Scratch

This is the exercise that separates people who understand deep learning from people who copy-paste. We implement the core of "Attention Is All You Need."

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
 
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
 
    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape
 
        # Project to Q, K, V
        Q = self.W_q(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
 
        # Scaled dot-product attention
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
 
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
 
        attention = F.softmax(scores, dim=-1)
        context = attention @ V
 
        # Concatenate heads and project
        context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        return self.W_o(context)
 
 
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
 
    def forward(self, x):
        return self.linear2(F.relu(self.linear1(x)))
 
 
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.ff = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
 
    def forward(self, x, mask=None):
        # Self-attention + residual connection + layer norm
        attn_out = self.attention(self.norm1(x), mask)
        x = x + self.dropout(attn_out)
 
        # Feed-forward + residual connection + layer norm
        ff_out = self.ff(self.norm2(x))
        x = x + self.dropout(ff_out)
        return x
 
 
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))
 
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]
 
 
class MiniGPT(nn.Module):
    """A tiny GPT-style language model."""
    def __init__(self, vocab_size, d_model=128, n_heads=4, n_layers=4, d_ff=512, max_len=256):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff) for _ in range(n_layers)
        ])
        self.norm = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size)
 
    def forward(self, x):
        seq_len = x.size(1)
        # Causal mask: each token can only attend to previous tokens
        mask = torch.tril(torch.ones(seq_len, seq_len, device=x.device)).unsqueeze(0).unsqueeze(0)
 
        x = self.embedding(x)
        x = self.pos_encoding(x)
        for block in self.blocks:
            x = block(x, mask)
        x = self.norm(x)
        logits = self.head(x)
        return logits
 
# Test it
vocab_size = 1000
model = MiniGPT(vocab_size)
 
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {total_params:,}")
 
# Forward pass with random input
x = torch.randint(0, vocab_size, (2, 32))  # batch of 2, sequence length 32
logits = model(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {logits.shape}")  # (2, 32, 1000) — probability over vocab for each position
 
# Next token prediction
next_token_probs = F.softmax(logits[0, -1, :], dim=-1)
next_token = torch.argmax(next_token_probs)
print(f"Predicted next token: {next_token.item()} with prob {next_token_probs[next_token]:.4f}")

Phase 5: Interview Preparation — Practice Problems

ML Coding: Implement K-Means from Scratch

This is a classic interview question.

import numpy as np
import matplotlib.pyplot as plt
 
def kmeans(X, k, max_iters=100):
    """K-Means clustering from scratch."""
    n_samples = X.shape[0]
 
    # Step 1: Initialize centroids randomly
    indices = np.random.choice(n_samples, k, replace=False)
    centroids = X[indices].copy()
 
    for iteration in range(max_iters):
        # Step 2: Assign each point to the nearest centroid
        distances = np.sqrt(((X[:, np.newaxis] - centroids) ** 2).sum(axis=2))
        labels = np.argmin(distances, axis=1)
 
        # Step 3: Update centroids to the mean of their assigned points
        new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
 
        # Check for convergence
        if np.allclose(centroids, new_centroids):
            print(f"Converged at iteration {iteration}")
            break
        centroids = new_centroids
 
    return labels, centroids
 
# Generate data
np.random.seed(42)
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
 
# Run our K-Means
labels, centroids = kmeans(X, k=4)
 
# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30, alpha=0.7)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, edgecolors='black')
plt.title('K-Means Clustering (from scratch)')
plt.show()

ML System Design: Recommendation System

Walk through this out loud as if you're in an interview:

"""
Interview Question: Design a movie recommendation system.
 
Step 1: Clarify requirements
- Personalized recommendations for each user
- Should handle new users (cold start)
- Needs to serve millions of users with low latency
 
Step 2: Approach — Collaborative Filtering
Two users who liked the same movies will likely enjoy similar movies.
"""
 
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
# User-movie rating matrix (0 = not rated)
# Rows = users, Columns = movies
ratings = np.array([
    [5, 4, 0, 0, 1],  # User 0: likes action/adventure
    [4, 5, 0, 1, 0],  # User 1: likes action/adventure
    [0, 0, 5, 4, 0],  # User 2: likes romance/drama
    [0, 1, 4, 5, 0],  # User 3: likes romance/drama
    [3, 3, 3, 3, 3],  # User 4: likes everything equally
])
movies = ["Action Hero", "Space Wars", "Love Story", "Drama Queen", "Horror Night"]
 
# Compute user-user similarity
user_similarity = cosine_similarity(ratings)
print("User similarity matrix:")
for i, row in enumerate(user_similarity):
    print(f"  User {i}: {row.round(2)}")
 
# Predict: what would User 0 rate "Love Story" (movie 2)?
def predict_rating(user_id, movie_id, ratings, similarity):
    # Find users who rated this movie
    rated_mask = ratings[:, movie_id] > 0
    if not rated_mask.any():
        return 0
 
    # Weighted average of other users' ratings, weighted by similarity
    sim_scores = similarity[user_id][rated_mask]
    user_ratings = ratings[rated_mask, movie_id]
 
    if sim_scores.sum() == 0:
        return 0
    return np.dot(sim_scores, user_ratings) / np.abs(sim_scores).sum()
 
# Recommend movies for User 0
print(f"\nRecommendations for User 0:")
for movie_id, movie_name in enumerate(movies):
    if ratings[0, movie_id] == 0:  # Only recommend unrated movies
        predicted = predict_rating(0, movie_id, ratings, user_similarity)
        print(f"  {movie_name}: predicted rating = {predicted:.2f}")

Where to Go Next

Now that you have the foundations with working code, here's your next steps:

What to Do	Where
Build more projects	Kaggle competitions — start with "Getting Started" ones
Watch Karpathy's series	"Neural Networks: Zero to Hero" on YouTube
Read the Transformer paper	"Attention Is All You Need" — now you can follow the math
Build a full RAG app	Expand the RAG example above with a real UI
Study system design	Designing Machine Learning Systems by Chip Huyen
Practice interviews	NeetCode for coding, the system design framework above for ML design

The key insight: you learn ML by building things, not by watching courses. Courses give you structure. Building gives you understanding. Do both, but always be building.