Week 1: Overview of Large Language Models

From the fundamentals of AI and machine learning to the frontier of modern LLMs — build a rock-solid mental model of the entire landscape.

Difficulty: Beginner

1. What is AI / ML / Deep Learning?

1.1 The Hierarchy

Think of three concentric circles. The outermost circle is Artificial Intelligence (AI), the middle circle is Machine Learning (ML), and the innermost circle is Deep Learning (DL).

  • AI (Artificial Intelligence) — The broadest concept: any technique that enables a computer to mimic human-like intelligence. This includes rule-based expert systems (like the chess engine Deep Blue), search algorithms, constraint solvers, and more. If a program can play chess, translate languages, or drive a car, it falls under AI.
  • ML (Machine Learning) — A subset of AI where the system learns from data rather than being explicitly programmed with rules. Instead of writing if temperature > 30: say("hot"), you feed the system thousands of examples and let it discover the patterns. Algorithms include linear regression, decision trees, SVMs, random forests, and neural networks.
  • DL (Deep Learning) — A subset of ML that uses deep neural networks (networks with many layers) to learn hierarchical representations. "Deep" refers to the depth of layers, not the depth of understanding. Deep learning powers image recognition, speech synthesis, and, crucially, modern language models.
Real-World Analogy: Imagine teaching someone to cook.
  • AI approach (rule-based): You hand them a detailed recipe with exact measurements and steps. They follow the recipe precisely.
  • ML approach: You give them 1000 photos of well-cooked vs. burnt dishes and let them figure out what "good cooking" looks like.
  • DL approach: You give them millions of cooking videos and a neural network figures out everything from ingredient identification to cooking techniques to plating aesthetics — all by itself.
AI / ML / Deep Learning Hierarchy
graph TD AI["Artificial Intelligence
Rule-based systems, search, planning"] ML["Machine Learning
Learns from data: trees, SVM, regression"] DL["Deep Learning
Deep neural networks with many layers"] LLM["Large Language Models
GPT, Claude, Llama"] AI --> ML ML --> DL DL --> LLM style AI fill:#e8f4f8,stroke:#2c3e50,stroke-width:2px style ML fill:#d5e8d4,stroke:#2c3e50,stroke-width:2px style DL fill:#fff2cc,stroke:#2c3e50,stroke-width:2px style LLM fill:#f8cecc,stroke:#2c3e50,stroke-width:2px

1.2 Traditional Programming vs. the ML Paradigm Shift

This is one of the most important conceptual shifts in computer science:

Aspect Traditional Programming Machine Learning
Input Data + Rules Data + Expected Outputs
Output Answers Rules (learned model)
Example if email contains "free money" → spam Show 100k emails labeled spam/not-spam; model learns patterns
Maintenance Manually update rules as spammers evolve Retrain on new data; model adapts automatically
Scalability Rules become unmanageable in complex domains Scales gracefully with more data
# Traditional Programming: Spam Detector
def is_spam_traditional(email_text):
    """Rule-based spam detection - brittle and hard to maintain."""
    spam_keywords = ["free money", "click here", "limited offer",
                     "congratulations", "you won", "act now"]
    email_lower = email_text.lower()
    for keyword in spam_keywords:
        if keyword in email_lower:
            return True
    return False

# Machine Learning: Spam Detector
# Instead of writing rules, we LEARN them from data
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

def train_spam_detector(emails, labels):
    """ML-based spam detection - learns patterns from data."""
    vectorizer = TfidfVectorizer(max_features=5000)
    X = vectorizer.fit_transform(emails)

    model = MultinomialNB()
    model.fit(X, labels)

    return model, vectorizer

def predict_spam_ml(model, vectorizer, email_text):
    """Use the trained model to predict spam."""
    X = vectorizer.transform([email_text])
    prediction = model.predict(X)
    probability = model.predict_proba(X)
    return prediction[0], probability[0]

# The ML version can detect NEW spam patterns it has never explicitly
# been told about, as long as they share statistical features with
# known spam. This is the power of learning from data.

2. Neural Networks from Scratch

2.1 What is a Neuron?

A biological neuron receives electrical signals through its dendrites, processes them in the cell body, and if the combined signal exceeds a threshold, it fires an output signal through its axon to connected neurons.

An artificial neuron mirrors this:

  • Inputs (x1, x2, ..., xn) → like dendrites receiving signals
  • Weights (w1, w2, ..., wn) → like the strength of synaptic connections
  • Bias (b) → like the neuron's intrinsic excitability threshold
  • Activation function (f) → like the decision to fire or not
  • Output: y = f(Σ(wi · xi) + b)
Analogy: Imagine you're deciding whether to go to a party. Your inputs are: friends going (x1=1), weather is nice (x2=1), you have homework (x3=1). The weights represent how much each factor matters to you: friends (w1=0.8), weather (w2=0.3), homework (w3=-0.6). The bias might be your general inclination to go out (b=0.1). If the weighted sum exceeds your threshold, you go!

2.2 Mathematical Model of a Neuron

The complete mathematical formulation:

Step 1 — Linear Transformation (Weighted Sum):

z = w1x1 + w2x2 + ... + wnxn + b = wTx + b

Step 2 — Non-linear Activation:

y = f(z)

where f is an activation function (ReLU, Sigmoid, etc.)

import numpy as np

class Neuron:
    """A single artificial neuron."""

    def __init__(self, n_inputs):
        # Initialize weights randomly (small values near zero)
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0.0

    def forward(self, x):
        """Compute the neuron's output."""
        # Step 1: Weighted sum (linear transformation)
        z = np.dot(self.weights, x) + self.bias

        # Step 2: Activation function (using sigmoid here)
        y = 1 / (1 + np.exp(-z))

        return y

# Example
neuron = Neuron(n_inputs=3)
x = np.array([1.0, 0.5, -0.3])  # Input features
output = neuron.forward(x)
print(f"Neuron output: {output:.4f}")
# Output will be a value between 0 and 1 (because sigmoid)

2.3 Activation Functions

Activation functions introduce non-linearity into the network. Without them, stacking layers would be equivalent to a single linear transformation, no matter how many layers you use. They are the key to a neural network's power.

ReLU (Rectified Linear Unit)

ReLU(x) = max(0, x)

Derivative: 1 if x > 0, else 0

  • Pros: Computationally efficient, mitigates vanishing gradient problem, sparse activation
  • Cons: "Dying ReLU" problem (neurons can become permanently inactive if they always output 0)
  • Used in: Hidden layers of most modern networks
def relu(x):
    """ReLU activation: returns x if positive, 0 otherwise."""
    return np.maximum(0, x)

def relu_derivative(x):
    """Derivative of ReLU: 1 if x > 0, else 0."""
    return (x > 0).astype(float)

# Visual representation:
# Input:  [-2, -1, 0, 1, 2, 3]
# Output: [ 0,  0, 0, 1, 2, 3]
# It simply "clips" negative values to zero.

Sigmoid

σ(x) = 1 / (1 + e-x)

Output range: (0, 1)

Derivative: σ(x) · (1 - σ(x))

  • Pros: Outputs interpretable as probabilities, smooth gradient
  • Cons: Vanishing gradients for very large/small inputs, not zero-centered
  • Used in: Binary classification output layers, gates in LSTMs
def sigmoid(x):
    """Sigmoid activation: squashes input to (0, 1)."""
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    """Derivative of sigmoid."""
    s = sigmoid(x)
    return s * (1 - s)

# Visual representation:
# Input:  [-5,   -2,    0,   2,    5  ]
# Output: [0.007, 0.12, 0.5, 0.88, 0.993]
# Large negative -> ~0, large positive -> ~1, 0 -> 0.5

Tanh (Hyperbolic Tangent)

tanh(x) = (ex - e-x) / (ex + e-x)

Output range: (-1, 1)

Derivative: 1 - tanh2(x)

  • Pros: Zero-centered (helps with gradient updates), stronger gradients than sigmoid
  • Cons: Still suffers from vanishing gradients at extremes
  • Used in: RNN hidden states, some normalization layers
def tanh(x):
    """Tanh activation: squashes input to (-1, 1)."""
    return np.tanh(x)

def tanh_derivative(x):
    """Derivative of tanh."""
    return 1 - np.tanh(x) ** 2

# Visual representation:
# Input:  [-5,    -2,     0,   2,     5   ]
# Output: [-0.999, -0.96, 0.0, 0.96,  0.999]
# Like sigmoid but centered at 0 and ranges from -1 to 1

Softmax

softmax(xi) = exi / Σj exj

Output: probability distribution (all outputs sum to 1)

  • Used in: Multi-class classification output layers, attention mechanisms
  • Key property: Converts raw scores (logits) into probabilities
def softmax(x):
    """Softmax: converts logits to probabilities.

    We subtract max(x) for numerical stability to prevent
    overflow when computing exp() of large numbers.
    """
    exp_x = np.exp(x - np.max(x))
    return exp_x / exp_x.sum()

# Example: classifying an image as cat/dog/bird
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Cat: {probs[0]:.3f}, Dog: {probs[1]:.3f}, Bird: {probs[2]:.3f}")
# Output: Cat: 0.659, Dog: 0.242, Bird: 0.099
# The highest logit (cat=2.0) gets the highest probability
# Complete comparison of all activation functions
import numpy as np

x = np.linspace(-5, 5, 100)

# All activations side by side
activations = {
    'ReLU':    np.maximum(0, x),
    'Sigmoid': 1 / (1 + np.exp(-x)),
    'Tanh':    np.tanh(x),
    'LeakyReLU': np.where(x > 0, x, 0.01 * x),
    'GELU':    x * 0.5 * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3))),
}

# GELU (Gaussian Error Linear Unit) - used in GPT-2, BERT, modern transformers
# It's a smoother version of ReLU that allows small negative values
# GELU(x) = x * Phi(x), where Phi is the CDF of standard normal distribution
# Approximation: GELU(x) ~ 0.5x(1 + tanh(sqrt(2/pi)(x + 0.044715x^3)))

for name, values in activations.items():
    print(f"{name:12s} at x=-2: {values[20]:.4f}, x=0: {values[50]:.4f}, x=2: {values[80]:.4f}")

2.4 Forward Propagation (Step by Step)

Forward propagation is the process of passing input through the network layer by layer to produce an output. Let's trace through a concrete numerical example.

Example Network: 2 inputs → 2 hidden neurons (ReLU) → 1 output neuron (Sigmoid)
"""
FORWARD PROPAGATION - Complete Numerical Example
=================================================

Network Architecture:
  Input Layer:  2 neurons (x1, x2)
  Hidden Layer: 2 neurons (h1, h2) with ReLU activation
  Output Layer: 1 neuron (o1) with Sigmoid activation

Given values:
  Inputs: x1 = 0.5, x2 = 0.8

  Hidden layer weights:
    w1 = 0.4 (x1 -> h1),  w2 = 0.3 (x2 -> h1),  b1 = 0.1
    w3 = -0.2 (x1 -> h2), w4 = 0.6 (x2 -> h2),  b2 = -0.1

  Output layer weights:
    w5 = 0.7 (h1 -> o1),  w6 = 0.5 (h2 -> o1),  b3 = 0.2
"""

import numpy as np

# Inputs
x = np.array([0.5, 0.8])

# Hidden layer parameters
W_hidden = np.array([
    [0.4, 0.3],    # weights for h1
    [-0.2, 0.6]    # weights for h2
])
b_hidden = np.array([0.1, -0.1])

# Output layer parameters
W_output = np.array([
    [0.7, 0.5]     # weights for o1
])
b_output = np.array([0.2])

# ---- STEP 1: Hidden Layer Linear Transformation ----
z_hidden = np.dot(W_hidden, x) + b_hidden
# h1: z1 = (0.4 * 0.5) + (0.3 * 0.8) + 0.1 = 0.2 + 0.24 + 0.1 = 0.54
# h2: z2 = (-0.2 * 0.5) + (0.6 * 0.8) + (-0.1) = -0.1 + 0.48 - 0.1 = 0.28
print(f"Hidden layer linear output (z): {z_hidden}")
# Output: [0.54, 0.28]

# ---- STEP 2: Hidden Layer Activation (ReLU) ----
a_hidden = np.maximum(0, z_hidden)  # ReLU
# h1: ReLU(0.54) = 0.54  (positive, so unchanged)
# h2: ReLU(0.28) = 0.28  (positive, so unchanged)
print(f"Hidden layer activation (a):    {a_hidden}")
# Output: [0.54, 0.28]

# ---- STEP 3: Output Layer Linear Transformation ----
z_output = np.dot(W_output, a_hidden) + b_output
# o1: z = (0.7 * 0.54) + (0.5 * 0.28) + 0.2 = 0.378 + 0.14 + 0.2 = 0.718
print(f"Output layer linear output (z): {z_output}")
# Output: [0.718]

# ---- STEP 4: Output Layer Activation (Sigmoid) ----
y_pred = 1 / (1 + np.exp(-z_output))
# o1: sigmoid(0.718) = 1 / (1 + e^(-0.718)) = 1 / (1 + 0.4877) = 0.6723
print(f"Network output (prediction):    {y_pred}")
# Output: [0.6723]

# If this is a binary classifier:
# - Output > 0.5 -> Class 1 (positive)
# - Output <= 0.5 -> Class 0 (negative)
# Our prediction of 0.6723 -> Class 1

2.5 Loss Functions

A loss function (also called cost function or objective function) measures how far the model's predictions are from the true values. The goal of training is to minimize the loss.

Mean Squared Error (MSE)

MSE = (1/n) Σi=1n (yi - ŷi)2

Where yi is the true value and ŷi is the predicted value.

  • Used for: Regression problems (predicting continuous values)
  • Intuition: Penalizes larger errors more heavily (quadratic penalty)
  • Derivation: Squaring ensures all errors are positive and differentiable

Binary Cross-Entropy (BCE)

BCE = -(1/n) Σi=1n [yi log(ŷi) + (1 - yi) log(1 - ŷi)]

  • Used for: Binary classification
  • Intuition: Heavily penalizes confident wrong predictions

Categorical Cross-Entropy

CCE = -Σc=1C yc log(ŷc)

Where C is the number of classes.

  • Used for: Multi-class classification, language model training (next token prediction)
  • Key insight: This is the loss function used to train LLMs!
import numpy as np

# ---- Mean Squared Error ----
def mse_loss(y_true, y_pred):
    """Mean Squared Error for regression."""
    return np.mean((y_true - y_pred) ** 2)

def mse_derivative(y_true, y_pred):
    """Derivative of MSE with respect to y_pred."""
    return 2 * (y_pred - y_true) / len(y_true)

# Example: predicting house prices
y_true = np.array([300000, 450000, 200000])
y_pred = np.array([310000, 420000, 215000])
print(f"MSE Loss: {mse_loss(y_true, y_pred):,.0f}")
# Output: MSE Loss: 408,333,333

# ---- Binary Cross-Entropy ----
def binary_cross_entropy(y_true, y_pred):
    """Binary cross-entropy for binary classification.

    We clip predictions to avoid log(0) which is undefined.
    """
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Example: spam detection
y_true = np.array([1, 0, 1, 1])  # Actual labels
y_pred = np.array([0.9, 0.1, 0.8, 0.7])  # Model predictions
print(f"BCE Loss: {binary_cross_entropy(y_true, y_pred):.4f}")
# Output: BCE Loss: 0.1643

# What happens with a TERRIBLE prediction?
y_pred_bad = np.array([0.1, 0.9, 0.2, 0.3])  # Everything wrong
print(f"BCE Loss (bad): {binary_cross_entropy(y_true, y_pred_bad):.4f}")
# Output: BCE Loss (bad): 1.8971 (much higher = worse!)

# ---- Categorical Cross-Entropy ----
def categorical_cross_entropy(y_true_onehot, y_pred_probs):
    """Categorical cross-entropy for multi-class classification.

    This is the loss function used to train language models!
    When predicting the next token, y_true is the correct token
    (one-hot encoded) and y_pred is the model's probability
    distribution over the entire vocabulary.
    """
    epsilon = 1e-15
    y_pred_probs = np.clip(y_pred_probs, epsilon, 1.0)
    return -np.sum(y_true_onehot * np.log(y_pred_probs))

# Example: classifying an image as [cat, dog, bird]
y_true = np.array([1, 0, 0])  # True label: cat (one-hot)
y_pred = np.array([0.7, 0.2, 0.1])  # Model says 70% cat
print(f"CCE Loss: {categorical_cross_entropy(y_true, y_pred):.4f}")
# Output: CCE Loss: 0.3567

# For language models: predict next word from vocabulary of 50000 words
# y_true = one-hot vector with 1 at position of correct word
# y_pred = softmax output giving probability for each word
Neural Network Forward and Backward Pass
graph LR subgraph Forward Pass I["Input X"] --> H1["Hidden Layer 1
z = Wx + b"] H1 --> A1["Activation
a = ReLU z"] A1 --> H2["Hidden Layer 2
z = Wa + b"] H2 --> A2["Activation
a = ReLU z"] A2 --> O["Output
y hat"] end O --> L["Loss Function
L = loss of y, y hat"] subgraph Backward Pass L --> G3["dL/dW2
Gradient"] G3 --> G2["dL/dW1
Gradient"] G2 --> U["Update Weights
W = W - lr x grad"] end style I fill:#e8f4f8,stroke:#333 style O fill:#d5e8d4,stroke:#333 style L fill:#f8cecc,stroke:#333 style U fill:#fff2cc,stroke:#333

2.6 Backpropagation

Backpropagation is the algorithm that computes how much each weight contributed to the error, so we can update them to reduce the loss. It uses the chain rule of calculus to propagate gradients backward through the network.

Chain Rule Intuition: If you change the temperature of an oven (x), that changes how cooked the food is (y), which changes how happy the customer is (z). The chain rule tells us: dz/dx = dz/dy · dy/dx. We can compute the effect of oven temperature on customer happiness by multiplying intermediate effects.
"""
BACKPROPAGATION - Complete Worked Example
==========================================

Using our forward propagation example:
  Network: 2 inputs -> 2 hidden (ReLU) -> 1 output (Sigmoid)
  Prediction: 0.6723
  True label: 1.0 (positive class)
  Loss function: Binary Cross-Entropy

We need to compute: dLoss/dw for EVERY weight in the network,
then update: w_new = w_old - learning_rate * dLoss/dw
"""

import numpy as np

# Forward pass values (from previous example)
x = np.array([0.5, 0.8])
z_hidden = np.array([0.54, 0.28])
a_hidden = np.array([0.54, 0.28])  # After ReLU
z_output = np.array([0.718])
y_pred = np.array([0.6723])  # After Sigmoid
y_true = np.array([1.0])

# ---- STEP 1: Compute output layer gradient ----
# For sigmoid + BCE, the gradient simplifies beautifully:
# dL/dz_output = y_pred - y_true
dL_dz_output = y_pred - y_true  # = 0.6723 - 1.0 = -0.3277
print(f"Output gradient: {dL_dz_output}")

# ---- STEP 2: Compute gradients for output weights ----
# dL/dw5 = dL/dz_output * dz_output/dw5 = dL/dz_output * a_h1
# dL/dw6 = dL/dz_output * dz_output/dw6 = dL/dz_output * a_h2
dL_dW_output = dL_dz_output * a_hidden.reshape(1, -1).T
# w5 gradient: -0.3277 * 0.54 = -0.1770
# w6 gradient: -0.3277 * 0.28 = -0.0918
print(f"Output weight gradients: {dL_dW_output.flatten()}")

# dL/db3 = dL/dz_output = -0.3277
dL_db_output = dL_dz_output
print(f"Output bias gradient: {dL_db_output}")

# ---- STEP 3: Propagate gradient to hidden layer ----
W_output = np.array([[0.7, 0.5]])
# dL/da_hidden = W_output^T * dL/dz_output
dL_da_hidden = W_output.T.dot(dL_dz_output.reshape(-1, 1)).flatten()
print(f"Hidden activation gradient: {dL_da_hidden}")

# ---- STEP 4: Gradient through ReLU ----
# dReLU/dz = 1 if z > 0, else 0
relu_grad = (z_hidden > 0).astype(float)  # [1.0, 1.0] (both positive)
dL_dz_hidden = dL_da_hidden * relu_grad
print(f"Hidden layer gradient: {dL_dz_hidden}")

# ---- STEP 5: Compute gradients for hidden weights ----
dL_dW_hidden = np.outer(dL_dz_hidden, x)
dL_db_hidden = dL_dz_hidden
print(f"Hidden weight gradients:\n{dL_dW_hidden}")
print(f"Hidden bias gradients: {dL_db_hidden}")

# ---- STEP 6: Update all weights ----
learning_rate = 0.01

W_hidden = np.array([[0.4, 0.3], [-0.2, 0.6]])
b_hidden_vals = np.array([0.1, -0.1])
W_output_vals = np.array([[0.7, 0.5]])
b_output_val = np.array([0.2])

W_hidden_new = W_hidden - learning_rate * dL_dW_hidden
b_hidden_new = b_hidden_vals - learning_rate * dL_db_hidden
W_output_new = W_output_vals - learning_rate * dL_dW_output.T
b_output_new = b_output_val - learning_rate * dL_db_output

print(f"\nUpdated weights:")
print(f"W_hidden: {W_hidden} -> {W_hidden_new}")
print(f"W_output: {W_output_vals} -> {W_output_new}")
print(f"\nThe weights shifted slightly to reduce the loss!")
print(f"Repeat this process thousands of times = training!")

2.7 Learning Rate, Epochs, and Batch Size

  • Learning Rate (η): How big a step to take when updating weights. Too large → overshoot the minimum, training diverges. Too small → training takes forever, may get stuck in local minima. Typical values: 0.001 to 0.01 for simple networks, 1e-4 to 3e-4 for transformers.
  • Epoch: One complete pass through the entire training dataset. Training for 10 epochs means the model sees every example 10 times. Too few epochs → underfitting. Too many → overfitting.
  • Batch Size: Number of training examples processed together before updating weights.
    • Batch size = 1: Stochastic Gradient Descent (SGD) — noisy but fast updates
    • Batch size = N (full dataset): Batch GD — stable but slow, high memory
    • Batch size = 32-512: Mini-batch GD — the practical sweet spot

2.8 PRACTICAL: Neural Network from Scratch (MNIST)

Here is a complete, working neural network built from scratch using only NumPy. No TensorFlow, no PyTorch. This network classifies handwritten digits (0-9) from the MNIST dataset.

"""
NEURAL NETWORK FROM SCRATCH - MNIST Digit Classification
=========================================================
Architecture: 784 inputs -> 128 hidden (ReLU) -> 64 hidden (ReLU) -> 10 output (Softmax)
Loss: Categorical Cross-Entropy
Optimizer: Mini-batch Gradient Descent with momentum

No frameworks - just NumPy!
"""

import numpy as np

# -------------------------------------------------------------------
# STEP 1: Load MNIST Data
# -------------------------------------------------------------------
# You can download MNIST from many sources. Here we'll use sklearn
# as a convenient loader, but the neural network itself is pure NumPy.
from sklearn.datasets import fetch_openml

print("Loading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist.data, mnist.target.astype(int)

# Normalize pixel values to [0, 1]
X = X / 255.0

# One-hot encode labels
def one_hot_encode(labels, num_classes=10):
    """Convert integer labels to one-hot vectors."""
    n = len(labels)
    one_hot = np.zeros((n, num_classes))
    one_hot[np.arange(n), labels] = 1
    return one_hot

Y = one_hot_encode(y)

# Split into train/test
X_train, X_test = X[:60000], X[60000:]
Y_train, Y_test = Y[:60000], Y[60000:]
y_test = y[60000:]  # Keep integer labels for accuracy calculation

print(f"Training set: {X_train.shape}")  # (60000, 784)
print(f"Test set: {X_test.shape}")       # (10000, 784)

# -------------------------------------------------------------------
# STEP 2: Define Activation Functions
# -------------------------------------------------------------------
def relu(z):
    """ReLU activation function."""
    return np.maximum(0, z)

def relu_derivative(z):
    """Derivative of ReLU."""
    return (z > 0).astype(float)

def softmax(z):
    """Softmax activation - converts logits to probabilities.

    Subtract max for numerical stability (prevents overflow).
    """
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# -------------------------------------------------------------------
# STEP 3: Initialize Network Parameters
# -------------------------------------------------------------------
def initialize_parameters(layer_dims):
    """Initialize weights using He initialization and biases to zero.

    He initialization: W ~ N(0, sqrt(2/n_in))
    This is specifically designed for ReLU networks and helps
    prevent vanishing/exploding gradients.

    Args:
        layer_dims: list of layer sizes, e.g., [784, 128, 64, 10]

    Returns:
        parameters: dict with W1, b1, W2, b2, etc.
    """
    parameters = {}
    n_layers = len(layer_dims)

    for l in range(1, n_layers):
        # He initialization: scale by sqrt(2 / fan_in)
        parameters[f'W{l}'] = np.random.randn(
            layer_dims[l], layer_dims[l-1]
        ) * np.sqrt(2.0 / layer_dims[l-1])

        parameters[f'b{l}'] = np.zeros((layer_dims[l], 1))

        print(f"Layer {l}: W{l} shape = {parameters[f'W{l}'].shape}, "
              f"b{l} shape = {parameters[f'b{l}'].shape}")

    return parameters

# Network architecture
layer_dims = [784, 128, 64, 10]
params = initialize_parameters(layer_dims)
# Layer 1: W1 shape = (128, 784), b1 shape = (128, 1)  -> 100,352 weights
# Layer 2: W2 shape = (64, 128),  b2 shape = (64, 1)   -> 8,192 weights
# Layer 3: W3 shape = (10, 64),   b3 shape = (10, 1)   -> 640 weights
# Total: 109,184 parameters (+ 202 biases = 109,386 total)

# -------------------------------------------------------------------
# STEP 4: Forward Propagation
# -------------------------------------------------------------------
def forward_propagation(X, parameters, n_layers):
    """Forward pass through the network.

    Args:
        X: input data (batch_size, 784) - we transpose to (784, batch_size)
        parameters: dict with weights and biases
        n_layers: number of layers (excluding input)

    Returns:
        A_final: output predictions (10, batch_size)
        cache: intermediate values needed for backpropagation
    """
    cache = {'A0': X.T}  # Transpose: (784, batch_size)

    A = X.T
    for l in range(1, n_layers):
        Z = parameters[f'W{l}'] @ A + parameters[f'b{l}']

        if l == n_layers - 1:
            # Last layer: softmax
            A = softmax(Z.T).T  # softmax expects (batch, classes)
        else:
            # Hidden layers: ReLU
            A = relu(Z)

        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A

    return A, cache

# -------------------------------------------------------------------
# STEP 5: Compute Loss
# -------------------------------------------------------------------
def compute_loss(A_final, Y):
    """Compute categorical cross-entropy loss.

    Loss = -(1/m) * sum(Y * log(A))
    """
    m = Y.shape[0]
    # A_final is (10, m), Y.T is (10, m)
    epsilon = 1e-15
    A_clipped = np.clip(A_final, epsilon, 1 - epsilon)
    loss = -np.sum(Y.T * np.log(A_clipped)) / m
    return loss

# -------------------------------------------------------------------
# STEP 6: Backward Propagation
# -------------------------------------------------------------------
def backward_propagation(parameters, cache, Y, n_layers):
    """Compute gradients using backpropagation.

    For softmax + cross-entropy, the output gradient simplifies to:
    dZ_final = A_final - Y (elegant mathematical simplification!)
    """
    m = Y.shape[0]
    grads = {}

    # Output layer gradient (softmax + cross-entropy derivative)
    dZ = cache[f'A{n_layers-1}'] - Y.T  # (10, m)

    for l in range(n_layers - 1, 0, -1):
        # Gradient for weights and biases
        grads[f'dW{l}'] = (1/m) * dZ @ cache[f'A{l-1}'].T
        grads[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)

        if l > 1:
            # Propagate gradient to previous layer
            dA = parameters[f'W{l}'].T @ dZ
            dZ = dA * relu_derivative(cache[f'Z{l-1}'])

    return grads

# -------------------------------------------------------------------
# STEP 7: Update Parameters
# -------------------------------------------------------------------
def update_parameters(parameters, grads, learning_rate, n_layers):
    """Update parameters using gradient descent."""
    for l in range(1, n_layers):
        parameters[f'W{l}'] -= learning_rate * grads[f'dW{l}']
        parameters[f'b{l}'] -= learning_rate * grads[f'db{l}']
    return parameters

# -------------------------------------------------------------------
# STEP 8: Training Loop
# -------------------------------------------------------------------
def train(X_train, Y_train, layer_dims, epochs=20, batch_size=128,
          learning_rate=0.1):
    """Train the neural network."""
    n_layers = len(layer_dims)
    parameters = initialize_parameters(layer_dims)
    m = X_train.shape[0]

    losses = []

    for epoch in range(epochs):
        # Shuffle training data
        permutation = np.random.permutation(m)
        X_shuffled = X_train[permutation]
        Y_shuffled = Y_train[permutation]

        epoch_loss = 0
        n_batches = m // batch_size

        for i in range(n_batches):
            # Get mini-batch
            start = i * batch_size
            end = start + batch_size
            X_batch = X_shuffled[start:end]
            Y_batch = Y_shuffled[start:end]

            # Forward pass
            A_final, cache = forward_propagation(
                X_batch, parameters, n_layers
            )

            # Compute loss
            batch_loss = compute_loss(A_final, Y_batch)
            epoch_loss += batch_loss

            # Backward pass
            grads = backward_propagation(
                parameters, cache, Y_batch, n_layers
            )

            # Update parameters
            parameters = update_parameters(
                parameters, grads, learning_rate, n_layers
            )

        avg_loss = epoch_loss / n_batches
        losses.append(avg_loss)

        # Evaluate accuracy on test set every 5 epochs
        if (epoch + 1) % 5 == 0 or epoch == 0:
            A_test, _ = forward_propagation(X_test, parameters, n_layers)
            predictions = np.argmax(A_test, axis=0)
            accuracy = np.mean(predictions == y_test)
            print(f"Epoch {epoch+1}/{epochs} | Loss: {avg_loss:.4f} | "
                  f"Test Accuracy: {accuracy*100:.2f}%")

    return parameters, losses

# -------------------------------------------------------------------
# STEP 9: Train the Network!
# -------------------------------------------------------------------
print("\n" + "="*60)
print("TRAINING NEURAL NETWORK FROM SCRATCH")
print("="*60 + "\n")

trained_params, losses = train(
    X_train, Y_train,
    layer_dims=[784, 128, 64, 10],
    epochs=20,
    batch_size=128,
    learning_rate=0.1
)

# Expected output (approximate):
# Epoch 1/20  | Loss: 0.5123 | Test Accuracy: 89.42%
# Epoch 5/20  | Loss: 0.1847 | Test Accuracy: 95.13%
# Epoch 10/20 | Loss: 0.1102 | Test Accuracy: 96.51%
# Epoch 15/20 | Loss: 0.0731 | Test Accuracy: 97.02%
# Epoch 20/20 | Loss: 0.0498 | Test Accuracy: 97.31%

# ~97% accuracy with just NumPy! No frameworks needed.
# Modern frameworks like PyTorch do all of this automatically,
# but understanding the internals is crucial for AI engineering.

3. Types of Learning

3.1 Supervised Learning

The model learns from labeled data — input-output pairs where the correct answer is provided during training.

Classification

Predict a discrete category/class.

  • Binary: spam/not-spam, cat/dog, positive/negative sentiment
  • Multi-class: digit recognition (0-9), image classification (1000 ImageNet classes)
  • Multi-label: a movie can be both "comedy" AND "romance"

Regression

Predict a continuous value.

  • House price prediction
  • Stock price forecasting
  • Temperature prediction
# Supervised Learning Examples

# Classification: Sentiment Analysis
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

texts = [
    "This movie was amazing!", "Terrible waste of time",
    "Loved every minute", "Boring and predictable",
    "Best film of the year", "I want my money back"
]
labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
model = LogisticRegression()
model.fit(X, labels)

# Test on new data
new_text = vectorizer.transform(["This was wonderful!"])
print(f"Prediction: {'Positive' if model.predict(new_text)[0] else 'Negative'}")

# Regression: House Price
from sklearn.linear_model import LinearRegression
import numpy as np

# Features: [sq_ft, bedrooms, bathrooms]
X = np.array([[1500, 3, 2], [2000, 4, 3], [1200, 2, 1], [3000, 5, 4]])
y = np.array([300000, 450000, 200000, 650000])

model = LinearRegression()
model.fit(X, y)
print(f"Predicted price for 1800sqft, 3bed, 2bath: ${model.predict([[1800, 3, 2]])[0]:,.0f}")

3.2 Self-Supervised Learning

The model creates its own labels from the data itself. This is how modern LLMs learn! No human labeling is needed.

Masked Language Modeling (MLM) — BERT's approach

Randomly mask 15% of tokens in a sentence and train the model to predict them.

# How BERT learns (Masked Language Modeling)

# Original:  "The cat sat on the mat"
# Masked:    "The [MASK] sat on the [MASK]"
# Target:    predict "cat" at position 2, "mat" at position 6

# The model learns:
# - "cat" is a likely word between "The" and "sat" (grammar + semantics)
# - "mat" makes sense after "on the" (common phrases)
# - Through billions of examples, it learns language structure

# This is BIDIRECTIONAL - the model can see both left and right context
# "The [MASK] sat" -> can use "The" AND "sat" to predict "cat"

Next Token Prediction — GPT's approach

Given a sequence of tokens, predict the next one. This is how GPT, Llama, Claude, and most modern LLMs are trained.

# How GPT learns (Next Token Prediction / Causal Language Modeling)

# Input:  "The cat sat on"
# Target: "the"

# Input:  "The cat sat on the"
# Target: "mat"

# For a single sentence "The cat sat on the mat", GPT gets 5 training signals:
# "The"           -> predict "cat"
# "The cat"       -> predict "sat"
# "The cat sat"   -> predict "on"
# "The cat sat on"-> predict "the"
# "The cat sat on the" -> predict "mat"

# This is UNIDIRECTIONAL (causal) - model can only see PREVIOUS tokens
# Can't peek at the future! This is why it's called "autoregressive"

# Why this works so well:
# - Infinite free training data (every text on the internet)
# - Predicting the next word requires understanding:
#   * Grammar and syntax
#   * Facts and knowledge
#   * Reasoning and logic
#   * Common sense
#   * Context and pragmatics

3.3 Contrastive Learning

Train the model to bring similar examples close together and push dissimilar examples apart in embedding space.

SimCLR (Simple Contrastive Learning of Representations)

  • Take an image, create two augmented versions (crop, flip, color shift)
  • These two versions are "positive pairs" (should be close in embedding space)
  • All other images in the batch are "negative pairs" (should be far apart)
  • The model learns visual features without any labels

CLIP (Contrastive Language-Image Pre-training)

  • Train on 400 million (image, text caption) pairs from the internet
  • Learn to match images with their correct text descriptions
  • Result: a model that understands both images AND text in the same space
  • This enables zero-shot image classification: "Is this image more similar to the text 'a cat' or 'a dog'?"
# Contrastive Learning Intuition

import numpy as np

def contrastive_loss(anchor, positive, negatives, temperature=0.1):
    """
    Simplified contrastive loss (InfoNCE).

    Args:
        anchor: embedding of the anchor example
        positive: embedding of a similar example
        negatives: embeddings of dissimilar examples (list)
        temperature: controls how "sharp" the distribution is

    The loss encourages:
    - High similarity between anchor and positive
    - Low similarity between anchor and negatives
    """
    # Cosine similarity
    def cosine_sim(a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    # Similarity with positive pair
    pos_sim = cosine_sim(anchor, positive) / temperature

    # Similarities with negative pairs
    neg_sims = [cosine_sim(anchor, neg) / temperature for neg in negatives]

    # InfoNCE loss: -log(exp(pos_sim) / (exp(pos_sim) + sum(exp(neg_sims))))
    all_sims = [pos_sim] + neg_sims
    log_sum_exp = np.log(sum(np.exp(s) for s in all_sims))
    loss = -pos_sim + log_sum_exp

    return loss

# Example
anchor = np.random.randn(128)  # Embedding of a cat image
positive = anchor + np.random.randn(128) * 0.1  # Augmented cat (similar)
negatives = [np.random.randn(128) for _ in range(5)]  # Random images

loss = contrastive_loss(anchor, positive, negatives)
print(f"Contrastive loss: {loss:.4f}")
# Lower loss = model better at distinguishing similar from dissimilar

3.4 Reinforcement Learning

An agent learns by interacting with an environment, taking actions, and receiving rewards. The goal is to learn a policy that maximizes cumulative reward.

  • Agent: The learner/decision-maker (e.g., a game-playing AI)
  • Environment: The world the agent interacts with (e.g., a chess board)
  • State: Current situation (e.g., positions of all pieces)
  • Action: What the agent can do (e.g., move a piece)
  • Reward: Feedback signal (+1 for winning, -1 for losing, 0 otherwise)
  • Policy: Strategy mapping states to actions
Connection to LLMs — RLHF: Reinforcement Learning from Human Feedback (RLHF) is used to align LLMs with human preferences. After pre-training (self-supervised), the model is fine-tuned using RL:
  1. Generate multiple responses to a prompt
  2. Human rankers rate which response is best
  3. Train a "reward model" to predict human preferences
  4. Use PPO (Proximal Policy Optimization) to fine-tune the LLM to maximize the reward model's score
This is what makes ChatGPT helpful, harmless, and honest — and is a key part of how Claude is trained. DeepSeek R1 notably used RL more extensively for reasoning.

4. Large Language Models (LLMs)

4.1 What Makes a Model "Large"?

The "large" in Large Language Models refers primarily to the number of parameters (weights and biases). A parameter is a learnable number in the model. More parameters generally means more capacity to learn complex patterns.

Model Parameters Year Architecture Notable Feature
GPT-1 117M 2018 Decoder-only Transformer First GPT; proved pre-training works
BERT-Base 110M 2018 Encoder-only Transformer Bidirectional; revolutionized NLP benchmarks
GPT-2 1.5B 2019 Decoder-only Transformer "Too dangerous to release" (they did eventually)
GPT-3 175B 2020 Decoder-only Transformer In-context learning, few-shot prompting
PaLM 540B 2022 Decoder-only Transformer Google; chain-of-thought reasoning
Llama 2 7B-70B 2023 Decoder-only Transformer Open-source; democratized LLMs
GPT-4 ~1.8T (MoE, rumored) 2023 Mixture of Experts Multimodal; major quality leap
Llama 3.1 8B-405B 2024 Decoder-only Transformer Open-source; competitive with GPT-4
Claude 3.5 Sonnet Undisclosed 2024 Undisclosed Strong coding; computer use capability
DeepSeek V3 671B (MoE, 37B active) 2024 Mixture of Experts Efficient MoE; strong for cost
DeepSeek R1 671B (MoE) 2025 Mixture of Experts + RL Reasoning model; extensive RL training
Claude 4 (Opus/Sonnet) Undisclosed 2025 Undisclosed Advanced reasoning; agentic capabilities
GPT-5 Undisclosed 2025 Undisclosed Unified reasoning model

4.2 Timeline of LLMs

The evolution of language models has been breathtaking:

  • 2017: "Attention Is All You Need" paper introduces the Transformer architecture
  • 2018: GPT-1 (117M params) proves unsupervised pre-training + supervised fine-tuning works. BERT shows bidirectional pre-training.
  • 2019: GPT-2 (1.5B params) generates remarkably coherent text. OpenAI initially withholds release.
  • 2020: GPT-3 (175B params) demonstrates in-context learning — it can perform tasks from just a few examples in the prompt, without any fine-tuning.
  • 2022: ChatGPT launches (GPT-3.5-turbo). Goes viral. AI enters mainstream consciousness. PaLM, Chinchilla scaling laws published.
  • 2023: GPT-4 (multimodal), Claude 2, Llama 2 (open-source revolution). Era of "AI alignment" becomes prominent.
  • 2024: Claude 3.5 Sonnet, Llama 3.1 (405B), DeepSeek V3, Gemini 1.5 Pro (1M context). Focus shifts to efficiency, reasoning, and agents.
  • 2025: DeepSeek R1 (reasoning via RL), Claude 4 family (Opus 4, Sonnet 4 with strong agentic capabilities), GPT-5, open-source models approach frontier quality. Reasoning models and agentic AI become dominant themes.
  • 2026 (current): Focus on efficient inference, multi-modal agents, tool use, and integration into production systems.

4.3 How LLMs Work at a High Level

At their core, LLMs are next-token prediction machines:

  1. Tokenize the input text into tokens (subword units)
  2. Embed each token as a high-dimensional vector
  3. Pass through many Transformer layers (attention + feed-forward networks)
  4. Output a probability distribution over the entire vocabulary for the next token
  5. Sample or select the next token from this distribution
  6. Repeat — append the generated token and generate the next one
# Simplified LLM Generation Process (pseudocode)

def generate_text(model, prompt, max_tokens=100):
    """
    How an LLM generates text, step by step.
    This is pseudocode to illustrate the process.
    """
    tokens = tokenize(prompt)  # "Hello world" -> [15496, 995]

    for _ in range(max_tokens):
        # 1. Convert tokens to embeddings (learned vector representations)
        embeddings = model.embed(tokens)  # Shape: (seq_len, d_model)

        # 2. Pass through transformer layers
        # Each layer has: Multi-Head Attention + Feed-Forward Network
        hidden = embeddings
        for layer in model.transformer_layers:
            hidden = layer(hidden)  # Complex pattern recognition

        # 3. Project to vocabulary size and get probabilities
        logits = model.output_projection(hidden[-1])  # (vocab_size,)
        probs = softmax(logits / temperature)

        # 4. Sample next token
        next_token = sample(probs)  # e.g., token 318 = " I"

        # 5. Append and continue
        tokens.append(next_token)

        # 6. Stop if we generate an end-of-sequence token
        if next_token == EOS_TOKEN:
            break

    return detokenize(tokens)

# The magic is in the transformer layers.
# After training on trillions of tokens, these layers encode:
# - Grammar and syntax rules
# - World knowledge and facts
# - Reasoning patterns
# - Stylistic understanding
# - And much more...

# Example with a real API (OpenAI):
from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env variable

response = client.chat.completions.create(
    model="gpt-4o",  # or "gpt-4o-mini" for cost savings
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    temperature=0.7,  # Controls randomness (0=deterministic, 2=very random)
    max_tokens=100
)

print(response.choices[0].message.content)

# Example with Anthropic Claude:
import anthropic

client = anthropic.Anthropic()  # Uses ANTHROPIC_API_KEY env variable

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain quantum computing in one sentence."}
    ]
)

print(response.content[0].text)
LLM Training Pipeline Overview
graph LR D["Raw Text Data
Books, Web, Code"] --> T["Tokenization
Text to Token IDs"] T --> E["Embedding
IDs to Vectors"] E --> TR["Transformer Layers
Attention + FFN"] TR --> P["Predict Next Token
Probability Distribution"] P --> L["Compute Loss
Cross-Entropy"] L --> B["Backpropagation
Update Weights"] B --> TR style D fill:#e8f4f8,stroke:#333 style T fill:#d5e8d4,stroke:#333 style E fill:#fff2cc,stroke:#333 style TR fill:#dae8fc,stroke:#333 style P fill:#f8cecc,stroke:#333 style L fill:#e1d5e7,stroke:#333 style B fill:#fff2cc,stroke:#333

4.4 Emergent Abilities

One of the most fascinating aspects of LLMs is emergence: abilities that appear suddenly as models get larger, without being explicitly trained for.

  • In-context learning: GPT-3 could learn new tasks from just a few examples in the prompt, without updating any weights. This was not present in GPT-2.
  • Chain-of-thought reasoning: Larger models can "think step by step" when prompted to do so, dramatically improving accuracy on math and logic problems.
  • Code generation: Models trained primarily on text also learned to write code, even though code was a small fraction of training data.
  • Translation: Models can translate between languages they were not explicitly trained to translate between.
  • Theory of mind: Larger models show signs of understanding that different people have different beliefs and knowledge (though this is debated).
The Emergence Debate (2024-2026): Recent research has questioned whether emergent abilities are truly "sudden." Some researchers argue that with the right evaluation metrics, improvements are more gradual and predictable. The debate continues, but the practical reality is clear: larger models can do things smaller models simply cannot.

5. Scaling Laws

5.1 Chinchilla Scaling Laws

In 2022, DeepMind published the "Chinchilla" paper, which changed how the industry thinks about training LLMs. The key finding:

Chinchilla's Optimal Training Rule:

For a compute-optimal model, the number of training tokens should scale linearly with model parameters:

Doptimal ≈ 20 × N

Where D = number of training tokens, N = number of parameters.

A 10B parameter model should be trained on ~200B tokens.

This was revolutionary because GPT-3 (175B params) was trained on only 300B tokens — massively undertrained by Chinchilla's standards (it should have seen ~3.5T tokens). DeepMind's Chinchilla (70B params, 1.4T tokens) matched GPT-3's performance with 4x fewer parameters.

5.2 The Compute-Parameter-Data Relationship

The three pillars of LLM training form a tradeoff triangle:

  • Model Size (N): More parameters = more capacity to learn patterns
  • Training Data (D): More data = more patterns to learn from
  • Compute (C): C ≈ 6 × N × D (FLOPs for training)
# Scaling Law Calculations

def estimate_training_compute(n_params, n_tokens):
    """
    Estimate total training FLOPs.

    Rule of thumb: C ≈ 6 * N * D
    Where:
        C = compute in FLOPs
        N = number of parameters
        D = number of training tokens

    The factor of 6 comes from:
        - 2 FLOPs per parameter per token (forward pass: multiply + add)
        - 2 FLOPs per parameter per token (backward pass, ~2x forward)
        - Total: ~6 FLOPs per parameter per token
    """
    return 6 * n_params * n_tokens

def chinchilla_optimal_tokens(n_params):
    """Chinchilla-optimal number of training tokens."""
    return 20 * n_params

def chinchilla_optimal_params(compute_budget):
    """Given a compute budget, find optimal model size and data."""
    # C = 6 * N * D, and D = 20 * N
    # C = 6 * N * 20 * N = 120 * N^2
    # N = sqrt(C / 120)
    import math
    n_params = math.sqrt(compute_budget / 120)
    n_tokens = 20 * n_params
    return n_params, n_tokens

# Examples
models = {
    "GPT-2":     (1.5e9,   40e9),
    "GPT-3":     (175e9,   300e9),
    "Chinchilla":(70e9,    1.4e12),
    "Llama 2 7B":(7e9,     2e12),
    "Llama 3.1 8B":(8e9,   15e12),
    "Llama 3.1 405B":(405e9, 15e12),
    "DeepSeek V3": (671e9, 14.8e12),
}

print(f"{'Model':<20} {'Params':>12} {'Tokens':>14} {'Compute (FLOPs)':>18} {'Chinchilla Optimal?':>22}")
print("-" * 90)
for name, (params, tokens) in models.items():
    compute = estimate_training_compute(params, tokens)
    optimal_tokens = chinchilla_optimal_tokens(params)
    is_optimal = "Over-trained ✓" if tokens >= optimal_tokens * 0.8 else "Under-trained"
    print(f"{name:<20} {params:>12.1e} {tokens:>14.1e} {compute:>18.1e} {is_optimal:>22}")

# Key insight: Modern models (Llama 3.1, DeepSeek V3) train on FAR MORE
# tokens than Chinchilla suggests. Why?
#
# The "over-training" trend (2024-2026):
# - Inference cost matters more than training cost
# - A smaller model trained on more data can match a larger model
# - Llama 3.1 8B trained on 15T tokens (1875x Chinchilla ratio!)
# - This makes the model cheaper to deploy at inference time
# - Training is a one-time cost; inference runs millions of times

5.3 Why Scaling Works (and Its Limits)

Why scaling works: Neural networks are universal function approximators. Given enough parameters and data, they can learn increasingly complex patterns. Language modeling requires understanding grammar, facts, reasoning, common sense — more parameters allow the model to store and compose more of these patterns.

The 2025-2026 scaling debate: There's growing discourse about whether pure scaling of pre-training has hit diminishing returns:

  • Data wall: We're running out of high-quality training text on the internet. Models are being trained on synthetic data generated by other LLMs, raising quality concerns.
  • Inference-time compute: Instead of bigger models, researchers are exploring "thinking longer" at inference time. Models like DeepSeek R1 and OpenAI's o1/o3 use chain-of-thought reasoning during inference, trading compute at test time for better answers.
  • Efficiency innovations: Mixture-of-Experts (MoE) models like DeepSeek V3 have 671B total parameters but only activate 37B per token, getting large-model quality at small-model inference cost.
  • Post-training matters more: RLHF, DPO, and other alignment techniques can dramatically improve a model's usefulness without increasing parameters.

5.4 When to Use Small vs. Large Models

Consideration Small Models (1B-8B) Large Models (70B+) / API Models
Latency Fast (can run on single GPU or even CPU) Slower (needs multiple GPUs or API call)
Cost Low (self-hosted) Higher (GPU cluster or API pricing)
Quality Good for specific tasks when fine-tuned Better general reasoning and instruction following
Privacy Data stays on your servers API models: data goes to provider
Use Cases Classification, extraction, simple generation, edge deployment Complex reasoning, creative writing, code generation, agentic workflows
Customization Full fine-tuning possible Often limited to prompting or API fine-tuning
# Practical Decision Framework for Model Selection

def recommend_model(task, constraints):
    """
    A practical guide to choosing the right model size.

    In production AI engineering (2025-2026), the key insight is:
    use the SMALLEST model that meets your quality requirements.
    """
    recommendations = {
        # Simple classification tasks
        "sentiment_analysis": {
            "model": "Fine-tuned Llama 3.1 8B or BERT",
            "reason": "Classification is well-solved by small models",
            "cost": "~$0.001 per 1000 requests (self-hosted)"
        },

        # Text extraction
        "entity_extraction": {
            "model": "Fine-tuned Llama 3.1 8B or GPT-4o-mini",
            "reason": "Structured extraction works well with small models + fine-tuning",
            "cost": "~$0.15 per 1M input tokens (GPT-4o-mini)"
        },

        # Code generation
        "code_generation": {
            "model": "Claude Sonnet 4 or GPT-4o",
            "reason": "Complex code needs strong reasoning; large models excel",
            "cost": "~$3 per 1M input tokens (Claude Sonnet 4)"
        },

        # Complex reasoning
        "multi_step_reasoning": {
            "model": "Claude Opus 4, GPT-4o, or DeepSeek R1",
            "reason": "Reasoning is where large models truly shine",
            "cost": "~$15 per 1M input tokens (Claude Opus 4)"
        },

        # High-volume, simple tasks
        "high_volume_simple": {
            "model": "GPT-4o-mini or self-hosted Llama 3.1 8B",
            "reason": "Cost-efficiency at scale is paramount",
            "cost": "~$0.15 per 1M input tokens"
        },

        # Privacy-critical
        "privacy_critical": {
            "model": "Self-hosted Llama 3.1 70B or Mistral Large",
            "reason": "Data never leaves your infrastructure",
            "cost": "GPU hosting costs (~$2-5/hour for 70B on 2xA100)"
        },
    }

    return recommendations.get(task, "Evaluate on your specific use case")

# The AI Engineer's Rule of Thumb:
# 1. Start with the cheapest model (GPT-4o-mini / Claude Haiku)
# 2. Evaluate quality on YOUR specific task
# 3. Only upgrade to a larger model if quality is insufficient
# 4. Consider fine-tuning a small model before jumping to a large one
# 5. Use large models for evaluation/labeling to improve small models

Week 1 Summary

Key Takeaways

  1. AI ⊃ ML ⊃ DL: Deep learning is a subset of machine learning, which is a subset of artificial intelligence.
  2. Neural networks are composed of neurons that compute weighted sums, apply activation functions, and learn through backpropagation.
  3. Self-supervised learning (especially next-token prediction) is how modern LLMs learn from raw text without human labels.
  4. LLMs are next-token prediction machines that have learned an incredible amount of knowledge and reasoning ability from training on internet text.
  5. Scale matters but there are limits, leading to innovations in efficiency (MoE), reasoning (inference-time compute), and alignment (RLHF/DPO).

Exercises

  1. Implement the MNIST neural network from scratch and experiment with different learning rates (0.001, 0.01, 0.1, 1.0). What happens?
  2. Add a third hidden layer to the network. Does accuracy improve?
  3. Replace ReLU with Sigmoid in the hidden layers. How does training speed change?
  4. Calculate how many FLOPs it would take to train a 13B parameter model on 2T tokens.
  5. Use the OpenAI or Anthropic API to compare responses from different model sizes on the same prompt.

Further Reading