1. What is AI / ML / Deep Learning?
1.1 The Hierarchy
Think of three concentric circles. The outermost circle is Artificial Intelligence (AI), the middle circle is Machine Learning (ML), and the innermost circle is Deep Learning (DL).
- AI (Artificial Intelligence) — The broadest concept: any technique that enables a computer to mimic human-like intelligence. This includes rule-based expert systems (like the chess engine Deep Blue), search algorithms, constraint solvers, and more. If a program can play chess, translate languages, or drive a car, it falls under AI.
- ML (Machine Learning) — A subset of AI where the system learns from
data rather than being explicitly programmed with rules. Instead of writing
if temperature > 30: say("hot"), you feed the system thousands of examples and let it discover the patterns. Algorithms include linear regression, decision trees, SVMs, random forests, and neural networks. - DL (Deep Learning) — A subset of ML that uses deep neural networks (networks with many layers) to learn hierarchical representations. "Deep" refers to the depth of layers, not the depth of understanding. Deep learning powers image recognition, speech synthesis, and, crucially, modern language models.
- AI approach (rule-based): You hand them a detailed recipe with exact measurements and steps. They follow the recipe precisely.
- ML approach: You give them 1000 photos of well-cooked vs. burnt dishes and let them figure out what "good cooking" looks like.
- DL approach: You give them millions of cooking videos and a neural network figures out everything from ingredient identification to cooking techniques to plating aesthetics — all by itself.
Rule-based systems, search, planning"] ML["Machine Learning
Learns from data: trees, SVM, regression"] DL["Deep Learning
Deep neural networks with many layers"] LLM["Large Language Models
GPT, Claude, Llama"] AI --> ML ML --> DL DL --> LLM style AI fill:#e8f4f8,stroke:#2c3e50,stroke-width:2px style ML fill:#d5e8d4,stroke:#2c3e50,stroke-width:2px style DL fill:#fff2cc,stroke:#2c3e50,stroke-width:2px style LLM fill:#f8cecc,stroke:#2c3e50,stroke-width:2px
1.2 Traditional Programming vs. the ML Paradigm Shift
This is one of the most important conceptual shifts in computer science:
| Aspect | Traditional Programming | Machine Learning |
|---|---|---|
| Input | Data + Rules | Data + Expected Outputs |
| Output | Answers | Rules (learned model) |
| Example | if email contains "free money" → spam |
Show 100k emails labeled spam/not-spam; model learns patterns |
| Maintenance | Manually update rules as spammers evolve | Retrain on new data; model adapts automatically |
| Scalability | Rules become unmanageable in complex domains | Scales gracefully with more data |
# Traditional Programming: Spam Detector
def is_spam_traditional(email_text):
"""Rule-based spam detection - brittle and hard to maintain."""
spam_keywords = ["free money", "click here", "limited offer",
"congratulations", "you won", "act now"]
email_lower = email_text.lower()
for keyword in spam_keywords:
if keyword in email_lower:
return True
return False
# Machine Learning: Spam Detector
# Instead of writing rules, we LEARN them from data
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
def train_spam_detector(emails, labels):
"""ML-based spam detection - learns patterns from data."""
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(emails)
model = MultinomialNB()
model.fit(X, labels)
return model, vectorizer
def predict_spam_ml(model, vectorizer, email_text):
"""Use the trained model to predict spam."""
X = vectorizer.transform([email_text])
prediction = model.predict(X)
probability = model.predict_proba(X)
return prediction[0], probability[0]
# The ML version can detect NEW spam patterns it has never explicitly
# been told about, as long as they share statistical features with
# known spam. This is the power of learning from data.
2. Neural Networks from Scratch
2.1 What is a Neuron?
A biological neuron receives electrical signals through its dendrites, processes them in the cell body, and if the combined signal exceeds a threshold, it fires an output signal through its axon to connected neurons.
An artificial neuron mirrors this:
- Inputs (x1, x2, ..., xn) → like dendrites receiving signals
- Weights (w1, w2, ..., wn) → like the strength of synaptic connections
- Bias (b) → like the neuron's intrinsic excitability threshold
- Activation function (f) → like the decision to fire or not
- Output: y = f(Σ(wi · xi) + b)
2.2 Mathematical Model of a Neuron
The complete mathematical formulation:
Step 1 — Linear Transformation (Weighted Sum):
z = w1x1 + w2x2 + ... + wnxn + b = wTx + b
Step 2 — Non-linear Activation:
y = f(z)
where f is an activation function (ReLU, Sigmoid, etc.)
import numpy as np
class Neuron:
"""A single artificial neuron."""
def __init__(self, n_inputs):
# Initialize weights randomly (small values near zero)
self.weights = np.random.randn(n_inputs) * 0.01
self.bias = 0.0
def forward(self, x):
"""Compute the neuron's output."""
# Step 1: Weighted sum (linear transformation)
z = np.dot(self.weights, x) + self.bias
# Step 2: Activation function (using sigmoid here)
y = 1 / (1 + np.exp(-z))
return y
# Example
neuron = Neuron(n_inputs=3)
x = np.array([1.0, 0.5, -0.3]) # Input features
output = neuron.forward(x)
print(f"Neuron output: {output:.4f}")
# Output will be a value between 0 and 1 (because sigmoid)
2.3 Activation Functions
Activation functions introduce non-linearity into the network. Without them, stacking layers would be equivalent to a single linear transformation, no matter how many layers you use. They are the key to a neural network's power.
ReLU (Rectified Linear Unit)
ReLU(x) = max(0, x)
Derivative: 1 if x > 0, else 0
- Pros: Computationally efficient, mitigates vanishing gradient problem, sparse activation
- Cons: "Dying ReLU" problem (neurons can become permanently inactive if they always output 0)
- Used in: Hidden layers of most modern networks
def relu(x):
"""ReLU activation: returns x if positive, 0 otherwise."""
return np.maximum(0, x)
def relu_derivative(x):
"""Derivative of ReLU: 1 if x > 0, else 0."""
return (x > 0).astype(float)
# Visual representation:
# Input: [-2, -1, 0, 1, 2, 3]
# Output: [ 0, 0, 0, 1, 2, 3]
# It simply "clips" negative values to zero.
Sigmoid
σ(x) = 1 / (1 + e-x)
Output range: (0, 1)
Derivative: σ(x) · (1 - σ(x))
- Pros: Outputs interpretable as probabilities, smooth gradient
- Cons: Vanishing gradients for very large/small inputs, not zero-centered
- Used in: Binary classification output layers, gates in LSTMs
def sigmoid(x):
"""Sigmoid activation: squashes input to (0, 1)."""
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
"""Derivative of sigmoid."""
s = sigmoid(x)
return s * (1 - s)
# Visual representation:
# Input: [-5, -2, 0, 2, 5 ]
# Output: [0.007, 0.12, 0.5, 0.88, 0.993]
# Large negative -> ~0, large positive -> ~1, 0 -> 0.5
Tanh (Hyperbolic Tangent)
tanh(x) = (ex - e-x) / (ex + e-x)
Output range: (-1, 1)
Derivative: 1 - tanh2(x)
- Pros: Zero-centered (helps with gradient updates), stronger gradients than sigmoid
- Cons: Still suffers from vanishing gradients at extremes
- Used in: RNN hidden states, some normalization layers
def tanh(x):
"""Tanh activation: squashes input to (-1, 1)."""
return np.tanh(x)
def tanh_derivative(x):
"""Derivative of tanh."""
return 1 - np.tanh(x) ** 2
# Visual representation:
# Input: [-5, -2, 0, 2, 5 ]
# Output: [-0.999, -0.96, 0.0, 0.96, 0.999]
# Like sigmoid but centered at 0 and ranges from -1 to 1
Softmax
softmax(xi) = exi / Σj exj
Output: probability distribution (all outputs sum to 1)
- Used in: Multi-class classification output layers, attention mechanisms
- Key property: Converts raw scores (logits) into probabilities
def softmax(x):
"""Softmax: converts logits to probabilities.
We subtract max(x) for numerical stability to prevent
overflow when computing exp() of large numbers.
"""
exp_x = np.exp(x - np.max(x))
return exp_x / exp_x.sum()
# Example: classifying an image as cat/dog/bird
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Cat: {probs[0]:.3f}, Dog: {probs[1]:.3f}, Bird: {probs[2]:.3f}")
# Output: Cat: 0.659, Dog: 0.242, Bird: 0.099
# The highest logit (cat=2.0) gets the highest probability
# Complete comparison of all activation functions
import numpy as np
x = np.linspace(-5, 5, 100)
# All activations side by side
activations = {
'ReLU': np.maximum(0, x),
'Sigmoid': 1 / (1 + np.exp(-x)),
'Tanh': np.tanh(x),
'LeakyReLU': np.where(x > 0, x, 0.01 * x),
'GELU': x * 0.5 * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3))),
}
# GELU (Gaussian Error Linear Unit) - used in GPT-2, BERT, modern transformers
# It's a smoother version of ReLU that allows small negative values
# GELU(x) = x * Phi(x), where Phi is the CDF of standard normal distribution
# Approximation: GELU(x) ~ 0.5x(1 + tanh(sqrt(2/pi)(x + 0.044715x^3)))
for name, values in activations.items():
print(f"{name:12s} at x=-2: {values[20]:.4f}, x=0: {values[50]:.4f}, x=2: {values[80]:.4f}")
2.4 Forward Propagation (Step by Step)
Forward propagation is the process of passing input through the network layer by layer to produce an output. Let's trace through a concrete numerical example.
"""
FORWARD PROPAGATION - Complete Numerical Example
=================================================
Network Architecture:
Input Layer: 2 neurons (x1, x2)
Hidden Layer: 2 neurons (h1, h2) with ReLU activation
Output Layer: 1 neuron (o1) with Sigmoid activation
Given values:
Inputs: x1 = 0.5, x2 = 0.8
Hidden layer weights:
w1 = 0.4 (x1 -> h1), w2 = 0.3 (x2 -> h1), b1 = 0.1
w3 = -0.2 (x1 -> h2), w4 = 0.6 (x2 -> h2), b2 = -0.1
Output layer weights:
w5 = 0.7 (h1 -> o1), w6 = 0.5 (h2 -> o1), b3 = 0.2
"""
import numpy as np
# Inputs
x = np.array([0.5, 0.8])
# Hidden layer parameters
W_hidden = np.array([
[0.4, 0.3], # weights for h1
[-0.2, 0.6] # weights for h2
])
b_hidden = np.array([0.1, -0.1])
# Output layer parameters
W_output = np.array([
[0.7, 0.5] # weights for o1
])
b_output = np.array([0.2])
# ---- STEP 1: Hidden Layer Linear Transformation ----
z_hidden = np.dot(W_hidden, x) + b_hidden
# h1: z1 = (0.4 * 0.5) + (0.3 * 0.8) + 0.1 = 0.2 + 0.24 + 0.1 = 0.54
# h2: z2 = (-0.2 * 0.5) + (0.6 * 0.8) + (-0.1) = -0.1 + 0.48 - 0.1 = 0.28
print(f"Hidden layer linear output (z): {z_hidden}")
# Output: [0.54, 0.28]
# ---- STEP 2: Hidden Layer Activation (ReLU) ----
a_hidden = np.maximum(0, z_hidden) # ReLU
# h1: ReLU(0.54) = 0.54 (positive, so unchanged)
# h2: ReLU(0.28) = 0.28 (positive, so unchanged)
print(f"Hidden layer activation (a): {a_hidden}")
# Output: [0.54, 0.28]
# ---- STEP 3: Output Layer Linear Transformation ----
z_output = np.dot(W_output, a_hidden) + b_output
# o1: z = (0.7 * 0.54) + (0.5 * 0.28) + 0.2 = 0.378 + 0.14 + 0.2 = 0.718
print(f"Output layer linear output (z): {z_output}")
# Output: [0.718]
# ---- STEP 4: Output Layer Activation (Sigmoid) ----
y_pred = 1 / (1 + np.exp(-z_output))
# o1: sigmoid(0.718) = 1 / (1 + e^(-0.718)) = 1 / (1 + 0.4877) = 0.6723
print(f"Network output (prediction): {y_pred}")
# Output: [0.6723]
# If this is a binary classifier:
# - Output > 0.5 -> Class 1 (positive)
# - Output <= 0.5 -> Class 0 (negative)
# Our prediction of 0.6723 -> Class 1
2.5 Loss Functions
A loss function (also called cost function or objective function) measures how far the model's predictions are from the true values. The goal of training is to minimize the loss.
Mean Squared Error (MSE)
MSE = (1/n) Σi=1n (yi - ŷi)2
Where yi is the true value and ŷi is the predicted value.
- Used for: Regression problems (predicting continuous values)
- Intuition: Penalizes larger errors more heavily (quadratic penalty)
- Derivation: Squaring ensures all errors are positive and differentiable
Binary Cross-Entropy (BCE)
BCE = -(1/n) Σi=1n [yi log(ŷi) + (1 - yi) log(1 - ŷi)]
- Used for: Binary classification
- Intuition: Heavily penalizes confident wrong predictions
Categorical Cross-Entropy
CCE = -Σc=1C yc log(ŷc)
Where C is the number of classes.
- Used for: Multi-class classification, language model training (next token prediction)
- Key insight: This is the loss function used to train LLMs!
import numpy as np
# ---- Mean Squared Error ----
def mse_loss(y_true, y_pred):
"""Mean Squared Error for regression."""
return np.mean((y_true - y_pred) ** 2)
def mse_derivative(y_true, y_pred):
"""Derivative of MSE with respect to y_pred."""
return 2 * (y_pred - y_true) / len(y_true)
# Example: predicting house prices
y_true = np.array([300000, 450000, 200000])
y_pred = np.array([310000, 420000, 215000])
print(f"MSE Loss: {mse_loss(y_true, y_pred):,.0f}")
# Output: MSE Loss: 408,333,333
# ---- Binary Cross-Entropy ----
def binary_cross_entropy(y_true, y_pred):
"""Binary cross-entropy for binary classification.
We clip predictions to avoid log(0) which is undefined.
"""
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# Example: spam detection
y_true = np.array([1, 0, 1, 1]) # Actual labels
y_pred = np.array([0.9, 0.1, 0.8, 0.7]) # Model predictions
print(f"BCE Loss: {binary_cross_entropy(y_true, y_pred):.4f}")
# Output: BCE Loss: 0.1643
# What happens with a TERRIBLE prediction?
y_pred_bad = np.array([0.1, 0.9, 0.2, 0.3]) # Everything wrong
print(f"BCE Loss (bad): {binary_cross_entropy(y_true, y_pred_bad):.4f}")
# Output: BCE Loss (bad): 1.8971 (much higher = worse!)
# ---- Categorical Cross-Entropy ----
def categorical_cross_entropy(y_true_onehot, y_pred_probs):
"""Categorical cross-entropy for multi-class classification.
This is the loss function used to train language models!
When predicting the next token, y_true is the correct token
(one-hot encoded) and y_pred is the model's probability
distribution over the entire vocabulary.
"""
epsilon = 1e-15
y_pred_probs = np.clip(y_pred_probs, epsilon, 1.0)
return -np.sum(y_true_onehot * np.log(y_pred_probs))
# Example: classifying an image as [cat, dog, bird]
y_true = np.array([1, 0, 0]) # True label: cat (one-hot)
y_pred = np.array([0.7, 0.2, 0.1]) # Model says 70% cat
print(f"CCE Loss: {categorical_cross_entropy(y_true, y_pred):.4f}")
# Output: CCE Loss: 0.3567
# For language models: predict next word from vocabulary of 50000 words
# y_true = one-hot vector with 1 at position of correct word
# y_pred = softmax output giving probability for each word
z = Wx + b"] H1 --> A1["Activation
a = ReLU z"] A1 --> H2["Hidden Layer 2
z = Wa + b"] H2 --> A2["Activation
a = ReLU z"] A2 --> O["Output
y hat"] end O --> L["Loss Function
L = loss of y, y hat"] subgraph Backward Pass L --> G3["dL/dW2
Gradient"] G3 --> G2["dL/dW1
Gradient"] G2 --> U["Update Weights
W = W - lr x grad"] end style I fill:#e8f4f8,stroke:#333 style O fill:#d5e8d4,stroke:#333 style L fill:#f8cecc,stroke:#333 style U fill:#fff2cc,stroke:#333
2.6 Backpropagation
Backpropagation is the algorithm that computes how much each weight contributed to the error, so we can update them to reduce the loss. It uses the chain rule of calculus to propagate gradients backward through the network.
"""
BACKPROPAGATION - Complete Worked Example
==========================================
Using our forward propagation example:
Network: 2 inputs -> 2 hidden (ReLU) -> 1 output (Sigmoid)
Prediction: 0.6723
True label: 1.0 (positive class)
Loss function: Binary Cross-Entropy
We need to compute: dLoss/dw for EVERY weight in the network,
then update: w_new = w_old - learning_rate * dLoss/dw
"""
import numpy as np
# Forward pass values (from previous example)
x = np.array([0.5, 0.8])
z_hidden = np.array([0.54, 0.28])
a_hidden = np.array([0.54, 0.28]) # After ReLU
z_output = np.array([0.718])
y_pred = np.array([0.6723]) # After Sigmoid
y_true = np.array([1.0])
# ---- STEP 1: Compute output layer gradient ----
# For sigmoid + BCE, the gradient simplifies beautifully:
# dL/dz_output = y_pred - y_true
dL_dz_output = y_pred - y_true # = 0.6723 - 1.0 = -0.3277
print(f"Output gradient: {dL_dz_output}")
# ---- STEP 2: Compute gradients for output weights ----
# dL/dw5 = dL/dz_output * dz_output/dw5 = dL/dz_output * a_h1
# dL/dw6 = dL/dz_output * dz_output/dw6 = dL/dz_output * a_h2
dL_dW_output = dL_dz_output * a_hidden.reshape(1, -1).T
# w5 gradient: -0.3277 * 0.54 = -0.1770
# w6 gradient: -0.3277 * 0.28 = -0.0918
print(f"Output weight gradients: {dL_dW_output.flatten()}")
# dL/db3 = dL/dz_output = -0.3277
dL_db_output = dL_dz_output
print(f"Output bias gradient: {dL_db_output}")
# ---- STEP 3: Propagate gradient to hidden layer ----
W_output = np.array([[0.7, 0.5]])
# dL/da_hidden = W_output^T * dL/dz_output
dL_da_hidden = W_output.T.dot(dL_dz_output.reshape(-1, 1)).flatten()
print(f"Hidden activation gradient: {dL_da_hidden}")
# ---- STEP 4: Gradient through ReLU ----
# dReLU/dz = 1 if z > 0, else 0
relu_grad = (z_hidden > 0).astype(float) # [1.0, 1.0] (both positive)
dL_dz_hidden = dL_da_hidden * relu_grad
print(f"Hidden layer gradient: {dL_dz_hidden}")
# ---- STEP 5: Compute gradients for hidden weights ----
dL_dW_hidden = np.outer(dL_dz_hidden, x)
dL_db_hidden = dL_dz_hidden
print(f"Hidden weight gradients:\n{dL_dW_hidden}")
print(f"Hidden bias gradients: {dL_db_hidden}")
# ---- STEP 6: Update all weights ----
learning_rate = 0.01
W_hidden = np.array([[0.4, 0.3], [-0.2, 0.6]])
b_hidden_vals = np.array([0.1, -0.1])
W_output_vals = np.array([[0.7, 0.5]])
b_output_val = np.array([0.2])
W_hidden_new = W_hidden - learning_rate * dL_dW_hidden
b_hidden_new = b_hidden_vals - learning_rate * dL_db_hidden
W_output_new = W_output_vals - learning_rate * dL_dW_output.T
b_output_new = b_output_val - learning_rate * dL_db_output
print(f"\nUpdated weights:")
print(f"W_hidden: {W_hidden} -> {W_hidden_new}")
print(f"W_output: {W_output_vals} -> {W_output_new}")
print(f"\nThe weights shifted slightly to reduce the loss!")
print(f"Repeat this process thousands of times = training!")
2.7 Learning Rate, Epochs, and Batch Size
- Learning Rate (η): How big a step to take when updating weights. Too large → overshoot the minimum, training diverges. Too small → training takes forever, may get stuck in local minima. Typical values: 0.001 to 0.01 for simple networks, 1e-4 to 3e-4 for transformers.
- Epoch: One complete pass through the entire training dataset. Training for 10 epochs means the model sees every example 10 times. Too few epochs → underfitting. Too many → overfitting.
- Batch Size: Number of training examples processed together before
updating weights.
- Batch size = 1: Stochastic Gradient Descent (SGD) — noisy but fast updates
- Batch size = N (full dataset): Batch GD — stable but slow, high memory
- Batch size = 32-512: Mini-batch GD — the practical sweet spot
2.8 PRACTICAL: Neural Network from Scratch (MNIST)
Here is a complete, working neural network built from scratch using only NumPy. No TensorFlow, no PyTorch. This network classifies handwritten digits (0-9) from the MNIST dataset.
"""
NEURAL NETWORK FROM SCRATCH - MNIST Digit Classification
=========================================================
Architecture: 784 inputs -> 128 hidden (ReLU) -> 64 hidden (ReLU) -> 10 output (Softmax)
Loss: Categorical Cross-Entropy
Optimizer: Mini-batch Gradient Descent with momentum
No frameworks - just NumPy!
"""
import numpy as np
# -------------------------------------------------------------------
# STEP 1: Load MNIST Data
# -------------------------------------------------------------------
# You can download MNIST from many sources. Here we'll use sklearn
# as a convenient loader, but the neural network itself is pure NumPy.
from sklearn.datasets import fetch_openml
print("Loading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist.data, mnist.target.astype(int)
# Normalize pixel values to [0, 1]
X = X / 255.0
# One-hot encode labels
def one_hot_encode(labels, num_classes=10):
"""Convert integer labels to one-hot vectors."""
n = len(labels)
one_hot = np.zeros((n, num_classes))
one_hot[np.arange(n), labels] = 1
return one_hot
Y = one_hot_encode(y)
# Split into train/test
X_train, X_test = X[:60000], X[60000:]
Y_train, Y_test = Y[:60000], Y[60000:]
y_test = y[60000:] # Keep integer labels for accuracy calculation
print(f"Training set: {X_train.shape}") # (60000, 784)
print(f"Test set: {X_test.shape}") # (10000, 784)
# -------------------------------------------------------------------
# STEP 2: Define Activation Functions
# -------------------------------------------------------------------
def relu(z):
"""ReLU activation function."""
return np.maximum(0, z)
def relu_derivative(z):
"""Derivative of ReLU."""
return (z > 0).astype(float)
def softmax(z):
"""Softmax activation - converts logits to probabilities.
Subtract max for numerical stability (prevents overflow).
"""
exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
# -------------------------------------------------------------------
# STEP 3: Initialize Network Parameters
# -------------------------------------------------------------------
def initialize_parameters(layer_dims):
"""Initialize weights using He initialization and biases to zero.
He initialization: W ~ N(0, sqrt(2/n_in))
This is specifically designed for ReLU networks and helps
prevent vanishing/exploding gradients.
Args:
layer_dims: list of layer sizes, e.g., [784, 128, 64, 10]
Returns:
parameters: dict with W1, b1, W2, b2, etc.
"""
parameters = {}
n_layers = len(layer_dims)
for l in range(1, n_layers):
# He initialization: scale by sqrt(2 / fan_in)
parameters[f'W{l}'] = np.random.randn(
layer_dims[l], layer_dims[l-1]
) * np.sqrt(2.0 / layer_dims[l-1])
parameters[f'b{l}'] = np.zeros((layer_dims[l], 1))
print(f"Layer {l}: W{l} shape = {parameters[f'W{l}'].shape}, "
f"b{l} shape = {parameters[f'b{l}'].shape}")
return parameters
# Network architecture
layer_dims = [784, 128, 64, 10]
params = initialize_parameters(layer_dims)
# Layer 1: W1 shape = (128, 784), b1 shape = (128, 1) -> 100,352 weights
# Layer 2: W2 shape = (64, 128), b2 shape = (64, 1) -> 8,192 weights
# Layer 3: W3 shape = (10, 64), b3 shape = (10, 1) -> 640 weights
# Total: 109,184 parameters (+ 202 biases = 109,386 total)
# -------------------------------------------------------------------
# STEP 4: Forward Propagation
# -------------------------------------------------------------------
def forward_propagation(X, parameters, n_layers):
"""Forward pass through the network.
Args:
X: input data (batch_size, 784) - we transpose to (784, batch_size)
parameters: dict with weights and biases
n_layers: number of layers (excluding input)
Returns:
A_final: output predictions (10, batch_size)
cache: intermediate values needed for backpropagation
"""
cache = {'A0': X.T} # Transpose: (784, batch_size)
A = X.T
for l in range(1, n_layers):
Z = parameters[f'W{l}'] @ A + parameters[f'b{l}']
if l == n_layers - 1:
# Last layer: softmax
A = softmax(Z.T).T # softmax expects (batch, classes)
else:
# Hidden layers: ReLU
A = relu(Z)
cache[f'Z{l}'] = Z
cache[f'A{l}'] = A
return A, cache
# -------------------------------------------------------------------
# STEP 5: Compute Loss
# -------------------------------------------------------------------
def compute_loss(A_final, Y):
"""Compute categorical cross-entropy loss.
Loss = -(1/m) * sum(Y * log(A))
"""
m = Y.shape[0]
# A_final is (10, m), Y.T is (10, m)
epsilon = 1e-15
A_clipped = np.clip(A_final, epsilon, 1 - epsilon)
loss = -np.sum(Y.T * np.log(A_clipped)) / m
return loss
# -------------------------------------------------------------------
# STEP 6: Backward Propagation
# -------------------------------------------------------------------
def backward_propagation(parameters, cache, Y, n_layers):
"""Compute gradients using backpropagation.
For softmax + cross-entropy, the output gradient simplifies to:
dZ_final = A_final - Y (elegant mathematical simplification!)
"""
m = Y.shape[0]
grads = {}
# Output layer gradient (softmax + cross-entropy derivative)
dZ = cache[f'A{n_layers-1}'] - Y.T # (10, m)
for l in range(n_layers - 1, 0, -1):
# Gradient for weights and biases
grads[f'dW{l}'] = (1/m) * dZ @ cache[f'A{l-1}'].T
grads[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
if l > 1:
# Propagate gradient to previous layer
dA = parameters[f'W{l}'].T @ dZ
dZ = dA * relu_derivative(cache[f'Z{l-1}'])
return grads
# -------------------------------------------------------------------
# STEP 7: Update Parameters
# -------------------------------------------------------------------
def update_parameters(parameters, grads, learning_rate, n_layers):
"""Update parameters using gradient descent."""
for l in range(1, n_layers):
parameters[f'W{l}'] -= learning_rate * grads[f'dW{l}']
parameters[f'b{l}'] -= learning_rate * grads[f'db{l}']
return parameters
# -------------------------------------------------------------------
# STEP 8: Training Loop
# -------------------------------------------------------------------
def train(X_train, Y_train, layer_dims, epochs=20, batch_size=128,
learning_rate=0.1):
"""Train the neural network."""
n_layers = len(layer_dims)
parameters = initialize_parameters(layer_dims)
m = X_train.shape[0]
losses = []
for epoch in range(epochs):
# Shuffle training data
permutation = np.random.permutation(m)
X_shuffled = X_train[permutation]
Y_shuffled = Y_train[permutation]
epoch_loss = 0
n_batches = m // batch_size
for i in range(n_batches):
# Get mini-batch
start = i * batch_size
end = start + batch_size
X_batch = X_shuffled[start:end]
Y_batch = Y_shuffled[start:end]
# Forward pass
A_final, cache = forward_propagation(
X_batch, parameters, n_layers
)
# Compute loss
batch_loss = compute_loss(A_final, Y_batch)
epoch_loss += batch_loss
# Backward pass
grads = backward_propagation(
parameters, cache, Y_batch, n_layers
)
# Update parameters
parameters = update_parameters(
parameters, grads, learning_rate, n_layers
)
avg_loss = epoch_loss / n_batches
losses.append(avg_loss)
# Evaluate accuracy on test set every 5 epochs
if (epoch + 1) % 5 == 0 or epoch == 0:
A_test, _ = forward_propagation(X_test, parameters, n_layers)
predictions = np.argmax(A_test, axis=0)
accuracy = np.mean(predictions == y_test)
print(f"Epoch {epoch+1}/{epochs} | Loss: {avg_loss:.4f} | "
f"Test Accuracy: {accuracy*100:.2f}%")
return parameters, losses
# -------------------------------------------------------------------
# STEP 9: Train the Network!
# -------------------------------------------------------------------
print("\n" + "="*60)
print("TRAINING NEURAL NETWORK FROM SCRATCH")
print("="*60 + "\n")
trained_params, losses = train(
X_train, Y_train,
layer_dims=[784, 128, 64, 10],
epochs=20,
batch_size=128,
learning_rate=0.1
)
# Expected output (approximate):
# Epoch 1/20 | Loss: 0.5123 | Test Accuracy: 89.42%
# Epoch 5/20 | Loss: 0.1847 | Test Accuracy: 95.13%
# Epoch 10/20 | Loss: 0.1102 | Test Accuracy: 96.51%
# Epoch 15/20 | Loss: 0.0731 | Test Accuracy: 97.02%
# Epoch 20/20 | Loss: 0.0498 | Test Accuracy: 97.31%
# ~97% accuracy with just NumPy! No frameworks needed.
# Modern frameworks like PyTorch do all of this automatically,
# but understanding the internals is crucial for AI engineering.
3. Types of Learning
3.1 Supervised Learning
The model learns from labeled data — input-output pairs where the correct answer is provided during training.
Classification
Predict a discrete category/class.
- Binary: spam/not-spam, cat/dog, positive/negative sentiment
- Multi-class: digit recognition (0-9), image classification (1000 ImageNet classes)
- Multi-label: a movie can be both "comedy" AND "romance"
Regression
Predict a continuous value.
- House price prediction
- Stock price forecasting
- Temperature prediction
# Supervised Learning Examples
# Classification: Sentiment Analysis
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
texts = [
"This movie was amazing!", "Terrible waste of time",
"Loved every minute", "Boring and predictable",
"Best film of the year", "I want my money back"
]
labels = [1, 0, 1, 0, 1, 0] # 1=positive, 0=negative
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
model = LogisticRegression()
model.fit(X, labels)
# Test on new data
new_text = vectorizer.transform(["This was wonderful!"])
print(f"Prediction: {'Positive' if model.predict(new_text)[0] else 'Negative'}")
# Regression: House Price
from sklearn.linear_model import LinearRegression
import numpy as np
# Features: [sq_ft, bedrooms, bathrooms]
X = np.array([[1500, 3, 2], [2000, 4, 3], [1200, 2, 1], [3000, 5, 4]])
y = np.array([300000, 450000, 200000, 650000])
model = LinearRegression()
model.fit(X, y)
print(f"Predicted price for 1800sqft, 3bed, 2bath: ${model.predict([[1800, 3, 2]])[0]:,.0f}")
3.2 Self-Supervised Learning
The model creates its own labels from the data itself. This is how modern LLMs learn! No human labeling is needed.
Masked Language Modeling (MLM) — BERT's approach
Randomly mask 15% of tokens in a sentence and train the model to predict them.
# How BERT learns (Masked Language Modeling)
# Original: "The cat sat on the mat"
# Masked: "The [MASK] sat on the [MASK]"
# Target: predict "cat" at position 2, "mat" at position 6
# The model learns:
# - "cat" is a likely word between "The" and "sat" (grammar + semantics)
# - "mat" makes sense after "on the" (common phrases)
# - Through billions of examples, it learns language structure
# This is BIDIRECTIONAL - the model can see both left and right context
# "The [MASK] sat" -> can use "The" AND "sat" to predict "cat"
Next Token Prediction — GPT's approach
Given a sequence of tokens, predict the next one. This is how GPT, Llama, Claude, and most modern LLMs are trained.
# How GPT learns (Next Token Prediction / Causal Language Modeling)
# Input: "The cat sat on"
# Target: "the"
# Input: "The cat sat on the"
# Target: "mat"
# For a single sentence "The cat sat on the mat", GPT gets 5 training signals:
# "The" -> predict "cat"
# "The cat" -> predict "sat"
# "The cat sat" -> predict "on"
# "The cat sat on"-> predict "the"
# "The cat sat on the" -> predict "mat"
# This is UNIDIRECTIONAL (causal) - model can only see PREVIOUS tokens
# Can't peek at the future! This is why it's called "autoregressive"
# Why this works so well:
# - Infinite free training data (every text on the internet)
# - Predicting the next word requires understanding:
# * Grammar and syntax
# * Facts and knowledge
# * Reasoning and logic
# * Common sense
# * Context and pragmatics
3.3 Contrastive Learning
Train the model to bring similar examples close together and push dissimilar examples apart in embedding space.
SimCLR (Simple Contrastive Learning of Representations)
- Take an image, create two augmented versions (crop, flip, color shift)
- These two versions are "positive pairs" (should be close in embedding space)
- All other images in the batch are "negative pairs" (should be far apart)
- The model learns visual features without any labels
CLIP (Contrastive Language-Image Pre-training)
- Train on 400 million (image, text caption) pairs from the internet
- Learn to match images with their correct text descriptions
- Result: a model that understands both images AND text in the same space
- This enables zero-shot image classification: "Is this image more similar to the text 'a cat' or 'a dog'?"
# Contrastive Learning Intuition
import numpy as np
def contrastive_loss(anchor, positive, negatives, temperature=0.1):
"""
Simplified contrastive loss (InfoNCE).
Args:
anchor: embedding of the anchor example
positive: embedding of a similar example
negatives: embeddings of dissimilar examples (list)
temperature: controls how "sharp" the distribution is
The loss encourages:
- High similarity between anchor and positive
- Low similarity between anchor and negatives
"""
# Cosine similarity
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Similarity with positive pair
pos_sim = cosine_sim(anchor, positive) / temperature
# Similarities with negative pairs
neg_sims = [cosine_sim(anchor, neg) / temperature for neg in negatives]
# InfoNCE loss: -log(exp(pos_sim) / (exp(pos_sim) + sum(exp(neg_sims))))
all_sims = [pos_sim] + neg_sims
log_sum_exp = np.log(sum(np.exp(s) for s in all_sims))
loss = -pos_sim + log_sum_exp
return loss
# Example
anchor = np.random.randn(128) # Embedding of a cat image
positive = anchor + np.random.randn(128) * 0.1 # Augmented cat (similar)
negatives = [np.random.randn(128) for _ in range(5)] # Random images
loss = contrastive_loss(anchor, positive, negatives)
print(f"Contrastive loss: {loss:.4f}")
# Lower loss = model better at distinguishing similar from dissimilar
3.4 Reinforcement Learning
An agent learns by interacting with an environment, taking actions, and receiving rewards. The goal is to learn a policy that maximizes cumulative reward.
- Agent: The learner/decision-maker (e.g., a game-playing AI)
- Environment: The world the agent interacts with (e.g., a chess board)
- State: Current situation (e.g., positions of all pieces)
- Action: What the agent can do (e.g., move a piece)
- Reward: Feedback signal (+1 for winning, -1 for losing, 0 otherwise)
- Policy: Strategy mapping states to actions
- Generate multiple responses to a prompt
- Human rankers rate which response is best
- Train a "reward model" to predict human preferences
- Use PPO (Proximal Policy Optimization) to fine-tune the LLM to maximize the reward model's score
4. Large Language Models (LLMs)
4.1 What Makes a Model "Large"?
The "large" in Large Language Models refers primarily to the number of parameters (weights and biases). A parameter is a learnable number in the model. More parameters generally means more capacity to learn complex patterns.
| Model | Parameters | Year | Architecture | Notable Feature |
|---|---|---|---|---|
| GPT-1 | 117M | 2018 | Decoder-only Transformer | First GPT; proved pre-training works |
| BERT-Base | 110M | 2018 | Encoder-only Transformer | Bidirectional; revolutionized NLP benchmarks |
| GPT-2 | 1.5B | 2019 | Decoder-only Transformer | "Too dangerous to release" (they did eventually) |
| GPT-3 | 175B | 2020 | Decoder-only Transformer | In-context learning, few-shot prompting |
| PaLM | 540B | 2022 | Decoder-only Transformer | Google; chain-of-thought reasoning |
| Llama 2 | 7B-70B | 2023 | Decoder-only Transformer | Open-source; democratized LLMs |
| GPT-4 | ~1.8T (MoE, rumored) | 2023 | Mixture of Experts | Multimodal; major quality leap |
| Llama 3.1 | 8B-405B | 2024 | Decoder-only Transformer | Open-source; competitive with GPT-4 |
| Claude 3.5 Sonnet | Undisclosed | 2024 | Undisclosed | Strong coding; computer use capability |
| DeepSeek V3 | 671B (MoE, 37B active) | 2024 | Mixture of Experts | Efficient MoE; strong for cost |
| DeepSeek R1 | 671B (MoE) | 2025 | Mixture of Experts + RL | Reasoning model; extensive RL training |
| Claude 4 (Opus/Sonnet) | Undisclosed | 2025 | Undisclosed | Advanced reasoning; agentic capabilities |
| GPT-5 | Undisclosed | 2025 | Undisclosed | Unified reasoning model |
4.2 Timeline of LLMs
The evolution of language models has been breathtaking:
- 2017: "Attention Is All You Need" paper introduces the Transformer architecture
- 2018: GPT-1 (117M params) proves unsupervised pre-training + supervised fine-tuning works. BERT shows bidirectional pre-training.
- 2019: GPT-2 (1.5B params) generates remarkably coherent text. OpenAI initially withholds release.
- 2020: GPT-3 (175B params) demonstrates in-context learning — it can perform tasks from just a few examples in the prompt, without any fine-tuning.
- 2022: ChatGPT launches (GPT-3.5-turbo). Goes viral. AI enters mainstream consciousness. PaLM, Chinchilla scaling laws published.
- 2023: GPT-4 (multimodal), Claude 2, Llama 2 (open-source revolution). Era of "AI alignment" becomes prominent.
- 2024: Claude 3.5 Sonnet, Llama 3.1 (405B), DeepSeek V3, Gemini 1.5 Pro (1M context). Focus shifts to efficiency, reasoning, and agents.
- 2025: DeepSeek R1 (reasoning via RL), Claude 4 family (Opus 4, Sonnet 4 with strong agentic capabilities), GPT-5, open-source models approach frontier quality. Reasoning models and agentic AI become dominant themes.
- 2026 (current): Focus on efficient inference, multi-modal agents, tool use, and integration into production systems.
4.3 How LLMs Work at a High Level
At their core, LLMs are next-token prediction machines:
- Tokenize the input text into tokens (subword units)
- Embed each token as a high-dimensional vector
- Pass through many Transformer layers (attention + feed-forward networks)
- Output a probability distribution over the entire vocabulary for the next token
- Sample or select the next token from this distribution
- Repeat — append the generated token and generate the next one
# Simplified LLM Generation Process (pseudocode)
def generate_text(model, prompt, max_tokens=100):
"""
How an LLM generates text, step by step.
This is pseudocode to illustrate the process.
"""
tokens = tokenize(prompt) # "Hello world" -> [15496, 995]
for _ in range(max_tokens):
# 1. Convert tokens to embeddings (learned vector representations)
embeddings = model.embed(tokens) # Shape: (seq_len, d_model)
# 2. Pass through transformer layers
# Each layer has: Multi-Head Attention + Feed-Forward Network
hidden = embeddings
for layer in model.transformer_layers:
hidden = layer(hidden) # Complex pattern recognition
# 3. Project to vocabulary size and get probabilities
logits = model.output_projection(hidden[-1]) # (vocab_size,)
probs = softmax(logits / temperature)
# 4. Sample next token
next_token = sample(probs) # e.g., token 318 = " I"
# 5. Append and continue
tokens.append(next_token)
# 6. Stop if we generate an end-of-sequence token
if next_token == EOS_TOKEN:
break
return detokenize(tokens)
# The magic is in the transformer layers.
# After training on trillions of tokens, these layers encode:
# - Grammar and syntax rules
# - World knowledge and facts
# - Reasoning patterns
# - Stylistic understanding
# - And much more...
# Example with a real API (OpenAI):
from openai import OpenAI
client = OpenAI() # Uses OPENAI_API_KEY env variable
response = client.chat.completions.create(
model="gpt-4o", # or "gpt-4o-mini" for cost savings
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in one sentence."}
],
temperature=0.7, # Controls randomness (0=deterministic, 2=very random)
max_tokens=100
)
print(response.choices[0].message.content)
# Example with Anthropic Claude:
import anthropic
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env variable
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain quantum computing in one sentence."}
]
)
print(response.content[0].text)
Books, Web, Code"] --> T["Tokenization
Text to Token IDs"] T --> E["Embedding
IDs to Vectors"] E --> TR["Transformer Layers
Attention + FFN"] TR --> P["Predict Next Token
Probability Distribution"] P --> L["Compute Loss
Cross-Entropy"] L --> B["Backpropagation
Update Weights"] B --> TR style D fill:#e8f4f8,stroke:#333 style T fill:#d5e8d4,stroke:#333 style E fill:#fff2cc,stroke:#333 style TR fill:#dae8fc,stroke:#333 style P fill:#f8cecc,stroke:#333 style L fill:#e1d5e7,stroke:#333 style B fill:#fff2cc,stroke:#333
4.4 Emergent Abilities
One of the most fascinating aspects of LLMs is emergence: abilities that appear suddenly as models get larger, without being explicitly trained for.
- In-context learning: GPT-3 could learn new tasks from just a few examples in the prompt, without updating any weights. This was not present in GPT-2.
- Chain-of-thought reasoning: Larger models can "think step by step" when prompted to do so, dramatically improving accuracy on math and logic problems.
- Code generation: Models trained primarily on text also learned to write code, even though code was a small fraction of training data.
- Translation: Models can translate between languages they were not explicitly trained to translate between.
- Theory of mind: Larger models show signs of understanding that different people have different beliefs and knowledge (though this is debated).
5. Scaling Laws
5.1 Chinchilla Scaling Laws
In 2022, DeepMind published the "Chinchilla" paper, which changed how the industry thinks about training LLMs. The key finding:
Chinchilla's Optimal Training Rule:
For a compute-optimal model, the number of training tokens should scale linearly with model parameters:
Doptimal ≈ 20 × N
Where D = number of training tokens, N = number of parameters.
A 10B parameter model should be trained on ~200B tokens.
This was revolutionary because GPT-3 (175B params) was trained on only 300B tokens — massively undertrained by Chinchilla's standards (it should have seen ~3.5T tokens). DeepMind's Chinchilla (70B params, 1.4T tokens) matched GPT-3's performance with 4x fewer parameters.
5.2 The Compute-Parameter-Data Relationship
The three pillars of LLM training form a tradeoff triangle:
- Model Size (N): More parameters = more capacity to learn patterns
- Training Data (D): More data = more patterns to learn from
- Compute (C): C ≈ 6 × N × D (FLOPs for training)
# Scaling Law Calculations
def estimate_training_compute(n_params, n_tokens):
"""
Estimate total training FLOPs.
Rule of thumb: C ≈ 6 * N * D
Where:
C = compute in FLOPs
N = number of parameters
D = number of training tokens
The factor of 6 comes from:
- 2 FLOPs per parameter per token (forward pass: multiply + add)
- 2 FLOPs per parameter per token (backward pass, ~2x forward)
- Total: ~6 FLOPs per parameter per token
"""
return 6 * n_params * n_tokens
def chinchilla_optimal_tokens(n_params):
"""Chinchilla-optimal number of training tokens."""
return 20 * n_params
def chinchilla_optimal_params(compute_budget):
"""Given a compute budget, find optimal model size and data."""
# C = 6 * N * D, and D = 20 * N
# C = 6 * N * 20 * N = 120 * N^2
# N = sqrt(C / 120)
import math
n_params = math.sqrt(compute_budget / 120)
n_tokens = 20 * n_params
return n_params, n_tokens
# Examples
models = {
"GPT-2": (1.5e9, 40e9),
"GPT-3": (175e9, 300e9),
"Chinchilla":(70e9, 1.4e12),
"Llama 2 7B":(7e9, 2e12),
"Llama 3.1 8B":(8e9, 15e12),
"Llama 3.1 405B":(405e9, 15e12),
"DeepSeek V3": (671e9, 14.8e12),
}
print(f"{'Model':<20} {'Params':>12} {'Tokens':>14} {'Compute (FLOPs)':>18} {'Chinchilla Optimal?':>22}")
print("-" * 90)
for name, (params, tokens) in models.items():
compute = estimate_training_compute(params, tokens)
optimal_tokens = chinchilla_optimal_tokens(params)
is_optimal = "Over-trained ✓" if tokens >= optimal_tokens * 0.8 else "Under-trained"
print(f"{name:<20} {params:>12.1e} {tokens:>14.1e} {compute:>18.1e} {is_optimal:>22}")
# Key insight: Modern models (Llama 3.1, DeepSeek V3) train on FAR MORE
# tokens than Chinchilla suggests. Why?
#
# The "over-training" trend (2024-2026):
# - Inference cost matters more than training cost
# - A smaller model trained on more data can match a larger model
# - Llama 3.1 8B trained on 15T tokens (1875x Chinchilla ratio!)
# - This makes the model cheaper to deploy at inference time
# - Training is a one-time cost; inference runs millions of times
5.3 Why Scaling Works (and Its Limits)
Why scaling works: Neural networks are universal function approximators. Given enough parameters and data, they can learn increasingly complex patterns. Language modeling requires understanding grammar, facts, reasoning, common sense — more parameters allow the model to store and compose more of these patterns.
The 2025-2026 scaling debate: There's growing discourse about whether pure scaling of pre-training has hit diminishing returns:
- Data wall: We're running out of high-quality training text on the internet. Models are being trained on synthetic data generated by other LLMs, raising quality concerns.
- Inference-time compute: Instead of bigger models, researchers are exploring "thinking longer" at inference time. Models like DeepSeek R1 and OpenAI's o1/o3 use chain-of-thought reasoning during inference, trading compute at test time for better answers.
- Efficiency innovations: Mixture-of-Experts (MoE) models like DeepSeek V3 have 671B total parameters but only activate 37B per token, getting large-model quality at small-model inference cost.
- Post-training matters more: RLHF, DPO, and other alignment techniques can dramatically improve a model's usefulness without increasing parameters.
5.4 When to Use Small vs. Large Models
| Consideration | Small Models (1B-8B) | Large Models (70B+) / API Models |
|---|---|---|
| Latency | Fast (can run on single GPU or even CPU) | Slower (needs multiple GPUs or API call) |
| Cost | Low (self-hosted) | Higher (GPU cluster or API pricing) |
| Quality | Good for specific tasks when fine-tuned | Better general reasoning and instruction following |
| Privacy | Data stays on your servers | API models: data goes to provider |
| Use Cases | Classification, extraction, simple generation, edge deployment | Complex reasoning, creative writing, code generation, agentic workflows |
| Customization | Full fine-tuning possible | Often limited to prompting or API fine-tuning |
# Practical Decision Framework for Model Selection
def recommend_model(task, constraints):
"""
A practical guide to choosing the right model size.
In production AI engineering (2025-2026), the key insight is:
use the SMALLEST model that meets your quality requirements.
"""
recommendations = {
# Simple classification tasks
"sentiment_analysis": {
"model": "Fine-tuned Llama 3.1 8B or BERT",
"reason": "Classification is well-solved by small models",
"cost": "~$0.001 per 1000 requests (self-hosted)"
},
# Text extraction
"entity_extraction": {
"model": "Fine-tuned Llama 3.1 8B or GPT-4o-mini",
"reason": "Structured extraction works well with small models + fine-tuning",
"cost": "~$0.15 per 1M input tokens (GPT-4o-mini)"
},
# Code generation
"code_generation": {
"model": "Claude Sonnet 4 or GPT-4o",
"reason": "Complex code needs strong reasoning; large models excel",
"cost": "~$3 per 1M input tokens (Claude Sonnet 4)"
},
# Complex reasoning
"multi_step_reasoning": {
"model": "Claude Opus 4, GPT-4o, or DeepSeek R1",
"reason": "Reasoning is where large models truly shine",
"cost": "~$15 per 1M input tokens (Claude Opus 4)"
},
# High-volume, simple tasks
"high_volume_simple": {
"model": "GPT-4o-mini or self-hosted Llama 3.1 8B",
"reason": "Cost-efficiency at scale is paramount",
"cost": "~$0.15 per 1M input tokens"
},
# Privacy-critical
"privacy_critical": {
"model": "Self-hosted Llama 3.1 70B or Mistral Large",
"reason": "Data never leaves your infrastructure",
"cost": "GPU hosting costs (~$2-5/hour for 70B on 2xA100)"
},
}
return recommendations.get(task, "Evaluate on your specific use case")
# The AI Engineer's Rule of Thumb:
# 1. Start with the cheapest model (GPT-4o-mini / Claude Haiku)
# 2. Evaluate quality on YOUR specific task
# 3. Only upgrade to a larger model if quality is insufficient
# 4. Consider fine-tuning a small model before jumping to a large one
# 5. Use large models for evaluation/labeling to improve small models
Week 1 Summary
Key Takeaways
- AI ⊃ ML ⊃ DL: Deep learning is a subset of machine learning, which is a subset of artificial intelligence.
- Neural networks are composed of neurons that compute weighted sums, apply activation functions, and learn through backpropagation.
- Self-supervised learning (especially next-token prediction) is how modern LLMs learn from raw text without human labels.
- LLMs are next-token prediction machines that have learned an incredible amount of knowledge and reasoning ability from training on internet text.
- Scale matters but there are limits, leading to innovations in efficiency (MoE), reasoning (inference-time compute), and alignment (RLHF/DPO).
Exercises
- Implement the MNIST neural network from scratch and experiment with different learning rates (0.001, 0.01, 0.1, 1.0). What happens?
- Add a third hidden layer to the network. Does accuracy improve?
- Replace ReLU with Sigmoid in the hidden layers. How does training speed change?
- Calculate how many FLOPs it would take to train a 13B parameter model on 2T tokens.
- Use the OpenAI or Anthropic API to compare responses from different model sizes on the same prompt.