Learning Objectives
Understand the Math
Derive and understand the forward/reverse diffusion processes, noise schedules, and the denoising loss function.
Implement from Scratch
Build a working diffusion model in PyTorch: noise schedule, U-Net, training loop, and sampling.
Master Stable Diffusion
Understand the full Stable Diffusion pipeline: VAE, CLIP, U-Net, schedulers, and CFG.
Apply Advanced Techniques
Use ControlNet, LoRA, inpainting, and understand the latest architectures (Flux, DiT).
1. Introduction to Diffusion Models
1.1 The Core Idea
Diffusion models are a class of generative models that learn to generate data by learning to reverse a gradual noising process. The intuition is beautifully simple:
The Two Processes
FORWARD PROCESS (Fixed, no learning):
Gradually add Gaussian noise to data over T timesteps until it becomes pure noise.
Clean Image (x_0) --> Slightly Noisy (x_1) --> ... --> More Noisy (x_t) --> ... --> Pure Noise (x_T)
Each step: x_t = sqrt(1 - beta_t) * x_{t-1} + sqrt(beta_t) * epsilon
where epsilon ~ N(0, I)
REVERSE PROCESS (Learned by a neural network):
Starting from pure noise, gradually denoise to recover a clean image.
Pure Noise (x_T) --> Less Noisy (x_{T-1}) --> ... --> Cleaner (x_t) --> ... --> Clean Image (x_0)
Each step: x_{t-1} = f_theta(x_t, t) [neural network predicts how to denoise]
The key insight is that while the forward process is trivial (just add noise), the reverse process requires learning a complex function. If we can train a neural network to accurately reverse one small noise step, we can chain many such steps together to generate realistic data from pure noise.
x₀"] -->|"Add noise β₁"| X1["Slightly Noisy
x₁"] X1 -->|"Add noise β₂"| X2["Noisier
x₂"] X2 -->|"..."| Xt["More Noisy
xₜ"] Xt -->|"..."| XT["Pure Noise
x_T ~ N(0,I)"] style X0 fill:#4CAF50,stroke:#333,color:#fff style X1 fill:#66BB6A,stroke:#333,color:#fff style X2 fill:#FFA726,stroke:#333,color:#fff style Xt fill:#EF5350,stroke:#333,color:#fff style XT fill:#B71C1C,stroke:#333,color:#fff
x_T"] -->|"Denoise step"| XTm["Less Noisy
x_{T-1}"] XTm -->|"..."| Xt2["Cleaner
xₜ"] Xt2 -->|"..."| X12["Almost Clean
x₁"] X12 -->|"Final denoise"| X02["Clean Image
x₀"] NN["Neural Network
ε_θ(xₜ, t)"] -.->|"Predicts noise
at each step"| Xt2 style XT2 fill:#B71C1C,stroke:#333,color:#fff style XTm fill:#EF5350,stroke:#333,color:#fff style Xt2 fill:#FFA726,stroke:#333,color:#fff style X12 fill:#66BB6A,stroke:#333,color:#fff style X02 fill:#4CAF50,stroke:#333,color:#fff style NN fill:#2196F3,stroke:#333,color:#fff
1.2 Why Diffusion Models Overtook GANs
| Aspect | GANs | Diffusion Models |
|---|---|---|
| Training stability | Notoriously unstable (adversarial dynamics) | Stable (simple MSE regression loss) |
| Mode coverage | Mode collapse is a major issue | Full distribution coverage by design |
| Image quality | High (for faces) | Higher (especially for complex scenes) |
| Diversity | Limited by mode collapse | Excellent diversity |
| Text conditioning | Requires architectural tricks | Natural via cross-attention |
| Controllability | Limited | Excellent (CFG, ControlNet, etc.) |
| Sampling speed | Fast (single forward pass) | Slow (many denoising steps) |
| Likelihood | Not available | Available (via variational bound) |
1.3 Historical Context
Timeline of Diffusion Model Breakthroughs
- 2015: Sohl-Dickstein et al. introduce the diffusion framework (deep unsupervised learning using nonequilibrium thermodynamics)
- 2020 (June): DDPM (Ho et al.) makes diffusion practical with simplified training and high-quality image generation
- 2020 (Oct): DDIM (Song et al.) enables faster sampling with fewer steps (deterministic)
- 2021: Guided diffusion (Dhariwal & Nichol) shows diffusion beats GANs on ImageNet; Classifier-Free Guidance invented
- 2021: DALL-E and GLIDE introduce text-to-image diffusion
- 2022 (Apr): DALL-E 2 uses CLIP + diffusion prior + diffusion decoder
- 2022 (Aug): Stable Diffusion released open-source (Latent Diffusion Model)
- 2022: Midjourney launches with stunning artistic quality
- 2023: SDXL, DALL-E 3 (integrated with ChatGPT), Midjourney v5
- 2024: Stable Diffusion 3 (flow matching, DiT), Flux (Black Forest Labs), DALL-E 3 improvements
- 2025: Sora (video diffusion), Flux 1.1, improved efficiency models
- 2026: DALL-E 4, Flux 2.0, real-time diffusion, widespread video generation
2. The Math of Diffusion
Understanding the mathematics of diffusion models is essential for implementing them correctly and debugging issues. We will build up from first principles, explaining every formula with intuition.
2.1 The Forward Process (Adding Noise)
The forward process is a Markov chain that gradually adds Gaussian noise to data over T timesteps. Starting from a clean data point x_0, we define:
Single Forward Step:
q(x_t | x_{t-1}) = N(x_t; sqrt(1 - beta_t) * x_{t-1}, beta_t * I)
At each timestep t, we scale the previous sample by sqrt(1 - beta_t) (making it slightly smaller) and add Gaussian noise with variance beta_t. The noise schedule beta_1, ..., beta_T controls how quickly the signal is destroyed.
In code, a single forward step is:
def forward_step(x_prev, beta_t):
"""One step of the forward diffusion process."""
noise = torch.randn_like(x_prev)
x_t = torch.sqrt(1 - beta_t) * x_prev + torch.sqrt(beta_t) * noise
return x_t
The Noise Schedule
The values beta_1, ..., beta_T determine how much noise is added at each step:
import torch
import numpy as np
def linear_noise_schedule(T=1000, beta_start=1e-4, beta_end=0.02):
"""
Linear noise schedule (used in original DDPM).
beta increases linearly from beta_start to beta_end.
"""
return torch.linspace(beta_start, beta_end, T)
def cosine_noise_schedule(T=1000, s=0.008):
"""
Cosine noise schedule (Nichol & Dhariwal, 2021).
Produces a smoother, more gradual noising process.
Better results than linear schedule in practice.
"""
steps = torch.arange(T + 1, dtype=torch.float64)
f_t = torch.cos(((steps / T) + s) / (1 + s) * (np.pi / 2)) ** 2
alphas_cumprod = f_t / f_t[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clamp(betas, 0.0001, 0.999).float()
# Visualize both schedules
betas_linear = linear_noise_schedule()
betas_cosine = cosine_noise_schedule()
print(f"Linear schedule: beta_1={betas_linear[0]:.6f}, beta_T={betas_linear[-1]:.6f}")
print(f"Cosine schedule: beta_1={betas_cosine[0]:.6f}, beta_T={betas_cosine[-1]:.6f}")
The Reparameterization Trick: Jump to Any Timestep
A crucial property: we can compute x_t directly from x_0 without iterating through all intermediate steps. Define:
Key Quantities:
alpha_t = 1 - beta_t
alpha_bar_t = product(alpha_1, alpha_2, ..., alpha_t) [cumulative product]
Direct Sampling at Timestep t:
q(x_t | x_0) = N(x_t; sqrt(alpha_bar_t) * x_0, (1 - alpha_bar_t) * I)
Reparameterized Form:
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon, where epsilon ~ N(0, I)
This means x_t is a weighted combination of the original clean image x_0 and random noise epsilon. As t increases, alpha_bar_t decreases (approaching 0), so x_t becomes more noise and less signal. At t=T, x_T is approximately pure noise.
def precompute_schedule(betas):
"""
Precompute all quantities needed for training and sampling.
"""
alphas = 1.0 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])
# For forward process q(x_t | x_0)
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - alphas_cumprod)
# For reverse process p(x_{t-1} | x_t)
sqrt_recip_alphas = torch.sqrt(1.0 / alphas)
# Posterior variance (for reverse process)
posterior_variance = betas * (1.0 - alphas_cumprod_prev) / (1.0 - alphas_cumprod)
return {
"betas": betas,
"alphas": alphas,
"alphas_cumprod": alphas_cumprod,
"sqrt_alphas_cumprod": sqrt_alphas_cumprod,
"sqrt_one_minus_alphas_cumprod": sqrt_one_minus_alphas_cumprod,
"sqrt_recip_alphas": sqrt_recip_alphas,
"posterior_variance": posterior_variance,
}
def forward_diffusion(x_0, t, schedule):
"""
Apply forward diffusion: compute x_t from x_0 directly.
Args:
x_0: clean images [B, C, H, W]
t: timestep indices [B] (integer values 0 to T-1)
schedule: precomputed schedule dict
Returns:
x_t: noisy images [B, C, H, W]
noise: the noise that was added [B, C, H, W]
"""
noise = torch.randn_like(x_0)
# Gather the schedule values for each sample's timestep
sqrt_alpha_bar = schedule["sqrt_alphas_cumprod"][t] # [B]
sqrt_one_minus_alpha_bar = schedule["sqrt_one_minus_alphas_cumprod"][t] # [B]
# Reshape for broadcasting: [B] -> [B, 1, 1, 1]
sqrt_alpha_bar = sqrt_alpha_bar[:, None, None, None]
sqrt_one_minus_alpha_bar = sqrt_one_minus_alpha_bar[:, None, None, None]
# x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
return x_t, noise
# Example: visualize the forward process
betas = linear_noise_schedule(T=1000)
schedule = precompute_schedule(betas)
# Show how an image gets noisier
import matplotlib.pyplot as plt
from torchvision import datasets, transforms
# Load a sample image
# dataset = datasets.MNIST(root="./data", train=True, download=True,
# transform=transforms.ToTensor())
# x_0 = dataset[0][0].unsqueeze(0) # [1, 1, 28, 28]
# fig, axes = plt.subplots(1, 6, figsize=(15, 3))
# timesteps = [0, 50, 200, 500, 750, 999]
# for ax, t_val in zip(axes, timesteps):
# t = torch.tensor([t_val])
# x_t, _ = forward_diffusion(x_0, t, schedule)
# ax.imshow(x_t[0, 0].numpy(), cmap="gray")
# ax.set_title(f"t = {t_val}")
# ax.axis("off")
# plt.suptitle("Forward Diffusion Process")
# plt.tight_layout()
# plt.show()
2.2 The Reverse Process (Denoising)
The reverse process is what we actually want to learn. Starting from pure noise x_T, we want to iteratively denoise to recover a clean sample x_0.
Reverse Step (parametrized by theta):
p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I)
The model learns to predict the mean mu_theta of the reverse distribution. The variance sigma_t^2 is typically fixed to beta_t or the posterior variance.
There are three equivalent ways to parametrize what the model predicts:
Three Prediction Targets
- Predict the noise epsilon (most common): The model epsilon_theta(x_t, t) predicts the noise that was added. The mean is then:
mu_theta = (1/sqrt(alpha_t)) * (x_t - (beta_t / sqrt(1 - alpha_bar_t)) * epsilon_theta(x_t, t)) - Predict the clean image x_0: The model predicts the denoised image directly. Useful for some applications.
- Predict the velocity v (used in v-prediction): v = sqrt(alpha_bar_t) * epsilon - sqrt(1 - alpha_bar_t) * x_0. Used in newer models for better training dynamics.
2.3 The Denoising Objective
The training loss is remarkably simple. We train the model to predict the noise that was added:
Simplified Diffusion Loss:
L = E_{t, x_0, epsilon} [ || epsilon - epsilon_theta(x_t, t) ||^2 ]
1. Sample a clean image x_0 from the training data
2. Sample a random timestep t uniformly from {1, ..., T}
3. Sample random noise epsilon ~ N(0, I)
4. Compute x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
5. The model predicts the noise: epsilon_hat = epsilon_theta(x_t, t)
6. Loss = MSE(epsilon, epsilon_hat)
That is it. The model learns to denoise by predicting what noise was added.
2.4 DDPM vs DDIM
DDPM (Denoising Diffusion Probabilistic Models)
The original formulation by Ho et al. (2020):
- Stochastic sampling: each reverse step adds noise (controlled randomness)
- Requires many steps (typically T=1000) for good quality
- Each run produces different outputs even with same starting noise
- Sampling formula:
x_{t-1} = mu_theta(x_t, t) + sigma_t * z, where z ~ N(0, I)
DDIM (Denoising Diffusion Implicit Models)
Song et al. (2020) showed that the same trained model can be sampled differently:
- Deterministic sampling: same starting noise always produces the same output
- Can skip steps: sample with 50 or even 20 steps instead of 1000
- Enables interpolation in latent space
- Same model, different sampling procedure (no retraining needed)
- Sampling formula:
x_{t-1} = sqrt(alpha_bar_{t-1}) * predicted_x_0 + sqrt(1 - alpha_bar_{t-1}) * direction_pointing_to_x_t
def ddpm_sample_step(model, x_t, t, schedule):
"""
One step of DDPM (stochastic) sampling.
"""
betas = schedule["betas"]
sqrt_recip_alphas = schedule["sqrt_recip_alphas"]
sqrt_one_minus_alphas_cumprod = schedule["sqrt_one_minus_alphas_cumprod"]
posterior_variance = schedule["posterior_variance"]
# Predict noise
predicted_noise = model(x_t, t)
# Compute mean
mean = sqrt_recip_alphas[t] * (
x_t - (betas[t] / sqrt_one_minus_alphas_cumprod[t]) * predicted_noise
)
if t > 0:
noise = torch.randn_like(x_t)
x_prev = mean + torch.sqrt(posterior_variance[t]) * noise
else:
x_prev = mean # No noise at final step
return x_prev
def ddim_sample_step(model, x_t, t, t_prev, schedule, eta=0.0):
"""
One step of DDIM sampling.
Args:
eta: controls stochasticity. eta=0 is deterministic, eta=1 is DDPM.
"""
alphas_cumprod = schedule["alphas_cumprod"]
alpha_bar_t = alphas_cumprod[t]
alpha_bar_t_prev = alphas_cumprod[t_prev] if t_prev >= 0 else torch.tensor(1.0)
# Predict noise
predicted_noise = model(x_t, t)
# Predict x_0
predicted_x0 = (x_t - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
predicted_x0 = torch.clamp(predicted_x0, -1, 1) # Clip for stability
# Compute variance
sigma = eta * torch.sqrt(
(1 - alpha_bar_t_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_t_prev)
)
# Direction pointing to x_t
direction = torch.sqrt(1 - alpha_bar_t_prev - sigma**2) * predicted_noise
# Compute x_{t-1}
x_prev = torch.sqrt(alpha_bar_t_prev) * predicted_x0 + direction
if eta > 0 and t > 0:
noise = torch.randn_like(x_t)
x_prev = x_prev + sigma * noise
return x_prev
3. Diffusion Architecture Components
3.1 U-Net Architecture
The U-Net is the backbone of most diffusion models. Originally designed for biomedical image segmentation (Ronneberger et al., 2015), it was adapted for diffusion by DDPM. Its encoder-decoder structure with skip connections is ideal for the denoising task.
U-Net Structure
Input (x_t + time embedding)
|
[Encoder / Downsampling Path]
|
Conv Block (64) ----skip connection----> Concat
| |
Downsample (stride 2) |
| |
Conv Block (128) ----skip connection----> Concat
| |
Downsample (stride 2) |
| |
Conv Block (256) ----skip connection----> Concat
| |
Downsample (stride 2) |
| |
[Bottleneck] |
Conv Block (512) + Self-Attention |
| |
[Decoder / Upsampling Path] |
| |
Upsample (2x) |
| |
Conv Block (256) <----- skip connection ----+
|
Upsample (2x)
|
Conv Block (128) <----- skip connection ----+
|
Upsample (2x)
|
Conv Block (64) <----- skip connection ----+
|
Output Conv -> Predicted noise (epsilon)
Key Components of the Diffusion U-Net
- Skip connections: Concatenate encoder features with decoder features at matching resolutions. This preserves fine-grained spatial information that would be lost through downsampling.
- Time embedding: The model must know which timestep it is denoising. The timestep t is embedded using sinusoidal position encoding (like in Transformers), then projected through an MLP. This time embedding is added to every residual block.
- Self-attention layers: Added at lower resolutions (e.g., 16x16, 8x8) to capture global dependencies. Not used at full resolution due to quadratic cost.
- Cross-attention layers: For conditional generation (e.g., text-to-image). The text embedding is injected via cross-attention at multiple resolutions.
- Group normalization: Used instead of batch norm for stability with small batch sizes.
3.2 Time Embedding
import torch
import torch.nn as nn
import math
class SinusoidalTimeEmbedding(nn.Module):
"""
Sinusoidal time step embedding, similar to positional encoding in Transformers.
Maps integer timestep t to a high-dimensional vector.
"""
def __init__(self, dim):
super().__init__()
self.dim = dim
def forward(self, t):
"""
Args:
t: [B] tensor of integer timesteps
Returns:
[B, dim] time embeddings
"""
device = t.device
half_dim = self.dim // 2
embeddings = math.log(10000) / (half_dim - 1)
embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
embeddings = t[:, None].float() * embeddings[None, :]
embeddings = torch.cat([torch.sin(embeddings), torch.cos(embeddings)], dim=-1)
return embeddings
class TimeMLPEmbedding(nn.Module):
"""
Full time embedding: sinusoidal encoding -> MLP projection.
"""
def __init__(self, time_dim, embed_dim):
super().__init__()
self.sinusoidal = SinusoidalTimeEmbedding(time_dim)
self.mlp = nn.Sequential(
nn.Linear(time_dim, embed_dim),
nn.SiLU(),
nn.Linear(embed_dim, embed_dim),
)
def forward(self, t):
return self.mlp(self.sinusoidal(t))
3.3 Cross-Attention for Conditioning
For text-to-image generation, the text prompt needs to influence the denoising process. This is done via cross-attention, where the image features are queries and the text features are keys and values:
class CrossAttention(nn.Module):
"""
Cross-attention layer for text conditioning.
Q = image features (what we're generating)
K, V = text features (what we're conditioning on)
"""
def __init__(self, query_dim, context_dim, num_heads=8, head_dim=64):
super().__init__()
inner_dim = num_heads * head_dim
self.num_heads = num_heads
self.head_dim = head_dim
self.scale = head_dim ** -0.5
self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
self.to_out = nn.Linear(inner_dim, query_dim)
def forward(self, x, context):
"""
Args:
x: image features [B, HW, query_dim]
context: text features [B, seq_len, context_dim]
Returns:
conditioned features [B, HW, query_dim]
"""
B, N, _ = x.shape
h = self.num_heads
q = self.to_q(x).reshape(B, N, h, self.head_dim).permute(0, 2, 1, 3)
k = self.to_k(context).reshape(B, -1, h, self.head_dim).permute(0, 2, 1, 3)
v = self.to_v(context).reshape(B, -1, h, self.head_dim).permute(0, 2, 1, 3)
attn = (q @ k.transpose(-2, -1)) * self.scale
attn = attn.softmax(dim=-1)
out = (attn @ v).permute(0, 2, 1, 3).reshape(B, N, -1)
return self.to_out(out)
3.4 PRACTICAL: Implement a Simple Diffusion Model in PyTorch
Let us build a complete, working diffusion model from scratch. We will train it on MNIST to generate handwritten digits.
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchvision.utils import save_image, make_grid
import os
# ====================
# Hyperparameters
# ====================
T = 1000 # Number of diffusion timesteps
IMG_SIZE = 28
IMG_CHANNELS = 1
BATCH_SIZE = 128
EPOCHS = 30
LR = 2e-4
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# ====================
# Noise Schedule
# ====================
def get_schedule(T):
betas = torch.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
return {
"betas": betas.to(DEVICE),
"alphas": alphas.to(DEVICE),
"alphas_cumprod": alphas_cumprod.to(DEVICE),
"sqrt_alphas_cumprod": torch.sqrt(alphas_cumprod).to(DEVICE),
"sqrt_one_minus_alphas_cumprod": torch.sqrt(1 - alphas_cumprod).to(DEVICE),
"sqrt_recip_alphas": torch.sqrt(1.0 / alphas).to(DEVICE),
"posterior_variance": (betas * (1 - torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])) / (1 - alphas_cumprod)).to(DEVICE),
}
schedule = get_schedule(T)
# ====================
# Simple U-Net
# ====================
class ResBlock(nn.Module):
"""Residual block with time embedding."""
def __init__(self, in_ch, out_ch, time_dim):
super().__init__()
self.conv1 = nn.Sequential(
nn.GroupNorm(8, in_ch),
nn.SiLU(),
nn.Conv2d(in_ch, out_ch, 3, padding=1),
)
self.time_proj = nn.Sequential(
nn.SiLU(),
nn.Linear(time_dim, out_ch),
)
self.conv2 = nn.Sequential(
nn.GroupNorm(8, out_ch),
nn.SiLU(),
nn.Conv2d(out_ch, out_ch, 3, padding=1),
)
self.shortcut = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()
def forward(self, x, t_emb):
h = self.conv1(x)
h = h + self.time_proj(t_emb)[:, :, None, None]
h = self.conv2(h)
return h + self.shortcut(x)
class SimpleUNet(nn.Module):
"""
A simplified U-Net for diffusion on 28x28 images.
"""
def __init__(self, in_channels=1, base_channels=64, time_dim=256):
super().__init__()
# Time embedding
self.time_embed = nn.Sequential(
SinusoidalTimeEmbedding(time_dim),
nn.Linear(time_dim, time_dim),
nn.SiLU(),
nn.Linear(time_dim, time_dim),
)
# Encoder (downsampling)
self.enc1 = ResBlock(in_channels, base_channels, time_dim) # 28x28
self.down1 = nn.Conv2d(base_channels, base_channels, 3, stride=2, padding=1) # 14x14
self.enc2 = ResBlock(base_channels, base_channels * 2, time_dim) # 14x14
self.down2 = nn.Conv2d(base_channels * 2, base_channels * 2, 3, stride=2, padding=1) # 7x7
# Bottleneck
self.bottleneck = ResBlock(base_channels * 2, base_channels * 2, time_dim) # 7x7
# Decoder (upsampling)
self.up2 = nn.ConvTranspose2d(base_channels * 2, base_channels * 2, 2, stride=2) # 14x14
self.dec2 = ResBlock(base_channels * 4, base_channels, time_dim) # concat with skip
self.up1 = nn.ConvTranspose2d(base_channels, base_channels, 2, stride=2) # 28x28
self.dec1 = ResBlock(base_channels * 2, base_channels, time_dim) # concat with skip
# Output
self.out = nn.Sequential(
nn.GroupNorm(8, base_channels),
nn.SiLU(),
nn.Conv2d(base_channels, in_channels, 1),
)
def forward(self, x, t):
"""
Args:
x: noisy image [B, C, H, W]
t: timestep [B] (integer)
Returns:
predicted noise [B, C, H, W]
"""
t_emb = self.time_embed(t)
# Encoder
e1 = self.enc1(x, t_emb) # [B, 64, 28, 28]
e2 = self.enc2(self.down1(e1), t_emb) # [B, 128, 14, 14]
# Bottleneck
b = self.bottleneck(self.down2(e2), t_emb) # [B, 128, 7, 7]
# Decoder with skip connections
d2 = self.up2(b) # [B, 128, 14, 14]
d2 = self.dec2(torch.cat([d2, e2], dim=1), t_emb) # [B, 64, 14, 14]
d1 = self.up1(d2) # [B, 64, 28, 28]
d1 = self.dec1(torch.cat([d1, e1], dim=1), t_emb) # [B, 64, 28, 28]
return self.out(d1)
# ====================
# Training
# ====================
def train_diffusion():
# Data
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]), # Scale to [-1, 1]
])
dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
# Model
model = SimpleUNet(in_channels=IMG_CHANNELS, base_channels=64).to(DEVICE)
optimizer = torch.optim.AdamW(model.parameters(), lr=LR)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
os.makedirs("diffusion_outputs", exist_ok=True)
for epoch in range(EPOCHS):
total_loss = 0
count = 0
for images, _ in dataloader:
images = images.to(DEVICE) # [B, 1, 28, 28], range [-1, 1]
B = images.shape[0]
# Sample random timesteps for each image
t = torch.randint(0, T, (B,), device=DEVICE)
# Sample noise
noise = torch.randn_like(images)
# Forward diffusion: create noisy images
sqrt_alpha_bar = schedule["sqrt_alphas_cumprod"][t][:, None, None, None]
sqrt_one_minus_alpha_bar = schedule["sqrt_one_minus_alphas_cumprod"][t][:, None, None, None]
x_t = sqrt_alpha_bar * images + sqrt_one_minus_alpha_bar * noise
# Predict the noise
predicted_noise = model(x_t, t)
# Loss: MSE between actual noise and predicted noise
loss = F.mse_loss(predicted_noise, noise)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item() * B
count += B
avg_loss = total_loss / count
print(f"Epoch {epoch+1}/{EPOCHS} Loss: {avg_loss:.6f}")
# Generate samples every 5 epochs
if (epoch + 1) % 5 == 0:
samples = sample_images(model, n_samples=64)
save_image(samples, f"diffusion_outputs/epoch_{epoch+1:03d}.png",
nrow=8, normalize=True, value_range=(-1, 1))
return model
# ====================
# Sampling (DDPM)
# ====================
@torch.no_grad()
def sample_images(model, n_samples=64):
"""
Generate images using DDPM sampling.
Start from pure noise and iteratively denoise.
"""
model.eval()
# Start from pure Gaussian noise
x = torch.randn(n_samples, IMG_CHANNELS, IMG_SIZE, IMG_SIZE, device=DEVICE)
# Reverse diffusion: from t=T-1 down to t=0
for t_val in reversed(range(T)):
t = torch.full((n_samples,), t_val, device=DEVICE, dtype=torch.long)
# Predict noise
predicted_noise = model(x, t)
# Compute mean of p(x_{t-1} | x_t)
alpha_t = schedule["alphas"][t_val]
alpha_bar_t = schedule["alphas_cumprod"][t_val]
beta_t = schedule["betas"][t_val]
mean = (1 / torch.sqrt(alpha_t)) * (
x - (beta_t / torch.sqrt(1 - alpha_bar_t)) * predicted_noise
)
if t_val > 0:
noise = torch.randn_like(x)
variance = schedule["posterior_variance"][t_val]
x = mean + torch.sqrt(variance) * noise
else:
x = mean
model.train()
return x
# Train the model
# model = train_diffusion()
# Final generation
# samples = sample_images(model, n_samples=64)
# save_image(samples, "diffusion_outputs/final_samples.png",
# nrow=8, normalize=True, value_range=(-1, 1))
4. Text-to-Image: Stable Diffusion Architecture
4.1 Latent Diffusion Models
The key innovation of Stable Diffusion (Rombach et al., 2022) is performing diffusion in a compressed latent space rather than pixel space. This dramatically reduces computational cost.
Why Latent Space?
A 512x512x3 image has 786,432 dimensions. Running diffusion in this space is extremely expensive. Instead:
- Train a VAE to compress images into a smaller latent representation (e.g., 64x64x4 = 16,384 dimensions -- a 48x reduction)
- Run the entire diffusion process in this latent space
- Decode the final latent back to pixel space using the VAE decoder
4.2 The Full Stable Diffusion Pipeline
Complete Pipeline
Text Prompt: "a photo of an astronaut riding a horse on Mars"
|
v
+------------------+
| CLIP Text | Tokenize text, produce text embeddings
| Encoder | Output: [B, 77, 768] (77 tokens, 768-dim)
+------------------+
|
| (text embeddings fed via cross-attention)
|
v
+------------------+
| U-Net | Iterative denoising in latent space
| (in latent | Input: noisy latent z_t + timestep t + text
| space) | Output: predicted noise epsilon
+------------------+
|
| (after T denoising steps)
|
v
Clean Latent z_0
|
v
+------------------+
| VAE Decoder | Decode latent to pixel space
| | Input: [B, 4, 64, 64]
| | Output: [B, 3, 512, 512]
+------------------+
|
v
Generated Image (512 x 512 x 3)
Component Details
- VAE (Variational Autoencoder): Trained separately. Encoder compresses 512x512x3 to 64x64x4 (8x spatial downsampling). Decoder reverses this. The latent space is regularized to be approximately Gaussian.
- CLIP Text Encoder: The text encoder from OpenAI's CLIP model. Converts text prompts into dense embeddings that the U-Net conditions on via cross-attention. SD 1.x uses CLIP ViT-L/14 (768-dim). SDXL uses two text encoders.
- U-Net: The core diffusion model. Contains ~860M parameters in SD 1.5. ResNet blocks with self-attention and cross-attention at multiple resolutions.
- Scheduler/Sampler: Controls the denoising process. Different schedulers trade off speed vs quality (Euler, DPM++, PNDM, etc.).
Encoder"] CLIP --> Cross["Cross-Attention"] Noise["Random Noise
z_T"] --> UNet["U-Net
(Iterative Denoising)"] Cross --> UNet UNet -->|"T denoising steps"| Latent["Clean Latent
z₀"] Latent --> VAE["VAE
Decoder"] VAE --> Image["Generated
Image"] style CLIP fill:#2196F3,stroke:#333,color:#fff style UNet fill:#4CAF50,stroke:#333,color:#fff style VAE fill:#FF9800,stroke:#333,color:#fff style Image fill:#9C27B0,stroke:#333,color:#fff
4.3 Classifier-Free Guidance (CFG)
CFG is the technique that makes text-to-image generation work well in practice. Without it, generated images have low text-image alignment.
Classifier-Free Guidance:
epsilon_guided = epsilon_uncond + w * (epsilon_cond - epsilon_uncond)
During training, the model randomly drops the text condition (replaces with empty text) some percentage of the time. At inference:
1. Run the model twice: once with text (epsilon_cond) and once without text (epsilon_uncond)
2. Amplify the difference: move AWAY from the unconditional prediction toward the conditional one
3. The guidance scale w controls the strength. w=1 is normal, w=7.5 is typical, w=15+ is very strong guidance
Higher w = more faithful to the prompt but less diversity and eventually artifacts.
4.4 PRACTICAL: Use Stable Diffusion with the Diffusers Library
from diffusers import (
StableDiffusionPipeline,
StableDiffusionXLPipeline,
DPMSolverMultistepScheduler,
EulerDiscreteScheduler,
EulerAncestralDiscreteScheduler,
)
import torch
# ====================
# Basic Stable Diffusion 1.5
# ====================
def generate_sd15(
prompt: str,
negative_prompt: str = "",
num_images: int = 1,
guidance_scale: float = 7.5,
num_inference_steps: int = 50,
seed: int = None,
):
"""Generate images with Stable Diffusion 1.5."""
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
safety_checker=None,
).to("cuda")
# Use DPM++ 2M scheduler for faster, higher-quality generation
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
generator = torch.Generator("cuda").manual_seed(seed) if seed else None
images = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_images_per_prompt=num_images,
guidance_scale=guidance_scale,
num_inference_steps=num_inference_steps,
generator=generator,
).images
return images
# ====================
# Stable Diffusion XL
# ====================
def generate_sdxl(
prompt: str,
negative_prompt: str = "",
guidance_scale: float = 7.0,
num_inference_steps: int = 30,
seed: int = None,
):
"""Generate images with SDXL (1024x1024)."""
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
).to("cuda")
generator = torch.Generator("cuda").manual_seed(seed) if seed else None
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
guidance_scale=guidance_scale,
num_inference_steps=num_inference_steps,
generator=generator,
).images[0]
return image
# ====================
# Experiment with Different Schedulers
# ====================
def compare_schedulers(prompt: str, seed: int = 42):
"""
Compare different schedulers/samplers for the same prompt.
Different schedulers produce different results and have different speed/quality tradeoffs.
"""
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
safety_checker=None,
).to("cuda")
schedulers = {
"Euler": EulerDiscreteScheduler,
"Euler Ancestral": EulerAncestralDiscreteScheduler,
"DPM++ 2M": DPMSolverMultistepScheduler,
}
results = {}
for name, SchedulerClass in schedulers.items():
pipe.scheduler = SchedulerClass.from_config(pipe.scheduler.config)
generator = torch.Generator("cuda").manual_seed(seed)
image = pipe(
prompt=prompt,
num_inference_steps=30,
guidance_scale=7.5,
generator=generator,
).images[0]
results[name] = image
print(f"Generated with {name} scheduler")
return results
# ====================
# Experiment with CFG Scale
# ====================
def compare_cfg_scales(prompt: str, seed: int = 42):
"""
Show the effect of different classifier-free guidance scales.
"""
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
safety_checker=None,
).to("cuda")
cfg_scales = [1.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0]
results = {}
for cfg in cfg_scales:
generator = torch.Generator("cuda").manual_seed(seed)
image = pipe(
prompt=prompt,
guidance_scale=cfg,
num_inference_steps=30,
generator=generator,
).images[0]
results[cfg] = image
print(f"CFG Scale {cfg}: generated")
return results
# Example usage:
# images = generate_sd15(
# prompt="a beautiful landscape painting of mountains at sunset, oil on canvas",
# negative_prompt="blurry, bad quality, distorted",
# guidance_scale=7.5,
# num_inference_steps=30,
# seed=42,
# )
# images[0].save("landscape.png")
5. Advanced Diffusion Techniques
5.1 ControlNet
ControlNet (Zhang et al., 2023) adds spatial conditioning to diffusion models. It lets you control the generation with edge maps, depth maps, pose skeletons, segmentation maps, and more.
How ControlNet Works
Control Image (e.g., Canny edges)
|
v
+-----------------------+
| ControlNet |
| (copy of U-Net |
| encoder + middle) |
+-----------------------+
|
| (zero-conv outputs added to main U-Net)
|
v
+-----------------------+
| Main U-Net | + Text conditioning via cross-attention
| (frozen weights) |
+-----------------------+
|
v
Generated Image (follows the control structure)
ControlNet creates a trainable copy of the U-Net's encoder blocks. The outputs are connected to the main (frozen) U-Net through "zero convolutions" -- 1x1 convs initialized with zero weights, so the ControlNet has no effect at the start of training and gradually learns to influence the generation.
from diffusers import (
StableDiffusionControlNetPipeline,
ControlNetModel,
UniPCMultistepScheduler,
)
from diffusers.utils import load_image
import cv2
import numpy as np
from PIL import Image
import torch
def generate_with_controlnet_canny(
prompt: str,
image_path: str,
negative_prompt: str = "",
guidance_scale: float = 7.5,
controlnet_conditioning_scale: float = 1.0,
seed: int = None,
):
"""
Generate an image conditioned on Canny edge detection of an input image.
"""
# Load ControlNet model for Canny edges
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny",
torch_dtype=torch.float16,
)
# Create pipeline with ControlNet
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16,
safety_checker=None,
).to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
# Prepare Canny edge map from input image
image = cv2.imread(image_path)
image = cv2.resize(image, (512, 512))
edges = cv2.Canny(image, 100, 200)
edges = np.stack([edges] * 3, axis=-1) # Convert to 3-channel
control_image = Image.fromarray(edges)
generator = torch.Generator("cuda").manual_seed(seed) if seed else None
output = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
image=control_image,
guidance_scale=guidance_scale,
controlnet_conditioning_scale=controlnet_conditioning_scale,
num_inference_steps=30,
generator=generator,
).images[0]
return output, control_image
# Usage:
# output, edges = generate_with_controlnet_canny(
# prompt="a beautiful house in the forest, professional photo",
# image_path="input_photo.jpg",
# seed=42,
# )
# output.save("controlnet_output.png")
# edges.save("canny_edges.png")
5.2 LoRA for Stable Diffusion
LoRA (Low-Rank Adaptation) enables fine-tuning diffusion models on custom styles, characters, or concepts with minimal training data and compute.
How LoRA Works for Diffusion
Instead of fine-tuning all ~860M parameters of the U-Net, LoRA:
- Freezes the original model weights W
- Adds small trainable matrices: W' = W + alpha * B * A, where A is [d, r] and B is [r, d] with rank r much less than d (typically r=4 to 32)
- Only trains A and B matrices (a few MB instead of GB)
- Applied to attention layers (Q, K, V projections) in the U-Net
Benefits: small file size (2-100 MB vs 2-7 GB), fast training (minutes to hours), can be composed (apply multiple LoRAs at once).
# Training a LoRA with the diffusers library
# This uses the official training script
# Step 1: Prepare your dataset
# Create a folder with images and a metadata.jsonl file:
# {"file_name": "image1.jpg", "text": "a painting in the style of sks artist"}
# {"file_name": "image2.jpg", "text": "a landscape in the style of sks artist"}
# Step 2: Run the LoRA training script
"""
accelerate launch diffusers/examples/text_to_image/train_text_to_image_lora.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--dataset_name="path/to/your/dataset" \
--caption_column="text" \
--resolution=512 \
--train_batch_size=1 \
--num_train_epochs=100 \
--learning_rate=1e-4 \
--lr_scheduler="cosine" \
--lr_warmup_steps=0 \
--rank=4 \
--output_dir="./my-lora" \
--validation_prompt="a cat in the style of sks artist" \
--seed=42
"""
# Step 3: Use the trained LoRA
from diffusers import StableDiffusionPipeline
import torch
def generate_with_lora(prompt, lora_path, lora_scale=1.0, seed=42):
"""Generate images using a trained LoRA."""
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
# Load LoRA weights
pipe.load_lora_weights(lora_path)
generator = torch.Generator("cuda").manual_seed(seed)
image = pipe(
prompt=prompt,
num_inference_steps=30,
guidance_scale=7.5,
cross_attention_kwargs={"scale": lora_scale},
generator=generator,
).images[0]
return image
# Usage:
# image = generate_with_lora(
# "a mountain landscape in the style of sks artist",
# lora_path="./my-lora",
# lora_scale=0.8,
# )
# image.save("lora_output.png")
5.3 Inpainting and Image-to-Image
from diffusers import (
StableDiffusionInpaintPipeline,
StableDiffusionImg2ImgPipeline,
AutoPipelineForInpainting,
)
from PIL import Image
import torch
def inpaint_image(
prompt: str,
image_path: str,
mask_path: str,
guidance_scale: float = 7.5,
strength: float = 0.75,
seed: int = None,
):
"""
Inpaint a region of an image based on a text prompt.
Args:
prompt: Description of what to generate in the masked region
image_path: Original image path
mask_path: Binary mask (white = inpaint, black = keep)
strength: How much to change the masked region (0-1)
"""
pipe = AutoPipelineForInpainting.from_pretrained(
"runwayml/stable-diffusion-inpainting",
torch_dtype=torch.float16,
).to("cuda")
image = Image.open(image_path).convert("RGB").resize((512, 512))
mask = Image.open(mask_path).convert("L").resize((512, 512))
generator = torch.Generator("cuda").manual_seed(seed) if seed else None
output = pipe(
prompt=prompt,
image=image,
mask_image=mask,
guidance_scale=guidance_scale,
num_inference_steps=30,
generator=generator,
).images[0]
return output
def image_to_image(
prompt: str,
image_path: str,
strength: float = 0.75,
guidance_scale: float = 7.5,
seed: int = None,
):
"""
Transform an existing image based on a text prompt.
Args:
prompt: Description of the desired output
image_path: Input image path
strength: How much to change (0 = almost identical, 1 = completely new)
"""
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
safety_checker=None,
).to("cuda")
image = Image.open(image_path).convert("RGB").resize((512, 512))
generator = torch.Generator("cuda").manual_seed(seed) if seed else None
output = pipe(
prompt=prompt,
image=image,
strength=strength,
guidance_scale=guidance_scale,
num_inference_steps=30,
generator=generator,
).images[0]
return output
5.4 Modern Architectures: SD3, Flux, Flow Matching
Flow Matching (SD3, Flux)
Flow matching is a continuous-time generalization of diffusion that has become the dominant approach in 2024-2026:
- Continuous time: Instead of discrete timesteps t in {0, 1, ..., T}, use continuous t in [0, 1]
- Straight-line interpolation: The forward path is x_t = (1-t) * x_0 + t * epsilon (simple linear interpolation between data and noise)
- Velocity prediction: The model predicts the velocity v_t = dx_t/dt instead of noise
- ODE solver for sampling: Sampling is solving an ODE, enabling adaptive step sizes
- Simpler math: No need for beta schedules, reparameterization tricks, or complex variance formulas
Diffusion Transformer (DiT)
DiT (Peebles & Xie, 2023) replaces the U-Net with a Transformer:
- Input: noisy latent patches (just like ViT patches)
- Conditioning: timestep and class/text embeddings via adaptive layer norm (adaLN-Zero)
- Architecture: standard Transformer blocks with self-attention
- Scales better than U-Net with more compute
- Used in: Sora, Stable Diffusion 3, Flux
Flux Architecture (Black Forest Labs, 2024-2025)
Flux is the successor to Stable Diffusion, created by the original SD researchers:
- Flow matching based (continuous-time formulation)
- DiT architecture (no U-Net)
- MMDiT: multimodal DiT where text and image tokens attend to each other jointly
- Rotary position embeddings (RoPE) for resolution flexibility
- Available in multiple sizes: Flux.1 [dev], Flux.1 [schnell] (fast), Flux.1 [pro]
# Using Flux with diffusers
from diffusers import FluxPipeline
import torch
def generate_with_flux(
prompt: str,
num_inference_steps: int = 50,
guidance_scale: float = 3.5,
seed: int = None,
):
"""Generate images with Flux.1 [dev]."""
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
).to("cuda")
generator = torch.Generator("cuda").manual_seed(seed) if seed else None
image = pipe(
prompt=prompt,
guidance_scale=guidance_scale,
num_inference_steps=num_inference_steps,
generator=generator,
height=1024,
width=1024,
).images[0]
return image
# Usage:
# image = generate_with_flux(
# "a photorealistic portrait of a wise old wizard in a library",
# seed=42,
# )
# image.save("flux_output.png")
6. Video Generation with Diffusion
6.1 Extending Diffusion to Video
Video generation extends image diffusion by adding the temporal dimension. The key challenges are:
- Temporal consistency: Objects must look the same across frames. No flickering or sudden changes.
- Motion quality: Movement must be smooth and physically plausible.
- Computational cost: A 10-second 24fps video has 240 frames, each a full image.
- Long-range coherence: The beginning and end of a video must be consistent.
6.2 Sora and the DiT Architecture for Video
Sora's Key Concepts
- Spacetime patches: Instead of 2D image patches, Sora uses 3D patches that span space and time. A spacetime patch might be 2 frames x 16x16 pixels.
- Variable duration and resolution: Unlike fixed-size models, Sora can generate videos at various resolutions and lengths by varying the number of spacetime patches.
- DiT backbone: Uses Diffusion Transformer instead of U-Net, with 3D attention (attend across space and time).
- Scaling: Quality improves consistently with more compute, following scaling laws.
- Emergent properties: With enough scale, the model develops understanding of 3D consistency, object permanence, and basic physics.
6.3 PRACTICAL: Generate Video with Open-Source Models
# Video generation with CogVideoX (open-source)
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
import torch
def generate_video_cogvideox(
prompt: str,
num_frames: int = 49,
num_inference_steps: int = 50,
guidance_scale: float = 6.0,
seed: int = None,
):
"""
Generate a video using CogVideoX (open-source video diffusion model).
"""
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16,
).to("cuda")
# Enable memory optimization
pipe.enable_model_cpu_offload()
generator = torch.Generator("cuda").manual_seed(seed) if seed else None
video_frames = pipe(
prompt=prompt,
num_frames=num_frames,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
).frames[0]
return video_frames
# Usage:
# frames = generate_video_cogvideox(
# "A golden retriever running through a field of sunflowers",
# seed=42,
# )
# export_to_video(frames, "output_video.mp4", fps=8)
# ====================
# Video generation with Wan 2.1 (Alibaba, open-source)
# ====================
from diffusers import WanPipeline
def generate_video_wan(
prompt: str,
num_frames: int = 81,
guidance_scale: float = 5.0,
seed: int = None,
):
"""Generate video with Wan 2.1 (open-source, competitive quality)."""
pipe = WanPipeline.from_pretrained(
"Wan-AI/Wan2.1-T2V-14B",
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
generator = torch.Generator("cuda").manual_seed(seed) if seed else None
output = pipe(
prompt=prompt,
num_frames=num_frames,
guidance_scale=guidance_scale,
generator=generator,
)
return output.frames[0]
# frames = generate_video_wan("A cat sitting on a windowsill watching rain", seed=42)
# export_to_video(frames, "wan_video.mp4", fps=16)
7. Diffusion Models vs LLMs for Image Generation
| Aspect | Diffusion Models | Autoregressive (LLM-style) |
|---|---|---|
| Generation process | Parallel denoising (all pixels at once) | Sequential token prediction |
| Architecture | U-Net or DiT | Transformer decoder |
| Speed | Multiple forward passes (10-50 steps) | Many sequential tokens |
| Quality | Excellent for images | Improving rapidly |
| Text-image integration | Separate encoders + cross-attention | Native (same vocabulary) |
| Editing | Inpainting, img2img | Natural with token manipulation |
| Examples | SD, DALL-E 3, Flux, Midjourney | Parti, Chameleon, Transfusion |
Hybrid Approaches (2025-2026)
The field is converging toward hybrid architectures:
- Transfusion: A single model that does both autoregressive text and diffusion-based images in a shared architecture
- Chameleon (Meta): Early-fusion multimodal model that tokenizes images and generates them autoregressively alongside text
- Native multimodal models: GPT-4o and Gemini 2.0 can both understand and generate images natively
- The trend: Moving toward unified models that handle all modalities in a single framework
Summary and Key Takeaways
What We Covered This Week
- Diffusion fundamentals: Forward (add noise) and reverse (denoise) processes, with a simple MSE loss on noise prediction.
- The math: Noise schedules, reparameterization trick, DDPM vs DDIM sampling.
- U-Net architecture: Encoder-decoder with skip connections, time embeddings, and cross-attention for conditioning.
- Stable Diffusion: Latent diffusion (VAE + CLIP + U-Net), classifier-free guidance, different schedulers.
- Advanced techniques: ControlNet for spatial conditioning, LoRA for style fine-tuning, inpainting, img2img.
- Modern architectures: DiT (Diffusion Transformer), Flow Matching (Flux, SD3), continuous-time formulations.
- Video generation: Spacetime patches, temporal consistency, Sora/CogVideoX/Wan architectures.
Preparation for Next Week
In Week 15: Capstone Project, you will apply everything you have learned across all 14 weeks to build a comprehensive AI engineering project. Review the project ideas and start thinking about which one excites you most.
Exercises
Exercise 1: Diffusion from Scratch
Train the SimpleUNet diffusion model on MNIST. Then modify it to use a cosine noise schedule instead of linear. Compare the quality of generated samples.
Exercise 2: DDIM Sampling
Implement DDIM sampling for your trained model. Compare samples generated with 1000 DDPM steps vs 50 DDIM steps. Measure FID if possible.
Exercise 3: CFG Exploration
Using the diffusers library, generate the same prompt with CFG scales from 1 to 20. Create a grid showing how guidance scale affects quality, diversity, and artifacts.
Exercise 4: ControlNet Application
Build a small web app (Gradio/Streamlit) that lets users upload an image, automatically extract edges/depth, and generate a new image in a specified style using ControlNet.
Exercise 5: LoRA Training
Collect 10-20 images in a specific art style. Train a LoRA on Stable Diffusion to capture that style. Generate images and evaluate how well the style transfers.