Week 14 of 16

Diffusion Based Models

The Mathematics, Architecture, and Practice of Diffusion Models -- from DDPM to Stable Diffusion, ControlNet, and Video Generation

Advanced Estimated: 18-25 hours

Learning Objectives

Understand the Math

Derive and understand the forward/reverse diffusion processes, noise schedules, and the denoising loss function.

Implement from Scratch

Build a working diffusion model in PyTorch: noise schedule, U-Net, training loop, and sampling.

Master Stable Diffusion

Understand the full Stable Diffusion pipeline: VAE, CLIP, U-Net, schedulers, and CFG.

Apply Advanced Techniques

Use ControlNet, LoRA, inpainting, and understand the latest architectures (Flux, DiT).

1. Introduction to Diffusion Models

1.1 The Core Idea

Diffusion models are a class of generative models that learn to generate data by learning to reverse a gradual noising process. The intuition is beautifully simple:

The Two Processes


  FORWARD PROCESS (Fixed, no learning):
  Gradually add Gaussian noise to data over T timesteps until it becomes pure noise.

  Clean Image (x_0) --> Slightly Noisy (x_1) --> ... --> More Noisy (x_t) --> ... --> Pure Noise (x_T)

  Each step: x_t = sqrt(1 - beta_t) * x_{t-1} + sqrt(beta_t) * epsilon
             where epsilon ~ N(0, I)

  REVERSE PROCESS (Learned by a neural network):
  Starting from pure noise, gradually denoise to recover a clean image.

  Pure Noise (x_T) --> Less Noisy (x_{T-1}) --> ... --> Cleaner (x_t) --> ... --> Clean Image (x_0)

  Each step: x_{t-1} = f_theta(x_t, t)  [neural network predicts how to denoise]
                        

The key insight is that while the forward process is trivial (just add noise), the reverse process requires learning a complex function. If we can train a neural network to accurately reverse one small noise step, we can chain many such steps together to generate realistic data from pure noise.

Diffusion Forward Process: Image to Noise
graph LR X0["Clean Image
x₀"] -->|"Add noise β₁"| X1["Slightly Noisy
x₁"] X1 -->|"Add noise β₂"| X2["Noisier
x₂"] X2 -->|"..."| Xt["More Noisy
xₜ"] Xt -->|"..."| XT["Pure Noise
x_T ~ N(0,I)"] style X0 fill:#4CAF50,stroke:#333,color:#fff style X1 fill:#66BB6A,stroke:#333,color:#fff style X2 fill:#FFA726,stroke:#333,color:#fff style Xt fill:#EF5350,stroke:#333,color:#fff style XT fill:#B71C1C,stroke:#333,color:#fff
Diffusion Reverse Process: Noise to Image
graph RL XT2["Pure Noise
x_T"] -->|"Denoise step"| XTm["Less Noisy
x_{T-1}"] XTm -->|"..."| Xt2["Cleaner
xₜ"] Xt2 -->|"..."| X12["Almost Clean
x₁"] X12 -->|"Final denoise"| X02["Clean Image
x₀"] NN["Neural Network
ε_θ(xₜ, t)"] -.->|"Predicts noise
at each step"| Xt2 style XT2 fill:#B71C1C,stroke:#333,color:#fff style XTm fill:#EF5350,stroke:#333,color:#fff style Xt2 fill:#FFA726,stroke:#333,color:#fff style X12 fill:#66BB6A,stroke:#333,color:#fff style X02 fill:#4CAF50,stroke:#333,color:#fff style NN fill:#2196F3,stroke:#333,color:#fff

1.2 Why Diffusion Models Overtook GANs

AspectGANsDiffusion Models
Training stabilityNotoriously unstable (adversarial dynamics)Stable (simple MSE regression loss)
Mode coverageMode collapse is a major issueFull distribution coverage by design
Image qualityHigh (for faces)Higher (especially for complex scenes)
DiversityLimited by mode collapseExcellent diversity
Text conditioningRequires architectural tricksNatural via cross-attention
ControllabilityLimitedExcellent (CFG, ControlNet, etc.)
Sampling speedFast (single forward pass)Slow (many denoising steps)
LikelihoodNot availableAvailable (via variational bound)

1.3 Historical Context

Timeline of Diffusion Model Breakthroughs

  • 2015: Sohl-Dickstein et al. introduce the diffusion framework (deep unsupervised learning using nonequilibrium thermodynamics)
  • 2020 (June): DDPM (Ho et al.) makes diffusion practical with simplified training and high-quality image generation
  • 2020 (Oct): DDIM (Song et al.) enables faster sampling with fewer steps (deterministic)
  • 2021: Guided diffusion (Dhariwal & Nichol) shows diffusion beats GANs on ImageNet; Classifier-Free Guidance invented
  • 2021: DALL-E and GLIDE introduce text-to-image diffusion
  • 2022 (Apr): DALL-E 2 uses CLIP + diffusion prior + diffusion decoder
  • 2022 (Aug): Stable Diffusion released open-source (Latent Diffusion Model)
  • 2022: Midjourney launches with stunning artistic quality
  • 2023: SDXL, DALL-E 3 (integrated with ChatGPT), Midjourney v5
  • 2024: Stable Diffusion 3 (flow matching, DiT), Flux (Black Forest Labs), DALL-E 3 improvements
  • 2025: Sora (video diffusion), Flux 1.1, improved efficiency models
  • 2026: DALL-E 4, Flux 2.0, real-time diffusion, widespread video generation

2. The Math of Diffusion

Understanding the mathematics of diffusion models is essential for implementing them correctly and debugging issues. We will build up from first principles, explaining every formula with intuition.

2.1 The Forward Process (Adding Noise)

The forward process is a Markov chain that gradually adds Gaussian noise to data over T timesteps. Starting from a clean data point x_0, we define:

Single Forward Step:

q(x_t | x_{t-1}) = N(x_t; sqrt(1 - beta_t) * x_{t-1}, beta_t * I)

At each timestep t, we scale the previous sample by sqrt(1 - beta_t) (making it slightly smaller) and add Gaussian noise with variance beta_t. The noise schedule beta_1, ..., beta_T controls how quickly the signal is destroyed.

In code, a single forward step is:

def forward_step(x_prev, beta_t):
    """One step of the forward diffusion process."""
    noise = torch.randn_like(x_prev)
    x_t = torch.sqrt(1 - beta_t) * x_prev + torch.sqrt(beta_t) * noise
    return x_t

The Noise Schedule

The values beta_1, ..., beta_T determine how much noise is added at each step:

import torch
import numpy as np

def linear_noise_schedule(T=1000, beta_start=1e-4, beta_end=0.02):
    """
    Linear noise schedule (used in original DDPM).
    beta increases linearly from beta_start to beta_end.
    """
    return torch.linspace(beta_start, beta_end, T)


def cosine_noise_schedule(T=1000, s=0.008):
    """
    Cosine noise schedule (Nichol & Dhariwal, 2021).
    Produces a smoother, more gradual noising process.
    Better results than linear schedule in practice.
    """
    steps = torch.arange(T + 1, dtype=torch.float64)
    f_t = torch.cos(((steps / T) + s) / (1 + s) * (np.pi / 2)) ** 2
    alphas_cumprod = f_t / f_t[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clamp(betas, 0.0001, 0.999).float()


# Visualize both schedules
betas_linear = linear_noise_schedule()
betas_cosine = cosine_noise_schedule()

print(f"Linear schedule: beta_1={betas_linear[0]:.6f}, beta_T={betas_linear[-1]:.6f}")
print(f"Cosine schedule: beta_1={betas_cosine[0]:.6f}, beta_T={betas_cosine[-1]:.6f}")

The Reparameterization Trick: Jump to Any Timestep

A crucial property: we can compute x_t directly from x_0 without iterating through all intermediate steps. Define:

Key Quantities:

alpha_t = 1 - beta_t alpha_bar_t = product(alpha_1, alpha_2, ..., alpha_t) [cumulative product]

Direct Sampling at Timestep t:

q(x_t | x_0) = N(x_t; sqrt(alpha_bar_t) * x_0, (1 - alpha_bar_t) * I)

Reparameterized Form:

x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon, where epsilon ~ N(0, I)

This means x_t is a weighted combination of the original clean image x_0 and random noise epsilon. As t increases, alpha_bar_t decreases (approaching 0), so x_t becomes more noise and less signal. At t=T, x_T is approximately pure noise.

def precompute_schedule(betas):
    """
    Precompute all quantities needed for training and sampling.
    """
    alphas = 1.0 - betas
    alphas_cumprod = torch.cumprod(alphas, dim=0)
    alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])

    # For forward process q(x_t | x_0)
    sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
    sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - alphas_cumprod)

    # For reverse process p(x_{t-1} | x_t)
    sqrt_recip_alphas = torch.sqrt(1.0 / alphas)

    # Posterior variance (for reverse process)
    posterior_variance = betas * (1.0 - alphas_cumprod_prev) / (1.0 - alphas_cumprod)

    return {
        "betas": betas,
        "alphas": alphas,
        "alphas_cumprod": alphas_cumprod,
        "sqrt_alphas_cumprod": sqrt_alphas_cumprod,
        "sqrt_one_minus_alphas_cumprod": sqrt_one_minus_alphas_cumprod,
        "sqrt_recip_alphas": sqrt_recip_alphas,
        "posterior_variance": posterior_variance,
    }


def forward_diffusion(x_0, t, schedule):
    """
    Apply forward diffusion: compute x_t from x_0 directly.

    Args:
        x_0: clean images [B, C, H, W]
        t: timestep indices [B] (integer values 0 to T-1)
        schedule: precomputed schedule dict

    Returns:
        x_t: noisy images [B, C, H, W]
        noise: the noise that was added [B, C, H, W]
    """
    noise = torch.randn_like(x_0)

    # Gather the schedule values for each sample's timestep
    sqrt_alpha_bar = schedule["sqrt_alphas_cumprod"][t]          # [B]
    sqrt_one_minus_alpha_bar = schedule["sqrt_one_minus_alphas_cumprod"][t]  # [B]

    # Reshape for broadcasting: [B] -> [B, 1, 1, 1]
    sqrt_alpha_bar = sqrt_alpha_bar[:, None, None, None]
    sqrt_one_minus_alpha_bar = sqrt_one_minus_alpha_bar[:, None, None, None]

    # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
    x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise

    return x_t, noise


# Example: visualize the forward process
betas = linear_noise_schedule(T=1000)
schedule = precompute_schedule(betas)

# Show how an image gets noisier
import matplotlib.pyplot as plt
from torchvision import datasets, transforms

# Load a sample image
# dataset = datasets.MNIST(root="./data", train=True, download=True,
#                           transform=transforms.ToTensor())
# x_0 = dataset[0][0].unsqueeze(0)  # [1, 1, 28, 28]

# fig, axes = plt.subplots(1, 6, figsize=(15, 3))
# timesteps = [0, 50, 200, 500, 750, 999]
# for ax, t_val in zip(axes, timesteps):
#     t = torch.tensor([t_val])
#     x_t, _ = forward_diffusion(x_0, t, schedule)
#     ax.imshow(x_t[0, 0].numpy(), cmap="gray")
#     ax.set_title(f"t = {t_val}")
#     ax.axis("off")
# plt.suptitle("Forward Diffusion Process")
# plt.tight_layout()
# plt.show()

2.2 The Reverse Process (Denoising)

The reverse process is what we actually want to learn. Starting from pure noise x_T, we want to iteratively denoise to recover a clean sample x_0.

Reverse Step (parametrized by theta):

p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I)

The model learns to predict the mean mu_theta of the reverse distribution. The variance sigma_t^2 is typically fixed to beta_t or the posterior variance.

There are three equivalent ways to parametrize what the model predicts:

Three Prediction Targets

  1. Predict the noise epsilon (most common): The model epsilon_theta(x_t, t) predicts the noise that was added. The mean is then:
    mu_theta = (1/sqrt(alpha_t)) * (x_t - (beta_t / sqrt(1 - alpha_bar_t)) * epsilon_theta(x_t, t))
  2. Predict the clean image x_0: The model predicts the denoised image directly. Useful for some applications.
  3. Predict the velocity v (used in v-prediction): v = sqrt(alpha_bar_t) * epsilon - sqrt(1 - alpha_bar_t) * x_0. Used in newer models for better training dynamics.

2.3 The Denoising Objective

The training loss is remarkably simple. We train the model to predict the noise that was added:

Simplified Diffusion Loss:

L = E_{t, x_0, epsilon} [ || epsilon - epsilon_theta(x_t, t) ||^2 ]

1. Sample a clean image x_0 from the training data
2. Sample a random timestep t uniformly from {1, ..., T}
3. Sample random noise epsilon ~ N(0, I)
4. Compute x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
5. The model predicts the noise: epsilon_hat = epsilon_theta(x_t, t)
6. Loss = MSE(epsilon, epsilon_hat)

That is it. The model learns to denoise by predicting what noise was added.

2.4 DDPM vs DDIM

DDPM (Denoising Diffusion Probabilistic Models)

The original formulation by Ho et al. (2020):

  • Stochastic sampling: each reverse step adds noise (controlled randomness)
  • Requires many steps (typically T=1000) for good quality
  • Each run produces different outputs even with same starting noise
  • Sampling formula:
    x_{t-1} = mu_theta(x_t, t) + sigma_t * z, where z ~ N(0, I)

DDIM (Denoising Diffusion Implicit Models)

Song et al. (2020) showed that the same trained model can be sampled differently:

  • Deterministic sampling: same starting noise always produces the same output
  • Can skip steps: sample with 50 or even 20 steps instead of 1000
  • Enables interpolation in latent space
  • Same model, different sampling procedure (no retraining needed)
  • Sampling formula:
    x_{t-1} = sqrt(alpha_bar_{t-1}) * predicted_x_0 + sqrt(1 - alpha_bar_{t-1}) * direction_pointing_to_x_t
def ddpm_sample_step(model, x_t, t, schedule):
    """
    One step of DDPM (stochastic) sampling.
    """
    betas = schedule["betas"]
    sqrt_recip_alphas = schedule["sqrt_recip_alphas"]
    sqrt_one_minus_alphas_cumprod = schedule["sqrt_one_minus_alphas_cumprod"]
    posterior_variance = schedule["posterior_variance"]

    # Predict noise
    predicted_noise = model(x_t, t)

    # Compute mean
    mean = sqrt_recip_alphas[t] * (
        x_t - (betas[t] / sqrt_one_minus_alphas_cumprod[t]) * predicted_noise
    )

    if t > 0:
        noise = torch.randn_like(x_t)
        x_prev = mean + torch.sqrt(posterior_variance[t]) * noise
    else:
        x_prev = mean  # No noise at final step

    return x_prev


def ddim_sample_step(model, x_t, t, t_prev, schedule, eta=0.0):
    """
    One step of DDIM sampling.

    Args:
        eta: controls stochasticity. eta=0 is deterministic, eta=1 is DDPM.
    """
    alphas_cumprod = schedule["alphas_cumprod"]

    alpha_bar_t = alphas_cumprod[t]
    alpha_bar_t_prev = alphas_cumprod[t_prev] if t_prev >= 0 else torch.tensor(1.0)

    # Predict noise
    predicted_noise = model(x_t, t)

    # Predict x_0
    predicted_x0 = (x_t - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
    predicted_x0 = torch.clamp(predicted_x0, -1, 1)  # Clip for stability

    # Compute variance
    sigma = eta * torch.sqrt(
        (1 - alpha_bar_t_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_t_prev)
    )

    # Direction pointing to x_t
    direction = torch.sqrt(1 - alpha_bar_t_prev - sigma**2) * predicted_noise

    # Compute x_{t-1}
    x_prev = torch.sqrt(alpha_bar_t_prev) * predicted_x0 + direction

    if eta > 0 and t > 0:
        noise = torch.randn_like(x_t)
        x_prev = x_prev + sigma * noise

    return x_prev

3. Diffusion Architecture Components

3.1 U-Net Architecture

The U-Net is the backbone of most diffusion models. Originally designed for biomedical image segmentation (Ronneberger et al., 2015), it was adapted for diffusion by DDPM. Its encoder-decoder structure with skip connections is ideal for the denoising task.

U-Net Structure


  Input (x_t + time embedding)
       |
  [Encoder / Downsampling Path]
       |
  Conv Block (64)  ----skip connection----> Concat
       |                                      |
  Downsample (stride 2)                       |
       |                                      |
  Conv Block (128) ----skip connection----> Concat
       |                                      |
  Downsample (stride 2)                       |
       |                                      |
  Conv Block (256) ----skip connection----> Concat
       |                                      |
  Downsample (stride 2)                       |
       |                                      |
  [Bottleneck]                                |
  Conv Block (512) + Self-Attention           |
       |                                      |
  [Decoder / Upsampling Path]                 |
       |                                      |
  Upsample (2x)                               |
       |                                      |
  Conv Block (256) <----- skip connection ----+
       |
  Upsample (2x)
       |
  Conv Block (128) <----- skip connection ----+
       |
  Upsample (2x)
       |
  Conv Block (64)  <----- skip connection ----+
       |
  Output Conv -> Predicted noise (epsilon)
                        

Key Components of the Diffusion U-Net

  • Skip connections: Concatenate encoder features with decoder features at matching resolutions. This preserves fine-grained spatial information that would be lost through downsampling.
  • Time embedding: The model must know which timestep it is denoising. The timestep t is embedded using sinusoidal position encoding (like in Transformers), then projected through an MLP. This time embedding is added to every residual block.
  • Self-attention layers: Added at lower resolutions (e.g., 16x16, 8x8) to capture global dependencies. Not used at full resolution due to quadratic cost.
  • Cross-attention layers: For conditional generation (e.g., text-to-image). The text embedding is injected via cross-attention at multiple resolutions.
  • Group normalization: Used instead of batch norm for stability with small batch sizes.

3.2 Time Embedding

import torch
import torch.nn as nn
import math

class SinusoidalTimeEmbedding(nn.Module):
    """
    Sinusoidal time step embedding, similar to positional encoding in Transformers.
    Maps integer timestep t to a high-dimensional vector.
    """
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, t):
        """
        Args:
            t: [B] tensor of integer timesteps
        Returns:
            [B, dim] time embeddings
        """
        device = t.device
        half_dim = self.dim // 2
        embeddings = math.log(10000) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = t[:, None].float() * embeddings[None, :]
        embeddings = torch.cat([torch.sin(embeddings), torch.cos(embeddings)], dim=-1)
        return embeddings


class TimeMLPEmbedding(nn.Module):
    """
    Full time embedding: sinusoidal encoding -> MLP projection.
    """
    def __init__(self, time_dim, embed_dim):
        super().__init__()
        self.sinusoidal = SinusoidalTimeEmbedding(time_dim)
        self.mlp = nn.Sequential(
            nn.Linear(time_dim, embed_dim),
            nn.SiLU(),
            nn.Linear(embed_dim, embed_dim),
        )

    def forward(self, t):
        return self.mlp(self.sinusoidal(t))

3.3 Cross-Attention for Conditioning

For text-to-image generation, the text prompt needs to influence the denoising process. This is done via cross-attention, where the image features are queries and the text features are keys and values:

class CrossAttention(nn.Module):
    """
    Cross-attention layer for text conditioning.

    Q = image features (what we're generating)
    K, V = text features (what we're conditioning on)
    """
    def __init__(self, query_dim, context_dim, num_heads=8, head_dim=64):
        super().__init__()
        inner_dim = num_heads * head_dim
        self.num_heads = num_heads
        self.head_dim = head_dim
        self.scale = head_dim ** -0.5

        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
        self.to_out = nn.Linear(inner_dim, query_dim)

    def forward(self, x, context):
        """
        Args:
            x: image features [B, HW, query_dim]
            context: text features [B, seq_len, context_dim]
        Returns:
            conditioned features [B, HW, query_dim]
        """
        B, N, _ = x.shape
        h = self.num_heads

        q = self.to_q(x).reshape(B, N, h, self.head_dim).permute(0, 2, 1, 3)
        k = self.to_k(context).reshape(B, -1, h, self.head_dim).permute(0, 2, 1, 3)
        v = self.to_v(context).reshape(B, -1, h, self.head_dim).permute(0, 2, 1, 3)

        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)

        out = (attn @ v).permute(0, 2, 1, 3).reshape(B, N, -1)
        return self.to_out(out)

3.4 PRACTICAL: Implement a Simple Diffusion Model in PyTorch

Let us build a complete, working diffusion model from scratch. We will train it on MNIST to generate handwritten digits.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchvision.utils import save_image, make_grid
import os

# ====================
# Hyperparameters
# ====================
T = 1000              # Number of diffusion timesteps
IMG_SIZE = 28
IMG_CHANNELS = 1
BATCH_SIZE = 128
EPOCHS = 30
LR = 2e-4
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ====================
# Noise Schedule
# ====================
def get_schedule(T):
    betas = torch.linspace(1e-4, 0.02, T)
    alphas = 1.0 - betas
    alphas_cumprod = torch.cumprod(alphas, dim=0)

    return {
        "betas": betas.to(DEVICE),
        "alphas": alphas.to(DEVICE),
        "alphas_cumprod": alphas_cumprod.to(DEVICE),
        "sqrt_alphas_cumprod": torch.sqrt(alphas_cumprod).to(DEVICE),
        "sqrt_one_minus_alphas_cumprod": torch.sqrt(1 - alphas_cumprod).to(DEVICE),
        "sqrt_recip_alphas": torch.sqrt(1.0 / alphas).to(DEVICE),
        "posterior_variance": (betas * (1 - torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])) / (1 - alphas_cumprod)).to(DEVICE),
    }

schedule = get_schedule(T)

# ====================
# Simple U-Net
# ====================

class ResBlock(nn.Module):
    """Residual block with time embedding."""
    def __init__(self, in_ch, out_ch, time_dim):
        super().__init__()
        self.conv1 = nn.Sequential(
            nn.GroupNorm(8, in_ch),
            nn.SiLU(),
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
        )
        self.time_proj = nn.Sequential(
            nn.SiLU(),
            nn.Linear(time_dim, out_ch),
        )
        self.conv2 = nn.Sequential(
            nn.GroupNorm(8, out_ch),
            nn.SiLU(),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
        )
        self.shortcut = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()

    def forward(self, x, t_emb):
        h = self.conv1(x)
        h = h + self.time_proj(t_emb)[:, :, None, None]
        h = self.conv2(h)
        return h + self.shortcut(x)


class SimpleUNet(nn.Module):
    """
    A simplified U-Net for diffusion on 28x28 images.
    """
    def __init__(self, in_channels=1, base_channels=64, time_dim=256):
        super().__init__()

        # Time embedding
        self.time_embed = nn.Sequential(
            SinusoidalTimeEmbedding(time_dim),
            nn.Linear(time_dim, time_dim),
            nn.SiLU(),
            nn.Linear(time_dim, time_dim),
        )

        # Encoder (downsampling)
        self.enc1 = ResBlock(in_channels, base_channels, time_dim)         # 28x28
        self.down1 = nn.Conv2d(base_channels, base_channels, 3, stride=2, padding=1)  # 14x14

        self.enc2 = ResBlock(base_channels, base_channels * 2, time_dim)   # 14x14
        self.down2 = nn.Conv2d(base_channels * 2, base_channels * 2, 3, stride=2, padding=1)  # 7x7

        # Bottleneck
        self.bottleneck = ResBlock(base_channels * 2, base_channels * 2, time_dim)  # 7x7

        # Decoder (upsampling)
        self.up2 = nn.ConvTranspose2d(base_channels * 2, base_channels * 2, 2, stride=2)  # 14x14
        self.dec2 = ResBlock(base_channels * 4, base_channels, time_dim)  # concat with skip

        self.up1 = nn.ConvTranspose2d(base_channels, base_channels, 2, stride=2)  # 28x28
        self.dec1 = ResBlock(base_channels * 2, base_channels, time_dim)  # concat with skip

        # Output
        self.out = nn.Sequential(
            nn.GroupNorm(8, base_channels),
            nn.SiLU(),
            nn.Conv2d(base_channels, in_channels, 1),
        )

    def forward(self, x, t):
        """
        Args:
            x: noisy image [B, C, H, W]
            t: timestep [B] (integer)
        Returns:
            predicted noise [B, C, H, W]
        """
        t_emb = self.time_embed(t)

        # Encoder
        e1 = self.enc1(x, t_emb)        # [B, 64, 28, 28]
        e2 = self.enc2(self.down1(e1), t_emb)  # [B, 128, 14, 14]

        # Bottleneck
        b = self.bottleneck(self.down2(e2), t_emb)  # [B, 128, 7, 7]

        # Decoder with skip connections
        d2 = self.up2(b)                 # [B, 128, 14, 14]
        d2 = self.dec2(torch.cat([d2, e2], dim=1), t_emb)  # [B, 64, 14, 14]

        d1 = self.up1(d2)               # [B, 64, 28, 28]
        d1 = self.dec1(torch.cat([d1, e1], dim=1), t_emb)  # [B, 64, 28, 28]

        return self.out(d1)


# ====================
# Training
# ====================

def train_diffusion():
    # Data
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize([0.5], [0.5]),  # Scale to [-1, 1]
    ])
    dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)

    # Model
    model = SimpleUNet(in_channels=IMG_CHANNELS, base_channels=64).to(DEVICE)
    optimizer = torch.optim.AdamW(model.parameters(), lr=LR)

    print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

    os.makedirs("diffusion_outputs", exist_ok=True)

    for epoch in range(EPOCHS):
        total_loss = 0
        count = 0

        for images, _ in dataloader:
            images = images.to(DEVICE)  # [B, 1, 28, 28], range [-1, 1]
            B = images.shape[0]

            # Sample random timesteps for each image
            t = torch.randint(0, T, (B,), device=DEVICE)

            # Sample noise
            noise = torch.randn_like(images)

            # Forward diffusion: create noisy images
            sqrt_alpha_bar = schedule["sqrt_alphas_cumprod"][t][:, None, None, None]
            sqrt_one_minus_alpha_bar = schedule["sqrt_one_minus_alphas_cumprod"][t][:, None, None, None]
            x_t = sqrt_alpha_bar * images + sqrt_one_minus_alpha_bar * noise

            # Predict the noise
            predicted_noise = model(x_t, t)

            # Loss: MSE between actual noise and predicted noise
            loss = F.mse_loss(predicted_noise, noise)

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_loss += loss.item() * B
            count += B

        avg_loss = total_loss / count
        print(f"Epoch {epoch+1}/{EPOCHS}  Loss: {avg_loss:.6f}")

        # Generate samples every 5 epochs
        if (epoch + 1) % 5 == 0:
            samples = sample_images(model, n_samples=64)
            save_image(samples, f"diffusion_outputs/epoch_{epoch+1:03d}.png",
                      nrow=8, normalize=True, value_range=(-1, 1))

    return model


# ====================
# Sampling (DDPM)
# ====================

@torch.no_grad()
def sample_images(model, n_samples=64):
    """
    Generate images using DDPM sampling.
    Start from pure noise and iteratively denoise.
    """
    model.eval()

    # Start from pure Gaussian noise
    x = torch.randn(n_samples, IMG_CHANNELS, IMG_SIZE, IMG_SIZE, device=DEVICE)

    # Reverse diffusion: from t=T-1 down to t=0
    for t_val in reversed(range(T)):
        t = torch.full((n_samples,), t_val, device=DEVICE, dtype=torch.long)

        # Predict noise
        predicted_noise = model(x, t)

        # Compute mean of p(x_{t-1} | x_t)
        alpha_t = schedule["alphas"][t_val]
        alpha_bar_t = schedule["alphas_cumprod"][t_val]
        beta_t = schedule["betas"][t_val]

        mean = (1 / torch.sqrt(alpha_t)) * (
            x - (beta_t / torch.sqrt(1 - alpha_bar_t)) * predicted_noise
        )

        if t_val > 0:
            noise = torch.randn_like(x)
            variance = schedule["posterior_variance"][t_val]
            x = mean + torch.sqrt(variance) * noise
        else:
            x = mean

    model.train()
    return x


# Train the model
# model = train_diffusion()
# Final generation
# samples = sample_images(model, n_samples=64)
# save_image(samples, "diffusion_outputs/final_samples.png",
#            nrow=8, normalize=True, value_range=(-1, 1))

4. Text-to-Image: Stable Diffusion Architecture

4.1 Latent Diffusion Models

The key innovation of Stable Diffusion (Rombach et al., 2022) is performing diffusion in a compressed latent space rather than pixel space. This dramatically reduces computational cost.

Why Latent Space?

A 512x512x3 image has 786,432 dimensions. Running diffusion in this space is extremely expensive. Instead:

  1. Train a VAE to compress images into a smaller latent representation (e.g., 64x64x4 = 16,384 dimensions -- a 48x reduction)
  2. Run the entire diffusion process in this latent space
  3. Decode the final latent back to pixel space using the VAE decoder

4.2 The Full Stable Diffusion Pipeline

Complete Pipeline


  Text Prompt: "a photo of an astronaut riding a horse on Mars"
       |
       v
  +------------------+
  | CLIP Text        |  Tokenize text, produce text embeddings
  | Encoder          |  Output: [B, 77, 768] (77 tokens, 768-dim)
  +------------------+
       |
       | (text embeddings fed via cross-attention)
       |
       v
  +------------------+
  | U-Net            |  Iterative denoising in latent space
  | (in latent       |  Input: noisy latent z_t + timestep t + text
  |  space)          |  Output: predicted noise epsilon
  +------------------+
       |
       | (after T denoising steps)
       |
       v
  Clean Latent z_0
       |
       v
  +------------------+
  | VAE Decoder      |  Decode latent to pixel space
  |                  |  Input: [B, 4, 64, 64]
  |                  |  Output: [B, 3, 512, 512]
  +------------------+
       |
       v
  Generated Image (512 x 512 x 3)
                        

Component Details

  • VAE (Variational Autoencoder): Trained separately. Encoder compresses 512x512x3 to 64x64x4 (8x spatial downsampling). Decoder reverses this. The latent space is regularized to be approximately Gaussian.
  • CLIP Text Encoder: The text encoder from OpenAI's CLIP model. Converts text prompts into dense embeddings that the U-Net conditions on via cross-attention. SD 1.x uses CLIP ViT-L/14 (768-dim). SDXL uses two text encoders.
  • U-Net: The core diffusion model. Contains ~860M parameters in SD 1.5. ResNet blocks with self-attention and cross-attention at multiple resolutions.
  • Scheduler/Sampler: Controls the denoising process. Different schedulers trade off speed vs quality (Euler, DPM++, PNDM, etc.).
Text-to-Image Pipeline: Stable Diffusion
graph LR Prompt["Text Prompt"] --> CLIP["CLIP Text
Encoder"] CLIP --> Cross["Cross-Attention"] Noise["Random Noise
z_T"] --> UNet["U-Net
(Iterative Denoising)"] Cross --> UNet UNet -->|"T denoising steps"| Latent["Clean Latent
z₀"] Latent --> VAE["VAE
Decoder"] VAE --> Image["Generated
Image"] style CLIP fill:#2196F3,stroke:#333,color:#fff style UNet fill:#4CAF50,stroke:#333,color:#fff style VAE fill:#FF9800,stroke:#333,color:#fff style Image fill:#9C27B0,stroke:#333,color:#fff

4.3 Classifier-Free Guidance (CFG)

CFG is the technique that makes text-to-image generation work well in practice. Without it, generated images have low text-image alignment.

Classifier-Free Guidance:

epsilon_guided = epsilon_uncond + w * (epsilon_cond - epsilon_uncond)

During training, the model randomly drops the text condition (replaces with empty text) some percentage of the time. At inference:
1. Run the model twice: once with text (epsilon_cond) and once without text (epsilon_uncond)
2. Amplify the difference: move AWAY from the unconditional prediction toward the conditional one
3. The guidance scale w controls the strength. w=1 is normal, w=7.5 is typical, w=15+ is very strong guidance

Higher w = more faithful to the prompt but less diversity and eventually artifacts.

4.4 PRACTICAL: Use Stable Diffusion with the Diffusers Library

from diffusers import (
    StableDiffusionPipeline,
    StableDiffusionXLPipeline,
    DPMSolverMultistepScheduler,
    EulerDiscreteScheduler,
    EulerAncestralDiscreteScheduler,
)
import torch

# ====================
# Basic Stable Diffusion 1.5
# ====================

def generate_sd15(
    prompt: str,
    negative_prompt: str = "",
    num_images: int = 1,
    guidance_scale: float = 7.5,
    num_inference_steps: int = 50,
    seed: int = None,
):
    """Generate images with Stable Diffusion 1.5."""

    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16,
        safety_checker=None,
    ).to("cuda")

    # Use DPM++ 2M scheduler for faster, higher-quality generation
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

    generator = torch.Generator("cuda").manual_seed(seed) if seed else None

    images = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_images_per_prompt=num_images,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps,
        generator=generator,
    ).images

    return images


# ====================
# Stable Diffusion XL
# ====================

def generate_sdxl(
    prompt: str,
    negative_prompt: str = "",
    guidance_scale: float = 7.0,
    num_inference_steps: int = 30,
    seed: int = None,
):
    """Generate images with SDXL (1024x1024)."""

    pipe = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.float16,
        variant="fp16",
    ).to("cuda")

    generator = torch.Generator("cuda").manual_seed(seed) if seed else None

    image = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps,
        generator=generator,
    ).images[0]

    return image


# ====================
# Experiment with Different Schedulers
# ====================

def compare_schedulers(prompt: str, seed: int = 42):
    """
    Compare different schedulers/samplers for the same prompt.
    Different schedulers produce different results and have different speed/quality tradeoffs.
    """
    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16,
        safety_checker=None,
    ).to("cuda")

    schedulers = {
        "Euler": EulerDiscreteScheduler,
        "Euler Ancestral": EulerAncestralDiscreteScheduler,
        "DPM++ 2M": DPMSolverMultistepScheduler,
    }

    results = {}
    for name, SchedulerClass in schedulers.items():
        pipe.scheduler = SchedulerClass.from_config(pipe.scheduler.config)
        generator = torch.Generator("cuda").manual_seed(seed)

        image = pipe(
            prompt=prompt,
            num_inference_steps=30,
            guidance_scale=7.5,
            generator=generator,
        ).images[0]

        results[name] = image
        print(f"Generated with {name} scheduler")

    return results


# ====================
# Experiment with CFG Scale
# ====================

def compare_cfg_scales(prompt: str, seed: int = 42):
    """
    Show the effect of different classifier-free guidance scales.
    """
    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16,
        safety_checker=None,
    ).to("cuda")

    cfg_scales = [1.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0]
    results = {}

    for cfg in cfg_scales:
        generator = torch.Generator("cuda").manual_seed(seed)
        image = pipe(
            prompt=prompt,
            guidance_scale=cfg,
            num_inference_steps=30,
            generator=generator,
        ).images[0]

        results[cfg] = image
        print(f"CFG Scale {cfg}: generated")

    return results


# Example usage:
# images = generate_sd15(
#     prompt="a beautiful landscape painting of mountains at sunset, oil on canvas",
#     negative_prompt="blurry, bad quality, distorted",
#     guidance_scale=7.5,
#     num_inference_steps=30,
#     seed=42,
# )
# images[0].save("landscape.png")

5. Advanced Diffusion Techniques

5.1 ControlNet

ControlNet (Zhang et al., 2023) adds spatial conditioning to diffusion models. It lets you control the generation with edge maps, depth maps, pose skeletons, segmentation maps, and more.

How ControlNet Works


  Control Image (e.g., Canny edges)
       |
       v
  +-----------------------+
  | ControlNet            |
  | (copy of U-Net        |
  |  encoder + middle)    |
  +-----------------------+
       |
       | (zero-conv outputs added to main U-Net)
       |
       v
  +-----------------------+
  | Main U-Net            |   + Text conditioning via cross-attention
  | (frozen weights)      |
  +-----------------------+
       |
       v
  Generated Image (follows the control structure)
                        

ControlNet creates a trainable copy of the U-Net's encoder blocks. The outputs are connected to the main (frozen) U-Net through "zero convolutions" -- 1x1 convs initialized with zero weights, so the ControlNet has no effect at the start of training and gradually learns to influence the generation.

from diffusers import (
    StableDiffusionControlNetPipeline,
    ControlNetModel,
    UniPCMultistepScheduler,
)
from diffusers.utils import load_image
import cv2
import numpy as np
from PIL import Image
import torch


def generate_with_controlnet_canny(
    prompt: str,
    image_path: str,
    negative_prompt: str = "",
    guidance_scale: float = 7.5,
    controlnet_conditioning_scale: float = 1.0,
    seed: int = None,
):
    """
    Generate an image conditioned on Canny edge detection of an input image.
    """
    # Load ControlNet model for Canny edges
    controlnet = ControlNetModel.from_pretrained(
        "lllyasviel/sd-controlnet-canny",
        torch_dtype=torch.float16,
    )

    # Create pipeline with ControlNet
    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        controlnet=controlnet,
        torch_dtype=torch.float16,
        safety_checker=None,
    ).to("cuda")

    pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

    # Prepare Canny edge map from input image
    image = cv2.imread(image_path)
    image = cv2.resize(image, (512, 512))
    edges = cv2.Canny(image, 100, 200)
    edges = np.stack([edges] * 3, axis=-1)  # Convert to 3-channel
    control_image = Image.fromarray(edges)

    generator = torch.Generator("cuda").manual_seed(seed) if seed else None

    output = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        image=control_image,
        guidance_scale=guidance_scale,
        controlnet_conditioning_scale=controlnet_conditioning_scale,
        num_inference_steps=30,
        generator=generator,
    ).images[0]

    return output, control_image


# Usage:
# output, edges = generate_with_controlnet_canny(
#     prompt="a beautiful house in the forest, professional photo",
#     image_path="input_photo.jpg",
#     seed=42,
# )
# output.save("controlnet_output.png")
# edges.save("canny_edges.png")

5.2 LoRA for Stable Diffusion

LoRA (Low-Rank Adaptation) enables fine-tuning diffusion models on custom styles, characters, or concepts with minimal training data and compute.

How LoRA Works for Diffusion

Instead of fine-tuning all ~860M parameters of the U-Net, LoRA:

  1. Freezes the original model weights W
  2. Adds small trainable matrices: W' = W + alpha * B * A, where A is [d, r] and B is [r, d] with rank r much less than d (typically r=4 to 32)
  3. Only trains A and B matrices (a few MB instead of GB)
  4. Applied to attention layers (Q, K, V projections) in the U-Net

Benefits: small file size (2-100 MB vs 2-7 GB), fast training (minutes to hours), can be composed (apply multiple LoRAs at once).

# Training a LoRA with the diffusers library
# This uses the official training script

# Step 1: Prepare your dataset
# Create a folder with images and a metadata.jsonl file:
# {"file_name": "image1.jpg", "text": "a painting in the style of sks artist"}
# {"file_name": "image2.jpg", "text": "a landscape in the style of sks artist"}

# Step 2: Run the LoRA training script
"""
accelerate launch diffusers/examples/text_to_image/train_text_to_image_lora.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
  --dataset_name="path/to/your/dataset" \
  --caption_column="text" \
  --resolution=512 \
  --train_batch_size=1 \
  --num_train_epochs=100 \
  --learning_rate=1e-4 \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --rank=4 \
  --output_dir="./my-lora" \
  --validation_prompt="a cat in the style of sks artist" \
  --seed=42
"""

# Step 3: Use the trained LoRA
from diffusers import StableDiffusionPipeline
import torch

def generate_with_lora(prompt, lora_path, lora_scale=1.0, seed=42):
    """Generate images using a trained LoRA."""
    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16,
    ).to("cuda")

    # Load LoRA weights
    pipe.load_lora_weights(lora_path)

    generator = torch.Generator("cuda").manual_seed(seed)

    image = pipe(
        prompt=prompt,
        num_inference_steps=30,
        guidance_scale=7.5,
        cross_attention_kwargs={"scale": lora_scale},
        generator=generator,
    ).images[0]

    return image

# Usage:
# image = generate_with_lora(
#     "a mountain landscape in the style of sks artist",
#     lora_path="./my-lora",
#     lora_scale=0.8,
# )
# image.save("lora_output.png")

5.3 Inpainting and Image-to-Image

from diffusers import (
    StableDiffusionInpaintPipeline,
    StableDiffusionImg2ImgPipeline,
    AutoPipelineForInpainting,
)
from PIL import Image
import torch


def inpaint_image(
    prompt: str,
    image_path: str,
    mask_path: str,
    guidance_scale: float = 7.5,
    strength: float = 0.75,
    seed: int = None,
):
    """
    Inpaint a region of an image based on a text prompt.

    Args:
        prompt: Description of what to generate in the masked region
        image_path: Original image path
        mask_path: Binary mask (white = inpaint, black = keep)
        strength: How much to change the masked region (0-1)
    """
    pipe = AutoPipelineForInpainting.from_pretrained(
        "runwayml/stable-diffusion-inpainting",
        torch_dtype=torch.float16,
    ).to("cuda")

    image = Image.open(image_path).convert("RGB").resize((512, 512))
    mask = Image.open(mask_path).convert("L").resize((512, 512))

    generator = torch.Generator("cuda").manual_seed(seed) if seed else None

    output = pipe(
        prompt=prompt,
        image=image,
        mask_image=mask,
        guidance_scale=guidance_scale,
        num_inference_steps=30,
        generator=generator,
    ).images[0]

    return output


def image_to_image(
    prompt: str,
    image_path: str,
    strength: float = 0.75,
    guidance_scale: float = 7.5,
    seed: int = None,
):
    """
    Transform an existing image based on a text prompt.

    Args:
        prompt: Description of the desired output
        image_path: Input image path
        strength: How much to change (0 = almost identical, 1 = completely new)
    """
    pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16,
        safety_checker=None,
    ).to("cuda")

    image = Image.open(image_path).convert("RGB").resize((512, 512))

    generator = torch.Generator("cuda").manual_seed(seed) if seed else None

    output = pipe(
        prompt=prompt,
        image=image,
        strength=strength,
        guidance_scale=guidance_scale,
        num_inference_steps=30,
        generator=generator,
    ).images[0]

    return output

5.4 Modern Architectures: SD3, Flux, Flow Matching

Flow Matching (SD3, Flux)

Flow matching is a continuous-time generalization of diffusion that has become the dominant approach in 2024-2026:

  • Continuous time: Instead of discrete timesteps t in {0, 1, ..., T}, use continuous t in [0, 1]
  • Straight-line interpolation: The forward path is x_t = (1-t) * x_0 + t * epsilon (simple linear interpolation between data and noise)
  • Velocity prediction: The model predicts the velocity v_t = dx_t/dt instead of noise
  • ODE solver for sampling: Sampling is solving an ODE, enabling adaptive step sizes
  • Simpler math: No need for beta schedules, reparameterization tricks, or complex variance formulas

Diffusion Transformer (DiT)

DiT (Peebles & Xie, 2023) replaces the U-Net with a Transformer:

  • Input: noisy latent patches (just like ViT patches)
  • Conditioning: timestep and class/text embeddings via adaptive layer norm (adaLN-Zero)
  • Architecture: standard Transformer blocks with self-attention
  • Scales better than U-Net with more compute
  • Used in: Sora, Stable Diffusion 3, Flux

Flux Architecture (Black Forest Labs, 2024-2025)

Flux is the successor to Stable Diffusion, created by the original SD researchers:

  • Flow matching based (continuous-time formulation)
  • DiT architecture (no U-Net)
  • MMDiT: multimodal DiT where text and image tokens attend to each other jointly
  • Rotary position embeddings (RoPE) for resolution flexibility
  • Available in multiple sizes: Flux.1 [dev], Flux.1 [schnell] (fast), Flux.1 [pro]
# Using Flux with diffusers
from diffusers import FluxPipeline
import torch

def generate_with_flux(
    prompt: str,
    num_inference_steps: int = 50,
    guidance_scale: float = 3.5,
    seed: int = None,
):
    """Generate images with Flux.1 [dev]."""
    pipe = FluxPipeline.from_pretrained(
        "black-forest-labs/FLUX.1-dev",
        torch_dtype=torch.bfloat16,
    ).to("cuda")

    generator = torch.Generator("cuda").manual_seed(seed) if seed else None

    image = pipe(
        prompt=prompt,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps,
        generator=generator,
        height=1024,
        width=1024,
    ).images[0]

    return image

# Usage:
# image = generate_with_flux(
#     "a photorealistic portrait of a wise old wizard in a library",
#     seed=42,
# )
# image.save("flux_output.png")

6. Video Generation with Diffusion

6.1 Extending Diffusion to Video

Video generation extends image diffusion by adding the temporal dimension. The key challenges are:

  • Temporal consistency: Objects must look the same across frames. No flickering or sudden changes.
  • Motion quality: Movement must be smooth and physically plausible.
  • Computational cost: A 10-second 24fps video has 240 frames, each a full image.
  • Long-range coherence: The beginning and end of a video must be consistent.

6.2 Sora and the DiT Architecture for Video

Sora's Key Concepts

  • Spacetime patches: Instead of 2D image patches, Sora uses 3D patches that span space and time. A spacetime patch might be 2 frames x 16x16 pixels.
  • Variable duration and resolution: Unlike fixed-size models, Sora can generate videos at various resolutions and lengths by varying the number of spacetime patches.
  • DiT backbone: Uses Diffusion Transformer instead of U-Net, with 3D attention (attend across space and time).
  • Scaling: Quality improves consistently with more compute, following scaling laws.
  • Emergent properties: With enough scale, the model develops understanding of 3D consistency, object permanence, and basic physics.

6.3 PRACTICAL: Generate Video with Open-Source Models

# Video generation with CogVideoX (open-source)
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
import torch

def generate_video_cogvideox(
    prompt: str,
    num_frames: int = 49,
    num_inference_steps: int = 50,
    guidance_scale: float = 6.0,
    seed: int = None,
):
    """
    Generate a video using CogVideoX (open-source video diffusion model).
    """
    pipe = CogVideoXPipeline.from_pretrained(
        "THUDM/CogVideoX-5b",
        torch_dtype=torch.bfloat16,
    ).to("cuda")

    # Enable memory optimization
    pipe.enable_model_cpu_offload()

    generator = torch.Generator("cuda").manual_seed(seed) if seed else None

    video_frames = pipe(
        prompt=prompt,
        num_frames=num_frames,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
        generator=generator,
    ).frames[0]

    return video_frames


# Usage:
# frames = generate_video_cogvideox(
#     "A golden retriever running through a field of sunflowers",
#     seed=42,
# )
# export_to_video(frames, "output_video.mp4", fps=8)


# ====================
# Video generation with Wan 2.1 (Alibaba, open-source)
# ====================
from diffusers import WanPipeline

def generate_video_wan(
    prompt: str,
    num_frames: int = 81,
    guidance_scale: float = 5.0,
    seed: int = None,
):
    """Generate video with Wan 2.1 (open-source, competitive quality)."""
    pipe = WanPipeline.from_pretrained(
        "Wan-AI/Wan2.1-T2V-14B",
        torch_dtype=torch.bfloat16,
    )
    pipe.enable_model_cpu_offload()

    generator = torch.Generator("cuda").manual_seed(seed) if seed else None

    output = pipe(
        prompt=prompt,
        num_frames=num_frames,
        guidance_scale=guidance_scale,
        generator=generator,
    )

    return output.frames[0]

# frames = generate_video_wan("A cat sitting on a windowsill watching rain", seed=42)
# export_to_video(frames, "wan_video.mp4", fps=16)

7. Diffusion Models vs LLMs for Image Generation

AspectDiffusion ModelsAutoregressive (LLM-style)
Generation processParallel denoising (all pixels at once)Sequential token prediction
ArchitectureU-Net or DiTTransformer decoder
SpeedMultiple forward passes (10-50 steps)Many sequential tokens
QualityExcellent for imagesImproving rapidly
Text-image integrationSeparate encoders + cross-attentionNative (same vocabulary)
EditingInpainting, img2imgNatural with token manipulation
ExamplesSD, DALL-E 3, Flux, MidjourneyParti, Chameleon, Transfusion

Hybrid Approaches (2025-2026)

The field is converging toward hybrid architectures:

  • Transfusion: A single model that does both autoregressive text and diffusion-based images in a shared architecture
  • Chameleon (Meta): Early-fusion multimodal model that tokenizes images and generates them autoregressively alongside text
  • Native multimodal models: GPT-4o and Gemini 2.0 can both understand and generate images natively
  • The trend: Moving toward unified models that handle all modalities in a single framework

Summary and Key Takeaways

What We Covered This Week

  1. Diffusion fundamentals: Forward (add noise) and reverse (denoise) processes, with a simple MSE loss on noise prediction.
  2. The math: Noise schedules, reparameterization trick, DDPM vs DDIM sampling.
  3. U-Net architecture: Encoder-decoder with skip connections, time embeddings, and cross-attention for conditioning.
  4. Stable Diffusion: Latent diffusion (VAE + CLIP + U-Net), classifier-free guidance, different schedulers.
  5. Advanced techniques: ControlNet for spatial conditioning, LoRA for style fine-tuning, inpainting, img2img.
  6. Modern architectures: DiT (Diffusion Transformer), Flow Matching (Flux, SD3), continuous-time formulations.
  7. Video generation: Spacetime patches, temporal consistency, Sora/CogVideoX/Wan architectures.

Preparation for Next Week

In Week 15: Capstone Project, you will apply everything you have learned across all 14 weeks to build a comprehensive AI engineering project. Review the project ideas and start thinking about which one excites you most.

Exercises

Exercise 1: Diffusion from Scratch

Train the SimpleUNet diffusion model on MNIST. Then modify it to use a cosine noise schedule instead of linear. Compare the quality of generated samples.

Exercise 2: DDIM Sampling

Implement DDIM sampling for your trained model. Compare samples generated with 1000 DDPM steps vs 50 DDIM steps. Measure FID if possible.

Exercise 3: CFG Exploration

Using the diffusers library, generate the same prompt with CFG scales from 1 to 20. Create a grid showing how guidance scale affects quality, diversity, and artifacts.

Exercise 4: ControlNet Application

Build a small web app (Gradio/Streamlit) that lets users upload an image, automatically extract edges/depth, and generate a new image in a specified style using ControlNet.

Exercise 5: LoRA Training

Collect 10-20 images in a specific art style. Train a LoRA on Stable Diffusion to capture that style. Generate images and evaluate how well the style transfers.