Week 12: Reasoning Models | AI Engineering Mastery

1. What Are Reasoning Models?

The Evolution of Reasoning in LLMs

The journey toward AI systems that can genuinely "think" has been one of the most exciting developments in AI. Here is the timeline of major reasoning model milestones:

Date	Model	Key Innovation
Mar 2023	GPT-4	First model widely capable of complex reasoning tasks
Sep 2024	o1-preview	First "reasoning model" with hidden chain-of-thought (thinking tokens)
Dec 2024	o1 (full)	Improved reasoning with o1-pro mode for harder problems
Jan 2025	DeepSeek R1	Open-source reasoning model trained with pure RL (GRPO)
Jan 2025	o3-mini	Smaller reasoning model with configurable "thinking effort"
Feb 2025	Claude 3.5 Sonnet (Extended Thinking)	Anthropic's approach: visible extended thinking blocks
Apr 2025	o3	Full o3 model with state-of-the-art reasoning
2025-2026	Claude 4 (thinking)	Claude 4 with extended thinking for complex reasoning

What Makes a "Reasoning Model" Different?

A reasoning model differs from a regular LLM in several fundamental ways:

1. Thinking Tokens / Extended Thinking

Regular LLMs generate output tokens directly. Reasoning models generate thinking tokens first -- internal reasoning steps that are processed before the final answer. These thinking tokens represent the model's step-by-step problem-solving process.

OpenAI's o1/o3: Thinking tokens are hidden from the user. You see a "thinking..." indicator, then get the final answer. The API returns a reasoning_tokens count.
Anthropic's Extended Thinking: Thinking blocks are visible to the developer. You can see exactly how the model reasons through a problem, making it more transparent and debuggable.
DeepSeek R1: Thinking process is fully visible in the output, wrapped in <think>...</think> tags.

2. Test-Time Compute Scaling

This is the key insight behind reasoning models: you can improve model performance by spending more computation at inference time (when the model is generating a response), not just during training.

A regular LLM has a fixed amount of "thinking" per token -- each token is generated in roughly the same amount of time. A reasoning model can choose to "think harder" on difficult problems by generating more thinking tokens. This is analogous to how humans spend more time thinking about hard problems.

The more thinking tokens a reasoning model uses, the better its answer tends to be. This creates a new trade-off: latency and cost vs. quality. You can configure how much the model should think:

Low thinking: Fast, cheap, good for simple tasks
Medium thinking: Balanced, suitable for most tasks
High thinking: Slow, expensive, best for very hard problems (math, complex coding, scientific analysis)

3. "System 1" vs "System 2" Thinking

Drawing from Daniel Kahneman's framework from "Thinking, Fast and Slow":

System 1 (Fast Thinking): Automatic, intuitive, quick responses. This is how regular LLMs operate -- they generate responses token-by-token based on pattern matching from training. Fast but error-prone on complex tasks.
System 2 (Slow Thinking): Deliberate, analytical, step-by-step reasoning. This is what reasoning models do -- they explicitly work through problems, check their work, consider alternatives, and verify their answers. Slower but much more accurate on complex tasks.

Reasoning models essentially add a "System 2" capability to LLMs. They can still do System 1 for simple tasks, but they can engage System 2 when needed for hard problems.

Reasoning Model Architecture

graph LR A[Input Prompt] --> B{Task Difficulty} B -->|Simple| C[System 1: Direct Output] B -->|Complex| D[System 2: Thinking Tokens] D --> E[Step-by-Step Reasoning] E --> F[Verify and Check] F -->|Error found| E F -->|Correct| G[Final Answer] C --> G style A fill:#1a1a2e,stroke:#e94560,color:#fff style D fill:#1a1a2e,stroke:#f5a623,color:#fff style E fill:#1a1a2e,stroke:#7c4dff,color:#fff style G fill:#1a1a2e,stroke:#00d4aa,color:#fff

Using Reasoning Models via API

OpenAI o3 API Usage


import openai

client = openai.OpenAI()

# Using o3 -- note the differences from regular models
response = client.chat.completions.create(
    model="o3",
    messages=[
        # Note: o3 has limited system prompt support
        # Use developer messages instead for instructions
        {
            "role": "developer",
            "content": "You are a math tutor. Show your work clearly."
        },
        {
            "role": "user",
            "content": "Prove that the square root of 2 is irrational."
        }
    ],
    # Reasoning effort: "low", "medium", "high"
    reasoning_effort="high",
    # max_completion_tokens includes BOTH thinking and output tokens
    max_completion_tokens=16000
)

print(f"Answer: {response.choices[0].message.content}")
print(f"Thinking tokens: {response.usage.completion_tokens_details.reasoning_tokens}")
print(f"Output tokens: {response.usage.completion_tokens - response.usage.completion_tokens_details.reasoning_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Anthropic Claude Extended Thinking API


import anthropic

client = anthropic.Anthropic()

# Claude with extended thinking
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Max tokens for thinking
    },
    messages=[
        {
            "role": "user",
            "content": "Solve this step by step: A train leaves Station A at 9:00 AM "
                       "traveling at 60 mph. Another train leaves Station B (300 miles away) "
                       "at 10:00 AM traveling toward Station A at 90 mph. At what time "
                       "do they meet, and how far from Station A?"
        }
    ]
)

# Process the response - it contains both thinking and text blocks
for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
        print("=== END THINKING ===\n")
    elif block.type == "text":
        print("=== ANSWER ===")
        print(block.text)

2. Chain of Thought (CoT)

Chain of Thought Reasoning Flow

graph TD A[Complex Question] --> B[Step 1: Decompose Problem] B --> C[Step 2: Solve Sub-problem] C --> D[Step 3: Intermediate Result] D --> E[Step 4: Combine Results] E --> F[Final Answer] G[Standard Prompting] -->|Single step| F style A fill:#1a1a2e,stroke:#e94560,color:#fff style B fill:#1a1a2e,stroke:#f5a623,color:#fff style C fill:#1a1a2e,stroke:#f5a623,color:#fff style D fill:#1a1a2e,stroke:#7c4dff,color:#fff style F fill:#1a1a2e,stroke:#00d4aa,color:#fff style G fill:#1a1a2e,stroke:#e94560,color:#fff

Chain-of-Thought Prompting

Chain-of-Thought (CoT) prompting is a technique that encourages the model to break down complex problems into intermediate reasoning steps before arriving at a final answer. It was introduced by Wei et al. (2022) and is one of the most impactful prompting techniques discovered.

Why CoT Works

LLMs generate tokens sequentially -- each token is influenced by all previous tokens. When a model is forced to "show its work," several things happen:

Decomposition: Complex problems are broken into manageable sub-problems.
Working Memory: Intermediate results are stored in the generated text, serving as "working memory" the model can reference.
Error Detection: The model can notice mistakes in earlier steps and correct them.
Correct Path Biasing: Each correct step biases the next token generation toward the correct direction.

Zero-Shot CoT

Simply add "Let's think step by step" to the prompt. Surprisingly effective.

Zero-Shot vs Standard Prompting


from openai import OpenAI

client = OpenAI()

def compare_standard_vs_cot(question: str):
    """Compare standard prompting vs chain-of-thought."""

    # Standard prompting
    standard_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
        temperature=0.0
    )

    # Zero-shot CoT
    cot_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"{question}\n\nLet's think step by step."
        }],
        temperature=0.0
    )

    print(f"Question: {question}")
    print(f"\n--- Standard ---")
    print(standard_response.choices[0].message.content[:300])
    print(f"\n--- Chain of Thought ---")
    print(cot_response.choices[0].message.content[:500])


# Test with problems that benefit from reasoning
compare_standard_vs_cot(
    "A bat and a ball cost $1.10 in total. The bat costs $1.00 more "
    "than the ball. How much does the ball cost?"
)

compare_standard_vs_cot(
    "If it takes 5 machines 5 minutes to make 5 widgets, "
    "how long would it take 100 machines to make 100 widgets?"
)

Few-Shot CoT

Provide examples of step-by-step reasoning before asking the target question.

Few-Shot Chain of Thought


from openai import OpenAI

client = OpenAI()

few_shot_cot_prompt = """Solve each problem step by step.

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Let me think step by step.
1. Roger starts with 5 tennis balls.
2. He buys 2 cans, each containing 3 balls.
3. 2 cans x 3 balls = 6 new balls.
4. Total: 5 + 6 = 11 tennis balls.
The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
A: Let me think step by step.
1. Starting apples: 23
2. Used for lunch: 23 - 20 = 3 remaining
3. Bought more: 3 + 6 = 9 apples
The answer is 9.

Q: {question}
A: Let me think step by step."""

def few_shot_cot(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": few_shot_cot_prompt.format(question=question)
        }],
        temperature=0.0
    )
    return response.choices[0].message.content

# Test
answer = few_shot_cot(
    "A store had 42 shirts. They sold 28 in the morning and received "
    "a shipment of 15 new shirts. Then they sold 9 more in the afternoon. "
    "How many shirts do they have at the end of the day?"
)
print(answer)

Limitations of CoT

Uses more tokens: Reasoning steps add to the output length (and cost).
Not always needed: For simple tasks, CoT adds overhead without benefit.
Can be wrong confidently: The model can produce plausible-sounding but incorrect reasoning chains.
Model dependent: CoT works best with larger, more capable models. Small models may generate incoherent reasoning.

Self-Consistency: Sampling Multiple CoT Paths

Self-consistency (Wang et al., 2022) improves CoT by sampling multiple reasoning paths and taking the majority vote. The intuition: if a model arrives at the same answer via different reasoning paths, the answer is more likely correct.

Self-Consistency Implementation


from openai import OpenAI
from collections import Counter
import re

client = OpenAI()

def self_consistency(
    question: str,
    n_samples: int = 5,
    temperature: float = 0.7,
    model: str = "gpt-4o"
) -> dict:
    """
    Self-Consistency: Sample multiple chain-of-thought responses
    and take the majority vote on the final answer.
    """
    responses = []
    answers = []

    for i in range(n_samples):
        response = client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "system",
                    "content": "Solve the problem step by step. End your response with "
                              "'The answer is: [X]' where X is your final numerical answer."
                },
                {"role": "user", "content": question}
            ],
            temperature=temperature,
            max_tokens=1000
        )
        text = response.choices[0].message.content
        responses.append(text)

        # Extract the final answer
        match = re.search(r"The answer is:?\s*\$?(\d+[\.,]?\d*)", text, re.I)
        if match:
            answer = match.group(1).replace(",", "")
            answers.append(answer)

    # Majority vote
    if answers:
        answer_counts = Counter(answers)
        majority_answer, count = answer_counts.most_common(1)[0]
        confidence = count / len(answers)
    else:
        majority_answer = "Could not extract answers"
        confidence = 0.0

    return {
        "question": question,
        "majority_answer": majority_answer,
        "confidence": confidence,
        "all_answers": answers,
        "answer_distribution": dict(Counter(answers)),
        "n_samples": n_samples,
        "agreement_rate": confidence,
    }


# Test
result = self_consistency(
    "A store sells apples for $2 each and oranges for $3 each. "
    "If Sarah buys 4 apples and 3 oranges, and she has a 10% discount coupon, "
    "how much does she pay in total?",
    n_samples=7
)

print(f"Question: {result['question']}")
print(f"Majority Answer: {result['majority_answer']}")
print(f"Confidence: {result['confidence']:.1%}")
print(f"Answer Distribution: {result['answer_distribution']}")

Tree of Thoughts (ToT)

Tree of Thoughts (Yao et al., 2023) extends CoT by exploring multiple reasoning paths in a tree structure. At each step, the model generates several possible next thoughts, evaluates them, and pursues the most promising ones. It is like a chess player considering multiple moves ahead.

Simplified Tree of Thoughts


from openai import OpenAI
import json

client = OpenAI()

def tree_of_thoughts(
    problem: str,
    n_branches: int = 3,
    max_depth: int = 3,
    model: str = "gpt-4o"
) -> dict:
    """
    Simplified Tree of Thoughts:
    1. Generate multiple possible first steps
    2. Evaluate each step (promising? dead end?)
    3. Continue the most promising paths
    4. Select the best final answer
    """

    def generate_thoughts(problem: str, current_path: str, n: int) -> list[str]:
        """Generate n possible next thoughts/steps."""
        prompt = f"""Problem: {problem}

Current reasoning path:
{current_path if current_path else "(Starting fresh)"}

Generate {n} DIFFERENT possible next reasoning steps. Each should take a distinct approach.
Return JSON: {{"thoughts": ["thought1", "thought2", ...]}}"""

        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a problem solver exploring different reasoning paths."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.8,
            response_format={"type": "json_object"}
        )
        result = json.loads(response.choices[0].message.content)
        return result.get("thoughts", [])

    def evaluate_thought(problem: str, path: str) -> float:
        """Evaluate how promising a reasoning path is (0-1)."""
        prompt = f"""Problem: {problem}

Reasoning path so far:
{path}

Rate this reasoning path on a scale of 0.0 to 1.0:
- 1.0: Clearly on the right track, likely to reach the correct answer
- 0.5: Possible but uncertain
- 0.0: Wrong approach or contains errors

Return JSON: {{"score": , "reasoning": "brief explanation"}}"""

        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a critical evaluator of reasoning quality."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.0,
            response_format={"type": "json_object"}
        )
        result = json.loads(response.choices[0].message.content)
        return result.get("score", 0.5)

    # BFS-style tree search
    paths = [{"path": "", "score": 1.0}]

    for depth in range(max_depth):
        print(f"\n--- Depth {depth + 1} ---")
        new_paths = []

        for current in paths[:n_branches]:  # Keep top-n paths
            thoughts = generate_thoughts(problem, current["path"], n_branches)

            for thought in thoughts:
                new_path = f"{current['path']}\nStep {depth + 1}: {thought}"
                score = evaluate_thought(problem, new_path)
                new_paths.append({"path": new_path, "score": score})
                print(f"  Path (score={score:.2f}): ...{thought[:80]}...")

        # Keep the best paths
        paths = sorted(new_paths, key=lambda x: x["score"], reverse=True)

    # Get the best path and generate final answer
    best_path = paths[0]

    final_response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a problem solver. Based on the best reasoning path, provide a clear final answer."},
            {"role": "user", "content": f"Problem: {problem}\n\nBest reasoning path:\n{best_path['path']}\n\nNow provide the final answer."}
        ],
        temperature=0.0
    )

    return {
        "problem": problem,
        "best_path": best_path["path"],
        "best_score": best_path["score"],
        "final_answer": final_response.choices[0].message.content,
        "paths_explored": len(paths),
    }


# Test
result = tree_of_thoughts(
    "I have 6 horses and want to find the 3 fastest using a racetrack "
    "that fits only 3 horses at a time. What is the minimum number of races needed?",
    n_branches=3,
    max_depth=2
)
print(f"\nFinal Answer: {result['final_answer']}")

3. RLHF Deep Dive

RLHF Training Pipeline

graph LR A[Pre-trained LLM] --> B[Stage 1: SFT] B -->|Fine-tune on demos| C[SFT Model] C --> D[Stage 2: Reward Model] D -->|Train on preferences| E[Reward Model] C --> F[Stage 3: PPO] E --> F F -->|Optimize policy| G[RLHF-Aligned Model] style A fill:#1a1a2e,stroke:#e94560,color:#fff style B fill:#1a1a2e,stroke:#f5a623,color:#fff style D fill:#1a1a2e,stroke:#7c4dff,color:#fff style F fill:#1a1a2e,stroke:#00d4aa,color:#fff style G fill:#1a1a2e,stroke:#e94560,color:#fff

The RLHF Pipeline

Reinforcement Learning from Human Feedback (RLHF) is the training technique that turns a raw language model into a helpful, harmless, and honest assistant. It was a key breakthrough that made ChatGPT possible. The full pipeline has three stages:

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pre-trained base model and fine-tune it on high-quality demonstrations. Human annotators write ideal responses for a set of prompts. This teaches the model the basic format and behavior expected of an assistant.

SFT Data Format


# SFT training data: human-written ideal responses
sft_examples = [
    {
        "messages": [
            {"role": "user", "content": "Explain quantum computing in simple terms."},
            {"role": "assistant", "content": "Quantum computing uses quantum mechanical phenomena like superposition and entanglement to process information. Unlike classical computers that use bits (0 or 1), quantum computers use qubits which can be both 0 and 1 simultaneously. This allows them to explore many solutions at once, making them potentially much faster for certain types of problems like cryptography, drug discovery, and optimization. However, quantum computers are extremely delicate and currently only work at near absolute zero temperatures."}
        ]
    },
    {
        "messages": [
            {"role": "user", "content": "Write a haiku about programming."},
            {"role": "assistant", "content": "Code flows like water\nBugs hide in the tangled depths\nTests bring the sunlight"}
        ]
    },
]

# SFT is essentially supervised learning:
# Loss = -log P(correct_response | prompt)
# The model learns to predict the human-written response given the prompt

Stage 2: Reward Model Training

Train a separate model to predict human preferences. Humans compare pairs of responses and indicate which is better. The reward model learns to score responses based on these preferences.

Reward Model Concept


"""
Reward Model Training (Conceptual)
====================================
The reward model learns human preferences using the Bradley-Terry model.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F

class RewardModel(nn.Module):
    """
    A reward model predicts a scalar "reward" for a (prompt, response) pair.
    Higher reward = response more preferred by humans.

    Architecture: Same as the LLM but with a scalar head instead of a vocabulary head.
    """
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model          # Pre-trained LLM backbone
        self.reward_head = nn.Linear(         # Maps hidden state to scalar reward
            base_model.config.hidden_size, 1
        )

    def forward(self, input_ids, attention_mask):
        # Get the last hidden state from the base model
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
        # Use the last token's hidden state
        last_hidden = outputs.hidden_states[-1][:, -1, :]
        reward = self.reward_head(last_hidden).squeeze(-1)
        return reward


def bradley_terry_loss(reward_chosen: torch.Tensor, reward_rejected: torch.Tensor) -> torch.Tensor:
    """
    Bradley-Terry Loss for preference learning.

    Given a pair where response_chosen > response_rejected (as judged by humans):
    Loss = -log(sigmoid(reward_chosen - reward_rejected))

    This pushes the reward model to assign higher rewards to preferred responses.

    The intuition: if the reward model correctly ranks the chosen response higher,
    the sigmoid is close to 1 and the loss is close to 0. If it gets it wrong,
    the loss is high.
    """
    return -F.logsigmoid(reward_chosen - reward_rejected).mean()


# Training data format: pairs of (prompt, chosen_response, rejected_response)
preference_data = [
    {
        "prompt": "What is the capital of France?",
        "chosen": "The capital of France is Paris. It has been the capital since the late 10th century and is the country's largest city with a population of about 2.1 million in the city proper.",
        "rejected": "Paris. It is a city in France."
    },
    {
        "prompt": "How do I learn Python?",
        "chosen": "Here is a structured approach to learning Python:\n1. Start with basics: variables, data types, and control flow\n2. Practice with small projects\n3. Learn libraries: NumPy, Pandas for data science\n4. Build real projects\n5. Read other people's code on GitHub",
        "rejected": "Just Google it and practice. Python is easy, you'll figure it out."
    }
]

# Conceptual training loop
def train_reward_model(model, preference_data, optimizer, epochs=3):
    for epoch in range(epochs):
        total_loss = 0
        for pair in preference_data:
            # Tokenize prompt + chosen
            chosen_reward = model(tokenize(pair["prompt"] + pair["chosen"]))
            rejected_reward = model(tokenize(pair["prompt"] + pair["rejected"]))

            loss = bradley_terry_loss(chosen_reward, rejected_reward)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            total_loss += loss.item()

        print(f"Epoch {epoch+1}: Loss = {total_loss / len(preference_data):.4f}")

Stage 3: PPO (Proximal Policy Optimization)

Use the reward model to fine-tune the SFT model with reinforcement learning. The LLM generates responses, the reward model scores them, and PPO updates the LLM to generate higher-reward responses.

PPO for LLMs (Conceptual)


"""
PPO for Language Models (Conceptual)
=====================================
Proximal Policy Optimization applied to LLM fine-tuning.
"""

import torch

# The PPO objective for language models:
#
# L_PPO = E[ min(r(θ) * A, clip(r(θ), 1-ε, 1+ε) * A) ] - β * KL(π_θ || π_ref)
#
# Where:
# - r(θ) = π_θ(a|s) / π_old(a|s)  -- probability ratio (new policy / old policy)
# - A = advantage = R(response) - baseline  -- how much better than expected
# - ε = clip range (typically 0.2)  -- prevents too-large policy updates
# - β = KL penalty coefficient  -- keeps the model close to the reference
# - π_ref = the original SFT model  -- prevents model from diverging too far

def ppo_step_conceptual(
    policy_model,      # Current LLM being trained
    reference_model,   # Original SFT model (frozen)
    reward_model,      # Trained reward model (frozen)
    prompts,           # Batch of prompts
    optimizer,
    clip_epsilon=0.2,
    kl_penalty=0.1,
):
    """One step of PPO training for an LLM."""

    # 1. Generate responses with current policy
    responses = policy_model.generate(prompts)

    # 2. Get rewards from the reward model
    rewards = reward_model.score(prompts, responses)

    # 3. Calculate log probabilities under current and old policies
    log_probs_current = policy_model.log_prob(prompts, responses)
    log_probs_old = policy_model.log_prob(prompts, responses).detach()  # From before this update
    log_probs_ref = reference_model.log_prob(prompts, responses)  # Reference model

    # 4. Calculate the probability ratio
    ratio = torch.exp(log_probs_current - log_probs_old)

    # 5. Calculate advantage (simplified: reward - baseline)
    baseline = rewards.mean()
    advantages = rewards - baseline

    # 6. PPO clipped objective
    unclipped = ratio * advantages
    clipped = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantages
    policy_loss = -torch.min(unclipped, clipped).mean()

    # 7. KL divergence penalty (keep close to reference model)
    kl_div = (log_probs_current - log_probs_ref).mean()
    kl_loss = kl_penalty * kl_div

    # 8. Total loss
    total_loss = policy_loss + kl_loss

    # 9. Update the policy model
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()

    return {
        "policy_loss": policy_loss.item(),
        "kl_div": kl_div.item(),
        "mean_reward": rewards.mean().item(),
        "total_loss": total_loss.item(),
    }


"""
Key Points About RLHF:

1. WHY the KL penalty matters:
   Without it, the model would "hack" the reward model -- finding degenerate
   outputs that score high rewards but are nonsensical or harmful. The KL
   penalty keeps the model close to the original SFT model, which acts as
   an anchor for coherent language.

2. Problems with RLHF:
   - Reward Hacking: The model exploits reward model weaknesses (e.g.,
     generating longer responses because the RM prefers verbose answers)
   - Mode Collapse: The model converges to a narrow set of "safe" responses
   - Expensive: Requires training 3 separate models (SFT, RM, PPO)
   - Unstable: PPO training can be finicky and hard to tune
   - Human labeling bias: The reward model inherits the biases of annotators

3. Why the field moved toward DPO:
   DPO eliminates the reward model entirely and is much simpler to train.
"""

4. DPO (Direct Preference Optimization)

Why DPO? Simpler Than RLHF

Direct Preference Optimization (Rafailov et al., 2023) was a breakthrough that dramatically simplified the RLHF pipeline. The key insight: you don't need a separate reward model. The preference data can be used to directly optimize the language model.

DPO reformulates the RLHF objective so that the optimal policy can be derived in closed form from the reward model. This means you can skip reward model training and PPO entirely, and instead optimize a simple classification-like loss on preference pairs.

The DPO Loss Function

DPO Loss - Mathematical and Code


"""
Direct Preference Optimization (DPO)
======================================
Aligns LLMs with human preferences without a separate reward model.
"""

import torch
import torch.nn.functional as F

def dpo_loss(
    policy_chosen_logps: torch.Tensor,    # log P(chosen | prompt) under current policy
    policy_rejected_logps: torch.Tensor,  # log P(rejected | prompt) under current policy
    ref_chosen_logps: torch.Tensor,       # log P(chosen | prompt) under reference model
    ref_rejected_logps: torch.Tensor,     # log P(rejected | prompt) under reference model
    beta: float = 0.1                      # Temperature parameter
) -> torch.Tensor:
    """
    DPO Loss Function:

    L_DPO = -log sigmoid(β * (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))

    Where:
    - π_θ = current policy (model being trained)
    - π_ref = reference model (frozen SFT model)
    - y_w = preferred (chosen/winning) response
    - y_l = dispreferred (rejected/losing) response
    - x = prompt
    - β = temperature (controls how far from reference model)

    Intuition:
    - The loss encourages the model to increase the probability of chosen responses
      and decrease the probability of rejected responses, RELATIVE to the reference model.
    - The reference model term prevents the model from diverging too far.
    - Higher β means stronger optimization (more divergence from reference allowed).
    """
    # Calculate log-ratios (how much the policy differs from reference)
    chosen_logratios = policy_chosen_logps - ref_chosen_logps
    rejected_logratios = policy_rejected_logps - ref_rejected_logps

    # DPO loss
    logits = beta * (chosen_logratios - rejected_logratios)
    loss = -F.logsigmoid(logits).mean()

    # Useful metrics
    chosen_rewards = beta * chosen_logratios.detach()
    rejected_rewards = beta * rejected_logratios.detach()
    reward_margin = (chosen_rewards - rejected_rewards).mean()

    return loss, {
        "loss": loss.item(),
        "reward_margin": reward_margin.item(),
        "chosen_reward": chosen_rewards.mean().item(),
        "rejected_reward": rejected_rewards.mean().item(),
        "accuracy": (chosen_rewards > rejected_rewards).float().mean().item(),
    }


"""
DPO vs RLHF Comparison:

| Aspect              | RLHF                          | DPO                       |
|---------------------|-------------------------------|---------------------------|
| Pipeline            | SFT -> RM -> PPO (3 stages)  | SFT -> DPO (2 stages)    |
| Reward Model        | Required (separate model)     | Not needed                |
| RL Training         | Yes (PPO, complex)            | No (supervised-like)      |
| Stability           | Often unstable                | Much more stable          |
| Compute Cost        | High (3 models in memory)     | Lower (2 models)          |
| Hyperparameters     | Many (PPO + KL + reward)      | Few (mainly β)            |
| Performance         | Strong                        | Comparable or better      |
| Implementation      | Complex                       | ~50 lines of training code|

DPO Variants:
- IPO (Identity Preference Optimization): More robust to noisy labels
- KTO (Kahneman-Tversky Optimization): Works with unpaired data (just good/bad, no pairs needed)
- ORPO (Odds Ratio Preference Optimization): Combines SFT and alignment in one step
- SimPO (Simple Preference Optimization): Reference-model-free DPO
"""

PRACTICAL: Fine-Tune a Model with DPO Using TRL

DPO Training with TRL (Hugging Face)


# pip install trl transformers datasets peft accelerate bitsandbytes

"""
DPO Fine-Tuning with TRL
=========================
Fine-tune a language model using DPO with the Hugging Face TRL library.
"""

from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import DPOConfig, DPOTrainer

# ============================
# Step 1: Prepare the Dataset
# ============================

# DPO requires paired preference data: (prompt, chosen, rejected)
preference_data = {
    "prompt": [
        "Explain machine learning to a 5-year-old.",
        "What is the best programming language?",
        "How do I make pasta?",
        "What causes earthquakes?",
    ],
    "chosen": [
        "Imagine you have a robot toy. At first, it doesn't know anything. "
        "But every time you show it a picture of a cat and say 'cat', it remembers "
        "a little bit. After seeing lots of cats, it can recognize cats on its own! "
        "That's machine learning - teaching computers by showing them lots of examples.",

        "There isn't a single 'best' programming language - it depends on what you want "
        "to do. Python is great for data science and AI. JavaScript is essential for web "
        "development. Rust is excellent for systems programming with safety guarantees. "
        "I'd recommend starting with Python for its readability and versatility.",

        "Here's a simple pasta recipe:\n1. Boil a large pot of salted water\n"
        "2. Add pasta and cook for 8-10 minutes until al dente\n"
        "3. While waiting, heat olive oil in a pan\n"
        "4. Add garlic and your favorite sauce\n"
        "5. Drain pasta, toss with sauce, and serve\n"
        "The key is salting your water well and not overcooking the pasta.",

        "Earthquakes occur when tectonic plates (large slabs of Earth's crust) "
        "move against each other. Stress builds up along fault lines where plates meet. "
        "When the stress exceeds the friction holding them together, the plates suddenly "
        "slip, releasing energy as seismic waves that we feel as shaking.",
    ],
    "rejected": [
        "Machine learning is a subset of artificial intelligence that involves "
        "training algorithms on data to make predictions or decisions.",

        "Python is the best programming language.",

        "You cook pasta by boiling water and putting the pasta in it.",

        "Earthquakes happen because the ground shakes. They can be very dangerous "
        "and cause a lot of damage to buildings.",
    ]
}

dataset = Dataset.from_dict(preference_data)

# ================================
# Step 2: Load Model and Tokenizer
# ================================

model_name = "meta-llama/Llama-3.2-1B-Instruct"  # Using a smaller model for demo

# Quantize to 4-bit for memory efficiency
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# ==============================
# Step 3: Configure LoRA for PEFT
# ==============================

peft_config = LoraConfig(
    r=16,                       # LoRA rank
    lora_alpha=32,              # LoRA scaling factor
    lora_dropout=0.05,          # Dropout for LoRA layers
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

# ============================
# Step 4: Configure DPO Training
# ============================

training_args = DPOConfig(
    output_dir="./dpo-output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    beta=0.1,                   # DPO temperature parameter
    loss_type="sigmoid",        # Standard DPO loss
    logging_steps=10,
    save_strategy="epoch",
    remove_unused_columns=False,
    bf16=True,                  # Use bfloat16 if your GPU supports it
)

# ============================
# Step 5: Initialize DPO Trainer
# ============================

trainer = DPOTrainer(
    model=model,
    ref_model=None,             # With PEFT, the base model serves as reference
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
)

# ============================
# Step 6: Train!
# ============================

print("Starting DPO training...")
trainer.train()

print("Saving model...")
trainer.save_model("./dpo-model-final")

# ============================
# Step 7: Inference with the DPO model
# ============================

from peft import PeftModel

# Load the base model and merge the LoRA adapters
base_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
model = PeftModel.from_pretrained(base_model, "./dpo-model-final")
model = model.merge_and_unload()

# Generate
prompt = "Explain quantum computing to a beginner."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

5. DeepSeek R1 and Open Reasoning Models

How DeepSeek R1 Was Trained

DeepSeek R1, released in January 2025, was a landmark in open-source AI. It demonstrated that reasoning capabilities comparable to OpenAI's o1 could be achieved with a novel training approach. Key innovations:

R1-Zero: Pure RL Without SFT

The most remarkable finding was R1-Zero -- a model trained purely with reinforcement learning, without any supervised fine-tuning. Starting from the base DeepSeek-V3 model, they applied only RL (using GRPO) with a reward based on correctness of answers to math and coding problems.

What emerged was extraordinary: the model spontaneously developed reasoning behaviors that were never explicitly taught:

Self-verification: The model started checking its own work ("Wait, let me verify this...")
Reflection: The model would reconsider its approach ("Hmm, this doesn't seem right. Let me try a different method.")
Extended thinking: The model naturally began generating longer reasoning chains for harder problems
Problem decomposition: Breaking complex problems into sub-problems

This was a profound finding: reasoning can emerge purely from reinforcement learning, without needing humans to demonstrate step-by-step reasoning.

GRPO (Group Relative Policy Optimization)

Instead of PPO, DeepSeek used GRPO, a simpler RL algorithm:

GRPO Conceptual Overview


"""
GRPO (Group Relative Policy Optimization)
==========================================
DeepSeek's RL algorithm for training reasoning models.
Simpler than PPO -- no separate reward model or value function needed.
"""

def grpo_conceptual(policy_model, prompts, reward_function, group_size=8):
    """
    GRPO Algorithm (Conceptual):

    1. For each prompt, sample a GROUP of responses (e.g., 8)
    2. Score each response with the reward function
    3. Normalize rewards within the group (relative ranking)
    4. Update the policy: increase probability of higher-reward responses,
       decrease probability of lower-reward ones

    Key difference from PPO:
    - No value function (critic) needed
    - No separate reward model -- uses rule-based rewards
    - Rewards are normalized within each group, making training stable

    Reward function for R1:
    - Math problems: 1.0 if final answer is correct, 0.0 otherwise
    - Code problems: 1.0 if code passes all test cases, 0.0 otherwise
    - Format reward: small bonus for correct formatting (e.g., using  tags)
    """
    for prompt in prompts:
        # Step 1: Sample multiple responses from current policy
        responses = [policy_model.generate(prompt) for _ in range(group_size)]

        # Step 2: Score each response
        rewards = [reward_function(prompt, resp) for resp in responses]

        # Step 3: Normalize rewards within the group
        mean_reward = sum(rewards) / len(rewards)
        std_reward = (sum((r - mean_reward)**2 for r in rewards) / len(rewards)) ** 0.5
        normalized_rewards = [(r - mean_reward) / (std_reward + 1e-8) for r in rewards]

        # Step 4: Policy gradient update
        # Increase probability of responses with positive normalized rewards
        # Decrease probability of responses with negative normalized rewards
        for response, norm_reward in zip(responses, normalized_rewards):
            log_prob = policy_model.log_prob(prompt, response)
            loss = -log_prob * norm_reward  # REINFORCE-style gradient
            loss.backward()

    policy_model.optimizer.step()


"""
The R1 Training Pipeline:

1. Start with DeepSeek-V3 base model (671B parameters, MoE)

2. Cold Start Stage:
   - Small amount of SFT data with long CoT reasoning examples
   - Teaches the model the format: reasoningfinal answer

3. RL Stage (the main training):
   - GRPO with rule-based rewards
   - Math: Correct answer = 1.0, wrong = 0.0
   - Code: Pass test cases = 1.0, fail = 0.0
   - Format: Correct use of  tags = small bonus
   - Train for many steps, gradually increasing problem difficulty

4. Rejection Sampling + SFT:
   - Use the RL model to generate many reasoning traces
   - Keep only the ones with correct answers
   - SFT on these good traces to improve coherence and readability

5. Final RL Stage:
   - Another round of GRPO
   - Also adds human preference rewards for helpfulness and safety
   - Balances reasoning ability with general assistant capabilities

Result: R1 matches or exceeds o1 on math, code, and science benchmarks,
while being fully open-source (including weights).
"""

Distillation: Smaller Reasoning Models

DeepSeek distilled R1's reasoning capabilities into smaller models (1.5B, 7B, 8B, 14B, 32B, 70B parameters). The process:

Use R1 to generate reasoning traces for a large set of problems
Filter for correct answers only
Fine-tune smaller models (like Llama 3 or Qwen) on these traces

The distilled models are remarkably capable -- the 32B distilled model outperforms many much larger models on reasoning benchmarks, making reasoning accessible on consumer hardware.

Open-Source Reasoning Model Landscape (March 2026)

DeepSeek R1: 671B MoE, strongest open reasoning model. Fully open weights.
DeepSeek R1 Distilled: Available in 1.5B to 70B sizes. Great for local deployment.
Qwen QwQ: Alibaba's reasoning model series, competitive performance.
Llama Reasoning variants: Community fine-tunes of Llama models with reasoning capabilities.
Phi Reasoning: Microsoft's small reasoning models.

6. How to Use Reasoning Models Effectively

When to Use Reasoning Models vs Regular Models

Use Reasoning Models For	Use Regular Models For
Complex math and logic problems	Simple Q&A and information retrieval
Multi-step coding challenges	Text classification and extraction
Scientific analysis and reasoning	Translation and summarization
Complex planning and strategy	Creative writing
Tasks that need verification	Conversational chat
Problems with multiple constraints	Simple tool calling

Prompting Tips for Reasoning Models

Do NOT use Chain-of-Thought prompts. Reasoning models do CoT internally. Adding "think step by step" is redundant and can actually hurt performance by conflicting with the model's own reasoning process.
Be specific about what you want. State the problem clearly, provide all constraints, and specify the desired output format.
Let the model think. Don't impose a specific reasoning structure. The model's internal reasoning process is often more effective than any manually specified approach.
Use longer context for harder problems. Increase the max_completion_tokens to give the model more thinking room.

Cost Considerations

Reasoning models are significantly more expensive than regular models because of thinking tokens:

Cost Comparison: Regular vs Reasoning Models


"""
Cost Analysis: Regular vs Reasoning Models
=============================================
"""

# Approximate pricing (March 2026)
PRICING = {
    "gpt-4o": {
        "input": 2.50,      # per 1M tokens
        "output": 10.00,
    },
    "gpt-4o-mini": {
        "input": 0.15,
        "output": 0.60,
    },
    "o3": {
        "input": 10.00,
        "output": 40.00,    # Includes thinking tokens
        "avg_thinking_tokens": 5000,  # Average thinking tokens per request
    },
    "o3-mini": {
        "input": 1.10,
        "output": 4.40,
        "avg_thinking_tokens": 2000,
    },
    "claude-sonnet-4": {
        "input": 3.00,
        "output": 15.00,
    },
}

def estimate_cost(model: str, input_tokens: int, output_tokens: int,
                  thinking_tokens: int = 0) -> float:
    """Estimate cost for a single request."""
    pricing = PRICING.get(model, {})
    cost = (
        input_tokens * pricing.get("input", 5) / 1_000_000 +
        (output_tokens + thinking_tokens) * pricing.get("output", 15) / 1_000_000
    )
    return cost


def cost_comparison(question_complexity: str = "hard"):
    """Compare costs across models for different complexity levels."""

    scenarios = {
        "simple": {"input": 100, "output": 200, "thinking": 500},
        "medium": {"input": 500, "output": 500, "thinking": 3000},
        "hard": {"input": 1000, "output": 1000, "thinking": 10000},
    }

    s = scenarios[question_complexity]
    print(f"\nCost comparison for {question_complexity} question:")
    print(f"(Input: {s['input']} tokens, Output: {s['output']} tokens)")
    print(f"{'='*50}")

    models_to_compare = [
        ("gpt-4o-mini", 0),
        ("gpt-4o", 0),
        ("o3-mini", s["thinking"] // 2),
        ("o3", s["thinking"]),
    ]

    for model, thinking in models_to_compare:
        cost = estimate_cost(model, s["input"], s["output"], thinking)
        thinking_str = f" + {thinking} thinking" if thinking else ""
        print(f"  {model:15s}: ${cost:.4f}{thinking_str}")


cost_comparison("simple")
cost_comparison("medium")
cost_comparison("hard")

"""
Key Takeaways on Cost:
1. o3 can be 10-50x more expensive than gpt-4o for the same input/output
2. The cost comes from thinking tokens, which can be thousands per request
3. For simple tasks, reasoning models are massive overkill
4. Strategy: Use routing to send only complex tasks to reasoning models

Recommended approach:
- gpt-4o-mini: Default for simple tasks (classification, extraction, formatting)
- gpt-4o / claude-sonnet: Medium complexity (analysis, coding, creative)
- o3 / o3-mini: Only for genuinely hard reasoning (math proofs, complex code, planning)
"""

PRACTICAL: Compare Regular vs Reasoning Models

Benchmarking Regular vs Reasoning Models


"""
Compare Regular vs Reasoning Models on Various Tasks
=====================================================
"""

import time
import json
from openai import OpenAI

client = OpenAI()


def benchmark_models(
    question: str,
    models: list[dict],
    expected_answer: str = None
) -> list[dict]:
    """Benchmark multiple models on the same question."""
    results = []

    for model_config in models:
        model_name = model_config["model"]
        print(f"\n{'='*50}")
        print(f"Model: {model_name}")

        start_time = time.time()

        kwargs = {
            "model": model_name,
            "messages": [{"role": "user", "content": question}],
        }

        # Add reasoning-specific params
        if model_config.get("reasoning_effort"):
            kwargs["reasoning_effort"] = model_config["reasoning_effort"]

        if model_config.get("max_completion_tokens"):
            kwargs["max_completion_tokens"] = model_config["max_completion_tokens"]
        else:
            kwargs["max_tokens"] = 2000

        response = client.chat.completions.create(**kwargs)

        elapsed = time.time() - start_time
        answer = response.choices[0].message.content

        thinking_tokens = 0
        if hasattr(response.usage, "completion_tokens_details") and response.usage.completion_tokens_details:
            thinking_tokens = getattr(response.usage.completion_tokens_details, "reasoning_tokens", 0) or 0

        result = {
            "model": model_name,
            "answer": answer[:500],
            "latency_seconds": elapsed,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
            "thinking_tokens": thinking_tokens,
            "total_tokens": response.usage.total_tokens,
        }

        print(f"  Latency: {elapsed:.1f}s")
        print(f"  Tokens: {result['total_tokens']} (thinking: {thinking_tokens})")
        print(f"  Answer: {answer[:200]}...")

        results.append(result)

    return results


# Define test cases
test_cases = [
    {
        "name": "Simple factual question",
        "question": "What is the capital of Japan?",
        "expected": "Tokyo",
    },
    {
        "name": "Math reasoning",
        "question": "A farmer has 17 sheep. All but 9 die. How many sheep are left?",
        "expected": "9",
    },
    {
        "name": "Complex logic puzzle",
        "question": (
            "Three people check into a hotel room that costs $30. They each pay $10. "
            "The manager realizes the room should cost $25 and gives $5 to the bellboy "
            "to return. The bellboy keeps $2 and gives $1 back to each person. "
            "Each person paid $9 (total $27) and the bellboy has $2. "
            "$27 + $2 = $29. Where is the missing dollar?"
        ),
        "expected": "There is no missing dollar. The $27 includes the $2 the bellboy kept.",
    },
    {
        "name": "Coding challenge",
        "question": (
            "Write a Python function to find the longest palindromic substring in a string. "
            "Use Manacher's algorithm for O(n) time complexity. Include the full algorithm "
            "with explanations."
        ),
        "expected": None,
    },
]

# Models to compare
models = [
    {"model": "gpt-4o-mini"},
    {"model": "gpt-4o"},
    {"model": "o3-mini", "reasoning_effort": "medium", "max_completion_tokens": 8000},
]

# Run benchmarks
for test in test_cases:
    print(f"\n{'#'*60}")
    print(f"TEST: {test['name']}")
    print(f"{'#'*60}")
    print(f"Question: {test['question'][:100]}...")

    results = benchmark_models(test["question"], models, test.get("expected"))

    # Summary
    print(f"\n--- Summary for '{test['name']}' ---")
    for r in results:
        print(f"  {r['model']:20s} | {r['latency_seconds']:5.1f}s | {r['total_tokens']:6d} tokens | {r['thinking_tokens']:5d} thinking")

PRACTICAL: Smart Model Routing

Route Queries to Appropriate Models


"""
Smart Model Router
==================
Route queries to the cheapest model that can handle them well.
"""

from openai import OpenAI
import json

client = OpenAI()

class SmartRouter:
    """Route queries to the optimal model based on complexity."""

    def __init__(self):
        self.classifier_model = "gpt-4o-mini"  # Cheap model for classification

    def classify_complexity(self, query: str) -> dict:
        """Classify query complexity to determine the best model."""
        response = client.chat.completions.create(
            model=self.classifier_model,
            messages=[
                {
                    "role": "system",
                    "content": """Classify this query's complexity for model routing.

Categories:
- "simple": Factual lookup, simple Q&A, formatting, translation. Any fast model works.
- "medium": Analysis, summarization, coding tasks, creative writing. Needs a capable model.
- "hard": Complex math, multi-step logic, proofs, hard coding, planning with constraints. Needs a reasoning model.

Return JSON: {"complexity": "simple|medium|hard", "reasoning": "brief explanation"}"""
                },
                {"role": "user", "content": query}
            ],
            response_format={"type": "json_object"},
            temperature=0.0,
            max_tokens=100
        )
        return json.loads(response.choices[0].message.content)

    def route(self, query: str) -> dict:
        """Route a query to the optimal model."""
        classification = self.classify_complexity(query)
        complexity = classification.get("complexity", "medium")

        model_config = {
            "simple": {
                "model": "gpt-4o-mini",
                "max_tokens": 1000,
                "temperature": 0.3,
            },
            "medium": {
                "model": "gpt-4o",
                "max_tokens": 2000,
                "temperature": 0.5,
            },
            "hard": {
                "model": "o3-mini",
                "reasoning_effort": "high",
                "max_completion_tokens": 10000,
            },
        }

        config = model_config[complexity]
        print(f"[Router] Complexity: {complexity} -> Model: {config['model']}")
        print(f"[Router] Reasoning: {classification.get('reasoning', '')}")

        # Make the routed call
        messages = [{"role": "user", "content": query}]

        kwargs = {"model": config["model"], "messages": messages}
        if "reasoning_effort" in config:
            kwargs["reasoning_effort"] = config["reasoning_effort"]
            kwargs["max_completion_tokens"] = config.get("max_completion_tokens", 8000)
        else:
            kwargs["max_tokens"] = config.get("max_tokens", 2000)
            kwargs["temperature"] = config.get("temperature", 0.5)

        response = client.chat.completions.create(**kwargs)

        return {
            "answer": response.choices[0].message.content,
            "model_used": config["model"],
            "complexity": complexity,
            "tokens": response.usage.total_tokens,
        }


# Usage
router = SmartRouter()

queries = [
    "What is 2 + 2?",                                              # simple
    "Summarize the key differences between REST and GraphQL APIs",  # medium
    "Prove that there are infinitely many prime numbers",           # hard
]

for q in queries:
    print(f"\n{'='*60}")
    print(f"Query: {q}")
    result = router.route(q)
    print(f"Model: {result['model_used']} ({result['complexity']})")
    print(f"Tokens: {result['tokens']}")
    print(f"Answer: {result['answer'][:200]}...")

Week 12 Summary

Key Takeaways

Reasoning models (o3, DeepSeek R1, Claude Extended Thinking) use "thinking tokens" to solve complex problems step-by-step, analogous to human System 2 thinking.
Test-time compute scaling is the key innovation: spending more compute at inference time (more thinking tokens) leads to better answers on hard problems.
Chain-of-Thought (CoT) prompting encourages models to show their work. Self-consistency (multiple samples + majority vote) and Tree of Thoughts (branching exploration) extend this further.
RLHF (SFT + Reward Model + PPO) was the original alignment technique. It is effective but complex and unstable.
DPO simplified alignment to a single training loss on preference data, eliminating the reward model and PPO. Variants like KTO and ORPO simplify it further.
DeepSeek R1 showed that reasoning can emerge from pure RL (GRPO) without supervised reasoning demonstrations. R1-Zero spontaneously developed self-verification and reflection.
Use reasoning models only when the task genuinely requires deep reasoning. For most tasks, regular models with good prompting are more cost-effective. Implement smart routing to choose the right model per query.

Course Conclusion

Congratulations on completing the first 12 weeks of the AI Engineering Mastery curriculum. You have built a strong foundation spanning from embeddings and RAG through agents and reasoning models. The field continues to evolve rapidly -- keep building, keep experimenting, and keep learning.