Week 5: LLM Training at Scale | AI Engineering Mastery

Learning Objectives

Understand the LLM Lifecycle

Master every stage from data collection through deployment and monitoring of production LLMs.

Data Processing at Scale

Learn how trillions of tokens are collected, cleaned, deduplicated, and prepared for training.

Pre-training Infrastructure

Understand distributed training, parallelism strategies, and the economics of training frontier models.

Post-training Alignment

Learn SFT, RLHF, DPO, and Constitutional AI that transform base models into helpful assistants.

Model Evaluation

Understand benchmarks, evaluation methodologies, and the challenges of measuring LLM capabilities.

Latest Developments

Stay current with DeepSeek V3/R1, Llama 4, Claude 4, and the open-weight model revolution through 2025-2026.

1. The LLM Lifecycle

Building a Large Language Model is a multi-stage engineering endeavor that can take months or years and cost anywhere from thousands to hundreds of millions of dollars. Understanding the full lifecycle is essential for any AI engineer, even if you never train a model from scratch yourself. Each stage has its own challenges, tooling, and best practices.

The Six Stages of the LLM Lifecycle

Every production LLM goes through these stages: Data Collection → Preprocessing → Pre-training → Post-training → Deployment → Monitoring. The line between stages can blur, and iteration across stages is common. Let us examine each in detail.

End-to-End LLM Lifecycle

graph LR A[Data Collection] --> B[Preprocessing] B --> C[Pre-training] C --> D[Post-training] D --> E[Deployment] E --> F[Monitoring] F -->|Feedback Loop| A style A fill:#4a90d9,stroke:#333,color:#fff style B fill:#50b5a9,stroke:#333,color:#fff style C fill:#f5a623,stroke:#333,color:#fff style D fill:#e74c3c,stroke:#333,color:#fff style E fill:#9b59b6,stroke:#333,color:#fff style F fill:#2ecc71,stroke:#333,color:#fff

Stage 1: Data Collection

The foundation of any LLM is its training data. The quality, diversity, and scale of training data are arguably the most important factors determining a model's capabilities. Modern LLMs are trained on datasets containing trillions of tokens drawn from diverse sources.

Primary Data Sources

Source	Description	Volume	Quality
Common Crawl	Monthly web crawls since 2008; petabytes of raw HTML	Very High	Low (requires heavy filtering)
Books	Digitized books, Project Gutenberg, Books3	Medium	High
Wikipedia	Multilingual Wikipedia dumps	Low (~20B tokens)	Very High
Code	GitHub, GitLab, StackOverflow	High	Medium-High
Scientific Papers	arXiv, PubMed, Semantic Scholar	Medium	Very High
Social Media	Reddit (Pushshift), forums, discussions	High	Low-Medium
Synthetic Data	LLM-generated training data (increasingly common in 2025-2026)	Scalable	Varies

Real-World Example - Llama 3: Meta's Llama 3 was trained on approximately 15 trillion tokens. The team built custom web crawlers, used extensive quality filtering, and carefully balanced the data mix. They estimated that the pre-training data included web pages in over 30 languages, with English comprising roughly 90% of the final dataset.

Real-World Example - DeepSeek V3: DeepSeek V3 was trained on approximately 14.8 trillion tokens of diverse, high-quality data. Despite using roughly similar data volume to Llama 3, DeepSeek achieved competitive or superior performance on many benchmarks at a fraction of the training cost.

Stage 2: Preprocessing

Raw collected data is far from usable. Preprocessing transforms messy, noisy web data into clean, high-quality training corpora. This stage is critically important -- "garbage in, garbage out" applies more strongly to LLMs than almost any other ML system.

Key preprocessing steps include:

HTML parsing and text extraction -- strip markup, extract meaningful text content
Language identification -- classify documents by language, filter as needed
Quality filtering -- remove low-quality content (spam, boilerplate, autogenerated text)
Deduplication -- remove duplicate or near-duplicate documents
Toxicity filtering -- remove harmful, illegal, or extremely offensive content
PII removal -- redact personal information (emails, phone numbers, SSNs)
Tokenization -- convert text into token sequences the model can process

Stage 3: Pre-training

Pre-training is the most compute-intensive stage. The model learns general language understanding by predicting the next token in sequences drawn from the preprocessed corpus. This stage typically accounts for 90%+ of total training compute.

Real-World Example: Training Llama 3.1 405B required approximately 30.84 million GPU-hours on NVIDIA H100 GPUs spread across a 16,384-GPU cluster. The training ran for roughly 54 days, consuming an estimated 11.4 GWh of electricity. The total cost is estimated at over $100 million when accounting for hardware, electricity, cooling, and engineering staff.

Stage 4: Post-training

After pre-training, the model is a powerful text completion engine but is not yet useful as an assistant. Post-training aligns the model to be helpful, harmless, and honest through techniques like Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).

Real-World Example: ChatGPT's breakthrough was largely due to post-training. OpenAI took GPT-3.5 (a pre-trained model) and applied SFT on thousands of human-written instruction-response pairs, followed by RLHF using a reward model trained on human preference data. This transformed a text-completion engine into a conversational AI assistant.

Stage 5: Deployment

Deploying LLMs in production involves:

Inference optimization -- quantization, batching, KV-cache management
Serving infrastructure -- GPU servers, load balancing, auto-scaling
API design -- streaming, token-by-token generation, rate limiting
Safety systems -- content filters, guardrails, monitoring for misuse
Cost management -- balancing quality with cost per token

Stage 6: Monitoring

Once deployed, LLMs require continuous monitoring:

Performance metrics -- latency, throughput, error rates
Quality metrics -- user satisfaction, response quality, hallucination rates
Safety monitoring -- detecting adversarial use, monitoring for harmful outputs
Data collection -- gathering feedback for future training iterations
Model drift -- ensuring model behavior remains consistent over time

2. Data Processing at Scale

Data processing for LLMs is an engineering challenge at massive scale. We are dealing with petabytes of raw data that must be cleaned, filtered, deduplicated, and transformed into training-ready token sequences. Let us explore each component in depth.

Web Crawling

The primary source of LLM training data is the open web. Two major approaches exist:

Common Crawl

Common Crawl is a nonprofit organization that has been crawling the web since 2008. They release monthly crawl dumps containing petabytes of raw HTML data. Key facts:

Each monthly crawl contains approximately 3-4 billion web pages
Total archive exceeds 250+ petabytes (as of early 2026)
Data is stored in WARC (Web ARChive) format on AWS S3
The WET format provides extracted plaintext
Freely available but requires significant processing

RefinedWeb (Falcon's Approach)

The Technology Innovation Institute created RefinedWeb by applying aggressive filtering and deduplication to Common Crawl data. Their key insight was that properly filtered web data alone can match or exceed curated datasets in downstream model quality. RefinedWeb demonstrated that quality filtering matters more than source diversity.

FineWeb (HuggingFace)

HuggingFace released FineWeb in 2024, a 15-trillion-token dataset derived from 96 Common Crawl snapshots (2013-2024). It applies multiple deduplication strategies and quality filters. FineWeb-Edu, a subset filtered for educational content, showed that aggressive quality filtering produces better models even with less data.

Data Cleaning

Deduplication

Duplicate documents are extremely common on the web. Training on duplicates wastes compute and can cause the model to memorize specific texts, increasing the risk of verbatim reproduction. There are three levels of deduplication:

Exact Deduplication -- Remove documents with identical content. Typically done by comparing cryptographic hashes (SHA-256) of document text.
Near-Duplicate Detection (MinHash + LSH) -- Documents that are very similar but not identical (e.g., copied with minor edits). MinHash with Locality-Sensitive Hashing is the standard approach, creating compact signatures for each document and finding similar pairs efficiently. This approach was used extensively by Llama 3 and DeepSeek V3.
Substring Deduplication (Suffix Array) -- Remove repeated long substrings that appear across documents (boilerplate headers, footers, legal notices). Uses suffix arrays for efficient detection.

Quality Filtering

Not all web text is useful for training. Quality filtering removes low-quality content using multiple signals:

Perplexity-based filtering -- A small language model scores each document; documents with very high perplexity (incoherent text) or very low perplexity (repetitive/templated text) are removed
Heuristic rules -- Remove documents with too few words, too many special characters, abnormal word lengths, or excessive repetition
Classifier-based filtering -- Train a binary classifier (e.g., fasttext) to distinguish "high-quality" text (Wikipedia, books) from "low-quality" text (spam, boilerplate)
URL-based filtering -- Blocklist known spam/adult/low-quality domains

Toxic Content Filtering

Removing harmful content from training data is both an ethical imperative and a practical necessity. Common approaches:

Keyword-based filtering -- Lists of toxic words/phrases (crude but fast)
Classifier-based filtering -- Models like Perspective API score text for toxicity, threat, profanity
Domain filtering -- Remove entire domains known for harmful content
Targeted removal -- Remove specific categories (CSAM, personal attacks, hate speech) while retaining educational discussions about these topics

Data Mixing

The composition of training data profoundly affects model capabilities. Labs carefully tune the proportion of different data types:

Data Type	Typical Proportion	Impact on Model
Web text	60-80%	General knowledge, language fluency
Code	5-15%	Reasoning, coding ability, structured thinking
Books	5-10%	Long-form reasoning, narrative understanding
Scientific papers	3-8%	Technical knowledge, citation patterns
Wikipedia	2-5%	Factual knowledge, structured information
Math/STEM	2-5%	Mathematical reasoning, problem-solving
Multilingual	5-15%	Cross-lingual capabilities

A key finding from Llama 3's training is that including more code in training data improves general reasoning capabilities, even on non-coding tasks. This is because code requires logical thinking, variable tracking, and precise instruction following.

Tokenizer Training

Before training the LLM, a tokenizer must be trained on a representative sample of the training corpus. The tokenizer converts raw text into integer token IDs that the model processes.

Modern LLMs primarily use Byte-Pair Encoding (BPE):

Start with individual bytes (or characters) as the initial vocabulary
Count the frequency of every adjacent pair in the corpus
Merge the most frequent pair into a new token
Repeat until the desired vocabulary size is reached (typically 32K-128K tokens)

Model	Tokenizer	Vocab Size
GPT-4 / GPT-4o	cl100k_base / o200k_base	100K / 200K
Llama 3	tiktoken-based BPE	128K
Claude 3/4	Custom BPE	~100K
DeepSeek V3	Custom BPE	128K
Gemma 2	SentencePiece	256K

Practical: Text Data Preprocessing Pipeline

Hands-On Exercise

Let us build a complete data preprocessing pipeline in Python that handles text cleaning, deduplication, quality filtering, and tokenization.

"""
Complete Text Data Preprocessing Pipeline for LLM Training
============================================================
This pipeline demonstrates the key stages of preparing text data
for LLM pre-training: cleaning, deduplication, quality filtering,
and tokenization.
"""

import re
import hashlib
import unicodedata
from collections import Counter
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass, field
import json

# For MinHash deduplication
# pip install datasketch
from datasketch import MinHash, MinHashLSH

# For tokenizer training
# pip install sentencepiece tokenizers
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders


@dataclass
class Document:
    """Represents a single document in our pipeline."""
    text: str
    url: str = ""
    language: str = "en"
    metadata: Dict = field(default_factory=dict)
    quality_score: float = 0.0
    is_duplicate: bool = False


class TextCleaner:
    """Stage 1: Clean raw text extracted from web pages."""

    def __init__(self):
        # Common boilerplate patterns to remove
        self.boilerplate_patterns = [
            r'cookie\s*(policy|consent|notice)',
            r'privacy\s*policy',
            r'terms\s*(of\s*service|and\s*conditions)',
            r'all\s*rights\s*reserved',
            r'subscribe\s*to\s*(our|the)\s*newsletter',
            r'share\s*(this|on)\s*(facebook|twitter|linkedin)',
            r'click\s*here\s*to\s*(read|learn|subscribe)',
            r'copyright\s*\d{4}',
        ]
        self.boilerplate_regex = re.compile(
            '|'.join(self.boilerplate_patterns),
            re.IGNORECASE
        )

    def clean(self, doc: Document) -> Document:
        """Apply all cleaning steps to a document."""
        text = doc.text

        # Step 1: Normalize Unicode characters
        text = unicodedata.normalize('NFKC', text)

        # Step 2: Remove HTML artifacts that survived extraction
        text = re.sub(r'<[^>]+>', ' ', text)
        text = re.sub(r'&[a-zA-Z]+;', ' ', text)
        text = re.sub(r'&#\d+;', ' ', text)

        # Step 3: Remove URLs
        text = re.sub(
            r'https?://\S+|www\.\S+',
            '[URL]',
            text
        )

        # Step 4: Remove email addresses (PII)
        text = re.sub(
            r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
            '[EMAIL]',
            text
        )

        # Step 5: Remove phone numbers (PII)
        text = re.sub(
            r'(\+?\d{1,3}[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}',
            '[PHONE]',
            text
        )

        # Step 6: Normalize whitespace
        text = re.sub(r'\n{3,}', '\n\n', text)
        text = re.sub(r' {2,}', ' ', text)
        text = re.sub(r'\t+', ' ', text)

        # Step 7: Remove lines that are mostly boilerplate
        lines = text.split('\n')
        cleaned_lines = []
        for line in lines:
            stripped = line.strip()
            if stripped and not self.boilerplate_regex.search(stripped):
                cleaned_lines.append(line)

        text = '\n'.join(cleaned_lines).strip()

        doc.text = text
        return doc


class ExactDeduplicator:
    """Stage 2a: Remove exact duplicate documents using SHA-256 hashes."""

    def __init__(self):
        self.seen_hashes = set()

    def _hash_document(self, text: str) -> str:
        """Create a hash of normalized text."""
        # Normalize before hashing: lowercase, remove extra whitespace
        normalized = ' '.join(text.lower().split())
        return hashlib.sha256(normalized.encode('utf-8')).hexdigest()

    def deduplicate(self, documents: List[Document]) -> List[Document]:
        """Mark exact duplicates."""
        results = []
        for doc in documents:
            doc_hash = self._hash_document(doc.text)
            if doc_hash in self.seen_hashes:
                doc.is_duplicate = True
            else:
                self.seen_hashes.add(doc_hash)
            results.append(doc)
        return results


class MinHashDeduplicator:
    """Stage 2b: Remove near-duplicate documents using MinHash LSH."""

    def __init__(self, threshold: float = 0.8, num_perm: int = 128):
        self.threshold = threshold
        self.num_perm = num_perm
        self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
        self.doc_count = 0

    def _create_minhash(self, text: str) -> MinHash:
        """Create a MinHash signature for a document."""
        m = MinHash(num_perm=self.num_perm)
        # Create n-grams (shingles) of words
        words = text.lower().split()
        for i in range(len(words) - 4):
            shingle = ' '.join(words[i:i+5])  # 5-gram shingles
            m.update(shingle.encode('utf-8'))
        return m

    def deduplicate(self, documents: List[Document]) -> List[Document]:
        """Mark near-duplicate documents."""
        results = []
        for doc in documents:
            if doc.is_duplicate:
                results.append(doc)
                continue

            minhash = self._create_minhash(doc.text)
            doc_id = f"doc_{self.doc_count}"

            # Check if similar document already exists
            similar = self.lsh.query(minhash)
            if similar:
                doc.is_duplicate = True
            else:
                try:
                    self.lsh.insert(doc_id, minhash)
                except ValueError:
                    pass  # Handle edge case of identical MinHash

            self.doc_count += 1
            results.append(doc)
        return results


class QualityFilter:
    """Stage 3: Filter documents based on quality heuristics."""

    def __init__(self, config: Optional[Dict] = None):
        self.config = config or {
            'min_words': 50,
            'max_words': 100000,
            'min_avg_word_length': 3.0,
            'max_avg_word_length': 15.0,
            'max_special_char_ratio': 0.3,
            'max_uppercase_ratio': 0.4,
            'min_unique_word_ratio': 0.1,
            'max_line_length': 10000,
            'min_alpha_ratio': 0.6,
        }

    def _compute_quality_score(self, text: str) -> Tuple[float, Dict]:
        """Compute a quality score between 0 and 1."""
        scores = {}
        words = text.split()
        num_words = len(words)

        # Word count check
        if num_words < self.config['min_words']:
            scores['word_count'] = 0.0
        elif num_words > self.config['max_words']:
            scores['word_count'] = 0.5
        else:
            scores['word_count'] = 1.0

        # Average word length
        if num_words > 0:
            avg_word_len = sum(len(w) for w in words) / num_words
            if self.config['min_avg_word_length'] <= avg_word_len <= self.config['max_avg_word_length']:
                scores['avg_word_length'] = 1.0
            else:
                scores['avg_word_length'] = 0.0
        else:
            scores['avg_word_length'] = 0.0

        # Special character ratio
        if len(text) > 0:
            special_chars = sum(1 for c in text if not c.isalnum() and not c.isspace())
            special_ratio = special_chars / len(text)
            scores['special_chars'] = 1.0 if special_ratio < self.config['max_special_char_ratio'] else 0.0
        else:
            scores['special_chars'] = 0.0

        # Uppercase ratio
        alpha_chars = [c for c in text if c.isalpha()]
        if alpha_chars:
            upper_ratio = sum(1 for c in alpha_chars if c.isupper()) / len(alpha_chars)
            scores['uppercase'] = 1.0 if upper_ratio < self.config['max_uppercase_ratio'] else 0.0
        else:
            scores['uppercase'] = 0.0

        # Unique word ratio (measures repetitiveness)
        if num_words > 0:
            unique_ratio = len(set(w.lower() for w in words)) / num_words
            scores['unique_words'] = 1.0 if unique_ratio > self.config['min_unique_word_ratio'] else 0.0
        else:
            scores['unique_words'] = 0.0

        # Alphabetic character ratio
        if len(text) > 0:
            alpha_ratio = sum(1 for c in text if c.isalpha()) / len(text)
            scores['alpha_ratio'] = 1.0 if alpha_ratio > self.config['min_alpha_ratio'] else 0.0
        else:
            scores['alpha_ratio'] = 0.0

        # Overall quality score (weighted average)
        weights = {
            'word_count': 0.2,
            'avg_word_length': 0.15,
            'special_chars': 0.15,
            'uppercase': 0.1,
            'unique_words': 0.2,
            'alpha_ratio': 0.2,
        }

        total_score = sum(scores[k] * weights[k] for k in scores)
        return total_score, scores

    def filter(self, documents: List[Document], min_score: float = 0.7) -> List[Document]:
        """Filter documents below quality threshold."""
        results = []
        for doc in documents:
            if doc.is_duplicate:
                results.append(doc)
                continue

            score, details = self._compute_quality_score(doc.text)
            doc.quality_score = score
            doc.metadata['quality_details'] = details

            if score < min_score:
                doc.metadata['filtered_reason'] = 'low_quality'

            results.append(doc)
        return results


class ToxicityFilter:
    """Stage 4: Filter toxic content using keyword and pattern matching."""

    def __init__(self):
        # In production, use a classifier (e.g., Perspective API, Detoxify)
        # This is a simplified keyword-based approach for demonstration
        self.toxic_patterns = [
            # Add patterns as needed; keeping this minimal for the example
        ]

    def filter(self, documents: List[Document]) -> List[Document]:
        """Mark documents with high toxicity."""
        # In a real pipeline, you would use a trained toxicity classifier:
        #
        # from detoxify import Detoxify
        # model = Detoxify('multilingual')
        # results = model.predict(doc.text)
        # if results['toxicity'] > 0.8:
        #     doc.metadata['filtered_reason'] = 'toxic'
        #
        # For this demonstration, we skip actual classification.
        return documents


class DataPreprocessingPipeline:
    """
    Complete data preprocessing pipeline that chains all stages together.
    """

    def __init__(self, quality_threshold: float = 0.7, dedup_threshold: float = 0.8):
        self.cleaner = TextCleaner()
        self.exact_dedup = ExactDeduplicator()
        self.minhash_dedup = MinHashDeduplicator(threshold=dedup_threshold)
        self.quality_filter = QualityFilter()
        self.toxicity_filter = ToxicityFilter()
        self.quality_threshold = quality_threshold

    def process(self, documents: List[Document]) -> List[Document]:
        """Run the full preprocessing pipeline."""
        print(f"Starting pipeline with {len(documents)} documents")

        # Stage 1: Clean text
        print("Stage 1: Cleaning text...")
        documents = [self.cleaner.clean(doc) for doc in documents]

        # Stage 2a: Exact deduplication
        print("Stage 2a: Exact deduplication...")
        documents = self.exact_dedup.deduplicate(documents)
        exact_dupes = sum(1 for d in documents if d.is_duplicate)
        print(f"  Found {exact_dupes} exact duplicates")

        # Stage 2b: Near-duplicate detection
        print("Stage 2b: Near-duplicate detection (MinHash LSH)...")
        documents = self.minhash_dedup.deduplicate(documents)
        total_dupes = sum(1 for d in documents if d.is_duplicate)
        print(f"  Found {total_dupes - exact_dupes} near-duplicates")

        # Stage 3: Quality filtering
        print("Stage 3: Quality filtering...")
        documents = self.quality_filter.filter(
            documents,
            min_score=self.quality_threshold
        )
        low_quality = sum(
            1 for d in documents
            if d.metadata.get('filtered_reason') == 'low_quality'
        )
        print(f"  Found {low_quality} low-quality documents")

        # Stage 4: Toxicity filtering
        print("Stage 4: Toxicity filtering...")
        documents = self.toxicity_filter.filter(documents)

        # Collect passing documents
        passed = [
            d for d in documents
            if not d.is_duplicate
            and 'filtered_reason' not in d.metadata
        ]

        print(f"\nPipeline complete:")
        print(f"  Input:  {len(documents)} documents")
        print(f"  Output: {len(passed)} documents")
        print(f"  Removed: {len(documents) - len(passed)} documents "
              f"({(len(documents) - len(passed)) / len(documents) * 100:.1f}%)")

        return passed

    def get_statistics(self, documents: List[Document]) -> Dict:
        """Compute corpus statistics."""
        total_chars = sum(len(d.text) for d in documents)
        total_words = sum(len(d.text.split()) for d in documents)
        avg_doc_length = total_words / len(documents) if documents else 0

        return {
            'num_documents': len(documents),
            'total_characters': total_chars,
            'total_words': total_words,
            'avg_words_per_document': avg_doc_length,
            'avg_quality_score': sum(d.quality_score for d in documents) / len(documents) if documents else 0,
        }


# ==============================
# Example usage
# ==============================
if __name__ == "__main__":
    # Create sample documents
    sample_docs = [
        Document(
            text="""
            Machine learning is a subset of artificial intelligence that enables
            systems to learn and improve from experience without being explicitly
            programmed. It focuses on the development of computer programs that
            can access data and use it to learn for themselves. The process begins
            with observations or data, such as examples, direct experience, or
            instruction, in order to look for patterns in data and make better
            decisions in the future.
            """,
            url="https://example.com/ml-intro"
        ),
        Document(
            text="""
            Machine learning is a subset of artificial intelligence that enables
            systems to learn and improve from experience without being explicitly
            programmed. It focuses on the development of computer programs that
            can access data and use it to learn for themselves. The process begins
            with observations or data, such as examples, direct experience, or
            instruction, in order to look for patterns in data and make better
            decisions in the future.
            """,
            url="https://example.com/ml-intro-copy"  # Exact duplicate
        ),
        Document(
            text="buy now!!! click here!!! $$$ FREE $$$",
            url="https://spam.example.com"  # Low quality
        ),
        Document(
            text="""
            Transformers are a type of neural network architecture that has
            revolutionized natural language processing. Introduced in the paper
            'Attention Is All You Need' by Vaswani et al. in 2017, transformers
            use self-attention mechanisms to process sequences in parallel rather
            than sequentially. This architecture forms the basis of modern LLMs
            like GPT-4, Claude, and Llama. The key innovation is the multi-head
            attention mechanism which allows the model to attend to different
            parts of the input simultaneously.
            """,
            url="https://example.com/transformers"
        ),
    ]

    # Run the pipeline
    pipeline = DataPreprocessingPipeline(quality_threshold=0.7)
    clean_docs = pipeline.process(sample_docs)

    # Print statistics
    stats = pipeline.get_statistics(clean_docs)
    print(f"\nCorpus Statistics:")
    for key, value in stats.items():
        print(f"  {key}: {value}")

    # Print surviving documents
    print(f"\nSurviving documents:")
    for doc in clean_docs:
        preview = doc.text[:100].strip().replace('\n', ' ')
        print(f"  - {doc.url}: '{preview}...'")
        print(f"    Quality score: {doc.quality_score:.3f}")

Scale Considerations

The pipeline above works for demonstration purposes but would need significant modifications for production scale. At the scale of Common Crawl (billions of documents), you would use distributed computing frameworks like Apache Spark or Ray, stream data rather than loading it all into memory, and use optimized C++ implementations for MinHash computation. Meta's Llama 3 preprocessing pipeline processed data on a cluster of hundreds of machines over several weeks.

3. Pre-training

Pre-training is where an LLM acquires its core knowledge and capabilities. The model learns to predict the next token in a sequence, which requires understanding grammar, facts, reasoning patterns, and more. Let us explore the technical details of this process.

Next Token Prediction (Causal Language Modeling)

The pre-training objective for decoder-only models (GPT, Llama, Claude) is causal language modeling -- predicting the next token given all previous tokens.

Formally, given a sequence of tokens x = (x_1, x_2, ..., x_n), the model maximizes:

L(x) = sum_{i=1}^{n} log P(x_i | x_1, x_2, ..., x_{i-1}; theta)

where theta represents all model parameters.

During training:

A batch of text sequences is fed to the model
The model predicts a probability distribution over the vocabulary for each position
Cross-entropy loss is computed between predictions and actual next tokens
Gradients are computed via backpropagation
Optimizer (typically AdamW) updates the weights

The beauty of this objective is its simplicity and scalability. There are no labels to annotate -- the text itself provides the supervision signal. Every token in every document becomes a training example.

Training Infrastructure

Training frontier LLMs requires massive compute infrastructure. Let us look at what this involves:

GPU Clusters

Model	GPUs Used	GPU Type	Training Duration	Estimated Cost
Llama 3.1 405B	16,384	H100 80GB	~54 days	~$100M+
DeepSeek V3	2,048	H800	~60 days	~$5.6M
Gemini Ultra	Thousands of TPUv5e	TPU v5e	Months	~$100M+
GPT-4	~25,000 (estimated)	A100 80GB	~100 days (est.)	~$100M+ (est.)

DeepSeek's Cost Efficiency

DeepSeek V3 stands out for training a highly competitive model at roughly $5.6 million -- 10-20x less than comparable models. They achieved this through architectural innovations (Multi-head Latent Attention, DeepSeekMoE), engineering optimizations (FP8 training, optimized communication), and training on fewer but higher-quality tokens. This demonstrated that frontier AI does not necessarily require frontier budgets.

Parallelism Strategies

A single GPU cannot hold a large model or process enough data. Training must be distributed across hundreds or thousands of GPUs. There are four main parallelism strategies:

1. Data Parallelism (DP)

The simplest form of distributed training. Each GPU holds a complete copy of the model and processes a different mini-batch of data. Gradients are averaged across all GPUs after each step.

# Conceptual: Data Parallelism
# GPU 0: model_copy_0 processes batch_0 -> gradient_0
# GPU 1: model_copy_1 processes batch_1 -> gradient_1
# GPU 2: model_copy_2 processes batch_2 -> gradient_2
# GPU 3: model_copy_3 processes batch_3 -> gradient_3
# Then: avg_gradient = mean(gradient_0, gradient_1, gradient_2, gradient_3)
# All GPUs update their model with avg_gradient

Limitation: Each GPU must hold the entire model in memory. For a 405B parameter model in FP16, that is approximately 810 GB -- far exceeding any single GPU's memory.

2. Model Parallelism (aka Tensor Parallelism, TP)

Individual layers are split across multiple GPUs. Each GPU holds a portion of each layer's weights and computes its part of the output. Results are communicated between GPUs within each layer.

# Conceptual: Tensor Parallelism (splitting a linear layer)
# Original: Y = X @ W  where W is [4096, 16384]
#
# GPU 0: Y_0 = X @ W[:, :4096]       # W chunk [4096, 4096]
# GPU 1: Y_1 = X @ W[:, 4096:8192]   # W chunk [4096, 4096]
# GPU 2: Y_2 = X @ W[:, 8192:12288]  # W chunk [4096, 4096]
# GPU 3: Y_3 = X @ W[:, 12288:16384] # W chunk [4096, 4096]
#
# Y = concat(Y_0, Y_1, Y_2, Y_3)  # AllGather operation

Advantage: Enables training models that do not fit on a single GPU. Limitation: High communication overhead between GPUs within each layer, so works best within a single node with fast NVLink interconnects.

3. Pipeline Parallelism (PP)

Different layers are assigned to different GPUs. Data flows through the pipeline like an assembly line. Uses micro-batching to keep all GPUs busy.

# Conceptual: Pipeline Parallelism
# GPU 0: Layers 0-19  (forward pass on micro-batch 1, then 2, then 3...)
# GPU 1: Layers 20-39 (waits for GPU 0, then processes)
# GPU 2: Layers 40-59 (waits for GPU 1, then processes)
# GPU 3: Layers 60-79 (waits for GPU 2, then processes)
#
# The "bubble" (idle time) is minimized by splitting batches into
# multiple micro-batches that can be processed in a pipelined fashion.

Advantage: Lower communication overhead than tensor parallelism (only activations between pipeline stages). Limitation: Pipeline "bubbles" where some GPUs are idle.

4. Sequence Parallelism (SP)

Long sequences are split across GPUs, with each GPU processing a portion of the sequence. Requires careful handling of attention (since each token attends to others across GPUs). Used alongside tensor parallelism, particularly for very long context models.

Combining Parallelism Strategies

In practice, frontier models use 3D parallelism -- combining data, tensor, and pipeline parallelism:

# Llama 3.1 405B Training Configuration (approximate)
# Total GPUs: 16,384 H100s
#
# Tensor Parallelism (TP): 8 GPUs per tensor group
#   - Within a single server node (8 GPUs connected via NVLink)
#   - Each layer split across 8 GPUs
#
# Pipeline Parallelism (PP): 16 stages
#   - 16 groups of layers across 16 nodes
#   - Each stage holds ~8 transformer layers
#
# Data Parallelism (DP): 128 replicas
#   - 16,384 / (8 * 16) = 128 data parallel groups
#   - Each processes a different batch of data
#
# Effective batch: 128 * micro_batch_size tokens per step

ZeRO Optimization

ZeRO (Zero Redundancy Optimizer) from Microsoft Research eliminates memory redundancy in data-parallel training. Without ZeRO, each GPU stores a complete copy of model states (parameters, gradients, optimizer states).

ZeRO Stage	What is Partitioned	Memory Reduction	Communication Overhead
Stage 1	Optimizer states only	~4x	Same as DP
Stage 2	Optimizer states + gradients	~8x	Same as DP
Stage 3	Optimizer states + gradients + parameters	Linear with # GPUs	~1.5x DP

Example memory breakdown for a 7B parameter model in FP16 with Adam optimizer:

# Memory per GPU WITHOUT ZeRO (Data Parallelism)
# Parameters (FP16):     7B * 2 bytes  = 14 GB
# Gradients (FP16):      7B * 2 bytes  = 14 GB
# Optimizer States:
#   - FP32 params copy:  7B * 4 bytes  = 28 GB
#   - FP32 momentum:     7B * 4 bytes  = 28 GB
#   - FP32 variance:     7B * 4 bytes  = 28 GB
# Total per GPU:                        = 112 GB  (doesn't fit on 80GB GPU!)

# With ZeRO Stage 3 across 8 GPUs:
# Everything partitioned: 112 GB / 8 = 14 GB per GPU
# Plus activation memory and communication buffers

Training Stability

Training large models is notoriously unstable. Common issues and solutions:

Loss Spikes

Sudden increases in training loss can occur due to bad data batches, numerical instability, or learning rate issues. Approaches to handle them:

Restart from checkpoint: Roll back to a checkpoint before the spike and skip the problematic data
Reduce learning rate: Temporarily lower the learning rate when a spike is detected
Gradient clipping: Cap gradient norms (typically at 1.0) to prevent extreme updates
Data quality: Identify and remove the data batch that caused the spike

Llama 3 Example: Meta reported approximately 466 job interruptions during Llama 3.1 405B training. About 78% were due to hardware issues (GPU failures, network problems), and the rest were due to software bugs or environmental factors. Their checkpointing system was designed to resume training within minutes of any interruption.

Gradient Accumulation

When the desired batch size exceeds what fits in GPU memory, gradient accumulation simulates larger batches:

# Gradient Accumulation Example
accumulation_steps = 8  # Simulate 8x larger batch
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    outputs = model(batch)
    loss = outputs.loss / accumulation_steps  # Scale loss
    loss.backward()  # Accumulate gradients

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()  # Update weights
        optimizer.zero_grad()  # Reset gradients

Practical: Distributed Training with PyTorch DDP

Hands-On Exercise

Let us set up a simple distributed training script using PyTorch's DistributedDataParallel (DDP).

"""
Distributed Training with PyTorch DDP
=======================================
This script demonstrates how to set up distributed training
using PyTorch's DistributedDataParallel (DDP).

Run with: torchrun --nproc_per_node=NUM_GPUS train_ddp.py
"""

import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from torch.utils.data import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    get_cosine_schedule_with_warmup,
)
import math


class TextDataset(Dataset):
    """Simple text dataset for causal language modeling."""

    def __init__(self, texts, tokenizer, max_length=512):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.examples = []

        for text in texts:
            encoding = tokenizer(
                text,
                truncation=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt",
            )
            self.examples.append({
                "input_ids": encoding["input_ids"].squeeze(),
                "attention_mask": encoding["attention_mask"].squeeze(),
            })

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx]


def setup_distributed():
    """Initialize distributed training environment."""
    # torchrun sets these environment variables automatically
    dist.init_process_group(backend="nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    return local_rank


def cleanup():
    """Clean up distributed training."""
    dist.destroy_process_group()


def train(
    model_name: str = "gpt2",
    num_epochs: int = 3,
    batch_size: int = 4,
    learning_rate: float = 5e-5,
    gradient_accumulation_steps: int = 4,
    max_grad_norm: float = 1.0,
    warmup_ratio: float = 0.1,
):
    """Main training function."""

    # Setup distributed training
    local_rank = setup_distributed()
    global_rank = dist.get_rank()
    world_size = dist.get_world_size()

    is_main = global_rank == 0

    if is_main:
        print(f"Training with {world_size} GPUs")
        print(f"Model: {model_name}")
        print(f"Effective batch size: {batch_size * gradient_accumulation_steps * world_size}")

    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,  # Use BF16 for training stability
    )
    model = model.to(local_rank)

    # Wrap model with DDP
    model = DDP(
        model,
        device_ids=[local_rank],
        output_device=local_rank,
        find_unused_parameters=False,  # Set True if needed
    )

    # Create sample dataset (replace with your data)
    sample_texts = [
        "The transformer architecture revolutionized natural language processing.",
        "Large language models learn by predicting the next token in a sequence.",
        "Pre-training on diverse data gives LLMs broad general knowledge.",
        "Fine-tuning adapts pre-trained models to specific tasks or domains.",
        "Attention mechanisms allow models to focus on relevant context.",
        "Distributed training enables training models across multiple GPUs.",
        "Gradient accumulation simulates larger batch sizes with limited memory.",
        "Mixed precision training uses FP16/BF16 to save memory and increase speed.",
    ] * 100  # Repeat for larger dataset

    dataset = TextDataset(sample_texts, tokenizer)

    # Distributed sampler ensures each GPU gets different data
    sampler = DistributedSampler(
        dataset,
        num_replicas=world_size,
        rank=global_rank,
        shuffle=True,
    )

    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        sampler=sampler,
        num_workers=2,
        pin_memory=True,
    )

    # Optimizer
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=learning_rate,
        weight_decay=0.01,
        betas=(0.9, 0.95),
    )

    # Learning rate scheduler with warmup
    total_steps = len(dataloader) * num_epochs // gradient_accumulation_steps
    warmup_steps = int(total_steps * warmup_ratio)

    scheduler = get_cosine_schedule_with_warmup(
        optimizer,
        num_warmup_steps=warmup_steps,
        num_training_steps=total_steps,
    )

    # Training loop
    global_step = 0
    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Important for proper shuffling
        model.train()

        epoch_loss = 0.0
        num_batches = 0

        for step, batch in enumerate(dataloader):
            input_ids = batch["input_ids"].to(local_rank)
            attention_mask = batch["attention_mask"].to(local_rank)

            # Forward pass -- causal LM uses input_ids as both input and labels
            # Labels are shifted internally by the model
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=input_ids,
            )

            loss = outputs.loss / gradient_accumulation_steps
            loss.backward()

            epoch_loss += outputs.loss.item()
            num_batches += 1

            if (step + 1) % gradient_accumulation_steps == 0:
                # Gradient clipping
                torch.nn.utils.clip_grad_norm_(
                    model.parameters(), max_grad_norm
                )

                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                global_step += 1

                if is_main and global_step % 10 == 0:
                    avg_loss = epoch_loss / num_batches
                    lr = scheduler.get_last_lr()[0]
                    perplexity = math.exp(min(avg_loss, 20))
                    print(
                        f"Epoch {epoch+1}/{num_epochs} | "
                        f"Step {global_step}/{total_steps} | "
                        f"Loss: {avg_loss:.4f} | "
                        f"PPL: {perplexity:.2f} | "
                        f"LR: {lr:.2e}"
                    )

        # Epoch summary
        avg_epoch_loss = epoch_loss / num_batches
        if is_main:
            print(f"\nEpoch {epoch+1} complete. Avg loss: {avg_epoch_loss:.4f}")

            # Save checkpoint (only on main process)
            checkpoint = {
                "epoch": epoch,
                "global_step": global_step,
                "model_state_dict": model.module.state_dict(),
                "optimizer_state_dict": optimizer.state_dict(),
                "scheduler_state_dict": scheduler.state_dict(),
                "loss": avg_epoch_loss,
            }
            torch.save(checkpoint, f"checkpoint_epoch_{epoch+1}.pt")
            print(f"Saved checkpoint_epoch_{epoch+1}.pt")

    # Save final model
    if is_main:
        model.module.save_pretrained("./trained_model")
        tokenizer.save_pretrained("./trained_model")
        print("Training complete. Model saved to ./trained_model")

    cleanup()


if __name__ == "__main__":
    train()

# To run the distributed training script:
# Single node, multiple GPUs:
torchrun --nproc_per_node=4 train_ddp.py

# Multiple nodes:
# Node 0 (master):
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \
    --master_addr=10.0.0.1 --master_port=29500 train_ddp.py

# Node 1:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 \
    --master_addr=10.0.0.1 --master_port=29500 train_ddp.py

4. Post-training

Post-training transforms a raw text-completion model into a helpful, harmless, and honest assistant. This stage is what makes the difference between a model that completes "What is the capital of France?" with "What is the capital of Germany? What is the capital of Italy?" versus one that responds "The capital of France is Paris." Let us explore the key techniques.

Pre-training vs Post-training Comparison

graph TB subgraph PRE[Pre-training] A1[Massive Text Corpus] --> A2[Next Token Prediction] A2 --> A3[Base Model] A3 --> A4[Learns Language and Knowledge] end subgraph POST[Post-training] B1[Curated Instruction Data] --> B2[SFT] B2 --> B3[RLHF / DPO] B3 --> B4[Aligned Assistant Model] end A3 --> B1 style A3 fill:#f5a623,stroke:#333,color:#fff style B4 fill:#2ecc71,stroke:#333,color:#fff

Supervised Fine-Tuning (SFT)

SFT is the first step in post-training. The model is fine-tuned on high-quality instruction-response pairs in a conversational format.

SFT Data Format

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful AI assistant."
        },
        {
            "role": "user",
            "content": "Explain quantum computing in simple terms."
        },
        {
            "role": "assistant",
            "content": "Quantum computing is a type of computing that uses quantum mechanical phenomena..."
        }
    ]
}

Key Aspects of SFT

Data quality over quantity: A few thousand high-quality examples can be more effective than millions of low-quality ones. The LIMA paper showed that 1,000 carefully curated examples could produce a surprisingly capable model.
Loss masking: During SFT, loss is typically only computed on the assistant's tokens, not on the user's messages or system prompts. This teaches the model to respond, not to predict what users will say.
Chat templates: Each model family uses a specific template to format conversations (e.g., ChatML, Llama chat format). Consistency between training and inference is critical.
Multi-turn conversations: SFT data should include multi-turn dialogues to teach the model to maintain context.

RLHF (Reinforcement Learning from Human Feedback)

RLHF further aligns the model with human preferences. It consists of two stages:

Stage 1: Reward Model Training

Human annotators compare pairs of model responses and indicate which is better. A reward model is trained on these preferences.

# RLHF Reward Model Training (conceptual)
#
# Training data format:
# (prompt, chosen_response, rejected_response)
#
# Loss function (Bradley-Terry model):
# L = -log(sigmoid(r(chosen) - r(rejected)))
#
# where r(x) is the scalar reward score for response x

Stage 2: PPO Training

The LLM is fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward model's score while staying close to the SFT model (to prevent reward hacking).

# PPO Training Objective (simplified)
#
# L_PPO = E[min(
#     ratio * advantage,
#     clip(ratio, 1-epsilon, 1+epsilon) * advantage
# )] - beta * KL(policy || reference)
#
# where:
# - ratio = pi_new(a|s) / pi_old(a|s)  (probability ratio)
# - advantage = reward - baseline
# - KL penalty prevents the model from diverging too far from the SFT model
# - beta controls the strength of the KL penalty

Why RLHF is Difficult

RLHF is notoriously difficult to get right. The reward model can be gamed (reward hacking), the KL penalty must be carefully tuned, PPO requires maintaining multiple models simultaneously (policy, reference, reward, value), and the process is computationally expensive. This led to the development of simpler alternatives like DPO.

DPO (Direct Preference Optimization)

DPO, introduced by Rafailov et al. (2023), eliminates the need for a separate reward model and RL training. Instead, it directly optimizes the language model on preference pairs.

# DPO Loss Function
#
# L_DPO = -E[log sigmoid(
#     beta * (log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x))
# )]
#
# where:
# - pi is the current policy (model being trained)
# - pi_ref is the reference model (frozen SFT model)
# - y_w is the preferred (winning) response
# - y_l is the dispreferred (losing) response
# - x is the prompt
# - beta controls how much to deviate from the reference

Why DPO became popular:

No need to train a separate reward model
No RL training loop (simpler, more stable)
Can be implemented as a standard supervised loss
Often produces comparable results to RLHF
Many top models in 2025-2026 use DPO or variants (ORPO, IPO, KTO)

Constitutional AI (Anthropic)

Anthropic developed Constitutional AI (CAI) as an alternative alignment approach. Instead of relying entirely on human feedback, CAI uses a set of principles (a "constitution") to guide the model's behavior.

The CAI Process

Generate responses: The model generates responses to prompts, including potentially harmful ones
Self-critique: The model critiques its own responses based on constitutional principles (e.g., "Is this response harmful? Is it honest?")
Self-revision: The model revises its responses based on its critiques
RLAIF: Instead of human feedback, use AI feedback (the revised responses) to train a preference model

This approach reduces the need for human labelers to interact with harmful content and makes the alignment process more scalable and transparent.

Practical: Fine-tune with SFT Using HuggingFace

Hands-On Exercise

Fine-tune a small language model (GPT-2 or similar) on instruction data using HuggingFace's TRL (Transformer Reinforcement Learning) library.

"""
Supervised Fine-Tuning (SFT) Example
======================================
Fine-tune a small model on instruction-following data
using HuggingFace TRL.

pip install transformers datasets trl peft accelerate
"""

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
)
from trl import SFTTrainer, SFTConfig


def create_instruction_dataset():
    """Create a small instruction-following dataset."""
    examples = [
        {
            "instruction": "Explain what machine learning is.",
            "input": "",
            "output": (
                "Machine learning is a branch of artificial intelligence that "
                "enables computers to learn patterns from data without being "
                "explicitly programmed. Instead of writing specific rules, "
                "you provide the system with examples, and it learns to make "
                "predictions or decisions based on those examples. Common "
                "types include supervised learning (learning from labeled data), "
                "unsupervised learning (finding patterns in unlabeled data), "
                "and reinforcement learning (learning through trial and error)."
            ),
        },
        {
            "instruction": "Write a Python function to calculate the factorial of a number.",
            "input": "",
            "output": (
                "Here's a Python function to calculate factorial:\n\n"
                "```python\n"
                "def factorial(n):\n"
                "    \"\"\"Calculate the factorial of a non-negative integer.\"\"\"\n"
                "    if n < 0:\n"
                "        raise ValueError('Factorial is not defined for negative numbers')\n"
                "    if n == 0 or n == 1:\n"
                "        return 1\n"
                "    return n * factorial(n - 1)\n"
                "\n"
                "# Examples:\n"
                "print(factorial(5))   # 120\n"
                "print(factorial(10))  # 3628800\n"
                "```\n\n"
                "This function uses recursion. For very large numbers, consider "
                "an iterative approach or `math.factorial()` from the standard library."
            ),
        },
        {
            "instruction": "Summarize the following text.",
            "input": (
                "The transformer architecture, introduced in 2017, replaced "
                "recurrent neural networks as the dominant architecture for "
                "natural language processing. It uses self-attention mechanisms "
                "to process all tokens in parallel, leading to much faster "
                "training. The key innovation was the multi-head attention "
                "mechanism, which allows the model to attend to different "
                "parts of the input simultaneously."
            ),
            "output": (
                "The transformer architecture (2017) replaced RNNs in NLP by "
                "using parallel self-attention mechanisms instead of sequential "
                "processing, enabling faster training through its innovative "
                "multi-head attention that attends to multiple input parts "
                "simultaneously."
            ),
        },
        # Add more examples as needed...
    ] * 50  # Repeat for a larger dataset

    return Dataset.from_list(examples)


def format_instruction(example):
    """Format instruction data into a chat-style prompt."""
    if example["input"]:
        text = (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Input:\n{example['input']}\n\n"
            f"### Response:\n{example['output']}"
        )
    else:
        text = (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Response:\n{example['output']}"
        )
    return text


def main():
    # Configuration
    model_name = "gpt2"  # Use a small model for demonstration
    output_dir = "./sft_output"

    # Load tokenizer and model
    print(f"Loading model: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype="auto",
    )

    # Create dataset
    print("Creating dataset...")
    dataset = create_instruction_dataset()

    # Format the dataset
    def formatting_func(example):
        return format_instruction(example)

    # SFT Training configuration
    sft_config = SFTConfig(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-5,
        weight_decay=0.01,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        logging_steps=10,
        save_strategy="epoch",
        bf16=True,  # Use BF16 if GPU supports it
        max_seq_length=512,
        packing=True,  # Pack multiple short sequences into one
        dataset_text_field=None,  # We use formatting_func instead
    )

    # Create SFT trainer
    trainer = SFTTrainer(
        model=model,
        args=sft_config,
        train_dataset=dataset,
        tokenizer=tokenizer,
        formatting_func=formatting_func,
    )

    # Train
    print("Starting SFT training...")
    trainer.train()

    # Save model
    print(f"Saving model to {output_dir}")
    trainer.save_model(output_dir)
    tokenizer.save_pretrained(output_dir)

    # Test the fine-tuned model
    print("\nTesting fine-tuned model...")
    from transformers import pipeline

    generator = pipeline(
        "text-generation",
        model=output_dir,
        tokenizer=output_dir,
        max_length=200,
    )

    test_prompt = "### Instruction:\nWhat is deep learning?\n\n### Response:\n"
    result = generator(test_prompt, do_sample=True, temperature=0.7)
    print(f"Generated response:\n{result[0]['generated_text']}")


if __name__ == "__main__":
    main()

5. Model Evaluation

Evaluating LLMs is one of the most challenging problems in AI. Unlike traditional ML where accuracy on a test set suffices, LLM evaluation must capture diverse capabilities across language understanding, reasoning, coding, math, safety, and more.

Model Evaluation Pipeline

graph LR A[Trained Model] --> B[Perplexity Test] A --> C[Benchmark Suite] A --> D[Human Evaluation] C --> E[MMLU] C --> F[HumanEval] C --> G[GSM8K] B --> H{Pass Threshold} D --> H E --> H F --> H G --> H H -->|Yes| I[Release] H -->|No| J[Iterate] style A fill:#4a90d9,stroke:#333,color:#fff style H fill:#f5a623,stroke:#333,color:#fff style I fill:#2ecc71,stroke:#333,color:#fff style J fill:#e74c3c,stroke:#333,color:#fff

Perplexity

Perplexity is the most fundamental metric for language models. It measures how well the model predicts a held-out test set.

# Perplexity = exp(-1/N * sum(log P(x_i | x_{<i})))
#
# - Lower perplexity = better model
# - Perplexity of 1 = perfect prediction
# - Perplexity equal to vocab size = random guessing
#
# Example: GPT-2 perplexity on WikiText-103: ~29.4
# Example: GPT-3 perplexity on WikiText-103: ~20.5

Limitations: Perplexity measures raw prediction ability but does not directly correlate with downstream task performance, instruction following, or safety. A model with lower perplexity is not necessarily more helpful.

Key Benchmarks

Benchmark	What It Measures	Format	Examples
MMLU	Broad knowledge across 57 subjects	Multiple choice (A/B/C/D)	14,042 questions
MMLU-Pro	Harder version with 10 options + reasoning	Multiple choice	12,032 questions
HumanEval	Python code generation	Complete a function	164 problems
GSM8K	Grade school math	Word problems	8,500 problems
MATH	Competition-level mathematics	Open-ended math problems	12,500 problems
ARC	Science reasoning (grade school)	Multiple choice	7,787 questions
HellaSwag	Common sense reasoning	Sentence completion	10,042 examples
TruthfulQA	Truthfulness / avoiding common misconceptions	Open-ended + MC	817 questions
GPQA	Graduate-level science questions	Multiple choice	448 questions
SWE-Bench	Real software engineering tasks	Fix GitHub issues	2,294 tasks

The Contamination Problem

A major concern in LLM evaluation is benchmark contamination -- when test data appears in training data. If a model has seen the exact questions from MMLU during pre-training, its score is inflated and not meaningful.

Approaches to combat contamination:

n-gram overlap checking: Check if test questions appear verbatim in training data
Dynamic benchmarks: LiveBench generates new questions regularly so they cannot be in training data
Private test sets: Keep test questions private (but this limits reproducibility)
Paraphrased versions: Test with rephrased questions to see if performance holds

Human Evaluation and Chatbot Arena

LMSYS's Chatbot Arena (now lmarena.ai) is widely considered the gold standard for LLM evaluation. Users interact with two anonymous models side-by-side and vote for which response they prefer. The results generate an Elo-style leaderboard.

Why it works:

Real users with diverse questions (not artificial benchmarks)
Blind comparison (users do not know which model is which)
Large scale (millions of votes as of early 2026)
Difficult to game (you cannot optimize for "Arena score")
Captures aspects benchmarks miss (helpfulness, writing quality, nuance)

Practical: Simple Benchmark Evaluation

"""
Simple LLM Benchmark Evaluation
==================================
Evaluate a model on a simple multiple-choice benchmark using
the lm-evaluation-harness library and manually.
"""

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List, Dict, Tuple


class SimpleBenchmarkEvaluator:
    """Evaluate a model on simple multiple-choice questions."""

    def __init__(self, model_name: str):
        print(f"Loading model: {model_name}")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        self.model.eval()

    def evaluate_multiple_choice(
        self,
        question: str,
        choices: List[str],
        correct_idx: int,
    ) -> Tuple[int, bool]:
        """
        Evaluate a single multiple-choice question.
        Returns the model's predicted index and whether it was correct.
        """
        # Method: Score each choice by computing the log-likelihood
        # of the completion given the question as context

        choice_scores = []

        for i, choice in enumerate(choices):
            prompt = f"Question: {question}\nAnswer: {choice}"
            inputs = self.tokenizer(prompt, return_tensors="pt").to(
                self.model.device
            )

            with torch.no_grad():
                outputs = self.model(**inputs, labels=inputs["input_ids"])
                # Negative log-likelihood (lower = more likely)
                loss = outputs.loss.item()

            # We want higher likelihood, so negate the loss
            choice_scores.append(-loss)

        predicted_idx = choice_scores.index(max(choice_scores))
        is_correct = predicted_idx == correct_idx

        return predicted_idx, is_correct

    def run_benchmark(self, questions: List[Dict]) -> Dict:
        """Run evaluation on a set of questions."""
        correct = 0
        total = len(questions)
        results = []

        for i, q in enumerate(questions):
            pred_idx, is_correct = self.evaluate_multiple_choice(
                question=q["question"],
                choices=q["choices"],
                correct_idx=q["correct_idx"],
            )

            if is_correct:
                correct += 1

            results.append({
                "question": q["question"],
                "predicted": q["choices"][pred_idx],
                "correct": q["choices"][q["correct_idx"]],
                "is_correct": is_correct,
            })

            print(
                f"[{i+1}/{total}] "
                f"{'CORRECT' if is_correct else 'WRONG'} | "
                f"Predicted: {q['choices'][pred_idx]}"
            )

        accuracy = correct / total * 100
        print(f"\nAccuracy: {correct}/{total} = {accuracy:.1f}%")

        return {
            "accuracy": accuracy,
            "correct": correct,
            "total": total,
            "results": results,
        }


# Sample benchmark questions (MMLU-style)
SAMPLE_QUESTIONS = [
    {
        "question": "What is the chemical symbol for gold?",
        "choices": ["Ag", "Au", "Fe", "Cu"],
        "correct_idx": 1,
    },
    {
        "question": "Which planet is closest to the Sun?",
        "choices": ["Venus", "Earth", "Mercury", "Mars"],
        "correct_idx": 2,
    },
    {
        "question": "What is the time complexity of binary search?",
        "choices": ["O(n)", "O(n^2)", "O(log n)", "O(1)"],
        "correct_idx": 2,
    },
    {
        "question": "Who wrote 'A Brief History of Time'?",
        "choices": [
            "Albert Einstein",
            "Stephen Hawking",
            "Richard Feynman",
            "Carl Sagan",
        ],
        "correct_idx": 1,
    },
    {
        "question": "What is the derivative of x^2?",
        "choices": ["x", "2x", "x^2", "2x^2"],
        "correct_idx": 1,
    },
]


if __name__ == "__main__":
    # Evaluate GPT-2 on our mini benchmark
    evaluator = SimpleBenchmarkEvaluator("gpt2")
    results = evaluator.run_benchmark(SAMPLE_QUESTIONS)

    # For comprehensive evaluation, use lm-evaluation-harness:
    # pip install lm-eval
    # lm_eval --model hf --model_args pretrained=gpt2 \
    #     --tasks mmlu,hellaswag,arc_easy --batch_size 8
    print("\n--- For comprehensive evaluation, use: ---")
    print("pip install lm-eval")
    print("lm_eval --model hf \\")
    print("  --model_args pretrained=YOUR_MODEL \\")
    print("  --tasks mmlu,hellaswag,arc_easy,truthfulqa \\")
    print("  --batch_size 8")

6. Latest Developments (2025-2026)

The LLM landscape has evolved rapidly through 2025 and into early 2026. Here are the most significant developments:

DeepSeek V3 and R1: Efficiency Breakthroughs

DeepSeek, a Chinese AI lab, made waves with two groundbreaking models:

DeepSeek V3 (December 2024)

671B total parameters, 37B active per token (Mixture of Experts)
Trained for approximately $5.6 million -- a fraction of comparable models
Used 2,048 H800 GPUs (compared to 16,384 H100s for Llama 3)
Innovations: Multi-head Latent Attention (MLA), DeepSeekMoE with auxiliary-loss-free balancing, FP8 mixed-precision training
Competitive with GPT-4o and Claude 3.5 Sonnet on many benchmarks

DeepSeek R1 (January 2025)

A reasoning model that shows its chain-of-thought process
Trained using large-scale reinforcement learning (Group Relative Policy Optimization -- GRPO)
Key insight: reasoning behaviors emerge from pure RL without SFT -- the model learned to think step-by-step, self-verify, and explore multiple approaches
DeepSeek R1-Zero showed emergent "aha moments" where the model discovered new reasoning strategies during RL training
Competitive with OpenAI's o1 on math and coding benchmarks
Open-weight, spurring a wave of distilled reasoning models

Llama 3.1, 3.2, and 4

Llama 3.1 (July 2024)

Released in 8B, 70B, and 405B parameter sizes
128K context length
The 405B model was the first truly competitive open-weight frontier model
Trained on 15T+ tokens

Llama 3.2 (September 2024)

Added multimodal capabilities (vision) in 11B and 90B sizes
Lightweight text models: 1B and 3B parameters for edge/mobile deployment
Demonstrated that small models can be surprisingly capable when well-trained

Llama 4 (2025)

Meta's next-generation model family with significant architectural changes
Improved reasoning and instruction-following capabilities
Enhanced multilingual and multimodal support
Further efficiency improvements in training and inference

Claude 3.5 and Claude 4

Claude 3.5 Sonnet (released mid-2024, updated late 2024): Became the leading model for coding tasks, with particularly strong performance on SWE-Bench. Introduced "computer use" capabilities for agentic tasks.
Claude 3.5 Haiku: Fast, cost-effective model competitive with much larger models. Excellent for high-throughput applications.
Claude 4 family (2025): Significant advances in reasoning, coding, and extended thinking capabilities. Claude Opus 4 set new benchmarks for agentic coding tasks.
Anthropic's focus on safety and Constitutional AI continues to differentiate their approach.

GPT-4o and Multimodal Training

GPT-4o ("omni"): Natively multimodal -- processes text, images, and audio in a unified architecture rather than separate models stitched together
Real-time voice conversations with emotional expression
Significantly lower latency and cost than GPT-4 Turbo
o1 and o3: OpenAI's reasoning models that use "thinking" tokens to solve complex problems, achieving human-expert-level performance on GPQA and competitive math olympiad problems

Open-Weight Model Revolution

2024-2025 saw a dramatic shift toward open-weight models:

Llama 3.1 405B proved open models can compete with proprietary frontier models
DeepSeek V3/R1 showed that competitive models can be trained at a fraction of the cost
Mistral continued releasing high-quality models (Mistral Large 2, Pixtral)
Qwen 2.5 from Alibaba: Strong multilingual models in various sizes
This shift has democratized AI and created a vibrant ecosystem of fine-tuned and merged models

Small Model Renaissance

Perhaps the most impactful trend of 2025 is the emergence of surprisingly capable small models:

Model	Parameters	Notable Capability
Phi-4 (Microsoft)	14B	Matches 70B+ models on reasoning benchmarks; heavy use of synthetic training data
Gemma 2 (Google)	2B, 9B, 27B	State-of-the-art for their sizes; excellent for research and on-device
Qwen 2.5 (Alibaba)	0.5B-72B	Strong across the board; excellent coding and math models
SmolLM2 (HuggingFace)	135M-1.7B	Tiny but capable; designed for on-device applications
Llama 3.2 (Meta)	1B, 3B	Efficient edge models with strong instruction following

Key insights driving the small model revolution:

Data quality over model size: Carefully curated training data (especially synthetic data) can compensate for smaller parameter counts
Knowledge distillation: Smaller models trained on outputs of larger models inherit much of their capability
Architecture improvements: Better attention mechanisms, training recipes, and optimization techniques make every parameter count more
Practical deployment: Small models can run on laptops, phones, and edge devices, enabling new use cases

Summary and Key Takeaways

Week 5 Key Takeaways

Data is everything: The quality and composition of training data is the single most important factor in model quality. Careful preprocessing, deduplication, and quality filtering are essential.
Pre-training is expensive but conceptually simple: Next-token prediction at massive scale creates emergent capabilities. The engineering challenge is in distributed training infrastructure.
Parallelism is necessary: Data parallelism, tensor parallelism, pipeline parallelism, and ZeRO optimization work together to train models that do not fit on single GPUs.
Post-training matters enormously: SFT and alignment (RLHF/DPO/CAI) transform a raw text predictor into a helpful assistant. Without post-training, even the best base model is not useful for end users.
Evaluation is hard: No single benchmark captures all of a model's capabilities. Chatbot Arena remains the best holistic evaluation, but automated benchmarks are essential for rapid iteration.
Efficiency is the new frontier: DeepSeek V3 proved that competitive models can be trained at 10-20x lower cost. The focus is shifting from "bigger" to "smarter."
Open-weight models are thriving: The gap between open and proprietary models has narrowed dramatically. Small models (1B-14B) are now capable enough for many production use cases.

Next Steps

In Week 6: Quantization and Fine-Tuning, we will dive deep into making these large models practical for deployment. You will learn about KV caches, quantization techniques (GPTQ, AWQ, GGUF), attention optimizations (FlashAttention, PagedAttention), and parameter-efficient fine-tuning with LoRA and QLoRA.