LLM Training at Scale
From raw data to deploying state-of-the-art language models: understand the entire lifecycle of building Large Language Models, including pre-training, post-training, and evaluation at industrial scale.
Learning Objectives
Understand the LLM Lifecycle
Master every stage from data collection through deployment and monitoring of production LLMs.
Data Processing at Scale
Learn how trillions of tokens are collected, cleaned, deduplicated, and prepared for training.
Pre-training Infrastructure
Understand distributed training, parallelism strategies, and the economics of training frontier models.
Post-training Alignment
Learn SFT, RLHF, DPO, and Constitutional AI that transform base models into helpful assistants.
Model Evaluation
Understand benchmarks, evaluation methodologies, and the challenges of measuring LLM capabilities.
Latest Developments
Stay current with DeepSeek V3/R1, Llama 4, Claude 4, and the open-weight model revolution through 2025-2026.
1. The LLM Lifecycle
Building a Large Language Model is a multi-stage engineering endeavor that can take months or years and cost anywhere from thousands to hundreds of millions of dollars. Understanding the full lifecycle is essential for any AI engineer, even if you never train a model from scratch yourself. Each stage has its own challenges, tooling, and best practices.
The Six Stages of the LLM Lifecycle
Every production LLM goes through these stages: Data Collection → Preprocessing → Pre-training → Post-training → Deployment → Monitoring. The line between stages can blur, and iteration across stages is common. Let us examine each in detail.
Stage 1: Data Collection
The foundation of any LLM is its training data. The quality, diversity, and scale of training data are arguably the most important factors determining a model's capabilities. Modern LLMs are trained on datasets containing trillions of tokens drawn from diverse sources.
Primary Data Sources
| Source | Description | Volume | Quality |
|---|---|---|---|
| Common Crawl | Monthly web crawls since 2008; petabytes of raw HTML | Very High | Low (requires heavy filtering) |
| Books | Digitized books, Project Gutenberg, Books3 | Medium | High |
| Wikipedia | Multilingual Wikipedia dumps | Low (~20B tokens) | Very High |
| Code | GitHub, GitLab, StackOverflow | High | Medium-High |
| Scientific Papers | arXiv, PubMed, Semantic Scholar | Medium | Very High |
| Social Media | Reddit (Pushshift), forums, discussions | High | Low-Medium |
| Synthetic Data | LLM-generated training data (increasingly common in 2025-2026) | Scalable | Varies |
Real-World Example - Llama 3: Meta's Llama 3 was trained on approximately 15 trillion tokens. The team built custom web crawlers, used extensive quality filtering, and carefully balanced the data mix. They estimated that the pre-training data included web pages in over 30 languages, with English comprising roughly 90% of the final dataset.
Real-World Example - DeepSeek V3: DeepSeek V3 was trained on approximately 14.8 trillion tokens of diverse, high-quality data. Despite using roughly similar data volume to Llama 3, DeepSeek achieved competitive or superior performance on many benchmarks at a fraction of the training cost.
Stage 2: Preprocessing
Raw collected data is far from usable. Preprocessing transforms messy, noisy web data into clean, high-quality training corpora. This stage is critically important -- "garbage in, garbage out" applies more strongly to LLMs than almost any other ML system.
Key preprocessing steps include:
- HTML parsing and text extraction -- strip markup, extract meaningful text content
- Language identification -- classify documents by language, filter as needed
- Quality filtering -- remove low-quality content (spam, boilerplate, autogenerated text)
- Deduplication -- remove duplicate or near-duplicate documents
- Toxicity filtering -- remove harmful, illegal, or extremely offensive content
- PII removal -- redact personal information (emails, phone numbers, SSNs)
- Tokenization -- convert text into token sequences the model can process
Stage 3: Pre-training
Pre-training is the most compute-intensive stage. The model learns general language understanding by predicting the next token in sequences drawn from the preprocessed corpus. This stage typically accounts for 90%+ of total training compute.
Real-World Example: Training Llama 3.1 405B required approximately 30.84 million GPU-hours on NVIDIA H100 GPUs spread across a 16,384-GPU cluster. The training ran for roughly 54 days, consuming an estimated 11.4 GWh of electricity. The total cost is estimated at over $100 million when accounting for hardware, electricity, cooling, and engineering staff.
Stage 4: Post-training
After pre-training, the model is a powerful text completion engine but is not yet useful as an assistant. Post-training aligns the model to be helpful, harmless, and honest through techniques like Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).
Real-World Example: ChatGPT's breakthrough was largely due to post-training. OpenAI took GPT-3.5 (a pre-trained model) and applied SFT on thousands of human-written instruction-response pairs, followed by RLHF using a reward model trained on human preference data. This transformed a text-completion engine into a conversational AI assistant.
Stage 5: Deployment
Deploying LLMs in production involves:
- Inference optimization -- quantization, batching, KV-cache management
- Serving infrastructure -- GPU servers, load balancing, auto-scaling
- API design -- streaming, token-by-token generation, rate limiting
- Safety systems -- content filters, guardrails, monitoring for misuse
- Cost management -- balancing quality with cost per token
Stage 6: Monitoring
Once deployed, LLMs require continuous monitoring:
- Performance metrics -- latency, throughput, error rates
- Quality metrics -- user satisfaction, response quality, hallucination rates
- Safety monitoring -- detecting adversarial use, monitoring for harmful outputs
- Data collection -- gathering feedback for future training iterations
- Model drift -- ensuring model behavior remains consistent over time
2. Data Processing at Scale
Data processing for LLMs is an engineering challenge at massive scale. We are dealing with petabytes of raw data that must be cleaned, filtered, deduplicated, and transformed into training-ready token sequences. Let us explore each component in depth.
Web Crawling
The primary source of LLM training data is the open web. Two major approaches exist:
Common Crawl
Common Crawl is a nonprofit organization that has been crawling the web since 2008. They release monthly crawl dumps containing petabytes of raw HTML data. Key facts:
- Each monthly crawl contains approximately 3-4 billion web pages
- Total archive exceeds 250+ petabytes (as of early 2026)
- Data is stored in WARC (Web ARChive) format on AWS S3
- The WET format provides extracted plaintext
- Freely available but requires significant processing
RefinedWeb (Falcon's Approach)
The Technology Innovation Institute created RefinedWeb by applying aggressive filtering and deduplication to Common Crawl data. Their key insight was that properly filtered web data alone can match or exceed curated datasets in downstream model quality. RefinedWeb demonstrated that quality filtering matters more than source diversity.
FineWeb (HuggingFace)
HuggingFace released FineWeb in 2024, a 15-trillion-token dataset derived from 96 Common Crawl snapshots (2013-2024). It applies multiple deduplication strategies and quality filters. FineWeb-Edu, a subset filtered for educational content, showed that aggressive quality filtering produces better models even with less data.
Data Cleaning
Deduplication
Duplicate documents are extremely common on the web. Training on duplicates wastes compute and can cause the model to memorize specific texts, increasing the risk of verbatim reproduction. There are three levels of deduplication:
- Exact Deduplication -- Remove documents with identical content. Typically done by comparing cryptographic hashes (SHA-256) of document text.
- Near-Duplicate Detection (MinHash + LSH) -- Documents that are very similar but not identical (e.g., copied with minor edits). MinHash with Locality-Sensitive Hashing is the standard approach, creating compact signatures for each document and finding similar pairs efficiently. This approach was used extensively by Llama 3 and DeepSeek V3.
- Substring Deduplication (Suffix Array) -- Remove repeated long substrings that appear across documents (boilerplate headers, footers, legal notices). Uses suffix arrays for efficient detection.
Quality Filtering
Not all web text is useful for training. Quality filtering removes low-quality content using multiple signals:
- Perplexity-based filtering -- A small language model scores each document; documents with very high perplexity (incoherent text) or very low perplexity (repetitive/templated text) are removed
- Heuristic rules -- Remove documents with too few words, too many special characters, abnormal word lengths, or excessive repetition
- Classifier-based filtering -- Train a binary classifier (e.g., fasttext) to distinguish "high-quality" text (Wikipedia, books) from "low-quality" text (spam, boilerplate)
- URL-based filtering -- Blocklist known spam/adult/low-quality domains
Toxic Content Filtering
Removing harmful content from training data is both an ethical imperative and a practical necessity. Common approaches:
- Keyword-based filtering -- Lists of toxic words/phrases (crude but fast)
- Classifier-based filtering -- Models like Perspective API score text for toxicity, threat, profanity
- Domain filtering -- Remove entire domains known for harmful content
- Targeted removal -- Remove specific categories (CSAM, personal attacks, hate speech) while retaining educational discussions about these topics
Data Mixing
The composition of training data profoundly affects model capabilities. Labs carefully tune the proportion of different data types:
| Data Type | Typical Proportion | Impact on Model |
|---|---|---|
| Web text | 60-80% | General knowledge, language fluency |
| Code | 5-15% | Reasoning, coding ability, structured thinking |
| Books | 5-10% | Long-form reasoning, narrative understanding |
| Scientific papers | 3-8% | Technical knowledge, citation patterns |
| Wikipedia | 2-5% | Factual knowledge, structured information |
| Math/STEM | 2-5% | Mathematical reasoning, problem-solving |
| Multilingual | 5-15% | Cross-lingual capabilities |
A key finding from Llama 3's training is that including more code in training data improves general reasoning capabilities, even on non-coding tasks. This is because code requires logical thinking, variable tracking, and precise instruction following.
Tokenizer Training
Before training the LLM, a tokenizer must be trained on a representative sample of the training corpus. The tokenizer converts raw text into integer token IDs that the model processes.
Modern LLMs primarily use Byte-Pair Encoding (BPE):
- Start with individual bytes (or characters) as the initial vocabulary
- Count the frequency of every adjacent pair in the corpus
- Merge the most frequent pair into a new token
- Repeat until the desired vocabulary size is reached (typically 32K-128K tokens)
| Model | Tokenizer | Vocab Size |
|---|---|---|
| GPT-4 / GPT-4o | cl100k_base / o200k_base | 100K / 200K |
| Llama 3 | tiktoken-based BPE | 128K |
| Claude 3/4 | Custom BPE | ~100K |
| DeepSeek V3 | Custom BPE | 128K |
| Gemma 2 | SentencePiece | 256K |
Practical: Text Data Preprocessing Pipeline
Hands-On Exercise
Let us build a complete data preprocessing pipeline in Python that handles text cleaning, deduplication, quality filtering, and tokenization.
"""
Complete Text Data Preprocessing Pipeline for LLM Training
============================================================
This pipeline demonstrates the key stages of preparing text data
for LLM pre-training: cleaning, deduplication, quality filtering,
and tokenization.
"""
import re
import hashlib
import unicodedata
from collections import Counter
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass, field
import json
# For MinHash deduplication
# pip install datasketch
from datasketch import MinHash, MinHashLSH
# For tokenizer training
# pip install sentencepiece tokenizers
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders
@dataclass
class Document:
"""Represents a single document in our pipeline."""
text: str
url: str = ""
language: str = "en"
metadata: Dict = field(default_factory=dict)
quality_score: float = 0.0
is_duplicate: bool = False
class TextCleaner:
"""Stage 1: Clean raw text extracted from web pages."""
def __init__(self):
# Common boilerplate patterns to remove
self.boilerplate_patterns = [
r'cookie\s*(policy|consent|notice)',
r'privacy\s*policy',
r'terms\s*(of\s*service|and\s*conditions)',
r'all\s*rights\s*reserved',
r'subscribe\s*to\s*(our|the)\s*newsletter',
r'share\s*(this|on)\s*(facebook|twitter|linkedin)',
r'click\s*here\s*to\s*(read|learn|subscribe)',
r'copyright\s*\d{4}',
]
self.boilerplate_regex = re.compile(
'|'.join(self.boilerplate_patterns),
re.IGNORECASE
)
def clean(self, doc: Document) -> Document:
"""Apply all cleaning steps to a document."""
text = doc.text
# Step 1: Normalize Unicode characters
text = unicodedata.normalize('NFKC', text)
# Step 2: Remove HTML artifacts that survived extraction
text = re.sub(r'<[^>]+>', ' ', text)
text = re.sub(r'&[a-zA-Z]+;', ' ', text)
text = re.sub(r'\d+;', ' ', text)
# Step 3: Remove URLs
text = re.sub(
r'https?://\S+|www\.\S+',
'[URL]',
text
)
# Step 4: Remove email addresses (PII)
text = re.sub(
r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
'[EMAIL]',
text
)
# Step 5: Remove phone numbers (PII)
text = re.sub(
r'(\+?\d{1,3}[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}',
'[PHONE]',
text
)
# Step 6: Normalize whitespace
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r' {2,}', ' ', text)
text = re.sub(r'\t+', ' ', text)
# Step 7: Remove lines that are mostly boilerplate
lines = text.split('\n')
cleaned_lines = []
for line in lines:
stripped = line.strip()
if stripped and not self.boilerplate_regex.search(stripped):
cleaned_lines.append(line)
text = '\n'.join(cleaned_lines).strip()
doc.text = text
return doc
class ExactDeduplicator:
"""Stage 2a: Remove exact duplicate documents using SHA-256 hashes."""
def __init__(self):
self.seen_hashes = set()
def _hash_document(self, text: str) -> str:
"""Create a hash of normalized text."""
# Normalize before hashing: lowercase, remove extra whitespace
normalized = ' '.join(text.lower().split())
return hashlib.sha256(normalized.encode('utf-8')).hexdigest()
def deduplicate(self, documents: List[Document]) -> List[Document]:
"""Mark exact duplicates."""
results = []
for doc in documents:
doc_hash = self._hash_document(doc.text)
if doc_hash in self.seen_hashes:
doc.is_duplicate = True
else:
self.seen_hashes.add(doc_hash)
results.append(doc)
return results
class MinHashDeduplicator:
"""Stage 2b: Remove near-duplicate documents using MinHash LSH."""
def __init__(self, threshold: float = 0.8, num_perm: int = 128):
self.threshold = threshold
self.num_perm = num_perm
self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
self.doc_count = 0
def _create_minhash(self, text: str) -> MinHash:
"""Create a MinHash signature for a document."""
m = MinHash(num_perm=self.num_perm)
# Create n-grams (shingles) of words
words = text.lower().split()
for i in range(len(words) - 4):
shingle = ' '.join(words[i:i+5]) # 5-gram shingles
m.update(shingle.encode('utf-8'))
return m
def deduplicate(self, documents: List[Document]) -> List[Document]:
"""Mark near-duplicate documents."""
results = []
for doc in documents:
if doc.is_duplicate:
results.append(doc)
continue
minhash = self._create_minhash(doc.text)
doc_id = f"doc_{self.doc_count}"
# Check if similar document already exists
similar = self.lsh.query(minhash)
if similar:
doc.is_duplicate = True
else:
try:
self.lsh.insert(doc_id, minhash)
except ValueError:
pass # Handle edge case of identical MinHash
self.doc_count += 1
results.append(doc)
return results
class QualityFilter:
"""Stage 3: Filter documents based on quality heuristics."""
def __init__(self, config: Optional[Dict] = None):
self.config = config or {
'min_words': 50,
'max_words': 100000,
'min_avg_word_length': 3.0,
'max_avg_word_length': 15.0,
'max_special_char_ratio': 0.3,
'max_uppercase_ratio': 0.4,
'min_unique_word_ratio': 0.1,
'max_line_length': 10000,
'min_alpha_ratio': 0.6,
}
def _compute_quality_score(self, text: str) -> Tuple[float, Dict]:
"""Compute a quality score between 0 and 1."""
scores = {}
words = text.split()
num_words = len(words)
# Word count check
if num_words < self.config['min_words']:
scores['word_count'] = 0.0
elif num_words > self.config['max_words']:
scores['word_count'] = 0.5
else:
scores['word_count'] = 1.0
# Average word length
if num_words > 0:
avg_word_len = sum(len(w) for w in words) / num_words
if self.config['min_avg_word_length'] <= avg_word_len <= self.config['max_avg_word_length']:
scores['avg_word_length'] = 1.0
else:
scores['avg_word_length'] = 0.0
else:
scores['avg_word_length'] = 0.0
# Special character ratio
if len(text) > 0:
special_chars = sum(1 for c in text if not c.isalnum() and not c.isspace())
special_ratio = special_chars / len(text)
scores['special_chars'] = 1.0 if special_ratio < self.config['max_special_char_ratio'] else 0.0
else:
scores['special_chars'] = 0.0
# Uppercase ratio
alpha_chars = [c for c in text if c.isalpha()]
if alpha_chars:
upper_ratio = sum(1 for c in alpha_chars if c.isupper()) / len(alpha_chars)
scores['uppercase'] = 1.0 if upper_ratio < self.config['max_uppercase_ratio'] else 0.0
else:
scores['uppercase'] = 0.0
# Unique word ratio (measures repetitiveness)
if num_words > 0:
unique_ratio = len(set(w.lower() for w in words)) / num_words
scores['unique_words'] = 1.0 if unique_ratio > self.config['min_unique_word_ratio'] else 0.0
else:
scores['unique_words'] = 0.0
# Alphabetic character ratio
if len(text) > 0:
alpha_ratio = sum(1 for c in text if c.isalpha()) / len(text)
scores['alpha_ratio'] = 1.0 if alpha_ratio > self.config['min_alpha_ratio'] else 0.0
else:
scores['alpha_ratio'] = 0.0
# Overall quality score (weighted average)
weights = {
'word_count': 0.2,
'avg_word_length': 0.15,
'special_chars': 0.15,
'uppercase': 0.1,
'unique_words': 0.2,
'alpha_ratio': 0.2,
}
total_score = sum(scores[k] * weights[k] for k in scores)
return total_score, scores
def filter(self, documents: List[Document], min_score: float = 0.7) -> List[Document]:
"""Filter documents below quality threshold."""
results = []
for doc in documents:
if doc.is_duplicate:
results.append(doc)
continue
score, details = self._compute_quality_score(doc.text)
doc.quality_score = score
doc.metadata['quality_details'] = details
if score < min_score:
doc.metadata['filtered_reason'] = 'low_quality'
results.append(doc)
return results
class ToxicityFilter:
"""Stage 4: Filter toxic content using keyword and pattern matching."""
def __init__(self):
# In production, use a classifier (e.g., Perspective API, Detoxify)
# This is a simplified keyword-based approach for demonstration
self.toxic_patterns = [
# Add patterns as needed; keeping this minimal for the example
]
def filter(self, documents: List[Document]) -> List[Document]:
"""Mark documents with high toxicity."""
# In a real pipeline, you would use a trained toxicity classifier:
#
# from detoxify import Detoxify
# model = Detoxify('multilingual')
# results = model.predict(doc.text)
# if results['toxicity'] > 0.8:
# doc.metadata['filtered_reason'] = 'toxic'
#
# For this demonstration, we skip actual classification.
return documents
class DataPreprocessingPipeline:
"""
Complete data preprocessing pipeline that chains all stages together.
"""
def __init__(self, quality_threshold: float = 0.7, dedup_threshold: float = 0.8):
self.cleaner = TextCleaner()
self.exact_dedup = ExactDeduplicator()
self.minhash_dedup = MinHashDeduplicator(threshold=dedup_threshold)
self.quality_filter = QualityFilter()
self.toxicity_filter = ToxicityFilter()
self.quality_threshold = quality_threshold
def process(self, documents: List[Document]) -> List[Document]:
"""Run the full preprocessing pipeline."""
print(f"Starting pipeline with {len(documents)} documents")
# Stage 1: Clean text
print("Stage 1: Cleaning text...")
documents = [self.cleaner.clean(doc) for doc in documents]
# Stage 2a: Exact deduplication
print("Stage 2a: Exact deduplication...")
documents = self.exact_dedup.deduplicate(documents)
exact_dupes = sum(1 for d in documents if d.is_duplicate)
print(f" Found {exact_dupes} exact duplicates")
# Stage 2b: Near-duplicate detection
print("Stage 2b: Near-duplicate detection (MinHash LSH)...")
documents = self.minhash_dedup.deduplicate(documents)
total_dupes = sum(1 for d in documents if d.is_duplicate)
print(f" Found {total_dupes - exact_dupes} near-duplicates")
# Stage 3: Quality filtering
print("Stage 3: Quality filtering...")
documents = self.quality_filter.filter(
documents,
min_score=self.quality_threshold
)
low_quality = sum(
1 for d in documents
if d.metadata.get('filtered_reason') == 'low_quality'
)
print(f" Found {low_quality} low-quality documents")
# Stage 4: Toxicity filtering
print("Stage 4: Toxicity filtering...")
documents = self.toxicity_filter.filter(documents)
# Collect passing documents
passed = [
d for d in documents
if not d.is_duplicate
and 'filtered_reason' not in d.metadata
]
print(f"\nPipeline complete:")
print(f" Input: {len(documents)} documents")
print(f" Output: {len(passed)} documents")
print(f" Removed: {len(documents) - len(passed)} documents "
f"({(len(documents) - len(passed)) / len(documents) * 100:.1f}%)")
return passed
def get_statistics(self, documents: List[Document]) -> Dict:
"""Compute corpus statistics."""
total_chars = sum(len(d.text) for d in documents)
total_words = sum(len(d.text.split()) for d in documents)
avg_doc_length = total_words / len(documents) if documents else 0
return {
'num_documents': len(documents),
'total_characters': total_chars,
'total_words': total_words,
'avg_words_per_document': avg_doc_length,
'avg_quality_score': sum(d.quality_score for d in documents) / len(documents) if documents else 0,
}
# ==============================
# Example usage
# ==============================
if __name__ == "__main__":
# Create sample documents
sample_docs = [
Document(
text="""
Machine learning is a subset of artificial intelligence that enables
systems to learn and improve from experience without being explicitly
programmed. It focuses on the development of computer programs that
can access data and use it to learn for themselves. The process begins
with observations or data, such as examples, direct experience, or
instruction, in order to look for patterns in data and make better
decisions in the future.
""",
url="https://example.com/ml-intro"
),
Document(
text="""
Machine learning is a subset of artificial intelligence that enables
systems to learn and improve from experience without being explicitly
programmed. It focuses on the development of computer programs that
can access data and use it to learn for themselves. The process begins
with observations or data, such as examples, direct experience, or
instruction, in order to look for patterns in data and make better
decisions in the future.
""",
url="https://example.com/ml-intro-copy" # Exact duplicate
),
Document(
text="buy now!!! click here!!! $$$ FREE $$$",
url="https://spam.example.com" # Low quality
),
Document(
text="""
Transformers are a type of neural network architecture that has
revolutionized natural language processing. Introduced in the paper
'Attention Is All You Need' by Vaswani et al. in 2017, transformers
use self-attention mechanisms to process sequences in parallel rather
than sequentially. This architecture forms the basis of modern LLMs
like GPT-4, Claude, and Llama. The key innovation is the multi-head
attention mechanism which allows the model to attend to different
parts of the input simultaneously.
""",
url="https://example.com/transformers"
),
]
# Run the pipeline
pipeline = DataPreprocessingPipeline(quality_threshold=0.7)
clean_docs = pipeline.process(sample_docs)
# Print statistics
stats = pipeline.get_statistics(clean_docs)
print(f"\nCorpus Statistics:")
for key, value in stats.items():
print(f" {key}: {value}")
# Print surviving documents
print(f"\nSurviving documents:")
for doc in clean_docs:
preview = doc.text[:100].strip().replace('\n', ' ')
print(f" - {doc.url}: '{preview}...'")
print(f" Quality score: {doc.quality_score:.3f}")
Scale Considerations
The pipeline above works for demonstration purposes but would need significant modifications for production scale. At the scale of Common Crawl (billions of documents), you would use distributed computing frameworks like Apache Spark or Ray, stream data rather than loading it all into memory, and use optimized C++ implementations for MinHash computation. Meta's Llama 3 preprocessing pipeline processed data on a cluster of hundreds of machines over several weeks.
3. Pre-training
Pre-training is where an LLM acquires its core knowledge and capabilities. The model learns to predict the next token in a sequence, which requires understanding grammar, facts, reasoning patterns, and more. Let us explore the technical details of this process.
Next Token Prediction (Causal Language Modeling)
The pre-training objective for decoder-only models (GPT, Llama, Claude) is causal language modeling -- predicting the next token given all previous tokens.
Formally, given a sequence of tokens x = (x_1, x_2, ..., x_n), the model maximizes:
L(x) = sum_{i=1}^{n} log P(x_i | x_1, x_2, ..., x_{i-1}; theta)
where theta represents all model parameters.
During training:
- A batch of text sequences is fed to the model
- The model predicts a probability distribution over the vocabulary for each position
- Cross-entropy loss is computed between predictions and actual next tokens
- Gradients are computed via backpropagation
- Optimizer (typically AdamW) updates the weights
The beauty of this objective is its simplicity and scalability. There are no labels to annotate -- the text itself provides the supervision signal. Every token in every document becomes a training example.
Training Infrastructure
Training frontier LLMs requires massive compute infrastructure. Let us look at what this involves:
GPU Clusters
| Model | GPUs Used | GPU Type | Training Duration | Estimated Cost |
|---|---|---|---|---|
| Llama 3.1 405B | 16,384 | H100 80GB | ~54 days | ~$100M+ |
| DeepSeek V3 | 2,048 | H800 | ~60 days | ~$5.6M |
| Gemini Ultra | Thousands of TPUv5e | TPU v5e | Months | ~$100M+ |
| GPT-4 | ~25,000 (estimated) | A100 80GB | ~100 days (est.) | ~$100M+ (est.) |
DeepSeek's Cost Efficiency
DeepSeek V3 stands out for training a highly competitive model at roughly $5.6 million -- 10-20x less than comparable models. They achieved this through architectural innovations (Multi-head Latent Attention, DeepSeekMoE), engineering optimizations (FP8 training, optimized communication), and training on fewer but higher-quality tokens. This demonstrated that frontier AI does not necessarily require frontier budgets.
Parallelism Strategies
A single GPU cannot hold a large model or process enough data. Training must be distributed across hundreds or thousands of GPUs. There are four main parallelism strategies:
1. Data Parallelism (DP)
The simplest form of distributed training. Each GPU holds a complete copy of the model and processes a different mini-batch of data. Gradients are averaged across all GPUs after each step.
# Conceptual: Data Parallelism
# GPU 0: model_copy_0 processes batch_0 -> gradient_0
# GPU 1: model_copy_1 processes batch_1 -> gradient_1
# GPU 2: model_copy_2 processes batch_2 -> gradient_2
# GPU 3: model_copy_3 processes batch_3 -> gradient_3
# Then: avg_gradient = mean(gradient_0, gradient_1, gradient_2, gradient_3)
# All GPUs update their model with avg_gradient
Limitation: Each GPU must hold the entire model in memory. For a 405B parameter model in FP16, that is approximately 810 GB -- far exceeding any single GPU's memory.
2. Model Parallelism (aka Tensor Parallelism, TP)
Individual layers are split across multiple GPUs. Each GPU holds a portion of each layer's weights and computes its part of the output. Results are communicated between GPUs within each layer.
# Conceptual: Tensor Parallelism (splitting a linear layer)
# Original: Y = X @ W where W is [4096, 16384]
#
# GPU 0: Y_0 = X @ W[:, :4096] # W chunk [4096, 4096]
# GPU 1: Y_1 = X @ W[:, 4096:8192] # W chunk [4096, 4096]
# GPU 2: Y_2 = X @ W[:, 8192:12288] # W chunk [4096, 4096]
# GPU 3: Y_3 = X @ W[:, 12288:16384] # W chunk [4096, 4096]
#
# Y = concat(Y_0, Y_1, Y_2, Y_3) # AllGather operation
Advantage: Enables training models that do not fit on a single GPU. Limitation: High communication overhead between GPUs within each layer, so works best within a single node with fast NVLink interconnects.
3. Pipeline Parallelism (PP)
Different layers are assigned to different GPUs. Data flows through the pipeline like an assembly line. Uses micro-batching to keep all GPUs busy.
# Conceptual: Pipeline Parallelism
# GPU 0: Layers 0-19 (forward pass on micro-batch 1, then 2, then 3...)
# GPU 1: Layers 20-39 (waits for GPU 0, then processes)
# GPU 2: Layers 40-59 (waits for GPU 1, then processes)
# GPU 3: Layers 60-79 (waits for GPU 2, then processes)
#
# The "bubble" (idle time) is minimized by splitting batches into
# multiple micro-batches that can be processed in a pipelined fashion.
Advantage: Lower communication overhead than tensor parallelism (only activations between pipeline stages). Limitation: Pipeline "bubbles" where some GPUs are idle.
4. Sequence Parallelism (SP)
Long sequences are split across GPUs, with each GPU processing a portion of the sequence. Requires careful handling of attention (since each token attends to others across GPUs). Used alongside tensor parallelism, particularly for very long context models.
Combining Parallelism Strategies
In practice, frontier models use 3D parallelism -- combining data, tensor, and pipeline parallelism:
# Llama 3.1 405B Training Configuration (approximate)
# Total GPUs: 16,384 H100s
#
# Tensor Parallelism (TP): 8 GPUs per tensor group
# - Within a single server node (8 GPUs connected via NVLink)
# - Each layer split across 8 GPUs
#
# Pipeline Parallelism (PP): 16 stages
# - 16 groups of layers across 16 nodes
# - Each stage holds ~8 transformer layers
#
# Data Parallelism (DP): 128 replicas
# - 16,384 / (8 * 16) = 128 data parallel groups
# - Each processes a different batch of data
#
# Effective batch: 128 * micro_batch_size tokens per step
ZeRO Optimization
ZeRO (Zero Redundancy Optimizer) from Microsoft Research eliminates memory redundancy in data-parallel training. Without ZeRO, each GPU stores a complete copy of model states (parameters, gradients, optimizer states).
| ZeRO Stage | What is Partitioned | Memory Reduction | Communication Overhead |
|---|---|---|---|
| Stage 1 | Optimizer states only | ~4x | Same as DP |
| Stage 2 | Optimizer states + gradients | ~8x | Same as DP |
| Stage 3 | Optimizer states + gradients + parameters | Linear with # GPUs | ~1.5x DP |
Example memory breakdown for a 7B parameter model in FP16 with Adam optimizer:
# Memory per GPU WITHOUT ZeRO (Data Parallelism)
# Parameters (FP16): 7B * 2 bytes = 14 GB
# Gradients (FP16): 7B * 2 bytes = 14 GB
# Optimizer States:
# - FP32 params copy: 7B * 4 bytes = 28 GB
# - FP32 momentum: 7B * 4 bytes = 28 GB
# - FP32 variance: 7B * 4 bytes = 28 GB
# Total per GPU: = 112 GB (doesn't fit on 80GB GPU!)
# With ZeRO Stage 3 across 8 GPUs:
# Everything partitioned: 112 GB / 8 = 14 GB per GPU
# Plus activation memory and communication buffers
Training Stability
Training large models is notoriously unstable. Common issues and solutions:
Loss Spikes
Sudden increases in training loss can occur due to bad data batches, numerical instability, or learning rate issues. Approaches to handle them:
- Restart from checkpoint: Roll back to a checkpoint before the spike and skip the problematic data
- Reduce learning rate: Temporarily lower the learning rate when a spike is detected
- Gradient clipping: Cap gradient norms (typically at 1.0) to prevent extreme updates
- Data quality: Identify and remove the data batch that caused the spike
Llama 3 Example: Meta reported approximately 466 job interruptions during Llama 3.1 405B training. About 78% were due to hardware issues (GPU failures, network problems), and the rest were due to software bugs or environmental factors. Their checkpointing system was designed to resume training within minutes of any interruption.
Gradient Accumulation
When the desired batch size exceeds what fits in GPU memory, gradient accumulation simulates larger batches:
# Gradient Accumulation Example
accumulation_steps = 8 # Simulate 8x larger batch
optimizer.zero_grad()
for i, batch in enumerate(dataloader):
outputs = model(batch)
loss = outputs.loss / accumulation_steps # Scale loss
loss.backward() # Accumulate gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step() # Update weights
optimizer.zero_grad() # Reset gradients
Practical: Distributed Training with PyTorch DDP
Hands-On Exercise
Let us set up a simple distributed training script using PyTorch's DistributedDataParallel (DDP).
"""
Distributed Training with PyTorch DDP
=======================================
This script demonstrates how to set up distributed training
using PyTorch's DistributedDataParallel (DDP).
Run with: torchrun --nproc_per_node=NUM_GPUS train_ddp.py
"""
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from torch.utils.data import Dataset
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
get_cosine_schedule_with_warmup,
)
import math
class TextDataset(Dataset):
"""Simple text dataset for causal language modeling."""
def __init__(self, texts, tokenizer, max_length=512):
self.tokenizer = tokenizer
self.max_length = max_length
self.examples = []
for text in texts:
encoding = tokenizer(
text,
truncation=True,
max_length=max_length,
padding="max_length",
return_tensors="pt",
)
self.examples.append({
"input_ids": encoding["input_ids"].squeeze(),
"attention_mask": encoding["attention_mask"].squeeze(),
})
def __len__(self):
return len(self.examples)
def __getitem__(self, idx):
return self.examples[idx]
def setup_distributed():
"""Initialize distributed training environment."""
# torchrun sets these environment variables automatically
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
return local_rank
def cleanup():
"""Clean up distributed training."""
dist.destroy_process_group()
def train(
model_name: str = "gpt2",
num_epochs: int = 3,
batch_size: int = 4,
learning_rate: float = 5e-5,
gradient_accumulation_steps: int = 4,
max_grad_norm: float = 1.0,
warmup_ratio: float = 0.1,
):
"""Main training function."""
# Setup distributed training
local_rank = setup_distributed()
global_rank = dist.get_rank()
world_size = dist.get_world_size()
is_main = global_rank == 0
if is_main:
print(f"Training with {world_size} GPUs")
print(f"Model: {model_name}")
print(f"Effective batch size: {batch_size * gradient_accumulation_steps * world_size}")
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16, # Use BF16 for training stability
)
model = model.to(local_rank)
# Wrap model with DDP
model = DDP(
model,
device_ids=[local_rank],
output_device=local_rank,
find_unused_parameters=False, # Set True if needed
)
# Create sample dataset (replace with your data)
sample_texts = [
"The transformer architecture revolutionized natural language processing.",
"Large language models learn by predicting the next token in a sequence.",
"Pre-training on diverse data gives LLMs broad general knowledge.",
"Fine-tuning adapts pre-trained models to specific tasks or domains.",
"Attention mechanisms allow models to focus on relevant context.",
"Distributed training enables training models across multiple GPUs.",
"Gradient accumulation simulates larger batch sizes with limited memory.",
"Mixed precision training uses FP16/BF16 to save memory and increase speed.",
] * 100 # Repeat for larger dataset
dataset = TextDataset(sample_texts, tokenizer)
# Distributed sampler ensures each GPU gets different data
sampler = DistributedSampler(
dataset,
num_replicas=world_size,
rank=global_rank,
shuffle=True,
)
dataloader = DataLoader(
dataset,
batch_size=batch_size,
sampler=sampler,
num_workers=2,
pin_memory=True,
)
# Optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=learning_rate,
weight_decay=0.01,
betas=(0.9, 0.95),
)
# Learning rate scheduler with warmup
total_steps = len(dataloader) * num_epochs // gradient_accumulation_steps
warmup_steps = int(total_steps * warmup_ratio)
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=warmup_steps,
num_training_steps=total_steps,
)
# Training loop
global_step = 0
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # Important for proper shuffling
model.train()
epoch_loss = 0.0
num_batches = 0
for step, batch in enumerate(dataloader):
input_ids = batch["input_ids"].to(local_rank)
attention_mask = batch["attention_mask"].to(local_rank)
# Forward pass -- causal LM uses input_ids as both input and labels
# Labels are shifted internally by the model
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=input_ids,
)
loss = outputs.loss / gradient_accumulation_steps
loss.backward()
epoch_loss += outputs.loss.item()
num_batches += 1
if (step + 1) % gradient_accumulation_steps == 0:
# Gradient clipping
torch.nn.utils.clip_grad_norm_(
model.parameters(), max_grad_norm
)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
global_step += 1
if is_main and global_step % 10 == 0:
avg_loss = epoch_loss / num_batches
lr = scheduler.get_last_lr()[0]
perplexity = math.exp(min(avg_loss, 20))
print(
f"Epoch {epoch+1}/{num_epochs} | "
f"Step {global_step}/{total_steps} | "
f"Loss: {avg_loss:.4f} | "
f"PPL: {perplexity:.2f} | "
f"LR: {lr:.2e}"
)
# Epoch summary
avg_epoch_loss = epoch_loss / num_batches
if is_main:
print(f"\nEpoch {epoch+1} complete. Avg loss: {avg_epoch_loss:.4f}")
# Save checkpoint (only on main process)
checkpoint = {
"epoch": epoch,
"global_step": global_step,
"model_state_dict": model.module.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"scheduler_state_dict": scheduler.state_dict(),
"loss": avg_epoch_loss,
}
torch.save(checkpoint, f"checkpoint_epoch_{epoch+1}.pt")
print(f"Saved checkpoint_epoch_{epoch+1}.pt")
# Save final model
if is_main:
model.module.save_pretrained("./trained_model")
tokenizer.save_pretrained("./trained_model")
print("Training complete. Model saved to ./trained_model")
cleanup()
if __name__ == "__main__":
train()
# To run the distributed training script:
# Single node, multiple GPUs:
torchrun --nproc_per_node=4 train_ddp.py
# Multiple nodes:
# Node 0 (master):
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \
--master_addr=10.0.0.1 --master_port=29500 train_ddp.py
# Node 1:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 \
--master_addr=10.0.0.1 --master_port=29500 train_ddp.py
4. Post-training
Post-training transforms a raw text-completion model into a helpful, harmless, and honest assistant. This stage is what makes the difference between a model that completes "What is the capital of France?" with "What is the capital of Germany? What is the capital of Italy?" versus one that responds "The capital of France is Paris." Let us explore the key techniques.
Supervised Fine-Tuning (SFT)
SFT is the first step in post-training. The model is fine-tuned on high-quality instruction-response pairs in a conversational format.
SFT Data Format
{
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant."
},
{
"role": "user",
"content": "Explain quantum computing in simple terms."
},
{
"role": "assistant",
"content": "Quantum computing is a type of computing that uses quantum mechanical phenomena..."
}
]
}
Key Aspects of SFT
- Data quality over quantity: A few thousand high-quality examples can be more effective than millions of low-quality ones. The LIMA paper showed that 1,000 carefully curated examples could produce a surprisingly capable model.
- Loss masking: During SFT, loss is typically only computed on the assistant's tokens, not on the user's messages or system prompts. This teaches the model to respond, not to predict what users will say.
- Chat templates: Each model family uses a specific template to format conversations (e.g., ChatML, Llama chat format). Consistency between training and inference is critical.
- Multi-turn conversations: SFT data should include multi-turn dialogues to teach the model to maintain context.
RLHF (Reinforcement Learning from Human Feedback)
RLHF further aligns the model with human preferences. It consists of two stages:
Stage 1: Reward Model Training
Human annotators compare pairs of model responses and indicate which is better. A reward model is trained on these preferences.
# RLHF Reward Model Training (conceptual)
#
# Training data format:
# (prompt, chosen_response, rejected_response)
#
# Loss function (Bradley-Terry model):
# L = -log(sigmoid(r(chosen) - r(rejected)))
#
# where r(x) is the scalar reward score for response x
Stage 2: PPO Training
The LLM is fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward model's score while staying close to the SFT model (to prevent reward hacking).
# PPO Training Objective (simplified)
#
# L_PPO = E[min(
# ratio * advantage,
# clip(ratio, 1-epsilon, 1+epsilon) * advantage
# )] - beta * KL(policy || reference)
#
# where:
# - ratio = pi_new(a|s) / pi_old(a|s) (probability ratio)
# - advantage = reward - baseline
# - KL penalty prevents the model from diverging too far from the SFT model
# - beta controls the strength of the KL penalty
Why RLHF is Difficult
RLHF is notoriously difficult to get right. The reward model can be gamed (reward hacking), the KL penalty must be carefully tuned, PPO requires maintaining multiple models simultaneously (policy, reference, reward, value), and the process is computationally expensive. This led to the development of simpler alternatives like DPO.
DPO (Direct Preference Optimization)
DPO, introduced by Rafailov et al. (2023), eliminates the need for a separate reward model and RL training. Instead, it directly optimizes the language model on preference pairs.
# DPO Loss Function
#
# L_DPO = -E[log sigmoid(
# beta * (log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x))
# )]
#
# where:
# - pi is the current policy (model being trained)
# - pi_ref is the reference model (frozen SFT model)
# - y_w is the preferred (winning) response
# - y_l is the dispreferred (losing) response
# - x is the prompt
# - beta controls how much to deviate from the reference
Why DPO became popular:
- No need to train a separate reward model
- No RL training loop (simpler, more stable)
- Can be implemented as a standard supervised loss
- Often produces comparable results to RLHF
- Many top models in 2025-2026 use DPO or variants (ORPO, IPO, KTO)
Constitutional AI (Anthropic)
Anthropic developed Constitutional AI (CAI) as an alternative alignment approach. Instead of relying entirely on human feedback, CAI uses a set of principles (a "constitution") to guide the model's behavior.
The CAI Process
- Generate responses: The model generates responses to prompts, including potentially harmful ones
- Self-critique: The model critiques its own responses based on constitutional principles (e.g., "Is this response harmful? Is it honest?")
- Self-revision: The model revises its responses based on its critiques
- RLAIF: Instead of human feedback, use AI feedback (the revised responses) to train a preference model
This approach reduces the need for human labelers to interact with harmful content and makes the alignment process more scalable and transparent.
Practical: Fine-tune with SFT Using HuggingFace
Hands-On Exercise
Fine-tune a small language model (GPT-2 or similar) on instruction data using HuggingFace's TRL (Transformer Reinforcement Learning) library.
"""
Supervised Fine-Tuning (SFT) Example
======================================
Fine-tune a small model on instruction-following data
using HuggingFace TRL.
pip install transformers datasets trl peft accelerate
"""
from datasets import Dataset
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
)
from trl import SFTTrainer, SFTConfig
def create_instruction_dataset():
"""Create a small instruction-following dataset."""
examples = [
{
"instruction": "Explain what machine learning is.",
"input": "",
"output": (
"Machine learning is a branch of artificial intelligence that "
"enables computers to learn patterns from data without being "
"explicitly programmed. Instead of writing specific rules, "
"you provide the system with examples, and it learns to make "
"predictions or decisions based on those examples. Common "
"types include supervised learning (learning from labeled data), "
"unsupervised learning (finding patterns in unlabeled data), "
"and reinforcement learning (learning through trial and error)."
),
},
{
"instruction": "Write a Python function to calculate the factorial of a number.",
"input": "",
"output": (
"Here's a Python function to calculate factorial:\n\n"
"```python\n"
"def factorial(n):\n"
" \"\"\"Calculate the factorial of a non-negative integer.\"\"\"\n"
" if n < 0:\n"
" raise ValueError('Factorial is not defined for negative numbers')\n"
" if n == 0 or n == 1:\n"
" return 1\n"
" return n * factorial(n - 1)\n"
"\n"
"# Examples:\n"
"print(factorial(5)) # 120\n"
"print(factorial(10)) # 3628800\n"
"```\n\n"
"This function uses recursion. For very large numbers, consider "
"an iterative approach or `math.factorial()` from the standard library."
),
},
{
"instruction": "Summarize the following text.",
"input": (
"The transformer architecture, introduced in 2017, replaced "
"recurrent neural networks as the dominant architecture for "
"natural language processing. It uses self-attention mechanisms "
"to process all tokens in parallel, leading to much faster "
"training. The key innovation was the multi-head attention "
"mechanism, which allows the model to attend to different "
"parts of the input simultaneously."
),
"output": (
"The transformer architecture (2017) replaced RNNs in NLP by "
"using parallel self-attention mechanisms instead of sequential "
"processing, enabling faster training through its innovative "
"multi-head attention that attends to multiple input parts "
"simultaneously."
),
},
# Add more examples as needed...
] * 50 # Repeat for a larger dataset
return Dataset.from_list(examples)
def format_instruction(example):
"""Format instruction data into a chat-style prompt."""
if example["input"]:
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Input:\n{example['input']}\n\n"
f"### Response:\n{example['output']}"
)
else:
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Response:\n{example['output']}"
)
return text
def main():
# Configuration
model_name = "gpt2" # Use a small model for demonstration
output_dir = "./sft_output"
# Load tokenizer and model
print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
)
# Create dataset
print("Creating dataset...")
dataset = create_instruction_dataset()
# Format the dataset
def formatting_func(example):
return format_instruction(example)
# SFT Training configuration
sft_config = SFTConfig(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
weight_decay=0.01,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
bf16=True, # Use BF16 if GPU supports it
max_seq_length=512,
packing=True, # Pack multiple short sequences into one
dataset_text_field=None, # We use formatting_func instead
)
# Create SFT trainer
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
tokenizer=tokenizer,
formatting_func=formatting_func,
)
# Train
print("Starting SFT training...")
trainer.train()
# Save model
print(f"Saving model to {output_dir}")
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
# Test the fine-tuned model
print("\nTesting fine-tuned model...")
from transformers import pipeline
generator = pipeline(
"text-generation",
model=output_dir,
tokenizer=output_dir,
max_length=200,
)
test_prompt = "### Instruction:\nWhat is deep learning?\n\n### Response:\n"
result = generator(test_prompt, do_sample=True, temperature=0.7)
print(f"Generated response:\n{result[0]['generated_text']}")
if __name__ == "__main__":
main()
5. Model Evaluation
Evaluating LLMs is one of the most challenging problems in AI. Unlike traditional ML where accuracy on a test set suffices, LLM evaluation must capture diverse capabilities across language understanding, reasoning, coding, math, safety, and more.
Perplexity
Perplexity is the most fundamental metric for language models. It measures how well the model predicts a held-out test set.
# Perplexity = exp(-1/N * sum(log P(x_i | x_{<i})))
#
# - Lower perplexity = better model
# - Perplexity of 1 = perfect prediction
# - Perplexity equal to vocab size = random guessing
#
# Example: GPT-2 perplexity on WikiText-103: ~29.4
# Example: GPT-3 perplexity on WikiText-103: ~20.5
Limitations: Perplexity measures raw prediction ability but does not directly correlate with downstream task performance, instruction following, or safety. A model with lower perplexity is not necessarily more helpful.
Key Benchmarks
| Benchmark | What It Measures | Format | Examples |
|---|---|---|---|
| MMLU | Broad knowledge across 57 subjects | Multiple choice (A/B/C/D) | 14,042 questions |
| MMLU-Pro | Harder version with 10 options + reasoning | Multiple choice | 12,032 questions |
| HumanEval | Python code generation | Complete a function | 164 problems |
| GSM8K | Grade school math | Word problems | 8,500 problems |
| MATH | Competition-level mathematics | Open-ended math problems | 12,500 problems |
| ARC | Science reasoning (grade school) | Multiple choice | 7,787 questions |
| HellaSwag | Common sense reasoning | Sentence completion | 10,042 examples |
| TruthfulQA | Truthfulness / avoiding common misconceptions | Open-ended + MC | 817 questions |
| GPQA | Graduate-level science questions | Multiple choice | 448 questions |
| SWE-Bench | Real software engineering tasks | Fix GitHub issues | 2,294 tasks |
The Contamination Problem
A major concern in LLM evaluation is benchmark contamination -- when test data appears in training data. If a model has seen the exact questions from MMLU during pre-training, its score is inflated and not meaningful.
Approaches to combat contamination:
- n-gram overlap checking: Check if test questions appear verbatim in training data
- Dynamic benchmarks: LiveBench generates new questions regularly so they cannot be in training data
- Private test sets: Keep test questions private (but this limits reproducibility)
- Paraphrased versions: Test with rephrased questions to see if performance holds
Human Evaluation and Chatbot Arena
LMSYS's Chatbot Arena (now lmarena.ai) is widely considered the gold standard for LLM evaluation. Users interact with two anonymous models side-by-side and vote for which response they prefer. The results generate an Elo-style leaderboard.
Why it works:
- Real users with diverse questions (not artificial benchmarks)
- Blind comparison (users do not know which model is which)
- Large scale (millions of votes as of early 2026)
- Difficult to game (you cannot optimize for "Arena score")
- Captures aspects benchmarks miss (helpfulness, writing quality, nuance)
Practical: Simple Benchmark Evaluation
"""
Simple LLM Benchmark Evaluation
==================================
Evaluate a model on a simple multiple-choice benchmark using
the lm-evaluation-harness library and manually.
"""
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List, Dict, Tuple
class SimpleBenchmarkEvaluator:
"""Evaluate a model on simple multiple-choice questions."""
def __init__(self, model_name: str):
print(f"Loading model: {model_name}")
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
self.model.eval()
def evaluate_multiple_choice(
self,
question: str,
choices: List[str],
correct_idx: int,
) -> Tuple[int, bool]:
"""
Evaluate a single multiple-choice question.
Returns the model's predicted index and whether it was correct.
"""
# Method: Score each choice by computing the log-likelihood
# of the completion given the question as context
choice_scores = []
for i, choice in enumerate(choices):
prompt = f"Question: {question}\nAnswer: {choice}"
inputs = self.tokenizer(prompt, return_tensors="pt").to(
self.model.device
)
with torch.no_grad():
outputs = self.model(**inputs, labels=inputs["input_ids"])
# Negative log-likelihood (lower = more likely)
loss = outputs.loss.item()
# We want higher likelihood, so negate the loss
choice_scores.append(-loss)
predicted_idx = choice_scores.index(max(choice_scores))
is_correct = predicted_idx == correct_idx
return predicted_idx, is_correct
def run_benchmark(self, questions: List[Dict]) -> Dict:
"""Run evaluation on a set of questions."""
correct = 0
total = len(questions)
results = []
for i, q in enumerate(questions):
pred_idx, is_correct = self.evaluate_multiple_choice(
question=q["question"],
choices=q["choices"],
correct_idx=q["correct_idx"],
)
if is_correct:
correct += 1
results.append({
"question": q["question"],
"predicted": q["choices"][pred_idx],
"correct": q["choices"][q["correct_idx"]],
"is_correct": is_correct,
})
print(
f"[{i+1}/{total}] "
f"{'CORRECT' if is_correct else 'WRONG'} | "
f"Predicted: {q['choices'][pred_idx]}"
)
accuracy = correct / total * 100
print(f"\nAccuracy: {correct}/{total} = {accuracy:.1f}%")
return {
"accuracy": accuracy,
"correct": correct,
"total": total,
"results": results,
}
# Sample benchmark questions (MMLU-style)
SAMPLE_QUESTIONS = [
{
"question": "What is the chemical symbol for gold?",
"choices": ["Ag", "Au", "Fe", "Cu"],
"correct_idx": 1,
},
{
"question": "Which planet is closest to the Sun?",
"choices": ["Venus", "Earth", "Mercury", "Mars"],
"correct_idx": 2,
},
{
"question": "What is the time complexity of binary search?",
"choices": ["O(n)", "O(n^2)", "O(log n)", "O(1)"],
"correct_idx": 2,
},
{
"question": "Who wrote 'A Brief History of Time'?",
"choices": [
"Albert Einstein",
"Stephen Hawking",
"Richard Feynman",
"Carl Sagan",
],
"correct_idx": 1,
},
{
"question": "What is the derivative of x^2?",
"choices": ["x", "2x", "x^2", "2x^2"],
"correct_idx": 1,
},
]
if __name__ == "__main__":
# Evaluate GPT-2 on our mini benchmark
evaluator = SimpleBenchmarkEvaluator("gpt2")
results = evaluator.run_benchmark(SAMPLE_QUESTIONS)
# For comprehensive evaluation, use lm-evaluation-harness:
# pip install lm-eval
# lm_eval --model hf --model_args pretrained=gpt2 \
# --tasks mmlu,hellaswag,arc_easy --batch_size 8
print("\n--- For comprehensive evaluation, use: ---")
print("pip install lm-eval")
print("lm_eval --model hf \\")
print(" --model_args pretrained=YOUR_MODEL \\")
print(" --tasks mmlu,hellaswag,arc_easy,truthfulqa \\")
print(" --batch_size 8")
6. Latest Developments (2025-2026)
The LLM landscape has evolved rapidly through 2025 and into early 2026. Here are the most significant developments:
DeepSeek V3 and R1: Efficiency Breakthroughs
DeepSeek, a Chinese AI lab, made waves with two groundbreaking models:
DeepSeek V3 (December 2024)
- 671B total parameters, 37B active per token (Mixture of Experts)
- Trained for approximately $5.6 million -- a fraction of comparable models
- Used 2,048 H800 GPUs (compared to 16,384 H100s for Llama 3)
- Innovations: Multi-head Latent Attention (MLA), DeepSeekMoE with auxiliary-loss-free balancing, FP8 mixed-precision training
- Competitive with GPT-4o and Claude 3.5 Sonnet on many benchmarks
DeepSeek R1 (January 2025)
- A reasoning model that shows its chain-of-thought process
- Trained using large-scale reinforcement learning (Group Relative Policy Optimization -- GRPO)
- Key insight: reasoning behaviors emerge from pure RL without SFT -- the model learned to think step-by-step, self-verify, and explore multiple approaches
- DeepSeek R1-Zero showed emergent "aha moments" where the model discovered new reasoning strategies during RL training
- Competitive with OpenAI's o1 on math and coding benchmarks
- Open-weight, spurring a wave of distilled reasoning models
Llama 3.1, 3.2, and 4
Llama 3.1 (July 2024)
- Released in 8B, 70B, and 405B parameter sizes
- 128K context length
- The 405B model was the first truly competitive open-weight frontier model
- Trained on 15T+ tokens
Llama 3.2 (September 2024)
- Added multimodal capabilities (vision) in 11B and 90B sizes
- Lightweight text models: 1B and 3B parameters for edge/mobile deployment
- Demonstrated that small models can be surprisingly capable when well-trained
Llama 4 (2025)
- Meta's next-generation model family with significant architectural changes
- Improved reasoning and instruction-following capabilities
- Enhanced multilingual and multimodal support
- Further efficiency improvements in training and inference
Claude 3.5 and Claude 4
- Claude 3.5 Sonnet (released mid-2024, updated late 2024): Became the leading model for coding tasks, with particularly strong performance on SWE-Bench. Introduced "computer use" capabilities for agentic tasks.
- Claude 3.5 Haiku: Fast, cost-effective model competitive with much larger models. Excellent for high-throughput applications.
- Claude 4 family (2025): Significant advances in reasoning, coding, and extended thinking capabilities. Claude Opus 4 set new benchmarks for agentic coding tasks.
- Anthropic's focus on safety and Constitutional AI continues to differentiate their approach.
GPT-4o and Multimodal Training
- GPT-4o ("omni"): Natively multimodal -- processes text, images, and audio in a unified architecture rather than separate models stitched together
- Real-time voice conversations with emotional expression
- Significantly lower latency and cost than GPT-4 Turbo
- o1 and o3: OpenAI's reasoning models that use "thinking" tokens to solve complex problems, achieving human-expert-level performance on GPQA and competitive math olympiad problems
Open-Weight Model Revolution
2024-2025 saw a dramatic shift toward open-weight models:
- Llama 3.1 405B proved open models can compete with proprietary frontier models
- DeepSeek V3/R1 showed that competitive models can be trained at a fraction of the cost
- Mistral continued releasing high-quality models (Mistral Large 2, Pixtral)
- Qwen 2.5 from Alibaba: Strong multilingual models in various sizes
- This shift has democratized AI and created a vibrant ecosystem of fine-tuned and merged models
Small Model Renaissance
Perhaps the most impactful trend of 2025 is the emergence of surprisingly capable small models:
| Model | Parameters | Notable Capability |
|---|---|---|
| Phi-4 (Microsoft) | 14B | Matches 70B+ models on reasoning benchmarks; heavy use of synthetic training data |
| Gemma 2 (Google) | 2B, 9B, 27B | State-of-the-art for their sizes; excellent for research and on-device |
| Qwen 2.5 (Alibaba) | 0.5B-72B | Strong across the board; excellent coding and math models |
| SmolLM2 (HuggingFace) | 135M-1.7B | Tiny but capable; designed for on-device applications |
| Llama 3.2 (Meta) | 1B, 3B | Efficient edge models with strong instruction following |
Key insights driving the small model revolution:
- Data quality over model size: Carefully curated training data (especially synthetic data) can compensate for smaller parameter counts
- Knowledge distillation: Smaller models trained on outputs of larger models inherit much of their capability
- Architecture improvements: Better attention mechanisms, training recipes, and optimization techniques make every parameter count more
- Practical deployment: Small models can run on laptops, phones, and edge devices, enabling new use cases
Summary and Key Takeaways
Week 5 Key Takeaways
- Data is everything: The quality and composition of training data is the single most important factor in model quality. Careful preprocessing, deduplication, and quality filtering are essential.
- Pre-training is expensive but conceptually simple: Next-token prediction at massive scale creates emergent capabilities. The engineering challenge is in distributed training infrastructure.
- Parallelism is necessary: Data parallelism, tensor parallelism, pipeline parallelism, and ZeRO optimization work together to train models that do not fit on single GPUs.
- Post-training matters enormously: SFT and alignment (RLHF/DPO/CAI) transform a raw text predictor into a helpful assistant. Without post-training, even the best base model is not useful for end users.
- Evaluation is hard: No single benchmark captures all of a model's capabilities. Chatbot Arena remains the best holistic evaluation, but automated benchmarks are essential for rapid iteration.
- Efficiency is the new frontier: DeepSeek V3 proved that competitive models can be trained at 10-20x lower cost. The focus is shifting from "bigger" to "smarter."
- Open-weight models are thriving: The gap between open and proprietary models has narrowed dramatically. Small models (1B-14B) are now capable enough for many production use cases.
Next Steps
In Week 6: Quantization and Fine-Tuning, we will dive deep into making these large models practical for deployment. You will learn about KV caches, quantization techniques (GPTQ, AWQ, GGUF), attention optimizations (FlashAttention, PagedAttention), and parameter-efficient fine-tuning with LoRA and QLoRA.