Week 2: Tokenization, Vectorization, and Attention

Tokenization

See how text gets broken into tokens

The ░trans former ░model ░process es ░lang uage

Tokens

Characters

BPE

Algorithm

1. Tokenization

1.1 Why We Need Tokenization

Computers don't understand text. They understand numbers. Tokenization is the process of converting raw text into a sequence of tokens (discrete units), each mapped to an integer ID. This is always the very first step in any NLP pipeline.

# The fundamental problem:
text = "Hello, world!"
# Computer sees: just a sequence of bytes/characters
# We need to convert this into numbers that a neural network can process

# After tokenization (using GPT-4's tokenizer):
# "Hello, world!" -> [9906, 11, 1917, 0]
# Each number is an index into a vocabulary of ~100,000 tokens

1.2 Tokenization Strategies

Four tokenization granularities applied to the same input text.

Character-Level Tokenization

Split text into individual characters.

# Character-level tokenization
text = "Hello world"
tokens = list(text)
print(tokens)  # ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']

# Vocabulary is tiny: ~256 characters (ASCII) or ~150,000 (Unicode)
# But sequences become VERY long: a 1000-word article = ~5000 characters
# This makes it hard for the model to learn long-range dependencies

# Pros: Small vocabulary, handles ANY text (no unknown tokens)
# Cons: Very long sequences, hard to capture word-level meaning

Word-Level Tokenization

Split text into words (by whitespace and punctuation).

# Word-level tokenization
text = "The cat sat on the mat."
tokens = text.split()
print(tokens)  # ['The', 'cat', 'sat', 'on', 'the', 'mat.']

# Problems:
# 1. Huge vocabulary: English has ~170,000 words + names + technical terms
# 2. Can't handle new words: "ChatGPT" would be [UNK] if not in vocabulary
# 3. Morphology is lost: "run", "running", "runner" are 3 separate tokens
#    with no shared representation
# 4. Different languages: Chinese/Japanese don't use spaces between words

# Pros: Intuitive, short sequences
# Cons: Huge vocab, can't handle unseen words, no morphological sharing

Subword Tokenization (The Sweet Spot)

Modern LLMs use subword tokenization — a middle ground between character-level and word-level. Common words stay as single tokens, while rare words are broken into meaningful subword units.

# Subword tokenization (what modern LLMs actually use)
# "unhappiness" -> ["un", "happiness"]  or  ["un", "happi", "ness"]
# "ChatGPT"    -> ["Chat", "G", "PT"]
# "the"        -> ["the"]  (common word stays whole)

# Benefits:
# - Manageable vocabulary size (32K - 128K tokens)
# - Handles ANY word (even new ones like "ChatGPT")
# - Shares representations: "unhappy" and "unhelpful" share "un"
# - Good sequence lengths (not too long, not too short)

Tokenization Pipeline

graph LR A["Raw Text
Hello world"] --> B["Tokenizer
BPE / WordPiece"] B --> C["Tokens
Hel, lo, world"] C --> D["Token IDs
4521, 328, 1917"] D --> E["Embedding Layer
Lookup Table"] E --> F["Dense Vectors
768-dim each"] style A fill:#e8f4f8,stroke:#333 style B fill:#d5e8d4,stroke:#333 style C fill:#fff2cc,stroke:#333 style D fill:#dae8fc,stroke:#333 style E fill:#e1d5e7,stroke:#333 style F fill:#f8cecc,stroke:#333

1.3 Byte Pair Encoding (BPE) — Step by Step

BPE is the most popular subword tokenization algorithm. Used by GPT-2, GPT-3, GPT-4, Llama, and most modern LLMs. The algorithm is surprisingly simple:

BPE Algorithm:

Start with a vocabulary of individual characters (bytes)
Count all adjacent pairs of tokens in the training corpus
Merge the most frequent pair into a new token
Repeat steps 2-3 until the desired vocabulary size is reached

BPE iteratively merges the most frequent character pairs until common words become single tokens.

"""
BPE (Byte Pair Encoding) - Complete Worked Example
===================================================
Let's trace through BPE on a small corpus.
"""

# Our tiny training corpus
corpus = "low low low low low lowest lowest newer newer newer wider wider wider"

# STEP 1: Split into characters and add end-of-word marker
# We add a special marker '' to preserve word boundaries
words = corpus.split()
word_freq = {}
for word in words:
    # Represent each word as a tuple of characters + end marker
    char_tuple = tuple(list(word) + [''])
    word_freq[char_tuple] = word_freq.get(char_tuple, 0) + 1

print("Initial vocabulary (characters):")
print(word_freq)
# {('l','o','w',''): 5,
#  ('l','o','w','e','s','t',''): 2,
#  ('n','e','w','e','r',''): 3,
#  ('w','i','d','e','r',''): 3}

# Current vocabulary: {'l','o','w','e','s','t','n','r','i','d',''}

# STEP 2: Count all adjacent pairs
def get_pair_counts(word_freq):
    """Count frequency of all adjacent symbol pairs."""
    pairs = {}
    for word, freq in word_freq.items():
        for i in range(len(word) - 1):
            pair = (word[i], word[i+1])
            pairs[pair] = pairs.get(pair, 0) + freq
    return pairs

pairs = get_pair_counts(word_freq)
print("\nPair frequencies:")
for pair, count in sorted(pairs.items(), key=lambda x: -x[1])[:10]:
    print(f"  {pair}: {count}")

# Most frequent pairs:
# ('e', 'r'): 6   (from "newer" x3 + "wider" x3)
# ('l', 'o'): 7   (from "low" x5 + "lowest" x2)
# ('o', 'w'): 7   (from "low" x5 + "lowest" x2)
# ... etc.

# STEP 3: Merge the most frequent pair
def merge_pair(word_freq, pair_to_merge):
    """Merge all occurrences of the most frequent pair."""
    new_word_freq = {}
    merged = pair_to_merge[0] + pair_to_merge[1]

    for word, freq in word_freq.items():
        new_word = []
        i = 0
        while i < len(word):
            if (i < len(word) - 1 and
                word[i] == pair_to_merge[0] and
                word[i+1] == pair_to_merge[1]):
                new_word.append(merged)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_word_freq[tuple(new_word)] = freq

    return new_word_freq

# Let's run several merges:
print("\n=== BPE Merge Steps ===")
num_merges = 10
merges = []

for step in range(num_merges):
    pairs = get_pair_counts(word_freq)
    if not pairs:
        break

    # Find the most frequent pair
    best_pair = max(pairs, key=pairs.get)
    best_count = pairs[best_pair]

    print(f"\nMerge {step+1}: '{best_pair[0]}' + '{best_pair[1]}' -> "
          f"'{best_pair[0]+best_pair[1]}' (frequency: {best_count})")

    merges.append(best_pair)
    word_freq = merge_pair(word_freq, best_pair)

    print(f"  Vocabulary after merge:")
    for word, freq in word_freq.items():
        print(f"    {word} : {freq}")

# After all merges, common subwords like "low", "er", "est" become tokens
# The merge rules are saved and applied at inference time

# Expected merge sequence (approximately):
# Merge 1: 'l' + 'o' -> 'lo'
# Merge 2: 'lo' + 'w' -> 'low'
# Merge 3: 'e' + 'r' -> 'er'
# Merge 4: 'er' + '' -> 'er'
# Merge 5: 'n' + 'e' -> 'ne'
# Merge 6: 'ne' + 'w' -> 'new'
# Merge 7: 'new' + 'er' -> 'newer'
# Merge 8: 'w' + 'i' -> 'wi'
# Merge 9: 'wi' + 'd' -> 'wid'
# Merge 10: 'low' + '' -> 'low'

1.4 PRACTICAL: Full BPE Implementation from Scratch

"""
Complete BPE Tokenizer Implementation
======================================
A working BPE tokenizer you can train on any text.
"""

import re
from collections import Counter, defaultdict

class BPETokenizer:
    """Byte Pair Encoding tokenizer built from scratch."""

    def __init__(self, vocab_size=256 + 256):  # 256 bytes + 256 merges
        self.vocab_size = vocab_size
        self.merges = []           # List of (pair_a, pair_b) merges in order
        self.vocab = {}            # token_id -> token_bytes
        self.inverse_vocab = {}    # token_bytes -> token_id

    def _get_pair_counts(self, token_sequences):
        """Count adjacent pairs across all sequences."""
        counts = Counter()
        for seq in token_sequences:
            for i in range(len(seq) - 1):
                counts[(seq[i], seq[i+1])] += 1
        return counts

    def _merge(self, token_sequences, pair, new_token):
        """Replace all occurrences of pair with new_token."""
        new_sequences = []
        for seq in token_sequences:
            new_seq = []
            i = 0
            while i < len(seq):
                if i < len(seq) - 1 and seq[i] == pair[0] and seq[i+1] == pair[1]:
                    new_seq.append(new_token)
                    i += 2
                else:
                    new_seq.append(seq[i])
                    i += 1
            new_sequences.append(new_seq)
        return new_sequences

    def train(self, text, verbose=True):
        """Train the BPE tokenizer on the given text.

        Args:
            text: Training text (string)
            verbose: Print merge steps
        """
        # Step 1: Convert text to bytes (UTF-8 encoding)
        text_bytes = text.encode('utf-8')

        # Step 2: Initialize vocabulary with individual bytes (0-255)
        for i in range(256):
            self.vocab[i] = bytes([i])
            self.inverse_vocab[bytes([i])] = i

        # Step 3: Create initial token sequences (each byte is a token)
        # We split into "chunks" at whitespace boundaries for efficiency
        # but this is a simplification - real BPE works on the full byte sequence
        token_sequences = [list(text_bytes)]

        # Step 4: Iteratively merge most frequent pairs
        num_merges = self.vocab_size - 256

        for i in range(num_merges):
            # Count all adjacent pairs
            pair_counts = self._get_pair_counts(token_sequences)
            if not pair_counts:
                break

            # Find the most frequent pair
            best_pair = max(pair_counts, key=pair_counts.get)
            best_count = pair_counts[best_pair]

            if best_count < 2:
                break  # No pair appears more than once

            # Create new token
            new_token = 256 + i
            new_bytes = self.vocab[best_pair[0]] + self.vocab[best_pair[1]]
            self.vocab[new_token] = new_bytes
            self.inverse_vocab[new_bytes] = new_token
            self.merges.append(best_pair)

            if verbose and (i < 10 or i % 50 == 0):
                try:
                    merged_str = new_bytes.decode('utf-8', errors='replace')
                except:
                    merged_str = str(new_bytes)
                print(f"Merge {i+1}: {best_pair} -> {new_token} "
                      f"('{merged_str}', count: {best_count})")

            # Apply the merge
            token_sequences = self._merge(token_sequences, best_pair, new_token)

        if verbose:
            print(f"\nTraining complete! Vocabulary size: {len(self.vocab)}")

    def encode(self, text):
        """Encode text to token IDs."""
        # Start with individual bytes
        tokens = list(text.encode('utf-8'))

        # Apply merges in order (same order as training)
        for pair in self.merges:
            new_token = self.inverse_vocab[
                self.vocab[pair[0]] + self.vocab[pair[1]]
            ]
            i = 0
            new_tokens = []
            while i < len(tokens):
                if (i < len(tokens) - 1 and
                    tokens[i] == pair[0] and tokens[i+1] == pair[1]):
                    new_tokens.append(new_token)
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            tokens = new_tokens

        return tokens

    def decode(self, token_ids):
        """Decode token IDs back to text."""
        byte_sequence = b''.join(self.vocab[tid] for tid in token_ids)
        return byte_sequence.decode('utf-8', errors='replace')


# ------- Train and use our BPE tokenizer -------

# Training text (in practice, this would be gigabytes of text)
training_text = """
The quick brown fox jumps over the lazy dog. The dog barked at the fox.
The fox ran quickly through the forest. The forest was dark and quiet.
Natural language processing is a subfield of artificial intelligence.
Language models learn to predict the next token in a sequence.
The transformer architecture revolutionized natural language processing.
Attention mechanisms allow models to focus on relevant parts of the input.
"""

print("=" * 60)
print("Training BPE Tokenizer")
print("=" * 60)

tokenizer = BPETokenizer(vocab_size=300)  # 256 bytes + 44 merges
tokenizer.train(training_text)

# Test encoding
test_text = "The fox jumps over the dog."
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)

print(f"\nOriginal:  '{test_text}'")
print(f"Encoded:   {encoded}")
print(f"Decoded:   '{decoded}'")
print(f"Tokens:    {len(encoded)}")
print(f"Characters:{len(test_text)}")
print(f"Compression ratio: {len(test_text)/len(encoded):.1f}x")

1.5 WordPiece and SentencePiece

WordPiece (Used in BERT)

Similar to BPE but uses a different scoring criterion. Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data (i.e., the pair whose merger most increases the language model probability).

# WordPiece marks subword continuations with "##"
# "unhappiness" -> ["un", "##happi", "##ness"]
# "playing"     -> ["play", "##ing"]
# "cat"         -> ["cat"]  (in vocabulary, stays as one token)

# The "##" prefix means "this token is a continuation of the previous word"
# This helps the model know which tokens start new words

# BERT vocabulary example:
# Token ID 2003 -> "the"
# Token ID 7592 -> "hello"
# Token ID 2015 -> "##ing"  (subword continuation)
# Token ID 2094 -> "##ed"   (subword continuation)

SentencePiece (Used in Llama, T5)

A language-independent tokenizer that treats the input as a raw byte stream, making it work for any language without pre-tokenization rules.

# SentencePiece key differences:
# 1. Treats spaces as regular characters (represented as '▁' / underscore)
# 2. Language-independent: no need for language-specific pre-processing
# 3. Can use BPE or Unigram model internally

# SentencePiece tokenization:
# "Hello world" -> ["▁Hello", "▁world"]
# The ▁ represents a space BEFORE the token (word boundary)

# This is crucial for languages without spaces (Chinese, Japanese):
# "東京は素晴らしい" -> ["▁東京", "は", "素晴ら", "しい"]
# Works the same way regardless of language!

# Using SentencePiece in Python:
import sentencepiece as spm

# Train a SentencePiece model
# spm.SentencePieceTrainer.train(
#     input='training_data.txt',
#     model_prefix='my_tokenizer',
#     vocab_size=32000,
#     model_type='bpe'  # or 'unigram'
# )

# Load and use
# sp = spm.SentencePieceProcessor(model_file='my_tokenizer.model')
# tokens = sp.encode('Hello world', out_type=str)
# ids = sp.encode('Hello world', out_type=int)

1.6 Vocabulary Size Tradeoffs

Model	Vocab Size	Tokenizer	Notes
GPT-2	50,257	BPE	Byte-level BPE
GPT-4 / GPT-4o	100,277	BPE (cl100k_base)	Larger vocab for efficiency
BERT	30,522	WordPiece	Cased and uncased versions
Llama 2	32,000	SentencePiece (BPE)	Relatively small vocab
Llama 3 / 3.1	128,256	Tiktoken (BPE)	4x larger than Llama 2
DeepSeek V3	129,280	BPE	Efficient multilingual coverage

Vocabulary Size Tradeoffs:

Larger vocab → shorter sequences (fewer tokens per text), faster inference, BUT larger embedding matrix (more parameters)
Smaller vocab → longer sequences (more tokens), slower inference, BUT smaller embedding matrix
The trend (2024-2026) is toward larger vocabularies (100K+) because the embedding matrix cost is small relative to total model size, and shorter sequences mean faster inference.

1.7 Special Tokens

# Special tokens serve critical roles in how models process text

special_tokens = {
    # BERT special tokens
    "[CLS]":    "Classification token. Placed at the start of input. "
                "Its final hidden state is used as the 'sentence embedding'.",

    "[SEP]":    "Separator token. Separates two sentences in pair tasks "
                "(e.g., 'Is sentence A entailed by sentence B?').",

    "[PAD]":    "Padding token. Used to make all sequences the same length "
                "in a batch. Attention mask ensures model ignores these.",

    "[MASK]":   "Mask token. Used in BERT's masked language modeling. "
                "Replaces a token that the model must predict.",

    # GPT special tokens
    "<|endoftext|>": "End of text marker in GPT-2/3. Signals the end of "
                     "a document during training.",

    "<|im_start|>":  "Start of a message in ChatGPT format.",
    "<|im_end|>":    "End of a message in ChatGPT format.",

    # Llama special tokens
    "":      "Beginning of sequence (BOS) in Llama.",
    "":     "End of sequence (EOS) in Llama.",

    # Common across models
    "<|pad|>":  "Generic padding token.",
    "":    "Unknown token. Fallback for characters not in vocabulary. "
                "Modern BPE tokenizers rarely produce this.",
}

for token, description in special_tokens.items():
    print(f"{token:20s} | {description}")

# Chat template example (Llama 3 format):
chat_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The capital of France is Paris.<|eot_id|>"""

print("\nLlama 3 Chat Template:")
print(chat_template)

1.8 PRACTICAL: Using Real Tokenizers

"""
Using Real Tokenizers in Practice
==================================
tiktoken (OpenAI) and HuggingFace tokenizers
"""

# ---- tiktoken (OpenAI's tokenizer) ----
import tiktoken

# GPT-4 / GPT-4o tokenizer
enc = tiktoken.get_encoding("cl100k_base")  # Used by GPT-4, GPT-4o

text = "Hello, how are you doing today? I'm learning about tokenization!"

tokens = enc.encode(text)
print(f"Text: '{text}'")
print(f"Token IDs: {tokens}")
print(f"Number of tokens: {len(tokens)}")
print(f"Number of characters: {len(text)}")
print(f"Characters per token: {len(text)/len(tokens):.1f}")

# Decode individual tokens to see what each one represents
print(f"\nToken breakdown:")
for token_id in tokens:
    token_text = enc.decode([token_id])
    print(f"  ID {token_id:6d} -> '{token_text}'")

# Output (approximately):
# ID   9906 -> 'Hello'
# ID     11 -> ','
# ID    1268 -> ' how'
# ID    527 -> ' are'
# ID    499 -> ' you'
# ID   3815 -> ' doing'
# ID   3432 -> ' today'
# ID     30 -> '?'
# ...

# ---- Cost Calculation ----
def estimate_api_cost(text, model="gpt-4o"):
    """Estimate API cost for processing text."""
    enc = tiktoken.get_encoding("cl100k_base")
    n_tokens = len(enc.encode(text))

    # Prices as of early 2026 (approximate)
    prices = {
        "gpt-4o":       {"input": 2.50, "output": 10.00},  # per 1M tokens
        "gpt-4o-mini":  {"input": 0.15, "output": 0.60},
        "gpt-4-turbo":  {"input": 10.00, "output": 30.00},
    }

    price = prices.get(model, prices["gpt-4o"])
    input_cost = (n_tokens / 1_000_000) * price["input"]

    return {
        "tokens": n_tokens,
        "input_cost": input_cost,
        "model": model
    }

# Example: processing a long document
long_text = "This is a sample document. " * 1000  # ~5000 words
result = estimate_api_cost(long_text)
print(f"\nCost estimate for {result['tokens']} tokens ({result['model']}):")
print(f"  Input cost: ${result['input_cost']:.4f}")

# ---- HuggingFace Tokenizers ----
from transformers import AutoTokenizer

# Load Llama 3 tokenizer (requires HuggingFace access)
# tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# Load BERT tokenizer (freely available)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Tokenization is fundamental to NLP."
encoded = tokenizer(text)

print(f"\nBERT tokenization of '{text}':")
print(f"  Input IDs: {encoded['input_ids']}")
print(f"  Token type IDs: {encoded['token_type_ids']}")
print(f"  Attention mask: {encoded['attention_mask']}")

# Decode to see tokens
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'])
print(f"  Tokens: {tokens}")
# Output: ['[CLS]', 'token', '##ization', 'is', 'fundamental', 'to', 'nl', '##p', '.', '[SEP]']
# Note: BERT adds [CLS] and [SEP], and uses ## for subword pieces

# ---- Compare tokenizers ----
text = "The transformer architecture uses self-attention mechanisms."

tokenizers_to_compare = {
    "GPT-4 (cl100k)": tiktoken.get_encoding("cl100k_base"),
    "GPT-2 (r50k)":   tiktoken.get_encoding("r50k_base"),
}

print(f"\nTokenizer comparison for: '{text}'")
print("-" * 60)
for name, tok in tokenizers_to_compare.items():
    tokens = tok.encode(text)
    print(f"{name:20s}: {len(tokens)} tokens -> {tokens[:10]}...")

# GPT-4's tokenizer is more efficient (fewer tokens for the same text)
# because it has a larger vocabulary (100K vs 50K)

2. Vectorization / Embeddings

2.1 Why Vectors?

Token IDs are arbitrary integers with no semantic structure. Embeddings place related concepts close together in continuous space.

Token IDs are arbitrary numbers — token 5000 is not "more" than token 3000 in any meaningful way. We need a representation where similar meanings are close together in a continuous space. This is what embeddings provide.

# The problem with raw token IDs:
# "king" = ID 5123
# "queen" = ID 8901
# "banana" = ID 2345
# There's no mathematical relationship between these numbers!

# With embeddings (hypothetical 3D space):
# "king"   = [0.8, 0.2, 0.9]
# "queen"  = [0.7, 0.3, 0.9]   <- close to "king" (both royalty)
# "banana" = [0.1, 0.8, 0.1]   <- far from "king" (not related)

# The famous word2vec result:
# king - man + woman ≈ queen
# This works because the embedding space captures semantic relationships!

2.2 Evolution of Word Representations

One-hot vectors are sparse and capture no similarity. Dense embeddings place related words close together in continuous space.

One-Hot Encoding

import numpy as np

# Vocabulary: ["the", "cat", "sat", "on", "mat"]
vocab_size = 5

# One-hot encoding: each word is a vector of size vocab_size
# with a 1 at the word's index and 0 everywhere else
one_hot = {
    "the": np.array([1, 0, 0, 0, 0]),
    "cat": np.array([0, 1, 0, 0, 0]),
    "sat": np.array([0, 0, 1, 0, 0]),
    "on":  np.array([0, 0, 0, 1, 0]),
    "mat": np.array([0, 0, 0, 0, 1]),
}

# Problems:
# 1. HUGE vectors for real vocabularies (50,000-dimensional!)
# 2. ALL words are equally distant from each other
cosine_sim = np.dot(one_hot["cat"], one_hot["mat"])
print(f"Similarity(cat, mat) = {cosine_sim}")  # 0.0
# "cat" and "mat" have zero similarity, same as "cat" and "quantum_physics"
# This representation captures NO semantic information

Word2Vec

Proposed by Mikolov et al. (2013), Word2Vec learns dense vector representations where semantically similar words are close together. Two architectures:

CBOW (Continuous Bag of Words): Predict the center word from surrounding context words.

# CBOW: Predict center word from context
# Context: "The [?] sat on the mat"
# Target: "cat"

# The model learns:
# - "cat" often appears between "The" and "sat"
# - "dog" also appears in similar contexts
# - Therefore, "cat" and "dog" get similar embeddings

# CBOW is faster to train, works well for frequent words

Skip-gram: Predict surrounding context words from the center word.

# Skip-gram: Predict context from center word
# Center: "cat"
# Predict: "The", "sat", "on", "the", "mat"

# Given "cat", what words are likely nearby?
# - "the", "sat", "on" (from this sentence)
# - "dog", "pet", "furry" (from other training sentences)

# Skip-gram is slower but works better for rare words

# Training intuition:
# The network has ONE hidden layer (the embedding layer)
# Input: one-hot vector of center word
# Output: probability distribution over vocabulary
# After training, the hidden layer weights ARE the word embeddings

Skip-gram trains a shallow network to predict context words. The hidden layer weights become the word embeddings.

GloVe (Global Vectors for Word Representation)

# GloVe: combines word2vec's local context with global statistics
# Instead of predicting context words, it factorizes the
# word-word co-occurrence matrix

# Key idea: the ratio of co-occurrence probabilities
# encodes meaning:
#
# P(ice | solid)  / P(ice | gas)   = large  (ice is solid)
# P(steam | solid) / P(steam | gas) = small  (steam is gas)
# P(water | solid) / P(water | gas) ≈ 1     (water is both)
#
# GloVe trains embeddings so that dot products approximate
# the log of co-occurrence probabilities:
# w_i · w_j + b_i + b_j ≈ log(X_ij)

Contextual Embeddings (BERT, GPT)

# The REVOLUTION: contextual embeddings
# Word2Vec gives ONE embedding per word regardless of context

# But "bank" means different things:
# "I went to the bank to deposit money"  <- financial institution
# "We sat on the river bank"             <- edge of river

# Word2Vec: "bank" = [0.3, 0.5, 0.2] (always the same)
# BERT:     "bank" in financial context = [0.8, 0.1, 0.9]
#           "bank" in river context     = [0.2, 0.7, 0.3]

# Contextual embeddings compute a DIFFERENT vector for each word
# based on its ENTIRE surrounding context. This is done by the
# transformer attention mechanism (covered in detail later).

# This is why transformer-based models are so much better:
# the same word gets different representations in different contexts

2.3 How Embedding Layers Work

An embedding layer is simply a matrix lookup: each token ID selects its corresponding row of learned dense values.

import numpy as np

class EmbeddingLayer:
    """
    An embedding layer is simply a lookup table.

    It maps each token ID to a dense vector of size `embedding_dim`.
    These vectors are LEARNED during training (they are parameters).
    """

    def __init__(self, vocab_size, embedding_dim):
        """
        Args:
            vocab_size: Number of unique tokens (e.g., 50257 for GPT-2)
            embedding_dim: Size of each embedding vector (e.g., 768 for GPT-2)
        """
        # Initialize with random values (will be learned during training)
        # Shape: (vocab_size, embedding_dim)
        self.embeddings = np.random.randn(vocab_size, embedding_dim) * 0.02
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim

        print(f"Embedding layer: {vocab_size} tokens x {embedding_dim} dimensions")
        print(f"Total parameters: {vocab_size * embedding_dim:,}")

    def forward(self, token_ids):
        """
        Look up embeddings for the given token IDs.

        This is just an array index operation - very fast!

        Args:
            token_ids: list or array of integer token IDs

        Returns:
            embeddings: array of shape (len(token_ids), embedding_dim)
        """
        return self.embeddings[token_ids]


# Example: GPT-2 scale embedding
embed = EmbeddingLayer(vocab_size=50257, embedding_dim=768)
# Embedding layer: 50257 tokens x 768 dimensions
# Total parameters: 38,597,376 (~38.6M parameters just for embeddings!)

# Look up embeddings for a token sequence
token_ids = [15496, 11, 995, 0]  # "Hello, world!"
embeddings = embed.forward(token_ids)

print(f"\nInput token IDs: {token_ids}")
print(f"Output shape: {embeddings.shape}")  # (4, 768)
print(f"Embedding for token 15496 (first 10 values):")
print(f"  {embeddings[0, :10]}")

# What does 768 dimensions mean?
# Each dimension captures some aspect of the token's meaning.
# No single dimension has a clear interpretation, but collectively
# they encode:
# - Part of speech (noun, verb, adjective)
# - Semantic category (animal, food, color)
# - Syntactic role (subject, object)
# - Sentiment (positive, negative)
# - And many more subtle features

# Common embedding dimensions:
# BERT-base:    768
# BERT-large:   1024
# GPT-2:        768
# GPT-3:        12288
# Llama 3.1 8B: 4096
# Llama 3.1 70B:8192
# GPT-4:        unknown (likely 12288+)

2.4 PRACTICAL: Train Word2Vec and Visualize

"""
Train Word2Vec Embeddings with Gensim
======================================
Train on a real corpus, then explore the embedding space.
"""

from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import numpy as np

# Prepare training data
sentences = [
    "the king ruled the kingdom with wisdom",
    "the queen ruled alongside the king",
    "the prince was the son of the king and queen",
    "the princess was the daughter of the king and queen",
    "the man worked in the field",
    "the woman worked in the market",
    "the boy played in the garden",
    "the girl played in the garden",
    "the cat sat on the mat",
    "the dog sat on the rug",
    "the cat chased the mouse",
    "the dog chased the cat",
    "paris is the capital of france",
    "london is the capital of england",
    "berlin is the capital of germany",
    "tokyo is the capital of japan",
    "machine learning is a branch of artificial intelligence",
    "deep learning uses neural networks with many layers",
    "natural language processing handles text data",
    "computer vision handles image data",
]

# Tokenize
tokenized = [simple_preprocess(sent) for sent in sentences]

# Train Word2Vec
model = Word2Vec(
    sentences=tokenized,
    vector_size=50,    # Embedding dimensions (small for demo)
    window=5,          # Context window size
    min_count=1,       # Include words appearing at least once
    workers=4,         # Parallel training threads
    epochs=100,        # Training epochs (more for small datasets)
    sg=1               # 1 = Skip-gram, 0 = CBOW
)

# Explore the embeddings
print("=" * 50)
print("Word2Vec Embedding Exploration")
print("=" * 50)

# Most similar words
print("\nMost similar to 'king':")
for word, sim in model.wv.most_similar('king', topn=5):
    print(f"  {word:15s} {sim:.4f}")

print("\nMost similar to 'cat':")
for word, sim in model.wv.most_similar('cat', topn=5):
    print(f"  {word:15s} {sim:.4f}")

# Word analogies: king - man + woman = ?
print("\nAnalogy: king - man + woman = ?")
result = model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=3)
for word, sim in result:
    print(f"  {word:15s} {sim:.4f}")
# Expected: "queen" (though results depend on training data size)

# Compute cosine similarity
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

pairs = [("king", "queen"), ("king", "cat"), ("cat", "dog"), ("paris", "france")]
print("\nCosine similarities:")
for w1, w2 in pairs:
    sim = cosine_similarity(model.wv[w1], model.wv[w2])
    print(f"  sim({w1}, {w2}) = {sim:.4f}")

# ---- t-SNE Visualization ----
from sklearn.manifold import TSNE

# Get all word vectors
words = list(model.wv.key_to_index.keys())
vectors = np.array([model.wv[w] for w in words])

# Reduce to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=min(5, len(words)-1))
vectors_2d = tsne.fit_transform(vectors)

# Print 2D positions (in practice, you'd plot these)
print("\nt-SNE 2D Projections (would be plotted in practice):")
print(f"{'Word':15s} {'X':>8s} {'Y':>8s}")
print("-" * 33)
for i, word in enumerate(words[:20]):
    print(f"{word:15s} {vectors_2d[i,0]:8.2f} {vectors_2d[i,1]:8.2f}")

# In a real visualization:
# - Related words cluster together
# - king/queen/prince/princess form a cluster
# - cat/dog/mouse form another cluster
# - paris/london/berlin/tokyo form a "capitals" cluster

2.5 PRACTICAL: Sentence Embeddings and Semantic Search

"""
Sentence Embeddings with sentence-transformers
================================================
Generate embeddings for entire sentences and compute similarity.
This is the foundation of semantic search, RAG, and many AI applications.
"""

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained sentence embedding model
# all-MiniLM-L6-v2 is small (80MB) but effective
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentences to embed
sentences = [
    "The cat sat on the mat",
    "A kitten was resting on the rug",           # Similar to sentence 1
    "Machine learning is a subset of AI",
    "Deep learning is part of artificial intelligence",  # Similar to sentence 3
    "The weather is nice today",
    "I enjoy sunny days",                         # Similar to sentence 5
    "Quantum physics deals with subatomic particles",
]

# Generate embeddings
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")  # (7, 384)
# Each sentence is now a 384-dimensional vector

# Compute cosine similarity matrix
def cosine_similarity_matrix(embeddings):
    """Compute pairwise cosine similarity."""
    # Normalize vectors
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / norms
    # Dot product of normalized vectors = cosine similarity
    return normalized @ normalized.T

sim_matrix = cosine_similarity_matrix(embeddings)

print("\nCosine Similarity Matrix:")
print(f"{'':30s}", end="")
for i in range(len(sentences)):
    print(f"  S{i+1}", end="")
print()

for i, sent in enumerate(sentences):
    truncated = sent[:28] + ".." if len(sent) > 30 else sent
    print(f"{truncated:30s}", end="")
    for j in range(len(sentences)):
        print(f" {sim_matrix[i,j]:.2f}", end="")
    print()

# Expected: high similarity between semantically related pairs
# S1-S2 (cat/kitten): ~0.7-0.8
# S3-S4 (ML/DL):      ~0.8-0.9
# S5-S6 (weather):    ~0.6-0.7
# S1-S7 (cat/quantum): ~0.0-0.1

# ---- Semantic Search ----
print("\n" + "=" * 50)
print("Semantic Search Demo")
print("=" * 50)

# Knowledge base
documents = [
    "Python is a programming language known for its simplicity",
    "JavaScript is used for web development",
    "Machine learning models learn from data",
    "The Eiffel Tower is located in Paris, France",
    "Neural networks are inspired by biological brains",
    "Climate change is affecting global temperatures",
    "Transformers revolutionized natural language processing",
    "The Great Wall of China is visible from space",
]

doc_embeddings = model.encode(documents)

def semantic_search(query, documents, doc_embeddings, model, top_k=3):
    """Find the most relevant documents for a query."""
    query_embedding = model.encode([query])

    # Compute similarities
    similarities = np.dot(doc_embeddings, query_embedding.T).flatten()

    # Get top-k indices
    top_indices = np.argsort(similarities)[::-1][:top_k]

    print(f"\nQuery: '{query}'")
    print(f"Top {top_k} results:")
    for rank, idx in enumerate(top_indices):
        print(f"  {rank+1}. [{similarities[idx]:.4f}] {documents[idx]}")

# Test queries
semantic_search("How do AI systems learn?", documents, doc_embeddings, model)
semantic_search("What programming languages exist?", documents, doc_embeddings, model)
semantic_search("Tell me about famous landmarks", documents, doc_embeddings, model)

3. Positional Encodings

3.1 Why Position Matters

Unlike RNNs that process tokens sequentially (inherently knowing position), transformers process all tokens in parallel. Without positional information, the model sees the input as a bag of words — "dog bites man" and "man bites dog" would be identical!

# Without positional encoding:
# "The cat ate the fish" and "The fish ate the cat" produce the same output!
# Both have the same set of token embeddings, just in different positions.
# The model CANNOT distinguish them.

# With positional encoding:
# Each token embedding is ADDED to a position-specific vector
# embedding("the", position=0) ≠ embedding("the", position=3)
# Now the model knows where each word appears in the sequence

Positional Encoding Integration

graph TD T["Input Tokens"] --> E["Token Embeddings
Learned lookup table"] P["Position Index
0, 1, 2, ..."] --> PE["Positional Encoding
Sinusoidal or Learned"] E --> ADD["Element-wise Addition
Token Embed + Pos Embed"] PE --> ADD ADD --> TL["Transformer Layers
Model knows WHAT and WHERE"] style T fill:#e8f4f8,stroke:#333 style P fill:#dae8fc,stroke:#333 style E fill:#d5e8d4,stroke:#333 style PE fill:#fff2cc,stroke:#333 style ADD fill:#e1d5e7,stroke:#333 style TL fill:#f8cecc,stroke:#333

3.2 Sinusoidal Positional Encodings

Positional encodings are added element-wise to token embeddings, giving the model both meaning and position information.

The original Transformer paper ("Attention Is All You Need") used sinusoidal functions to generate positional encodings:

For position pos and dimension i:

PE(pos, 2i) = sin(pos / 10000^2i/d_model)

PE(pos, 2i+1) = cos(pos / 10000^2i/d_model)

Even dimensions use sine, odd dimensions use cosine.

import numpy as np

def sinusoidal_positional_encoding(max_len, d_model):
    """
    Generate sinusoidal positional encodings.

    Args:
        max_len: Maximum sequence length
        d_model: Model dimension (embedding size)

    Returns:
        PE matrix of shape (max_len, d_model)

    Why sinusoidal?
    1. Unique encoding for each position
    2. Bounded values (always between -1 and 1)
    3. Can extrapolate to longer sequences than seen during training
    4. Relative positions can be computed as linear transformations:
       PE(pos+k) can be represented as a linear function of PE(pos)
    """
    PE = np.zeros((max_len, d_model))

    position = np.arange(max_len).reshape(-1, 1)  # (max_len, 1)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    # Even dimensions: sine
    PE[:, 0::2] = np.sin(position * div_term)
    # Odd dimensions: cosine
    PE[:, 1::2] = np.cos(position * div_term)

    return PE

# Generate for a small example
PE = sinusoidal_positional_encoding(max_len=10, d_model=8)

print("Positional Encoding Matrix (10 positions x 8 dimensions):")
print(f"{'Pos':>4}", end="")
for d in range(8):
    print(f"  dim{d:1d}", end="")
print()

for pos in range(10):
    print(f"{pos:4d}", end="")
    for d in range(8):
        print(f" {PE[pos, d]:6.3f}", end="")
    print()

# Key observations:
# 1. Low dimensions (dim 0,1) oscillate rapidly - capture fine position info
# 2. High dimensions (dim 6,7) oscillate slowly - capture coarse position info
# 3. Each position has a unique "fingerprint"
# 4. Nearby positions have similar encodings (smooth interpolation)

# How it's used:
# final_embedding = token_embedding + positional_encoding
# This is simply element-wise addition!

3.3 Modern Positional Encodings

Rotary Position Embeddings (RoPE)

Used by Llama, GPT-NeoX, and most modern LLMs. Instead of adding positional information, RoPE rotates the query and key vectors in the attention mechanism.

"""
RoPE (Rotary Position Embeddings)
==================================
Used in Llama 2, Llama 3, Mistral, and most modern open-source LLMs.

Key insight: encode position by ROTATING the embedding vectors.
The angle of rotation depends on the position.

Benefits over sinusoidal:
1. Directly encodes RELATIVE position (attention naturally depends on distance)
2. Better extrapolation to longer sequences
3. Decays attention naturally with distance
"""

import numpy as np

def apply_rope(x, position, d_model):
    """
    Apply Rotary Position Embeddings to a vector.

    Args:
        x: input vector of shape (d_model,)
        position: position index (integer)
        d_model: dimension of the vector

    Returns:
        Rotated vector of same shape

    The rotation is applied to PAIRS of dimensions:
    (dim 0, dim 1), (dim 2, dim 3), etc.
    Each pair is rotated by a different angle based on position.
    """
    output = np.zeros_like(x)

    for i in range(d_model // 2):
        # Rotation angle depends on position and dimension
        theta = position / (10000 ** (2 * i / d_model))

        # Apply 2D rotation to each pair of dimensions
        cos_theta = np.cos(theta)
        sin_theta = np.sin(theta)

        # Rotation matrix: [[cos, -sin], [sin, cos]]
        output[2*i]     = x[2*i] * cos_theta - x[2*i+1] * sin_theta
        output[2*i + 1] = x[2*i] * sin_theta + x[2*i+1] * cos_theta

    return output

# Demonstrate
d_model = 8
x = np.random.randn(d_model)

print("RoPE Demonstration:")
print(f"Original vector: {x}")
print(f"Position 0: {apply_rope(x, 0, d_model)}")  # No rotation at position 0
print(f"Position 1: {apply_rope(x, 1, d_model)}")  # Slight rotation
print(f"Position 5: {apply_rope(x, 5, d_model)}")  # More rotation

# Key property: dot product of rotated vectors depends on RELATIVE position
# q_pos1 · k_pos3 depends only on the distance (3-1=2), not absolute positions
# This is what makes RoPE so effective for attention

ALiBi (Attention with Linear Biases)

# ALiBi: Instead of modifying embeddings, add a bias to attention scores
# that penalizes distant tokens linearly.

# Standard attention:  score = Q·K^T / sqrt(d_k)
# ALiBi attention:     score = Q·K^T / sqrt(d_k) - m * |i - j|

# Where m is a head-specific slope and |i-j| is the distance between tokens
# Each attention head gets a different slope, allowing different heads to
# focus on different distance ranges.

# Benefits:
# - Simpler than RoPE (just a bias term)
# - Good length extrapolation
# - Used in BLOOM and some other models

# In practice, RoPE has become more dominant (2024-2026) because:
# - Better performance on long-context tasks
# - Works well with NTK-aware scaling for context extension
# - Adopted by Llama, Mistral, and most popular open-source models

4. The Attention Mechanism

4.1 Intuition

Self-attention lets each token "look at" every other token. Here, "it" strongly attends to "cat" (its referent).

When you read the sentence "The cat sat on the mat because it was tired", you instantly know "it" refers to "the cat", not "the mat". Your brain attends to the relevant part of the input. The attention mechanism gives neural networks the same ability.

Library Analogy for Q, K, V:

Query (Q): You walk into a library and ask: "I need information about machine learning." This is your query — what you're looking for.
Key (K): Each book on the shelf has a label/tag: "ML Fundamentals", "Cooking Recipes", "Neural Networks", "History of Rome". These are keys — descriptions of what each source contains.
Value (V): The actual content inside each book. This is the value — the information you'll extract.
Attention: You compare your query with each key to find the best matches, then read the corresponding values. "ML Fundamentals" and "Neural Networks" match well → you read those books more carefully.

4.2 The Attention Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:

Q (Query): What am I looking for? Shape: (seq_len, d_k)
K (Key): What do I contain? Shape: (seq_len, d_k)
V (Value): What information do I have? Shape: (seq_len, d_v)
d_k: Dimension of key vectors (used for scaling)

Attention Mechanism Flow

graph TD X["Input Embeddings X"] --> WQ["Multiply by W_Q"] X --> WK["Multiply by W_K"] X --> WV["Multiply by W_V"] WQ --> Q["Query Matrix Q"] WK --> K["Key Matrix K"] WV --> V["Value Matrix V"] Q --> DOT["Q times K transpose
Dot Product"] K --> DOT DOT --> SCALE["Scale by 1 over sqrt d_k"] SCALE --> SM["Softmax
Attention Weights"] SM --> MUL["Multiply Weights x V"] V --> MUL MUL --> OUT["Context-Aware Output"] style X fill:#e8f4f8,stroke:#333 style Q fill:#dae8fc,stroke:#333 style K fill:#d5e8d4,stroke:#333 style V fill:#fff2cc,stroke:#333 style SM fill:#e1d5e7,stroke:#333 style OUT fill:#f8cecc,stroke:#333

4.3 Self-Attention Step by Step

"""
SELF-ATTENTION - Complete Numerical Example
=============================================
Let's trace through self-attention with actual numbers.

Input: 3 tokens ("I", "love", "AI") with embedding dimension 4
"""

import numpy as np

# Set seed for reproducibility
np.random.seed(42)

# ------- STEP 0: Input embeddings -------
# 3 tokens, each with 4-dimensional embedding
X = np.array([
    [1.0, 0.0, 1.0, 0.0],  # "I"
    [0.0, 1.0, 0.0, 1.0],  # "love"
    [1.0, 1.0, 0.0, 0.0],  # "AI"
])
# Shape: (3, 4) = (seq_len, d_model)
print(f"Input X:\n{X}\n")

# ------- STEP 1: Create Q, K, V weight matrices -------
# These are LEARNED parameters
d_model = 4
d_k = 3  # Dimension for Q and K (can differ from d_model)
d_v = 3  # Dimension for V

# Weight matrices (normally learned, here random for demo)
W_Q = np.random.randn(d_model, d_k) * 0.5
W_K = np.random.randn(d_model, d_k) * 0.5
W_V = np.random.randn(d_model, d_v) * 0.5

print(f"W_Q shape: {W_Q.shape} (d_model x d_k = {d_model} x {d_k})")
print(f"W_K shape: {W_K.shape}")
print(f"W_V shape: {W_V.shape}\n")

# ------- STEP 2: Compute Q, K, V -------
Q = X @ W_Q  # (3, 4) @ (4, 3) = (3, 3)
K = X @ W_K
V = X @ W_V

print(f"Q (Queries):\n{Q}\n")
print(f"K (Keys):\n{K}\n")
print(f"V (Values):\n{V}\n")

# ------- STEP 3: Compute attention scores (Q @ K^T) -------
# Each query "asks a question" and each key "provides a match score"
scores = Q @ K.T  # (3, 3) @ (3, 3) = (3, 3)
print(f"Raw attention scores (Q @ K^T):\n{scores}\n")

# scores[i][j] = how much token i should attend to token j
# e.g., scores[0][2] = how much "I" attends to "AI"

# ------- STEP 4: Scale by sqrt(d_k) -------
# WHY SCALE?
# Without scaling, dot products grow with d_k, pushing softmax into
# regions with tiny gradients (saturated softmax).
# Scaling by sqrt(d_k) keeps the variance of dot products at ~1.

scaled_scores = scores / np.sqrt(d_k)
print(f"Scaled scores (/ sqrt({d_k}) = / {np.sqrt(d_k):.2f}):\n{scaled_scores}\n")

# ------- STEP 5: Apply softmax -------
# Convert scores to probabilities (each row sums to 1)
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

attention_weights = softmax(scaled_scores)
print(f"Attention weights (after softmax):\n{attention_weights}")
print(f"Row sums: {attention_weights.sum(axis=1)}\n")

# Interpretation:
# Row 0: How much "I" attends to ["I", "love", "AI"]
#   e.g., [0.35, 0.30, 0.35] means "I" attends roughly equally
# Row 1: How much "love" attends to each token
# Row 2: How much "AI" attends to each token

# ------- STEP 6: Weighted sum of values -------
# Each token's output = weighted combination of all values
output = attention_weights @ V  # (3, 3) @ (3, 3) = (3, 3)
print(f"Output (attention_weights @ V):\n{output}\n")

# output[0] = the new representation of "I", enriched with information
# from "love" and "AI" based on the attention weights.
# This is SELF-attention because each token attends to ALL tokens
# (including itself) in the same sequence.

print("=" * 50)
print("Summary of shapes:")
print(f"  Input X:          {X.shape}         (seq_len, d_model)")
print(f"  Q = X @ W_Q:      {Q.shape}         (seq_len, d_k)")
print(f"  K = X @ W_K:      {K.shape}         (seq_len, d_k)")
print(f"  V = X @ W_V:      {V.shape}         (seq_len, d_v)")
print(f"  QK^T:             {scores.shape}         (seq_len, seq_len)")
print(f"  Attention weights: {attention_weights.shape}         (seq_len, seq_len)")
print(f"  Output:           {output.shape}         (seq_len, d_v)")

4.4 Why Scale by √d_k?

"""
Why we scale by sqrt(d_k) - A demonstration
=============================================
"""

import numpy as np

np.random.seed(42)

# Consider two random vectors of dimension d_k
# Their dot product has variance proportional to d_k

for d_k in [4, 64, 512, 4096]:
    # Generate many random Q and K vectors
    Q = np.random.randn(10000, d_k)
    K = np.random.randn(10000, d_k)

    # Compute dot products
    dot_products = np.sum(Q * K, axis=1)

    print(f"d_k = {d_k:5d}: "
          f"mean = {dot_products.mean():8.3f}, "
          f"std = {dot_products.std():8.3f}, "
          f"std/sqrt(d_k) = {dot_products.std()/np.sqrt(d_k):.3f}")

# Output:
# d_k =     4: mean =    0.012, std =    2.011, std/sqrt(d_k) = 1.005
# d_k =    64: mean =    0.026, std =    7.989, std/sqrt(d_k) = 0.999
# d_k =   512: mean =   -0.050, std =   22.593, std/sqrt(d_k) = 0.999
# d_k =  4096: mean =   -0.106, std =   63.938, std/sqrt(d_k) = 0.999

# The standard deviation of dot products grows as sqrt(d_k)!
# For d_k=4096, scores range roughly from -192 to +192.
# Softmax of such extreme values becomes nearly one-hot
# (one element ≈ 1, all others ≈ 0), killing gradients.

# After scaling by sqrt(d_k), the variance is always ~1,
# keeping softmax in a well-behaved range.

# Demonstrate softmax behavior:
print("\nSoftmax behavior with different scales:")
x = np.array([1.0, 2.0, 3.0])
print(f"  softmax([1, 2, 3]):       {softmax(x)}")  # Reasonable distribution
print(f"  softmax([10, 20, 30]):    {softmax(x * 10)}")  # Nearly one-hot!
print(f"  softmax([100, 200, 300]): {softmax(x * 100)}")  # Completely saturated

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum()

4.5 PRACTICAL: Self-Attention in PyTorch

"""
Self-Attention Implementation in PyTorch
==========================================
A clean, complete implementation with proper tensor handling.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    """
    Single-head self-attention mechanism.

    This is the core building block of the Transformer architecture.
    """

    def __init__(self, d_model, d_k=None, d_v=None):
        """
        Args:
            d_model: Input/output dimension
            d_k: Key/Query dimension (defaults to d_model)
            d_v: Value dimension (defaults to d_model)
        """
        super().__init__()

        self.d_k = d_k or d_model
        self.d_v = d_v or d_model

        # Linear projections for Q, K, V
        self.W_Q = nn.Linear(d_model, self.d_k, bias=False)
        self.W_K = nn.Linear(d_model, self.d_k, bias=False)
        self.W_V = nn.Linear(d_model, self.d_v, bias=False)

        # Output projection (to map back to d_model if d_v != d_model)
        self.W_O = nn.Linear(self.d_v, d_model, bias=False)

    def forward(self, x, mask=None):
        """
        Compute self-attention.

        Args:
            x: Input tensor of shape (batch_size, seq_len, d_model)
            mask: Optional mask of shape (seq_len, seq_len)

        Returns:
            output: Attended values, shape (batch_size, seq_len, d_model)
            attention_weights: Attention probabilities, shape (batch_size, seq_len, seq_len)
        """
        # Step 1: Compute Q, K, V
        Q = self.W_Q(x)  # (batch, seq_len, d_k)
        K = self.W_K(x)  # (batch, seq_len, d_k)
        V = self.W_V(x)  # (batch, seq_len, d_v)

        # Step 2: Compute scaled dot-product attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1))  # (batch, seq_len, seq_len)
        scores = scores / math.sqrt(self.d_k)

        # Step 3: Apply mask (if provided)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        # Step 4: Softmax to get attention weights
        attention_weights = F.softmax(scores, dim=-1)

        # Step 5: Weighted sum of values
        context = torch.matmul(attention_weights, V)  # (batch, seq_len, d_v)

        # Step 6: Output projection
        output = self.W_O(context)  # (batch, seq_len, d_model)

        return output, attention_weights


# ------- Test the implementation -------
torch.manual_seed(42)

batch_size = 2
seq_len = 5
d_model = 8

# Random input (simulating token embeddings)
x = torch.randn(batch_size, seq_len, d_model)

# Create attention layer
attention = SelfAttention(d_model=d_model, d_k=8, d_v=8)

# Forward pass
output, weights = attention(x)

print(f"Input shape:            {x.shape}")
print(f"Output shape:           {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"\nAttention weights for first example (batch 0):")
print(weights[0].detach().numpy().round(3))

# Each row shows how much each token attends to every other token
# Rows sum to 1 (probability distribution)
print(f"\nRow sums: {weights[0].sum(dim=-1).detach().numpy()}")

# With causal mask (for autoregressive models like GPT):
causal_mask = torch.tril(torch.ones(seq_len, seq_len))
print(f"\nCausal mask:\n{causal_mask}")

output_masked, weights_masked = attention(x, mask=causal_mask)
print(f"\nCausal attention weights (batch 0):")
print(weights_masked[0].detach().numpy().round(3))
# Notice: upper triangle is all zeros (can't attend to future tokens)

Week 2 Summary

Key Takeaways

Tokenization converts text to integers. BPE (subword tokenization) is the dominant approach, balancing vocabulary size with sequence length.
Embeddings map tokens to dense vectors where semantic similarity is preserved. Modern models use contextual embeddings where the same word gets different representations based on context.
Positional encodings inject position information since transformers process all tokens in parallel. RoPE is the current state-of-the-art.
Self-attention allows each token to "look at" every other token and compute a weighted combination. The formula: Attention(Q,K,V) = softmax(QK^T/√d_k)V
The scaling factor √d_k prevents attention scores from becoming too large, which would push softmax into saturated regions.

Exercises

Implement the full BPE tokenizer and train it on a book from Project Gutenberg. How does vocabulary size affect compression ratio?
Use tiktoken to tokenize the same text with GPT-2 vs GPT-4 tokenizers. Which produces fewer tokens? Why?
Train word2vec on a larger corpus and find interesting analogies beyond "king - man + woman = queen".
Implement the self-attention formula from scratch in NumPy, then verify your results match PyTorch's implementation.
Modify the attention implementation to use a causal mask. How do attention patterns change?