Week 7: Retrieval Augmented Generation

Learning Objectives

Understand RAG Architecture

Learn why RAG exists, when to use it, and how it compares to fine-tuning and prompt engineering.

Document Processing

Master chunking strategies that maximize retrieval quality for different document types.

Vector Embeddings

Understand how text is converted to vectors, and compare embedding models for quality and efficiency.

Vector Databases

Deploy and query vector databases with HNSW indexing, and understand the tradeoffs between solutions.

Build RAG Pipelines

Construct complete RAG systems both from scratch and with frameworks like LangChain.

Advanced RAG (2025-2026)

Learn cutting-edge techniques: GraphRAG, contextual retrieval, self-RAG, and RAG Fusion.

1. Why RAG?

Retrieval Augmented Generation (RAG) is a technique that enhances LLM responses by retrieving relevant information from external knowledge sources and including it in the prompt. It was introduced by Lewis et al. (2020) at Facebook AI Research and has become the most widely used pattern for building production LLM applications.

LLM Limitations RAG Solves

1. Hallucinations

LLMs can generate plausible-sounding but factually incorrect information. They have no mechanism to verify their outputs against reality. RAG provides grounding -- the model generates responses based on retrieved documents rather than relying solely on its parametric knowledge.

2. Outdated Knowledge

LLMs have a knowledge cutoff date. A model trained on data through December 2024 knows nothing about events in 2025 or 2026. RAG allows the system to access current information by retrieving from an up-to-date knowledge base.

3. No Access to Private Data

LLMs cannot access your company's internal documents, databases, or proprietary information. RAG bridges this gap by retrieving from your private data sources.

4. No Citations

A vanilla LLM cannot tell you where its information comes from. RAG enables citations by tracking which documents were used to generate each response.

RAG Architecture Overview

Full RAG Pipeline

graph LR A[User Query] --> B[Embed Query] B --> C[Vector Search] C --> D[Retrieve Top-K Chunks] D --> E[Augment Prompt] E --> F[LLM Generate] F --> G[Response with Citations] style A fill:#4a90d9,stroke:#333,color:#fff style C fill:#f5a623,stroke:#333,color:#fff style F fill:#9b59b6,stroke:#333,color:#fff style G fill:#2ecc71,stroke:#333,color:#fff

# RAG System Architecture (Text Diagram)
#
# ┌─────────────────────────────────────────────────────┐
# │                  INDEXING PIPELINE                   │
# │  (runs once, or periodically for updates)            │
# │                                                      │
# │  Documents ──► Chunking ──► Embedding ──► Vector DB  │
# │  (PDF, HTML,    (split     (convert to    (store &   │
# │   DOCX, MD)     into       vectors)       index)    │
# │                 chunks)                              │
# └─────────────────────────────────────────────────────┘
#
# ┌─────────────────────────────────────────────────────┐
# │                   QUERY PIPELINE                     │
# │  (runs on every user query)                          │
# │                                                      │
# │  User Query                                          │
# │      │                                               │
# │      ├──► Embed Query ──► Vector Search ──► Top K    │
# │      │                        │              chunks  │
# │      │                        │                      │
# │      │    (optional: rerank top K chunks)            │
# │      │                        │                      │
# │      ▼                        ▼                      │
# │  ┌─────────────────────────────────────┐             │
# │  │  Prompt = System Instructions       │             │
# │  │        + Retrieved Context          │             │
# │  │        + User Question              │             │
# │  └──────────────┬──────────────────────┘             │
# │                 │                                    │
# │                 ▼                                    │
# │              LLM generates response                  │
# │              with citations                          │
# └─────────────────────────────────────────────────────┘

RAG vs Fine-tuning

Aspect	RAG	Fine-tuning
Knowledge updates	Easy (update the index)	Hard (retrain the model)
Citations	Built-in (source documents)	Not naturally available
Cost	Retrieval + embedding costs	Training compute costs
Latency	Higher (retrieval step)	Lower
Best for	Knowledge-intensive Q&A, changing data	Custom behavior, style, format
Combined	Best results often come from fine-tuning + RAG together

2. Document Processing and Chunking

Before documents can be stored in a vector database, they must be split into smaller pieces called "chunks." Chunking strategy has an enormous impact on RAG quality -- bad chunking leads to irrelevant retrieval, which leads to bad answers.

Document Loaders

Different document formats require different extraction approaches:

Format	Library	Considerations
PDF	PyMuPDF, pdfplumber, Unstructured	Tables, images, multi-column layouts are challenging
HTML	BeautifulSoup, Trafilatura	Remove navigation, ads, boilerplate
Markdown	Built-in, markdown-it	Preserve heading structure for metadata
DOCX	python-docx, Unstructured	Handle styles, headers, tables
CSV/Excel	pandas	Convert rows to text, or treat as structured data
Code	tree-sitter, custom parsers	Parse by function/class, preserve context

Chunking Strategies

1. Fixed-Size Chunking

The simplest approach: split text into chunks of a fixed number of characters or tokens, with optional overlap.

def fixed_size_chunking(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into fixed-size chunks with overlap."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap  # Overlap with previous chunk
    return chunks

# Pros: Simple, predictable chunk sizes
# Cons: Can split mid-sentence, mid-paragraph, or mid-concept
# Use when: Quick prototyping, uniform document types

2. Sentence-Based Chunking

Split at sentence boundaries, grouping sentences until reaching the target size.

import re

def sentence_chunking(text: str, max_chunk_size: int = 500) -> list[str]:
    """Split text into chunks at sentence boundaries."""
    # Split into sentences (simple regex -- use spaCy/NLTK for production)
    sentences = re.split(r'(?<=[.!?])\s+', text)

    chunks = []
    current_chunk = []
    current_size = 0

    for sentence in sentences:
        sentence_len = len(sentence)
        if current_size + sentence_len > max_chunk_size and current_chunk:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
            current_size = 0
        current_chunk.append(sentence)
        current_size += sentence_len + 1  # +1 for space

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Pros: Never splits mid-sentence
# Cons: Chunk sizes vary; may still split topics
# Use when: General-purpose text documents

3. Recursive Character Text Splitting

The most popular approach (default in LangChain). Tries to split by paragraphs first, then sentences, then words, recursively ensuring chunks do not exceed the target size.

class RecursiveTextSplitter:
    """
    Recursively split text using a hierarchy of separators.
    Tries to keep semantically related text together.
    """

    def __init__(
        self,
        chunk_size: int = 1000,
        chunk_overlap: int = 200,
        separators: list[str] = None,
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.separators = separators or [
            "\n\n",   # Paragraph breaks (highest priority)
            "\n",     # Line breaks
            ". ",     # Sentence breaks
            ", ",     # Clause breaks
            " ",      # Word breaks
            "",       # Character breaks (last resort)
        ]

    def split_text(self, text: str) -> list[str]:
        """Split text recursively."""
        final_chunks = []
        self._split_recursive(text, self.separators, final_chunks)
        return final_chunks

    def _split_recursive(self, text: str, separators: list[str], final_chunks: list):
        if len(text) <= self.chunk_size:
            if text.strip():
                final_chunks.append(text.strip())
            return

        # Find the best separator (first one that exists in the text)
        separator = separators[-1]
        remaining_separators = separators
        for i, sep in enumerate(separators):
            if sep in text:
                separator = sep
                remaining_separators = separators[i + 1:]
                break

        # Split by the chosen separator
        splits = text.split(separator) if separator else list(text)

        # Merge splits into chunks of appropriate size
        current_chunk = []
        current_length = 0

        for split in splits:
            piece = split + separator if separator else split
            piece_len = len(piece)

            if current_length + piece_len > self.chunk_size and current_chunk:
                # Current chunk is full; finalize it
                merged = (separator if separator else "").join(current_chunk)
                if len(merged) <= self.chunk_size:
                    final_chunks.append(merged.strip())
                else:
                    # Still too big -- recurse with finer separators
                    self._split_recursive(merged, remaining_separators, final_chunks)

                # Start new chunk with overlap
                overlap_chunks = []
                overlap_len = 0
                for c in reversed(current_chunk):
                    if overlap_len + len(c) > self.chunk_overlap:
                        break
                    overlap_chunks.insert(0, c)
                    overlap_len += len(c)
                current_chunk = overlap_chunks
                current_length = overlap_len

            current_chunk.append(split)
            current_length += piece_len

        if current_chunk:
            merged = (separator if separator else "").join(current_chunk)
            if merged.strip():
                final_chunks.append(merged.strip())


# Usage
splitter = RecursiveTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(long_document)
print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i}: {len(chunk)} chars -- {chunk[:80]}...")

4. Semantic Chunking

Use embeddings to find natural topic boundaries. Split where the semantic similarity between consecutive sentences drops below a threshold.

import numpy as np
from sentence_transformers import SentenceTransformer

def semantic_chunking(
    text: str,
    model_name: str = "all-MiniLM-L6-v2",
    threshold: float = 0.5,
    min_chunk_size: int = 100,
) -> list[str]:
    """
    Split text at semantic boundaries using embedding similarity.
    """
    model = SentenceTransformer(model_name)

    # Split into sentences
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text)
    if len(sentences) <= 1:
        return [text]

    # Embed all sentences
    embeddings = model.encode(sentences)

    # Compute cosine similarity between consecutive sentences
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = np.dot(embeddings[i], embeddings[i + 1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1])
        )
        similarities.append(sim)

    # Find split points where similarity drops below threshold
    chunks = []
    current_chunk = [sentences[0]]

    for i, sim in enumerate(similarities):
        if sim < threshold and len(' '.join(current_chunk)) >= min_chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentences[i + 1]]
        else:
            current_chunk.append(sentences[i + 1])

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Pros: Respects topic boundaries, produces semantically coherent chunks
# Cons: Requires embedding computation, variable chunk sizes
# Use when: Documents with clear topic shifts, quality is critical

5. Document-Structure-Aware Chunking

Use the document's own structure (headings, sections, lists) to guide chunking.

import re
from typing import List, Dict


def markdown_structure_chunking(
    markdown_text: str,
    max_chunk_size: int = 1000,
) -> List[Dict]:
    """
    Chunk a Markdown document by its heading structure.
    Each chunk includes metadata about its position in the hierarchy.
    """
    # Split by headings
    heading_pattern = r'^(#{1,6})\s+(.+)$'
    lines = markdown_text.split('\n')

    sections = []
    current_section = {
        'heading': 'Introduction',
        'level': 0,
        'content': [],
        'path': [],  # Breadcrumb path of parent headings
    }
    heading_stack = []  # Stack of (level, heading) for building paths

    for line in lines:
        match = re.match(heading_pattern, line)
        if match:
            # Save current section
            if current_section['content']:
                current_section['text'] = '\n'.join(current_section['content']).strip()
                if current_section['text']:
                    sections.append(current_section.copy())

            # Parse heading
            level = len(match.group(1))
            heading = match.group(2)

            # Update heading stack
            while heading_stack and heading_stack[-1][0] >= level:
                heading_stack.pop()
            heading_stack.append((level, heading))

            # Start new section
            current_section = {
                'heading': heading,
                'level': level,
                'content': [],
                'path': [h[1] for h in heading_stack],
            }
        else:
            current_section['content'].append(line)

    # Save last section
    if current_section['content']:
        current_section['text'] = '\n'.join(current_section['content']).strip()
        if current_section['text']:
            sections.append(current_section)

    # Split large sections further
    final_chunks = []
    splitter = RecursiveTextSplitter(chunk_size=max_chunk_size, chunk_overlap=100)

    for section in sections:
        if len(section['text']) <= max_chunk_size:
            final_chunks.append({
                'text': section['text'],
                'metadata': {
                    'heading': section['heading'],
                    'path': ' > '.join(section['path']),
                    'level': section['level'],
                },
            })
        else:
            # Split large section into smaller chunks
            sub_chunks = splitter.split_text(section['text'])
            for i, chunk in enumerate(sub_chunks):
                final_chunks.append({
                    'text': chunk,
                    'metadata': {
                        'heading': section['heading'],
                        'path': ' > '.join(section['path']),
                        'level': section['level'],
                        'chunk_index': i,
                    },
                })

    return final_chunks

Chunk Size and Overlap Considerations

Guidelines for Chunk Size

Smaller chunks (100-300 tokens): More precise retrieval, but may lack context. Good for factoid Q&A.
Medium chunks (300-800 tokens): Balance between precision and context. Best general-purpose choice.
Larger chunks (800-2000 tokens): More context per chunk, but less precise retrieval. Good for summarization tasks.
Overlap (10-20% of chunk size): Prevents information loss at chunk boundaries. Critical for maintaining coherence.
Match chunk size to embedding model's training data: If the embedding model was trained on passages of ~256 tokens, chunks around that size will be encoded most effectively.

3. Vector Embeddings Deep Dive

Embeddings are dense vector representations of text that capture semantic meaning. Two texts with similar meanings will have similar embedding vectors, even if they use different words. This is the foundation of semantic search in RAG systems.

Text Embedding Models

Model	Provider	Dimensions	Max Tokens	Notes
text-embedding-3-large	OpenAI	3072 (or less via Matryoshka)	8191	Best proprietary model; supports dimension reduction
text-embedding-3-small	OpenAI	1536	8191	Cost-effective, strong quality
voyage-3	Voyage AI	1024	32000	Strong for code and retrieval tasks
embed-v4	Cohere	1024	512	Excellent retrieval quality with search types
BGE-large-en-v1.5	BAAI (open)	1024	512	Top open-source embedding model
E5-mistral-7b-instruct	Microsoft (open)	4096	32768	LLM-based embeddings, excellent quality
GTE-Qwen2-7B-instruct	Alibaba (open)	3584	131072	Long-context embeddings
all-MiniLM-L6-v2	SBERT (open)	384	256	Tiny, fast, good for prototyping

Matryoshka Embeddings

Matryoshka embeddings (named after Russian nesting dolls) are trained so that the first N dimensions of a larger embedding are themselves a valid, useful embedding of lower dimension. This allows you to choose between quality and efficiency at deployment time without retraining.

# Matryoshka embeddings: variable-dimension embeddings
# OpenAI text-embedding-3-large supports this natively

from openai import OpenAI
client = OpenAI()

text = "Machine learning is a subset of artificial intelligence."

# Full 3072 dimensions (highest quality)
response_full = client.embeddings.create(
    model="text-embedding-3-large",
    input=text,
    dimensions=3072,
)

# Reduced to 1024 dimensions (good quality, 3x smaller)
response_1024 = client.embeddings.create(
    model="text-embedding-3-large",
    input=text,
    dimensions=1024,
)

# Reduced to 256 dimensions (acceptable quality, 12x smaller)
response_256 = client.embeddings.create(
    model="text-embedding-3-large",
    input=text,
    dimensions=256,
)

# The 256-dim embedding is a valid embedding that captures
# the most important semantic features. You can use it for
# applications where storage/speed matter more than precision.

Late Interaction (ColBERT)

Unlike standard embeddings (single vector per document), ColBERT produces one vector per token. At retrieval time, it computes token-level similarity using MaxSim, providing much more fine-grained matching.

# Standard Embedding:
# Document -> [single 768-dim vector]
# Query -> [single 768-dim vector]
# Similarity = cosine(doc_vec, query_vec)

# ColBERT Late Interaction:
# Document "machine learning is powerful" -> [vec_machine, vec_learning, vec_is, vec_powerful]
# Query "deep learning" -> [vec_deep, vec_learning]
#
# For each query token, find max similarity across all doc tokens:
# score = sum over query tokens of max(cosine(q_token, d_token) for d_token in doc)
#
# This captures partial matches much better than single-vector comparison

Practical: Generate and Compare Embeddings

"""
Embedding Model Comparison
=============================
Generate embeddings with multiple models and compare their
quality on a retrieval task.
"""

import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Tuple


def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def evaluate_retrieval(
    model: SentenceTransformer,
    queries: List[str],
    documents: List[str],
    ground_truth: List[int],  # Index of correct document for each query
) -> dict:
    """Evaluate retrieval quality of an embedding model."""

    # Embed all documents
    doc_embeddings = model.encode(documents, normalize_embeddings=True)

    correct = 0
    mrr_total = 0.0

    for i, query in enumerate(queries):
        # Embed query
        query_embedding = model.encode([query], normalize_embeddings=True)[0]

        # Compute similarities
        similarities = [
            cosine_similarity(query_embedding, doc_emb)
            for doc_emb in doc_embeddings
        ]

        # Rank documents
        ranked_indices = np.argsort(similarities)[::-1]

        # Check if top-1 is correct
        if ranked_indices[0] == ground_truth[i]:
            correct += 1

        # Compute reciprocal rank
        correct_rank = np.where(ranked_indices == ground_truth[i])[0][0] + 1
        mrr_total += 1.0 / correct_rank

    accuracy = correct / len(queries)
    mrr = mrr_total / len(queries)

    return {"accuracy_at_1": accuracy, "mrr": mrr}


# Test data
documents = [
    "Python is a high-level programming language known for its simplicity and readability.",
    "The Eiffel Tower is a wrought-iron lattice tower in Paris, France, built in 1889.",
    "Photosynthesis is the process by which plants convert sunlight into chemical energy.",
    "The stock market experienced significant volatility during the 2008 financial crisis.",
    "Quantum computing uses qubits that can exist in superposition, enabling parallel computation.",
    "The human genome contains approximately 3 billion base pairs of DNA.",
    "Neural networks are computing systems inspired by biological neural networks in the brain.",
    "Climate change is driven by greenhouse gas emissions from human activities.",
]

queries = [
    "What programming language is easy to learn?",          # -> 0 (Python)
    "Tell me about a famous landmark in France.",            # -> 1 (Eiffel Tower)
    "How do plants make food from sunlight?",               # -> 2 (Photosynthesis)
    "What happened to financial markets in 2008?",          # -> 3 (stock market)
    "How does quantum computing work?",                     # -> 4 (quantum)
    "What is DNA made of?",                                 # -> 5 (genome)
    "How are artificial neural networks structured?",       # -> 6 (neural networks)
    "What causes global warming?",                          # -> 7 (climate change)
]

ground_truth = [0, 1, 2, 3, 4, 5, 6, 7]

# Compare models
models_to_test = [
    ("all-MiniLM-L6-v2", 384),
    ("all-mpnet-base-v2", 768),
    ("BAAI/bge-small-en-v1.5", 384),
    ("BAAI/bge-base-en-v1.5", 768),
]

print("=" * 60)
print("EMBEDDING MODEL COMPARISON")
print("=" * 60)

for model_name, dim in models_to_test:
    print(f"\nLoading {model_name} (dim={dim})...")
    model = SentenceTransformer(model_name)

    results = evaluate_retrieval(model, queries, documents, ground_truth)
    print(f"  Accuracy@1: {results['accuracy_at_1']:.1%}")
    print(f"  MRR:        {results['mrr']:.3f}")

    # Show similarity matrix for first query
    query_emb = model.encode([queries[0]], normalize_embeddings=True)[0]
    doc_embs = model.encode(documents, normalize_embeddings=True)
    sims = [cosine_similarity(query_emb, d) for d in doc_embs]

    print(f"  Query: '{queries[0]}'")
    top3_idx = np.argsort(sims)[::-1][:3]
    for rank, idx in enumerate(top3_idx):
        print(f"    #{rank+1} (sim={sims[idx]:.3f}): {documents[idx][:60]}...")

4. Vector Databases

Vector databases are purpose-built systems for storing, indexing, and querying high-dimensional vectors. They are the backbone of any RAG system, enabling fast similarity search over millions or billions of embeddings.

Vector Database Indexing Flow

graph TB A[Raw Documents] --> B[Chunking] B --> C[Embedding Model] C --> D[Vector Embeddings] D --> E[HNSW Index] D --> F[Metadata Store] E --> G[Vector Database] F --> G H[Query] --> I[Query Embedding] I --> J[ANN Search via HNSW] G --> J J --> K[Top-K Results + Metadata] style A fill:#4a90d9,stroke:#333,color:#fff style C fill:#f5a623,stroke:#333,color:#fff style G fill:#9b59b6,stroke:#333,color:#fff style K fill:#2ecc71,stroke:#333,color:#fff

Comparison of Vector Databases

Database	Type	Hosting	Best For
Pinecone	Managed cloud	Fully managed	Production; zero ops, auto-scaling
Weaviate	Open-source	Self-hosted or cloud	Hybrid search (vector + keyword), GraphQL API
Qdrant	Open-source	Self-hosted or cloud	Rich filtering, Rust performance
Milvus	Open-source	Self-hosted or Zilliz cloud	Large-scale (billions of vectors)
Chroma	Open-source	Embedded or server	Prototyping, small-medium datasets
pgvector	PostgreSQL extension	Any PostgreSQL host	Integration with existing Postgres infrastructure
FAISS	Library	In-process	Research, benchmarking, maximum control

Indexing Algorithms

Brute Force (Flat Index)

Compute distance from query to every vector in the database. Guarantees finding the exact nearest neighbor but is O(n) in the number of vectors.

# Brute force: compare query against all N vectors
# Time: O(N * d) where d is dimension
# Memory: O(N * d)
# Quality: Perfect (100% recall)
# Practical for: < 100K vectors

IVF (Inverted File Index)

Partition vectors into clusters using k-means. At query time, only search the nearest clusters.

# IVF with nlist clusters and nprobe searched:
# Training: Run k-means to create nlist centroids
# Indexing: Assign each vector to its nearest centroid
# Query:
#   1. Find nprobe nearest centroids to query
#   2. Search only vectors in those clusters
# Time: O(nprobe * N/nlist * d)  -- much faster when nprobe << nlist
# Quality: Approximate (recall depends on nprobe/nlist ratio)
# Typical: nlist=sqrt(N), nprobe=nlist/10

HNSW (Hierarchical Navigable Small World)

The most popular algorithm for approximate nearest neighbor search. Builds a multi-layer graph where each node is connected to its nearest neighbors.

HNSW Multi-Layer Graph Structure

graph TB subgraph L2[Layer 2 - Sparse] N1[Node A] --- N2[Node B] end subgraph L1[Layer 1 - Medium] N3[Node A] --- N4[Node C] N4 --- N5[Node B] N5 --- N6[Node D] end subgraph L0[Layer 0 - Dense] N7[A] --- N8[E] N8 --- N9[C] N9 --- N10[F] N10 --- N11[B] N11 --- N12[D] N12 --- N13[G] end N1 -.-> N3 N2 -.-> N5 N3 -.-> N7 N4 -.-> N9 N5 -.-> N11 N6 -.-> N12 style L2 fill:#e8f4fd,stroke:#4a90d9 style L1 fill:#fef3e2,stroke:#f5a623 style L0 fill:#e8fce8,stroke:#2ecc71

# HNSW: Multi-layer navigable graph
#
# Layer 3: [A] -------- [B]           (few nodes, long-range connections)
#            |            |
# Layer 2: [A] -- [C] -- [B] -- [D]  (more nodes)
#            |     |      |      |
# Layer 1: [A]-[E]-[C]-[F]-[B]-[D]-[G]  (even more nodes)
#            |  |  |  |  |  |  |
# Layer 0: [A][E][H][C][F][I][B][D][G][J]  (all nodes, dense connections)
#
# Search algorithm:
# 1. Start at entry point in top layer
# 2. Greedily traverse to nearest node in current layer
# 3. When no closer node found, drop to next layer
# 4. Repeat until bottom layer
# 5. Return top-K nearest neighbors from bottom layer
#
# Key parameters:
# M: max connections per node (higher = better quality, more memory)
# ef_construction: beam width during index building
# ef_search: beam width during query (higher = better recall, slower)
#
# Time: O(log(N) * d)  -- logarithmic scaling!
# Memory: O(N * M * d)  -- higher than flat due to graph structure
# Quality: Very high recall (>95%) with proper parameters

Distance Metrics

Metric	Formula	Range	Best For
Cosine Similarity	dot(a,b) / (\|\|a\|\| * \|\|b\|\|)	[-1, 1]	Text embeddings (most common)
Euclidean (L2)	sqrt(sum((a_i - b_i)^2))	[0, inf)	Spatial data, image features
Dot Product	sum(a_i * b_i)	(-inf, inf)	Normalized embeddings (equivalent to cosine)

Practical: Vector Search with Chroma and FAISS

"""
Vector Database Practical: Chroma and FAISS
=============================================
Set up vector stores, insert embeddings, and perform similarity search.

pip install chromadb faiss-cpu sentence-transformers
"""

# ===========================
# Part 1: ChromaDB
# ===========================

import chromadb
from chromadb.utils import embedding_functions

# Initialize Chroma (persistent storage)
client = chromadb.PersistentClient(path="./chroma_db")

# Use the default embedding function (all-MiniLM-L6-v2)
# Or specify a custom one:
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create or get a collection
collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"},  # Use cosine similarity
)

# Add documents
documents = [
    "Python is a versatile programming language used in AI and web development.",
    "JavaScript is the most popular language for web development.",
    "Machine learning models learn patterns from data to make predictions.",
    "Deep learning uses neural networks with multiple layers.",
    "Natural language processing enables computers to understand human language.",
    "Computer vision allows machines to interpret and understand visual information.",
    "Reinforcement learning trains agents through rewards and penalties.",
    "Transfer learning leverages pre-trained models for new tasks.",
]

# Add documents with IDs and metadata
collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[{"source": "textbook", "chapter": i + 1} for i in range(len(documents))],
)

print(f"Collection has {collection.count()} documents")

# Query the collection
results = collection.query(
    query_texts=["How do machines learn from data?"],
    n_results=3,
    include=["documents", "distances", "metadatas"],
)

print("\nChroma Search Results:")
for i, (doc, dist, meta) in enumerate(zip(
    results["documents"][0],
    results["distances"][0],
    results["metadatas"][0],
)):
    print(f"  #{i+1} (distance={dist:.4f}): {doc}")
    print(f"       metadata: {meta}")

# Query with metadata filtering
results_filtered = collection.query(
    query_texts=["programming languages"],
    n_results=3,
    where={"chapter": {"$lte": 3}},  # Only chapters 1-3
    include=["documents", "distances"],
)

print("\nFiltered results (chapters 1-3 only):")
for doc, dist in zip(results_filtered["documents"][0], results_filtered["distances"][0]):
    print(f"  (dist={dist:.4f}): {doc}")


# ===========================
# Part 2: FAISS
# ===========================

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
dimension = 384  # Embedding dimension for this model

# Create embeddings for our documents
doc_embeddings = model.encode(documents, normalize_embeddings=True)
doc_embeddings = np.array(doc_embeddings).astype("float32")

# ---- Flat Index (brute force, exact) ----
index_flat = faiss.IndexFlatIP(dimension)  # Inner product (= cosine for normalized vectors)
index_flat.add(doc_embeddings)
print(f"\nFAISS Flat Index: {index_flat.ntotal} vectors")

# Search
query = "How do machines learn from data?"
query_embedding = model.encode([query], normalize_embeddings=True).astype("float32")

distances, indices = index_flat.search(query_embedding, k=3)
print("FAISS Flat Search Results:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    print(f"  #{i+1} (score={dist:.4f}): {documents[idx]}")

# ---- HNSW Index (approximate, fast) ----
index_hnsw = faiss.IndexHNSWFlat(dimension, 32)  # M=32 connections
index_hnsw.hnsw.efConstruction = 128  # Construction-time beam width
index_hnsw.hnsw.efSearch = 64         # Search-time beam width
index_hnsw.add(doc_embeddings)
print(f"\nFAISS HNSW Index: {index_hnsw.ntotal} vectors")

distances, indices = index_hnsw.search(query_embedding, k=3)
print("FAISS HNSW Search Results:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    print(f"  #{i+1} (score={dist:.4f}): {documents[idx]}")

# ---- IVF Index (approximate, memory efficient) ----
nlist = 4  # Number of clusters (use sqrt(N) for large datasets)
quantizer = faiss.IndexFlatIP(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
index_ivf.train(doc_embeddings)  # Train the clustering
index_ivf.add(doc_embeddings)
index_ivf.nprobe = 2  # Search 2 out of 4 clusters
print(f"\nFAISS IVF Index: {index_ivf.ntotal} vectors")

distances, indices = index_ivf.search(query_embedding, k=3)
print("FAISS IVF Search Results:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    print(f"  #{i+1} (score={dist:.4f}): {documents[idx]}")

# Save and load index
faiss.write_index(index_hnsw, "knowledge_base.index")
loaded_index = faiss.read_index("knowledge_base.index")
print(f"\nLoaded index with {loaded_index.ntotal} vectors")

Practical: pgvector with PostgreSQL

"""
pgvector: Vector Search in PostgreSQL
=========================================
Use vectors alongside traditional relational data.

Setup:
1. Install PostgreSQL with pgvector extension
2. pip install psycopg2-binary sentence-transformers

Docker quickstart:
docker run -d --name pgvector -e POSTGRES_PASSWORD=password \
    -p 5432:5432 pgvector/pgvector:pg16
"""

import psycopg2
import numpy as np
from sentence_transformers import SentenceTransformer


def setup_pgvector():
    """Set up pgvector database and table."""
    conn = psycopg2.connect(
        host="localhost",
        port=5432,
        dbname="postgres",
        user="postgres",
        password="password",
    )
    conn.autocommit = True
    cur = conn.cursor()

    # Enable pgvector extension
    cur.execute("CREATE EXTENSION IF NOT EXISTS vector")

    # Create table with vector column
    cur.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id SERIAL PRIMARY KEY,
            content TEXT NOT NULL,
            source VARCHAR(255),
            category VARCHAR(100),
            embedding vector(384),  -- 384 dimensions for MiniLM
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    # Create HNSW index for fast similarity search
    cur.execute("""
        CREATE INDEX IF NOT EXISTS documents_embedding_idx
        ON documents
        USING hnsw (embedding vector_cosine_ops)
        WITH (m = 16, ef_construction = 64)
    """)

    return conn


def insert_documents(conn, documents, model):
    """Insert documents with their embeddings."""
    cur = conn.cursor()

    for doc in documents:
        embedding = model.encode(doc["content"]).tolist()

        cur.execute(
            """
            INSERT INTO documents (content, source, category, embedding)
            VALUES (%s, %s, %s, %s::vector)
            """,
            (doc["content"], doc.get("source", ""), doc.get("category", ""), embedding),
        )

    conn.commit()
    print(f"Inserted {len(documents)} documents")


def semantic_search(conn, model, query, k=5, category=None):
    """Perform semantic search with optional filtering."""
    cur = conn.cursor()
    query_embedding = model.encode(query).tolist()

    if category:
        cur.execute(
            """
            SELECT content, source, category,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            WHERE category = %s
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (query_embedding, category, query_embedding, k),
        )
    else:
        cur.execute(
            """
            SELECT content, source, category,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (query_embedding, query_embedding, k),
        )

    results = cur.fetchall()
    return [
        {
            "content": r[0],
            "source": r[1],
            "category": r[2],
            "similarity": float(r[3]),
        }
        for r in results
    ]


# Usage
if __name__ == "__main__":
    model = SentenceTransformer("all-MiniLM-L6-v2")
    conn = setup_pgvector()

    documents = [
        {"content": "Python supports multiple programming paradigms.", "source": "docs", "category": "programming"},
        {"content": "Neural networks are inspired by biological neurons.", "source": "textbook", "category": "ml"},
        {"content": "PostgreSQL is a powerful relational database.", "source": "docs", "category": "databases"},
    ]

    insert_documents(conn, documents, model)

    results = semantic_search(conn, model, "How do brain-inspired algorithms work?", k=3)
    for r in results:
        print(f"[{r['similarity']:.3f}] {r['content']}")

5. RAG Pipeline Architecture

Hybrid Search

Combining keyword search (BM25) with semantic vector search often produces better results than either alone. BM25 excels at exact term matching, while semantic search captures meaning.

"""
Hybrid Search: BM25 + Vector Search
======================================
Combine keyword and semantic search for better retrieval.

pip install rank-bm25 sentence-transformers numpy
"""

import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from typing import List, Tuple


class HybridSearcher:
    """Combine BM25 keyword search with semantic vector search."""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.bm25 = None
        self.embeddings = None

    def index(self, documents: List[str]):
        """Index documents for both BM25 and vector search."""
        self.documents = documents

        # BM25 index
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

        # Vector index
        self.embeddings = self.model.encode(
            documents, normalize_embeddings=True
        )

    def search(
        self,
        query: str,
        k: int = 5,
        alpha: float = 0.5,  # Weight for semantic search (1-alpha for BM25)
    ) -> List[Tuple[int, float, str]]:
        """
        Hybrid search combining BM25 and semantic scores.

        Args:
            query: Search query
            k: Number of results
            alpha: Weight for semantic search (0=pure BM25, 1=pure semantic)

        Returns:
            List of (index, score, document) tuples
        """
        # BM25 scores
        bm25_scores = self.bm25.get_scores(query.lower().split())
        # Normalize to [0, 1]
        bm25_max = max(bm25_scores) if max(bm25_scores) > 0 else 1
        bm25_normalized = bm25_scores / bm25_max

        # Semantic scores
        query_embedding = self.model.encode(
            [query], normalize_embeddings=True
        )[0]
        semantic_scores = np.dot(self.embeddings, query_embedding)
        # Already in [-1, 1] range for normalized embeddings
        semantic_normalized = (semantic_scores + 1) / 2  # Shift to [0, 1]

        # Combine scores
        hybrid_scores = alpha * semantic_normalized + (1 - alpha) * bm25_normalized

        # Get top-k results
        top_indices = np.argsort(hybrid_scores)[::-1][:k]

        results = []
        for idx in top_indices:
            results.append((
                int(idx),
                float(hybrid_scores[idx]),
                self.documents[idx],
            ))

        return results


# Demo
searcher = HybridSearcher()
docs = [
    "The Python programming language was created by Guido van Rossum in 1991.",
    "Machine learning algorithms can be supervised, unsupervised, or reinforcement-based.",
    "PostgreSQL supports JSONB columns for storing semi-structured data.",
    "Transfer learning uses pre-trained neural network weights as a starting point.",
    "REST APIs use HTTP methods like GET, POST, PUT, and DELETE.",
    "Convolutional neural networks excel at image recognition tasks.",
    "Docker containers provide lightweight virtualization for application deployment.",
    "The attention mechanism in transformers computes weighted sums of value vectors.",
]

searcher.index(docs)

# Test different queries
queries = [
    "Who invented Python?",            # BM25 should excel (exact term match)
    "How do deep learning models see images?",  # Semantic should excel
    "neural network attention",         # Both should contribute
]

for query in queries:
    print(f"\nQuery: '{query}'")
    results = searcher.search(query, k=3, alpha=0.5)
    for idx, score, doc in results:
        print(f"  [{score:.3f}] {doc}")

Reranking with Cross-Encoders

Initial retrieval (bi-encoder) is optimized for recall (finding all relevant documents). Reranking with cross-encoders is optimized for precision (ordering them correctly).

"""
Reranking with Cross-Encoders
================================
Use a cross-encoder to rerank initially retrieved documents
for higher precision.

pip install sentence-transformers
"""

from sentence_transformers import CrossEncoder

# Load cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Initial retrieval results (from vector search)
query = "How do transformers process sequences?"
retrieved_docs = [
    "The attention mechanism computes weighted sums of values based on query-key similarities.",
    "Recurrent neural networks process sequences one token at a time.",
    "Transformers use self-attention to process all tokens in parallel.",
    "BERT is a bidirectional transformer model for language understanding.",
    "CNNs use convolutional filters to detect local patterns.",
]

# Rerank: cross-encoder scores each (query, document) pair
pairs = [(query, doc) for doc in retrieved_docs]
scores = reranker.predict(pairs)

# Sort by score
ranked = sorted(
    zip(scores, retrieved_docs),
    key=lambda x: x[0],
    reverse=True,
)

print(f"Query: {query}\n")
print("Before reranking:")
for i, doc in enumerate(retrieved_docs):
    print(f"  #{i+1}: {doc}")

print("\nAfter reranking:")
for i, (score, doc) in enumerate(ranked):
    print(f"  #{i+1} (score={score:.3f}): {doc}")

Practical: Complete RAG Pipeline from Scratch

"""
Complete RAG Pipeline from Scratch
=====================================
Build a full RAG system without any framework,
using only basic libraries.

pip install sentence-transformers chromadb openai
"""

import os
from typing import List, Dict, Optional
from dataclasses import dataclass
from sentence_transformers import SentenceTransformer, CrossEncoder
import chromadb
from openai import OpenAI


@dataclass
class RetrievedChunk:
    """A chunk of text retrieved from the knowledge base."""
    text: str
    source: str
    score: float
    metadata: Dict


class RAGPipeline:
    """Complete RAG pipeline: index, retrieve, generate."""

    def __init__(
        self,
        embedding_model: str = "all-MiniLM-L6-v2",
        reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        llm_model: str = "gpt-4o-mini",
        collection_name: str = "rag_knowledge_base",
        persist_dir: str = "./rag_chroma_db",
    ):
        # Embedding model
        self.embedder = SentenceTransformer(embedding_model)

        # Reranker
        self.reranker = CrossEncoder(reranker_model)

        # Vector store
        self.chroma_client = chromadb.PersistentClient(path=persist_dir)
        self.collection = self.chroma_client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"},
        )

        # LLM client
        self.llm_client = OpenAI()
        self.llm_model = llm_model

        # Text splitter
        self.chunk_size = 500
        self.chunk_overlap = 50

    # ---- INDEXING ----

    def chunk_text(self, text: str, source: str = "") -> List[Dict]:
        """Split text into overlapping chunks with metadata."""
        chunks = []
        sentences = text.replace('\n', ' ').split('. ')
        current_chunk = []
        current_length = 0

        for sentence in sentences:
            sentence = sentence.strip()
            if not sentence:
                continue
            sentence_with_period = sentence + '.'
            sentence_len = len(sentence_with_period)

            if current_length + sentence_len > self.chunk_size and current_chunk:
                chunk_text = ' '.join(current_chunk)
                chunks.append({
                    "text": chunk_text,
                    "source": source,
                    "chunk_index": len(chunks),
                })
                # Keep last sentence for overlap
                current_chunk = current_chunk[-1:]
                current_length = len(current_chunk[0]) if current_chunk else 0

            current_chunk.append(sentence_with_period)
            current_length += sentence_len

        if current_chunk:
            chunks.append({
                "text": ' '.join(current_chunk),
                "source": source,
                "chunk_index": len(chunks),
            })

        return chunks

    def index_documents(self, documents: List[Dict[str, str]]):
        """
        Index documents into the vector store.

        Args:
            documents: List of {"text": "...", "source": "..."} dicts
        """
        all_chunks = []
        for doc in documents:
            chunks = self.chunk_text(doc["text"], doc.get("source", "unknown"))
            all_chunks.extend(chunks)

        if not all_chunks:
            print("No chunks to index")
            return

        # Generate embeddings
        texts = [c["text"] for c in all_chunks]
        embeddings = self.embedder.encode(texts).tolist()

        # Add to ChromaDB
        self.collection.add(
            ids=[f"chunk_{i}" for i in range(len(all_chunks))],
            documents=texts,
            embeddings=embeddings,
            metadatas=[
                {"source": c["source"], "chunk_index": c["chunk_index"]}
                for c in all_chunks
            ],
        )
        print(f"Indexed {len(all_chunks)} chunks from {len(documents)} documents")

    # ---- RETRIEVAL ----

    def retrieve(
        self,
        query: str,
        top_k: int = 10,
        rerank_top_k: int = 3,
    ) -> List[RetrievedChunk]:
        """Retrieve and rerank relevant chunks."""

        # Step 1: Initial retrieval with vector search
        query_embedding = self.embedder.encode([query]).tolist()

        results = self.collection.query(
            query_embeddings=query_embedding,
            n_results=top_k,
            include=["documents", "distances", "metadatas"],
        )

        if not results["documents"][0]:
            return []

        # Step 2: Rerank with cross-encoder
        pairs = [(query, doc) for doc in results["documents"][0]]
        rerank_scores = self.reranker.predict(pairs)

        # Combine and sort
        chunks = []
        for doc, dist, meta, rerank_score in zip(
            results["documents"][0],
            results["distances"][0],
            results["metadatas"][0],
            rerank_scores,
        ):
            chunks.append(RetrievedChunk(
                text=doc,
                source=meta.get("source", ""),
                score=float(rerank_score),
                metadata=meta,
            ))

        # Sort by reranker score (descending)
        chunks.sort(key=lambda x: x.score, reverse=True)

        return chunks[:rerank_top_k]

    # ---- GENERATION ----

    def generate(
        self,
        query: str,
        retrieved_chunks: List[RetrievedChunk],
        system_prompt: Optional[str] = None,
    ) -> str:
        """Generate a response using retrieved context."""

        if system_prompt is None:
            system_prompt = """You are a helpful AI assistant. Answer the user's question
based ONLY on the provided context. If the context doesn't contain
enough information to answer the question, say so clearly.
Always cite your sources by referencing the source document."""

        # Build context from retrieved chunks
        context_parts = []
        for i, chunk in enumerate(retrieved_chunks, 1):
            context_parts.append(
                f"[Source {i}: {chunk.source}]\n{chunk.text}"
            )
        context = "\n\n".join(context_parts)

        # Build the prompt
        messages = [
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": f"""Context:
{context}

Question: {query}

Please answer based on the context above. Cite sources using [Source N] notation.""",
            },
        ]

        # Call LLM
        response = self.llm_client.chat.completions.create(
            model=self.llm_model,
            messages=messages,
            temperature=0.3,
            max_tokens=500,
        )

        return response.choices[0].message.content

    # ---- FULL PIPELINE ----

    def query(self, question: str) -> Dict:
        """Run the full RAG pipeline: retrieve + generate."""

        # Retrieve relevant chunks
        chunks = self.retrieve(question, top_k=10, rerank_top_k=3)

        if not chunks:
            return {
                "answer": "I could not find any relevant information to answer your question.",
                "sources": [],
            }

        # Generate answer
        answer = self.generate(question, chunks)

        return {
            "answer": answer,
            "sources": [
                {
                    "text": c.text[:200] + "...",
                    "source": c.source,
                    "relevance_score": c.score,
                }
                for c in chunks
            ],
        }


# ===========================
# Usage Example
# ===========================

if __name__ == "__main__":
    rag = RAGPipeline()

    # Index some documents
    documents = [
        {
            "text": """
            Transformers are a neural network architecture introduced in the paper
            'Attention Is All You Need' by Vaswani et al. in 2017. They revolutionized
            NLP by replacing recurrent architectures with self-attention mechanisms
            that can process sequences in parallel. The key innovation is multi-head
            attention, which allows the model to jointly attend to information from
            different representation subspaces. Transformers consist of an encoder
            and decoder, each made up of layers containing self-attention and
            feed-forward neural network sublayers.
            """,
            "source": "transformer_overview.pdf",
        },
        {
            "text": """
            BERT (Bidirectional Encoder Representations from Transformers) is a
            language model developed by Google in 2018. Unlike GPT which is
            autoregressive (left-to-right), BERT is trained with a masked language
            modeling objective where random tokens are masked and the model predicts
            them using bidirectional context. BERT uses only the encoder part of
            the transformer architecture. It achieved state-of-the-art results on
            11 NLP benchmarks when released.
            """,
            "source": "bert_paper.pdf",
        },
        {
            "text": """
            GPT (Generative Pre-trained Transformer) models use the decoder part
            of the transformer architecture. They are trained autoregressively to
            predict the next token. GPT-2 demonstrated that large language models
            can generate coherent long-form text. GPT-3 with 175 billion parameters
            showed emergent few-shot learning capabilities. GPT-4 introduced
            multimodal capabilities, accepting both text and image inputs.
            """,
            "source": "gpt_history.pdf",
        },
    ]

    rag.index_documents(documents)

    # Query the system
    result = rag.query("What is the key innovation of the transformer architecture?")
    print(f"Answer: {result['answer']}")
    print(f"\nSources used:")
    for source in result['sources']:
        print(f"  - {source['source']} (score: {source['relevance_score']:.3f})")

Practical: RAG with LangChain

"""
RAG Pipeline with LangChain
==============================
Build the same RAG pipeline using LangChain framework.

pip install langchain langchain-openai langchain-community chromadb
"""

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.document_loaders import (
    TextLoader,
    PyPDFLoader,
    DirectoryLoader,
)


# Step 1: Load documents
# From text files:
# loader = TextLoader("document.txt")
# From PDFs:
# loader = PyPDFLoader("document.pdf")
# From a directory:
# loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)

# For demo, create documents manually
from langchain.schema import Document

documents = [
    Document(
        page_content="""
        Retrieval Augmented Generation (RAG) combines retrieval and generation
        to produce grounded, accurate responses. It was introduced by Lewis et al.
        in 2020. The key idea is to retrieve relevant documents from a knowledge
        base and include them as context for the language model.
        """,
        metadata={"source": "rag_paper.pdf", "page": 1},
    ),
    Document(
        page_content="""
        Vector databases store high-dimensional embeddings and enable fast
        similarity search. Popular options include Pinecone, Weaviate, Qdrant,
        and Chroma. They use algorithms like HNSW for approximate nearest
        neighbor search, achieving sub-millisecond query times even with
        millions of vectors.
        """,
        metadata={"source": "vector_db_guide.pdf", "page": 1},
    ),
]

# Step 2: Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

# Step 3: Create vector store with embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./langchain_chroma_db",
    collection_name="langchain_rag",
)

# Step 4: Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",  # or "mmr" for diversity
    search_kwargs={"k": 3},
)

# Step 5: Create the RAG chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)

prompt_template = PromptTemplate(
    template="""Use the following context to answer the question.
If you cannot answer based on the context, say so.

Context:
{context}

Question: {question}

Answer:""",
    input_variables=["context", "question"],
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff" = put all docs in prompt
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template},
)

# Step 6: Query
result = qa_chain.invoke({"query": "What is RAG and how does it work?"})
print(f"Answer: {result['result']}")
print(f"\nSource documents:")
for doc in result['source_documents']:
    print(f"  - {doc.metadata['source']}: {doc.page_content[:100]}...")

6. Advanced RAG Techniques (2025-2026)

Basic RAG works well for simple questions but struggles with complex queries, ambiguous questions, and multi-hop reasoning. Advanced techniques address these limitations.

Multi-Query RAG

Generate multiple variations of the user's query using an LLM, retrieve for each variation, then merge and deduplicate the results. This increases recall by capturing different aspects of the question.

def multi_query_retrieval(query: str, retriever, llm) -> list:
    """Generate multiple query variations and retrieve for each."""
    # Generate query variations
    prompt = f"""Generate 3 different versions of the following question
to help retrieve relevant documents. Each version should capture
a different aspect or phrasing of the question.

Original question: {query}

Provide the 3 alternative questions, one per line:"""

    variations = llm.invoke(prompt).content.strip().split('\n')
    variations = [v.strip() for v in variations if v.strip()]
    all_queries = [query] + variations

    # Retrieve for each query
    all_docs = []
    seen_contents = set()
    for q in all_queries:
        docs = retriever.get_relevant_documents(q)
        for doc in docs:
            if doc.page_content not in seen_contents:
                all_docs.append(doc)
                seen_contents.add(doc.page_content)

    return all_docs

RAG Fusion

Similar to multi-query but uses Reciprocal Rank Fusion (RRF) to combine rankings from different queries into a single, better ranking.

def reciprocal_rank_fusion(rankings: list[list], k: int = 60) -> list:
    """
    Combine multiple rankings using Reciprocal Rank Fusion (RRF).
    RRF score = sum(1 / (k + rank_i)) for each ranking
    """
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            if doc_id not in scores:
                scores[doc_id] = 0.0
            scores[doc_id] += 1.0 / (k + rank + 1)

    # Sort by RRF score
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Self-RAG

Self-RAG (Asai et al., 2023) trains the LLM to decide when retrieval is needed, assess the relevance of retrieved passages, and critique its own responses for factual grounding.

# Self-RAG Decision Flow:
#
# 1. Given a query, the model generates a "retrieval token":
#    [Retrieve] = Yes/No
#
# 2. If retrieval is needed:
#    - Retrieve relevant passages
#    - For each passage, generate a "relevance token":
#      [IsRel] = Relevant/Irrelevant
#
# 3. Generate response using relevant passages
#
# 4. Generate a "support token" for each claim:
#    [IsSup] = Fully Supported/Partially Supported/Not Supported
#
# 5. Generate a "utility token" for overall quality:
#    [IsUse] = 5/4/3/2/1
#
# This creates a self-correcting RAG system that knows
# when it needs external information and can verify its claims.

GraphRAG (Microsoft)

GraphRAG (2024) builds a knowledge graph from the corpus before retrieval. Instead of retrieving raw text chunks, it uses the graph structure to find and traverse related concepts.

# GraphRAG Architecture:
#
# Indexing Phase:
# 1. Extract entities and relationships from documents using LLM
# 2. Build a knowledge graph (nodes = entities, edges = relationships)
# 3. Detect communities in the graph using Leiden algorithm
# 4. Generate summaries for each community
#
# Query Phase:
# 1. Map query to relevant entities and communities
# 2. Retrieve community summaries (global search)
#    OR traverse local neighborhood in graph (local search)
# 3. Use retrieved information as context for generation
#
# Advantages over standard RAG:
# - Better at answering questions about themes and connections
# - Can synthesize information across many documents
# - Handles "What are the main topics in this corpus?" queries
# - Provides more comprehensive answers to broad questions

Contextual Retrieval (Anthropic)

Anthropic's contextual retrieval (2024) prepends context about the whole document to each chunk before embedding. This helps chunks that are ambiguous on their own (e.g., "The company reported Q3 revenue of $5B" -- which company?).

def contextual_chunking(document: str, chunks: list[str], llm) -> list[str]:
    """
    Add contextual information to each chunk.
    Anthropic's approach: use an LLM to generate context for each chunk
    based on the full document.
    """
    contextualized_chunks = []

    for chunk in chunks:
        prompt = f"""Here is the full document:
{document[:3000]}  # Truncate for context window

Here is a specific chunk from the document:
{chunk}

Please provide a brief (2-3 sentences) context that situates this chunk
within the overall document. Focus on who/what/when/where information
that would help someone understand this chunk in isolation.

Context:"""

        context = llm.invoke(prompt).content.strip()
        contextualized_chunk = f"{context}\n\n{chunk}"
        contextualized_chunks.append(contextualized_chunk)

    return contextualized_chunks

# Example:
# Original chunk: "Revenue increased by 15% year-over-year."
# Contextualized: "This passage is from Apple Inc.'s Q3 2025 earnings report,
# discussing financial performance. Revenue increased by 15% year-over-year."

Late Chunking

Late chunking (2024) processes the full document through the embedding model first, then chunks the resulting token embeddings. This preserves cross-chunk context in the embeddings, unlike traditional chunking where each chunk is embedded independently.

# Traditional chunking:
# Document -> Split into chunks -> Embed each chunk independently
# Problem: Each chunk's embedding only captures its local content

# Late chunking:
# Document -> Embed full document (get per-token embeddings)
#          -> Split TOKEN EMBEDDINGS into chunks
#          -> Pool each chunk's token embeddings into a single vector
# Advantage: Each chunk's embedding benefits from full document context

Summary and Key Takeaways

Week 7 Key Takeaways

RAG solves critical LLM limitations: hallucinations, outdated knowledge, and inability to access private data. It is the most practical pattern for production LLM applications.
Chunking strategy matters enormously: Recursive character splitting with structure awareness is a strong default. Semantic chunking provides the best quality but requires more computation.
Choose embedding models carefully: OpenAI text-embedding-3 for proprietary, BGE/E5 for open-source. Match chunk sizes to what the embedding model was trained on.
HNSW is the standard indexing algorithm: It provides logarithmic search time with high recall. Chroma and FAISS are excellent starting points.
Hybrid search outperforms either approach alone: Combine BM25 keyword search with semantic vector search for robust retrieval.
Reranking is essential: Cross-encoder reranking after initial retrieval significantly improves precision at minimal latency cost.
Advanced techniques keep evolving: GraphRAG for corpus-level questions, contextual retrieval for better chunk embeddings, Self-RAG for self-correcting systems.

Next Steps

In Week 8: Hands-on RAG Implementation, we will build a complete production RAG chatbot with advanced retrieval strategies, reranking, input/output guardrails, AI safety measures, and evaluation using the RAGAS framework.