Retrieval Augmented Generation (RAG)
Build production-grade RAG systems that ground LLM responses in your own data. Master document processing, vector embeddings, vector databases, and advanced retrieval techniques to create AI systems that are accurate, up-to-date, and verifiable.
Learning Objectives
Understand RAG Architecture
Learn why RAG exists, when to use it, and how it compares to fine-tuning and prompt engineering.
Document Processing
Master chunking strategies that maximize retrieval quality for different document types.
Vector Embeddings
Understand how text is converted to vectors, and compare embedding models for quality and efficiency.
Vector Databases
Deploy and query vector databases with HNSW indexing, and understand the tradeoffs between solutions.
Build RAG Pipelines
Construct complete RAG systems both from scratch and with frameworks like LangChain.
Advanced RAG (2025-2026)
Learn cutting-edge techniques: GraphRAG, contextual retrieval, self-RAG, and RAG Fusion.
1. Why RAG?
Retrieval Augmented Generation (RAG) is a technique that enhances LLM responses by retrieving relevant information from external knowledge sources and including it in the prompt. It was introduced by Lewis et al. (2020) at Facebook AI Research and has become the most widely used pattern for building production LLM applications.
LLM Limitations RAG Solves
1. Hallucinations
LLMs can generate plausible-sounding but factually incorrect information. They have no mechanism to verify their outputs against reality. RAG provides grounding -- the model generates responses based on retrieved documents rather than relying solely on its parametric knowledge.
2. Outdated Knowledge
LLMs have a knowledge cutoff date. A model trained on data through December 2024 knows nothing about events in 2025 or 2026. RAG allows the system to access current information by retrieving from an up-to-date knowledge base.
3. No Access to Private Data
LLMs cannot access your company's internal documents, databases, or proprietary information. RAG bridges this gap by retrieving from your private data sources.
4. No Citations
A vanilla LLM cannot tell you where its information comes from. RAG enables citations by tracking which documents were used to generate each response.
RAG Architecture Overview
# RAG System Architecture (Text Diagram)
#
# ┌─────────────────────────────────────────────────────┐
# │ INDEXING PIPELINE │
# │ (runs once, or periodically for updates) │
# │ │
# │ Documents ──► Chunking ──► Embedding ──► Vector DB │
# │ (PDF, HTML, (split (convert to (store & │
# │ DOCX, MD) into vectors) index) │
# │ chunks) │
# └─────────────────────────────────────────────────────┘
#
# ┌─────────────────────────────────────────────────────┐
# │ QUERY PIPELINE │
# │ (runs on every user query) │
# │ │
# │ User Query │
# │ │ │
# │ ├──► Embed Query ──► Vector Search ──► Top K │
# │ │ │ chunks │
# │ │ │ │
# │ │ (optional: rerank top K chunks) │
# │ │ │ │
# │ ▼ ▼ │
# │ ┌─────────────────────────────────────┐ │
# │ │ Prompt = System Instructions │ │
# │ │ + Retrieved Context │ │
# │ │ + User Question │ │
# │ └──────────────┬──────────────────────┘ │
# │ │ │
# │ ▼ │
# │ LLM generates response │
# │ with citations │
# └─────────────────────────────────────────────────────┘
RAG vs Fine-tuning
| Aspect | RAG | Fine-tuning |
|---|---|---|
| Knowledge updates | Easy (update the index) | Hard (retrain the model) |
| Citations | Built-in (source documents) | Not naturally available |
| Cost | Retrieval + embedding costs | Training compute costs |
| Latency | Higher (retrieval step) | Lower |
| Best for | Knowledge-intensive Q&A, changing data | Custom behavior, style, format |
| Combined | Best results often come from fine-tuning + RAG together | |
2. Document Processing and Chunking
Before documents can be stored in a vector database, they must be split into smaller pieces called "chunks." Chunking strategy has an enormous impact on RAG quality -- bad chunking leads to irrelevant retrieval, which leads to bad answers.
Document Loaders
Different document formats require different extraction approaches:
| Format | Library | Considerations |
|---|---|---|
| PyMuPDF, pdfplumber, Unstructured | Tables, images, multi-column layouts are challenging | |
| HTML | BeautifulSoup, Trafilatura | Remove navigation, ads, boilerplate |
| Markdown | Built-in, markdown-it | Preserve heading structure for metadata |
| DOCX | python-docx, Unstructured | Handle styles, headers, tables |
| CSV/Excel | pandas | Convert rows to text, or treat as structured data |
| Code | tree-sitter, custom parsers | Parse by function/class, preserve context |
Chunking Strategies
1. Fixed-Size Chunking
The simplest approach: split text into chunks of a fixed number of characters or tokens, with optional overlap.
def fixed_size_chunking(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into fixed-size chunks with overlap."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap # Overlap with previous chunk
return chunks
# Pros: Simple, predictable chunk sizes
# Cons: Can split mid-sentence, mid-paragraph, or mid-concept
# Use when: Quick prototyping, uniform document types
2. Sentence-Based Chunking
Split at sentence boundaries, grouping sentences until reaching the target size.
import re
def sentence_chunking(text: str, max_chunk_size: int = 500) -> list[str]:
"""Split text into chunks at sentence boundaries."""
# Split into sentences (simple regex -- use spaCy/NLTK for production)
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_chunk = []
current_size = 0
for sentence in sentences:
sentence_len = len(sentence)
if current_size + sentence_len > max_chunk_size and current_chunk:
chunks.append(' '.join(current_chunk))
current_chunk = []
current_size = 0
current_chunk.append(sentence)
current_size += sentence_len + 1 # +1 for space
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
# Pros: Never splits mid-sentence
# Cons: Chunk sizes vary; may still split topics
# Use when: General-purpose text documents
3. Recursive Character Text Splitting
The most popular approach (default in LangChain). Tries to split by paragraphs first, then sentences, then words, recursively ensuring chunks do not exceed the target size.
class RecursiveTextSplitter:
"""
Recursively split text using a hierarchy of separators.
Tries to keep semantically related text together.
"""
def __init__(
self,
chunk_size: int = 1000,
chunk_overlap: int = 200,
separators: list[str] = None,
):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.separators = separators or [
"\n\n", # Paragraph breaks (highest priority)
"\n", # Line breaks
". ", # Sentence breaks
", ", # Clause breaks
" ", # Word breaks
"", # Character breaks (last resort)
]
def split_text(self, text: str) -> list[str]:
"""Split text recursively."""
final_chunks = []
self._split_recursive(text, self.separators, final_chunks)
return final_chunks
def _split_recursive(self, text: str, separators: list[str], final_chunks: list):
if len(text) <= self.chunk_size:
if text.strip():
final_chunks.append(text.strip())
return
# Find the best separator (first one that exists in the text)
separator = separators[-1]
remaining_separators = separators
for i, sep in enumerate(separators):
if sep in text:
separator = sep
remaining_separators = separators[i + 1:]
break
# Split by the chosen separator
splits = text.split(separator) if separator else list(text)
# Merge splits into chunks of appropriate size
current_chunk = []
current_length = 0
for split in splits:
piece = split + separator if separator else split
piece_len = len(piece)
if current_length + piece_len > self.chunk_size and current_chunk:
# Current chunk is full; finalize it
merged = (separator if separator else "").join(current_chunk)
if len(merged) <= self.chunk_size:
final_chunks.append(merged.strip())
else:
# Still too big -- recurse with finer separators
self._split_recursive(merged, remaining_separators, final_chunks)
# Start new chunk with overlap
overlap_chunks = []
overlap_len = 0
for c in reversed(current_chunk):
if overlap_len + len(c) > self.chunk_overlap:
break
overlap_chunks.insert(0, c)
overlap_len += len(c)
current_chunk = overlap_chunks
current_length = overlap_len
current_chunk.append(split)
current_length += piece_len
if current_chunk:
merged = (separator if separator else "").join(current_chunk)
if merged.strip():
final_chunks.append(merged.strip())
# Usage
splitter = RecursiveTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(long_document)
print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks[:3]):
print(f"Chunk {i}: {len(chunk)} chars -- {chunk[:80]}...")
4. Semantic Chunking
Use embeddings to find natural topic boundaries. Split where the semantic similarity between consecutive sentences drops below a threshold.
import numpy as np
from sentence_transformers import SentenceTransformer
def semantic_chunking(
text: str,
model_name: str = "all-MiniLM-L6-v2",
threshold: float = 0.5,
min_chunk_size: int = 100,
) -> list[str]:
"""
Split text at semantic boundaries using embedding similarity.
"""
model = SentenceTransformer(model_name)
# Split into sentences
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
if len(sentences) <= 1:
return [text]
# Embed all sentences
embeddings = model.encode(sentences)
# Compute cosine similarity between consecutive sentences
similarities = []
for i in range(len(embeddings) - 1):
sim = np.dot(embeddings[i], embeddings[i + 1]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1])
)
similarities.append(sim)
# Find split points where similarity drops below threshold
chunks = []
current_chunk = [sentences[0]]
for i, sim in enumerate(similarities):
if sim < threshold and len(' '.join(current_chunk)) >= min_chunk_size:
chunks.append(' '.join(current_chunk))
current_chunk = [sentences[i + 1]]
else:
current_chunk.append(sentences[i + 1])
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
# Pros: Respects topic boundaries, produces semantically coherent chunks
# Cons: Requires embedding computation, variable chunk sizes
# Use when: Documents with clear topic shifts, quality is critical
5. Document-Structure-Aware Chunking
Use the document's own structure (headings, sections, lists) to guide chunking.
import re
from typing import List, Dict
def markdown_structure_chunking(
markdown_text: str,
max_chunk_size: int = 1000,
) -> List[Dict]:
"""
Chunk a Markdown document by its heading structure.
Each chunk includes metadata about its position in the hierarchy.
"""
# Split by headings
heading_pattern = r'^(#{1,6})\s+(.+)$'
lines = markdown_text.split('\n')
sections = []
current_section = {
'heading': 'Introduction',
'level': 0,
'content': [],
'path': [], # Breadcrumb path of parent headings
}
heading_stack = [] # Stack of (level, heading) for building paths
for line in lines:
match = re.match(heading_pattern, line)
if match:
# Save current section
if current_section['content']:
current_section['text'] = '\n'.join(current_section['content']).strip()
if current_section['text']:
sections.append(current_section.copy())
# Parse heading
level = len(match.group(1))
heading = match.group(2)
# Update heading stack
while heading_stack and heading_stack[-1][0] >= level:
heading_stack.pop()
heading_stack.append((level, heading))
# Start new section
current_section = {
'heading': heading,
'level': level,
'content': [],
'path': [h[1] for h in heading_stack],
}
else:
current_section['content'].append(line)
# Save last section
if current_section['content']:
current_section['text'] = '\n'.join(current_section['content']).strip()
if current_section['text']:
sections.append(current_section)
# Split large sections further
final_chunks = []
splitter = RecursiveTextSplitter(chunk_size=max_chunk_size, chunk_overlap=100)
for section in sections:
if len(section['text']) <= max_chunk_size:
final_chunks.append({
'text': section['text'],
'metadata': {
'heading': section['heading'],
'path': ' > '.join(section['path']),
'level': section['level'],
},
})
else:
# Split large section into smaller chunks
sub_chunks = splitter.split_text(section['text'])
for i, chunk in enumerate(sub_chunks):
final_chunks.append({
'text': chunk,
'metadata': {
'heading': section['heading'],
'path': ' > '.join(section['path']),
'level': section['level'],
'chunk_index': i,
},
})
return final_chunks
Chunk Size and Overlap Considerations
Guidelines for Chunk Size
- Smaller chunks (100-300 tokens): More precise retrieval, but may lack context. Good for factoid Q&A.
- Medium chunks (300-800 tokens): Balance between precision and context. Best general-purpose choice.
- Larger chunks (800-2000 tokens): More context per chunk, but less precise retrieval. Good for summarization tasks.
- Overlap (10-20% of chunk size): Prevents information loss at chunk boundaries. Critical for maintaining coherence.
- Match chunk size to embedding model's training data: If the embedding model was trained on passages of ~256 tokens, chunks around that size will be encoded most effectively.
3. Vector Embeddings Deep Dive
Embeddings are dense vector representations of text that capture semantic meaning. Two texts with similar meanings will have similar embedding vectors, even if they use different words. This is the foundation of semantic search in RAG systems.
Text Embedding Models
| Model | Provider | Dimensions | Max Tokens | Notes |
|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 3072 (or less via Matryoshka) | 8191 | Best proprietary model; supports dimension reduction |
| text-embedding-3-small | OpenAI | 1536 | 8191 | Cost-effective, strong quality |
| voyage-3 | Voyage AI | 1024 | 32000 | Strong for code and retrieval tasks |
| embed-v4 | Cohere | 1024 | 512 | Excellent retrieval quality with search types |
| BGE-large-en-v1.5 | BAAI (open) | 1024 | 512 | Top open-source embedding model |
| E5-mistral-7b-instruct | Microsoft (open) | 4096 | 32768 | LLM-based embeddings, excellent quality |
| GTE-Qwen2-7B-instruct | Alibaba (open) | 3584 | 131072 | Long-context embeddings |
| all-MiniLM-L6-v2 | SBERT (open) | 384 | 256 | Tiny, fast, good for prototyping |
Matryoshka Embeddings
Matryoshka embeddings (named after Russian nesting dolls) are trained so that the first N dimensions of a larger embedding are themselves a valid, useful embedding of lower dimension. This allows you to choose between quality and efficiency at deployment time without retraining.
# Matryoshka embeddings: variable-dimension embeddings
# OpenAI text-embedding-3-large supports this natively
from openai import OpenAI
client = OpenAI()
text = "Machine learning is a subset of artificial intelligence."
# Full 3072 dimensions (highest quality)
response_full = client.embeddings.create(
model="text-embedding-3-large",
input=text,
dimensions=3072,
)
# Reduced to 1024 dimensions (good quality, 3x smaller)
response_1024 = client.embeddings.create(
model="text-embedding-3-large",
input=text,
dimensions=1024,
)
# Reduced to 256 dimensions (acceptable quality, 12x smaller)
response_256 = client.embeddings.create(
model="text-embedding-3-large",
input=text,
dimensions=256,
)
# The 256-dim embedding is a valid embedding that captures
# the most important semantic features. You can use it for
# applications where storage/speed matter more than precision.
Late Interaction (ColBERT)
Unlike standard embeddings (single vector per document), ColBERT produces one vector per token. At retrieval time, it computes token-level similarity using MaxSim, providing much more fine-grained matching.
# Standard Embedding:
# Document -> [single 768-dim vector]
# Query -> [single 768-dim vector]
# Similarity = cosine(doc_vec, query_vec)
# ColBERT Late Interaction:
# Document "machine learning is powerful" -> [vec_machine, vec_learning, vec_is, vec_powerful]
# Query "deep learning" -> [vec_deep, vec_learning]
#
# For each query token, find max similarity across all doc tokens:
# score = sum over query tokens of max(cosine(q_token, d_token) for d_token in doc)
#
# This captures partial matches much better than single-vector comparison
Practical: Generate and Compare Embeddings
"""
Embedding Model Comparison
=============================
Generate embeddings with multiple models and compare their
quality on a retrieval task.
"""
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Compute cosine similarity between two vectors."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def evaluate_retrieval(
model: SentenceTransformer,
queries: List[str],
documents: List[str],
ground_truth: List[int], # Index of correct document for each query
) -> dict:
"""Evaluate retrieval quality of an embedding model."""
# Embed all documents
doc_embeddings = model.encode(documents, normalize_embeddings=True)
correct = 0
mrr_total = 0.0
for i, query in enumerate(queries):
# Embed query
query_embedding = model.encode([query], normalize_embeddings=True)[0]
# Compute similarities
similarities = [
cosine_similarity(query_embedding, doc_emb)
for doc_emb in doc_embeddings
]
# Rank documents
ranked_indices = np.argsort(similarities)[::-1]
# Check if top-1 is correct
if ranked_indices[0] == ground_truth[i]:
correct += 1
# Compute reciprocal rank
correct_rank = np.where(ranked_indices == ground_truth[i])[0][0] + 1
mrr_total += 1.0 / correct_rank
accuracy = correct / len(queries)
mrr = mrr_total / len(queries)
return {"accuracy_at_1": accuracy, "mrr": mrr}
# Test data
documents = [
"Python is a high-level programming language known for its simplicity and readability.",
"The Eiffel Tower is a wrought-iron lattice tower in Paris, France, built in 1889.",
"Photosynthesis is the process by which plants convert sunlight into chemical energy.",
"The stock market experienced significant volatility during the 2008 financial crisis.",
"Quantum computing uses qubits that can exist in superposition, enabling parallel computation.",
"The human genome contains approximately 3 billion base pairs of DNA.",
"Neural networks are computing systems inspired by biological neural networks in the brain.",
"Climate change is driven by greenhouse gas emissions from human activities.",
]
queries = [
"What programming language is easy to learn?", # -> 0 (Python)
"Tell me about a famous landmark in France.", # -> 1 (Eiffel Tower)
"How do plants make food from sunlight?", # -> 2 (Photosynthesis)
"What happened to financial markets in 2008?", # -> 3 (stock market)
"How does quantum computing work?", # -> 4 (quantum)
"What is DNA made of?", # -> 5 (genome)
"How are artificial neural networks structured?", # -> 6 (neural networks)
"What causes global warming?", # -> 7 (climate change)
]
ground_truth = [0, 1, 2, 3, 4, 5, 6, 7]
# Compare models
models_to_test = [
("all-MiniLM-L6-v2", 384),
("all-mpnet-base-v2", 768),
("BAAI/bge-small-en-v1.5", 384),
("BAAI/bge-base-en-v1.5", 768),
]
print("=" * 60)
print("EMBEDDING MODEL COMPARISON")
print("=" * 60)
for model_name, dim in models_to_test:
print(f"\nLoading {model_name} (dim={dim})...")
model = SentenceTransformer(model_name)
results = evaluate_retrieval(model, queries, documents, ground_truth)
print(f" Accuracy@1: {results['accuracy_at_1']:.1%}")
print(f" MRR: {results['mrr']:.3f}")
# Show similarity matrix for first query
query_emb = model.encode([queries[0]], normalize_embeddings=True)[0]
doc_embs = model.encode(documents, normalize_embeddings=True)
sims = [cosine_similarity(query_emb, d) for d in doc_embs]
print(f" Query: '{queries[0]}'")
top3_idx = np.argsort(sims)[::-1][:3]
for rank, idx in enumerate(top3_idx):
print(f" #{rank+1} (sim={sims[idx]:.3f}): {documents[idx][:60]}...")
4. Vector Databases
Vector databases are purpose-built systems for storing, indexing, and querying high-dimensional vectors. They are the backbone of any RAG system, enabling fast similarity search over millions or billions of embeddings.
Comparison of Vector Databases
| Database | Type | Hosting | Best For |
|---|---|---|---|
| Pinecone | Managed cloud | Fully managed | Production; zero ops, auto-scaling |
| Weaviate | Open-source | Self-hosted or cloud | Hybrid search (vector + keyword), GraphQL API |
| Qdrant | Open-source | Self-hosted or cloud | Rich filtering, Rust performance |
| Milvus | Open-source | Self-hosted or Zilliz cloud | Large-scale (billions of vectors) |
| Chroma | Open-source | Embedded or server | Prototyping, small-medium datasets |
| pgvector | PostgreSQL extension | Any PostgreSQL host | Integration with existing Postgres infrastructure |
| FAISS | Library | In-process | Research, benchmarking, maximum control |
Indexing Algorithms
Brute Force (Flat Index)
Compute distance from query to every vector in the database. Guarantees finding the exact nearest neighbor but is O(n) in the number of vectors.
# Brute force: compare query against all N vectors
# Time: O(N * d) where d is dimension
# Memory: O(N * d)
# Quality: Perfect (100% recall)
# Practical for: < 100K vectors
IVF (Inverted File Index)
Partition vectors into clusters using k-means. At query time, only search the nearest clusters.
# IVF with nlist clusters and nprobe searched:
# Training: Run k-means to create nlist centroids
# Indexing: Assign each vector to its nearest centroid
# Query:
# 1. Find nprobe nearest centroids to query
# 2. Search only vectors in those clusters
# Time: O(nprobe * N/nlist * d) -- much faster when nprobe << nlist
# Quality: Approximate (recall depends on nprobe/nlist ratio)
# Typical: nlist=sqrt(N), nprobe=nlist/10
HNSW (Hierarchical Navigable Small World)
The most popular algorithm for approximate nearest neighbor search. Builds a multi-layer graph where each node is connected to its nearest neighbors.
# HNSW: Multi-layer navigable graph
#
# Layer 3: [A] -------- [B] (few nodes, long-range connections)
# | |
# Layer 2: [A] -- [C] -- [B] -- [D] (more nodes)
# | | | |
# Layer 1: [A]-[E]-[C]-[F]-[B]-[D]-[G] (even more nodes)
# | | | | | | |
# Layer 0: [A][E][H][C][F][I][B][D][G][J] (all nodes, dense connections)
#
# Search algorithm:
# 1. Start at entry point in top layer
# 2. Greedily traverse to nearest node in current layer
# 3. When no closer node found, drop to next layer
# 4. Repeat until bottom layer
# 5. Return top-K nearest neighbors from bottom layer
#
# Key parameters:
# M: max connections per node (higher = better quality, more memory)
# ef_construction: beam width during index building
# ef_search: beam width during query (higher = better recall, slower)
#
# Time: O(log(N) * d) -- logarithmic scaling!
# Memory: O(N * M * d) -- higher than flat due to graph structure
# Quality: Very high recall (>95%) with proper parameters
Distance Metrics
| Metric | Formula | Range | Best For |
|---|---|---|---|
| Cosine Similarity | dot(a,b) / (||a|| * ||b||) | [-1, 1] | Text embeddings (most common) |
| Euclidean (L2) | sqrt(sum((a_i - b_i)^2)) | [0, inf) | Spatial data, image features |
| Dot Product | sum(a_i * b_i) | (-inf, inf) | Normalized embeddings (equivalent to cosine) |
Practical: Vector Search with Chroma and FAISS
"""
Vector Database Practical: Chroma and FAISS
=============================================
Set up vector stores, insert embeddings, and perform similarity search.
pip install chromadb faiss-cpu sentence-transformers
"""
# ===========================
# Part 1: ChromaDB
# ===========================
import chromadb
from chromadb.utils import embedding_functions
# Initialize Chroma (persistent storage)
client = chromadb.PersistentClient(path="./chroma_db")
# Use the default embedding function (all-MiniLM-L6-v2)
# Or specify a custom one:
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
# Create or get a collection
collection = client.get_or_create_collection(
name="knowledge_base",
embedding_function=embedding_fn,
metadata={"hnsw:space": "cosine"}, # Use cosine similarity
)
# Add documents
documents = [
"Python is a versatile programming language used in AI and web development.",
"JavaScript is the most popular language for web development.",
"Machine learning models learn patterns from data to make predictions.",
"Deep learning uses neural networks with multiple layers.",
"Natural language processing enables computers to understand human language.",
"Computer vision allows machines to interpret and understand visual information.",
"Reinforcement learning trains agents through rewards and penalties.",
"Transfer learning leverages pre-trained models for new tasks.",
]
# Add documents with IDs and metadata
collection.add(
documents=documents,
ids=[f"doc_{i}" for i in range(len(documents))],
metadatas=[{"source": "textbook", "chapter": i + 1} for i in range(len(documents))],
)
print(f"Collection has {collection.count()} documents")
# Query the collection
results = collection.query(
query_texts=["How do machines learn from data?"],
n_results=3,
include=["documents", "distances", "metadatas"],
)
print("\nChroma Search Results:")
for i, (doc, dist, meta) in enumerate(zip(
results["documents"][0],
results["distances"][0],
results["metadatas"][0],
)):
print(f" #{i+1} (distance={dist:.4f}): {doc}")
print(f" metadata: {meta}")
# Query with metadata filtering
results_filtered = collection.query(
query_texts=["programming languages"],
n_results=3,
where={"chapter": {"$lte": 3}}, # Only chapters 1-3
include=["documents", "distances"],
)
print("\nFiltered results (chapters 1-3 only):")
for doc, dist in zip(results_filtered["documents"][0], results_filtered["distances"][0]):
print(f" (dist={dist:.4f}): {doc}")
# ===========================
# Part 2: FAISS
# ===========================
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
dimension = 384 # Embedding dimension for this model
# Create embeddings for our documents
doc_embeddings = model.encode(documents, normalize_embeddings=True)
doc_embeddings = np.array(doc_embeddings).astype("float32")
# ---- Flat Index (brute force, exact) ----
index_flat = faiss.IndexFlatIP(dimension) # Inner product (= cosine for normalized vectors)
index_flat.add(doc_embeddings)
print(f"\nFAISS Flat Index: {index_flat.ntotal} vectors")
# Search
query = "How do machines learn from data?"
query_embedding = model.encode([query], normalize_embeddings=True).astype("float32")
distances, indices = index_flat.search(query_embedding, k=3)
print("FAISS Flat Search Results:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
print(f" #{i+1} (score={dist:.4f}): {documents[idx]}")
# ---- HNSW Index (approximate, fast) ----
index_hnsw = faiss.IndexHNSWFlat(dimension, 32) # M=32 connections
index_hnsw.hnsw.efConstruction = 128 # Construction-time beam width
index_hnsw.hnsw.efSearch = 64 # Search-time beam width
index_hnsw.add(doc_embeddings)
print(f"\nFAISS HNSW Index: {index_hnsw.ntotal} vectors")
distances, indices = index_hnsw.search(query_embedding, k=3)
print("FAISS HNSW Search Results:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
print(f" #{i+1} (score={dist:.4f}): {documents[idx]}")
# ---- IVF Index (approximate, memory efficient) ----
nlist = 4 # Number of clusters (use sqrt(N) for large datasets)
quantizer = faiss.IndexFlatIP(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
index_ivf.train(doc_embeddings) # Train the clustering
index_ivf.add(doc_embeddings)
index_ivf.nprobe = 2 # Search 2 out of 4 clusters
print(f"\nFAISS IVF Index: {index_ivf.ntotal} vectors")
distances, indices = index_ivf.search(query_embedding, k=3)
print("FAISS IVF Search Results:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
print(f" #{i+1} (score={dist:.4f}): {documents[idx]}")
# Save and load index
faiss.write_index(index_hnsw, "knowledge_base.index")
loaded_index = faiss.read_index("knowledge_base.index")
print(f"\nLoaded index with {loaded_index.ntotal} vectors")
Practical: pgvector with PostgreSQL
"""
pgvector: Vector Search in PostgreSQL
=========================================
Use vectors alongside traditional relational data.
Setup:
1. Install PostgreSQL with pgvector extension
2. pip install psycopg2-binary sentence-transformers
Docker quickstart:
docker run -d --name pgvector -e POSTGRES_PASSWORD=password \
-p 5432:5432 pgvector/pgvector:pg16
"""
import psycopg2
import numpy as np
from sentence_transformers import SentenceTransformer
def setup_pgvector():
"""Set up pgvector database and table."""
conn = psycopg2.connect(
host="localhost",
port=5432,
dbname="postgres",
user="postgres",
password="password",
)
conn.autocommit = True
cur = conn.cursor()
# Enable pgvector extension
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
# Create table with vector column
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
source VARCHAR(255),
category VARCHAR(100),
embedding vector(384), -- 384 dimensions for MiniLM
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
# Create HNSW index for fast similarity search
cur.execute("""
CREATE INDEX IF NOT EXISTS documents_embedding_idx
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
""")
return conn
def insert_documents(conn, documents, model):
"""Insert documents with their embeddings."""
cur = conn.cursor()
for doc in documents:
embedding = model.encode(doc["content"]).tolist()
cur.execute(
"""
INSERT INTO documents (content, source, category, embedding)
VALUES (%s, %s, %s, %s::vector)
""",
(doc["content"], doc.get("source", ""), doc.get("category", ""), embedding),
)
conn.commit()
print(f"Inserted {len(documents)} documents")
def semantic_search(conn, model, query, k=5, category=None):
"""Perform semantic search with optional filtering."""
cur = conn.cursor()
query_embedding = model.encode(query).tolist()
if category:
cur.execute(
"""
SELECT content, source, category,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE category = %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(query_embedding, category, query_embedding, k),
)
else:
cur.execute(
"""
SELECT content, source, category,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(query_embedding, query_embedding, k),
)
results = cur.fetchall()
return [
{
"content": r[0],
"source": r[1],
"category": r[2],
"similarity": float(r[3]),
}
for r in results
]
# Usage
if __name__ == "__main__":
model = SentenceTransformer("all-MiniLM-L6-v2")
conn = setup_pgvector()
documents = [
{"content": "Python supports multiple programming paradigms.", "source": "docs", "category": "programming"},
{"content": "Neural networks are inspired by biological neurons.", "source": "textbook", "category": "ml"},
{"content": "PostgreSQL is a powerful relational database.", "source": "docs", "category": "databases"},
]
insert_documents(conn, documents, model)
results = semantic_search(conn, model, "How do brain-inspired algorithms work?", k=3)
for r in results:
print(f"[{r['similarity']:.3f}] {r['content']}")
5. RAG Pipeline Architecture
Hybrid Search
Combining keyword search (BM25) with semantic vector search often produces better results than either alone. BM25 excels at exact term matching, while semantic search captures meaning.
"""
Hybrid Search: BM25 + Vector Search
======================================
Combine keyword and semantic search for better retrieval.
pip install rank-bm25 sentence-transformers numpy
"""
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
class HybridSearcher:
"""Combine BM25 keyword search with semantic vector search."""
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
self.documents = []
self.bm25 = None
self.embeddings = None
def index(self, documents: List[str]):
"""Index documents for both BM25 and vector search."""
self.documents = documents
# BM25 index
tokenized = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
# Vector index
self.embeddings = self.model.encode(
documents, normalize_embeddings=True
)
def search(
self,
query: str,
k: int = 5,
alpha: float = 0.5, # Weight for semantic search (1-alpha for BM25)
) -> List[Tuple[int, float, str]]:
"""
Hybrid search combining BM25 and semantic scores.
Args:
query: Search query
k: Number of results
alpha: Weight for semantic search (0=pure BM25, 1=pure semantic)
Returns:
List of (index, score, document) tuples
"""
# BM25 scores
bm25_scores = self.bm25.get_scores(query.lower().split())
# Normalize to [0, 1]
bm25_max = max(bm25_scores) if max(bm25_scores) > 0 else 1
bm25_normalized = bm25_scores / bm25_max
# Semantic scores
query_embedding = self.model.encode(
[query], normalize_embeddings=True
)[0]
semantic_scores = np.dot(self.embeddings, query_embedding)
# Already in [-1, 1] range for normalized embeddings
semantic_normalized = (semantic_scores + 1) / 2 # Shift to [0, 1]
# Combine scores
hybrid_scores = alpha * semantic_normalized + (1 - alpha) * bm25_normalized
# Get top-k results
top_indices = np.argsort(hybrid_scores)[::-1][:k]
results = []
for idx in top_indices:
results.append((
int(idx),
float(hybrid_scores[idx]),
self.documents[idx],
))
return results
# Demo
searcher = HybridSearcher()
docs = [
"The Python programming language was created by Guido van Rossum in 1991.",
"Machine learning algorithms can be supervised, unsupervised, or reinforcement-based.",
"PostgreSQL supports JSONB columns for storing semi-structured data.",
"Transfer learning uses pre-trained neural network weights as a starting point.",
"REST APIs use HTTP methods like GET, POST, PUT, and DELETE.",
"Convolutional neural networks excel at image recognition tasks.",
"Docker containers provide lightweight virtualization for application deployment.",
"The attention mechanism in transformers computes weighted sums of value vectors.",
]
searcher.index(docs)
# Test different queries
queries = [
"Who invented Python?", # BM25 should excel (exact term match)
"How do deep learning models see images?", # Semantic should excel
"neural network attention", # Both should contribute
]
for query in queries:
print(f"\nQuery: '{query}'")
results = searcher.search(query, k=3, alpha=0.5)
for idx, score, doc in results:
print(f" [{score:.3f}] {doc}")
Reranking with Cross-Encoders
Initial retrieval (bi-encoder) is optimized for recall (finding all relevant documents). Reranking with cross-encoders is optimized for precision (ordering them correctly).
"""
Reranking with Cross-Encoders
================================
Use a cross-encoder to rerank initially retrieved documents
for higher precision.
pip install sentence-transformers
"""
from sentence_transformers import CrossEncoder
# Load cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Initial retrieval results (from vector search)
query = "How do transformers process sequences?"
retrieved_docs = [
"The attention mechanism computes weighted sums of values based on query-key similarities.",
"Recurrent neural networks process sequences one token at a time.",
"Transformers use self-attention to process all tokens in parallel.",
"BERT is a bidirectional transformer model for language understanding.",
"CNNs use convolutional filters to detect local patterns.",
]
# Rerank: cross-encoder scores each (query, document) pair
pairs = [(query, doc) for doc in retrieved_docs]
scores = reranker.predict(pairs)
# Sort by score
ranked = sorted(
zip(scores, retrieved_docs),
key=lambda x: x[0],
reverse=True,
)
print(f"Query: {query}\n")
print("Before reranking:")
for i, doc in enumerate(retrieved_docs):
print(f" #{i+1}: {doc}")
print("\nAfter reranking:")
for i, (score, doc) in enumerate(ranked):
print(f" #{i+1} (score={score:.3f}): {doc}")
Practical: Complete RAG Pipeline from Scratch
"""
Complete RAG Pipeline from Scratch
=====================================
Build a full RAG system without any framework,
using only basic libraries.
pip install sentence-transformers chromadb openai
"""
import os
from typing import List, Dict, Optional
from dataclasses import dataclass
from sentence_transformers import SentenceTransformer, CrossEncoder
import chromadb
from openai import OpenAI
@dataclass
class RetrievedChunk:
"""A chunk of text retrieved from the knowledge base."""
text: str
source: str
score: float
metadata: Dict
class RAGPipeline:
"""Complete RAG pipeline: index, retrieve, generate."""
def __init__(
self,
embedding_model: str = "all-MiniLM-L6-v2",
reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
llm_model: str = "gpt-4o-mini",
collection_name: str = "rag_knowledge_base",
persist_dir: str = "./rag_chroma_db",
):
# Embedding model
self.embedder = SentenceTransformer(embedding_model)
# Reranker
self.reranker = CrossEncoder(reranker_model)
# Vector store
self.chroma_client = chromadb.PersistentClient(path=persist_dir)
self.collection = self.chroma_client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"},
)
# LLM client
self.llm_client = OpenAI()
self.llm_model = llm_model
# Text splitter
self.chunk_size = 500
self.chunk_overlap = 50
# ---- INDEXING ----
def chunk_text(self, text: str, source: str = "") -> List[Dict]:
"""Split text into overlapping chunks with metadata."""
chunks = []
sentences = text.replace('\n', ' ').split('. ')
current_chunk = []
current_length = 0
for sentence in sentences:
sentence = sentence.strip()
if not sentence:
continue
sentence_with_period = sentence + '.'
sentence_len = len(sentence_with_period)
if current_length + sentence_len > self.chunk_size and current_chunk:
chunk_text = ' '.join(current_chunk)
chunks.append({
"text": chunk_text,
"source": source,
"chunk_index": len(chunks),
})
# Keep last sentence for overlap
current_chunk = current_chunk[-1:]
current_length = len(current_chunk[0]) if current_chunk else 0
current_chunk.append(sentence_with_period)
current_length += sentence_len
if current_chunk:
chunks.append({
"text": ' '.join(current_chunk),
"source": source,
"chunk_index": len(chunks),
})
return chunks
def index_documents(self, documents: List[Dict[str, str]]):
"""
Index documents into the vector store.
Args:
documents: List of {"text": "...", "source": "..."} dicts
"""
all_chunks = []
for doc in documents:
chunks = self.chunk_text(doc["text"], doc.get("source", "unknown"))
all_chunks.extend(chunks)
if not all_chunks:
print("No chunks to index")
return
# Generate embeddings
texts = [c["text"] for c in all_chunks]
embeddings = self.embedder.encode(texts).tolist()
# Add to ChromaDB
self.collection.add(
ids=[f"chunk_{i}" for i in range(len(all_chunks))],
documents=texts,
embeddings=embeddings,
metadatas=[
{"source": c["source"], "chunk_index": c["chunk_index"]}
for c in all_chunks
],
)
print(f"Indexed {len(all_chunks)} chunks from {len(documents)} documents")
# ---- RETRIEVAL ----
def retrieve(
self,
query: str,
top_k: int = 10,
rerank_top_k: int = 3,
) -> List[RetrievedChunk]:
"""Retrieve and rerank relevant chunks."""
# Step 1: Initial retrieval with vector search
query_embedding = self.embedder.encode([query]).tolist()
results = self.collection.query(
query_embeddings=query_embedding,
n_results=top_k,
include=["documents", "distances", "metadatas"],
)
if not results["documents"][0]:
return []
# Step 2: Rerank with cross-encoder
pairs = [(query, doc) for doc in results["documents"][0]]
rerank_scores = self.reranker.predict(pairs)
# Combine and sort
chunks = []
for doc, dist, meta, rerank_score in zip(
results["documents"][0],
results["distances"][0],
results["metadatas"][0],
rerank_scores,
):
chunks.append(RetrievedChunk(
text=doc,
source=meta.get("source", ""),
score=float(rerank_score),
metadata=meta,
))
# Sort by reranker score (descending)
chunks.sort(key=lambda x: x.score, reverse=True)
return chunks[:rerank_top_k]
# ---- GENERATION ----
def generate(
self,
query: str,
retrieved_chunks: List[RetrievedChunk],
system_prompt: Optional[str] = None,
) -> str:
"""Generate a response using retrieved context."""
if system_prompt is None:
system_prompt = """You are a helpful AI assistant. Answer the user's question
based ONLY on the provided context. If the context doesn't contain
enough information to answer the question, say so clearly.
Always cite your sources by referencing the source document."""
# Build context from retrieved chunks
context_parts = []
for i, chunk in enumerate(retrieved_chunks, 1):
context_parts.append(
f"[Source {i}: {chunk.source}]\n{chunk.text}"
)
context = "\n\n".join(context_parts)
# Build the prompt
messages = [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": f"""Context:
{context}
Question: {query}
Please answer based on the context above. Cite sources using [Source N] notation.""",
},
]
# Call LLM
response = self.llm_client.chat.completions.create(
model=self.llm_model,
messages=messages,
temperature=0.3,
max_tokens=500,
)
return response.choices[0].message.content
# ---- FULL PIPELINE ----
def query(self, question: str) -> Dict:
"""Run the full RAG pipeline: retrieve + generate."""
# Retrieve relevant chunks
chunks = self.retrieve(question, top_k=10, rerank_top_k=3)
if not chunks:
return {
"answer": "I could not find any relevant information to answer your question.",
"sources": [],
}
# Generate answer
answer = self.generate(question, chunks)
return {
"answer": answer,
"sources": [
{
"text": c.text[:200] + "...",
"source": c.source,
"relevance_score": c.score,
}
for c in chunks
],
}
# ===========================
# Usage Example
# ===========================
if __name__ == "__main__":
rag = RAGPipeline()
# Index some documents
documents = [
{
"text": """
Transformers are a neural network architecture introduced in the paper
'Attention Is All You Need' by Vaswani et al. in 2017. They revolutionized
NLP by replacing recurrent architectures with self-attention mechanisms
that can process sequences in parallel. The key innovation is multi-head
attention, which allows the model to jointly attend to information from
different representation subspaces. Transformers consist of an encoder
and decoder, each made up of layers containing self-attention and
feed-forward neural network sublayers.
""",
"source": "transformer_overview.pdf",
},
{
"text": """
BERT (Bidirectional Encoder Representations from Transformers) is a
language model developed by Google in 2018. Unlike GPT which is
autoregressive (left-to-right), BERT is trained with a masked language
modeling objective where random tokens are masked and the model predicts
them using bidirectional context. BERT uses only the encoder part of
the transformer architecture. It achieved state-of-the-art results on
11 NLP benchmarks when released.
""",
"source": "bert_paper.pdf",
},
{
"text": """
GPT (Generative Pre-trained Transformer) models use the decoder part
of the transformer architecture. They are trained autoregressively to
predict the next token. GPT-2 demonstrated that large language models
can generate coherent long-form text. GPT-3 with 175 billion parameters
showed emergent few-shot learning capabilities. GPT-4 introduced
multimodal capabilities, accepting both text and image inputs.
""",
"source": "gpt_history.pdf",
},
]
rag.index_documents(documents)
# Query the system
result = rag.query("What is the key innovation of the transformer architecture?")
print(f"Answer: {result['answer']}")
print(f"\nSources used:")
for source in result['sources']:
print(f" - {source['source']} (score: {source['relevance_score']:.3f})")
Practical: RAG with LangChain
"""
RAG Pipeline with LangChain
==============================
Build the same RAG pipeline using LangChain framework.
pip install langchain langchain-openai langchain-community chromadb
"""
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.document_loaders import (
TextLoader,
PyPDFLoader,
DirectoryLoader,
)
# Step 1: Load documents
# From text files:
# loader = TextLoader("document.txt")
# From PDFs:
# loader = PyPDFLoader("document.pdf")
# From a directory:
# loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
# For demo, create documents manually
from langchain.schema import Document
documents = [
Document(
page_content="""
Retrieval Augmented Generation (RAG) combines retrieval and generation
to produce grounded, accurate responses. It was introduced by Lewis et al.
in 2020. The key idea is to retrieve relevant documents from a knowledge
base and include them as context for the language model.
""",
metadata={"source": "rag_paper.pdf", "page": 1},
),
Document(
page_content="""
Vector databases store high-dimensional embeddings and enable fast
similarity search. Popular options include Pinecone, Weaviate, Qdrant,
and Chroma. They use algorithms like HNSW for approximate nearest
neighbor search, achieving sub-millisecond query times even with
millions of vectors.
""",
metadata={"source": "vector_db_guide.pdf", "page": 1},
),
]
# Step 2: Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
# Step 3: Create vector store with embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./langchain_chroma_db",
collection_name="langchain_rag",
)
# Step 4: Create retriever
retriever = vectorstore.as_retriever(
search_type="similarity", # or "mmr" for diversity
search_kwargs={"k": 3},
)
# Step 5: Create the RAG chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
prompt_template = PromptTemplate(
template="""Use the following context to answer the question.
If you cannot answer based on the context, say so.
Context:
{context}
Question: {question}
Answer:""",
input_variables=["context", "question"],
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = put all docs in prompt
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt_template},
)
# Step 6: Query
result = qa_chain.invoke({"query": "What is RAG and how does it work?"})
print(f"Answer: {result['result']}")
print(f"\nSource documents:")
for doc in result['source_documents']:
print(f" - {doc.metadata['source']}: {doc.page_content[:100]}...")
6. Advanced RAG Techniques (2025-2026)
Basic RAG works well for simple questions but struggles with complex queries, ambiguous questions, and multi-hop reasoning. Advanced techniques address these limitations.
Multi-Query RAG
Generate multiple variations of the user's query using an LLM, retrieve for each variation, then merge and deduplicate the results. This increases recall by capturing different aspects of the question.
def multi_query_retrieval(query: str, retriever, llm) -> list:
"""Generate multiple query variations and retrieve for each."""
# Generate query variations
prompt = f"""Generate 3 different versions of the following question
to help retrieve relevant documents. Each version should capture
a different aspect or phrasing of the question.
Original question: {query}
Provide the 3 alternative questions, one per line:"""
variations = llm.invoke(prompt).content.strip().split('\n')
variations = [v.strip() for v in variations if v.strip()]
all_queries = [query] + variations
# Retrieve for each query
all_docs = []
seen_contents = set()
for q in all_queries:
docs = retriever.get_relevant_documents(q)
for doc in docs:
if doc.page_content not in seen_contents:
all_docs.append(doc)
seen_contents.add(doc.page_content)
return all_docs
RAG Fusion
Similar to multi-query but uses Reciprocal Rank Fusion (RRF) to combine rankings from different queries into a single, better ranking.
def reciprocal_rank_fusion(rankings: list[list], k: int = 60) -> list:
"""
Combine multiple rankings using Reciprocal Rank Fusion (RRF).
RRF score = sum(1 / (k + rank_i)) for each ranking
"""
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
if doc_id not in scores:
scores[doc_id] = 0.0
scores[doc_id] += 1.0 / (k + rank + 1)
# Sort by RRF score
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
Self-RAG
Self-RAG (Asai et al., 2023) trains the LLM to decide when retrieval is needed, assess the relevance of retrieved passages, and critique its own responses for factual grounding.
# Self-RAG Decision Flow:
#
# 1. Given a query, the model generates a "retrieval token":
# [Retrieve] = Yes/No
#
# 2. If retrieval is needed:
# - Retrieve relevant passages
# - For each passage, generate a "relevance token":
# [IsRel] = Relevant/Irrelevant
#
# 3. Generate response using relevant passages
#
# 4. Generate a "support token" for each claim:
# [IsSup] = Fully Supported/Partially Supported/Not Supported
#
# 5. Generate a "utility token" for overall quality:
# [IsUse] = 5/4/3/2/1
#
# This creates a self-correcting RAG system that knows
# when it needs external information and can verify its claims.
GraphRAG (Microsoft)
GraphRAG (2024) builds a knowledge graph from the corpus before retrieval. Instead of retrieving raw text chunks, it uses the graph structure to find and traverse related concepts.
# GraphRAG Architecture:
#
# Indexing Phase:
# 1. Extract entities and relationships from documents using LLM
# 2. Build a knowledge graph (nodes = entities, edges = relationships)
# 3. Detect communities in the graph using Leiden algorithm
# 4. Generate summaries for each community
#
# Query Phase:
# 1. Map query to relevant entities and communities
# 2. Retrieve community summaries (global search)
# OR traverse local neighborhood in graph (local search)
# 3. Use retrieved information as context for generation
#
# Advantages over standard RAG:
# - Better at answering questions about themes and connections
# - Can synthesize information across many documents
# - Handles "What are the main topics in this corpus?" queries
# - Provides more comprehensive answers to broad questions
Contextual Retrieval (Anthropic)
Anthropic's contextual retrieval (2024) prepends context about the whole document to each chunk before embedding. This helps chunks that are ambiguous on their own (e.g., "The company reported Q3 revenue of $5B" -- which company?).
def contextual_chunking(document: str, chunks: list[str], llm) -> list[str]:
"""
Add contextual information to each chunk.
Anthropic's approach: use an LLM to generate context for each chunk
based on the full document.
"""
contextualized_chunks = []
for chunk in chunks:
prompt = f"""Here is the full document:
{document[:3000]} # Truncate for context window
Here is a specific chunk from the document:
{chunk}
Please provide a brief (2-3 sentences) context that situates this chunk
within the overall document. Focus on who/what/when/where information
that would help someone understand this chunk in isolation.
Context:"""
context = llm.invoke(prompt).content.strip()
contextualized_chunk = f"{context}\n\n{chunk}"
contextualized_chunks.append(contextualized_chunk)
return contextualized_chunks
# Example:
# Original chunk: "Revenue increased by 15% year-over-year."
# Contextualized: "This passage is from Apple Inc.'s Q3 2025 earnings report,
# discussing financial performance. Revenue increased by 15% year-over-year."
Late Chunking
Late chunking (2024) processes the full document through the embedding model first, then chunks the resulting token embeddings. This preserves cross-chunk context in the embeddings, unlike traditional chunking where each chunk is embedded independently.
# Traditional chunking:
# Document -> Split into chunks -> Embed each chunk independently
# Problem: Each chunk's embedding only captures its local content
# Late chunking:
# Document -> Embed full document (get per-token embeddings)
# -> Split TOKEN EMBEDDINGS into chunks
# -> Pool each chunk's token embeddings into a single vector
# Advantage: Each chunk's embedding benefits from full document context
Summary and Key Takeaways
Week 7 Key Takeaways
- RAG solves critical LLM limitations: hallucinations, outdated knowledge, and inability to access private data. It is the most practical pattern for production LLM applications.
- Chunking strategy matters enormously: Recursive character splitting with structure awareness is a strong default. Semantic chunking provides the best quality but requires more computation.
- Choose embedding models carefully: OpenAI text-embedding-3 for proprietary, BGE/E5 for open-source. Match chunk sizes to what the embedding model was trained on.
- HNSW is the standard indexing algorithm: It provides logarithmic search time with high recall. Chroma and FAISS are excellent starting points.
- Hybrid search outperforms either approach alone: Combine BM25 keyword search with semantic vector search for robust retrieval.
- Reranking is essential: Cross-encoder reranking after initial retrieval significantly improves precision at minimal latency cost.
- Advanced techniques keep evolving: GraphRAG for corpus-level questions, contextual retrieval for better chunk embeddings, Self-RAG for self-correcting systems.
Next Steps
In Week 8: Hands-on RAG Implementation, we will build a complete production RAG chatbot with advanced retrieval strategies, reranking, input/output guardrails, AI safety measures, and evaluation using the RAGAS framework.