1. Project Planning
1.1 How to Choose a Capstone Project
Your capstone project is the culmination of 14 weeks of learning. It should demonstrate your ability to design, build, and deploy an AI-powered application. Here is how to choose wisely:
Selection Criteria
- Solves a real problem: Choose something you or others would actually use. Portfolio projects that solve real pain points stand out.
- Demonstrates breadth: Incorporate multiple concepts from the course (RAG, agents, evaluation, deployment).
- Achievable scope: You have 2 weeks. Better to ship a polished MVP than have an incomplete ambitious project.
- Showcases depth: Go beyond "call an API." Show that you understand the engineering behind the solution.
- Has a clear demo: The project should be demo-able in 5 minutes to a technical audience.
1.2 Scope Assessment Framework
Rate your project on these dimensions to ensure it is appropriately scoped:
| Dimension | Small (1-2 days) | Medium (3-5 days) | Large (1-2 weeks) |
|---|---|---|---|
| LLM Integration | Single API call | Chain of calls, tool use | Multi-agent, RAG, eval |
| Data Pipeline | Single file input | Multiple sources, embeddings | ETL, vector DB, caching |
| Frontend | Streamlit only | Polished Streamlit/Gradio | Custom React/Next.js |
| Backend | Script-based | FastAPI with basic routes | Full API, auth, queues |
| Evaluation | Manual testing | Basic metrics | Automated eval pipeline |
| Deployment | Local only | Single platform deploy | CI/CD, monitoring |
Target: Medium to Large scope. Aim for at least "Medium" in every dimension and "Large" in 2-3 dimensions.
1.3 Timeline Planning (2-Week Build)
Recommended Timeline
Week 1: Foundation
==================
Day 1-2: Architecture design + project setup
- Define the problem and user stories
- Draw the architecture diagram
- Set up repo, virtual env, dependencies
- Choose and configure LLM provider(s)
Day 3-4: Core AI pipeline
- Implement the main AI/LLM functionality
- Build data ingestion (if RAG)
- Set up vector store (if needed)
- Get the core "happy path" working end-to-end
Day 5: API layer
- Build FastAPI endpoints
- Request/response models
- Error handling
- Basic tests
Week 2: Polish and Ship
=======================
Day 6-7: Frontend and UX
- Build the user interface
- Connect to API
- Handle loading states, errors
- Make it look professional
Day 8-9: Evaluation and hardening
- Set up evaluation pipeline
- Test edge cases
- Add caching, rate limiting
- Fix bugs, improve prompts
Day 10: Deployment and documentation
- Dockerize the application
- Deploy to cloud
- Write README
- Record demo video
& User Stories"] --> Arch["Design
Architecture"] Arch --> Setup["Project Setup
& Dependencies"] Setup --> Core["Build Core
AI Pipeline"] Core --> API["Build API
Layer"] API --> UI["Frontend
& UX"] UI --> Eval["Evaluation
& Hardening"] Eval --> Deploy["Deploy
& Document"] Deploy -->|"Iterate"| Eval style Define fill:#4CAF50,stroke:#333,color:#fff style Arch fill:#66BB6A,stroke:#333,color:#fff style Core fill:#2196F3,stroke:#333,color:#fff style API fill:#42A5F5,stroke:#333,color:#fff style UI fill:#FF9800,stroke:#333,color:#fff style Eval fill:#EF5350,stroke:#333,color:#fff style Deploy fill:#9C27B0,stroke:#333,color:#fff
1.4 Technology Stack Selection Guide
Decision Matrix
"I want to build quickly"
-> Streamlit + OpenAI API + Chroma
"I want production-grade"
-> FastAPI + LangGraph + Qdrant + Next.js
"I want to use open-source models"
-> FastAPI + Ollama/vLLM + pgvector + Gradio
"I want multi-agent"
-> LangGraph or CrewAI + FastAPI
"I want to process documents"
-> LangChain doc loaders + Qdrant + OpenAI
2. Project Ideas (Detailed)
Below are 10 detailed project ideas, each with architecture, tech stack, and learning outcomes. Choose one or combine elements from multiple ideas.
Project 1: AI-Powered Document Q&A System
Difficulty: Medium-HardDescription
Build a production-grade RAG system that lets users upload documents (PDFs, Word, web pages), ask questions, and get accurate answers with source citations. Includes evaluation pipeline and admin dashboard.
Architecture
User uploads documents
|
v
+------------------+ +------------------+
| Document | --> | Chunking & |
| Ingestion | | Embedding |
| (PDF, DOCX, URL) | | Pipeline |
+------------------+ +------------------+
|
v
+------------------+
| Vector Store |
| (Qdrant) |
+------------------+
^
| retrieve
User asks question |
| |
v |
+------------------+ +------------------+
| Query Pipeline | -> | Reranker |
| (embedding + | | (cross-encoder) |
| search) | +------------------+
+------------------+ |
v
+------------------+
| LLM Generation |
| (with citations) |
+------------------+
|
v
Answer + Sources
Tech Stack
- Backend: FastAPI, Python 3.11+
- LLM: OpenAI GPT-4o / Anthropic Claude
- Embeddings: OpenAI text-embedding-3-small or sentence-transformers
- Vector Store: Qdrant (or Chroma for simplicity)
- Reranker: Cohere Rerank or cross-encoder
- Document Processing: PyMuPDF, python-docx, BeautifulSoup
- Frontend: Streamlit or Gradio
- Evaluation: RAGAS, custom metrics
Key Components to Build
- Document ingestion pipeline (multiple formats)
- Smart chunking (semantic, by section headers)
- Hybrid search (vector + keyword)
- Reranking for better precision
- Answer generation with source citations
- Conversation memory (multi-turn)
- Evaluation pipeline (faithfulness, relevancy, answer correctness)
- Admin panel: upload docs, view analytics, manage collections
What You Will Learn
RAG engineering, chunking strategies, embedding models, vector search, reranking, prompt engineering for grounded generation, evaluation with RAGAS, production API design.
Project 2: Intelligent Customer Support Bot
Difficulty: HardDescription
Build a multi-agent customer support system that can answer product questions (RAG), perform actions (tool calling), escalate to humans, and learn from feedback.
Architecture
Customer Message
|
v
+------------------+
| Router Agent | Classifies intent: FAQ, action, complaint, escalate
+------------------+
|
+-- FAQ --> RAG Agent (knowledge base search)
|
+-- Action --> Tool Agent (check order, update account, etc.)
|
+-- Complaint --> Empathy Agent (acknowledge, offer resolution)
|
+-- Complex --> Escalation (notify human agent)
|
v
+------------------+
| Response | Synthesize final response, maintain tone
| Synthesizer |
+------------------+
|
v
Customer Response + Internal Logging
Tech Stack
- Agent Framework: LangGraph (for stateful multi-agent orchestration)
- LLM: GPT-4o-mini (fast, cheap) + GPT-4o (complex cases)
- RAG: Qdrant + OpenAI embeddings
- Tools: Mock order system, CRM API
- Backend: FastAPI with WebSocket support
- Frontend: Streamlit chat or custom React chat widget
- Logging: LangSmith or custom logging
What You Will Learn
Multi-agent orchestration, tool calling, conversation management, escalation patterns, RAG for customer support, feedback loops, production agent deployment.
Project 3: Code Review Assistant
Difficulty: MediumDescription
An AI assistant that analyzes code diffs/pull requests, identifies potential bugs, suggests improvements, checks for security issues, and provides educational explanations.
Architecture
GitHub PR / Code Diff
|
v
+------------------+
| Code Parser | Parse diff, identify changed files and context
+------------------+
|
v
+-------------------------------------------+
| Parallel Analysis Agents |
| |
| [Bug Detector] [Security] [Style] [Perf] |
+-------------------------------------------+
|
v
+------------------+
| Review | Compile findings, prioritize, format
| Synthesizer |
+------------------+
|
v
Formatted Code Review (comments on specific lines)
Tech Stack
- LLM: Claude (excellent at code) or GPT-4o
- Code parsing: tree-sitter, unidiff
- GitHub integration: PyGithub or GitHub API
- Backend: FastAPI
- Frontend: GitHub App or Streamlit
What You Will Learn
Code analysis with LLMs, structured output for code review, GitHub API integration, parallel LLM calls, prompt engineering for technical tasks.
Project 4: AI Content Pipeline
Difficulty: Medium-HardDescription
A multi-agent content creation pipeline: Research a topic, create an outline, write the content, edit for quality, generate images, and prepare for publishing. Each step uses a specialized agent.
Architecture
Topic Input: "Write a blog post about quantum computing for beginners"
|
v
[Research Agent] -> Search web, gather sources, extract key points
|
v
[Outline Agent] -> Create structured outline from research
|
v
[Writer Agent] -> Write full content following outline
|
v
[Editor Agent] -> Check grammar, flow, accuracy, suggest edits
|
v
[Image Agent] -> Generate illustrations with Stable Diffusion
|
v
[Publisher Agent] -> Format as HTML/Markdown, prepare for CMS
|
v
Final Content Package (text + images + metadata)
Tech Stack
- Agent Framework: LangGraph or CrewAI
- LLMs: GPT-4o (writing), Claude (editing)
- Web Search: Tavily API or SerpAPI
- Image Generation: DALL-E 3 or Stable Diffusion API
- Frontend: Streamlit with progress tracking
What You Will Learn
Multi-agent workflows, sequential pipeline orchestration, web search integration, content quality evaluation, image generation APIs, human-in-the-loop editing.
Project 5: Multimodal Search Engine
Difficulty: HardDescription
A search engine that supports text, image, and video search using CLIP embeddings. Users can search by typing text, uploading images, or combining both. Supports cross-modal retrieval (find images matching text, find text matching images).
Tech Stack
- Embeddings: CLIP (OpenAI clip-vit-base-patch32 or SigLIP)
- Vector Store: Qdrant (supports multiple vector fields)
- Backend: FastAPI
- Video Processing: OpenCV, frame extraction
- Frontend: Streamlit or Next.js
What You Will Learn
Multimodal embeddings, CLIP architecture, cross-modal retrieval, vector database optimization, building search UIs, video processing.
Project 6: AI Tutor
Difficulty: Medium-HardDescription
A personalized AI tutoring system that adapts to the student's level, tracks knowledge gaps, generates practice problems, explains concepts with analogies, and provides Socratic-style guidance rather than direct answers.
Tech Stack
- LLM: GPT-4o with structured prompts for pedagogy
- Knowledge Tracking: Simple skill graph in PostgreSQL
- RAG: Subject-specific knowledge base
- Frontend: Streamlit with interactive elements
What You Will Learn
Prompt engineering for educational contexts, knowledge graph construction, adaptive systems, long conversation management, pedagogical AI design.
Project 7: Automated Data Analysis Agent
Difficulty: MediumDescription
Upload a CSV/Excel file and get automated analysis: statistical summaries, visualizations, correlations, anomalies, and a narrative report. The agent writes and executes Python code to analyze the data.
Tech Stack
- LLM: GPT-4o (code generation + analysis)
- Code Execution: Sandboxed Python (E2B or Docker)
- Visualization: matplotlib, seaborn (generated by agent)
- Frontend: Streamlit with file upload and chart display
What You Will Learn
Code generation with LLMs, sandboxed execution, data analysis automation, chart generation, report writing, tool calling for data tasks.
Project 8: AI Meeting Assistant
Difficulty: MediumDescription
Processes meeting recordings or transcripts to produce structured summaries, action items, decisions, and follow-up reminders. Can answer questions about past meetings.
Tech Stack
- Transcription: Whisper (local) or AssemblyAI
- LLM: GPT-4o-mini (summaries) + GPT-4o (complex analysis)
- RAG: Store past meeting data for Q&A
- Frontend: Streamlit
What You Will Learn
Audio processing, speech-to-text, long document summarization, structured extraction, RAG over temporal data, calendar/task integration.
Project 9: Legal Document Analyzer
Difficulty: HardDescription
Analyze legal contracts and documents: extract key clauses, highlight risks, compare document versions, and answer questions about legal terms. Includes citation to specific sections.
Tech Stack
- Document Processing: PyMuPDF, unstructured.io
- LLM: Claude (strong at long document analysis)
- RAG: Section-aware chunking + Qdrant
- Frontend: Streamlit with PDF viewer
What You Will Learn
Legal document processing, section-aware parsing, comparative analysis, risk assessment prompting, long-context strategies, citation generation.
Project 10: AI-Powered Monitoring Dashboard
Difficulty: HardDescription
Build a monitoring and observability platform for LLM applications. Track latency, cost, token usage, error rates, prompt/response quality, and detect anomalies.
Tech Stack
- Data Collection: OpenTelemetry, custom middleware
- Storage: PostgreSQL + TimescaleDB
- LLM Evaluation: Automated quality scoring
- Alerting: Custom rules + anomaly detection
- Frontend: Streamlit dashboards or Grafana
What You Will Learn
LLM observability, production monitoring, cost tracking, quality metrics, anomaly detection, dashboard design, operational AI engineering.
3. Architecture Patterns for AI Projects
(Streamlit / React)"] FE --> API["API Server
(FastAPI)"] API --> LLM["LLM Service
(OpenAI / Anthropic)"] API --> RAG["RAG Pipeline"] RAG --> VDB["Vector Store
(Qdrant / Chroma)"] RAG --> Embed["Embedding
Service"] API --> Cache["Cache Layer
(Redis)"] API --> DB["Database
(PostgreSQL)"] API --> Eval["Evaluation
Pipeline"] style User fill:#9C27B0,stroke:#333,color:#fff style API fill:#2196F3,stroke:#333,color:#fff style LLM fill:#4CAF50,stroke:#333,color:#fff style RAG fill:#FF9800,stroke:#333,color:#fff style VDB fill:#EF5350,stroke:#333,color:#fff
Evaluate"] Test --> Review["Review
Results"] Review -->|"Pass"| Ship["Ship to
Production"] Review -->|"Fail"| Fix["Fix Issues &
Improve Prompts"] Fix --> Build Ship --> Monitor["Monitor &
Collect Feedback"] Monitor -->|"New Issues"| Build style Build fill:#2196F3,stroke:#333,color:#fff style Test fill:#FF9800,stroke:#333,color:#fff style Review fill:#9C27B0,stroke:#333,color:#fff style Ship fill:#4CAF50,stroke:#333,color:#fff style Fix fill:#EF5350,stroke:#333,color:#fff style Monitor fill:#607D8B,stroke:#333,color:#fff
3.1 Monolithic vs Microservices for AI
For Your Capstone: Start Monolithic
For a 2-week project, a well-structured monolith is the right choice. Here is why and how:
Recommended Structure (Monolithic):
===================================
project/
+-- app/
| +-- __init__.py
| +-- main.py # FastAPI app, routes
| +-- config.py # Settings and environment variables
| +-- models/
| | +-- schemas.py # Pydantic models for API
| | +-- database.py # DB models (if needed)
| +-- services/
| | +-- llm.py # LLM client wrapper
| | +-- rag.py # RAG pipeline
| | +-- embeddings.py # Embedding service
| | +-- agents.py # Agent logic
| +-- utils/
| +-- prompts.py # Prompt templates
| +-- chunking.py # Text chunking utilities
+-- tests/
| +-- test_rag.py
| +-- test_api.py
+-- evaluation/
| +-- eval_pipeline.py # Evaluation scripts
| +-- test_cases.json # Test Q&A pairs
+-- frontend/
| +-- app.py # Streamlit app
+-- scripts/
| +-- ingest_documents.py # Data ingestion scripts
+-- Dockerfile
+-- docker-compose.yml
+-- requirements.txt
+-- .env.example
+-- README.md
3.2 API Design for AI Services
from fastapi import FastAPI, UploadFile, File, HTTPException, BackgroundTasks
from pydantic import BaseModel, Field
from typing import Optional
import uuid
from datetime import datetime
app = FastAPI(title="AI Document Q&A", version="1.0.0")
# ====================
# Request/Response Models
# ====================
class QuestionRequest(BaseModel):
"""Request model for asking a question."""
question: str = Field(..., min_length=1, max_length=2000,
description="The question to ask about the documents")
collection_id: str = Field(default="default",
description="Which document collection to search")
max_sources: int = Field(default=5, ge=1, le=20,
description="Maximum number of source passages to retrieve")
model: str = Field(default="gpt-4o-mini",
description="LLM model to use for answer generation")
class Source(BaseModel):
"""A source passage used to generate the answer."""
document_name: str
page_number: Optional[int] = None
chunk_text: str
relevance_score: float
class AnswerResponse(BaseModel):
"""Response model for a question answer."""
answer: str
sources: list[Source]
model_used: str
latency_ms: float
token_usage: dict
class DocumentUploadResponse(BaseModel):
"""Response after uploading a document."""
document_id: str
filename: str
num_chunks: int
status: str
message: str
class HealthResponse(BaseModel):
"""Health check response."""
status: str
version: str
models_available: list[str]
vector_store_status: str
# ====================
# API Endpoints
# ====================
@app.get("/health", response_model=HealthResponse)
async def health_check():
"""Check the health of all services."""
return HealthResponse(
status="healthy",
version="1.0.0",
models_available=["gpt-4o-mini", "gpt-4o", "claude-sonnet"],
vector_store_status="connected",
)
@app.post("/documents/upload", response_model=DocumentUploadResponse)
async def upload_document(
file: UploadFile = File(...),
collection_id: str = "default",
background_tasks: BackgroundTasks = None,
):
"""
Upload a document for indexing.
Supports PDF, DOCX, TXT, and Markdown files.
Processing happens in the background.
"""
allowed_extensions = {".pdf", ".docx", ".txt", ".md"}
ext = "." + file.filename.split(".")[-1].lower()
if ext not in allowed_extensions:
raise HTTPException(
status_code=400,
detail=f"Unsupported file type: {ext}. Supported: {allowed_extensions}"
)
doc_id = str(uuid.uuid4())
# Save file and process in background
content = await file.read()
# In a real implementation:
# background_tasks.add_task(process_document, doc_id, content, ext, collection_id)
return DocumentUploadResponse(
document_id=doc_id,
filename=file.filename,
num_chunks=0, # Updated after processing
status="processing",
message="Document is being processed. Use the status endpoint to check progress.",
)
@app.post("/ask", response_model=AnswerResponse)
async def ask_question(request: QuestionRequest):
"""
Ask a question about the uploaded documents.
Uses RAG to retrieve relevant passages and generate an answer.
"""
import time
start_time = time.time()
# In a real implementation:
# 1. Embed the question
# 2. Search the vector store
# 3. Rerank results
# 4. Generate answer with LLM
# 5. Extract citations
# Placeholder response
latency = (time.time() - start_time) * 1000
return AnswerResponse(
answer="This is a placeholder answer. Implement the RAG pipeline.",
sources=[],
model_used=request.model,
latency_ms=latency,
token_usage={"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0},
)
@app.get("/documents/{collection_id}")
async def list_documents(collection_id: str = "default"):
"""List all documents in a collection."""
# Return list of documents with metadata
return {"collection_id": collection_id, "documents": []}
3.3 Caching Strategies for LLM Responses
import hashlib
import json
from functools import wraps
from typing import Optional
# ====================
# Simple In-Memory Cache
# ====================
class LLMCache:
"""
Cache for LLM responses to avoid redundant API calls.
Strategies:
1. Exact match: Cache based on exact prompt hash
2. Semantic cache: Cache based on embedding similarity (more advanced)
"""
def __init__(self, max_size: int = 1000):
self.cache: dict[str, dict] = {}
self.max_size = max_size
self.hits = 0
self.misses = 0
def _make_key(self, prompt: str, model: str, **kwargs) -> str:
"""Create a cache key from prompt and parameters."""
key_data = {
"prompt": prompt,
"model": model,
"temperature": kwargs.get("temperature", 0),
"max_tokens": kwargs.get("max_tokens"),
}
key_string = json.dumps(key_data, sort_keys=True)
return hashlib.sha256(key_string.encode()).hexdigest()
def get(self, prompt: str, model: str, **kwargs) -> Optional[str]:
"""Try to get a cached response."""
key = self._make_key(prompt, model, **kwargs)
if key in self.cache:
self.hits += 1
return self.cache[key]["response"]
self.misses += 1
return None
def set(self, prompt: str, model: str, response: str, **kwargs):
"""Cache a response."""
if len(self.cache) >= self.max_size:
# Evict oldest entry
oldest_key = next(iter(self.cache))
del self.cache[oldest_key]
key = self._make_key(prompt, model, **kwargs)
self.cache[key] = {
"response": response,
"model": model,
}
@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0
# Usage with a decorator
llm_cache = LLMCache()
def cached_llm_call(func):
"""Decorator to cache LLM calls."""
@wraps(func)
def wrapper(prompt: str, model: str = "gpt-4o-mini", **kwargs):
# Only cache deterministic calls (temperature=0)
if kwargs.get("temperature", 0) == 0:
cached = llm_cache.get(prompt, model, **kwargs)
if cached is not None:
return cached
response = func(prompt, model, **kwargs)
if kwargs.get("temperature", 0) == 0:
llm_cache.set(prompt, model, response, **kwargs)
return response
return wrapper
@cached_llm_call
def call_llm(prompt: str, model: str = "gpt-4o-mini", **kwargs) -> str:
"""Call the LLM (with caching)."""
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
**kwargs,
)
return response.choices[0].message.content
3.4 Queue-Based Processing for Async AI Tasks
import asyncio
from collections import deque
from dataclasses import dataclass, field
from enum import Enum
from typing import Any
import uuid
import time
class TaskStatus(str, Enum):
PENDING = "pending"
PROCESSING = "processing"
COMPLETED = "completed"
FAILED = "failed"
@dataclass
class AITask:
id: str = field(default_factory=lambda: str(uuid.uuid4()))
task_type: str = ""
input_data: dict = field(default_factory=dict)
status: TaskStatus = TaskStatus.PENDING
result: Any = None
error: str = None
created_at: float = field(default_factory=time.time)
completed_at: float = None
class AITaskQueue:
"""
Simple async task queue for AI processing.
For production, use Celery + Redis or similar.
This demonstrates the pattern.
"""
def __init__(self, max_concurrent: int = 5):
self.queue: deque[AITask] = deque()
self.tasks: dict[str, AITask] = {}
self.max_concurrent = max_concurrent
self.semaphore = asyncio.Semaphore(max_concurrent)
def submit(self, task_type: str, input_data: dict) -> str:
"""Submit a task and return its ID."""
task = AITask(task_type=task_type, input_data=input_data)
self.queue.append(task)
self.tasks[task.id] = task
return task.id
def get_status(self, task_id: str) -> dict:
"""Get the status of a task."""
task = self.tasks.get(task_id)
if not task:
return {"error": "Task not found"}
return {
"id": task.id,
"status": task.status,
"result": task.result,
"error": task.error,
}
async def process_task(self, task: AITask):
"""Process a single task (override for your specific logic)."""
async with self.semaphore:
task.status = TaskStatus.PROCESSING
try:
# Route to appropriate handler
if task.task_type == "document_ingestion":
result = await self._ingest_document(task.input_data)
elif task.task_type == "question_answer":
result = await self._answer_question(task.input_data)
else:
raise ValueError(f"Unknown task type: {task.task_type}")
task.result = result
task.status = TaskStatus.COMPLETED
except Exception as e:
task.error = str(e)
task.status = TaskStatus.FAILED
finally:
task.completed_at = time.time()
async def _ingest_document(self, data: dict) -> dict:
"""Process document ingestion."""
# Simulate processing
await asyncio.sleep(2)
return {"num_chunks": 42, "status": "indexed"}
async def _answer_question(self, data: dict) -> dict:
"""Process a question."""
await asyncio.sleep(1)
return {"answer": "Processed answer", "sources": []}
async def run_worker(self):
"""Background worker that processes tasks from the queue."""
while True:
if self.queue:
task = self.queue.popleft()
asyncio.create_task(self.process_task(task))
await asyncio.sleep(0.1)
# FastAPI integration
# task_queue = AITaskQueue(max_concurrent=5)
# @app.on_event("startup")
# async def startup():
# asyncio.create_task(task_queue.run_worker())
# @app.post("/tasks/submit")
# async def submit_task(task_type: str, input_data: dict):
# task_id = task_queue.submit(task_type, input_data)
# return {"task_id": task_id, "status": "pending"}
# @app.get("/tasks/{task_id}")
# async def get_task_status(task_id: str):
# return task_queue.get_status(task_id)
4. Tech Stack Recommendations (2026)
The AI Engineering Stack (March 2026)
| Layer | Recommended | Alternatives |
|---|---|---|
| Language | Python 3.12+ | TypeScript (for full-stack) |
| Backend Framework | FastAPI | Flask, Django, Hono (TS) |
| LLM APIs | OpenAI, Anthropic | Google Gemini, Mistral, Groq |
| Open-Source LLMs | Ollama (local), vLLM (serving) | llama.cpp, TGI |
| Embeddings | OpenAI text-embedding-3-small | sentence-transformers, Cohere |
| Vector Store | Qdrant | Chroma, pgvector, Pinecone, Weaviate |
| Agent Framework | LangGraph | CrewAI, Autogen, custom |
| Evaluation | promptfoo, RAGAS | DeepEval, custom |
| Observability | LangSmith, Langfuse | Helicone, custom logging |
| Frontend | Streamlit (quick), Next.js (production) | Gradio, Chainlit |
| Deployment | Docker + Railway/Render | Modal, AWS/GCP, Fly.io |
| Database | PostgreSQL | SQLite (simple), Supabase |
5. Sample Project Walkthrough: AI Document Q&A System
Let us build Project #1 (AI Document Q&A) step by step, with complete working code. This serves as a template you can adapt for your own capstone.
5.1 Project Setup
# requirements.txt
"""
fastapi==0.115.0
uvicorn==0.30.0
python-multipart==0.0.9
openai==1.50.0
anthropic==0.35.0
qdrant-client==1.11.0
PyMuPDF==1.24.0
python-docx==1.1.0
sentence-transformers==3.0.0
pydantic==2.9.0
pydantic-settings==2.5.0
streamlit==1.38.0
python-dotenv==1.0.1
"""
# .env.example
"""
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=
COLLECTION_NAME=documents
EMBEDDING_MODEL=text-embedding-3-small
LLM_MODEL=gpt-4o-mini
"""
5.2 Configuration
# app/config.py
from pydantic_settings import BaseSettings
from functools import lru_cache
class Settings(BaseSettings):
"""Application settings loaded from environment variables."""
# API Keys
openai_api_key: str = ""
anthropic_api_key: str = ""
# Vector Store
qdrant_url: str = "http://localhost:6333"
qdrant_api_key: str = ""
collection_name: str = "documents"
# Models
embedding_model: str = "text-embedding-3-small"
embedding_dimension: int = 1536
llm_model: str = "gpt-4o-mini"
# Chunking
chunk_size: int = 500
chunk_overlap: int = 50
# Retrieval
top_k: int = 10
rerank_top_k: int = 5
class Config:
env_file = ".env"
@lru_cache
def get_settings() -> Settings:
return Settings()
5.3 Document Ingestion Pipeline
# app/services/ingestion.py
import fitz # PyMuPDF
import docx
from pathlib import Path
from dataclasses import dataclass
@dataclass
class DocumentChunk:
"""A chunk of text from a document."""
text: str
metadata: dict # source, page_number, chunk_index, etc.
class DocumentProcessor:
"""Process documents into text chunks for embedding."""
def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
def process_file(self, file_path: str, file_bytes: bytes = None) -> list[DocumentChunk]:
"""Process a file and return chunks."""
ext = Path(file_path).suffix.lower()
if ext == ".pdf":
text_by_page = self._extract_pdf(file_path, file_bytes)
elif ext == ".docx":
text_by_page = self._extract_docx(file_path, file_bytes)
elif ext in (".txt", ".md"):
text_by_page = self._extract_text(file_path, file_bytes)
else:
raise ValueError(f"Unsupported file type: {ext}")
# Chunk each page
chunks = []
for page_num, page_text in enumerate(text_by_page):
page_chunks = self._chunk_text(page_text)
for chunk_idx, chunk_text in enumerate(page_chunks):
chunks.append(DocumentChunk(
text=chunk_text,
metadata={
"source": Path(file_path).name,
"page_number": page_num + 1,
"chunk_index": chunk_idx,
"total_chunks": len(page_chunks),
}
))
return chunks
def _extract_pdf(self, file_path: str, file_bytes: bytes = None) -> list[str]:
"""Extract text from each page of a PDF."""
if file_bytes:
doc = fitz.open(stream=file_bytes, filetype="pdf")
else:
doc = fitz.open(file_path)
pages = []
for page in doc:
text = page.get_text()
if text.strip():
pages.append(text)
doc.close()
return pages
def _extract_docx(self, file_path: str, file_bytes: bytes = None) -> list[str]:
"""Extract text from a DOCX file."""
import io
if file_bytes:
doc = docx.Document(io.BytesIO(file_bytes))
else:
doc = docx.Document(file_path)
full_text = "\n".join([para.text for para in doc.paragraphs if para.text.strip()])
return [full_text] # DOCX does not have pages per se
def _extract_text(self, file_path: str, file_bytes: bytes = None) -> list[str]:
"""Extract text from a plain text file."""
if file_bytes:
text = file_bytes.decode("utf-8")
else:
with open(file_path, "r", encoding="utf-8") as f:
text = f.read()
return [text]
def _chunk_text(self, text: str) -> list[str]:
"""
Split text into overlapping chunks.
Uses sentence-aware splitting to avoid cutting mid-sentence.
"""
if len(text) <= self.chunk_size:
return [text.strip()] if text.strip() else []
# Simple sentence-aware chunking
sentences = text.replace("\n", " ").split(". ")
chunks = []
current_chunk = ""
for sentence in sentences:
sentence = sentence.strip()
if not sentence:
continue
# Add period back if it was removed by split
if not sentence.endswith("."):
sentence += "."
if len(current_chunk) + len(sentence) + 1 <= self.chunk_size:
current_chunk += (" " + sentence) if current_chunk else sentence
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = sentence
if current_chunk.strip():
chunks.append(current_chunk.strip())
return chunks
5.4 Embedding and Vector Store
# app/services/embeddings.py
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct, Filter,
FieldCondition, MatchValue,
)
from app.config import get_settings
import uuid
class EmbeddingService:
"""Manage embeddings and vector store."""
def __init__(self):
self.settings = get_settings()
self.openai_client = OpenAI(api_key=self.settings.openai_api_key)
self.qdrant = QdrantClient(
url=self.settings.qdrant_url,
api_key=self.settings.qdrant_api_key or None,
)
self._ensure_collection()
def _ensure_collection(self):
"""Create the vector collection if it doesn't exist."""
collections = [c.name for c in self.qdrant.get_collections().collections]
if self.settings.collection_name not in collections:
self.qdrant.create_collection(
collection_name=self.settings.collection_name,
vectors_config=VectorParams(
size=self.settings.embedding_dimension,
distance=Distance.COSINE,
),
)
print(f"Created collection: {self.settings.collection_name}")
def embed_texts(self, texts: list[str]) -> list[list[float]]:
"""Generate embeddings for a list of texts."""
response = self.openai_client.embeddings.create(
model=self.settings.embedding_model,
input=texts,
)
return [item.embedding for item in response.data]
def index_chunks(self, chunks: list, collection_id: str = "default"):
"""Index document chunks into the vector store."""
if not chunks:
return 0
texts = [chunk.text for chunk in chunks]
embeddings = self.embed_texts(texts)
points = []
for chunk, embedding in zip(chunks, embeddings):
point = PointStruct(
id=str(uuid.uuid4()),
vector=embedding,
payload={
"text": chunk.text,
"collection_id": collection_id,
**chunk.metadata,
},
)
points.append(point)
self.qdrant.upsert(
collection_name=self.settings.collection_name,
points=points,
)
return len(points)
def search(
self,
query: str,
collection_id: str = "default",
top_k: int = 10,
) -> list[dict]:
"""Search for relevant chunks given a query."""
query_embedding = self.embed_texts([query])[0]
results = self.qdrant.search(
collection_name=self.settings.collection_name,
query_vector=query_embedding,
limit=top_k,
query_filter=Filter(
must=[
FieldCondition(
key="collection_id",
match=MatchValue(value=collection_id),
)
]
),
)
return [
{
"text": hit.payload["text"],
"score": hit.score,
"source": hit.payload.get("source", "unknown"),
"page_number": hit.payload.get("page_number"),
}
for hit in results
]
5.5 RAG Pipeline
# app/services/rag.py
from openai import OpenAI
from app.config import get_settings
from app.services.embeddings import EmbeddingService
class RAGPipeline:
"""
Full RAG pipeline: retrieve, rerank, generate with citations.
"""
def __init__(self):
self.settings = get_settings()
self.embedding_service = EmbeddingService()
self.llm_client = OpenAI(api_key=self.settings.openai_api_key)
def answer_question(
self,
question: str,
collection_id: str = "default",
model: str = None,
) -> dict:
"""
Full RAG pipeline to answer a question.
Steps:
1. Retrieve relevant chunks
2. Build context from chunks
3. Generate answer with LLM
4. Extract source citations
"""
import time
start_time = time.time()
model = model or self.settings.llm_model
# Step 1: Retrieve relevant chunks
retrieved = self.embedding_service.search(
query=question,
collection_id=collection_id,
top_k=self.settings.top_k,
)
if not retrieved:
return {
"answer": "I could not find any relevant information in the uploaded documents to answer this question.",
"sources": [],
"model_used": model,
"latency_ms": (time.time() - start_time) * 1000,
"token_usage": {},
}
# Step 2: Build context
context = self._build_context(retrieved)
# Step 3: Generate answer
system_prompt = """You are a helpful assistant that answers questions based on the provided context.
Rules:
1. Only answer based on the provided context. Do not use external knowledge.
2. If the context doesn't contain enough information, say so clearly.
3. Cite your sources using [Source: filename, Page X] format.
4. Be concise but thorough.
5. If multiple sources support your answer, cite all of them."""
user_prompt = f"""Context:
{context}
Question: {question}
Answer the question based only on the context above. Cite your sources."""
response = self.llm_client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=0,
max_tokens=1024,
)
answer = response.choices[0].message.content
latency_ms = (time.time() - start_time) * 1000
# Step 4: Format sources
sources = [
{
"document_name": r["source"],
"page_number": r["page_number"],
"chunk_text": r["text"][:200] + "...",
"relevance_score": r["score"],
}
for r in retrieved[:5] # Top 5 sources
]
return {
"answer": answer,
"sources": sources,
"model_used": model,
"latency_ms": latency_ms,
"token_usage": {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"total_tokens": response.usage.total_tokens,
},
}
def _build_context(self, chunks: list[dict], max_context_length: int = 4000) -> str:
"""Build a context string from retrieved chunks."""
context_parts = []
total_length = 0
for i, chunk in enumerate(chunks):
source_info = f"[Source {i+1}: {chunk['source']}"
if chunk.get("page_number"):
source_info += f", Page {chunk['page_number']}"
source_info += "]"
part = f"{source_info}\n{chunk['text']}\n"
if total_length + len(part) > max_context_length:
break
context_parts.append(part)
total_length += len(part)
return "\n".join(context_parts)
5.6 Simple Web UI
# frontend/app.py
import streamlit as st
import requests
import json
API_URL = "http://localhost:8000"
st.set_page_config(
page_title="AI Document Q&A",
page_icon="📚",
layout="wide",
)
st.title("AI Document Q&A System")
st.markdown("Upload documents and ask questions about their content.")
# Sidebar: Document Upload
with st.sidebar:
st.header("Upload Documents")
uploaded_file = st.file_uploader(
"Choose a document",
type=["pdf", "docx", "txt", "md"],
)
if uploaded_file and st.button("Upload & Index"):
with st.spinner("Processing document..."):
files = {"file": (uploaded_file.name, uploaded_file.getvalue())}
try:
response = requests.post(f"{API_URL}/documents/upload", files=files)
if response.status_code == 200:
result = response.json()
st.success(f"Document uploaded: {result['filename']}")
st.info(f"Chunks created: {result['num_chunks']}")
else:
st.error(f"Error: {response.text}")
except requests.ConnectionError:
st.error("Cannot connect to the API server. Is it running?")
st.divider()
st.header("Settings")
model = st.selectbox("Model", ["gpt-4o-mini", "gpt-4o"])
max_sources = st.slider("Max Sources", 1, 10, 5)
# Main: Chat Interface
st.header("Ask Questions")
# Initialize chat history
if "messages" not in st.session_state:
st.session_state.messages = []
# Display chat history
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
if message.get("sources"):
with st.expander(f"Sources ({len(message['sources'])})"):
for source in message["sources"]:
st.markdown(f"**{source['document_name']}** "
f"(Page {source.get('page_number', 'N/A')}, "
f"Score: {source['relevance_score']:.3f})")
st.caption(source["chunk_text"])
# Chat input
if prompt := st.chat_input("Ask a question about your documents..."):
# Display user message
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
# Get answer from API
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
try:
response = requests.post(
f"{API_URL}/ask",
json={
"question": prompt,
"model": model,
"max_sources": max_sources,
},
)
if response.status_code == 200:
result = response.json()
st.markdown(result["answer"])
# Show sources
if result["sources"]:
with st.expander(f"Sources ({len(result['sources'])})"):
for source in result["sources"]:
st.markdown(
f"**{source['document_name']}** "
f"(Page {source.get('page_number', 'N/A')}, "
f"Score: {source['relevance_score']:.3f})"
)
st.caption(source["chunk_text"])
# Show metrics
col1, col2, col3 = st.columns(3)
col1.metric("Latency", f"{result['latency_ms']:.0f}ms")
col2.metric("Tokens", result["token_usage"].get("total_tokens", 0))
col3.metric("Model", result["model_used"])
# Save to history
st.session_state.messages.append({
"role": "assistant",
"content": result["answer"],
"sources": result["sources"],
})
else:
st.error(f"Error: {response.text}")
except requests.ConnectionError:
st.error("Cannot connect to the API server.")
5.7 Evaluation Setup
# evaluation/eval_pipeline.py
"""
Evaluation pipeline for the Document Q&A system.
Tests retrieval quality and answer accuracy.
"""
from dataclasses import dataclass
from openai import OpenAI
import json
import time
@dataclass
class TestCase:
question: str
expected_answer: str
expected_sources: list[str] # Expected document names
@dataclass
class EvalResult:
question: str
generated_answer: str
expected_answer: str
faithfulness_score: float # Is the answer grounded in sources?
relevancy_score: float # Is the answer relevant to the question?
correctness_score: float # Is the answer factually correct?
source_recall: float # Were the right sources retrieved?
latency_ms: float
class RAGEvaluator:
"""Evaluate RAG pipeline quality using LLM-as-judge."""
def __init__(self, api_url: str = "http://localhost:8000"):
self.api_url = api_url
self.judge = OpenAI()
def evaluate_test_cases(self, test_cases: list[TestCase]) -> list[EvalResult]:
"""Run evaluation on a set of test cases."""
import requests
results = []
for tc in test_cases:
start = time.time()
# Get answer from the system
response = requests.post(
f"{self.api_url}/ask",
json={"question": tc.question},
)
latency_ms = (time.time() - start) * 1000
if response.status_code != 200:
print(f"Error for question: {tc.question}")
continue
data = response.json()
answer = data["answer"]
sources = [s["document_name"] for s in data.get("sources", [])]
# Score with LLM-as-judge
faithfulness = self._score_faithfulness(tc.question, answer, data.get("sources", []))
relevancy = self._score_relevancy(tc.question, answer)
correctness = self._score_correctness(tc.question, answer, tc.expected_answer)
source_recall = self._compute_source_recall(sources, tc.expected_sources)
result = EvalResult(
question=tc.question,
generated_answer=answer,
expected_answer=tc.expected_answer,
faithfulness_score=faithfulness,
relevancy_score=relevancy,
correctness_score=correctness,
source_recall=source_recall,
latency_ms=latency_ms,
)
results.append(result)
print(f"Q: {tc.question[:50]}... | "
f"Faith: {faithfulness:.2f} | Rel: {relevancy:.2f} | "
f"Corr: {correctness:.2f} | SrcRec: {source_recall:.2f}")
return results
def _score_faithfulness(self, question: str, answer: str, sources: list) -> float:
"""Score if the answer is faithful to (grounded in) the sources."""
source_texts = "\n".join([s.get("chunk_text", "") for s in sources])
response = self.judge.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Rate the faithfulness of the answer to the given sources on a scale of 0 to 1.
A score of 1 means the answer is fully supported by the sources.
A score of 0 means the answer contains information not in the sources.
Sources:
{source_texts}
Question: {question}
Answer: {answer}
Return ONLY a number between 0 and 1."""
}],
temperature=0,
)
try:
return float(response.choices[0].message.content.strip())
except ValueError:
return 0.0
def _score_relevancy(self, question: str, answer: str) -> float:
"""Score if the answer is relevant to the question."""
response = self.judge.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Rate the relevancy of the answer to the question on a scale of 0 to 1.
Question: {question}
Answer: {answer}
Return ONLY a number between 0 and 1."""
}],
temperature=0,
)
try:
return float(response.choices[0].message.content.strip())
except ValueError:
return 0.0
def _score_correctness(self, question: str, answer: str, expected: str) -> float:
"""Score the correctness of the answer against the expected answer."""
response = self.judge.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Compare the generated answer with the expected answer.
Rate the semantic similarity and factual correctness on a scale of 0 to 1.
Question: {question}
Generated Answer: {answer}
Expected Answer: {expected}
Return ONLY a number between 0 and 1."""
}],
temperature=0,
)
try:
return float(response.choices[0].message.content.strip())
except ValueError:
return 0.0
def _compute_source_recall(self, retrieved: list[str], expected: list[str]) -> float:
"""Compute recall of expected source documents."""
if not expected:
return 1.0
hits = sum(1 for e in expected if e in retrieved)
return hits / len(expected)
def print_summary(self, results: list[EvalResult]):
"""Print evaluation summary."""
n = len(results)
if n == 0:
print("No results to summarize.")
return
avg = lambda vals: sum(vals) / len(vals)
print("\n" + "=" * 60)
print("EVALUATION SUMMARY")
print("=" * 60)
print(f"Test cases: {n}")
print(f"Avg Faithfulness: {avg([r.faithfulness_score for r in results]):.3f}")
print(f"Avg Relevancy: {avg([r.relevancy_score for r in results]):.3f}")
print(f"Avg Correctness: {avg([r.correctness_score for r in results]):.3f}")
print(f"Avg Source Recall: {avg([r.source_recall for r in results]):.3f}")
print(f"Avg Latency: {avg([r.latency_ms for r in results]):.0f}ms")
print("=" * 60)
5.8 Deployment
# Dockerfile
"""
FROM python:3.12-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first for caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Expose port
EXPOSE 8000
# Run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
"""
# docker-compose.yml
"""
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
env_file:
- .env
depends_on:
- qdrant
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
volumes:
- qdrant_data:/qdrant/storage
streamlit:
build:
context: .
dockerfile: Dockerfile.streamlit
ports:
- "8501:8501"
environment:
- API_URL=http://api:8000
volumes:
qdrant_data:
"""
6. Presentation and Showcase Tips
6.1 How to Demo an AI Project
The 5-Minute Demo Structure
- Problem (30s): What problem does this solve? Why does it matter? A concrete example of the pain point.
- Solution Demo (2min): Show the happy path. Upload a document, ask a question, get an answer with citations. Make it look effortless.
- Architecture (1min): Show the high-level architecture diagram. Explain key design decisions in 2-3 sentences.
- Key Technical Detail (1min): Go deep on one interesting technical challenge you solved. This shows depth.
- Results and Next Steps (30s): Share evaluation metrics. What would you improve with more time?
6.2 Key Metrics to Highlight
- Quality metrics: Faithfulness, relevancy, correctness scores from your evaluation pipeline
- Performance metrics: Latency (p50, p95), throughput, token usage
- Cost metrics: Cost per query, monthly projected cost at scale
- Scale metrics: Number of documents indexed, concurrent user support
6.3 Common Pitfalls to Avoid
- Do not demo without a backup. Have screenshots or a recorded video in case the live demo fails.
- Do not show error states accidentally. Test your demo flow beforehand. Use pre-loaded data.
- Do not over-scope. A polished small project beats an unfinished large project every time.
- Do not ignore evaluation. "It works when I try it" is not enough. Show systematic evaluation.
- Do not skip error handling. Show that your system gracefully handles bad inputs, API failures, and edge cases.
- Do not hardcode API keys. Use environment variables. Show that you follow security best practices.
Summary
Your Capstone Checklist
- Choose a project from the ideas above (or propose your own)
- Draw the architecture diagram before writing code
- Set up the project structure and dependencies
- Build the core AI pipeline first (get end-to-end working)
- Add the API layer with proper request/response models
- Build a usable frontend (Streamlit is fine)
- Set up evaluation and test with real data
- Dockerize and deploy
- Write a clear README
- Prepare your 5-minute demo
Next week in Week 16: AI Engineering Principles, we will wrap up the course with best practices, production patterns, career guidance, and a comprehensive resource list.