A project-by-project curriculum built on your existing data science foundation. Every project you build here produces a real, deployable artifact — something you can demo, share, and talk about in any job interview. Updated for March 2026 industry standards.
Each phase builds on the previous one. Phase 1 gives you the foundations that Phase 2 relies on. Don't skip ahead.
Click any project to see the full details: what you'll build, what you'll learn, interview questions it prepares, and what tech stack to use.
If you see an industry term you don't know, it is explained in the Glossary section below. Industry vocabulary is important — learn the language.
Every project should end with a live demo URL and a written technical summary. A working URL is worth 10 GitHub repositories to a recruiter.
These are the terms you will hear in interviews, job postings, and team meetings. Learn what each one means in plain English before you start building.
The roadmap is divided into three phases. Each phase has a clear goal and produces real, deployable projects. Do them in order.
Every company has thousands of internal documents — manuals, reports, contracts — that nobody can quickly search. A basic keyword search returns too many irrelevant results. This project builds the system that actually understands the meaning of a question and finds the right answer in a large collection of documents. The key differentiator: you will measure your system's accuracy with real metrics, not just "it seems to work."
Most RAG tutorials build a chatbot that "seems to work" but has no measurements. Your version has a live benchmark dashboard with real accuracy numbers (RAGAS scores). When an interviewer asks "how do you know it works?", you will have a precise, quantitative answer. This is the difference between a junior and a senior AI engineer's approach.
Hybrid Search: The combination of meaning-based search (dense) and keyword-based search (sparse). It consistently outperforms either method alone. RRF (Reciprocal Rank Fusion) is the algorithm that merges the two result lists. This is standard in all serious production RAG systems in 2026.
Customizing a resume for each job application takes 1–2 hours and most people skip it. Those who do it manually miss half the important keywords. This agent automates the whole pipeline: read the job → understand requirements → match to your experience → write a tailored resume → score how well it fits. A 2-hour task becomes 30 seconds. This also teaches you LangGraph at a beginner level, which you will use at an advanced level in Phase 2.
Your portfolio audience is recruiters. Showing a recruiter a live demo where you type a job URL and watch a tailored resume appear in 30 seconds is immediately understood. It also demonstrates a complete engineering pipeline: data extraction → retrieval → generation → evaluation — all in one coherent product. The match score makes it measurable, not just "it generated something."
Structured Outputs: Getting an LLM to return data in a predictable format (like JSON) instead of free-form text. Instructor is a Python library that wraps LLMs and forces them to return data that matches a Pydantic schema. Without this, LLM outputs are unreliable in production systems.
Every company has data analysts who spend hours doing the same exploratory analysis on every new dataset: check distributions, find outliers, look for correlations, test hypotheses, make charts. This is valuable work but it follows a repeatable pattern. This agent automates the pattern while preserving the insight. Because you already know what good analysis looks like, you are uniquely qualified to design this agent's tools correctly.
Most agent demos do trivial tasks like "search the web and summarize results." This agent does complex analytical work on real data. You can walk into any interview, hand them a CSV, run the agent live, and show a complete analysis report in 2 minutes. Your DS background is the reason the tools are actually good — not toy examples. This combination of domain expertise + agent architecture is rare.
ReAct Pattern: Reason + Act. The agent alternates between thinking about what to do next (reason) and actually doing it (act), then observes the result and reasons again. This back-and-forth loop is how most real-world agents work. LangGraph makes this loop explicit and controllable.
AI research is published at a pace that is impossible to follow manually. Researchers and engineers need to quickly evaluate whether a new technique works, but reproducing the experiments takes days or weeks. This agent compresses that to hours. It also handles the common situation where the paper required 8 powerful GPUs but you only have a laptop — the agent scales the experiment down and tells you how confident the scaled result is. This is recursive proof of AI capability: an AI agent that understands and implements AI research.
No one has this in their portfolio. It is technically deep (multi-agent, code generation, scientific evaluation), immediately impressive in a demo, and understood by any ML researcher or AI engineer. The Compute Estimator feature alone is a conversation-stopper in interviews. It shows you understand real-world constraints, not just ideal scenarios.
Supervisor + Sub-Agent Architecture: One "boss" agent (the Supervisor) receives the main task and delegates subtasks to specialized workers. Each worker agent has its own tools and focus area. The Supervisor collects results and makes the final decision. This is the standard architecture for complex multi-agent systems.
Before MCP, every AI tool that needed to connect to external data (GitHub, databases, Slack, etc.) required its own custom integration. A developer had to build separate code for "Claude connecting to GitHub" and "ChatGPT connecting to GitHub." With MCP, you build one server that connects to GitHub — and any MCP-compatible AI model can use it instantly. MCP was adopted by OpenAI in March 2025, Google DeepMind in April 2025, and is now the de facto standard. In 2026, "running an MCP server has become almost as common as running a web server." Knowing how to build one puts you ahead of most AI engineers.
First-mover advantage. Almost nobody has an MCP server in their portfolio right now. Building one from scratch (not just using existing servers) demonstrates you understand the protocol at the architecture level, not just as a user. The fact that your server works with Claude, ChatGPT, and any future MCP-compatible AI model proves you understand the standardization — a key signal for senior roles.
The Thoughtworks Technology Radar (the industry's most respected technology assessment) placed MCP in "Trial" status in late 2025 — meaning: use it in production now, but with awareness. By March 2026, it is the de facto standard for AI tool integration. Employers will increasingly expect AI engineers to know this protocol.
Standard RAG has a serious flaw: it always retrieves something, even when nothing relevant exists in the document collection. Then it generates an answer based on irrelevant documents — and the answer sounds confident but is wrong. CRAG and Self-RAG (published research papers from 2024) solve this with two mechanisms: (1) grade the retrieved documents before using them, and (2) let the model decide whether to retrieve at all. You are bringing cutting-edge research into a production system and measuring whether it actually improves accuracy.
This project is both a GitHub repository and a published technical article. The combination of research-paper implementation + measured benchmarks + written findings is exactly what AI research roles and senior engineering roles look for. Your data science background makes the statistical analysis in the blog post credible and rigorous — you know how to run a proper experiment and present results.
Natural Language Inference (NLI) is a classical NLP task: given a premise sentence and a hypothesis sentence, determine if the hypothesis is supported by the premise, contradicted by it, or unrelated. Used here to check if generated answers are supported by retrieved documents. You likely already know the underlying ML (it's a classification model).
"Should we fine-tune or use RAG?" is asked in every senior AI engineering interview and every enterprise AI project meeting. Most people give an opinion. You will give data. This project is also where your idea for a "Synthetic Data Creator" is implemented — not as a standalone demo but as the engine that generates thousands of training examples you actually need, published as a separate open-source CLI tool.
The written report is the primary asset. Engineers who can design experiments, run them rigorously, and communicate findings clearly are rare. This study answers a question that companies pay AI consultants to answer. The Synthetic Data Engine, published separately on GitHub, gives you a second portfolio artifact from the same project. Both together show breadth (engineering tool) and depth (scientific study).
LoRA (Low-Rank Adaptation): A technique that fine-tunes LLMs by adding small adapter layers, instead of modifying all billions of parameters. Uses 80–90% less memory than full fine-tuning. GGUF: A file format for compressed LLM models that can run efficiently on CPU. Evol-Instruct: A method to generate progressively harder training examples from simple seeds using an LLM.
Quebec carpenters and tradespeople deal with CCQ (Commission de la construction du Québec) regulations, CSST safety standards, union agreements, and supplier catalogs — all in French, all spread across PDFs and government websites. Finding a specific regulation requires knowing which document to look in, which requires knowing the regulatory structure, which takes years of experience. This platform makes that knowledge instantly accessible to anyone. The specificity of the domain is what makes the project credible and memorable — it is solving a real problem for real people.
A live URL at a real domain, for real potential users, with a bilingual technical stack, multi-tenant architecture, and a written case study. Domain specificity is what separates this from the 10,000 generic RAG chatbots in portfolios. "I built a production AI platform serving Quebec tradespeople in French, handling regulatory, safety, and commercial queries through specialized sub-agents" is a story. "I built a RAG chatbot" is not.
Use Railway.app for your first deployment — it handles Docker containers, databases, and environment variables with minimal configuration, and has a generous free tier. Once you're comfortable, migrate to Google Cloud Run for auto-scaling. Both work perfectly for this project. Keep it running permanently: a dead demo is worse than no demo.
These are not separate projects — they are habits and tools you adopt as you go. Start each one when the timeline says "From Phase X." Each one makes every project better.
Getting reliable, typed data from LLMs is essential for production. Raw LLM text output is unpredictable. Instructor + Pydantic forces LLMs to return data that matches a strict schema. Every agent tool call in this roadmap depends on this skill.
You cannot debug, optimize, or explain a system you cannot see. Every LLM call should be logged: input, output, latency, tokens, cost. LangSmith is the industry default (paid). Langfuse is the open-source alternative you can self-host. Start using one from your very first project.
Agent systems run many LLM calls at the same time. Synchronous (blocking) code kills performance because it waits for each call to finish before starting the next. Async Python lets you start multiple calls simultaneously. Learn this before building Phase 2 agents.
A live URL is worth 10 GitHub repositories to a recruiter. Docker packages your application so it runs identically anywhere. Railway.app or Cloud Run deploys it to the internet with minimal setup. Learn Docker basics in Phase 2 so every Phase 3 project ships with a live demo.
Red teaming means deliberately trying to break your own AI system — testing for prompt injections, jailbreaks, misleading inputs, and failure modes. Companies building production AI need engineers who think about safety. A documented red-teaming section on any project is an extreme rarity in portfolios and signals senior thinking.
AI engineers who write clearly get 3× more inbound recruiter interest. Write one technical post per project — what you built, what was hard, what you measured, what you learned. Projects 6 and 7 produce posts that could genuinely go viral in the AI engineering community. Start writing before you think you're ready.
Building the projects is 60% of the work. How you present, document, and position them is the other 40%.
Your data science background makes this natural and powerful. Every project should have a metrics section with real numbers. "My RAG system achieves 87% answer relevancy on RAGAS, compared to 61% for a baseline chatbot" is infinitely more compelling than "I built a RAG system." Numbers make you credible. Opinions make you another candidate.
Publish a technical post for each portfolio project. The fine-tuning benchmark (P7) and the CRAG comparison (P6) are research papers in disguise — treat them that way. Post on Medium, share on LinkedIn, and link from your GitHub. AI engineers who explain their work clearly are rare. Be one of them.
Use carpentry and trades regulation documents as your test data for Projects 1, 2, and 6. By the time you reach Project 8, you know the domain deeply and your portfolio tells a coherent story: "I specialized in AI for the trades and construction sector." A focused portfolio reads like expertise. A scattered portfolio reads like curiosity.
Never hide your data science background — it is a rare advantage. In every interview and LinkedIn post, frame it: "Classical ML gave me strong foundations in statistics and evaluation rigor. AI engineering adds orchestration, retrieval, and agent design. Here is how the two combine in my work." This narrative resonates with senior hiring panels who have seen too many AI engineers who cannot evaluate their own systems.