Module 16 · AI Engineering · GenAI track
Python & Data Foundations
Python is the lingua franca of AI work, so fluency is assumed. This module covers the practical Python and the ML vocabulary you'll be expected to handle without hesitation.
By the end you'll be able to explain, with conviction:
- Why Python won AI, and how to manage environments properly.
- What NumPy, Pandas, and FastAPI are for, and Python's async model.
- The ML vocabulary you must use correctly.
1Python for AI work — why it dominates
Python isn't the fastest language — it won AI for entirely different reasons.
Python dominates AI for three compounding reasons: an unmatched ecosystem (NumPy, Pandas, PyTorch, the entire ML and GenAI tooling world is Python-first), readability that lets researchers and engineers iterate fast, and a role as glue — its heavy numerical work runs in optimised C/C++/CUDA under the hood, so Python orchestrates while fast native code does the math.
That last point answers the obvious objection. "Isn't Python slow?" — yes, the interpreter is, but in practice the hot loops live in compiled libraries (NumPy, the deep-learning frameworks), so Python's slowness rarely bites for AI workloads. It's a productivity and ecosystem win, with performance delegated to native code. Stating that nuance shows you understand why the tradeoff works.
Interview angle
"Python won AI on ecosystem and readability, not speed. The interpreter is slow, but the numerical heavy lifting runs in optimised C and CUDA under libraries like NumPy and PyTorch — Python is the glue that orchestrates fast native code, which is the best of both worlds for this work."
2Virtual environments & packaging
Different projects need different dependency versions, and installing everything globally creates conflicts ("dependency hell"). A virtual environment gives each project an isolated set of packages — venv is the built-in tool; conda is common in data science for also managing non-Python dependencies; modern tools like Poetry and uv handle environments and dependency resolution together.
The principle that matters (echoing reproducible builds, Module 09): pin your dependencies in a lockfile so the environment is reproducible across machines and CI. "Each project gets an isolated, pinned environment" is the professional baseline — and being able to say why (avoid conflicts, reproducibility) beats just naming a tool.
3NumPy & Pandas essentials
These two are the bedrock of data work. NumPy provides the fast n-dimensional array and vectorised operations — the foundation everything numerical (including ML frameworks) is built on. Its superpower is vectorisation: operating on whole arrays at once in native code instead of Python loops, which is dramatically faster.
Pandas builds on NumPy to give you the DataFrame — a labelled, table-like structure for real-world data, with tools to load, clean, filter, group, join, and aggregate (think SQL in Python). The interview-relevant instinct is the same as §1: prefer vectorised operations over explicit Python loops — "don't iterate rows, vectorise" signals you understand performance in this world. Knowing these are the tools for tabular data wrangling is the expected fluency.
4Building an API with FastAPI
FastAPI is the modern Python web framework of choice for AI services (Module 13), because it's fast, async-native, and uses Python type hints to give you automatic request validation (via Pydantic) and auto-generated API docs for free.
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class Query(BaseModel):
text: str # type hint → automatic validation
@app.post("/predict")
async def predict(q: Query):
return {"label": classify(q.text)} # auto-serialised to JSON
Type hints drive validation and docs; async lets the endpoint handle many concurrent requests efficiently.
This is why FastAPI is everywhere in GenAI: wrapping a model behind an HTTP endpoint is the standard deployment pattern, and FastAPI makes it minimal, typed, and production-ready. The async def matters for AI specifically — model and API calls are I/O-bound (§5), so async lets one worker serve many in-flight requests while they wait.
5Async Python basics
Python's async/await enables concurrency for I/O-bound work — exactly the profile of AI apps, which spend most of their time waiting on LLM API calls, database queries, and network requests. While one request awaits a slow LLM response, the event loop (the same idea as Module 12's JS event loop) runs others, so a single process handles many concurrent requests instead of blocking on each.
The crucial distinction interviewers probe: async helps I/O-bound work (waiting), not CPU-bound work (computing) — for CPU-heavy tasks you need multiprocessing, because Python's GIL (Global Interpreter Lock) prevents true multi-threaded parallelism of Python bytecode. "Async for I/O concurrency, multiprocessing for CPU parallelism, and the GIL is why" is a complete, senior-level answer.
Common trap
Expecting async to speed up a CPU-bound computation is a classic misunderstanding. Async only helps when you're waiting; it gives concurrency, not parallelism. Heavy computation needs multiprocessing to sidestep the GIL.
6Jupyter workflow
Jupyter notebooks mix code, output, and prose in runnable cells — ideal for the exploratory, iterative nature of data science and ML: load data, try something, see the chart, adjust, repeat, all in one document. They're the lab bench of AI work.
The mature caveat shows judgment: notebooks are exploration tools, not production tools. Their hidden state (cells run out of order), poor version-control diffs, and weak testing make them a bad home for production code. The professional pattern is prototype in a notebook, then refactor the working logic into proper, tested, version-controlled Python modules. Naming that "notebook to module" transition reassures interviewers you won't ship a notebook to prod.
7ML vocabulary you must not fumble
Even in a non-ML-research role, garbling this vocabulary undercuts you. The essentials:
- Supervised vs unsupervised — learning from labelled data (classification, regression) vs finding structure in unlabelled data (clustering).
- Training / validation / test split — fit on training, tune on validation, judge final performance on the untouched test set.
- Overfitting vs underfitting — memorising the training data and failing to generalise, vs being too simple to capture the pattern. The central tension.
- Features & labels — the inputs and the target you predict.
- Inference vs training — using a trained model to predict, vs the (expensive) process of building it.
If you discuss model quality, precision vs recall is the pair to get right: precision is "of what I flagged, how much was correct"; recall is "of what I should have flagged, how much I caught" — and F1 balances them. Using these terms precisely is what separates "I work near ML" from "I'm guessing."
Interview angle
"The tension I'd anchor on is overfitting versus underfitting — a model that memorises the training set won't generalise, one that's too simple misses the pattern. That's why we hold out a test set we never train on, and why precision and recall, not just accuracy, tell the real quality story."
Recap — what you can now teach
- Python won AI on ecosystem and readability; native libraries do the fast numerical work.
- Use isolated, pinned environments per project for reproducibility.
- NumPy arrays + Pandas DataFrames for data; vectorise, don't loop.
- FastAPI (typed, async) is the standard way to serve a model over HTTP.
- Async for I/O concurrency, multiprocessing for CPU — the GIL is why.
- Prototype in Jupyter, ship refactored modules; never fumble core ML vocabulary.
Self-check
Say each answer out loud before revealing it.
If Python is slow, why does it dominate AI?
Ecosystem and readability — and the heavy numerical work runs in optimised C/CUDA libraries, so Python orchestrates fast native code rather than doing the math itself.
When does async help, and when doesn't it?
It helps I/O-bound work (waiting on APIs, DBs) by running other tasks meanwhile; it doesn't help CPU-bound work — that needs multiprocessing because of the GIL.
What's overfitting?
When a model memorises the training data (including noise) and fails to generalise to new, unseen data — caught by evaluating on a held-out test set.
Why prefer vectorised NumPy/Pandas over Python loops?
Vectorised operations run on whole arrays in optimised native code, which is dramatically faster than iterating in the Python interpreter.
Precision vs recall?
Precision: of what you flagged, how much was correct. Recall: of what you should have flagged, how much you caught. F1 balances the two.