Module 15 · AI Engineering · GenAI track

GenAI & LLM Fundamentals

This is the deepest module in the course because GenAI is where the differentiated roles are right now. We go from "what is an LLM" to building production RAG and agent systems you can defend in detail.

⏱ 95 min deep read 🎯 17 sections 📊 3 diagrams

By the end you'll be able to explain, with conviction:

How an LLM works — tokens, attention, sampling — without hand-waving.
RAG end-to-end, the retrieval stack, and when to fine-tune instead.
Agents, evaluation, guardrails, and the cost/latency levers of real systems.

1What an LLM actually is

Strip away the mystique and an LLM is a very sophisticated next-token predictor.

A Large Language Model is a neural network trained on vast amounts of text to do one deceptively simple thing: predict the next token given the preceding ones. Generation is that prediction run repeatedly — predict a token, append it, predict the next. The astonishing capabilities (reasoning, coding, translation) are emergent from doing this extremely well at massive scale.

The implication that matters for engineering: an LLM has no database of facts and no inherent notion of truth — it generates statistically likely continuations. That's precisely why it can hallucinate (produce fluent falsehoods) and why its knowledge is frozen at its training cutoff. Everything else in this module — RAG, tools, guardrails — exists to compensate for those two facts. Stating the mechanism plainly is the foundation of every good GenAI answer.

💬 Interview angle

"An LLM is a next-token predictor trained at scale — its capabilities emerge from doing that well. Crucially it has no fact store and no sense of truth, so hallucination and a knowledge cutoff aren't bugs but consequences of the mechanism, which is exactly what RAG and tools are there to address."

2Tokens, embeddings & context windows

Models don't see words; they see tokens — sub-word chunks (roughly ¾ of a word on average in English). Text is split into tokens, each mapped to an ID. This matters practically: pricing and limits are counted in tokens, and odd tokenisation explains why models sometimes miscount letters or struggle with rare strings.

Each token becomes an embedding — a high-dimensional vector that captures meaning, such that semantically similar things sit close together in vector space. Embeddings are the bridge between language and maths, and the foundation of semantic search (§9). The context window is the maximum number of tokens the model can attend to at once — its working memory for a request. Everything (system prompt, history, retrieved docs, the question, and the answer) must fit. Exceeding it means truncation or failure, which is why context management (Module 14) is a real skill.

⚠ Common trap

The context window is not memory between calls — LLM APIs are stateless. A chatbot "remembers" only because the app resends the prior conversation each turn. Conflating the context window with persistent memory is a giveaway of shallow understanding.

3Transformers & attention — the intuition

The Transformer is the architecture behind modern LLMs, and its breakthrough is the attention mechanism. You don't need the maths; you need the intuition: when processing each token, attention lets the model weigh how relevant every other token is to it, and focus accordingly. It dynamically decides what to "pay attention to."

flowchart LR T1[The] --> A((Self-Attention)) T2[animal] --> A T3[didn't cross because] --> A T4[it] --> A A -.binds it to animal.-> Out[Contextual meaning]

Attention resolves "it" by weighting "animal" heavily — context is computed, not assumed.

In "the animal didn't cross the street because it was tired," attention is how the model links "it" to "animal." Two properties made Transformers win: attention captures long-range dependencies across the whole sequence, and unlike older recurrent models it processes tokens in parallel, which is what made training at today's scale feasible. "Attention weighs the relevance of every token to every other, in parallel" is the one-line answer.

4Temperature & top-p — controlling output

At each step the model produces a probability distribution over possible next tokens; sampling parameters control how you pick from it. Temperature scales randomness: low (→0) makes output focused and deterministic (it picks the most likely token — best for factual, code, extraction tasks); high (→1+) flattens the distribution for more diverse, creative, surprising output. Top-p (nucleus sampling) restricts the choice to the smallest set of tokens whose probabilities sum to p, trimming the unlikely tail.

The practical heuristic to state: low temperature for anything needing correctness and consistency (classification, JSON, SQL, factual Q&A), higher for creative generation (brainstorming, copy). Knowing that temperature 0 is your friend for structured, repeatable output is exactly the applied detail interviewers want.

5Prompt engineering — zero/few-shot & CoT

Prompting is how you steer a frozen model, and a few techniques carry most of the value. Zero-shot — just ask. Few-shot — include a handful of input→output examples in the prompt; the model infers the pattern, dramatically improving format adherence and accuracy on specialised tasks. Chain-of-Thought (CoT) — ask the model to "think step by step," which markedly improves reasoning on multi-step problems by letting it work through intermediate steps rather than blurting a final answer.

Beyond those: give clear role and instructions, specify the output format explicitly, and put the most important constraints where they're salient. The underlying truth (Module 14) is that prompting is precise communication — you're programming in natural language. "Few-shot examples for format, chain-of-thought for reasoning" is a strong, concrete answer.

💬 Interview angle

"I reach for few-shot examples when I need a specific format or behaviour, and chain-of-thought when the task needs multi-step reasoning — letting the model show its work raises accuracy. Underneath, prompting is just precise instruction; ambiguity is what produces bad output."

6ReAct & tool use

An LLM alone can't fetch live data, do reliable arithmetic, or take actions. Tool use fixes this: you give the model a set of tools (search, a calculator, an API, code execution) and it decides when to call them. ReAct (Reason + Act) is the pattern that makes this work as a loop: the model reasons about what to do, takes an action (calls a tool), observes the result, and repeats until it can answer.

This is the conceptual heart of agents (§13). It turns a static text generator into something that can interact with the world and ground its answers in real, current data — directly attacking the hallucination and knowledge-cutoff problems from §1. The "reason → act → observe → repeat" loop is the one to be able to draw.

7Function calling

Function calling is the structured mechanism behind tool use. You describe your functions to the model — names, descriptions, and a JSON schema of parameters — and instead of replying in prose, the model can return a structured request: "call get_weather with {city: 'Paris'}." Your code executes the function and feeds the result back; the model then continues with that data.

The key clarifications that show real understanding: the model never runs the function — it only asks you to, returning structured arguments your application executes. And it relies on good function descriptions and schemas to choose correctly — vague tool descriptions cause wrong calls, just like vague prompts cause wrong text. It's the typed, reliable bridge between the model's intent and your system's actions.

⚠ Common trap

Saying "the LLM calls the API" is imprecise. The model only emits a structured request to call it; your code runs it and returns the result. That separation is also where you enforce permissions and validation — the model doesn't get raw execution power.

8RAG — the architecture

Retrieval-Augmented Generation is the most important production pattern in GenAI, because it solves the two core LLM limitations: stale knowledge and hallucination. The idea: before answering, retrieve relevant information from your own knowledge base and inject it into the prompt as context, so the model answers grounded in real, current, private data it was never trained on.

flowchart LR Q[User question] --> EMB[Embed query] EMB --> VDB[(Vector DB
search)] VDB --> CTX[Top-k relevant chunks] CTX --> P["Prompt: context + question"] P --> LLM[LLM] LLM --> A[Grounded answer + citations]

Retrieve relevant chunks by semantic similarity, stuff them into the prompt, then generate grounded in them.

Two phases: indexing (offline) — split your documents into chunks, embed each, store the vectors; and retrieval + generation (per query) — embed the question, find the most similar chunks, and pass them to the model with the question. The payoffs: answers cite sources, knowledge updates by re-indexing (no retraining), and the model stays grounded. "RAG grounds the model in retrieved private data to cut hallucination and bypass the training cutoff" is the headline.

💬 Interview angle

"RAG retrieves relevant chunks from my own data and injects them as context so the model answers grounded in real, current sources instead of its frozen weights. It directly attacks hallucination and the knowledge cutoff, lets me cite sources, and updates by re-indexing rather than retraining."

9Vector databases & embeddings

RAG's retrieval step needs to find text by meaning, not keywords — and that's what a vector database does. You embed every chunk (§2) into a vector, and at query time embed the question and find the nearest vectors by similarity (cosine distance). Because embeddings place similar meanings close together, "how do I reset my password" matches a doc titled "account recovery" even with no shared words — that's semantic search, the leap beyond keyword matching.

At scale, exact nearest-neighbour search is too slow, so vector DBs (Pinecone, Weaviate, pgvector, FAISS) use ANN (approximate nearest neighbour) indexes like HNSW to trade a tiny bit of accuracy for huge speed. Increasingly, hybrid search — combining semantic vectors with keyword (BM25) matching — wins, because pure semantic search can miss exact terms like product codes. Naming hybrid search signals you've moved past the textbook.

10Chunking & retrieval strategies, reranking

RAG quality lives or dies on retrieval, and the details are where senior candidates shine. Chunking — how you split documents — is critical: too large and you dilute relevance and waste context; too small and you lose meaning. Good practice is semantically-aware chunking (by paragraph/section) with some overlap so ideas aren't cut mid-thought.

Reranking is the high-leverage second stage: retrieve a broad set of candidates cheaply (say top 50), then use a more expensive, accurate cross-encoder reranker to reorder and keep the best few for the prompt. This two-stage "retrieve wide, rerank precise" pattern noticeably lifts answer quality. Add query rewriting (reformulating the user's question for better retrieval) and you've named the levers that separate a toy RAG from a production one.

↗ Go deeper

When a RAG system gives bad answers, debug retrieval first: did the right chunks come back at all? Most RAG failures are retrieval failures (bad chunking, missing reranking, wrong top-k), not generation failures. Saying "I'd inspect what was retrieved before blaming the model" is a strong diagnostic instinct.

11Fine-tuning vs RAG vs prompting

A classic decision question — the framing is "knowledge vs behaviour." Prompting is the first resort: cheapest, fastest, no training; tune it before anything else. RAG is for giving the model knowledge it lacks — current, private, or factual data — and for grounding/citations. Fine-tuning changes the model's behaviour, style, or format by training on examples; it's for "talk like this / always output this structure / do this narrow task well," not for injecting facts.

Need	Reach for
Current / private / factual knowledge	RAG
Specific tone, format, or narrow skill	Fine-tuning
General steering, fastest iteration	Prompting

The senior soundbite: "RAG for knowledge, fine-tuning for behaviour, prompting first." The common mistake is fine-tuning to add facts — it's expensive, bakes in knowledge that goes stale, and RAG does it better and updatably.

💬 Interview angle

"I start with prompting, use RAG when the model needs knowledge it doesn't have, and fine-tune only to change behaviour, style, or format. People reach for fine-tuning to add facts, but that's RAG's job — fine-tuned facts are expensive and go stale."

12LangChain / LlamaIndex / LangGraph

These frameworks save you from wiring everything by hand. LangChain — a broad toolkit for chaining LLM calls, tools, memory, and prompts into pipelines. LlamaIndex — focused on RAG and data ingestion/indexing; strongest when retrieval is the centre of your app. LangGraph — models agent workflows as a graph with state, giving you explicit control over loops, branching, and multi-step agent control flow.

The mature note to add: frameworks accelerate prototyping but add abstraction and can obscure what's happening — and the underlying API calls are simple enough that many production teams use them thinly or drop to raw SDK calls for control. Knowing the concepts (chains, retrievers, agents, state graphs) matters more than any one framework, since they all implement the same ideas you've learned here.

13Agents & multi-agent systems

An agent combines everything so far: an LLM as the "brain" in a loop (ReAct, §6), with tools to act, memory for context, and the autonomy to plan and pursue a goal over multiple steps. The leap from a chatbot is autonomy and action — it decides what to do next and does it, rather than just answering.

flowchart TB G[Goal] --> O[Orchestrator agent] O --> A1[Research agent] O --> A2[Coding agent] O --> A3[Review agent] A1 --> O A2 --> O A3 --> O O --> R[Result]

Multi-agent: an orchestrator decomposes a goal and delegates to specialised agents, then synthesises.

Multi-agent systems use several specialised agents that collaborate — e.g. an orchestrator delegating to research, coding, and review agents. The benefit is separation of concerns and parallelism (each agent has a focused role and context); the cost is coordination complexity, more failure modes, and higher token spend. The honest framing interviewers respect: agents are powerful but error-prone and hard to make reliable — you constrain them with good tools, clear roles, and verification (Module 14), and reach for multi-agent only when a single agent genuinely can't cope.

14Evaluation — hallucination, faithfulness, RAGAS

"How do you know it's any good?" is the question that separates engineers from demo-builders. LLM output is non-deterministic and open-ended, so you can't unit-test it conventionally — you need evaluation. Key dimensions for RAG: faithfulness (is the answer grounded in the retrieved context, or made up?), answer relevance (does it address the question?), and context relevance/recall (did retrieval surface the right material?).

Approaches to name: a curated eval set of question→expected-answer pairs you run on every change (regression testing for prompts); LLM-as-a-judge (using a strong model to score outputs against criteria) for scale; human review for the gold standard; and frameworks like RAGAS that score the RAG dimensions above. The principle: treat prompts and pipelines as code that needs a test suite — without evals you're flying blind, and "it looked good in the demo" isn't engineering.

💬 Interview angle

"I build an eval set of representative cases and run it on every prompt or pipeline change — regression testing for non-deterministic systems. For RAG I track faithfulness and context relevance with something like RAGAS or an LLM-as-judge, plus human spot-checks. No evals means no idea if a change helped or hurt."

15Guardrails & responsible AI

Production LLM apps need guardrails — controls on both input and output. On input: filtering malicious content and defending against prompt injection (where user or retrieved text tries to override your instructions — the GenAI cousin of SQL injection from Module 05). On output: checking for toxicity, PII leakage, off-topic drift, and format validity before anything reaches the user.

Prompt injection deserves a specific mention as the signature new vulnerability — and the honest point is that it isn't fully solved, so you defend in layers: separate trusted instructions from untrusted input, least-privilege on any tools the model can call, and validate outputs. Broader responsible AI concerns — bias, fairness, transparency, privacy, human oversight on high-stakes decisions — round it out. Showing you think about safety and misuse, not just capability, marks real maturity.

⚠ Common trap

Treating the model's instructions as a hard security boundary is a mistake — a clever prompt can talk past them. Don't give an LLM tools or data access you wouldn't give an untrusted user, because injected text may end up driving those tools.

16Cost & latency optimization

Real LLM systems live or die on cost and speed, billed per token (Module 10's cost-awareness, applied). The levers worth naming: pick the right-sized model — use a small, fast, cheap model for easy tasks and reserve the large one for hard ones (often routing between them); cache repeated or similar requests; stream tokens so perceived latency drops even if total time is unchanged (Module 04's SSE); and trim context — every unnecessary token costs money and latency.

Other tactics: shorten outputs where possible, batch where you can, and for agents, cap the number of steps to prevent runaway loops. The framing that lands: cost, latency, and quality are a triangle you tune per use case — a customer-facing chat needs low latency and a strong model; a nightly batch job can use a cheaper model and tolerate slowness. Treating these as first-class design constraints is the senior signal.

17MCP awareness

The Model Context Protocol (MCP) is an open standard for connecting LLM applications to external tools and data sources through a common interface — think "a universal adapter for giving models context and capabilities." Instead of writing bespoke integrations for every data source and every app, a tool exposes an MCP server once, and any MCP-aware client can use it.

You don't need deep detail — awareness is enough. The point to make: MCP standardises the messy, fragmented integration layer (the same way an API gateway or a driver standardises access elsewhere in this course), making tools and context portable across AI applications. Dropping "MCP is emerging as a standard way to plug tools and data into LLM apps" shows you're tracking where the ecosystem is heading, not just where it's been.

Recap — what you can now teach

An LLM is a next-token predictor with no fact store — hence hallucination and a knowledge cutoff.
Tokens → embeddings; the context window is finite working memory, and APIs are stateless.
Attention weighs every token's relevance, in parallel; low temperature for correctness.
RAG grounds the model in retrieved data; quality lives in chunking, hybrid search, and reranking.
RAG for knowledge, fine-tuning for behaviour, prompting first; agents add autonomy and action.
Build evals (faithfulness, relevance), defend against prompt injection, and tune the cost/latency/quality triangle.

Self-check

Say each answer out loud before revealing it.

Why do LLMs hallucinate, mechanistically?

In one line, what does RAG do and why?

RAG vs fine-tuning vs prompting — when each?

When function calling, does the model execute the function?

How do you evaluate a non-deterministic LLM feature?

Next module → 16 · Python & Data Foundations