← Curriculum 03 · Databases & Data ⏱ 80 min

Module 03 · Data · All tracks

Databases & Data

Data questions reward depth more than almost any other area, because the tradeoffs are concrete and measurable. This module takes you from writing SQL to reasoning about why it's slow and which database to pick.

80 min deep read 🎯 12 sections 📊 2 diagrams

By the end you'll be able to explain, with conviction:

  • Joins, CTEs and window functions — the SQL that separates seniors from juniors.
  • Why a query is slow, and how indexes and execution plans fix it.
  • ACID, CAP, and how to choose between SQL and the NoSQL families.

1Why SQL still rules

Fifty years on, the relational model is still the default — and understanding why is the foundation for everything else.

SQL endures because it's declarative: you describe what data you want, and the database's query planner figures out how to get it efficiently. That separation is enormously powerful — the same query keeps working as data grows, because the engine re-plans around indexes and statistics you never see. Decades of optimisation live behind that simple SELECT.

Underneath sits the relational model: data in tables (relations), rows as records, columns as attributes, and keys wiring it together. A primary key uniquely identifies a row; a foreign key references another table's primary key, enforcing referential integrity. This structure plus ACID guarantees (§8) is why relational databases remain the trustworthy default for anything where correctness matters — money, identity, orders.

💬 Interview angle

"SQL is declarative — I state what I want and the planner decides how. That, plus the relational model's keys and ACID guarantees, is why it's still the default whenever correctness and relationships matter."

2Joins — combining tables

A join combines rows from two tables on a related column. The four you must know cold differ only in which non-matching rows they keep:

  • INNER JOIN — only rows with a match in both tables. The default workhorse.
  • LEFT JOIN — all rows from the left table, plus matches from the right (NULL where none). "Everyone, and their orders if any."
  • RIGHT JOIN — the mirror image; rarely needed since you can flip the tables.
  • FULL OUTER JOIN — all rows from both sides, matched where possible.
flowchart LR subgraph INNER I[Only matching rows] end subgraph LEFT L[All left + matches] end subgraph FULL F[All rows, both sides] end
The join type is simply a decision about which non-matching rows survive.

Common trap

A LEFT JOIN with a WHERE condition on the right table silently becomes an inner join — the WHERE filters out the NULL rows you were trying to keep. Put right-table conditions in the ON clause instead. This is a favourite "spot the bug" question.

3Subqueries & CTEs

Both let you build a query in stages. A subquery nests one query inside another — fine for small, one-off needs, but deeply nested subqueries become unreadable. A Common Table Expression (CTE), written with WITH, names an intermediate result you can then reference like a table.

WITH recent_orders AS (
    SELECT customer_id, SUM(total) AS spend
    FROM orders
    WHERE created_at > NOW() - INTERVAL '30 days'
    GROUP BY customer_id
)
SELECT c.name, r.spend
FROM customers c
JOIN recent_orders r ON r.customer_id = c.id
WHERE r.spend > 1000;

The CTE names a step, so the final query reads top-to-bottom like prose instead of inside-out.

The senior point isn't performance — modern planners often treat them similarly — it's readability and composability. CTEs let you decompose a gnarly query into named steps, and they unlock recursion (e.g. walking an org chart), which plain subqueries can't do.

4Window functions — the senior signal

Window functions are the single biggest "this person knows SQL" tell. They perform a calculation across a set of rows related to the current row — without collapsing them the way GROUP BY does. You keep every row and get the aggregate alongside it.

The mechanism is the OVER() clause, which defines the "window." PARTITION BY splits rows into groups; ORDER BY orders them within each group. With that you get running totals, rankings, and row-to-row comparisons that would otherwise need ugly self-joins.

SELECT
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;

Rank within each department, but keep every employee row — impossible to do cleanly with GROUP BY.

💬 Interview angle

"Window functions aggregate across related rows without collapsing them — so I can show each row alongside its running total or its rank within a partition. They replace messy self-joins for things like 'top N per group.'"

5Indexes — speed vs tradeoff

An index is a separate, sorted data structure (usually a B-tree) that lets the database find rows without scanning the whole table — the textbook analogy is a book's index versus reading every page. It turns an O(n) full scan into an O(log n) lookup, which is the difference between milliseconds and seconds at scale.

The tradeoff is the part juniors forget: indexes cost writes and storage. Every INSERT, UPDATE, and DELETE must also update each index, and indexes consume disk. So you index the columns you filter and join on heavily — not every column. A composite index's column order matters too: an index on (a, b) helps queries filtering on a or a, b, but not b alone.

💬 Interview angle

"An index trades write speed and storage for read speed — it's a sorted structure that turns a full scan into a log-n lookup. So I index hot filter and join columns, not everything, because each index taxes every write."

6Query optimization & execution plans

When a query is slow, you don't guess — you ask the database what it's doing. EXPLAIN (or EXPLAIN ANALYZE) shows the execution plan: the steps the planner chose, their estimated and actual costs, and crucially whether it's using an index or doing a full sequential scan.

The usual culprits, in rough order: a missing index forcing a seq scan on a big table; a function wrapped around an indexed column (WHERE LOWER(email) = … defeats the index); fetching far more rows or columns than needed; and stale statistics misleading the planner. The senior move is to read the plan, find the most expensive node, and address that — optimisation is measurement, not intuition.

Common trap

"Just add an index" isn't always the answer — if a query returns most of the table, a sequential scan is genuinely faster than thrashing an index. Always read the plan before and after; let the numbers decide.

7Normalization & when to denormalize

Normalization organises tables to eliminate redundancy — every fact lives in exactly one place. Third Normal Form (3NF) is the practical target: no repeating groups, every non-key column depends on the whole key, and nothing else. The payoff is integrity: update a customer's address once, and it's correct everywhere, because it was only stored once.

Denormalization deliberately reintroduces redundancy — duplicating data to avoid expensive joins — to speed up reads. It's a conscious trade: faster queries in exchange for the burden of keeping the copies in sync. The senior framing: normalize by default for correctness, denormalize selectively as a measured performance optimization on read-heavy paths, never as a starting point.

💬 Interview angle

"I normalize by default so each fact lives in one place and updates can't create contradictions. I denormalize only as a deliberate, measured optimisation for read-heavy paths — accepting the cost of keeping duplicates consistent."

8ACID & isolation levels

ACID is the set of guarantees that make transactions trustworthy:

  • Atomicity — all of a transaction happens, or none of it. No half-completed transfers.
  • Consistency — a transaction moves the database from one valid state to another, respecting all constraints.
  • Isolation — concurrent transactions don't corrupt each other; results look as if they ran one at a time.
  • Durability — once committed, it survives crashes and power loss.

Isolation has levels, because full isolation is expensive. Loosening it admits specific anomalies: dirty reads (reading uncommitted data), non-repeatable reads (a row changes between two reads), and phantom reads (new rows appear). The levels — Read Uncommitted, Read Committed, Repeatable Read, Serializable — trade performance for stricter guarantees, each ruling out more anomalies.

Go deeper

Most databases default to Read Committed — a pragmatic middle ground. You only reach for Serializable when correctness genuinely demands it (e.g. financial invariants), because it serialises conflicting transactions and can hurt throughput. Knowing the default and when to climb above it is the interview-ready nuance.

9SQL vs NoSQL — choosing

The mature answer is never "NoSQL is modern, SQL is old" — it's matching the data and access pattern to the store. SQL gives you a rigid schema, powerful ad-hoc queries and joins, and strong ACID guarantees; it shines for structured, relational data where correctness matters. NoSQL trades some of that for flexible schemas and easy horizontal scaling; it shines for huge volume, rapidly evolving shapes, or simple high-throughput access patterns.

SQL (relational)NoSQL
SchemaRigid, enforcedFlexible / schema-less
ScalingMostly verticalHorizontal by design
QueriesRich joins, ad-hocOptimised for known patterns
ConsistencyStrong (ACID)Often eventual (tunable)
Best forOrders, money, identityCatalogs, sessions, telemetry, scale

💬 Interview angle

"I pick by data shape and access pattern, not fashion. Relational for structured, related data where I need ad-hoc queries and strong consistency; NoSQL when I need flexible schemas or horizontal scale and the access patterns are known up front."

10NoSQL types

"NoSQL" is four different families, each suited to a shape of data:

flowchart TB N[NoSQL families] --> KV[Key-Value
Redis, DynamoDB] N --> DOC[Document
MongoDB] N --> COL[Wide-Column
Cassandra] N --> GR[Graph
Neo4j]
Pick the family that matches how your data is shaped and queried.
  • Key-Value — a giant hashmap; blazing-fast lookups by key. Caching, sessions, feature flags.
  • Document — JSON-like documents; flexible, nested, queryable. App data with evolving shapes.
  • Wide-Column — rows with dynamic columns, built for massive write throughput across clusters. Time-series, event logs.
  • Graph — nodes and edges as first-class citizens; excels at relationship-heavy queries. Social networks, recommendations, fraud rings.

11CAP theorem in plain words

CAP says that in a distributed system, when a network partition happens (nodes can't talk), you can keep only one of two things: Consistency (every read sees the latest write) or Availability (every request still gets a response). You don't get to choose Partition-tolerance away — partitions will happen — so the real choice is C-vs-A during a partition.

Make it concrete: a banking ledger chooses CP — it would rather refuse a request than show a wrong balance. A social feed chooses AP — it would rather show slightly stale likes than go down. Most real systems are tunable per-operation rather than globally one or the other.

Common trap

"Pick two of three" is the popular but misleading phrasing. Partition tolerance isn't optional in a distributed system — so CAP is really "when partitioned, choose consistency or availability." Saying that precisely sets you apart.

12ORM & the N+1 trap

An ORM (Object-Relational Mapper) maps database rows to objects so you work in your language instead of raw SQL. It removes boilerplate and SQL-injection footguns — but it also hides the queries it generates, and that hiding causes the single most common performance bug: the N+1 query problem.

It happens when you fetch a list (1 query) and then lazily access a related field on each item, firing one more query per item (N queries). A page showing 100 orders with their customer can quietly become 101 round-trips to the database.

# N+1: 1 query for orders, then 1 per order for its customer
orders = Order.objects.all()
for o in orders:
    print(o.customer.name)        # lazy load → a query each iteration

# Fixed: eager-load the relation in a single join
orders = Order.objects.select_related("customer")

The fix is eager loading (JOIN/IN) — fetch the related data up front in one or two queries.

💬 Interview angle

"ORMs are productive but they hide the SQL, which breeds the N+1 problem — one query for a list, then one per row for a relation. I watch for it and fix it with eager loading, and I'm never afraid to drop to raw SQL for a hot query."

Recap — what you can now teach

  • SQL endures because it's declarative over a relational model with keys and ACID.
  • Window functions and CTEs are the senior SQL signals; joins differ only in which non-matches survive.
  • Indexes trade write speed and storage for read speed; read the execution plan before optimising.
  • Normalize for correctness, denormalize as a measured read optimisation.
  • ACID guarantees trustworthy transactions; isolation levels trade strictness for throughput.
  • Choose SQL vs NoSQL by data shape and access pattern; know the four NoSQL families.
  • CAP = consistency or availability during a partition; watch ORMs for the N+1 trap.

Self-check

Say each answer out loud before revealing it.

What do window functions do that GROUP BY can't?

What's the tradeoff when you add an index?

State CAP precisely.

What is the N+1 problem and how do you fix it?

When would you denormalize?