← Curriculum 07 · System Design ⏱ 80 min

Module 07 · Design · GenAI / Full-Stack

System Design

System-design rounds reward process over trivia. With a repeatable framework you can reason through any prompt out loud — which is exactly what's being graded.

80 min deep read 🎯 9 sections 📊 2 diagrams

By the end you'll be able to:

  • Drive any design question with a clear, repeatable framework.
  • Reason about scaling, availability, consistency, and estimation.
  • Walk through two classic designs end-to-end.

1The framework — how to drive any design question

The interviewer is grading how you think, not whether you recall a "correct" diagram. A framework keeps you in control.

flowchart TB A[1 · Clarify requirements
functional + non-functional] --> B[2 · Estimate scale
users, QPS, storage] B --> C[3 · API + data model] C --> D[4 · High-level design
draw the boxes] D --> E[5 · Deep-dive a component] E --> F[6 · Address bottlenecks
scale, cache, replicate]
A six-step spine you can apply to any prompt — narrate each step aloud.

The single biggest mistake is jumping straight to boxes and arrows. Start by clarifying: what are the functional requirements (what it does) and — more importantly — the non-functional ones (scale, latency, availability, consistency)? Those constraints drive every later decision. Stating assumptions out loud and getting buy-in is half the score.

💬 Interview angle

"Before I draw anything I'd nail down requirements and scale — read-heavy or write-heavy, how many users, what latency and consistency we need. Those non-functional constraints decide the whole design, so I anchor on them first."

2Scalability — vertical vs horizontal

Vertical scaling (scale up) means a bigger machine — more CPU/RAM. Simple, but there's a ceiling and a single point of failure. Horizontal scaling (scale out) means more machines behind a load balancer — effectively unlimited and fault-tolerant, but it forces you to handle distributed state.

The thing that enables horizontal scaling is statelessness (Module 04): if servers hold no session, any one can serve any request and you add capacity freely. State gets pushed to shared stores — databases, caches, object storage. For the data tier, scaling out means replication (copies for read scaling and redundancy) and sharding (partitioning data across nodes by a key for write scaling).

💬 Interview angle

"I default to horizontal scaling because it removes the single point of failure and has no hard ceiling. The enabler is keeping services stateless and pushing state to shared stores; for the database I scale reads with replicas and writes with sharding."

3High availability & fault tolerance

Availability is the percentage of time a system is up, often quoted in "nines" — 99.9% ("three nines") is about 8.7 hours of downtime a year; 99.99% is under an hour. Every nine costs real money and complexity, so you match the target to the business need rather than chasing five nines reflexively.

You buy availability with redundancy and no single points of failure: multiple instances across multiple availability zones, automatic failover, health checks that route around dead nodes, and replicated data. Fault tolerance goes further — the system keeps working through a failure, not just recovering after. The resilience patterns from Module 06 (circuit breakers, retries) are how you achieve graceful degradation rather than total collapse.

4Consistency models

This is where CAP (Module 03) becomes a design lever. Strong consistency means every read sees the latest write — intuitive and necessary for money or inventory, but it costs latency and availability under partitions. Eventual consistency means replicas converge over time; a read might briefly be stale, but the system stays fast and available. Perfect for likes, view counts, feeds.

The senior move is to apply consistency per feature, not globally. A social app might use strong consistency for a user's password change but eventual consistency for their follower count. Naming where you'd accept staleness — and where you absolutely wouldn't — shows real judgment.

Common trap

Demanding strong consistency everywhere "to be safe" is a costly junior instinct. It needlessly sacrifices availability and latency where stale data would be perfectly fine. Match the model to each feature's real requirement.

5Caching layers & CDN

Caching appears at every layer of a real system, and naming the layers signals depth: browser cache, CDN (edge), load-balancer/reverse-proxy cache, application cache (Redis/Memcached), and the database's own buffer cache. Each cuts load and latency for the layers behind it.

A CDN deserves its own mention: it serves static assets (and increasingly dynamic content) from edge locations physically near the user, slashing latency and offloading your origin. The rule of thumb: push reads as close to the user as you can tolerate staleness for. Combined with the caching strategies from Module 06, this is usually the highest-leverage performance win in a design.

💬 Interview angle

"I think about caching at every layer — CDN at the edge for static assets, Redis at the app tier for hot data, down to the DB buffer cache. The art is pushing reads as close to the user as the freshness requirement allows."

6Back-of-envelope estimation

Rough numbers justify your design choices and show you can reason about scale. You're not expected to be exact — you're expected to be in the right order of magnitude and to show the method.

The technique: start from users → derive requests per second → derive storage and bandwidth. Keep a few anchors handy: 1 day ≈ 86,400 s (round to ~100k), so 1M daily actions ≈ ~12/s average, but peak might be 5–10×. For storage, multiply item size × count × retention. Always separate read QPS from write QPS — most systems are read-heavy by 10:1 or more, and that ratio decides where you cache and replicate.

Go deeper

Useful latency anchors to quote: memory access ~100 ns, an SSD read ~100 µs, a network round-trip within a datacentre ~0.5 ms, cross-continent ~100+ ms. These "numbers every engineer should know" let you reason about where time actually goes.

7Worked mini-design — URL shortener

A classic because it touches reads, writes, storage, and caching. Requirements: shorten a long URL to a short code; redirect on lookup; very read-heavy (redirects ≫ creations).

Core design: on create, generate a unique short code and store the mapping code → long URL. The code can be a base-62 encoding of an auto-increment ID (compact, collision-free) or a hash (needs collision handling). On read, look up the code and return a 301/302 redirect.

Scaling the reads: redirects dominate, so put the mapping behind a cache (Redis) — most lookups never hit the database. The data is simple key-value, so a key-value or well-indexed store scales horizontally with sharding by code. Tradeoff to mention: 301 (permanent) lets browsers cache the redirect and offloads you, but means you lose per-click analytics; 302 keeps every click flowing through you for tracking. Naming that tradeoff is the "senior" beat.

💬 Interview angle

"Since it's overwhelmingly read-heavy, I'd encode an auto-increment ID in base-62 for collision-free short codes, store a simple key-value mapping, and cache hot codes in Redis so most redirects skip the database. I'd use 302 if I need click analytics, 301 if I'd rather offload to browser caches."

8Worked mini-design — rate limiter

Requirements: cap a client to N requests per time window; work across many distributed servers; be fast.

The standard answer is the token bucket: each client has a bucket that refills at a fixed rate up to a cap; each request spends a token; empty bucket → 429. It elegantly allows short bursts while enforcing an average rate. Alternatives worth naming: fixed-window (simple but allows a 2× burst at the boundary) and sliding-window (smoother, more state).

The distributed twist is the real test: counters must be shared across all servers, so you keep them in a fast central store like Redis, using atomic operations to avoid race conditions when many servers update the same client's count at once. That "where does the shared counter live, and how do I make it atomic" question is what they're probing.

flowchart LR R[Request] --> LB[Servers] LB -->|atomic incr| Redis[(Redis
token buckets)] Redis -->|allowed| OK[Forward] Redis -->|empty| NO[429 Too Many Requests]
Shared counters in Redis with atomic ops keep the limit correct across every server.

9Spotting the bottleneck

Strong candidates don't just build a system — they critique their own. After sketching a design, proactively ask "where does this fall over first?" Usually it's the database (the hardest tier to scale — address with read replicas, caching, sharding), a single point of failure (add redundancy), or a synchronous hot path that should be made async via a queue (Module 06).

This habit mirrors the whole module: design is iterative. Build the simple version, find where it breaks at the stated scale, and apply the specific tool — cache, replica, shard, queue, CDN — that relieves that pressure. Narrating that loop is what separates a senior answer from a memorised one.

💬 Interview angle

"After I sketch a design I immediately ask where it breaks first — usually the database. Then I apply the targeted fix: replicas and caching for read pressure, sharding for write pressure, a queue to make a hot synchronous path async. Design is iterative, and I want to show that loop."

Recap — what you can now teach

  • Drive with the framework: clarify → estimate → API/data → high-level → deep-dive → bottlenecks.
  • Prefer horizontal scaling via statelessness; scale data with replicas (reads) and shards (writes).
  • Availability comes from redundancy and no single points of failure; match the "nines" to the need.
  • Apply consistency per feature — strong where it matters, eventual where staleness is fine.
  • Cache at every layer; estimate by deriving QPS and storage from users.
  • For any design, name where it breaks first and apply the targeted fix.

Self-check

Say each answer out loud before revealing it.

What's the first thing you do in a system-design question, and why?

What enables horizontal scaling of the app tier?

Why not use strong consistency everywhere?

How do you make a rate limiter work across many servers?

Where does a system usually bottleneck first, and what's the fix?