System Design Interview Deep Dive

Focus on HOW to design and WHY we make each decision

URL Shortener Rate Limiter Twitter Timeline Chat System Uber/Location Video Streaming Notifications

1. Design a URL Shortener (bit.ly)

This is often the first system design question you'll encounter. It seems simple but tests fundamental concepts: hashing, database design, caching, and scale estimation.

Step 1: Clarify Requirements

? Questions to ask the interviewer:

How many URLs per day? → "100 million new URLs/day"
How long should shortened URLs be? → "As short as possible"
Can users choose custom aliases? → "Yes, optionally"
Do URLs expire? → "Default 5 years, configurable"
Analytics needed? → "Basic click counts"

Step 2: Back-of-Envelope Estimation

Write Traffic: 100M URLs/day = 100M / 86400 ≈ 1,200 URLs/second Read Traffic (assume 100:1 read:write ratio): 1,200 × 100 = 120,000 redirects/second Storage (over 5 years): 100M URLs/day × 365 days × 5 years = 182.5 billion URLs Average URL = 500 bytes (long URL + short URL + metadata) 182.5B × 500 bytes = ~91 TB URL Length: Using base62 (a-z, A-Z, 0-9): 62^7 = 3.5 trillion combinations 7 characters is sufficient for 182.5 billion URLs

WHY base62? We want URLs that are URL-safe without encoding. Base64 includes + and / which need URL encoding. Base62 (alphanumeric only) gives clean URLs like bit.ly/abc1234.

Step 3: High-Level Design

graph LR Client([Client]) --> LB[Load Balancer] LB --> API[API Servers] API --> Cache[(Redis Cache)] API --> DB[(Database)] Cache -.-> DB subgraph "Write Path" API -->|"POST /shorten"| IDGen[ID Generator] IDGen --> DB end subgraph "Read Path" API -->|"GET /:shortUrl"| Cache Cache -->|"Cache Miss"| DB end style Client fill:#88c0d0,color:#2e3440 style LB fill:#a3be8c,color:#2e3440 style API fill:#ebcb8b,color:#2e3440 style Cache fill:#bf616a,color:#2e3440 style DB fill:#b48ead,color:#2e3440 style IDGen fill:#d08770,color:#2e3440

Step 4: Deep Dive - The Core Problem

How do we generate unique short URLs?

This is THE key design decision. There are three main approaches:

Option A: Hash the Long URL

MD5(longUrl) = "5d41402abc4b2a76b9719d911017c592"
Take first 7 chars = "5d41402"
Convert to base62 = "aB3x9Kp"

Trade-offs:

✅ Same URL always produces same short URL (idempotent)
✅ No coordination between servers needed
❌ Collisions! Two different URLs might hash to same 7 chars
❌ Need collision resolution (append counter, rehash)

HOW to handle collisions:

Generate hash, check if exists in DB
If collision: append incrementing counter and rehash
Repeat until unique

Problem: Multiple DB lookups on collision. Under high write load, this becomes expensive.

Option B: Auto-Increment ID + Base62 Encoding (Recommended)

Database auto-increment ID: 123456789
Convert to base62: 123456789 → "8M0kX"

WHY this is better:

✅ Guaranteed unique - no collisions ever
✅ Simple, predictable
✅ Short URLs get longer over time (starts at 1 char, grows)
❌ Single point of failure (one DB generating IDs)
❌ Predictable URLs (security concern - can enumerate)

HOW to scale ID generation:

Use multiple ID generators with different ranges:

Server 1: IDs 0, 2, 4, 6, 8...     (even numbers)
Server 2: IDs 1, 3, 5, 7, 9...     (odd numbers)

Or use ranges:
Server 1: 0 - 1 billion
Server 2: 1 billion - 2 billion
...

Twitter's Snowflake ID: 64-bit IDs = timestamp + datacenter + machine + sequence

Option C: Pre-generated Key Service

Key Generation Service (KGS):
- Pre-generate millions of unique 7-char keys
- Store in database (used/unused columns)
- API servers request batch of keys from KGS
- Mark keys as used when assigned

WHY use this approach:

✅ Zero collision - keys pre-verified unique
✅ Fast - no generation at request time
✅ Keys are random (not sequential = more secure)
❌ More complex infrastructure
❌ Need to manage key exhaustion

Database Schema

CREATE TABLE urls (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    short_url VARCHAR(7) UNIQUE NOT NULL,
    long_url VARCHAR(2048) NOT NULL,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP,
    click_count BIGINT DEFAULT 0,

    INDEX idx_short_url (short_url),  -- For redirect lookups
    INDEX idx_user_id (user_id)       -- For user's URL list
);

WHY this schema:

short_url indexed for O(1) redirect lookups
long_url not indexed - we never search by it
click_count denormalized for quick analytics (avoid COUNT queries)

Caching Strategy

graph LR Request([GET /aB3x9Kp]) --> Cache{Redis Cache} Cache -->|"HIT"| Return([Return long URL]) Cache -->|"MISS"| DB[(Database)] DB --> UpdateCache[Update Cache] UpdateCache --> Return style Request fill:#88c0d0,color:#2e3440 style Cache fill:#bf616a,color:#2e3440 style DB fill:#b48ead,color:#2e3440 style Return fill:#a3be8c,color:#2e3440 style UpdateCache fill:#ebcb8b,color:#2e3440

HOW to implement caching:

def redirect(short_url):
    # Try cache first
    long_url = redis.get(short_url)
    if long_url:
        return redirect_301(long_url)

    # Cache miss - hit database
    long_url = db.query("SELECT long_url FROM urls WHERE short_url = ?", short_url)
    if not long_url:
        return 404

    # Populate cache (TTL 24 hours)
    redis.setex(short_url, 86400, long_url)

    # Increment click count async (don't block redirect)
    async_queue.push({"type": "click", "short_url": short_url})

    return redirect_301(long_url)

301 vs 302 Redirect?

301 (Permanent): Browser caches redirect. Fewer requests to our servers, but we lose analytics visibility.
302 (Temporary): Browser always hits our servers. More load, but accurate click tracking.

Choose based on whether analytics matter more than server load.

Final Architecture

graph TB subgraph "Clients" Web([Web App]) Mobile([Mobile App]) API_Client([API Clients]) end subgraph "Edge Layer" CDN[CDN / GeoDNS] LB[Load Balancer] end subgraph "Application Layer" API1[API Server 1] API2[API Server 2] API3[API Server N] end subgraph "Caching Layer" Redis1[(Redis Primary)] Redis2[(Redis Replica)] end subgraph "Database Layer" DB_Primary[(MySQL Primary)] DB_Replica1[(MySQL Replica)] DB_Replica2[(MySQL Replica)] end subgraph "Async Processing" Kafka[Kafka] Analytics[Analytics Worker] end Web --> CDN Mobile --> CDN API_Client --> CDN CDN --> LB LB --> API1 LB --> API2 LB --> API3 API1 --> Redis1 API2 --> Redis1 API3 --> Redis1 Redis1 --> Redis2 API1 --> DB_Primary DB_Primary --> DB_Replica1 DB_Primary --> DB_Replica2 API1 --> Kafka Kafka --> Analytics Analytics --> DB_Primary style CDN fill:#88c0d0,color:#2e3440 style LB fill:#a3be8c,color:#2e3440 style API1 fill:#ebcb8b,color:#2e3440 style API2 fill:#ebcb8b,color:#2e3440 style API3 fill:#ebcb8b,color:#2e3440 style Redis1 fill:#bf616a,color:#2e3440 style Redis2 fill:#bf616a,color:#2e3440 style DB_Primary fill:#b48ead,color:#2e3440 style DB_Replica1 fill:#b48ead,color:#2e3440 style DB_Replica2 fill:#b48ead,color:#2e3440 style Kafka fill:#d08770,color:#2e3440 style Analytics fill:#d08770,color:#2e3440

2. Design a Rate Limiter

Rate limiting protects services from abuse, ensures fair usage, and prevents cascading failures. This tests your understanding of algorithms and distributed systems.

Step 1: Clarify Requirements

What are we limiting? → "API requests per user"
What limits? → "100 requests per minute per user"
Distributed system? → "Yes, multiple API servers"
What happens when limited? → "Return 429 Too Many Requests"
Hard or soft limit? → "Hard limit, must be accurate"

Step 2: Algorithm Deep Dive

Algorithm 1: Token Bucket

Imagine a bucket that holds tokens. Each request takes a token. Tokens refill at a steady rate.

graph LR subgraph "Token Bucket" Bucket["🪣 Bucket
Capacity: 10 tokens
Current: 7 tokens"] end Refill["⏱️ Refill
1 token/second"] -->|"Add tokens"| Bucket Request([Request]) -->|"Take 1 token"| Bucket Bucket -->|"Tokens > 0"| Allow([✅ Allow]) Bucket -->|"Tokens = 0"| Deny([❌ Deny 429]) style Bucket fill:#88c0d0,color:#2e3440 style Refill fill:#a3be8c,color:#2e3440 style Allow fill:#a3be8c,color:#2e3440 style Deny fill:#bf616a,color:#2e3440 style Request fill:#ebcb8b,color:#2e3440

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity        # Max tokens
        self.tokens = capacity          # Current tokens
        self.refill_rate = refill_rate  # Tokens per second
        self.last_refill = time.time()

    def allow_request(self):
        self._refill()

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        tokens_to_add = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill = now

WHY Token Bucket?

✅ Allows bursts (bucket can fill up, then burst)
✅ Memory efficient (just 2 numbers per user)
✅ Smooth traffic over time
Used by: AWS API Gateway, Stripe

Algorithm 2: Sliding Window Log

Keep a log of all request timestamps. Count requests in the last window.

class SlidingWindowLog:
    def __init__(self, window_size, max_requests):
        self.window_size = window_size  # e.g., 60 seconds
        self.max_requests = max_requests
        self.request_log = []  # List of timestamps

    def allow_request(self):
        now = time.time()
        window_start = now - self.window_size

        # Remove old entries outside window
        self.request_log = [ts for ts in self.request_log if ts > window_start]

        if len(self.request_log) < self.max_requests:
            self.request_log.append(now)
            return True
        return False

Trade-offs:

✅ Very accurate - no boundary issues
❌ Memory intensive - stores every timestamp
❌ O(n) cleanup of old entries

Algorithm 3: Sliding Window Counter (Best for Most Cases)

Hybrid approach: use counters per time bucket, weight by position in window.

graph LR subgraph "Sliding Window Counter" Prev["Previous Minute
Count: 84"] Curr["Current Minute
Count: 36"] end Calc["Weighted Count:
84 × 0.25 + 36 × 0.75
= 21 + 27 = 48"] Prev --> Calc Curr --> Calc Calc --> Check{48 < 100?} Check -->|"Yes"| Allow([✅ Allow]) Check -->|"No"| Deny([❌ Deny]) style Prev fill:#434c5e,color:#e0e0e0 style Curr fill:#88c0d0,color:#2e3440 style Calc fill:#ebcb8b,color:#2e3440 style Allow fill:#a3be8c,color:#2e3440 style Deny fill:#bf616a,color:#2e3440

class SlidingWindowCounter:
    def __init__(self, window_size, max_requests):
        self.window_size = window_size
        self.max_requests = max_requests
        self.current_window_count = 0
        self.previous_window_count = 0
        self.current_window_start = self._get_window_start()

    def allow_request(self):
        now = time.time()
        window_start = self._get_window_start()

        # Roll over to new window if needed
        if window_start != self.current_window_start:
            self.previous_window_count = self.current_window_count
            self.current_window_count = 0
            self.current_window_start = window_start

        # Calculate weighted count
        window_progress = (now - window_start) / self.window_size
        weighted_count = (
            self.previous_window_count * (1 - window_progress) +
            self.current_window_count
        )

        if weighted_count < self.max_requests:
            self.current_window_count += 1
            return True
        return False

WHY Sliding Window Counter?

✅ Memory efficient (just 2 counters per user)
✅ Smooths boundary issues of fixed windows
✅ O(1) operations
Used by: Cloudflare

Step 3: Distributed Rate Limiting

The Problem: With multiple API servers, each has its own counter. User makes 50 requests to Server A, 50 to Server B = 100 total, but each server thinks only 50!

Solution: Centralized Counter with Redis

graph TB Client([Client]) --> LB[Load Balancer] LB --> API1[API Server 1] LB --> API2[API Server 2] LB --> API3[API Server 3] API1 --> Redis[(Redis)] API2 --> Redis API3 --> Redis subgraph "Redis Data" Key1["user:123:minute:1699999200
value: 45"] Key2["user:456:minute:1699999200
value: 98"] end style Client fill:#88c0d0,color:#2e3440 style LB fill:#a3be8c,color:#2e3440 style API1 fill:#ebcb8b,color:#2e3440 style API2 fill:#ebcb8b,color:#2e3440 style API3 fill:#ebcb8b,color:#2e3440 style Redis fill:#bf616a,color:#2e3440

# Redis Lua script for atomic rate limiting (Token Bucket)
RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

-- Calculate tokens to add
local elapsed = now - last_refill
local tokens_to_add = elapsed * refill_rate
tokens = math.min(capacity, tokens + tokens_to_add)

-- Check if request allowed
if tokens >= requested then
    tokens = tokens - requested
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 60)  -- TTL for cleanup
    return 1  -- Allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 60)
    return 0  -- Denied
end
"""

def check_rate_limit(user_id, capacity=100, refill_rate=1.67):
    key = f"ratelimit:{user_id}"
    now = time.time()
    result = redis.eval(RATE_LIMIT_SCRIPT, 1, key, capacity, refill_rate, now, 1)
    return result == 1

WHY Lua script?

Atomic execution - no race conditions
Single round trip to Redis
All logic runs on Redis server

Response Headers

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1699999260
Retry-After: 30

HOW clients should handle 429:

Check Retry-After header
Implement exponential backoff
Queue requests locally if needed

3. Design Twitter Timeline

The Twitter timeline is a classic example of the "fan-out" problem. This tests understanding of trade-offs between read and write optimization.

Step 1: Understand the Scale

Users: 300 million MAU, 150 million DAU Tweets: 500 million tweets/day ≈ 6,000 tweets/second Timeline reads: Each user checks timeline ~10 times/day 150M × 10 = 1.5 billion timeline reads/day ≈ 17,000 reads/second Read:Write ratio: 17,000 : 6,000 ≈ 3:1 (reads dominate) Average user follows: 200 accounts Celebrity followers: Up to 100+ million (e.g., @elonmusk)

Step 2: The Core Problem - Fan-Out

When user A posts a tweet, how do we show it to all followers?

Approach 1: Fan-Out on Read (Pull Model)

sequenceDiagram participant User participant API participant DB User->>API: GET /timeline API->>DB: Get list of people I follow DB-->>API: [user_1, user_2, ..., user_200] API->>DB: Get recent tweets from all 200 users DB-->>API: [tweet_1, tweet_2, ...] API->>API: Merge & Sort by time API-->>User: Timeline (sorted tweets)

def get_timeline(user_id):
    # Get who this user follows
    following = db.query("""
        SELECT followed_id FROM follows
        WHERE follower_id = ?
    """, user_id)  # Returns ~200 IDs

    # Get recent tweets from all followed users
    tweets = db.query("""
        SELECT * FROM tweets
        WHERE user_id IN (?, ?, ?, ...)  -- 200 IDs
        ORDER BY created_at DESC
        LIMIT 100
    """, following)

    return tweets

Trade-offs:

✅ Writes are fast (just insert tweet)
✅ No wasted work if user never reads
❌ Reads are SLOW - query 200 users, merge, sort
❌ Can't handle 17,000 reads/second with this latency

Approach 2: Fan-Out on Write (Push Model)

sequenceDiagram participant User_A as User A (Author) participant API participant Queue participant Workers participant Cache participant Followers as Followers (1000) User_A->>API: POST /tweet "Hello world" API->>Queue: New tweet event API-->>User_A: 200 OK Queue->>Workers: Process tweet Workers->>Workers: Get A's followers (1000 users) loop For each follower Workers->>Cache: Add tweet to follower's timeline cache end Note over Cache: Each user has pre-built timeline in Redis

def post_tweet(user_id, content):
    # 1. Save tweet to database
    tweet_id = db.insert("INSERT INTO tweets (user_id, content) VALUES (?, ?)",
                         user_id, content)

    # 2. Async: Push to all followers' timelines
    followers = db.query("SELECT follower_id FROM follows WHERE followed_id = ?", user_id)

    for follower_id in followers:
        # Each user's timeline is a Redis sorted set (score = timestamp)
        redis.zadd(f"timeline:{follower_id}", {tweet_id: timestamp})
        redis.zremrangebyrank(f"timeline:{follower_id}", 0, -801)  # Keep only 800 tweets

def get_timeline(user_id):
    # Just read from cache - O(1)!
    tweet_ids = redis.zrevrange(f"timeline:{user_id}", 0, 100)
    tweets = db.query("SELECT * FROM tweets WHERE id IN (?)", tweet_ids)
    return tweets

Trade-offs:

✅ Reads are INSTANT - just read from Redis
✅ Timeline pre-computed
❌ Writes are expensive - 1 tweet = 1000 cache updates
❌ Celebrity problem: 1 tweet × 100M followers = 100M writes!
❌ Wasted work if followers never log in

Approach 3: Hybrid (What Twitter Actually Uses)

graph TB subgraph "On Tweet" Tweet([New Tweet]) --> Check{Author has
> 10K followers?} Check -->|"No (99% of users)"| FanOut[Fan-out to all followers] Check -->|"Yes (Celebrities)"| Store[Just store tweet] end subgraph "On Read" Read([Get Timeline]) --> Cache[(Cached Timeline)] Cache --> Merge[Merge with celebrity tweets] Merge --> Celebrities[(Celebrity tweets
fetched on-demand)] Merge --> Result([Final Timeline]) end style Tweet fill:#88c0d0,color:#2e3440 style Check fill:#ebcb8b,color:#2e3440 style FanOut fill:#a3be8c,color:#2e3440 style Store fill:#a3be8c,color:#2e3440 style Read fill:#88c0d0,color:#2e3440 style Cache fill:#bf616a,color:#2e3440 style Celebrities fill:#b48ead,color:#2e3440 style Result fill:#a3be8c,color:#2e3440

WHY Hybrid works:

99% of users have < 10K followers → fan-out is cheap
1% celebrities → their tweets fetched on-demand
Merging 5-10 celebrity feeds is fast (user follows ~5 celebrities)
Best of both worlds!

def get_timeline(user_id):
    # 1. Get pre-computed timeline from cache
    cached_timeline = redis.zrevrange(f"timeline:{user_id}", 0, 100)

    # 2. Get list of celebrities this user follows
    celebrities = redis.smembers(f"follows_celebrities:{user_id}")

    # 3. Fetch recent tweets from celebrities
    celebrity_tweets = []
    for celeb_id in celebrities:
        tweets = redis.zrevrange(f"user_tweets:{celeb_id}", 0, 10)
        celebrity_tweets.extend(tweets)

    # 4. Merge and sort
    all_tweets = merge_sorted(cached_timeline, celebrity_tweets)
    return all_tweets[:100]

Complete Architecture

graph TB subgraph "Write Path" Client1([Client]) --> LB1[Load Balancer] LB1 --> TweetAPI[Tweet Service] TweetAPI --> TweetDB[(Tweet DB
MySQL)] TweetAPI --> Kafka[Kafka] Kafka --> FanOutWorkers[Fan-out Workers] FanOutWorkers --> TimelineCache[(Timeline Cache
Redis Cluster)] end subgraph "Read Path" Client2([Client]) --> LB2[Load Balancer] LB2 --> TimelineAPI[Timeline Service] TimelineAPI --> TimelineCache TimelineAPI --> TweetDB end subgraph "Social Graph" FollowDB[(Follow Graph
Graph DB or MySQL)] FanOutWorkers --> FollowDB TimelineAPI --> FollowDB end style TweetAPI fill:#ebcb8b,color:#2e3440 style TimelineAPI fill:#88c0d0,color:#2e3440 style TweetDB fill:#b48ead,color:#2e3440 style FollowDB fill:#b48ead,color:#2e3440 style TimelineCache fill:#bf616a,color:#2e3440 style Kafka fill:#d08770,color:#2e3440 style FanOutWorkers fill:#a3be8c,color:#2e3440

4. Design a Chat System (WhatsApp)

Chat systems require real-time delivery, guaranteed message ordering, and offline support. This tests WebSockets, message queuing, and consistency.

Step 1: Requirements

1:1 chat and group chat (up to 500 members)
Real-time delivery < 100ms when online
Delivery guarantees: sent → delivered → read receipts
Offline support: messages delivered when user comes online
Message ordering: messages appear in correct order
Scale: 1 billion users, 100 billion messages/day

Step 2: Real-Time Communication

Why WebSockets?

Method	How it works	Latency	Server Load
Polling	Client asks "any new messages?" every N seconds	High (up to N seconds)	Very High (constant requests)
Long Polling	Client asks, server holds connection until new message	Medium	High (many open connections)
WebSockets	Persistent bidirectional connection	Instant	Lowest (one connection per user)

WHY WebSockets for chat:

Bidirectional - server pushes messages instantly
Persistent connection - no reconnection overhead
Lower bandwidth - no HTTP headers on each message
Real-time indicators - typing status, online presence

Step 3: Message Flow

sequenceDiagram participant Alice participant WS_A as WebSocket Server A participant MQ as Message Queue participant WS_B as WebSocket Server B participant Bob participant DB as Database participant Push as Push Service Alice->>WS_A: Send "Hello Bob" WS_A->>DB: Store message (status: sent) WS_A-->>Alice: ✓ Sent WS_A->>MQ: Route to Bob alt Bob is Online MQ->>WS_B: Deliver message WS_B->>Bob: "Hello Bob" Bob-->>WS_B: ✓ Received WS_B->>MQ: Delivery confirmation MQ->>WS_A: Bob received WS_A-->>Alice: ✓✓ Delivered Bob->>WS_B: Message read WS_B->>MQ: Read receipt MQ->>WS_A: Bob read message WS_A-->>Alice: ✓✓ Read (blue) else Bob is Offline MQ->>DB: Store for later delivery MQ->>Push: Send push notification Push->>Bob: 📱 Push: "Alice: Hello Bob" end

Message States

enum MessageStatus {
    SENDING,    // Client sent, not yet acknowledged
    SENT,       // Server received and stored
    DELIVERED,  // Recipient's device received
    READ        // Recipient opened conversation
}

Step 4: How to Route Messages

The Problem: Alice connects to Server A. Bob connects to Server B. How does Alice's message reach Bob?

Solution: Connection Registry + Message Queue

class WebSocketServer:
    def on_connect(self, user_id, connection):
        # Register this user's connection
        redis.hset("connections", user_id, self.server_id)
        self.local_connections[user_id] = connection

    def on_disconnect(self, user_id):
        redis.hdel("connections", user_id)
        del self.local_connections[user_id]

    def send_message(self, from_user, to_user, message):
        # Store message in database
        msg_id = db.insert("""
            INSERT INTO messages (from_user, to_user, content, status)
            VALUES (?, ?, ?, 'sent')
        """, from_user, to_user, message)

        # Publish to Kafka (partitioned by recipient)
        kafka.produce(
            topic="messages",
            key=to_user,  # Partition by recipient
            value={"msg_id": msg_id, "from": from_user, "content": message}
        )

    def consume_messages(self):
        for message in kafka.consume("messages"):
            recipient = message.key
            if recipient in self.local_connections:
                # User is connected to THIS server
                self.local_connections[recipient].send(message.value)
            # If not connected here, another server will handle it

WHY Kafka partitioned by user_id:

All messages for a user go to same partition → ordering guaranteed
Server consuming that partition delivers to user
If user disconnects, messages stay in Kafka until delivered

Step 5: Database Schema

-- Messages table (partitioned by time for scaling)
CREATE TABLE messages (
    message_id BIGINT PRIMARY KEY,  -- Snowflake ID (contains timestamp)
    conversation_id BIGINT NOT NULL,
    sender_id BIGINT NOT NULL,
    content TEXT,
    media_url VARCHAR(500),
    status ENUM('sent', 'delivered', 'read'),
    created_at TIMESTAMP,

    INDEX idx_conversation (conversation_id, created_at)
) PARTITION BY RANGE (created_at);

-- Conversations (1:1 and groups)
CREATE TABLE conversations (
    conversation_id BIGINT PRIMARY KEY,
    type ENUM('direct', 'group'),
    created_at TIMESTAMP
);

-- Conversation members
CREATE TABLE conversation_members (
    conversation_id BIGINT,
    user_id BIGINT,
    joined_at TIMESTAMP,
    last_read_message_id BIGINT,  -- For read receipts
    PRIMARY KEY (conversation_id, user_id)
);

HOW to find 1:1 conversation between two users:

-- Option 1: Canonical ID (smaller user_id first)
conversation_id = hash(min(user_a, user_b), max(user_a, user_b))

-- Option 2: Lookup table
SELECT conversation_id FROM conversation_members
WHERE user_id IN (user_a, user_b)
GROUP BY conversation_id
HAVING COUNT(*) = 2;

Step 6: Group Chat

graph TB Sender([Alice sends to Group]) --> API[Chat Service] API --> DB[(Store Message Once)] API --> Kafka[Kafka] Kafka --> Worker[Fan-out Worker] Worker --> Members[(Get 500 group members)] Worker -->|"Check online"| Registry[(Connection Registry)] Worker -->|"Online members"| WS[WebSocket Servers] Worker -->|"Offline members"| Push[Push Notification] Worker -->|"All members"| Unread[Increment Unread Count] style Sender fill:#88c0d0,color:#2e3440 style API fill:#ebcb8b,color:#2e3440 style DB fill:#b48ead,color:#2e3440 style Kafka fill:#d08770,color:#2e3440 style Worker fill:#a3be8c,color:#2e3440 style WS fill:#ebcb8b,color:#2e3440 style Push fill:#bf616a,color:#2e3440

Group size limits:

WhatsApp: 1024 members (manageable fan-out)
Slack: 500 members per channel (beyond = degraded experience)
Discord: 500K members (different architecture - lazy loading, no delivery guarantees)

5. Design Uber (Location-Based Matching)

Uber's core challenge is efficiently matching riders with nearby drivers in real-time. This tests geospatial indexing and real-time systems.

Step 1: Core Requirements

Scale: - 10 million drivers, 100 million riders - 1 million concurrent rides - Driver location update: every 4 seconds Traffic: - Location updates: 10M drivers / 4 seconds = 2.5M updates/second - Ride requests: 1M rides/hour = ~300 requests/second Latency requirements: - Find nearby drivers: < 1 second - Location update processing: < 500ms

Step 2: The Geospatial Problem

The Problem: "Find all drivers within 3km of rider at (40.7128, -74.0060)"
Naive approach: Check distance to all 10 million drivers → O(n) = way too slow!

Solution: Geohash

Geohash converts 2D coordinates into a 1D string. Nearby locations share common prefixes!

graph LR subgraph "Geohash Encoding" Coord["(40.7128, -74.0060)"] --> Hash["dr5ru7"] end subgraph "Nearby = Similar Prefix" H1["dr5ru7 (Times Square)"] H2["dr5ru6 (nearby)"] H3["dr5ru8 (nearby)"] H4["dr5ruk (further)"] end subgraph "Precision Levels" P1["dr5 → ~156km"] P2["dr5r → ~39km"] P3["dr5ru → ~5km"] P4["dr5ru7 → ~1.2km"] end style Coord fill:#88c0d0,color:#2e3440 style Hash fill:#a3be8c,color:#2e3440

import geohash

def find_nearby_drivers(rider_lat, rider_lng, radius_km):
    # 1. Get geohash of rider's location (precision 6 ≈ 1.2km cells)
    rider_hash = geohash.encode(rider_lat, rider_lng, precision=6)  # "dr5ru7"

    # 2. Get neighboring geohash cells (handles edge cases)
    neighbors = geohash.neighbors(rider_hash)  # 8 surrounding cells
    cells_to_search = [rider_hash] + neighbors  # 9 cells total

    # 3. Query drivers in these cells
    drivers = redis.sunion([f"drivers:{cell}" for cell in cells_to_search])

    # 4. Filter by exact distance (geohash is approximate)
    nearby = []
    for driver_id in drivers:
        driver_loc = redis.hget(f"driver:{driver_id}", "location")
        distance = haversine(rider_lat, rider_lng, driver_loc.lat, driver_loc.lng)
        if distance <= radius_km:
            nearby.append((driver_id, distance))

    return sorted(nearby, key=lambda x: x[1])  # Sort by distance

WHY Geohash?

Converts 2D search into 1D prefix search → use standard indexes!
Redis SET per geohash cell → O(1) lookup
Only search 9 cells instead of millions of drivers
Easy sharding by geohash prefix

Step 3: Driver Location Updates

graph TB subgraph "Driver App" Driver([Driver]) -->|"Every 4 sec"| GPS[GPS Location] end GPS --> LB[Load Balancer] LB --> LocationService[Location Service] LocationService --> OldCell{Geohash changed?} OldCell -->|"Yes"| RemoveOld[Remove from old cell] OldCell -->|"No"| UpdateOnly[Update location only] RemoveOld --> AddNew[Add to new cell] AddNew --> UpdateRedis[(Redis)] UpdateOnly --> UpdateRedis LocationService --> Kafka[Kafka
Location Stream] Kafka --> Analytics[Analytics] Kafka --> ETA[ETA Service] style Driver fill:#88c0d0,color:#2e3440 style LocationService fill:#ebcb8b,color:#2e3440 style UpdateRedis fill:#bf616a,color:#2e3440 style Kafka fill:#d08770,color:#2e3440

def update_driver_location(driver_id, lat, lng, status):
    new_geohash = geohash.encode(lat, lng, precision=6)

    # Get previous location
    prev = redis.hgetall(f"driver:{driver_id}")
    prev_geohash = prev.get("geohash")

    # Update driver's location
    redis.hset(f"driver:{driver_id}", mapping={
        "lat": lat,
        "lng": lng,
        "geohash": new_geohash,
        "status": status,  # available, busy, offline
        "updated_at": time.time()
    })

    # Update geohash index if cell changed
    if prev_geohash != new_geohash:
        if prev_geohash:
            redis.srem(f"drivers:{prev_geohash}", driver_id)
        if status == "available":
            redis.sadd(f"drivers:{new_geohash}", driver_id)

    # Publish for real-time tracking (active rides)
    if driver_has_active_ride(driver_id):
        kafka.produce("driver_locations", key=driver_id, value={
            "lat": lat, "lng": lng, "timestamp": time.time()
        })

Step 4: Ride Matching Flow

sequenceDiagram participant Rider participant API as Ride Service participant Location as Location Service participant Match as Matching Service participant Driver1 as Driver A participant Driver2 as Driver B Rider->>API: Request ride (pickup, destination) API->>Location: Find drivers near pickup Location-->>API: [Driver A (0.5km), Driver B (0.8km), ...] API->>Match: Select best driver Note over Match: Consider: distance, rating,
acceptance rate, vehicle type Match-->>API: Driver A selected API->>Driver1: 🔔 Ride request (15 sec timeout) alt Driver A accepts Driver1-->>API: Accept API-->>Rider: Driver A matched! API->>Location: Mark Driver A as "busy" else Driver A ignores/rejects API->>Driver2: 🔔 Ride request Driver2-->>API: Accept API-->>Rider: Driver B matched! end

HOW matching works:

Find drivers within radius (geohash query)
Score each driver: score = w1*distance + w2*rating + w3*acceptance_rate
Send request to top driver with timeout
If no response, try next driver
After N failures, expand search radius

Complete Architecture

graph TB subgraph "Clients" RiderApp([Rider App]) DriverApp([Driver App]) end subgraph "API Gateway" LB[Load Balancer] end subgraph "Core Services" RideService[Ride Service] LocationService[Location Service] MatchingService[Matching Service] PricingService[Pricing Service] ETAService[ETA Service] end subgraph "Real-time" WebSocket[WebSocket Servers] Kafka[Kafka] end subgraph "Data Stores" Redis[(Redis
Driver Locations)] PostgreSQL[(PostgreSQL
Rides, Users)] Cassandra[(Cassandra
Location History)] end RiderApp --> LB DriverApp --> LB LB --> RideService LB --> LocationService LB --> WebSocket RideService --> MatchingService RideService --> PricingService MatchingService --> LocationService LocationService --> Redis DriverApp -->|"Location updates"| LocationService LocationService --> Kafka Kafka --> ETAService Kafka --> Cassandra WebSocket --> Kafka RideService --> PostgreSQL style RiderApp fill:#88c0d0,color:#2e3440 style DriverApp fill:#88c0d0,color:#2e3440 style RideService fill:#ebcb8b,color:#2e3440 style LocationService fill:#ebcb8b,color:#2e3440 style MatchingService fill:#ebcb8b,color:#2e3440 style Redis fill:#bf616a,color:#2e3440 style PostgreSQL fill:#b48ead,color:#2e3440 style Kafka fill:#d08770,color:#2e3440

6. Design YouTube (Video Streaming)

Video streaming involves massive storage, transcoding pipelines, and CDN optimization. The key insight: optimize for the common case (watching videos) over the rare case (uploading).

Step 1: Scale Estimation

Upload: - 500 hours of video uploaded per minute - Average video: 10 minutes, 1GB raw → ~3GB after transcoding (multiple resolutions) - Storage growth: 500 hours × 60 min/hour × 6 GB = 180 TB/hour = 4.3 PB/day Viewing: - 1 billion hours watched per day - Average video: 10 minutes = 6 billion video views/day - Peak: 70,000 videos/second Read:Write ratio: 6 billion views : 500×60 uploads ≈ 200,000:1

Key insight: Reading (streaming) is 200,000x more common than writing (uploading). Optimize aggressively for reads. Uploads can be slow and async.

Step 2: Video Upload Pipeline

graph LR subgraph "Upload" Client([Creator]) -->|"Chunked upload"| UploadService[Upload Service] UploadService --> OriginalStore[(Original Storage)] end subgraph "Processing Pipeline" OriginalStore --> Queue[Job Queue] Queue --> Transcode[Transcoding Workers] Transcode -->|"360p"| CDN[(CDN Storage)] Transcode -->|"720p"| CDN Transcode -->|"1080p"| CDN Transcode -->|"4K"| CDN Transcode --> Thumbnail[Thumbnail Generator] Thumbnail --> CDN end subgraph "Metadata" UploadService --> MetadataDB[(Metadata DB)] Transcode -->|"Update status"| MetadataDB end style Client fill:#88c0d0,color:#2e3440 style UploadService fill:#ebcb8b,color:#2e3440 style Transcode fill:#a3be8c,color:#2e3440 style CDN fill:#bf616a,color:#2e3440 style MetadataDB fill:#b48ead,color:#2e3440

def upload_video(user_id, video_file, metadata):
    # 1. Generate unique video ID
    video_id = generate_snowflake_id()

    # 2. Upload original to blob storage (resumable, chunked)
    original_url = s3.upload(
        bucket="originals",
        key=f"{video_id}/original",
        file=video_file,
        multipart=True  # Support resume on failure
    )

    # 3. Save metadata
    db.insert("""
        INSERT INTO videos (id, user_id, title, description, status)
        VALUES (?, ?, ?, ?, 'processing')
    """, video_id, user_id, metadata.title, metadata.description)

    # 4. Queue transcoding job
    kafka.produce("transcode_jobs", {
        "video_id": video_id,
        "source_url": original_url,
        "resolutions": ["360p", "720p", "1080p", "4k"]
    })

    return video_id  # Client polls for status

HOW transcoding works:

Split video into segments (10-second chunks)
Distribute segments to worker pool
Each worker transcodes segment to target resolution
Reassemble segments into final video
Generate HLS/DASH manifest file

Parallelization: A 1-hour video becomes 360 segments, processed in parallel → transcoding completes in minutes, not hours.

Step 3: Video Streaming (Adaptive Bitrate)

sequenceDiagram participant Client as Video Player participant CDN participant Origin as Origin Server Client->>CDN: GET /video/abc123/manifest.m3u8 CDN-->>Client: Manifest (list of quality levels) Note over Client: Manifest contains:
360p, 720p, 1080p, 4K
with segment URLs Client->>CDN: GET /video/abc123/720p/segment_001.ts CDN-->>Client: Video segment (2-10 seconds) Note over Client: Bandwidth drops... Client->>CDN: GET /video/abc123/360p/segment_002.ts Note over Client: Bandwidth improves... Client->>CDN: GET /video/abc123/1080p/segment_003.ts

# HLS Manifest Example (master.m3u8)
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=640x360
360p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2800000,RESOLUTION=1280x720
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
1080p/playlist.m3u8

# Quality-specific playlist (720p/playlist.m3u8)
#EXTM3U
#EXT-X-TARGETDURATION:10
#EXTINF:10.0,
segment_001.ts
#EXTINF:10.0,
segment_002.ts
...

WHY Adaptive Bitrate Streaming (ABR)?

Automatically adjusts quality based on bandwidth
No buffering on slow connections (drops to 360p)
Best quality on fast connections (jumps to 4K)
Seamless switching without user intervention

Step 4: CDN Architecture

graph TB subgraph "Viewers" US([US Viewer]) EU([EU Viewer]) Asia([Asia Viewer]) end subgraph "Edge (CDN PoPs)" US_Edge[US Edge Server] EU_Edge[EU Edge Server] Asia_Edge[Asia Edge Server] end subgraph "Regional Cache" US_Regional[US Regional] EU_Regional[EU Regional] end subgraph "Origin" Origin[Origin Servers] S3[(S3 Storage)] end US --> US_Edge EU --> EU_Edge Asia --> Asia_Edge US_Edge -->|"Cache Miss"| US_Regional EU_Edge -->|"Cache Miss"| EU_Regional Asia_Edge -->|"Cache Miss"| Origin US_Regional -->|"Cache Miss"| Origin EU_Regional -->|"Cache Miss"| Origin Origin --> S3 style US fill:#88c0d0,color:#2e3440 style EU fill:#88c0d0,color:#2e3440 style Asia fill:#88c0d0,color:#2e3440 style US_Edge fill:#a3be8c,color:#2e3440 style EU_Edge fill:#a3be8c,color:#2e3440 style Asia_Edge fill:#a3be8c,color:#2e3440 style Origin fill:#ebcb8b,color:#2e3440 style S3 fill:#b48ead,color:#2e3440

HOW CDN caching works for video:

Popular videos: Cached at edge (98% of requests)
Less popular: Cached at regional tier
Long tail: Fetched from origin on-demand
Strategy: Pre-warm cache for new viral videos

80/20 rule: 20% of videos get 80% of views → those 20% are always cached at edge.

7. Design a Notification System

Notification systems must handle multiple channels (push, email, SMS), user preferences, rate limiting, and guaranteed delivery. This tests queue design and third-party integrations.

Step 1: Requirements

Channels: Push notifications, Email, SMS, In-app
Scale: 100 million notifications/day
Delivery: At-least-once (no lost notifications)
Latency: Push < 1 second, Email < 1 minute
Features: User preferences, rate limiting, templates

Step 2: High-Level Design

graph TB subgraph "Producers" Service1([Order Service]) Service2([Social Service]) Service3([Marketing]) end subgraph "Notification Service" API[Notification API] Validator[Validator] Router[Channel Router] end subgraph "Queues (Kafka)" PushQ[Push Queue] EmailQ[Email Queue] SMSQ[SMS Queue] end subgraph "Workers" PushWorker[Push Workers] EmailWorker[Email Workers] SMSWorker[SMS Workers] end subgraph "Third Party" FCM[Firebase/APNs] SendGrid[SendGrid] Twilio[Twilio] end Service1 --> API Service2 --> API Service3 --> API API --> Validator Validator --> Router Router --> PushQ Router --> EmailQ Router --> SMSQ PushQ --> PushWorker EmailQ --> EmailWorker SMSQ --> SMSWorker PushWorker --> FCM EmailWorker --> SendGrid SMSWorker --> Twilio style API fill:#ebcb8b,color:#2e3440 style Router fill:#88c0d0,color:#2e3440 style PushQ fill:#d08770,color:#2e3440 style EmailQ fill:#d08770,color:#2e3440 style SMSQ fill:#d08770,color:#2e3440 style PushWorker fill:#a3be8c,color:#2e3440 style EmailWorker fill:#a3be8c,color:#2e3440 style SMSWorker fill:#a3be8c,color:#2e3440

Step 3: Notification Flow

def send_notification(user_id, notification_type, data):
    # 1. Get user preferences
    prefs = get_user_preferences(user_id)

    # 2. Check if user wants this type of notification
    if not prefs.allows(notification_type):
        return "suppressed"

    # 3. Rate limit check
    if is_rate_limited(user_id, notification_type):
        return "rate_limited"

    # 4. Get user's devices and contact info
    devices = get_user_devices(user_id)  # For push
    email = get_user_email(user_id)
    phone = get_user_phone(user_id)

    # 5. Determine channels based on preferences
    channels = []
    if prefs.push_enabled and devices:
        channels.append("push")
    if prefs.email_enabled and email:
        channels.append("email")
    if prefs.sms_enabled and phone:
        channels.append("sms")

    # 6. Render notification from template
    notification = render_template(notification_type, data)

    # 7. Queue for each channel
    for channel in channels:
        kafka.produce(f"notifications_{channel}", {
            "notification_id": generate_id(),
            "user_id": user_id,
            "channel": channel,
            "content": notification,
            "created_at": time.time()
        })

User Preferences Schema

CREATE TABLE notification_preferences (
    user_id BIGINT PRIMARY KEY,
    push_enabled BOOLEAN DEFAULT true,
    email_enabled BOOLEAN DEFAULT true,
    sms_enabled BOOLEAN DEFAULT false,

    -- Granular preferences
    marketing_push BOOLEAN DEFAULT false,
    marketing_email BOOLEAN DEFAULT true,
    order_updates_push BOOLEAN DEFAULT true,
    order_updates_email BOOLEAN DEFAULT true,
    order_updates_sms BOOLEAN DEFAULT true,
    social_push BOOLEAN DEFAULT true,
    social_email BOOLEAN DEFAULT false,

    -- Quiet hours
    quiet_hours_start TIME,
    quiet_hours_end TIME,
    timezone VARCHAR(50)
);

Delivery Guarantees:

Channel	Guarantee	Why
Push	Best effort	Device might be offline, token expired
Email	At-least-once	Retry on failure, may result in duplicates
SMS	At-least-once	Delivery receipt from carrier

Step 4: Handling Failures

class NotificationWorker:
    def process(self, notification):
        try:
            result = self.send(notification)
            self.mark_delivered(notification.id)
            return result
        except TemporaryError as e:
            # Retry with exponential backoff
            retry_count = notification.retry_count + 1
            if retry_count < MAX_RETRIES:
                delay = min(2 ** retry_count, 3600)  # Max 1 hour
                self.schedule_retry(notification, delay)
            else:
                self.move_to_dead_letter_queue(notification)
        except PermanentError as e:
            # Don't retry (invalid token, unsubscribed, etc.)
            self.mark_failed(notification.id, str(e))

            # If push token invalid, remove from user's devices
            if isinstance(e, InvalidTokenError):
                remove_device_token(notification.user_id, notification.token)

WHY separate queues per channel:

Different SLAs: Push needs < 1s, email can take minutes
Different retry strategies: Push retries fast, SMS retries slow
Independent scaling: More push workers during app launch
Isolation: Email provider outage doesn't affect push

Key Takeaways

The System Design Process

Clarify requirements - Ask questions, don't assume
Estimate scale - Back-of-envelope math shows you understand reality
Start simple - Draw basic boxes, then iterate
Identify bottlenecks - Where will it break at scale?
Discuss trade-offs - There's no perfect solution
Deep dive where asked - Follow interviewer's interest

Common Patterns to Know

Pattern	When to Use
Fan-out on write	Read-heavy, bounded followers (Twitter for normal users)
Fan-out on read	Write-heavy, unbounded followers (Twitter for celebrities)
Geohash/QuadTree	Location-based queries (Uber, Yelp)
Token bucket	Rate limiting with burst allowance
Consistent hashing	Distributed caching, sharding
CDC + Kafka	Real-time data sync, event sourcing
CQRS	Separate read/write models at different scales

← Back to Index