System Design Interview Deep Dive
Focus on HOW to design and WHY we make each decision
1. Design a URL Shortener (bit.ly)
This is often the first system design question you'll encounter. It seems simple but tests fundamental concepts: hashing, database design, caching, and scale estimation.
Step 1: Clarify Requirements
? Questions to ask the interviewer:
- How many URLs per day? → "100 million new URLs/day"
- How long should shortened URLs be? → "As short as possible"
- Can users choose custom aliases? → "Yes, optionally"
- Do URLs expire? → "Default 5 years, configurable"
- Analytics needed? → "Basic click counts"
Step 2: Back-of-Envelope Estimation
Write Traffic:
100M URLs/day = 100M / 86400 ≈ 1,200 URLs/second
Read Traffic (assume 100:1 read:write ratio):
1,200 × 100 = 120,000 redirects/second
Storage (over 5 years):
100M URLs/day × 365 days × 5 years = 182.5 billion URLs
Average URL = 500 bytes (long URL + short URL + metadata)
182.5B × 500 bytes = ~91 TB
URL Length:
Using base62 (a-z, A-Z, 0-9): 62^7 = 3.5 trillion combinations
7 characters is sufficient for 182.5 billion URLs
WHY base62? We want URLs that are URL-safe without encoding. Base64 includes + and / which need URL encoding. Base62 (alphanumeric only) gives clean URLs like bit.ly/abc1234.
Step 3: High-Level Design
graph LR
Client([Client]) --> LB[Load Balancer]
LB --> API[API Servers]
API --> Cache[(Redis Cache)]
API --> DB[(Database)]
Cache -.-> DB
subgraph "Write Path"
API -->|"POST /shorten"| IDGen[ID Generator]
IDGen --> DB
end
subgraph "Read Path"
API -->|"GET /:shortUrl"| Cache
Cache -->|"Cache Miss"| DB
end
style Client fill:#88c0d0,color:#2e3440
style LB fill:#a3be8c,color:#2e3440
style API fill:#ebcb8b,color:#2e3440
style Cache fill:#bf616a,color:#2e3440
style DB fill:#b48ead,color:#2e3440
style IDGen fill:#d08770,color:#2e3440
Step 4: Deep Dive - The Core Problem
How do we generate unique short URLs?
This is THE key design decision. There are three main approaches:
Option A: Hash the Long URL
MD5(longUrl) = "5d41402abc4b2a76b9719d911017c592"
Take first 7 chars = "5d41402"
Convert to base62 = "aB3x9Kp"
Trade-offs:
- ✅ Same URL always produces same short URL (idempotent)
- ✅ No coordination between servers needed
- ❌ Collisions! Two different URLs might hash to same 7 chars
- ❌ Need collision resolution (append counter, rehash)
HOW to handle collisions:
- Generate hash, check if exists in DB
- If collision: append incrementing counter and rehash
- Repeat until unique
Problem: Multiple DB lookups on collision. Under high write load, this becomes expensive.
Option B: Auto-Increment ID + Base62 Encoding (Recommended)
Database auto-increment ID: 123456789
Convert to base62: 123456789 → "8M0kX"
WHY this is better:
- ✅ Guaranteed unique - no collisions ever
- ✅ Simple, predictable
- ✅ Short URLs get longer over time (starts at 1 char, grows)
- ❌ Single point of failure (one DB generating IDs)
- ❌ Predictable URLs (security concern - can enumerate)
HOW to scale ID generation:
Use multiple ID generators with different ranges:
Server 1: IDs 0, 2, 4, 6, 8... (even numbers)
Server 2: IDs 1, 3, 5, 7, 9... (odd numbers)
Or use ranges:
Server 1: 0 - 1 billion
Server 2: 1 billion - 2 billion
...
Twitter's Snowflake ID: 64-bit IDs = timestamp + datacenter + machine + sequence
Option C: Pre-generated Key Service
Key Generation Service (KGS):
- Pre-generate millions of unique 7-char keys
- Store in database (used/unused columns)
- API servers request batch of keys from KGS
- Mark keys as used when assigned
WHY use this approach:
- ✅ Zero collision - keys pre-verified unique
- ✅ Fast - no generation at request time
- ✅ Keys are random (not sequential = more secure)
- ❌ More complex infrastructure
- ❌ Need to manage key exhaustion
Database Schema
CREATE TABLE urls (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
short_url VARCHAR(7) UNIQUE NOT NULL,
long_url VARCHAR(2048) NOT NULL,
user_id BIGINT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP,
click_count BIGINT DEFAULT 0,
INDEX idx_short_url (short_url), -- For redirect lookups
INDEX idx_user_id (user_id) -- For user's URL list
);
WHY this schema:
short_url indexed for O(1) redirect lookups
long_url not indexed - we never search by it
click_count denormalized for quick analytics (avoid COUNT queries)
Caching Strategy
graph LR
Request([GET /aB3x9Kp]) --> Cache{Redis Cache}
Cache -->|"HIT"| Return([Return long URL])
Cache -->|"MISS"| DB[(Database)]
DB --> UpdateCache[Update Cache]
UpdateCache --> Return
style Request fill:#88c0d0,color:#2e3440
style Cache fill:#bf616a,color:#2e3440
style DB fill:#b48ead,color:#2e3440
style Return fill:#a3be8c,color:#2e3440
style UpdateCache fill:#ebcb8b,color:#2e3440
HOW to implement caching:
def redirect(short_url):
# Try cache first
long_url = redis.get(short_url)
if long_url:
return redirect_301(long_url)
# Cache miss - hit database
long_url = db.query("SELECT long_url FROM urls WHERE short_url = ?", short_url)
if not long_url:
return 404
# Populate cache (TTL 24 hours)
redis.setex(short_url, 86400, long_url)
# Increment click count async (don't block redirect)
async_queue.push({"type": "click", "short_url": short_url})
return redirect_301(long_url)
301 vs 302 Redirect?
- 301 (Permanent): Browser caches redirect. Fewer requests to our servers, but we lose analytics visibility.
- 302 (Temporary): Browser always hits our servers. More load, but accurate click tracking.
Choose based on whether analytics matter more than server load.
Final Architecture
graph TB
subgraph "Clients"
Web([Web App])
Mobile([Mobile App])
API_Client([API Clients])
end
subgraph "Edge Layer"
CDN[CDN / GeoDNS]
LB[Load Balancer]
end
subgraph "Application Layer"
API1[API Server 1]
API2[API Server 2]
API3[API Server N]
end
subgraph "Caching Layer"
Redis1[(Redis Primary)]
Redis2[(Redis Replica)]
end
subgraph "Database Layer"
DB_Primary[(MySQL Primary)]
DB_Replica1[(MySQL Replica)]
DB_Replica2[(MySQL Replica)]
end
subgraph "Async Processing"
Kafka[Kafka]
Analytics[Analytics Worker]
end
Web --> CDN
Mobile --> CDN
API_Client --> CDN
CDN --> LB
LB --> API1
LB --> API2
LB --> API3
API1 --> Redis1
API2 --> Redis1
API3 --> Redis1
Redis1 --> Redis2
API1 --> DB_Primary
DB_Primary --> DB_Replica1
DB_Primary --> DB_Replica2
API1 --> Kafka
Kafka --> Analytics
Analytics --> DB_Primary
style CDN fill:#88c0d0,color:#2e3440
style LB fill:#a3be8c,color:#2e3440
style API1 fill:#ebcb8b,color:#2e3440
style API2 fill:#ebcb8b,color:#2e3440
style API3 fill:#ebcb8b,color:#2e3440
style Redis1 fill:#bf616a,color:#2e3440
style Redis2 fill:#bf616a,color:#2e3440
style DB_Primary fill:#b48ead,color:#2e3440
style DB_Replica1 fill:#b48ead,color:#2e3440
style DB_Replica2 fill:#b48ead,color:#2e3440
style Kafka fill:#d08770,color:#2e3440
style Analytics fill:#d08770,color:#2e3440
2. Design a Rate Limiter
Rate limiting protects services from abuse, ensures fair usage, and prevents cascading failures. This tests your understanding of algorithms and distributed systems.
Step 1: Clarify Requirements
- What are we limiting? → "API requests per user"
- What limits? → "100 requests per minute per user"
- Distributed system? → "Yes, multiple API servers"
- What happens when limited? → "Return 429 Too Many Requests"
- Hard or soft limit? → "Hard limit, must be accurate"
Step 2: Algorithm Deep Dive
Algorithm 1: Token Bucket
Imagine a bucket that holds tokens. Each request takes a token. Tokens refill at a steady rate.
graph LR
subgraph "Token Bucket"
Bucket["🪣 Bucket
Capacity: 10 tokens
Current: 7 tokens"]
end
Refill["⏱️ Refill
1 token/second"] -->|"Add tokens"| Bucket
Request([Request]) -->|"Take 1 token"| Bucket
Bucket -->|"Tokens > 0"| Allow([✅ Allow])
Bucket -->|"Tokens = 0"| Deny([❌ Deny 429])
style Bucket fill:#88c0d0,color:#2e3440
style Refill fill:#a3be8c,color:#2e3440
style Allow fill:#a3be8c,color:#2e3440
style Deny fill:#bf616a,color:#2e3440
style Request fill:#ebcb8b,color:#2e3440
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity # Max tokens
self.tokens = capacity # Current tokens
self.refill_rate = refill_rate # Tokens per second
self.last_refill = time.time()
def allow_request(self):
self._refill()
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
tokens_to_add = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
WHY Token Bucket?
- ✅ Allows bursts (bucket can fill up, then burst)
- ✅ Memory efficient (just 2 numbers per user)
- ✅ Smooth traffic over time
- Used by: AWS API Gateway, Stripe
Algorithm 2: Sliding Window Log
Keep a log of all request timestamps. Count requests in the last window.
class SlidingWindowLog:
def __init__(self, window_size, max_requests):
self.window_size = window_size # e.g., 60 seconds
self.max_requests = max_requests
self.request_log = [] # List of timestamps
def allow_request(self):
now = time.time()
window_start = now - self.window_size
# Remove old entries outside window
self.request_log = [ts for ts in self.request_log if ts > window_start]
if len(self.request_log) < self.max_requests:
self.request_log.append(now)
return True
return False
Trade-offs:
- ✅ Very accurate - no boundary issues
- ❌ Memory intensive - stores every timestamp
- ❌ O(n) cleanup of old entries
Algorithm 3: Sliding Window Counter (Best for Most Cases)
Hybrid approach: use counters per time bucket, weight by position in window.
graph LR
subgraph "Sliding Window Counter"
Prev["Previous Minute
Count: 84"]
Curr["Current Minute
Count: 36"]
end
Calc["Weighted Count:
84 × 0.25 + 36 × 0.75
= 21 + 27 = 48"]
Prev --> Calc
Curr --> Calc
Calc --> Check{48 < 100?}
Check -->|"Yes"| Allow([✅ Allow])
Check -->|"No"| Deny([❌ Deny])
style Prev fill:#434c5e,color:#e0e0e0
style Curr fill:#88c0d0,color:#2e3440
style Calc fill:#ebcb8b,color:#2e3440
style Allow fill:#a3be8c,color:#2e3440
style Deny fill:#bf616a,color:#2e3440
class SlidingWindowCounter:
def __init__(self, window_size, max_requests):
self.window_size = window_size
self.max_requests = max_requests
self.current_window_count = 0
self.previous_window_count = 0
self.current_window_start = self._get_window_start()
def allow_request(self):
now = time.time()
window_start = self._get_window_start()
# Roll over to new window if needed
if window_start != self.current_window_start:
self.previous_window_count = self.current_window_count
self.current_window_count = 0
self.current_window_start = window_start
# Calculate weighted count
window_progress = (now - window_start) / self.window_size
weighted_count = (
self.previous_window_count * (1 - window_progress) +
self.current_window_count
)
if weighted_count < self.max_requests:
self.current_window_count += 1
return True
return False
WHY Sliding Window Counter?
- ✅ Memory efficient (just 2 counters per user)
- ✅ Smooths boundary issues of fixed windows
- ✅ O(1) operations
- Used by: Cloudflare
Step 3: Distributed Rate Limiting
The Problem: With multiple API servers, each has its own counter. User makes 50 requests to Server A, 50 to Server B = 100 total, but each server thinks only 50!
Solution: Centralized Counter with Redis
graph TB
Client([Client]) --> LB[Load Balancer]
LB --> API1[API Server 1]
LB --> API2[API Server 2]
LB --> API3[API Server 3]
API1 --> Redis[(Redis)]
API2 --> Redis
API3 --> Redis
subgraph "Redis Data"
Key1["user:123:minute:1699999200
value: 45"]
Key2["user:456:minute:1699999200
value: 98"]
end
style Client fill:#88c0d0,color:#2e3440
style LB fill:#a3be8c,color:#2e3440
style API1 fill:#ebcb8b,color:#2e3440
style API2 fill:#ebcb8b,color:#2e3440
style API3 fill:#ebcb8b,color:#2e3440
style Redis fill:#bf616a,color:#2e3440
# Redis Lua script for atomic rate limiting (Token Bucket)
RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
-- Calculate tokens to add
local elapsed = now - last_refill
local tokens_to_add = elapsed * refill_rate
tokens = math.min(capacity, tokens + tokens_to_add)
-- Check if request allowed
if tokens >= requested then
tokens = tokens - requested
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 60) -- TTL for cleanup
return 1 -- Allowed
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 60)
return 0 -- Denied
end
"""
def check_rate_limit(user_id, capacity=100, refill_rate=1.67):
key = f"ratelimit:{user_id}"
now = time.time()
result = redis.eval(RATE_LIMIT_SCRIPT, 1, key, capacity, refill_rate, now, 1)
return result == 1
WHY Lua script?
- Atomic execution - no race conditions
- Single round trip to Redis
- All logic runs on Redis server
Response Headers
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1699999260
Retry-After: 30
HOW clients should handle 429:
- Check
Retry-After header
- Implement exponential backoff
- Queue requests locally if needed
The Twitter timeline is a classic example of the "fan-out" problem. This tests understanding of trade-offs between read and write optimization.
Step 1: Understand the Scale
Users: 300 million MAU, 150 million DAU
Tweets: 500 million tweets/day ≈ 6,000 tweets/second
Timeline reads: Each user checks timeline ~10 times/day
150M × 10 = 1.5 billion timeline reads/day ≈ 17,000 reads/second
Read:Write ratio: 17,000 : 6,000 ≈ 3:1 (reads dominate)
Average user follows: 200 accounts
Celebrity followers: Up to 100+ million (e.g., @elonmusk)
Step 2: The Core Problem - Fan-Out
When user A posts a tweet, how do we show it to all followers?
Approach 1: Fan-Out on Read (Pull Model)
sequenceDiagram
participant User
participant API
participant DB
User->>API: GET /timeline
API->>DB: Get list of people I follow
DB-->>API: [user_1, user_2, ..., user_200]
API->>DB: Get recent tweets from all 200 users
DB-->>API: [tweet_1, tweet_2, ...]
API->>API: Merge & Sort by time
API-->>User: Timeline (sorted tweets)
def get_timeline(user_id):
# Get who this user follows
following = db.query("""
SELECT followed_id FROM follows
WHERE follower_id = ?
""", user_id) # Returns ~200 IDs
# Get recent tweets from all followed users
tweets = db.query("""
SELECT * FROM tweets
WHERE user_id IN (?, ?, ?, ...) -- 200 IDs
ORDER BY created_at DESC
LIMIT 100
""", following)
return tweets
Trade-offs:
- ✅ Writes are fast (just insert tweet)
- ✅ No wasted work if user never reads
- ❌ Reads are SLOW - query 200 users, merge, sort
- ❌ Can't handle 17,000 reads/second with this latency
Approach 2: Fan-Out on Write (Push Model)
sequenceDiagram
participant User_A as User A (Author)
participant API
participant Queue
participant Workers
participant Cache
participant Followers as Followers (1000)
User_A->>API: POST /tweet "Hello world"
API->>Queue: New tweet event
API-->>User_A: 200 OK
Queue->>Workers: Process tweet
Workers->>Workers: Get A's followers (1000 users)
loop For each follower
Workers->>Cache: Add tweet to follower's timeline cache
end
Note over Cache: Each user has pre-built timeline in Redis
def post_tweet(user_id, content):
# 1. Save tweet to database
tweet_id = db.insert("INSERT INTO tweets (user_id, content) VALUES (?, ?)",
user_id, content)
# 2. Async: Push to all followers' timelines
followers = db.query("SELECT follower_id FROM follows WHERE followed_id = ?", user_id)
for follower_id in followers:
# Each user's timeline is a Redis sorted set (score = timestamp)
redis.zadd(f"timeline:{follower_id}", {tweet_id: timestamp})
redis.zremrangebyrank(f"timeline:{follower_id}", 0, -801) # Keep only 800 tweets
def get_timeline(user_id):
# Just read from cache - O(1)!
tweet_ids = redis.zrevrange(f"timeline:{user_id}", 0, 100)
tweets = db.query("SELECT * FROM tweets WHERE id IN (?)", tweet_ids)
return tweets
Trade-offs:
- ✅ Reads are INSTANT - just read from Redis
- ✅ Timeline pre-computed
- ❌ Writes are expensive - 1 tweet = 1000 cache updates
- ❌ Celebrity problem: 1 tweet × 100M followers = 100M writes!
- ❌ Wasted work if followers never log in
Approach 3: Hybrid (What Twitter Actually Uses)
graph TB
subgraph "On Tweet"
Tweet([New Tweet]) --> Check{Author has
> 10K followers?}
Check -->|"No (99% of users)"| FanOut[Fan-out to all followers]
Check -->|"Yes (Celebrities)"| Store[Just store tweet]
end
subgraph "On Read"
Read([Get Timeline]) --> Cache[(Cached Timeline)]
Cache --> Merge[Merge with celebrity tweets]
Merge --> Celebrities[(Celebrity tweets
fetched on-demand)]
Merge --> Result([Final Timeline])
end
style Tweet fill:#88c0d0,color:#2e3440
style Check fill:#ebcb8b,color:#2e3440
style FanOut fill:#a3be8c,color:#2e3440
style Store fill:#a3be8c,color:#2e3440
style Read fill:#88c0d0,color:#2e3440
style Cache fill:#bf616a,color:#2e3440
style Celebrities fill:#b48ead,color:#2e3440
style Result fill:#a3be8c,color:#2e3440
WHY Hybrid works:
- 99% of users have < 10K followers → fan-out is cheap
- 1% celebrities → their tweets fetched on-demand
- Merging 5-10 celebrity feeds is fast (user follows ~5 celebrities)
- Best of both worlds!
def get_timeline(user_id):
# 1. Get pre-computed timeline from cache
cached_timeline = redis.zrevrange(f"timeline:{user_id}", 0, 100)
# 2. Get list of celebrities this user follows
celebrities = redis.smembers(f"follows_celebrities:{user_id}")
# 3. Fetch recent tweets from celebrities
celebrity_tweets = []
for celeb_id in celebrities:
tweets = redis.zrevrange(f"user_tweets:{celeb_id}", 0, 10)
celebrity_tweets.extend(tweets)
# 4. Merge and sort
all_tweets = merge_sorted(cached_timeline, celebrity_tweets)
return all_tweets[:100]
Complete Architecture
graph TB
subgraph "Write Path"
Client1([Client]) --> LB1[Load Balancer]
LB1 --> TweetAPI[Tweet Service]
TweetAPI --> TweetDB[(Tweet DB
MySQL)]
TweetAPI --> Kafka[Kafka]
Kafka --> FanOutWorkers[Fan-out Workers]
FanOutWorkers --> TimelineCache[(Timeline Cache
Redis Cluster)]
end
subgraph "Read Path"
Client2([Client]) --> LB2[Load Balancer]
LB2 --> TimelineAPI[Timeline Service]
TimelineAPI --> TimelineCache
TimelineAPI --> TweetDB
end
subgraph "Social Graph"
FollowDB[(Follow Graph
Graph DB or MySQL)]
FanOutWorkers --> FollowDB
TimelineAPI --> FollowDB
end
style TweetAPI fill:#ebcb8b,color:#2e3440
style TimelineAPI fill:#88c0d0,color:#2e3440
style TweetDB fill:#b48ead,color:#2e3440
style FollowDB fill:#b48ead,color:#2e3440
style TimelineCache fill:#bf616a,color:#2e3440
style Kafka fill:#d08770,color:#2e3440
style FanOutWorkers fill:#a3be8c,color:#2e3440
4. Design a Chat System (WhatsApp)
Chat systems require real-time delivery, guaranteed message ordering, and offline support. This tests WebSockets, message queuing, and consistency.
Step 1: Requirements
- 1:1 chat and group chat (up to 500 members)
- Real-time delivery < 100ms when online
- Delivery guarantees: sent → delivered → read receipts
- Offline support: messages delivered when user comes online
- Message ordering: messages appear in correct order
- Scale: 1 billion users, 100 billion messages/day
Step 2: Real-Time Communication
Why WebSockets?
| Method |
How it works |
Latency |
Server Load |
| Polling |
Client asks "any new messages?" every N seconds |
High (up to N seconds) |
Very High (constant requests) |
| Long Polling |
Client asks, server holds connection until new message |
Medium |
High (many open connections) |
| WebSockets |
Persistent bidirectional connection |
Instant |
Lowest (one connection per user) |
WHY WebSockets for chat:
- Bidirectional - server pushes messages instantly
- Persistent connection - no reconnection overhead
- Lower bandwidth - no HTTP headers on each message
- Real-time indicators - typing status, online presence
Step 3: Message Flow
sequenceDiagram
participant Alice
participant WS_A as WebSocket Server A
participant MQ as Message Queue
participant WS_B as WebSocket Server B
participant Bob
participant DB as Database
participant Push as Push Service
Alice->>WS_A: Send "Hello Bob"
WS_A->>DB: Store message (status: sent)
WS_A-->>Alice: ✓ Sent
WS_A->>MQ: Route to Bob
alt Bob is Online
MQ->>WS_B: Deliver message
WS_B->>Bob: "Hello Bob"
Bob-->>WS_B: ✓ Received
WS_B->>MQ: Delivery confirmation
MQ->>WS_A: Bob received
WS_A-->>Alice: ✓✓ Delivered
Bob->>WS_B: Message read
WS_B->>MQ: Read receipt
MQ->>WS_A: Bob read message
WS_A-->>Alice: ✓✓ Read (blue)
else Bob is Offline
MQ->>DB: Store for later delivery
MQ->>Push: Send push notification
Push->>Bob: 📱 Push: "Alice: Hello Bob"
end
Message States
enum MessageStatus {
SENDING, // Client sent, not yet acknowledged
SENT, // Server received and stored
DELIVERED, // Recipient's device received
READ // Recipient opened conversation
}
Step 4: How to Route Messages
The Problem: Alice connects to Server A. Bob connects to Server B. How does Alice's message reach Bob?
Solution: Connection Registry + Message Queue
graph TB
subgraph "Users"
Alice([Alice])
Bob([Bob])
Charlie([Charlie])
end
subgraph "WebSocket Servers"
WS1[WS Server 1]
WS2[WS Server 2]
end
subgraph "Service Discovery"
Redis[(Redis
Connection Registry)]
end
subgraph "Message Routing"
Kafka[Kafka
Partitioned by user_id]
end
Alice -->|"Connected"| WS1
Bob -->|"Connected"| WS2
Charlie -->|"Connected"| WS2
WS1 -->|"alice → ws1"| Redis
WS2 -->|"bob → ws2"| Redis
WS2 -->|"charlie → ws2"| Redis
WS1 -->|"Publish"| Kafka
Kafka -->|"Consume"| WS1
Kafka -->|"Consume"| WS2
style Alice fill:#88c0d0,color:#2e3440
style Bob fill:#88c0d0,color:#2e3440
style Charlie fill:#88c0d0,color:#2e3440
style WS1 fill:#ebcb8b,color:#2e3440
style WS2 fill:#ebcb8b,color:#2e3440
style Redis fill:#bf616a,color:#2e3440
style Kafka fill:#d08770,color:#2e3440
class WebSocketServer:
def on_connect(self, user_id, connection):
# Register this user's connection
redis.hset("connections", user_id, self.server_id)
self.local_connections[user_id] = connection
def on_disconnect(self, user_id):
redis.hdel("connections", user_id)
del self.local_connections[user_id]
def send_message(self, from_user, to_user, message):
# Store message in database
msg_id = db.insert("""
INSERT INTO messages (from_user, to_user, content, status)
VALUES (?, ?, ?, 'sent')
""", from_user, to_user, message)
# Publish to Kafka (partitioned by recipient)
kafka.produce(
topic="messages",
key=to_user, # Partition by recipient
value={"msg_id": msg_id, "from": from_user, "content": message}
)
def consume_messages(self):
for message in kafka.consume("messages"):
recipient = message.key
if recipient in self.local_connections:
# User is connected to THIS server
self.local_connections[recipient].send(message.value)
# If not connected here, another server will handle it
WHY Kafka partitioned by user_id:
- All messages for a user go to same partition → ordering guaranteed
- Server consuming that partition delivers to user
- If user disconnects, messages stay in Kafka until delivered
Step 5: Database Schema
-- Messages table (partitioned by time for scaling)
CREATE TABLE messages (
message_id BIGINT PRIMARY KEY, -- Snowflake ID (contains timestamp)
conversation_id BIGINT NOT NULL,
sender_id BIGINT NOT NULL,
content TEXT,
media_url VARCHAR(500),
status ENUM('sent', 'delivered', 'read'),
created_at TIMESTAMP,
INDEX idx_conversation (conversation_id, created_at)
) PARTITION BY RANGE (created_at);
-- Conversations (1:1 and groups)
CREATE TABLE conversations (
conversation_id BIGINT PRIMARY KEY,
type ENUM('direct', 'group'),
created_at TIMESTAMP
);
-- Conversation members
CREATE TABLE conversation_members (
conversation_id BIGINT,
user_id BIGINT,
joined_at TIMESTAMP,
last_read_message_id BIGINT, -- For read receipts
PRIMARY KEY (conversation_id, user_id)
);
HOW to find 1:1 conversation between two users:
-- Option 1: Canonical ID (smaller user_id first)
conversation_id = hash(min(user_a, user_b), max(user_a, user_b))
-- Option 2: Lookup table
SELECT conversation_id FROM conversation_members
WHERE user_id IN (user_a, user_b)
GROUP BY conversation_id
HAVING COUNT(*) = 2;
Step 6: Group Chat
graph TB
Sender([Alice sends to Group]) --> API[Chat Service]
API --> DB[(Store Message Once)]
API --> Kafka[Kafka]
Kafka --> Worker[Fan-out Worker]
Worker --> Members[(Get 500 group members)]
Worker -->|"Check online"| Registry[(Connection Registry)]
Worker -->|"Online members"| WS[WebSocket Servers]
Worker -->|"Offline members"| Push[Push Notification]
Worker -->|"All members"| Unread[Increment Unread Count]
style Sender fill:#88c0d0,color:#2e3440
style API fill:#ebcb8b,color:#2e3440
style DB fill:#b48ead,color:#2e3440
style Kafka fill:#d08770,color:#2e3440
style Worker fill:#a3be8c,color:#2e3440
style WS fill:#ebcb8b,color:#2e3440
style Push fill:#bf616a,color:#2e3440
Group size limits:
- WhatsApp: 1024 members (manageable fan-out)
- Slack: 500 members per channel (beyond = degraded experience)
- Discord: 500K members (different architecture - lazy loading, no delivery guarantees)
5. Design Uber (Location-Based Matching)
Uber's core challenge is efficiently matching riders with nearby drivers in real-time. This tests geospatial indexing and real-time systems.
Step 1: Core Requirements
Scale:
- 10 million drivers, 100 million riders
- 1 million concurrent rides
- Driver location update: every 4 seconds
Traffic:
- Location updates: 10M drivers / 4 seconds = 2.5M updates/second
- Ride requests: 1M rides/hour = ~300 requests/second
Latency requirements:
- Find nearby drivers: < 1 second
- Location update processing: < 500ms
Step 2: The Geospatial Problem
The Problem: "Find all drivers within 3km of rider at (40.7128, -74.0060)"
Naive approach: Check distance to all 10 million drivers → O(n) = way too slow!
Solution: Geohash
Geohash converts 2D coordinates into a 1D string. Nearby locations share common prefixes!
graph LR
subgraph "Geohash Encoding"
Coord["(40.7128, -74.0060)"] --> Hash["dr5ru7"]
end
subgraph "Nearby = Similar Prefix"
H1["dr5ru7 (Times Square)"]
H2["dr5ru6 (nearby)"]
H3["dr5ru8 (nearby)"]
H4["dr5ruk (further)"]
end
subgraph "Precision Levels"
P1["dr5 → ~156km"]
P2["dr5r → ~39km"]
P3["dr5ru → ~5km"]
P4["dr5ru7 → ~1.2km"]
end
style Coord fill:#88c0d0,color:#2e3440
style Hash fill:#a3be8c,color:#2e3440
import geohash
def find_nearby_drivers(rider_lat, rider_lng, radius_km):
# 1. Get geohash of rider's location (precision 6 ≈ 1.2km cells)
rider_hash = geohash.encode(rider_lat, rider_lng, precision=6) # "dr5ru7"
# 2. Get neighboring geohash cells (handles edge cases)
neighbors = geohash.neighbors(rider_hash) # 8 surrounding cells
cells_to_search = [rider_hash] + neighbors # 9 cells total
# 3. Query drivers in these cells
drivers = redis.sunion([f"drivers:{cell}" for cell in cells_to_search])
# 4. Filter by exact distance (geohash is approximate)
nearby = []
for driver_id in drivers:
driver_loc = redis.hget(f"driver:{driver_id}", "location")
distance = haversine(rider_lat, rider_lng, driver_loc.lat, driver_loc.lng)
if distance <= radius_km:
nearby.append((driver_id, distance))
return sorted(nearby, key=lambda x: x[1]) # Sort by distance
WHY Geohash?
- Converts 2D search into 1D prefix search → use standard indexes!
- Redis SET per geohash cell → O(1) lookup
- Only search 9 cells instead of millions of drivers
- Easy sharding by geohash prefix
Step 3: Driver Location Updates
graph TB
subgraph "Driver App"
Driver([Driver]) -->|"Every 4 sec"| GPS[GPS Location]
end
GPS --> LB[Load Balancer]
LB --> LocationService[Location Service]
LocationService --> OldCell{Geohash changed?}
OldCell -->|"Yes"| RemoveOld[Remove from old cell]
OldCell -->|"No"| UpdateOnly[Update location only]
RemoveOld --> AddNew[Add to new cell]
AddNew --> UpdateRedis[(Redis)]
UpdateOnly --> UpdateRedis
LocationService --> Kafka[Kafka
Location Stream]
Kafka --> Analytics[Analytics]
Kafka --> ETA[ETA Service]
style Driver fill:#88c0d0,color:#2e3440
style LocationService fill:#ebcb8b,color:#2e3440
style UpdateRedis fill:#bf616a,color:#2e3440
style Kafka fill:#d08770,color:#2e3440
def update_driver_location(driver_id, lat, lng, status):
new_geohash = geohash.encode(lat, lng, precision=6)
# Get previous location
prev = redis.hgetall(f"driver:{driver_id}")
prev_geohash = prev.get("geohash")
# Update driver's location
redis.hset(f"driver:{driver_id}", mapping={
"lat": lat,
"lng": lng,
"geohash": new_geohash,
"status": status, # available, busy, offline
"updated_at": time.time()
})
# Update geohash index if cell changed
if prev_geohash != new_geohash:
if prev_geohash:
redis.srem(f"drivers:{prev_geohash}", driver_id)
if status == "available":
redis.sadd(f"drivers:{new_geohash}", driver_id)
# Publish for real-time tracking (active rides)
if driver_has_active_ride(driver_id):
kafka.produce("driver_locations", key=driver_id, value={
"lat": lat, "lng": lng, "timestamp": time.time()
})
Step 4: Ride Matching Flow
sequenceDiagram
participant Rider
participant API as Ride Service
participant Location as Location Service
participant Match as Matching Service
participant Driver1 as Driver A
participant Driver2 as Driver B
Rider->>API: Request ride (pickup, destination)
API->>Location: Find drivers near pickup
Location-->>API: [Driver A (0.5km), Driver B (0.8km), ...]
API->>Match: Select best driver
Note over Match: Consider: distance, rating,
acceptance rate, vehicle type
Match-->>API: Driver A selected
API->>Driver1: 🔔 Ride request (15 sec timeout)
alt Driver A accepts
Driver1-->>API: Accept
API-->>Rider: Driver A matched!
API->>Location: Mark Driver A as "busy"
else Driver A ignores/rejects
API->>Driver2: 🔔 Ride request
Driver2-->>API: Accept
API-->>Rider: Driver B matched!
end
HOW matching works:
- Find drivers within radius (geohash query)
- Score each driver:
score = w1*distance + w2*rating + w3*acceptance_rate
- Send request to top driver with timeout
- If no response, try next driver
- After N failures, expand search radius
Complete Architecture
graph TB
subgraph "Clients"
RiderApp([Rider App])
DriverApp([Driver App])
end
subgraph "API Gateway"
LB[Load Balancer]
end
subgraph "Core Services"
RideService[Ride Service]
LocationService[Location Service]
MatchingService[Matching Service]
PricingService[Pricing Service]
ETAService[ETA Service]
end
subgraph "Real-time"
WebSocket[WebSocket Servers]
Kafka[Kafka]
end
subgraph "Data Stores"
Redis[(Redis
Driver Locations)]
PostgreSQL[(PostgreSQL
Rides, Users)]
Cassandra[(Cassandra
Location History)]
end
RiderApp --> LB
DriverApp --> LB
LB --> RideService
LB --> LocationService
LB --> WebSocket
RideService --> MatchingService
RideService --> PricingService
MatchingService --> LocationService
LocationService --> Redis
DriverApp -->|"Location updates"| LocationService
LocationService --> Kafka
Kafka --> ETAService
Kafka --> Cassandra
WebSocket --> Kafka
RideService --> PostgreSQL
style RiderApp fill:#88c0d0,color:#2e3440
style DriverApp fill:#88c0d0,color:#2e3440
style RideService fill:#ebcb8b,color:#2e3440
style LocationService fill:#ebcb8b,color:#2e3440
style MatchingService fill:#ebcb8b,color:#2e3440
style Redis fill:#bf616a,color:#2e3440
style PostgreSQL fill:#b48ead,color:#2e3440
style Kafka fill:#d08770,color:#2e3440
6. Design YouTube (Video Streaming)
Video streaming involves massive storage, transcoding pipelines, and CDN optimization. The key insight: optimize for the common case (watching videos) over the rare case (uploading).
Step 1: Scale Estimation
Upload:
- 500 hours of video uploaded per minute
- Average video: 10 minutes, 1GB raw → ~3GB after transcoding (multiple resolutions)
- Storage growth: 500 hours × 60 min/hour × 6 GB = 180 TB/hour = 4.3 PB/day
Viewing:
- 1 billion hours watched per day
- Average video: 10 minutes = 6 billion video views/day
- Peak: 70,000 videos/second
Read:Write ratio: 6 billion views : 500×60 uploads ≈ 200,000:1
Key insight: Reading (streaming) is 200,000x more common than writing (uploading). Optimize aggressively for reads. Uploads can be slow and async.
Step 2: Video Upload Pipeline
graph LR
subgraph "Upload"
Client([Creator]) -->|"Chunked upload"| UploadService[Upload Service]
UploadService --> OriginalStore[(Original Storage)]
end
subgraph "Processing Pipeline"
OriginalStore --> Queue[Job Queue]
Queue --> Transcode[Transcoding Workers]
Transcode -->|"360p"| CDN[(CDN Storage)]
Transcode -->|"720p"| CDN
Transcode -->|"1080p"| CDN
Transcode -->|"4K"| CDN
Transcode --> Thumbnail[Thumbnail Generator]
Thumbnail --> CDN
end
subgraph "Metadata"
UploadService --> MetadataDB[(Metadata DB)]
Transcode -->|"Update status"| MetadataDB
end
style Client fill:#88c0d0,color:#2e3440
style UploadService fill:#ebcb8b,color:#2e3440
style Transcode fill:#a3be8c,color:#2e3440
style CDN fill:#bf616a,color:#2e3440
style MetadataDB fill:#b48ead,color:#2e3440
def upload_video(user_id, video_file, metadata):
# 1. Generate unique video ID
video_id = generate_snowflake_id()
# 2. Upload original to blob storage (resumable, chunked)
original_url = s3.upload(
bucket="originals",
key=f"{video_id}/original",
file=video_file,
multipart=True # Support resume on failure
)
# 3. Save metadata
db.insert("""
INSERT INTO videos (id, user_id, title, description, status)
VALUES (?, ?, ?, ?, 'processing')
""", video_id, user_id, metadata.title, metadata.description)
# 4. Queue transcoding job
kafka.produce("transcode_jobs", {
"video_id": video_id,
"source_url": original_url,
"resolutions": ["360p", "720p", "1080p", "4k"]
})
return video_id # Client polls for status
HOW transcoding works:
- Split video into segments (10-second chunks)
- Distribute segments to worker pool
- Each worker transcodes segment to target resolution
- Reassemble segments into final video
- Generate HLS/DASH manifest file
Parallelization: A 1-hour video becomes 360 segments, processed in parallel → transcoding completes in minutes, not hours.
Step 3: Video Streaming (Adaptive Bitrate)
sequenceDiagram
participant Client as Video Player
participant CDN
participant Origin as Origin Server
Client->>CDN: GET /video/abc123/manifest.m3u8
CDN-->>Client: Manifest (list of quality levels)
Note over Client: Manifest contains:
360p, 720p, 1080p, 4K
with segment URLs
Client->>CDN: GET /video/abc123/720p/segment_001.ts
CDN-->>Client: Video segment (2-10 seconds)
Note over Client: Bandwidth drops...
Client->>CDN: GET /video/abc123/360p/segment_002.ts
Note over Client: Bandwidth improves...
Client->>CDN: GET /video/abc123/1080p/segment_003.ts
# HLS Manifest Example (master.m3u8)
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=640x360
360p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2800000,RESOLUTION=1280x720
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
1080p/playlist.m3u8
# Quality-specific playlist (720p/playlist.m3u8)
#EXTM3U
#EXT-X-TARGETDURATION:10
#EXTINF:10.0,
segment_001.ts
#EXTINF:10.0,
segment_002.ts
...
WHY Adaptive Bitrate Streaming (ABR)?
- Automatically adjusts quality based on bandwidth
- No buffering on slow connections (drops to 360p)
- Best quality on fast connections (jumps to 4K)
- Seamless switching without user intervention
Step 4: CDN Architecture
graph TB
subgraph "Viewers"
US([US Viewer])
EU([EU Viewer])
Asia([Asia Viewer])
end
subgraph "Edge (CDN PoPs)"
US_Edge[US Edge Server]
EU_Edge[EU Edge Server]
Asia_Edge[Asia Edge Server]
end
subgraph "Regional Cache"
US_Regional[US Regional]
EU_Regional[EU Regional]
end
subgraph "Origin"
Origin[Origin Servers]
S3[(S3 Storage)]
end
US --> US_Edge
EU --> EU_Edge
Asia --> Asia_Edge
US_Edge -->|"Cache Miss"| US_Regional
EU_Edge -->|"Cache Miss"| EU_Regional
Asia_Edge -->|"Cache Miss"| Origin
US_Regional -->|"Cache Miss"| Origin
EU_Regional -->|"Cache Miss"| Origin
Origin --> S3
style US fill:#88c0d0,color:#2e3440
style EU fill:#88c0d0,color:#2e3440
style Asia fill:#88c0d0,color:#2e3440
style US_Edge fill:#a3be8c,color:#2e3440
style EU_Edge fill:#a3be8c,color:#2e3440
style Asia_Edge fill:#a3be8c,color:#2e3440
style Origin fill:#ebcb8b,color:#2e3440
style S3 fill:#b48ead,color:#2e3440
HOW CDN caching works for video:
- Popular videos: Cached at edge (98% of requests)
- Less popular: Cached at regional tier
- Long tail: Fetched from origin on-demand
- Strategy: Pre-warm cache for new viral videos
80/20 rule: 20% of videos get 80% of views → those 20% are always cached at edge.
7. Design a Notification System
Notification systems must handle multiple channels (push, email, SMS), user preferences, rate limiting, and guaranteed delivery. This tests queue design and third-party integrations.
Step 1: Requirements
- Channels: Push notifications, Email, SMS, In-app
- Scale: 100 million notifications/day
- Delivery: At-least-once (no lost notifications)
- Latency: Push < 1 second, Email < 1 minute
- Features: User preferences, rate limiting, templates
Step 2: High-Level Design
graph TB
subgraph "Producers"
Service1([Order Service])
Service2([Social Service])
Service3([Marketing])
end
subgraph "Notification Service"
API[Notification API]
Validator[Validator]
Router[Channel Router]
end
subgraph "Queues (Kafka)"
PushQ[Push Queue]
EmailQ[Email Queue]
SMSQ[SMS Queue]
end
subgraph "Workers"
PushWorker[Push Workers]
EmailWorker[Email Workers]
SMSWorker[SMS Workers]
end
subgraph "Third Party"
FCM[Firebase/APNs]
SendGrid[SendGrid]
Twilio[Twilio]
end
Service1 --> API
Service2 --> API
Service3 --> API
API --> Validator
Validator --> Router
Router --> PushQ
Router --> EmailQ
Router --> SMSQ
PushQ --> PushWorker
EmailQ --> EmailWorker
SMSQ --> SMSWorker
PushWorker --> FCM
EmailWorker --> SendGrid
SMSWorker --> Twilio
style API fill:#ebcb8b,color:#2e3440
style Router fill:#88c0d0,color:#2e3440
style PushQ fill:#d08770,color:#2e3440
style EmailQ fill:#d08770,color:#2e3440
style SMSQ fill:#d08770,color:#2e3440
style PushWorker fill:#a3be8c,color:#2e3440
style EmailWorker fill:#a3be8c,color:#2e3440
style SMSWorker fill:#a3be8c,color:#2e3440
Step 3: Notification Flow
def send_notification(user_id, notification_type, data):
# 1. Get user preferences
prefs = get_user_preferences(user_id)
# 2. Check if user wants this type of notification
if not prefs.allows(notification_type):
return "suppressed"
# 3. Rate limit check
if is_rate_limited(user_id, notification_type):
return "rate_limited"
# 4. Get user's devices and contact info
devices = get_user_devices(user_id) # For push
email = get_user_email(user_id)
phone = get_user_phone(user_id)
# 5. Determine channels based on preferences
channels = []
if prefs.push_enabled and devices:
channels.append("push")
if prefs.email_enabled and email:
channels.append("email")
if prefs.sms_enabled and phone:
channels.append("sms")
# 6. Render notification from template
notification = render_template(notification_type, data)
# 7. Queue for each channel
for channel in channels:
kafka.produce(f"notifications_{channel}", {
"notification_id": generate_id(),
"user_id": user_id,
"channel": channel,
"content": notification,
"created_at": time.time()
})
User Preferences Schema
CREATE TABLE notification_preferences (
user_id BIGINT PRIMARY KEY,
push_enabled BOOLEAN DEFAULT true,
email_enabled BOOLEAN DEFAULT true,
sms_enabled BOOLEAN DEFAULT false,
-- Granular preferences
marketing_push BOOLEAN DEFAULT false,
marketing_email BOOLEAN DEFAULT true,
order_updates_push BOOLEAN DEFAULT true,
order_updates_email BOOLEAN DEFAULT true,
order_updates_sms BOOLEAN DEFAULT true,
social_push BOOLEAN DEFAULT true,
social_email BOOLEAN DEFAULT false,
-- Quiet hours
quiet_hours_start TIME,
quiet_hours_end TIME,
timezone VARCHAR(50)
);
Delivery Guarantees:
| Channel |
Guarantee |
Why |
| Push |
Best effort |
Device might be offline, token expired |
| Email |
At-least-once |
Retry on failure, may result in duplicates |
| SMS |
At-least-once |
Delivery receipt from carrier |
Step 4: Handling Failures
class NotificationWorker:
def process(self, notification):
try:
result = self.send(notification)
self.mark_delivered(notification.id)
return result
except TemporaryError as e:
# Retry with exponential backoff
retry_count = notification.retry_count + 1
if retry_count < MAX_RETRIES:
delay = min(2 ** retry_count, 3600) # Max 1 hour
self.schedule_retry(notification, delay)
else:
self.move_to_dead_letter_queue(notification)
except PermanentError as e:
# Don't retry (invalid token, unsubscribed, etc.)
self.mark_failed(notification.id, str(e))
# If push token invalid, remove from user's devices
if isinstance(e, InvalidTokenError):
remove_device_token(notification.user_id, notification.token)
WHY separate queues per channel:
- Different SLAs: Push needs < 1s, email can take minutes
- Different retry strategies: Push retries fast, SMS retries slow
- Independent scaling: More push workers during app launch
- Isolation: Email provider outage doesn't affect push
Key Takeaways
The System Design Process
- Clarify requirements - Ask questions, don't assume
- Estimate scale - Back-of-envelope math shows you understand reality
- Start simple - Draw basic boxes, then iterate
- Identify bottlenecks - Where will it break at scale?
- Discuss trade-offs - There's no perfect solution
- Deep dive where asked - Follow interviewer's interest
Common Patterns to Know
| Pattern |
When to Use |
| Fan-out on write |
Read-heavy, bounded followers (Twitter for normal users) |
| Fan-out on read |
Write-heavy, unbounded followers (Twitter for celebrities) |
| Geohash/QuadTree |
Location-based queries (Uber, Yelp) |
| Token bucket |
Rate limiting with burst allowance |
| Consistent hashing |
Distributed caching, sharding |
| CDC + Kafka |
Real-time data sync, event sourcing |
| CQRS |
Separate read/write models at different scales |
← Back to Index