Core distributed systems knowledge at the right level for a Distinguished Architect interview. Focus on trade-offs, not theory.
| Choice | Means | Example |
|---|---|---|
| CP (Consistency + Partition Tolerance) | System becomes unavailable during partitions to maintain consistency | Traditional RDBMS with strong consistency, Zookeeper |
| AP (Availability + Partition Tolerance) | System stays available but may return stale data | DynamoDB, Cassandra, DNS |
| CA (Consistency + Availability) | Only possible in a single node (not distributed) | Single-server databases |
"When would you choose consistency over availability for Twilio's messaging platform?"
"The messaging pipeline itself should be AP - we want to accept messages even during network partitions. If we can't replicate immediately, that's okay - we queue it and eventually deliver. However, for the control plane - account authentication, API key validation, rate limit state - I'd lean more CP.
Here's why: If a customer sends a message and we temporarily can't persist it due to a partition, we can return a retry-able error and they'll resend. But if we accept an API request with an invalid key because we couldn't reach the auth service, that's a security issue.
In practice, I'd use PACELC thinking: Partitions are rare, so optimize for the else case - low latency and eventual consistency for data plane, stronger consistency with slightly higher latency for control plane. We'd use DynamoDB with eventual consistent reads for message metadata, but strongly consistent reads for account authentication."
| Model | Guarantee | Latency Cost | When to Use |
|---|---|---|---|
| Strong Consistency | Read always returns latest write | High - must coordinate across nodes | Financial transactions, inventory |
| Eventual Consistency | Eventually all replicas converge | Low - no coordination needed | Social media feeds, message status |
| Causal Consistency | Related writes seen in order | Medium - track causality | Message threads, comment replies |
| Read-Your-Writes | You see your own writes immediately | Low-Medium - session affinity | User settings, profile updates |
"How would you design message status tracking for billions of messages per day?"
"I'd use eventual consistency with DynamoDB. Here's the reasoning: message status is fundamentally asynchronous - the message goes through multiple states (queued, sent, delivered, read) over seconds or minutes. There's no user expectation of instant status updates.
The write path writes status to DynamoDB with eventual consistent replication across AZs. The read path (status check API) does eventually consistent reads - we're optimizing for low latency and high throughput.
However, I'd add read-your-writes consistency for the customer's own writes. If they send a message and immediately query its status, we route that to the same AZ using session affinity so they see their own write. But if someone else queries that message status, eventual consistency is fine.
This gives us the throughput to handle billions of writes/day while keeping p99 latency under 10ms for status checks."
"Would you use consensus for Twilio's messaging pipeline?"
"No, not for the data plane. Consensus requires multiple round trips and a quorum, which would add 10-50ms latency per message and limit throughput. For a system handling millions of messages per second, that's a non-starter.
Instead, I'd use consensus sparingly for control plane operations: leader election in each cell, cluster membership management, and distributed configuration. Let etcd or ZooKeeper handle that - don't build it yourself.
For the message pipeline itself, I'd design for eventual consistency with idempotent operations. Messages get written to Kafka or Kinesis (which uses quorum internally but hides that complexity), processed by workers, and delivery status is eventually consistent.
The key insight: use consensus to set up the infrastructure (which cells are active, who's the leader), but not for every transaction. That's how you get both reliability and scale."
| Semantic | Guarantee | How to Achieve | Use Case |
|---|---|---|---|
| At-most-once | Message delivered 0 or 1 times | Send and forget, no retries | Metrics, logs, non-critical events |
| At-least-once | Message delivered 1+ times | Retry until acknowledged | Most messaging, event processing |
| Exactly-once | Message delivered exactly 1 time | Distributed transactions or deduplication | Financial transactions, billing |
Design for at-least-once delivery with idempotent consumers
Result: End-to-end reliability without expensive distributed transactions.
"How would you ensure messages aren't delivered twice?"
"I'd design for at-least-once delivery with idempotency, not exactly-once delivery. Here's why: true exactly-once delivery requires distributed transactions across the message queue, database, and external carriers - that's too expensive and complex.
Instead, the architecture would be:
1. Ingestion: Accept message with client-provided idempotency key. Store in DynamoDB with key as primary key - duplicate POSTs get deduplicated automatically.
2. Processing: Write to Kafka with message SID. Workers consume with at-least-once semantics. If a worker crashes mid-processing, Kafka redelivers.
3. Carrier delivery: Workers send to carrier. If they get a timeout, they retry. Carrier might receive duplicate requests, but we include the same message ID so they can deduplicate.
4. Customer webhooks: We might send delivery status webhook twice if our system restarts. We include message SID and status update ID so customers can deduplicate.
This gives end-to-end idempotency without distributed transactions. Each layer handles duplicates locally."
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Hash-based | hash(key) % num_partitions | Even distribution | Hard to reshard, can't do range queries |
| Range-based | Key ranges (A-M, N-Z) | Range queries efficient | Hotspots if keys aren't distributed |
| Consistent Hashing | Keys and nodes on a ring | Adding nodes only affects neighbors | More complex, potential imbalance |
| Geography-based | Partition by region/country | Locality, data residency | Uneven load across regions |
"How would you partition Twilio's message data?"
"I'd use a hybrid approach: partition by account_id and time_bucket.
Primary partition key: account_id - This gives us tenant isolation, which is critical for Twilio. Each customer's data lives on specific shards. This also aligns with access patterns - customers query their own messages, not across accounts.
Secondary partition: time_bucket - Within each account, partition by month or week. This keeps recent data hot and allows us to archive old messages to cheaper storage.
Number of shards: Start with 256 logical shards mapped to fewer physical nodes. This gives us room to scale - we can add physical nodes and rebalance logical shards without changing the partition scheme.
Handling hotspots: Large customers (enterprises sending millions of messages/day) get dedicated shards. We detect this in the API layer and route their traffic separately. Small customers share shards.
This approach scales to billions of messages, maintains tenant isolation, supports time-range queries, and handles heterogeneous customer sizes."
| Approach | How It Works | When to Use |
|---|---|---|
| Wall-clock time (NTP) | Synchronized physical clocks | Approximate ordering, timestamps for humans |
| Lamport timestamps | Logical counter, incremented per event | Total ordering of events in a system |
| Vector clocks | Per-node counters, tracks causality | Detecting concurrent updates, conflict resolution |
| Hybrid logical clocks | Combines wall time + logical counter | Modern databases (CockroachDB, MongoDB) |
"How would you order message delivery when messages come from multiple regions?"
"First, I'd ask: do we actually need strict ordering? For most messaging use cases, best-effort ordering is sufficient. Messages sent seconds apart don't need microsecond precision.
If the customer doesn't require ordering, we accept messages in any region with a timestamp for approximate ordering, but we don't guarantee it. This is the cheapest and fastest approach.
If ordering IS required (like a conversation thread), I wouldn't rely on timestamps. Instead:
1. Each conversation gets a logical sequence number
2. Messages include a client-generated sequence number or explicitly reference the previous message
3. We detect gaps and can hold later messages until earlier ones arrive
For internal systems (audit logs, event sourcing), I'd use hybrid logical clocks: each event gets a timestamp that's mostly wall-clock time but includes a logical counter for events that happen on the same node in the same millisecond. This gives us human-readable timestamps that also support precise ordering.
The key is: don't rely on synchronized clocks across regions for correctness, only for approximate ordering."
| Pattern | Writes | Reads | Complexity | Use Case |
|---|---|---|---|---|
| Leader-Follower | Leader only | Any replica | Low | Most databases, read-heavy workloads |
| Multi-Leader | Any leader | Any replica | High (conflicts!) | Multi-region apps, collaborative editing |
| Leaderless (Quorum) | Quorum of nodes | Quorum of nodes | Medium | High availability, DynamoDB/Cassandra |
"How would you design multi-region replication for Twilio's account database?"
"I'd use leader-follower with regional read replicas and cross-region disaster recovery, not multi-leader.
Architecture:
- Primary region (e.g., us-east-1) has the leader database (Aurora)
- Same region has 2+ read replicas for read scaling
- Secondary region (e.g., us-west-2) has cross-region read replica for disaster recovery
Why not multi-leader?
Account data has strong consistency requirements - you can't have two regions accepting conflicting updates to the same account. Multi-leader would require conflict resolution, which is complex and error-prone for structured data like accounts and billing.
Write path:
All writes go to the leader in the primary region. Control plane writes are infrequent enough (account creation, setting updates) that cross-region latency is acceptable.
Read path:
- In-region reads go to read replicas (fast)
- Cross-region reads tolerate replication lag (eventual consistency)
- For read-your-writes consistency, session affinity to the region where you wrote
Failover:
If primary region fails, promote the cross-region replica to leader. This is rare but tested regularly.
This gives us read scalability, disaster recovery, and avoids the complexity of multi-leader conflict resolution."