System Design & Architecture

Patterns for building scalable, resilient, distributed systems at Twilio scale

Architectural Styles

Monolith vs Microservices

MONOLITH                              MICROSERVICES
┌─────────────────────────┐           ┌─────────┐ ┌─────────┐ ┌─────────┐
│      Single Deployment   │           │  User   │ │  Order  │ │ Payment │
│  ┌─────┐ ┌─────┐ ┌─────┐│           │ Service │ │ Service │ │ Service │
│  │User │ │Order│ │Pay  ││           └────┬────┘ └────┬────┘ └────┬────┘
│  │     │ │     │ │     ││                │          │          │
│  └──┬──┘ └──┬──┘ └──┬──┘│           ┌────▼──────────▼──────────▼────┐
│     │      │      │    │           │        Message Broker         │
│  ┌──▼──────▼──────▼──┐ │           │      (Kafka / RabbitMQ)       │
│  │   Shared Database  │ │           └───────────────────────────────┘
│  └────────────────────┘ │                │          │          │
└─────────────────────────┘           ┌────▼────┐┌───▼────┐┌────▼────┐
                                      │  DB 1   ││  DB 2  ││  DB 3   │
                                      └─────────┘└────────┘└─────────┘

Monolith Advantages

Monolith Disadvantages

Microservices Advantages

Microservices Disadvantages

Architect's Rule of Thumb

Start with a well-structured monolith. Extract microservices only when you have: (1) clear domain boundaries, (2) independent scaling needs, or (3) team scaling requirements. Premature microservices = distributed monolith = worst of both worlds.

Event-Driven Architecture

┌──────────────┐     Event      ┌─────────────────┐     Event      ┌──────────────┐
│ Order Service│────Published──▶│  Message Broker │◀───Consumed────│ Email Service│
└──────────────┘                │  (Kafka/SQS)    │                └──────────────┘
                                └────────┬────────┘
                                         │
              ┌──────────────────────────┼──────────────────────────┐
              │                          │                          │
              ▼                          ▼                          ▼
     ┌────────────────┐        ┌────────────────┐        ┌────────────────┐
     │Inventory Service│       │Analytics Service│       │ Audit Service  │
     └────────────────┘        └────────────────┘        └────────────────┘
// Domain Event
public record OrderPlacedEvent(
    String eventId,
    String orderId,
    String customerId,
    List<OrderItem> items,
    BigDecimal total,
    Instant occurredAt
) {
    public static OrderPlacedEvent from(Order order) {
        return new OrderPlacedEvent(
            UUID.randomUUID().toString(),
            order.getId(),
            order.getCustomerId(),
            order.getItems(),
            order.getTotal(),
            Instant.now()
        );
    }
}

// Order Service publishes event
@Service
public class OrderService {
    private final KafkaTemplate<String, OrderPlacedEvent> kafka;

    @Transactional
    public Order placeOrder(CreateOrderRequest request) {
        Order order = orderRepository.save(new Order(request));

        // Publish event AFTER transaction commits (outbox pattern better)
        kafka.send("orders.placed", order.getId(), OrderPlacedEvent.from(order));

        return order;
    }
}
// Email Service consumes event
@Service
public class OrderNotificationHandler {

    @KafkaListener(topics = "orders.placed", groupId = "email-service")
    public void handleOrderPlaced(OrderPlacedEvent event) {
        Customer customer = customerService.getById(event.customerId());
        emailService.sendOrderConfirmation(customer.getEmail(), event);
    }
}

// Inventory Service consumes same event (fan-out)
@Service
public class InventoryHandler {

    @KafkaListener(topics = "orders.placed", groupId = "inventory-service")
    public void handleOrderPlaced(OrderPlacedEvent event) {
        for (OrderItem item : event.items()) {
            inventoryService.decrementStock(item.productId(), item.quantity());
        }
    }
}

// Benefits:
// - Services are decoupled (don't know about each other)
// - Easy to add new consumers without changing producer
// - Events provide audit trail
// - Can replay events for recovery/debugging

Communication Patterns

Synchronous vs Asynchronous

PatternWhen to UseTrade-offs
REST/HTTP Simple CRUD, public APIs, request-response needed Synchronous, tight coupling, latency chains
gRPC Internal service-to-service, high performance needed Binary protocol, streaming support, schema-first
Message Queue Async processing, load leveling, reliability needed Eventual consistency, more complex debugging
Event Streaming Event sourcing, real-time analytics, audit logs Kafka complexity, ordering guarantees vary
// REST client with resilience (Spring WebClient)
@Service
public class UserServiceClient {
    private final WebClient webClient;

    public Mono<User> getUser(String userId) {
        return webClient.get()
            .uri("/users/{id}", userId)
            .retrieve()
            .bodyToMono(User.class)
            .timeout(Duration.ofSeconds(5))
            .retryWhen(Retry.backoff(3, Duration.ofMillis(100)))
            .onErrorResume(e -> {
                log.error("Failed to get user", e);
                return Mono.empty();
            });
    }
}

// With virtual threads (Java 21) - simpler blocking code
@Service
public class UserServiceClient {
    private final RestClient restClient;

    public User getUser(String userId) {
        return restClient.get()
            .uri("/users/{id}", userId)
            .retrieve()
            .body(User.class);  // Blocking is OK with virtual threads
    }
}
// gRPC - schema-first with Protocol Buffers
// user.proto
/*
syntax = "proto3";

service UserService {
    rpc GetUser(GetUserRequest) returns (User);
    rpc ListUsers(ListUsersRequest) returns (stream User);  // Streaming!
}

message GetUserRequest {
    string user_id = 1;
}

message User {
    string id = 1;
    string name = 2;
    string email = 3;
}
*/

// Generated client usage
@Service
public class UserServiceGrpcClient {
    private final UserServiceGrpc.UserServiceBlockingStub stub;

    public User getUser(String userId) {
        GetUserRequest request = GetUserRequest.newBuilder()
            .setUserId(userId)
            .build();
        return stub.getUser(request);
    }

    // Streaming response
    public Iterator<User> listUsers() {
        return stub.listUsers(ListUsersRequest.getDefaultInstance());
    }
}

// gRPC advantages:
// - 10x faster than REST (binary, HTTP/2)
// - Strong typing from proto definitions
// - Bi-directional streaming
// - Auto-generated clients in any language
// Message Queue (RabbitMQ/SQS) - async, reliable
@Service
public class OrderProcessor {

    private final RabbitTemplate rabbit;

    // Send message to queue
    public void submitOrder(Order order) {
        rabbit.convertAndSend("orders.pending", order);
        // Returns immediately - processing happens async
    }
}

// Worker processes messages
@Component
public class OrderWorker {

    @RabbitListener(queues = "orders.pending")
    public void processOrder(Order order) {
        try {
            paymentService.charge(order);
            inventoryService.reserve(order);
            order.setStatus(CONFIRMED);
            orderRepository.save(order);
        } catch (Exception e) {
            // Message will be requeued (or sent to DLQ)
            throw e;
        }
    }
}

// Queue benefits:
// - Load leveling (spikes absorbed by queue)
// - Reliability (messages persisted until processed)
// - Retry with backoff (automatic requeue)
// - Dead letter queue for failed messages

Data Patterns

CQRS (Command Query Responsibility Segregation)

                    ┌─────────────────────────────────────────────────────┐
                    │                     API Gateway                      │
                    └─────────────────────┬───────────────────────────────┘
                                          │
                    ┌─────────────────────┴───────────────────────────────┐
                    │                                                      │
           ┌────────▼────────┐                              ┌──────────────▼─────────────┐
           │   WRITE (Command)│                              │      READ (Query)          │
           │     Service       │                              │       Service              │
           └────────┬─────────┘                              └──────────────┬─────────────┘
                    │                                                       │
                    │ Events                                                │
           ┌────────▼─────────┐     Sync/Project      ┌────────────────────▼─────────────┐
           │   Write Database  │─────────────────────▶│         Read Database            │
           │   (Normalized)    │                      │   (Denormalized, Optimized)      │
           │   PostgreSQL      │                      │   Elasticsearch / Redis          │
           └──────────────────┘                       └──────────────────────────────────┘
// WRITE side - Commands modify state
public record CreateOrderCommand(String customerId, List<OrderItem> items) {}

@Service
public class OrderCommandHandler {
    private final OrderRepository repository;  // PostgreSQL
    private final EventPublisher events;

    public String handle(CreateOrderCommand cmd) {
        Order order = new Order(cmd.customerId(), cmd.items());
        repository.save(order);

        // Publish event for read side to sync
        events.publish(new OrderCreatedEvent(order));
        return order.getId();
    }
}

// READ side - Queries return data (optimized for reading)
public record OrderSummaryQuery(String customerId, int limit) {}

@Service
public class OrderQueryHandler {
    private final ElasticsearchOperations elastic;  // Read-optimized

    public List<OrderSummary> handle(OrderSummaryQuery query) {
        // Fast denormalized query - no joins needed
        return elastic.search(query.customerId(), query.limit());
    }
}

// Projector keeps read model in sync
@Component
public class OrderProjector {
    private final ElasticsearchOperations elastic;

    @EventListener
    public void on(OrderCreatedEvent event) {
        OrderSummary summary = new OrderSummary(
            event.orderId(),
            event.customerName(),  // Denormalized!
            event.itemCount(),
            event.total()
        );
        elastic.save(summary);
    }
}

// When to use CQRS:
// - Read/write patterns are very different
// - Need different scaling for reads vs writes
// - Complex queries that don't fit relational model
// - Event sourcing (natural fit)

Saga Pattern — Distributed Transactions

Problem: In microservices, you can't use ACID transactions across services. Saga coordinates a sequence of local transactions.

CHOREOGRAPHY SAGA (Event-driven)

Order Service          Payment Service         Inventory Service        Shipping Service
     │                       │                        │                       │
     │ OrderCreated          │                        │                       │
     ├──────────────────────▶│                        │                       │
     │                       │ PaymentCompleted       │                       │
     │                       ├───────────────────────▶│                       │
     │                       │                        │ InventoryReserved     │
     │                       │                        ├──────────────────────▶│
     │                       │                        │                       │ ShippingScheduled
     │◀─────────────────────────────────────────────────────────────────────────┤
     │ OrderCompleted        │                        │                       │

COMPENSATION (if payment fails):
     │                       │ PaymentFailed          │                       │
     │◀──────────────────────┤                        │                       │
     │ OrderCancelled        │                        │                       │
// CHOREOGRAPHY: Services react to events, no central coordinator

// Order Service starts the saga
@Service
public class OrderService {
    public Order createOrder(CreateOrderRequest req) {
        Order order = orderRepository.save(new Order(req, PENDING));
        events.publish(new OrderCreatedEvent(order));
        return order;
    }

    @EventListener
    public void on(ShippingScheduledEvent event) {
        Order order = orderRepository.findById(event.orderId());
        order.setStatus(COMPLETED);
        orderRepository.save(order);
    }

    @EventListener
    public void on(PaymentFailedEvent event) {
        Order order = orderRepository.findById(event.orderId());
        order.setStatus(CANCELLED);
        orderRepository.save(order);
    }
}

// Payment Service reacts to OrderCreated
@Service
public class PaymentService {
    @EventListener
    public void on(OrderCreatedEvent event) {
        try {
            paymentGateway.charge(event.customerId(), event.total());
            events.publish(new PaymentCompletedEvent(event.orderId()));
        } catch (PaymentException e) {
            events.publish(new PaymentFailedEvent(event.orderId(), e.getMessage()));
        }
    }
}

// Inventory Service reacts to PaymentCompleted
@Service
public class InventoryService {
    @EventListener
    public void on(PaymentCompletedEvent event) {
        Order order = getOrderDetails(event.orderId());
        reserveItems(order.getItems());
        events.publish(new InventoryReservedEvent(event.orderId()));
    }

    // Compensation: release inventory if shipping fails
    @EventListener
    public void on(ShippingFailedEvent event) {
        releaseItems(event.orderId());
        events.publish(new InventoryReleasedEvent(event.orderId()));
    }
}
// ORCHESTRATION: Central coordinator manages the saga

@Service
public class OrderSagaOrchestrator {
    private final PaymentServiceClient paymentService;
    private final InventoryServiceClient inventoryService;
    private final ShippingServiceClient shippingService;

    public void executeOrderSaga(Order order) {
        try {
            // Step 1: Payment
            PaymentResult payment = paymentService.charge(order);
            if (!payment.isSuccess()) {
                order.setStatus(PAYMENT_FAILED);
                return;
            }

            // Step 2: Reserve Inventory
            ReservationResult reservation = inventoryService.reserve(order);
            if (!reservation.isSuccess()) {
                // Compensate: refund payment
                paymentService.refund(payment.getTransactionId());
                order.setStatus(INVENTORY_FAILED);
                return;
            }

            // Step 3: Schedule Shipping
            ShippingResult shipping = shippingService.schedule(order);
            if (!shipping.isSuccess()) {
                // Compensate: release inventory, refund payment
                inventoryService.release(reservation.getReservationId());
                paymentService.refund(payment.getTransactionId());
                order.setStatus(SHIPPING_FAILED);
                return;
            }

            order.setStatus(COMPLETED);

        } catch (Exception e) {
            // Handle partial failure, trigger compensations
            handleSagaFailure(order, e);
        }
    }
}

// Orchestration pros:
// - Central place to see the flow
// - Easier to reason about
// - Simpler compensation logic

// Orchestration cons:
// - Orchestrator can become a bottleneck
// - Single point of failure
// - Tighter coupling to orchestrator

Resilience Patterns

Circuit Breaker

Problem: When a downstream service is failing, don't keep hammering it. Fail fast and recover gracefully.

         CLOSED                    OPEN                    HALF-OPEN
    ┌──────────────┐          ┌──────────────┐          ┌──────────────┐
    │   Normal     │ Failures │   Fail Fast  │ Timeout  │   Test       │
    │  Operation   │─────────▶│   (no calls) │─────────▶│  (1 request) │
    │              │ exceed   │              │ expires  │              │
    └──────────────┘ threshold└──────────────┘          └───────┬──────┘
          ▲                                                     │
          │                    Success                          │
          └─────────────────────────────────────────────────────┘
                              Failure: back to OPEN
// Using Resilience4j (modern, lightweight)
@Service
public class PaymentServiceClient {
    private final CircuitBreaker circuitBreaker;
    private final RestClient restClient;

    public PaymentServiceClient() {
        this.circuitBreaker = CircuitBreaker.ofDefaults("paymentService");
        // Or with config:
        this.circuitBreaker = CircuitBreaker.of("paymentService",
            CircuitBreakerConfig.custom()
                .failureRateThreshold(50)           // Open at 50% failures
                .waitDurationInOpenState(Duration.ofSeconds(30))
                .slidingWindowSize(10)              // Last 10 calls
                .build()
        );
    }

    public PaymentResult charge(Order order) {
        return circuitBreaker.executeSupplier(() -> {
            return restClient.post()
                .uri("/payments/charge")
                .body(order)
                .retrieve()
                .body(PaymentResult.class);
        });
    }

    // With fallback
    public PaymentResult chargeWithFallback(Order order) {
        return Try.ofSupplier(
            CircuitBreaker.decorateSupplier(circuitBreaker, () -> charge(order))
        ).recover(CallNotPermittedException.class, e -> {
            // Circuit is open - use fallback
            return PaymentResult.deferred(order.getId());
        }).get();
    }
}

// Spring Boot + Resilience4j annotations
@Service
public class PaymentServiceClient {

    @CircuitBreaker(name = "payment", fallbackMethod = "chargeFallback")
    public PaymentResult charge(Order order) {
        return restClient.post(...);
    }

    private PaymentResult chargeFallback(Order order, Exception e) {
        return PaymentResult.deferred(order.getId());
    }
}
// RETRY with exponential backoff
@Service
public class ExternalServiceClient {

    @Retry(name = "externalService", fallbackMethod = "fallback")
    public Data fetchData(String id) {
        return restClient.get().uri("/data/{id}", id).retrieve().body(Data.class);
    }

    // application.yml config:
    // resilience4j.retry.instances.externalService:
    //   maxAttempts: 3
    //   waitDuration: 500ms
    //   exponentialBackoffMultiplier: 2
    //   retryExceptions:
    //     - java.io.IOException
    //     - java.net.SocketTimeoutException
}

// TIMEOUT - don't wait forever
@Service
public class SlowServiceClient {

    @TimeLimiter(name = "slowService")
    public CompletableFuture<Data> fetchSlowData(String id) {
        return CompletableFuture.supplyAsync(() -> {
            return restClient.get().uri("/slow/{id}", id).retrieve().body(Data.class);
        });
    }

    // Config: timeout after 2 seconds
    // resilience4j.timelimiter.instances.slowService:
    //   timeoutDuration: 2s
}

// Combining patterns: Retry → CircuitBreaker → TimeLimiter
@CircuitBreaker(name = "backend")
@Retry(name = "backend")
@TimeLimiter(name = "backend")
public CompletableFuture<Data> resilientCall() {
    // TimeLimiter wraps Retry wraps CircuitBreaker
}
// BULKHEAD - isolate failures, limit concurrent calls
// Prevents one slow service from consuming all threads

@Service
public class MultiServiceClient {

    // Semaphore bulkhead: limit concurrent calls
    @Bulkhead(name = "paymentService", type = Bulkhead.Type.SEMAPHORE)
    public PaymentResult charge(Order order) {
        return paymentClient.charge(order);
    }

    // Config:
    // resilience4j.bulkhead.instances.paymentService:
    //   maxConcurrentCalls: 20
    //   maxWaitDuration: 500ms

    // ThreadPool bulkhead: separate thread pool
    @Bulkhead(name = "inventoryService", type = Bulkhead.Type.THREADPOOL)
    public CompletableFuture<InventoryResult> checkInventory(String productId) {
        return CompletableFuture.supplyAsync(() -> inventoryClient.check(productId));
    }

    // Config:
    // resilience4j.thread-pool-bulkhead.instances.inventoryService:
    //   maxThreadPoolSize: 10
    //   coreThreadPoolSize: 5
    //   queueCapacity: 100
}

// Why bulkheads matter:
// Without: Slow payment service → all 200 Tomcat threads waiting → entire app unresponsive
// With: Slow payment service → only 20 threads affected → other endpoints still work

Scalability Patterns

Caching Strategies

// CACHE-ASIDE (Lazy Loading): App manages cache
@Service
public class ProductService {
    private final RedisTemplate<String, Product> cache;
    private final ProductRepository repository;

    public Product getProduct(String id) {
        // 1. Check cache
        Product cached = cache.opsForValue().get("product:" + id);
        if (cached != null) {
            return cached;  // Cache hit
        }

        // 2. Cache miss: load from DB
        Product product = repository.findById(id).orElseThrow();

        // 3. Populate cache
        cache.opsForValue().set("product:" + id, product, Duration.ofMinutes(30));

        return product;
    }

    public void updateProduct(Product product) {
        repository.save(product);
        // Invalidate cache (or update it)
        cache.delete("product:" + product.getId());
    }
}

// Spring @Cacheable (declarative cache-aside)
@Service
public class ProductService {

    @Cacheable(value = "products", key = "#id")
    public Product getProduct(String id) {
        return repository.findById(id).orElseThrow();
    }

    @CacheEvict(value = "products", key = "#product.id")
    public void updateProduct(Product product) {
        repository.save(product);
    }

    @CachePut(value = "products", key = "#product.id")
    public Product updateAndCache(Product product) {
        return repository.save(product);
    }
}
// WRITE-THROUGH: Write to cache and DB together
@Service
public class ProductService {
    public void saveProduct(Product product) {
        // Write to both synchronously
        repository.save(product);
        cache.opsForValue().set("product:" + product.getId(), product);
        // Pros: Cache always consistent
        // Cons: Slower writes, both must succeed
    }
}

// WRITE-BEHIND (Write-Back): Write to cache, async to DB
@Service
public class ProductService {
    private final BlockingQueue<Product> writeQueue = new LinkedBlockingQueue<>();

    public void saveProduct(Product product) {
        // Write to cache immediately
        cache.opsForValue().set("product:" + product.getId(), product);
        // Queue for async DB write
        writeQueue.offer(product);
        // Pros: Fast writes
        // Cons: Risk of data loss if crash before DB write
    }

    @Scheduled(fixedDelay = 1000)
    public void flushToDatabase() {
        List<Product> batch = new ArrayList<>();
        writeQueue.drainTo(batch, 100);
        if (!batch.isEmpty()) {
            repository.saveAll(batch);
        }
    }
}

// CACHE INVALIDATION STRATEGIES
// 1. TTL (Time To Live) - cache expires after duration
// 2. Event-driven - invalidate on data change events
// 3. Version-based - include version in cache key

// "There are only two hard things in CS:
//  cache invalidation and naming things."

Database Scaling

READ REPLICAS                          SHARDING (Horizontal Partitioning)
┌─────────┐                            ┌─────────────────────────────────────┐
│  App    │                            │              App                     │
└────┬────┘                            └────┬───────────┬───────────┬────────┘
     │                                      │           │           │
     ├───Write───▶┌──────────┐              │           │           │
     │            │  Primary │              ▼           ▼           ▼
     │            └────┬─────┘         ┌────────┐ ┌────────┐ ┌────────┐
     │                 │ Replication   │Shard 0 │ │Shard 1 │ │Shard 2 │
     │                 ▼               │ A-H    │ │ I-P    │ │ Q-Z    │
     │            ┌──────────┐         └────────┘ └────────┘ └────────┘
     └──Read────▶ │ Replica 1│
                  ├──────────┤         Partition by: user_id % 3
                  │ Replica 2│         or: hash(user_id) → shard
                  └──────────┘
StrategyWhen to UseComplexity
Read ReplicasRead-heavy workloads (80%+ reads)Low - most DBs support natively
Vertical ScalingFirst option - bigger machineLow - but has limits
CachingFrequently accessed, rarely changed dataMedium - invalidation is hard
ShardingDataset too large for single machineHigh - cross-shard queries are hard
CQRSDifferent read/write patternsHigh - eventual consistency

API Design

RESTful Best Practices

// Resource naming: nouns, not verbs
GET    /users              // List users
GET    /users/{id}         // Get user
POST   /users              // Create user
PUT    /users/{id}         // Replace user
PATCH  /users/{id}         // Update user (partial)
DELETE /users/{id}         // Delete user

// Nested resources
GET    /users/{id}/orders  // User's orders
POST   /users/{id}/orders  // Create order for user

// Filtering, sorting, pagination
GET    /orders?status=pending&sort=-createdAt&page=2&limit=20

// HTTP Status Codes
200 OK            // Success (GET, PUT, PATCH)
201 Created       // Resource created (POST)
204 No Content    // Success, no body (DELETE)
400 Bad Request   // Invalid input
401 Unauthorized  // Not authenticated
403 Forbidden     // Not authorized
404 Not Found     // Resource doesn't exist
409 Conflict      // Conflict (duplicate, version mismatch)
422 Unprocessable // Validation failed
429 Too Many Req  // Rate limited
500 Server Error  // Unexpected error
503 Unavailable   // Service down

// Error response format
{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Invalid email format",
    "details": [
      {"field": "email", "message": "Must be valid email"}
    ],
    "requestId": "abc-123"  // For debugging
  }
}
// API VERSIONING strategies

// 1. URL path (most common, explicit)
/api/v1/users
/api/v2/users

// 2. Header (cleaner URLs, harder to test)
Accept: application/vnd.myapi.v2+json

// 3. Query param (easy to use, pollutes URL)
/api/users?version=2

@RestController
@RequestMapping("/api/v1/users")
public class UserControllerV1 {
    // V1 implementation
}

@RestController
@RequestMapping("/api/v2/users")
public class UserControllerV2 {
    // V2 with breaking changes
}

// PAGINATION (cursor-based is better for large datasets)
// Offset-based (simple, but slow for large offsets)
GET /orders?offset=1000&limit=20

// Cursor-based (consistent, performant)
GET /orders?cursor=eyJpZCI6MTIzfQ&limit=20

public record PagedResponse<T>(
    List<T> data,
    String nextCursor,
    boolean hasMore
) {}

// IDEMPOTENCY (safe retries)
// Client sends unique key, server deduplicates
POST /orders
Idempotency-Key: abc-123-unique

@PostMapping("/orders")
public Order createOrder(
    @RequestHeader("Idempotency-Key") String idempotencyKey,
    @RequestBody CreateOrderRequest request
) {
    // Check if already processed
    return idempotencyService.executeOnce(idempotencyKey, () -> {
        return orderService.create(request);
    });
}

Distributed Systems Fundamentals

CAP Theorem

                     CONSISTENCY
                         ▲
                        /│\
                       / │ \
                      /  │  \
                     /   │   \
                    /    │    \
          CP Systems    │     CA Systems
         (MongoDB,      │     (Traditional
          HBase)        │      RDBMS - but
                        │      not distributed)
                       /│\
                      / │ \
                     /  │  \
    AVAILABILITY ◀──────┴──────▶ PARTITION
                                 TOLERANCE
                   AP Systems
                  (Cassandra,
                   DynamoDB)

In a distributed system, when a network partition occurs,
you must choose between Consistency and Availability.
PropertyMeaningTrade-off
ConsistencyAll nodes see same data at same timeMay need to reject requests
AvailabilityEvery request gets a responseResponse may be stale
Partition ToleranceSystem works despite network failuresRequired for distributed systems

Practical CAP

Network partitions will happen. The real question is: when they do, do you want consistency (return error rather than stale data) or availability (return possibly stale data)?

Consistency Models

ModelGuaranteeExample
StrongRead always returns latest writeSingle-node RDBMS
EventualGiven time, all reads return latest writeDNS, Cassandra
CausalRelated operations seen in orderMongoDB (configurable)
Read-your-writesYou see your own writes immediatelyMany web apps

Interview Framework: System Design Questions

The 5-Step Approach

  1. Clarify Requirements (2-3 min) - Functional & non-functional, scale estimates
  2. High-Level Design (5 min) - Main components, data flow
  3. Deep Dive (15-20 min) - Key components in detail, trade-offs
  4. Address Bottlenecks (5 min) - Scaling, failure modes, monitoring
  5. Wrap Up (2 min) - Summary, future improvements

Questions to Ask

Back-of-Envelope Calculations

Common numbers to know:
- 1 day = 86,400 seconds ≈ 100,000 seconds
- 1 million requests/day ≈ 12 requests/second
- 1 billion requests/day ≈ 12,000 requests/second

Storage:
- 1 char = 1 byte (ASCII) or 2-4 bytes (UTF-8)
- 1 million users × 1KB/user = 1 GB
- 1 billion rows × 100 bytes/row = 100 GB

Network:
- Read from memory: 100 ns
- Read from SSD: 100 µs (1000x slower)
- Read from disk: 10 ms (100,000x slower)
- Network round trip (same datacenter): 500 µs
- Network round trip (cross-country): 150 ms