Patterns for building scalable, resilient, distributed systems at Twilio scale
MONOLITH MICROSERVICES
┌─────────────────────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Single Deployment │ │ User │ │ Order │ │ Payment │
│ ┌─────┐ ┌─────┐ ┌─────┐│ │ Service │ │ Service │ │ Service │
│ │User │ │Order│ │Pay ││ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │ │ │ ││ │ │ │
│ └──┬──┘ └──┬──┘ └──┬──┘│ ┌────▼──────────▼──────────▼────┐
│ │ │ │ │ │ Message Broker │
│ ┌──▼──────▼──────▼──┐ │ │ (Kafka / RabbitMQ) │
│ │ Shared Database │ │ └───────────────────────────────┘
│ └────────────────────┘ │ │ │ │
└─────────────────────────┘ ┌────▼────┐┌───▼────┐┌────▼────┐
│ DB 1 ││ DB 2 ││ DB 3 │
└─────────┘└────────┘└─────────┘
Start with a well-structured monolith. Extract microservices only when you have: (1) clear domain boundaries, (2) independent scaling needs, or (3) team scaling requirements. Premature microservices = distributed monolith = worst of both worlds.
┌──────────────┐ Event ┌─────────────────┐ Event ┌──────────────┐
│ Order Service│────Published──▶│ Message Broker │◀───Consumed────│ Email Service│
└──────────────┘ │ (Kafka/SQS) │ └──────────────┘
└────────┬────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│Inventory Service│ │Analytics Service│ │ Audit Service │
└────────────────┘ └────────────────┘ └────────────────┘
// Domain Event public record OrderPlacedEvent( String eventId, String orderId, String customerId, List<OrderItem> items, BigDecimal total, Instant occurredAt ) { public static OrderPlacedEvent from(Order order) { return new OrderPlacedEvent( UUID.randomUUID().toString(), order.getId(), order.getCustomerId(), order.getItems(), order.getTotal(), Instant.now() ); } } // Order Service publishes event @Service public class OrderService { private final KafkaTemplate<String, OrderPlacedEvent> kafka; @Transactional public Order placeOrder(CreateOrderRequest request) { Order order = orderRepository.save(new Order(request)); // Publish event AFTER transaction commits (outbox pattern better) kafka.send("orders.placed", order.getId(), OrderPlacedEvent.from(order)); return order; } }
// Email Service consumes event @Service public class OrderNotificationHandler { @KafkaListener(topics = "orders.placed", groupId = "email-service") public void handleOrderPlaced(OrderPlacedEvent event) { Customer customer = customerService.getById(event.customerId()); emailService.sendOrderConfirmation(customer.getEmail(), event); } } // Inventory Service consumes same event (fan-out) @Service public class InventoryHandler { @KafkaListener(topics = "orders.placed", groupId = "inventory-service") public void handleOrderPlaced(OrderPlacedEvent event) { for (OrderItem item : event.items()) { inventoryService.decrementStock(item.productId(), item.quantity()); } } } // Benefits: // - Services are decoupled (don't know about each other) // - Easy to add new consumers without changing producer // - Events provide audit trail // - Can replay events for recovery/debugging
| Pattern | When to Use | Trade-offs |
|---|---|---|
| REST/HTTP | Simple CRUD, public APIs, request-response needed | Synchronous, tight coupling, latency chains |
| gRPC | Internal service-to-service, high performance needed | Binary protocol, streaming support, schema-first |
| Message Queue | Async processing, load leveling, reliability needed | Eventual consistency, more complex debugging |
| Event Streaming | Event sourcing, real-time analytics, audit logs | Kafka complexity, ordering guarantees vary |
// REST client with resilience (Spring WebClient) @Service public class UserServiceClient { private final WebClient webClient; public Mono<User> getUser(String userId) { return webClient.get() .uri("/users/{id}", userId) .retrieve() .bodyToMono(User.class) .timeout(Duration.ofSeconds(5)) .retryWhen(Retry.backoff(3, Duration.ofMillis(100))) .onErrorResume(e -> { log.error("Failed to get user", e); return Mono.empty(); }); } } // With virtual threads (Java 21) - simpler blocking code @Service public class UserServiceClient { private final RestClient restClient; public User getUser(String userId) { return restClient.get() .uri("/users/{id}", userId) .retrieve() .body(User.class); // Blocking is OK with virtual threads } }
// gRPC - schema-first with Protocol Buffers // user.proto /* syntax = "proto3"; service UserService { rpc GetUser(GetUserRequest) returns (User); rpc ListUsers(ListUsersRequest) returns (stream User); // Streaming! } message GetUserRequest { string user_id = 1; } message User { string id = 1; string name = 2; string email = 3; } */ // Generated client usage @Service public class UserServiceGrpcClient { private final UserServiceGrpc.UserServiceBlockingStub stub; public User getUser(String userId) { GetUserRequest request = GetUserRequest.newBuilder() .setUserId(userId) .build(); return stub.getUser(request); } // Streaming response public Iterator<User> listUsers() { return stub.listUsers(ListUsersRequest.getDefaultInstance()); } } // gRPC advantages: // - 10x faster than REST (binary, HTTP/2) // - Strong typing from proto definitions // - Bi-directional streaming // - Auto-generated clients in any language
// Message Queue (RabbitMQ/SQS) - async, reliable @Service public class OrderProcessor { private final RabbitTemplate rabbit; // Send message to queue public void submitOrder(Order order) { rabbit.convertAndSend("orders.pending", order); // Returns immediately - processing happens async } } // Worker processes messages @Component public class OrderWorker { @RabbitListener(queues = "orders.pending") public void processOrder(Order order) { try { paymentService.charge(order); inventoryService.reserve(order); order.setStatus(CONFIRMED); orderRepository.save(order); } catch (Exception e) { // Message will be requeued (or sent to DLQ) throw e; } } } // Queue benefits: // - Load leveling (spikes absorbed by queue) // - Reliability (messages persisted until processed) // - Retry with backoff (automatic requeue) // - Dead letter queue for failed messages
┌─────────────────────────────────────────────────────┐
│ API Gateway │
└─────────────────────┬───────────────────────────────┘
│
┌─────────────────────┴───────────────────────────────┐
│ │
┌────────▼────────┐ ┌──────────────▼─────────────┐
│ WRITE (Command)│ │ READ (Query) │
│ Service │ │ Service │
└────────┬─────────┘ └──────────────┬─────────────┘
│ │
│ Events │
┌────────▼─────────┐ Sync/Project ┌────────────────────▼─────────────┐
│ Write Database │─────────────────────▶│ Read Database │
│ (Normalized) │ │ (Denormalized, Optimized) │
│ PostgreSQL │ │ Elasticsearch / Redis │
└──────────────────┘ └──────────────────────────────────┘
// WRITE side - Commands modify state public record CreateOrderCommand(String customerId, List<OrderItem> items) {} @Service public class OrderCommandHandler { private final OrderRepository repository; // PostgreSQL private final EventPublisher events; public String handle(CreateOrderCommand cmd) { Order order = new Order(cmd.customerId(), cmd.items()); repository.save(order); // Publish event for read side to sync events.publish(new OrderCreatedEvent(order)); return order.getId(); } } // READ side - Queries return data (optimized for reading) public record OrderSummaryQuery(String customerId, int limit) {} @Service public class OrderQueryHandler { private final ElasticsearchOperations elastic; // Read-optimized public List<OrderSummary> handle(OrderSummaryQuery query) { // Fast denormalized query - no joins needed return elastic.search(query.customerId(), query.limit()); } } // Projector keeps read model in sync @Component public class OrderProjector { private final ElasticsearchOperations elastic; @EventListener public void on(OrderCreatedEvent event) { OrderSummary summary = new OrderSummary( event.orderId(), event.customerName(), // Denormalized! event.itemCount(), event.total() ); elastic.save(summary); } } // When to use CQRS: // - Read/write patterns are very different // - Need different scaling for reads vs writes // - Complex queries that don't fit relational model // - Event sourcing (natural fit)
Problem: In microservices, you can't use ACID transactions across services. Saga coordinates a sequence of local transactions.
CHOREOGRAPHY SAGA (Event-driven)
Order Service Payment Service Inventory Service Shipping Service
│ │ │ │
│ OrderCreated │ │ │
├──────────────────────▶│ │ │
│ │ PaymentCompleted │ │
│ ├───────────────────────▶│ │
│ │ │ InventoryReserved │
│ │ ├──────────────────────▶│
│ │ │ │ ShippingScheduled
│◀─────────────────────────────────────────────────────────────────────────┤
│ OrderCompleted │ │ │
COMPENSATION (if payment fails):
│ │ PaymentFailed │ │
│◀──────────────────────┤ │ │
│ OrderCancelled │ │ │
// CHOREOGRAPHY: Services react to events, no central coordinator // Order Service starts the saga @Service public class OrderService { public Order createOrder(CreateOrderRequest req) { Order order = orderRepository.save(new Order(req, PENDING)); events.publish(new OrderCreatedEvent(order)); return order; } @EventListener public void on(ShippingScheduledEvent event) { Order order = orderRepository.findById(event.orderId()); order.setStatus(COMPLETED); orderRepository.save(order); } @EventListener public void on(PaymentFailedEvent event) { Order order = orderRepository.findById(event.orderId()); order.setStatus(CANCELLED); orderRepository.save(order); } } // Payment Service reacts to OrderCreated @Service public class PaymentService { @EventListener public void on(OrderCreatedEvent event) { try { paymentGateway.charge(event.customerId(), event.total()); events.publish(new PaymentCompletedEvent(event.orderId())); } catch (PaymentException e) { events.publish(new PaymentFailedEvent(event.orderId(), e.getMessage())); } } } // Inventory Service reacts to PaymentCompleted @Service public class InventoryService { @EventListener public void on(PaymentCompletedEvent event) { Order order = getOrderDetails(event.orderId()); reserveItems(order.getItems()); events.publish(new InventoryReservedEvent(event.orderId())); } // Compensation: release inventory if shipping fails @EventListener public void on(ShippingFailedEvent event) { releaseItems(event.orderId()); events.publish(new InventoryReleasedEvent(event.orderId())); } }
// ORCHESTRATION: Central coordinator manages the saga @Service public class OrderSagaOrchestrator { private final PaymentServiceClient paymentService; private final InventoryServiceClient inventoryService; private final ShippingServiceClient shippingService; public void executeOrderSaga(Order order) { try { // Step 1: Payment PaymentResult payment = paymentService.charge(order); if (!payment.isSuccess()) { order.setStatus(PAYMENT_FAILED); return; } // Step 2: Reserve Inventory ReservationResult reservation = inventoryService.reserve(order); if (!reservation.isSuccess()) { // Compensate: refund payment paymentService.refund(payment.getTransactionId()); order.setStatus(INVENTORY_FAILED); return; } // Step 3: Schedule Shipping ShippingResult shipping = shippingService.schedule(order); if (!shipping.isSuccess()) { // Compensate: release inventory, refund payment inventoryService.release(reservation.getReservationId()); paymentService.refund(payment.getTransactionId()); order.setStatus(SHIPPING_FAILED); return; } order.setStatus(COMPLETED); } catch (Exception e) { // Handle partial failure, trigger compensations handleSagaFailure(order, e); } } } // Orchestration pros: // - Central place to see the flow // - Easier to reason about // - Simpler compensation logic // Orchestration cons: // - Orchestrator can become a bottleneck // - Single point of failure // - Tighter coupling to orchestrator
Problem: When a downstream service is failing, don't keep hammering it. Fail fast and recover gracefully.
CLOSED OPEN HALF-OPEN
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Normal │ Failures │ Fail Fast │ Timeout │ Test │
│ Operation │─────────▶│ (no calls) │─────────▶│ (1 request) │
│ │ exceed │ │ expires │ │
└──────────────┘ threshold└──────────────┘ └───────┬──────┘
▲ │
│ Success │
└─────────────────────────────────────────────────────┘
Failure: back to OPEN
// Using Resilience4j (modern, lightweight) @Service public class PaymentServiceClient { private final CircuitBreaker circuitBreaker; private final RestClient restClient; public PaymentServiceClient() { this.circuitBreaker = CircuitBreaker.ofDefaults("paymentService"); // Or with config: this.circuitBreaker = CircuitBreaker.of("paymentService", CircuitBreakerConfig.custom() .failureRateThreshold(50) // Open at 50% failures .waitDurationInOpenState(Duration.ofSeconds(30)) .slidingWindowSize(10) // Last 10 calls .build() ); } public PaymentResult charge(Order order) { return circuitBreaker.executeSupplier(() -> { return restClient.post() .uri("/payments/charge") .body(order) .retrieve() .body(PaymentResult.class); }); } // With fallback public PaymentResult chargeWithFallback(Order order) { return Try.ofSupplier( CircuitBreaker.decorateSupplier(circuitBreaker, () -> charge(order)) ).recover(CallNotPermittedException.class, e -> { // Circuit is open - use fallback return PaymentResult.deferred(order.getId()); }).get(); } } // Spring Boot + Resilience4j annotations @Service public class PaymentServiceClient { @CircuitBreaker(name = "payment", fallbackMethod = "chargeFallback") public PaymentResult charge(Order order) { return restClient.post(...); } private PaymentResult chargeFallback(Order order, Exception e) { return PaymentResult.deferred(order.getId()); } }
// RETRY with exponential backoff @Service public class ExternalServiceClient { @Retry(name = "externalService", fallbackMethod = "fallback") public Data fetchData(String id) { return restClient.get().uri("/data/{id}", id).retrieve().body(Data.class); } // application.yml config: // resilience4j.retry.instances.externalService: // maxAttempts: 3 // waitDuration: 500ms // exponentialBackoffMultiplier: 2 // retryExceptions: // - java.io.IOException // - java.net.SocketTimeoutException } // TIMEOUT - don't wait forever @Service public class SlowServiceClient { @TimeLimiter(name = "slowService") public CompletableFuture<Data> fetchSlowData(String id) { return CompletableFuture.supplyAsync(() -> { return restClient.get().uri("/slow/{id}", id).retrieve().body(Data.class); }); } // Config: timeout after 2 seconds // resilience4j.timelimiter.instances.slowService: // timeoutDuration: 2s } // Combining patterns: Retry → CircuitBreaker → TimeLimiter @CircuitBreaker(name = "backend") @Retry(name = "backend") @TimeLimiter(name = "backend") public CompletableFuture<Data> resilientCall() { // TimeLimiter wraps Retry wraps CircuitBreaker }
// BULKHEAD - isolate failures, limit concurrent calls // Prevents one slow service from consuming all threads @Service public class MultiServiceClient { // Semaphore bulkhead: limit concurrent calls @Bulkhead(name = "paymentService", type = Bulkhead.Type.SEMAPHORE) public PaymentResult charge(Order order) { return paymentClient.charge(order); } // Config: // resilience4j.bulkhead.instances.paymentService: // maxConcurrentCalls: 20 // maxWaitDuration: 500ms // ThreadPool bulkhead: separate thread pool @Bulkhead(name = "inventoryService", type = Bulkhead.Type.THREADPOOL) public CompletableFuture<InventoryResult> checkInventory(String productId) { return CompletableFuture.supplyAsync(() -> inventoryClient.check(productId)); } // Config: // resilience4j.thread-pool-bulkhead.instances.inventoryService: // maxThreadPoolSize: 10 // coreThreadPoolSize: 5 // queueCapacity: 100 } // Why bulkheads matter: // Without: Slow payment service → all 200 Tomcat threads waiting → entire app unresponsive // With: Slow payment service → only 20 threads affected → other endpoints still work
// CACHE-ASIDE (Lazy Loading): App manages cache @Service public class ProductService { private final RedisTemplate<String, Product> cache; private final ProductRepository repository; public Product getProduct(String id) { // 1. Check cache Product cached = cache.opsForValue().get("product:" + id); if (cached != null) { return cached; // Cache hit } // 2. Cache miss: load from DB Product product = repository.findById(id).orElseThrow(); // 3. Populate cache cache.opsForValue().set("product:" + id, product, Duration.ofMinutes(30)); return product; } public void updateProduct(Product product) { repository.save(product); // Invalidate cache (or update it) cache.delete("product:" + product.getId()); } } // Spring @Cacheable (declarative cache-aside) @Service public class ProductService { @Cacheable(value = "products", key = "#id") public Product getProduct(String id) { return repository.findById(id).orElseThrow(); } @CacheEvict(value = "products", key = "#product.id") public void updateProduct(Product product) { repository.save(product); } @CachePut(value = "products", key = "#product.id") public Product updateAndCache(Product product) { return repository.save(product); } }
// WRITE-THROUGH: Write to cache and DB together @Service public class ProductService { public void saveProduct(Product product) { // Write to both synchronously repository.save(product); cache.opsForValue().set("product:" + product.getId(), product); // Pros: Cache always consistent // Cons: Slower writes, both must succeed } } // WRITE-BEHIND (Write-Back): Write to cache, async to DB @Service public class ProductService { private final BlockingQueue<Product> writeQueue = new LinkedBlockingQueue<>(); public void saveProduct(Product product) { // Write to cache immediately cache.opsForValue().set("product:" + product.getId(), product); // Queue for async DB write writeQueue.offer(product); // Pros: Fast writes // Cons: Risk of data loss if crash before DB write } @Scheduled(fixedDelay = 1000) public void flushToDatabase() { List<Product> batch = new ArrayList<>(); writeQueue.drainTo(batch, 100); if (!batch.isEmpty()) { repository.saveAll(batch); } } } // CACHE INVALIDATION STRATEGIES // 1. TTL (Time To Live) - cache expires after duration // 2. Event-driven - invalidate on data change events // 3. Version-based - include version in cache key // "There are only two hard things in CS: // cache invalidation and naming things."
READ REPLICAS SHARDING (Horizontal Partitioning)
┌─────────┐ ┌─────────────────────────────────────┐
│ App │ │ App │
└────┬────┘ └────┬───────────┬───────────┬────────┘
│ │ │ │
├───Write───▶┌──────────┐ │ │ │
│ │ Primary │ ▼ ▼ ▼
│ └────┬─────┘ ┌────────┐ ┌────────┐ ┌────────┐
│ │ Replication │Shard 0 │ │Shard 1 │ │Shard 2 │
│ ▼ │ A-H │ │ I-P │ │ Q-Z │
│ ┌──────────┐ └────────┘ └────────┘ └────────┘
└──Read────▶ │ Replica 1│
├──────────┤ Partition by: user_id % 3
│ Replica 2│ or: hash(user_id) → shard
└──────────┘
| Strategy | When to Use | Complexity |
|---|---|---|
| Read Replicas | Read-heavy workloads (80%+ reads) | Low - most DBs support natively |
| Vertical Scaling | First option - bigger machine | Low - but has limits |
| Caching | Frequently accessed, rarely changed data | Medium - invalidation is hard |
| Sharding | Dataset too large for single machine | High - cross-shard queries are hard |
| CQRS | Different read/write patterns | High - eventual consistency |
// Resource naming: nouns, not verbs GET /users // List users GET /users/{id} // Get user POST /users // Create user PUT /users/{id} // Replace user PATCH /users/{id} // Update user (partial) DELETE /users/{id} // Delete user // Nested resources GET /users/{id}/orders // User's orders POST /users/{id}/orders // Create order for user // Filtering, sorting, pagination GET /orders?status=pending&sort=-createdAt&page=2&limit=20 // HTTP Status Codes 200 OK // Success (GET, PUT, PATCH) 201 Created // Resource created (POST) 204 No Content // Success, no body (DELETE) 400 Bad Request // Invalid input 401 Unauthorized // Not authenticated 403 Forbidden // Not authorized 404 Not Found // Resource doesn't exist 409 Conflict // Conflict (duplicate, version mismatch) 422 Unprocessable // Validation failed 429 Too Many Req // Rate limited 500 Server Error // Unexpected error 503 Unavailable // Service down // Error response format { "error": { "code": "VALIDATION_ERROR", "message": "Invalid email format", "details": [ {"field": "email", "message": "Must be valid email"} ], "requestId": "abc-123" // For debugging } }
// API VERSIONING strategies // 1. URL path (most common, explicit) /api/v1/users /api/v2/users // 2. Header (cleaner URLs, harder to test) Accept: application/vnd.myapi.v2+json // 3. Query param (easy to use, pollutes URL) /api/users?version=2 @RestController @RequestMapping("/api/v1/users") public class UserControllerV1 { // V1 implementation } @RestController @RequestMapping("/api/v2/users") public class UserControllerV2 { // V2 with breaking changes } // PAGINATION (cursor-based is better for large datasets) // Offset-based (simple, but slow for large offsets) GET /orders?offset=1000&limit=20 // Cursor-based (consistent, performant) GET /orders?cursor=eyJpZCI6MTIzfQ&limit=20 public record PagedResponse<T>( List<T> data, String nextCursor, boolean hasMore ) {} // IDEMPOTENCY (safe retries) // Client sends unique key, server deduplicates POST /orders Idempotency-Key: abc-123-unique @PostMapping("/orders") public Order createOrder( @RequestHeader("Idempotency-Key") String idempotencyKey, @RequestBody CreateOrderRequest request ) { // Check if already processed return idempotencyService.executeOnce(idempotencyKey, () -> { return orderService.create(request); }); }
CONSISTENCY
▲
/│\
/ │ \
/ │ \
/ │ \
/ │ \
CP Systems │ CA Systems
(MongoDB, │ (Traditional
HBase) │ RDBMS - but
│ not distributed)
/│\
/ │ \
/ │ \
AVAILABILITY ◀──────┴──────▶ PARTITION
TOLERANCE
AP Systems
(Cassandra,
DynamoDB)
In a distributed system, when a network partition occurs,
you must choose between Consistency and Availability.
| Property | Meaning | Trade-off |
|---|---|---|
| Consistency | All nodes see same data at same time | May need to reject requests |
| Availability | Every request gets a response | Response may be stale |
| Partition Tolerance | System works despite network failures | Required for distributed systems |
Network partitions will happen. The real question is: when they do, do you want consistency (return error rather than stale data) or availability (return possibly stale data)?
| Model | Guarantee | Example |
|---|---|---|
| Strong | Read always returns latest write | Single-node RDBMS |
| Eventual | Given time, all reads return latest write | DNS, Cassandra |
| Causal | Related operations seen in order | MongoDB (configurable) |
| Read-your-writes | You see your own writes immediately | Many web apps |
Common numbers to know: - 1 day = 86,400 seconds ≈ 100,000 seconds - 1 million requests/day ≈ 12 requests/second - 1 billion requests/day ≈ 12,000 requests/second Storage: - 1 char = 1 byte (ASCII) or 2-4 bytes (UTF-8) - 1 million users × 1KB/user = 1 GB - 1 billion rows × 100 bytes/row = 100 GB Network: - Read from memory: 100 ns - Read from SSD: 100 µs (1000x slower) - Read from disk: 10 ms (100,000x slower) - Network round trip (same datacenter): 500 µs - Network round trip (cross-country): 150 ms