Multi-Region Active-Active

Global Distribution, Data Replication, Conflict Resolution, and Failover Strategies

← Back to Study Guide

Quick Navigation

Why Multi-Region Matters for Twilio DA

Twilio Context: Twilio operates a global communications platform where latency and availability directly impact customer experience. A US customer calling a UK number needs infrastructure in both regions. Multi-region isn't optional—it's the product.

Key Drivers for Multi-Region

Driver Requirement Twilio Example
Latency < 100ms for real-time Voice/Video requires regional media servers
Availability 99.99%+ uptime Region failure can't take down global service
Data Residency GDPR, data sovereignty EU customer data must stay in EU
Disaster Recovery RTO < 1 hour, RPO < 1 min Natural disasters, cloud provider outages
Carrier Proximity Close to telecom infrastructure Super Network connections in each region

Architecture Patterns

Pattern Comparison

Aspect Active-Passive Active-Active
Traffic distribution 100% to primary, 0% to secondary Split across all regions
Failover time Minutes (DNS TTL, warmup) Seconds (already serving traffic)
Resource utilization 50% (standby is idle) ~100% (all regions active)
Data consistency Simpler (one writer) Complex (conflict resolution)
Cost Higher (paying for idle) Lower (all resources active)
Operational complexity Lower Higher
Best for Compliance, simple DR Global latency, high availability

Active-Active Architecture

┌─────────────────────────────────────────────────────────────┐ │ GLOBAL LAYER │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Route 53 / Global Accelerator │ │ │ │ (Latency-based routing, health checks) │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ │ ┌───────────────────────┘ └───────────────────────┐ │ │ ▼ ▼ ┌─────────────────────────────────────────┐ ┌─────────────────────────────────────────┐ │ REGION: US-EAST-1 │ │ REGION: EU-WEST-1 │ │ ┌───────────────────────────────────┐ │ │ ┌───────────────────────────────────┐ │ │ │ Cell Router │ │ │ │ Cell Router │ │ │ └───────────────────────────────────┘ │ │ └───────────────────────────────────┘ │ │ │ │ │ │ │ │ ┌─────────┴─────────┐ │ │ ┌─────────┴─────────┐ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ ▼ ▼ │ │ ┌──────┐ ┌──────┐ │ │ ┌──────┐ ┌──────┐ │ │ │Cell A│ │Cell B│ │ │ │Cell C│ │Cell D│ │ │ │(Ent) │ │(SMB) │ │ │ │(Ent) │ │(SMB) │ │ │ └──────┘ └──────┘ │ │ └──────┘ └──────┘ │ │ │ │ │ │ ┌───────────────────────────────────┐ │ │ ┌───────────────────────────────────┐ │ │ │ Regional Data (Aurora, etc.) │ │◄────────►│ │ Regional Data (Aurora, etc.) │ │ │ └───────────────────────────────────┘ │ Async │ └───────────────────────────────────┘ │ │ │ Repl. │ │ └─────────────────────────────────────────┘ └─────────────────────────────────────────┘ │ │ └──────────────────────┬─────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ GLOBAL DATA │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ DynamoDB Global Tables (Identity, Routing) │ │ │ │ Multi-master, automatic conflict resolution │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘

Key Design Principles

  1. Regional autonomy: Each region can operate independently if isolated
  2. Data locality: Keep data close to where it's used
  3. Eventual consistency: Accept temporary inconsistency for availability
  4. Conflict resolution: Define deterministic rules for concurrent writes
  5. Health-based routing: Automatically shift traffic from unhealthy regions

2-Minute Answer: "Active-Active vs Active-Passive?"

"Active-passive has one primary region handling all traffic while a standby waits for failover. It's simpler—one writer means no conflicts—but failover takes minutes because the standby is cold and DNS needs to propagate. You're also paying for idle capacity. Active-active serves traffic from all regions simultaneously. Failover is nearly instant because every region is already warm. You get better latency by routing users to the nearest region. The trade-off is data consistency: with multiple writers, you need conflict resolution strategies. For Twilio, active-active is essential. Voice calls need low latency, so we need regional media servers. SMS delivery needs regional carrier connections. And 99.99% availability means we can't have minutes of downtime during failover. We accept the complexity of conflict resolution to get the availability and latency benefits."

Data Replication Strategies

Synchronous vs Asynchronous Replication

Synchronous Replication

Write Request │ ▼ ┌─────────┐ sync ┌─────────┐ │ Primary │───────────►│ Replica │ │ (ack) │◄───────────│ (ack) │ └─────────┘ └─────────┘ │ ▼ Response to Client (only after both ack)
  • Consistency: Strong (RPO = 0)
  • Latency: High (cross-region RTT)
  • Availability: Lower (both must be up)
  • Use when: Financial transactions, inventory

Asynchronous Replication

Write Request │ ▼ ┌─────────┐ │ Primary │──── Response to Client │ (ack) │ (immediately) └─────────┘ │ │ async (background) ▼ ┌─────────┐ │ Replica │ └─────────┘
  • Consistency: Eventual (RPO > 0)
  • Latency: Low (local write only)
  • Availability: Higher (tolerates replica lag)
  • Use when: User data, messages, logs

Replication Topologies

Topology Description Pros Cons
Leader-Follower One writer, multiple readers Simple, no conflicts Leader is SPOF, failover lag
Multi-Leader Multiple writers, each replicates to others Write availability, lower latency Conflict resolution needed
Leaderless Any node accepts writes (Dynamo-style) Highest availability Complex consistency (quorums)

Replication Lag Implications

The Problem: User writes to Region A, then reads from Region B before replication completes. They see stale data—or worse, their write appears to have vanished.

Solutions

Pattern How It Works Trade-off
Read-your-writes Route user's reads to same region as their writes Requires sticky sessions or user-region affinity
Monotonic reads Track last-seen timestamp, reject stale reads May require waiting or retry
Causal consistency Track dependencies between operations Complex implementation
Version vectors Each write includes vector clock Metadata overhead

Data Classification for Replication

Key Insight: Not all data needs the same replication strategy. Classify data by consistency requirements.
Data Type Consistency Need Replication Strategy Twilio Example
Account/Identity Strong DynamoDB Global Tables (LWW) API credentials, account settings
Routing/Cell Assignment Strong DynamoDB Global Tables + Cache Customer → Cell mapping
Message Content Eventual Regional with async backup SMS body, media files
Call State Regional only No cross-region replication Active call data (ephemeral)
Analytics/Logs Eventual Async to central data lake Usage metrics, audit logs

2-Minute Answer: "How do you handle data replication across regions?"

"I classify data by consistency requirements. For identity and routing data that must be globally consistent, I use DynamoDB Global Tables with last-writer-wins conflict resolution—it's multi-master, so any region can write, and conflicts are resolved automatically by timestamp. For transactional data like message delivery status, I use Aurora Global Database with a primary writer and read replicas in other regions. Writes go to the primary, reads can hit local replicas with acceptable lag. For truly regional data like active call state, I don't replicate at all—it's ephemeral and specific to the media servers handling that call. The key insight is that async replication introduces lag, so you need strategies like read-your-writes consistency for user-facing operations. I route a user's reads to the same region as their writes for the session, which guarantees they see their own updates immediately."

Conflict Resolution

The Fundamental Problem: Two users update the same record in different regions simultaneously. When the updates replicate, which one wins?

Conflict Resolution Strategies

1. Last-Writer-Wins (LWW)

Region A (t=100): UPDATE user SET name='Alice' WHERE id=1 Region B (t=101): UPDATE user SET name='Bob' WHERE id=1 After replication: name='Bob' (higher timestamp wins) Pros: Simple, deterministic, no manual intervention Cons: Data loss (Alice's update is silently discarded)

Use when: Updates are idempotent, latest value is always correct (e.g., user profile)

2. Merge/CRDT (Conflict-free Replicated Data Types)

Region A: ADD item='apple' to shopping_cart Region B: ADD item='banana' to shopping_cart After replication: shopping_cart = ['apple', 'banana'] (union of both) Types of CRDTs: • G-Counter: Only increments, sum all replicas • PN-Counter: Increment and decrement • G-Set: Add-only set (union) • OR-Set: Add and remove with unique tags • LWW-Register: Last-writer-wins for single values

Use when: Operations can be mathematically merged (counters, sets)

3. Application-Level Resolution

Region A: UPDATE account SET balance=100 WHERE id=1 Region B: UPDATE account SET balance=80 WHERE id=1 Conflict detected! Options: a) Prompt user to choose b) Apply business rules (e.g., keep lower balance for safety) c) Store both versions, resolve later d) Reject one based on business priority

Use when: Business logic determines correct resolution (financial data)

4. Operational Transformation (OT)

Document: "Hello" Region A: INSERT ' World' at position 5 → "Hello World" Region B: INSERT '!' at position 5 → "Hello!" OT transforms operations: B's operation adjusted: INSERT '!' at position 11 Result: "Hello World!"

Use when: Collaborative editing (Google Docs-style)

Conflict Resolution in AWS Services

Service Conflict Resolution Details
DynamoDB Global Tables Last-Writer-Wins Uses aws:rep:updatetime for ordering
Aurora Global Database Single Writer (no conflicts) Only primary accepts writes
S3 Cross-Region Replication Last-Writer-Wins Based on object timestamp
ElastiCache Global Datastore Single Writer Primary cluster only

Designing for Conflict Avoidance

Best Strategy: Avoid conflicts by design rather than resolving them after the fact.
Technique How It Works Example
Region affinity Each record "belongs" to one region US customers write to US, EU to EU
Partitioned writes Different records updated in different regions User profile in home region only
Append-only data Never update, only insert new records Event log, message history
Idempotent operations Same operation applied twice = same result SET status='delivered' (not INCREMENT counter)

2-Minute Answer: "How do you handle conflicts in multi-region writes?"

"First, I try to avoid conflicts by design. Most data has natural region affinity—a customer's data is written in their home region. For data that truly needs multi-region writes, I choose the strategy based on the data type. For simple values like user preferences, last-writer-wins works fine—DynamoDB Global Tables does this automatically using timestamps. For additive data like shopping carts or message queues, I use CRDT-like structures where the merge operation is a union. For critical data like financial balances, I avoid multi-region writes entirely—I designate a single authoritative region and route writes there, accepting the latency cost for correctness. The key insight is that conflict resolution isn't one-size-fits-all. I classify data into tiers: tier 1 gets strong consistency with single-region writes; tier 2 gets LWW with async replication; tier 3 gets CRDTs for merge-friendly operations."

Traffic Routing & DNS

Routing Strategies

Strategy How It Works Best For AWS Service
Latency-based Route to region with lowest latency Global users, real-time apps Route 53 Latency Routing
Geolocation Route based on user's location Data residency, compliance Route 53 Geolocation
Weighted Distribute traffic by percentage Canary deploys, gradual migration Route 53 Weighted
Failover Primary/secondary with health checks Disaster recovery Route 53 Failover
Anycast Same IP announced from multiple locations DDoS protection, lowest latency Global Accelerator

DNS-Based Routing

┌──────────┐ 1. DNS Query ┌─────────────┐ │ Client │─────────────────────►│ Route 53 │ │ │ │ │ │ │◄─────────────────────│ (Latency │ └──────────┘ 2. IP of nearest │ Routing) │ │ region └─────────────┘ │ │ │ 3. HTTPS Request │ Health Checks ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ US-EAST-1 │ │ EU-WEST-1 │ │ (50ms RTT) │ │ (150ms RTT) │ └──────────────┘ └──────────────┘ TTL Considerations: • Low TTL (60s): Fast failover, more DNS queries • High TTL (300s): Fewer queries, slower failover • Typical: 60-120s for active-active

AWS Global Accelerator vs Route 53

Route 53 (DNS)

  • Returns IP address, client connects directly
  • Failover limited by DNS TTL
  • Client caching can delay failover
  • Lower cost
  • Works with any endpoint

Global Accelerator (Anycast)

  • Static IPs, traffic routed through AWS backbone
  • Instant failover (no DNS propagation)
  • TCP/UDP termination at edge
  • Higher cost, better performance
  • DDoS protection built-in

Health Checks

Critical Design Decision: What does "healthy" mean for your application? A region might be up but degraded.
Health Check Type What It Tests Failure Action
TCP Can establish connection Fast, but doesn't test application
HTTP 200 Endpoint returns 200 Tests application is responding
String Match Response contains expected string Tests application logic
Deep Health Tests dependencies (DB, cache) Most accurate, risk of cascade
// Deep health check endpoint
app.get('/health/deep', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    cache: await checkRedis(),
    queue: await checkKafka(),
    downstream: await checkCriticalService()
  };

  const allHealthy = Object.values(checks).every(c => c.healthy);
  const status = allHealthy ? 200 : 503;

  res.status(status).json({
    status: allHealthy ? 'healthy' : 'degraded',
    checks,
    region: process.env.AWS_REGION,
    timestamp: new Date().toISOString()
  });
});

2-Minute Answer: "How do you route traffic in a multi-region setup?"

"I use a layered approach. At the DNS layer, Route 53 with latency-based routing directs users to the nearest healthy region. Health checks run every 10 seconds against deep health endpoints that verify database connectivity, not just that the server responds. For critical applications, I add Global Accelerator on top—it gives us anycast IPs so failover is instant, no DNS TTL to wait for. Traffic flows through AWS's backbone network, which is faster and more reliable than the public internet. Within each region, the Cell Router handles customer-to-cell assignment using our DynamoDB-backed routing table. For data residency requirements, I use geolocation routing to ensure EU users always hit EU infrastructure, with failover only to other EU regions. The key trade-off is cost versus failover speed: Route 53 alone is cheap but failover takes 60+ seconds due to TTL; Global Accelerator costs more but fails over in seconds."

Failure Scenarios & Mitigation

Failure Categories

Failure Type Scope RTO Target Mitigation
Single AZ failure One datacenter Automatic (seconds) Multi-AZ deployment
Region failure All AZs in region < 5 minutes Multi-region active-active
Global service failure AWS service (IAM, Route 53) Variable Static credentials, cached DNS
Network partition Inter-region connectivity Automatic Regional autonomy, async replication
Data corruption All regions (if replicated) Hours Point-in-time recovery, backups

Scenario 1: Complete Region Failure

BEFORE: Traffic split US-EAST-1 (60%) / EU-WEST-1 (40%) EVENT: US-EAST-1 goes offline Timeline: T+0: Region fails T+10s: Health checks detect failure (3 consecutive failures) T+20s: Route 53 removes US-EAST-1 from DNS responses T+60s: DNS TTL expires, clients get EU-WEST-1 only T+120s: All traffic now to EU-WEST-1 AFTER: EU-WEST-1 (100%) - must handle 2.5x normal load Key Questions: • Can EU-WEST-1 scale to handle all traffic? • Is data replicated enough that EU-WEST-1 has recent state? • What happens when US-EAST-1 comes back?

Scenario 2: Network Partition (Split Brain)

The Nightmare Scenario: Regions can't communicate but both think they're the primary. Users in each region continue writing, creating divergent data.
BEFORE: US-EAST-1 ◄──replication──► EU-WEST-1 EVENT: Inter-region network fails ╳ (network partition) US-EAST-1 ◄────────────────────────► EU-WEST-1 │ │ ▼ ▼ Users write Users write balance=100 balance=80 AFTER PARTITION HEALS: Which balance is correct? Both regions have different values. Mitigation Options: 1. Quorum writes (need majority of regions to write) 2. Leader election (only one region accepts writes) 3. Conflict resolution (LWW, merge, manual) 4. Accept partition as feature (regional autonomy)

Scenario 3: Cascading Failure

1. US-EAST-1 database becomes slow (not failed) 2. Health checks still pass (200 OK, just slow) 3. Requests queue up, timeouts increase 4. Circuit breakers don't trip (it's "working") 5. EU-WEST-1 overwhelmed when US users retry there 6. Both regions now degraded Prevention: • Latency-based health checks (fail if p99 > threshold) • Adaptive load shedding • Request quotas per region • Circuit breakers with latency triggers

Recovery Playbook

Phase Actions Validation
1. Detect Automated alerts, health check failures Alert fires within 60 seconds
2. Isolate Remove failed region from routing No new traffic to failed region
3. Scale Increase capacity in healthy regions Latency/error rate stable
4. Investigate Root cause analysis Understand failure mode
5. Repair Fix issue in failed region Local validation passes
6. Resync Replay missed writes, reconcile data Data consistency verified
7. Restore Gradually route traffic back (10%→50%→100%) Error rate stays low

2-Minute Answer: "What happens when a region fails?"

"Our health checks detect the failure within 10-30 seconds—we check every 10 seconds and require 2-3 consecutive failures to avoid false positives. Route 53 automatically removes the failed region from DNS responses. Existing connections drain; new connections go to healthy regions. The key design decision is whether healthy regions can absorb the load. We provision for N+1 capacity, meaning if we have 3 regions, each can handle 50% of total traffic—so losing one region still leaves us with 100% capacity. For data, we accept that async-replicated data might be slightly stale—RPO of a few seconds. Critical data uses DynamoDB Global Tables with sub-second replication. When the region recovers, we don't just flip traffic back on. We verify data consistency first, then gradually ramp traffic—10%, then 50%, then 100%—watching error rates at each step. The biggest risk isn't the failure itself; it's the recovery. Thundering herd when traffic shifts back has caused more outages than the original failures."

Multi-Region in Twilio's Cell Architecture

Regional Cell Deployment

┌─────────────────────────────────────────────────────────────────────────────────┐ │ GLOBAL LAYER │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Route 53 / Global Accelerator → Cell Router (Lambda@Edge) │ │ │ │ DynamoDB Global Tables (Identity, Customer→Cell mapping) │ │ │ │ Billing Aggregation Service │ │ │ └─────────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────────┘ │ ┌───────────────────────────┼───────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐ │ US-EAST-1 │ │ EU-WEST-1 │ │ AP-SOUTHEAST-1 │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ │ Enterprise-US-A │ │ │ │ Enterprise-EU-A │ │ │ │ Enterprise-AP-A │ │ │ │ (Customers 1-50)│ │ │ │ (Customers 1-50)│ │ │ │ (Customers 1-50)│ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ │ SMB-US-Pool │ │ │ │ SMB-EU-Pool │ │ │ │ SMB-AP-Pool │ │ │ │ (Customers 1K+) │ │ │ │ (Customers 1K+) │ │ │ │ (Customers 1K+) │ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ │ │ │ │ │ │ Regional Resources: │ │ Regional Resources: │ │ Regional Resources: │ │ • Carrier connections│ │ • Carrier connections│ │ • Carrier connections│ │ • Media servers │ │ • Media servers │ │ • Media servers │ │ • Message queues │ │ • Message queues │ │ • Message queues │ └───────────────────────┘ └───────────────────────┘ └───────────────────────┘ Key Insight: Customers are region-homed, not distributed across regions

Data Distribution in Cell Architecture

Data Type Location Replication Rationale
Identity/API Keys Global DynamoDB Global Tables Auth from any region
Customer→Cell mapping Global DynamoDB + Redis cache Route from any edge
Customer data Cell-local Within-region only Isolation, data residency
Message content Cell-local Backup to cold storage Regional compliance
Active calls Regional (ephemeral) None Real-time, latency-sensitive
Usage/Billing Cell → Global Async aggregation Eventually consistent OK

Customer Region Assignment

Design Choice: Customers are assigned to a HOME region. All their data lives in cells within that region. This simplifies data residency and avoids cross-region writes.
// Customer → Region → Cell assignment
{
  "customer_id": "AC123456",
  "home_region": "eu-west-1",        // GDPR requires EU data residency
  "cell_id": "enterprise-eu-west-1-a",
  "tier": "enterprise",
  "failover_region": "eu-central-1", // Failover stays in EU
  "created_at": "2024-01-15T10:00:00Z"
}

// Routing logic
function routeCustomer(customerId, requestRegion) {
  const customer = await getCustomerMapping(customerId);

  if (customer.home_region === requestRegion) {
    return customer.cell_id;  // Local routing
  }

  // Cross-region request - proxy to home region
  return proxyToRegion(customer.home_region, customerId);
}

Cross-Region Communication

Scenario: US customer sends SMS to EU number 1. US Customer calls Twilio API (US edge) 2. Cell Router: Customer AC123 → Cell enterprise-us-east-1-a 3. US Cell processes request, needs EU carrier delivery 4. US Cell → Super Network (global carrier connections) 5. Super Network routes to EU carrier via local connection 6. Delivery status: EU → US Cell (async callback) Key: Customer data stays in US cell Only carrier routing crosses regions (via Super Network)

Regional Failover for Cells

Scenario Customer Impact Failover Action
Single cell failure 10-50 customers Migrate to another cell in same region
Region failure (EU customer) All EU customers Failover to EU-CENTRAL-1 (stay in EU for GDPR)
Region failure (US customer) All US customers Failover to US-WEST-2
Global routing failure All customers Fallback to static routing table at edge

2-Minute Answer: "How does multi-region work with your cell architecture?"

"Customers are region-homed, not globally distributed. When a customer signs up, they're assigned to a home region based on their location and data residency requirements—EU customers go to EU regions for GDPR. Within that region, they're assigned to a specific cell. All their data lives in that cell. Global services—identity, routing, billing—use DynamoDB Global Tables and are available in all regions. So I can authenticate a customer from any edge location, look up their cell assignment, and route them to the correct region. If their home region fails, we failover to another region in the same geographic area—EU fails over to another EU region, not to US. This maintains data residency compliance. The Super Network is the one truly global component that crosses regions—carrier connections exist in each region, but the routing intelligence to pick the best carrier path is global. For an EU customer sending an SMS to a US number, the request processes in their EU cell, but the Super Network routes delivery through US carrier connections for best deliverability."

AWS Services Deep Dive

DynamoDB Global Tables

What it is: Multi-region, multi-master database with automatic replication.
Feature Details
Replication latency Typically < 1 second
Conflict resolution Last-writer-wins (using aws:rep:updatetime)
Consistency Eventually consistent across regions; strongly consistent within region
Write capacity Must provision in each region (writes replicate, consuming WCU)
Best for Identity, routing tables, configuration, session data
// DynamoDB Global Tables conflict example
// Region A writes at t=100
await dynamodb.putItem({
  TableName: 'customer_settings',
  Item: {
    customer_id: { S: 'AC123' },
    theme: { S: 'dark' },
    // DynamoDB adds: aws:rep:updatetime = 100
  }
});

// Region B writes at t=101 (before replication)
await dynamodb.putItem({
  TableName: 'customer_settings',
  Item: {
    customer_id: { S: 'AC123' },
    theme: { S: 'light' },
    // DynamoDB adds: aws:rep:updatetime = 101
  }
});

// After replication: theme = 'light' (t=101 > t=100)

Aurora Global Database

What it is: Single-writer PostgreSQL/MySQL with read replicas across regions.
Feature Details
Replication Storage-level, < 1 second lag typically
Writes Primary region only (single writer)
Reads Any region (with replication lag)
Failover Promote secondary to primary (< 1 minute)
Write forwarding Secondary can forward writes to primary (adds latency)
Best for Transactional data, relational queries, complex joins

S3 Cross-Region Replication

Feature Details
Replication Asynchronous, typically minutes
Scope Entire bucket, prefix, or tag-based
Versioning Required (source and destination)
Delete markers Optionally replicated
Best for Media files, backups, static assets

ElastiCache Global Datastore

Feature Details
Engine Redis only (not Memcached)
Replication Async, < 1 second typically
Writes Primary cluster only
Failover Promote secondary (< 1 minute)
Best for Session cache, rate limiting, real-time leaderboards

Service Comparison Matrix

Requirement DynamoDB Global Aurora Global ElastiCache Global
Multi-region writes Yes (multi-master) No (single writer) No (single writer)
Strong consistency Within region only Within primary only Within primary only
Replication lag < 1 second < 1 second < 1 second
Conflict resolution LWW (automatic) N/A (single writer) N/A (single writer)
Query flexibility Limited (key-value) Full SQL Limited (key-value)
Cost model Per request + storage Instance + storage Instance-based

2-Minute Answer: "Which AWS database for multi-region?"

"It depends on the access pattern. For truly global data that needs multi-region writes—like identity, routing tables, or session data—I use DynamoDB Global Tables. It's multi-master, so any region can write, and conflicts resolve automatically with last-writer-wins. Replication is under a second. For transactional workloads with complex queries—like order processing or financial data—I use Aurora Global Database. It's single-writer, which means no conflicts, but writes must go to the primary region. Secondary regions get read replicas with sub-second lag. For caching, ElastiCache Global Datastore with Redis, again single-writer but incredibly fast reads from any region. The pattern I use: DynamoDB Global Tables for the control plane—who is this customer, which cell are they in. Aurora for the data plane—what are their messages, call records. ElastiCache for hot path acceleration—caching the DynamoDB lookups that happen on every request."

Interview Q&A

Design Questions

Q: "Design a multi-region active-active system for Twilio's SMS API"

Framework:

  1. Clarify requirements: Latency target? Data residency? RPO/RTO?
  2. Identify data types: What's global vs regional?
  3. Choose replication strategy: Multi-master vs single-writer per data type
  4. Design routing: How do requests reach the right region?
  5. Handle failures: What happens when a region goes down?

Q: "How would you handle a region failure during peak traffic?"

A: "First, the failure detection. Health checks run every 10 seconds; after 3 failures, Route 53 removes the region. During the DNS TTL window—60 seconds—some traffic still tries the failed region, so we need the load balancer to return 503 fast, not timeout. The surviving regions need capacity headroom. We provision N+1, so losing one of three regions leaves us with 100% capacity. Auto-scaling kicks in, but there's a lag, so baseline capacity matters. For data, anything in DynamoDB Global Tables is already in the surviving regions. Aurora needs a failover—promote the read replica to primary, which takes under a minute. The risk is the recovery. When the failed region comes back, we don't just flip traffic back. We verify data consistency, warm up caches, then gradually ramp from 10% to 50% to 100%, watching error rates. Thundering herd on recovery has caused more outages than the original failures."

Q: "How do you test multi-region failover?"

A: "We do three types of testing. First, synthetic failovers in staging—we literally turn off a region and watch what happens. This catches configuration issues but doesn't test production load. Second, chaos engineering in production—we use AWS Fault Injection Simulator to degrade regions gradually. Start with one AZ, then whole region, measuring impact at each step. Third, game days—scheduled exercises where we simulate region failure with the on-call team responding as if it's real. The game day reveals process gaps, not just technical ones. Does the runbook work? Can we actually reach the people we need? We also test recovery, which is often neglected. Bringing a region back online after extended outage—with stale caches, lagged data—is harder than the initial failover."

Q: "What's the hardest part of multi-region active-active?"

A: "Conflict resolution and the cognitive overhead it creates. With single-writer, the mental model is simple—there's one source of truth. With multi-master, every piece of data might have conflicts you haven't thought of. We had a bug where two regions both thought they were updating 'last_login' timestamp—harmless, right? But downstream analytics depended on that field being monotonically increasing for session tracking. Last-writer-wins caused sessions to appear to go backwards in time. The second hardest part is testing. You can't easily simulate network partitions between regions in a realistic way. And the failure modes are combinatorial—region A fails, or B fails, or the link between them fails, or A is slow but not down. Each scenario needs different handling. That's why I prefer regional affinity where possible—customer data lives in one region, reducing the surface area for conflicts."

Q: "How do you ensure data consistency across regions?"

A: "I don't try to guarantee global strong consistency—that requires synchronous replication which kills latency and availability. Instead, I design for the consistency level each data type actually needs. For identity and routing, I use DynamoDB Global Tables with eventual consistency—writes replicate in under a second, and for a routing table lookup, reading a slightly stale cell assignment is fine because the cell will redirect if wrong. For user-facing data where they need to see their own writes, I use read-your-writes consistency—route reads to the same region as their last write for that session. For financial data where inconsistency means real problems, I avoid multi-region writes entirely. One region is authoritative; other regions proxy writes there. Yes, it adds latency for users far from that region, but correctness matters more than speed for money. The key insight is that 'consistency' isn't binary—it's a spectrum, and different data has different needs."

Quick Reference: Key Numbers

Metric Target Why
Cross-region latency 50-150ms Physical distance + network hops
DynamoDB Global Tables replication < 1 second Typical async replication lag
Aurora Global failover < 1 minute Promote secondary to primary
DNS TTL 60-120 seconds Balance failover speed vs query volume
Health check interval 10-30 seconds Fast detection vs false positives
Capacity headroom N+1 regions Survive one region failure
RPO (Recovery Point Objective) < 1 minute Max data loss acceptable
RTO (Recovery Time Objective) < 5 minutes Max downtime acceptable

Key Takeaways

For Twilio DA Interview

  • Active-active for availability and latency, not just DR
  • Classify data by consistency needs
  • DynamoDB Global Tables for global data (LWW)
  • Aurora Global for transactional, single-writer
  • Regional affinity avoids conflicts by design
  • Test failover AND recovery

Trade-offs to Articulate

  • Consistency vs availability (CAP)
  • Multi-master vs single-writer complexity
  • Sync replication latency vs async data loss
  • DNS failover time vs Global Accelerator cost
  • Regional autonomy vs global consistency
  • N+1 capacity cost vs failover capability
Your Narrative: "Multi-region active-active isn't just about disaster recovery—it's about latency and availability for a global communications platform. I design with regional affinity to avoid conflicts, use DynamoDB Global Tables for truly global data, and accept eventual consistency where it's safe. The hardest part isn't the technology; it's the operational discipline—testing failover regularly, provisioning N+1 capacity, and having runbooks for scenarios you hope never happen."