Multi-Region Active-Active - DA Interview Prep

Quick Navigation

Architecture Patterns Data Replication Strategies Conflict Resolution Traffic Routing & DNS Failure Scenarios Twilio Cell Context AWS Services Deep Dive Interview Q&A

Why Multi-Region Matters for Twilio DA

                Twilio Context: Twilio operates a global communications platform where latency and availability directly impact customer experience. A US customer calling a UK number needs infrastructure in both regions. Multi-region isn't optional—it's the product.
            

Key Drivers for Multi-Region

Driver	Requirement	Twilio Example
Latency	< 100ms for real-time	Voice/Video requires regional media servers
Availability	99.99%+ uptime	Region failure can't take down global service
Data Residency	GDPR, data sovereignty	EU customer data must stay in EU
Disaster Recovery	RTO < 1 hour, RPO < 1 min	Natural disasters, cloud provider outages
Carrier Proximity	Close to telecom infrastructure	Super Network connections in each region

Architecture Patterns

Pattern Comparison

Aspect	Active-Passive	Active-Active
Traffic distribution	100% to primary, 0% to secondary	Split across all regions
Failover time	Minutes (DNS TTL, warmup)	Seconds (already serving traffic)
Resource utilization	50% (standby is idle)	~100% (all regions active)
Data consistency	Simpler (one writer)	Complex (conflict resolution)
Cost	Higher (paying for idle)	Lower (all resources active)
Operational complexity	Lower	Higher
Best for	Compliance, simple DR	Global latency, high availability

Active-Active Architecture

┌─────────────────────────────────────────────────────────────┐ │ GLOBAL LAYER │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Route 53 / Global Accelerator │ │ │ │ (Latency-based routing, health checks) │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ │ ┌───────────────────────┘ └───────────────────────┐ │ │ ▼ ▼ ┌─────────────────────────────────────────┐ ┌─────────────────────────────────────────┐ │ REGION: US-EAST-1 │ │ REGION: EU-WEST-1 │ │ ┌───────────────────────────────────┐ │ │ ┌───────────────────────────────────┐ │ │ │ Cell Router │ │ │ │ Cell Router │ │ │ └───────────────────────────────────┘ │ │ └───────────────────────────────────┘ │ │ │ │ │ │ │ │ ┌─────────┴─────────┐ │ │ ┌─────────┴─────────┐ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ ▼ ▼ │ │ ┌──────┐ ┌──────┐ │ │ ┌──────┐ ┌──────┐ │ │ │Cell A│ │Cell B│ │ │ │Cell C│ │Cell D│ │ │ │(Ent) │ │(SMB) │ │ │ │(Ent) │ │(SMB) │ │ │ └──────┘ └──────┘ │ │ └──────┘ └──────┘ │ │ │ │ │ │ ┌───────────────────────────────────┐ │ │ ┌───────────────────────────────────┐ │ │ │ Regional Data (Aurora, etc.) │ │◄────────►│ │ Regional Data (Aurora, etc.) │ │ │ └───────────────────────────────────┘ │ Async │ └───────────────────────────────────┘ │ │ │ Repl. │ │ └─────────────────────────────────────────┘ └─────────────────────────────────────────┘ │ │ └──────────────────────┬─────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ GLOBAL DATA │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ DynamoDB Global Tables (Identity, Routing) │ │ │ │ Multi-master, automatic conflict resolution │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘

Key Design Principles

Regional autonomy: Each region can operate independently if isolated
Data locality: Keep data close to where it's used
Eventual consistency: Accept temporary inconsistency for availability
Conflict resolution: Define deterministic rules for concurrent writes
Health-based routing: Automatically shift traffic from unhealthy regions

2-Minute Answer: "Active-Active vs Active-Passive?"

"Active-passive has one primary region handling all traffic while a standby waits for failover. It's simpler—one writer means no conflicts—but failover takes minutes because the standby is cold and DNS needs to propagate. You're also paying for idle capacity. Active-active serves traffic from all regions simultaneously. Failover is nearly instant because every region is already warm. You get better latency by routing users to the nearest region. The trade-off is data consistency: with multiple writers, you need conflict resolution strategies. For Twilio, active-active is essential. Voice calls need low latency, so we need regional media servers. SMS delivery needs regional carrier connections. And 99.99% availability means we can't have minutes of downtime during failover. We accept the complexity of conflict resolution to get the availability and latency benefits."

Data Replication Strategies

Synchronous vs Asynchronous Replication

Synchronous Replication

Write Request │ ▼ ┌─────────┐ sync ┌─────────┐ │ Primary │───────────►│ Replica │ │ (ack) │◄───────────│ (ack) │ └─────────┘ └─────────┘ │ ▼ Response to Client (only after both ack)

Consistency: Strong (RPO = 0)
Latency: High (cross-region RTT)
Availability: Lower (both must be up)
Use when: Financial transactions, inventory

Asynchronous Replication

Write Request │ ▼ ┌─────────┐ │ Primary │──── Response to Client │ (ack) │ (immediately) └─────────┘ │ │ async (background) ▼ ┌─────────┐ │ Replica │ └─────────┘

Consistency: Eventual (RPO > 0)
Latency: Low (local write only)
Availability: Higher (tolerates replica lag)
Use when: User data, messages, logs

Replication Topologies

Topology	Description	Pros	Cons
Leader-Follower	One writer, multiple readers	Simple, no conflicts	Leader is SPOF, failover lag
Multi-Leader	Multiple writers, each replicates to others	Write availability, lower latency	Conflict resolution needed
Leaderless	Any node accepts writes (Dynamo-style)	Highest availability	Complex consistency (quorums)

Replication Lag Implications

                The Problem: User writes to Region A, then reads from Region B before replication completes. They see stale data—or worse, their write appears to have vanished.
            

Solutions

Pattern	How It Works	Trade-off
Read-your-writes	Route user's reads to same region as their writes	Requires sticky sessions or user-region affinity
Monotonic reads	Track last-seen timestamp, reject stale reads	May require waiting or retry
Causal consistency	Track dependencies between operations	Complex implementation
Version vectors	Each write includes vector clock	Metadata overhead

Data Classification for Replication

                Key Insight: Not all data needs the same replication strategy. Classify data by consistency requirements.
            

Data Type	Consistency Need	Replication Strategy	Twilio Example
Account/Identity	Strong	DynamoDB Global Tables (LWW)	API credentials, account settings
Routing/Cell Assignment	Strong	DynamoDB Global Tables + Cache	Customer → Cell mapping
Message Content	Eventual	Regional with async backup	SMS body, media files
Call State	Regional only	No cross-region replication	Active call data (ephemeral)
Analytics/Logs	Eventual	Async to central data lake	Usage metrics, audit logs

2-Minute Answer: "How do you handle data replication across regions?"

"I classify data by consistency requirements. For identity and routing data that must be globally consistent, I use DynamoDB Global Tables with last-writer-wins conflict resolution—it's multi-master, so any region can write, and conflicts are resolved automatically by timestamp. For transactional data like message delivery status, I use Aurora Global Database with a primary writer and read replicas in other regions. Writes go to the primary, reads can hit local replicas with acceptable lag. For truly regional data like active call state, I don't replicate at all—it's ephemeral and specific to the media servers handling that call. The key insight is that async replication introduces lag, so you need strategies like read-your-writes consistency for user-facing operations. I route a user's reads to the same region as their writes for the session, which guarantees they see their own updates immediately."

Conflict Resolution

                The Fundamental Problem: Two users update the same record in different regions simultaneously. When the updates replicate, which one wins?
            

Conflict Resolution Strategies

1. Last-Writer-Wins (LWW)

Region A (t=100): UPDATE user SET name='Alice' WHERE id=1 Region B (t=101): UPDATE user SET name='Bob' WHERE id=1 After replication: name='Bob' (higher timestamp wins) Pros: Simple, deterministic, no manual intervention Cons: Data loss (Alice's update is silently discarded)

Use when: Updates are idempotent, latest value is always correct (e.g., user profile)

2. Merge/CRDT (Conflict-free Replicated Data Types)

Region A: ADD item='apple' to shopping_cart Region B: ADD item='banana' to shopping_cart After replication: shopping_cart = ['apple', 'banana'] (union of both) Types of CRDTs: • G-Counter: Only increments, sum all replicas • PN-Counter: Increment and decrement • G-Set: Add-only set (union) • OR-Set: Add and remove with unique tags • LWW-Register: Last-writer-wins for single values

Use when: Operations can be mathematically merged (counters, sets)

3. Application-Level Resolution

Region A: UPDATE account SET balance=100 WHERE id=1 Region B: UPDATE account SET balance=80 WHERE id=1 Conflict detected! Options: a) Prompt user to choose b) Apply business rules (e.g., keep lower balance for safety) c) Store both versions, resolve later d) Reject one based on business priority

Use when: Business logic determines correct resolution (financial data)

4. Operational Transformation (OT)

Document: "Hello" Region A: INSERT ' World' at position 5 → "Hello World" Region B: INSERT '!' at position 5 → "Hello!" OT transforms operations: B's operation adjusted: INSERT '!' at position 11 Result: "Hello World!"

Use when: Collaborative editing (Google Docs-style)

Conflict Resolution in AWS Services

Service	Conflict Resolution	Details
DynamoDB Global Tables	Last-Writer-Wins	Uses `aws:rep:updatetime` for ordering
Aurora Global Database	Single Writer (no conflicts)	Only primary accepts writes
S3 Cross-Region Replication	Last-Writer-Wins	Based on object timestamp
ElastiCache Global Datastore	Single Writer	Primary cluster only

Designing for Conflict Avoidance

                Best Strategy: Avoid conflicts by design rather than resolving them after the fact.
            

Technique	How It Works	Example
Region affinity	Each record "belongs" to one region	US customers write to US, EU to EU
Partitioned writes	Different records updated in different regions	User profile in home region only
Append-only data	Never update, only insert new records	Event log, message history
Idempotent operations	Same operation applied twice = same result	SET status='delivered' (not INCREMENT counter)

2-Minute Answer: "How do you handle conflicts in multi-region writes?"

"First, I try to avoid conflicts by design. Most data has natural region affinity—a customer's data is written in their home region. For data that truly needs multi-region writes, I choose the strategy based on the data type. For simple values like user preferences, last-writer-wins works fine—DynamoDB Global Tables does this automatically using timestamps. For additive data like shopping carts or message queues, I use CRDT-like structures where the merge operation is a union. For critical data like financial balances, I avoid multi-region writes entirely—I designate a single authoritative region and route writes there, accepting the latency cost for correctness. The key insight is that conflict resolution isn't one-size-fits-all. I classify data into tiers: tier 1 gets strong consistency with single-region writes; tier 2 gets LWW with async replication; tier 3 gets CRDTs for merge-friendly operations."

Traffic Routing & DNS

Routing Strategies

Strategy	How It Works	Best For	AWS Service
Latency-based	Route to region with lowest latency	Global users, real-time apps	Route 53 Latency Routing
Geolocation	Route based on user's location	Data residency, compliance	Route 53 Geolocation
Weighted	Distribute traffic by percentage	Canary deploys, gradual migration	Route 53 Weighted
Failover	Primary/secondary with health checks	Disaster recovery	Route 53 Failover
Anycast	Same IP announced from multiple locations	DDoS protection, lowest latency	Global Accelerator

DNS-Based Routing

┌──────────┐ 1. DNS Query ┌─────────────┐ │ Client │─────────────────────►│ Route 53 │ │ │ │ │ │ │◄─────────────────────│ (Latency │ └──────────┘ 2. IP of nearest │ Routing) │ │ region └─────────────┘ │ │ │ 3. HTTPS Request │ Health Checks ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ US-EAST-1 │ │ EU-WEST-1 │ │ (50ms RTT) │ │ (150ms RTT) │ └──────────────┘ └──────────────┘ TTL Considerations: • Low TTL (60s): Fast failover, more DNS queries • High TTL (300s): Fewer queries, slower failover • Typical: 60-120s for active-active

AWS Global Accelerator vs Route 53

Route 53 (DNS)

Returns IP address, client connects directly
Failover limited by DNS TTL
Client caching can delay failover
Lower cost
Works with any endpoint

Global Accelerator (Anycast)

Static IPs, traffic routed through AWS backbone
Instant failover (no DNS propagation)
TCP/UDP termination at edge
Higher cost, better performance
DDoS protection built-in

Health Checks

                Critical Design Decision: What does "healthy" mean for your application? A region might be up but degraded.
            

Health Check Type	What It Tests	Failure Action
TCP	Can establish connection	Fast, but doesn't test application
HTTP 200	Endpoint returns 200	Tests application is responding
String Match	Response contains expected string	Tests application logic
Deep Health	Tests dependencies (DB, cache)	Most accurate, risk of cascade

// Deep health check endpoint
app.get('/health/deep', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    cache: await checkRedis(),
    queue: await checkKafka(),
    downstream: await checkCriticalService()
  };

  const allHealthy = Object.values(checks).every(c => c.healthy);
  const status = allHealthy ? 200 : 503;

  res.status(status).json({
    status: allHealthy ? 'healthy' : 'degraded',
    checks,
    region: process.env.AWS_REGION,
    timestamp: new Date().toISOString()
  });
});

2-Minute Answer: "How do you route traffic in a multi-region setup?"

"I use a layered approach. At the DNS layer, Route 53 with latency-based routing directs users to the nearest healthy region. Health checks run every 10 seconds against deep health endpoints that verify database connectivity, not just that the server responds. For critical applications, I add Global Accelerator on top—it gives us anycast IPs so failover is instant, no DNS TTL to wait for. Traffic flows through AWS's backbone network, which is faster and more reliable than the public internet. Within each region, the Cell Router handles customer-to-cell assignment using our DynamoDB-backed routing table. For data residency requirements, I use geolocation routing to ensure EU users always hit EU infrastructure, with failover only to other EU regions. The key trade-off is cost versus failover speed: Route 53 alone is cheap but failover takes 60+ seconds due to TTL; Global Accelerator costs more but fails over in seconds."

Failure Scenarios & Mitigation

Failure Categories

Failure Type	Scope	RTO Target	Mitigation
Single AZ failure	One datacenter	Automatic (seconds)	Multi-AZ deployment
Region failure	All AZs in region	< 5 minutes	Multi-region active-active
Global service failure	AWS service (IAM, Route 53)	Variable	Static credentials, cached DNS
Network partition	Inter-region connectivity	Automatic	Regional autonomy, async replication
Data corruption	All regions (if replicated)	Hours	Point-in-time recovery, backups

Scenario 1: Complete Region Failure

BEFORE: Traffic split US-EAST-1 (60%) / EU-WEST-1 (40%) EVENT: US-EAST-1 goes offline Timeline: T+0: Region fails T+10s: Health checks detect failure (3 consecutive failures) T+20s: Route 53 removes US-EAST-1 from DNS responses T+60s: DNS TTL expires, clients get EU-WEST-1 only T+120s: All traffic now to EU-WEST-1 AFTER: EU-WEST-1 (100%) - must handle 2.5x normal load Key Questions: • Can EU-WEST-1 scale to handle all traffic? • Is data replicated enough that EU-WEST-1 has recent state? • What happens when US-EAST-1 comes back?

Scenario 2: Network Partition (Split Brain)

                The Nightmare Scenario: Regions can't communicate but both think they're the primary. Users in each region continue writing, creating divergent data.
            

BEFORE: US-EAST-1 ◄──replication──► EU-WEST-1 EVENT: Inter-region network fails ╳ (network partition) US-EAST-1 ◄────────────────────────► EU-WEST-1 │ │ ▼ ▼ Users write Users write balance=100 balance=80 AFTER PARTITION HEALS: Which balance is correct? Both regions have different values. Mitigation Options: 1. Quorum writes (need majority of regions to write) 2. Leader election (only one region accepts writes) 3. Conflict resolution (LWW, merge, manual) 4. Accept partition as feature (regional autonomy)

Scenario 3: Cascading Failure

1. US-EAST-1 database becomes slow (not failed) 2. Health checks still pass (200 OK, just slow) 3. Requests queue up, timeouts increase 4. Circuit breakers don't trip (it's "working") 5. EU-WEST-1 overwhelmed when US users retry there 6. Both regions now degraded Prevention: • Latency-based health checks (fail if p99 > threshold) • Adaptive load shedding • Request quotas per region • Circuit breakers with latency triggers

Recovery Playbook

Phase	Actions	Validation
1. Detect	Automated alerts, health check failures	Alert fires within 60 seconds
2. Isolate	Remove failed region from routing	No new traffic to failed region
3. Scale	Increase capacity in healthy regions	Latency/error rate stable
4. Investigate	Root cause analysis	Understand failure mode
5. Repair	Fix issue in failed region	Local validation passes
6. Resync	Replay missed writes, reconcile data	Data consistency verified
7. Restore	Gradually route traffic back (10%→50%→100%)	Error rate stays low

2-Minute Answer: "What happens when a region fails?"

"Our health checks detect the failure within 10-30 seconds—we check every 10 seconds and require 2-3 consecutive failures to avoid false positives. Route 53 automatically removes the failed region from DNS responses. Existing connections drain; new connections go to healthy regions. The key design decision is whether healthy regions can absorb the load. We provision for N+1 capacity, meaning if we have 3 regions, each can handle 50% of total traffic—so losing one region still leaves us with 100% capacity. For data, we accept that async-replicated data might be slightly stale—RPO of a few seconds. Critical data uses DynamoDB Global Tables with sub-second replication. When the region recovers, we don't just flip traffic back on. We verify data consistency first, then gradually ramp traffic—10%, then 50%, then 100%—watching error rates at each step. The biggest risk isn't the failure itself; it's the recovery. Thundering herd when traffic shifts back has caused more outages than the original failures."

Multi-Region in Twilio's Cell Architecture

Regional Cell Deployment

┌─────────────────────────────────────────────────────────────────────────────────┐ │ GLOBAL LAYER │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Route 53 / Global Accelerator → Cell Router (Lambda@Edge) │ │ │ │ DynamoDB Global Tables (Identity, Customer→Cell mapping) │ │ │ │ Billing Aggregation Service │ │ │ └─────────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────────┘ │ ┌───────────────────────────┼───────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐ │ US-EAST-1 │ │ EU-WEST-1 │ │ AP-SOUTHEAST-1 │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ │ Enterprise-US-A │ │ │ │ Enterprise-EU-A │ │ │ │ Enterprise-AP-A │ │ │ │ (Customers 1-50)│ │ │ │ (Customers 1-50)│ │ │ │ (Customers 1-50)│ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ │ SMB-US-Pool │ │ │ │ SMB-EU-Pool │ │ │ │ SMB-AP-Pool │ │ │ │ (Customers 1K+) │ │ │ │ (Customers 1K+) │ │ │ │ (Customers 1K+) │ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ │ │ │ │ │ │ Regional Resources: │ │ Regional Resources: │ │ Regional Resources: │ │ • Carrier connections│ │ • Carrier connections│ │ • Carrier connections│ │ • Media servers │ │ • Media servers │ │ • Media servers │ │ • Message queues │ │ • Message queues │ │ • Message queues │ └───────────────────────┘ └───────────────────────┘ └───────────────────────┘ Key Insight: Customers are region-homed, not distributed across regions

Data Distribution in Cell Architecture

Data Type	Location	Replication	Rationale
Identity/API Keys	Global	DynamoDB Global Tables	Auth from any region
Customer→Cell mapping	Global	DynamoDB + Redis cache	Route from any edge
Customer data	Cell-local	Within-region only	Isolation, data residency
Message content	Cell-local	Backup to cold storage	Regional compliance
Active calls	Regional (ephemeral)	None	Real-time, latency-sensitive
Usage/Billing	Cell → Global	Async aggregation	Eventually consistent OK

Customer Region Assignment

                Design Choice: Customers are assigned to a HOME region. All their data lives in cells within that region. This simplifies data residency and avoids cross-region writes.
            

// Customer → Region → Cell assignment
{
  "customer_id": "AC123456",
  "home_region": "eu-west-1",        // GDPR requires EU data residency
  "cell_id": "enterprise-eu-west-1-a",
  "tier": "enterprise",
  "failover_region": "eu-central-1", // Failover stays in EU
  "created_at": "2024-01-15T10:00:00Z"
}

// Routing logic
function routeCustomer(customerId, requestRegion) {
  const customer = await getCustomerMapping(customerId);

  if (customer.home_region === requestRegion) {
    return customer.cell_id;  // Local routing
  }

  // Cross-region request - proxy to home region
  return proxyToRegion(customer.home_region, customerId);
}

Cross-Region Communication

Scenario: US customer sends SMS to EU number 1. US Customer calls Twilio API (US edge) 2. Cell Router: Customer AC123 → Cell enterprise-us-east-1-a 3. US Cell processes request, needs EU carrier delivery 4. US Cell → Super Network (global carrier connections) 5. Super Network routes to EU carrier via local connection 6. Delivery status: EU → US Cell (async callback) Key: Customer data stays in US cell Only carrier routing crosses regions (via Super Network)

Regional Failover for Cells

Scenario	Customer Impact	Failover Action
Single cell failure	10-50 customers	Migrate to another cell in same region
Region failure (EU customer)	All EU customers	Failover to EU-CENTRAL-1 (stay in EU for GDPR)
Region failure (US customer)	All US customers	Failover to US-WEST-2
Global routing failure	All customers	Fallback to static routing table at edge

2-Minute Answer: "How does multi-region work with your cell architecture?"

"Customers are region-homed, not globally distributed. When a customer signs up, they're assigned to a home region based on their location and data residency requirements—EU customers go to EU regions for GDPR. Within that region, they're assigned to a specific cell. All their data lives in that cell. Global services—identity, routing, billing—use DynamoDB Global Tables and are available in all regions. So I can authenticate a customer from any edge location, look up their cell assignment, and route them to the correct region. If their home region fails, we failover to another region in the same geographic area—EU fails over to another EU region, not to US. This maintains data residency compliance. The Super Network is the one truly global component that crosses regions—carrier connections exist in each region, but the routing intelligence to pick the best carrier path is global. For an EU customer sending an SMS to a US number, the request processes in their EU cell, but the Super Network routes delivery through US carrier connections for best deliverability."

AWS Services Deep Dive

DynamoDB Global Tables

What it is: Multi-region, multi-master database with automatic replication.

Feature	Details
Replication latency	Typically < 1 second
Conflict resolution	Last-writer-wins (using `aws:rep:updatetime`)
Consistency	Eventually consistent across regions; strongly consistent within region
Write capacity	Must provision in each region (writes replicate, consuming WCU)
Best for	Identity, routing tables, configuration, session data

// DynamoDB Global Tables conflict example
// Region A writes at t=100
await dynamodb.putItem({
  TableName: 'customer_settings',
  Item: {
    customer_id: { S: 'AC123' },
    theme: { S: 'dark' },
    // DynamoDB adds: aws:rep:updatetime = 100
  }
});

// Region B writes at t=101 (before replication)
await dynamodb.putItem({
  TableName: 'customer_settings',
  Item: {
    customer_id: { S: 'AC123' },
    theme: { S: 'light' },
    // DynamoDB adds: aws:rep:updatetime = 101
  }
});

// After replication: theme = 'light' (t=101 > t=100)

Aurora Global Database

What it is: Single-writer PostgreSQL/MySQL with read replicas across regions.

Feature	Details
Replication	Storage-level, < 1 second lag typically
Writes	Primary region only (single writer)
Reads	Any region (with replication lag)
Failover	Promote secondary to primary (< 1 minute)
Write forwarding	Secondary can forward writes to primary (adds latency)
Best for	Transactional data, relational queries, complex joins

S3 Cross-Region Replication

Feature	Details
Replication	Asynchronous, typically minutes
Scope	Entire bucket, prefix, or tag-based
Versioning	Required (source and destination)
Delete markers	Optionally replicated
Best for	Media files, backups, static assets

ElastiCache Global Datastore

Feature	Details
Engine	Redis only (not Memcached)
Replication	Async, < 1 second typically
Writes	Primary cluster only
Failover	Promote secondary (< 1 minute)
Best for	Session cache, rate limiting, real-time leaderboards

Service Comparison Matrix

Requirement	DynamoDB Global	Aurora Global	ElastiCache Global
Multi-region writes	Yes (multi-master)	No (single writer)	No (single writer)
Strong consistency	Within region only	Within primary only	Within primary only
Replication lag	< 1 second	< 1 second	< 1 second
Conflict resolution	LWW (automatic)	N/A (single writer)	N/A (single writer)
Query flexibility	Limited (key-value)	Full SQL	Limited (key-value)
Cost model	Per request + storage	Instance + storage	Instance-based

2-Minute Answer: "Which AWS database for multi-region?"

"It depends on the access pattern. For truly global data that needs multi-region writes—like identity, routing tables, or session data—I use DynamoDB Global Tables. It's multi-master, so any region can write, and conflicts resolve automatically with last-writer-wins. Replication is under a second. For transactional workloads with complex queries—like order processing or financial data—I use Aurora Global Database. It's single-writer, which means no conflicts, but writes must go to the primary region. Secondary regions get read replicas with sub-second lag. For caching, ElastiCache Global Datastore with Redis, again single-writer but incredibly fast reads from any region. The pattern I use: DynamoDB Global Tables for the control plane—who is this customer, which cell are they in. Aurora for the data plane—what are their messages, call records. ElastiCache for hot path acceleration—caching the DynamoDB lookups that happen on every request."

Interview Q&A

Design Questions

Q: "Design a multi-region active-active system for Twilio's SMS API"

Framework:

Clarify requirements: Latency target? Data residency? RPO/RTO?
Identify data types: What's global vs regional?
Choose replication strategy: Multi-master vs single-writer per data type
Design routing: How do requests reach the right region?
Handle failures: What happens when a region goes down?

Q: "How would you handle a region failure during peak traffic?"

A: "First, the failure detection. Health checks run every 10 seconds; after 3 failures, Route 53 removes the region. During the DNS TTL window—60 seconds—some traffic still tries the failed region, so we need the load balancer to return 503 fast, not timeout. The surviving regions need capacity headroom. We provision N+1, so losing one of three regions leaves us with 100% capacity. Auto-scaling kicks in, but there's a lag, so baseline capacity matters. For data, anything in DynamoDB Global Tables is already in the surviving regions. Aurora needs a failover—promote the read replica to primary, which takes under a minute. The risk is the recovery. When the failed region comes back, we don't just flip traffic back. We verify data consistency, warm up caches, then gradually ramp from 10% to 50% to 100%, watching error rates. Thundering herd on recovery has caused more outages than the original failures."

Q: "How do you test multi-region failover?"

A: "We do three types of testing. First, synthetic failovers in staging—we literally turn off a region and watch what happens. This catches configuration issues but doesn't test production load. Second, chaos engineering in production—we use AWS Fault Injection Simulator to degrade regions gradually. Start with one AZ, then whole region, measuring impact at each step. Third, game days—scheduled exercises where we simulate region failure with the on-call team responding as if it's real. The game day reveals process gaps, not just technical ones. Does the runbook work? Can we actually reach the people we need? We also test recovery, which is often neglected. Bringing a region back online after extended outage—with stale caches, lagged data—is harder than the initial failover."

Q: "What's the hardest part of multi-region active-active?"

A: "Conflict resolution and the cognitive overhead it creates. With single-writer, the mental model is simple—there's one source of truth. With multi-master, every piece of data might have conflicts you haven't thought of. We had a bug where two regions both thought they were updating 'last_login' timestamp—harmless, right? But downstream analytics depended on that field being monotonically increasing for session tracking. Last-writer-wins caused sessions to appear to go backwards in time. The second hardest part is testing. You can't easily simulate network partitions between regions in a realistic way. And the failure modes are combinatorial—region A fails, or B fails, or the link between them fails, or A is slow but not down. Each scenario needs different handling. That's why I prefer regional affinity where possible—customer data lives in one region, reducing the surface area for conflicts."

Q: "How do you ensure data consistency across regions?"

A: "I don't try to guarantee global strong consistency—that requires synchronous replication which kills latency and availability. Instead, I design for the consistency level each data type actually needs. For identity and routing, I use DynamoDB Global Tables with eventual consistency—writes replicate in under a second, and for a routing table lookup, reading a slightly stale cell assignment is fine because the cell will redirect if wrong. For user-facing data where they need to see their own writes, I use read-your-writes consistency—route reads to the same region as their last write for that session. For financial data where inconsistency means real problems, I avoid multi-region writes entirely. One region is authoritative; other regions proxy writes there. Yes, it adds latency for users far from that region, but correctness matters more than speed for money. The key insight is that 'consistency' isn't binary—it's a spectrum, and different data has different needs."

Quick Reference: Key Numbers

Metric	Target	Why
Cross-region latency	50-150ms	Physical distance + network hops
DynamoDB Global Tables replication	< 1 second	Typical async replication lag
Aurora Global failover	< 1 minute	Promote secondary to primary
DNS TTL	60-120 seconds	Balance failover speed vs query volume
Health check interval	10-30 seconds	Fast detection vs false positives
Capacity headroom	N+1 regions	Survive one region failure
RPO (Recovery Point Objective)	< 1 minute	Max data loss acceptable
RTO (Recovery Time Objective)	< 5 minutes	Max downtime acceptable

Key Takeaways

For Twilio DA Interview

Active-active for availability and latency, not just DR
Classify data by consistency needs
DynamoDB Global Tables for global data (LWW)
Aurora Global for transactional, single-writer
Regional affinity avoids conflicts by design
Test failover AND recovery

Trade-offs to Articulate

Consistency vs availability (CAP)
Multi-master vs single-writer complexity
Sync replication latency vs async data loss
DNS failover time vs Global Accelerator cost
Regional autonomy vs global consistency
N+1 capacity cost vs failover capability

                Your Narrative: "Multi-region active-active isn't just about disaster recovery—it's about latency and availability for a global communications platform. I design with regional affinity to avoid conflicts, use DynamoDB Global Tables for truly global data, and accept eventual consistency where it's safe. The hardest part isn't the technology; it's the operational discipline—testing failover regularly, provisioning N+1 capacity, and having runbooks for scenarios you hope never happen."