Multi-Region Active-Active
Global Distribution, Data Replication, Conflict Resolution, and Failover Strategies
← Back to Study Guide
Why Multi-Region Matters for Twilio DA
Twilio Context: Twilio operates a global communications platform where latency and availability directly impact customer experience. A US customer calling a UK number needs infrastructure in both regions. Multi-region isn't optional—it's the product.
Key Drivers for Multi-Region
| Driver |
Requirement |
Twilio Example |
| Latency |
< 100ms for real-time |
Voice/Video requires regional media servers |
| Availability |
99.99%+ uptime |
Region failure can't take down global service |
| Data Residency |
GDPR, data sovereignty |
EU customer data must stay in EU |
| Disaster Recovery |
RTO < 1 hour, RPO < 1 min |
Natural disasters, cloud provider outages |
| Carrier Proximity |
Close to telecom infrastructure |
Super Network connections in each region |
Architecture Patterns
Pattern Comparison
| Aspect |
Active-Passive |
Active-Active |
| Traffic distribution |
100% to primary, 0% to secondary |
Split across all regions |
| Failover time |
Minutes (DNS TTL, warmup) |
Seconds (already serving traffic) |
| Resource utilization |
50% (standby is idle) |
~100% (all regions active) |
| Data consistency |
Simpler (one writer) |
Complex (conflict resolution) |
| Cost |
Higher (paying for idle) |
Lower (all resources active) |
| Operational complexity |
Lower |
Higher |
| Best for |
Compliance, simple DR |
Global latency, high availability |
Active-Active Architecture
┌─────────────────────────────────────────────────────────────┐
│ GLOBAL LAYER │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Route 53 / Global Accelerator │ │
│ │ (Latency-based routing, health checks) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │
┌───────────────────────┘ └───────────────────────┐
│ │
▼ ▼
┌─────────────────────────────────────────┐ ┌─────────────────────────────────────────┐
│ REGION: US-EAST-1 │ │ REGION: EU-WEST-1 │
│ ┌───────────────────────────────────┐ │ │ ┌───────────────────────────────────┐ │
│ │ Cell Router │ │ │ │ Cell Router │ │
│ └───────────────────────────────────┘ │ │ └───────────────────────────────────┘ │
│ │ │ │ │ │
│ ┌─────────┴─────────┐ │ │ ┌─────────┴─────────┐ │
│ │ │ │ │ │ │ │
│ ▼ ▼ │ │ ▼ ▼ │
│ ┌──────┐ ┌──────┐ │ │ ┌──────┐ ┌──────┐ │
│ │Cell A│ │Cell B│ │ │ │Cell C│ │Cell D│ │
│ │(Ent) │ │(SMB) │ │ │ │(Ent) │ │(SMB) │ │
│ └──────┘ └──────┘ │ │ └──────┘ └──────┘ │
│ │ │ │
│ ┌───────────────────────────────────┐ │ │ ┌───────────────────────────────────┐ │
│ │ Regional Data (Aurora, etc.) │ │◄────────►│ │ Regional Data (Aurora, etc.) │ │
│ └───────────────────────────────────┘ │ Async │ └───────────────────────────────────┘ │
│ │ Repl. │ │
└─────────────────────────────────────────┘ └─────────────────────────────────────────┘
│ │
└──────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ GLOBAL DATA │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ DynamoDB Global Tables (Identity, Routing) │ │
│ │ Multi-master, automatic conflict resolution │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key Design Principles
- Regional autonomy: Each region can operate independently if isolated
- Data locality: Keep data close to where it's used
- Eventual consistency: Accept temporary inconsistency for availability
- Conflict resolution: Define deterministic rules for concurrent writes
- Health-based routing: Automatically shift traffic from unhealthy regions
2-Minute Answer: "Active-Active vs Active-Passive?"
"Active-passive has one primary region handling all traffic while a standby waits for failover. It's simpler—one writer means no conflicts—but failover takes minutes because the standby is cold and DNS needs to propagate. You're also paying for idle capacity. Active-active serves traffic from all regions simultaneously. Failover is nearly instant because every region is already warm. You get better latency by routing users to the nearest region. The trade-off is data consistency: with multiple writers, you need conflict resolution strategies. For Twilio, active-active is essential. Voice calls need low latency, so we need regional media servers. SMS delivery needs regional carrier connections. And 99.99% availability means we can't have minutes of downtime during failover. We accept the complexity of conflict resolution to get the availability and latency benefits."
Data Replication Strategies
Synchronous vs Asynchronous Replication
Synchronous Replication
Write Request
│
▼
┌─────────┐ sync ┌─────────┐
│ Primary │───────────►│ Replica │
│ (ack) │◄───────────│ (ack) │
└─────────┘ └─────────┘
│
▼
Response to Client
(only after both ack)
- Consistency: Strong (RPO = 0)
- Latency: High (cross-region RTT)
- Availability: Lower (both must be up)
- Use when: Financial transactions, inventory
Asynchronous Replication
Write Request
│
▼
┌─────────┐
│ Primary │──── Response to Client
│ (ack) │ (immediately)
└─────────┘
│
│ async (background)
▼
┌─────────┐
│ Replica │
└─────────┘
- Consistency: Eventual (RPO > 0)
- Latency: Low (local write only)
- Availability: Higher (tolerates replica lag)
- Use when: User data, messages, logs
Replication Topologies
| Topology |
Description |
Pros |
Cons |
| Leader-Follower |
One writer, multiple readers |
Simple, no conflicts |
Leader is SPOF, failover lag |
| Multi-Leader |
Multiple writers, each replicates to others |
Write availability, lower latency |
Conflict resolution needed |
| Leaderless |
Any node accepts writes (Dynamo-style) |
Highest availability |
Complex consistency (quorums) |
Replication Lag Implications
The Problem: User writes to Region A, then reads from Region B before replication completes. They see stale data—or worse, their write appears to have vanished.
Solutions
| Pattern |
How It Works |
Trade-off |
| Read-your-writes |
Route user's reads to same region as their writes |
Requires sticky sessions or user-region affinity |
| Monotonic reads |
Track last-seen timestamp, reject stale reads |
May require waiting or retry |
| Causal consistency |
Track dependencies between operations |
Complex implementation |
| Version vectors |
Each write includes vector clock |
Metadata overhead |
Data Classification for Replication
Key Insight: Not all data needs the same replication strategy. Classify data by consistency requirements.
| Data Type |
Consistency Need |
Replication Strategy |
Twilio Example |
| Account/Identity |
Strong |
DynamoDB Global Tables (LWW) |
API credentials, account settings |
| Routing/Cell Assignment |
Strong |
DynamoDB Global Tables + Cache |
Customer → Cell mapping |
| Message Content |
Eventual |
Regional with async backup |
SMS body, media files |
| Call State |
Regional only |
No cross-region replication |
Active call data (ephemeral) |
| Analytics/Logs |
Eventual |
Async to central data lake |
Usage metrics, audit logs |
2-Minute Answer: "How do you handle data replication across regions?"
"I classify data by consistency requirements. For identity and routing data that must be globally consistent, I use DynamoDB Global Tables with last-writer-wins conflict resolution—it's multi-master, so any region can write, and conflicts are resolved automatically by timestamp. For transactional data like message delivery status, I use Aurora Global Database with a primary writer and read replicas in other regions. Writes go to the primary, reads can hit local replicas with acceptable lag. For truly regional data like active call state, I don't replicate at all—it's ephemeral and specific to the media servers handling that call. The key insight is that async replication introduces lag, so you need strategies like read-your-writes consistency for user-facing operations. I route a user's reads to the same region as their writes for the session, which guarantees they see their own updates immediately."
Conflict Resolution
The Fundamental Problem: Two users update the same record in different regions simultaneously. When the updates replicate, which one wins?
Conflict Resolution Strategies
1. Last-Writer-Wins (LWW)
Region A (t=100): UPDATE user SET name='Alice' WHERE id=1
Region B (t=101): UPDATE user SET name='Bob' WHERE id=1
After replication: name='Bob' (higher timestamp wins)
Pros: Simple, deterministic, no manual intervention
Cons: Data loss (Alice's update is silently discarded)
Use when: Updates are idempotent, latest value is always correct (e.g., user profile)
2. Merge/CRDT (Conflict-free Replicated Data Types)
Region A: ADD item='apple' to shopping_cart
Region B: ADD item='banana' to shopping_cart
After replication: shopping_cart = ['apple', 'banana'] (union of both)
Types of CRDTs:
• G-Counter: Only increments, sum all replicas
• PN-Counter: Increment and decrement
• G-Set: Add-only set (union)
• OR-Set: Add and remove with unique tags
• LWW-Register: Last-writer-wins for single values
Use when: Operations can be mathematically merged (counters, sets)
3. Application-Level Resolution
Region A: UPDATE account SET balance=100 WHERE id=1
Region B: UPDATE account SET balance=80 WHERE id=1
Conflict detected! Options:
a) Prompt user to choose
b) Apply business rules (e.g., keep lower balance for safety)
c) Store both versions, resolve later
d) Reject one based on business priority
Use when: Business logic determines correct resolution (financial data)
4. Operational Transformation (OT)
Document: "Hello"
Region A: INSERT ' World' at position 5 → "Hello World"
Region B: INSERT '!' at position 5 → "Hello!"
OT transforms operations:
B's operation adjusted: INSERT '!' at position 11
Result: "Hello World!"
Use when: Collaborative editing (Google Docs-style)
Conflict Resolution in AWS Services
| Service |
Conflict Resolution |
Details |
| DynamoDB Global Tables |
Last-Writer-Wins |
Uses aws:rep:updatetime for ordering |
| Aurora Global Database |
Single Writer (no conflicts) |
Only primary accepts writes |
| S3 Cross-Region Replication |
Last-Writer-Wins |
Based on object timestamp |
| ElastiCache Global Datastore |
Single Writer |
Primary cluster only |
Designing for Conflict Avoidance
Best Strategy: Avoid conflicts by design rather than resolving them after the fact.
| Technique |
How It Works |
Example |
| Region affinity |
Each record "belongs" to one region |
US customers write to US, EU to EU |
| Partitioned writes |
Different records updated in different regions |
User profile in home region only |
| Append-only data |
Never update, only insert new records |
Event log, message history |
| Idempotent operations |
Same operation applied twice = same result |
SET status='delivered' (not INCREMENT counter) |
2-Minute Answer: "How do you handle conflicts in multi-region writes?"
"First, I try to avoid conflicts by design. Most data has natural region affinity—a customer's data is written in their home region. For data that truly needs multi-region writes, I choose the strategy based on the data type. For simple values like user preferences, last-writer-wins works fine—DynamoDB Global Tables does this automatically using timestamps. For additive data like shopping carts or message queues, I use CRDT-like structures where the merge operation is a union. For critical data like financial balances, I avoid multi-region writes entirely—I designate a single authoritative region and route writes there, accepting the latency cost for correctness. The key insight is that conflict resolution isn't one-size-fits-all. I classify data into tiers: tier 1 gets strong consistency with single-region writes; tier 2 gets LWW with async replication; tier 3 gets CRDTs for merge-friendly operations."
Traffic Routing & DNS
Routing Strategies
| Strategy |
How It Works |
Best For |
AWS Service |
| Latency-based |
Route to region with lowest latency |
Global users, real-time apps |
Route 53 Latency Routing |
| Geolocation |
Route based on user's location |
Data residency, compliance |
Route 53 Geolocation |
| Weighted |
Distribute traffic by percentage |
Canary deploys, gradual migration |
Route 53 Weighted |
| Failover |
Primary/secondary with health checks |
Disaster recovery |
Route 53 Failover |
| Anycast |
Same IP announced from multiple locations |
DDoS protection, lowest latency |
Global Accelerator |
DNS-Based Routing
┌──────────┐ 1. DNS Query ┌─────────────┐
│ Client │─────────────────────►│ Route 53 │
│ │ │ │
│ │◄─────────────────────│ (Latency │
└──────────┘ 2. IP of nearest │ Routing) │
│ region └─────────────┘
│ │
│ 3. HTTPS Request │ Health Checks
▼ ▼
┌──────────────┐ ┌──────────────┐
│ US-EAST-1 │ │ EU-WEST-1 │
│ (50ms RTT) │ │ (150ms RTT) │
└──────────────┘ └──────────────┘
TTL Considerations:
• Low TTL (60s): Fast failover, more DNS queries
• High TTL (300s): Fewer queries, slower failover
• Typical: 60-120s for active-active
AWS Global Accelerator vs Route 53
Route 53 (DNS)
- Returns IP address, client connects directly
- Failover limited by DNS TTL
- Client caching can delay failover
- Lower cost
- Works with any endpoint
Global Accelerator (Anycast)
- Static IPs, traffic routed through AWS backbone
- Instant failover (no DNS propagation)
- TCP/UDP termination at edge
- Higher cost, better performance
- DDoS protection built-in
Health Checks
Critical Design Decision: What does "healthy" mean for your application? A region might be up but degraded.
| Health Check Type |
What It Tests |
Failure Action |
| TCP |
Can establish connection |
Fast, but doesn't test application |
| HTTP 200 |
Endpoint returns 200 |
Tests application is responding |
| String Match |
Response contains expected string |
Tests application logic |
| Deep Health |
Tests dependencies (DB, cache) |
Most accurate, risk of cascade |
// Deep health check endpoint
app.get('/health/deep', async (req, res) => {
const checks = {
database: await checkDatabase(),
cache: await checkRedis(),
queue: await checkKafka(),
downstream: await checkCriticalService()
};
const allHealthy = Object.values(checks).every(c => c.healthy);
const status = allHealthy ? 200 : 503;
res.status(status).json({
status: allHealthy ? 'healthy' : 'degraded',
checks,
region: process.env.AWS_REGION,
timestamp: new Date().toISOString()
});
});
2-Minute Answer: "How do you route traffic in a multi-region setup?"
"I use a layered approach. At the DNS layer, Route 53 with latency-based routing directs users to the nearest healthy region. Health checks run every 10 seconds against deep health endpoints that verify database connectivity, not just that the server responds. For critical applications, I add Global Accelerator on top—it gives us anycast IPs so failover is instant, no DNS TTL to wait for. Traffic flows through AWS's backbone network, which is faster and more reliable than the public internet. Within each region, the Cell Router handles customer-to-cell assignment using our DynamoDB-backed routing table. For data residency requirements, I use geolocation routing to ensure EU users always hit EU infrastructure, with failover only to other EU regions. The key trade-off is cost versus failover speed: Route 53 alone is cheap but failover takes 60+ seconds due to TTL; Global Accelerator costs more but fails over in seconds."
Failure Scenarios & Mitigation
Failure Categories
| Failure Type |
Scope |
RTO Target |
Mitigation |
| Single AZ failure |
One datacenter |
Automatic (seconds) |
Multi-AZ deployment |
| Region failure |
All AZs in region |
< 5 minutes |
Multi-region active-active |
| Global service failure |
AWS service (IAM, Route 53) |
Variable |
Static credentials, cached DNS |
| Network partition |
Inter-region connectivity |
Automatic |
Regional autonomy, async replication |
| Data corruption |
All regions (if replicated) |
Hours |
Point-in-time recovery, backups |
Scenario 1: Complete Region Failure
BEFORE: Traffic split US-EAST-1 (60%) / EU-WEST-1 (40%)
EVENT: US-EAST-1 goes offline
Timeline:
T+0: Region fails
T+10s: Health checks detect failure (3 consecutive failures)
T+20s: Route 53 removes US-EAST-1 from DNS responses
T+60s: DNS TTL expires, clients get EU-WEST-1 only
T+120s: All traffic now to EU-WEST-1
AFTER: EU-WEST-1 (100%) - must handle 2.5x normal load
Key Questions:
• Can EU-WEST-1 scale to handle all traffic?
• Is data replicated enough that EU-WEST-1 has recent state?
• What happens when US-EAST-1 comes back?
Scenario 2: Network Partition (Split Brain)
The Nightmare Scenario: Regions can't communicate but both think they're the primary. Users in each region continue writing, creating divergent data.
BEFORE: US-EAST-1 ◄──replication──► EU-WEST-1
EVENT: Inter-region network fails
╳ (network partition)
US-EAST-1 ◄────────────────────────► EU-WEST-1
│ │
▼ ▼
Users write Users write
balance=100 balance=80
AFTER PARTITION HEALS:
Which balance is correct? Both regions have different values.
Mitigation Options:
1. Quorum writes (need majority of regions to write)
2. Leader election (only one region accepts writes)
3. Conflict resolution (LWW, merge, manual)
4. Accept partition as feature (regional autonomy)
Scenario 3: Cascading Failure
1. US-EAST-1 database becomes slow (not failed)
2. Health checks still pass (200 OK, just slow)
3. Requests queue up, timeouts increase
4. Circuit breakers don't trip (it's "working")
5. EU-WEST-1 overwhelmed when US users retry there
6. Both regions now degraded
Prevention:
• Latency-based health checks (fail if p99 > threshold)
• Adaptive load shedding
• Request quotas per region
• Circuit breakers with latency triggers
Recovery Playbook
| Phase |
Actions |
Validation |
| 1. Detect |
Automated alerts, health check failures |
Alert fires within 60 seconds |
| 2. Isolate |
Remove failed region from routing |
No new traffic to failed region |
| 3. Scale |
Increase capacity in healthy regions |
Latency/error rate stable |
| 4. Investigate |
Root cause analysis |
Understand failure mode |
| 5. Repair |
Fix issue in failed region |
Local validation passes |
| 6. Resync |
Replay missed writes, reconcile data |
Data consistency verified |
| 7. Restore |
Gradually route traffic back (10%→50%→100%) |
Error rate stays low |
2-Minute Answer: "What happens when a region fails?"
"Our health checks detect the failure within 10-30 seconds—we check every 10 seconds and require 2-3 consecutive failures to avoid false positives. Route 53 automatically removes the failed region from DNS responses. Existing connections drain; new connections go to healthy regions. The key design decision is whether healthy regions can absorb the load. We provision for N+1 capacity, meaning if we have 3 regions, each can handle 50% of total traffic—so losing one region still leaves us with 100% capacity. For data, we accept that async-replicated data might be slightly stale—RPO of a few seconds. Critical data uses DynamoDB Global Tables with sub-second replication. When the region recovers, we don't just flip traffic back on. We verify data consistency first, then gradually ramp traffic—10%, then 50%, then 100%—watching error rates at each step. The biggest risk isn't the failure itself; it's the recovery. Thundering herd when traffic shifts back has caused more outages than the original failures."
Multi-Region in Twilio's Cell Architecture
Regional Cell Deployment
┌─────────────────────────────────────────────────────────────────────────────────┐
│ GLOBAL LAYER │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Route 53 / Global Accelerator → Cell Router (Lambda@Edge) │ │
│ │ DynamoDB Global Tables (Identity, Customer→Cell mapping) │ │
│ │ Billing Aggregation Service │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
┌───────────────────────────┼───────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐
│ US-EAST-1 │ │ EU-WEST-1 │ │ AP-SOUTHEAST-1 │
│ │ │ │ │ │
│ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │
│ │ Enterprise-US-A │ │ │ │ Enterprise-EU-A │ │ │ │ Enterprise-AP-A │ │
│ │ (Customers 1-50)│ │ │ │ (Customers 1-50)│ │ │ │ (Customers 1-50)│ │
│ └─────────────────┘ │ │ └─────────────────┘ │ │ └─────────────────┘ │
│ │ │ │ │ │
│ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │
│ │ SMB-US-Pool │ │ │ │ SMB-EU-Pool │ │ │ │ SMB-AP-Pool │ │
│ │ (Customers 1K+) │ │ │ │ (Customers 1K+) │ │ │ │ (Customers 1K+) │ │
│ └─────────────────┘ │ │ └─────────────────┘ │ │ └─────────────────┘ │
│ │ │ │ │ │
│ Regional Resources: │ │ Regional Resources: │ │ Regional Resources: │
│ • Carrier connections│ │ • Carrier connections│ │ • Carrier connections│
│ • Media servers │ │ • Media servers │ │ • Media servers │
│ • Message queues │ │ • Message queues │ │ • Message queues │
└───────────────────────┘ └───────────────────────┘ └───────────────────────┘
Key Insight: Customers are region-homed, not distributed across regions
Data Distribution in Cell Architecture
| Data Type |
Location |
Replication |
Rationale |
| Identity/API Keys |
Global |
DynamoDB Global Tables |
Auth from any region |
| Customer→Cell mapping |
Global |
DynamoDB + Redis cache |
Route from any edge |
| Customer data |
Cell-local |
Within-region only |
Isolation, data residency |
| Message content |
Cell-local |
Backup to cold storage |
Regional compliance |
| Active calls |
Regional (ephemeral) |
None |
Real-time, latency-sensitive |
| Usage/Billing |
Cell → Global |
Async aggregation |
Eventually consistent OK |
Customer Region Assignment
Design Choice: Customers are assigned to a HOME region. All their data lives in cells within that region. This simplifies data residency and avoids cross-region writes.
// Customer → Region → Cell assignment
{
"customer_id": "AC123456",
"home_region": "eu-west-1", // GDPR requires EU data residency
"cell_id": "enterprise-eu-west-1-a",
"tier": "enterprise",
"failover_region": "eu-central-1", // Failover stays in EU
"created_at": "2024-01-15T10:00:00Z"
}
// Routing logic
function routeCustomer(customerId, requestRegion) {
const customer = await getCustomerMapping(customerId);
if (customer.home_region === requestRegion) {
return customer.cell_id; // Local routing
}
// Cross-region request - proxy to home region
return proxyToRegion(customer.home_region, customerId);
}
Cross-Region Communication
Scenario: US customer sends SMS to EU number
1. US Customer calls Twilio API (US edge)
2. Cell Router: Customer AC123 → Cell enterprise-us-east-1-a
3. US Cell processes request, needs EU carrier delivery
4. US Cell → Super Network (global carrier connections)
5. Super Network routes to EU carrier via local connection
6. Delivery status: EU → US Cell (async callback)
Key: Customer data stays in US cell
Only carrier routing crosses regions (via Super Network)
Regional Failover for Cells
| Scenario |
Customer Impact |
Failover Action |
| Single cell failure |
10-50 customers |
Migrate to another cell in same region |
| Region failure (EU customer) |
All EU customers |
Failover to EU-CENTRAL-1 (stay in EU for GDPR) |
| Region failure (US customer) |
All US customers |
Failover to US-WEST-2 |
| Global routing failure |
All customers |
Fallback to static routing table at edge |
2-Minute Answer: "How does multi-region work with your cell architecture?"
"Customers are region-homed, not globally distributed. When a customer signs up, they're assigned to a home region based on their location and data residency requirements—EU customers go to EU regions for GDPR. Within that region, they're assigned to a specific cell. All their data lives in that cell. Global services—identity, routing, billing—use DynamoDB Global Tables and are available in all regions. So I can authenticate a customer from any edge location, look up their cell assignment, and route them to the correct region. If their home region fails, we failover to another region in the same geographic area—EU fails over to another EU region, not to US. This maintains data residency compliance. The Super Network is the one truly global component that crosses regions—carrier connections exist in each region, but the routing intelligence to pick the best carrier path is global. For an EU customer sending an SMS to a US number, the request processes in their EU cell, but the Super Network routes delivery through US carrier connections for best deliverability."
AWS Services Deep Dive
DynamoDB Global Tables
What it is: Multi-region, multi-master database with automatic replication.
| Feature |
Details |
| Replication latency |
Typically < 1 second |
| Conflict resolution |
Last-writer-wins (using aws:rep:updatetime) |
| Consistency |
Eventually consistent across regions; strongly consistent within region |
| Write capacity |
Must provision in each region (writes replicate, consuming WCU) |
| Best for |
Identity, routing tables, configuration, session data |
// DynamoDB Global Tables conflict example
// Region A writes at t=100
await dynamodb.putItem({
TableName: 'customer_settings',
Item: {
customer_id: { S: 'AC123' },
theme: { S: 'dark' },
// DynamoDB adds: aws:rep:updatetime = 100
}
});
// Region B writes at t=101 (before replication)
await dynamodb.putItem({
TableName: 'customer_settings',
Item: {
customer_id: { S: 'AC123' },
theme: { S: 'light' },
// DynamoDB adds: aws:rep:updatetime = 101
}
});
// After replication: theme = 'light' (t=101 > t=100)
Aurora Global Database
What it is: Single-writer PostgreSQL/MySQL with read replicas across regions.
| Feature |
Details |
| Replication |
Storage-level, < 1 second lag typically |
| Writes |
Primary region only (single writer) |
| Reads |
Any region (with replication lag) |
| Failover |
Promote secondary to primary (< 1 minute) |
| Write forwarding |
Secondary can forward writes to primary (adds latency) |
| Best for |
Transactional data, relational queries, complex joins |
S3 Cross-Region Replication
| Feature |
Details |
| Replication |
Asynchronous, typically minutes |
| Scope |
Entire bucket, prefix, or tag-based |
| Versioning |
Required (source and destination) |
| Delete markers |
Optionally replicated |
| Best for |
Media files, backups, static assets |
ElastiCache Global Datastore
| Feature |
Details |
| Engine |
Redis only (not Memcached) |
| Replication |
Async, < 1 second typically |
| Writes |
Primary cluster only |
| Failover |
Promote secondary (< 1 minute) |
| Best for |
Session cache, rate limiting, real-time leaderboards |
Service Comparison Matrix
| Requirement |
DynamoDB Global |
Aurora Global |
ElastiCache Global |
| Multi-region writes |
Yes (multi-master) |
No (single writer) |
No (single writer) |
| Strong consistency |
Within region only |
Within primary only |
Within primary only |
| Replication lag |
< 1 second |
< 1 second |
< 1 second |
| Conflict resolution |
LWW (automatic) |
N/A (single writer) |
N/A (single writer) |
| Query flexibility |
Limited (key-value) |
Full SQL |
Limited (key-value) |
| Cost model |
Per request + storage |
Instance + storage |
Instance-based |
2-Minute Answer: "Which AWS database for multi-region?"
"It depends on the access pattern. For truly global data that needs multi-region writes—like identity, routing tables, or session data—I use DynamoDB Global Tables. It's multi-master, so any region can write, and conflicts resolve automatically with last-writer-wins. Replication is under a second. For transactional workloads with complex queries—like order processing or financial data—I use Aurora Global Database. It's single-writer, which means no conflicts, but writes must go to the primary region. Secondary regions get read replicas with sub-second lag. For caching, ElastiCache Global Datastore with Redis, again single-writer but incredibly fast reads from any region. The pattern I use: DynamoDB Global Tables for the control plane—who is this customer, which cell are they in. Aurora for the data plane—what are their messages, call records. ElastiCache for hot path acceleration—caching the DynamoDB lookups that happen on every request."
Interview Q&A
Design Questions
Q: "Design a multi-region active-active system for Twilio's SMS API"
Framework:
- Clarify requirements: Latency target? Data residency? RPO/RTO?
- Identify data types: What's global vs regional?
- Choose replication strategy: Multi-master vs single-writer per data type
- Design routing: How do requests reach the right region?
- Handle failures: What happens when a region goes down?
Q: "How would you handle a region failure during peak traffic?"
A: "First, the failure detection. Health checks run every 10 seconds; after 3 failures, Route 53 removes the region. During the DNS TTL window—60 seconds—some traffic still tries the failed region, so we need the load balancer to return 503 fast, not timeout. The surviving regions need capacity headroom. We provision N+1, so losing one of three regions leaves us with 100% capacity. Auto-scaling kicks in, but there's a lag, so baseline capacity matters. For data, anything in DynamoDB Global Tables is already in the surviving regions. Aurora needs a failover—promote the read replica to primary, which takes under a minute. The risk is the recovery. When the failed region comes back, we don't just flip traffic back. We verify data consistency, warm up caches, then gradually ramp from 10% to 50% to 100%, watching error rates. Thundering herd on recovery has caused more outages than the original failures."
Q: "How do you test multi-region failover?"
A: "We do three types of testing. First, synthetic failovers in staging—we literally turn off a region and watch what happens. This catches configuration issues but doesn't test production load. Second, chaos engineering in production—we use AWS Fault Injection Simulator to degrade regions gradually. Start with one AZ, then whole region, measuring impact at each step. Third, game days—scheduled exercises where we simulate region failure with the on-call team responding as if it's real. The game day reveals process gaps, not just technical ones. Does the runbook work? Can we actually reach the people we need? We also test recovery, which is often neglected. Bringing a region back online after extended outage—with stale caches, lagged data—is harder than the initial failover."
Q: "What's the hardest part of multi-region active-active?"
A: "Conflict resolution and the cognitive overhead it creates. With single-writer, the mental model is simple—there's one source of truth. With multi-master, every piece of data might have conflicts you haven't thought of. We had a bug where two regions both thought they were updating 'last_login' timestamp—harmless, right? But downstream analytics depended on that field being monotonically increasing for session tracking. Last-writer-wins caused sessions to appear to go backwards in time. The second hardest part is testing. You can't easily simulate network partitions between regions in a realistic way. And the failure modes are combinatorial—region A fails, or B fails, or the link between them fails, or A is slow but not down. Each scenario needs different handling. That's why I prefer regional affinity where possible—customer data lives in one region, reducing the surface area for conflicts."
Q: "How do you ensure data consistency across regions?"
A: "I don't try to guarantee global strong consistency—that requires synchronous replication which kills latency and availability. Instead, I design for the consistency level each data type actually needs. For identity and routing, I use DynamoDB Global Tables with eventual consistency—writes replicate in under a second, and for a routing table lookup, reading a slightly stale cell assignment is fine because the cell will redirect if wrong. For user-facing data where they need to see their own writes, I use read-your-writes consistency—route reads to the same region as their last write for that session. For financial data where inconsistency means real problems, I avoid multi-region writes entirely. One region is authoritative; other regions proxy writes there. Yes, it adds latency for users far from that region, but correctness matters more than speed for money. The key insight is that 'consistency' isn't binary—it's a spectrum, and different data has different needs."
Quick Reference: Key Numbers
| Metric |
Target |
Why |
| Cross-region latency |
50-150ms |
Physical distance + network hops |
| DynamoDB Global Tables replication |
< 1 second |
Typical async replication lag |
| Aurora Global failover |
< 1 minute |
Promote secondary to primary |
| DNS TTL |
60-120 seconds |
Balance failover speed vs query volume |
| Health check interval |
10-30 seconds |
Fast detection vs false positives |
| Capacity headroom |
N+1 regions |
Survive one region failure |
| RPO (Recovery Point Objective) |
< 1 minute |
Max data loss acceptable |
| RTO (Recovery Time Objective) |
< 5 minutes |
Max downtime acceptable |
Key Takeaways
For Twilio DA Interview
- Active-active for availability and latency, not just DR
- Classify data by consistency needs
- DynamoDB Global Tables for global data (LWW)
- Aurora Global for transactional, single-writer
- Regional affinity avoids conflicts by design
- Test failover AND recovery
Trade-offs to Articulate
- Consistency vs availability (CAP)
- Multi-master vs single-writer complexity
- Sync replication latency vs async data loss
- DNS failover time vs Global Accelerator cost
- Regional autonomy vs global consistency
- N+1 capacity cost vs failover capability
Your Narrative: "Multi-region active-active isn't just about disaster recovery—it's about latency and availability for a global communications platform. I design with regional affinity to avoid conflicts, use DynamoDB Global Tables for truly global data, and accept eventual consistency where it's safe. The hardest part isn't the technology; it's the operational discipline—testing failover regularly, provisioning N+1 capacity, and having runbooks for scenarios you hope never happen."