๐ŸŽฏ Cell Routing & Assignment

DynamoDB-based Customer Routing & Least-Loaded Cell Assignment

Algorithm Overview

๐ŸŽฏ Goals of Cell Routing

  1. Stable Assignment: Customers stay in the same cell (DynamoDB is source of truth)
  2. Optimal Load Distribution: Assign new customers to least-loaded cells using real metrics
  3. Fast Lookups: Sub-10ms lookup time via Redis cache (95% hit rate)
  4. Segment Awareness: Route Enterprise to Enterprise cells, SMB to SMB cells
  5. Geographic Affinity: Route customers to geographically closest cells
  6. Controlled Migration: Explicit rebalancing via Control Plane when needed

โœ… Why This Approach is Better Than Consistent Hashing

Key Insight: If you're using DynamoDB as the source of truth and already doing lookups, consistent hashing adds complexity without benefit.

  • Simpler: No hash ring to maintain, version, or synchronize across Cell Routers
  • Better Load Distribution: Assign to least-loaded cell based on actual CPU/customer count, not hash distribution
  • Same Performance: DynamoDB lookup is required either way (for existing customers)
  • Controlled Migration: Control Plane explicitly chooses which customers to migrate, when, and where
  • Easier to Debug: Direct database queries show exactly where each customer is assigned

Consistent hashing is valuable for stateless distributed caches or database sharding where the hash IS the source of truth. But when you have a database, use it!

Architecture Components

API Request customer_id: 12345 region: us-east-1 Cell Router Service 1. Check cache (ElastiCache) 2. Lookup DynamoDB if miss 3. Assign to least-loaded cell (new) 4. Set X-Twilio-Cell-ID header 5. Route via VPC Lattice ElastiCache (Redis) Cache: customer โ†’ cell TTL: 1 hour Cache miss DynamoDB Global Table Table: customer_cell_mapping PK: customer_id | SK: region Attributes: cell_id, segment, hash Consistent Hash Ring Cell A Cell B Cell C Cell D Cell Assignment customer_12345 โ†’ Cell B Region: us-east-1 Segment: Enterprise Enterprise Cell B EKS Cluster: enterprise-us-east-1-b Namespace: customer-12345 Request Forwarded to Cell VPC Lattice routes to correct EKS cluster โ†’ Namespace โ†’ Pod Latency: ~5ms (cache hit) or ~15ms (cache miss)

Data Model: DynamoDB (Workflow-Affinity Routing)

// Table: customer_cell_mapping // Global table replicated across US, EU, APAC // Key includes service_category for workflow-affinity routing { "customer_id": "12345", // Partition Key "region#category": "us-east-1#messaging", // Composite Sort Key "cell_id": "messaging-enterprise-us-001", // Assigned cell (workflow-specific) "service_category": "messaging", // messaging | realtime | async | verify "region": "us-east-1", // Denormalized for queries "segment": "enterprise", // Customer segment "assigned_at": "2025-01-15T10:30:00Z", // When assigned "last_accessed": "2025-01-20T14:22:11Z", // For cold customer detection "api_calls_30d": 1500000, // Usage metrics for rebalancing "monthly_spend": 75000.00, // For segment reclassification "is_migrating": false, // Flag during cell migration "version": 5 // Optimistic locking } // Same customer, multiple cells (one per workflow category): // { customer_id: "12345", region#category: "us-east-1#messaging", cell_id: "messaging-enterprise-us-001" } // { customer_id: "12345", region#category: "us-east-1#realtime", cell_id: "realtime-enterprise-us-001" } // { customer_id: "12345", region#category: "us-east-1#verify", cell_id: "verify-enterprise-us-001" } // GSI-1: cell_id (to list all customers in a cell) // GSI-2: segment + region + category (to find cells by segment/category)

Cell Routing Algorithm

๐ŸŽฏ Core Principle: Workflow-Affinity Routing

Every customer ร— service_category โ†’ cell assignment is stored in DynamoDB Global Tables. Same customer can have different cells for different workflow categories (messaging, realtime, verify, async).

Why? Services with different scaling characteristics (bursty SMS vs sustained Voice) can scale independently while services that call each other (SMS + WhatsApp in fallback chains) stay co-located.

Complete Routing Flow

Step-by-Step Routing Decision

Step 1: Extract Customer Context
  • Parse customer_id from JWT token or API key
  • Determine region from request origin (Route 53 geolocation)
  • Extract service_category from API path:
    • /v1/sms, /v1/mms, /v1/whatsapp โ†’ messaging
    • /v1/voice, /v1/video โ†’ realtime
    • /v1/verify, /v1/lookup โ†’ verify
    • /v1/email, /v1/fax โ†’ async
Step 2: Check Cache (95% of requests end here!)
  • Redis key: cell:{customer_id}:{region}:{category}
  • If HIT: Return cached cell_id (~5ms latency)
  • If MISS: Proceed to Step 3
Step 3: Lookup in DynamoDB
  • Query: SELECT cell_id FROM customer_cell_mapping WHERE customer_id = ? AND region#category = ?
  • If found: Cache in Redis (TTL: 1 hour), return cell_id (~15ms latency)
  • If not found (new customer or new category): Proceed to Step 4
Step 4: Assign New Customer/Category (only for first request per category)
  • Determine customer segment (from signup plan or API tier)
  • Call getLeastLoadedCell(region, segment, category) - see Cell Assignment section
  • Write to DynamoDB with conditional expression: attribute_not_exists(customer_id) AND attribute_not_exists(region#category)
  • Cache in Redis (TTL: 1 hour)
  • Return cell_id (~20ms latency)
Step 5: Set Routing Header & Forward
  • Set header: X-Twilio-Cell-ID: enterprise-us-east-1-a
  • VPC Lattice reads header and routes to correct cell's service endpoint
  • Request arrives at cell's ALB โ†’ EKS cluster โ†’ Kubernetes pod

Routing Service Implementation (Lambda)

import { DynamoDBClient, GetItemCommand, PutItemCommand } from "@aws-sdk/client-dynamodb" import { createClient } from "redis" const dynamodb = new DynamoDBClient({ region: process.env.AWS_REGION }) const redis = createClient({ url: process.env.REDIS_URL }) export async function routeToCell(customerId, region, segment, serviceCategory) { // Step 1: Check cache (95% of requests end here) // Key includes service_category for workflow-affinity routing const cacheKey = `cell:${customerId}:${region}:${serviceCategory}` let cellId = await redis.get(cacheKey) if (cellId) { console.log(`Cache HIT: ${cacheKey} โ†’ ${cellId}`) return { cellId, source: 'cache', latency: '~5ms' } } // Step 2: Check DynamoDB (composite sort key: region#category) const sortKey = `${region}#${serviceCategory}` const dbResult = await dynamodb.send(new GetItemCommand({ TableName: 'customer_cell_mapping', Key: { customer_id: { S: customerId }, 'region#category': { S: sortKey } } })) if (dbResult.Item) { cellId = dbResult.Item.cell_id.S // Cache for 1 hour await redis.setEx(cacheKey, 3600, cellId) console.log(`DB HIT: ${customerId}:${serviceCategory} โ†’ ${cellId}`) return { cellId, source: 'database', latency: '~15ms' } } // Step 3: New customer or new category - assign to least-loaded cell // Same customer can have different cells for different workflow categories cellId = await assignToLeastLoadedCell(customerId, region, segment, serviceCategory) // Cache await redis.setEx(cacheKey, 3600, cellId) console.log(`NEW ASSIGNMENT: ${customerId}:${serviceCategory} โ†’ ${cellId}`) return { cellId, source: 'new_assignment', latency: '~20ms' } }

Performance Metrics

Cache Hit Rate

95%

Redis cache hit rate

Cache Hit Latency

~5ms

Redis lookup time

Cache Miss Latency

~15ms

DynamoDB lookup

New Customer

~20ms

Assignment + DDB write

Cell Assignment Algorithm

๐ŸŽฏ Goal: Assign to Least-Loaded Workflow-Affinity Cell

When a customer first uses a service category, assign them to the cell with the most available capacity for that workflow type in their segment and region.

Same customer can have different cells for different service categories (messaging, realtime, verify, async).

Least-Loaded Assignment Algorithm

async function assignToLeastLoadedCell(customerId, region, segment, serviceCategory) { // Step 1: Get all active cells for this segment + region + category // Cells are workflow-specific: messaging cells, realtime cells, etc. const cells = await dynamodb.send(new QueryCommand({ TableName: 'cell_metadata', IndexName: 'category-segment-region-index', KeyConditionExpression: 'category = :cat AND segment_region = :segReg', FilterExpression: '#status = :active', ExpressionAttributeNames: { '#status': 'status' }, ExpressionAttributeValues: { ':cat': { S: serviceCategory }, ':segReg': { S: `${segment}#${region}` }, ':active': { S: 'active' } } })) if (cells.Items.length === 0) { throw new Error(`No active ${serviceCategory} cells for segment=${segment}, region=${region}`) } // Step 2: Find least-loaded cell // Sort by: 1) customer count, 2) category-specific metric (queue depth, connections, etc.) const leastLoaded = cells.Items .map(item => ({ id: item.cell_id.S, customers: item.current_customers?.N || 0, load: item.load_metric?.N || 0, // Category-specific: queue_depth, concurrent_connections, etc. maxCustomers: item.max_customers?.N || 100, // Calculate load score (0-200) loadScore: (item.current_customers.N / item.max_customers.N * 100) + (item.load_metric.N) })) .sort((a, b) => a.loadScore - b.loadScore)[0] console.log(`Selected ${serviceCategory} cell: ${leastLoaded.id} (${leastLoaded.customers}/${leastLoaded.maxCustomers} customers)`) // Step 3: Persist assignment with conditional write // Composite sort key: region#category (same customer can have multiple cells) const sortKey = `${region}#${serviceCategory}` await dynamodb.send(new PutItemCommand({ TableName: 'customer_cell_mapping', Item: { customer_id: { S: customerId }, 'region#category': { S: sortKey }, cell_id: { S: leastLoaded.id }, service_category: { S: serviceCategory }, region: { S: region }, segment: { S: segment }, assigned_at: { S: new Date().toISOString() }, version: { N: '1' } }, ConditionExpression: 'attribute_not_exists(customer_id) AND attribute_not_exists(#sk)', ExpressionAttributeNames: { '#sk': 'region#category' } })) // Step 4: Update cell metadata (increment customer count) await dynamodb.send(new UpdateItemCommand({ TableName: 'cell_metadata', Key: { cell_id: { S: leastLoaded.id } }, UpdateExpression: 'ADD current_customers :one', ExpressionAttributeValues: { ':one': { N: '1' } } })) return leastLoaded.id }

Load Balancing Comparison

Hash-Based Assignment

Method: hash(customer_id) % num_cells

Distribution:

  • Pseudo-random based on hash function
  • No awareness of actual cell load
  • Can assign to overloaded cells

Adding cells: Requires remapping existing customers

Least-Loaded Assignment RECOMMENDED

Method: Query cell metrics, pick lowest load

Distribution:

  • โœ… Based on real metrics (CPU, customer count)
  • โœ… Avoids overloaded cells
  • โœ… Adapts to heterogeneous cell sizes

Adding cells: New customers automatically go to new cell

Rebalancing: Explicit migration by Control Plane when needed

Race Condition Handling

What if two Cell Routers assign the same customer simultaneously?

Scenario: Customer makes first API call โ†’ Load balancer sends to 2 Cell Router instances simultaneously

Solution: DynamoDB Conditional Write

ConditionExpression: 'attribute_not_exists(customer_id)'

First write succeeds, second write gets ConditionalCheckFailedException

Second Cell Router catches exception, re-queries DynamoDB, uses the cell_id from first write

Result: Atomic assignment, no duplicate customers in different cells!

Alternative: Consistent Hashing (Background Knowledge)

โš ๏ธ When to Use Consistent Hashing

Consistent hashing is valuable in specific scenarios:

  • Stateless distributed caches: Memcached/Redis clusters where hash function IS the source of truth
  • Database sharding: Cassandra-style partitioning where no central lookup table exists
  • Minimal state scenarios: When storing every mapping in a database is impractical

NOT recommended when: You already have a database (like DynamoDB) storing mappings!

Consistent Hashing Algorithm (Reference)

Why Consistent Hashing?

Problem: Simple hash modulo (e.g., hash(customer_id) % num_cells) causes massive redistribution when cells are added/removed.

Example: With 10 cells, adding 1 cell (11 total) reassigns ~91% of customers!

Solution: Consistent hashing minimizes redistribution to K/N where K = keys, N = nodes. Adding a cell only redistributes ~1/N customers.

How Consistent Hashing Works

Consistent Hash Ring (2^32 space) 0 (start) Cell A-v1 Cell A-v2 Cell A-v3 Cell B-v1 Cell B-v2 Cell B-v3 Cell C-v1 Cell C-v2 Cust 1 Cust 2 Cust 3 Cell A virtual nodes (Enterprise) Cell B virtual nodes (Enterprise) Cell C virtual nodes (Mid-Market) Customer hash position Each cell has 3 virtual nodes Customer assigned to next clockwise node

Algorithm Pseudocode

class ConsistentHashRing { constructor(cells, virtualNodesPerCell = 150) { this.ring = new SortedMap() // Hash โ†’ Cell mapping this.cells = cells this.virtualNodesPerCell = virtualNodesPerCell // Add virtual nodes for each cell for (cell of cells) { for (i = 0; i < virtualNodesPerCell; i++) { // Create virtual node identifier vnodeKey = `${cell.id}:vnode:${i}` hash = sha256(vnodeKey) // 32-byte hash hashInt = hash.toUInt64() // Convert to 64-bit integer this.ring.put(hashInt, cell) } } } getCell(customerId, segment) { // Hash customer ID customerHash = sha256(`customer:${customerId}:${segment}`) hashInt = customerHash.toUInt64() // Find next cell clockwise on ring (ceiling lookup) entry = this.ring.ceilingEntry(hashInt) // Wrap around if at end of ring if (!entry) { entry = this.ring.firstEntry() } return entry.value // Returns the cell } addCell(newCell) { // Add virtual nodes for new cell affectedCustomers = [] for (i = 0; i < this.virtualNodesPerCell; i++) { vnodeKey = `${newCell.id}:vnode:${i}` hash = sha256(vnodeKey).toUInt64() // Find what was previously at this position prevEntry = this.ring.floorEntry(hash) // Insert new virtual node this.ring.put(hash, newCell) // Customers between prevEntry and this hash need migration if (prevEntry) { affectedCustomers.push({ rangeStart: prevEntry.key, rangeEnd: hash, fromCell: prevEntry.value, toCell: newCell }) } } return affectedCustomers // ~1/N customers affected } removeCell(cellToRemove) { // Remove all virtual nodes for this cell affectedCustomers = [] for (i = 0; i < this.virtualNodesPerCell; i++) { vnodeKey = `${cellToRemove.id}:vnode:${i}` hash = sha256(vnodeKey).toUInt64() // Find next cell (where customers will move) nextEntry = this.ring.higherEntry(hash) if (!nextEntry) nextEntry = this.ring.firstEntry() // Remove this virtual node this.ring.remove(hash) affectedCustomers.push({ fromCell: cellToRemove, toCell: nextEntry.value }) } return affectedCustomers } }

Virtual Nodes: Why 150?

Few Virtual Nodes (e.g., 3)

Pros:

  • Lower memory usage
  • Faster ring operations

Cons:

  • โŒ Uneven distribution
  • โŒ Large gaps between nodes
  • โŒ Some cells get 2x load

150 Virtual Nodes RECOMMENDED

Pros:

  • โœ… Excellent load distribution (ยฑ1-2%)
  • โœ… Smooth customer redistribution
  • โœ… Industry standard (Cassandra uses 256)

Cons:

  • Moderate memory (150 entries ร— cells)

Math: 10 cells ร— 150 vnodes = 1,500 ring entries (negligible memory)

Many Virtual Nodes (e.g., 1000)

Pros:

  • Perfect distribution

Cons:

  • โŒ Overkill for most systems
  • โŒ Slower lookups
  • โŒ Higher memory usage
  • โŒ Diminishing returns

Routing Algorithm Implementation (Workflow-Affinity)

Complete Routing Flow

Step-by-Step Routing Decision

Step 1: Extract Customer Context
  • Parse customer_id from JWT token or API key
  • Determine region from request origin (Route 53 geolocation)
  • Extract service_category from API path:
    • /v1/sms, /v1/mms, /v1/whatsapp โ†’ messaging
    • /v1/voice, /v1/video โ†’ realtime
    • /v1/verify, /v1/lookup โ†’ verify
    • /v1/email, /v1/fax โ†’ async
Step 2: Determine Customer Segment
  • Check ElastiCache for cached segment: GET customer:{customer_id}:segment
  • If miss, query DynamoDB: customer_cell_mapping table
  • If new customer, classify based on signup plan (free โ†’ SMB, paid โ†’ Mid-Market, enterprise contract โ†’ Enterprise)
  • Cache result with 1-hour TTL
Step 3: Check Cell Assignment Cache
  • Redis key: cell:{customer_id}:{region}:{category}
  • If HIT: Return cached cell_id (95% of requests)
  • If MISS: Proceed to Step 4
Step 4: Lookup or Assign Cell
  • Query DynamoDB: customer_cell_mapping with PK=customer_id, SK=region#category
  • If exists: Return cell_id, cache in Redis (TTL: 1 hour)
  • If not exists (new customer or first use of this category): Assign to least-loaded cell
  • Query cells for segment + region + category
  • Select cell with lowest load score (customer count + category-specific metric)
Step 5: Persist Assignment
  • Write to DynamoDB: customer_cell_mapping table
  • Attributes: customer_id, region#category, cell_id, service_category, segment, assigned_at
  • Cache in Redis with 1-hour TTL
  • Emit CloudWatch metric: NewCustomerCategoryAssignment
Step 6: Route Request to Cell
  • Update request header: X-Twilio-Cell-ID: enterprise-us-east-1-b
  • VPC Lattice routes to correct EKS cluster
  • Service Discovery resolves cell DNS: enterprise-us-east-1-b.internal
  • ALB forwards to target group (EKS worker nodes)
  • Kubernetes ingress routes to namespace: customer-12345

Routing Service Implementation (AWS Lambda)

import { DynamoDBClient, GetItemCommand, PutItemCommand, QueryCommand } from "@aws-sdk/client-dynamodb" import { createClient } from "redis" const dynamodb = new DynamoDBClient({ region: process.env.AWS_REGION }) const redis = createClient({ url: process.env.REDIS_URL }) // Service category mapping from API path const CATEGORY_MAP = { '/v1/sms': 'messaging', '/v1/mms': 'messaging', '/v1/whatsapp': 'messaging', '/v1/voice': 'realtime', '/v1/video': 'realtime', '/v1/verify': 'verify', '/v1/lookup': 'verify', '/v1/email': 'async', '/v1/fax': 'async' } export async function routeToCell(customerId, region, segment, serviceCategory) { // Step 1: Check cache (includes service category) const cacheKey = `cell:${customerId}:${region}:${serviceCategory}` let cellId = await redis.get(cacheKey) if (cellId) { console.log(`Cache HIT: ${cacheKey} โ†’ ${cellId}`) return { cellId, source: 'cache' } } // Step 2: Check DynamoDB (composite sort key: region#category) const sortKey = `${region}#${serviceCategory}` const dbResult = await dynamodb.send(new GetItemCommand({ TableName: 'customer_cell_mapping', Key: { customer_id: { S: customerId }, 'region#category': { S: sortKey } } })) if (dbResult.Item) { cellId = dbResult.Item.cell_id.S await redis.setEx(cacheKey, 3600, cellId) console.log(`DB HIT: ${customerId}:${serviceCategory} โ†’ ${cellId}`) return { cellId, source: 'database' } } // Step 3: New assignment - find least-loaded cell for this category cellId = await assignToLeastLoadedCell(customerId, region, segment, serviceCategory) await redis.setEx(cacheKey, 3600, cellId) console.log(`NEW ASSIGNMENT: ${customerId}:${serviceCategory} โ†’ ${cellId}`) return { cellId, source: 'least_loaded' } } async function assignToLeastLoadedCell(customerId, region, segment, category) { // Query cells for this category + segment + region const cells = await dynamodb.send(new QueryCommand({ TableName: 'cell_metadata', IndexName: 'category-segment-region-index', KeyConditionExpression: 'category = :cat AND segment_region = :segReg', ExpressionAttributeValues: { ':cat': { S: category }, ':segReg': { S: `${segment}#${region}` } } })) // Find least loaded cell const leastLoaded = cells.Items .map(item => ({ id: item.cell_id.S, load: item.load_score.N })) .sort((a, b) => a.load - b.load)[0] // Persist assignment await dynamodb.send(new PutItemCommand({ TableName: 'customer_cell_mapping', Item: { customer_id: { S: customerId }, 'region#category': { S: `${region}#${category}` }, cell_id: { S: leastLoaded.id }, service_category: { S: category } } })) return leastLoaded.id }

Performance Metrics

Cache Hit Rate

95%

Redis cache hit rate

Cache Hit Latency

~5ms

Redis lookup time

Cache Miss Latency

~15ms

DynamoDB lookup + hash

New Customer

~20ms

Hash + DDB write + cache

Cell Rebalancing Strategy

When to Rebalance?

  1. New Cell Added: Capacity expansion, new region launch
  2. Cell Removed: Decommissioning, failure, cost optimization
  3. Cell Overloaded: CPU > 80%, request rate > threshold
  4. Customer Reclassification: SMB โ†’ Mid-Market โ†’ Enterprise promotion
  5. Scheduled Optimization: Quarterly rebalancing to optimize cost/performance

Rebalancing Strategies

Strategy Trigger Scope Downtime Duration
Automatic (Consistent Hash) New cell added/removed ~1/N customers Zero (dual-write) Minutes to hours
Manual Migration Customer upgrade (SMBโ†’Enterprise) Single customer Zero (blue/green) Seconds
Gradual Drain Cell decommissioning All customers in cell Zero (redirect) Days (batched)
Load-Based Cell CPU > 80% Hottest customers Zero Minutes
Scheduled Optimization Quarterly maintenance Imbalanced customers Zero Hours (off-peak)

Zero-Downtime Migration Process

Zero-Downtime Cell Migration Phase 1: Dual Write (5 minutes) 1. Update DynamoDB: Set is_migrating=true, target_cell=new_cell 2. Cell Router writes to BOTH old and new cells 3. Reads still served from old cell Phase 2: Background Data Sync (Variable) 1. Launch background job to copy customer data: โ€ข DynamoDB streams โ†’ Lambda โ†’ Copy to new cell's namespace โ€ข S3 data (recordings, attachments) โ†’ Copy to new region (if applicable) โ€ข Aurora data (if customer-specific) โ†’ Snapshot + restore 2. Verify data integrity (checksums, row counts) Phase 3: Cutover (Instant) 1. Update DynamoDB: cell_id=new_cell, increment version 2. Invalidate Redis cache: DEL cell:{customer_id}:* 3. Next request routes to new cell (reads from new cell) Phase 4: Cleanup (Hours later) 1. Monitor new cell for errors (rollback if needed) 2. After 24 hours: Delete customer data from old cell Timeline T=0 Start T+5min Data sync T+30min Cutover T+24hrs Cleanup ๐Ÿ”‘ Key Insight: Dual-write ensures zero data loss, atomic cutover ensures zero downtime Total customer-visible impact: 0ms (transparent migration)

Rebalancing Automation (Step Function)

// AWS Step Functions state machine for cell rebalancing { "Comment": "Cell Rebalancing Workflow", "StartAt": "IdentifyCustomersToMigrate", "States": { "IdentifyCustomersToMigrate": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:xxx:function:identify-migration-candidates", // Returns list of customer_ids that need migration based on consistent hash "Next": "BatchCustomers" }, "BatchCustomers": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", // Split customers into batches of 100 to avoid overwhelming "Parameters": { "FunctionName": "batch-customers", "BatchSize": 100 }, "Next": "MigrateBatch" }, "MigrateBatch": { "Type": "Map", // Parallel execution (up to 100 concurrent migrations) "MaxConcurrency": 100, "ItemsPath": "$.batches", "Iterator": { "StartAt": "MigrateSingleCustomer", "States": { "MigrateSingleCustomer": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:xxx:function:migrate-customer", "Retry": [{ "ErrorEquals": ["States.ALL"], "MaxAttempts": 3, "BackoffRate": 2.0 }], "Catch": [{ "ErrorEquals": ["States.ALL"], "ResultPath": "$.error", "Next": "LogFailure" }], "End": true }, "LogFailure": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:xxx:function:log-migration-failure", // Alert on-call engineer, add to retry queue "End": true } } }, "Next": "VerifyMigration" }, "VerifyMigration": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:xxx:function:verify-migration", // Check error rates, latency for migrated customers "Next": "CheckHealth" }, "CheckHealth": { "Type": "Choice", "Choices": [{ "Variable": "$.health_check_passed", "BooleanEquals": true, "Next": "ScheduleCleanup" }], "Default": "Rollback" }, "Rollback": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:xxx:function:rollback-migration", // Revert customer to old cell, alert team "End": true }, "ScheduleCleanup": { "Type": "Wait", "Seconds": 86400, // 24 hours "Next": "CleanupOldCell" }, "CleanupOldCell": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:xxx:function:cleanup-old-cell-data", "End": true } } }

Load-Based Rebalancing

Automatic Hotspot Detection

Scenario: Enterprise Cell A has CPU at 85%, while Cell B is at 40%.

Solution: Identify "hottest" customers in Cell A and migrate to Cell B.

Algorithm:

  1. CloudWatch alarm triggers when Cell CPU > 80% for 5 minutes
  2. Query DynamoDB for customers in overloaded cell, sorted by api_calls_30d DESC
  3. Select top N customers representing ~20% of traffic
  4. Find underutilized cell in same segment/region (CPU < 50%)
  5. Migrate hot customers to underutilized cell using zero-downtime process
  6. Monitor for 1 hour, verify load distribution improved

Expected Result: Cell A CPU drops to ~65%, Cell B increases to ~60% (balanced)

Production Implementation Considerations

DynamoDB Table Design

// Table: customer_cell_mapping { "TableName": "customer_cell_mapping", "BillingMode": "PAY_PER_REQUEST", // On-demand for unpredictable load "StreamEnabled": true, "StreamViewType": "NEW_AND_OLD_IMAGES", // For audit trail "KeySchema": [ { "AttributeName": "customer_id", "KeyType": "HASH" }, { "AttributeName": "region", "KeyType": "RANGE" } ], "GlobalSecondaryIndexes": [ { "IndexName": "cell_id-index", "KeySchema": [ { "AttributeName": "cell_id", "KeyType": "HASH" } ], // Query: "List all customers in cell X" for rebalancing }, { "IndexName": "segment-region-index", "KeySchema": [ { "AttributeName": "segment", "KeyType": "HASH" }, { "AttributeName": "region", "KeyType": "RANGE" } ], // Query: "List all enterprise customers in us-east-1" } ], "GlobalTableSettings": { "Regions": ["us-east-1", "eu-west-1", "ap-southeast-1"], "ReplicationSettings": { "RegionalReplicaSettings": { "us-east-1": { "ReadCapacityUnits": "AUTO" }, "eu-west-1": { "ReadCapacityUnits": "AUTO" }, "ap-southeast-1": { "ReadCapacityUnits": "AUTO" } } } } } // Table: cell_metadata (for hash ring construction) { "TableName": "cell_metadata", "BillingMode": "PAY_PER_REQUEST", "KeySchema": [ { "AttributeName": "cell_id", "KeyType": "HASH" } ], "Attributes": { "cell_id": "enterprise-us-east-1-a", "segment": "enterprise", "region": "us-east-1", "status": "active", // active | draining | inactive "capacity": { "max_customers": 1000, "current_customers": 875, "cpu_percent": 72, "memory_percent": 65 }, "endpoints": { "internal_dns": "enterprise-us-east-1-a.internal", "alb_arn": "arn:aws:elasticloadbalancing:..." }, "created_at": "2024-01-01T00:00:00Z", "virtual_nodes": 150 } }

Monitoring & Alerting

Metric Threshold Alert Action
Cell CPU Utilization > 80% for 5 min PagerDuty P2 Auto-scale or trigger load-based rebalancing
Cell Imbalance Stddev > 20% across cells Slack warning Schedule rebalancing during off-peak
Cache Hit Rate < 90% Slack warning Investigate cache eviction, increase Redis capacity
Cell Routing Latency p99 > 50ms PagerDuty P3 Check DynamoDB throttling, Redis latency
Migration Failures > 5 failures in 1 hour PagerDuty P1 Pause migrations, investigate root cause
Customer Distribution Cell has < 10 customers CloudWatch alarm Consider cell decommissioning

Cost Optimization

DynamoDB Costs

Assumptions:

  • 10M customers
  • 3 regions (Global Tables)
  • 10 KB avg item size

Storage: 10M ร— 10KB ร— 3 regions = 300 GB = $75/month

Reads (cached): 1B req/month ร— 5% miss rate = 50M reads = $6/month

Writes: 10M new customers/month = $13/month

Replication: Included in Global Tables

Total: ~$100/month

ElastiCache Costs

Configuration:

  • cache.r7g.large (2 vCPU, 13 GB RAM)
  • 3 AZs, 3 replicas
  • Per region

Per Region: $0.188/hr ร— 3 nodes ร— 730 hrs = $411/month

3 Regions: $1,233/month

Alternative: cache.r7g.xlarge (26 GB) if > 10M customers

Cell Router Lambda

Configuration:

  • 1024 MB memory
  • ~20ms avg execution
  • 1B requests/month

Compute: 1B ร— 20ms = 20M GB-sec = $333/month

Requests: 1B requests = $200/month

Total: ~$533/month

Note: 95% cache hit rate saves significant Lambda cost!

๐Ÿ’ฐ Total Cell Routing Infrastructure Cost

Monthly: ~$1,866/month for 10M customers, 1B API calls

Per Customer: $0.0001866/month

Per API Call: $0.000001866

ROI: Enables reliable cell isolation, fast routing, zero-downtime migrations - well worth the investment!

Control Plane: Cell Lifecycle Management

๐ŸŽฎ What is the Control Plane?

The Control Plane is the operational automation layer that manages the entire cell lifecycle. Think of it as the "cockpit of the platform" that orchestrates cell provisioning, customer assignment, rebalancing, and decommissioning.

It works in conjunction with the Cloud Native Landing Zone (AWS Control Tower), which provides the foundational multi-account governance structure.

Control Plane vs Cell Router

Cell Router (Data Plane)

Purpose: Route customer requests to the correct cell

Latency: ~5-15ms per request

Operations:

  • Extract customer_id from request
  • Query cache/DynamoDB for cell assignment
  • Apply consistent hashing if new customer
  • Set X-Twilio-Cell-ID header
  • Forward request via VPC Lattice

Scale: Millions of requests per second

Control Plane (Management Plane) NEW

Purpose: Manage cell lifecycle and infrastructure

Latency: Minutes to hours (batch operations)

Operations:

  • Provision new AWS accounts for cells
  • Deploy cell infrastructure (VPC, EKS, DB)
  • Update consistent hash ring
  • Orchestrate customer migrations
  • Decommission old cells

Scale: Hundreds of operations per day

Control Plane Architecture

Control Plane Architecture Cloud Native Landing Zone (AWS Control Tower) โ€ข Multi-Account Organization โ€ข Security Guardrails โ€ข Account Vending (Account Factory) โ€ข IAM Identity Center (SSO) โ€ข CloudTrail / Config โ€ข Service Control Policies โ€ข Billing & Cost Management โ€ข Organizational Units Control Plane (Orchestration Layer) Cell Provisioner Step Functions Workflow 1. Create AWS account 2. Deploy VPC via Terraform 3. Deploy EKS cluster 4. Register with VPC Lattice Cell Rebalancer EventBridge + Lambda 1. Monitor cell capacity 2. Detect saturation (>70%) 3. Trigger migration workflow 4. Update hash ring Hash Ring Manager DynamoDB + Lambda 1. Store cell metadata 2. Publish ring updates (SNS) 3. Version hash rings 4. Notify Cell Routers Migration Orchestrator Step Functions 1. Dual-write 2. Data sync 3. Cutover 4. Cleanup Cell Infrastructure (Managed by Control Plane) Enterprise Cell A AWS Account: 111111111111 VPC: 10.0.0.0/16 EKS Cluster DynamoDB Tables 100 Customers Mid-Market Cell (Shared) AWS Account: 222222222222 VPC: 10.0.0.0/16 EKS Cluster DynamoDB Tables 500 Customers Enterprise Cell B AWS Account: 333333333333 VPC: 10.0.0.0/16 EKS Cluster DynamoDB Tables 85 Customers New Cell (Provisioning...) Account Factory API Terraform applying... EKS Cluster โณ DynamoDB โณ ETA: 15 minutes

Control Plane Operations

Operation 1: Provision New Cell

Trigger: Existing cell reaches 70% capacity OR manual request
Step 1: Create AWS Account
  • Call AWS Control Tower Account Factory API
  • Account name: enterprise-cell-{region}-{letter}
  • Organizational Unit: /Cells/Enterprise
  • Wait for account creation (2-5 minutes)
Step 2: Deploy Infrastructure (Terraform)
  • VPC with CIDR 10.0.0.0/16 (overlapping IPs OK)
  • 3 private subnets across AZs (10.0.0.0/20, 10.0.16.0/20, 10.0.32.0/20)
  • 3 public subnets for NAT gateways
  • EKS cluster with managed node groups (20-50 nodes)
  • DynamoDB tables (replicated via Global Tables)
  • ElastiCache Redis cluster
Step 3: Register with VPC Lattice
  • Create VPC Lattice service for cell
  • Service name: enterprise-{region}-{letter}-api
  • Target: ALB in cell's VPC
  • Associate service with regional VPC Lattice network
Step 4: Update Hash Ring
  • Write to DynamoDB cell_metadata table
  • Add 150 virtual nodes for new cell to consistent hash ring
  • Increment ring version number
  • Publish SNS notification โ†’ All Cell Router instances refresh ring
Step 5: Trigger Rebalancing
  • Calculate which customers should move to new cell (consistent hashing)
  • Start Step Function: MigrateCellBatch
  • Migrate ~1/N customers from existing cells (N = total cells)

๐Ÿ”‘ Key Insight: Control Plane is Asynchronous

Unlike the Cell Router which operates in the critical path (milliseconds), the Control Plane operates asynchronously (minutes to hours). This separation allows:

  • Cell Router to be ultra-fast and simple
  • Control Plane to be complex and robust (retries, validation, rollback)
  • Zero downtime during infrastructure changes
  • Safe experimentation with new cell configurations

Interview Talking Points: Control Plane

2-Minute Interview Answer

"How would you automate cell provisioning and lifecycle management?"

"I'd build a Control Plane as a set of Step Functions workflows that orchestrate cell lifecycle operations.

Architecture: The Control Plane sits on top of AWS Control Tower, which provides the multi-account Landing Zone. When we need a new cell, the Control Plane calls the Account Factory API to create an AWS account, then uses Terraform to deploy the standard cell stackโ€”VPC, EKS, DynamoDB, etc.

Separation of concerns: The Cell Router handles request-time routing in milliseconds, reading from a DynamoDB table and cache. The Control Plane handles infrastructure changes asynchronously in minutes, updating that table and notifying Cell Routers via SNS when the hash ring changes.

Capacity management: EventBridge rules monitor cell capacity metrics. When a cell hits 70% CPU or 100 customers, it triggers the provisioning workflow automatically. This ensures we're always ahead of demand.

Hash ring updates: The Control Plane is the single writer to the hash ring metadata in DynamoDB. It increments a version number with each change and publishes to SNS. Cell Routers poll every 30 seconds or receive push notifications to refresh their in-memory ring.

Why Step Functions: Cell provisioning involves multiple AWS API calls, Terraform applies, health checks, and rollback logic. Step Functions provides the orchestration, retries, and error handling out of the box. A single workflow can provision a cell end-to-end in about 15 minutes.

This architecture keeps the data plane simple and fast while allowing complex operational automation in the control plane."

Edge Cases & Failure Scenarios

Scenario 1: DynamoDB Unavailable

Problem

DynamoDB Global Table in us-east-1 is experiencing throttling or regional outage.

Impact

  • Cache misses cannot be resolved (5% of traffic)
  • New customer assignments fail

Mitigation

  1. Cross-Region Failover: Route 53 health check detects DynamoDB unavailability, routes requests to EU region
  2. Degraded Mode: Cell Router falls back to pure consistent hashing (no persistence check)
  3. Extended Cache TTL: Temporarily extend Redis TTL from 1 hour to 24 hours
  4. Rate Limiting: Reduce new customer signup rate to avoid overwhelming when DynamoDB recovers

Recovery

When DynamoDB recovers, background job reconciles any assignments made during outage.

Scenario 2: Redis Cache Failure

Problem

ElastiCache cluster in us-east-1 fails (e.g., all nodes restart simultaneously).

Impact

  • 100% cache miss rate (vs 5% normal)
  • 20x increase in DynamoDB read load
  • Routing latency increases from 5ms โ†’ 15ms

Mitigation

  1. Multi-AZ: ElastiCache cluster mode with 3 replicas across AZs (automatic failover)
  2. DynamoDB Auto-Scaling: On-demand mode handles read spike
  3. Circuit Breaker: If cache is down, skip cache entirely (direct to DynamoDB)
  4. Graceful Degradation: Inform customers of slight latency increase via status page

Scenario 3: Customer Segment Changes Mid-Request

Problem

Customer upgrades from SMB โ†’ Enterprise while their request is in flight.

Impact

  • Request routed to SMB cell, but customer now belongs in Enterprise cell
  • Data may be written to wrong cell

Mitigation

  1. Optimistic Locking: Use DynamoDB version field with conditional writes
  2. Dual-Write Period: During upgrade, write to both SMB and Enterprise cells for 5 minutes
  3. Cell Router Check: Before routing, verify segment hasn't changed (compare cached vs DynamoDB)
  4. Automatic Redirect: If segment mismatch detected, invalidate cache and re-route

Scenario 4: Hash Ring Desynchronization

Problem

Cell Router instances have inconsistent views of the hash ring (e.g., new cell added but not all routers refreshed).

Impact

  • Same customer routed to different cells by different routers
  • Data inconsistency

Mitigation

  1. Authoritative Source: DynamoDB cell_metadata table is source of truth
  2. Version Number: Hash ring has version number, routers check version before routing
  3. Refresh Frequency: Cell Routers poll DynamoDB every 30 seconds for ring updates
  4. SNS Notification: When cell added/removed, publish to SNS โ†’ All routers immediately refresh
  5. Consistency Check: If routing result doesn't match DynamoDB assignment, trust DynamoDB and update local ring

Scenario 5: Entire Cell Failure

Problem

Enterprise Cell A (us-east-1) experiences complete failure (EKS cluster down, AZ outage, etc.).

Impact

  • All customers in Cell A cannot access services
  • ~1,000 enterprise customers affected

Mitigation

  1. Health Checks: ALB target group health checks detect cell failure within 30 seconds
  2. Automatic Failover: Update DynamoDB: Set cell_status=inactive
  3. Emergency Rebalancing: Step Function immediately migrates all Cell A customers to Cell B (same region)
  4. Dual-Cell Strategy: Enterprise customers replicated to 2 cells (primary + hot standby)
  5. RTO: < 5 minutes for full recovery
  6. RPO: < 1 second (DynamoDB Global Tables replication lag)

Interview Talking Points

๐ŸŽค 2-Minute Summary: Cell Routing & Assignment

"For Twilio's cell routing, I use DynamoDB Global Tables as the source of truth for customer โ†’ cell mappings. The Cell Router is simple: check Redis cache (95% hit rate, ~5ms), query DynamoDB on miss (~15ms), or assign new customers to the least-loaded cell (~20ms)."

"For new customer assignment, I query cell metadata from DynamoDB to find the least-loaded cell in the customer's segment and region. The assignment uses a load score combining customer count and CPU utilization, so we always assign to the cell with the most capacity. A conditional write prevents race conditions if two Cell Routers try to assign the same customer simultaneously."

"This approach is simpler than consistent hashing because DynamoDB is already the source of truthโ€”why calculate hashes when you can just look it up? It also gives better load distribution since we use real metrics instead of pseudo-random hash distribution."

"When adding new cells, the Control Plane provisions the AWS account and infrastructure. New customers automatically go to the new cell since it has the lowest load. For rebalancing existing customers, the Control Plane explicitly migrates them using a zero-downtime process: dual-write, background sync, atomic cutover, cleanup."

"The entire routing infrastructure costs about $1,866/month for 10M customers and 1B API callsโ€”DynamoDB global tables, ElastiCache, and Lambda. That's roughly $0.000002 per API call for reliable cell isolation and sub-20ms routing."

๐Ÿ’ก Key Interview Points

  • Simplicity: "DynamoDB is the source of truth, so we just look it up. No hash ring versioning or synchronization."
  • Performance: "95% cache hit rate means ~5ms latency for most requests. Cache miss is ~15ms from DynamoDB."
  • Load Distribution: "Least-loaded assignment based on real metrics (CPU, customer count) beats hash-based distribution."
  • Atomic Assignment: "DynamoDB conditional writes prevent race conditions when multiple Cell Routers assign same customer."
  • Controlled Migration: "Control Plane explicitly decides which customers to migrate, when, and whereโ€”not dictated by hash ring changes."
  • When Hashing Makes Sense: "Consistent hashing is great for stateless caches (Memcached) or database sharding (Cassandra), but not when you already have a database!"