Cell Operations & Deployment Strategy

Progressive Rollout, API Versioning, and Schema Migrations for Cell-Based Architecture

← Back to Study Guide

The Challenge: Product-Agnostic Cells Add Deployment Complexity

When each cell runs ALL Twilio services (SMS, Voice, Video, WhatsApp, Email), releasing software across cells becomes complicated. This is the fundamental trade-off of product-agnostic cells:

Benefit Trade-off
No cross-cell API calls (runtime simplicity) More cells to update during deployments
Customer's services co-located Breaking changes require global coordination
Blast radius limited to cell Schema migrations must be orchestrated carefully

The Meta-Insight

Product-agnostic cells shift complexity from runtime (no cross-cell calls) to deployment (more cells to update). That's a good trade—deployment complexity is solved with automation; runtime complexity causes outages.

Deployment Strategy: Progressive Rollout

Wave-Based Deployment

Wave 0: Canary Cell (1 cell)
   └── Internal traffic or opt-in beta customers
   └── Bake time: 30-60 minutes
   └── Automated rollback on error rate spike

Wave 1: SMB Cells (25% of SMB)
   └── Lower blast radius, self-service customers
   └── Bake time: 2-4 hours
   └── Monitor error rates, latency p99

Wave 2: Remaining SMB Cells (75%)
   └── Bake time: 4-8 hours

Wave 3: Enterprise Cells (one at a time)
   └── Bake time: 24 hours per cell
   └── Customer notification for maintenance windows
                    

Why This Works

Concern Solution
Bad deploy takes down everything Wave-based rollout limits blast radius
Rolling back is hard Each cell is independent - rollback only affected cells
Enterprise customers need stability They're always last, with longest bake times
Speed vs safety SMB cells move fast; Enterprise cells move carefully

Automated Deployment Pipeline

# deployment-pipeline.yaml (Step Functions or similar)
states:
  DeployToCanary:
    cell: canary-us-internal
    action: deploy
    next: ValidateCanary

  ValidateCanary:
    checks:
      - error_rate < 0.1%
      - p99_latency < 500ms
      - no_pagerduty_alerts: 30m
    on_failure: RollbackCanary
    on_success: DeployToSMBWave1

  DeployToSMBWave1:
    cells: [smb-us-1, smb-us-2, smb-eu-1]  # 25% of SMB
    parallel: true
    next: ValidateSMBWave1

  ValidateSMBWave1:
    bake_time: 4h
    checks: [error_rate, latency, customer_complaints]
    on_failure: RollbackSMBWave1
    on_success: DeployToSMBWave2

  # ... continues through all waves

  DeployToEnterprise:
    cells: [enterprise-us-hipaa]
    requires_approval: true  # Human gate for enterprise
    notification: slack://enterprise-releases

Decouple Deployment from Activation

The Principle

Deploy code to all cells quickly, activate features slowly. Feature flags let you deploy code without activating it.

// Feature flags let you deploy code without activating it
const features = {
  newVoiceCodec: {
    enabled: false,           // Deployed everywhere, activated nowhere
    rolloutPercentage: 0,
    allowlist: ['enterprise-customer-123']  // Beta testers
  }
};

// Gradual activation AFTER deployment is complete
// Day 1: Deploy to all cells (feature off)
// Day 2: Enable for 5% of traffic
// Day 3: Enable for 25%
// Day 7: Enable for 100%

Separation of Concerns

┌─────────────────────────────────────────────────────────────────┐
│   Breaking changes are a ROUTING problem, not a DEPLOYMENT     │
│   problem.                                                      │
│                                                                 │
│   Deploy code progressively → Cells (wave-based, safe)         │
│   Enable versions instantly → Landing Zone (global, atomic)    │
└─────────────────────────────────────────────────────────────────┘
                    

Operations Strategy: Centralized Control, Decentralized Execution

GitOps Model

infrastructure-repo/
├── modules/
│   └── cell/                    # Shared cell template
│       ├── sms-service.tf
│       ├── voice-service.tf
│       ├── video-service.tf
│       └── variables.tf
│
├── cells/
│   ├── enterprise-us-hipaa/
│   │   └── terragrunt.hcl       # Just variables, includes module
│   ├── enterprise-eu-standard/
│   │   └── terragrunt.hcl
│   ├── smb-us-standard/
│   │   └── terragrunt.hcl
│   └── ...
│
└── argocd/
    └── applications/            # One ArgoCD app per cell

Operational Pattern

┌─────────────────────────────────────────────────────┐
│              Central Control Plane                   │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │
│  │   ArgoCD    │  │  Datadog/   │  │   PagerDuty │  │
│  │  (Deploy)   │  │  Grafana    │  │  (Alerting) │  │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  │
└─────────┼────────────────┼────────────────┼─────────┘
          │                │                │
    ┌─────┴─────┬──────────┴────────┬───────┴────┐
    ▼           ▼                   ▼            ▼
┌───────┐  ┌───────┐           ┌───────┐    ┌───────┐
│Cell 1 │  │Cell 2 │    ...    │Cell N │    │Cell M │
│ (SMB) │  │ (SMB) │           │(Ent.) │    │(Ent.) │
└───────┘  └───────┘           └───────┘    └───────┘
                    

One Dashboard, Drill-Down Capability

# Datadog dashboard hierarchy
Global View:
  - All cells health heatmap (green/yellow/red)
  - Cross-cell comparison: latency, error rates, CPU
  - Deployment status: which cells on which version

Cell View (drill down):
  - Per-service metrics within cell
  - Customer impact (affected accounts)
  - Recent deployments and config changes

Service View (drill down further):
  - Individual pod/container metrics
  - Traces, logs, spans

Key Operational Simplifications

Challenge Solution
"Which cell is on which version?" ArgoCD dashboard shows all cells
"How do I debug across cells?" Centralized logging (Datadog/Splunk) with cell_id tag
"Runbooks for each cell?" Same runbook, parameterized by cell_id
"Alert fatigue from N cells?" Aggregate alerts by issue type, not by cell
"Config drift between cells?" GitOps ensures cells match repo state

API Versioning at the Edge (Landing Zone)

The Principle

Breaking API changes don't roll through cells—they're handled at the Landing Zone edge. Cells run multiple API versions simultaneously; the edge routes /v1/* to v1 handlers and /v2/* to v2 handlers.

Architecture: Version Routing at Edge

                    ┌─────────────────────────────────────┐
                    │       Cloud Native Landing Zone      │
                    │  ┌─────────────────────────────────┐ │
                    │  │   API Gateway (Version Router)  │ │
                    │  │                                 │ │
   POST /v1/Messages│  │   /v1/* ──→ v1 handler         │ │
   ─────────────────┼─▶│   /v2/* ──→ v2 handler         │ │
   POST /v2/Messages│  │                                 │ │
                    │  └───────────┬─────────────────────┘ │
                    └──────────────┼───────────────────────┘
                                   │
                    ┌──────────────┼───────────────────────┐
                    │         Cell Router                  │
                    │    (Routes to correct cell)          │
                    └──────────────┼───────────────────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                    ▼
        ┌──────────┐         ┌──────────┐        ┌──────────┐
        │  Cell 1  │         │  Cell 2  │        │  Cell 3  │
        │  (v1+v2) │         │  (v1+v2) │        │  (v1+v2) │
        └──────────┘         └──────────┘        └──────────┘
                    

Cells Run Multiple API Versions

// Cells are version-agnostic - they run ALL supported versions
// The edge determines which version handler to invoke

// In Landing Zone API Gateway:
const routes = {
  'POST /v1/Messages': 'sms-service:handleV1Message',
  'POST /v2/Messages': 'sms-service:handleV2Message',  // Breaking change
  'POST /v1/Calls':    'voice-service:handleV1Call',
  'POST /v2/Calls':    'voice-service:handleV2Call',
};

// Inside each cell - BOTH handlers exist simultaneously
class SmsService {
  handleV1Message(req) {
    // Old contract: { To, From, Body }
    return this.sendMessage(req.To, req.From, req.Body);
  }

  handleV2Message(req) {
    // New contract: { recipient, sender, content, metadata }
    return this.sendMessage(req.recipient, req.sender, req.content);
  }

  // Shared internal logic - only contract differs
  sendMessage(to, from, body) { /* ... */ }
}

What Lives Where

Layer Responsibilities Breaking Change Strategy
Landing Zone API Gateway, version routing, rate limiting, auth Routes /v1/* vs /v2/* to correct handlers
Control Plane Cell provisioning, customer routing, capacity Tracks which cells support which versions
Cell Router Customer → Cell mapping Version-unaware (all cells run all versions)
Cell Run services, process requests Contains BOTH v1 and v2 handlers

Breaking Change Rollout Timeline

Month 0: Deploy v2 handlers to all cells (v2 disabled at edge)

Progressive rollout, normal wave-based deployment. v2 code exists but receives zero traffic.

Month 1: Enable v2 at API Gateway

Single config change in Landing Zone. Instant global availability. v1 still works, no customer impact.

Month 3: Announce v1 deprecation

Customer notification. v1 continues working.

Month 12: Disable v1 at API Gateway

Single config change. Remove v1 handlers from cells (cleanup).

Risk Model Inversion

Deployment is low-risk because the new code isn't reachable. Enablement is low-risk because all cells already have the code. And sunset is just another config change—disable the old route when customers have migrated.

Landing Zone API Gateway Config

# api-gateway-config.yaml (AWS API Gateway / Kong / Envoy)
routes:
  - path: /v1/Messages
    status: active
    backend: ${cell_router}/sms/v1/messages
    deprecation_date: null

  - path: /v2/Messages
    status: active
    backend: ${cell_router}/sms/v2/messages
    introduced: 2025-01-15

  - path: /v0/Messages
    status: deprecated
    backend: ${cell_router}/sms/v1/messages  # Internally aliased
    sunset_date: 2025-06-01
    response_headers:
      Deprecation: "true"
      Sunset: "Sat, 01 Jun 2025 00:00:00 GMT"

Control Plane Version Tracking

// Control Plane tracks cell capabilities
const cellMetadata = {
  'enterprise-us-hipaa': {
    versions: {
      sms: ['v1', 'v2'],
      voice: ['v1', 'v2'],
      video: ['v1']  // v2 not yet deployed to this cell
    },
    lastDeployed: '2025-01-15T10:30:00Z'
  }
};

// During v2 rollout, Control Plane can:
// 1. Track which cells have v2 code
// 2. Gate v2 enablement until all cells ready
// 3. Provide rollout status dashboard

Interview Talking Point: API Versioning

"Breaking API changes don't roll through cells—they're handled at the Landing Zone. Cells run multiple API versions simultaneously; the edge routes /v1/* to v1 handlers and /v2/* to v2 handlers.

The rollout becomes two separate concerns: First, deploy the new handlers to cells using normal progressive rollout—code exists but receives no traffic. Then, enable the new version at the API Gateway—a single global config change that's instant and atomic.

This inverts the risk model. Deployment is low-risk because the new code isn't reachable. Enablement is low-risk because all cells already have the code. And sunset is just another config change—disable the old route when customers have migrated.

The Control Plane tracks which cells support which versions, so we can gate enablement until all cells are ready. But the versioning logic itself lives at the edge where it belongs."

Database Schema Migrations Across Cells

Key Advantage

Cell isolation actually helps with schema migrations. Each cell has its own database, so you migrate cell-by-cell rather than globally. The blast radius of a failed migration is one cell, not the whole platform.

The Core Pattern: Expand-Contract Migrations

┌─────────────────────────────────────────────────────────────────┐
│                    EXPAND-CONTRACT PATTERN                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Phase 1: EXPAND (Schema)                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ ALTER TABLE messages ADD COLUMN recipient_id VARCHAR;   │    │
│  │ -- Nullable, old code ignores it, new code can write    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           ↓                                      │
│  Phase 2: MIGRATE (Code + Data)                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ - Deploy code that writes to BOTH old and new columns   │    │
│  │ - Backfill: UPDATE messages SET recipient_id = to_phone │    │
│  │ - Code reads from new column, falls back to old         │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           ↓                                      │
│  Phase 3: CONTRACT (Cleanup)                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ - Deploy code that ONLY uses new column                 │    │
│  │ - ALTER TABLE messages DROP COLUMN to_phone;            │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
                    

Cell-by-Cell Migration Strategy

                    Control Plane
                    (Migration Orchestrator)
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
         ▼                 ▼                 ▼
    ┌─────────┐       ┌─────────┐       ┌─────────┐
    │ Cell 1  │       │ Cell 2  │       │ Cell 3  │
    │ v2.1    │       │ v2.0    │       │ v2.0    │
    │ Schema  │       │ Schema  │       │ Schema  │
    │ ✅ Done │       │ 🔄 Running│      │ ⏳ Pending│
    └─────────┘       └─────────┘       └─────────┘
         │                 │                 │
         ▼                 ▼                 ▼
    ┌─────────┐       ┌─────────┐       ┌─────────┐
    │   DB    │       │   DB    │       │   DB    │
    │ (new)   │       │(migrating)│     │ (old)   │
    └─────────┘       └─────────┘       └─────────┘

Migration order: Canary → SMB Wave 1 → SMB Wave 2 → Enterprise
                    

The Golden Rule: Code Must Handle Both Schemas

// During migration window, code must be "schema-bilingual"
class MessageRepository {
  async getMessage(id) {
    const row = await db.query('SELECT * FROM messages WHERE id = ?', [id]);

    // Handle both old and new schema
    return {
      id: row.id,
      // New column with fallback to old
      recipientId: row.recipient_id || row.to_phone,
      senderId: row.sender_id || row.from_phone,
      // ... rest of fields
    };
  }

  async createMessage(msg) {
    // Write to BOTH during migration (dual-write)
    await db.query(`
      INSERT INTO messages (id, to_phone, from_phone, recipient_id, sender_id, body)
      VALUES (?, ?, ?, ?, ?, ?)
    `, [msg.id, msg.recipientId, msg.senderId, msg.recipientId, msg.senderId, msg.body]);
    //         ↑ old columns        ↑ new columns (same values)
  }
}

Migration Orchestration in Control Plane

# migration-manifest.yaml
migration:
  id: "2025-01-15-recipient-id-column"
  description: "Rename to_phone → recipient_id"

  phases:
    - name: expand
      type: schema
      script: migrations/001_add_recipient_id.sql
      rollout:
        - cells: [canary-internal]
          wait: 1h
        - cells: [smb-us-1, smb-us-2]
          wait: 4h
        - cells: [smb-*]
          wait: 24h
        - cells: [enterprise-*]
          wait: 48h
          requires_approval: true

    - name: migrate-code
      type: deployment
      version: "v2.5.0"  # Code with dual-write
      rollout: standard  # Normal wave-based deployment

    - name: backfill
      type: data
      script: migrations/001_backfill_recipient_id.py
      parallel_cells: 3  # Max 3 cells backfilling at once
      checkpoint_interval: 10000  # Rows between checkpoints

    - name: migrate-code-readonly
      type: deployment
      version: "v2.6.0"  # Code reads only new column
      rollout: standard

    - name: contract
      type: schema
      script: migrations/001_drop_to_phone.sql
      rollout:
        - cells: [canary-internal]
          wait: 24h
        - cells: [smb-*]
          wait: 48h
        - cells: [enterprise-*]
          wait: 168h  # 1 week bake time
          requires_approval: true

Online Schema Change Tools (Zero Downtime)

# pt-online-schema-change (Percona) - for MySQL
# Creates shadow table, syncs with triggers, atomic swap
pt-online-schema-change \
  --alter "ADD COLUMN recipient_id VARCHAR(50)" \
  --execute \
  D=twilio,t=messages

# gh-ost (GitHub) - for MySQL
# Binlog-based replication, no triggers
gh-ost \
  --alter="ADD COLUMN recipient_id VARCHAR(50)" \
  --database=twilio \
  --table=messages \
  --execute

# For PostgreSQL - native online DDL is better
# But for large tables, use pg_repack or similar

Control Plane Migration State Tracking

// Control Plane tracks migration state per cell
const migrationState = {
  'migration-2025-01-15-recipient-id': {
    status: 'in_progress',
    currentPhase: 'backfill',
    cells: {
      'canary-internal':     { phase: 'contract', status: 'complete' },
      'smb-us-1':            { phase: 'backfill', status: 'complete', rows: 1500000 },
      'smb-us-2':            { phase: 'backfill', status: 'running', rows: 850000, total: 2000000 },
      'smb-eu-1':            { phase: 'migrate-code', status: 'complete' },
      'enterprise-us-hipaa': { phase: 'expand', status: 'pending' },
    },
    canRollback: true,  // Until contract phase
    startedAt: '2025-01-15T10:00:00Z',
  }
};

// Dashboard shows:
// ████████████░░░░░░░░ 60% complete (12/20 cells)
// Current: Backfilling smb-us-2 (42% of rows)
// Next: smb-eu-2 (waiting for smb-us-2)

The Rollback Question

Phase Rollback Strategy
Expand Just drop the new column (no data loss)
Migrate Code Roll back to old code version (still reads old columns)
Backfill Stop backfill, no harm done
Contract Point of no return - old column is gone

Point of No Return

The contract phase (dropping old columns) is irreversible. This is why enterprise cells get 1-week bake times and require human approval before contract phase.

What About Global/Platform Services?

Platform Services (Identity, Billing) have shared databases that span regions:

Option 1: DynamoDB Global Tables (Schema-less)

  • No formal schema to migrate
  • Code handles missing/new attributes gracefully
  • Just deploy code that writes new attributes

Option 2: Aurora Global Database

  • Primary region does schema change
  • Replicates to read replicas automatically
  • Use expand-contract (longer timeline)
  • Schema change is atomic, replication handles it

Option 3: Blue-Green Database

  • Provision new database with new schema
  • Dual-write during migration
  • Cut over when backfill complete
  • More complex, but zero-risk rollback

Interview Talking Point: Schema Migrations

"Schema migrations in a cell-based architecture are actually easier because each cell has its own database—you migrate cell by cell, not globally. The blast radius of a failed migration is one cell, not the whole platform.

The pattern is expand-contract: add the new column as nullable, deploy code that dual-writes, backfill existing data, then deploy code that only uses the new column, and finally drop the old column. Code must be schema-bilingual during the migration window.

The Control Plane orchestrates this—tracking which cells are in which phase, managing the rollout order (canary, SMB, then enterprise), and knowing when rollback is still possible. The point of no return is the contract phase when you drop the old column.

For global Platform Services like Identity, I'd use DynamoDB Global Tables where possible—no formal schema means no schema migrations. For relational data, Aurora Global handles replication automatically, but you still use expand-contract with longer timelines since there's no cell isolation to limit blast radius."

Interview Summary: Deployment & Operations

Comprehensive Talking Point

"Product-agnostic cells add deployment complexity, but we solve it with progressive rollout and decoupled activation. Code deploys in waves—canary cell first, then SMB cells, then enterprise cells last with 24-hour bake times. But I separate deployment from activation using feature flags, so we can deploy everywhere quickly but enable features gradually.

For operations, it's GitOps: one cell module, many instances with different variables. ArgoCD syncs the desired state, Datadog gives us one dashboard with drill-down to any cell, and runbooks are parameterized by cell_id. The complexity is in the automation, not the operator's head.

Breaking API changes don't roll through cells—they're handled at the Landing Zone. And schema migrations are actually easier with cell isolation—you migrate cell by cell, limiting blast radius to one cell instead of the whole platform."

Key Principles Summary

Principle Implementation
Progressive Rollout Canary → SMB → Enterprise with increasing bake times
Decouple Deploy from Enable Feature flags let you deploy everywhere, activate gradually
API Versioning at Edge Landing Zone routes /v1/* and /v2/* to different handlers
Schema: Expand-Contract Add column → Dual-write → Backfill → Drop old column
Cell Isolation Helps Blast radius limited to one cell for both deploys and migrations
GitOps Everything One module, N instances; ArgoCD ensures cells match repo state

The Meta-Principle

Cells are execution environments. They shouldn't make routing decisions about API versions. Push that complexity UP to the Landing Zone where you have global visibility and atomic control. Push operational complexity INTO automation, not operators' heads.