Distinguished Architect Interview Q&A

Architecture Deep Dive: RocksDB, Centrifuge, Kafka, and Data Infrastructure

How to Use This Document

This document prepares you for Distinguished Architect-level questions about the specific architectures used at Twilio Segment. Questions progress from foundational understanding to deep technical reasoning to trade-off analysis.

Kafka RocksDB Centrifuge LSM Trees Bloom Filters Database-as-Queue Immutable Architecture Exactly-Once Delivery
For each question, practice giving a 2-minute answer first, then be ready to go deeper on follow-ups. Distinguished Architects need to explain complex systems simply, then demonstrate depth when probed.

Kafka & Event Streaming Intermediate

You're processing 400K events/second and need exactly-once semantics. How do you achieve this with Kafka?

Exactly-once in Kafka requires three components working together:

  1. Idempotent producer: Set enable.idempotence=true. Kafka assigns a producer ID and sequence number to each message, deduplicating retries within a session.
  2. Transactional writes: For cross-partition atomicity, use initTransactions(), beginTransaction(), commitTransaction(). Either all partitions get the message or none do.
  3. Consumer isolation: Set isolation.level=read_committed so consumers only see committed messages.

But this is only exactly-once within Kafka. The real challenge is external system integration, which is why Segment uses application-level deduplication with RocksDB.

Why partition by messageId instead of customerId or sourceId?

Partitioning by messageId concentrates duplicates:

  • Same messageId always goes to the same partition and consumer
  • Deduplication becomes local - check one RocksDB instance, not a distributed lookup
  • If we partitioned by customerId, duplicates could land on different partitions, requiring cross-partition deduplication (expensive)

The trade-off: we lose per-customer ordering. But for analytics events, global ordering matters less than throughput. If you need per-customer ordering, you'd partition by customerId and accept the distributed dedup complexity.

How does Segment handle Kafka broker failures?

Multi-tier failover within each TAPI shard:

  1. Primary cluster in the same AZ as the TAPI shard
  2. Secondary cluster in a different AZ, continuously replicated
  3. Replicated service monitors broker health metrics (latency, error rate, ISR shrinkage)
  4. When primary degrades beyond threshold, routes new writes to secondary
  5. Automatic revert when primary recovers

This is more sophisticated than Kafka's built-in replication because it handles entire cluster failures, not just broker failures.

Don't confuse Kafka partition replication (ISR) with cluster-level failover. Kafka's built-in replication handles individual broker failures. Segment adds cluster-level failover for AZ failures or cluster-wide issues.

RocksDB & Deduplication Advanced

Explain the role of RocksDB in Segment's architecture. Why not use Redis or DynamoDB?

RocksDB serves as a high-performance, embedded deduplication index. The key requirements:

  • 60 billion keys with 4-week retention
  • Sub-millisecond lookups for every incoming message
  • 1.5TB storage per partition

Why not alternatives:

OptionWhy Not
Redis60B keys × 40 bytes = 2.4TB RAM. Cost-prohibitive for a cache.
DynamoDBNetwork latency on every message (~5-15ms) would throttle throughput.
In-memory HashMapSame RAM problem as Redis, plus no persistence.

RocksDB is embedded (no network hop), persistent (EBS-backed), and uses Bloom filters for the common case (new messages that haven't been seen).

Explain how Bloom filters work in RocksDB and why they're critical for this use case.

Bloom filters are probabilistic data structures that answer "definitely not in set" or "possibly in set."

How they work:

  1. Hash the key with k hash functions
  2. Set k bits in a bit array
  3. To query: hash again, check if all k bits are set
  4. If any bit is 0: definitely not seen (no disk read needed)
  5. If all bits are 1: possibly seen (read from LSM tree to confirm)

Why critical here: Most messages are new (not duplicates). For new messages, the Bloom filter returns "not seen" without touching disk. At 400K messages/second, avoiding disk reads on 99.4% of lookups is essential.

If RocksDB is derived from Kafka, what happens when a worker crashes? Walk me through recovery.
Crash Scenario:
1. Worker reads from Kafka INPUT
2. Writes to RocksDB
3. Publishes to Kafka OUTPUT
4. Commits offset to Kafka INPUT

Problem: What if crash between steps?

Case A: Crash after step 2, before step 3
- RocksDB has the ID, Kafka OUTPUT doesn't
- On restart: rebuild RocksDB from Kafka OUTPUT
- ID is NOT in rebuilt RocksDB, message is re-processed
- Correct behavior (at-least-once)

Case B: Crash after step 3, before step 4
- Kafka OUTPUT has the message
- On restart: Kafka replays from last committed offset
- Message re-processed, but RocksDB (rebuilt) knows it's a duplicate
- Correct behavior (deduplication handles it)

Key insight: Kafka OUTPUT is the commit point.
RocksDB is rebuildable from OUTPUT.
Why did Segment switch from Memcached to RocksDB for deduplication?

Their engineering blog explicitly states:

"We no longer have a set of large Memcached instances which require failover replicas. Instead we use embedded RocksDB databases which require zero coordination."

Benefits of the switch:

  • No network hop: Memcached ~1ms, RocksDB local ~0.01ms
  • No coordination: No distributed cache invalidation protocols
  • No failover replicas: RocksDB is rebuilt from Kafka, not replicated
  • Simpler ops: No Memcached cluster to manage

Trade-off: Rebuild time on cold start (minutes). Acceptable because failures are rare and Kafka buffers during rebuild.

When discussing RocksDB, always tie it back to the Kafka source of truth. This shows you understand the full system, not just individual components.

Centrifuge: Database-as-Queue Advanced

Why did Segment build Centrifuge instead of using Kafka, RabbitMQ, or SQS for delivery to external APIs?

The core problem: 88,000 logical queues.

Segment has 42K sources sending to an average of 2.1 destinations = 88K source-destination pairs. Each needs:

  • Fair scheduling (big customer shouldn't block small customer)
  • Isolated failure handling (Google Analytics down shouldn't block Salesforce)
  • Per-pair retry policies

Why queues fail:

ApproachProblem
Single queueOne slow endpoint backs up everything
Per-destination queueBig customers dominate, small customers starve
Per-pair queue (88K)Kafka/RabbitMQ can't handle this cardinality

Database-as-queue insight: MySQL gives SQL flexibility for dynamic QoS. "Change QoS by running a single SQL statement." Need to deprioritize a failing destination? One UPDATE. Queues don't give you that.

Explain the immutable rows design in Centrifuge. Why no UPDATEs?

Schema design:

-- jobs: immutable after insert
CREATE TABLE jobs (
    ksuid CHAR(27) PRIMARY KEY,
    payload BLOB,
    endpoint VARCHAR(255),
    expires_at TIMESTAMP
);

-- job_state_transitions: append-only log
CREATE TABLE job_state_transitions (
    id BIGINT AUTO_INCREMENT,
    job_ksuid CHAR(27),
    to_state ENUM('executing','succeeded','discarded','awaiting_retry'),
    transitioned_at TIMESTAMP
);

-- Current state = most recent transition
SELECT j.* FROM jobs j
JOIN (SELECT job_ksuid, MAX(id) as latest
      FROM job_state_transitions GROUP BY job_ksuid) t
ON j.ksuid = t.job_ksuid
WHERE to_state = 'awaiting_scheduling';

Benefits:

  • No write contention (no row locks on UPDATE)
  • No index fragmentation (KSUID is time-ordered)
  • Audit trail for free (every state change recorded)
  • Simple debugging (replay state history)
Explain the TABLE DROP trick. Why is it better than DELETE?

Traditional approach:

DELETE FROM jobs WHERE status = 'completed' AND created_at < NOW() - INTERVAL 1 HOUR;
-- Problems:
-- 1. Locks rows during deletion
-- 2. Creates index fragmentation
-- 3. Triggers expensive vacuuming/compaction
-- 4. Slow for millions of rows

Centrifuge approach:

1. Create new JobDB every ~30 minutes
2. Route new jobs to newest database
3. Old databases drain retry queues (4-hour window)
4. When database empty:
   DROP TABLE jobs;
   DROP TABLE job_state_transitions;
   -- Instant, regardless of table size

DROP TABLE is O(1) - it just removes metadata. DELETE is O(n) with all the cleanup overhead. At 400K writes/second, this difference is massive.

Walk me through how a Director works in Centrifuge.
  1. Acquire lock: Consul session locks a specific JobDB (one Director per database)
  2. Load jobs: SELECT awaiting_scheduling jobs into memory cache
  3. Accept RPCs: Upstream Kafka consumers call Director to log new jobs (INSERT)
  4. Execute HTTP: Pop from memory cache, make HTTP call to external API
  5. Handle response:
    • 200 OK: INSERT state → succeeded
    • 5xx/timeout: INSERT state → awaiting_retry
    • 4xx: INSERT state → discarded
  6. Retry loop: Background thread scans awaiting_retry with exponential backoff
  7. Archive: After 4 hours, write to S3, INSERT state → archived

Scale: 80-300 Directors at peak, scaled by CPU utilization. Each Director owns one JobDB.

When discussing Centrifuge, emphasize it's for external API delivery, not internal Kafka processing. The 4-hour retry window handles third-party outages (Google Analytics down, Salesforce rate limiting), not Segment's internal infrastructure.

Cell-Based Architecture for Twilio Advanced

This section ties together everything: how would you design Twilio's entire platform (SMS, Voice, Video, Segment, Verify, SendGrid) using cell-based architecture? This is the "big picture" question that demonstrates Distinguished Architect thinking.
How would you design a cell-based architecture for Twilio's platform?

Core principle: Product-agnostic cells, operationally-differentiated tiers.

Each cell runs ALL Twilio services—SMS, Voice, Video, WhatsApp, Email (SendGrid), Verify, Segment CDP. The differentiation is HOW cells are operated (Enterprise vs SMB), not WHAT they run.

┌─────────────────────────────────────────────────────────────────────────────┐
│                      TWILIO CELL-BASED ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  LANDING ZONE (Management Plane)                                             │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  AWS Control Tower │ IAM Identity Center │ CloudTrail │ SCPs           │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                    │                                         │
│  GLOBAL/PLATFORM SERVICES (Tier 0 - Multi-Region)                           │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  Identity (Stytch) │ API Key Mgmt │ Billing │ Cell Router │ TrustHub   │ │
│  │  [DynamoDB Global Tables - Multi-Master Replication]                   │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                    │                                         │
│  CELL ROUTER (Data Plane - Lambda @ Edge)                                   │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  1. Check Redis cache (95% hit, ~5ms)                                  │ │
│  │  2. Query DynamoDB on miss (~15ms)                                     │ │
│  │  3. Assign new customers to least-loaded cell (~20ms)                  │ │
│  │  4. Set X-Twilio-Cell-ID header → VPC Lattice routes to cell           │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                    │                                         │
│           ┌────────────────────────┼────────────────────────┐               │
│           ▼                        ▼                        ▼               │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐         │
│  │ ENTERPRISE CELL │    │ MID-MARKET CELL │    │    SMB CELL     │         │
│  │ (AWS Account)   │    │ (AWS Account)   │    │ (AWS Account)   │         │
│  │                 │    │                 │    │                 │         │
│  │ VPC 10.0.0.0/16 │    │ VPC 10.0.0.0/16 │    │ VPC 10.0.0.0/16 │         │
│  │ (overlapping OK │    │ (overlapping OK │    │ (overlapping OK │         │
│  │  via VPC Lattice)│    │  via VPC Lattice)│    │  via VPC Lattice)│         │
│  │                 │    │                 │    │                 │         │
│  │ ALL SERVICES:   │    │ ALL SERVICES:   │    │ ALL SERVICES:   │         │
│  │ • SMS Gateway   │    │ • SMS Gateway   │    │ • SMS Gateway   │         │
│  │ • Voice (WebRTC)│    │ • Voice (WebRTC)│    │ • Voice (WebRTC)│         │
│  │ • Video SFU     │    │ • Video SFU     │    │ • Video SFU     │         │
│  │ • WhatsApp      │    │ • WhatsApp      │    │ • WhatsApp      │         │
│  │ • SendGrid      │    │ • SendGrid      │    │ • SendGrid      │         │
│  │ • Verify        │    │ • Verify        │    │ • Verify        │         │
│  │ • Segment CDP   │    │ • Segment CDP   │    │ • Segment CDP   │         │
│  │   (Kafka+RocksDB│    │   (Kafka+RocksDB│    │   (Kafka+RocksDB│         │
│  │    +Centrifuge) │    │    +Centrifuge) │    │    +Centrifuge) │         │
│  │                 │    │                 │    │                 │         │
│  │ 10-50 customers │    │ 100-500 customers│   │ 1000+ customers │         │
│  │ 99.99% SLA      │    │ 99.95% SLA      │    │ 99.9% SLA       │         │
│  └─────────────────┘    └─────────────────┘    └─────────────────┘         │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘
How would Kafka, RocksDB, and Centrifuge be deployed in a cell-based architecture?

Each cell has its own complete Segment CDP stack:

WITHIN EACH CELL:
┌─────────────────────────────────────────────────────────────────┐
│                     CELL: ENTERPRISE-US-EAST                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  SEGMENT CDP (per-cell instance)                                │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  Tracking API (Go) ──► NSQ buffer ──► Kafka (cell-local)   │ │
│  │                                              │              │ │
│  │                                              ▼              │ │
│  │  Dedup Workers ◄── RocksDB (EBS-backed, per-partition)     │ │
│  │        │                                                    │ │
│  │        ▼                                                    │ │
│  │  Centrifuge (MySQL/RDS per-cell) ──► External destinations │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  MESSAGING SERVICES (per-cell instance)                         │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  SMS Gateway ──► Account Queue ──► Super Network           │ │
│  │  Voice Gateway ──► Media Server ──► PSTN/SIP              │ │
│  │  WhatsApp Business API ──► Meta Cloud API                  │ │
│  │  SendGrid MTA ──► ISP connections                          │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  SHARED WITHIN CELL                                             │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  EKS Cluster (100-200 nodes) │ Aurora PostgreSQL (primary) │ │
│  │  ElastiCache Redis │ MSK (Managed Kafka) │ S3 buckets      │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

KEY INSIGHT: Kafka is CELL-LOCAL, not global.
- Each cell has its own MSK cluster
- No cross-cell Kafka consumers
- Blast radius = one cell

Why cell-local Kafka?

  • Global Kafka = single point of failure for entire platform
  • Cell-local Kafka = cell failure affects only that cell's customers
  • Scale independently: Enterprise cells get bigger Kafka clusters
How do you handle the Super Network (carrier connections) in a cell-based architecture?

The Super Network is a special case: it's shared, not cell-local.

┌─────────────────────────────────────────────────────────────────┐
│                    SUPER NETWORK ARCHITECTURE                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  CELLS (Data Plane)                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                      │
│  │Enterprise│  │Mid-Market│  │   SMB    │                      │
│  │   Cell   │  │   Cell   │  │   Cell   │                      │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘                      │
│       │             │             │                              │
│       └─────────────┼─────────────┘                              │
│                     ▼                                            │
│  SUPER NETWORK (Shared Infrastructure)                          │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                                                             │ │
│  │  Traffic Optimization Engine                                │ │
│  │  • 900+ data points per message                            │ │
│  │  • 3.2B data points monitored daily                        │ │
│  │  • 4x redundant routes per destination                     │ │
│  │                                                             │ │
│  │  SMPP Gateways (Regional)                                  │ │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐      │ │
│  │  │ US-EAST  │ │ US-WEST  │ │   EU     │ │   APAC   │      │ │
│  │  │ Gateway  │ │ Gateway  │ │ Gateway  │ │ Gateway  │      │ │
│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘      │ │
│  │       │            │            │            │              │ │
│  │       └────────────┴────────────┴────────────┘              │ │
│  │                          │                                  │ │
│  │                          ▼                                  │ │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐          │ │
│  │  │  AT&T   │ │ Verizon │ │T-Mobile │ │Vodafone │ ...4800  │ │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘          │ │
│  │                                                             │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

WHY SHARED?
• Carrier connections are expensive to establish (contracts, compliance)
• 4,800 connections × N cells = unsustainable
• Traffic Optimization Engine needs global visibility for routing decisions
• Carriers don't care about your internal cell boundaries

Isolation strategy: Cells send to Super Network via internal API. Super Network is stateless (no customer data). If Super Network degrades, all cells degrade (acceptable - it's the carrier connection, not Twilio's logic).

How does the Cell Router work? Walk through a request.
REQUEST FLOW: Customer sends SMS via API

1. REQUEST ARRIVES AT EDGE
   POST https://api.twilio.com/2010-04-01/Accounts/AC123/Messages.json
   Authorization: Basic {API_KEY}

2. GLOBAL IDENTITY CHECK (Platform Services)
   → Validate API key (DynamoDB Global Tables)
   → Extract account_id: AC123
   → Check TrustHub compliance status

3. CELL ROUTER (Lambda @ Edge)
   async function routeToCell(accountId, region) {
     // Step 1: Check Redis cache (95% hit rate)
     let cellId = await redis.get(`cell:${accountId}:${region}`);
     if (cellId) return { cellId, latency: '~5ms' };

     // Step 2: Query DynamoDB
     const record = await dynamodb.get({
       TableName: 'customer_cell_mapping',
       Key: { account_id: accountId, region: region }
     });
     if (record.Item) {
       await redis.setEx(`cell:${accountId}:${region}`, 3600, record.Item.cell_id);
       return { cellId: record.Item.cell_id, latency: '~15ms' };
     }

     // Step 3: New customer - assign to least-loaded cell
     const cells = await getCellMetrics(region, segment);  // CPU, customer count
     const bestCell = cells.sort((a,b) => a.loadScore - b.loadScore)[0];

     await dynamodb.put({
       TableName: 'customer_cell_mapping',
       Item: { account_id: accountId, region, cell_id: bestCell.id },
       ConditionExpression: 'attribute_not_exists(account_id)'  // Prevent race
     });

     return { cellId: bestCell.id, latency: '~20ms' };
   }

4. ROUTE TO CELL VIA VPC LATTICE
   → Set header: X-Twilio-Cell-ID: enterprise-us-east-1
   → VPC Lattice routes by service name (not IP address)
   → Request arrives at cell's SMS Gateway

5. CELL PROCESSES REQUEST
   → SMS Gateway validates, queues, sends to Super Network
   → All processing happens within cell boundary
   → Response returns via same path
How do you handle customer migrations between cells (e.g., SMB to Enterprise)?

Control Plane orchestrates multi-phase migration:

MIGRATION PHASES:

Phase 1: PREPARE (hours before)
├── Snapshot source cell data for customer
├── Begin async replication to target cell
└── Verify target cell has capacity

Phase 2: DUAL-WRITE (brief window)
├── Update routing: both cells receive writes
├── Source cell forwards to target cell
└── Ensures no data loss during cutover

Phase 3: CUTOVER (seconds)
├── Update DynamoDB: customer → new cell
├── Invalidate Redis cache (SNS broadcast)
├── New requests go to target cell immediately
└── Source cell rejects with redirect

Phase 4: DRAIN (minutes to hours)
├── Source cell processes in-flight requests
├── Centrifuge drains retry queues
├── Kafka consumers process remaining messages
└── Verify all data migrated

Phase 5: CLEANUP
├── Delete customer data from source cell
├── Update billing/metering
└── Notify customer (if requested)

TRAFFIC SHIFTING (optional for large customers):
Instead of instant cutover, shift gradually:
- 10% → target cell (monitor errors)
- 50% → target cell (monitor latency)
- 100% → target cell (complete migration)
What's the cost/benefit trade-off of cell-based architecture for Twilio?
BenefitCost
Blast radius reduction: Cell failure = 10-1000 customers, not millions Operational overhead: N cells to deploy, monitor, patch
Noisy neighbor isolation: SMB traffic spike doesn't affect Enterprise SLAs Resource duplication: Each cell runs full stack (Kafka, RDS, EKS)
Independent scaling: Scale Enterprise cells without touching SMB Cross-cell complexity: Migrations, global services, routing layer
Compliance isolation: HIPAA customers in dedicated cells with audit controls Cost inefficiency: Under-utilized cells still have base costs
Simpler capacity planning: Plan per cell, not globally Engineering complexity: Team must understand cell boundaries

Net assessment: For Twilio's scale (10M+ developers, enterprise SLAs, global presence), the blast radius reduction and noisy neighbor isolation justify the operational overhead. The alternative—one global deployment—means any failure affects everyone.

How do you size cells? What metrics trigger creating a new cell?

Cell capacity thresholds (Control Plane monitors these):

MetricEnterprise CellSMB CellAction
Customer count502,000Create new cell when exceeded
CPU utilization60%70%Migrate customers or scale cell
Kafka lag1 minute5 minutesScale consumers or create cell
RDS connections70%80%Scale RDS or migrate customers
API latency P99200ms500msInvestigate, possibly migrate

New cell provisioning (Control Plane automation):

1. Trigger: Cell at 70% capacity threshold
2. Control Plane initiates AWS Account Factory
3. Terraform provisions: VPC, EKS, RDS, MSK, etc. (~30 min)
4. Deploy services via ArgoCD (~10 min)
5. Health checks pass
6. Update hash ring: new cell joins rotation
7. New customers assigned to new cell
8. Existing customers stay put (no automatic migration)

Interview Talking Point: The 2-Minute Cell Architecture Answer

"I'd design Twilio's platform with product-agnostic, operationally-differentiated cells. Each cell runs ALL services—SMS, Voice, Video, Segment, Verify, SendGrid—in a single AWS account with its own VPC. The key insight is aligning customers to cells, not products, which eliminates cross-cell API calls and cascading failures.

The differentiation is operational tier: Enterprise cells have 10-50 customers with N+2 redundancy and 99.99% SLAs. SMB cells pack 1000+ customers with cost-optimized infrastructure and 99.9% SLAs. Same services, different operational posture.

For routing, I'd use a Cell Router at the edge—Lambda checking Redis cache first (95% hit rate), falling back to DynamoDB Global Tables. New customers get assigned to the least-loaded cell in their segment. VPC Lattice routes requests by service name, not IP address, which lets every cell use the same 10.0.0.0/16 CIDR without conflicts.

The trade-off is operational overhead—N cells to deploy and monitor—but the blast radius reduction is worth it. A cell failure affects 50 customers, not millions. And noisy neighbor isolation means an SMB traffic spike can't impact Enterprise SLAs."

Remember: The Super Network (4,800 carrier connections) is shared, not cell-local. Carrier connections are too expensive to replicate per cell. This is an intentional trade-off: carrier connectivity is a shared dependency, but it's stateless and externally managed.

Trade-off Questions Advanced

What are the trade-offs of embedding RocksDB versus using a centralized cache like Redis?
AspectEmbedded RocksDBCentralized Redis
Latency~0.01ms (local)~1ms (network)
Operational complexityNone (embedded)Cluster management, failover
Data localityMust partition by messageIdAny partitioning works
Rebuild timeMinutes (read Kafka)Gradual (cache warming)
Memory efficiencyDisk-backed (1.5TB ok)RAM-backed (2.4TB expensive)
Cross-partition queriesImpossibleEasy

When to choose RocksDB: High throughput, predictable access patterns, data fits single partition, can tolerate rebuild time.

When to choose Redis: Cross-partition queries needed, unpredictable access patterns, millisecond latency acceptable, data fits in RAM budget.

You've shown that Segment uses Kafka, RocksDB, and MySQL (Centrifuge) in the same pipeline. How do you decide when to use each?
TechnologyUse WhenExample in Segment
KafkaDurable log, replay needed, ordering within partitionEvent backbone, source of truth
RocksDBLocal state, high-throughput lookups, derived from logDeduplication index (60B keys)
MySQL (Centrifuge)Complex querying, dynamic QoS, high cardinality queuesDelivery to 700+ external APIs

Design principle: Use the simplest tool that meets requirements. Kafka when you need a log. RocksDB when you need local state. SQL when you need query flexibility.

What are the failure modes of this architecture and how are they handled?
FailureImpactMitigation
Kafka broker failureWrites blockedMulti-cluster failover, secondary cluster in different AZ
RocksDB corruptionDedup state lostRebuild from Kafka OUTPUT topic
Director crashJobs orphanedConsul TTL releases lock, new Director picks up
External API outageDelivery blocked4-hour retry queue, S3 archival
AZ failurePartial outageTAPI shards + Kafka clusters span AZs

Key insight: Every component is either stateless (can restart) or has state derived from a durable log (can rebuild). This is the core of event sourcing reliability.

System Design Questions Advanced

Design a system to process 1M events/second with exactly-once semantics and reliable delivery to 1000 external APIs.

This is essentially Segment's architecture. Walk through it:

1. INGESTION LAYER
   - Sharded API servers (TAPI) behind load balancer
   - Local NSQ buffer per server for backpressure
   - Write to Kafka with idempotent producer

2. KAFKA BACKBONE
   - Multi-AZ deployment with cluster failover
   - Partition by messageId (for dedup locality)
   - Retain 7 days for replay

3. DEDUPLICATION WORKERS
   - Consume from Kafka INPUT
   - Embedded RocksDB for 60B key lookups
   - Bloom filters for fast "not seen" path
   - Publish to Kafka OUTPUT (this is the commit point)

4. TRANSFORMATION WORKERS
   - JSON parsing, schema validation
   - Custom transforms per destination
   - Partition by destination for locality

5. DELIVERY LAYER (Centrifuge-style)
   - MySQL-based job queue (not traditional queue)
   - Immutable rows, state transition table
   - Directors with Consul locks
   - 4-hour retry with exponential backoff
   - S3 archival for undelivered

6. OBSERVABILITY
   - Per-stage latency histograms
   - Dedup rate monitoring (expect ~0.6%)
   - Per-destination success rate
   - Alerting on retry queue growth
How would you migrate this system to handle 10x traffic without downtime?

Scale each layer independently:

  1. Ingestion: Add TAPI shards. Horizontal scaling, just DNS/LB changes.
  2. Kafka: Add partitions (careful: can't reduce later). Consumer groups auto-rebalance.
  3. Dedup workers: Scale with partitions. Each worker owns a partition subset.
  4. RocksDB: More partitions = smaller RocksDB per worker. Or upgrade to larger EBS volumes.
  5. Centrifuge: More Directors + JobDBs. Scale by CPU utilization.

Key insight: The architecture was designed for horizontal scaling. No single component is a bottleneck because state is partitioned.

If you were starting from scratch today, what would you do differently?

Segment's architecture is 2016-era. Modern alternatives:

  • Kafka Streams/Flink: Instead of custom dedup workers, use streaming frameworks with built-in state management.
  • FoundationDB/TiKV: Instead of per-partition RocksDB, use distributed KV store with transactions.
  • Temporal/Cadence: Instead of custom Centrifuge, use workflow orchestration for reliable delivery.
  • Pulsar: Instead of Kafka + failover logic, use built-in geo-replication.

But: Segment's custom approach gives fine-grained control. The TABLE DROP trick in Centrifuge is hard to replicate in off-the-shelf systems. Sometimes custom wins.

Technical Leadership Questions Intermediate

How do you explain RocksDB's role to a non-technical stakeholder?

Analogy: "Imagine a hotel front desk checking if a guest has already checked in. They could call every room (slow), keep a list in their head (impossible for 60 billion guests), or have a quick lookup book right at the desk. RocksDB is that lookup book - it's right next to the worker, instant to check, and can handle billions of entries."

Business impact: "This lets us process 400,000 events per second without duplicates. Duplicates mean wrong analytics, double-charged customers, and broken integrations. RocksDB prevents that at scale."

How do you decide when to build custom infrastructure versus using off-the-shelf?

Build custom when:

  • Off-the-shelf doesn't meet scale requirements (88K queues)
  • Critical differentiator (reliability IS the product)
  • Control needed for specific optimizations (TABLE DROP trick)
  • Team has expertise to build and maintain

Use off-the-shelf when:

  • Not a differentiator (auth, payments, logging)
  • Maintenance burden exceeds value
  • Faster time to market needed

Segment's choice: Custom Centrifuge because queue cardinality was a hard requirement. Used AWS RDS (off-the-shelf) for the database underneath.

How do you ensure this architecture is maintainable by a team that didn't build it?
  • Documentation: Architecture decision records (ADRs) explaining WHY, not just WHAT.
  • Observability: Every component emits metrics. Dashboards show system health at a glance.
  • Runbooks: Step-by-step guides for common failures. "RocksDB rebuild takes 15 minutes, here's how to monitor."
  • Testing: Chaos engineering (kill Directors, corrupt RocksDB) to verify recovery works.
  • Simplicity: Each component has ONE job. Dedup workers dedupe. Directors deliver. No god services.

Behavioral & Leadership Questions DA Focus

Distinguished Architect interviews heavily weight leadership behaviors. Use the STAR format (Situation, Task, Action, Result) but emphasize the reasoning behind decisions and organizational impact. These questions assess whether you can operate at a company-wide level.

Influencing Without Authority

Tell me about a time you had to drive a major architectural change across teams that didn't report to you.

What they're assessing: Can you influence at scale without positional authority?

Structure your answer:

  1. Context: What was the architectural problem? Why did it matter to the business?
  2. Stakeholder landscape: Who needed to be convinced? What were their concerns?
  3. Your approach: How did you build alignment? (Data, prototypes, 1:1s, written proposals)
  4. Resistance: What pushback did you face? How did you address it?
  5. Outcome: What changed? What was the business impact?

Example themes:

  • Migrating from monolith to microservices (or vice versa)
  • Adopting cell-based architecture across product lines
  • Standardizing on a new database or messaging technology
  • Implementing SLO/SLI culture across engineering
Describe a situation where you had to say "no" to a product or business request for technical reasons.

What they're assessing: Can you hold technical standards while maintaining business partnerships?

Key elements:

  • Understand the "why": What business outcome were they trying to achieve?
  • Quantify the risk: "This would increase P0 incidents by 3x" not "this is bad"
  • Offer alternatives: Never just say no—propose a path that meets the need safely
  • Escalate appropriately: If you can't align, bring in leadership with clear trade-offs

Example: "Product wanted to ship a feature that bypassed our rate limiting. I explained that our rate limits protect carrier relationships—if we get flagged for spam, we lose sender reputation for ALL customers. Instead, I proposed a tiered approach where verified enterprise customers get higher limits through a separate queue."

Technical Decision Making

Tell me about a technical decision you made that you later realized was wrong. What did you do?

What they're assessing: Self-awareness, learning orientation, intellectual honesty.

Structure:

  1. The decision: What did you decide and why did it seem right at the time?
  2. The signals: How did you realize it was wrong? (metrics, team feedback, incidents)
  3. Your response: Did you course-correct quickly or defend the decision?
  4. The fix: What did you do to remediate?
  5. The learning: What would you do differently? How did you prevent similar mistakes?

Good example themes:

  • Over-engineering (built for scale that never came)
  • Under-engineering (cut corners that caused incidents)
  • Technology choice that didn't fit the team's skills
  • Premature optimization vs. premature abstraction
How do you balance short-term delivery pressure with long-term architectural health?

What they're assessing: Strategic thinking, ability to communicate technical debt.

Framework:

SHORT-TERM vs LONG-TERM DECISION FRAMEWORK

1. CLASSIFY THE DEBT
   - Deliberate & Prudent: "We know this is a shortcut, ship now, fix in Q2"
   - Deliberate & Reckless: "We don't have time for design" (avoid this)
   - Inadvertent: "We didn't know better" (learning opportunity)

2. QUANTIFY THE COST
   - Developer productivity impact (hours/week)
   - Incident frequency and MTTR
   - Feature velocity slowdown
   - Recruitment/retention impact

3. MAKE IT VISIBLE
   - Tech debt backlog with business impact estimates
   - "Debt service" time in sprint planning (20% rule)
   - Architecture fitness functions that alert on degradation

4. NEGOTIATE EXPLICITLY
   - "We can ship in 2 weeks with debt, or 4 weeks clean"
   - "This debt will cost us 2 engineer-weeks per quarter until fixed"
   - Get business sign-off on the trade-off

Key quote: "I never let debt accumulate silently. If we're taking a shortcut, I make sure leadership understands the interest payments we'll make later."

Organizational Impact

How have you raised the technical bar across an organization?

What they're assessing: Ability to scale your impact beyond your immediate team.

Examples of raising the bar:

  • Standards & Guidelines: Created API design standards adopted across 12 teams
  • Review Processes: Established architecture review board for cross-cutting changes
  • Tooling: Built internal platforms that encoded best practices
  • Education: Ran workshops, wrote internal docs, mentored senior engineers
  • Hiring: Raised interview bar, calibrated technical assessments
  • Culture: Modeled behaviors (blameless postmortems, design docs, ADRs)

Metrics to cite:

  • Reduction in P0 incidents
  • Improved deployment frequency
  • Faster onboarding time for new engineers
  • Increased reuse of shared components
Tell me about a time you had to align multiple teams with competing priorities on a shared technical initiative.

What they're assessing: Cross-functional leadership, prioritization, stakeholder management.

Structure:

  1. The initiative: What were you trying to accomplish? Why did it need multiple teams?
  2. The conflicts: What were the competing priorities? Why couldn't teams just agree?
  3. Your role: How did you facilitate alignment? (workshops, proposals, escalation)
  4. The trade-offs: What did you give up to get alignment? What was non-negotiable?
  5. The outcome: Did the initiative succeed? What would you do differently?

Example: "We needed to migrate to cell-based architecture, but each product team had different timelines. I created a migration framework where teams could adopt at their own pace, but I held firm on the shared cell router and routing table schema. Teams got flexibility on timing; we got architectural consistency."

Mentorship & Team Development

How do you develop senior engineers into staff/principal engineers?

What they're assessing: Do you multiply your impact through others?

Development approaches:

FromToDevelopment Focus
SeniorStaffScope expansion: own a system, not just features
StaffPrincipalInfluence expansion: drive org-wide initiatives
PrincipalDistinguishedIndustry impact: thought leadership, external influence

Concrete tactics:

  • Stretch assignments: Give them ownership of ambiguous, cross-team problems
  • Visibility: Put them in front of leadership, let them present their work
  • Feedback loops: Regular 1:1s focused on growth, not just status
  • Sponsorship: Advocate for them in calibration and promotion discussions
  • Modeling: Show them what good looks like (design docs, influence, decisions)
Describe a conflict between two senior engineers on your team. How did you handle it?

What they're assessing: Conflict resolution, maintaining psychological safety.

Framework:

  1. Understand both perspectives: 1:1 conversations before any group discussion
  2. Separate people from positions: Focus on the technical trade-offs, not personalities
  3. Find the shared goal: What do both parties actually want for the system/team?
  4. Make it safe to disagree: "We can disagree on approach and still respect each other"
  5. Decide and move on: If consensus isn't possible, make a call and commit together

Red flags to avoid:

  • Taking sides publicly
  • Letting conflict fester ("they'll work it out")
  • Making it about who's "right"
  • Punishing dissent

Crisis & Incident Leadership

Tell me about a major production incident you led the response for. What did you do?

What they're assessing: Composure under pressure, systematic thinking, communication.

Structure (Incident Timeline):

1. DETECTION
   - How did you find out? (Alerts, customer report, intuition)
   - How long from start to detection?

2. TRIAGE
   - How did you assess severity and scope?
   - Who did you pull in? How did you communicate?

3. MITIGATION
   - What was the first thing you did to stop the bleeding?
   - Trade-offs: speed vs. thoroughness

4. RESOLUTION
   - Root cause identification
   - The actual fix

5. COMMUNICATION
   - Internal: leadership, other teams
   - External: customers, status page

6. FOLLOW-UP
   - Blameless postmortem
   - Action items and ownership
   - Systemic improvements

DA-level additions:

  • How did you shield the team from exec pressure during the incident?
  • What systemic changes did you drive afterward?
  • How did you turn the incident into a learning opportunity?
How do you build a culture of reliability and operational excellence?

What they're assessing: Can you shape culture, not just respond to incidents?

Cultural elements:

  • SLOs as contracts: Teams own their SLOs and error budgets
  • Blameless postmortems: Focus on systems, not individuals
  • On-call as first-class work: Not a burden, a learning opportunity
  • Chaos engineering: Proactively find weaknesses
  • Celebrate reliability: Recognize teams that improve MTTR, reduce incidents

Concrete practices:

  • Weekly reliability review with leadership
  • Error budget policies that slow feature work when budgets are exhausted
  • Runbook requirements before production deployment
  • Game days and disaster recovery drills

Strategy & Vision

How do you develop and communicate a multi-year technical vision?

What they're assessing: Strategic thinking, communication at multiple levels.

Vision development process:

  1. Understand business direction: Where is the company going in 3-5 years?
  2. Assess current state: What are our technical strengths and weaknesses?
  3. Identify gaps: What technical capabilities do we need that we don't have?
  4. Prioritize ruthlessly: What 3 things matter most?
  5. Create milestones: What does year 1, year 2, year 3 look like?

Communication at different levels:

AudienceFocusFormat
ExecutivesBusiness outcomes, investment required1-page strategy doc
Engineering leadershipTechnical approach, team implicationsArchitecture proposal
Individual engineersHow their work connects to the visionAll-hands, team meetings
What's your approach to build vs. buy decisions?

What they're assessing: Pragmatism, business acumen, long-term thinking.

Decision framework:

BUILD when:
✓ Core differentiator (this IS your product)
✓ Unique requirements that off-the-shelf can't meet
✓ Team has expertise to build AND maintain
✓ Control is critical (security, compliance, performance)

BUY when:
✓ Commodity capability (auth, payments, logging)
✓ Time-to-market is critical
✓ Maintenance burden exceeds value
✓ Vendor has expertise you don't

EXAMPLES from Twilio:
- BUILD: Centrifuge (unique 88K queue cardinality requirement)
- BUY: AWS RDS for MySQL underneath Centrifuge
- BUILD: Custom Cell Router (core to architecture)
- BUY: Consul for distributed locking (commodity)

Key insight: "The question isn't 'can we build it?' It's 'should we spend our engineering calories here?' Every hour spent building commodity infrastructure is an hour not spent on customer value."

Interview Tips: Behavioral Questions

Common mistakes to avoid:
  • Being too technical: These questions assess leadership, not architecture. Lead with the people/org impact.
  • Taking all the credit: Use "we" for team accomplishments, "I" only for your specific contributions.
  • No concrete examples: Vague answers like "I always communicate well" don't demonstrate competence.
  • Badmouthing previous employers: Even if the situation was bad, focus on what you learned.
  • Not quantifying impact: "Improved reliability" → "Reduced P0 incidents from 4/month to 1/quarter"
STAR+ Format for DA interviews:
  • Situation: Context (brief—10% of answer)
  • Task: Your specific responsibility
  • Action: What YOU did (detailed—50% of answer)
  • Result: Outcome with metrics
  • +Reflection: What you learned, what you'd do differently (DA differentiator)

Quick Reference: Key Numbers to Remember

Segment CDP / Data Infrastructure

MetricValueContext
Events/second400,000 (sustained), 1M (peak)Tracking API throughput
Duplicate rate0.6%From mobile client retries
RocksDB keys60 billion4-week deduplication window
RocksDB size1.5 TBPer partition
Centrifuge throughput400K req/sec (sustained), 2M (tested)Outbound HTTP delivery
Source-destination pairs88,00042K sources × 2.1 avg destinations
Directors80-300Scaled by CPU utilization
Retry window4 hoursWith exponential backoff
JobDB rotation~30 minutesAt 100% fill
Destinations supported700+External integrations

Cell-Based Architecture

MetricValueContext
Enterprise cell customers10-50Dedicated resources, 99.99% SLA
SMB cell customers1,000+Shared resources, 99.9% SLA
Cell Router cache hit rate95%Redis ElastiCache
Cache lookup latency~5msRedis hit
DynamoDB lookup latency~15msCache miss
New customer assignment~20msLeast-loaded cell selection
VPC CIDR per cell10.0.0.0/16Overlapping OK via VPC Lattice
Cell provisioning time~40 minutesAccount Factory + Terraform + ArgoCD
EKS nodes per cell100-200Enterprise vs SMB sizing

Super Network / Messaging

MetricValueContext
Carrier connections4,800Global SMPP/SIP connections
Data points per message900+Routing decision inputs
Data points monitored daily3.2 billionNetwork health monitoring
Redundant routes4x averagePer destination failover
GLL Edge locations9Voice/Video low-latency edge
Short code throughput100+ MPSMessages per second
Toll-free throughput3 MPS (upgradeable)Default rate
Queue timeout4 hoursMax queue time before error 30001