Technology Stack Overview
This document catalogs the specific technologies used at Twilio based on engineering blog posts, documentation, and architecture research. Technologies are categorized by function and include their specific use cases within Twilio's platform.
Event Streaming & Messaging
Apache Kafka DATA
Distributed event streaming platform. Core backbone for Segment CDP and event-driven architecture.
- Segment Tracking API event backbone
- Multi-tier failover (primary + secondary clusters per shard)
- Partitioning by messageId for deduplication locality
- Nearly 1M messages/second throughput
NSQ DATA
Real-time distributed messaging platform. Lightweight local buffer before Kafka.
- Local buffer on each TAPI (Tracking API) server
- Absorbs burst traffic before writing to Kafka
- Decouples API response from Kafka write latency
Amazon Kinesis AWS
Real-time data streaming service. Used for customer-facing event delivery.
- Event Streams product (customer-facing)
- Delivers message status updates to customer Kinesis streams
- Alternative to webhooks for high-volume customers
Kafka Architecture at Segment
Key Insight: Each TAPI shard has its own primary AND secondary Kafka cluster. This provides cluster-level failover, not just broker-level replication. The "Replicated" service monitors broker health and routes traffic.
| Aspect | Configuration | Rationale |
|---|---|---|
| Partitioning | By messageId | Same ID → same consumer → local dedup |
| Replication | Multi-cluster (not just multi-broker) | Survives entire cluster failures |
| Retention | 7 days minimum | Enables replay for recovery |
| Throughput | ~1M messages/second | Peak load across all shards |
Databases & Storage
RocksDB STORAGE
Embedded key-value store based on LSM trees. Facebook-developed, used by many at scale.
- Deduplication index in Segment CDP
- 60 billion keys, 1.5TB per partition
- 4-week deduplication window
- Bloom filters for fast "not seen" checks
- Replaced Memcached (100x improvement)
MySQL / Amazon RDS STORAGE
Relational database. Used in Centrifuge as "database-as-queue" pattern.
- Centrifuge job queue (not traditional queue)
- Immutable rows design (no UPDATEs)
jobsandjob_state_transitionstables- KSUID primary keys (time-sortable)
- TABLE DROP instead of DELETE for cleanup
- 400K outbound requests/second
Amazon DynamoDB AWS
Fully managed NoSQL database. Key-value with global replication.
- Customer → Cell routing table (Global Tables)
- API key and identity lookups
- Session state for identity service
- Multi-region with ~5-15ms latency
Amazon Aurora PostgreSQL AWS
MySQL/PostgreSQL-compatible relational database with enhanced performance.
- Primary transactional database per cell
- Customer account data, configuration
- Multi-AZ deployment with read replicas
- Enterprise cells: 6 read replicas
- SMB cells: 2 read replicas
Redis / Amazon ElastiCache AWS
In-memory data store. Used for caching and session management.
- Cell Router cache (customer → cell mapping)
- 95% cache hit rate, ~5ms latency
- 1-hour TTL for routing entries
- Session cache for identity service
- Rate limiting counters
Amazon S3 AWS
Object storage. Durable, scalable storage for any data type.
- Centrifuge archival (undelivered messages after 4 hours)
- Media storage (MMS, recordings)
- Log archival
- Segment warehouse sync destinations
Database Selection Philosophy
Principle: Use the simplest tool that meets requirements. Don't over-engineer.
| Use Case | Technology | Why This Choice |
|---|---|---|
| Deduplication (60B keys) | RocksDB | Embedded, no network, Bloom filters, disk-backed |
| Job queue (88K queues) | MySQL | SQL flexibility for QoS changes, immutable rows |
| Global routing | DynamoDB Global Tables | Multi-master, multi-region, ~15ms latency |
| Fast cache | Redis | Sub-5ms, in-memory, TTL support |
| Transactional | Aurora PostgreSQL | ACID, read replicas, AWS-managed |
| Archival | S3 | Cheap, durable, queryable with Athena |
Compute & Runtime
Go (Golang) COMPUTE
Statically typed, compiled language. Excellent for concurrent, high-throughput systems.
- Segment Tracking API (TAPI) servers
- Deduplication workers
- Custom JSON parser (zero-allocation)
- High-throughput services requiring low latency
Amazon EKS (Kubernetes) AWS
Managed Kubernetes service. Container orchestration at scale.
- Primary compute platform per cell
- 100-200 nodes per cell (varies by tier)
- Runs all microservices
- Auto-scaling based on CPU/memory
- IRSA for IAM authentication
AWS Lambda AWS
Serverless compute. Event-driven, auto-scaling, pay-per-use.
- Cell Router (Lambda@Edge)
- API Gateway integrations
- Event-driven processing
- Control Plane automation tasks
Consul COMPUTE
HashiCorp service mesh and distributed locking. Service discovery and coordination.
- Centrifuge Director locking
- One Director per JobDB (exclusive lock)
- Session-based TTL for automatic release
- Service discovery
Language Choices
| Language | Use Cases | Rationale |
|---|---|---|
| Go | High-throughput data plane (TAPI, workers) | Performance, concurrency, low GC pause |
| Node.js | APIs, TwiML parsing, webhooks | Async I/O, JavaScript ecosystem |
| Python | Data pipelines, ML/AI, scripting | Data science libraries, rapid development |
| Java | Enterprise services, Android SDK | Mature ecosystem, strong typing |
AWS Services
Networking
Amazon VPC Lattice NETWORK
Application networking service. Routes by service name, not IP address.
- Cross-cell service mesh
- Enables overlapping VPC CIDRs (all cells use 10.0.0.0/16)
- Routes based on X-Twilio-Cell-ID header
- Eliminates IP address coordination
Amazon API Gateway AWS
Managed API service. REST and WebSocket APIs at scale.
- Public API endpoint (api.twilio.com)
- API versioning (/v1/*, /v2/*)
- Rate limiting
- Lambda integration for Cell Router
Amazon CloudFront AWS
Content delivery network. Global edge locations.
- Static asset delivery
- API acceleration for global customers
- Lambda@Edge for Cell Router logic
Amazon Route 53 AWS
Managed DNS service. Global traffic management.
- Global DNS for api.twilio.com
- Health check-based failover
- Latency-based routing to nearest region
- Private hosted zones per cell
Management & Operations
AWS Control Tower AWS
Multi-account governance. Landing zone for AWS Organizations.
- Cell = AWS Account (via Account Factory)
- Automated account provisioning
- Service Control Policies (SCPs)
- Centralized CloudTrail/Config
Terraform COMPUTE
Infrastructure as Code. HashiCorp's declarative provisioning tool.
- Cell infrastructure provisioning
- VPC, EKS, RDS, MSK setup
- Terragrunt for multi-cell management
- ~30 min to provision new cell
ArgoCD COMPUTE
GitOps continuous delivery for Kubernetes.
- Application deployment to cells
- One ArgoCD application per cell
- Git as source of truth
- Automated sync and drift detection
AWS Step Functions AWS
Serverless workflow orchestration. State machines for complex workflows.
- Control Plane orchestration
- Cell provisioning workflow
- Customer migration workflows
- Long-running async operations
Protocols & Standards
SMPP PROTOCOL
Short Message Peer-to-Peer. Industry standard for SMS carrier connectivity.
- 4,800 carrier connections worldwide
- Super Network SMPP gateways
- PDU encoding, segmentation, concatenation
- Delivery receipt (DLR) handling
SIP PROTOCOL
Session Initiation Protocol. Standard for voice/video session setup.
- Programmable Voice connections
- BYOC (Bring Your Own Carrier) trunking
- Enterprise PBX integration
- SIP domains for custom endpoints
WebRTC PROTOCOL
Web Real-Time Communication. Browser/mobile real-time media.
- Programmable Voice (browser SDK)
- Programmable Video
- Global Low Latency (GLL) Edge - 9 locations
- STUN/TURN handled by Twilio
- SFU for group video rooms
OAuth 2.1 PROTOCOL
Authorization framework. Industry standard for delegated access.
- Stytch Connected Apps (AI agent auth)
- MCP protocol compliance (Anthropic standard)
- Scoped tokens for fine-grained access
- Human-in-the-loop step-up authentication
TwiML PROTOCOL
Twilio Markup Language. XML-based instructions for voice/messaging.
- Voice call control (<Say>, <Dial>, <Gather>)
- Messaging responses (<Message>)
- Webhook response format
- Declarative call flow definition
MCP (Model Context Protocol) PROTOCOL
Anthropic's open standard for AI tool integration.
- Stytch AI agent authentication
- Claude/ChatGPT connector support
- OAuth 2.1 mandate for agent auth
- Scoped, ephemeral credentials for AI agents
Technology Summary by Function
| Function | Technology | Scale / Notes |
|---|---|---|
| Event Streaming | ||
| Event backbone | Apache Kafka (AWS MSK) | ~1M messages/sec, multi-cluster failover |
| Local buffer | NSQ | Per-server buffer before Kafka |
| Customer event delivery | Kinesis | Event Streams product |
| Databases | ||
| Deduplication | RocksDB (embedded) | 60B keys, 1.5TB, Bloom filters |
| Job queue | MySQL/RDS | Centrifuge, 400K req/sec, immutable rows |
| Global routing | DynamoDB Global Tables | Multi-master, ~15ms latency |
| Caching | Redis (ElastiCache) | 95% hit rate, ~5ms latency |
| Transactional | Aurora PostgreSQL | Per-cell primary database |
| Archival | S3 | Long-term storage, analytics |
| Compute | ||
| Container orchestration | Amazon EKS | 100-200 nodes per cell |
| Serverless | AWS Lambda | Cell Router, event processing |
| High-performance services | Go | TAPI: 800K RPS, 30ms latency |
| Distributed locking | Consul | Director locks in Centrifuge |
| Networking | ||
| Service mesh | VPC Lattice | Overlapping VPC CIDRs, service routing |
| API management | API Gateway | Rate limiting, versioning |
| CDN | CloudFront | Lambda@Edge for Cell Router |
| DNS | Route 53 | Latency-based routing |
| Operations | ||
| Infrastructure as Code | Terraform | Cell provisioning (~30 min) |
| GitOps | ArgoCD | Application deployment |
| Multi-account governance | AWS Control Tower | Cell = AWS Account |
| Workflow orchestration | Step Functions | Control Plane automation |
| Protocols | ||
| SMS carrier | SMPP | 4,800 carrier connections |
| Voice signaling | SIP | BYOC, PBX integration |
| Real-time media | WebRTC | 9 GLL Edge locations |
| AI agent auth | OAuth 2.1 + MCP | Stytch Connected Apps |
Interview Quick Reference
When asked "What technologies does Twilio use?"
"Twilio runs on AWS with a cell-based architecture. Each cell is a separate AWS account with its own EKS cluster, Aurora PostgreSQL, and MSK (Kafka). For the Segment CDP specifically, they use Go for high-throughput services, RocksDB for embedded deduplication at 60 billion keys, and MySQL as a 'database-as-queue' for Centrifuge which handles 400K outbound requests/second. DynamoDB Global Tables provide multi-region routing, Redis caches for 95% hit rates, and VPC Lattice enables service mesh across cells with overlapping IP spaces. For carrier connectivity, they maintain 4,800 SMPP connections through their Super Network."