REST Principles, Versioning Strategies, Breaking Changes, and API Governance at Scale
← Back to Study Guide| Challenge | Impact | Twilio Example |
|---|---|---|
| Breaking changes | Customer code breaks in production | Changing SMS callback payload structure |
| Version proliferation | Maintenance burden, confusion | Supporting v1, v2, v3 simultaneously |
| Inconsistency | Poor developer experience | Different patterns across Voice vs Messaging APIs |
| Rate limiting | Customer frustration, abuse prevention | Protecting Super Network from traffic spikes |
| Documentation drift | Incorrect implementations, support burden | Outdated SDK examples |
GET /messages # List messages
POST /messages # Create message
GET /messages/{sid} # Get specific message
PUT /messages/{sid} # Update message
DELETE /messages/{sid} # Delete message
POST /sendMessage
POST /getMessage
POST /updateMessage
POST /deleteMessage
POST /listAllMessages
| Method | Purpose | Idempotent? | Safe? | Request Body |
|---|---|---|---|---|
| GET | Retrieve resource(s) | Yes | Yes | No |
| POST | Create resource | No | No | Yes |
| PUT | Replace resource entirely | Yes | No | Yes |
| PATCH | Partial update | No* | No | Yes |
| DELETE | Remove resource | Yes | No | Optional |
*PATCH can be idempotent if designed carefully (e.g., "set field to X" vs "increment field")
# Hierarchical resources
GET /accounts/{accountSid}/messages
GET /accounts/{accountSid}/calls/{callSid}/recordings
# Twilio pattern: Account scoping
POST /2010-04-01/Accounts/{AccountSid}/Messages.json
# Filtering via query params (not path)
GET /messages?status=delivered&dateSent>=2024-01-01
# Pagination
GET /messages?pageSize=50&pageToken=xyz123
# Sorting
GET /messages?sort=dateSent:desc
// Successful response (200 OK)
{
"sid": "SM123456789",
"accountSid": "AC987654321",
"to": "+14155551234",
"from": "+14155556789",
"body": "Hello, World!",
"status": "delivered",
"dateCreated": "2024-01-15T10:30:00Z",
"dateUpdated": "2024-01-15T10:30:05Z",
"uri": "/2010-04-01/Accounts/AC987654321/Messages/SM123456789.json",
"_links": {
"self": "/messages/SM123456789",
"account": "/accounts/AC987654321",
"media": "/messages/SM123456789/media"
}
}
// Collection response with pagination
{
"messages": [...],
"meta": {
"page": 0,
"pageSize": 50,
"totalCount": 1234,
"pageCount": 25
},
"pagination": {
"firstPageUri": "/messages?pageSize=50",
"nextPageUri": "/messages?pageSize=50&pageToken=abc123",
"previousPageUri": null
}
}
| Element | Convention | Example |
|---|---|---|
| URLs | lowercase, hyphens | /phone-numbers |
| JSON fields | camelCase | dateCreated, accountSid |
| Query params | camelCase | ?pageSize=50 |
| Resource names | Plural nouns | /messages, /calls |
| IDs | Prefixed, unique | SM..., CA..., AC... |
"A good REST API is resource-oriented—it exposes nouns, not verbs. Instead of 'sendMessage', you POST to /messages. Resources have stable URIs that can be bookmarked, cached, and linked. HTTP methods provide consistent semantics: GET is safe and idempotent, POST creates, PUT replaces, DELETE removes. Responses include hypermedia links so clients can navigate the API without hardcoding URLs. For consistency, I establish conventions early: camelCase for JSON, plural nouns for collections, prefixed IDs for debuggability. At scale, I add pagination with cursors instead of offsets—offset pagination breaks when data is added or removed between requests. The most important thing is predictability. A developer who's used one endpoint should be able to guess how another works. That means consistent naming, consistent error formats, and consistent behavior across the entire API surface."
| Approach | Example | Pros | Cons |
|---|---|---|---|
| URL Path | /v1/messages |
Obvious, cacheable, easy routing | Not RESTful (resource URI changes) |
| Query Param | /messages?version=1 |
Optional, backward compatible | Easy to forget, caching issues |
| Header | Accept: application/vnd.twilio.v1+json |
Clean URLs, content negotiation | Hidden, harder to test, no browser support |
| Date-based | /2010-04-01/Messages |
Clear timeline, Twilio's approach | Many versions accumulate |
| No versioning | Evolve in place | Simple, forces compatibility | Hard to make breaking changes |
# Twilio API versions
/2010-04-01/Accounts/{AccountSid}/Messages.json # Original version
/2020-03-15/Accounts/{AccountSid}/Messages.json # Newer version
# Benefits:
# - Clear timeline of when features were added
# - Customer pins to specific date, knows exactly what they get
# - New versions only when breaking changes needed
# - Old versions maintained for years with deprecation warnings
// Version-specific request/response transformers
class MessageHandlerV1 {
// 2010-04-01: Original format
async create(req) {
const input = this.transformRequest(req.body);
const result = await messageService.create(input); // Core service
return this.transformResponse(result);
}
transformRequest(body) {
return {
to: body.To, // V1 uses PascalCase
from: body.From,
body: body.Body,
};
}
transformResponse(message) {
return {
sid: message.id,
To: message.to, // V1 returns PascalCase
From: message.from,
Body: message.body,
Status: message.status,
DateCreated: message.createdAt.toISOString(),
};
}
}
class MessageHandlerV2 {
// 2020-03-15: Modern format
async create(req) {
const input = this.transformRequest(req.body);
const result = await messageService.create(input); // Same core service!
return this.transformResponse(result);
}
transformRequest(body) {
return {
to: body.to, // V2 uses camelCase
from: body.from,
body: body.body,
contentType: body.contentType, // V2 adds new fields
validityPeriod: body.validityPeriod,
};
}
transformResponse(message) {
return {
sid: message.id,
to: message.to, // V2 returns camelCase
from: message.from,
body: message.body,
status: message.status,
dateCreated: message.createdAt,
dateUpdated: message.updatedAt,
_links: { // V2 adds hypermedia
self: `/messages/${message.id}`,
media: `/messages/${message.id}/media`,
},
};
}
}
"I version in the URL path with a date-based scheme—like Twilio's /2010-04-01/Messages. The date makes it clear when a version was released and what behavior to expect. I only create new versions for breaking changes; additive changes go into the current version. The version is resolved at the API Gateway layer—the Landing Zone in our cell architecture. Each cell runs handlers for all supported versions, but the business logic is version-agnostic. Version handlers just transform requests and responses at the boundary. This means I can add a new version by adding a new transformer, not rewriting business logic. For deprecation, I give customers at least 12 months notice, add deprecation headers to responses, track version usage metrics, and reach out to heavy users of old versions before sunset. The goal is zero surprise breakage."
// Server: Liberal in accepting requests
app.post('/messages', (req, res) => {
const { to, from, body, ...unknownFields } = req.body;
if (Object.keys(unknownFields).length > 0) {
logger.info('Ignoring unknown fields', { fields: Object.keys(unknownFields) });
// Don't reject! Just ignore.
}
// Process with known fields only
const message = await createMessage({ to, from, body });
res.json(message);
});
// Client: Liberal in parsing responses
function parseMessage(response) {
// Only extract fields we know about
// Ignore any new fields the server adds
return {
sid: response.sid,
to: response.to,
from: response.from,
body: response.body,
status: response.status,
// Don't fail if response has extra fields!
};
}
// Feature flag controls new behavior
async function sendMessage(request, account) {
const useNewValidation = await featureFlags.isEnabled(
'sms.strict_e164_validation',
{ accountId: account.id }
);
if (useNewValidation) {
// New stricter validation (potential breaking change)
if (!isStrictE164(request.to)) {
throw new ValidationError('Phone must be E.164 format: +1234567890');
}
} else {
// Old lenient validation
request.to = normalizeToE164(request.to);
}
return await messageService.send(request);
}
// Rollout strategy:
// Week 1: 1% of accounts (beta testers)
// Week 2: 10% of accounts
// Week 4: 50% of accounts
// Week 6: 100% of accounts
// Week 8: Remove flag, new behavior is default
"The best breaking change is the one you don't make. I follow additive-only evolution—new fields, new endpoints, new optional parameters. Clients should ignore unknown fields in responses and servers should ignore unknown fields in requests. When I absolutely must make a breaking change, I use the expand-contract pattern. First, I add the new field alongside the old one and support both for 6+ months. During this time, I add deprecation warnings to responses, update documentation, and track which customers still use the old pattern. I reach out directly to heavy users. Only after the migration period do I remove the old field. For risky changes, I use feature flags to roll out gradually—1%, then 10%, then 50%, watching error rates at each step. The goal is zero surprises. A customer should never wake up to broken code because we changed an API."
| Signal | Implementation | Purpose |
|---|---|---|
| HTTP Header | Deprecation: trueSunset: Sat, 01 Jun 2025 00:00:00 GMT |
Machine-readable, loggable |
| Response Field | "_deprecation": { "message": "...", "sunset": "..." } |
Visible in response body |
| Documentation | Prominent "DEPRECATED" badge | Developers reading docs |
| SDK Warnings | console.warn('Method deprecated...') |
IDE and runtime visibility |
| Email/Dashboard | Direct notification to affected accounts | Proactive communication |
// Response with deprecation headers
HTTP/1.1 200 OK
Content-Type: application/json
Deprecation: true
Deprecation-Date: Wed, 01 Jan 2025 00:00:00 GMT
Sunset: Wed, 01 Jul 2025 00:00:00 GMT
Link: <https://docs.twilio.com/migration/v2>; rel="deprecation"
{
"sid": "SM123456789",
"Status": "sent",
"_deprecation": {
"warning": "API version 2010-04-01 is deprecated",
"sunsetDate": "2025-07-01T00:00:00Z",
"migrationGuide": "https://docs.twilio.com/migration/v2",
"replacement": "/2024-01-01/Messages"
}
}
| Timeline | Action | Channel |
|---|---|---|
| T-12 months | Announce deprecation | Blog, changelog, email to all users |
| T-9 months | Add deprecation headers/warnings | API responses |
| T-6 months | Direct outreach to heavy users | Email, account manager contact |
| T-3 months | Sunset reminder | Email, in-app notification |
| T-1 month | Final warning | All channels, dashboard banner |
| T-0 | Begin returning 410 Gone | API response |
| T+1 month | Remove from documentation | Docs site |
// Track deprecated endpoint usage
SELECT
api_version,
endpoint,
COUNT(DISTINCT account_id) as unique_accounts,
COUNT(*) as total_requests,
DATE_TRUNC('week', timestamp) as week
FROM api_requests
WHERE endpoint IN (SELECT endpoint FROM deprecated_endpoints)
GROUP BY api_version, endpoint, week
ORDER BY week DESC;
// Alert if deprecated usage isn't declining
// Goal: 50% reduction each quarter until sunset
"Deprecation is a 12-month process minimum. At announcement, I add the Deprecation and Sunset headers to every response from that endpoint. These are machine-readable—customers can detect them in CI/CD and set up alerts. I update documentation with prominent warnings and migration guides. At the 6-month mark, I pull usage data to identify customers still on the old API. The top 20 accounts by usage get direct outreach—often they have valid reasons or need help migrating. I track weekly metrics: unique accounts, total requests, error rates. If migration isn't happening fast enough, I extend the timeline or allocate engineering to help. At sunset, the endpoint returns 410 Gone with a body explaining what happened and where to go. But honestly, if you've communicated well, no one should be surprised. The goal is that by sunset day, usage is already near zero because everyone migrated months ago."
| Code | Meaning | When to Use | Retry? |
|---|---|---|---|
400 |
Bad Request | Invalid input, validation failure | No (fix request) |
401 |
Unauthorized | Missing or invalid credentials | No (fix auth) |
403 |
Forbidden | Valid auth, but not allowed | No (permission issue) |
404 |
Not Found | Resource doesn't exist | No |
409 |
Conflict | Optimistic lock failure, duplicate | Maybe (re-fetch, retry) |
422 |
Unprocessable Entity | Valid syntax, invalid semantics | No (fix data) |
429 |
Too Many Requests | Rate limited | Yes (after delay) |
500 |
Internal Server Error | Unexpected server failure | Yes (with backoff) |
502 |
Bad Gateway | Upstream service failure | Yes (with backoff) |
503 |
Service Unavailable | Temporary overload/maintenance | Yes (check Retry-After) |
// Twilio-style error response
{
"code": 21211,
"message": "The 'To' phone number is not a valid phone number",
"status": 400,
"moreInfo": "https://www.twilio.com/docs/errors/21211",
"details": {
"field": "to",
"value": "not-a-phone",
"reason": "Must be E.164 format: +1234567890"
}
}
// Multiple validation errors
{
"code": 20001,
"message": "Validation failed",
"status": 400,
"errors": [
{
"field": "to",
"code": "invalid_format",
"message": "Must be E.164 format"
},
{
"field": "body",
"code": "too_long",
"message": "Body exceeds 1600 characters"
}
]
}
| Range | Category | Example |
|---|---|---|
| 10xxx | Account errors | 10001: Account not found |
| 20xxx | Request validation | 20003: Missing required parameter |
| 21xxx | Phone number errors | 21211: Invalid phone number |
| 30xxx | Message delivery | 30003: Unreachable destination |
| 50xxx | Internal errors | 50001: Internal server error |
// Include retry guidance in response headers
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1699900030
{
"code": 20429,
"message": "Rate limit exceeded",
"retryAfter": 30,
"rateLimitInfo": {
"limit": 100,
"remaining": 0,
"resetAt": "2024-01-15T10:30:30Z"
}
}
// Client retry logic
async function callAPIWithRetry(request, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await makeRequest(request);
} catch (error) {
if (error.status === 429) {
const retryAfter = error.headers['retry-after'] || 30;
await sleep(retryAfter * 1000);
continue;
}
if (error.status >= 500) {
const backoff = Math.pow(2, attempt) * 1000; // Exponential backoff
await sleep(backoff);
continue;
}
throw error; // Don't retry 4xx errors
}
}
throw new Error('Max retries exceeded');
}
| Algorithm | How It Works | Pros | Cons |
|---|---|---|---|
| Fixed Window | N requests per minute, resets on the minute | Simple, predictable | Burst at window boundary |
| Sliding Window Log | Track timestamp of each request | Accurate, smooth | Memory-intensive |
| Sliding Window Counter | Weighted average of current + previous window | Balance of accuracy and efficiency | Approximate |
| Token Bucket | Bucket fills at constant rate, requests drain | Allows bursts up to bucket size | More complex |
| Leaky Bucket | Requests queue, processed at constant rate | Smooth output rate | Adds latency, queue management |
// Redis-based token bucket
class TokenBucket {
constructor(redis, key, capacity, refillRate) {
this.redis = redis;
this.key = key;
this.capacity = capacity; // Max tokens (burst size)
this.refillRate = refillRate; // Tokens per second
}
async allowRequest(tokens = 1) {
const now = Date.now();
// Lua script for atomic operation
const script = `
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
local tokens = tonumber(bucket[1]) or capacity
local lastRefill = tonumber(bucket[2]) or now
-- Refill tokens based on time elapsed
local elapsed = (now - lastRefill) / 1000
tokens = math.min(capacity, tokens + (elapsed * refillRate))
if tokens >= requested then
tokens = tokens - requested
redis.call('HMSET', key, 'tokens', tokens, 'lastRefill', now)
redis.call('EXPIRE', key, 3600)
return {1, tokens} -- Allowed, remaining tokens
else
return {0, tokens} -- Denied, remaining tokens
end
`;
const [allowed, remaining] = await this.redis.eval(
script, 1, this.key,
this.capacity, this.refillRate, now, tokens
);
return { allowed: allowed === 1, remaining };
}
}
| Tier | Messages/sec | API calls/min | Burst |
|---|---|---|---|
| Free Trial | 1 | 60 | 10 |
| Standard | 10 | 600 | 100 |
| Enterprise | 100 | 6,000 | 1,000 |
| Enterprise+ | Custom | Custom | Custom |
// Standard rate limit headers
HTTP/1.1 200 OK
X-RateLimit-Limit: 100 // Max requests in window
X-RateLimit-Remaining: 87 // Requests left in window
X-RateLimit-Reset: 1699900060 // Unix timestamp when window resets
// Draft IETF standard (RateLimit header)
RateLimit-Limit: 100
RateLimit-Remaining: 87
RateLimit-Reset: 60 // Seconds until reset
// Twilio adds endpoint-specific limits
X-Twilio-Concurrent-Requests: 25
X-Twilio-Concurrent-Requests-Limit: 100
"I use token bucket for rate limiting because it allows controlled bursts while enforcing average rate. Each customer has a bucket that refills at their tier's rate—say 10 tokens per second for standard accounts. They can burst up to the bucket capacity, but sustained traffic is capped. Implementation is a Redis Lua script for atomicity across distributed systems. I set different limits per endpoint based on cost—sending an SMS is more expensive than reading a message, so lower limits. Response headers tell customers their limits and remaining quota so they can self-throttle. When rate limited, I return 429 with a Retry-After header. Critical design decision: do you rate limit by API key, by account, or by IP? I do account-level for authenticated requests and IP-level for unauthenticated. For cell architecture, rate limiting happens at the Cell Router before traffic even reaches cells, protecting downstream systems."
| Area | Standard | Enforcement |
|---|---|---|
| URL format | Lowercase, hyphens, plural nouns | Linter |
| JSON casing | camelCase | Linter |
| Date format | ISO 8601 (2024-01-15T10:30:00Z) | Linter |
| Pagination | Cursor-based with pageToken | Review board |
| Error format | Standard error object with code, message, details | Linter |
| Versioning | Date-based in URL path | Review board |
| Authentication | Account SID + Auth Token or API Key | Review board |
# .github/workflows/api-lint.yml
name: API Lint
on:
pull_request:
paths:
- 'api/**/*.yaml'
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Lint OpenAPI Spec
uses: stoplightio/spectral-action@v0.8
with:
file_glob: 'api/**/*.yaml'
spectral_ruleset: '.spectral.yaml'
- name: Check Breaking Changes
run: |
# Compare with main branch
oasdiff breaking api/spec.yaml main:api/spec.yaml
- name: Security Scan
run: |
# Check for sensitive data in examples
# Check auth requirements on all endpoints
api-security-scanner api/spec.yaml
# .spectral.yaml - API linting rules
extends: spectral:oas
rules:
# Naming conventions
paths-kebab-case:
description: Paths must use kebab-case
given: $.paths[*]~
then:
function: pattern
functionOptions:
match: "^(/[a-z0-9-]+)+$"
properties-camel-case:
description: Properties must use camelCase
given: $..properties[*]~
then:
function: casing
functionOptions:
type: camel
# Security
operation-security-defined:
description: All operations must have security defined
given: $.paths[*][*]
then:
field: security
function: truthy
# Documentation
operation-description:
description: Operations must have descriptions
given: $.paths[*][*]
then:
field: description
function: truthy
# Error responses
error-response-schema:
description: 4xx/5xx must use standard error schema
given: $.paths[*][*].responses[?(@property >= 400)]
then:
field: content.application/json.schema.$ref
function: pattern
functionOptions:
match: "#/components/schemas/Error"
| Review Trigger | Reviewers | SLA |
|---|---|---|
| New public API | API Board + Security + Docs | 1 week |
| Breaking change | API Board + affected teams | 1 week |
| New version | API Board | 3 days |
| Deprecation | API Board + Customer Success | 1 week |
| Additive change | Auto-approved if linting passes | Immediate |
"Three layers: standards, automation, and humans. First, I publish an API design standards document covering naming, error formats, pagination, versioning—everything teams need to build consistent APIs. Second, I automate enforcement through OpenAPI linting in CI/CD. Every PR that touches API specs runs Spectral rules checking for kebab-case URLs, camelCase properties, security definitions, and more. Breaking changes are automatically detected and blocked. Third, an API Review Board—architects from across the organization—reviews new APIs, breaking changes, and deprecations. They catch what automation can't: business logic issues, cross-team conflicts, strategic misalignment. The key is making the right thing easy. I provide templates, generators, and examples. If teams start from our standard OpenAPI template, they're 80% compliant automatically. The review board becomes a partnership, not a gate."
POST /Calls/{sid}/Recordings to start recording)/2010-04-01/ in URL path/Accounts/{AccountSid}/SM, CA, AC)<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say voice="alice">Hello! Thanks for calling.</Say>
<Gather input="speech dtmf" timeout="3" numDigits="1">
<Say>Press 1 for sales, or 2 for support.</Say>
</Gather>
<Say>We didn't receive any input. Goodbye!</Say>
</Response>
// Idempotency key prevents duplicate sends
POST /2010-04-01/Accounts/{AccountSid}/Messages.json
Idempotency-Key: unique-request-id-12345
Content-Type: application/x-www-form-urlencoded
To=+15551234567&From=+15559876543&Body=Hello!
// If network fails and client retries with same key:
// - First request: Message created, returns 201
// - Retry: Same response returned, no duplicate message sent
// Twilio stores idempotency keys for 24 hours
| Component | Versioning | Release Cadence |
|---|---|---|
| API (Backend) | Date-based (2010-04-01) | Breaking changes: ~yearly |
| SDK (Client libraries) | SemVer (8.1.0) | Minor: monthly, Major: yearly |
| SDK ↔ API mapping | SDK pins to API version | SDK 8.x → API 2020-03-15 |
// SDK hides API versioning from developers
const twilio = require('twilio')(accountSid, authToken);
// SDK internally uses /2020-03-15/Accounts/...
const message = await twilio.messages.create({
to: '+15551234567',
from: '+15559876543',
body: 'Hello from Twilio!'
});
// To use a different API version:
const twilio = require('twilio')(accountSid, authToken, {
apiVersion: '2024-01-01' // Override default
});
"Twilio uses date-based versioning in the URL path—/2010-04-01/Messages. The date indicates when that API contract was locked. I'd keep this approach because it's explicit and self-documenting. Version routing happens at the Landing Zone, not in cells. When a request comes in, the API Gateway extracts the version and routes to the appropriate handler. All cells run all version handlers simultaneously, so there's no version-specific deployment. For SDKs, I use semver and pin each major SDK version to a specific API version. Developers use the SDK without thinking about API versions—it's abstracted away. When we release a new API version, we release a new SDK major version. This gives customers two migration paths: upgrade the SDK and get the new API automatically, or pin the SDK and never change. For breaking changes between API versions, I use the expand-contract pattern with 12+ months of overlap. The old version keeps working; we just stop adding features to it."
A: "First principle: never break existing integrations. I'd establish that additive changes are always safe—new optional fields, new endpoints, new enum values. For changes that could break clients, I use the expand-contract pattern: add new alongside old, support both for 6-12 months with deprecation warnings, then remove old. I'd track usage metrics obsessively—which accounts use which fields, which versions, which endpoints. Before any sunset, I'd know exactly who's affected and reach out directly. I'd build automated breaking change detection into CI/CD so engineers can't accidentally ship breaking changes. And I'd invest heavily in SDK quality—most customers use SDKs, not raw API calls, so the SDK becomes the abstraction layer that insulates them from API evolution."
A: "First, I'd validate the need. Is this a 'want' or a 'need'? What's the cost of not doing it? If it's truly necessary, I'd require a migration plan before approving: What's the timeline? How will existing customers be notified? What's the SDK impact? Who does direct customer outreach? Then I'd ensure it goes through the API Review Board. The new API version gets launched alongside the old one. We set a sunset date at least 12 months out. We add deprecation headers immediately. We track weekly migration metrics. We reach out to top accounts personally. Only when usage of the old version drops below a threshold do we sunset. I'd also push back on 'redesign' and ask if incremental evolution could achieve the same goals. Often a series of additive changes gets you 80% of the benefit without the migration pain."
A: "Rate limiting happens at the edge—in our cell architecture, that's the Cell Router in the Landing Zone. I use token bucket for its burst tolerance. Each account has a bucket that refills based on their tier. Implementation is Redis with Lua scripts for atomicity—you need the check-and-decrement to be atomic or you get race conditions. I set per-endpoint limits too, since some operations are more expensive than others. Response headers tell clients their limit, remaining quota, and reset time so they can implement client-side throttling. When rate limited, I return 429 with Retry-After. For distributed rate limiting across multiple Cell Routers, I use a shared Redis cluster. The alternative is local rate limiting with some slack—each router enforces a fraction of the limit—but that's less accurate. For enterprise customers, I support custom limits defined in their account config and retrieved during auth."
A: "First, I'd verify it's actually the same request—same endpoint, headers, body, account, time window. Caching can cause this legitimately. If it's truly identical requests with different responses, I'd look at: Which cell handled each request? Could there be data inconsistency between cells? Did a deployment happen between requests? Is there a race condition in the handler? Is there randomization that shouldn't exist—like load balancing across inconsistent replicas? I'd pull the request IDs from their logs and trace them through our system. For cell-based architecture, I'd check if the requests hit the same cell or different cells, and whether those cells have identical data state. I'd also check if they're using a deprecated API version that has known inconsistencies we're in the process of fixing."
A: "Three-pronged approach. First, a comprehensive API design guide that covers naming, error formats, pagination, versioning—everything a team needs to build a consistent API. I make this the first thing new engineers read. Second, automated enforcement via OpenAPI linting in CI/CD. Every spec change runs through Spectral rules. Breaking changes are automatically detected. PRs can't merge if they violate standards. Third, an API Review Board of senior engineers who review new APIs and significant changes. They catch what automation misses—business logic issues, cross-team conflicts, strategic misalignment. The key is making compliance easy. I provide templates, generators, and golden examples. If you start from our template, you're 80% compliant automatically. The review board should feel like a partnership, not a gate."
| Metric | Guideline | Rationale |
|---|---|---|
| Deprecation period | 12 months minimum | Customers need time to migrate |
| API response time (p99) | < 200ms | Developer experience threshold |
| Rate limit response | 429 + Retry-After header | Standard, actionable for clients |
| Page size default | 20-50 items | Balance response size and round trips |
| Page size max | 100-200 items | Prevent memory/timeout issues |
| Idempotency key retention | 24 hours | Handle retries from failures |
| Breaking change notice | 6 months minimum | Before removal of old behavior |