Purpose: Master deployment strategies used by companies like Netflix, Amazon, and Google to deploy thousands of times per day with zero downtime.
1. Blue-Green Deployment
Zero-downtime deployment by maintaining two identical environments.
Concept
graph LR
LB[Load Balancer]
subgraph "Blue Environment (Current)"
B1[App v1.0]
B2[App v1.0]
B3[App v1.0]
end
subgraph "Green Environment (New)"
G1[App v2.0]
G2[App v2.0]
G3[App v2.0]
end
LB -->|100% traffic| B1
LB -->|100% traffic| B2
LB -->|100% traffic| B3
LB -.->|0% traffic
Deploy & Test| G1 LB -.->|0% traffic| G2 LB -.->|0% traffic| G3 style B1 fill:#5e81ac style B2 fill:#5e81ac style B3 fill:#5e81ac style G1 fill:#a3be8c,color:#2e3440 style G2 fill:#a3be8c,color:#2e3440 style G3 fill:#a3be8c,color:#2e3440
Deploy & Test| G1 LB -.->|0% traffic| G2 LB -.->|0% traffic| G3 style B1 fill:#5e81ac style B2 fill:#5e81ac style B3 fill:#5e81ac style G1 fill:#a3be8c,color:#2e3440 style G2 fill:#a3be8c,color:#2e3440 style G3 fill:#a3be8c,color:#2e3440
Deployment Process
- Deploy to Green: Deploy new version to idle environment
- Test Green: Run smoke tests, health checks
- Switch Traffic: Flip load balancer to Green (100% instant cutover)
- Monitor: Watch metrics, errors
- Rollback (if needed): Flip back to Blue instantly
import time
import random
from typing import List, Dict
from enum import Enum
class Environment(Enum):
BLUE = "blue"
GREEN = "green"
class LoadBalancer:
"""
Simulated load balancer for blue-green deployment.
Maintains two environments and can instantly switch traffic between them.
"""
def __init__(self):
self.blue_servers = [f"blue-{i}" for i in range(3)]
self.green_servers = [f"green-{i}" for i in range(3)]
self.active_env = Environment.BLUE
self.blue_version = "v1.0"
self.green_version = "v1.0"
def deploy_to_green(self, version: str):
"""Deploy new version to green environment"""
print(f"\n=== Deploying {version} to GREEN environment ===")
self.green_version = version
for server in self.green_servers:
print(f" Deploying to {server}...")
time.sleep(0.1) # Simulate deployment
print(f" GREEN environment updated to {version}")
def health_check(self, env: Environment) -> bool:
"""Run health checks on environment"""
servers = self.blue_servers if env == Environment.BLUE else self.green_servers
version = self.blue_version if env == Environment.BLUE else self.green_version
print(f"\n=== Health Check: {env.value} ({version}) ===")
for server in servers:
# Simulate health check (random success/failure for demo)
healthy = random.random() > 0.1 # 90% success rate
status = "Healthy" if healthy else "Unhealthy"
print(f" {server}: {status}")
if not healthy:
return False
print(f" All servers healthy!")
return True
def switch_traffic(self, to_env: Environment):
"""Instantly switch all traffic to specified environment"""
print(f"\n=== Switching Traffic: {self.active_env.value} to {to_env.value} ===")
self.active_env = to_env
active_version = self.blue_version if to_env == Environment.BLUE else self.green_version
print(f" All traffic now routed to {to_env.value} ({active_version})")
def route_request(self, request_id: int) -> str:
"""Route request to active environment"""
servers = self.blue_servers if self.active_env == Environment.BLUE else self.green_servers
server = random.choice(servers)
version = self.blue_version if self.active_env == Environment.BLUE else self.green_version
return f"Request {request_id} -> {server} ({version})"
# Example: Blue-Green Deployment
print("=== Blue-Green Deployment Example ===")
lb = LoadBalancer()
# Current state: Blue is active with v1.0
print("\n--- Current State ---")
for i in range(3):
print(lb.route_request(i))
# Deploy v2.0 to Green (no downtime!)
lb.deploy_to_green("v2.0")
# Health check Green
if lb.health_check(Environment.GREEN):
# Switch traffic to Green
lb.switch_traffic(Environment.GREEN)
print("\n--- After Switch ---")
for i in range(3):
print(lb.route_request(i))
# Simulate error detected - INSTANT ROLLBACK
print("\n--- ERROR DETECTED: Rolling back to BLUE ---")
lb.switch_traffic(Environment.BLUE)
print("\n--- After Rollback ---")
for i in range(3):
print(lb.route_request(i))
else:
print("\nHealth check failed - NOT switching traffic")
Blue-Green Benefits:
Zero downtime
Instant rollback (just flip switch)
Full testing in production environment
Simple to understand
Challenges:
2x infrastructure cost (two full environments)
Database migrations tricky (must be backward compatible)
All-or-nothing (100% cutover)
Zero downtime
Instant rollback (just flip switch)
Full testing in production environment
Simple to understand
Challenges:
2x infrastructure cost (two full environments)
Database migrations tricky (must be backward compatible)
All-or-nothing (100% cutover)
2. Canary Deployment
Gradual rollout to detect issues before affecting all users.
Concept
graph TD
Users[Users 100%]
subgraph "Stage 1: 5% Canary"
Users -->|5%| C1[New v2.0]
Users -->|95%| S1[Stable v1.0]
end
subgraph "Stage 2: 25% Canary"
Users2[Users 100%] -->|25%| C2[New v2.0]
Users2 -->|75%| S2[Stable v1.0]
end
subgraph "Stage 3: 100% Canary"
Users3[Users 100%] -->|100%| C3[New v2.0]
end
S1 -.->|Monitor metrics| Users2
S2 -.->|Monitor metrics| Users3
style C1 fill:#ebcb8b,color:#2e3440
style C2 fill:#ebcb8b,color:#2e3440
style C3 fill:#a3be8c,color:#2e3440
Deployment Process
- 5% canary: Route 5% traffic to new version
- Monitor: Error rate, latency, business metrics
- 25% canary: If good, increase to 25%
- 50% canary: Half and half
- 100% canary: Complete rollout
- Rollback at any stage: Route 100% back to stable
import random
from typing import Dict
class CanaryDeployment:
"""
Gradual rollout with automated metrics-based decision making.
Used by: Netflix, Google, Amazon
"""
def __init__(self):
self.stable_version = "v1.0"
self.canary_version = None
self.canary_percentage = 0
self.total_requests = 0
self.canary_errors = 0
self.stable_errors = 0
def deploy_canary(self, version: str, percentage: int):
"""Deploy canary version with specified traffic percentage"""
self.canary_version = version
self.canary_percentage = percentage
self.total_requests = 0
self.canary_errors = 0
self.stable_errors = 0
print(f"\n=== Canary Deployment: {percentage}% traffic to {version} ===")
def route_request(self, request_id: int) -> Dict:
"""Route request to canary or stable based on percentage"""
self.total_requests += 1
# Determine if this request goes to canary
goes_to_canary = random.random() * 100 < self.canary_percentage
if goes_to_canary and self.canary_version:
version = self.canary_version
# Simulate higher error rate in canary (for demo)
has_error = random.random() < 0.05 # 5% error rate
if has_error:
self.canary_errors += 1
else:
version = self.stable_version
# Stable has lower error rate
has_error = random.random() < 0.01 # 1% error rate
if has_error:
self.stable_errors += 1
return {
"request_id": request_id,
"version": version,
"error": has_error
}
def get_error_rate(self, version_type: str) -> float:
"""Calculate error rate for canary or stable"""
if self.total_requests == 0:
return 0.0
if version_type == "canary":
canary_requests = self.total_requests * (self.canary_percentage / 100)
return (self.canary_errors / canary_requests * 100) if canary_requests > 0 else 0
else:
stable_requests = self.total_requests * (1 - self.canary_percentage / 100)
return (self.stable_errors / stable_requests * 100) if stable_requests > 0 else 0
def should_proceed(self, threshold: float = 2.0) -> bool:
"""
Automated decision: should we proceed with rollout?
Criteria: Canary error rate must be within threshold of stable
"""
canary_rate = self.get_error_rate("canary")
stable_rate = self.get_error_rate("stable")
print(f"\n=== Metrics Check ===")
print(f" Stable ({self.stable_version}): {stable_rate:.2f}% error rate")
print(f" Canary ({self.canary_version}): {canary_rate:.2f}% error rate")
if canary_rate > stable_rate + threshold:
print(f" Canary error rate too high! Rolling back.")
return False
else:
print(f" Canary looks healthy. Proceeding.")
return True
# Example: Automated Canary Deployment
print("=== Canary Deployment Example ===")
deployment = CanaryDeployment()
# Stage 1: 10% canary
deployment.deploy_canary("v2.0", percentage=10)
# Simulate traffic
for i in range(1000):
deployment.route_request(i)
# Check if we should proceed
if deployment.should_proceed():
# Stage 2: 50% canary
deployment.deploy_canary("v2.0", percentage=50)
for i in range(1000, 2000):
deployment.route_request(i)
# Check again
if deployment.should_proceed():
# Stage 3: 100% canary
deployment.deploy_canary("v2.0", percentage=100)
print(f"\nRollout complete! {deployment.canary_version} is now serving all traffic.")
else:
print(f"\nRollback triggered at 50% stage")
else:
print(f"\nRollback triggered at 10% stage")
Canary Metrics to Monitor:
- Error rate (5xx errors)
- Latency (p50, p95, p99)
- CPU/memory usage
- Business metrics (conversion rate, checkout success)
- Database query performance
Automated Decision: If canary > stable + threshold then auto-rollback
- Error rate (5xx errors)
- Latency (p50, p95, p99)
- CPU/memory usage
- Business metrics (conversion rate, checkout success)
- Database query performance
Automated Decision: If canary > stable + threshold then auto-rollback
3. Feature Flags
Decouple deployment from release using runtime toggles.
import time
from typing import Dict, Set
from enum import Enum
class RolloutStrategy(Enum):
PERCENTAGE = "percentage" # Gradual rollout by %
USER_LIST = "user_list" # Specific users only
ALL = "all" # Everyone
NONE = "none" # Nobody
class FeatureFlag:
"""
Feature flag system (LaunchDarkly/Unleash style).
Enables:
- Deploy code but keep feature disabled
- Gradual rollout (0% to 100%)
- A/B testing
- Kill switch (instant disable)
"""
def __init__(self, flag_name: str):
self.flag_name = flag_name
self.strategy = RolloutStrategy.NONE
self.percentage = 0
self.enabled_users: Set[str] = set()
def enable_for_percentage(self, percentage: int):
"""Gradual rollout by percentage"""
self.strategy = RolloutStrategy.PERCENTAGE
self.percentage = percentage
print(f"[{self.flag_name}] Enabled for {percentage}% of users")
def enable_for_users(self, user_ids: Set[str]):
"""Enable for specific users (beta testers)"""
self.strategy = RolloutStrategy.USER_LIST
self.enabled_users = user_ids
print(f"[{self.flag_name}] Enabled for {len(user_ids)} specific users")
def enable_all(self):
"""Enable for everyone (full release)"""
self.strategy = RolloutStrategy.ALL
print(f"[{self.flag_name}] Enabled for ALL users")
def disable_all(self):
"""Kill switch - instant disable"""
self.strategy = RolloutStrategy.NONE
print(f"[{self.flag_name}] DISABLED (kill switch activated)")
def is_enabled(self, user_id: str) -> bool:
"""Check if feature is enabled for this user"""
if self.strategy == RolloutStrategy.NONE:
return False
if self.strategy == RolloutStrategy.ALL:
return True
if self.strategy == RolloutStrategy.USER_LIST:
return user_id in self.enabled_users
if self.strategy == RolloutStrategy.PERCENTAGE:
# Consistent hash to determine if user in rollout
user_hash = hash(user_id + self.flag_name) % 100
return user_hash < self.percentage
return False
# Example: Feature flag usage
print("\n=== Feature Flags Example ===")
# New feature: AI-powered recommendations
ai_recommendations = FeatureFlag("ai_recommendations")
# Stage 1: Enable for internal testers
ai_recommendations.enable_for_users({"user_alice", "user_bob"})
users = ["user_alice", "user_bob", "user_charlie", "user_david"]
print("\nStage 1 - Internal testing:")
for user in users:
enabled = ai_recommendations.is_enabled(user)
print(f" {user}: {'Enabled' if enabled else 'Disabled'}")
# Stage 2: 25% rollout
print("\n")
ai_recommendations.enable_for_percentage(25)
print("\nStage 2 - 25% rollout:")
for i in range(10):
user = f"user_{i}"
enabled = ai_recommendations.is_enabled(user)
print(f" {user}: {'Enabled' if enabled else 'Disabled'}")
# Stage 3: Emergency - bug found! Kill switch
print("\n")
ai_recommendations.disable_all()
print("\nAfter kill switch:")
enabled = ai_recommendations.is_enabled("user_0")
print(f" All users: {'Enabled' if enabled else 'Disabled'}")
CI/CD Pipeline with Feature Flags
# .github/workflows/deploy.yml
name: Deploy with Feature Flags
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Run tests
run: pytest tests/
- name: Build Docker image
run: docker build -t myapp:${{ github.sha }} .
- name: Push to registry
run: docker push myapp:${{ github.sha }}
- name: Deploy to production (blue-green)
run: |
# Deploy to green environment
kubectl set image deployment/myapp-green \
app=myapp:${{ github.sha }}
# Wait for rollout
kubectl rollout status deployment/myapp-green
# Health check
./scripts/health-check.sh green
# Switch traffic (blue to green)
kubectl patch service myapp \
-p '{"spec":{"selector":{"version":"green"}}}'
- name: Enable feature flag (gradual)
run: |
# Start at 5%
curl -X POST https://featureflags.com/api/flags/new-feature \
-d '{"strategy": "percentage", "value": 5}'
# Monitor for 1 hour
sleep 3600
# If metrics good, increase to 50%
curl -X POST https://featureflags.com/api/flags/new-feature \
-d '{"strategy": "percentage", "value": 50}'
4. Rollback Strategies
| Strategy | Speed | Safety | Use Case |
|---|---|---|---|
| Blue-Green Switch | Instant | Very Safe | Complete rollback needed |
| Canary Decrease | Gradual | Safe | Partial issues, want to limit blast radius |
| Feature Flag Disable | Instant | Very Safe | Feature-specific issue, rest of deployment OK |
| Revert Git Commit | Slow (redeploy) | Risky | Last resort (database migrations complicate this) |
| Traffic Replay | N/A | Testing | Replay production traffic to new version for testing |
Best Practices
Deployment Checklist:
1. Automated tests (unit, integration, E2E)
2. Database migrations are backward compatible
3. Health checks configured
4. Metrics/monitoring dashboards ready
5. Rollback plan documented
6. Feature flags for new features
7. Gradual rollout (canary or blue-green)
8. Automated rollback triggers
9. Post-deployment verification
10. Communication plan (incident channel ready)
1. Automated tests (unit, integration, E2E)
2. Database migrations are backward compatible
3. Health checks configured
4. Metrics/monitoring dashboards ready
5. Rollback plan documented
6. Feature flags for new features
7. Gradual rollout (canary or blue-green)
8. Automated rollback triggers
9. Post-deployment verification
10. Communication plan (incident channel ready)
Database Migration Pattern
# Backward-compatible migration strategy
# Step 1: Add new column (nullable)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN NULL;
# Deploy code that writes to BOTH old and new columns
# Old code still works (ignores new column)
# New code uses new column
# Step 2: Backfill data
UPDATE users SET email_verified = (verification_status = 'verified');
# Step 3: Make column NOT NULL
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
# Step 4: Remove old column (after all instances use new column)
ALTER TABLE users DROP COLUMN verification_status;
Common Mistakes:
- Deploying database changes that break old code
- No automated health checks
- All-or-nothing deployments (no canary)
- No rollback plan
- Not monitoring business metrics during rollout
- Skipping load testing in staging
- No feature flags for risky changes
- Deploying database changes that break old code
- No automated health checks
- All-or-nothing deployments (no canary)
- No rollback plan
- Not monitoring business metrics during rollout
- Skipping load testing in staging
- No feature flags for risky changes
Real-World Examples
Netflix
- Strategy: Red-black deployment (similar to blue-green)
- Scale: Thousands of deployments per day
- Key Tool: Spinnaker (open-source CD platform)
- Rollback: Automated based on error rate thresholds
Amazon
- Strategy: One-box to fleet deployment
- Process: Deploy to one instance, monitor, then expand
- Frequency: Deploy every 11.7 seconds on average
- Safety: Automated rollback on metric deviations
- Strategy: Gradual rollout with canarying
- Testing: 1% to 5% to 25% to 50% to 100%
- Monitoring: Automated anomaly detection
- Culture: "Launch and iterate" vs "launch perfect"