CI/CD & Deployment Strategies

Back to Study Guide

Purpose: Master deployment strategies used by companies like Netflix, Amazon, and Google to deploy thousands of times per day with zero downtime.

1. Blue-Green Deployment

Zero-downtime deployment by maintaining two identical environments.

Concept

Deployment Process

Deploy to Green: Deploy new version to idle environment
Test Green: Run smoke tests, health checks
Switch Traffic: Flip load balancer to Green (100% instant cutover)
Monitor: Watch metrics, errors
Rollback (if needed): Flip back to Blue instantly

import time
import random
from typing import List, Dict
from enum import Enum

class Environment(Enum):
    BLUE = "blue"
    GREEN = "green"

class LoadBalancer:
    """
    Simulated load balancer for blue-green deployment.

    Maintains two environments and can instantly switch traffic between them.
    """

    def __init__(self):
        self.blue_servers = [f"blue-{i}" for i in range(3)]
        self.green_servers = [f"green-{i}" for i in range(3)]
        self.active_env = Environment.BLUE
        self.blue_version = "v1.0"
        self.green_version = "v1.0"

    def deploy_to_green(self, version: str):
        """Deploy new version to green environment"""
        print(f"\n=== Deploying {version} to GREEN environment ===")
        self.green_version = version

        for server in self.green_servers:
            print(f"  Deploying to {server}...")
            time.sleep(0.1)  # Simulate deployment

        print(f"  GREEN environment updated to {version}")

    def health_check(self, env: Environment) -> bool:
        """Run health checks on environment"""
        servers = self.blue_servers if env == Environment.BLUE else self.green_servers
        version = self.blue_version if env == Environment.BLUE else self.green_version

        print(f"\n=== Health Check: {env.value} ({version}) ===")

        for server in servers:
            # Simulate health check (random success/failure for demo)
            healthy = random.random() > 0.1  # 90% success rate
            status = "Healthy" if healthy else "Unhealthy"
            print(f"  {server}: {status}")

            if not healthy:
                return False

        print(f"  All servers healthy!")
        return True

    def switch_traffic(self, to_env: Environment):
        """Instantly switch all traffic to specified environment"""
        print(f"\n=== Switching Traffic: {self.active_env.value} to {to_env.value} ===")

        self.active_env = to_env

        active_version = self.blue_version if to_env == Environment.BLUE else self.green_version
        print(f"  All traffic now routed to {to_env.value} ({active_version})")

    def route_request(self, request_id: int) -> str:
        """Route request to active environment"""
        servers = self.blue_servers if self.active_env == Environment.BLUE else self.green_servers
        server = random.choice(servers)
        version = self.blue_version if self.active_env == Environment.BLUE else self.green_version

        return f"Request {request_id} -> {server} ({version})"

# Example: Blue-Green Deployment
print("=== Blue-Green Deployment Example ===")

lb = LoadBalancer()

# Current state: Blue is active with v1.0
print("\n--- Current State ---")
for i in range(3):
    print(lb.route_request(i))

# Deploy v2.0 to Green (no downtime!)
lb.deploy_to_green("v2.0")

# Health check Green
if lb.health_check(Environment.GREEN):
    # Switch traffic to Green
    lb.switch_traffic(Environment.GREEN)

    print("\n--- After Switch ---")
    for i in range(3):
        print(lb.route_request(i))

    # Simulate error detected - INSTANT ROLLBACK
    print("\n--- ERROR DETECTED: Rolling back to BLUE ---")
    lb.switch_traffic(Environment.BLUE)

    print("\n--- After Rollback ---")
    for i in range(3):
        print(lb.route_request(i))
else:
    print("\nHealth check failed - NOT switching traffic")

Blue-Green Benefits:
Zero downtime
Instant rollback (just flip switch)
Full testing in production environment
Simple to understand

Challenges:
2x infrastructure cost (two full environments)
Database migrations tricky (must be backward compatible)
All-or-nothing (100% cutover)

2. Canary Deployment

Gradual rollout to detect issues before affecting all users.

Concept

Deployment Process

5% canary: Route 5% traffic to new version
Monitor: Error rate, latency, business metrics
25% canary: If good, increase to 25%
50% canary: Half and half
100% canary: Complete rollout
Rollback at any stage: Route 100% back to stable

import random
from typing import Dict

class CanaryDeployment:
    """
    Gradual rollout with automated metrics-based decision making.

    Used by: Netflix, Google, Amazon
    """

    def __init__(self):
        self.stable_version = "v1.0"
        self.canary_version = None
        self.canary_percentage = 0
        self.total_requests = 0
        self.canary_errors = 0
        self.stable_errors = 0

    def deploy_canary(self, version: str, percentage: int):
        """Deploy canary version with specified traffic percentage"""
        self.canary_version = version
        self.canary_percentage = percentage
        self.total_requests = 0
        self.canary_errors = 0
        self.stable_errors = 0

        print(f"\n=== Canary Deployment: {percentage}% traffic to {version} ===")

    def route_request(self, request_id: int) -> Dict:
        """Route request to canary or stable based on percentage"""
        self.total_requests += 1

        # Determine if this request goes to canary
        goes_to_canary = random.random() * 100 < self.canary_percentage

        if goes_to_canary and self.canary_version:
            version = self.canary_version
            # Simulate higher error rate in canary (for demo)
            has_error = random.random() < 0.05  # 5% error rate
            if has_error:
                self.canary_errors += 1
        else:
            version = self.stable_version
            # Stable has lower error rate
            has_error = random.random() < 0.01  # 1% error rate
            if has_error:
                self.stable_errors += 1

        return {
            "request_id": request_id,
            "version": version,
            "error": has_error
        }

    def get_error_rate(self, version_type: str) -> float:
        """Calculate error rate for canary or stable"""
        if self.total_requests == 0:
            return 0.0

        if version_type == "canary":
            canary_requests = self.total_requests * (self.canary_percentage / 100)
            return (self.canary_errors / canary_requests * 100) if canary_requests > 0 else 0
        else:
            stable_requests = self.total_requests * (1 - self.canary_percentage / 100)
            return (self.stable_errors / stable_requests * 100) if stable_requests > 0 else 0

    def should_proceed(self, threshold: float = 2.0) -> bool:
        """
        Automated decision: should we proceed with rollout?

        Criteria: Canary error rate must be within threshold of stable
        """
        canary_rate = self.get_error_rate("canary")
        stable_rate = self.get_error_rate("stable")

        print(f"\n=== Metrics Check ===")
        print(f"  Stable ({self.stable_version}): {stable_rate:.2f}% error rate")
        print(f"  Canary ({self.canary_version}): {canary_rate:.2f}% error rate")

        if canary_rate > stable_rate + threshold:
            print(f"  Canary error rate too high! Rolling back.")
            return False
        else:
            print(f"  Canary looks healthy. Proceeding.")
            return True

# Example: Automated Canary Deployment
print("=== Canary Deployment Example ===")

deployment = CanaryDeployment()

# Stage 1: 10% canary
deployment.deploy_canary("v2.0", percentage=10)

# Simulate traffic
for i in range(1000):
    deployment.route_request(i)

# Check if we should proceed
if deployment.should_proceed():
    # Stage 2: 50% canary
    deployment.deploy_canary("v2.0", percentage=50)

    for i in range(1000, 2000):
        deployment.route_request(i)

    # Check again
    if deployment.should_proceed():
        # Stage 3: 100% canary
        deployment.deploy_canary("v2.0", percentage=100)
        print(f"\nRollout complete! {deployment.canary_version} is now serving all traffic.")
    else:
        print(f"\nRollback triggered at 50% stage")
else:
    print(f"\nRollback triggered at 10% stage")

Canary Metrics to Monitor:
- Error rate (5xx errors)
- Latency (p50, p95, p99)
- CPU/memory usage
- Business metrics (conversion rate, checkout success)
- Database query performance

Automated Decision: If canary > stable + threshold then auto-rollback

3. Feature Flags

Decouple deployment from release using runtime toggles.

import time
from typing import Dict, Set
from enum import Enum

class RolloutStrategy(Enum):
    PERCENTAGE = "percentage"  # Gradual rollout by %
    USER_LIST = "user_list"    # Specific users only
    ALL = "all"                # Everyone
    NONE = "none"              # Nobody

class FeatureFlag:
    """
    Feature flag system (LaunchDarkly/Unleash style).

    Enables:
    - Deploy code but keep feature disabled
    - Gradual rollout (0% to 100%)
    - A/B testing
    - Kill switch (instant disable)
    """

    def __init__(self, flag_name: str):
        self.flag_name = flag_name
        self.strategy = RolloutStrategy.NONE
        self.percentage = 0
        self.enabled_users: Set[str] = set()

    def enable_for_percentage(self, percentage: int):
        """Gradual rollout by percentage"""
        self.strategy = RolloutStrategy.PERCENTAGE
        self.percentage = percentage
        print(f"[{self.flag_name}] Enabled for {percentage}% of users")

    def enable_for_users(self, user_ids: Set[str]):
        """Enable for specific users (beta testers)"""
        self.strategy = RolloutStrategy.USER_LIST
        self.enabled_users = user_ids
        print(f"[{self.flag_name}] Enabled for {len(user_ids)} specific users")

    def enable_all(self):
        """Enable for everyone (full release)"""
        self.strategy = RolloutStrategy.ALL
        print(f"[{self.flag_name}] Enabled for ALL users")

    def disable_all(self):
        """Kill switch - instant disable"""
        self.strategy = RolloutStrategy.NONE
        print(f"[{self.flag_name}] DISABLED (kill switch activated)")

    def is_enabled(self, user_id: str) -> bool:
        """Check if feature is enabled for this user"""
        if self.strategy == RolloutStrategy.NONE:
            return False

        if self.strategy == RolloutStrategy.ALL:
            return True

        if self.strategy == RolloutStrategy.USER_LIST:
            return user_id in self.enabled_users

        if self.strategy == RolloutStrategy.PERCENTAGE:
            # Consistent hash to determine if user in rollout
            user_hash = hash(user_id + self.flag_name) % 100
            return user_hash < self.percentage

        return False

# Example: Feature flag usage
print("\n=== Feature Flags Example ===")

# New feature: AI-powered recommendations
ai_recommendations = FeatureFlag("ai_recommendations")

# Stage 1: Enable for internal testers
ai_recommendations.enable_for_users({"user_alice", "user_bob"})

users = ["user_alice", "user_bob", "user_charlie", "user_david"]
print("\nStage 1 - Internal testing:")
for user in users:
    enabled = ai_recommendations.is_enabled(user)
    print(f"  {user}: {'Enabled' if enabled else 'Disabled'}")

# Stage 2: 25% rollout
print("\n")
ai_recommendations.enable_for_percentage(25)

print("\nStage 2 - 25% rollout:")
for i in range(10):
    user = f"user_{i}"
    enabled = ai_recommendations.is_enabled(user)
    print(f"  {user}: {'Enabled' if enabled else 'Disabled'}")

# Stage 3: Emergency - bug found! Kill switch
print("\n")
ai_recommendations.disable_all()

print("\nAfter kill switch:")
enabled = ai_recommendations.is_enabled("user_0")
print(f"  All users: {'Enabled' if enabled else 'Disabled'}")

CI/CD Pipeline with Feature Flags

# .github/workflows/deploy.yml
name: Deploy with Feature Flags

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Run tests
        run: pytest tests/

      - name: Build Docker image
        run: docker build -t myapp:${{ github.sha }} .

      - name: Push to registry
        run: docker push myapp:${{ github.sha }}

      - name: Deploy to production (blue-green)
        run: |
          # Deploy to green environment
          kubectl set image deployment/myapp-green \
            app=myapp:${{ github.sha }}

          # Wait for rollout
          kubectl rollout status deployment/myapp-green

          # Health check
          ./scripts/health-check.sh green

          # Switch traffic (blue to green)
          kubectl patch service myapp \
            -p '{"spec":{"selector":{"version":"green"}}}'

      - name: Enable feature flag (gradual)
        run: |
          # Start at 5%
          curl -X POST https://featureflags.com/api/flags/new-feature \
            -d '{"strategy": "percentage", "value": 5}'

          # Monitor for 1 hour
          sleep 3600

          # If metrics good, increase to 50%
          curl -X POST https://featureflags.com/api/flags/new-feature \
            -d '{"strategy": "percentage", "value": 50}'

4. Rollback Strategies

Strategy	Speed	Safety	Use Case
Blue-Green Switch	Instant	Very Safe	Complete rollback needed
Canary Decrease	Gradual	Safe	Partial issues, want to limit blast radius
Feature Flag Disable	Instant	Very Safe	Feature-specific issue, rest of deployment OK
Revert Git Commit	Slow (redeploy)	Risky	Last resort (database migrations complicate this)
Traffic Replay	N/A	Testing	Replay production traffic to new version for testing

Best Practices

Deployment Checklist:
1. Automated tests (unit, integration, E2E)
2. Database migrations are backward compatible
3. Health checks configured
4. Metrics/monitoring dashboards ready
5. Rollback plan documented
6. Feature flags for new features
7. Gradual rollout (canary or blue-green)
8. Automated rollback triggers
9. Post-deployment verification
10. Communication plan (incident channel ready)

Database Migration Pattern

# Backward-compatible migration strategy

# Step 1: Add new column (nullable)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN NULL;

# Deploy code that writes to BOTH old and new columns
# Old code still works (ignores new column)
# New code uses new column

# Step 2: Backfill data
UPDATE users SET email_verified = (verification_status = 'verified');

# Step 3: Make column NOT NULL
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;

# Step 4: Remove old column (after all instances use new column)
ALTER TABLE users DROP COLUMN verification_status;

Common Mistakes:
- Deploying database changes that break old code
- No automated health checks
- All-or-nothing deployments (no canary)
- No rollback plan
- Not monitoring business metrics during rollout
- Skipping load testing in staging
- No feature flags for risky changes

Real-World Examples

Netflix

Strategy: Red-black deployment (similar to blue-green)
Scale: Thousands of deployments per day
Key Tool: Spinnaker (open-source CD platform)
Rollback: Automated based on error rate thresholds

Amazon

Strategy: One-box to fleet deployment
Process: Deploy to one instance, monitor, then expand
Frequency: Deploy every 11.7 seconds on average
Safety: Automated rollback on metric deviations

Google

Strategy: Gradual rollout with canarying
Testing: 1% to 5% to 25% to 50% to 100%
Monitoring: Automated anomaly detection
Culture: "Launch and iterate" vs "launch perfect"

Back to Study Guide