AWS Architecture Guide

Building Scalable, Secure Cloud Infrastructure

Guide Overview

This guide covers AWS architecture from the ground up: account organization, networking, compute, storage, databases, and security. Each section includes Terraform examples, comparison tables, and real-world architecture patterns for senior/staff engineer interviews.

Cross-references: See Messaging Systems for SQS, SNS, Kinesis, EventBridge details.

1. AWS Organizations & Account Structure

1.1 Multi-Account Strategy

Why Multiple Accounts?

Security Isolation: Blast radius containment - breach in dev doesn't affect prod
Billing Separation: Clear cost attribution per environment/team
Regulatory Compliance: Separate PCI/HIPAA workloads from general infrastructure
Resource Limits: AWS limits are per-account (e.g., VPCs, EC2 instances)

graph TB Root["AWS Organization Root
Management Account"] Root --> Core["Core OU"] Root --> Workloads["Workloads OU"] Root --> Security["Security OU"] Core --> Log["Log Archive Account
(CloudTrail, Config)"] Core --> Audit["Audit Account
(Security Hub, GuardDuty)"] Workloads --> Dev["Development Account"] Workloads --> Staging["Staging Account"] Workloads --> Prod["Production Account"] Security --> SecurityTools["Security Tooling Account
(Inspector, Macie)"] Root -.->|"SCPs"| Core Root -.->|"SCPs"| Workloads Root -.->|"SCPs"| Security style Root fill:#ff9900,stroke:#232f3e,stroke-width:3px,color:#232f3e style Prod fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440 style Security fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440

Account Structure Best Practices

Account Type	Purpose	Who Has Access
Management (Root)	AWS Organizations, billing, minimal workloads	Finance, senior ops only
Log Archive	Centralized logging (CloudTrail, VPC Flow Logs)	Security team (read-only), automated systems (write)
Audit/Security	Security Hub, GuardDuty aggregation, compliance	Security team, compliance auditors
Shared Services	Active Directory, VPN, Transit Gateway	Network team, infra team
Development	Development/testing workloads	All engineers (broad permissions)
Staging	Pre-production testing	Engineers (limited), QA team
Production	Live customer-facing workloads	On-call engineers (read-only + incident response)

Terraform: AWS Organizations Setup

# Create AWS Organization (run in management account)
resource "aws_organizations_organization" "org" {
  feature_set = "ALL"

  aws_service_access_principals = [
    "cloudtrail.amazonaws.com",
    "config.amazonaws.com",
    "guardduty.amazonaws.com",
    "securityhub.amazonaws.com"
  ]

  enabled_policy_types = [
    "SERVICE_CONTROL_POLICY",
    "TAG_POLICY"
  ]
}

# Create Organizational Units
resource "aws_organizations_organizational_unit" "workloads" {
  name      = "Workloads"
  parent_id = aws_organizations_organization.org.roots[0].id
}

resource "aws_organizations_organizational_unit" "security" {
  name      = "Security"
  parent_id = aws_organizations_organization.org.roots[0].id
}

# Create Production Account
resource "aws_organizations_account" "production" {
  name      = "production"
  email     = "aws-prod@company.com"
  parent_id = aws_organizations_organizational_unit.workloads.id

  role_name = "OrganizationAccountAccessRole"

  tags = {
    Environment = "production"
    CostCenter  = "engineering"
  }
}

# Service Control Policy (SCP) - Prevent region usage outside allowed list
resource "aws_organizations_policy" "require_regions" {
  name        = "RequireAllowedRegions"
  description = "Restrict operations to us-east-1 and us-west-2"
  type        = "SERVICE_CONTROL_POLICY"

  content = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "DenyAllOutsideAllowedRegions"
        Effect = "Deny"
        NotAction = [
          "iam:*",
          "organizations:*",
          "route53:*",
          "budgets:*",
          "cloudfront:*",
          "support:*",
          "sts:*"
        ]
        Resource = "*"
        Condition = {
          StringNotEquals = {
            "aws:RequestedRegion" = [
              "us-east-1",
              "us-west-2"
            ]
          }
        }
      }
    ]
  })
}

# Attach SCP to Workloads OU
resource "aws_organizations_policy_attachment" "workloads_regions" {
  policy_id = aws_organizations_policy.require_regions.id
  target_id = aws_organizations_organizational_unit.workloads.id
}

1.2 IAM Foundation

IAM Best Practices:

Never use root account: Enable MFA, create admin IAM user immediately
No long-lived credentials: Use IAM roles and temporary credentials
Principle of least privilege: Grant minimum permissions needed
Use IAM roles for services: EC2, Lambda, ECS should assume roles, not use API keys
Enable CloudTrail: Audit all API calls

IAM Components

Component	Purpose	Example Use Case
IAM Users	Human users (avoid for services)	Developer accessing AWS Console
IAM Roles	Assumed by services or federated users	EC2 instance accessing S3, cross-account access
IAM Policies	JSON documents defining permissions	Allow S3 read access to specific bucket
IAM Groups	Collection of users for permission management	Engineers group with EC2/S3 access
Service Control Policies (SCPs)	Org-level permission boundaries	Prevent production account from deleting CloudTrail

Terraform: IAM Roles for Services

# IAM Role for EC2 instances to access S3 and DynamoDB
resource "aws_iam_role" "app_server" {
  name = "app-server-role"

  # Trust policy - who can assume this role
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Purpose = "application-server"
  }
}

# Permission policy - what this role can do
resource "aws_iam_role_policy" "app_server_permissions" {
  name = "app-server-permissions"
  role = aws_iam_role.app_server.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        # S3 access
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject"
        ]
        Resource = "arn:aws:s3:::my-app-bucket/*"
      },
      {
        # DynamoDB access
        Effect = "Allow"
        Action = [
          "dynamodb:GetItem",
          "dynamodb:PutItem",
          "dynamodb:Query"
        ]
        Resource = "arn:aws:dynamodb:us-east-1:*:table/Users"
      },
      {
        # Secrets Manager access
        Effect = "Allow"
        Action = [
          "secretsmanager:GetSecretValue"
        ]
        Resource = "arn:aws:secretsmanager:us-east-1:*:secret:app/*"
      }
    ]
  })
}

# Instance profile (wrapper for role used by EC2)
resource "aws_iam_instance_profile" "app_server" {
  name = "app-server-profile"
  role = aws_iam_role.app_server.name
}

# Attach instance profile to EC2 instance
resource "aws_instance" "app" {
  ami                  = "ami-12345678"
  instance_type        = "t3.medium"
  iam_instance_profile = aws_iam_instance_profile.app_server.name

  # Now this instance can access S3, DynamoDB, Secrets Manager
  # without any API keys or credentials!
}

Cross-Account Access Pattern

# In Production Account: Role that Dev Account can assume
resource "aws_iam_role" "prod_readonly_from_dev" {
  name = "prod-readonly-cross-account"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::111111111111:root"  # Dev account ID
        }
        Action = "sts:AssumeRole"
        Condition = {
          StringEquals = {
            "sts:ExternalId" = "unique-external-id-12345"
          }
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "readonly" {
  role       = aws_iam_role.prod_readonly_from_dev.name
  policy_arn = "arn:aws:iam::aws:policy/ReadOnlyAccess"
}

# In Dev Account: Policy allowing users to assume prod role
resource "aws_iam_policy" "assume_prod_readonly" {
  name = "assume-prod-readonly"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = "sts:AssumeRole"
        Resource = "arn:aws:iam::222222222222:role/prod-readonly-cross-account"
      }
    ]
  })
}

# Attach to engineers group
resource "aws_iam_group_policy_attachment" "engineers_assume_prod" {
  group      = aws_iam_group.engineers.name
  policy_arn = aws_iam_policy.assume_prod_readonly.arn
}

2. Networking Architecture

2.1 VPC Fundamentals

What is a VPC?

A Virtual Private Cloud (VPC) is an isolated network within AWS where you launch resources. Think of it as your own private data center in the cloud with complete control over:

IP address ranges (CIDR blocks)
Subnets (subdivisions of your VPC)
Route tables (how traffic flows)
Internet gateways (connection to internet)
Security groups and NACLs (firewalls)

graph TB Internet["Internet"] subgraph VPC["VPC: 10.0.0.0/16"] IGW["Internet Gateway"] subgraph AZ1["Availability Zone us-east-1a"] PubSub1["Public Subnet
10.0.1.0/24"] PrivSub1["Private Subnet
10.0.10.0/24"] NAT1["NAT Gateway"] end subgraph AZ2["Availability Zone us-east-1b"] PubSub2["Public Subnet
10.0.2.0/24"] PrivSub2["Private Subnet
10.0.20.0/24"] NAT2["NAT Gateway"] end Web1["Web Server
EC2"] Web2["Web Server
EC2"] App1["App Server
EC2"] App2["App Server
EC2"] PubSub1 --> Web1 PubSub2 --> Web2 PrivSub1 --> App1 PrivSub2 --> App2 PubSub1 --> NAT1 PubSub2 --> NAT2 end Internet <--> IGW IGW <--> PubSub1 IGW <--> PubSub2 PrivSub1 --> NAT1 PrivSub2 --> NAT2 NAT1 --> IGW NAT2 --> IGW style VPC fill:#2e3440,stroke:#88c0d0,stroke-width:3px,color:#e0e0e0 style PubSub1 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440 style PubSub2 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440 style PrivSub1 fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440 style PrivSub2 fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440

Public vs Private Subnets

Subnet Type	Internet Access	Route Table	Use Cases
Public Subnet	Direct (via Internet Gateway)	0.0.0.0/0 → Internet Gateway	Load balancers, bastion hosts, NAT gateways
Private Subnet	Outbound only (via NAT Gateway)	0.0.0.0/0 → NAT Gateway	Application servers, databases, internal services
Isolated Subnet	None	Local VPC routes only	Databases with no internet access needed

Terraform: Complete VPC Setup

# VPC
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "production-vpc"
  }
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "production-igw"
  }
}

# Public Subnet AZ1
resource "aws_subnet" "public_1" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "us-east-1a"
  map_public_ip_on_launch = true

  tags = {
    Name = "public-subnet-1a"
    Type = "public"
  }
}

# Public Subnet AZ2
resource "aws_subnet" "public_2" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.2.0/24"
  availability_zone       = "us-east-1b"
  map_public_ip_on_launch = true

  tags = {
    Name = "public-subnet-1b"
    Type = "public"
  }
}

# Private Subnet AZ1
resource "aws_subnet" "private_1" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.10.0/24"
  availability_zone = "us-east-1a"

  tags = {
    Name = "private-subnet-1a"
    Type = "private"
  }
}

# Private Subnet AZ2
resource "aws_subnet" "private_2" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.20.0/24"
  availability_zone = "us-east-1b"

  tags = {
    Name = "private-subnet-1b"
    Type = "private"
  }
}

# Elastic IP for NAT Gateway
resource "aws_eip" "nat_1" {
  domain = "vpc"

  tags = {
    Name = "nat-gateway-eip-1a"
  }
}

# NAT Gateway (for private subnets to access internet)
resource "aws_nat_gateway" "nat_1" {
  allocation_id = aws_eip.nat_1.id
  subnet_id     = aws_subnet.public_1.id

  tags = {
    Name = "nat-gateway-1a"
  }

  depends_on = [aws_internet_gateway.main]
}

# Public Route Table
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name = "public-route-table"
  }
}

# Private Route Table
resource "aws_route_table" "private_1" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat_1.id
  }

  tags = {
    Name = "private-route-table-1a"
  }
}

# Route Table Associations
resource "aws_route_table_association" "public_1" {
  subnet_id      = aws_subnet.public_1.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "public_2" {
  subnet_id      = aws_subnet.public_2.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private_1" {
  subnet_id      = aws_subnet.private_1.id
  route_table_id = aws_route_table.private_1.id
}

2.2 Security Groups vs NACLs

Feature	Security Groups	Network ACLs
Level	Instance (ENI) level	Subnet level
State	Stateful (return traffic auto-allowed)	Stateless (must explicitly allow return traffic)
Rules	Allow rules only	Allow and Deny rules
Rule Evaluation	All rules evaluated	Rules processed in number order
Default	Deny all inbound, allow all outbound	Allow all inbound and outbound
Use Case	Primary firewall (granular control)	Additional subnet-level protection

# Security Group for Web Servers
resource "aws_security_group" "web" {
  name        = "web-servers"
  description = "Security group for web tier"
  vpc_id      = aws_vpc.main.id

  # Inbound rules
  ingress {
    description = "HTTPS from internet"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "HTTP from internet"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Outbound rules (allow all by default)
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "web-sg"
  }
}

# Security Group for Application Servers
resource "aws_security_group" "app" {
  name        = "app-servers"
  description = "Security group for application tier"
  vpc_id      = aws_vpc.main.id

  # Only allow traffic from web tier
  ingress {
    description     = "HTTP from web tier"
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.web.id]  # Reference web SG
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "app-sg"
  }
}

# Security Group for Database
resource "aws_security_group" "db" {
  name        = "database"
  description = "Security group for database tier"
  vpc_id      = aws_vpc.main.id

  # Only allow traffic from app tier
  ingress {
    description     = "PostgreSQL from app tier"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "db-sg"
  }
}

2.3 VPC Peering & Transit Gateway

VPC Peering

Use VPC Peering when:

Connecting 2-3 VPCs
Simple, low-latency communication needed
No transitive routing required

Limitations: Non-transitive (A↔B, B↔C doesn't mean A↔C), doesn't scale beyond ~10 VPCs

# VPC Peering Connection
resource "aws_vpc_peering_connection" "prod_to_shared" {
  vpc_id        = aws_vpc.production.id
  peer_vpc_id   = aws_vpc.shared_services.id
  peer_region   = "us-east-1"
  auto_accept   = true

  tags = {
    Name = "prod-to-shared-services"
  }
}

# Add route in Production VPC to Shared Services VPC
resource "aws_route" "prod_to_shared" {
  route_table_id            = aws_route_table.production_private.id
  destination_cidr_block    = "10.1.0.0/16"  # Shared Services CIDR
  vpc_peering_connection_id = aws_vpc_peering_connection.prod_to_shared.id
}

# Add route in Shared Services VPC to Production VPC
resource "aws_route" "shared_to_prod" {
  route_table_id            = aws_route_table.shared_private.id
  destination_cidr_block    = "10.0.0.0/16"  # Production CIDR
  vpc_peering_connection_id = aws_vpc_peering_connection.prod_to_shared.id
}

Transit Gateway (TGW)

Use Transit Gateway when:

Connecting 4+ VPCs
Need hub-and-spoke topology
Want centralized routing control
Connecting on-premises networks (VPN/Direct Connect)

Advantages: Transitive routing, simpler management at scale, centralized network monitoring

graph TB OnPrem["On-Premises
Data Center"] TGW["Transit Gateway
(Hub)"] VPC1["Production VPC
10.0.0.0/16"] VPC2["Staging VPC
10.1.0.0/16"] VPC3["Dev VPC
10.2.0.0/16"] VPC4["Shared Services VPC
10.3.0.0/16"] OnPrem <-->|"VPN/Direct Connect"| TGW TGW <--> VPC1 TGW <--> VPC2 TGW <--> VPC3 TGW <--> VPC4 style TGW fill:#ff9900,stroke:#232f3e,stroke-width:3px,color:#232f3e style VPC1 fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440 style VPC4 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440

# Transit Gateway
resource "aws_ec2_transit_gateway" "main" {
  description                     = "Main TGW for all VPCs"
  default_route_table_association = "enable"
  default_route_table_propagation = "enable"

  tags = {
    Name = "main-tgw"
  }
}

# Attach Production VPC to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "production" {
  subnet_ids         = [aws_subnet.prod_private_1.id, aws_subnet.prod_private_2.id]
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  vpc_id             = aws_vpc.production.id

  tags = {
    Name = "production-tgw-attachment"
  }
}

# Attach Shared Services VPC to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "shared" {
  subnet_ids         = [aws_subnet.shared_private_1.id, aws_subnet.shared_private_2.id]
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  vpc_id             = aws_vpc.shared_services.id

  tags = {
    Name = "shared-services-tgw-attachment"
  }
}

# Route from Production VPC to TGW (for reaching other VPCs)
resource "aws_route" "prod_to_tgw" {
  route_table_id         = aws_route_table.production_private.id
  destination_cidr_block = "10.0.0.0/8"  # All internal networks
  transit_gateway_id     = aws_ec2_transit_gateway.main.id
}

2.4 VPC Endpoints (PrivateLink)

Why VPC Endpoints?

Access AWS services (S3, DynamoDB, etc.) without going through the internet. Benefits:

Security: Traffic stays within AWS network
Cost: No NAT Gateway data processing charges
Performance: Lower latency

Endpoint Type	Services	How It Works
Gateway Endpoint	S3, DynamoDB	Route table entry, no ENI created
Interface Endpoint	Most AWS services (EC2, SNS, SQS, etc.)	ENI with private IP in your subnet

# S3 Gateway Endpoint (free)
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"

  route_table_ids = [
    aws_route_table.private_1.id,
    aws_route_table.private_2.id
  ]

  tags = {
    Name = "s3-gateway-endpoint"
  }
}

# DynamoDB Gateway Endpoint (free)
resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.dynamodb"

  route_table_ids = [
    aws_route_table.private_1.id,
    aws_route_table.private_2.id
  ]

  tags = {
    Name = "dynamodb-gateway-endpoint"
  }
}

# Secrets Manager Interface Endpoint
resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.secretsmanager"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true

  subnet_ids = [
    aws_subnet.private_1.id,
    aws_subnet.private_2.id
  ]

  security_group_ids = [
    aws_security_group.vpc_endpoints.id
  ]

  tags = {
    Name = "secretsmanager-interface-endpoint"
  }
}

3. Compute Services

3.1 Compute Service Comparison

Service	Type	When to Use	Pros	Cons
EC2	Virtual Machines	Full OS control, legacy apps, persistent workloads	Complete control, any software	Must manage OS, scaling complexity
Lambda	Serverless Functions	Event-driven, short tasks (<15min), API backends	No servers, auto-scaling, pay-per-use	15min timeout, cold starts, limited runtime options
ECS	Container Orchestration	Docker containers, AWS-native orchestration	Simple, integrates well with AWS	AWS-only, less ecosystem than Kubernetes
EKS	Managed Kubernetes	Need Kubernetes, complex microservices, multi-cloud	Industry standard, portable, huge ecosystem	Complex, expensive, steep learning curve
Fargate	Serverless Containers	Containers without managing servers (ECS/EKS)	No server management, auto-scaling	More expensive than EC2, less control
App Runner	Fully Managed Apps	Deploy from source code/container with zero config	Easiest deployment, auto-scaling	Less control, newer service

3.2 EC2 Fundamentals

Instance Types Decision Tree

Family	Type	Use Case	Example
T3/T4g	Burstable	Web servers, dev/test, low-traffic apps	t3.medium (2 vCPU, 4 GB)
M5/M6i	General Purpose	Balanced compute/memory, app servers	m5.xlarge (4 vCPU, 16 GB)
C5/C6i	Compute Optimized	CPU-intensive (batch processing, ML inference)	c5.2xlarge (8 vCPU, 16 GB)
R5/R6i	Memory Optimized	In-memory databases, caches, big data	r5.xlarge (4 vCPU, 32 GB)
P3/P4	GPU	Machine learning training, HPC	p3.2xlarge (8 vCPU, 1 GPU)

# Auto Scaling Group for web servers
resource "aws_launch_template" "web" {
  name_prefix   = "web-"
  image_id      = "ami-12345678"  # Amazon Linux 2023
  instance_type = "t3.medium"

  # IAM instance profile
  iam_instance_profile {
    name = aws_iam_instance_profile.web.name
  }

  # Security group
  vpc_security_group_ids = [aws_security_group.web.id]

  # User data script (runs on boot)
  user_data = base64encode(<<-EOF
    #!/bin/bash
    yum update -y
    yum install -y httpd
    systemctl start httpd
    systemctl enable httpd
    echo "Hello from $(hostname)" > /var/www/html/index.html
  EOF
  )

  # EBS volume configuration
  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size           = 20
      volume_type           = "gp3"
      encrypted             = true
      delete_on_termination = true
    }
  }

  # Metadata options (IMDSv2 enforced)
  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"  # Require IMDSv2
    http_put_response_hop_limit = 1
  }

  tag_specifications {
    resource_type = "instance"

    tags = {
      Name = "web-server"
    }
  }
}

# Auto Scaling Group
resource "aws_autoscaling_group" "web" {
  name                = "web-asg"
  vpc_zone_identifier = [aws_subnet.public_1.id, aws_subnet.public_2.id]
  target_group_arns   = [aws_lb_target_group.web.arn]
  health_check_type   = "ELB"
  health_check_grace_period = 300

  min_size         = 2
  max_size         = 10
  desired_capacity = 2

  launch_template {
    id      = aws_launch_template.web.id
    version = "$Latest"
  }

  tag {
    key                 = "Name"
    value               = "web-server"
    propagate_at_launch = true
  }
}

# Auto Scaling Policy (CPU-based)
resource "aws_autoscaling_policy" "web_cpu" {
  name                   = "web-cpu-scaling"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0  # Target 70% CPU
  }
}

3.3 Lambda (Serverless Functions)

Lambda Best Practices:

Keep it small: One function = one purpose
Environment variables: Use for configuration, Secrets Manager for secrets
Layers: Share code/dependencies across functions
VPC sparingly: Only when accessing VPC resources (adds cold start time)
Timeout: Set appropriately (don't default to 15 minutes)

# Lambda function for S3 image processing
resource "aws_lambda_function" "image_processor" {
  filename      = "image_processor.zip"
  function_name = "process-uploaded-images"
  role          = aws_iam_role.lambda_exec.arn
  handler       = "index.handler"
  runtime       = "python3.11"

  source_code_hash = filebase64sha256("image_processor.zip")

  # Memory and timeout
  memory_size = 1024  # MB (also affects CPU)
  timeout     = 60    # seconds

  # Environment variables
  environment {
    variables = {
      DEST_BUCKET = aws_s3_bucket.processed_images.id
      MAX_WIDTH   = "1920"
      MAX_HEIGHT  = "1080"
    }
  }

  # VPC configuration (only if needed)
  # vpc_config {
  #   subnet_ids         = [aws_subnet.private_1.id, aws_subnet.private_2.id]
  #   security_group_ids = [aws_security_group.lambda.id]
  # }

  tags = {
    Purpose = "image-processing"
  }
}

# IAM role for Lambda
resource "aws_iam_role" "lambda_exec" {
  name = "lambda-image-processor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })
}

# Lambda policy - S3 access
resource "aws_iam_role_policy" "lambda_s3" {
  name = "lambda-s3-access"
  role = aws_iam_role.lambda_exec.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject"
        ]
        Resource = "${aws_s3_bucket.uploads.arn}/*"
      },
      {
        Effect = "Allow"
        Action = [
          "s3:PutObject"
        ]
        Resource = "${aws_s3_bucket.processed_images.arn}/*"
      }
    ]
  })
}

# Attach AWS managed policy for Lambda basics (CloudWatch Logs)
resource "aws_iam_role_policy_attachment" "lambda_basic" {
  role       = aws_iam_role.lambda_exec.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

# S3 trigger for Lambda
resource "aws_lambda_permission" "allow_s3" {
  statement_id  = "AllowExecutionFromS3"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.image_processor.function_name
  principal     = "s3.amazonaws.com"
  source_arn    = aws_s3_bucket.uploads.arn
}

resource "aws_s3_bucket_notification" "upload_trigger" {
  bucket = aws_s3_bucket.uploads.id

  lambda_function {
    lambda_function_arn = aws_lambda_function.image_processor.arn
    events              = ["s3:ObjectCreated:*"]
    filter_prefix       = "uploads/"
    filter_suffix       = ".jpg"
  }

  depends_on = [aws_lambda_permission.allow_s3]
}

3.4 ECS (Elastic Container Service)

ECS vs EKS Decision:

Choose ECS if: AWS-only, simpler setup, tighter AWS integration
Choose EKS if: Need Kubernetes, multi-cloud, complex orchestration, large ecosystem

# ECS Cluster
resource "aws_ecs_cluster" "main" {
  name = "production-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  tags = {
    Environment = "production"
  }
}

# ECS Task Definition (defines container configuration)
resource "aws_ecs_task_definition" "app" {
  family                   = "app-service"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "512"   # 0.5 vCPU
  memory                   = "1024"  # 1 GB

  # Task execution role (pulls images, writes logs)
  execution_role_arn = aws_iam_role.ecs_task_execution.arn

  # Task role (permissions for app itself)
  task_role_arn = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name  = "app"
      image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest"

      portMappings = [
        {
          containerPort = 8080
          protocol      = "tcp"
        }
      ]

      environment = [
        {
          name  = "ENV"
          value = "production"
        }
      ]

      # Secrets from Secrets Manager
      secrets = [
        {
          name      = "DB_PASSWORD"
          valueFrom = aws_secretsmanager_secret.db_password.arn
        }
      ]

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.ecs.name
          "awslogs-region"        = "us-east-1"
          "awslogs-stream-prefix" = "app"
        }
      }

      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 60
      }
    }
  ])
}

# ECS Service (runs and maintains task instances)
resource "aws_ecs_service" "app" {
  name            = "app-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = [aws_subnet.private_1.id, aws_subnet.private_2.id]
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  # Load balancer configuration
  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "app"
    container_port   = 8080
  }

  # Auto-scaling
  lifecycle {
    ignore_changes = [desired_count]  # Let auto-scaling manage this
  }

  depends_on = [aws_lb_listener.app]
}

# Auto-scaling for ECS service
resource "aws_appautoscaling_target" "ecs" {
  max_capacity       = 10
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "ecs_cpu" {
  name               = "ecs-cpu-autoscaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

4. Storage Services

4.1 Storage Service Comparison

Service	Type	Use Case	Max Size	Access
S3	Object Storage	Static assets, backups, data lake, archives	Unlimited (5TB per object)	HTTP API
EBS	Block Storage	EC2 instance boot/data volumes	64 TiB	Single EC2 instance (except io2 Multi-Attach)
EFS	File Storage (NFS)	Shared file storage across multiple EC2/containers	Unlimited	Multiple instances concurrently
FSx	Managed File Systems	Windows (SMB), Lustre (HPC), NetApp, OpenZFS	Varies	Multiple instances

4.2 S3 (Simple Storage Service)

S3 Storage Classes

Storage Class	Use Case	Availability	Retrieval	Cost
S3 Standard	Frequently accessed data	99.99%	Instant	$$$$
S3 Intelligent-Tiering	Unknown/changing access patterns	99.9%	Instant	$$$ + monitoring fee
S3 Standard-IA	Infrequently accessed (backups)	99.9%	Instant	$$ + retrieval fee
S3 One Zone-IA	Reproducible data (thumbnails)	99.5%	Instant	$ + retrieval fee
S3 Glacier Instant	Archive with instant access	99.9%	Instant	$ + retrieval fee
S3 Glacier Flexible	Archives (1-5 min retrieval)	99.99%	Minutes-hours	Very low
S3 Glacier Deep Archive	Long-term archive (12h retrieval)	99.99%	12 hours	Lowest

# S3 bucket with security best practices
resource "aws_s3_bucket" "app_data" {
  bucket = "my-app-data-bucket"

  tags = {
    Purpose = "application-data"
  }
}

# Block public access (CRITICAL)
resource "aws_s3_bucket_public_access_block" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Enable versioning
resource "aws_s3_bucket_versioning" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  versioning_configuration {
    status = "Enabled"
  }
}

# Server-side encryption (AES-256 or KMS)
resource "aws_s3_bucket_server_side_encryption_configuration" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.s3.id
    }
    bucket_key_enabled = true  # Reduces KMS costs
  }
}

# Lifecycle policy (auto-transition to cheaper storage)
resource "aws_s3_bucket_lifecycle_configuration" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  rule {
    id     = "transition-old-data"
    status = "Enabled"

    # Transition to IA after 30 days
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    # Transition to Glacier after 90 days
    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    # Delete after 365 days
    expiration {
      days = 365
    }

    # Clean up old versions
    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }
}

# Bucket policy (restrict access)
resource "aws_s3_bucket_policy" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        # Enforce SSL/TLS
        Sid    = "DenyInsecureTransport"
        Effect = "Deny"
        Principal = "*"
        Action = "s3:*"
        Resource = [
          aws_s3_bucket.app_data.arn,
          "${aws_s3_bucket.app_data.arn}/*"
        ]
        Condition = {
          Bool = {
            "aws:SecureTransport" = "false"
          }
        }
      }
    ]
  })
}

4.3 EBS (Elastic Block Store)

EBS Volume Types

Type	Use Case	IOPS	Throughput	Cost
gp3	General purpose (default choice)	3,000-16,000 (configurable)	125-1,000 MB/s	$0.08/GB-month
gp2	General purpose (older)	100-16,000 (size-based)	Up to 250 MB/s	$0.10/GB-month
io2	High-performance databases	Up to 64,000	Up to 1,000 MB/s	$0.125/GB + IOPS cost
st1	Big data, data warehouses (HDD)	N/A (throughput optimized)	Up to 500 MB/s	$0.045/GB-month
sc1	Cold storage (HDD)	N/A	Up to 250 MB/s	$0.015/GB-month

# EBS volume for database
resource "aws_ebs_volume" "database" {
  availability_zone = "us-east-1a"
  size              = 100  # GB
  type              = "gp3"
  iops              = 3000
  throughput        = 125  # MB/s
  encrypted         = true
  kms_key_id        = aws_kms_key.ebs.id

  tags = {
    Name = "database-volume"
  }
}

# Attach to EC2 instance
resource "aws_volume_attachment" "database" {
  device_name = "/dev/sdf"
  volume_id   = aws_ebs_volume.database.id
  instance_id = aws_instance.database.id
}

# EBS snapshot for backup
resource "aws_ebs_snapshot" "database_backup" {
  volume_id   = aws_ebs_volume.database.id
  description = "Database backup ${formatdate("YYYY-MM-DD", timestamp())}"

  tags = {
    Name = "database-snapshot"
  }
}

# Data Lifecycle Manager - automated snapshots
resource "aws_dlm_lifecycle_policy" "database_backups" {
  description        = "Daily database snapshots"
  execution_role_arn = aws_iam_role.dlm.arn
  state              = "ENABLED"

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "Daily snapshots"

      create_rule {
        interval      = 24  # hours
        interval_unit = "HOURS"
        times         = ["03:00"]  # 3 AM UTC
      }

      retain_rule {
        count = 7  # Keep 7 daily snapshots
      }

      tags_to_add = {
        SnapshotCreator = "DLM"
      }

      copy_tags = true
    }

    target_tags = {
      Backup = "true"
    }
  }
}

4.4 EFS (Elastic File System)

EFS vs EBS:

EFS: Shared file storage, multiple instances, NFS protocol, auto-scaling
EBS: Block storage, single instance (usually), fixed size, lower latency

# EFS file system
resource "aws_efs_file_system" "shared_data" {
  creation_token = "shared-app-data"
  encrypted      = true
  kms_key_id     = aws_kms_key.efs.id

  # Performance mode
  performance_mode = "generalPurpose"  # or "maxIO" for high parallelism

  # Throughput mode
  throughput_mode = "bursting"  # or "provisioned" for consistent throughput

  lifecycle_policy {
    transition_to_ia = "AFTER_30_DAYS"  # Move to cheaper Infrequent Access
  }

  tags = {
    Name = "shared-data"
  }
}

# Mount targets (one per AZ)
resource "aws_efs_mount_target" "az1" {
  file_system_id  = aws_efs_file_system.shared_data.id
  subnet_id       = aws_subnet.private_1.id
  security_groups = [aws_security_group.efs.id]
}

resource "aws_efs_mount_target" "az2" {
  file_system_id  = aws_efs_file_system.shared_data.id
  subnet_id       = aws_subnet.private_2.id
  security_groups = [aws_security_group.efs.id]
}

# Security group for EFS
resource "aws_security_group" "efs" {
  name        = "efs-mount-targets"
  description = "Security group for EFS mount targets"
  vpc_id      = aws_vpc.main.id

  ingress {
    description     = "NFS from app servers"
    from_port       = 2049
    to_port         = 2049
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }

  tags = {
    Name = "efs-sg"
  }
}

# User data to mount EFS on EC2 instance
resource "aws_instance" "app" {
  # ... other config ...

  user_data = <<-EOF
    #!/bin/bash
    yum install -y amazon-efs-utils
    mkdir /mnt/efs
    mount -t efs ${aws_efs_file_system.shared_data.id}:/ /mnt/efs
    echo "${aws_efs_file_system.shared_data.id}:/ /mnt/efs efs defaults,_netdev 0 0" >> /etc/fstab
  EOF
}

5. Database Services

5.1 Database Service Comparison

Service	Type	Engine	When to Use	Scaling
RDS	Relational	PostgreSQL, MySQL, MariaDB, Oracle, SQL Server	Traditional RDBMS needs, ACID compliance	Vertical (read replicas for reads)
Aurora	Relational	MySQL, PostgreSQL compatible	High-performance relational, cloud-native	Vertical + auto-scaling read replicas
DynamoDB	NoSQL (Key-Value/Document)	Proprietary	Massive scale, single-digit ms latency, serverless	Horizontal (unlimited)
ElastiCache	In-Memory Cache	Redis, Memcached	Caching, session store, real-time analytics	Horizontal (cluster mode)
DocumentDB	NoSQL (Document)	MongoDB compatible	MongoDB workloads on AWS	Vertical + read replicas
Neptune	Graph	Gremlin, SPARQL	Graph databases (social networks, fraud detection)	Vertical + read replicas

Relational vs NoSQL Decision Tree

Choose Relational (RDS/Aurora) when:

Complex queries with JOINs
ACID transactions critical
Structured data with relationships
Need SQL standard compliance
Data integrity and referential integrity important

Choose NoSQL (DynamoDB) when:

Need massive scale (millions of requests/second)
Simple key-value or document lookups
Schema flexibility needed
Single-digit millisecond latency required
Serverless architecture

5.2 RDS (Relational Database Service)

# RDS PostgreSQL instance
resource "aws_db_instance" "postgres" {
  identifier     = "production-postgres"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.t3.medium"

  # Storage
  allocated_storage     = 100  # GB
  max_allocated_storage = 1000  # Auto-scaling up to 1TB
  storage_type          = "gp3"
  storage_encrypted     = true
  kms_key_id            = aws_kms_key.rds.id

  # Database config
  db_name  = "myapp"
  username = "admin"
  password = random_password.db_master.result  # Use random password!
  port     = 5432

  # Networking
  db_subnet_group_name   = aws_db_subnet_group.main.name
  vpc_security_group_ids = [aws_security_group.rds.id]
  publicly_accessible    = false  # NEVER true for production!

  # High Availability
  multi_az               = true  # Standby in different AZ

  # Backups
  backup_retention_period = 7  # Days
  backup_window           = "03:00-04:00"  # UTC
  maintenance_window      = "Mon:04:00-Mon:05:00"  # UTC

  # Monitoring
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  monitoring_interval             = 60  # Enhanced monitoring every 60s
  monitoring_role_arn             = aws_iam_role.rds_monitoring.arn

  # Security
  deletion_protection       = true  # Prevent accidental deletion
  skip_final_snapshot       = false
  final_snapshot_identifier = "production-postgres-final-snapshot"

  # Performance Insights
  performance_insights_enabled    = true
  performance_insights_kms_key_id = aws_kms_key.rds.id
  performance_insights_retention_period = 7

  tags = {
    Name = "production-postgres"
  }
}

# DB Subnet Group (spans multiple AZs)
resource "aws_db_subnet_group" "main" {
  name       = "main-db-subnet-group"
  subnet_ids = [aws_subnet.private_1.id, aws_subnet.private_2.id]

  tags = {
    Name = "main-db-subnet-group"
  }
}

# Read Replica (for read scaling)
resource "aws_db_instance" "postgres_replica" {
  identifier              = "production-postgres-replica"
  replicate_source_db     = aws_db_instance.postgres.identifier
  instance_class          = "db.t3.medium"
  publicly_accessible     = false

  # Can be in different region for disaster recovery
  # availability_zone = "us-east-1b"

  tags = {
    Name = "production-postgres-replica"
  }
}

# Random password for DB
resource "random_password" "db_master" {
  length  = 32
  special = true
}

# Store password in Secrets Manager
resource "aws_secretsmanager_secret" "db_password" {
  name = "production/postgres/master-password"

  recovery_window_in_days = 7
}

resource "aws_secretsmanager_secret_version" "db_password" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = random_password.db_master.result
}

5.3 Aurora (Cloud-Native Relational)

Aurora Advantages over RDS:

Performance: 5x faster than MySQL, 3x faster than PostgreSQL (AWS claims)
Storage: Auto-scaling (10GB → 128TB), 6 copies across 3 AZs
Failover: <30 second automatic failover
Read Replicas: Up to 15 low-latency read replicas
Serverless: Aurora Serverless v2 auto-scales capacity

Trade-off: 20-30% more expensive than RDS, but often worth it

# Aurora Cluster
resource "aws_rds_cluster" "aurora" {
  cluster_identifier      = "production-aurora-cluster"
  engine                  = "aurora-postgresql"
  engine_version          = "15.4"
  database_name           = "myapp"
  master_username         = "admin"
  master_password         = random_password.aurora.result

  # Networking
  db_subnet_group_name    = aws_db_subnet_group.main.name
  vpc_security_group_ids  = [aws_security_group.aurora.id]

  # Backups
  backup_retention_period = 7
  preferred_backup_window = "03:00-04:00"

  # Encryption
  storage_encrypted = true
  kms_key_id        = aws_kms_key.aurora.id

  # Deletion protection
  deletion_protection = true
  skip_final_snapshot = false
  final_snapshot_identifier = "aurora-final-snapshot"

  # Enable backtrack (point-in-time restore without restoring from backup)
  backtrack_window = 72  # hours

  # CloudWatch Logs
  enabled_cloudwatch_logs_exports = ["postgresql"]

  tags = {
    Name = "production-aurora"
  }
}

# Aurora Writer Instance
resource "aws_rds_cluster_instance" "aurora_writer" {
  identifier         = "production-aurora-writer"
  cluster_identifier = aws_rds_cluster.aurora.id
  instance_class     = "db.r6g.large"
  engine             = aws_rds_cluster.aurora.engine
  engine_version     = aws_rds_cluster.aurora.engine_version

  performance_insights_enabled = true

  tags = {
    Name = "aurora-writer"
    Role = "writer"
  }
}

# Aurora Reader Instance (auto-scaling target)
resource "aws_rds_cluster_instance" "aurora_reader_1" {
  identifier         = "production-aurora-reader-1"
  cluster_identifier = aws_rds_cluster.aurora.id
  instance_class     = "db.r6g.large"
  engine             = aws_rds_cluster.aurora.engine
  engine_version     = aws_rds_cluster.aurora.engine_version

  performance_insights_enabled = true

  tags = {
    Name = "aurora-reader-1"
    Role = "reader"
  }
}

# Auto-scaling for Aurora read replicas
resource "aws_appautoscaling_target" "aurora_replicas" {
  max_capacity       = 5
  min_capacity       = 1
  resource_id        = "cluster:${aws_rds_cluster.aurora.cluster_identifier}"
  scalable_dimension = "rds:cluster:ReadReplicaCount"
  service_namespace  = "rds"
}

resource "aws_appautoscaling_policy" "aurora_replicas" {
  name               = "aurora-cpu-autoscaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.aurora_replicas.resource_id
  scalable_dimension = aws_appautoscaling_target.aurora_replicas.scalable_dimension
  service_namespace  = aws_appautoscaling_target.aurora_replicas.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "RDSReaderAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

5.4 DynamoDB (NoSQL)

DynamoDB Basics:

Primary Key: Partition key (required) + sort key (optional)
Partition Key: Determines which partition stores the item (hash)
Sort Key: Orders items within a partition, enables range queries
Secondary Indexes: Query on non-primary-key attributes

# DynamoDB table
resource "aws_dynamodb_table" "users" {
  name           = "Users"
  billing_mode   = "PAY_PER_REQUEST"  # On-demand pricing
  # Or "PROVISIONED" with read_capacity and write_capacity

  hash_key  = "user_id"       # Partition key
  range_key = "created_at"    # Sort key (optional)

  attribute {
    name = "user_id"
    type = "S"  # String
  }

  attribute {
    name = "created_at"
    type = "N"  # Number (Unix timestamp)
  }

  attribute {
    name = "email"
    type = "S"
  }

  # Global Secondary Index (query by email)
  global_secondary_index {
    name            = "EmailIndex"
    hash_key        = "email"
    projection_type = "ALL"  # Include all attributes in index

    # For on-demand billing, no need to specify capacity
  }

  # Point-in-time recovery
  point_in_time_recovery {
    enabled = true
  }

  # Server-side encryption
  server_side_encryption {
    enabled     = true
    kms_key_arn = aws_kms_key.dynamodb.arn
  }

  # TTL (auto-delete expired items)
  ttl {
    attribute_name = "expiration_time"
    enabled        = true
  }

  # DynamoDB Streams (for change data capture)
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  tags = {
    Name = "users-table"
  }
}

# DynamoDB Global Table (multi-region replication)
resource "aws_dynamodb_table" "orders_global" {
  name           = "Orders"
  billing_mode   = "PAY_PER_REQUEST"
  hash_key       = "order_id"

  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  attribute {
    name = "order_id"
    type = "S"
  }

  # Replicate to multiple regions
  replica {
    region_name = "us-west-2"
  }

  replica {
    region_name = "eu-west-1"
  }

  tags = {
    Name = "orders-global-table"
  }
}

5.5 ElastiCache (Redis/Memcached)

Redis vs Memcached:

Redis: Rich data structures (lists, sets, sorted sets), persistence, pub/sub, transactions. Choose for: session store, leaderboards, queues
Memcached: Simple key-value, multi-threaded, simpler. Choose for: simple caching, horizontal scaling

See Messaging Systems Guide for Redis Pub/Sub details.

# ElastiCache Redis Cluster
resource "aws_elasticache_replication_group" "redis" {
  replication_group_id       = "production-redis"
  replication_group_description = "Redis cluster for session storage and caching"

  engine               = "redis"
  engine_version       = "7.0"
  node_type            = "cache.r6g.large"
  num_cache_clusters   = 3  # 1 primary + 2 replicas
  parameter_group_name = "default.redis7"
  port                 = 6379

  # Networking
  subnet_group_name  = aws_elasticache_subnet_group.main.name
  security_group_ids = [aws_security_group.redis.id]

  # High Availability
  automatic_failover_enabled = true  # Auto-failover to replica
  multi_az_enabled           = true

  # Backups
  snapshot_retention_limit = 5  # Days
  snapshot_window          = "03:00-04:00"
  maintenance_window       = "sun:05:00-sun:06:00"

  # Encryption
  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token_enabled         = true
  auth_token                 = random_password.redis_auth.result

  # Logs
  log_delivery_configuration {
    destination      = aws_cloudwatch_log_group.redis.name
    destination_type = "cloudwatch-logs"
    log_format       = "json"
    log_type         = "slow-log"
  }

  tags = {
    Name = "production-redis"
  }
}

# Subnet group for ElastiCache
resource "aws_elasticache_subnet_group" "main" {
  name       = "main-cache-subnet-group"
  subnet_ids = [aws_subnet.private_1.id, aws_subnet.private_2.id]

  tags = {
    Name = "main-cache-subnet-group"
  }
}

# Security group for Redis
resource "aws_security_group" "redis" {
  name        = "redis-cluster"
  description = "Security group for Redis cluster"
  vpc_id      = aws_vpc.main.id

  ingress {
    description     = "Redis from app servers"
    from_port       = 6379
    to_port         = 6379
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }

  tags = {
    Name = "redis-sg"
  }
}

6. Security Deep Dive

6.1 KMS (Key Management Service)

# KMS key for encrypting data
resource "aws_kms_key" "app_data" {
  description             = "KMS key for application data encryption"
  deletion_window_in_days = 30  # Grace period before deletion
  enable_key_rotation     = true

  # Key policy
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
        }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow services to use the key"
        Effect = "Allow"
        Principal = {
          Service = [
            "s3.amazonaws.com",
            "rds.amazonaws.com",
            "dynamodb.amazonaws.com",
            "secretsmanager.amazonaws.com"
          ]
        }
        Action = [
          "kms:Decrypt",
          "kms:GenerateDataKey"
        ]
        Resource = "*"
      }
    ]
  })

  tags = {
    Name = "app-data-key"
  }
}

# Alias for easier reference
resource "aws_kms_alias" "app_data" {
  name          = "alias/app-data"
  target_key_id = aws_kms_key.app_data.key_id
}

6.2 Secrets Manager

# Secrets Manager secret
resource "aws_secretsmanager_secret" "api_key" {
  name                    = "production/api/external-service-key"
  description             = "API key for external service"
  recovery_window_in_days = 7

  tags = {
    Purpose = "external-api-auth"
  }
}

# Secret value
resource "aws_secretsmanager_secret_version" "api_key" {
  secret_id     = aws_secretsmanager_secret.api_key.id
  secret_string = jsonencode({
    api_key    = "super-secret-key"
    api_secret = "super-secret-value"
  })
}

# IAM policy to allow Lambda to read secret
resource "aws_iam_role_policy" "lambda_read_secret" {
  name = "read-api-secret"
  role = aws_iam_role.lambda.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "secretsmanager:GetSecretValue"
        ]
        Resource = aws_secretsmanager_secret.api_key.arn
      }
    ]
  })
}

# Automatic rotation (Lambda function rotates secret)
resource "aws_secretsmanager_secret_rotation" "api_key" {
  secret_id           = aws_secretsmanager_secret.api_key.id
  rotation_lambda_arn = aws_lambda_function.rotate_secret.arn

  rotation_rules {
    automatically_after_days = 30
  }
}

6.3 Security Monitoring & Compliance

Essential Security Services

Service	Purpose	What It Detects
GuardDuty	Threat detection	Unusual API calls, compromised instances, reconnaissance
Security Hub	Centralized security findings	Aggregates GuardDuty, Inspector, Macie, Config findings
CloudTrail	API audit logging	Who did what, when (all AWS API calls)
Config	Resource compliance	Non-compliant resources (unencrypted EBS, public S3)
Inspector	Vulnerability scanning	EC2/ECR vulnerabilities, network exposure
Macie	Data discovery & protection	Sensitive data in S3 (PII, credentials)
WAF	Web application firewall	SQL injection, XSS, bot traffic

# Enable GuardDuty (threat detection)
resource "aws_guardduty_detector" "main" {
  enable = true

  finding_publishing_frequency = "FIFTEEN_MINUTES"

  datasources {
    s3_logs {
      enable = true
    }
    kubernetes {
      audit_logs {
        enable = true
      }
    }
  }

  tags = {
    Name = "main-guardduty"
  }
}

# Enable Security Hub (compliance & security findings)
resource "aws_securityhub_account" "main" {
  enable_default_standards = true
  control_finding_generator = "SECURITY_CONTROL"
}

# Enable CloudTrail (API logging)
resource "aws_cloudtrail" "main" {
  name                          = "organization-trail"
  s3_bucket_name                = aws_s3_bucket.cloudtrail.id
  include_global_service_events = true
  is_multi_region_trail         = true
  is_organization_trail         = true

  # Enable logging of data events
  event_selector {
    read_write_type           = "All"
    include_management_events = true

    # Log S3 data events
    data_resource {
      type   = "AWS::S3::Object"
      values = ["arn:aws:s3:::*/"]
    }

    # Log Lambda invocations
    data_resource {
      type   = "AWS::Lambda::Function"
      values = ["arn:aws:lambda:*:*:function/*"]
    }
  }

  # CloudWatch Logs integration
  cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.cloudtrail.arn}:*"
  cloud_watch_logs_role_arn  = aws_iam_role.cloudtrail_cloudwatch.arn

  tags = {
    Name = "organization-cloudtrail"
  }
}

# Enable Config (resource compliance)
resource "aws_config_configuration_recorder" "main" {
  name     = "main-config-recorder"
  role_arn = aws_iam_role.config.arn

  recording_group {
    all_supported                 = true
    include_global_resource_types = true
  }
}

resource "aws_config_configuration_recorder_status" "main" {
  name       = aws_config_configuration_recorder.main.name
  is_enabled = true

  depends_on = [aws_config_delivery_channel.main]
}

# Config rules (compliance checks)
resource "aws_config_config_rule" "encrypted_volumes" {
  name = "encrypted-volumes"

  source {
    owner             = "AWS"
    source_identifier = "ENCRYPTED_VOLUMES"
  }

  depends_on = [aws_config_configuration_recorder.main]
}

resource "aws_config_config_rule" "s3_bucket_public_read_prohibited" {
  name = "s3-bucket-public-read-prohibited"

  source {
    owner             = "AWS"
    source_identifier = "S3_BUCKET_PUBLIC_READ_PROHIBITED"
  }

  depends_on = [aws_config_configuration_recorder.main]
}

7. Real-World Architecture Patterns

7.1 Multi-Tier Web Application

Architecture: 3-Tier Web App with Auto-Scaling

Use Case: E-commerce website with variable traffic

graph TB Users["Users"] subgraph "AWS Cloud" Route53["Route 53
DNS"] CloudFront["CloudFront CDN
(Static Assets)"] subgraph "VPC" subgraph "Public Subnets" ALB["Application Load Balancer
(us-east-1a, us-east-1b)"] end subgraph "Private Subnets - Web Tier" ASG1["Auto Scaling Group
EC2 Web Servers"] end subgraph "Private Subnets - App Tier" ASG2["Auto Scaling Group
EC2 App Servers"] Cache["ElastiCache Redis
(Session Store)"] end subgraph "Private Subnets - Data Tier" RDS["Aurora PostgreSQL
(Multi-AZ)"] S3["S3
(User Uploads)"] end end end Users --> Route53 Route53 --> CloudFront CloudFront --> S3 Route53 --> ALB ALB --> ASG1 ASG1 --> ASG2 ASG2 --> Cache ASG2 --> RDS ASG2 --> S3 style CloudFront fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#232f3e style ALB fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440 style RDS fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440

Key Components:

Route 53: DNS routing with health checks
CloudFront: CDN for static assets (images, CSS, JS)
ALB: Distributes traffic across web servers
Web Tier: Auto-scaling EC2 instances (t3.medium)
App Tier: Auto-scaling EC2 instances (m5.large), business logic
Cache Tier: Redis for session storage and caching
Data Tier: Aurora PostgreSQL (writer + 2 readers), S3 for uploads

See Full Terraform:

Combines VPC setup, ALB, Auto Scaling Groups, RDS, ElastiCache from examples above.

7.2 Serverless Microservices

Architecture: Event-Driven Serverless

Use Case: Order processing system with multiple services

graph TB API["API Gateway
(REST API)"] Lambda1["Lambda: Create Order"] Lambda2["Lambda: Process Payment"] Lambda3["Lambda: Update Inventory"] Lambda4["Lambda: Send Notification"] DDB["DynamoDB: Orders Table"] SQS1["SQS: Payment Queue"] SQS2["SQS: Inventory Queue"] SNS["SNS: Order Events Topic"] S3["S3: Order Receipts"] SES["SES: Email Service"] API --> Lambda1 Lambda1 --> DDB Lambda1 --> SQS1 Lambda1 --> SQS2 Lambda1 --> SNS SQS1 --> Lambda2 Lambda2 --> DDB SQS2 --> Lambda3 Lambda3 --> DDB SNS --> Lambda4 Lambda4 --> S3 Lambda4 --> SES style API fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#232f3e style DDB fill:#88c0d0,stroke:#81a1c1,stroke-width:2px,color:#2e3440 style SNS fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440

Flow:

User calls API Gateway → Lambda creates order in DynamoDB
Lambda publishes to SQS queues (payment, inventory)
Lambda publishes to SNS topic (order-events)
Payment Lambda processes from SQS, updates DynamoDB
Inventory Lambda processes from SQS, updates DynamoDB
Notification Lambda subscribes to SNS, generates receipt (S3), sends email (SES)

Benefits:

No servers to manage: Lambda auto-scales
Cost-effective: Pay per request
Decoupled: Services communicate via SQS/SNS
Resilient: SQS retries failed messages, DLQ for poison messages

See also: Messaging Systems Guide for detailed SQS/SNS/EventBridge patterns

7.3 Data Processing Pipeline

Architecture: Real-Time Analytics Pipeline

Use Case: Process clickstream data for real-time dashboards

graph LR App["Web App"] Kinesis["Kinesis Data Stream
(Clickstream)"] Lambda1["Lambda: Transform"] Firehose["Kinesis Firehose"] S3["S3: Data Lake
(Parquet)"] Athena["Athena
(SQL Queries)"] QuickSight["QuickSight
(Dashboards)"] Lambda2["Lambda: Real-time Alerts"] DDB["DynamoDB: Aggregates"] App --> Kinesis Kinesis --> Lambda1 Lambda1 --> Firehose Firehose --> S3 S3 --> Athena Athena --> QuickSight Kinesis --> Lambda2 Lambda2 --> DDB style Kinesis fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#232f3e style S3 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440

Flow:

Web app sends events to Kinesis Data Stream
Lambda transforms/enriches events in real-time
Firehose batches and writes to S3 (Parquet format)
Athena queries S3 data lake (SQL)
QuickSight creates dashboards from Athena queries
Separate Lambda reads stream for real-time alerts, writes aggregates to DynamoDB

Interview Tips

Key Topics to Master:

Multi-account strategy: Why use multiple accounts, SCPs, cross-account access
VPC design: Public/private subnets, NAT Gateway, VPC Peering vs Transit Gateway
Security layers: Security Groups (stateful) vs NACLs (stateless)
Compute trade-offs: EC2 vs Lambda vs ECS vs EKS - when to use each
Database selection: RDS vs Aurora vs DynamoDB decision tree
Storage classes: S3 Standard → IA → Glacier based on access patterns
IAM best practices: Roles over users, least privilege, service accounts
Encryption: KMS for data-at-rest, TLS for data-in-transit
Monitoring: GuardDuty, Security Hub, CloudTrail, Config
Architecture patterns: Multi-tier, serverless, data pipelines

Common Interview Questions:

Design a highly available, scalable web application on AWS
How would you secure access to a database in a private subnet?
Explain the difference between Security Groups and NACLs
When would you choose DynamoDB over RDS?
How do you enable cross-account access securely?
Design a disaster recovery strategy for a critical application
How would you handle a Lambda function that needs to access resources in a VPC?
Explain the difference between VPC Peering and Transit Gateway
How do you implement defense-in-depth security in AWS?
Design a data pipeline for processing millions of events per day