AWS Architecture Guide
Building Scalable, Secure Cloud Infrastructure
Guide Overview
This guide covers AWS architecture from the ground up: account organization, networking, compute, storage, databases, and security. Each section includes Terraform examples, comparison tables, and real-world architecture patterns for senior/staff engineer interviews.
Cross-references: See Messaging Systems for SQS, SNS, Kinesis, EventBridge details.
1. AWS Organizations & Account Structure
1.1 Multi-Account Strategy
Why Multiple Accounts?
- Security Isolation: Blast radius containment - breach in dev doesn't affect prod
- Billing Separation: Clear cost attribution per environment/team
- Regulatory Compliance: Separate PCI/HIPAA workloads from general infrastructure
- Resource Limits: AWS limits are per-account (e.g., VPCs, EC2 instances)
graph TB
Root["AWS Organization Root
Management Account"]
Root --> Core["Core OU"]
Root --> Workloads["Workloads OU"]
Root --> Security["Security OU"]
Core --> Log["Log Archive Account
(CloudTrail, Config)"]
Core --> Audit["Audit Account
(Security Hub, GuardDuty)"]
Workloads --> Dev["Development Account"]
Workloads --> Staging["Staging Account"]
Workloads --> Prod["Production Account"]
Security --> SecurityTools["Security Tooling Account
(Inspector, Macie)"]
Root -.->|"SCPs"| Core
Root -.->|"SCPs"| Workloads
Root -.->|"SCPs"| Security
style Root fill:#ff9900,stroke:#232f3e,stroke-width:3px,color:#232f3e
style Prod fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440
style Security fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440
Account Structure Best Practices
| Account Type |
Purpose |
Who Has Access |
| Management (Root) |
AWS Organizations, billing, minimal workloads |
Finance, senior ops only |
| Log Archive |
Centralized logging (CloudTrail, VPC Flow Logs) |
Security team (read-only), automated systems (write) |
| Audit/Security |
Security Hub, GuardDuty aggregation, compliance |
Security team, compliance auditors |
| Shared Services |
Active Directory, VPN, Transit Gateway |
Network team, infra team |
| Development |
Development/testing workloads |
All engineers (broad permissions) |
| Staging |
Pre-production testing |
Engineers (limited), QA team |
| Production |
Live customer-facing workloads |
On-call engineers (read-only + incident response) |
Terraform: AWS Organizations Setup
# Create AWS Organization (run in management account)
resource "aws_organizations_organization" "org" {
feature_set = "ALL"
aws_service_access_principals = [
"cloudtrail.amazonaws.com",
"config.amazonaws.com",
"guardduty.amazonaws.com",
"securityhub.amazonaws.com"
]
enabled_policy_types = [
"SERVICE_CONTROL_POLICY",
"TAG_POLICY"
]
}
# Create Organizational Units
resource "aws_organizations_organizational_unit" "workloads" {
name = "Workloads"
parent_id = aws_organizations_organization.org.roots[0].id
}
resource "aws_organizations_organizational_unit" "security" {
name = "Security"
parent_id = aws_organizations_organization.org.roots[0].id
}
# Create Production Account
resource "aws_organizations_account" "production" {
name = "production"
email = "aws-prod@company.com"
parent_id = aws_organizations_organizational_unit.workloads.id
role_name = "OrganizationAccountAccessRole"
tags = {
Environment = "production"
CostCenter = "engineering"
}
}
# Service Control Policy (SCP) - Prevent region usage outside allowed list
resource "aws_organizations_policy" "require_regions" {
name = "RequireAllowedRegions"
description = "Restrict operations to us-east-1 and us-west-2"
type = "SERVICE_CONTROL_POLICY"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "DenyAllOutsideAllowedRegions"
Effect = "Deny"
NotAction = [
"iam:*",
"organizations:*",
"route53:*",
"budgets:*",
"cloudfront:*",
"support:*",
"sts:*"
]
Resource = "*"
Condition = {
StringNotEquals = {
"aws:RequestedRegion" = [
"us-east-1",
"us-west-2"
]
}
}
}
]
})
}
# Attach SCP to Workloads OU
resource "aws_organizations_policy_attachment" "workloads_regions" {
policy_id = aws_organizations_policy.require_regions.id
target_id = aws_organizations_organizational_unit.workloads.id
}
1.2 IAM Foundation
IAM Best Practices:
- Never use root account: Enable MFA, create admin IAM user immediately
- No long-lived credentials: Use IAM roles and temporary credentials
- Principle of least privilege: Grant minimum permissions needed
- Use IAM roles for services: EC2, Lambda, ECS should assume roles, not use API keys
- Enable CloudTrail: Audit all API calls
IAM Components
| Component |
Purpose |
Example Use Case |
| IAM Users |
Human users (avoid for services) |
Developer accessing AWS Console |
| IAM Roles |
Assumed by services or federated users |
EC2 instance accessing S3, cross-account access |
| IAM Policies |
JSON documents defining permissions |
Allow S3 read access to specific bucket |
| IAM Groups |
Collection of users for permission management |
Engineers group with EC2/S3 access |
| Service Control Policies (SCPs) |
Org-level permission boundaries |
Prevent production account from deleting CloudTrail |
Terraform: IAM Roles for Services
# IAM Role for EC2 instances to access S3 and DynamoDB
resource "aws_iam_role" "app_server" {
name = "app-server-role"
# Trust policy - who can assume this role
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}
]
})
tags = {
Purpose = "application-server"
}
}
# Permission policy - what this role can do
resource "aws_iam_role_policy" "app_server_permissions" {
name = "app-server-permissions"
role = aws_iam_role.app_server.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
# S3 access
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject"
]
Resource = "arn:aws:s3:::my-app-bucket/*"
},
{
# DynamoDB access
Effect = "Allow"
Action = [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:Query"
]
Resource = "arn:aws:dynamodb:us-east-1:*:table/Users"
},
{
# Secrets Manager access
Effect = "Allow"
Action = [
"secretsmanager:GetSecretValue"
]
Resource = "arn:aws:secretsmanager:us-east-1:*:secret:app/*"
}
]
})
}
# Instance profile (wrapper for role used by EC2)
resource "aws_iam_instance_profile" "app_server" {
name = "app-server-profile"
role = aws_iam_role.app_server.name
}
# Attach instance profile to EC2 instance
resource "aws_instance" "app" {
ami = "ami-12345678"
instance_type = "t3.medium"
iam_instance_profile = aws_iam_instance_profile.app_server.name
# Now this instance can access S3, DynamoDB, Secrets Manager
# without any API keys or credentials!
}
Cross-Account Access Pattern
# In Production Account: Role that Dev Account can assume
resource "aws_iam_role" "prod_readonly_from_dev" {
name = "prod-readonly-cross-account"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
AWS = "arn:aws:iam::111111111111:root" # Dev account ID
}
Action = "sts:AssumeRole"
Condition = {
StringEquals = {
"sts:ExternalId" = "unique-external-id-12345"
}
}
}
]
})
}
resource "aws_iam_role_policy_attachment" "readonly" {
role = aws_iam_role.prod_readonly_from_dev.name
policy_arn = "arn:aws:iam::aws:policy/ReadOnlyAccess"
}
# In Dev Account: Policy allowing users to assume prod role
resource "aws_iam_policy" "assume_prod_readonly" {
name = "assume-prod-readonly"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = "sts:AssumeRole"
Resource = "arn:aws:iam::222222222222:role/prod-readonly-cross-account"
}
]
})
}
# Attach to engineers group
resource "aws_iam_group_policy_attachment" "engineers_assume_prod" {
group = aws_iam_group.engineers.name
policy_arn = aws_iam_policy.assume_prod_readonly.arn
}
2. Networking Architecture
2.1 VPC Fundamentals
What is a VPC?
A Virtual Private Cloud (VPC) is an isolated network within AWS where you launch resources. Think of it as your own private data center in the cloud with complete control over:
- IP address ranges (CIDR blocks)
- Subnets (subdivisions of your VPC)
- Route tables (how traffic flows)
- Internet gateways (connection to internet)
- Security groups and NACLs (firewalls)
graph TB
Internet["Internet"]
subgraph VPC["VPC: 10.0.0.0/16"]
IGW["Internet Gateway"]
subgraph AZ1["Availability Zone us-east-1a"]
PubSub1["Public Subnet
10.0.1.0/24"]
PrivSub1["Private Subnet
10.0.10.0/24"]
NAT1["NAT Gateway"]
end
subgraph AZ2["Availability Zone us-east-1b"]
PubSub2["Public Subnet
10.0.2.0/24"]
PrivSub2["Private Subnet
10.0.20.0/24"]
NAT2["NAT Gateway"]
end
Web1["Web Server
EC2"]
Web2["Web Server
EC2"]
App1["App Server
EC2"]
App2["App Server
EC2"]
PubSub1 --> Web1
PubSub2 --> Web2
PrivSub1 --> App1
PrivSub2 --> App2
PubSub1 --> NAT1
PubSub2 --> NAT2
end
Internet <--> IGW
IGW <--> PubSub1
IGW <--> PubSub2
PrivSub1 --> NAT1
PrivSub2 --> NAT2
NAT1 --> IGW
NAT2 --> IGW
style VPC fill:#2e3440,stroke:#88c0d0,stroke-width:3px,color:#e0e0e0
style PubSub1 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440
style PubSub2 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440
style PrivSub1 fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440
style PrivSub2 fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440
Public vs Private Subnets
| Subnet Type |
Internet Access |
Route Table |
Use Cases |
| Public Subnet |
Direct (via Internet Gateway) |
0.0.0.0/0 → Internet Gateway |
Load balancers, bastion hosts, NAT gateways |
| Private Subnet |
Outbound only (via NAT Gateway) |
0.0.0.0/0 → NAT Gateway |
Application servers, databases, internal services |
| Isolated Subnet |
None |
Local VPC routes only |
Databases with no internet access needed |
Terraform: Complete VPC Setup
# VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "production-vpc"
}
}
# Internet Gateway
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "production-igw"
}
}
# Public Subnet AZ1
resource "aws_subnet" "public_1" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
map_public_ip_on_launch = true
tags = {
Name = "public-subnet-1a"
Type = "public"
}
}
# Public Subnet AZ2
resource "aws_subnet" "public_2" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.2.0/24"
availability_zone = "us-east-1b"
map_public_ip_on_launch = true
tags = {
Name = "public-subnet-1b"
Type = "public"
}
}
# Private Subnet AZ1
resource "aws_subnet" "private_1" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.10.0/24"
availability_zone = "us-east-1a"
tags = {
Name = "private-subnet-1a"
Type = "private"
}
}
# Private Subnet AZ2
resource "aws_subnet" "private_2" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.20.0/24"
availability_zone = "us-east-1b"
tags = {
Name = "private-subnet-1b"
Type = "private"
}
}
# Elastic IP for NAT Gateway
resource "aws_eip" "nat_1" {
domain = "vpc"
tags = {
Name = "nat-gateway-eip-1a"
}
}
# NAT Gateway (for private subnets to access internet)
resource "aws_nat_gateway" "nat_1" {
allocation_id = aws_eip.nat_1.id
subnet_id = aws_subnet.public_1.id
tags = {
Name = "nat-gateway-1a"
}
depends_on = [aws_internet_gateway.main]
}
# Public Route Table
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = {
Name = "public-route-table"
}
}
# Private Route Table
resource "aws_route_table" "private_1" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.nat_1.id
}
tags = {
Name = "private-route-table-1a"
}
}
# Route Table Associations
resource "aws_route_table_association" "public_1" {
subnet_id = aws_subnet.public_1.id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "public_2" {
subnet_id = aws_subnet.public_2.id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private_1" {
subnet_id = aws_subnet.private_1.id
route_table_id = aws_route_table.private_1.id
}
2.2 Security Groups vs NACLs
| Feature |
Security Groups |
Network ACLs |
| Level |
Instance (ENI) level |
Subnet level |
| State |
Stateful (return traffic auto-allowed) |
Stateless (must explicitly allow return traffic) |
| Rules |
Allow rules only |
Allow and Deny rules |
| Rule Evaluation |
All rules evaluated |
Rules processed in number order |
| Default |
Deny all inbound, allow all outbound |
Allow all inbound and outbound |
| Use Case |
Primary firewall (granular control) |
Additional subnet-level protection |
# Security Group for Web Servers
resource "aws_security_group" "web" {
name = "web-servers"
description = "Security group for web tier"
vpc_id = aws_vpc.main.id
# Inbound rules
ingress {
description = "HTTPS from internet"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
description = "HTTP from internet"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
# Outbound rules (allow all by default)
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "web-sg"
}
}
# Security Group for Application Servers
resource "aws_security_group" "app" {
name = "app-servers"
description = "Security group for application tier"
vpc_id = aws_vpc.main.id
# Only allow traffic from web tier
ingress {
description = "HTTP from web tier"
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.web.id] # Reference web SG
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "app-sg"
}
}
# Security Group for Database
resource "aws_security_group" "db" {
name = "database"
description = "Security group for database tier"
vpc_id = aws_vpc.main.id
# Only allow traffic from app tier
ingress {
description = "PostgreSQL from app tier"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.app.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "db-sg"
}
}
2.3 VPC Peering & Transit Gateway
VPC Peering
Use VPC Peering when:
- Connecting 2-3 VPCs
- Simple, low-latency communication needed
- No transitive routing required
Limitations: Non-transitive (A↔B, B↔C doesn't mean A↔C), doesn't scale beyond ~10 VPCs
# VPC Peering Connection
resource "aws_vpc_peering_connection" "prod_to_shared" {
vpc_id = aws_vpc.production.id
peer_vpc_id = aws_vpc.shared_services.id
peer_region = "us-east-1"
auto_accept = true
tags = {
Name = "prod-to-shared-services"
}
}
# Add route in Production VPC to Shared Services VPC
resource "aws_route" "prod_to_shared" {
route_table_id = aws_route_table.production_private.id
destination_cidr_block = "10.1.0.0/16" # Shared Services CIDR
vpc_peering_connection_id = aws_vpc_peering_connection.prod_to_shared.id
}
# Add route in Shared Services VPC to Production VPC
resource "aws_route" "shared_to_prod" {
route_table_id = aws_route_table.shared_private.id
destination_cidr_block = "10.0.0.0/16" # Production CIDR
vpc_peering_connection_id = aws_vpc_peering_connection.prod_to_shared.id
}
Transit Gateway (TGW)
Use Transit Gateway when:
- Connecting 4+ VPCs
- Need hub-and-spoke topology
- Want centralized routing control
- Connecting on-premises networks (VPN/Direct Connect)
Advantages: Transitive routing, simpler management at scale, centralized network monitoring
graph TB
OnPrem["On-Premises
Data Center"]
TGW["Transit Gateway
(Hub)"]
VPC1["Production VPC
10.0.0.0/16"]
VPC2["Staging VPC
10.1.0.0/16"]
VPC3["Dev VPC
10.2.0.0/16"]
VPC4["Shared Services VPC
10.3.0.0/16"]
OnPrem <-->|"VPN/Direct Connect"| TGW
TGW <--> VPC1
TGW <--> VPC2
TGW <--> VPC3
TGW <--> VPC4
style TGW fill:#ff9900,stroke:#232f3e,stroke-width:3px,color:#232f3e
style VPC1 fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440
style VPC4 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440
# Transit Gateway
resource "aws_ec2_transit_gateway" "main" {
description = "Main TGW for all VPCs"
default_route_table_association = "enable"
default_route_table_propagation = "enable"
tags = {
Name = "main-tgw"
}
}
# Attach Production VPC to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "production" {
subnet_ids = [aws_subnet.prod_private_1.id, aws_subnet.prod_private_2.id]
transit_gateway_id = aws_ec2_transit_gateway.main.id
vpc_id = aws_vpc.production.id
tags = {
Name = "production-tgw-attachment"
}
}
# Attach Shared Services VPC to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "shared" {
subnet_ids = [aws_subnet.shared_private_1.id, aws_subnet.shared_private_2.id]
transit_gateway_id = aws_ec2_transit_gateway.main.id
vpc_id = aws_vpc.shared_services.id
tags = {
Name = "shared-services-tgw-attachment"
}
}
# Route from Production VPC to TGW (for reaching other VPCs)
resource "aws_route" "prod_to_tgw" {
route_table_id = aws_route_table.production_private.id
destination_cidr_block = "10.0.0.0/8" # All internal networks
transit_gateway_id = aws_ec2_transit_gateway.main.id
}
2.4 VPC Endpoints (PrivateLink)
Why VPC Endpoints?
Access AWS services (S3, DynamoDB, etc.) without going through the internet. Benefits:
- Security: Traffic stays within AWS network
- Cost: No NAT Gateway data processing charges
- Performance: Lower latency
| Endpoint Type |
Services |
How It Works |
| Gateway Endpoint |
S3, DynamoDB |
Route table entry, no ENI created |
| Interface Endpoint |
Most AWS services (EC2, SNS, SQS, etc.) |
ENI with private IP in your subnet |
# S3 Gateway Endpoint (free)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
route_table_ids = [
aws_route_table.private_1.id,
aws_route_table.private_2.id
]
tags = {
Name = "s3-gateway-endpoint"
}
}
# DynamoDB Gateway Endpoint (free)
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.dynamodb"
route_table_ids = [
aws_route_table.private_1.id,
aws_route_table.private_2.id
]
tags = {
Name = "dynamodb-gateway-endpoint"
}
}
# Secrets Manager Interface Endpoint
resource "aws_vpc_endpoint" "secretsmanager" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.secretsmanager"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = [
aws_subnet.private_1.id,
aws_subnet.private_2.id
]
security_group_ids = [
aws_security_group.vpc_endpoints.id
]
tags = {
Name = "secretsmanager-interface-endpoint"
}
}
3. Compute Services
3.1 Compute Service Comparison
| Service |
Type |
When to Use |
Pros |
Cons |
| EC2 |
Virtual Machines |
Full OS control, legacy apps, persistent workloads |
Complete control, any software |
Must manage OS, scaling complexity |
| Lambda |
Serverless Functions |
Event-driven, short tasks (<15min), API backends |
No servers, auto-scaling, pay-per-use |
15min timeout, cold starts, limited runtime options |
| ECS |
Container Orchestration |
Docker containers, AWS-native orchestration |
Simple, integrates well with AWS |
AWS-only, less ecosystem than Kubernetes |
| EKS |
Managed Kubernetes |
Need Kubernetes, complex microservices, multi-cloud |
Industry standard, portable, huge ecosystem |
Complex, expensive, steep learning curve |
| Fargate |
Serverless Containers |
Containers without managing servers (ECS/EKS) |
No server management, auto-scaling |
More expensive than EC2, less control |
| App Runner |
Fully Managed Apps |
Deploy from source code/container with zero config |
Easiest deployment, auto-scaling |
Less control, newer service |
3.2 EC2 Fundamentals
Instance Types Decision Tree
| Family |
Type |
Use Case |
Example |
| T3/T4g |
Burstable |
Web servers, dev/test, low-traffic apps |
t3.medium (2 vCPU, 4 GB) |
| M5/M6i |
General Purpose |
Balanced compute/memory, app servers |
m5.xlarge (4 vCPU, 16 GB) |
| C5/C6i |
Compute Optimized |
CPU-intensive (batch processing, ML inference) |
c5.2xlarge (8 vCPU, 16 GB) |
| R5/R6i |
Memory Optimized |
In-memory databases, caches, big data |
r5.xlarge (4 vCPU, 32 GB) |
| P3/P4 |
GPU |
Machine learning training, HPC |
p3.2xlarge (8 vCPU, 1 GPU) |
# Auto Scaling Group for web servers
resource "aws_launch_template" "web" {
name_prefix = "web-"
image_id = "ami-12345678" # Amazon Linux 2023
instance_type = "t3.medium"
# IAM instance profile
iam_instance_profile {
name = aws_iam_instance_profile.web.name
}
# Security group
vpc_security_group_ids = [aws_security_group.web.id]
# User data script (runs on boot)
user_data = base64encode(<<-EOF
#!/bin/bash
yum update -y
yum install -y httpd
systemctl start httpd
systemctl enable httpd
echo "Hello from $(hostname)
" > /var/www/html/index.html
EOF
)
# EBS volume configuration
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = 20
volume_type = "gp3"
encrypted = true
delete_on_termination = true
}
}
# Metadata options (IMDSv2 enforced)
metadata_options {
http_endpoint = "enabled"
http_tokens = "required" # Require IMDSv2
http_put_response_hop_limit = 1
}
tag_specifications {
resource_type = "instance"
tags = {
Name = "web-server"
}
}
}
# Auto Scaling Group
resource "aws_autoscaling_group" "web" {
name = "web-asg"
vpc_zone_identifier = [aws_subnet.public_1.id, aws_subnet.public_2.id]
target_group_arns = [aws_lb_target_group.web.arn]
health_check_type = "ELB"
health_check_grace_period = 300
min_size = 2
max_size = 10
desired_capacity = 2
launch_template {
id = aws_launch_template.web.id
version = "$Latest"
}
tag {
key = "Name"
value = "web-server"
propagate_at_launch = true
}
}
# Auto Scaling Policy (CPU-based)
resource "aws_autoscaling_policy" "web_cpu" {
name = "web-cpu-scaling"
autoscaling_group_name = aws_autoscaling_group.web.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 70.0 # Target 70% CPU
}
}
3.3 Lambda (Serverless Functions)
Lambda Best Practices:
- Keep it small: One function = one purpose
- Environment variables: Use for configuration, Secrets Manager for secrets
- Layers: Share code/dependencies across functions
- VPC sparingly: Only when accessing VPC resources (adds cold start time)
- Timeout: Set appropriately (don't default to 15 minutes)
# Lambda function for S3 image processing
resource "aws_lambda_function" "image_processor" {
filename = "image_processor.zip"
function_name = "process-uploaded-images"
role = aws_iam_role.lambda_exec.arn
handler = "index.handler"
runtime = "python3.11"
source_code_hash = filebase64sha256("image_processor.zip")
# Memory and timeout
memory_size = 1024 # MB (also affects CPU)
timeout = 60 # seconds
# Environment variables
environment {
variables = {
DEST_BUCKET = aws_s3_bucket.processed_images.id
MAX_WIDTH = "1920"
MAX_HEIGHT = "1080"
}
}
# VPC configuration (only if needed)
# vpc_config {
# subnet_ids = [aws_subnet.private_1.id, aws_subnet.private_2.id]
# security_group_ids = [aws_security_group.lambda.id]
# }
tags = {
Purpose = "image-processing"
}
}
# IAM role for Lambda
resource "aws_iam_role" "lambda_exec" {
name = "lambda-image-processor-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
}
]
})
}
# Lambda policy - S3 access
resource "aws_iam_role_policy" "lambda_s3" {
name = "lambda-s3-access"
role = aws_iam_role.lambda_exec.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject"
]
Resource = "${aws_s3_bucket.uploads.arn}/*"
},
{
Effect = "Allow"
Action = [
"s3:PutObject"
]
Resource = "${aws_s3_bucket.processed_images.arn}/*"
}
]
})
}
# Attach AWS managed policy for Lambda basics (CloudWatch Logs)
resource "aws_iam_role_policy_attachment" "lambda_basic" {
role = aws_iam_role.lambda_exec.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}
# S3 trigger for Lambda
resource "aws_lambda_permission" "allow_s3" {
statement_id = "AllowExecutionFromS3"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.image_processor.function_name
principal = "s3.amazonaws.com"
source_arn = aws_s3_bucket.uploads.arn
}
resource "aws_s3_bucket_notification" "upload_trigger" {
bucket = aws_s3_bucket.uploads.id
lambda_function {
lambda_function_arn = aws_lambda_function.image_processor.arn
events = ["s3:ObjectCreated:*"]
filter_prefix = "uploads/"
filter_suffix = ".jpg"
}
depends_on = [aws_lambda_permission.allow_s3]
}
3.4 ECS (Elastic Container Service)
ECS vs EKS Decision:
- Choose ECS if: AWS-only, simpler setup, tighter AWS integration
- Choose EKS if: Need Kubernetes, multi-cloud, complex orchestration, large ecosystem
# ECS Cluster
resource "aws_ecs_cluster" "main" {
name = "production-cluster"
setting {
name = "containerInsights"
value = "enabled"
}
tags = {
Environment = "production"
}
}
# ECS Task Definition (defines container configuration)
resource "aws_ecs_task_definition" "app" {
family = "app-service"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = "512" # 0.5 vCPU
memory = "1024" # 1 GB
# Task execution role (pulls images, writes logs)
execution_role_arn = aws_iam_role.ecs_task_execution.arn
# Task role (permissions for app itself)
task_role_arn = aws_iam_role.ecs_task.arn
container_definitions = jsonencode([
{
name = "app"
image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest"
portMappings = [
{
containerPort = 8080
protocol = "tcp"
}
]
environment = [
{
name = "ENV"
value = "production"
}
]
# Secrets from Secrets Manager
secrets = [
{
name = "DB_PASSWORD"
valueFrom = aws_secretsmanager_secret.db_password.arn
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.ecs.name
"awslogs-region" = "us-east-1"
"awslogs-stream-prefix" = "app"
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}
])
}
# ECS Service (runs and maintains task instances)
resource "aws_ecs_service" "app" {
name = "app-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = 2
launch_type = "FARGATE"
network_configuration {
subnets = [aws_subnet.private_1.id, aws_subnet.private_2.id]
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = false
}
# Load balancer configuration
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "app"
container_port = 8080
}
# Auto-scaling
lifecycle {
ignore_changes = [desired_count] # Let auto-scaling manage this
}
depends_on = [aws_lb_listener.app]
}
# Auto-scaling for ECS service
resource "aws_appautoscaling_target" "ecs" {
max_capacity = 10
min_capacity = 2
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.app.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "ecs_cpu" {
name = "ecs-cpu-autoscaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 70.0
}
}
4. Storage Services
4.1 Storage Service Comparison
| Service |
Type |
Use Case |
Max Size |
Access |
| S3 |
Object Storage |
Static assets, backups, data lake, archives |
Unlimited (5TB per object) |
HTTP API |
| EBS |
Block Storage |
EC2 instance boot/data volumes |
64 TiB |
Single EC2 instance (except io2 Multi-Attach) |
| EFS |
File Storage (NFS) |
Shared file storage across multiple EC2/containers |
Unlimited |
Multiple instances concurrently |
| FSx |
Managed File Systems |
Windows (SMB), Lustre (HPC), NetApp, OpenZFS |
Varies |
Multiple instances |
4.2 S3 (Simple Storage Service)
S3 Storage Classes
| Storage Class |
Use Case |
Availability |
Retrieval |
Cost |
| S3 Standard |
Frequently accessed data |
99.99% |
Instant |
$$$$ |
| S3 Intelligent-Tiering |
Unknown/changing access patterns |
99.9% |
Instant |
$$$ + monitoring fee |
| S3 Standard-IA |
Infrequently accessed (backups) |
99.9% |
Instant |
$$ + retrieval fee |
| S3 One Zone-IA |
Reproducible data (thumbnails) |
99.5% |
Instant |
$ + retrieval fee |
| S3 Glacier Instant |
Archive with instant access |
99.9% |
Instant |
$ + retrieval fee |
| S3 Glacier Flexible |
Archives (1-5 min retrieval) |
99.99% |
Minutes-hours |
Very low |
| S3 Glacier Deep Archive |
Long-term archive (12h retrieval) |
99.99% |
12 hours |
Lowest |
# S3 bucket with security best practices
resource "aws_s3_bucket" "app_data" {
bucket = "my-app-data-bucket"
tags = {
Purpose = "application-data"
}
}
# Block public access (CRITICAL)
resource "aws_s3_bucket_public_access_block" "app_data" {
bucket = aws_s3_bucket.app_data.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# Enable versioning
resource "aws_s3_bucket_versioning" "app_data" {
bucket = aws_s3_bucket.app_data.id
versioning_configuration {
status = "Enabled"
}
}
# Server-side encryption (AES-256 or KMS)
resource "aws_s3_bucket_server_side_encryption_configuration" "app_data" {
bucket = aws_s3_bucket.app_data.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.s3.id
}
bucket_key_enabled = true # Reduces KMS costs
}
}
# Lifecycle policy (auto-transition to cheaper storage)
resource "aws_s3_bucket_lifecycle_configuration" "app_data" {
bucket = aws_s3_bucket.app_data.id
rule {
id = "transition-old-data"
status = "Enabled"
# Transition to IA after 30 days
transition {
days = 30
storage_class = "STANDARD_IA"
}
# Transition to Glacier after 90 days
transition {
days = 90
storage_class = "GLACIER"
}
# Delete after 365 days
expiration {
days = 365
}
# Clean up old versions
noncurrent_version_expiration {
noncurrent_days = 90
}
}
}
# Bucket policy (restrict access)
resource "aws_s3_bucket_policy" "app_data" {
bucket = aws_s3_bucket.app_data.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
# Enforce SSL/TLS
Sid = "DenyInsecureTransport"
Effect = "Deny"
Principal = "*"
Action = "s3:*"
Resource = [
aws_s3_bucket.app_data.arn,
"${aws_s3_bucket.app_data.arn}/*"
]
Condition = {
Bool = {
"aws:SecureTransport" = "false"
}
}
}
]
})
}
4.3 EBS (Elastic Block Store)
EBS Volume Types
| Type |
Use Case |
IOPS |
Throughput |
Cost |
| gp3 |
General purpose (default choice) |
3,000-16,000 (configurable) |
125-1,000 MB/s |
$0.08/GB-month |
| gp2 |
General purpose (older) |
100-16,000 (size-based) |
Up to 250 MB/s |
$0.10/GB-month |
| io2 |
High-performance databases |
Up to 64,000 |
Up to 1,000 MB/s |
$0.125/GB + IOPS cost |
| st1 |
Big data, data warehouses (HDD) |
N/A (throughput optimized) |
Up to 500 MB/s |
$0.045/GB-month |
| sc1 |
Cold storage (HDD) |
N/A |
Up to 250 MB/s |
$0.015/GB-month |
# EBS volume for database
resource "aws_ebs_volume" "database" {
availability_zone = "us-east-1a"
size = 100 # GB
type = "gp3"
iops = 3000
throughput = 125 # MB/s
encrypted = true
kms_key_id = aws_kms_key.ebs.id
tags = {
Name = "database-volume"
}
}
# Attach to EC2 instance
resource "aws_volume_attachment" "database" {
device_name = "/dev/sdf"
volume_id = aws_ebs_volume.database.id
instance_id = aws_instance.database.id
}
# EBS snapshot for backup
resource "aws_ebs_snapshot" "database_backup" {
volume_id = aws_ebs_volume.database.id
description = "Database backup ${formatdate("YYYY-MM-DD", timestamp())}"
tags = {
Name = "database-snapshot"
}
}
# Data Lifecycle Manager - automated snapshots
resource "aws_dlm_lifecycle_policy" "database_backups" {
description = "Daily database snapshots"
execution_role_arn = aws_iam_role.dlm.arn
state = "ENABLED"
policy_details {
resource_types = ["VOLUME"]
schedule {
name = "Daily snapshots"
create_rule {
interval = 24 # hours
interval_unit = "HOURS"
times = ["03:00"] # 3 AM UTC
}
retain_rule {
count = 7 # Keep 7 daily snapshots
}
tags_to_add = {
SnapshotCreator = "DLM"
}
copy_tags = true
}
target_tags = {
Backup = "true"
}
}
}
4.4 EFS (Elastic File System)
EFS vs EBS:
- EFS: Shared file storage, multiple instances, NFS protocol, auto-scaling
- EBS: Block storage, single instance (usually), fixed size, lower latency
# EFS file system
resource "aws_efs_file_system" "shared_data" {
creation_token = "shared-app-data"
encrypted = true
kms_key_id = aws_kms_key.efs.id
# Performance mode
performance_mode = "generalPurpose" # or "maxIO" for high parallelism
# Throughput mode
throughput_mode = "bursting" # or "provisioned" for consistent throughput
lifecycle_policy {
transition_to_ia = "AFTER_30_DAYS" # Move to cheaper Infrequent Access
}
tags = {
Name = "shared-data"
}
}
# Mount targets (one per AZ)
resource "aws_efs_mount_target" "az1" {
file_system_id = aws_efs_file_system.shared_data.id
subnet_id = aws_subnet.private_1.id
security_groups = [aws_security_group.efs.id]
}
resource "aws_efs_mount_target" "az2" {
file_system_id = aws_efs_file_system.shared_data.id
subnet_id = aws_subnet.private_2.id
security_groups = [aws_security_group.efs.id]
}
# Security group for EFS
resource "aws_security_group" "efs" {
name = "efs-mount-targets"
description = "Security group for EFS mount targets"
vpc_id = aws_vpc.main.id
ingress {
description = "NFS from app servers"
from_port = 2049
to_port = 2049
protocol = "tcp"
security_groups = [aws_security_group.app.id]
}
tags = {
Name = "efs-sg"
}
}
# User data to mount EFS on EC2 instance
resource "aws_instance" "app" {
# ... other config ...
user_data = <<-EOF
#!/bin/bash
yum install -y amazon-efs-utils
mkdir /mnt/efs
mount -t efs ${aws_efs_file_system.shared_data.id}:/ /mnt/efs
echo "${aws_efs_file_system.shared_data.id}:/ /mnt/efs efs defaults,_netdev 0 0" >> /etc/fstab
EOF
}
5. Database Services
5.1 Database Service Comparison
| Service |
Type |
Engine |
When to Use |
Scaling |
| RDS |
Relational |
PostgreSQL, MySQL, MariaDB, Oracle, SQL Server |
Traditional RDBMS needs, ACID compliance |
Vertical (read replicas for reads) |
| Aurora |
Relational |
MySQL, PostgreSQL compatible |
High-performance relational, cloud-native |
Vertical + auto-scaling read replicas |
| DynamoDB |
NoSQL (Key-Value/Document) |
Proprietary |
Massive scale, single-digit ms latency, serverless |
Horizontal (unlimited) |
| ElastiCache |
In-Memory Cache |
Redis, Memcached |
Caching, session store, real-time analytics |
Horizontal (cluster mode) |
| DocumentDB |
NoSQL (Document) |
MongoDB compatible |
MongoDB workloads on AWS |
Vertical + read replicas |
| Neptune |
Graph |
Gremlin, SPARQL |
Graph databases (social networks, fraud detection) |
Vertical + read replicas |
Relational vs NoSQL Decision Tree
Choose Relational (RDS/Aurora) when:
- Complex queries with JOINs
- ACID transactions critical
- Structured data with relationships
- Need SQL standard compliance
- Data integrity and referential integrity important
Choose NoSQL (DynamoDB) when:
- Need massive scale (millions of requests/second)
- Simple key-value or document lookups
- Schema flexibility needed
- Single-digit millisecond latency required
- Serverless architecture
5.2 RDS (Relational Database Service)
# RDS PostgreSQL instance
resource "aws_db_instance" "postgres" {
identifier = "production-postgres"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.medium"
# Storage
allocated_storage = 100 # GB
max_allocated_storage = 1000 # Auto-scaling up to 1TB
storage_type = "gp3"
storage_encrypted = true
kms_key_id = aws_kms_key.rds.id
# Database config
db_name = "myapp"
username = "admin"
password = random_password.db_master.result # Use random password!
port = 5432
# Networking
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.rds.id]
publicly_accessible = false # NEVER true for production!
# High Availability
multi_az = true # Standby in different AZ
# Backups
backup_retention_period = 7 # Days
backup_window = "03:00-04:00" # UTC
maintenance_window = "Mon:04:00-Mon:05:00" # UTC
# Monitoring
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
monitoring_interval = 60 # Enhanced monitoring every 60s
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
# Security
deletion_protection = true # Prevent accidental deletion
skip_final_snapshot = false
final_snapshot_identifier = "production-postgres-final-snapshot"
# Performance Insights
performance_insights_enabled = true
performance_insights_kms_key_id = aws_kms_key.rds.id
performance_insights_retention_period = 7
tags = {
Name = "production-postgres"
}
}
# DB Subnet Group (spans multiple AZs)
resource "aws_db_subnet_group" "main" {
name = "main-db-subnet-group"
subnet_ids = [aws_subnet.private_1.id, aws_subnet.private_2.id]
tags = {
Name = "main-db-subnet-group"
}
}
# Read Replica (for read scaling)
resource "aws_db_instance" "postgres_replica" {
identifier = "production-postgres-replica"
replicate_source_db = aws_db_instance.postgres.identifier
instance_class = "db.t3.medium"
publicly_accessible = false
# Can be in different region for disaster recovery
# availability_zone = "us-east-1b"
tags = {
Name = "production-postgres-replica"
}
}
# Random password for DB
resource "random_password" "db_master" {
length = 32
special = true
}
# Store password in Secrets Manager
resource "aws_secretsmanager_secret" "db_password" {
name = "production/postgres/master-password"
recovery_window_in_days = 7
}
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = random_password.db_master.result
}
5.3 Aurora (Cloud-Native Relational)
Aurora Advantages over RDS:
- Performance: 5x faster than MySQL, 3x faster than PostgreSQL (AWS claims)
- Storage: Auto-scaling (10GB → 128TB), 6 copies across 3 AZs
- Failover: <30 second automatic failover
- Read Replicas: Up to 15 low-latency read replicas
- Serverless: Aurora Serverless v2 auto-scales capacity
Trade-off: 20-30% more expensive than RDS, but often worth it
# Aurora Cluster
resource "aws_rds_cluster" "aurora" {
cluster_identifier = "production-aurora-cluster"
engine = "aurora-postgresql"
engine_version = "15.4"
database_name = "myapp"
master_username = "admin"
master_password = random_password.aurora.result
# Networking
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.aurora.id]
# Backups
backup_retention_period = 7
preferred_backup_window = "03:00-04:00"
# Encryption
storage_encrypted = true
kms_key_id = aws_kms_key.aurora.id
# Deletion protection
deletion_protection = true
skip_final_snapshot = false
final_snapshot_identifier = "aurora-final-snapshot"
# Enable backtrack (point-in-time restore without restoring from backup)
backtrack_window = 72 # hours
# CloudWatch Logs
enabled_cloudwatch_logs_exports = ["postgresql"]
tags = {
Name = "production-aurora"
}
}
# Aurora Writer Instance
resource "aws_rds_cluster_instance" "aurora_writer" {
identifier = "production-aurora-writer"
cluster_identifier = aws_rds_cluster.aurora.id
instance_class = "db.r6g.large"
engine = aws_rds_cluster.aurora.engine
engine_version = aws_rds_cluster.aurora.engine_version
performance_insights_enabled = true
tags = {
Name = "aurora-writer"
Role = "writer"
}
}
# Aurora Reader Instance (auto-scaling target)
resource "aws_rds_cluster_instance" "aurora_reader_1" {
identifier = "production-aurora-reader-1"
cluster_identifier = aws_rds_cluster.aurora.id
instance_class = "db.r6g.large"
engine = aws_rds_cluster.aurora.engine
engine_version = aws_rds_cluster.aurora.engine_version
performance_insights_enabled = true
tags = {
Name = "aurora-reader-1"
Role = "reader"
}
}
# Auto-scaling for Aurora read replicas
resource "aws_appautoscaling_target" "aurora_replicas" {
max_capacity = 5
min_capacity = 1
resource_id = "cluster:${aws_rds_cluster.aurora.cluster_identifier}"
scalable_dimension = "rds:cluster:ReadReplicaCount"
service_namespace = "rds"
}
resource "aws_appautoscaling_policy" "aurora_replicas" {
name = "aurora-cpu-autoscaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.aurora_replicas.resource_id
scalable_dimension = aws_appautoscaling_target.aurora_replicas.scalable_dimension
service_namespace = aws_appautoscaling_target.aurora_replicas.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "RDSReaderAverageCPUUtilization"
}
target_value = 70.0
}
}
5.4 DynamoDB (NoSQL)
DynamoDB Basics:
- Primary Key: Partition key (required) + sort key (optional)
- Partition Key: Determines which partition stores the item (hash)
- Sort Key: Orders items within a partition, enables range queries
- Secondary Indexes: Query on non-primary-key attributes
# DynamoDB table
resource "aws_dynamodb_table" "users" {
name = "Users"
billing_mode = "PAY_PER_REQUEST" # On-demand pricing
# Or "PROVISIONED" with read_capacity and write_capacity
hash_key = "user_id" # Partition key
range_key = "created_at" # Sort key (optional)
attribute {
name = "user_id"
type = "S" # String
}
attribute {
name = "created_at"
type = "N" # Number (Unix timestamp)
}
attribute {
name = "email"
type = "S"
}
# Global Secondary Index (query by email)
global_secondary_index {
name = "EmailIndex"
hash_key = "email"
projection_type = "ALL" # Include all attributes in index
# For on-demand billing, no need to specify capacity
}
# Point-in-time recovery
point_in_time_recovery {
enabled = true
}
# Server-side encryption
server_side_encryption {
enabled = true
kms_key_arn = aws_kms_key.dynamodb.arn
}
# TTL (auto-delete expired items)
ttl {
attribute_name = "expiration_time"
enabled = true
}
# DynamoDB Streams (for change data capture)
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
tags = {
Name = "users-table"
}
}
# DynamoDB Global Table (multi-region replication)
resource "aws_dynamodb_table" "orders_global" {
name = "Orders"
billing_mode = "PAY_PER_REQUEST"
hash_key = "order_id"
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
attribute {
name = "order_id"
type = "S"
}
# Replicate to multiple regions
replica {
region_name = "us-west-2"
}
replica {
region_name = "eu-west-1"
}
tags = {
Name = "orders-global-table"
}
}
5.5 ElastiCache (Redis/Memcached)
Redis vs Memcached:
- Redis: Rich data structures (lists, sets, sorted sets), persistence, pub/sub, transactions. Choose for: session store, leaderboards, queues
- Memcached: Simple key-value, multi-threaded, simpler. Choose for: simple caching, horizontal scaling
See Messaging Systems Guide for Redis Pub/Sub details.
# ElastiCache Redis Cluster
resource "aws_elasticache_replication_group" "redis" {
replication_group_id = "production-redis"
replication_group_description = "Redis cluster for session storage and caching"
engine = "redis"
engine_version = "7.0"
node_type = "cache.r6g.large"
num_cache_clusters = 3 # 1 primary + 2 replicas
parameter_group_name = "default.redis7"
port = 6379
# Networking
subnet_group_name = aws_elasticache_subnet_group.main.name
security_group_ids = [aws_security_group.redis.id]
# High Availability
automatic_failover_enabled = true # Auto-failover to replica
multi_az_enabled = true
# Backups
snapshot_retention_limit = 5 # Days
snapshot_window = "03:00-04:00"
maintenance_window = "sun:05:00-sun:06:00"
# Encryption
at_rest_encryption_enabled = true
transit_encryption_enabled = true
auth_token_enabled = true
auth_token = random_password.redis_auth.result
# Logs
log_delivery_configuration {
destination = aws_cloudwatch_log_group.redis.name
destination_type = "cloudwatch-logs"
log_format = "json"
log_type = "slow-log"
}
tags = {
Name = "production-redis"
}
}
# Subnet group for ElastiCache
resource "aws_elasticache_subnet_group" "main" {
name = "main-cache-subnet-group"
subnet_ids = [aws_subnet.private_1.id, aws_subnet.private_2.id]
tags = {
Name = "main-cache-subnet-group"
}
}
# Security group for Redis
resource "aws_security_group" "redis" {
name = "redis-cluster"
description = "Security group for Redis cluster"
vpc_id = aws_vpc.main.id
ingress {
description = "Redis from app servers"
from_port = 6379
to_port = 6379
protocol = "tcp"
security_groups = [aws_security_group.app.id]
}
tags = {
Name = "redis-sg"
}
}
6. Security Deep Dive
6.1 KMS (Key Management Service)
# KMS key for encrypting data
resource "aws_kms_key" "app_data" {
description = "KMS key for application data encryption"
deletion_window_in_days = 30 # Grace period before deletion
enable_key_rotation = true
# Key policy
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "Enable IAM User Permissions"
Effect = "Allow"
Principal = {
AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
}
Action = "kms:*"
Resource = "*"
},
{
Sid = "Allow services to use the key"
Effect = "Allow"
Principal = {
Service = [
"s3.amazonaws.com",
"rds.amazonaws.com",
"dynamodb.amazonaws.com",
"secretsmanager.amazonaws.com"
]
}
Action = [
"kms:Decrypt",
"kms:GenerateDataKey"
]
Resource = "*"
}
]
})
tags = {
Name = "app-data-key"
}
}
# Alias for easier reference
resource "aws_kms_alias" "app_data" {
name = "alias/app-data"
target_key_id = aws_kms_key.app_data.key_id
}
6.2 Secrets Manager
# Secrets Manager secret
resource "aws_secretsmanager_secret" "api_key" {
name = "production/api/external-service-key"
description = "API key for external service"
recovery_window_in_days = 7
tags = {
Purpose = "external-api-auth"
}
}
# Secret value
resource "aws_secretsmanager_secret_version" "api_key" {
secret_id = aws_secretsmanager_secret.api_key.id
secret_string = jsonencode({
api_key = "super-secret-key"
api_secret = "super-secret-value"
})
}
# IAM policy to allow Lambda to read secret
resource "aws_iam_role_policy" "lambda_read_secret" {
name = "read-api-secret"
role = aws_iam_role.lambda.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"secretsmanager:GetSecretValue"
]
Resource = aws_secretsmanager_secret.api_key.arn
}
]
})
}
# Automatic rotation (Lambda function rotates secret)
resource "aws_secretsmanager_secret_rotation" "api_key" {
secret_id = aws_secretsmanager_secret.api_key.id
rotation_lambda_arn = aws_lambda_function.rotate_secret.arn
rotation_rules {
automatically_after_days = 30
}
}
6.3 Security Monitoring & Compliance
Essential Security Services
| Service |
Purpose |
What It Detects |
| GuardDuty |
Threat detection |
Unusual API calls, compromised instances, reconnaissance |
| Security Hub |
Centralized security findings |
Aggregates GuardDuty, Inspector, Macie, Config findings |
| CloudTrail |
API audit logging |
Who did what, when (all AWS API calls) |
| Config |
Resource compliance |
Non-compliant resources (unencrypted EBS, public S3) |
| Inspector |
Vulnerability scanning |
EC2/ECR vulnerabilities, network exposure |
| Macie |
Data discovery & protection |
Sensitive data in S3 (PII, credentials) |
| WAF |
Web application firewall |
SQL injection, XSS, bot traffic |
# Enable GuardDuty (threat detection)
resource "aws_guardduty_detector" "main" {
enable = true
finding_publishing_frequency = "FIFTEEN_MINUTES"
datasources {
s3_logs {
enable = true
}
kubernetes {
audit_logs {
enable = true
}
}
}
tags = {
Name = "main-guardduty"
}
}
# Enable Security Hub (compliance & security findings)
resource "aws_securityhub_account" "main" {
enable_default_standards = true
control_finding_generator = "SECURITY_CONTROL"
}
# Enable CloudTrail (API logging)
resource "aws_cloudtrail" "main" {
name = "organization-trail"
s3_bucket_name = aws_s3_bucket.cloudtrail.id
include_global_service_events = true
is_multi_region_trail = true
is_organization_trail = true
# Enable logging of data events
event_selector {
read_write_type = "All"
include_management_events = true
# Log S3 data events
data_resource {
type = "AWS::S3::Object"
values = ["arn:aws:s3:::*/"]
}
# Log Lambda invocations
data_resource {
type = "AWS::Lambda::Function"
values = ["arn:aws:lambda:*:*:function/*"]
}
}
# CloudWatch Logs integration
cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.cloudtrail.arn}:*"
cloud_watch_logs_role_arn = aws_iam_role.cloudtrail_cloudwatch.arn
tags = {
Name = "organization-cloudtrail"
}
}
# Enable Config (resource compliance)
resource "aws_config_configuration_recorder" "main" {
name = "main-config-recorder"
role_arn = aws_iam_role.config.arn
recording_group {
all_supported = true
include_global_resource_types = true
}
}
resource "aws_config_configuration_recorder_status" "main" {
name = aws_config_configuration_recorder.main.name
is_enabled = true
depends_on = [aws_config_delivery_channel.main]
}
# Config rules (compliance checks)
resource "aws_config_config_rule" "encrypted_volumes" {
name = "encrypted-volumes"
source {
owner = "AWS"
source_identifier = "ENCRYPTED_VOLUMES"
}
depends_on = [aws_config_configuration_recorder.main]
}
resource "aws_config_config_rule" "s3_bucket_public_read_prohibited" {
name = "s3-bucket-public-read-prohibited"
source {
owner = "AWS"
source_identifier = "S3_BUCKET_PUBLIC_READ_PROHIBITED"
}
depends_on = [aws_config_configuration_recorder.main]
}
7. Real-World Architecture Patterns
7.1 Multi-Tier Web Application
Architecture: 3-Tier Web App with Auto-Scaling
Use Case: E-commerce website with variable traffic
graph TB
Users["Users"]
subgraph "AWS Cloud"
Route53["Route 53
DNS"]
CloudFront["CloudFront CDN
(Static Assets)"]
subgraph "VPC"
subgraph "Public Subnets"
ALB["Application Load Balancer
(us-east-1a, us-east-1b)"]
end
subgraph "Private Subnets - Web Tier"
ASG1["Auto Scaling Group
EC2 Web Servers"]
end
subgraph "Private Subnets - App Tier"
ASG2["Auto Scaling Group
EC2 App Servers"]
Cache["ElastiCache Redis
(Session Store)"]
end
subgraph "Private Subnets - Data Tier"
RDS["Aurora PostgreSQL
(Multi-AZ)"]
S3["S3
(User Uploads)"]
end
end
end
Users --> Route53
Route53 --> CloudFront
CloudFront --> S3
Route53 --> ALB
ALB --> ASG1
ASG1 --> ASG2
ASG2 --> Cache
ASG2 --> RDS
ASG2 --> S3
style CloudFront fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#232f3e
style ALB fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440
style RDS fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440
Key Components:
- Route 53: DNS routing with health checks
- CloudFront: CDN for static assets (images, CSS, JS)
- ALB: Distributes traffic across web servers
- Web Tier: Auto-scaling EC2 instances (t3.medium)
- App Tier: Auto-scaling EC2 instances (m5.large), business logic
- Cache Tier: Redis for session storage and caching
- Data Tier: Aurora PostgreSQL (writer + 2 readers), S3 for uploads
See Full Terraform:
Combines VPC setup, ALB, Auto Scaling Groups, RDS, ElastiCache from examples above.
7.2 Serverless Microservices
Architecture: Event-Driven Serverless
Use Case: Order processing system with multiple services
graph TB
API["API Gateway
(REST API)"]
Lambda1["Lambda: Create Order"]
Lambda2["Lambda: Process Payment"]
Lambda3["Lambda: Update Inventory"]
Lambda4["Lambda: Send Notification"]
DDB["DynamoDB: Orders Table"]
SQS1["SQS: Payment Queue"]
SQS2["SQS: Inventory Queue"]
SNS["SNS: Order Events Topic"]
S3["S3: Order Receipts"]
SES["SES: Email Service"]
API --> Lambda1
Lambda1 --> DDB
Lambda1 --> SQS1
Lambda1 --> SQS2
Lambda1 --> SNS
SQS1 --> Lambda2
Lambda2 --> DDB
SQS2 --> Lambda3
Lambda3 --> DDB
SNS --> Lambda4
Lambda4 --> S3
Lambda4 --> SES
style API fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#232f3e
style DDB fill:#88c0d0,stroke:#81a1c1,stroke-width:2px,color:#2e3440
style SNS fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440
Flow:
- User calls API Gateway → Lambda creates order in DynamoDB
- Lambda publishes to SQS queues (payment, inventory)
- Lambda publishes to SNS topic (order-events)
- Payment Lambda processes from SQS, updates DynamoDB
- Inventory Lambda processes from SQS, updates DynamoDB
- Notification Lambda subscribes to SNS, generates receipt (S3), sends email (SES)
Benefits:
- No servers to manage: Lambda auto-scales
- Cost-effective: Pay per request
- Decoupled: Services communicate via SQS/SNS
- Resilient: SQS retries failed messages, DLQ for poison messages
See also: Messaging Systems Guide for detailed SQS/SNS/EventBridge patterns
7.3 Data Processing Pipeline
Architecture: Real-Time Analytics Pipeline
Use Case: Process clickstream data for real-time dashboards
graph LR
App["Web App"]
Kinesis["Kinesis Data Stream
(Clickstream)"]
Lambda1["Lambda: Transform"]
Firehose["Kinesis Firehose"]
S3["S3: Data Lake
(Parquet)"]
Athena["Athena
(SQL Queries)"]
QuickSight["QuickSight
(Dashboards)"]
Lambda2["Lambda: Real-time Alerts"]
DDB["DynamoDB: Aggregates"]
App --> Kinesis
Kinesis --> Lambda1
Lambda1 --> Firehose
Firehose --> S3
S3 --> Athena
Athena --> QuickSight
Kinesis --> Lambda2
Lambda2 --> DDB
style Kinesis fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#232f3e
style S3 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440
Flow:
- Web app sends events to Kinesis Data Stream
- Lambda transforms/enriches events in real-time
- Firehose batches and writes to S3 (Parquet format)
- Athena queries S3 data lake (SQL)
- QuickSight creates dashboards from Athena queries
- Separate Lambda reads stream for real-time alerts, writes aggregates to DynamoDB
Interview Tips
Key Topics to Master:
- Multi-account strategy: Why use multiple accounts, SCPs, cross-account access
- VPC design: Public/private subnets, NAT Gateway, VPC Peering vs Transit Gateway
- Security layers: Security Groups (stateful) vs NACLs (stateless)
- Compute trade-offs: EC2 vs Lambda vs ECS vs EKS - when to use each
- Database selection: RDS vs Aurora vs DynamoDB decision tree
- Storage classes: S3 Standard → IA → Glacier based on access patterns
- IAM best practices: Roles over users, least privilege, service accounts
- Encryption: KMS for data-at-rest, TLS for data-in-transit
- Monitoring: GuardDuty, Security Hub, CloudTrail, Config
- Architecture patterns: Multi-tier, serverless, data pipelines
Common Interview Questions:
- Design a highly available, scalable web application on AWS
- How would you secure access to a database in a private subnet?
- Explain the difference between Security Groups and NACLs
- When would you choose DynamoDB over RDS?
- How do you enable cross-account access securely?
- Design a disaster recovery strategy for a critical application
- How would you handle a Lambda function that needs to access resources in a VPC?
- Explain the difference between VPC Peering and Transit Gateway
- How do you implement defense-in-depth security in AWS?
- Design a data pipeline for processing millions of events per day