Distributed Systems Comparison Guide

Choosing the right data store is one of the most critical architectural decisions. There's no "best" database—only the best fit for your specific requirements.

For Senior/Distinguished Engineers: You should understand CAP theorem implications, consistency models, scaling characteristics, and operational complexity for each system type.

Quick Decision Matrix

Use Case Recommended System Why
Transactional data, ACID required RDBMS (PostgreSQL, MySQL) Strong consistency, transactions, data integrity
Flexible schema, rapid development Document DB (MongoDB, CouchDB) Schema flexibility, JSON-like documents
High read/write throughput, simple queries Key-Value Store (Redis, DynamoDB) Extreme performance, simple data model
Time-series data, analytics Column-Family (Cassandra, HBase) Write-optimized, wide columns, compression
Relationship-heavy data Graph DB (Neo4j, Amazon Neptune) Native graph traversal, relationship queries
Full-text search Search Engine (Elasticsearch, Solr) Inverted indexes, relevance scoring
Event streaming, message queue Message Broker (Kafka, RabbitMQ) Decoupling, async processing, replay
Caching, session storage Cache (Redis, Memcached) In-memory speed, TTL support

1. Relational Databases (RDBMS)

Overview

Traditional SQL databases with structured schema, ACID transactions, and relational model.

CA (Consistency + Availability)*

*Single-node. Distributed RDBMS typically CP.

Best For

  • Financial systems (banking, payments)
  • E-commerce (inventory, orders)
  • ERP, CRM systems
  • Any ACID-critical application

✓ Pros

  • ACID transactions
  • Data integrity (constraints, FK)
  • Powerful SQL queries (joins)
  • Mature tooling
  • Strong consistency

✗ Cons

  • Schema rigidity
  • Vertical scaling limits
  • Complex sharding
  • Write throughput limits
  • Horizontal scaling difficult

Python Example - PostgreSQL

from sqlalchemy import create_engine, Column, Integer, String, ForeignKey, Numeric
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    email = Column(String, unique=True, nullable=False)
    name = Column(String)
    orders = relationship("Order", back_populates="user")

class Order(Base):
    __tablename__ = 'orders'
    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey('users.id'), nullable=False)
    total = Column(Numeric(10, 2))
    user = relationship("User", back_populates="orders")

# Usage
engine = create_engine('postgresql://localhost/mydb')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()

# ACID Transaction
try:
    user = User(email='john@example.com', name='John')
    session.add(user)
    session.flush()  # Get user.id

    order = Order(user_id=user.id, total=99.99)
    session.add(order)

    session.commit()  # Atomic - all or nothing
except Exception as e:
    session.rollback()  # Rollback on error

# Complex join query
from sqlalchemy import func
result = session.query(
    User.name,
    func.count(Order.id).label('order_count'),
    func.sum(Order.total).label('total_spent')
).join(Order).group_by(User.name).all()

Scaling Strategies

Read Replicas

Async replication to read-only replicas. Handles read-heavy loads.

Trade-off: Eventual consistency for reads

Sharding

Partition data across multiple databases by key (e.g., user_id).

Trade-off: No cross-shard joins, complex management

Connection Pooling

Reuse database connections to reduce overhead.

Trade-off: Limited connections, queuing

Popular RDBMS: PostgreSQL (feature-rich), MySQL (simple, fast), Oracle (enterprise), SQL Server (Microsoft ecosystem)

2. NoSQL - Document Databases

Overview

Store data as JSON-like documents. Flexible schema, horizontal scaling, eventual consistency.

CP
AP

MongoDB (CP), CouchDB (AP)

Best For

  • Content management systems
  • User profiles, catalogs
  • Real-time analytics
  • Rapid prototyping

✓ Pros

  • Flexible schema
  • Horizontal scaling (sharding)
  • Fast development
  • JSON-native
  • Good read performance

✗ Cons

  • No joins (embed or reference)
  • Eventual consistency (often)
  • Data duplication
  • Complex transactions
  • Storage overhead

Python Example - MongoDB

from pymongo import MongoClient
from datetime import datetime

client = MongoClient('mongodb://localhost:27017/')
db = client['ecommerce']

# Document structure (flexible schema)
user = {
    '_id': 'user-123',
    'email': 'john@example.com',
    'name': 'John Doe',
    'address': {  # Embedded document
        'street': '123 Main St',
        'city': 'NYC',
        'zip': '10001'
    },
    'orders': [  # Embedded array
        {
            'order_id': 'order-456',
            'total': 99.99,
            'date': datetime.now()
        }
    ],
    'tags': ['premium', 'vip'],
    'created_at': datetime.now()
}

# Insert
db.users.insert_one(user)

# Query with rich expressions
users = db.users.find({
    'tags': 'premium',
    'orders.total': {'$gt': 50}
}).sort('created_at', -1).limit(10)

# Update (atomic operations)
db.users.update_one(
    {'_id': 'user-123'},
    {
        '$push': {'orders': {'order_id': 'order-789', 'total': 49.99}},
        '$set': {'last_updated': datetime.now()}
    }
)

# Aggregation pipeline (powerful queries)
pipeline = [
    {'$unwind': '$orders'},
    {'$group': {
        '_id': '$_id',
        'total_spent': {'$sum': '$orders.total'},
        'order_count': {'$sum': 1}
    }},
    {'$match': {'total_spent': {'$gt': 100}}}
]
big_spenders = db.users.aggregate(pipeline)
Schema Design: Embed for 1:few relationships, reference for many:many. Denormalize for read performance.

3. NoSQL - Key-Value Stores

Overview

Simplest NoSQL model: unique key → value. Extreme performance, horizontal scaling.

AP

Best For

  • Session storage
  • Caching layers
  • User preferences
  • Shopping carts
  • Real-time leaderboards

✓ Pros

  • Extreme performance (in-memory)
  • Simple data model
  • Easy horizontal scaling
  • High availability
  • Low latency (sub-ms)

✗ Cons

  • No complex queries
  • No relationships
  • Eventual consistency
  • Limited data structures
  • No transactions (usually)

Python Example - Redis

import redis
import json
from datetime import timedelta

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

# Simple key-value
r.set('user:123:name', 'John Doe')
r.expire('user:123:name', timedelta(hours=1))  # TTL
name = r.get('user:123:name')

# Store objects as JSON
user = {'id': 123, 'email': 'john@example.com', 'premium': True}
r.set('user:123', json.dumps(user))
user_data = json.loads(r.get('user:123'))

# Hash (field-value pairs)
r.hset('user:123:profile', mapping={
    'name': 'John',
    'email': 'john@example.com',
    'age': '30'
})
email = r.hget('user:123:profile', 'email')
profile = r.hgetall('user:123:profile')

# Lists (queues)
r.lpush('jobs', 'job1', 'job2', 'job3')  # Producer
job = r.rpop('jobs')  # Consumer

# Sets (unique values)
r.sadd('tags:user:123', 'premium', 'vip', 'early-adopter')
tags = r.smembers('tags:user:123')

# Sorted sets (leaderboard)
r.zadd('leaderboard', {'player1': 1000, 'player2': 1500, 'player3': 800})
top_players = r.zrevrange('leaderboard', 0, 9, withscores=True)  # Top 10

# Atomic operations
r.incr('page:views:123')  # Atomic increment
views = r.get('page:views:123')

# Pub/Sub
pubsub = r.pubsub()
pubsub.subscribe('notifications')

# Publisher (different process)
r.publish('notifications', json.dumps({'type': 'new_message', 'user_id': 123}))

# Cache pattern
def get_user_cached(user_id):
    cache_key = f'cache:user:{user_id}'
    cached = r.get(cache_key)

    if cached:
        return json.loads(cached)

    # Cache miss - fetch from database
    user = fetch_from_database(user_id)
    r.setex(cache_key, timedelta(minutes=5), json.dumps(user))
    return user
Popular Options: Redis (in-memory, rich data structures), DynamoDB (AWS-managed, SSD-backed), Riak (distributed, high availability)

4. NoSQL - Column-Family Stores

Overview

Wide-column stores optimized for write-heavy workloads. Data organized by rows with dynamic columns.

AP

Best For

  • Time-series data (IoT, metrics)
  • Event logging
  • High write throughput
  • Analytics on large datasets

Python Example - Cassandra

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from datetime import datetime
import uuid

cluster = Cluster(['127.0.0.1'])
session = cluster.connect()

# Create keyspace (database)
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS metrics
    WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}
""")
session.set_keyspace('metrics')

# Create table (wide column family)
session.execute("""
    CREATE TABLE IF NOT EXISTS sensor_data (
        sensor_id UUID,
        timestamp TIMESTAMP,
        temperature DOUBLE,
        humidity DOUBLE,
        location TEXT,
        PRIMARY KEY (sensor_id, timestamp)
    ) WITH CLUSTERING ORDER BY (timestamp DESC)
""")

# Insert (optimized for writes)
sensor_id = uuid.uuid4()
session.execute("""
    INSERT INTO sensor_data (sensor_id, timestamp, temperature, humidity, location)
    VALUES (%s, %s, %s, %s, %s)
""", (sensor_id, datetime.now(), 22.5, 65.0, 'warehouse-1'))

# Query (partition key required for performance)
rows = session.execute("""
    SELECT * FROM sensor_data
    WHERE sensor_id = %s
    AND timestamp >= %s
""", (sensor_id, datetime(2024, 1, 1)))

# Time-series query
latest = session.execute("""
    SELECT * FROM sensor_data
    WHERE sensor_id = %s
    LIMIT 100
""", (sensor_id,))

# Tunable consistency
from cassandra import ConsistencyLevel
statement = SimpleStatement(
    "SELECT * FROM sensor_data WHERE sensor_id = %s",
    consistency_level=ConsistencyLevel.QUORUM  # or ONE, ALL
)
rows = session.execute(statement, (sensor_id,))
Key Concept: Write path (MemTable → SSTable → Compaction) similar to LSM tree. Reads can be slower (bloom filters help).

5. Graph Databases

Overview

Optimized for relationship-heavy data. Nodes, edges, and properties. Fast graph traversal.

Best For

  • Social networks
  • Recommendation engines
  • Fraud detection
  • Knowledge graphs
  • Network topology

Python Example - Neo4j

from neo4j import GraphDatabase

class SocialGraph:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def close(self):
        self.driver.close()

    def create_user(self, name, email):
        with self.driver.session() as session:
            session.run(
                "CREATE (u:User {name: $name, email: $email})",
                name=name, email=email
            )

    def create_friendship(self, user1, user2):
        with self.driver.session() as session:
            session.run("""
                MATCH (u1:User {name: $user1})
                MATCH (u2:User {name: $user2})
                CREATE (u1)-[:FRIENDS_WITH]->(u2)
            """, user1=user1, user2=user2)

    def find_friends(self, name):
        """Direct friends"""
        with self.driver.session() as session:
            result = session.run("""
                MATCH (u:User {name: $name})-[:FRIENDS_WITH]->(friend)
                RETURN friend.name AS name
            """, name=name)
            return [record["name"] for record in result]

    def find_friends_of_friends(self, name):
        """2nd degree connections"""
        with self.driver.session() as session:
            result = session.run("""
                MATCH (u:User {name: $name})-[:FRIENDS_WITH*2]->(fof)
                WHERE fof <> u
                RETURN DISTINCT fof.name AS name
            """, name=name)
            return [record["name"] for record in result]

    def shortest_path(self, user1, user2):
        """Find shortest connection path"""
        with self.driver.session() as session:
            result = session.run("""
                MATCH path = shortestPath(
                    (u1:User {name: $user1})-[:FRIENDS_WITH*]-(u2:User {name: $user2})
                )
                RETURN [node in nodes(path) | node.name] AS path
            """, user1=user1, user2=user2)
            record = result.single()
            return record["path"] if record else None

# Usage
graph = SocialGraph("bolt://localhost:7687", "neo4j", "password")
graph.create_user("Alice", "alice@example.com")
graph.create_user("Bob", "bob@example.com")
graph.create_friendship("Alice", "Bob")
friends = graph.find_friends("Alice")
path = graph.shortest_path("Alice", "Charlie")
When NOT to use: Simple lookups without relationships, high write throughput requirements

6. Message Queues & Event Streaming

Overview

Asynchronous communication between services. Decouple producers from consumers.

Message Queue (RabbitMQ) vs Event Streaming (Kafka)

Aspect RabbitMQ (Queue) Kafka (Stream)
Model Message queue (delete after consume) Distributed log (persistent, replayable)
Delivery One consumer per message Multiple consumers (consumer groups)
Ordering Per queue Per partition
Retention Until consumed Configurable (days/weeks)
Replay No Yes (rewind offset)
Throughput Good (10K+ msg/sec) Excellent (millions/sec)
Use Case Task queues, RPC, work distribution Event sourcing, stream processing, logs

Python Example - Kafka

from kafka import KafkaProducer, KafkaConsumer
import json

# Producer
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Publish event
event = {
    'event_type': 'user_registered',
    'user_id': 'user-123',
    'timestamp': '2024-01-01T12:00:00Z'
}
producer.send('user-events', event)
producer.flush()

# Consumer
consumer = KafkaConsumer(
    'user-events',
    bootstrap_servers='localhost:9092',
    group_id='email-service',  # Consumer group
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    auto_offset_reset='earliest'  # Start from beginning
)

for message in consumer:
    event = message.value
    print(f"Processing: {event['event_type']}")
    # Send welcome email
    # Update analytics
    # Commit offset automatically (or manually for at-least-once)

Python Example - RabbitMQ

import pika
import json

# Producer
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

channel.queue_declare(queue='tasks', durable=True)

task = {'task_id': 'task-123', 'type': 'send_email', 'to': 'user@example.com'}
channel.basic_publish(
    exchange='',
    routing_key='tasks',
    body=json.dumps(task),
    properties=pika.BasicProperties(delivery_mode=2)  # Persistent
)
connection.close()

# Consumer (Worker)
def callback(ch, method, properties, body):
    task = json.loads(body)
    print(f"Processing task: {task['task_id']}")
    # Do work...
    ch.basic_ack(delivery_tag=method.delivery_tag)  # Acknowledge

connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
channel.queue_declare(queue='tasks', durable=True)
channel.basic_qos(prefetch_count=1)  # Fair dispatch
channel.basic_consume(queue='tasks', on_message_callback=callback)
channel.start_consuming()
Decision: Use RabbitMQ for task queues, RPC. Use Kafka for event sourcing, high-throughput streams, replay requirements.

Complete Comparison Table

System Data Model CAP Consistency Scaling Query Capability
PostgreSQL Relational (tables) CA/CP Strong (ACID) Vertical, Read replicas SQL (powerful joins)
MongoDB Document (JSON) CP Eventual → Strong Horizontal (sharding) Rich queries, aggregation
Redis Key-Value CP (cluster) Eventually consistent Horizontal (cluster) Simple (key lookup)
DynamoDB Key-Value / Document AP Eventual Auto (managed) Limited (key + index)
Cassandra Wide Column AP Tunable Horizontal (excellent) CQL (limited joins)
Neo4j Graph CA ACID Vertical primarily Cypher (graph queries)
Elasticsearch Document (inverted index) CP Near real-time Horizontal (sharding) Full-text, aggregations
Kafka Log / Stream AP Eventual Horizontal (partitions) N/A (streaming)

Decision Framework

How to Choose?

  1. Need ACID transactions? → RDBMS
  2. Need flexible schema? → Document DB (MongoDB)
  3. Need extreme read/write performance? → Key-Value (Redis, DynamoDB)
  4. Time-series or massive writes? → Column-Family (Cassandra)
  5. Relationship-heavy queries? → Graph DB (Neo4j)
  6. Full-text search? → Elasticsearch
  7. Async communication? → Message Queue (RabbitMQ, Kafka)
  8. Caching? → Redis/Memcached

Follow-up Questions

Polyglot Persistence

Real Systems Use Multiple Databases

Modern architectures combine different databases for different needs:

E-commerce Example
  • PostgreSQL: Orders, payments (ACID)
  • MongoDB: Product catalog (flexible)
  • Redis: Sessions, cart (fast)
  • Elasticsearch: Product search
  • Kafka: Event bus
  • Cassandra: Order history, analytics
Social Network Example
  • Neo4j: Friend graph
  • PostgreSQL: User accounts
  • MongoDB: Posts, comments
  • Redis: News feed cache
  • Cassandra: Activity logs
  • Kafka: Real-time events
Key Point: Don't force one database for everything. Use the right tool for each job.