LSM Tree (Log-Structured Merge Tree)

← Back Architecture Write Path Read Path Compaction Code

What is an LSM Tree?

LSM trees are disk-based data structures optimized for write-heavy workloads. They achieve fast writes by batching changes in memory and sequentially writing to disk, avoiding expensive random disk I/O.

Used in: Cassandra, RocksDB, LevelDB, HBase, BigTable, ScyllaDB

LSM Tree Architecture

graph TB subgraph Memory["MEMORY"] WAL["Write-Ahead Log
(WAL)
Durability"] MEM["MemTable
Active writes
HashMap: O(1)"] IMEM["Immutable MemTable
Being flushed"] end subgraph Disk["DISK - SORTED STRING TABLES (SSTables)"] L0["Level 0
4 SSTables
Overlapping ranges"] L1["Level 1
10 SSTables
Non-overlapping"] L2["Level 2
100 SSTables
Non-overlapping"] L3["Level 3+
1000+ SSTables
Non-overlapping"] end WRITE["Write(key, value)"] -->|"1. Log"| WAL WRITE -->|"2. Insert"| MEM MEM -->|"When full"| IMEM IMEM -->|"Flush"| L0 L0 -->|"Compact"| L1 L1 -->|"Compact"| L2 L2 -->|"Compact"| L3 READ["Read(key)"] -.->|"1. Check"| MEM READ -.->|"2. Check"| IMEM READ -.->|"3. Search"| L0 READ -.->|"4. Search"| L1 style Memory fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#2e3440 style Disk fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#2e3440 style WRITE fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style READ fill:#fff9c4,stroke:#f57c00,stroke-width:2px,color:#2e3440

Key Components

Component	Purpose	Location	Mutable?
MemTable	In-memory buffer for recent writes	RAM	Yes
WAL (Write-Ahead Log)	Crash recovery - logs before memtable	Disk	Append-only
SSTable	Sorted String Table - immutable sorted data	Disk	No (immutable)
Bloom Filter	Probabilistic filter to avoid disk reads	Memory/Disk	No

Write Path Flow

sequenceDiagram participant Client participant WAL as Write-Ahead Log participant MEM as MemTable participant SST as SSTable (Disk) Client->>+WAL: 1. Append write to log Note over WAL: Sequential write
Fast: ~1ms WAL-->>Client: Acknowledged (durable) Client->>+MEM: 2. Insert into MemTable Note over MEM: In-memory HashMap
O(1) insertion MEM-->>-Client: Write complete alt MemTable Full (threshold reached) MEM->>MEM: Mark as Immutable Note over MEM: Freeze current memtable
Create new active memtable MEM->>+SST: 3. Flush to SSTable (L0) Note over SST: Sort entries
Write sequential blocks
Create index + Bloom filter SST-->>-MEM: Flush complete MEM->>WAL: 4. Truncate WAL Note over WAL: Safe to delete logged entries end Note over Client,SST: Total Write Time: O(1) amortized

            Why Writes are Fast
            No disk seeks: Writes go to memory first (O(1))
Sequential I/O: When flushing to disk, data is written sequentially
Batching: Many writes batched into one SSTable flush
No updates in place: Never modify existing SSTables (avoids random writes)

        

Read Path Flow

sequenceDiagram participant Client participant MEM as MemTable participant BF as Bloom Filter participant SST0 as SSTable L0 participant SST1 as SSTable L1 participant SST2 as SSTable L2 Client->>+MEM: 1. Check MemTable alt Found in MemTable MEM-->>Client: Return value (fastest path) else Not in MemTable MEM-->>Client: Not found loop For each level (L0, L1, L2...) Client->>+BF: 2. Check Bloom Filter alt Bloom says "definitely not present" BF-->>Client: Skip this SSTable else Bloom says "might be present" BF-->>-Client: Check SSTable Client->>+SST0: 3. Binary search in SSTable Note over SST0: Read index
Binary search: O(log n)
Read data block alt Found SST0-->>Client: Return value else Not found SST0-->>-Client: Continue to next level end end end Client->>Client: Key not found end Note over Client,SST2: Read Time: O(k * log n)
k = number of SSTables checked

Read Amplification Challenge

Problem: May need to check multiple SSTables to find a key

Solutions:

Bloom Filters: Skip SSTables that definitely don't contain the key (99% reduction)
Compaction: Merge SSTables to reduce number of files
Leveling: Non-overlapping key ranges at each level (except L0)
Caching: Block cache and row cache for hot data

Compaction Process

graph TB subgraph Before["BEFORE COMPACTION"] SST1["SSTable 1
key1=v1, key2=v2, key5=v5"] SST2["SSTable 2
key2=v2_new, key3=v3, key7=NULL"] SST3["SSTable 3
key1=v1_latest, key4=v4"] end subgraph Process["COMPACTION PROCESS"] MERGE["Merge Sort
1. Read all SSTables
2. Merge sort by key
3. Keep newest value
4. Remove tombstones"] end subgraph After["AFTER COMPACTION"] MERGED["Compacted SSTable
key1=v1_latest
key2=v2_new
key3=v3
key4=v4
key5=v5
(key7 removed - tombstone)"] end SST1 --> MERGE SST2 --> MERGE SST3 --> MERGE MERGE --> MERGED style Before fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#2e3440 style After fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#2e3440 style MERGE fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#2e3440

Compaction Strategies

Strategy	How It Works	Used In	Trade-offs
Size-Tiered	Merge SSTables of similar size	Cassandra (default)	Higher space amplification, good for write-heavy
Leveled	Fixed levels with non-overlapping ranges	RocksDB, LevelDB	Lower space amplification, better reads, more compaction I/O
Time-Window	Bucket by time, expire old windows	Time-series DBs	Efficient for time-based TTL, simple expiration

Complexity Analysis

Operation	Time Complexity	Explanation
Write (Insert/Update)	O(1) amortized	Insert into memtable HashMap. Amortized due to occasional flush.
Read (Point Query)	O(k × log n)	Check k SSTables, binary search in each (log n). Bloom filters reduce k.
Delete	O(1)	Write tombstone marker to memtable
Range Scan	O(k × (log n + m))	Binary search to start (log n), scan m results across k SSTables
Compaction	O(n log n)	Merge sort across multiple SSTables

Amplification Factors

Write Amplification

Problem: Data written multiple times during compaction

Formula: Bytes written to disk / Bytes written by application

Typical value: 10-50× (depends on compaction strategy)

Example: 1 MB write → 10 MB disk I/O over time due to compaction

Read Amplification

Problem: Must check multiple SSTables

Formula: Disk reads per query

Without Bloom filter: 5-10 SSTables checked

With Bloom filter: 1-2 SSTables checked (99% reduction)

Python Implementation

MemTable Class

class MemTable:
    """In-memory buffer for recent writes

    Writes go here first for fast O(1) insertion.
    When full, it's flushed to disk as an immutable SSTable.
    """

    def __init__(self, max_size: int = 100):
        self.data: Dict[str, Any] = {}
        self.max_size = max_size

    def put(self, key: str, value: Any) -> None:
        """Insert or update a key-value pair

        Time Complexity: O(1)
        """
        self.data[key] = value

    def get(self, key: str) -> Optional[Any]:
        """Retrieve a value by key

        Time Complexity: O(1)
        """
        return self.data.get(key)

    def delete(self, key: str) -> None:
        """Mark a key as deleted (tombstone)

        LSM trees use tombstones instead of immediate deletion
        """
        self.data[key] = None  # Tombstone marker

    def is_full(self) -> bool:
        """Check if memtable has reached capacity"""
        return len(self.data) >= self.max_size

SSTable Class

class SSTable:
    """Sorted String Table - immutable sorted data structure

    - Stored on disk in real implementation
    - Sorted by key for efficient binary search
    - Never modified after creation (immutable)
    """

    def __init__(self, data: List[Tuple[str, Any]], level: int = 0):
        # Store as sorted list of (key, value) tuples
        self.data = sorted(data)
        self.keys = [item[0] for item in self.data]
        self.level = level

    def get(self, key: str) -> Optional[Any]:
        """Binary search for key

        Time Complexity: O(log n)
        """
        idx = bisect.bisect_left(self.keys, key)
        if idx < len(self.keys) and self.keys[idx] == key:
            return self.data[idx][1]
        return None

    def scan_range(self, start_key: str, end_key: str) -> List[Tuple[str, Any]]:
        """Scan a range of keys

        Time Complexity: O(log n + k) where k is number of results
        """
        start_idx = bisect.bisect_left(self.keys, start_key)
        end_idx = bisect.bisect_right(self.keys, end_key)
        return self.data[start_idx:end_idx]

LSM Tree Main Class

class LSMTree:
    """Simplified LSM Tree implementation

    Architecture:
    1. Writes go to MemTable (in-memory)
    2. When MemTable is full, flush to L0 SSTable
    3. Background compaction merges SSTables to reduce read amplification
    4. Reads check MemTable first, then SSTables (newest to oldest)
    """

    def __init__(self, memtable_size: int = 100, compaction_threshold: int = 4):
        self.memtable = MemTable(max_size=memtable_size)
        self.immutable_memtable = None
        self.sstables: List[List[SSTable]] = [[] for _ in range(7)]  # 7 levels
        self.compaction_threshold = compaction_threshold

    def put(self, key: str, value: Any) -> None:
        """Write operation - always goes to memtable first

        Time Complexity: O(1) amortized
        """
        self.memtable.put(key, value)

        if self.memtable.is_full():
            self._flush_memtable()
            self._maybe_compact()

    def get(self, key: str) -> Optional[Any]:
        """Read operation - check memtable first, then SSTables

        Time Complexity: O(k * log n) where k is number of SSTables
        """
        # 1. Check active memtable (most recent data)
        value = self.memtable.get(key)
        if value is not None:
            return value

        # 2. Check immutable memtable if exists
        if self.immutable_memtable:
            value = self.immutable_memtable.get(key)
            if value is not None:
                return value

        # 3. Check SSTables level by level (newer levels first)
        for level in range(len(self.sstables)):
            for sstable in reversed(self.sstables[level]):
                value = sstable.get(key)
                if value is not None:
                    return None if value is None else value  # Handle tombstones

        return None

    def _flush_memtable(self) -> None:
        """Flush memtable to L0 SSTable"""
        if len(self.memtable.data) > 0:
            sstable = SSTable(self.memtable.get_sorted_items(), level=0)
            self.sstables[0].append(sstable)
            self.memtable.clear()

    def _compact_level(self, level: int) -> None:
        """Compact SSTables at a given level

        Merges all SSTables at this level into next level
        Removes duplicates and tombstones
        """
        merged_data = {}
        for sstable in self.sstables[level]:
            for key, value in sstable.data:
                merged_data[key] = value  # Newer values overwrite

        # Remove tombstones
        merged_data = {k: v for k, v in merged_data.items() if v is not None}

        if merged_data:
            new_sstable = SSTable(list(merged_data.items()), level=level + 1)
            self.sstables[level + 1].append(new_sstable)

        self.sstables[level].clear()

LSM Tree vs Other Data Structures

Feature	LSM Tree	B-Tree	Hash Table
Write Speed	Very Fast (O(1))	Moderate (O(log n))	Very Fast (O(1))
Read Speed	Moderate (O(k log n))	Fast (O(log n))	Very Fast (O(1))
Range Queries	Good	Excellent	Poor (not supported)
Disk Efficiency	Excellent (sequential I/O)	Moderate (random I/O)	Poor
Write Amplification	High (10-50×)	Low (1-2×)	None
Best Use Case	Write-heavy workloads, time-series data	Balanced read/write, OLTP databases	In-memory caching

Real-World Usage

Apache Cassandra

Compaction: Size-tiered (default)

Optimized for: High write throughput, distributed writes

Trade-off: Higher space usage for faster writes

RocksDB / LevelDB

Compaction: Leveled compaction

Optimized for: Better read performance, lower space amplification

Trade-off: More compaction I/O

HBase / BigTable

Architecture: LSM-based distributed database

Features: Bloom filters, block cache, compression

Use case: Large-scale analytics, wide-column store

ScyllaDB

Implementation: C++ rewrite of Cassandra

Optimizations: Per-core LSM trees, lock-free data structures

Performance: 10× faster than Cassandra

Interview Tips

Common Questions

"Why are LSM trees good for writes?" → Sequential disk I/O, batching, no in-place updates
"What is write amplification?" → Data rewritten during compaction (10-50× typical)
"How do you optimize reads?" → Bloom filters (skip SSTables), compaction (reduce file count), caching
"LSM vs B-Tree?" → LSM for write-heavy, B-Tree for balanced or read-heavy
"What is a tombstone?" → Deletion marker; actual removal happens during compaction
"Explain compaction strategies" → Size-tiered (write-optimized) vs Leveled (read-optimized)

← Back to Index