LSM trees are disk-based data structures optimized for write-heavy workloads. They achieve fast writes by batching changes in memory and sequentially writing to disk, avoiding expensive random disk I/O.
Used in: Cassandra, RocksDB, LevelDB, HBase, BigTable, ScyllaDB
| Component | Purpose | Location | Mutable? |
|---|---|---|---|
| MemTable | In-memory buffer for recent writes | RAM | Yes |
| WAL (Write-Ahead Log) | Crash recovery - logs before memtable | Disk | Append-only |
| SSTable | Sorted String Table - immutable sorted data | Disk | No (immutable) |
| Bloom Filter | Probabilistic filter to avoid disk reads | Memory/Disk | No |
Problem: May need to check multiple SSTables to find a key
Solutions:
| Strategy | How It Works | Used In | Trade-offs |
|---|---|---|---|
| Size-Tiered | Merge SSTables of similar size | Cassandra (default) | Higher space amplification, good for write-heavy |
| Leveled | Fixed levels with non-overlapping ranges | RocksDB, LevelDB | Lower space amplification, better reads, more compaction I/O |
| Time-Window | Bucket by time, expire old windows | Time-series DBs | Efficient for time-based TTL, simple expiration |
| Operation | Time Complexity | Explanation |
|---|---|---|
| Write (Insert/Update) | O(1) amortized | Insert into memtable HashMap. Amortized due to occasional flush. |
| Read (Point Query) | O(k × log n) | Check k SSTables, binary search in each (log n). Bloom filters reduce k. |
| Delete | O(1) | Write tombstone marker to memtable |
| Range Scan | O(k × (log n + m)) | Binary search to start (log n), scan m results across k SSTables |
| Compaction | O(n log n) | Merge sort across multiple SSTables |
Problem: Data written multiple times during compaction
Formula: Bytes written to disk / Bytes written by application
Typical value: 10-50× (depends on compaction strategy)
Example: 1 MB write → 10 MB disk I/O over time due to compaction
Problem: Must check multiple SSTables
Formula: Disk reads per query
Without Bloom filter: 5-10 SSTables checked
With Bloom filter: 1-2 SSTables checked (99% reduction)
class MemTable:
"""In-memory buffer for recent writes
Writes go here first for fast O(1) insertion.
When full, it's flushed to disk as an immutable SSTable.
"""
def __init__(self, max_size: int = 100):
self.data: Dict[str, Any] = {}
self.max_size = max_size
def put(self, key: str, value: Any) -> None:
"""Insert or update a key-value pair
Time Complexity: O(1)
"""
self.data[key] = value
def get(self, key: str) -> Optional[Any]:
"""Retrieve a value by key
Time Complexity: O(1)
"""
return self.data.get(key)
def delete(self, key: str) -> None:
"""Mark a key as deleted (tombstone)
LSM trees use tombstones instead of immediate deletion
"""
self.data[key] = None # Tombstone marker
def is_full(self) -> bool:
"""Check if memtable has reached capacity"""
return len(self.data) >= self.max_size
class SSTable:
"""Sorted String Table - immutable sorted data structure
- Stored on disk in real implementation
- Sorted by key for efficient binary search
- Never modified after creation (immutable)
"""
def __init__(self, data: List[Tuple[str, Any]], level: int = 0):
# Store as sorted list of (key, value) tuples
self.data = sorted(data)
self.keys = [item[0] for item in self.data]
self.level = level
def get(self, key: str) -> Optional[Any]:
"""Binary search for key
Time Complexity: O(log n)
"""
idx = bisect.bisect_left(self.keys, key)
if idx < len(self.keys) and self.keys[idx] == key:
return self.data[idx][1]
return None
def scan_range(self, start_key: str, end_key: str) -> List[Tuple[str, Any]]:
"""Scan a range of keys
Time Complexity: O(log n + k) where k is number of results
"""
start_idx = bisect.bisect_left(self.keys, start_key)
end_idx = bisect.bisect_right(self.keys, end_key)
return self.data[start_idx:end_idx]
class LSMTree:
"""Simplified LSM Tree implementation
Architecture:
1. Writes go to MemTable (in-memory)
2. When MemTable is full, flush to L0 SSTable
3. Background compaction merges SSTables to reduce read amplification
4. Reads check MemTable first, then SSTables (newest to oldest)
"""
def __init__(self, memtable_size: int = 100, compaction_threshold: int = 4):
self.memtable = MemTable(max_size=memtable_size)
self.immutable_memtable = None
self.sstables: List[List[SSTable]] = [[] for _ in range(7)] # 7 levels
self.compaction_threshold = compaction_threshold
def put(self, key: str, value: Any) -> None:
"""Write operation - always goes to memtable first
Time Complexity: O(1) amortized
"""
self.memtable.put(key, value)
if self.memtable.is_full():
self._flush_memtable()
self._maybe_compact()
def get(self, key: str) -> Optional[Any]:
"""Read operation - check memtable first, then SSTables
Time Complexity: O(k * log n) where k is number of SSTables
"""
# 1. Check active memtable (most recent data)
value = self.memtable.get(key)
if value is not None:
return value
# 2. Check immutable memtable if exists
if self.immutable_memtable:
value = self.immutable_memtable.get(key)
if value is not None:
return value
# 3. Check SSTables level by level (newer levels first)
for level in range(len(self.sstables)):
for sstable in reversed(self.sstables[level]):
value = sstable.get(key)
if value is not None:
return None if value is None else value # Handle tombstones
return None
def _flush_memtable(self) -> None:
"""Flush memtable to L0 SSTable"""
if len(self.memtable.data) > 0:
sstable = SSTable(self.memtable.get_sorted_items(), level=0)
self.sstables[0].append(sstable)
self.memtable.clear()
def _compact_level(self, level: int) -> None:
"""Compact SSTables at a given level
Merges all SSTables at this level into next level
Removes duplicates and tombstones
"""
merged_data = {}
for sstable in self.sstables[level]:
for key, value in sstable.data:
merged_data[key] = value # Newer values overwrite
# Remove tombstones
merged_data = {k: v for k, v in merged_data.items() if v is not None}
if merged_data:
new_sstable = SSTable(list(merged_data.items()), level=level + 1)
self.sstables[level + 1].append(new_sstable)
self.sstables[level].clear()
| Feature | LSM Tree | B-Tree | Hash Table |
|---|---|---|---|
| Write Speed | Very Fast (O(1)) | Moderate (O(log n)) | Very Fast (O(1)) |
| Read Speed | Moderate (O(k log n)) | Fast (O(log n)) | Very Fast (O(1)) |
| Range Queries | Good | Excellent | Poor (not supported) |
| Disk Efficiency | Excellent (sequential I/O) | Moderate (random I/O) | Poor |
| Write Amplification | High (10-50×) | Low (1-2×) | None |
| Best Use Case | Write-heavy workloads, time-series data | Balanced read/write, OLTP databases | In-memory caching |
Compaction: Size-tiered (default)
Optimized for: High write throughput, distributed writes
Trade-off: Higher space usage for faster writes
Compaction: Leveled compaction
Optimized for: Better read performance, lower space amplification
Trade-off: More compaction I/O
Architecture: LSM-based distributed database
Features: Bloom filters, block cache, compression
Use case: Large-scale analytics, wide-column store
Implementation: C++ rewrite of Cassandra
Optimizations: Per-core LSM trees, lock-free data structures
Performance: 10× faster than Cassandra