Concept | Clear explanation | Why it matters |
---|---|---|
SSTable (Sorted-String Table) | An immutable on-disk file that contains key-value pairs in sorted order. New writes first land in an in-memory memtable; once it reaches a threshold (e.g. a few MB) the memtable is flushed as a brand-new SSTable, written sequentially from start to finish. (DEV Community) | Sequential writes are the fastest thing you can do on spinning disk and SSD; they avoid random-I/O penalties and make write throughput nearly line-rate. |
Read path optimizations | Because there may be many SSTables, a lookup might need to probe several files. Two tricks slash that cost: Bloom filters (tiny in-memory bitmaps that tell you “definitely not here” for absent keys) and a sparse index that records every N-th key in each file. (Code Farm) | Shows interviewers you know how LSM engines (RocksDB, Cassandra) keep reads acceptable despite a write-optimised layout. |
Compaction / merging | A background process periodically merges SSTables: it rewrites overlapping segments into one larger file, discards overwritten or deleted keys and keeps data roughly sorted by key. That trading of extra background I/O for smaller, cleaner on-disk data is what keeps reads fast over time. (Medium) | |
LSM-tree = hierarchy of SSTables | Levels (L0, L1, L2…) each hold SSTables of exponentially increasing size. Writes move (“cascade”) from the top to deeper levels through repeated compaction. (Code Farm) | Explains why write amplification (bytes-written / bytes-ingested) is the main tuning knob for LSM stores. |
Practical mental model: Append-only, then clean up later. Anything that updates in place is slow; LSM says “just append” and rely on compaction to settle the mess in the background.
Concept | Clear explanation | Why it matters |
---|---|---|
Fixed-size pages (≈ 4 KB) | The database is cut into uniform blocks that match the disk/OS page size. A page can reference child pages, letting you build a balanced tree. (O'Reilly Learning) | |
Lookup | Start at the root page, follow the range pointer that brackets your key, repeat until you land in a leaf page with the exact key/value. Depth stays tiny because the branching factor (children per page) is large—often 100–500. That means just 3–4 page reads fetch any record. (Medium) | |
Update-in-place | Inserts or updates modify the leaf page directly. If a page overflows it is split and parent pointers are updated. This is a random write (read-modify-write), so each change costs at least one extra disk seek. | |
Crash-safety | Because pages mutate, a write-ahead log (WAL) is required: every change is first appended to the log so the db can replay or roll back after a crash. | |
Space & read amplification | B-trees keep only one copy of each key, so disk footprint is compact and reads see at most one version—no need for compaction cycles. |
Practical mental model: Keep a tiny, shallow tree of pages on disk and mutate pages in place. That’s fantastic for point-lookups and range scans when random I/O is cheap (e.g., enterprise SSDs).
Workload trait | Favors LSM / SSTable | Favors B-tree |
---|---|---|
Write volume (random keys) | High write throughput, log-structured layout means sequential disk writes; compaction smooths later. | Each update = random I/O; heavy writes risk I/O bottlenecks. |
Read pattern (point look-ups) | Extra read amplification (many SSTables) mitigated by Bloom filters, but still higher latency tail. | Few page reads, predictable latency. |
Range scans | Data is sorted, but might span several SSTables if compaction hasn’t caught up. | Pages are in key order within one tree—excellent for large scans. |
Disk space | Can have space amplification: multiple obsolete versions linger until compaction. | Single live copy of each key keeps space tight. |
SSD endurance | Sequential writes are SSD-friendly; compaction writes extra data, but still mostly sequential. | Random writes stress SSD write-amp and garbage collection. |
Rule of thumb for interviews
“If my service is log-like—millions of sensor pings per second—I reach for an LSM store (e.g., RocksDB, Cassandra). If my workload is read-heavy OLTP with secondary indexes—think user profiles or orders—I prefer a B-tree engine (Postgres, InnoDB).”
Then add the nuance: “In 2025 many engines (MyRocks, WiredTiger) offer both modes, so I start with the default and switch only after profiling proves a mismatch.”
Aspect | LSM / SSTable (log-structured) | B-tree (page-oriented) |
---|---|---|
How data lands on disk | New records are appended to an in-memory memtable; when it fills, the whole memtable is flushed once, sequentially, as an immutable SSTable. Updates never touch old files in place. | A change is written twice: (1) appended to the write-ahead log (WAL) for crash-safety, (2) over-written into the 4 KB page that holds the key. Page splits may add more writes. |
Write amplification | Every compaction pass rewrites data, but the copies are sequential. With sensible “levelled” compaction, the total bytes re-written can still be lower than the random page rewrites of a B-tree—especially on spinning disks and SSDs that prefer sequential traffic. | Must re-write the WAL entry plus the page itself; on SSDs each random 4 KB overwrite triggers a whole-block erase inside the drive, so write-amp is baked in. |
Throughput and latency | Sustains very high write throughput (disk bandwidth is used in long sequential bursts). Read median latency is fine, but tail latency (P99, P99.9) rises when a query arrives while a large compaction is hogging I/O. Needs monitoring to make sure compaction can keep up with the ingest rate. | Read and write latency are steadier: only one page is touched per lookup, and there is no background merge competing for I/O. Point-reads are usually faster because each key lives in exactly one place. |
Space & compression | Because old fragments are thrown away during compaction, free space fragments are eliminated and the files compress very well; disk footprint is often smaller than an equally sized B-tree. | Page splits leave “air pockets” in half-empty pages; long-running databases gradually fragment and need occasional VACUUM / OPTIMIZE passes. |
Transactional locking | A single, unique copy of every key makes it straightforward to attach row- or range-locks for serialisable transactions. | A key may exist simultaneously in several SSTables until the next compaction, so fine-grained locking is trickier; many LSM engines therefore go with coarse-grained or optimistic schemes. |
Operational gotchas | If compaction falls behind (common on very heavy ingest) the number of SSTables grows, reads fan out over more files, and the disk can fill up. Engines rarely self-throttle writes, so alerting is essential. | Biggest risk is page-level corruption if a crash happens between WAL flush and page rewrite, but mature engines have decades of battle-tested recovery code. |
Index type | What it is | When to reach for it | Cost / trade-offs |
---|---|---|---|
Secondary index | An extra lookup structure on a column that is not the primary key (e.g. user_id on an orders table). Keys are not unique, so each entry points to a list of row-IDs or encodes the row-ID in the key itself. Works with either B-tree or LSM. |
Needed for joins and predicate filters that touch non-key columns. | Duplicates a (small) copy of the key plus a pointer for every row; every insert/update has to update every secondary index. |
Clustered (primary) index | The table itself is kept in primary-key order; leaf pages store the entire row beside the key. (InnoDB does this automatically.) | Point-reads and range-scans that retrieve the whole row by primary key are a single I/O hop. | Row moves cause page splits; only one clustered index allowed because the row can live in only one physical order. |
Non-clustered + heap file | Each index entry stores only a pointer into an unordered heap of rows on disk. | Multiple indexes without duplicating full rows; updates that don’t change length can overwrite in place efficiently. | A read is at least two hops (index ➜ heap page). |
Covering / “included-columns” index | A non-clustered index that stores some extra columns besides the key, so the query can be satisfied entirely from the index. | Speeds up frequent, read-heavy filters without paying the space of a full clustered copy. | More disk, more write overhead, and the DBA must pick which columns to “include”. |
Concatenated (multi-column) index | Simply concatenates two or more fields to make a compound key, e.g. (last_name, first_name) . Sorted order lets you filter by the left-most prefix of the key. |
Queries that always constrain the leading column(s) (“find all orders for user X on date Y”). | Useless if you need predicates only on the second column unless the first is also specified. |
Multi-dimensional index (R-tree, space-filling curve) | Keeps rectangles (“bounding boxes”) in a tree so you can answer range queries over ≥ 2 numeric dimensions (lat/long, RGB values, time × temperature). | Geospatial, GIS, or any query that must bound two axes at once (e.g. “all cafés inside this map tile”). | More complex to maintain; many mainstream engines support it only through extensions (e.g. PostGIS R-tree, MySQL/SQLite R-tree). |
Full-text / fuzzy index | Breaks documents into terms; stores a term-dictionary and postings lists (think: SSTable of terms ➜ doc-IDs). Often organised as a trie or automaton so it can match prefixes or words within an edit-distance. | Search boxes, logs, product catalogues with typo-tolerance. | A completely separate engine (Lucene, Elasticsearch, Solr) is usually easiest; synchronising it with the primary store is an operational task. |