Chapter 3: Storing & Retrieving Data

1. Log-structured storage: SSTables & LSM-trees

Concept	Clear explanation	Why it matters
SSTable (Sorted-String Table)	An immutable on-disk file that contains key-value pairs in sorted order. New writes first land in an in-memory memtable; once it reaches a threshold (e.g. a few MB) the memtable is flushed as a brand-new SSTable, written sequentially from start to finish. (DEV Community)	Sequential writes are the fastest thing you can do on spinning disk and SSD; they avoid random-I/O penalties and make write throughput nearly line-rate.
Read path optimizations	Because there may be many SSTables, a lookup might need to probe several files. Two tricks slash that cost: Bloom filters (tiny in-memory bitmaps that tell you “definitely not here” for absent keys) and a sparse index that records every N-th key in each file. (Code Farm)	Shows interviewers you know how LSM engines (RocksDB, Cassandra) keep reads acceptable despite a write-optimised layout.
Compaction / merging	A background process periodically merges SSTables: it rewrites overlapping segments into one larger file, discards overwritten or deleted keys and keeps data roughly sorted by key. That trading of extra background I/O for smaller, cleaner on-disk data is what keeps reads fast over time. (Medium)
LSM-tree = hierarchy of SSTables	Levels (L0, L1, L2…) each hold SSTables of exponentially increasing size. Writes move (“cascade”) from the top to deeper levels through repeated compaction. (Code Farm)	Explains why write amplification (bytes-written / bytes-ingested) is the main tuning knob for LSM stores.

Practical mental model: Append-only, then clean up later. Anything that updates in place is slow; LSM says “just append” and rely on compaction to settle the mess in the background.

2. B-trees — the classic page-oriented index

Concept	Clear explanation	Why it matters
Fixed-size pages (≈ 4 KB)	The database is cut into uniform blocks that match the disk/OS page size. A page can reference child pages, letting you build a balanced tree. (O'Reilly Learning)
Lookup	Start at the root page, follow the range pointer that brackets your key, repeat until you land in a leaf page with the exact key/value. Depth stays tiny because the branching factor (children per page) is large—often 100–500. That means just 3–4 page reads fetch any record. (Medium)
Update-in-place	Inserts or updates modify the leaf page directly. If a page overflows it is split and parent pointers are updated. This is a random write (read-modify-write), so each change costs at least one extra disk seek.
Crash-safety	Because pages mutate, a write-ahead log (WAL) is required: every change is first appended to the log so the db can replay or roll back after a crash.
Space & read amplification	B-trees keep only one copy of each key, so disk footprint is compact and reads see at most one version—no need for compaction cycles.

Practical mental model: Keep a tiny, shallow tree of pages on disk and mutate pages in place. That’s fantastic for point-lookups and range scans when random I/O is cheap (e.g., enterprise SSDs).

3. LSM vs B-tree — how to choose (and defend your choice in an interview)

Workload trait	Favors LSM / SSTable	Favors B-tree
Write volume (random keys)	High write throughput, log-structured layout means sequential disk writes; compaction smooths later.	Each update = random I/O; heavy writes risk I/O bottlenecks.
Read pattern (point look-ups)	Extra read amplification (many SSTables) mitigated by Bloom filters, but still higher latency tail.	Few page reads, predictable latency.
Range scans	Data is sorted, but might span several SSTables if compaction hasn’t caught up.	Pages are in key order within one tree—excellent for large scans.
Disk space	Can have space amplification: multiple obsolete versions linger until compaction.	Single live copy of each key keeps space tight.
SSD endurance	Sequential writes are SSD-friendly; compaction writes extra data, but still mostly sequential.	Random writes stress SSD write-amp and garbage collection.

Rule of thumb for interviews

“If my service is log-like—millions of sensor pings per second—I reach for an LSM store (e.g., RocksDB, Cassandra). If my workload is read-heavy OLTP with secondary indexes—think user profiles or orders—I prefer a B-tree engine (Postgres, InnoDB).”

Then add the nuance: “In 2025 many engines (MyRocks, WiredTiger) offer both modes, so I start with the default and switch only after profiling proves a mismatch.”

4. LSM / SSTable indexes versus B-trees

Aspect	LSM / SSTable (log-structured)	B-tree (page-oriented)
How data lands on disk	New records are appended to an in-memory memtable; when it fills, the whole memtable is flushed once, sequentially, as an immutable SSTable. Updates never touch old files in place.	A change is written twice: (1) appended to the write-ahead log (WAL) for crash-safety, (2) over-written into the 4 KB page that holds the key. Page splits may add more writes.
Write amplification	Every compaction pass rewrites data, but the copies are sequential. With sensible “levelled” compaction, the total bytes re-written can still be lower than the random page rewrites of a B-tree—especially on spinning disks and SSDs that prefer sequential traffic.	Must re-write the WAL entry plus the page itself; on SSDs each random 4 KB overwrite triggers a whole-block erase inside the drive, so write-amp is baked in.
Throughput and latency	Sustains very high write throughput (disk bandwidth is used in long sequential bursts). Read median latency is fine, but tail latency (P99, P99.9) rises when a query arrives while a large compaction is hogging I/O. Needs monitoring to make sure compaction can keep up with the ingest rate.	Read and write latency are steadier: only one page is touched per lookup, and there is no background merge competing for I/O. Point-reads are usually faster because each key lives in exactly one place.
Space & compression	Because old fragments are thrown away during compaction, free space fragments are eliminated and the files compress very well; disk footprint is often smaller than an equally sized B-tree.	Page splits leave “air pockets” in half-empty pages; long-running databases gradually fragment and need occasional VACUUM / OPTIMIZE passes.
Transactional locking	A single, unique copy of every key makes it straightforward to attach row- or range-locks for serialisable transactions.	A key may exist simultaneously in several SSTables until the next compaction, so fine-grained locking is trickier; many LSM engines therefore go with coarse-grained or optimistic schemes.
Operational gotchas	If compaction falls behind (common on very heavy ingest) the number of SSTables grows, reads fan out over more files, and the disk can fill up. Engines rarely self-throttle writes, so alerting is essential.	Biggest risk is page-level corruption if a crash happens between WAL flush and page rewrite, but mature engines have decades of battle-tested recovery code.

5. Other Index Types And Characteristics:

Index type	What it is	When to reach for it	Cost / trade-offs
Secondary index	An extra lookup structure on a column that is not* the primary key* (e.g. `user_id` on an `orders` table). Keys are not unique, so each entry points to a list of row-IDs or encodes the row-ID in the key itself. Works with either B-tree or LSM.	Needed for joins and predicate filters that touch non-key columns.	Duplicates a (small) copy of the key plus a pointer for every row; every insert/update has to update every secondary index.
Clustered (primary) index	The table itself is kept in primary-key order; leaf pages store the entire row beside the key. (InnoDB does this automatically.)	Point-reads and range-scans that retrieve the whole row by primary key are a single I/O hop.	Row moves cause page splits; only one clustered index allowed because the row can live in only one physical order.
Non-clustered + heap file	Each index entry stores only a pointer into an unordered heap of rows on disk.	Multiple indexes without duplicating full rows; updates that don’t change length can overwrite in place efficiently.	A read is at least two hops (index ➜ heap page).
Covering / “included-columns” index	A non-clustered index that stores some extra columns besides the key, so the query can be satisfied entirely from the index.	Speeds up frequent, read-heavy filters without paying the space of a full clustered copy.	More disk, more write overhead, and the DBA must pick which columns to “include”.
Concatenated (multi-column) index	Simply concatenates two or more fields to make a compound key, e.g. `(last_name, first_name)`. Sorted order lets you filter by the left-most prefix of the key.	Queries that always constrain the leading column(s) (“find all orders for user X on date Y”).	Useless if you need predicates only on the second column unless the first is also specified.
Multi-dimensional index (R-tree, space-filling curve)	Keeps rectangles (“bounding boxes”) in a tree so you can answer range queries over ≥ 2 numeric dimensions (lat/long, RGB values, time × temperature).	Geospatial, GIS, or any query that must bound two axes at once (e.g. “all cafés inside this map tile”).	More complex to maintain; many mainstream engines support it only through extensions (e.g. PostGIS R-tree, MySQL/SQLite R-tree).
Full-text / fuzzy index	Breaks documents into terms; stores a term-dictionary and postings lists (think: SSTable of terms ➜ doc-IDs). Often organised as a trie or automaton so it can match prefixes or words within an edit-distance.	Search boxes, logs, product catalogues with typo-tolerance.	A completely separate engine (Lucene, Elasticsearch, Solr) is usually easiest; synchronising it with the primary store is an operational task.