Chapter 8: The Trouble with Distributed Systems

add-ons:

A pragmatic, interview-oriented crib sheet for why “distributed” really just means “now you have a thousand new failure modes”.

Theme	Plain-English core idea	Why you keep getting grilled on it
Partial failure	Parts of the cluster die, stall or get cut off while others live on. You can’t assume “all or nothing” like on a single box.	Every design-doc needs how do we cope when one DC link flaps or one VM freezes for 2 min?
Unreliable network	Packets may be delayed, dropped, duplicated, reordered. Timeouts are guesses, not truths.	“Why did my micro-service return 502 at p-99?”→ understand queues, congestion & retry storms.
Unreliable clocks	Each node’s quartz drifts; NTP is only ≈tens of ms accurate; clocks jump (slew, leap-second, VM pause).	Timestamp-based logic (TTL, LWW, JWT expiry, Spanner-style commits) lives or dies by this.
Process pauses & GC	Any thread can stop for milliseconds-minutes ➜ apparent node “death”.	Causes ghost-leaders, expired leases, cascading fail-overs.
Safety vs. liveness	Safety: nothing bad ever happens. Liveness: something good eventually happens.	Lets you state guarantees without hand-waving (“reads never diverge” vs “client eventually hears back”).
System models	① Synchronous (bounded delay) ② Partially synchronous (usually OK, but sometimes wild) ③ Asynchronous (no timing assumptions).	Consensus algorithms quote their required model—know which fits real clouds (answer: #2).
Crash vs. Byzantine faults	Ordinary software assumes nodes just crash or recover. Byzantine ≈ nodes lie/malicious.	Explain why Kafka/Raft ignore Byzantine and why blockchains pay the price to handle it.
Quorums & fencing	Majority voting decides truth; fencing tokens stop “zombie” leaders from corrupting state.	Must mention these to justify safe leader election & distributed locks.

Classic gotchas & illustrations

Code smell	Real-world face-palm
`if (System.currentTimeMillis()>leaseExpiry) stillLeader=true;`	GC pause → lease actually expired 30 s ago → two writers, data corrupt.
Last-Write-Wins on clock time	Node with slow clock overwrites new value; silent data loss.
Fixed 5 s timeout	Works in lab; prod spike causes 6 s queues → mass false-failures → cascading restart storm.

Sound-bite answers you can drop

“In a distributed system you never know, you only suspect.”
Crash-stop assumes nodes vanish forever; crash-recovery lets them come back; Byzantine assumes they’re drunk or evil.
Safety is sacred—once broken you can’t roll it back. Liveness can wait if the world is on fire.
Use monotonic clocks for measuring durations, time-of-day only for user-facing timestamps.
Fencing token = lease-version. Accept writes only if token ↑, so zombies can’t scribble.

Mini decision tree – picking tactics

Need only best-effort latency?  -> retry w/ back-off, idempotent ops.
Need single-writer per shard?   -> majority lease + fencing token.
Need global ordering/counters?  -> Raft/Paxos (part-sync, crash-rec).
Users may be malicious?         -> Byzantine-tolerant (PBFT, blockchain) – $$$.
Must audit event order <10 ms?  -> GPS/PTP + TrueTime-style intervals + commit-wait.

Classic gotchas & illustrations

Sound-bite answers you can drop

Mini decision tree – picking tactics

Quick anti-panic checklist