add-ons:
A pragmatic, interview-oriented crib sheet for why “distributed” really just means “now you have a thousand new failure modes”.
Theme | Plain-English core idea | Why you keep getting grilled on it |
---|---|---|
Partial failure | Parts of the cluster die, stall or get cut off while others live on. You can’t assume “all or nothing” like on a single box. | Every design-doc needs how do we cope when one DC link flaps or one VM freezes for 2 min? |
Unreliable network | Packets may be delayed, dropped, duplicated, reordered. Timeouts are guesses, not truths. | “Why did my micro-service return 502 at p-99?”→ understand queues, congestion & retry storms. |
Unreliable clocks | Each node’s quartz drifts; NTP is only ≈tens of ms accurate; clocks jump (slew, leap-second, VM pause). | Timestamp-based logic (TTL, LWW, JWT expiry, Spanner-style commits) lives or dies by this. |
Process pauses & GC | Any thread can stop for milliseconds-minutes ➜ apparent node “death”. | Causes ghost-leaders, expired leases, cascading fail-overs. |
Safety vs. liveness | Safety: nothing bad ever happens. Liveness: something good eventually happens. | Lets you state guarantees without hand-waving (“reads never diverge” vs “client eventually hears back”). |
System models | ① Synchronous (bounded delay) ② Partially synchronous (usually OK, but sometimes wild) ③ Asynchronous (no timing assumptions). | Consensus algorithms quote their required model—know which fits real clouds (answer: #2). |
Crash vs. Byzantine faults | Ordinary software assumes nodes just crash or recover. Byzantine ≈ nodes lie/malicious. | Explain why Kafka/Raft ignore Byzantine and why blockchains pay the price to handle it. |
Quorums & fencing | Majority voting decides truth; fencing tokens stop “zombie” leaders from corrupting state. | Must mention these to justify safe leader election & distributed locks. |
Code smell | Real-world face-palm |
---|---|
if (System.currentTimeMillis()>leaseExpiry) stillLeader=true; |
GC pause → lease actually expired 30 s ago → two writers, data corrupt. |
Last-Write-Wins on clock time | Node with slow clock overwrites new value; silent data loss. |
Fixed 5 s timeout | Works in lab; prod spike causes 6 s queues → mass false-failures → cascading restart storm. |
Need only best-effort latency? -> retry w/ back-off, idempotent ops.
Need single-writer per shard? -> majority lease + fencing token.
Need global ordering/counters? -> Raft/Paxos (part-sync, crash-rec).
Users may be malicious? -> Byzantine-tolerant (PBFT, blockchain) – $$$.
Must audit event order <10 ms? -> GPS/PTP + TrueTime-style intervals + commit-wait.