add-ons:

Clocks In Distributed Systems

A pragmatic, interview-oriented crib sheet for why “distributed” really just means “now you have a thousand new failure modes”.

Theme Plain-English core idea Why you keep getting grilled on it
Partial failure Parts of the cluster die, stall or get cut off while others live on. You can’t assume “all or nothing” like on a single box. Every design-doc needs how do we cope when one DC link flaps or one VM freezes for 2 min?
Unreliable network Packets may be delayed, dropped, duplicated, reordered. Timeouts are guesses, not truths. “Why did my micro-service return 502 at p-99?”→ understand queues, congestion & retry storms.
Unreliable clocks Each node’s quartz drifts; NTP is only ≈tens of ms accurate; clocks jump (slew, leap-second, VM pause). Timestamp-based logic (TTL, LWW, JWT expiry, Spanner-style commits) lives or dies by this.
Process pauses & GC Any thread can stop for milliseconds-minutes ➜ apparent node “death”. Causes ghost-leaders, expired leases, cascading fail-overs.
Safety vs. liveness Safety: nothing bad ever happens. Liveness: something good eventually happens. Lets you state guarantees without hand-waving (“reads never diverge” vs “client eventually hears back”).
System models ① Synchronous (bounded delay) ② Partially synchronous (usually OK, but sometimes wild) ③ Asynchronous (no timing assumptions). Consensus algorithms quote their required model—know which fits real clouds (answer: #2).
Crash vs. Byzantine faults Ordinary software assumes nodes just crash or recover. Byzantine ≈ nodes lie/malicious. Explain why Kafka/Raft ignore Byzantine and why blockchains pay the price to handle it.
Quorums & fencing Majority voting decides truth; fencing tokens stop “zombie” leaders from corrupting state. Must mention these to justify safe leader election & distributed locks.

Classic gotchas & illustrations

Code smell Real-world face-palm
if (System.currentTimeMillis()>leaseExpiry) stillLeader=true; GC pause → lease actually expired 30 s ago → two writers, data corrupt.
Last-Write-Wins on clock time Node with slow clock overwrites new value; silent data loss.
Fixed 5 s timeout Works in lab; prod spike causes 6 s queues → mass false-failures → cascading restart storm.

Sound-bite answers you can drop


Mini decision tree – picking tactics

Need only best-effort latency?  -> retry w/ back-off, idempotent ops.
Need single-writer per shard?   -> majority lease + fencing token.
Need global ordering/counters?  -> Raft/Paxos (part-sync, crash-rec).
Users may be malicious?         -> Byzantine-tolerant (PBFT, blockchain) – $$$.
Must audit event order <10 ms?  -> GPS/PTP + TrueTime-style intervals + commit-wait.


Quick anti-panic checklist