Chapter 10: Batch Processing

Style	Trigger	Response-time goal	Example
Online service	User click / API call	Milliseconds	“Show me my Instagram feed”
Batch job (offline)	Scheduler or manual start	Minutes – days; focus on throughput	“Re-calculate all product recommendations every night”
Stream job (near-real-time)	Event arrival	Seconds	“Detect fraud within a minute of a card swipe”

Do one thing well, chain the tools.

In a terminal you might write:
```
cat access.log \\
  | awk '{print $7}' \\
  | sort | uniq -c | sort -nr | head -5
```
Meaning: read the web-server log, pick the URL field, sort, count duplicates, rank, show top 5.
MapReduce is the same idea stretched across thousands of machines.
- Mapper = “awk step” – extract a key and a value.
- Shuffle = automatic sort + copy so that all equal keys meet at one place.
- Reducer = “uniq-c step” – combine all values of that key.
- Files live on a distributed file system (most famously HDFS: Hadoop Distributed File System).
Why games like this matter in interviews

Shows you grasp data-parallel thinking: “Bring related records to the same machine, then process them locally.”

Split input into blocks (for example, 128 MB pieces on HDFS).
Map phase – run user code on each block, emit (key, value) pairs.
Shuffle & sort – framework partitions by key hash, sorts, and ships data to reducers.
Reduce phase – user code sees one key at a time with all its values, writes results.

Rule of thumb: Mappers prepare data, shuffle moves data, reducers finish data.

Pattern	When to use	Analogy
Reduce-side sort-merge join	Both inputs are big. Framework sorts by key, sends same keys to one reducer.	Classic relational join pushed into reduce phase.
Broadcast (replicated) hash join	One input fits in RAM. Copy it to every mapper; hash-lookup as you stream the big side.	“Little black book next to you while scanning a huge ledger.”
Partitioned hash join	Both inputs big but pre-partitioned the same way. Build hash table per partition.	“Each worker owns a shard of both tables.”

Interview hint: articulate the trade-off—network vs. memory vs. duplicate work.