1. Where batch jobs fit in

Style Trigger Response-time goal Example
Online service User click / API call Milliseconds “Show me my Instagram feed”
Batch job (offline) Scheduler or manual start Minutes – days; focus on throughput “Re-calculate all product recommendations every night”
Stream job (near-real-time) Event arrival Seconds “Detect fraud within a minute of a card swipe”

2. Unix philosophy → Big-data philosophy


3. Under the hood of a MapReduce job

  1. Split input into blocks (for example, 128 MB pieces on HDFS).
  2. Map phase – run user code on each block, emit (key, value) pairs.
  3. Shuffle & sort – framework partitions by key hash, sorts, and ships data to reducers.
  4. Reduce phase – user code sees one key at a time with all its values, writes results.

Rule of thumb: Mappers prepare data, shuffle moves data, reducers finish data.


4. Joins and grouping at scale – know three patterns

Pattern When to use Analogy
Reduce-side sort-merge join Both inputs are big. Framework sorts by key, sends same keys to one reducer. Classic relational join pushed into reduce phase.
Broadcast (replicated) hash join One input fits in RAM. Copy it to every mapper; hash-lookup as you stream the big side. “Little black book next to you while scanning a huge ledger.”
Partitioned hash join Both inputs big but pre-partitioned the same way. Build hash table per partition. “Each worker owns a shard of both tables.”

Interview hint: articulate the trade-off—network vs. memory vs. duplicate work.