Chapter 4 (up to “Modes of Dataflow”) — Encoding & Evolution

Pocket reference you can reread in five minutes before an interview.

Family Pick it when… Bytes for sample record* Schema & evolution model Good to know
Language-specific serializers(Java Serializable, Python pickle, Ruby Marshal, etc.) - Purely in-process caching or very short-lived queues n/a Object graph is baked into one language; versioning & security are weak Remote code execution bugs are common; locks you into one tech stack.
Text formatsJSON / XML / CSV - Human readability or cross-org hand-offs matter more than perf JSON: 81 B for sample Optional, heavyweight schemas (JSON Schema / XSD). Most teams ignore them and hand-roll checks. Ambiguous numbers (ints vs floats), no native binary strings. (notes.shichao.io)
Binary JSON companions(MessagePack, BSON…) - You like JSON’s data model but want it smaller/faster MessagePack: 66 B No formal schema ⇒ every record repeats field names Saves little space vs JSON; gains are minor. (notes.shichao.io)
Thrift BinaryProtocol - You need cross-language RPC with decent speed and simple tooling 59 B Each field has a tag number; writer may omit unset fields.Backward: new readers ignore unknown tags.Forward: old readers skip unknown tags. Requires a code-gen step; two wire formats: Binary (simple) and Compact (smaller). (notes.shichao.io)
Thrift CompactProtocol - Same as above, but you really care about payload size 34 B (packs tag + type into one byte, var-ints) Same tag-based evolution as BinaryProtocol Most widely used by Hive metastore, Cassandra internal RPC. (notes.shichao.io, Stack Overflow)
Protocol Buffers v2/v3 - High performance, strong IDE tooling, many languages 33 B Tag numbers like Thrift, plus optional·repeated markers allow “upgrade single → list” without breaking old code. No native unions; smaller feature set keeps tooling lightweight. (notes.shichao.io)
Apache Avro - Hadoop / Kafka pipelines, dynamic or frequently changing schemas 32 B (smallest) Writer’s schema travels with the data (once per file / message). Reader resolves differences (adds / drops / re-orders) at read-time; names, not numbers, identify fields. No tag numbers → easier auto-generated schemas (e.g., dump DB tables). Works with or without code-gen. (notes.shichao.io)

How to choose in one breath

“If humans or third-party partners need to read it, stick with JSON/XML.

Inside one language runtime use the native serializer for quick & dirty caches—but never over the wire.

Need compact, super-fast RPC between micro-services? Thrift or Protobuf give you tag-based schemas and code-gen.

In big-data pipelines where schemas evolve weekly, Avro wins: the writer’s schema rides with the data and names trump numbers.”


Interview talking points & traps


In the rest of this chapter we will explore some of the most common ways how data flows between processes: