Chapter 4: Encoding and Evolution

Chapter 4 (up to “Modes of Dataflow”) — Encoding & Evolution

Pocket reference you can reread in five minutes before an interview.

Family	Pick it when…	Bytes for sample record*	Schema & evolution model	Good to know
Language-specific serializers(Java `Serializable`, Python `pickle`, Ruby `Marshal`, etc.)	- Purely in-process caching or very short-lived queues	n/a	Object graph is baked into one language; versioning & security are weak	Remote code execution bugs are common; locks you into one tech stack.
Text formatsJSON / XML / CSV	- Human readability or cross-org hand-offs matter more than perf	JSON: 81 B for sample	Optional, heavyweight schemas (JSON Schema / XSD). Most teams ignore them and hand-roll checks.	Ambiguous numbers (ints vs floats), no native binary strings. (notes.shichao.io)
Binary JSON companions(MessagePack, BSON…)	- You like JSON’s data model but want it smaller/faster	MessagePack: 66 B	No formal schema ⇒ every record repeats field names	Saves little space vs JSON; gains are minor. (notes.shichao.io)
Thrift BinaryProtocol	- You need cross-language RPC with decent speed and simple tooling	59 B	Each field has a tag number; writer may omit unset fields.Backward: new readers ignore unknown tags.Forward: old readers skip unknown tags.	Requires a code-gen step; two wire formats: Binary (simple) and Compact (smaller). (notes.shichao.io)
Thrift CompactProtocol	- Same as above, but you really care about payload size	34 B (packs tag + type into one byte, var-ints)	Same tag-based evolution as BinaryProtocol	Most widely used by Hive metastore, Cassandra internal RPC. (notes.shichao.io, Stack Overflow)
Protocol Buffers v2/v3	- High performance, strong IDE tooling, many languages	33 B	Tag numbers like Thrift, plus `optional·repeated` markers allow “upgrade single → list” without breaking old code.	No native unions; smaller feature set keeps tooling lightweight. (notes.shichao.io)
Apache Avro	- Hadoop / Kafka pipelines, dynamic or frequently changing schemas	32 B (smallest)	Writer’s schema travels with the data (once per file / message). Reader resolves differences (adds / drops / re-orders) at read-time; names, not numbers, identify fields.	No tag numbers → easier auto-generated schemas (e.g., dump DB tables). Works with or without code-gen. (notes.shichao.io)

Sample record is the book’s {"userName":"Martin",…} example; numbers come straight from the figures in Chapter 4.

How to choose in one breath

“If humans or third-party partners need to read it, stick with JSON/XML.

Inside one language runtime use the native serializer for quick & dirty caches—but never over the wire.

Need compact, super-fast RPC between micro-services? Thrift or Protobuf give you tag-based schemas and code-gen.

In big-data pipelines where schemas evolve weekly, Avro wins: the writer’s schema rides with the data and names trump numbers.”

Interview talking points & traps

Why tags matter – Renaming a field in Protobuf/Thrift is safe; re-using a tag is data corruption.
Forward vs backward compatibility – Show you know both directions and the rules (e.g., new fields must be optional in tag-based systems). (notes.shichao.io)
Avro’s “writer ⇄ reader” resolution – Highlight that Avro can reorder fields, fill in defaults, and ignore unknowns automatically. (notes.shichao.io)
Security – Built-in language serializers can instantiate arbitrary classes; always validate inputs or avoid them for untrusted data.
Byte-size anecdote – Quote the 81 B JSON ➜ 32 B Avro progression to prove you understand the real savings.

In the rest of this chapter we will explore some of the most common ways how data flows between processes:

Via databases
Via service calls
Via asynchronous message passing