Pocket reference you can reread in five minutes before an interview.
Family | Pick it when… | Bytes for sample record* | Schema & evolution model | Good to know |
---|---|---|---|---|
Language-specific serializers(Java Serializable , Python pickle , Ruby Marshal , etc.) |
- Purely in-process caching or very short-lived queues | n/a | Object graph is baked into one language; versioning & security are weak | Remote code execution bugs are common; locks you into one tech stack. |
Text formatsJSON / XML / CSV | - Human readability or cross-org hand-offs matter more than perf | JSON: 81 B for sample | Optional, heavyweight schemas (JSON Schema / XSD). Most teams ignore them and hand-roll checks. | Ambiguous numbers (ints vs floats), no native binary strings. (notes.shichao.io) |
Binary JSON companions(MessagePack, BSON…) | - You like JSON’s data model but want it smaller/faster | MessagePack: 66 B | No formal schema ⇒ every record repeats field names | Saves little space vs JSON; gains are minor. (notes.shichao.io) |
Thrift BinaryProtocol | - You need cross-language RPC with decent speed and simple tooling | 59 B | Each field has a tag number; writer may omit unset fields.Backward: new readers ignore unknown tags.Forward: old readers skip unknown tags. | Requires a code-gen step; two wire formats: Binary (simple) and Compact (smaller). (notes.shichao.io) |
Thrift CompactProtocol | - Same as above, but you really care about payload size | 34 B (packs tag + type into one byte, var-ints) | Same tag-based evolution as BinaryProtocol | Most widely used by Hive metastore, Cassandra internal RPC. (notes.shichao.io, Stack Overflow) |
Protocol Buffers v2/v3 | - High performance, strong IDE tooling, many languages | 33 B | Tag numbers like Thrift, plus optional·repeated markers allow “upgrade single → list” without breaking old code. |
No native unions; smaller feature set keeps tooling lightweight. (notes.shichao.io) |
Apache Avro | - Hadoop / Kafka pipelines, dynamic or frequently changing schemas | 32 B (smallest) | Writer’s schema travels with the data (once per file / message). Reader resolves differences (adds / drops / re-orders) at read-time; names, not numbers, identify fields. | No tag numbers → easier auto-generated schemas (e.g., dump DB tables). Works with or without code-gen. (notes.shichao.io) |
{"userName":"Martin",…}
example; numbers come straight from the figures in Chapter 4.“If humans or third-party partners need to read it, stick with JSON/XML.
Inside one language runtime use the native serializer for quick & dirty caches—but never over the wire.
Need compact, super-fast RPC between micro-services? Thrift or Protobuf give you tag-based schemas and code-gen.
In big-data pipelines where schemas evolve weekly, Avro wins: the writer’s schema rides with the data and names trump numbers.”
In the rest of this chapter we will explore some of the most common ways how data flows between processes: