Chapter 2: Data Models & Query Languages

Fast-lane overview

Relational tables and document-style JSON live at opposite ends of a spectrum: one is built for many-to-many joins and strong schemas; the other is built for nested trees, locality and flexible schemas. The hard part is knowing when each wins, because the wrong pick explodes complexity (or latency) in production systems — and in interviews.

1. Core ideas, minus the noise

Big concept	Newbie-friendly mental model
Data-model layering	Reality → objects in code → general-purpose model (tables, docs) → bytes on disk. Each layer hides the ugliness below it.
Relational model	Think Excel sheets with constraints: every row is a tuple, columns are typed, and joins are the super-power that knit sheets together. Born for business transactions.
Document model	A Russian-doll JSON blob: store the whole tree together for one-to-many reads, skip the join. Great locality, optional schema.
Schema-on-write vs schema-on-read	RDBMS enforces structure before insert; docs let anything through and leave validation to readers. Flexibility today, surprise bugs tomorrow.
Normalization	“Don’t repeat yourself” in data: use IDs to avoid duplicated strings, cut write cost, and keep updates atomic.
History lesson	1960-70s IMS (hierarchical) and CODASYL (network) fought the same “tree vs graph” battle; SQL won by hiding access paths with a query optimizer. We’re replaying that debate with NoSQL.
Query languages	MapReduce & JS pipes are cool but verbose; NoSQL vendors inevitably reinvent declarative SQL-ish dialects (e.g., MongoDB aggregation pipeline).

2. Interview gold-nuggets

Interview hook	How to flex Chapter 2 knowledge
“Pick a DB for user profiles.”	If reads fetch the whole profile + nested lists → document beats shredding tables; warn that future features (recommendations, cross-links) may require joins, so add versioned IDs now.
“Our JSON store now needs friend-of-friend queries.”	Explain that docs hate many-to-many; options: denormalize (duplication risk), app-level joins (latency), or migrate hot paths to relational/graph.
“Why is schema-on-read dangerous?”	Because every micro-service may interpret the same field differently; catching mistakes late shifts failure from write-time to prod read-time outages.
“MapReduce or SQL for analytics?”	Say: MR good for one-off ETL, but declarative SQL-on-Hadoop/Spark lets the optimizer reorder stages, so it’s faster and shorter to write.
Lightning definition pair	Foreign key = relational pointer enabling join; document reference = same idea in doc DB, but follow-ups are manual and non-atomic.

3. Tech-lead takeaways

Future-proof your model. The longer you run, the more cross-entity links appear. Prototype in documents, but keep an upgrade plan to relational/graph.
Keep hot data together. Locality (JSON blob) cuts I/O and cache misses; measure 95-th percentile fetch latency after denormalizing.
Don’t DIY optimizers. SQL’s costing engine chooses access paths once; hand-crafted MapReduce has to be tuned forever.
Version everything. Treat field names like public APIs; add v2_email rather than mutating in place.

4. Cheat-sheet bullets (night-before-interview)

Tables = best for many-to-many & ad-hoc queries; docs = best for one-to-many tree reads.
Schema-on-read ≈ “no brakes”: quick dev, risky semantics.
Use IDs + joins to avoid duplication; denormalize only when profiling proves it.
CODASYL showed that imperative access-paths don’t age well; declarative wins in the long run.
NoSQL often re-adds a SQL-like layer for complex analytics — expect that trajectory.