Data (6a) — Data & Intelligence

A program holds its working data in memory. The process exits and that memory disappears. Two processes touching the same file race each other and corrupt each other's writes. Power can fail mid-write and leave a record half on disk. A database is a separate process that owns the storage and exposes a transactional API, so every application stops re-implementing crash recovery and concurrent access by hand.

This page walks up the database stack — bytes on a page, indexes that find rows, transactions that hold invariants, replication that survives node loss, query planners that pick a plan, and the warehouse pipelines that move all of it where analytics needs it.

The same stack appears in PostgreSQL, MySQL, MongoDB, and the columnar engines under modern warehouses. What changes between them is the access method (B-tree vs LSM vs columnar) and the language at the top.

Why databases exist

Two problems break the naive "just write to a file" approach.

Crashes truncate writes mid-flight. A multi-page update interrupted by a power loss leaves part of the change on disk and part missing — a torn write. Recovering means writing a write-ahead log by hand, calling fsync correctly on every platform, replaying partial records on restart, and ordering the log relative to the data file. Getting any one detail wrong silently corrupts the file the next time the cord is yanked.

Concurrent writers race. Two processes that read-modify-write the same record can both read the old value, both compute a new one, and both write — a lost update. Operating-system advisory locks help on a single machine; over NFS or SMB their semantics quietly weaken.

A database centralises this work. One process owns the files, exposes a transactional API (BEGIN, COMMIT, ROLLBACK), and is responsible for two promises every application then takes for granted:

Durability — once COMMIT returns, the write survives a crash.
Isolation — concurrent transactions see a consistent view, never each other's half-finished work.

Everything else — schemas, indexes, query languages, replication — is built on top of those two.

The left side is what every application would build if databases didn't exist. The right side is what applications pay a query-protocol round-trip to outsource to one well-tested implementation.

The cost is fsync. Durability ultimately depends on forcing writes out of the kernel's page cache to the device. fsync stalls until the device confirms — typically 1–10 ms on consumer SSDs, longer on cloud disks. Group commit batches multiple transactions into one fsync call to amortise the wait. Skipping fsync looks fast in benchmarks and corrupts the database at the next power loss.

The relational model

Before the relational model, an application asked for a record by chasing pointers between files — the schema and the access path were the same thing, and changing one meant rewriting the other. The relational model separates the two. Data is a set of tables (relations) of typed rows over a fixed schema. The application says what it wants in SQL; the database picks how to get it.

A primary key uniquely identifies each row in a table. A foreign key in one table references the primary key of another, encoding a relationship. The operations on relations — filter, project, join, group, aggregate — are the relational algebra, and SQL is its declarative form.

SELECT u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name;

That query never says whether to scan users first or orders first, whether to use a hash join or an index lookup, whether to read from a B-tree or a sequential scan. A query planner makes those choices from statistics it keeps about the data: row counts, value distributions, index selectivity, correlations between columns.

Two queries with the same logical plan can have wildly different physical plans depending on table sizes and indexes. EXPLAIN exposes the choice; ANALYZE runs it and reports actual row counts, exposing planner mis-estimates.

The optimizer enumerates physical plans and picks the cheapest under a cost model of rows-and-I/O per operator. Mis-estimated row counts are the single biggest reason production plans go wrong. If the optimizer expects 100 rows and gets 10 million, it will choose nested-loop where it should have chosen hash join, and a 50 ms query becomes a 50 minute query.

Three plans, same answer, costs that differ by two orders of magnitude. The planner's row estimates depend on statistics; stale stats are how queries silently regress overnight.

Normalisation factors tables so each fact lives in exactly one place. Third Normal Form (3NF) is the usual target for transactional databases: every non-key column depends on the whole primary key and nothing else. When reads dominate, duplicating data — denormalisation — trades write complexity for read speed. Analytical warehouses denormalise aggressively because they almost never update.

Pitfall — NULL is not equal to itself. SQL's logic has three values, and NULL = NULL evaluates to NULL (not true). WHERE x = NULL matches no rows; the correct form is WHERE x IS NULL. NOT IN with a NULL in the list rejects every row. COUNT(*) counts NULLs but COUNT(col) skips them. These rules are the source of long-lived bugs in queries that look correct at a glance.

Transactions and ACID

Without transactions, a multi-step change can fail partway and leave the database in an impossible state — a transfer that debits one account but never credits the other. A transaction is a unit of work the database treats as one indivisible step. ACID names the four promises that step keeps:

Atomicity — all writes commit together or none do. A partial transaction is impossible after COMMIT or ROLLBACK.
Consistency — the database moves from one valid state to another. Declared constraints (foreign keys, uniqueness, checks) are preserved across the boundary. The 'C' is partly the application's job — the database only enforces what you declared.
Isolation — concurrent transactions appear to run as if alone. The strength of "appear" is the isolation level.
Durability — once COMMIT returns, the change survives a crash.

The write-ahead log

Atomicity and durability share one implementation: the write-ahead log (WAL). The rule is one line: a log record describing a change must reach durable storage before the data page it changes does. Commit then becomes one fsync on a small sequential file, instead of flushing every modified data page. Data files catch up lazily in the background; on crash, recovery replays the log from the last checkpoint to bring them back in sync.

PostgreSQL calls it the WAL, InnoDB calls it the redo log, SQL Server calls it the transaction log; the algorithm — ARIES — is the same. Group commit batches many transactions into one fsync to amortise the per-flush cost.

Worked example: crash recovery from the WAL

Three transactions are in flight when the power dies. We walk what the log holds, what gets replayed, and what gets discarded.

State just before the crash. Last checkpoint was at LSN 40 (the database has guaranteed every change up to LSN 40 is already in the data files). The log on disk holds:

LSN	Record	Meaning
41	`T1: UPDATE accounts SET bal=80 WHERE id=1`	T1 first write
42	`T1: UPDATE accounts SET bal=120 WHERE id=2`	T1 second write
43	`T1: UPDATE accounts SET bal=50 WHERE id=3`	T1 third write
44	`T1: COMMIT`	T1 durably committed (fsync returned)
45	`T2: UPDATE accounts SET bal=200 WHERE id=4`	T2 first write
46	`T2: UPDATE accounts SET bal=10 WHERE id=5`	T2 second write
47	`T2: COMMIT`	T2 durably committed
48	`T3: UPDATE accounts SET bal=999 WHERE id=6`	T3 wrote but never committed

At LSN 48 the cord is yanked. T3 had not written a COMMIT record yet.

On restart, the database opens the log and runs three passes:

Analysis — read forward from the last checkpoint (LSN 40). Build two lists: committed transactions {T1, T2} (saw their COMMIT records) and in-flight {T3} (no COMMIT seen).
REDO — replay every log record from LSN 41 onward against the data pages, regardless of whether the transaction committed. After this pass, the data files reflect every write in the log, including T3's uncommitted bal=999.
UNDO — walk the in-flight list {T3} and reverse its writes. The WAL records carry the old value as well as the new, so undoing LSN 48 means writing bal back to whatever it was before. After undo, id=6 is restored.

Result. T1 and T2 are durable — their COMMIT was fsynced before the crash, so the client was promised they would survive, and they did. T3 vanishes — no commit was acknowledged to the client, so erasing it breaks no promise. The data files now match what would have been visible to a reader at the moment of T2's commit.

The key invariant: a COMMIT record reaching the log on disk is the only event that makes a transaction durable. Page writes can lag arbitrarily behind; the log is the source of truth.

Isolation levels and anomalies

Isolation describes what concurrent transactions are allowed to see of each other. Stronger isolation means fewer anomalies but more coordination — more locking, more aborts, lower throughput. Four anomalies define the levels:

Dirty read — read a value another transaction has written but not yet committed. If it rolls back, you read fiction.
Non-repeatable read — same query returns different values for the same row because someone updated it between your two reads.
Phantom read — same range query returns different rows because someone inserted or deleted in between.
Write skew — two transactions each read a set of rows, each makes a decision from what they read, each writes — and the combined writes break an invariant neither transaction's snapshot saw violated.

Filled cells are anomalies the level still allows. PostgreSQL's "repeatable read" is actually snapshot isolation; only "serializable" prevents write skew. PostgreSQL and Oracle default to read committed; MySQL InnoDB defaults to repeatable read. Production code that assumes serializable without setting it is where write-skew bugs hide.

The classic example of write skew: two on-call doctors each query "are at least two doctors on call?", each sees the answer "yes", and each goes off-call simultaneously. Neither transaction's snapshot saw the invariant violated; the combined effect of their writes did.

Each transaction reads a consistent snapshot; their writes target disjoint rows so the engine sees no conflict at commit. The fixes are explicit row locks (SELECT … FOR UPDATE), a constraint that catches the post-state, or SERIALIZABLE isolation that watches for dangerous read/write patterns.

Locking vs MVCC

Two implementation strategies give isolation.

Two-phase locking (2PL) acquires locks (shared for reads, exclusive for writes) and holds them until commit. Once any lock has been released, no new lock may be acquired. This produces serialisable schedules, at the cost that readers block writers and writers block readers. Deadlocks are inevitable; the engine detects cycles in a wait-for graph and aborts one transaction as the victim.

Multi-Version Concurrency Control (MVCC) keeps multiple versions of each row instead of locking. Every row carries two transaction IDs — xmin (the transaction that created it) and xmax (the transaction that deleted or replaced it). A transaction reading at snapshot S sees only versions where xmin is committed and visible to S, and xmax is either null or invisible to S. Writers create new versions; old versions are reclaimed by a background process — VACUUM in PostgreSQL, the purge thread in InnoDB.

The cost is space — old versions live until vacuumed — and write amplification, since every UPDATE creates a new tuple. The benefit is that long-running analytical reads never block the transactional workload.

Pitfall — long-running transactions hold back vacuum. A row version cannot be reclaimed while any open transaction might still see it. A reporting query that holds a snapshot open for an hour stops vacuum from cleaning up an hour's worth of dead rows; the table bloats and every query slows down. Watch pg_stat_activity for ancient transactions and set idle_in_transaction_session_timeout.

Worked example: MVCC visibility across three transactions

Three transactions touch the same row in PostgreSQL. We track which version each sees.

Setup. Row account(id=1, balance=100) exists as a tuple with (xmin=50, xmax=null). Transaction 50 committed long ago.

T=0 — TxA starts (txid=200). TxA's snapshot: "txids up to 199 visible if committed; nothing later."

TxA reads account(id=1):
  finds v1: (xmin=50, xmax=null, balance=100)
  xmin=50 visible? yes
  xmax=null → tuple is live
  → TxA sees balance=100

T=5 ms — TxB starts (txid=201). TxA at txid=200 is active, not committed, so TxB will not consider TxA's writes visible.

T=10 ms — TxA UPDATE balance to 150. Postgres does not modify v1. It inserts v2 (xmin=200, xmax=null, balance=150) and marks v1 dead-by-200 (xmin=50, xmax=200, balance=100). The row has two physical versions on disk.

T=15 ms — TxB reads. TxB walks the version chain:

v1: xmax=200 — TxA still active in TxB's snapshot
    → treat deletion as not yet happened
    → TxB sees v1: balance=100
v2: xmin=200 — TxA still active
    → skip

TxB sees balance=100. Snapshot isolation in action.

T=20 ms — TxA commits. Globally, txid 200 is now committed.

T=25 ms — TxB re-reads. TxB's snapshot was captured at T=5 ms and is fixed for its lifetime. Txid 200 is still not in TxB's committed set. TxB still sees balance=100. This is what "repeatable read" means.

T=30 ms — TxC starts (txid=202). Its snapshot now considers 200 committed. TxC sees v2: balance=150.

The cleanup tax. v1 is dead from every perspective except TxB's. As long as TxB lives, v1 stays on disk. After TxB ends, VACUUM reclaims its space. A five-minute analytical TxB can pin gigabytes of dead tuples.

Replication and sharding

A single-node database eventually runs out of hardware or becomes a single point of failure. Two answers, often combined.

Replication keeps copies of the same data on multiple nodes. The trade-off is between consistency, latency, and availability when nodes or networks fail.

Pick the topology by what failure modes you can tolerate. Async primary-replica is the default for read-heavy web apps; Raft-replicated is the default for systems that must not lose committed writes; multi-primary fits write-anywhere scenarios where the application can resolve conflicts.

Sharding splits one logical table across many physical nodes by a partition key, so write throughput and dataset size scale beyond one machine. The partitioning strategy decides which queries stay fast and where hot spots form.

Range works for time-series and historical data; hash distributes load uniformly for point lookups by key; list fits tenancy and residency. Most real systems compose them — partition by range on day, sub-partition by hash on user_id — and pay the planner-complexity tax for both properties.

Indexes

Without an index, every query is a sequential scan: read every page, evaluate WHERE on every row. That's O(n) — fine for a thousand rows, ruinous for a billion. An index is a separate sorted structure mapping key values to row locations. Storage and write cost go up; reads go from O(n) to O(log n) or O(k) for range scans.

Two structures dominate.

B-trees

A B-tree is a balanced multi-way search tree designed for block storage. Each node fills exactly one disk page (4–16 KB in practice — 8 KB in PostgreSQL, 16 KB in InnoDB) and holds many keys — the fanout. A typical fanout of a few hundred means a billion-row index is only 4–5 levels deep; the top levels live in memory, so a point lookup is one or two disk reads. Every leaf is at the same depth — the tree is balanced by construction.

Splits stay local most of the time. On heavy bulk insert in key order — monotonically increasing IDs — the rightmost path splits repeatedly; modern engines detect this and use a fast path that avoids touching the whole spine.

When a leaf insert overflows, the leaf splits in two and the median key is promoted to the parent. If the parent also overflows, the split cascades up — possibly creating a new root. Splits are durable: each one writes a WAL record so crash recovery can reconstruct the new shape.

Worked example: a leaf split during insertion

Pretend each leaf holds at most four keys (real engines hold hundreds; four keeps the diagram readable). We insert key 27 into a small tree.

Before. The tree has a root with one separator 25 and two leaves. The left leaf holds [10, 18, 22, 24] and the right leaf holds [28, 35, 48, 60]. Both leaves are full.

              root: [ 25 ]
              /          \
   [ 10, 18, 22, 24 ]   [ 28, 35, 48, 60 ]

Step 1 — descend. Insert 27. At the root, 27 > 25, so follow the right pointer to the leaf [28, 35, 48, 60].

Step 2 — attempt insert. 27 belongs at the front of that leaf (27 < 28). But the leaf is already at capacity. Overflow.

Step 3 — split the leaf. The engine takes the five keys that would be in the overflowed leaf — [27, 28, 35, 48, 60] — and splits them around the median 35. The left half [27, 28] stays in the original leaf node; a new leaf node is allocated for the right half [35, 48, 60].

Step 4 — promote the median. The median key 35 is inserted into the parent (the root) as a new separator. The root had room, so this insertion succeeds without further splitting.

After.

            root: [ 25, 35 ]
           /       |        \
[10,18,22,24]  [27, 28]   [35, 48, 60]

Both new leaves are at the same depth — the tree stayed balanced.

What if the parent had also been full? The parent would itself split, promoting its median key one level up. In the worst case the split propagates all the way to the root, which becomes a new internal node and the tree grows one level taller. This is the only way a B-tree's height changes — every level is shared by every leaf, by construction.

Each step writes a WAL record (new leaf allocated, keys redistributed, parent updated) so a crash mid-split is replayable.

LSM trees

A log-structured merge tree flips the trade-off for write-heavy workloads. Writes go into an in-memory memtable; when it fills, the engine flushes it to disk as an immutable sorted file called an SSTable at level 0. A background compactor merges level-0 files into level 1, level 1 into level 2, and so on, with each level roughly 10× the size of the one above.

Reads must consult the memtable plus every SSTable that could hold the key. Bloom filters at each SSTable cut this to "the memtable plus one or two files" in the common case.

RocksDB, Cassandra, ScyllaDB, LevelDB, and HBase all run variants of this. Without bloom filters, every point lookup would touch every level; with them, read amplification drops by 10–100×.

Worked example: data flowing through compaction levels

Pretend the memtable holds 4 keys and each level holds 4× the one above (so L0 caps at 4 SSTables, L1 at 16 keys, L2 at 64 keys). Real engines use much larger numbers; the ratios are the point.

Phase 1 — fill the memtable. Client writes arrive: put(7, "a"), put(3, "b"), put(11, "c"), put(7, "d"). The memtable now holds {3: "b", 7: "d", 11: "c"} — note 7 is overwritten in place, in memory.

Phase 2 — flush to L0. A fifth write put(20, "e") would overflow the memtable. The engine freezes the current memtable, opens a new empty one for incoming writes, and writes the frozen one to disk as a sorted file L0-file-1: [3, 7, 11]. L0 now has one file.

Phase 3 — repeat. The next four batches of writes flush as L0-file-2, L0-file-3, L0-file-4. L0 is now full (4 files). Critically, L0 files can overlap in key range — L0-file-1 might cover [3..11] and L0-file-3 might cover [5..15], because each was a snapshot of a different memtable.

Phase 4 — compact L0 → L1. Compactor picks all four L0 files and the L1 files that overlap their combined key range (initially none). It runs a merge-sort across them, resolving any duplicate keys by keeping the newest (highest sequence number). It writes the result as one or more new files in L1, where files are non-overlapping in key range. The four L0 files are deleted.

before:   L0: [3..11] [5..15] [9..20] [2..18]   L1: (empty)
                       \   merge-sort by key   /
after:    L0: (empty)                          L1: [2..20]

Phase 5 — fast-forward. Many memtable flushes later, L1 has grown to 16 keys (full). Same dance: the compactor picks the L1 file plus any L2 files in its key range, merges, writes the result down to L2, deletes the inputs.

Reading key=7 after all this.

Check the memtable — miss.
Check each L0 file in newest-first order, gated by its bloom filter. Bloom filter for L0-file-2 says "definitely not present" — skip without reading the file. L0-file-1's bloom filter says "possibly present" — binary-search inside it, miss.
Check L1. Files in L1 are sorted and non-overlapping, so we find at most one file whose range covers 7. Binary-search it. Found: 7: "d".

The bloom filters turn what would be "open every SSTable on disk" into "open one or two." That, plus the sequential-write friendliness of the memtable flush, is why LSMs sustain 5–10× more writes than a B-tree on the same hardware.

The trade-off is direct. B-tree wins reads: one tree traversal, predictable, cache-friendly. LSM wins writes: appends to memory are sequential and batched, so write throughput can be 5–10× higher, at the cost of compaction work and slower point reads. PostgreSQL, MySQL/InnoDB, SQLite use B-trees because transactional workloads balance reads and writes. Cassandra, ScyllaDB, RocksDB use LSM because their target is high write throughput at scale. Some systems (TiDB, CockroachDB) put a SQL B-tree-feeling API on top of an LSM key-value store underneath.

Pitfall — indexes only help when they're selective. An index pays off only when the query touches a small fraction of the table. Postgres switches to a sequential scan when an index lookup would return more than roughly 5–10% of rows; random-access I/O on many pages is slower than reading the whole table sequentially. Indexing booleans or any column with three values is rarely useful. Partial indexes (WHERE status = 'active') and covering indexes (include the columns the query reads) often are.

How one SELECT actually runs

With the planner, indexes, WAL, and MVCC defined, the path a single SELECT takes through them is concrete. The query is parsed and type-checked, the planner picks an access path, the executor walks the chosen index to find row pointers, the buffer pool (the database's in-memory cache of disk pages) supplies the heap pages those pointers reference, and an MVCC visibility filter discards row versions the transaction's snapshot should not see.

Every layer covered so far appears in one round-trip: planner picks the access path, B-tree returns row pointers, buffer pool serves the heap pages, MVCC filters out versions outside the snapshot. A write adds a new row version and flushes a WAL record before COMMIT returns.

Beyond relational

The relational model is general-purpose. For any one access pattern, a specialised model is faster, simpler, or both. Each family below trades generality for one specific kind of question it answers very well.

Document stores (MongoDB, Couchbase) — store nested JSON/BSON documents under a key. Schemas vary per document. Queries reach inside (db.users.find({"address.city": "Paris"})). Strong for application objects that vary in shape; weak when joins across collections matter.
Graph databases (Neo4j, JanusGraph, Neptune) — model nodes and edges with index-free adjacency: each node holds direct pointers to its neighbours, so a multi-hop traversal is O(hops × avg_degree), not O(hops × log n) joins. Queries use Cypher, GQL, or Gremlin to pattern-match against subgraphs. Wins when the relationships are the data: social networks, fraud rings, supply chains.
Time-series (InfluxDB, TimescaleDB, Prometheus) — append-heavy, timestamped, often dropped on a retention policy. Storage is partitioned by time and compressed with delta-of-delta encoding to reach under 2 bytes per sample. Queries are time-range plus aggregation.
Vector databases (pgvector, Pinecone, Weaviate, Qdrant) — index high-dimensional float vectors (typically 768 or 1536 dimensions) for approximate nearest-neighbour search. The dominant index is HNSW, a multi-layer proximity graph that gives sub-linear k-NN at recall above 0.95. Hands forward to Act VIb where the embeddings come from.

Each shape bets that one access pattern dominates the workload. Large systems usually combine several — relational for transactions, document for user content, time-series for telemetry, vector for semantic search — coordinated by an application layer above.

Lines blur. PostgreSQL adds JSONB (document), pg_trgm (text similarity), TimescaleDB hypertables, and pgvector to its relational core. CockroachDB and Spanner are relational over a globally distributed LSM. Category names are useful signposts, not strict containers.

Pitfall — eventual consistency is not free. Many of these stores trade strong consistency for availability under network partition. Reads can return stale data; writes can conflict and require application-level resolution (last-write-wins, CRDTs, vector clocks). Pick the consistency model deliberately. Eventual consistency is great for shopping carts and hostile for bank balances.

Data engineering

Operational databases serve transactions; the rest of the organisation wants the same data for analytics, dashboards, and ML training. Data engineering moves data from where it is born — the OLTP database, the event log, the third-party API — to where it is useful: the warehouse, the lakehouse, the feature store.

The deepest fault line in the data stack is OLTP (online transaction processing) versus OLAP (online analytical processing). The same row lives in two worlds with opposite priorities. An OLTP system serves thousands of small, latency-sensitive operations per second. An OLAP system answers a handful of large, throughput-sensitive aggregate queries. The systems, the storage layouts, and the engines that win at each are different.

Most "the database is slow" reports trace to running an OLAP query against an OLTP store, or vice versa. Pick the right system for the workload and pipe data between them.

Row-major vs column-major storage

The reason OLAP needs a different system is the storage layout. A row-major heap stores all columns of one row contiguously — fast for SELECT * WHERE id = 7 (one read), terrible for SELECT AVG(amount) FROM 1B rows (one read per row, most bytes thrown away). A column-major store inverts that: all values of one column are stored together, so a scan reads only the bytes the query needs.

Parquet, ORC, and the native formats inside Snowflake, BigQuery, and Redshift all use this layout — plus dictionary encoding for low-cardinality columns, run-length for sorted columns, per-page min/max stats for predicate pushdown, and bloom filters for equality.

Batch vs stream

Two ways to move and process data, distinguished by when the engine sees the input.

Batch collects data over an interval — an hour, a day — and processes the whole interval at once. Tools: Spark, dbt, Airflow, the warehouse's own SQL. Latency equals the window size; throughput is high; reasoning is easy because every job sees a fixed, finite input.

Stream processes records as they arrive. Tools: Kafka Streams, Flink, Materialize, Spark Structured Streaming. Latency is sub-second; throughput depends on parallelism; reasoning is harder because the input never ends. Two new concepts come with that: windows (tumbling, sliding, session) over which to aggregate, and handling for late-arriving events that show up after their window already closed.

Batch and stream aren't opposites; they're points on a continuum. Micro-batch (Spark Structured Streaming) and continuous-batch (dbt rebuild every 5 min) sit between them.

The streaming side is usually fed by change data capture (CDC). Instead of polling the OLTP database, a CDC connector tails the database's WAL and emits each row mutation as an event in commit order. The OLTP database pays nothing extra; downstream consumers learn about every insert, update, and delete in near real time.

Debezium, AWS DMS, and the native CDC inside modern warehouses all run this pattern. The downstream effect is profound: the OLTP database becomes the source of truth for an event-driven architecture without changing the application code.

Warehouse vs lakehouse

A data warehouse (Snowflake, BigQuery, Redshift) is a managed, columnar, analytical database. Storage is proprietary and tightly coupled to the engine. Query performance is excellent because the engine knows everything about how the data is laid out. The cost is lock-in: getting data out to train an ML model or feed a different engine means exporting and paying for egress.

A lakehouse keeps the columnar storage in open file formats (Parquet, ORC) on cheap object storage and adds a table format layer (Iceberg, Delta Lake, Hudi) that brings ACID transactions, schema evolution, time travel, and partition management to those raw files. Multiple engines (Spark, Trino, DuckDB, Snowflake-on-Iceberg) can read the same tables without copying the data.

The lakehouse pattern made open warehousing competitive with proprietary engines. Modern Snowflake, BigQuery, and Databricks all read external Iceberg tables alongside their native storage.

The table format does heavy lifting. Iceberg's layout, for example, layers a snapshot pointer, a manifest list, per-file manifests with statistics, and then the actual Parquet data files. Every commit writes a new snapshot; old snapshots stay around for time-travel queries (SELECT … FOR VERSION AS OF 17) until expired. That is how an eventually-consistent object store gets ACID semantics at the table level.

ETL, ELT, and lineage

The classic ETL sequence — Extract, Transform, Load — pulled data, transformed it in a separate engine, and loaded the result. ELT flips the last two steps: load raw data into the warehouse first and transform inside the warehouse with SQL. Cheap warehouse compute and tools like dbt (which compiles a graph of SQL SELECT statements into materialised tables with tests and lineage) made ELT the default for analytics.

When the same number — daily revenue, active users, model accuracy — appears in five dashboards and disagrees, someone has to answer "where did this number come from?" Lineage is the system of record for that question. At column granularity it tracks: this dashboard column came from that warehouse column, which came from that dbt join, which came from that Postgres table, which came from that Kafka event, which came from that application endpoint. OpenLineage standardises the events; DataHub, OpenMetadata, and Atlan visualise the graph.

Pitfall — pipelines silently break shape. Schema-on-read formats (JSON, untyped Parquet) make it easy to add a field upstream and not notice that downstream parsers now silently drop or mis-type it. Schema registries and column-level contracts catch this at the producer; without them, the bug surfaces three weeks later when a quarterly report disagrees with itself.

Standards

SQL — ISO/IEC 9075 (parts 1–14, current SQL:2023). Defines the language, the relational operators, the type system, JSON support, and SQL/PGQ (graph queries inside SQL). Vendors implement subsets and extensions; the standard is the lingua franca.
Apache Parquet — parquet.apache.org specification. Columnar, page-based file format with per-column compression and statistics; the de facto interchange format for analytics. ORC (orc.apache.org) is the close cousin from the Hive lineage.
Apache Iceberg — iceberg.apache.org/spec/ (current spec v3). Table format defining metadata structure, snapshot semantics, partition spec evolution, and atomic commits over object storage.
Delta Lake — delta.io/protocol. Linux Foundation's table format originating from Databricks; competitor and now-interoperable peer of Iceberg.
Apache Hudi — hudi.apache.org/docs/concepts. Third major table format; emphasizes incremental processing and upserts.
Avro — avro.apache.org/docs/. Compact, schema-defined binary serialization for record streams; the canonical Kafka payload format. Handed forward from Act I (serialization) — used here for stream payloads with registry-backed schema evolution.
Apache Arrow — arrow.apache.org/docs/format/Columnar.html. In-memory columnar layout for zero-copy interchange between query engines, ML frameworks, and language runtimes; Arrow Flight (format/Flight.html) is the RPC for moving Arrow batches.
JDBC / ODBC — Java SE JDBC API Specification (Oracle, JSR 221) and Microsoft ODBC Programmer's Reference. The lingua-franca client APIs every relational database implements; everything from BI tools to ETL tools targets these.
PostgreSQL wire protocol — postgresql.org/docs/current/protocol.html. Defines the message-oriented client/server protocol; many systems (CockroachDB, YugabyteDB, Materialize, Hyper) re-implement it to ride existing client libraries.
MongoDB Query Language (MQL) — mongodb.com/docs/manual/reference/operator/. Query, projection, and update operators over BSON documents; the aggregation pipeline is the warehouse-flavoured cousin.
Cypher — opencypher.org. Graph pattern-matching language originating in Neo4j; the basis for GQL — ISO/IEC 39075:2024, the first ISO standard for graph queries.
PromQL — prometheus.io/docs/prometheus/latest/querying/basics/. Functional query language for time-series; the de facto standard for metrics queries even outside Prometheus (Grafana, Cortex, Mimir, Thanos, VictoriaMetrics all support it).
HNSW — Yu. A. Malkov & D. A. Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (IEEE TPAMI, 2018). The dominant graph-based ANN index. IVF (inverted file) variants — H. Jégou, M. Douze, C. Schmid, Product Quantization for Nearest Neighbor Search (IEEE TPAMI, 2011) — are the partition-based alternative.
OpenLineage — openlineage.io/spec/. Open standard for emitting lineage events from data pipelines; ingested by DataHub, OpenMetadata, Marquez.
WAL / ARIES recovery — Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging (ACM TODS, 1992). The recovery algorithm under InnoDB, PostgreSQL, SQL Server, and DB2.
Forward refs — Act VIb (Intelligence) builds directly on vector stores and embeddings introduced here; Act V (Connection) supplies the network protocols that ship rows between replicas and across pipelines.