Data

A program holds its working data in memory. The process exits and that memory disappears. Two processes touching the same file race each other and corrupt each other's writes. Power can fail mid-write and leave a record half on disk. A database is a separate process that owns the storage and exposes a transactional API, so every application stops re-implementing crash recovery and concurrent access by hand.

This page walks up the database stack — bytes on a page, indexes that find rows, transactions that hold invariants, replication that survives node loss, query planners that pick a plan, and the warehouse pipelines that move all of it where analytics needs it.

The database stack — bytes on disk to data pipelinesData pipelinesbatch · stream · warehouse · lakehouse · lineageQuery languagesSQL · MQL · Cypher · PromQL · vector searchQuery engineparser · planner · optimizer · executorTransactionsACID · isolation levels · MVCC · WAL · locksAccess methodsheap · B-tree · LSM tree · hash · bitmap · vectorStoragefixed-size pages · WAL · fsync · checksumseach layer hides the one below; the database holds its promise by coordinating all six
The same stack appears in PostgreSQL, MySQL, MongoDB, and the columnar engines under modern warehouses. What changes between them is the access method (B-tree vs LSM vs columnar) and the language at the top.

Why databases exist

Two problems break the naive "just write to a file" approach.

Crashes truncate writes mid-flight. A multi-page update interrupted by a power loss leaves part of the change on disk and part missing — a torn write. Recovering means writing a write-ahead log by hand, calling fsync correctly on every platform, replaying partial records on restart, and ordering the log relative to the data file. Getting any one detail wrong silently corrupts the file the next time the cord is yanked.

Concurrent writers race. Two processes that read-modify-write the same record can both read the old value, both compute a new one, and both write — a lost update. Operating-system advisory locks help on a single machine; over NFS or SMB their semantics quietly weaken.

A database centralises this work. One process owns the files, exposes a transactional API (BEGIN, COMMIT, ROLLBACK), and is responsible for two promises every application then takes for granted:

  • Durability — once COMMIT returns, the write survives a crash.
  • Isolation — concurrent transactions see a consistent view, never each other's half-finished work.

Everything else — schemas, indexes, query languages, replication — is built on top of those two.

Files-and-locks done by hand versus a transactional databaseFiles + locks (by hand)Transactional databaseevery app reinvents the wheelone ACID API, many clientsApp AApp BApp Cdata.bintorn writes when power dies mid-updatelost updates when two writers raceno atomic multi-record changesrecovery code is bespoke and rarely testednetwork-FS locks are unreliableApp AApp BApp CDBMSBEGIN · COMMIT · ROLLBACKall-or-nothing commitssnapshot reads ignore in-flight writescrash recovery is the database's problemone well-tested implementation, many clients
The left side is what every application would build if databases didn't exist. The right side is what applications pay a query-protocol round-trip to outsource to one well-tested implementation.

The cost is fsync. Durability ultimately depends on forcing writes out of the kernel's page cache to the device. fsync stalls until the device confirms — typically 1–10 ms on consumer SSDs, longer on cloud disks. Group commit batches multiple transactions into one fsync call to amortise the wait. Skipping fsync looks fast in benchmarks and corrupts the database at the next power loss.

The relational model

Before the relational model, an application asked for a record by chasing pointers between files — the schema and the access path were the same thing, and changing one meant rewriting the other. The relational model separates the two. Data is a set of tables (relations) of typed rows over a fixed schema. The application says what it wants in SQL; the database picks how to get it.

A primary key uniquely identifies each row in a table. A foreign key in one table references the primary key of another, encoding a relationship. The operations on relations — filter, project, join, group, aggregate — are the relational algebra, and SQL is its declarative form.

SELECT u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name;

That query never says whether to scan users first or orders first, whether to use a hash join or an index lookup, whether to read from a B-tree or a sequential scan. A query planner makes those choices from statistics it keeps about the data: row counts, value distributions, index selectivity, correlations between columns.

SQL is parsed, type-checked, planned logically, then planned physicallySQLSELECT u.name FROM users u JOIN orders o ON o.user_id = u.idLogical plan (relational algebra)π_name ( σ_… ( users ⋈_id orders ) )project · select · join — order not yet fixedcost-based optimizerPhysical plan (operator tree)Project [u.name]Hash Join · u.id = o.user_id— right input —Seq Scan usersIndex Scan ordersChoices the planner made:· hash join over nested-loop because orders is large· seq scan on users because no useful index on created_at· index scan on orders.user_id because the join key is indexedEXPLAIN ANALYZE shows actual rows + time per node — the canonical debugging tool.
Two queries with the same logical plan can have wildly different physical plans depending on table sizes and indexes. EXPLAIN exposes the choice; ANALYZE runs it and reports actual row counts, exposing planner mis-estimates.

The optimizer enumerates physical plans and picks the cheapest under a cost model of rows-and-I/O per operator. Mis-estimated row counts are the single biggest reason production plans go wrong. If the optimizer expects 100 rows and gets 10 million, it will choose nested-loop where it should have chosen hash join, and a 50 ms query becomes a 50 minute query.

Three candidate physical plans for the same join, scored by the optimizer's cost modelSELECT o.id, c.name FROM orders o JOIN customers c ON o.cid = c.id WHERE o.total > 100stats: orders ≈ 10M rows · customers ≈ 200K rows · selectivity(total > 100) ≈ 0.05Nested-loop joinHash joinMerge joinfor each o in orders: if o.total > 100: lookup customers by o.cidbuild hash(customers)scan orders, filter,probe hash by cidsort orders by cidsort customers by cidmerge in lockstepprobe cost · 10M × log 200KI/O · heavy randombuild · 200K hashedI/O · two seq scanssort cost · two big sortsI/O · spill if coldcost ≈ 9.8e7cost ≈ 1.1e6 ✓cost ≈ 4.5e6planner picks hash join: customers fits in memory, orders scanned onceif customers had 200M rows, hash build would spill — merge join winsif orders had an index on cid and selectivity was very low — nested loop wins
Three plans, same answer, costs that differ by two orders of magnitude. The planner's row estimates depend on statistics; stale stats are how queries silently regress overnight.

Normalisation factors tables so each fact lives in exactly one place. Third Normal Form (3NF) is the usual target for transactional databases: every non-key column depends on the whole primary key and nothing else. When reads dominate, duplicating data — denormalisation — trades write complexity for read speed. Analytical warehouses denormalise aggressively because they almost never update.

Pitfall — NULL is not equal to itself. SQL's logic has three values, and NULL = NULL evaluates to NULL (not true). WHERE x = NULL matches no rows; the correct form is WHERE x IS NULL. NOT IN with a NULL in the list rejects every row. COUNT(*) counts NULLs but COUNT(col) skips them. These rules are the source of long-lived bugs in queries that look correct at a glance.

Transactions and ACID

Without transactions, a multi-step change can fail partway and leave the database in an impossible state — a transfer that debits one account but never credits the other. A transaction is a unit of work the database treats as one indivisible step. ACID names the four promises that step keeps:

  • Atomicity — all writes commit together or none do. A partial transaction is impossible after COMMIT or ROLLBACK.
  • Consistency — the database moves from one valid state to another. Declared constraints (foreign keys, uniqueness, checks) are preserved across the boundary. The 'C' is partly the application's job — the database only enforces what you declared.
  • Isolation — concurrent transactions appear to run as if alone. The strength of "appear" is the isolation level.
  • Durability — once COMMIT returns, the change survives a crash.

The write-ahead log

Atomicity and durability share one implementation: the write-ahead log (WAL). The rule is one line: a log record describing a change must reach durable storage before the data page it changes does. Commit then becomes one fsync on a small sequential file, instead of flushing every modified data page. Data files catch up lazily in the background; on crash, recovery replays the log from the last checkpoint to bring them back in sync.

Write-ahead log: log first, ack the client, page-flush asynchronously, replay on crashclientUPDATE x = x + 1DB processbuild log recordappend + fsyncWAL (sequential)LSN 41 · 42 · 43 · 44 · COMMITack to clientlater · background writer flushes dirty pagesdata pages (heap)random I/O · lazyon crash:1. find last checkpoint LSN · 2. REDO every committed log record · 3. UNDO uncommitted txnsdata pages converge to the state implied by the log · committed work survives, in-flight work disappearsone fsync on a small sequential file beats fsyncing every changed page
PostgreSQL calls it the WAL, InnoDB calls it the redo log, SQL Server calls it the transaction log; the algorithm — ARIES — is the same. Group commit batches many transactions into one fsync to amortise the per-flush cost.
Worked example: crash recovery from the WAL

Three transactions are in flight when the power dies. We walk what the log holds, what gets replayed, and what gets discarded.

State just before the crash. Last checkpoint was at LSN 40 (the database has guaranteed every change up to LSN 40 is already in the data files). The log on disk holds:

LSNRecordMeaning
41T1: UPDATE accounts SET bal=80 WHERE id=1T1 first write
42T1: UPDATE accounts SET bal=120 WHERE id=2T1 second write
43T1: UPDATE accounts SET bal=50 WHERE id=3T1 third write
44T1: COMMITT1 durably committed (fsync returned)
45T2: UPDATE accounts SET bal=200 WHERE id=4T2 first write
46T2: UPDATE accounts SET bal=10 WHERE id=5T2 second write
47T2: COMMITT2 durably committed
48T3: UPDATE accounts SET bal=999 WHERE id=6T3 wrote but never committed

At LSN 48 the cord is yanked. T3 had not written a COMMIT record yet.

On restart, the database opens the log and runs three passes:

  1. Analysis — read forward from the last checkpoint (LSN 40). Build two lists: committed transactions {T1, T2} (saw their COMMIT records) and in-flight {T3} (no COMMIT seen).
  2. REDO — replay every log record from LSN 41 onward against the data pages, regardless of whether the transaction committed. After this pass, the data files reflect every write in the log, including T3's uncommitted bal=999.
  3. UNDO — walk the in-flight list {T3} and reverse its writes. The WAL records carry the old value as well as the new, so undoing LSN 48 means writing bal back to whatever it was before. After undo, id=6 is restored.

Result. T1 and T2 are durable — their COMMIT was fsynced before the crash, so the client was promised they would survive, and they did. T3 vanishes — no commit was acknowledged to the client, so erasing it breaks no promise. The data files now match what would have been visible to a reader at the moment of T2's commit.

The key invariant: a COMMIT record reaching the log on disk is the only event that makes a transaction durable. Page writes can lag arbitrarily behind; the log is the source of truth.

Isolation levels and anomalies

Isolation describes what concurrent transactions are allowed to see of each other. Stronger isolation means fewer anomalies but more coordination — more locking, more aborts, lower throughput. Four anomalies define the levels:

  • Dirty read — read a value another transaction has written but not yet committed. If it rolls back, you read fiction.
  • Non-repeatable read — same query returns different values for the same row because someone updated it between your two reads.
  • Phantom read — same range query returns different rows because someone inserted or deleted in between.
  • Write skew — two transactions each read a set of rows, each makes a decision from what they read, each writes — and the combined writes break an invariant neither transaction's snapshot saw violated.
Isolation levels and the anomalies each one still allowsLevelDirty readNon-repeat.PhantomWrite skewRead uncommittedRead committedRepeatable readSnapshotSerializablepossiblepossiblepossiblepossiblepossiblepossiblepossiblepossiblepossiblepossiblepreventedpreventedpreventedpreventedpreventedpreventedpreventedpreventedpreventedprevented
Filled cells are anomalies the level still allows. PostgreSQL's "repeatable read" is actually snapshot isolation; only "serializable" prevents write skew. PostgreSQL and Oracle default to read committed; MySQL InnoDB defaults to repeatable read. Production code that assumes serializable without setting it is where write-skew bugs hide.

The classic example of write skew: two on-call doctors each query "are at least two doctors on call?", each sees the answer "yes", and each goes off-call simultaneously. Neither transaction's snapshot saw the invariant violated; the combined effect of their writes did.

Write skew: two transactions read consistent snapshots, write disjoint rows, commit, invariant breaksinvariantT1 (Alice)T2 (Bob)databaseon_call ≥ 1at all timesSELECT count(*)FROM oncall → 2SELECT count(*)FROM oncall → 2UPDATE meSET on_call=falseUPDATE meSET on_call=falseread snapshot S1read snapshot S2write Alice offwrite Bob offCOMMIT ✓COMMIT ✓result: on_call = 0 — invariant broken, no transaction saw an inconsistent snapshotSERIALIZABLE detects the read/write conflict pattern and aborts one; snapshot isolation alone cannot
Each transaction reads a consistent snapshot; their writes target disjoint rows so the engine sees no conflict at commit. The fixes are explicit row locks (SELECT … FOR UPDATE), a constraint that catches the post-state, or SERIALIZABLE isolation that watches for dangerous read/write patterns.

Locking vs MVCC

Two implementation strategies give isolation.

Two-phase locking (2PL) acquires locks (shared for reads, exclusive for writes) and holds them until commit. Once any lock has been released, no new lock may be acquired. This produces serialisable schedules, at the cost that readers block writers and writers block readers. Deadlocks are inevitable; the engine detects cycles in a wait-for graph and aborts one transaction as the victim.

Multi-Version Concurrency Control (MVCC) keeps multiple versions of each row instead of locking. Every row carries two transaction IDs — xmin (the transaction that created it) and xmax (the transaction that deleted or replaced it). A transaction reading at snapshot S sees only versions where xmin is committed and visible to S, and xmax is either null or invisible to S. Writers create new versions; old versions are reclaimed by a background process — VACUUM in PostgreSQL, the purge thread in InnoDB.

MVCC: each row carries xmin/xmax; readers see the snapshot they began withT100T105T110T115T120time →id=7 name="Alice" bal=100id=7 name="Alice" bal=80id=7 name="Alice" bal=120xmin=100 xmax=110xmin=110 xmax=120xmin=120 xmax=∞reader at snapshot 105 sees v1 (bal=100)reader at snapshot 117 sees v2 (bal=80)writers don't block readers; readers don't block writers; only conflicting writers serialize
The cost is space — old versions live until vacuumed — and write amplification, since every UPDATE creates a new tuple. The benefit is that long-running analytical reads never block the transactional workload.

Pitfall — long-running transactions hold back vacuum. A row version cannot be reclaimed while any open transaction might still see it. A reporting query that holds a snapshot open for an hour stops vacuum from cleaning up an hour's worth of dead rows; the table bloats and every query slows down. Watch pg_stat_activity for ancient transactions and set idle_in_transaction_session_timeout.

Worked example: MVCC visibility across three transactions

Three transactions touch the same row in PostgreSQL. We track which version each sees.

Setup. Row account(id=1, balance=100) exists as a tuple with (xmin=50, xmax=null). Transaction 50 committed long ago.

T=0 — TxA starts (txid=200). TxA's snapshot: "txids up to 199 visible if committed; nothing later."

TxA reads account(id=1):
  finds v1: (xmin=50, xmax=null, balance=100)
  xmin=50 visible? yes
  xmax=null → tuple is live
TxA sees balance=100

T=5 ms — TxB starts (txid=201). TxA at txid=200 is active, not committed, so TxB will not consider TxA's writes visible.

T=10 ms — TxA UPDATE balance to 150. Postgres does not modify v1. It inserts v2 (xmin=200, xmax=null, balance=150) and marks v1 dead-by-200 (xmin=50, xmax=200, balance=100). The row has two physical versions on disk.

T=15 ms — TxB reads. TxB walks the version chain:

v1: xmax=200TxA still active in TxB's snapshot
    → treat deletion as not yet happened
TxB sees v1: balance=100
v2: xmin=200TxA still active
    → skip

TxB sees balance=100. Snapshot isolation in action.

T=20 ms — TxA commits. Globally, txid 200 is now committed.

T=25 ms — TxB re-reads. TxB's snapshot was captured at T=5 ms and is fixed for its lifetime. Txid 200 is still not in TxB's committed set. TxB still sees balance=100. This is what "repeatable read" means.

T=30 ms — TxC starts (txid=202). Its snapshot now considers 200 committed. TxC sees v2: balance=150.

The cleanup tax. v1 is dead from every perspective except TxB's. As long as TxB lives, v1 stays on disk. After TxB ends, VACUUM reclaims its space. A five-minute analytical TxB can pin gigabytes of dead tuples.

Replication and sharding

A single-node database eventually runs out of hardware or becomes a single point of failure. Two answers, often combined.

Replication keeps copies of the same data on multiple nodes. The trade-off is between consistency, latency, and availability when nodes or networks fail.

Four replication topologies, each a different point on the consistency-latency-availability surfacePrimary-replicaMulti-primarySynchronousRaft / Paxosasync streamwrites anywhereall-or-nothing commitquorum log replicationPRRRPPPPWWWWLFFFquorum = 2 of 3+ low write latency+ scales reads− stale reads on replicas− failover loses tail+ writes anywhere+ no single SPOF− conflicts need merge− causal hazards+ no data loss+ replicas read-fresh− slowest node sets latency− loses availability if any down+ majority survives loss+ linearizable reads− needs odd member count− cross-region RTT costexamples · PostgreSQL streaming · Galera/BDR · Galera-sync · etcd, Spanner, CockroachDBCAP corner · AP eventual · AP eventual + merge · CP loses A under partition · CP loses A under partition"semi-sync" — wait for one replica ack, async to the rest — is the most common pragmatic compromise
Pick the topology by what failure modes you can tolerate. Async primary-replica is the default for read-heavy web apps; Raft-replicated is the default for systems that must not lose committed writes; multi-primary fits write-anywhere scenarios where the application can resolve conflicts.

Sharding splits one logical table across many physical nodes by a partition key, so write throughput and dataset size scale beyond one machine. The partitioning strategy decides which queries stay fast and where hot spots form.

Three partitioning strategies for the same orders tableRange partitionHash partitionList partitionPARTITION BY RANGE (created_at)PARTITION BY HASH (user_id)PARTITION BY LIST (region)orders_2024q1created_at IN [Jan, Apr)orders_2024q2created_at IN [Apr, Jul)orders_2024q3created_at IN [Jul, Oct)orders_2024q4created_at IN [Oct, Jan)shard 0hash(user_id) % 4 = 0shard 1hash(user_id) % 4 = 1shard 2hash(user_id) % 4 = 2shard 3hash(user_id) % 4 = 3orders_usregion IN ('us-east','us-west')orders_euregion IN ('eu-west','eu-cent')orders_apacregion IN ('ap-se','ap-ne')orders_defaultregion NOT IN above+ time-bounded queries scan one partition+ trivial archival: drop old partitions− writes hot-spot on the newest partition+ writes spread evenly across shards+ no hot partition− range queries fan out to all shards+ data residency / locality controls+ regional ops stay in-region− skew if one region dominates traffic
Range works for time-series and historical data; hash distributes load uniformly for point lookups by key; list fits tenancy and residency. Most real systems compose them — partition by range on day, sub-partition by hash on user_id — and pay the planner-complexity tax for both properties.

Indexes

Without an index, every query is a sequential scan: read every page, evaluate WHERE on every row. That's O(n) — fine for a thousand rows, ruinous for a billion. An index is a separate sorted structure mapping key values to row locations. Storage and write cost go up; reads go from O(n) to O(log n) or O(k) for range scans.

Two structures dominate.

B-trees

A B-tree is a balanced multi-way search tree designed for block storage. Each node fills exactly one disk page (4–16 KB in practice — 8 KB in PostgreSQL, 16 KB in InnoDB) and holds many keys — the fanout. A typical fanout of a few hundred means a billion-row index is only 4–5 levels deep; the top levels live in memory, so a point lookup is one or two disk reads. Every leaf is at the same depth — the tree is balanced by construction.

A 3-level B-tree (fanout shown small for clarity)25|6010|1835|4872|883,712,1520,2328,3240,4552,5865,7075,82internal nodes hold separator keys; leaves hold (key, row pointer) pairsleaves are linked left-to-right (B+ tree variant) so range scans are sequential I/Obalanced by construction: every leaf at the same depthon insert overflow, a node splits and promotes its median key to the parent
Splits stay local most of the time. On heavy bulk insert in key order — monotonically increasing IDs — the rightmost path splits repeatedly; modern engines detect this and use a fast path that avoids touching the whole spine.

When a leaf insert overflows, the leaf splits in two and the median key is promoted to the parent. If the parent also overflows, the split cascades up — possibly creating a new root. Splits are durable: each one writes a WAL record so crash recovery can reconstruct the new shape.

Worked example: a leaf split during insertion

Pretend each leaf holds at most four keys (real engines hold hundreds; four keeps the diagram readable). We insert key 27 into a small tree.

Before. The tree has a root with one separator 25 and two leaves. The left leaf holds [10, 18, 22, 24] and the right leaf holds [28, 35, 48, 60]. Both leaves are full.

              root: [ 25 ]
              /          \
   [ 10, 18, 22, 24 ]   [ 28, 35, 48, 60 ]

Step 1 — descend. Insert 27. At the root, 27 > 25, so follow the right pointer to the leaf [28, 35, 48, 60].

Step 2 — attempt insert. 27 belongs at the front of that leaf (27 < 28). But the leaf is already at capacity. Overflow.

Step 3 — split the leaf. The engine takes the five keys that would be in the overflowed leaf — [27, 28, 35, 48, 60] — and splits them around the median 35. The left half [27, 28] stays in the original leaf node; a new leaf node is allocated for the right half [35, 48, 60].

Step 4 — promote the median. The median key 35 is inserted into the parent (the root) as a new separator. The root had room, so this insertion succeeds without further splitting.

After.

            root: [ 25, 35 ]
           /       |        \
[10,18,22,24]  [27, 28]   [35, 48, 60]

Both new leaves are at the same depth — the tree stayed balanced.

What if the parent had also been full? The parent would itself split, promoting its median key one level up. In the worst case the split propagates all the way to the root, which becomes a new internal node and the tree grows one level taller. This is the only way a B-tree's height changes — every level is shared by every leaf, by construction.

Each step writes a WAL record (new leaf allocated, keys redistributed, parent updated) so a crash mid-split is replayable.

LSM trees

A log-structured merge tree flips the trade-off for write-heavy workloads. Writes go into an in-memory memtable; when it fills, the engine flushes it to disk as an immutable sorted file called an SSTable at level 0. A background compactor merges level-0 files into level 1, level 1 into level 2, and so on, with each level roughly 10× the size of the one above.

Reads must consult the memtable plus every SSTable that could hold the key. Bloom filters at each SSTable cut this to "the memtable plus one or two files" in the common case.

LSM tree: writes flow into the memtable, flush to L0, compact downward through exponentially larger levelsmemtablein-RAM · skip listflushL0L0L0L0L0 SSTables · may overlap in key rangecompactL1L1L1L1L1 ≈ 10× L0 · sorted, non-overlappingL2 (≈10× L1)… L3, L4, L5 — each ≈10× the previouswrites · append-only — fast (no random I/O until flush)reads · memtable + bloom-filtered SSTables + mergecompaction is the cost — write amplification of ≈10× total across all levels
RocksDB, Cassandra, ScyllaDB, LevelDB, and HBase all run variants of this. Without bloom filters, every point lookup would touch every level; with them, read amplification drops by 10–100×.
Worked example: data flowing through compaction levels

Pretend the memtable holds 4 keys and each level holds 4× the one above (so L0 caps at 4 SSTables, L1 at 16 keys, L2 at 64 keys). Real engines use much larger numbers; the ratios are the point.

Phase 1 — fill the memtable. Client writes arrive: put(7, "a"), put(3, "b"), put(11, "c"), put(7, "d"). The memtable now holds {3: "b", 7: "d", 11: "c"} — note 7 is overwritten in place, in memory.

Phase 2 — flush to L0. A fifth write put(20, "e") would overflow the memtable. The engine freezes the current memtable, opens a new empty one for incoming writes, and writes the frozen one to disk as a sorted file L0-file-1: [3, 7, 11]. L0 now has one file.

Phase 3 — repeat. The next four batches of writes flush as L0-file-2, L0-file-3, L0-file-4. L0 is now full (4 files). Critically, L0 files can overlap in key rangeL0-file-1 might cover [3..11] and L0-file-3 might cover [5..15], because each was a snapshot of a different memtable.

Phase 4 — compact L0 → L1. Compactor picks all four L0 files and the L1 files that overlap their combined key range (initially none). It runs a merge-sort across them, resolving any duplicate keys by keeping the newest (highest sequence number). It writes the result as one or more new files in L1, where files are non-overlapping in key range. The four L0 files are deleted.

before:   L0: [3..11] [5..15] [9..20] [2..18]   L1: (empty)
                       \   merge-sort by key   /
after:    L0: (empty)                          L1: [2..20]

Phase 5 — fast-forward. Many memtable flushes later, L1 has grown to 16 keys (full). Same dance: the compactor picks the L1 file plus any L2 files in its key range, merges, writes the result down to L2, deletes the inputs.

Reading key=7 after all this.

  1. Check the memtable — miss.
  2. Check each L0 file in newest-first order, gated by its bloom filter. Bloom filter for L0-file-2 says "definitely not present" — skip without reading the file. L0-file-1's bloom filter says "possibly present" — binary-search inside it, miss.
  3. Check L1. Files in L1 are sorted and non-overlapping, so we find at most one file whose range covers 7. Binary-search it. Found: 7: "d".

The bloom filters turn what would be "open every SSTable on disk" into "open one or two." That, plus the sequential-write friendliness of the memtable flush, is why LSMs sustain 5–10× more writes than a B-tree on the same hardware.

The trade-off is direct. B-tree wins reads: one tree traversal, predictable, cache-friendly. LSM wins writes: appends to memory are sequential and batched, so write throughput can be 5–10× higher, at the cost of compaction work and slower point reads. PostgreSQL, MySQL/InnoDB, SQLite use B-trees because transactional workloads balance reads and writes. Cassandra, ScyllaDB, RocksDB use LSM because their target is high write throughput at scale. Some systems (TiDB, CockroachDB) put a SQL B-tree-feeling API on top of an LSM key-value store underneath.

Pitfall — indexes only help when they're selective. An index pays off only when the query touches a small fraction of the table. Postgres switches to a sequential scan when an index lookup would return more than roughly 5–10% of rows; random-access I/O on many pages is slower than reading the whole table sequentially. Indexing booleans or any column with three values is rarely useful. Partial indexes (WHERE status = 'active') and covering indexes (include the columns the query reads) often are.

How one SELECT actually runs

With the planner, indexes, WAL, and MVCC defined, the path a single SELECT takes through them is concrete. The query is parsed and type-checked, the planner picks an access path, the executor walks the chosen index to find row pointers, the buffer pool (the database's in-memory cache of disk pages) supplies the heap pages those pointers reference, and an MVCC visibility filter discards row versions the transaction's snapshot should not see.

Lifecycle of one SELECT — parser, planner, executor, index, buffer pool, MVCCclient querySELECT * FROM ordersWHERE customer_id = 421 · ParserSQL → ASTresolve names · check types2 · Planner / optimizerpick idx_orders_customer over seq scanestimated rows = 303 · Executor · B-tree lookupdescend idx_orders_customer to 42return 30 (page, slot) row pointers4 · Buffer pool · fetch heap pagesmost pages hit cachecold pages → 1 read per miss5 · MVCC visibility filterdrop versions newer than this txndrop versions deleted before snapshot6 · Result rowsto clienta write would also:append a new row versionflush WAL record before committhe same SQL string drives a different physical plan tomorrow if statistics or table size change
Every layer covered so far appears in one round-trip: planner picks the access path, B-tree returns row pointers, buffer pool serves the heap pages, MVCC filters out versions outside the snapshot. A write adds a new row version and flushes a WAL record before COMMIT returns.

Beyond relational

The relational model is general-purpose. For any one access pattern, a specialised model is faster, simpler, or both. Each family below trades generality for one specific kind of question it answers very well.

  • Document stores (MongoDB, Couchbase) — store nested JSON/BSON documents under a key. Schemas vary per document. Queries reach inside (db.users.find({"address.city": "Paris"})). Strong for application objects that vary in shape; weak when joins across collections matter.
  • Graph databases (Neo4j, JanusGraph, Neptune) — model nodes and edges with index-free adjacency: each node holds direct pointers to its neighbours, so a multi-hop traversal is O(hops × avg_degree), not O(hops × log n) joins. Queries use Cypher, GQL, or Gremlin to pattern-match against subgraphs. Wins when the relationships are the data: social networks, fraud rings, supply chains.
  • Time-series (InfluxDB, TimescaleDB, Prometheus) — append-heavy, timestamped, often dropped on a retention policy. Storage is partitioned by time and compressed with delta-of-delta encoding to reach under 2 bytes per sample. Queries are time-range plus aggregation.
  • Vector databases (pgvector, Pinecone, Weaviate, Qdrant) — index high-dimensional float vectors (typically 768 or 1536 dimensions) for approximate nearest-neighbour search. The dominant index is HNSW, a multi-layer proximity graph that gives sub-linear k-NN at recall above 0.95. Hands forward to Act VIb where the embeddings come from.
Five database families, each shaped to its primary access patternRelationalDocumentGraphTime-seriesVectoridnamev1A·2B·{"id": 1,"addr": {"city":"Paris"},"tags":["a","b"]}index-free adjacencytimestamps · samples≈1.5 bytes/samplek-NN in 768-d spaceHNSW · IVF · cosinePostgreSQL, MySQL,SQLite, OracleMongoDB,CouchbaseNeo4j,JanusGraphInfluxDB,TimescaleDBpgvector,Pinecone
Each shape bets that one access pattern dominates the workload. Large systems usually combine several — relational for transactions, document for user content, time-series for telemetry, vector for semantic search — coordinated by an application layer above.

Lines blur. PostgreSQL adds JSONB (document), pg_trgm (text similarity), TimescaleDB hypertables, and pgvector to its relational core. CockroachDB and Spanner are relational over a globally distributed LSM. Category names are useful signposts, not strict containers.

Pitfall — eventual consistency is not free. Many of these stores trade strong consistency for availability under network partition. Reads can return stale data; writes can conflict and require application-level resolution (last-write-wins, CRDTs, vector clocks). Pick the consistency model deliberately. Eventual consistency is great for shopping carts and hostile for bank balances.

Data engineering

Operational databases serve transactions; the rest of the organisation wants the same data for analytics, dashboards, and ML training. Data engineering moves data from where it is born — the OLTP database, the event log, the third-party API — to where it is useful: the warehouse, the lakehouse, the feature store.

The deepest fault line in the data stack is OLTP (online transaction processing) versus OLAP (online analytical processing). The same row lives in two worlds with opposite priorities. An OLTP system serves thousands of small, latency-sensitive operations per second. An OLAP system answers a handful of large, throughput-sensitive aggregate queries. The systems, the storage layouts, and the engines that win at each are different.

OLTP and OLAP workloads pull storage and engines in opposite directionsOLTPOLAPtransactional · row-oriented · point opsanalytical · columnar · aggregate opsdimensionvaluedimensionvalueconcurrencythousands of txns/secquery shapepoint lookups, small writesrows touched1 to a fewlatency targetsingle-digit msstorage layoutrow-major (heap + B-tree)freshnessread-your-writesdata volumeGB to low TBcanonical enginePostgreSQL · MySQLconcurrencya few queries at oncequery shapescans, joins, aggregatesrows touchedmillions to billionslatency targetseconds to minutesstorage layoutcolumn-major (Parquet)freshnessminutes to a daydata volumeTB to PBcanonical engineSnowflake · BigQueryHTAP systems try to serve both — most teams just keep two systems and pipe between them
Most "the database is slow" reports trace to running an OLAP query against an OLTP store, or vice versa. Pick the right system for the workload and pipe data between them.

Row-major vs column-major storage

The reason OLAP needs a different system is the storage layout. A row-major heap stores all columns of one row contiguously — fast for SELECT * WHERE id = 7 (one read), terrible for SELECT AVG(amount) FROM 1B rows (one read per row, most bytes thrown away). A column-major store inverts that: all values of one column are stored together, so a scan reads only the bytes the query needs.

Row-major versus column-major storage of the same tableLogical tableidnameagecity1Alice30NYC2Bob42LA3Carol28NYCRow-major (heap)Column-major (Parquet)[ 1 · Alice · 30 · NYC ][ 2 · Bob · 42 · LA ][ 3 · Carol · 28 · NYC ]id123nameAliceBobCarolage304228cityNYCLANYC+ point lookups by row id are one read+ INSERTs are one append− scanning one column reads every row− poor compression (mixed types per page)+ scans of one column read only that column+ excellent compression (uniform type per page)+ vectorised SIMD execution per column− single-row INSERT/UPDATE is expensive
Parquet, ORC, and the native formats inside Snowflake, BigQuery, and Redshift all use this layout — plus dictionary encoding for low-cardinality columns, run-length for sorted columns, per-page min/max stats for predicate pushdown, and bloom filters for equality.

Batch vs stream

Two ways to move and process data, distinguished by when the engine sees the input.

Batch collects data over an interval — an hour, a day — and processes the whole interval at once. Tools: Spark, dbt, Airflow, the warehouse's own SQL. Latency equals the window size; throughput is high; reasoning is easy because every job sees a fixed, finite input.

Stream processes records as they arrive. Tools: Kafka Streams, Flink, Materialize, Spark Structured Streaming. Latency is sub-second; throughput depends on parallelism; reasoning is harder because the input never ends. Two new concepts come with that: windows (tumbling, sliding, session) over which to aggregate, and handling for late-arriving events that show up after their window already closed.

Batch processing aggregates closed windows; stream processing aggregates over open onesBatchStreamscheduled jobs over fixed windowscontinuous over open windows00:0006:0012:0018:0024:00jobjobjobjob+ exact aggregates over closed windows+ deterministic, easy to retry+ scales with data volume− latency = window size− reprocessing ripples downstreamtumblingtumblingtumblingtumblingoverlapping (sliding) windows+ sub-second latency+ continuous aggregates and joins− ordering, lateness, exactly-once are hard− window choice changes the answer− back-pressure when consumers fall behindmost platforms run both — stream for fresh, batch for canonical reconciliation
Batch and stream aren't opposites; they're points on a continuum. Micro-batch (Spark Structured Streaming) and continuous-batch (dbt rebuild every 5 min) sit between them.

The streaming side is usually fed by change data capture (CDC). Instead of polling the OLTP database, a CDC connector tails the database's WAL and emits each row mutation as an event in commit order. The OLTP database pays nothing extra; downstream consumers learn about every insert, update, and delete in near real time.

Change data capture: the OLTP database's WAL becomes a stream of row-level eventsapplicationOLTP databaseCDC connectorlog / brokerconsumersapptablesWALDebeziumKafkatopicINSERT/UPDATEcommit appendsto WALeventsWAL positionoffset 100offset 101offset 102offset 103offset 104warehousesearch indexcache invalidatoraudit / lineageevent shape · { before, after, op (c|u|d), source: {ts, lsn, table} }guarantees · at-least-once delivery, in-commit-order, snapshot + tail bootstrappitfalls · schema-change events need handling; huge transactions blow the connector buffer
Debezium, AWS DMS, and the native CDC inside modern warehouses all run this pattern. The downstream effect is profound: the OLTP database becomes the source of truth for an event-driven architecture without changing the application code.

Warehouse vs lakehouse

A data warehouse (Snowflake, BigQuery, Redshift) is a managed, columnar, analytical database. Storage is proprietary and tightly coupled to the engine. Query performance is excellent because the engine knows everything about how the data is laid out. The cost is lock-in: getting data out to train an ML model or feed a different engine means exporting and paying for egress.

A lakehouse keeps the columnar storage in open file formats (Parquet, ORC) on cheap object storage and adds a table format layer (Iceberg, Delta Lake, Hudi) that brings ACID transactions, schema evolution, time travel, and partition management to those raw files. Multiple engines (Spark, Trino, DuckDB, Snowflake-on-Iceberg) can read the same tables without copying the data.

Closed warehouse vs open lakehouse: where the data lives and who can read itWarehouse (closed)Lakehouse (open)SQL engineSnowflake · BigQuery · RedshiftProprietary columnar storageopaque file format · vendor-tied+ best query performance — engine and storage co-designed+ workload management, governance in one product− lock-in: Spark, DuckDB cannot read the bytes directly− egress to other tools means exportingSparkTrinoDuckDBTable formatIceberg · Delta · Hudi — ACID + time travel + schema evolutionParquet / ORC files on object storageS3 · GCS · Azure Blob — open, columnar, public spec+ same bytes readable by many engines — no copy+ storage and compute scale and price independently+ open formats survive vendors− less integrated than a single-vendor warehouse− table format choice (Iceberg vs Delta vs Hudi) is a real commitment
The lakehouse pattern made open warehousing competitive with proprietary engines. Modern Snowflake, BigQuery, and Databricks all read external Iceberg tables alongside their native storage.

The table format does heavy lifting. Iceberg's layout, for example, layers a snapshot pointer, a manifest list, per-file manifests with statistics, and then the actual Parquet data files. Every commit writes a new snapshot; old snapshots stay around for time-travel queries (SELECT … FOR VERSION AS OF 17) until expired. That is how an eventually-consistent object store gets ACID semantics at the table level.

ETL, ELT, and lineage

The classic ETL sequence — Extract, Transform, Load — pulled data, transformed it in a separate engine, and loaded the result. ELT flips the last two steps: load raw data into the warehouse first and transform inside the warehouse with SQL. Cheap warehouse compute and tools like dbt (which compiles a graph of SQL SELECT statements into materialised tables with tests and lineage) made ELT the default for analytics.

When the same number — daily revenue, active users, model accuracy — appears in five dashboards and disagrees, someone has to answer "where did this number come from?" Lineage is the system of record for that question. At column granularity it tracks: this dashboard column came from that warehouse column, which came from that dbt join, which came from that Postgres table, which came from that Kafka event, which came from that application endpoint. OpenLineage standardises the events; DataHub, OpenMetadata, and Atlan visualise the graph.

Pitfall — pipelines silently break shape. Schema-on-read formats (JSON, untyped Parquet) make it easy to add a field upstream and not notice that downstream parsers now silently drop or mis-type it. Schema registries and column-level contracts catch this at the producer; without them, the bug surfaces three weeks later when a quarterly report disagrees with itself.

Standards

  • SQLISO/IEC 9075 (parts 1–14, current SQL:2023). Defines the language, the relational operators, the type system, JSON support, and SQL/PGQ (graph queries inside SQL). Vendors implement subsets and extensions; the standard is the lingua franca.
  • Apache Parquetparquet.apache.org specification. Columnar, page-based file format with per-column compression and statistics; the de facto interchange format for analytics. ORC (orc.apache.org) is the close cousin from the Hive lineage.
  • Apache Icebergiceberg.apache.org/spec/ (current spec v3). Table format defining metadata structure, snapshot semantics, partition spec evolution, and atomic commits over object storage.
  • Delta Lakedelta.io/protocol. Linux Foundation's table format originating from Databricks; competitor and now-interoperable peer of Iceberg.
  • Apache Hudihudi.apache.org/docs/concepts. Third major table format; emphasizes incremental processing and upserts.
  • Avroavro.apache.org/docs/. Compact, schema-defined binary serialization for record streams; the canonical Kafka payload format. Handed forward from Act I (serialization) — used here for stream payloads with registry-backed schema evolution.
  • Apache Arrowarrow.apache.org/docs/format/Columnar.html. In-memory columnar layout for zero-copy interchange between query engines, ML frameworks, and language runtimes; Arrow Flight (format/Flight.html) is the RPC for moving Arrow batches.
  • JDBC / ODBC — Java SE JDBC API Specification (Oracle, JSR 221) and Microsoft ODBC Programmer's Reference. The lingua-franca client APIs every relational database implements; everything from BI tools to ETL tools targets these.
  • PostgreSQL wire protocolpostgresql.org/docs/current/protocol.html. Defines the message-oriented client/server protocol; many systems (CockroachDB, YugabyteDB, Materialize, Hyper) re-implement it to ride existing client libraries.
  • MongoDB Query Language (MQL)mongodb.com/docs/manual/reference/operator/. Query, projection, and update operators over BSON documents; the aggregation pipeline is the warehouse-flavoured cousin.
  • Cypheropencypher.org. Graph pattern-matching language originating in Neo4j; the basis for GQL — ISO/IEC 39075:2024, the first ISO standard for graph queries.
  • PromQLprometheus.io/docs/prometheus/latest/querying/basics/. Functional query language for time-series; the de facto standard for metrics queries even outside Prometheus (Grafana, Cortex, Mimir, Thanos, VictoriaMetrics all support it).
  • HNSWYu. A. Malkov & D. A. Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (IEEE TPAMI, 2018). The dominant graph-based ANN index. IVF (inverted file) variants — H. Jégou, M. Douze, C. Schmid, Product Quantization for Nearest Neighbor Search (IEEE TPAMI, 2011) — are the partition-based alternative.
  • OpenLineageopenlineage.io/spec/. Open standard for emitting lineage events from data pipelines; ingested by DataHub, OpenMetadata, Marquez.
  • WAL / ARIES recovery — Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging (ACM TODS, 1992). The recovery algorithm under InnoDB, PostgreSQL, SQL Server, and DB2.
  • Forward refsAct VIb (Intelligence) builds directly on vector stores and embeddings introduced here; Act V (Connection) supplies the network protocols that ship rows between replicas and across pipelines.
Going deeper

Branches that earn their own article.

  • Query planning and optimization.
  • Storage engine internals (B-tree pages, WAL, compaction).
  • Individual database deep dives (PostgreSQL, MySQL, SQLite, DynamoDB, Cassandra, MongoDB).
  • Replication and sharding for databases.
  • Time-series database internals (InfluxDB, TimescaleDB).
  • Graph database engines (Neo4j, TigerGraph).
  • Vector search algorithms (HNSW, IVF).
  • Data warehouse architecture (Snowflake, BigQuery, Redshift).
  • Lakehouse formats (Apache Iceberg, Delta Lake, Hudi).
  • Stream processing frameworks (Kafka Streams, Flink, Spark Streaming).
  • Data orchestration (Airflow, Dagster, Prefect).
  • Data quality and testing (Great Expectations, dbt tests).
  • Schema evolution strategies.
  • Data governance and catalogs.