Data
A program holds its working data in memory. The process exits and that memory disappears. Two processes touching the same file race each other and corrupt each other's writes. Power can fail mid-write and leave a record half on disk. A database is a separate process that owns the storage and exposes a transactional API, so every application stops re-implementing crash recovery and concurrent access by hand.
This page walks up the database stack — bytes on a page, indexes that find rows, transactions that hold invariants, replication that survives node loss, query planners that pick a plan, and the warehouse pipelines that move all of it where analytics needs it.
Why databases exist
Two problems break the naive "just write to a file" approach.
Crashes truncate writes mid-flight. A multi-page update interrupted by a power loss leaves part of the change on disk and part missing — a torn write. Recovering means writing a write-ahead log by hand, calling fsync correctly on every platform, replaying partial records on restart, and ordering the log relative to the data file. Getting any one detail wrong silently corrupts the file the next time the cord is yanked.
Concurrent writers race. Two processes that read-modify-write the same record can both read the old value, both compute a new one, and both write — a lost update. Operating-system advisory locks help on a single machine; over NFS or SMB their semantics quietly weaken.
A database centralises this work. One process owns the files, exposes a transactional API (BEGIN, COMMIT, ROLLBACK), and is responsible for two promises every application then takes for granted:
- Durability — once
COMMITreturns, the write survives a crash. - Isolation — concurrent transactions see a consistent view, never each other's half-finished work.
Everything else — schemas, indexes, query languages, replication — is built on top of those two.
The cost is fsync. Durability ultimately depends on forcing writes out of the kernel's page cache to the device. fsync stalls until the device confirms — typically 1–10 ms on consumer SSDs, longer on cloud disks. Group commit batches multiple transactions into one fsync call to amortise the wait. Skipping fsync looks fast in benchmarks and corrupts the database at the next power loss.
The relational model
Before the relational model, an application asked for a record by chasing pointers between files — the schema and the access path were the same thing, and changing one meant rewriting the other. The relational model separates the two. Data is a set of tables (relations) of typed rows over a fixed schema. The application says what it wants in SQL; the database picks how to get it.
A primary key uniquely identifies each row in a table. A foreign key in one table references the primary key of another, encoding a relationship. The operations on relations — filter, project, join, group, aggregate — are the relational algebra, and SQL is its declarative form.
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name;
That query never says whether to scan users first or orders first, whether to use a hash join or an index lookup, whether to read from a B-tree or a sequential scan. A query planner makes those choices from statistics it keeps about the data: row counts, value distributions, index selectivity, correlations between columns.
EXPLAIN exposes the choice; ANALYZE runs it and reports actual row counts, exposing planner mis-estimates.The optimizer enumerates physical plans and picks the cheapest under a cost model of rows-and-I/O per operator. Mis-estimated row counts are the single biggest reason production plans go wrong. If the optimizer expects 100 rows and gets 10 million, it will choose nested-loop where it should have chosen hash join, and a 50 ms query becomes a 50 minute query.
Normalisation factors tables so each fact lives in exactly one place. Third Normal Form (3NF) is the usual target for transactional databases: every non-key column depends on the whole primary key and nothing else. When reads dominate, duplicating data — denormalisation — trades write complexity for read speed. Analytical warehouses denormalise aggressively because they almost never update.
Pitfall — NULL is not equal to itself. SQL's logic has three values, and NULL = NULL evaluates to NULL (not true). WHERE x = NULL matches no rows; the correct form is WHERE x IS NULL. NOT IN with a NULL in the list rejects every row. COUNT(*) counts NULLs but COUNT(col) skips them. These rules are the source of long-lived bugs in queries that look correct at a glance.
Transactions and ACID
Without transactions, a multi-step change can fail partway and leave the database in an impossible state — a transfer that debits one account but never credits the other. A transaction is a unit of work the database treats as one indivisible step. ACID names the four promises that step keeps:
- Atomicity — all writes commit together or none do. A partial transaction is impossible after
COMMITorROLLBACK. - Consistency — the database moves from one valid state to another. Declared constraints (foreign keys, uniqueness, checks) are preserved across the boundary. The 'C' is partly the application's job — the database only enforces what you declared.
- Isolation — concurrent transactions appear to run as if alone. The strength of "appear" is the isolation level.
- Durability — once
COMMITreturns, the change survives a crash.
The write-ahead log
Atomicity and durability share one implementation: the write-ahead log (WAL). The rule is one line: a log record describing a change must reach durable storage before the data page it changes does. Commit then becomes one fsync on a small sequential file, instead of flushing every modified data page. Data files catch up lazily in the background; on crash, recovery replays the log from the last checkpoint to bring them back in sync.
Worked example: crash recovery from the WAL
Three transactions are in flight when the power dies. We walk what the log holds, what gets replayed, and what gets discarded.
State just before the crash. Last checkpoint was at LSN 40 (the database has guaranteed every change up to LSN 40 is already in the data files). The log on disk holds:
| LSN | Record | Meaning |
|---|---|---|
| 41 | T1: UPDATE accounts SET bal=80 WHERE id=1 | T1 first write |
| 42 | T1: UPDATE accounts SET bal=120 WHERE id=2 | T1 second write |
| 43 | T1: UPDATE accounts SET bal=50 WHERE id=3 | T1 third write |
| 44 | T1: COMMIT | T1 durably committed (fsync returned) |
| 45 | T2: UPDATE accounts SET bal=200 WHERE id=4 | T2 first write |
| 46 | T2: UPDATE accounts SET bal=10 WHERE id=5 | T2 second write |
| 47 | T2: COMMIT | T2 durably committed |
| 48 | T3: UPDATE accounts SET bal=999 WHERE id=6 | T3 wrote but never committed |
At LSN 48 the cord is yanked. T3 had not written a COMMIT record yet.
On restart, the database opens the log and runs three passes:
- Analysis — read forward from the last checkpoint (LSN 40). Build two lists: committed transactions
{T1, T2}(saw theirCOMMITrecords) and in-flight{T3}(noCOMMITseen). - REDO — replay every log record from LSN 41 onward against the data pages, regardless of whether the transaction committed. After this pass, the data files reflect every write in the log, including T3's uncommitted
bal=999. - UNDO — walk the in-flight list
{T3}and reverse its writes. The WAL records carry the old value as well as the new, so undoing LSN 48 means writingbalback to whatever it was before. After undo,id=6is restored.
Result. T1 and T2 are durable — their COMMIT was fsynced before the crash, so the client was promised they would survive, and they did. T3 vanishes — no commit was acknowledged to the client, so erasing it breaks no promise. The data files now match what would have been visible to a reader at the moment of T2's commit.
The key invariant: a COMMIT record reaching the log on disk is the only event that makes a transaction durable. Page writes can lag arbitrarily behind; the log is the source of truth.
Isolation levels and anomalies
Isolation describes what concurrent transactions are allowed to see of each other. Stronger isolation means fewer anomalies but more coordination — more locking, more aborts, lower throughput. Four anomalies define the levels:
- Dirty read — read a value another transaction has written but not yet committed. If it rolls back, you read fiction.
- Non-repeatable read — same query returns different values for the same row because someone updated it between your two reads.
- Phantom read — same range query returns different rows because someone inserted or deleted in between.
- Write skew — two transactions each read a set of rows, each makes a decision from what they read, each writes — and the combined writes break an invariant neither transaction's snapshot saw violated.
The classic example of write skew: two on-call doctors each query "are at least two doctors on call?", each sees the answer "yes", and each goes off-call simultaneously. Neither transaction's snapshot saw the invariant violated; the combined effect of their writes did.
SELECT … FOR UPDATE), a constraint that catches the post-state, or SERIALIZABLE isolation that watches for dangerous read/write patterns.Locking vs MVCC
Two implementation strategies give isolation.
Two-phase locking (2PL) acquires locks (shared for reads, exclusive for writes) and holds them until commit. Once any lock has been released, no new lock may be acquired. This produces serialisable schedules, at the cost that readers block writers and writers block readers. Deadlocks are inevitable; the engine detects cycles in a wait-for graph and aborts one transaction as the victim.
Multi-Version Concurrency Control (MVCC) keeps multiple versions of each row instead of locking. Every row carries two transaction IDs — xmin (the transaction that created it) and xmax (the transaction that deleted or replaced it). A transaction reading at snapshot S sees only versions where xmin is committed and visible to S, and xmax is either null or invisible to S. Writers create new versions; old versions are reclaimed by a background process — VACUUM in PostgreSQL, the purge thread in InnoDB.
Pitfall — long-running transactions hold back vacuum. A row version cannot be reclaimed while any open transaction might still see it. A reporting query that holds a snapshot open for an hour stops vacuum from cleaning up an hour's worth of dead rows; the table bloats and every query slows down. Watch pg_stat_activity for ancient transactions and set idle_in_transaction_session_timeout.
Worked example: MVCC visibility across three transactions
Three transactions touch the same row in PostgreSQL. We track which version each sees.
Setup. Row account(id=1, balance=100) exists as a tuple with (xmin=50, xmax=null). Transaction 50 committed long ago.
T=0 — TxA starts (txid=200). TxA's snapshot: "txids up to 199 visible if committed; nothing later."
TxA reads account(id=1):
finds v1: (xmin=50, xmax=null, balance=100)
xmin=50 visible? yes
xmax=null → tuple is live
→ TxA sees balance=100
T=5 ms — TxB starts (txid=201). TxA at txid=200 is active, not committed, so TxB will not consider TxA's writes visible.
T=10 ms — TxA UPDATE balance to 150. Postgres does not modify v1. It inserts v2 (xmin=200, xmax=null, balance=150) and marks v1 dead-by-200 (xmin=50, xmax=200, balance=100). The row has two physical versions on disk.
T=15 ms — TxB reads. TxB walks the version chain:
v1: xmax=200 — TxA still active in TxB's snapshot
→ treat deletion as not yet happened
→ TxB sees v1: balance=100
v2: xmin=200 — TxA still active
→ skip
TxB sees balance=100. Snapshot isolation in action.
T=20 ms — TxA commits. Globally, txid 200 is now committed.
T=25 ms — TxB re-reads. TxB's snapshot was captured at T=5 ms and is fixed for its lifetime. Txid 200 is still not in TxB's committed set. TxB still sees balance=100. This is what "repeatable read" means.
T=30 ms — TxC starts (txid=202). Its snapshot now considers 200 committed. TxC sees v2: balance=150.
The cleanup tax. v1 is dead from every perspective except TxB's. As long as TxB lives, v1 stays on disk. After TxB ends, VACUUM reclaims its space. A five-minute analytical TxB can pin gigabytes of dead tuples.
Replication and sharding
A single-node database eventually runs out of hardware or becomes a single point of failure. Two answers, often combined.
Replication keeps copies of the same data on multiple nodes. The trade-off is between consistency, latency, and availability when nodes or networks fail.
Sharding splits one logical table across many physical nodes by a partition key, so write throughput and dataset size scale beyond one machine. The partitioning strategy decides which queries stay fast and where hot spots form.
Indexes
Without an index, every query is a sequential scan: read every page, evaluate WHERE on every row. That's O(n) — fine for a thousand rows, ruinous for a billion. An index is a separate sorted structure mapping key values to row locations. Storage and write cost go up; reads go from O(n) to O(log n) or O(k) for range scans.
Two structures dominate.
B-trees
A B-tree is a balanced multi-way search tree designed for block storage. Each node fills exactly one disk page (4–16 KB in practice — 8 KB in PostgreSQL, 16 KB in InnoDB) and holds many keys — the fanout. A typical fanout of a few hundred means a billion-row index is only 4–5 levels deep; the top levels live in memory, so a point lookup is one or two disk reads. Every leaf is at the same depth — the tree is balanced by construction.
When a leaf insert overflows, the leaf splits in two and the median key is promoted to the parent. If the parent also overflows, the split cascades up — possibly creating a new root. Splits are durable: each one writes a WAL record so crash recovery can reconstruct the new shape.
Worked example: a leaf split during insertion
Pretend each leaf holds at most four keys (real engines hold hundreds; four keeps the diagram readable). We insert key 27 into a small tree.
Before. The tree has a root with one separator 25 and two leaves. The left leaf holds [10, 18, 22, 24] and the right leaf holds [28, 35, 48, 60]. Both leaves are full.
root: [ 25 ]
/ \
[ 10, 18, 22, 24 ] [ 28, 35, 48, 60 ]
Step 1 — descend. Insert 27. At the root, 27 > 25, so follow the right pointer to the leaf [28, 35, 48, 60].
Step 2 — attempt insert. 27 belongs at the front of that leaf (27 < 28). But the leaf is already at capacity. Overflow.
Step 3 — split the leaf. The engine takes the five keys that would be in the overflowed leaf — [27, 28, 35, 48, 60] — and splits them around the median 35. The left half [27, 28] stays in the original leaf node; a new leaf node is allocated for the right half [35, 48, 60].
Step 4 — promote the median. The median key 35 is inserted into the parent (the root) as a new separator. The root had room, so this insertion succeeds without further splitting.
After.
root: [ 25, 35 ]
/ | \
[10,18,22,24] [27, 28] [35, 48, 60]
Both new leaves are at the same depth — the tree stayed balanced.
What if the parent had also been full? The parent would itself split, promoting its median key one level up. In the worst case the split propagates all the way to the root, which becomes a new internal node and the tree grows one level taller. This is the only way a B-tree's height changes — every level is shared by every leaf, by construction.
Each step writes a WAL record (new leaf allocated, keys redistributed, parent updated) so a crash mid-split is replayable.
LSM trees
A log-structured merge tree flips the trade-off for write-heavy workloads. Writes go into an in-memory memtable; when it fills, the engine flushes it to disk as an immutable sorted file called an SSTable at level 0. A background compactor merges level-0 files into level 1, level 1 into level 2, and so on, with each level roughly 10× the size of the one above.
Reads must consult the memtable plus every SSTable that could hold the key. Bloom filters at each SSTable cut this to "the memtable plus one or two files" in the common case.
Worked example: data flowing through compaction levels
Pretend the memtable holds 4 keys and each level holds 4× the one above (so L0 caps at 4 SSTables, L1 at 16 keys, L2 at 64 keys). Real engines use much larger numbers; the ratios are the point.
Phase 1 — fill the memtable. Client writes arrive: put(7, "a"), put(3, "b"), put(11, "c"), put(7, "d"). The memtable now holds {3: "b", 7: "d", 11: "c"} — note 7 is overwritten in place, in memory.
Phase 2 — flush to L0. A fifth write put(20, "e") would overflow the memtable. The engine freezes the current memtable, opens a new empty one for incoming writes, and writes the frozen one to disk as a sorted file L0-file-1: [3, 7, 11]. L0 now has one file.
Phase 3 — repeat. The next four batches of writes flush as L0-file-2, L0-file-3, L0-file-4. L0 is now full (4 files). Critically, L0 files can overlap in key range — L0-file-1 might cover [3..11] and L0-file-3 might cover [5..15], because each was a snapshot of a different memtable.
Phase 4 — compact L0 → L1. Compactor picks all four L0 files and the L1 files that overlap their combined key range (initially none). It runs a merge-sort across them, resolving any duplicate keys by keeping the newest (highest sequence number). It writes the result as one or more new files in L1, where files are non-overlapping in key range. The four L0 files are deleted.
before: L0: [3..11] [5..15] [9..20] [2..18] L1: (empty)
\ merge-sort by key /
after: L0: (empty) L1: [2..20]
Phase 5 — fast-forward. Many memtable flushes later, L1 has grown to 16 keys (full). Same dance: the compactor picks the L1 file plus any L2 files in its key range, merges, writes the result down to L2, deletes the inputs.
Reading key=7 after all this.
- Check the memtable — miss.
- Check each L0 file in newest-first order, gated by its bloom filter. Bloom filter for
L0-file-2says "definitely not present" — skip without reading the file.L0-file-1's bloom filter says "possibly present" — binary-search inside it, miss. - Check L1. Files in L1 are sorted and non-overlapping, so we find at most one file whose range covers
7. Binary-search it. Found:7: "d".
The bloom filters turn what would be "open every SSTable on disk" into "open one or two." That, plus the sequential-write friendliness of the memtable flush, is why LSMs sustain 5–10× more writes than a B-tree on the same hardware.
The trade-off is direct. B-tree wins reads: one tree traversal, predictable, cache-friendly. LSM wins writes: appends to memory are sequential and batched, so write throughput can be 5–10× higher, at the cost of compaction work and slower point reads. PostgreSQL, MySQL/InnoDB, SQLite use B-trees because transactional workloads balance reads and writes. Cassandra, ScyllaDB, RocksDB use LSM because their target is high write throughput at scale. Some systems (TiDB, CockroachDB) put a SQL B-tree-feeling API on top of an LSM key-value store underneath.
Pitfall — indexes only help when they're selective. An index pays off only when the query touches a small fraction of the table. Postgres switches to a sequential scan when an index lookup would return more than roughly 5–10% of rows; random-access I/O on many pages is slower than reading the whole table sequentially. Indexing booleans or any column with three values is rarely useful. Partial indexes (WHERE status = 'active') and covering indexes (include the columns the query reads) often are.
How one SELECT actually runs
With the planner, indexes, WAL, and MVCC defined, the path a single SELECT takes through them is concrete. The query is parsed and type-checked, the planner picks an access path, the executor walks the chosen index to find row pointers, the buffer pool (the database's in-memory cache of disk pages) supplies the heap pages those pointers reference, and an MVCC visibility filter discards row versions the transaction's snapshot should not see.
COMMIT returns.Beyond relational
The relational model is general-purpose. For any one access pattern, a specialised model is faster, simpler, or both. Each family below trades generality for one specific kind of question it answers very well.
- Document stores (MongoDB, Couchbase) — store nested JSON/BSON documents under a key. Schemas vary per document. Queries reach inside (
db.users.find({"address.city": "Paris"})). Strong for application objects that vary in shape; weak when joins across collections matter. - Graph databases (Neo4j, JanusGraph, Neptune) — model nodes and edges with index-free adjacency: each node holds direct pointers to its neighbours, so a multi-hop traversal is O(hops × avg_degree), not O(hops × log n) joins. Queries use Cypher, GQL, or Gremlin to pattern-match against subgraphs. Wins when the relationships are the data: social networks, fraud rings, supply chains.
- Time-series (InfluxDB, TimescaleDB, Prometheus) — append-heavy, timestamped, often dropped on a retention policy. Storage is partitioned by time and compressed with delta-of-delta encoding to reach under 2 bytes per sample. Queries are time-range plus aggregation.
- Vector databases (pgvector, Pinecone, Weaviate, Qdrant) — index high-dimensional float vectors (typically 768 or 1536 dimensions) for approximate nearest-neighbour search. The dominant index is HNSW, a multi-layer proximity graph that gives sub-linear k-NN at recall above 0.95. Hands forward to Act VIb where the embeddings come from.
Lines blur. PostgreSQL adds JSONB (document), pg_trgm (text similarity), TimescaleDB hypertables, and pgvector to its relational core. CockroachDB and Spanner are relational over a globally distributed LSM. Category names are useful signposts, not strict containers.
Pitfall — eventual consistency is not free. Many of these stores trade strong consistency for availability under network partition. Reads can return stale data; writes can conflict and require application-level resolution (last-write-wins, CRDTs, vector clocks). Pick the consistency model deliberately. Eventual consistency is great for shopping carts and hostile for bank balances.
Data engineering
Operational databases serve transactions; the rest of the organisation wants the same data for analytics, dashboards, and ML training. Data engineering moves data from where it is born — the OLTP database, the event log, the third-party API — to where it is useful: the warehouse, the lakehouse, the feature store.
The deepest fault line in the data stack is OLTP (online transaction processing) versus OLAP (online analytical processing). The same row lives in two worlds with opposite priorities. An OLTP system serves thousands of small, latency-sensitive operations per second. An OLAP system answers a handful of large, throughput-sensitive aggregate queries. The systems, the storage layouts, and the engines that win at each are different.
Row-major vs column-major storage
The reason OLAP needs a different system is the storage layout. A row-major heap stores all columns of one row contiguously — fast for SELECT * WHERE id = 7 (one read), terrible for SELECT AVG(amount) FROM 1B rows (one read per row, most bytes thrown away). A column-major store inverts that: all values of one column are stored together, so a scan reads only the bytes the query needs.
Batch vs stream
Two ways to move and process data, distinguished by when the engine sees the input.
Batch collects data over an interval — an hour, a day — and processes the whole interval at once. Tools: Spark, dbt, Airflow, the warehouse's own SQL. Latency equals the window size; throughput is high; reasoning is easy because every job sees a fixed, finite input.
Stream processes records as they arrive. Tools: Kafka Streams, Flink, Materialize, Spark Structured Streaming. Latency is sub-second; throughput depends on parallelism; reasoning is harder because the input never ends. Two new concepts come with that: windows (tumbling, sliding, session) over which to aggregate, and handling for late-arriving events that show up after their window already closed.
The streaming side is usually fed by change data capture (CDC). Instead of polling the OLTP database, a CDC connector tails the database's WAL and emits each row mutation as an event in commit order. The OLTP database pays nothing extra; downstream consumers learn about every insert, update, and delete in near real time.
Warehouse vs lakehouse
A data warehouse (Snowflake, BigQuery, Redshift) is a managed, columnar, analytical database. Storage is proprietary and tightly coupled to the engine. Query performance is excellent because the engine knows everything about how the data is laid out. The cost is lock-in: getting data out to train an ML model or feed a different engine means exporting and paying for egress.
A lakehouse keeps the columnar storage in open file formats (Parquet, ORC) on cheap object storage and adds a table format layer (Iceberg, Delta Lake, Hudi) that brings ACID transactions, schema evolution, time travel, and partition management to those raw files. Multiple engines (Spark, Trino, DuckDB, Snowflake-on-Iceberg) can read the same tables without copying the data.
The table format does heavy lifting. Iceberg's layout, for example, layers a snapshot pointer, a manifest list, per-file manifests with statistics, and then the actual Parquet data files. Every commit writes a new snapshot; old snapshots stay around for time-travel queries (SELECT … FOR VERSION AS OF 17) until expired. That is how an eventually-consistent object store gets ACID semantics at the table level.
ETL, ELT, and lineage
The classic ETL sequence — Extract, Transform, Load — pulled data, transformed it in a separate engine, and loaded the result. ELT flips the last two steps: load raw data into the warehouse first and transform inside the warehouse with SQL. Cheap warehouse compute and tools like dbt (which compiles a graph of SQL SELECT statements into materialised tables with tests and lineage) made ELT the default for analytics.
When the same number — daily revenue, active users, model accuracy — appears in five dashboards and disagrees, someone has to answer "where did this number come from?" Lineage is the system of record for that question. At column granularity it tracks: this dashboard column came from that warehouse column, which came from that dbt join, which came from that Postgres table, which came from that Kafka event, which came from that application endpoint. OpenLineage standardises the events; DataHub, OpenMetadata, and Atlan visualise the graph.
Pitfall — pipelines silently break shape. Schema-on-read formats (JSON, untyped Parquet) make it easy to add a field upstream and not notice that downstream parsers now silently drop or mis-type it. Schema registries and column-level contracts catch this at the producer; without them, the bug surfaces three weeks later when a quarterly report disagrees with itself.
Standards
- SQL — ISO/IEC 9075 (parts 1–14, current SQL:2023). Defines the language, the relational operators, the type system, JSON support, and SQL/PGQ (graph queries inside SQL). Vendors implement subsets and extensions; the standard is the lingua franca.
- Apache Parquet —
parquet.apache.orgspecification. Columnar, page-based file format with per-column compression and statistics; the de facto interchange format for analytics. ORC (orc.apache.org) is the close cousin from the Hive lineage. - Apache Iceberg —
iceberg.apache.org/spec/(current spec v3). Table format defining metadata structure, snapshot semantics, partition spec evolution, and atomic commits over object storage. - Delta Lake —
delta.io/protocol. Linux Foundation's table format originating from Databricks; competitor and now-interoperable peer of Iceberg. - Apache Hudi —
hudi.apache.org/docs/concepts. Third major table format; emphasizes incremental processing and upserts. - Avro —
avro.apache.org/docs/. Compact, schema-defined binary serialization for record streams; the canonical Kafka payload format. Handed forward from Act I (serialization) — used here for stream payloads with registry-backed schema evolution. - Apache Arrow —
arrow.apache.org/docs/format/Columnar.html. In-memory columnar layout for zero-copy interchange between query engines, ML frameworks, and language runtimes; Arrow Flight (format/Flight.html) is the RPC for moving Arrow batches. - JDBC / ODBC — Java SE JDBC API Specification (Oracle, JSR 221) and Microsoft ODBC Programmer's Reference. The lingua-franca client APIs every relational database implements; everything from BI tools to ETL tools targets these.
- PostgreSQL wire protocol —
postgresql.org/docs/current/protocol.html. Defines the message-oriented client/server protocol; many systems (CockroachDB, YugabyteDB, Materialize, Hyper) re-implement it to ride existing client libraries. - MongoDB Query Language (MQL) —
mongodb.com/docs/manual/reference/operator/. Query, projection, and update operators over BSON documents; the aggregation pipeline is the warehouse-flavoured cousin. - Cypher —
opencypher.org. Graph pattern-matching language originating in Neo4j; the basis for GQL — ISO/IEC 39075:2024, the first ISO standard for graph queries. - PromQL —
prometheus.io/docs/prometheus/latest/querying/basics/. Functional query language for time-series; the de facto standard for metrics queries even outside Prometheus (Grafana, Cortex, Mimir, Thanos, VictoriaMetrics all support it). - HNSW — Yu. A. Malkov & D. A. Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (IEEE TPAMI, 2018). The dominant graph-based ANN index. IVF (inverted file) variants — H. Jégou, M. Douze, C. Schmid, Product Quantization for Nearest Neighbor Search (IEEE TPAMI, 2011) — are the partition-based alternative.
- OpenLineage —
openlineage.io/spec/. Open standard for emitting lineage events from data pipelines; ingested by DataHub, OpenMetadata, Marquez. - WAL / ARIES recovery — Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging (ACM TODS, 1992). The recovery algorithm under InnoDB, PostgreSQL, SQL Server, and DB2.
- Forward refs — Act VIb (Intelligence) builds directly on vector stores and embeddings introduced here; Act V (Connection) supplies the network protocols that ship rows between replicas and across pipelines.
Branches that earn their own article.
- Query planning and optimization.
- Storage engine internals (B-tree pages, WAL, compaction).
- Individual database deep dives (PostgreSQL, MySQL, SQLite, DynamoDB, Cassandra, MongoDB).
- Replication and sharding for databases.
- Time-series database internals (InfluxDB, TimescaleDB).
- Graph database engines (Neo4j, TigerGraph).
- Vector search algorithms (HNSW, IVF).
- Data warehouse architecture (Snowflake, BigQuery, Redshift).
- Lakehouse formats (Apache Iceberg, Delta Lake, Hudi).
- Stream processing frameworks (Kafka Streams, Flink, Spark Streaming).
- Data orchestration (Airflow, Dagster, Prefect).
- Data quality and testing (Great Expectations, dbt tests).
- Schema evolution strategies.
- Data governance and catalogs.