Data
A database is a promise about durability and consistency; ML is a bet that patterns in old data predict new data.
On this page
The working table of contents.
- Why databases exist — programs need data to survive restarts, and multiple programs need to share the same data without corruption.
- The relational model — tables, rows, columns, keys. SQL as the language to ask questions. Why this 50-year-old idea still dominates: it separates "what you want" from "how to get it".
- Transactions and ACID — the four promises a database makes (Atomicity, Consistency, Isolation, Durability). Isolation levels as how strict the isolation really is. MVCC as the trick that lets readers and writers coexist.
- Indexes — without them, every query reads every row. B-trees (the workhorse), LSM trees (the write-optimized alternative). The trade-off: faster reads vs faster writes.
- Beyond relational — document stores (flexible schemas), graph databases (relationships first), time-series (optimized for timestamped data), vector stores (nearest-neighbor search for embeddings). Each exists because one access pattern matters more than the rest.
- Data engineering — getting data from where it's born to where it's useful. Batch (process all at once, periodically) vs stream (process as it arrives). Warehouses (structured, analytical queries) vs lakehouses (raw + structured, one system). The idea of lineage: knowing where data came from and what happened to it.
Going deeper
Branches that earn their own article.
- Query planning and optimization.
- Storage engine internals (B-tree pages, WAL, compaction).
- Individual database deep dives (PostgreSQL, MySQL, SQLite, DynamoDB, Cassandra, MongoDB).
- Replication and sharding for databases.
- Time-series database internals (InfluxDB, TimescaleDB).
- Graph database engines (Neo4j, TigerGraph).
- Vector search algorithms (HNSW, IVF).
- Data warehouse architecture (Snowflake, BigQuery, Redshift).
- Lakehouse formats (Apache Iceberg, Delta Lake, Hudi).
- Stream processing frameworks (Kafka Streams, Flink, Spark Streaming).
- Data orchestration (Airflow, Dagster, Prefect).
- Data quality and testing (Great Expectations, dbt tests).
- Schema evolution strategies.
- Data governance and catalogs.