Data

A database is a promise about durability and consistency; ML is a bet that patterns in old data predict new data.

On this page

The working table of contents.

  1. Why databases exist — programs need data to survive restarts, and multiple programs need to share the same data without corruption.
  2. The relational model — tables, rows, columns, keys. SQL as the language to ask questions. Why this 50-year-old idea still dominates: it separates "what you want" from "how to get it".
  3. Transactions and ACID — the four promises a database makes (Atomicity, Consistency, Isolation, Durability). Isolation levels as how strict the isolation really is. MVCC as the trick that lets readers and writers coexist.
  4. Indexes — without them, every query reads every row. B-trees (the workhorse), LSM trees (the write-optimized alternative). The trade-off: faster reads vs faster writes.
  5. Beyond relational — document stores (flexible schemas), graph databases (relationships first), time-series (optimized for timestamped data), vector stores (nearest-neighbor search for embeddings). Each exists because one access pattern matters more than the rest.
  6. Data engineering — getting data from where it's born to where it's useful. Batch (process all at once, periodically) vs stream (process as it arrives). Warehouses (structured, analytical queries) vs lakehouses (raw + structured, one system). The idea of lineage: knowing where data came from and what happened to it.
Going deeper

Branches that earn their own article.

  • Query planning and optimization.
  • Storage engine internals (B-tree pages, WAL, compaction).
  • Individual database deep dives (PostgreSQL, MySQL, SQLite, DynamoDB, Cassandra, MongoDB).
  • Replication and sharding for databases.
  • Time-series database internals (InfluxDB, TimescaleDB).
  • Graph database engines (Neo4j, TigerGraph).
  • Vector search algorithms (HNSW, IVF).
  • Data warehouse architecture (Snowflake, BigQuery, Redshift).
  • Lakehouse formats (Apache Iceberg, Delta Lake, Hudi).
  • Stream processing frameworks (Kafka Streams, Flink, Spark Streaming).
  • Data orchestration (Airflow, Dagster, Prefect).
  • Data quality and testing (Great Expectations, dbt tests).
  • Schema evolution strategies.
  • Data governance and catalogs.