Handbook · Digital · Software

Data & AI

Data & AI··46 min read

TL;DR

Every modern product is in the business of storing, processing, and learning from data. A transactional app records what users do. An analytical pipeline copies those records somewhere else, reshapes them for analysis, and answers questions about the business. A machine-learning pipeline uses the same data to train models — functions that map inputs to predictions — which are then served back as APIs or baked into the product. These three activities (transact, analyse, learn) all share the same raw material and increasingly the same tooling.

"Data & AI" under one handbook is really two pipelines sharing most of their infrastructure. The data pipeline moves records from transactional systems (databases, services, event logs) to analytical stores (warehouses, lakes, lakehouses), cleaning and reshaping them along the way so analysts and dashboards can ask questions efficiently. The AI pipeline takes those same records, turns them into tensors (multi-dimensional arrays of numbers the GPU can crunch), trains a model, evaluates it, serves it in production, and watches for drift. Both pipelines pass through the same five choke points — ingestion, storage formats, query engines, evaluation, and governance — so learning the data side is most of learning the AI side.

This handbook walks those shared choke points in the order data actually flows through a mature system. Ingestion and CDC (change data capture) — how records leave the systems that produce them. Storage formats — row-oriented for transactions, columnar for analytics, and the lakehouse formats that tried to unify both. Query engines — vectorized, distributed, cost-based. Lakehouse vs warehouse — how the two architectures converged. Training pipelines — tensors, losses, optimizers, distributed training. Inference serving — batching, KV caches, quantization, all the reasons an LLM response costs what it costs. Evaluation, drift, and the offline/online gap — why a model that scored 99% in training can still ship a disaster. And governance, lineage, and deletion — the unglamorous work that keeps you out of court.

You will be able to

  • Explain, in plain language, what a "data pipeline" is, what an "ML training run" does, and what an "inference request" actually computes.
  • Decide between row-oriented, columnar, and object-storage layouts from the read pattern, and name the compression that pays rent.
  • Describe what an LLM inference request actually costs (GPU seconds, KV cache, batch efficiency, quantization) instead of staring at the API bill.
  • Place any "AI feature" on the pipeline so you know where its failure modes live.
  • Draw a data lineage graph for any pipeline and point at the exact step that would have to change under a deletion request.

The Map

Rendering diagram…

Read the graph left-to-right for a record's lifecycle: born in a transactional system, captured by CDC, landed in columnar files, catalogued as a table, queried for features, baked into a model's weights, served as predictions, evaluated against ground truth, deleted on request. Governance is not a final step — it is a constraint every previous step must honour.

Station 1 — Ingestion, CDC, and the transactional tail

Data is born where users do things — an order in the e-commerce app, a message in the chat service, a click event in the mobile SDK. That data lives in OLTP systems (Online Transaction Processing — PostgreSQL, MySQL, DynamoDB, and Kafka-as-a-log). These systems are optimised for reading and writing single records quickly and consistently; they are not good at answering "how many orders did we ship in Germany last quarter?" without degrading the user-facing path.

Analytics and ML want exactly those slower, wider questions — aggregations across millions of records, joins across years of history, feature extraction across every user. Running those queries on the OLTP database at scale competes with the product for the same CPU, memory, and locks. So analytics lives somewhere else: a warehouse, a data lake, a lakehouse. Ingestion is the set of pipelines that moves records from the OLTP side to the analytical side without breaking either. Change Data Capture (CDC) is the modern preferred technique — instead of periodic full table dumps, stream the database's own write-ahead log into downstream systems so analytics sees every insert, update, and delete with minimal lag.

The hard engineering problem here is neither speed nor volume; it is consistency. How do we get a record into the warehouse at-least-once without duplicating it, reliably capture deletions, handle schema changes from the upstream team, and preserve ordering? The patterns that make this work — transactional outbox, CDC via logical replication, idempotent consumers — are Rung 5 of the Systems & Architecture handbook meeting the warehouse at the ingestion boundary.

Rendering diagram…
  • Log-based CDC (Debezium on Postgres WAL, MySQL binlog, Oracle LogMiner) captures every committed write as an event with (op, before, after, lsn, ts). The event stream is replayable — a downstream system can rebuild state from any offset. Trigger-based CDC exists but taxes the source DB; log-based is the production default.
  • Kafka is the standard transport: topics partitioned by primary key so per-key ordering holds, typical retention 7–30 days, replication factor 3 for durability. Delivery is at-least-once by default; the Systems & Architecture handbook covers why that pushes idempotency into every consumer.
  • Schema evolution: the source's ALTER TABLE changes the event schema. Avro + a schema registry (Confluent Schema Registry, Apicurio) with compatibility rules (BACKWARD, FORWARD, FULL) is the usual discipline; without it, a column rename on Tuesday breaks ETL on Wednesday. See Foundations Station 5 on wire formats.
  • Late and out-of-order events are inevitable. Watermarks (Apache Flink, Beam) let you say "I will accept events for this window up to T + grace; after that, sealed." Without watermarks you close windows either too early (drop data) or never (unbounded state).

The model you want: ingestion is a replay-able log, not a copy. If you can't replay from the WAL offset or the Kafka offset, you don't have CDC — you have a fragile dump job with optimistic assumptions.

WARNING

Never run analytics queries against a transactional primary "just this once" for a big backfill. The long scan plays havoc with OLTP's working set, evicts the hot pages, and your p99 on the app goes to hell. Always go via a replica, a snapshot, or the CDC stream — even if the pipeline looks like overkill for one job.

The wider picture. Data ingestion is a rich ecosystem:

  • Batch ETL (Extract, Transform, Load) — nightly jobs in Airflow, Dagster, Prefect, or cron. The classic approach; still common for non-realtime workloads.
  • Streaming ingestion — Kafka, Kinesis, Pub/Sub, Pulsar, Event Hubs, Redpanda. Continuous event streams with at-least-once semantics.
  • CDC tools — Debezium (Kafka-based), AWS DMS, Fivetran, Airbyte, Google Datastream. Capture database changes and replicate.
  • ELT vs ETL — load first then transform in the warehouse (ELT, modern default) vs transform in flight (ETL, legacy). dbt ecosystem leans hard on ELT.
  • Schema evolution and registries — Confluent Schema Registry, Buf Schema Registry, Apicurio. Avro / Protobuf / JSON Schema with compatibility rules.
  • Watermarks and late data — Apache Flink, Google Dataflow / Apache Beam, Kafka Streams. How streaming handles out-of-order events.
  • Transactional outbox — the standard pattern for "atomically persist + publish" from a single database.
  • Data contracts — explicit producer/consumer agreements (Andrew Jones's Driving Data Quality with Data Contracts). The "API for data."
  • Bronze / silver / gold layers (medallion architecture) — raw → cleansed → business-ready table tiers inside the warehouse.

Where this shows up next. Station 2 is where the ingested records land as files on disk. Stations 4 and 5 are where analytics and training finally consume them. Cross-link to Foundations Station 5 for the wire-format side, and to Systems & Architecture Rung 5 for the delivery-semantics backbone.

Go deeper: Kleppmann, Designing Data-Intensive Applications chapter 11 — the friendliest on-ramp; the Debezium documentation on connector architectures; the Confluent Schema Registry compatibility rules; Apache Flink's watermark chapter in the official docs; Andrew Jones, Driving Data Quality with Data Contracts for the modern producer/consumer discipline.

Station 2 — Storage formats: row, column, and table

Once data is ingested, it has to be laid out on disk in some format. The choice of format is the single biggest determinant of query cost and speed for the next ten years of the system's life, because every query has to read bytes in the order the format stored them.

There are two fundamental layouts. Row-oriented formats store every field of a record together (field 1, field 2, field 3 for row 1; then field 1, field 2, field 3 for row 2; and so on). This is ideal for OLTP systems that want to retrieve or modify entire records by key — the whole row comes off disk in one sequential read. Columnar formats invert the layout: all values of field 1 together, then all values of field 2 together, and so on. This is ideal for analytics, which typically aggregates over just a few columns of wide tables; you only read the columns the query needs, and values within a column compress extraordinarily well because they are similar.

A table format (Iceberg, Delta Lake, Hudi) is a third layer on top of columnar files. It adds a metadata layer — manifest files, snapshots, transaction logs — that lets you treat a pile of Parquet files in object storage as a database table with ACID transactions, schema evolution, time-travel reads, and efficient upserts. Table formats are what let "a bunch of files in S3" behave like a warehouse, which is the architectural move that made the modern lakehouse possible.

Once records land, the question is how they sit on disk. Three broad answers: row-oriented (the OLTP default, a whole row's bytes contiguous), columnar (all values for one column contiguous), and object-store table formats (Iceberg, Delta Lake, Hudi — columnar files plus a manifest that says which files currently make up the table).

  Three customers, four columns. Two layouts, same data.

  Row-oriented (Postgres, MySQL heap):
    Page 1:  [1, "Ana",  "CA", 2023] [2, "Bo",  "NY", 2024] [3, "Cy", "TX", 2024] …
            ↑ good for: row-at-a-time access, OLTP

  Columnar (Parquet row group):
    id:      1, 2, 3, …
    name:    "Ana", "Bo", "Cy", …
    state:   "CA", "NY", "TX", …
    year:    2023, 2024, 2024, …
            ↑ good for: SUM/AVG/WHERE on a few columns over millions of rows

Parquet and ORC are the two dominant columnar formats. Both organize a file into row groups (default ~128–256 MB), each containing column chunks with per-chunk statistics (min, max, null count) so query engines can skip entire chunks via predicate pushdown. Both use dictionary encoding for low-cardinality strings, run-length encoding for repeated values, and Snappy / ZSTD for post-encoding compression. Typical compression ratios on analytics workloads: 4–10×.

  • Parquet pages are the I/O unit (typically 1 MB), inside row groups. Column chunks support nested data (lists, structs, maps) via Dremel-style repetition/definition levels — how a single Parquet file can store JSON-shaped records without falling back to blobs.
  • Iceberg / Delta / Hudi solve the "columnar files don't update" problem at the table level. A table is a JSON/manifest pointer to a set of Parquet files; an update writes new files and publishes a new manifest atomically. You get ACID on an object store, time-travel reads, and schema evolution without rewriting history. Iceberg (created at Netflix, now Apache) has the cleanest spec; Delta Lake (Databricks) is dominant on the Spark side; Hudi (Uber) specializes in upserts and streaming MERGE.
  • Compaction is necessary maintenance: small files from streaming writes get merged into large files so query planners don't spend milliseconds per file listing. Iceberg's rewrite_data_files and Delta's OPTIMIZE are the standard operators.
  • Partitioning is layout not indexing — a Hive-style year=2024/month=03/ directory structure means queries with WHERE year=2024 touch only those directories. Over-partitioning (too many tiny partitions) is a classic mistake; 10s to 100s of MB per partition is a healthy target.

The model you want: the storage format is a contract between write-once and read-many. Columnar + compression + predicate pushdown assumes you will scan selectively; it is terrible for point lookups. A write-heavy, point-lookup workload wants a row store; an analytics workload wants Parquet; a hybrid wants Iceberg or a warehouse's native format.

TIP

"Just put everything in Parquet" beats 80% of ad-hoc data stores you'd otherwise build. DuckDB or Trino against Parquet on S3 is a minute-to-stand-up analytics stack. The remaining 20% — low-latency dashboards, high-concurrency lookups, transactions — is where a warehouse or a row-oriented store earns the extra money.

The wider picture. Storage formats span a broad landscape:

  • Row formats — CSV / TSV, JSON / JSONL / NDJSON, Avro, MessagePack, Protobuf, Thrift. Good for streaming and for single-record access.
  • Columnar formats — Parquet, ORC, Arrow (in-memory columnar), Vortex. 10–100× compression vs CSV, order-of-magnitude query speedup for analytics.
  • Table formats — Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon. Each adds transactions, schema evolution, and efficient upserts on top of columnar files.
  • Vector indexes — HNSW (Hierarchical Navigable Small World), IVF, ScaNN, DiskANN. For similarity search over embeddings.
  • Database-native formats — Postgres heap + TOAST, MySQL InnoDB, SQLite pages, SQL Server, Oracle.
  • Time-series formats — InfluxDB TSM, Prometheus TSDB, Gorilla compression, VictoriaMetrics.
  • Key-value engine files — LevelDB / RocksDB SST, LMDB B-trees, FoundationDB redwood.
  • Compression codecs — Snappy, LZ4, ZSTD, GZIP, BZip2, Brotli; column-encoding schemes (dictionary, run-length, bit-packing, delta).
  • File organisation — partitioning (by date, region), bucketing (hashing to N files), Z-order / Hilbert curves for multi-column skipping, Bloom filters for point-lookup speed.

Where this shows up next. Station 3 is the query engine that reads these files. Station 4 is the platform architecture (lakehouse, warehouse) built around them. Cross-link to Foundations Station 6 for the entropy / compression math these formats exploit.

Go deeper: Parquet format specification — short, readable, explains column chunks and statistics; Iceberg spec; Stonebraker & Abadi, "Column-Stores vs. Row-Stores" (SIGMOD 2008); Abadi et al., "The Design and Implementation of Modern Column-Oriented Database Systems" (2013) — the friendliest book-length introduction; Delta Lake paper (VLDB 2020); Hudi's streaming-write design doc.

Station 3 — Query engines: vectorized, distributed, and cost-based

A query engine is the software that takes a SQL (or SQL-like) query, figures out how to execute it efficiently against the stored data, and returns the result. It is the component that decides in which order to scan tables, which indexes to use, which joins to run first, and how to spread the work across cores and machines. The difference between a 100-millisecond query and a 10-minute one on the same data is almost always the query engine, not the hardware.

Modern engines share three ideas that together give orders of magnitude of speedup. Vectorized execution processes thousands of rows at a time in tight SIMD-friendly loops, rather than one-row-at-a-time ("tuple-at-a-time") as classical engines did — this exploits the cache and branch-prediction behaviour the Computer Architecture handbook describes. Distributed execution splits a query plan across many machines reading disjoint partitions in parallel — this is how Snowflake, BigQuery, and Spark scale to petabytes. Cost-based optimisation picks a join order and plan shape by estimating the data volume each step will produce, using statistics gathered from the tables — a good optimiser can be hundreds of times faster than a naïve one on the same query.

The engine is also where pushdown happens — pushing filters, projections, and aggregations down into the storage layer so less data moves. A columnar store with predicate pushdown can read 1% of the bytes a row-store reads for the same query. That 100× is why analytics migrated off row stores.

A query engine takes SQL (or dataframe code), turns it into an execution plan, and runs that plan against columnar files or tables. Modern engines are vectorized — operators process batches of a few thousand column values per call instead of tuple-at-a-time — and cost-based in their planner, using table statistics to pick a join order and a join algorithm.

Rendering diagram…
  • Vectorized execution (MonetDB/X100 2005, Vertica 2005, DuckDB today) processes data in cache-friendly column batches. A tight SIMD loop over 1024 int64s amortizes function-call overhead and hits the memory system in patterns the prefetcher loves — 10–100× faster than classic Volcano tuple-at-a-time on analytic workloads.
  • Distributed engines (Spark, Trino/Presto, ClickHouse, Snowflake, BigQuery Dremel) split work across workers and shuffle intermediate results by a partitioning key for joins and aggregations. Shuffle cost dominates in bad query plans; network bandwidth often ends up the bottleneck rather than CPU.
  • Join algorithms: hash join (build hash table on small side, probe with large side — O(n + m)); sort-merge join (for pre-sorted inputs, or when spill is needed); broadcast join (replicate small table to every worker, avoids shuffle). A cost-based optimizer picks using table size and distribution statistics.
  • DuckDB (Mühleisen, Raasveldt, 2019) is the single-node vectorized engine worth knowing about even if you run Spark — embedded in a Python process, runs Parquet-on-S3 queries at Trino-like speed, and has become a grep for analysts. Polars (Rust, Arrow-native) fills the same niche for dataframe workflows.

The model you want: modern analytics is vectorized columnar compute plus smart I/O; the fastest query is the one that skips the most data. Partition pruning, predicate pushdown, min/max stats — every "the query was slow" investigation ends with figuring out which of these the engine did or didn't do.

WARNING

SELECT * from a wide Parquet table reads every column; SELECT id, amount reads two. Dashboards that blindly SELECT * are often 5–20× more expensive than they need to be. Enforce column projection in your BI layer and your warehouse bill drops noticeably.

The wider picture. Query engines and analytical SQL ecosystems are a big tent:

  • Distributed SQL engines — Presto / Trino, Spark SQL, Apache Hive, Apache Impala, Apache Drill, Dremio.
  • Cloud warehouses — Snowflake, BigQuery, Amazon Redshift, Azure Synapse, Databricks SQL. Managed engines, often billed per query or per reserved compute unit.
  • Columnar OLAP engines — ClickHouse, Apache Druid, Apache Pinot, StarRocks, DuckDB (embeddable), Doris.
  • Streaming SQL — Apache Flink SQL, Materialize, RisingWave, ksqlDB. Continuous queries over event streams.
  • Federated and polystore — Trino, Presto, Apache Calcite (planning library). Query across many stores without copying data.
  • Embedded analytics — DuckDB, SQLite, Polars, Datafusion. "Warehouse in a binary" for laptops and edge devices.
  • Query optimisation layers — Apache Calcite, Volcano/Cascades optimisers, learned query optimisers (DB + ML).
  • Materialised views and query caching — pre-computed answers. dbt's incremental models, Druid pre-aggregations, Snowflake materialized views.
  • Dataframe libraries — pandas, Polars, Ibis, PySpark DataFrame. Query engines that pretend to be libraries.

Where this shows up next. Station 4 is the platform built around these engines (lakehouse vs warehouse). The training pipeline (Station 5) uses these engines for feature extraction; evaluation and drift monitoring (Station 7) are query workloads. Cross-link to the Cloud & Infrastructure handbook for the storage primitives beneath.

Go deeper: Boncz, Zukowski & Nes, "MonetDB/X100: Hyper-Pipelining Query Execution" (CIDR 2005); Kornacker et al., "Impala: A Modern, Open-Source SQL Engine for Hadoop" (CIDR 2015); Raasveldt & Mühleisen, "DuckDB: an Embeddable Analytical Database" (SIGMOD 2019); the Snowflake paper (SIGMOD 2016); Graefe, "Query Evaluation Techniques for Large Databases" (1993) — the foundational survey; Begoli et al., "Apache Calcite" (SIGMOD 2018) for the modern optimiser architecture.

Station 4 — Lakehouse and warehouse, and why they converged

For about two decades, analytical storage lived in two camps. A data warehouse (Teradata, Vertica, Redshift, Snowflake, BigQuery) was a tightly integrated managed system — proprietary storage, proprietary query engine, proprietary optimizer — offering excellent performance but locking you in. A data lake (S3 / GCS with Hive tables) was the opposite — open formats, cheap object storage, any engine can read it — but gave up transactions, schema evolution, and consistency.

A lakehouse is the architectural move that tried to get both: cheap open storage (Parquet + Iceberg / Delta / Hudi on S3) plus warehouse-grade guarantees (ACID transactions, schema evolution, time-travel, efficient upserts). Databricks popularised the term in 2021; Snowflake, BigQuery, and most cloud vendors have since added support for reading external table formats, so the lakehouse and the warehouse converged. The question is no longer "warehouse or lake?" but "which table format, on which object store, read by which engines?"

A data warehouse is a managed analytics DB with storage and compute bundled: Teradata, Snowflake, BigQuery, Redshift. A data lake is a pile of files on object storage (S3, GCS, ADLS) with a query engine read over the top: Hadoop, Databricks' early pattern, S3 + Athena. A lakehouse is the convergence — open table formats (Iceberg, Delta) on object storage, with warehouse-grade SQL engines that treat the files as managed tables.

  Architecture layers from bottom to top:

   object store                S3 / GCS / Azure Blob         (durability, scale, cheap)
   table format                Iceberg / Delta / Hudi        (ACID, schema, time-travel)
   catalog                     AWS Glue / Unity / Polaris    (table discovery, permissions)
   query engines               Spark, Trino, DuckDB, Flink   (SQL + Python + streaming)
   orchestration + BI          Airflow, dbt, Looker, Tableau (ELT, dashboards)
  • The separation of storage and compute is the hallmark of the modern stack: S3 is the system of record, compute clusters spin up to query it, data is never trapped in one vendor's silo. Snowflake and BigQuery internalized the same split years before "lakehouse" got a name; the difference now is that the table format is open (Iceberg/Delta) and multiple engines can read/write the same tables concurrently.
  • ELT over ETL: load the raw data, then transform in-warehouse using SQL and tools like dbt. This is cheaper (no separate ETL cluster), more testable (versioned SQL, data tests), and more flexible — transformations can be re-run after the fact as understanding improves.
  • Iceberg REST catalog (2023) is the emerging open standard for catalog APIs — Snowflake's Polaris, Databricks' Unity, Apache Polaris, Project Nessie all speak it. The idea: any engine can authenticate to any catalog and read any Iceberg table, without the catalog lock-in that Hive Metastore created.
  • Cost model: pay-per-query (BigQuery, Athena) is great for ad-hoc, punishing for always-on dashboards; warehouse-size-time (Snowflake, Databricks) flips that; reserved capacity hides the real unit cost but flattens spend. Every warehouse bill investigation starts with separating those three workloads. See also the Cloud & Infrastructure handbook.

The model you want: storage is where the truth is; compute is a commodity that comes and goes; the catalog is what binds them. A stack without an open catalog is one vendor migration away from a multi-quarter project.

CAUTION

"We'll standardize on one engine" inside a lakehouse is optional; "we'll standardize on one table format" is not. Iceberg vs Delta is the 2020s version of "do we use Parquet?" — you have to pick, and switching later costs real write-side rewrites.

The wider picture. The modern data platform includes:

  • Lakehouse platforms — Databricks (Delta), Snowflake (external Iceberg), BigQuery (BigLake), Dremio, Starburst. Converged architectures.
  • Warehouse-native — Snowflake, BigQuery, Redshift, Synapse, Teradata (historical), Vertica.
  • Lake-native — any object store + open table format + compute (EMR, Spark, Trino, Athena).
  • Transformation layer — dbt (the de facto SQL transformation tool), SQLMesh, dataform.
  • Orchestration — Airflow, Dagster, Prefect, Argo Workflows, Mage, Temporal.
  • Data catalogue and discovery — AWS Glue Catalog, Unity Catalog, Apache Atlas, DataHub, Amundsen, Alation, Collibra.
  • Data quality — Great Expectations, Soda, dbt tests, Monte Carlo, Bigeye. Assertions on ingested data.
  • Lakehouse + ML — Feature stores (Feast, Tecton, Hopsworks), ML training against the same lakehouse tables.
  • BI and analytics consumers — Looker, Tableau, PowerBI, Metabase, Hex, Mode, Superset.

Where this shows up next. Stations 5–8 all consume data from this layer. Cross-link to Cloud & Infrastructure for the storage economics and to Engineering Craft for the CI/CD discipline applied to data pipelines.

Go deeper: Armbrust et al., "Lakehouse: A New Generation of Open Platforms" (CIDR 2021); Dageville et al., "The Snowflake Elastic Data Warehouse" (SIGMOD 2016); Melnik et al., "Dremel: Interactive Analysis of Web-Scale Datasets" (VLDB 2010); Apache Iceberg documentation; Kimball, The Data Warehouse Toolkit (the foundational dimensional modelling book); the dbt guide on test and documentation patterns.

Station 5 — Training pipelines: from tensors to weights

A machine-learning model is, in the end, a function from inputs to outputs with a very large number of tuneable numeric parameters (the "weights"). Training is the process of finding values for those weights such that the model's outputs match a training set as well as possible. It has a few moving parts that every modern ML pipeline includes.

First, the data becomes tensors — multi-dimensional arrays of numbers the GPU can process in parallel. Images become 3-D tensors of pixel values; text becomes sequences of integer token IDs; tabular data becomes matrices of features. Next, the model is defined as a computation graph (a composition of differentiable operations — matrix multiplies, nonlinearities, attention, convolutions) implemented in a framework like PyTorch, JAX, or TensorFlow. The model ingests a batch of inputs, produces predictions, and computes a loss — a scalar measuring how wrong the predictions are.

The magic is the optimiser. It uses backpropagation — a systematic application of the chain rule — to compute the gradient of the loss with respect to every weight. The optimiser (SGD, Adam, AdamW) then takes a small step in the direction that most reduces the loss. Repeat for millions of batches on billions of tokens, possibly across thousands of GPUs, and you have a trained model. The engineering challenges at scale — memory partitioning (ZeRO, FSDP), tensor / pipeline / data parallelism, checkpointing, numerical stability in mixed precision — are mostly about "how do we do this very expensive computation without running out of memory or time?"

Training turns a dataset into a model's weights via gradient descent. The shape of the pipeline is stable across frameworks: load data, tokenize/encode into tensors, forward pass, loss, backward pass, optimizer step, repeat for billions of examples. What changes every generation is the scale — datasets into the trillions of tokens, models into the hundreds of billions of parameters, clusters into the tens of thousands of GPUs.

Rendering diagram…
  • Gradient descent is the same one equation across every model: θ ← θ − η · ∇_θ L(θ). Stochastic variants (SGD, momentum, Adam, AdamW) differ in how they estimate the gradient from a mini-batch and how they adapt the learning rate per parameter. AdamW is the current default for transformers; Lion (Chen et al., 2023) is a memory-cheaper alternative with comparable results.
  • Distributed training scales by partitioning either the data, the model, or both. Data parallel (every GPU has the full model, sees a different batch, averages gradients) is the simplest. Tensor parallel splits a single matrix multiply across GPUs. Pipeline parallel assigns consecutive transformer layers to different GPUs. ZeRO (Rajbhandari 2019) shards optimizer state, gradients, and parameters across the data-parallel group. Frontier models use all four simultaneously, in what the community calls 3D parallelism + ZeRO.
  • Mixed precision is now standard: activations in bfloat16 or fp8, optimizer state in float32 (precision matters for the weight update's last bits). See Foundations Station 3 for why bfloat16 matters: it has float32's range with float16's footprint.
  • Checkpoints are the big-number reality: a 70B-parameter model in bfloat16 is ~140 GB of weights, plus optimizer state of ~4× that for AdamW's first and second moments — so a resumable checkpoint is ~700 GB. Ship them to S3, use sharded checkpoints (one shard per rank), and test restore-from-checkpoint before you rely on it.
  • Scaling laws (Kaplan 2020, Hoffmann "Chinchilla" 2022) say: given a compute budget, loss is minimized by a specific balance of parameters and tokens — roughly 20× tokens per parameter for Chinchilla-optimal. Training a 70B model on 1.4T tokens is not a random choice; it's sitting on the Chinchilla curve.

The model you want: training is streaming records through a function and nudging the function's parameters the way the gradient says. Everything else — distributed strategies, mixed precision, checkpointing — is engineering to make that loop survive at scale.

CAUTION

A training run that looks good on a loss curve can still be bad on real tasks — memorization, data leakage, contamination of eval sets with training data, reward-model hacking. Always evaluate on a genuinely held-out set whose provenance you can attest to, not just its identity.

The wider picture. Training spans many more concerns:

  • Frameworks — PyTorch (dominant for research), JAX (functional, XLA-compiled), TensorFlow (still strong for production), MLX (Apple), PaddlePaddle.
  • Optimisers — SGD with momentum, Adam, AdamW, LAMB, Shampoo, Lion, Sophia. Each trades convergence speed against memory.
  • Scaling laws — Kaplan et al. (2020), Chinchilla (2022). The empirical curves that say "to get X loss, spend Y FLOPs on Z tokens."
  • Architectures — transformer (the current standard), MoE (mixture of experts), Mamba / state-space models, diffusion models, VAEs, GANs.
  • Distributed training — data / tensor / pipeline / expert parallelism; ZeRO stages; FSDP; Ring-AllReduce; NCCL.
  • Mixed precision and numeric formats — FP16, BF16, FP8, INT8, FP4. Cross-link to Foundations Station 3.
  • Gradient tricks — checkpointing (recompute activations instead of storing), accumulation, clipping, noise.
  • Fine-tuning and PEFT — LoRA, QLoRA, IA³, prompt tuning, full-FT. How you adapt a pretrained model cheaply.
  • Training infrastructure — Kubernetes + KubeFlow, Ray, SLURM, AWS SageMaker, GCP Vertex AI, Modal, Together AI.
  • Datasets and data curation — The Pile, Common Crawl, RedPajama, FineWeb. Curation is increasingly the bottleneck.
  • Experiment tracking — Weights & Biases, MLflow, Neptune, Aim, ClearML.

Where this shows up next. Station 6 is how the trained model is served. Station 7 is how you know it actually works. Cross-link to the Computer Architecture handbook for the GPU / TPU hardware this layer runs on.

Go deeper: Goodfellow, Bengio & Courville, Deep Learning (the friendly textbook); Kingma & Ba, "Adam: A Method for Stochastic Optimization" (ICLR 2015); Rajbhandari et al., "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (SC 2020); Kaplan et al., "Scaling Laws for Neural Language Models" (2020); Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022); Vaswani et al., "Attention Is All You Need" (NeurIPS 2017); Shoeybi et al., "Megatron-LM" (2019).

Station 6 — Inference serving, batching, and the KV cache

A trained model is a static file — hundreds of gigabytes of weights on disk. Inference is taking that file and using it to answer live queries: a user types a prompt, the server loads the model onto GPUs, and produces tokens back. The engineering of inference is a completely different discipline from training: training optimises for total throughput over hours or days, inference optimises for latency and cost per request, with strict real-time constraints.

The biggest costs at inference time are (1) matrix multiplies — a large model may do hundreds of billions of multiply-add operations per token; (2) memory bandwidth — every token generation streams all the weights (or a layer's worth) from GPU HBM, so bandwidth, not FLOPs, is often the bottleneck; and (3) the KV cache — in transformer inference, the model keeps a cache of "seen so far" keys and values per attention head, which grows with context length and dominates memory for long conversations.

The engineering wins have names. Continuous batching (Orca) packs many requests into one forward pass without padding to the slowest. PagedAttention (vLLM) treats KV cache as virtual memory with paging. Quantization (INT8, INT4, FP8) shrinks weights 2–8× with small accuracy loss. Speculative decoding uses a small draft model to propose tokens the big model then verifies in parallel. Together these techniques can cut cost per token by 10× over a naïve implementation — which is why the API price of calling a big model keeps falling.

Serving a trained model is a throughput-and-latency engineering problem. A request comes in, you run the forward pass, you return the output; behind that simple story is a choreography of GPU kernels, dynamic batching, and memory layouts engineered to extract as much tokens-per-second per dollar as physics allows.

For an autoregressive LLM, two phases dominate:

  Prefill phase  — prompt of N tokens, one forward pass on all N at once
                   COMPUTE-bound (large matmuls, GPU happy)
                   ~TFLOPS of arithmetic, ~tens of GB of weight reads

  Decode phase   — generate tokens one at a time, N+1, N+2, …
                   MEMORY-bound (each step reads all weights for one token)
                   ~1 KV-cache row per layer per head per previous token
                   → HBM bandwidth is the ceiling, not FLOPS

  KV cache shape (per layer):
     keys   [batch, heads, seq_len, head_dim]
     values [batch, heads, seq_len, head_dim]

     70B Llama, fp16: ~160 KB/token KV for one request
     → 8k context = ~1.3 GB of HBM per sequence just for KV
  • Continuous batching (vLLM, TensorRT-LLM, Orca 2022): instead of waiting for a whole batch to finish before starting the next, swap new requests in at the token level. GPU utilization goes from ~20% (naive static batching) to ~80%+, which is why throughput-per-GPU can vary 4× between serving stacks.
  • PagedAttention (vLLM) treats the KV cache like virtual memory: non-contiguous GPU memory "pages" hold KV blocks, and a per-request page table maps logical token positions to physical blocks. This kills memory fragmentation at long-context scale and is the single biggest reason vLLM dominates open-source LLM serving.
  • Quantization — fewer bits per weight at inference — is free throughput in most cases. GPTQ, AWQ, SmoothQuant, int4 and int8 weight-only quantization cut model memory 2–4× with 0.1–1.0 point degradation on standard benchmarks. FP8 (Hopper/Blackwell hardware) trains and serves natively at 8-bit float and is rapidly becoming the reference.
  • Speculative decoding (Leviathan et al., 2022): a small draft model proposes k tokens, the big model verifies them in one forward pass. If the big model agrees on j ≤ k tokens, you just generated j tokens in roughly one big-model step — 2–3× speedup on well-matched drafter/target pairs.

The model you want: prefill is compute-bound, decode is memory-bound; the serving stack's job is to keep both phases near the GPU's ceiling. Throughput is a function of batch size, KV-cache management, and quantization; latency is a function of the first token and how quickly you can get the GPU from your queue to the matmul.

WARNING

"Our LLM API is slow" usually means "we're running at batch size 1 or with static batching." Moving to a serving stack with continuous batching (vLLM, TGI, TensorRT-LLM, SGLang) often gives 3–10× throughput improvement without changing the model. The bill drops by the same factor. Measure before you assume the model is the problem.

The wider picture. Inference is a fast-evolving ecosystem:

  • Serving frameworks — vLLM, SGLang, TensorRT-LLM, Triton Inference Server, TGI (Text Generation Inference), Ray Serve, BentoML, llama.cpp, MLX, Ollama.
  • Quantization formats — INT8 (GPTQ, AWQ, SmoothQuant), INT4 (QLoRA, bitsandbytes), FP8, FP4, binary / ternary.
  • KV-cache management — PagedAttention, prefix caching, shared prompts, sliding-window attention, MQA / GQA (multi- vs grouped-query attention).
  • Batching strategies — continuous batching, chunked prefill, speculative decoding, lookahead decoding, medusa heads.
  • Model compression — distillation, pruning, mixture-of-experts, structured sparsity.
  • Retrieval-augmented generation (RAG) — vector database + retrieval + prompt augmentation. Pinecone, Weaviate, pgvector, Qdrant, Milvus.
  • Agents and tool use — function calling, OpenAI / Anthropic tool schemas, MCP (Model Context Protocol), agent frameworks (LangGraph, Semantic Kernel).
  • Edge / on-device inference — CoreML, TensorFlow Lite, ONNX Runtime, WebGPU, llama.cpp, MLX. Models running on phones and laptops.
  • Inference hardware — NVIDIA H100 / B100 / GB200, AMD MI300, Google TPU v5e/v5p, AWS Trainium / Inferentia, Groq LPU, Cerebras, Tenstorrent.
  • Cost optimisation — spot GPUs, multi-tenant hosting, reserved capacity, MoE routing, mixture of model sizes by request class.

Where this shows up next. Station 7 is how you evaluate what the model actually serves; Station 8 is how you govern it. Cross-link to the Computer Architecture handbook for the silicon inference runs on, and to the Cloud & Infrastructure handbook for how serving scales.

Go deeper: Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023); Yu et al., "Orca: A Distributed Serving System for Transformer-Based Generative Models" (OSDI 2022); Leviathan et al., "Fast Inference from Transformers via Speculative Decoding" (ICML 2023); NVIDIA TensorRT-LLM docs (the most accessible serving-engine documentation); Dettmers et al., "QLoRA" and "LLM.int8()" papers on quantization.

Station 7 — Evaluation, drift, and the offline/online gap

A model that scores 99% on a held-out test set can still ship a disaster in production. The reasons are mundane and frequent. The test set did not match the users you actually have. The training data was stale by the time the model shipped. The distribution of inputs shifted after launch (seasonality, trend, a new marketing campaign). The labels in training were noisy or biased. The metric you optimised correlates poorly with the business outcome you actually care about. Every mature ML team has a story in each of these categories.

Evaluation is the practice of measuring a model's real usefulness, not its test-set number. There is an offline evaluation — held-out sets, A/B held-out splits, task-specific benchmarks — and an online evaluation — A/B tests, shadow traffic, canary rollouts, comparing the new model's live metrics against the incumbent. Both are necessary; neither is sufficient alone. Drift is the phenomenon of the world changing faster than the model: data drift (input distributions shift), concept drift (the relationship between inputs and correct outputs shifts), and model drift (the model silently degrades as upstream systems mutate). Catching drift requires the same observability discipline the Engineering Craft handbook describes, applied to feature distributions and prediction distributions instead of latency and error rate.

A model that performs well on a held-out test set will not automatically perform well in production. The world drifts — the inputs shift, users change behaviour, adversaries adapt — and offline metrics that looked solid on Monday can be meaningless on Friday. Evaluation is not a step before deployment; it is a continuous system.

Rendering diagram…
  • Training-serving skew — a feature that is computed differently offline vs online — is the #1 silent killer of deployed ML. Classic example: a "last hour of user activity" feature is cached for the train set but freshly computed at serving, giving the model different values. The fix is a single feature store (Feast, Tecton) computing features identically in batch and streaming paths.
  • Data drift (Pθ(X) changes) and concept drift (P(Y|X) changes) are different diagnoses. Monitor both: Kolmogorov-Smirnov or population stability index (PSI) on input features, and rolling accuracy / precision / recall on labelled outcomes. A model with stable inputs but falling accuracy has concept drift; a model with shifting inputs may still perform well if the shift is within training distribution.
  • LLM evaluation is its own universe. Leaderboard benchmarks (MMLU, HumanEval, GSM8K, HELM, Chatbot Arena) measure different things and game differently. Production evaluation uses LLM-as-judge (one model scores another's outputs against a rubric) combined with a small human-graded set as calibration. Contamination — the test set appearing in pretraining data — is endemic; the best defense is a private, regularly-refreshed eval set.
  • Statistical power in A/B tests: a minimum detectable effect (MDE) of 1% on a conversion rate of 5% requires ~10^5 samples per arm to be 80% powered at α=0.05. Running without a pre-declared MDE is how teams "see" improvements that aren't there; an SRM check (Sample Ratio Mismatch) verifies assignment itself was random.

The model you want: evaluation is a feedback loop that must close faster than the world drifts. Offline is a filter; online is the truth; drift monitors are the alarm that tells you offline stopped mirroring online.

CAUTION

"The eval set says we're better" is not the same as "we're better." Test on representative data, not just convenient data; watch your metric variance, not just its mean; and trust multi-week A/B tests over three-day "wins." The shortest path from a well-intentioned model to a costly regression is a weak eval protocol.

The wider picture. Evaluation has many flavours:

  • Benchmarks — MMLU, HumanEval, GSM8K, BIG-Bench, HELM, MT-Bench, GPQA, SWE-bench, LiveCodeBench, Arena.
  • LLM-as-judge — use a strong model to grade outputs; high throughput, measurable bias.
  • A/B testing and causal inference — randomization, CUPED, stratification, Bayesian decision rules, multi-armed bandits.
  • Shadow, canary, and blue-green deploys — release the new model to a fraction of traffic and compare.
  • Drift detection — population stability index, Kolmogorov–Smirnov tests, adversarial validation, distribution-matching metrics.
  • Red-teaming and evals for LLMs — prompt-injection tests, jailbreak evaluations, safety benchmarks (TruthfulQA, ToxicChat, HarmBench).
  • Interpretability — SHAP, LIME, integrated gradients, attention visualisation, circuit-level mechanistic interpretability.
  • Calibration — reliability diagrams, expected calibration error, temperature scaling, Platt scaling.
  • Bias and fairness — demographic parity, equal odds, counterfactual fairness. See the Security & Cryptography handbook for the threat-modelling overlap.

Where this shows up next. Station 8 is the governance practice that makes these evaluations auditable. Cross-link to the Engineering Craft handbook for the observability and on-call practices, and to the Security & Cryptography handbook for privacy and red-teaming.

Go deeper: Gama et al., "A Survey on Concept Drift Adaptation" (ACM Computing Surveys 2014); Liang et al., "Holistic Evaluation of Language Models" (HELM, 2022); Kohavi, Tang & Xu, Trustworthy Online Controlled Experiments (the friendly canonical book on A/B testing); Chip Huyen, Designing Machine Learning Systems chapters on monitoring and deployment — the most accessible ML-infra book; Zheng et al., "Judging LLM-as-a-Judge" (NeurIPS 2023).

Station 8 — Governance, lineage, and the deletion request

The final station is the least glamorous and — for many organisations — the one that eventually determines survival. Data governance is the set of practices that keep data lawful, ethical, auditable, and reversible. It is the layer that answers: who is allowed to read this table? Where did this column come from? If a user asks to be forgotten, what systems have to change? If a regulator asks "how did your model make this decision?", can you answer?

Modern data regulations — GDPR, CCPA, HIPAA, the EU AI Act — turn governance from a nice-to-have into a legal requirement. The right to deletion (GDPR Article 17) requires that if a user requests their data be deleted, it is actually deleted — from primary stores, backups, derived datasets, and trained models where feasible. That is an engineering problem: unless you tracked every derived dataset back to its source ("lineage"), you cannot fulfil the request. Model cards (Mitchell et al. 2019) and data sheets document what a model was trained on, intended for, and measured by. Access controls, audit logs, and encryption at rest / in transit are the low-level primitives; the Security & Cryptography handbook covers them in depth.

Data governance is the boring word for the difference between a company that can answer "where is this user's data" in a morning and a company that spends a quarter doing it under a court order. GDPR (EU, 2018), CCPA (California, 2020), and the EU AI Act (2024) put legal teeth on the ability to produce, correct, or delete records end-to-end — every table, every model, every downstream feature store.

  Minimum governance capabilities a data+AI stack must carry:

   1. Catalog           — every dataset has an owner, a description, a schema,
                          PII classification, retention policy, access grants
   2. Lineage           — for each column/feature, the upstream tables and
                          transformations; for each model, the training data
   3. Access control    — column-level and row-level policies, not just table-
                          level; attribute-based ideally (ABAC)
   4. Audit log         — who read what, when, via which tool
   5. Deletion          — forward propagation of a subject-access or erasure
                          request through every downstream copy
   6. Retention         — time-based lifecycle rules, soft-delete windows,
                          hard-delete scheduling
   7. Consent           — which records are usable for which purpose
                          (marketing, training, analytics) by explicit flag
  • Column-level lineage is the missing middle layer in most stacks. Table-level lineage is easy (dbt graph, Airflow DAG, Iceberg manifests); column-level is what you need to answer "which downstream model uses user.home_address." Tools like OpenLineage (LFAI), dbt docs with column-level extraction, and DataHub solve different slices of this.
  • Training data deletion under GDPR Art. 17 is an unsolved technical problem at the weights level — once a record is baked into a model's weights, "removing" it requires retraining or machine unlearning. In practice teams handle it by scheduling full retrains on a cadence that bounds the residency of any individual record and logging the inclusion/exclusion to support auditability.
  • Consent propagation: a record tagged as "consented for marketing but not for training" must carry that flag through every downstream pipeline. The cleanest implementation is a policy tag attached at ingestion (column or row) and enforced by the query engine, not by the individual job author.
  • PII classification at scale uses a mix of schema conventions (column named email, ssn) and content scanning (regex + ML classifiers for addresses, phone numbers, named entities). AWS Macie, Google DLP, and open-source tools like Microsoft Presidio do the scanning; the output feeds the catalog and the access policies.

The model you want: governance is not a report; it is a set of capabilities each pipeline stage must expose. If a pipeline stage cannot answer "who, what, when, why, how long, with what consent," it isn't production-ready, no matter how good its model is.

WARNING

Keeping "one extra copy" of a dataset in someone's home directory for a "quick experiment" is a GDPR accident waiting to be discovered. The moment data leaves the governed surface — laptop, ad-hoc S3 bucket, a notebook's local file — you cannot produce, correct, or delete it reliably. The fix is organizational (a governed exploration sandbox) not technical (a scan and a scolding).

The wider picture. Governance is a broad domain:

  • Regulation — GDPR (EU), CCPA / CPRA (California), HIPAA (US healthcare), PCI DSS (payments), SOX (financial reporting), EU AI Act (model regulation).
  • Lineage tooling — OpenLineage, Marquez, Apache Atlas, DataHub, Amundsen.
  • Access control — RBAC / ABAC, row-level security, column-level masking, policy-as-code (OPA).
  • Privacy engineering — differential privacy, k-anonymity, l-diversity, t-closeness, tokenization, format-preserving encryption, homomorphic encryption.
  • Model documentation — model cards (Mitchell 2019), data sheets for datasets (Gebru 2018), factsheets, system cards, transparency reports.
  • Auditability and observability — immutable audit logs, chain-of-custody, tamper-evident timelines, "right to explanation" implementations.
  • Responsible AI practices — impact assessments, red-teaming, bias audits, incident registers (MIT AI Incident Database).
  • Data retention and deletion — retention policies, backup expiration, crypto-shredding (encrypt data with a per-user key, delete the key).
  • Data contracts and ownership — every dataset has a human owner with a responsibility; ownerless data is governance-free by default.

Where this shows up next. Governance closes the loop: an incident at Station 7 may feed a lineage question at Station 8, which may drive a data-quality fix at Station 1. Cross-link to the Security & Cryptography handbook for the defensive primitives, and to the Engineering Craft handbook for the practices that keep it honest.

Go deeper: GDPR (Regulation (EU) 2016/679), particularly Articles 5, 15, 17, 20, 22; the EU AI Act (Regulation (EU) 2024/1689) on high-risk model documentation; ISO/IEC 38505-1 on data governance; Mitchell et al., "Model Cards for Model Reporting" (FAT* 2019); Gebru et al., "Datasheets for Datasets" (2018); the OpenLineage specification; Chip Huyen, Designing Machine Learning Systems chapter 8.

How the stations connect

Data moves left-to-right, from the transactional source to the model's predictions; governance wraps every stage. The AI plane is a downstream consumer of the data plane — its inputs are the lakehouse's tables, its outputs become features for other tables, and its governance constraints are stricter than any analytical query's.

Rendering diagram…

The Foundations handbook defines the wire formats this stack serialises into; the Systems & Architecture handbook covers the delivery semantics ingestion inherits from Kafka; the Computer Architecture handbook is where training and inference actually run.

Standards & Specs

  • Apache Parquet format specification — the columnar format every engine speaks.
  • Apache Iceberg spec v2 and Delta Lake protocol — the open table formats.
  • Apache Avro spec — schema-on-write serialization favoured by Kafka.
  • OpenLineage spec — cross-tool lineage metadata.
  • ISO/IEC 38505-1:2017 — data governance principles.
  • NIST AI RMF 1.0 — AI risk management framework.
  • EU General Data Protection Regulation (Regulation (EU) 2016/679) — personal-data processing in the EU.
  • EU AI Act (Regulation (EU) 2024/1689) — risk-based AI regulation with transparency and documentation duties.
  • FTC Section 5 and CCPA — the US privacy backstop.
  • Canonical papers — Vaswani et al., "Attention Is All You Need" (NeurIPS 2017); Kaplan et al., "Scaling Laws for Neural Language Models" (2020); Hoffmann et al., "Chinchilla" (2022); Rajbhandari et al., "ZeRO" (SC 2020); Kwon et al., "PagedAttention / vLLM" (SOSP 2023); Leviathan et al., "Speculative Decoding" (ICML 2023); Dageville et al., "Snowflake" (SIGMOD 2016); Armbrust et al., "Lakehouse" (CIDR 2021); Melnik et al., "Dremel" (VLDB 2010); Boncz et al., "MonetDB/X100" (CIDR 2005).
  • Books — Kleppmann, Designing Data-Intensive Applications. Huyen, Designing Machine Learning Systems. Kohavi, Tang & Xu, Trustworthy Online Controlled Experiments. Bishop & Bishop, Deep Learning: Foundations and Concepts. Goodfellow, Bengio & Courville, Deep Learning. Machine Learning Design Patterns (Lakshmanan et al.).

Test yourself

A company's dashboards run on a Postgres replica. Engineering proposes moving them to a "data lake on S3." What three questions should the first design review answer?

(1) Read pattern and concurrency. Dashboards typically need interactive latency (sub-second) at high concurrency — a Parquet-on-S3 stack can serve ad-hoc scans at seconds-to-minutes, not milliseconds. If dashboards need sub-second, a warehouse or a cached BI layer is needed, not raw S3. (2) Table format and catalog. Without Iceberg/Delta + a catalog, concurrent writes corrupt quietly and schema evolution is manual. (3) Governance. Who has read access to PII columns, how is that enforced, and can deletions be propagated? See Stations 2, 4, 8.

An LLM API serves ~30 tokens/s for a single user. Users request longer prompts (8k → 32k context). Predict what happens to both throughput and cost if the serving stack is unchanged, and name two improvements.

The KV cache scales linearly with context length. Going 8k → 32k quadruples per-sequence KV-cache memory, cutting the number of sequences that fit in HBM by 4×, which shrinks the effective batch and cuts throughput roughly in the same ratio. Decode phase is memory-bound, so cost/token roughly quadruples at the same tokens-per-second. Improvements: (a) PagedAttention (vLLM) to eliminate KV fragmentation; (b) weight-only int4 quantization to free HBM for more KV; and at the cluster level (c) tensor-parallelism to share KV across more HBM. See Station 6.

A new model shows 2% offline accuracy gain on the holdout set but appears to degrade user engagement in an A/B test. Name three classes of cause and the first thing you'd check for each.

(1) Training-serving skew — features computed differently at train vs serve time. Check the feature store for identical definitions across batch and online paths. (2) Offline set is stale or biased — holdout data doesn't reflect production distribution. Compare input feature distributions between holdout and live traffic with KL divergence or PSI. (3) Metric misalignment — offline accuracy is not the thing users care about (e.g. optimizing precision but users want recall, or vice versa). Check whether the A/B metric and the training objective point in the same direction. See Station 7.