Intelligence

A database keeps a promise about durability. Machine learning makes a bet: that patterns in old data still hold on new data. The bet has the same shape in every modern system. Collect examples. Fit a function with millions to trillions of internal numbers. At run time, apply that function to inputs it has never seen. Whether it works is an empirical question, not a formal one — the pattern that fit yesterday may or may not fit tomorrow. This sub-act walks from "what is ML" up through transformers, language models, retrieval, and agents, and ends on the failures no architecture has erased.

The intelligence pipeline — data, training, inference, and the augmentation patterns above itDataTrainEvalDeploycollect · clean · labelfit weights · GPU daysheld-out · benchmarkship weights · serveInferencenew input → forward pass → predictionfeedbackRetrieval-augmented generationAgentsembed query · vector searchstuff top-k chunks into promptthink → call tool → observeloop until doneThe honest limitshallucinationplausible-wrongevaluation gapno ground truthinference costper-token computedata qualitythe real bottlenecka model is a function with billions of learned numbers — everything above is what you do with one
Train and infer are wildly asymmetric in cost. Most engineering above the model is about feeding it the right context (RAG) or letting it act in a loop (agents). Most failures still come from the data, not the architecture.

What ML actually is

Classical software needs a programmer to write every rule. That works when the rules are short — if salary > 50000 and tenure > 2 then approve — and breaks when the rules are uncountable. Nobody can write down the rule that distinguishes a cat photo from a dog photo at the pixel level. Machine learning inverts the relationship: provide examples — input/output pairs — and let an optimizer find a function that maps inputs to outputs by adjusting a large pile of internal numbers called parameters or weights. The output is still a function — a fixed sequence of arithmetic — but no human wrote the constants.

Two phases dominate the lifecycle, and they cost wildly different amounts. Training runs once: an optimizer iterates over the training set, nudging each parameter in the direction that reduces error. Training a frontier model burns weeks on thousands of GPUs and tens to hundreds of millions of dollars. Inference runs every time a user sends a request: a single forward pass through the same arithmetic, no optimizer, no gradient — milliseconds on one GPU. Training happens once. Inference happens billions of times. Production economics are inference economics.

Training and inference — same model, very different costsTrainingInferencefind the pattern · run onceapply the pattern · run many timesTraining dataArchitectureOptimizer (SGD / Adam)forward · loss · backprop · updateTrained weights≈10²³–10²⁵ FLOPs · weeks · thousands of GPUs1M1M–100M for a frontier modelNew inputTrained weightsForward passno gradients · no updatesPrediction≈10⁸–10¹¹ FLOPs / call · milliseconds0.00010.0001–0.10 per call · billions of callstraining is a one-time investment; inference is the recurring bill
This asymmetry shapes every downstream decision: weights are shipped once and cached forever, the inference path gets hand-tuned, and quantization (FP16, INT8, INT4) exists almost entirely to cut the inference bill.

Pitfall — train/serve skew. The function the model learned is only valid on the distribution it was trained on. Production inputs that drift — a new locale, a sensor recalibration, a user-behaviour shift — quietly degrade quality without throwing any error. Monitor input statistics, not just output accuracy, or you catch drift only after users complain.

Three learning modes

Different problems come with different supervision signals, and the signal dictates which algorithm fits.

Supervised learning has labelled pairs (X, Y) and learns f(X) ≈ Y. Image classification, spam detection, machine translation, and the second-stage tuning of LLMs all live here. The label is the entire teaching signal, so the quality of the labels caps the quality of the model.

Unsupervised learning has only X — no labels — and learns the structure of the input distribution itself: clusters, low-dimensional projections, density. Pre-training of an LLM is technically self-supervised — the label "what comes next" is extracted from the data itself — but the engineering shape is unsupervised: pour text in, no human annotation required.

Reinforcement learning has neither labels nor a static dataset. An agent takes actions in an environment, observes a reward, and learns a policy that maximises cumulative reward over time. Game-playing systems and the human-feedback stages of LLM training live here.

Three learning modes — different supervision signals, different problemsSupervisedUnsupervisedReinforcementlabeled examplesstructure of the datatrial · error · reward(X, Y) pairscat → "cat"dog → "dog"cat → "cat"learn f(X) ≈ Yclassificationregression · translationlearn P(X) · find clustersclustering · PCA · embeddingsstate saction areward rgames · robotics · RLHFmost production ML is supervised; the most interesting frontier is the boundary between the three
Modern LLMs use all three: self-supervised pre-training on internet text, supervised fine-tuning on instruction–response pairs, and reinforcement learning on human-preference comparisons.

Pitfall — using the wrong mode for your data. Reaching for reinforcement learning when supervised data would do is one of the most expensive mistakes a team can make. RL needs a tight feedback loop, a fast simulator, and a well-shaped reward. Without those, it converges slowly or not at all. If you have labelled examples, use them.

Neural networks

A function with a hundred billion knobs needs a parameterised shape that's expressive enough to fit anything and structured enough to train. A neural network is that shape: a stack of simple parameterised functions, called layers, that compose into one big differentiable function.

The basic unit is a linear layer — multiply the input vector by a weight matrix, add a bias — followed by a nonlinearity like ReLU (max(0, x)) or GELU. The nonlinearity is the only reason stacking helps: chain pure linear layers and they collapse back into a single linear layer by matrix multiplication. With one wide hidden layer, a network is already a universal approximator — it can fit any continuous function on a bounded domain. With many narrower layers stacked deep, the same approximation becomes parameter-efficient: depth lets the network compose features (edges to textures to shapes to objects) instead of memorising each example.

A small feed-forward network — forward pass and backpropagationInputHidden (ReLU)Output (softmax)forward pass · y = softmax(W₂ · ReLU(W₁ · x + b₁) + b₂)backward pass · ∂L/∂w via chain rule · update w ← w − η · ∂L/∂wW₁, b₁W₂, b₂
Every edge is a learned weight. A hidden layer in a modern LLM is several thousand wide; transformers stack dozens to hundreds of such layers; the resulting parameter counts run into the billions.

Training a network means three pieces working together. A loss function scores how wrong each prediction is — mean-squared error for regression, cross-entropy for classification, both differentiable. Backpropagation computes the gradient ∂L/∂w for every weight in one backward sweep using the chain rule from calculus; without it, estimating gradients in a deep network would require exponentially many forward passes. Gradient descent then takes a small step in the opposite direction: w ← w − η · ∂L/∂w, where η is the learning rate, usually 10⁻³ to 10⁻⁵.

Worked example: one gradient-descent step on f(w) = (w − 3)²

Take the simplest possible loss: a single weight w and a loss f(w) = (w − 3)² that is minimised at w = 3. Start at w = 5. The current loss is f(5) = (5 − 3)² = 4. The gradient — the slope of the loss with respect to w — is the derivative f'(w) = 2 · (w − 3). At w = 5 that slope is 2 · 2 = 4. It's positive, meaning the loss is rising in the direction of larger w, so to decrease the loss we step the opposite way — toward smaller w.

With learning rate η = 0.1:

w_new = w − η · gradient = 50.1 · 4 = 4.6
f(4.6) = (4.63)² = 2.56

Loss dropped from 4 to 2.56 in one step. Repeating: w = 4.28, then 4.024, then 3.819. After about 20 steps w ≈ 3.001. The gradient itself shrinks as w approaches 3, so the steps shrink with it — gradient descent slows down on its own at the minimum.

A real network has the same procedure, just wider. The loss is a function of millions to trillions of weights, the gradient is a vector of the same size — one slope per weight — and every weight updates by w_i ← w_i − η · ∂L/∂w_i in the same step. The single-knob walk above is what's happening to each knob, simultaneously.

Worked example: backprop on a one-neuron network y = w · x + b

Backpropagation is just the chain rule from calculus applied bottom-up. The smallest example that shows it: one input x, one weight w, one bias b, one output y = w · x + b, target t, squared-error loss L = (y − t)².

Forward pass. Pick x = 2, t = 5, and current w = 1, b = 0. Compute:

y = 1 · 2 + 0 = 2
L = (25)² = 9

Backward pass. We need ∂L/∂w and ∂L/∂b. The chain rule says: differentiate L with respect to y first, then multiply by how y depends on each parameter.

L/∂y = 2 · (y − t) = 2 · (25) = −6      (how loss reacts to output)
∂y/∂w = x = 2                                (how output reacts to w)
∂y/∂b = 1                                    (how output reacts to b)

L/∂w =L/∂y · ∂y/∂w =6 · 2 =12
L/∂b =L/∂y · ∂y/∂b =6 · 1 =6

Update. With η = 0.01:

w ← 10.01 · (12) = 1.12
b ← 00.01 · (6)  = 0.06

Check. New y = 1.12 · 2 + 0.06 = 2.30. New L = (2.30 − 5)² = 7.29. Loss dropped from 9 to 7.29.

In a deep network the chain extends: ∂L/∂(weight in layer 1) = ∂L/∂y · ∂y/∂(layer N output) · … · ∂(layer 2 output)/∂(layer 1 output) · ∂(layer 1 output)/∂(weight). Each layer contributes one factor to the product. Backprop walks the chain once, right-to-left, reusing the partial products — which is why it costs one backward pass, not exponentially many.

In practice nobody computes the gradient over the entire dataset at once. Stochastic gradient descent computes it on a small batch — 32 to a few thousand examples — and takes a step. The result is noisier but vastly faster. Adaptive variants like Adam track per-parameter running statistics so each weight gets its own effective learning rate. Adam plus a decaying schedule is the workhorse for almost all transformer training.

The training loop — one epoch is one pass over the data, broken into mini-batchesfor epoch in 1…E: shuffle dataset · split into mini-batchesone epoch = one full pass over the training setfor each mini-batch:ForwardLossBackwardOptimizer stepx → ŷL(ŷ, y)∂L/∂w (chain rule)w ← w − η · ∂L/∂wbatch sizelearning rate ηepochs E32 · 256 · 409610⁻³ to 10⁻⁵1–100 (vision) · ≈1 (LLM)three knobs do most of the work — everything else is regularization and schedule
Frontier LLMs do not finish a full epoch — the corpus is so large that one pass exhausts the compute budget. Vision and tabular models run dozens of epochs and lean on early stopping to know when to quit.

Pitfall — vanishing and exploding gradients. Deep networks multiply many small or large gradients along the chain rule. Products of many sub-1 numbers collapse to zero; products of many >1 numbers blow up. Either way, the optimizer stops working. Residual connections (which add the input back to the output of each block), layer normalization, careful weight initialization, and gradient clipping are the standard fixes — and they are what made networks deeper than about ten layers trainable at all.

Reading the loss curves during training tells you which thing is going wrong. A well-fit model sees training and validation loss decline together and plateau near each other. An overfit model drives training loss toward zero while validation loss bottoms out and climbs — the model is memorising training examples instead of learning the underlying pattern. An underfit model sees both losses stall high — the architecture lacks the capacity or the features are too weak. The fixes are different, so reading the shape matters: overfitting wants more data, regularization, or earlier stopping; underfitting wants a bigger model or a longer schedule.

Training vs validation loss — three classic shapesWell-fitOverfittingUnderfittinglossepochsepochsepochstrainval (close)early stopval risestrain fallsboth stuckhigh lossmodel has capacitydata covers taskcapacity too highor data too smallcapacity too lowor features too weaksolid line: training loss · dashed: validation loss · the gap between them is the generalisation error
The fixes diverge. Overfitting wants more data, regularization, dropout, or earlier stopping. Underfitting wants a bigger model, better features, or longer training. Mistaking one for the other wastes the next training run.

Architectures that bake in structure

Real signals are not bags of independent numbers. Pixels next to each other are related. Words earlier in a sentence constrain words later. A bare feed-forward network treats every input position identically and has to discover spatial or temporal structure from scratch. The architectures that won did so by baking the right structure — the inductive bias — directly into the layer.

Convolutional networks (CNNs)

A convolutional layer shares a small set of weights — a kernel, typically 3×3 — across every position in the image. The same edge detector applies whether the edge is in the corner or the centre, which means the model needs orders of magnitude fewer parameters than a fully-connected layer with the same coverage. Pooling layers reduce resolution between convolutions, so stacked small kernels see large effective regions of the input. CNNs dominated computer vision through the 2010s and still ship in most real-time vision systems.

A convolutional layer: shared kernel slides over the input gridInput image (5×5)Kernel (3×3)Feature map (3×3)w₀w₁w₂w₃w₄w₅w₆w₇w₈slide kernel over every position · output[i,j] = Σ kernel · input[i:i+3, j:j+3]9 weights apply to all positions · translation equivariance for freestack: conv → ReLU → pool → conv → … → fully-connected → class scores
The kernel is the entire learned vocabulary of the layer. Reusing it across millions of pixel positions is what makes vision networks tractable.

Recurrent networks (RNNs)

A recurrent network processes sequences one element at a time, carrying a hidden state forward from step to step. Each step sees the previous state plus the current input. LSTMs and GRUs added gating cells that let gradient signal flow over hundreds of steps without vanishing, which made them workable for translation and speech in the mid-2010s.

The fatal weakness was sequential dependence: step t+1 cannot start until step t finishes. GPUs love parallelism; RNNs don't have any along the time dimension. Training was slow and refused to scale, which is exactly the constraint the next architecture broke.

Transformers and attention

Transformers abandoned recurrence and used self-attention instead. Every position in the sequence produces three vectors: a query Q saying "what am I looking for," a key K saying "what do I contain," and a value V saying "what I'll contribute if you pick me." The output at each position is a weighted sum over all values, with weights computed as softmax(Q · Kᵀ / √d_k) — the softmax turns a row of raw scores into a probability distribution that sums to one. Every token sees every other token in a single matrix multiplication — fully parallel across positions, exactly the workload GPUs run fastest.

Scaled dot-product attention — one headInput XQ · Kᵀ / √d_kOutput[n × d_model]QKV[n × d_k][n × d_k][n × d_v]softmax[n × n]attn · V[n × d_v]Three learned projections turn the input into queries, keys, values.Q · Kᵀ produces an n × n score matrix; softmax turns rows into probabilities.Each output row is a weighted sum of all V rows — every position attends to every other in one shot.cost: O(n² · d) per layer · multi-head runs h such heads in parallel and concatenates
The breakthrough is parallelism. The cost is O(n²) in sequence length — fine at 1,000 tokens, painful at 200,000. Most "long-context" engineering attacks that quadratic.

A transformer first maps each input token to a learned vector — its embedding — and the attention layers operate on those vectors. Attention by itself is permutation-equivariant — shuffle the input tokens and the output shuffles the same way, because there's nothing in the operation that ties position 3 to a later step than position 2. Positional encoding fixes that: every position gets a unique vector added to (or rotated into) its embedding. The original transformer used fixed sinusoids; modern variants like RoPE rotate the query and key vectors by an angle proportional to position. The mechanism differs; the role is identical — mark where each token sits.

Positional encoding — every position becomes a unique vector that gets combined with the token embeddingToken embeddingPositional encodingSum (input to layer 1)E(▁the)E(▁cat)E(▁sat)d_modeld_modeld_modelPE(0)PE(1)PE(2)E + PE(0)E + PE(1)E + PE(2)PE(pos, 2i) = sin(pos / 10000^(2i/d)) · PE(pos, 2i+1) = cos(...)low frequencies encode coarse position · high frequencies encode neighbour-level offsetsRoPE rotates Q and K by angle proportional to position · ALiBi adds a per-distance bias to the score
Position is the inductive bias attention does not have for free. The encoding choice mostly determines how the model behaves at sequence lengths beyond its training window.

A single attention head learns one notion of "related." Multi-head attention runs several heads in parallel, each with its own smaller projections, then concatenates the outputs. Without anyone supervising it, one head ends up tracking syntactic dependencies, another tracks coreference, another tracks distance. None of this is hand-engineered; it falls out of giving the architecture room to learn multiple relations at once.

Multi-head attention — H parallel heads, each in a smaller subspace, concatenatedInput X[n × d_model]head 1head 2head 3head 4Q,K,VQ,K,VQ,K,VQ,K,VH = 8…128syntaxcorefdistancetopic[n × d_k][n × d_k][n × d_k][n × d_k]concat (head 1, head 2, …, head H)[n × H · d_k] · output projection W_oOutput
Total parameter count is unchanged — d_model splits across H heads — but the model can attend to several different relations at once. Different heads specialise without supervision.
Worked example: scaled dot-product attention on a 2-token sequence

A single-head attention pass with sequence length n=2 and dimension d=2. Just enough math to see the shape.

Input. Two token embeddings:

X = [ [1, 0],
      [0, 1] ]

1. Project to Q, K, V. Real models multiply by learned weight matrices. To keep the arithmetic visible we use identity projections, so Q = K = V = X.

2. Score matrix Q · Kᵀ. Row i, column j is "how much should token i attend to token j?"

Q · K= [[1, 0],
          [0, 1]]

3. Scale by √d_k. With d_k = 2, the scaling factor is about 1.41. Without it, scores would saturate softmax once dimensions get large.

(Q · K) /2[[0.71, 0.00],
                 [0.00, 0.71]]

4. Softmax row-wise. Each row becomes a probability distribution.

Row 0: softmax([0.71, 0.00])[0.67, 0.33]
Row 1: softmax([0.00, 0.71])[0.33, 0.67]

Token 0 attends 67% to itself, 33% to token 1. Token 1 mirrors.

5. Multiply by V. Take a weighted average of value vectors.

Output = [[0.67·[1,0] + 0.33·[0,1]],
          [0.33·[1,0] + 0.67·[0,1]]]
       = [[0.67, 0.33],
          [0.33, 0.67]]

Each token's output is a blend of every token's value, weighted by attention.

The cost. Steps 2 and 5 are both O(n² · d). Doubling sequence length quadruples both. That's the wall every long-context architecture is bumping against.

Pitfall — quadratic attention is the budget. Doubling the context window quadruples attention compute and memory. Sliding-window attention, FlashAttention's IO-aware kernel, KV-cache sharing, and state-space models like Mamba each attack this quadratic cost from a different angle — through sparsity, better memory access, sharing across requests, or a non-attention recurrence — while preserving the rest of the transformer block.

Large language models

A large language model is a transformer trained on one objective: predict the next token given everything that came before. A token is a sub-word unit (an English word averages 1.3 tokens). The model outputs a probability distribution over a 30K–200K-token vocabulary, and generation samples one token at a time. That single objective, applied at sufficient scale, ends up learning syntax, factual associations, reasoning patterns, code structure, and dozens of capabilities nobody hand-specified.

"Given everything that came before" is enforced inside attention by a causal mask: position i may attend to positions 0…i and is blocked from positions i+1…n. The mask is a lower-triangular pattern that sets blocked attention scores to −∞, so softmax sends them to zero. Without it, training would leak the answer — position t could peek at token t+1 and learn the shortcut "look ahead." The mask is what makes next-token prediction a non-trivial objective.

Causal attention mask — lower-triangular pattern that prevents peeking at future tokensAttention scores (Q · Kᵀ)After causal mask + softmaxt₀t₁t₂t₃t₄t₀t₁t₂t₃t₄2.10.41.10.71.50.32.40.81.00.60.91.22.00.51.30.40.61.42.20.90.71.10.51.32.51.00.10.90.20.30.50.10.20.30.40.10.20.10.30.3−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞t₀t₁t₂t₃t₄t₀t₁t₂t₃t₄all positions visible — encoder behaviourrow i sums to 1 over columns 0…i onlydecoder-only models (GPT, Llama, Mistral) apply this mask in every layer · encoder-only models (BERT) do notsame architecture, one mask — different model family
The mask is the architectural switch between a bidirectional model (BERT, used for embeddings) and an autoregressive one (GPT, Llama, Mistral). Same machinery, one triangle of −∞.

Tokenization

The model lives in an integer-indexed inner world. Tokenization is the boundary between human text and that world. Byte-Pair Encoding starts with individual bytes and repeatedly merges the most frequent adjacent pair into a new token, building a vocabulary of 30K–200K subword units. Common words become single tokens; rare or compound words split; entirely novel strings decompose down to bytes. The same scheme handles every language and every Unicode codepoint without a fixed dictionary.

Byte-Pair Encoding — text becomes a sequence of subword token idsInput textTokenizer (BPE merges)Token ids"the unfortunate cat"19 charactersmerge most frequent pairrepeat until vocab target▁ marks word boundary[464, 22001, 9379]3 ids▁the▁unfortunate▁cat464555331944709379common words = 1 token · rare words split into subwordsEnglish ≈ 1.3 tokens / word · code is denser · CJK is sparser · emoji can take 3–4 tokens
The vocabulary is fixed at training time and frozen forever after. Every prompt and every API bill flows through this table; choosing it badly is one of the few mistakes that cannot be fixed without retraining.
Worked example: one BPE merge step on a tiny corpus

BPE is trained by repeating a single rule: count every adjacent pair of symbols, merge the most frequent pair into one new symbol, repeat. Start with text as a sequence of characters — every character is its own initial token.

Corpus. Three words, with their counts:

low      × 5
lower    × 2
newest   × 6

Initial tokenisation. Each word is a sequence of characters (the marks word boundary):

▁ l o w           × 5
▁ l o w e r       × 2
▁ n e w e s t     × 6

Count adjacent pairs across the whole corpus.

(l, o):  5 + 2     = 7
(o, w):  5 + 2     = 7
(w, e):  2 + 6     = 8winner
(e, r):  2
(n, e):  6
(e, s):  6
(s, t):  6

Merge the winning pair (w, e) → we. Replace every occurrence in the corpus:

▁ l o w           × 5      (unchanged)
▁ l o we r        × 2
▁ n e we s t      × 6

The vocabulary just grew by one token: we. Recount pairs and repeat. After a few thousand merges, common sequences like ▁the and ing become single tokens, while rare strings stay split. At inference time, encoding a new word means greedily applying the same merges in the order they were learned — so "lower" tokenises as ▁ low er if those merges fired, or ▁ l o w e r if none of them did.

The whole tokeniser is just this merge list plus a final id table. Nothing learned, no neural network — pure frequency counting on the training corpus, frozen forever.

The four-stage training stack

A base transformer that's been trained only to predict the next token can continue a piece of text but does not know it's in a conversation. Turning it into a useful assistant takes three more stages.

The four-stage training stack of a modern instruction-tuned LLMPre-trainInstruction tunePreference tuneDeploynext-token on textunsupervisedSFT on (prompt, response)supervisedRLHF / DPOhuman comparisonsquantize · KV cachespeculative decode≈10T tokens≈10–100K examples≈10–100K comparisonsruntime→ a base modelthat completes text→ follows instructionsin a chat format→ aligned to preferenceshelpful · harmless→ low-latencylow-cost serving≈10²⁵ FLOPs≈10²² FLOPs≈10²² FLOPsper requestweeks · 10K GPUsdays · 100s GPUsdays · 100s GPUsmilliseconds99% of the compute is in pre-training; 99% of the personality comes from the last two stages
Pre-training is what no one but a frontier lab can afford. The post-training stages are what make the same base model into a chatty assistant, a coding model, or a refusal-heavy product variant.

Pre-training runs the next-token objective on as much text as the team can curate — typically 1 to 15 trillion tokens from web crawl, books, code, and high-quality filtered subsets. The output is a base model that can continue arbitrary text. It will happily complete What is the capital of France? with a Wikipedia-style sentence, but it will just as happily continue with another question, because it has only seen text, not conversations.

Instruction tuning (also called SFT, supervised fine-tuning) takes the base model and fine-tunes it on tens of thousands of (prompt, response) demonstrations written or curated by humans. The output is a model that responds in a chat format — when asked a question, it answers.

Preference tuning aligns the model's behaviour with human judgement. Collect tens of thousands of comparisons where a human picked the better of two model outputs. Train a reward model on those preferences, then push the policy toward higher reward using RLHF (reinforcement learning from human feedback) or the simpler DPO (direct preference optimization, which skips the reward model). The model becomes more helpful, more honest, and — for shipped products — more disciplined about refusing harmful requests.

Deployment is its own engineering problem. The naive autoregressive forward pass would recompute attention over every previous token at every new step. Three optimizations turn this from "too expensive" to "shippable."

The KV cache is the first. During decoding, the keys and values that attention computes for past tokens never change as new tokens arrive — past context is fixed. Cache them. Each new token then only computes one new K and V and attends across the cached set. Decode complexity drops from O(n²) per token to O(n).

KV cache — turns O(n²) decode into O(n) per new tokenWithout KV cacheWith KV cachet₀t₁t₂t₃t₄to predict t₅:recompute K,V for t₀…t₄ — every stepstep 1: 1 tokenstep 2: 2 tokensstep 5: 5 tokenstotal: 1+2+3+4+5 = O(n²) workK,V₀K,V₁K,V₂K,V₃t₄cachedeach new token:read all cached K,V · compute one new K,Vtotal: O(n) per new tokencache memory grows linearly with context — KV cache dominates GPU memory at long contextsat 128K context, a 70B model can need ≈10 GB just for KV per request
The cache is invariant under autoregressive generation. Modern serving stacks (vLLM, TensorRT-LLM, SGLang) make KV-cache management — paging, sharing across requests, eviction — their central design problem.

Quantization is the second. The same weights stored in 16-bit floats can be re-encoded in 8-bit, 4-bit, or even fewer bits using per-channel scale factors learned in a short calibration step. Memory drops 2× to 4×, inference speeds up 2× to 4× on hardware with integer tensor cores, and the precision loss is small — perplexity rises by a few percent at 4-bit, undetectable to most users. Quantization is what turns a 70B model from "needs a datacentre GPU" into "fits on a high-end laptop."

Quantization — same weights, fewer bits per numberFP16 (baseline)INT8INT416 bits / weight8 bits / weight4 bits / weight−1.42710.81032.5546−1.43 · scale0.81 · scale2.55 · scale−1.4 · scale0.8 · scale2.5 · scale70 B params → 140 GBdatacentre GPU territory70 B params → 70 GBsingle H100 / dual A10070 B params → 35 GBconsumer GPU / laptop2× smaller2× smallerper-channel calibration learns scale factors that bound quantization errorprecision loss is real but small · perplexity rises a few percent at INT4 · most users cannot tellactivations and KV cache can be quantized too — different trade-offs, same idea
Quantization is the single most consequential post-training optimization. It does not change architecture, training data, or capabilities — it just makes the same model cheaper to run at every layer of the stack.

Speculative decoding is the third. Even with a perfect KV cache, autoregressive decoding generates one token per full forward pass. A small draft model cheaply proposes a short run of candidate tokens; the big verifier runs them all through one parallel forward pass and accepts the longest correct prefix. Because the verifier runs in parallel across candidates, accepting four tokens costs roughly the same as generating one. The output is bit-identical to running the verifier alone, and throughput rises 2–3×.

Speculative decoding — small draft model proposes, big verifier checks in parallelDraft modelsmall · cheap · fastpropose 4 tokensthecatsatonVerifier (target model)one parallel forward pass over all 4 candidatesthe okcat oksat okon noaccept 3 · re-decode 1guarantee:output identical torunning verifier alone
A free win when the draft model agrees with the verifier most of the time. Typical acceptance rates run 60–80%, with no quality loss. Pairs cleanly with quantization and KV-cache reuse.

Prompting

Because so much capability is already baked into the pre-trained weights, the question stops being "how do I train this?" and becomes "how do I phrase the request so the model surfaces the right behaviour?" Prompt engineering is that interface layer: system prompts that frame the model's role, few-shot examples that demonstrate the desired format, chain-of-thought triggers like "let's think step by step" that elicit explicit reasoning. It is engineering done in natural language, with all the brittleness that implies — small wording changes can swing accuracy by tens of percent.

Pitfall — knowledge cutoffs. Every base model has a date past which its training data ends. Ask a model trained in 2024 about events from 2026 and it will produce confident fabrication unless the system supplies fresh information at query time. Always know your model's cutoff and surface it to users.

Retrieval-augmented generation (RAG)

An LLM's weights encode patterns from its training data and nothing else. They contain no record of your company's documents, last week's news, or the state of any live system. Retraining for every new corpus is impractical. Retrieval-augmented generation closes the gap a different way: at query time, find relevant passages in a knowledge store, paste them into the prompt, and let the LLM render the answer using both its training and the retrieved context.

The mechanism reuses the embedding idea at a different scale. A dedicated embedding model — typically a transformer encoder — maps a whole passage of text to a single fixed-size vector of 768 to 4096 dimensions, trained so that passages with similar meaning land at nearby vectors. Documents get chunked (a few hundred tokens per chunk), each chunk is embedded once and stored in a vector database. At query time, embed the question, find the top-k nearest chunks by cosine distance (usually 5 to 20), concatenate them into the prompt, and ask the LLM to answer using only that context. (Vector databases and the underlying approximate-nearest-neighbour indexes are covered in Act VIa.)

A RAG pipeline — query-time retrieval grounds the LLM in private or fresh dataUser queryEmbed→ vector ∈ ℝᵈVector searchtop-k nearestPrompt = system + query + top-k chunks"answer the question using only this context"LLMrenders an answerAnswer + citationsKnowledge storechunk 1chunk 2chunk Nvec₁vec₂vec_NANNindexvector search hits the ANN index, returns top-k chunksembeddings + vector DB do the heavy lifting · the LLM just renders the answer in fluent prose
RAG turns "the model didn't know" into "the system didn't retrieve the right chunk." The failure mode shifts from training to indexing, and indexing is something engineers can debug.

The trade-offs are concrete. Chunk size matters: too small and the model loses context; too large and the top-k crowds out the question. Pure vector similarity over-rewards lexical overlap; hybrid search mixes vector with keyword (BM25) scores. Re-rankers — cross-encoder models that score the (query, chunk) pair jointly — sharpen the top-k from "semantically near" to "actually answers the question." Many RAG systems also ask the LLM to cite — to emit which chunk each claim came from — so users can verify and so hallucination becomes detectable.

Pitfall — RAG is not a memory. Stuffing more chunks into the context does not make the model remember them after the request ends. The next call starts fresh. Conversational memory needs a separate store — a summary, a vector of past turns, a structured user profile — maintained outside the LLM and re-injected on every request.

Agents

A single LLM call answers a question. Real tasks have multiple steps — search, read, decide, write, retry. An agent is an LLM placed inside a control loop that lets it (1) think — produce a plan or next step in text, (2) act — call a tool, (3) observe — read the result, (4) decide whether the task is done; if not, repeat. Structured tool calling is what makes this practical: the LLM emits a JSON object naming a tool and its arguments, the host parses it, runs the tool, and returns the output back into the next prompt.

An agent loop — think · act · observe · repeatLLM (think)plan · choose tool · or finishTool / API (act)search · code · DB · HTTPObservationtool output → back into promptMemoryhistory · scratchpad · vector storeactobserveappendreadfinish · return answertermination: max steps · explicit done · cost cap
Each iteration is a full LLM call: the entire history plus the latest tool output is sent back as input. Cost and latency grow linearly with the number of iterations — and so does the chance of going off the rails.

The same loop powers wildly different agents. A coding agent reads files, edits them, runs tests, reads the failure, fixes the code, and loops until tests pass. A research agent searches the web, reads pages, follows citations, and assembles a report. A support agent looks up orders, files refunds, drafts emails. The structure is identical; only the tool list changes.

The failure modes are new and they compound. Long-horizon error is the first: at each step the model has some probability p of choosing the wrong action, and over n steps the chance of a clean run is (1−p)ⁿ, which collapses fast. Goal drift — the model "helpfully" expands the task beyond what the user asked. Tool misuse — the wrong arguments to the right tool, or the right tool in the wrong order. Cost runaway — an agent that loops fifty times instead of three racks up an unbounded bill. Reliable agents need explicit guardrails: hard step limits, cost limits, structured plans the model commits to before executing, retries with bounded backoff, and grounding rules that refuse to act on facts the agent did not actually retrieve.

Pitfall — letting the agent be its own judge. Asking the LLM "are we done?" at the end of every iteration is asking the same model that may have hallucinated the work to evaluate the work. Production agents check completion against external signals — tests passing, an API returning success, a human accepting the result — not the model's self-report.

The honest limits

Modern ML's failures are stubborn and well-mapped. Pretending otherwise is how teams ship demos that never become products.

Hallucination is the production-defining failure of language models. The model emits fluent, grammatically perfect, plausibly-sourced text that is not true — a fabricated court case, a made-up API method, an invented citation. The cause is structural. Next-token prediction optimises for likelihood under the training distribution, not truth. There is no internal mechanism that distinguishes "I know this" from "this completes the pattern." Mitigations stack — RAG to ground claims in real sources, citation requirements so users can verify, constrained decoding that forces outputs into a JSON schema, self-consistency that samples multiple times and keeps the agreed answer — but none eliminate hallucination, only reduce its rate.

Evaluation difficulty is its quieter twin. For closed tasks (classification, extraction) you can compute accuracy on a held-out set. For open-ended tasks (write an email, summarise this document, plan this trip) there is no single right answer. Public benchmarks capture some dimensions, but rankings on benchmarks correlate weakly with usefulness in any specific product. LLM-as-judge — using a strong model to grade outputs — helps and introduces its own biases (preference for length, preference for the judge's own style). The frontier method is still humans rating outputs side-by-side, which is slow and expensive.

The four shapes of LLM evaluation — what counts as a correct answerMultiple choiceGenerationPreferenceAgenticMMLU · ARC · GPQATriviaQA · HellaSwagpick A / B / C / Dlog-likelihood scoringcheap · automatableceiling effects · saturatedHumanEval · MBPPGSM8K · MATH · BIG-benchexecute output · checkvs reference / unit testsnarrow but objectivecode & math · not chatLMSYS Arena · MT-BenchAlpacaEval · Chatbot Arenahuman or LLM picksA vs B side-by-sidecaptures "feel"slow · biased · driftySWE-benchτ-bench · WebArenaend-to-end tasksuccess in envclosest to productexpensive · noisyno single benchmark predicts product quality · ship with your own task-specific evalbenchmarks rank models · only your eval ranks them on your task
Each column up-weights a different failure mode. Most teams combine a public benchmark for tracking regressions, a domain eval written from real user prompts, and an agentic harness that runs the system end-to-end.

Inference cost is the line item every product line eventually meets. Each generated token requires a forward pass through every layer of a model that may hold tens or hundreds of billions of parameters. At frontier-class quality, a long agent run can cost tens of cents to a few dollars; multiplied by millions of users, the bill is real. Quantization, distillation into smaller specialised models, KV-cache reuse, and speculative decoding all exist to bring per-call cost down by orders of magnitude.

Data quality beats architecture. The widely-repeated lesson from scaling studies is that, given a fixed compute budget, more high-quality tokens often beat more parameters. Cleaning, filtering, deduplicating, and curating training data — the unglamorous parts of the pipeline — produce larger quality jumps than swapping one transformer variant for another. The same pattern repeats at fine-tuning scale: a few thousand carefully-written examples often beat tens of thousands of mediocre ones.

The honest limits of modern ML — the unfixed bugs of the fieldHallucinationEvaluation gapInference costData qualityplausible-soundingbut wrongcause: trained forlikelihood, not truthmitigation: RAG · cite ·constrained decodingno ground truth onopen-ended tasksbenchmarks weakpredictors of utilitymitigation: LLM-judge+ human side-by-side≈10⁸–10¹¹ FLOPsper tokenscales with users ×tokens per requestmitigation: quantize ·distill · cachecuration beatsarchitecture"the model isthe easy part"the actualbottlenecknone of these are research bugs waiting to be fixed; they are properties of the current paradigm
The gap between a working demo and a deployed product is mostly evaluation, cost, and data plumbing — not modelling. Teams that miss this ship demos forever.

There is a refrain among practitioners that captures all four limits at once: "the model is the easy part." Frontier models are downloadable; a serviceable assistant is a few API calls away. What remains hard is the data — collecting it, cleaning it, labelling it, evaluating against it, monitoring it for drift — and the harness around the model that turns probabilistic text into a reliable product. That work is closer to the rest of this book than it looks: it is data engineering (Act VIa), it is distributed systems (Act V), it is security and red-teaming (Act VIIa). Same craft as the earlier acts, applied to a new substrate.

Standards

ML is younger and more research-driven than the other layers in this book; it has fewer formal ISO/IEEE specifications and many more de facto standards anchored to seminal papers, open formats, and benchmark suites.

Foundational papers (the citations the field uses as anchors):

Documentation and governance:

  • Model cardsMitchell et al., "Model Cards for Model Reporting" (FAccT 2019). The de facto documentation standard for shipped models; adopted by Hugging Face Hub.
  • Datasheets for datasets — Gebru et al. (Communications of the ACM, 2021). The companion standard for the data side.
  • NIST AI Risk Management FrameworkAI RMF 1.0 (NIST, 2023). The U.S. reference for ML risk management, voluntary but widely adopted.
  • EU AI ActRegulation (EU) 2024/1689. The first cross-border legislation on ML systems; introduces risk tiers (unacceptable / high / limited / minimal) and obligations that scale accordingly.
  • ISO/IEC 42001 (2023)AI management systems. The first international management-system standard for AI, analogous to ISO 27001 for information security.

Open formats and exchange:

  • ONNXOpen Neural Network Exchange (onnx.ai). Cross-framework graph format; lets a model trained in PyTorch run in TensorRT, ONNX Runtime, or CoreML.
  • SafetensorsHugging Face's safe serialization format for tensors; supersedes Python pickle and PyTorch checkpoints for distribution because it cannot execute arbitrary code on load.
  • GGUFllama.cpp's quantized-weight format (successor to GGML). The dominant format for local LLM inference; supports INT2 to FP16 with embedded metadata.
  • Hugging Face Hubhuggingface.co, the de facto distribution standard for open models, datasets, and tokenizers; pinned by SHA, accessible via the transformers and datasets libraries.

Benchmarking:

  • MLPerfMLCommons (mlcommons.org). The closest the field has to JEDEC for hardware: standardised training and inference benchmarks across vision, language, recommendation, and reinforcement-learning workloads.
  • BIG-bench, MMLU, HumanEval, HELM — the de facto capability benchmarks for LLMs.

Forward references:

  • Vector stores — embeddings, ANN indexes, and the databases that hold them are covered in Act VIa (Data).
  • Distributed training — DDP, FSDP, ZeRO, pipeline and tensor parallelism — extends the runtime side of Act V (Connection).
  • Safety, red-teaming, and the adversarial side of ML systems — prompt injection, jailbreaks, model extraction, training-data exfiltration — belong to Act VIIa (Trust / Security).
Going deeper

Branches that earn their own article.

  • Classical ML algorithms (linear/logistic regression, decision trees, random forests, SVMs, k-means, PCA).
  • Deep learning math (loss functions, optimizers — SGD, Adam, learning rate schedules).
  • CNN architectures (ResNet, EfficientNet, Vision Transformer).
  • RNN/LSTM/GRU deep dives.
  • Transformer internals (multi-head attention, positional encoding, layer norm).
  • Diffusion models (Stable Diffusion, DALL-E).
  • Training infrastructure (distributed training, mixed precision, DeepSpeed, FSDL).
  • Fine-tuning methods (LoRA, QLoRA, prefix tuning, RLHF, DPO).
  • Inference optimization (quantization, distillation, speculative decoding, KV cache).
  • Evaluation frameworks and benchmarks.
  • MLOps (model registries, feature stores, experiment tracking, monitoring).
  • Responsible AI (bias, fairness, interpretability, red-teaming).
  • Multimodal models (vision-language, audio-language).