Handbook · Digital · Hardware
Computer Architecture
Computer Architecture··41 min read
TL;DR
A modern CPU looks like magic from any distance above the instruction set — you write code, it runs billions of times per second, done. Zoom in and the magic resolves into a stack of concrete engineering layers, each solving a specific problem left by the layer below: transistors implement switches, switches make logic gates, gates make flip-flops and arithmetic units, those compose into a CPU core that executes a defined instruction set, the core is surrounded by caches and a memory system, cores are connected into chips, chips are connected over buses and interconnects, and special-purpose accelerators (GPUs, TPUs) hang off the side for workloads a general-purpose CPU cannot run efficiently.
Most of the time a program runs fast because the silicon team ten levels below your abstraction got something right (predicted the branch, kept the data in L1, fed the pipeline with parallel work). Most of the time a program runs slow for no apparent reason, the silicon optimised for someone else's workload (cache thrash, branch mispredict, pointer chase in off-core DRAM). Understanding the layers gives you a mental model of why the same algorithm can run at one instruction per cycle or at one instruction per hundred depending on how it touches memory — and where to look when the profiler says "stalled on cache misses."
This handbook walks the stack from the transistor up: the CMOS switch at the bottom; gates, flip-flops, and the clock that turn switches into memory; the ISA and pipeline that turn instructions into work; caches and coherence that hide slow DRAM; branch prediction and speculation that fill the pipeline even on conditional code; the memory hierarchy and NUMA that explain why physical location matters; GPUs and TPUs as the purpose-built alternatives; and the interconnects (PCIe, CXL, NVLink) that tie everything together into a modern server.
You will be able to
- Explain, in plain language, what a transistor does, what a "cache miss" is, what "a branch mispredict" means, and why any of them matter for wall-clock time.
- Read a
perf statsummary and say which microarchitectural resource the workload is starved on — frontend, backend, cache, TLB, branch. - Explain why NUMA, cache coherence, and memory-order fences exist from the silicon's point of view, not the programmer's.
- Place a GPU, TPU, or accelerator in the memory hierarchy and know which costs you are signing up for when the data crosses PCIe or NVLink.
- Point at the silicon layer where any given surprise slow path lives.
The Map
- You will be able to
- The Map
- Station 1 — Transistors and CMOS
- Station 2 — Gates, flip-flops, and the clock
- Station 3 — ISA and pipelines
- Station 4 — Caches and coherence (MESI)
- Station 5 — Branch prediction and speculation
- Station 6 — Memory hierarchy and NUMA
- Station 7 — GPUs, TPUs, and the shape of accelerators
- Station 8 — Interconnect: PCIe, CXL, NVLink
- How the stations connect
- Standards & Specs
- Test yourself
Read the map bottom-up once (how electrons become instructions), then top-down forever (why the instruction you wrote became the slow one you're profiling). Each station is a different abstraction leak — the transistor's leakage current became the datacenter's power bill; the cache's line size became the false-sharing bug; the branch predictor became Spectre. The abstractions are good enough that most of the time you don't need them; the times you do are the expensive ones.
Station 1 — Transistors and CMOS
At the very bottom of the digital stack is a switch. Apply a voltage to the "gate" input, the switch closes (current flows); remove the voltage, the switch opens (current stops). That switch is a transistor. It is not a metaphor — modern computers are built from literally trillions of tiny switches etched into silicon, each one turning on and off billions of times per second. Everything above this layer (logic gates, registers, CPUs, operating systems, programming languages, the cloud) is what happens when you have enough switches wired together.
The specific style of transistor used today is MOSFET (Metal-Oxide-Semiconductor Field-Effect Transistor), in a circuit style called CMOS (Complementary MOS) that pairs two types — n-type (conducts when gate is high) and p-type (conducts when gate is low) — so the circuit only draws current at the instant it switches, not while it sits idle. That low idle power is what makes modern high-density chips thermally possible. A leading-edge chip (2025 process nodes, ~3 nm features) fits roughly 200 million transistors per square millimetre; a high-end server CPU has tens of billions of them.
The MOSFET (Metal-Oxide-Semiconductor Field-Effect Transistor) is the unit cell of every modern chip. A voltage on the gate opens or closes a channel between source and drain. NMOS conducts when the gate is high; PMOS conducts when the gate is low. Wire one of each in series between Vdd and ground with the output between them and you have a CMOS inverter — the building block from which every gate, every register, and every cache line is constructed.
CMOS inverter (the atom of digital logic):
Vdd
│
┌───┴───┐
│ PMOS │ in = 0 → PMOS on, NMOS off → out = Vdd (1)
in─┤ gate │ in = 1 → PMOS off, NMOS on → out = 0
└───┬───┘
├──── out
┌───┴───┐
│ NMOS │
in─┤ gate │
└───┬───┘
│
GND
Three numbers describe a process: node (nominal feature size — "5 nm," mostly a marketing label now), transistor density (MTr/mm², really what the "node" measures), and Vdd (supply voltage, currently ~0.7–0.9 V on leading-edge nodes). Dynamic power is P = α · C · V² · f — doubling the clock doubles power; halving V quarters it, which is why the entire industry now runs below 1 V and pushes performance through parallelism instead of clock.
- As of 2024, leading-edge logic processes pack ~170–300 MTr/mm² (TSMC N3, Intel 18A). A modern server CPU die is ~500 mm² with tens of billions of transistors. A top-end GPU die (H100) is ~800 mm² with ~80 billion.
- Leakage current flows even when the transistor is off (sub-threshold leakage + gate-oxide tunneling). Below ~90 nm, static power became a non-trivial fraction of total — which is why chips added deep sleep states (C-states on Intel, idle states on ARM) to cut Vdd to parts of the die.
- Dennard scaling (every node shrink halves power per transistor) broke around 2005. That's why clock speeds stalled at ~4 GHz for a decade; the only free lunch left was adding cores.
- FinFET (2011, 22 nm on Intel) replaced planar transistors with a 3D fin to reduce leakage. GAAFET / nanosheet (Samsung 3GAE, TSMC N2) wraps the gate around the channel — the next node on the roadmap.
The model you want: a chip is a giant graph of CMOS switches clocked by a single distributed signal, and every design decision upstairs is a bet against that switch's power, delay, and leakage numbers. Moore's Law was the bet that those numbers would improve forever. They still improve; they no longer improve for free.
WARNING
"The chip is just following orders" is not quite true — at these densities it is a statistical ensemble. A single cosmic ray can flip a bit; a small area of unusually high leakage can thermally throttle a core; variation between dies means your 4 GHz chip and your neighbour's 4 GHz chip don't quite match. Firmware and OS hide all of this. Until they don't.
Go deeper: Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed) Appendix C; Harris & Harris, Digital Design and Computer Architecture chapter 1; the ISSCC proceedings for whatever year you want to know what's manufacturable right now.
Station 2 — Gates, flip-flops, and the clock
A single transistor is a switch. Wire a small number of transistors together in specific patterns and you get logic gates — circuits that implement the basic operations of Boolean algebra: AND (output is 1 only if all inputs are 1), OR (output is 1 if any input is 1), NOT (output is the inverse of the input), XOR (output is 1 if inputs differ). Every piece of logic a computer does — adding two numbers, comparing two values, choosing between two paths — is some combination of these gates.
Gates alone are stateless: change the inputs and the output changes instantly. To remember anything, you need a circuit that can hold a value over time. The fundamental memory element is the flip-flop, a feedback loop of gates that can be in one of two stable states (0 or 1) and latches a new value on the edge of a clock signal — a shared square wave that pulses billions of times per second. The clock is the heartbeat of the chip: every flip-flop in the CPU updates on the same edge, so values move through the pipeline in lockstep.
This layer is where an arithmetic logic unit (ALU) lives (combinational gates that compute a+b, a·b, a&b, a << b given two inputs), alongside registers (banks of flip-flops that hold the current working values). Everything higher — pipelines, pipelines with forwarding, multi-cycle operations — is choreography on top of this clocked substrate.
Combine CMOS inverters and you get gates: NAND, NOR, AND, OR, XOR. Combine gates and you get combinational logic — a function whose output depends only on current inputs. Add a feedback loop with a clock signal and you get sequential logic — a flip-flop that remembers one bit across clock edges. Every register, every SRAM cell, every finite state machine in silicon is built from these two kinds.
Edge-triggered D flip-flop (the memory atom of a register):
D (data) ──────┐
│
clk ──┐ ▼
│ ┌──────┐
└────▶│ FF │──── Q (output, held)
│ │
└──────┘
On the rising edge of clk, Q captures D. Between edges, Q holds.
32 flip-flops side by side = one 32-bit register.
6 transistors per SRAM cell; 1 transistor + 1 capacitor per DRAM cell.
A clock is a square wave distributed across the chip via a low-skew tree. On every rising edge, every flip-flop samples its input; between edges, combinational logic has until the next rising edge to compute the next inputs. The longest such combinational path is the critical path, and 1 / (critical-path delay + clock skew + setup margin) is the chip's maximum clock frequency.
- A modern CPU core has ~1,000–2,000 pipeline latches and maybe 100k flip-flops per core counting register files. SRAM caches are 6T cells per bit; L1D at 48 KB = ~2.3 M transistors just for the storage, plus decoders and sense amps.
- Setup and hold windows bracket the clock edge. Miss setup and the captured value is metastable — neither 0 nor 1 — which resolves in a few gate delays but can cascade if it crosses a clock domain. Clock domain crossings (CDCs) need synchronizers; getting CDCs wrong is a classic source of once-a-month hardware bugs.
- Dynamic vs static logic: static CMOS is the default; dynamic (domino) logic pre-charges a node and discharges it conditionally, faster but sensitive to noise, used inside high-performance ALUs.
- Clock gating and power gating turn off clocks or power to inactive regions. On a modern SoC, the mobile power budget wouldn't work without aggressive clock gating — idle blocks don't even see the clock edge.
The model you want: every visible computer operation is a rising edge moving data from one flip-flop through a combinational cloud into another flip-flop. The cloud is the ALU, the decoder, the cache lookup, the bypass network — all of them are just logic between two ranks of latches.
TIP
When a datasheet says "300 ps setup, 100 ps hold," that's the contract for your external signal into this pin. If a cable pushes you past the setup window by a bit, you get the previous cycle's value captured — the classic symptom is "it works sometimes, and always after a reset." The fix is usually a proper synchronizer, not a longer cable.
Go deeper: Harris & Harris, Digital Design and Computer Architecture chapters 2–3; Weste & Harris, CMOS VLSI Design chapters 1–4; a weekend with Yosys or Verilator simulating your own flip-flop, just to feel the edge.
Station 3 — ISA and pipelines
Compilers and programmers need to target some interface that is stable across chip generations. That interface is the ISA (Instruction Set Architecture) — the list of operations a CPU agrees to implement, how they are encoded in memory as byte sequences, what registers they can name, what memory-ordering they promise. An ISA is what "a program" is, at the lowest portable level. Common modern ISAs are x86-64 (Intel and AMD, CISC-style, variable-length instructions, dominant in servers and desktops), ARM / AArch64 (RISC-style, fixed 4-byte instructions, dominant in phones and increasingly in servers — Apple Silicon, Graviton), and RISC-V (open, modular, the rising research and embedded target).
Modern CPUs do not execute one instruction at a time like the textbook diagram suggests. They pipeline — at any moment, one instruction is being fetched from memory, while the previous one is being decoded, while the one before that is reading its operands, while the one before that is executing, and so on. A single core may have a dozen pipeline stages and be working on dozens of instructions in flight simultaneously. Further, superscalar cores issue multiple instructions per cycle by replicating ALUs, load/store units, and other resources — a modern high-end CPU can retire 4–8 instructions per cycle on friendly code.
The whole design target is to keep the pipeline full. Anything that stalls the pipeline — a mis-predicted branch, a cache miss that waits on DRAM, a dependency on a previous instruction's result — costs throughput. Everything in Stations 4–6 exists to minimise those stalls.
The ISA (Instruction Set Architecture) is the contract between hardware and software: the registers, instructions, addressing modes, memory model, and exception behaviour a program is allowed to assume. Everything above is a compiler target; everything below is a microarchitecture free to cheat as long as the contract holds.
Classic 5-stage RISC pipeline (MIPS, the textbook version):
cycle: 1 2 3 4 5 6 7 8
inst1: IF ID EX MEM WB
inst2: IF ID EX MEM WB
inst3: IF ID EX MEM WB
inst4: IF ID EX MEM WB
IF = instruction fetch EX = execute (ALU)
ID = decode + register read MEM = load/store
WB = writeback
With the pipeline full, CPI → 1 in steady state.
Hazards (data, control, structural) stall and drop CPI.
Modern cores do not stop at 5 stages. Intel Golden Cove and AMD Zen 4 are 14–19 stages deep, 6–8 wide (up to 8 instructions decoded, 10+ dispatched, 10+ retired per cycle), out-of-order (hundreds of in-flight instructions in a reorder buffer), and superscalar. But the abstract contract — fetch, decode, execute, access memory, write back, retire in program order — is still the one software sees.
- x86-64 is CISC in the ISA and RISC-like in the microarchitecture — the frontend translates variable-length x86 instructions into fixed-width micro-ops and feeds those through an out-of-order backend. A single
REP MOVSBexpands into a microcode-ROM sequence dozens of uops long. - ARM (AArch64) and RISC-V (RV64) are fixed-width 32-bit instruction encodings (ARM has 16-bit Thumb variants, RISC-V has the C compressed extension). Decoding parallelism is cheaper, frontend power is lower — the reason Apple's M-series cores can dispatch 8-wide at lower frequency and win on perf/watt.
- IPC (instructions per cycle) on a modern general-purpose core is typically 1–4 on SPEC-like code; 0.1–0.5 on memory-bound workloads; 5–6 on compiler-friendly loops with good branch prediction. CPI (1 / IPC) is the number
perf statreports as "insn per cycle." - Vector/SIMD extensions (SSE → AVX → AVX-512 on x86; NEON → SVE/SVE2 on ARM; V extension on RISC-V) process 128-, 256-, or 512-bit vectors per instruction. AVX-512 adds mask registers and gather/scatter; SVE is vector-length-agnostic (the same binary runs on any implementation from 128 to 2048 bits wide).
The model you want: the ISA is what you write; the microarchitecture is what you wait for. The same mov rax, [rdi] can take 4 cycles (L1 hit) or 200 cycles (DRAM miss) depending on state the ISA promises nothing about.
CAUTION
Micro-benchmarks that measure a single instruction in a loop are almost always measuring the frontend, not that instruction. If your benchmark runs faster than the frontend can decode (4–5 uops/cycle on Intel, 8+ on recent ARM), you are measuring the uop cache, not the core. Read Agner Fog's tables before you publish a number.
Go deeper: Hennessy & Patterson, Computer Architecture chapters 3–4; Agner Fog's optimization manuals and instruction tables; Shen & Lipasti, Modern Processor Design; Intel SDM Vol. 3; the RISC-V ISA spec (short, public, readable in a weekend).
Station 4 — Caches and coherence (MESI)
Main memory (DRAM) is slow compared to the CPU — a DRAM access takes roughly 100 ns, or ~300 CPU cycles at a 3 GHz clock. If the CPU had to wait on DRAM for every load, modern performance would be unachievable. The fix is a cache hierarchy: small, fast memories physically close to the core that hold recently-used data. A typical modern core has a ~32 KB L1 data cache (~1 ns access), a ~1 MB L2 (~3 ns), and shares a ~30 MB L3 with the other cores on the same chip (~10–20 ns). The access-latency ratio of L1 to DRAM is about 100×; getting good cache hit rates is what makes hot loops fast.
When multiple cores work on the same data, each has its own private L1 and L2. A naive design where each core caches its own copy would break correctness — core A writes a new value to location X, core B reads X from its stale cache, and they see different "true" values. Modern CPUs solve this with a cache coherence protocol — the canonical one is MESI (Modified, Exclusive, Shared, Invalid). Every cache line sits in one of four states; when any core writes, the hardware propagates an invalidation to other cores' caches, so they will refetch before their next read. Coherence is what lets shared-memory concurrency work correctly at the cost of bus traffic — and understanding MESI is what explains why contended atomics are slow even when the operation itself is a single instruction.
The speed gap between CPU and DRAM has widened every decade. A 2024 server core runs at ~4 GHz (one cycle = 0.25 ns) but DRAM is ~80–100 ns away — 300+ cycles. A cache miss is a bad day. The cache hierarchy exists to keep most loads from being that day.
Typical modern server core, order-of-magnitude numbers:
level size latency (cycles) bandwidth
─────── ───────── ──────────────── ────────────
register hundreds 0 infinite
L1 I/D 32–48 KB 4–5 ~1 TB/s (per core)
L2 1–4 MB 12–15 hundreds GB/s
L3 (LLC) 16–100+ MB 30–50 ~100 GB/s
DRAM local GBs ~300 ~50 GB/s per channel
DRAM remote GBs ~500 (NUMA hop) ~50 GB/s per channel
SSD NVMe TBs ~50,000 (=μs) ~7 GB/s sequential
Caches are organized in lines, typically 64 bytes on x86 and ARM (some ARM chips use 128). Memory is addressed at byte granularity but moves at cache-line granularity. Touch byte 0 of a line, you paid for bytes 0–63 whether you needed them or not. Two threads updating different bytes of the same line will bounce that line between cores — the classic false sharing performance bug.
Multi-core makes this hard: if two cores each cache the same line, who sees whose writes? The answer is a cache coherence protocol, almost always a MESI variant.
Every cache line in every core's cache is in one of M/E/S/I — Modified (dirty, exclusive), Exclusive (clean, exclusive), Shared (clean, may be cached elsewhere), Invalid (no valid data). Extensions add O (Owned, dirty but shared — AMD MOESI) and F (Forward, Intel MESIF). The rules are enforced by snoop traffic on a ring or mesh fabric that connects every cache to every other.
- A cacheline bounce — one core writes, another reads, over and over — costs 40–100+ cycles per round trip because coherence traffic has to propagate across the fabric. A spinlock on a contended lock is mostly measuring this.
- Write buffers and memory ordering: a store retires into a store buffer before becoming globally visible. x86-TSO lets stores go to the buffer and be drained in order, with loads allowed to pass stores (one relaxation). ARM and POWER are weaker — memory barriers (
dmb ish,lwsync) are how you force visibility. Java'svolatile, C++'sstd::memory_order, and Go'ssync/atomictranslate into these primitives. - Cache line alignment matters for anything concurrent.
alignas(64)in C++ or@Contendedin Java deliberately pads structures to their own line to prevent false sharing. This is the fix 100% of the time when a hot counter looks slower than it should. - Non-temporal stores (
movntdq,vmovntdq) bypass the cache entirely — useful for streaming writes you never read back (memcpy of a large buffer, a DMA-style fill). They trade cache pollution for bandwidth.
The model you want: loads and stores are not the primitive — cacheline fetches, invalidations, and writebacks are. Your loop touches 8 bytes; your machine moves 64. Align your data, layout-of-arrays over array-of-structures when you stream, and expect a factor-of-8 slowdown if you got either wrong.
WARNING
"We'll just add a lock around it" is true but also how hot locks become cache-line bouncing grenades. Counters per-thread (with periodic aggregation), RCU, and lock-free structures exist because the coherence bill on a contended line is the real cost of the critical section, not the wall-clock of the code inside it.
Go deeper: Drepper, "What Every Programmer Should Know About Memory" §6; Sorin, Hill & Wood, A Primer on Memory Consistency and Cache Coherence (2nd ed); Preshing's blog on memory ordering; Intel SDM Vol. 3A §8 on the memory model; perf c2c for cache-line contention profiling on Linux.
Station 5 — Branch prediction and speculation
A pipeline needs to know what instruction to fetch next. For straight-line code, the next instruction is the one at the next address — easy. For conditional branches (if, while, for, virtual method dispatch), the next instruction depends on a comparison whose result is not yet known. Waiting for the comparison to finish before fetching would empty the pipeline and cost maybe 20 cycles of work on every branch; real programs have a branch roughly every 5 instructions, so this is catastrophic.
The fix is branch prediction. The CPU guesses which way a branch will go based on history ("this branch took the then path the last 10 times, so let's assume it will again") and speculatively fetches and executes down the predicted path. If the prediction is right, everything is saved — the pipeline never stalled. If it is wrong, the CPU has to throw away all the speculative work and restart from the correct path, paying the full pipeline-flush cost. Modern branch predictors are remarkably accurate — 95%+ on typical code — but the 5% of mispredicts is where a lot of "unexplained slowness" lives.
Branch prediction is one form of a broader trick called speculative execution: executing instructions before you are certain they should run, because waiting would stall the pipeline. Speculation made CPUs dramatically faster, but it also created the class of vulnerabilities known as Spectre and Meltdown (2018) — secrets can leak through side channels left by speculation even when the speculation is rolled back. The mitigations (disabled speculation on certain paths, retpoline, KPTI) cost a few percent of CPU performance on every modern chip.
A 19-stage pipeline cannot afford to wait on a conditional branch. By the time cmp rax, 0 ; je somewhere reaches the EX stage, 10+ more instructions have already been fetched. If the branch was predicted wrong, all of them must be squashed and the correct path fetched — a misprediction penalty of 15–20 cycles on a modern core. Good predictors turn "90% of branches go this way" from a statistic into throughput.
Branch predictor components (logical):
┌───────────────────────────────────────────┐
│ Branch Target Buffer (BTB) │ given PC → predicted target
│ ~8k–16k entries, tagged by PC │
└──────────────────────┬────────────────────┘
│
┌──────────────────────┴────────────────────┐
│ Direction predictor │ taken / not-taken
│ TAGE, perceptron, hybrid │
└──────────────────────┬────────────────────┘
│
┌──────────────────────┴────────────────────┐
│ Return Stack Buffer (RSB) │ predicts rets by pushing on calls
│ 16–32 entries │
└──────────────────────┬────────────────────┘
│
┌──────────────────────┴────────────────────┐
│ Indirect branch predictor │ for virtual calls / switch tables
└───────────────────────────────────────────┘
- Modern predictors hit 95–99% accuracy on well-behaved branches. The remaining 1–5% dominates p99 latency on branchy code. A perceptron-based predictor (AMD since Piledriver, many Intel cores) uses a weighted sum of history bits; TAGE (TAgged GEometric-history, Seznec) uses tables indexed by different history lengths.
- Speculation is the other half: the core doesn't just predict, it executes the predicted path, accumulating results in the reorder buffer. If the branch resolves as predicted, the uops retire normally. If not, they are discarded — but not before their side effects touched the cache. That leak is the Spectre family of attacks (Kocher et al., 2018), exploited via timing on the side channel.
- Meltdown (2018) was a closely related flaw: speculative execution through a user-mode access to kernel memory, with the value observable via cache side channels before the fault retired. Mitigation was KPTI (kernel page-table isolation), costing 5–30% on syscall-heavy workloads. See the Operating Systems handbook on the syscall tax.
- Branchless programming (
cmov, conditional select) avoids the predictor entirely for short unpredictable branches — useful when the branch is genuinely 50/50 and data-dependent. Linus Torvalds's famous linked-list rant is a worked example; the generic lesson is "unpredictable branch + easy arithmetic alternative = go branchless." perf stat -e branch-misses,branchesandperf record -e branch-misseson your own hot code is a day well spent. A hot loop with 5% branch-miss rate is the compiler telling you to restructure the control flow.
The model you want: the predictor is betting on your code's recent past; write code whose future looks like its past. Sorted inputs, separated hot paths, and tight loops with predictable exits keep the predictor happy. Random control-flow on random data is how you find out the predictor is the only reason your code was fast.
CAUTION
Post-Spectre, speculation is not a purely internal matter — on shared-tenant hardware, speculative loads across a trust boundary can be used to exfiltrate data you never returned. This is why cloud vendors isolate tenants on full cores (not SMT siblings) and why modern compilers insert LFENCE/barrier sequences on some bounds checks. The performance cost is real; so was the vulnerability.
Go deeper: Kocher et al., "Spectre Attacks" (IEEE S&P 2019); Seznec & Michaud, "A case for (partially) TAgged GEometric history length branch prediction" (JILP 2006); Jim Smith, "A Study of Branch Prediction Strategies" (ISCA 1981); Intel's "Speculative Execution Side Channel Mitigations" whitepaper.
Station 6 — Memory hierarchy and NUMA
Caches (Station 4) are the first layer above the core. Zoom out and the whole machine has a memory hierarchy — each layer is bigger, slower, and cheaper than the one below it. Registers (~1 ns, hundreds of bytes). L1 / L2 / L3 caches (~1–20 ns, kilobytes to tens of megabytes). Local DRAM (~100 ns, tens to hundreds of gigabytes). Remote DRAM on a different socket (~150–200 ns). Local NVMe SSD (~20–100 µs). Network-attached storage (~1 ms). Every hop is an order of magnitude slower than the previous; the performance of most programs is determined by which layer their data actually lives in.
On multi-socket servers — two, four, or eight CPU chips in one machine — the memory is not uniform. Each socket has a directly-attached DRAM bank, and accessing another socket's DRAM requires going through an interconnect that adds latency. This is NUMA (Non-Uniform Memory Access), and it means a thread running on socket 1 that reads data allocated in socket 2's memory pays a tax that a better-placed thread would not. The operating system tries to keep threads near their memory (first-touch allocation, automatic NUMA balancing); optimising NUMA-aware software means pinning threads and allocating memory on the same socket.
This station is also where memory bandwidth becomes a thing you can run out of. Each DRAM channel pushes ~20 GB/s; a modern server with 8 channels per socket delivers ~150 GB/s per socket; a GPU offers ~3 TB/s of HBM bandwidth. Workloads that stream data through memory (analytics, ML) are often bandwidth-bound, not compute-bound.
Once a machine has more than one CPU socket, memory stops being one shared pool. Each socket has its own memory controllers, its own DIMM slots, its own DRAM banks — and reaching another socket's memory goes over a socket-interconnect (Intel UPI, AMD Infinity Fabric) with extra latency and bounded bandwidth. That's NUMA (Non-Uniform Memory Access), and it is the reality on every server with two or more sockets.
Two-socket NUMA topology (conceptually):
socket 0 socket 1
┌────────────┐ ┌────────────┐
│ cores 0–15 │ │ cores16–31 │
│ L1/L2/L3 │ │ L1/L2/L3 │
└─────┬──────┘ └──────┬─────┘
│ local local │
┌──┴──┐ ┌──┴──┐
│ DRAM│ │ DRAM│
│ 128 │ │ 128 │
│ GB │ │ GB │
└──┬──┘ └──┬──┘
│ │
└────────── UPI / IF ──────────┘
~30–100 GB/s total
+30–60 ns latency one-way
Local DRAM: ~80–100 ns latency
Remote DRAM: ~130–180 ns latency (sometimes 200+)
DRAM itself is not one thing. Every DIMM is organized into channels, ranks, banks, rows, and columns; the controller opens a row (~10 ns tRCD), streams columns out, and eventually closes it. Random access that keeps hitting different rows pays the full latency each time; sequential access within a row is fast. The memory controller schedules requests to maximize row-buffer locality — which is why perf stat can report "DRAM access latency" as a distribution, not a constant.
- DDR5-4800 peaks around 38 GB/s per channel; a typical server has 8 or 12 channels per socket, so 300–450 GB/s per socket is realistic. HBM3 (on Intel Max, NVIDIA H100, AMD MI300) uses stacked DRAM on an interposer and hits 2–3 TB/s — ~10× DDR5 at much higher cost per GB.
- ECC DIMMs add 1 extra byte per 8 data bytes (64/72-bit words) to correct any single-bit error and detect double. Servers use them by default; consumer DRAM often does not. Google's "DRAM Errors in the Wild" paper (Schroeder et al., 2009) found memory error rates far higher than manufacturer estimates — ECC is not optional for any long-running service.
- NUMA policy on Linux:
numactl --cpunodebind=0 --membind=0 ./apppins a process to socket 0's CPUs and RAM;mbind(2)/set_mempolicy(2)do it programmatically. The default (MPOL_DEFAULT) is "allocate on the first-touching node," which mostly does the right thing if the thread that allocates is the thread that uses — which requires you to write code that way. - CXL (Compute Express Link), built on PCIe physical and electrical, adds coherent memory semantics across devices. CXL 3.0 enables memory pooling across a rack — a host can map memory on another host's expansion chassis with cache-coherent semantics, at ~200–400 ns latency. It is what disaggregated memory looks like when it ships.
The model you want: memory is a tree, not a pool; every node of the tree has different latency and bandwidth. Know which node your data lives on and which node your thread is running on. If the two don't match, you're paying the cross-socket tax every load.
TIP
Run numactl --hardware on any server you care about — it prints the socket count, core map, local memory per socket, and the cross-socket distance matrix. lstopo (from hwloc) draws the picture. Neither takes five minutes. Neither is optional before "we'll just scale up horizontally" becomes plan B.
Go deeper: Hennessy & Patterson chapter 2 on memory hierarchy; JEDEC DDR5 spec (JESD79-5) for DRAM timings; Schroeder et al., "DRAM Errors in the Wild" (SIGMETRICS 2009); CXL Consortium's 3.0 specification; Brendan Gregg, Systems Performance chapter 7 for worked NUMA diagnostics.
Station 7 — GPUs, TPUs, and the shape of accelerators
A general-purpose CPU is optimised for a single thread making unpredictable decisions with low latency per operation — branches, pointers, pointer chasing, syscalls. Lots of real workloads do not look like that. Training a neural network is doing the same small arithmetic operation (multiply-accumulate) on billions of floating-point numbers in parallel with highly regular control flow. Rendering a frame is doing the same shader computation on millions of pixels. The right silicon for those workloads is not a general-purpose CPU with four-wide superscalar issue; it is a massively parallel processor with thousands of small cores, regular memory access patterns, and hardware built for specific operation mixes.
GPUs (NVIDIA H100, B100, AMD MI300, Intel Xe) are the most common. A modern GPU has thousands of compute units grouped into "streaming multiprocessors," a gigabyte-scale on-package HBM (High Bandwidth Memory) with multi-TB/s bandwidth, and specialised matrix-multiply hardware (Tensor Cores) that do FMA on small matrix tiles in one instruction. TPUs (Google) are even more specialised — built around large systolic matrix-multiply arrays and designed specifically for ML workloads. Other accelerators (Cerebras, Groq, Tenstorrent, AWS Trainium / Inferentia) trade off specialisation vs flexibility differently, each targeting a specific shape of compute.
The architectural cost of an accelerator is that your data has to get there. A GPU sees main memory only across a PCIe bus (or NVLink, Station 8). Moving a terabyte of data from CPU RAM to GPU HBM takes seconds; optimising for accelerators means keeping as much work on the accelerator side as possible, not round-tripping.
A CPU core is optimized for one thing: making a single sequential stream of instructions run fast, by throwing hundreds of transistors at out-of-order execution, branch prediction, and cache for every arithmetic unit. An accelerator is the opposite deal: a sea of simple arithmetic units, tiny per-unit control, explicit parallelism, specialized datapaths. You give up latency for throughput; you give up generality for perf/W.
Same die area, two philosophies:
CPU core (desktop-class, ~10 mm² of logic)
┌──────────────────────────────────────────┐
│ frontend OoO engine 4 ALUs │
│ decoders ROB 2 AGUs │
│ BTB/RSB 512-entry RS FPU/vector │
│ L1 I/D 32k L2 1MB huge caches │
└──────────────────────────────────────────┘
Perhaps 10–20 GFLOPS; great on one thread.
GPU SM / AMD CU (~1–2 mm² each, dozens per die)
┌──────────────────────────────────────────┐
│ 64–128 "CUDA cores" (lanes) │
│ tiny scheduler, no OoO │
│ shared memory (48–192 KB per SM) │
│ register file ~256 KB │
└──────────────────────────────────────────┘
Hundreds of GFLOPS per SM; useless alone; devastating at N=80+.
GPUs execute in the SIMT model (Single Instruction, Multiple Thread): a warp (NVIDIA) or wavefront (AMD) of 32 or 64 threads runs the same instruction on different data. Branch divergence within a warp costs cycles — lanes that diverge off the hot path idle until the warp reconverges. Memory accesses are coalesced when adjacent lanes hit adjacent addresses in a single cache line; uncoalesced accesses can be 10× slower.
- NVIDIA H100 (Hopper, 2022): 80 billion transistors, 80 GB HBM3 at ~3 TB/s, 4 TB/s SM-to-SM fabric, ~60 TFLOPS FP64 (with Tensor Cores), ~1000 TFLOPS BF16/FP8 with Tensor Cores, ~4 PFLOPS FP8 sparse. Those numbers are what makes a frontier LLM trainable.
- Tensor Cores (NVIDIA, since Volta 2017) and matrix engines (AMD CDNA, Intel AMX) do a small matrix-multiply in one instruction — typically 4×4×4 FP16 → FP32 per warp per cycle. This is the single innovation that took deep learning from "interesting" to "running everything."
- TPUs (Google, 2015–) go further: a systolic array of multiply-accumulate cells — Google's TPUv4 is a 128×128 array running at ~1 GHz, producing 16 K MACs per cycle — wired as a dataflow pipeline. No cache, no registers between MACs; just a grid where data and weights march through each other. Beat GPUs on perf/W for inference at Google's scale; less flexible for arbitrary workloads.
- ML pipelines: the Data & AI handbook covers what you run on these chips (batching, KV cache, quantization). At the hardware level, two numbers dominate — memory bandwidth (how fast weights reach the MACs) and arithmetic intensity (FLOPs per byte of memory traffic). Compute-bound kernels saturate the MACs; memory-bound kernels saturate HBM; the roofline model (Williams et al., 2009) is the standard picture for which regime you're in.
The model you want: accelerators are throughput machines with a narrow waist. You feed them big batches of the same operation, they give you 10–100× the ops/second of a CPU per watt. Anything that doesn't fit the narrow waist — irregular control flow, small batches, dependent chains — runs worse than it would on a CPU.
WARNING
"Just offload it to the GPU" is the worst sentence in performance engineering. The PCIe host-to-device transfer is on the order of tens of GB/s, and a small kernel's wall-clock is dominated by that transfer. If your workload is under a few MB of input per kernel launch, or you launch many small kernels, you may be paying more in PCIe than you save in compute.
Go deeper: Hennessy & Patterson chapter 4 (data-level parallelism) and chapter 7 (domain-specific architectures); Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (ISCA 2017); Williams, Waterman & Patterson, "Roofline: An Insightful Visual Performance Model" (CACM 2009); NVIDIA's CUDA C Programming Guide chapters on warps, coalescing, and shared memory.
Station 8 — Interconnect: PCIe, CXL, NVLink
A modern server is not a single chip; it is many chips — CPUs, GPUs, NICs, NVMe SSDs, DPUs — that all need to talk to each other. The interconnect is the physical bus and protocol that carries bytes between them, and its bandwidth and latency shape what kinds of workloads are possible.
PCIe (Peripheral Component Interconnect Express) is the standard bus for almost everything — a GPU, an NVMe drive, a network card are all on PCIe, each with 4, 8, or 16 "lanes" that together provide gigabytes per second of bidirectional bandwidth. PCIe 5.0 (current for servers) delivers ~4 GB/s per lane; PCIe 6.0 (rolling out) doubles that. NVLink (NVIDIA's proprietary interconnect between GPUs on the same board) is an order of magnitude faster — 900 GB/s between pairs of H100s — because ML training needs that bandwidth for all-reduce over model gradients. CXL (Compute Express Link) is the emerging cache-coherent fabric on top of PCIe physical layers — it lets accelerators and memory expanders share a coherent memory space with the CPU, blurring the line between "local RAM" and "device memory" that has stood for decades.
At the rack scale, interconnect becomes network: InfiniBand, RoCE (RDMA over Converged Ethernet), and emerging fabric technologies (Ultra Ethernet, NVLink Switch systems) connect many GPUs across many machines into one logical training cluster. The performance of a multi-node training job depends far more on the interconnect than on any single chip.
Silicon doesn't live alone. Every chip talks to memory, to storage, to NICs, to other chips. The wires that carry those bits have their own standards, speeds, and topologies — and increasingly, their own coherence semantics. Interconnect is the station where "my chip" becomes "my system."
Per-lane speeds (full duplex, PAM4 from PCIe 6 onward):
generation year per-lane x16 link typical use
────────── ──── ────────── ───────── ─────────────────
PCIe 3.0 2010 ~1 GB/s ~16 GB/s CPUs shipped 2012–2021
PCIe 4.0 2017 ~2 GB/s ~32 GB/s AMD Zen2, Intel Ice Lake
PCIe 5.0 2019 ~4 GB/s ~64 GB/s current server CPUs
PCIe 6.0 2022 ~8 GB/s ~128 GB/s shipping 2024–25
PCIe 7.0 2025 ~16 GB/s ~256 GB/s spec ratified, silicon coming
NVLink 4 2022 ~50 GB/s per link, up to 18 links on H100
→ ~900 GB/s per GPU to GPU
Infinity Fab 2017– AMD socket-to-socket and chiplet-to-chiplet
UCIe 2022 open chiplet die-to-die standard, up to 1 TB/s/mm
- PCIe is packet-switched, full-duplex, and topologically a tree (root complex at the CPU, endpoints at devices, switches in between). Every NVMe SSD, every GPU, every high-speed NIC is a PCIe endpoint. The "x16" slot count is a choice of lanes for one link; a GPU typically wants x16, an NVMe SSD x4, a slow NIC x1.
- CXL (Compute Express Link) 1.1/2.0/3.0 rides on the PCIe physical layer but adds three protocols:
CXL.io(PCIe-equivalent),CXL.cache(device-coherent access to host memory), andCXL.mem(host-coherent access to device memory). CXL 3.0 adds pooling and sharing: a single memory expander can be presented to multiple hosts, each seeing it as its own NUMA node at ~200–400 ns. - NVLink is NVIDIA's proprietary GPU-to-GPU interconnect — higher bandwidth and lower latency than PCIe, with hardware coherence. An 8-GPU HGX-H100 node gives each GPU direct ~900 GB/s of bandwidth to every other GPU via NVSwitch, making all 80 GB × 8 = 640 GB of HBM effectively one address space for LLM training.
- DMA is the old idea behind all of this: a device moves bytes between its memory and host RAM without bothering the CPU. An IOMMU (Intel VT-d, AMD-Vi) translates device-side addresses and enforces access boundaries — the hardware counterpart to the MMU page tables discussed in the Operating Systems handbook.
- Disaggregation and chiplets: AMD's EPYC is already a package of chiplets connected by Infinity Fabric; Intel and NVIDIA have followed. UCIe (Universal Chiplet Interconnect Express, 2022) is the open die-to-die standard that lets chiplets from different vendors coexist on one substrate. The "one chip" is becoming "one package of many specialized chiplets."
The model you want: every chip is a node in a graph whose edges have bandwidth and latency, and every performance claim quietly assumes a particular graph. A GPU that does "4 PFLOPS" needs data moved to it; "4 PFLOPS" assumes NVLink or HBM at the right place in the graph. Move that data over x16 PCIe 4.0 and the claim becomes marketing.
CAUTION
"We'll just plug more GPUs into this box" runs out quickly — a dual-socket server typically has ~128 PCIe 5.0 lanes total, so eight x16 GPUs leave nothing for NICs and storage. The reason HGX-style boxes exist is that NVIDIA engineered around PCIe's lane budget with NVSwitch. Copying their topology for your own accelerator means copying their interconnect, too.
Go deeper: PCIe Base Specification rev 5.0 / 6.0 (PCI-SIG); CXL 3.0 specification; NVIDIA's NVLink / NVSwitch whitepapers; UCIe 1.1 spec; Hennessy & Patterson chapter 6 on warehouse-scale computers.
How the stations connect
Each station is a lower layer's leaky abstraction. Transistors leak into power budgets; flip-flops leak into metastability; pipelines leak into branch predictor tuning; caches leak into false sharing and memory models; interconnect leaks into NUMA and PCIe lane arithmetic. All of it leaks into the software stack above.
The Operating Systems handbook treats this chip as a resource to multiplex — schedulers sit on Station 3, virtual memory sits on Stations 4 and 6, security primitives on Stations 3 and 5 (the speculation-attack surface). The Foundations handbook defines the representations this hardware runs on — IEEE 754 at Station 3, UTF-8 at the bytes Station 4 moves.
Standards & Specs
- IEEE 754-2019 — Floating-Point Arithmetic — the arithmetic contract every FPU on every ISA implements.
- IEEE 1149.1 — JTAG — the boundary-scan standard behind every chip's debug interface.
- JEDEC DDR5 (JESD79-5) — the DRAM-timing contract; similarly JESD235 for HBM3.
- PCI Express Base Specification — the canonical interconnect; CXL rides its physical layer.
- CXL Consortium 3.0 spec — cache-coherent memory semantics over PCIe.
- UCIe 1.1 spec — the open chiplet die-to-die standard.
- x86 System V ABI and ARM AAPCS64 — the register-allocation and calling contracts.
- Intel 64 / IA-32 Software Developer's Manual (Vols. 1–4) and ARM Architecture Reference Manual — the ISAs in full.
- RISC-V Unprivileged and Privileged ISA specs — the open-ISA canon; short enough to read in a weekend.
- Canonical papers — Mead & Conway, Introduction to VLSI Systems (1980); Patterson & Sequin, "RISC I" (ISCA 1981); Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units" (IBM J. R&D 1967); Hennessy et al., "MIPS: A Microprocessor Architecture" (1982); Kocher et al., "Spectre Attacks" (2019); Jouppi et al., "TPU" (ISCA 2017); Seznec, "TAGE" (2006); Williams, Waterman & Patterson, "Roofline" (CACM 2009); Schroeder et al., "DRAM Errors in the Wild" (2009).
- Books — Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed, the canonical textbook). Patterson & Hennessy, Computer Organization and Design (the undergraduate companion). Harris & Harris, Digital Design and Computer Architecture. Weste & Harris, CMOS VLSI Design. Sorin, Hill & Wood, A Primer on Memory Consistency and Cache Coherence. Agner Fog's optimization manuals (free online).
Test yourself
Two threads each increment a shared int32 counter in a tight loop, using an atomic add. Throughput is worse than a single thread. Name the root cause and the cheapest fix.
Cache-line bouncing on the counter's line. Each atomic add on a MESI system invalidates the line in the other core's cache, forcing a round trip on the coherence fabric — 40–100+ cycles of latency per increment. The "cheap" fixes are per-thread counters aggregated periodically, or a counter per CPU that a reader sums. A lock doesn't help — the lock itself lives on a cache line, so you've moved the problem. See Station 4.
A numerical workload runs at 3.2 TFLOPS on a GPU whose datasheet lists 60 TFLOPS. The profiler shows the SMs are stalled most of the time. What is the workload bound on, and what question do you ask before buying a bigger GPU?
The workload is almost certainly memory-bound, not compute-bound — the SMs are stalled waiting on HBM. Compute the arithmetic intensity (FLOPs per byte loaded from HBM) and check it against the GPU's ridge-point on the roofline. If you're left of the ridge, buying a GPU with more FLOPS won't help; you need one with more HBM bandwidth (or a kernel that reuses more data per load). See Station 7 and Williams's roofline model.
A dual-socket server shows p99 DB latency spikes that correlate with the query routing to cores on the "other" socket. What is going on in one sentence, and what is the one-line mitigation?
The query thread is running on socket 1 but the DB's buffer pool was allocated on socket 0 at startup, so every hot read crosses UPI — ~2× the local latency, with interconnect contention adding variance. Mitigation: pin the DB process with numactl --cpunodebind=0 --membind=0 ./db (or the container equivalent) so memory and threads share a socket. See Station 6.