Act II of X

Machines

A computer is a clock ticking through a list of simple instructions, very fast.

A computer is one trick repeated at six scales. A switch — useless alone — composes into a gate that does logic. Gates compose into an arithmetic unit and a small bag of named storage. Add a clock and a way to read instructions in order, and you have a CPU core. Cores need a pyramid of memory beneath them, because raw RAM is too slow to feed them. The clock that drives all of this stopped getting faster around 2005, so chips grow sideways now: more cores, then specialized accelerators that abandon the general-purpose contract entirely.

From transistor to system-on-a-chip — the layered build-upSystem on a chipCPU cores · GPU · NPU · memory controllers · I/OMulti-core CPUcores share L3 cache and a memory busCPU core + memory hierarchyL1 · L2 caches feed registers; clock drives fetch–executeALU · registers · control unitarithmetic + named storage + sequencingLogic gatesAND · OR · NOT · XOR — a complete basisTransistora switch you control with electricityeach layer hides the one below; abstraction is the only way the whole stack fits in one head
Each layer hides the one below it. When that hiding fails — Spectre and Meltdown are the canonical examples — a lower layer leaks through and breaks the abstraction the layer above relied on.

The transistor

Problem. Computing requires a physical thing that can hold a bit and flip it on command, billions of times per second, without falling apart.

Solution. A transistor is a switch operated by voltage instead of by hand. It has three terminals — a source, a drain, and a gate. Voltage on the gate decides whether current flows between source and drain: gate high closes the switch, gate low opens it (in an n-channel MOSFET; the p-channel transistor is its complement). That is the entire substrate-level mechanism on which everything else in this page rests.

Why it works. Two properties matter. A transistor has only two stable output states, on or off — not a continuum — so a noise margin separates them and bits can pass through millions of stages without degrading. And modern transistors are very small and very fast: a current-generation gate is about 20 nm long and switches in roughly a picosecond. A leading-edge die packs 100–200 billion of them in roughly the area of a thumbnail. Transistor counts have kept doubling on schedule even though clock frequencies stopped doubling decades ago.

A MOSFET as a controlled switchsourcedraingatechanneln-channel MOSFETgatechannel0 V (low)open · no currentVDD (high)closed · conducts≈20 nm gate length · ≈1 ps switching · ≈1011 per modern dietwo stable states · noise margin · cheap to copy— the substrate of everything above
The implementation has changed across decades (planar → FinFET → gate-all-around), but the abstraction is stable: voltage in, current allowed or denied. Every page above this one treats the transistor as the atomic unit.

That's the floor. From here, computers stop being electrical and start being mathematical.

From switches to logic

Problem. A single switch decides one thing. Computation needs combinations: "this and that", "this or that", "not this".

Solution. Wire two transistors in series and the output is high only when both inputs are high — that's an AND gate. Wire them in parallel and you get an OR. Add an inverter and you get NAND and NOR. The four common gates (AND, OR, NOT, XOR) are shorthands for the patterns that appear most often in real designs.

NAND alone — or NOR alone — is functionally complete: every Boolean expression can be built from one repeated gate. That's the leap from physics to math. A truth table is the contract a gate keeps; the gate symbol is just shorthand.

The four primitive logic gates and their truth tablesANDA B·Y0 000 101 001 11ORA B·Y0 000 111 011 11NOTA·Y0110XORA B·Y0 000 111 011 10NAND alone (or NOR alone) is enough — every Boolean function reduces to one repeated gate"functional completeness" is what makes the leap from physics to math possible
CMOS implements each gate with a small, fixed number of transistors — a 2-input NAND uses four. Multiply by billions and you have a chip.

The first non-obvious circuit built from these gates is a half-adder: two inputs, two outputs. The sum is A XOR B (1 when exactly one input is high); the carry is A AND B (1 only when both are). Two of these plus an OR gate become a full-adder, which also accepts a carry-in from a previous bit. Chain n full-adders together and you can add two n-bit numbers. That's the entire arithmetic story from Act I, now built from gates.

Half-adder: XOR for sum, AND for carryABSumCarryXORANDA BSumCarry0 0000 1101 0101 101Sum is "exactly one input high"; Carry is "both inputs high"
Two gates suffice for one bit of arithmetic. The naive ripple-carry chain has latency proportional to width; production designs use carry-lookahead to do 64-bit addition in roughly log₂(64) gate-delays instead of 64.

Pitfall — the gates so far are memoryless. Feed the same inputs to an adder and you get the same output, every time, regardless of what came before. That's a combinational circuit. To run a program, we need circuits that remember — outputs that depend on inputs and on history.

The fix. A latch or flip-flop holds a bit between clock ticks. The standard cell is the D flip-flop: one data input D, one clock input, one output Q. On the rising edge of the clock — and only then — Q snaps to whatever D held at that instant; between edges, Q stays put. A 64-bit register is sixty-four D flip-flops in parallel on a shared clock. The boundary between "logic" and "memory" disappears here: a register is just a row of flip-flops.

D flip-flop: Q samples D on the rising edge of the clockDclkQQ'D-FFedge-triggered · one bit of stateclkDQrising edges sample D into Q
Changes on D between rising edges are ignored — that property is what makes synchronous design tractable. Stack a few flip-flops with combinational logic and you have a state machine; every counter, controller, and program counter in a chip is built this way.

Arithmetic and memory together, sequenced by a clock, give the next layer: a CPU.

From logic to processor

A pile of gates is not a computer. To run a program, the chip needs to read instructions in some order, perform the operation each instruction names, and store the result somewhere it can read again. A CPU core is the smallest assembly that earns that name. It bundles four pieces:

  • Registers — a small bank of named storage (typically 16–32 general-purpose registers, each 32 or 64 bits wide). The fastest memory in the system; the ALU reads and writes them in one cycle.
  • ALU (arithmetic-logic unit) — does the math. Built from the adders and gates above, plus a multiplier, shifter, and comparator.
  • Control unit — sequences operations. Reads the next instruction, decodes which operation it names, and pulls the right wires to make the ALU and registers do that thing.
  • Buses — the wires that move bits between these pieces and out to memory.
Inside a CPU core: control unit, register file, ALUInstruction memoryData memoryControl unitRegister fileALUaddresses → bytesload · storealigned accessdecode · sequencex0 … x31named scratch spaceadd · sub · and · orshift · cmp · mulfetchcontrol signalsload / storeoperandsresultcontrol · datapath · register file · ALU — the four pieces every ISA exposes in some form
A real core adds a program counter, instruction queue, branch predictor, and load/store unit. The four-piece skeleton is what undergraduate textbooks ship — and what every microcontroller still looks like.

The core runs an endless fetch-decode-execute loop. Fetch the instruction at the address held in the program counter. Decode the bits to know which operation, which registers, which immediate values. Execute on the ALU, or read or write memory. Write the result back to a register and advance the program counter. Repeat, billions of times per second. Modern superscalar cores fetch and retire 4–8 instructions per cycle and decorate this loop with pipelines, predictors, and speculative execution — but the contract the programmer sees is still that loop, in order.

Pipelining

A naive core spends each cycle on one stage of one instruction — fetching, then decoding, then executing, then writing back — leaving most of the chip idle most of the time. A pipelined core overlaps the stages so multiple instructions are in flight at once, each at a different stage. The classic textbook RISC pipeline has five stages — IF (fetch), ID (decode), EX (execute), MEM (memory access), WB (writeback). Once the pipeline is full, five instructions are in flight at once and one instruction retires per cycle, even though each individual instruction still takes five cycles end to end. Same gates, rearranged, five times the throughput at the same clock.

A 5-stage pipeline: five instructions overlap to retire one per cyclecycle 123456789i1i2i3i4i5IFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBsteady state · one instruction retires per cycle5 instructions in flight · 5× throughput at the same clockindividual latency unchanged · throughput is what scales
The throughput win requires the next instruction to be ready every cycle. Branches and data dependencies break that assumption — which is what the rest of this section is about.

Pitfall — data hazards. Two consecutive instructions can collide. If i2 reads a register that i1 is still computing, i2 must wait until i1 writes the result back. The naive fix is to stall the pipeline: insert a "bubble" until the value is in the register file, costing cycles for nothing.

The real fix — forwarding. Route i1's ALU result directly from the EX-stage output back to i2's EX-stage input, before the writeback ever happens. The two instructions retire without a stall. Modern cores forward across many paths, which is part of why their datapaths are so dense.

Data hazard: the stall (top) and the forwarding fix (bottom)Stall: bubble insertedForwarding: result routed EX → EXcyc 1234567i1: add r3,r1,r2i2: sub r5,r3,r4IFIDEXMEMWBIFIDstallstallEXMEMi2 reads r3 — must wait for i1 WB · 2 cycles losti1: add r3,r1,r2i2: sub r5,r3,r4IFIDEXMEMWBIFIDEXMEMWBforwarda wire from EX-out to EX-in turns a 2-cycle stall into a 0-cycle hazard
Data hazards have a clean fix; branches do not. The IF stage doesn't know which instruction to fetch next until a conditional branch executes — the topic of the next subsection.

Branch prediction and speculation

Problem. A pipelined core has to fetch the next instruction every cycle. After a conditional branch — if (x > 0) goto L1 else goto L2 — the next instruction lives at one of two addresses, and the core won't know which until the branch executes a few cycles later. Stalling until then would empty the pipeline.

Solution. A branch predictor guesses which way each branch will go and starts speculatively fetching from the predicted target. If the guess is right, no cycles are lost. If the guess is wrong, the speculatively fetched work is thrown out and the pipeline restarts from the correct target — typically a 15–20 cycle penalty on a deep modern pipeline.

Mechanism. The simplest predictor is a 2-bit saturating counter per branch: four states from "strongly not taken" through "strongly taken". A correct prediction nudges the counter further in that direction; a wrong prediction nudges it back. Two bits of history mean a single anomaly does not flip the prediction — it takes two consecutive surprises. Real predictors stack this with branch history and pattern tables and hit ≥95% accuracy on branchy code.

Branch prediction: a 2-bit saturating counter as a state machine00011011strongnot takenweaknot takenweaktakenstrongtakentakennot takentakennot takentakennot takennot takentakenpredict not takenpredict takentwo consecutive surprises to flip the prediction · ≥ 95% accuracy on real branchy code
Speculation has a security cost. Mispredicted instructions get squashed before they commit, but they can still leave traces in the cache. That is the foothold Spectre exploits.
Worked example: one mispredict and what the core has to throw away

A hot loop runs the same branch ten times. The first nine iterations take it; the tenth exits. Suppose the predictor's counter for this branch starts at 11 (strongly taken) and the pipeline is five stages deep.

  1. Cycles 1–9. Each iteration, the predictor says "taken" before the branch has executed. The fetch unit immediately starts pulling the loop-body instructions for iteration n+1 into IF, ID, EX behind the branch. By the time the branch resolves in EX, three or four follow-on instructions are already in flight — and they are the right ones. No cycles lost. Counter stays at 11.
  2. Cycle 10. The branch is not taken this time (loop exit), but the predictor still says "taken". The fetch unit speculatively pulls in one more loop-body iteration. Four cycles later the branch resolves in EX and the comparator says: wrong.
  3. The flush. Every instruction younger than the branch — the speculatively-fetched loop body, sitting in IF, ID, and EX — is squashed. Their register writes are discarded (the ROB never lets them commit). The program counter is reset to the loop-exit address and the pipeline starts refilling from there. That is the 15–20 cycle bubble the prose mentioned: the time to drain and refill on a deep modern pipeline.
  4. The counter update. The 2-bit counter drops from 11 to 10 — still "taken", just less confidently. One surprise doesn't flip the prediction. If the loop ran again right away and exited at the same point, the counter would drop to 01 and now the predictor would say "not taken" for next time.

The cost of being wrong is the pipeline depth times the issue width — easily 50–100 squashed micro-ops on a wide core. The cost of being right is zero. That asymmetry is why even 95%-accurate prediction is worth the silicon.

Out-of-order execution

The natural endpoint of pipelining is to stop pretending that program order matches execution order. In an out-of-order core, instructions enter in program order but execute as soon as their operands are ready, on whichever execution unit happens to be free. A reorder buffer (ROB) holds the in-flight instructions; results are kept private until the ROB commits them in original program order. The programmer-visible state still matches the program — but underneath, 200+ instructions are in flight at once.

The ROB is what holds the abstraction together. Misbehaved instructions (bad branches, page faults) can be squashed before they commit, so the architectural state never sees them. The visible machine still executes one instruction at a time, in order.

The instruction set architecture

Hardware and software need a contract: which instructions exist, what bits encode them, and what each one does. The instruction set architecture (ISA) is that contract. x86-64 is the dominant desktop and server ISA — variable-length instructions (1–15 bytes), born at Intel, accumulated decades of extensions. Arm dominates mobile and now leads laptop performance-per-watt; fixed 32-bit instructions, designed to be simpler to decode. RISC-V is the open ISA with a free specification, increasingly used in microcontrollers and research silicon. Compilers translate source code to one of these. Everything above the ISA is portable; everything below is not.

The same operation in three ISAs: x86-64, ARM A64, RISC-VOperation: r1 = r2 + r3, then store r1 to memoryx86-64 (CISC)ARM A64 (RISC)RISC-V (RISC)add rax, rbxmov [rcx], raxadd x0, x1, x2str x0, [x3]add x10, x11, x12sw x10, 0(x13)encoding: 1–15 bytesvariable length≈ 1500 base instructionsmemory operands allowedIntel · AMDencoding: fixed 4 bytes31 GP regs (x0–x30)≈ 400 base instructionsload/store only to memApple · Qualcomm · etc.encoding: 4 bytes (RV32I)32 GP regs (x0–x31)≈ 50 base instructionsload/store only to memSiFive · academic · IoTvariable-length CISC: dense binaries, hard to decode · fixed-length RISC: simpler decode, more instructionsall three are translated to similar internal micro-ops once inside the core
Modern x86-64 cores translate variable-length CISC instructions into RISC-like internal micro-ops and execute those, so the visible difference at the instruction level often disappears at the gate level.

A core can run any program — but only if it can be fed. The next problem is that RAM is too slow.

Memory hierarchy

Problem. A modern core executes one or more instructions per nanosecond. A read from main memory takes about a hundred nanoseconds — long enough to lose hundreds of instructions waiting. If every memory access stalled the CPU, the chip would spend most of its time idle.

Solution. The memory hierarchy is a stack of progressively larger, slower, cheaper storage that hides the gap by keeping recently and likely-needed data close. Each level is roughly 5–30× larger than the one above and 3–10× slower.

The memory hierarchy with sizes and latenciesRegistersL1 cacheL2 cacheL3 cacheMain memory (DRAM)SSDHDD / network storage≈0 cyc≈4 cyc≈12 cyc≈40 cyc≈150 cyc≈150,000 cyc≈15,000,000 cyc≈256 B32–64 KB256 KB – 2 MB8 – 64 MB8 – 512 GB0.5 – 8 TBmulti-TBlatency · per accesscapacity · per levela register access is >1,000,000× faster than a disk seeknumbers approximate · vary by chip generation · 1 cycle ≈ 0.3 ns at 3 GHz
Cache lines are typically 64 bytes — the smallest unit moved between levels. Reading one byte pulls in 64, because amortized cost wins over precision.

Why the pyramid works: locality

The hierarchy only pays off because real programs exhibit locality of reference. Temporal locality: an address used now is likely to be used again soon — loop counters, hot data structures, frequently-called functions. Spatial locality: addresses near a recently-used one are likely to be touched soon — sequential array scans, struct fields, instruction streams.

Caches exploit both. When the CPU misses a single byte, it doesn't fetch one byte; it fetches a 64-byte cache line, betting that the next bytes will be touched too. Code that respects locality (row-major iteration, packed structs, tight inner loops) often outruns "smarter" code that doesn't, by an order of magnitude.

Temporal vs spatial localityTemporalSpatialsame address re-accessed in timenearby addresses accessed togetheraddr A · addr B · addr A · addr B · addr Aloop counter · cached lookup · hot locktime →arr[0] arr[1] arr[2] arr[3] arr[4] …array scan · packed struct · instruction streamaddress →
Both kinds of locality are properties of programs, not of hardware. Code that has them runs faster on the same chip; code that doesn't pays the full memory latency. perf stat reports cache-miss rate; an L3 miss rate above a few percent is usually fixable.

Where to put a line: associativity

A cache also has to decide which slot holds each line. The trade-off is between fast lookup and avoiding conflicts.

  • Direct-mapped. Every memory address has exactly one slot it can occupy. Lookup is one tag compare — fast and cheap. But two hot addresses that map to the same slot evict each other forever.
  • Fully-associative. Any line can live in any slot. Zero conflict misses, but every lookup must compare against every tag — slow and power-hungry.
  • Set-associative. Each address maps to a set of N slots and the line can live in any of them. Real L1 caches are typically 8-way; L3 is often 16-way. The compromise wins everywhere.
Cache mapping: direct-mapped vs N-way set-associative vs fully-associativeDirect-mapped2-way set-assocFully associativeone slot per addresschoose 1 of 2 slotsany slot anywhereslot 0slot 1slot 2slot 3way 0way 1way 0way 1set 0set 1any · any · any · anyany · any · any · anyaddr A → slot 1addr B → slot 1conflict: A evicts Baddr A → set 0, way 0addr B → set 0, way 1both fit · no conflictaddr A → free slotaddr B → free slotno conflict missescheap · 1 tag compare2 tag compares · cheapN tag compares · expensivereal L1 caches are 8-way · L3 is often 16-way · fully-associative is reserved for very small, very precious caches (TLB)
Conflict misses are the lurking cost of direct-mapped caches. Two arrays whose start addresses differ by a multiple of the cache size will evict each other on every iteration of a loop that touches both — a footgun even good programmers hit.

Virtual addresses and the TLB

The same lookup question recurs one level up. The CPU sees virtual addresses; DRAM is indexed by physical addresses. Translating between them on every memory access would be ruinous, so the chip keeps a small, very-associative cache of recent translations: the TLB (translation lookaside buffer). A TLB hit costs one cycle and the access goes through. A miss triggers a page-table walk in DRAM that can take 100+ cycles, and a process switch may flush the entire TLB.

Virtual-to-physical address translation through the TLBVPN (virtual page number)offsetvirtual address (issued by core)TLBVPN → PFN cache≈ 64–2048 entrieshit · ≈ 1 cyclepage-table walkmiss · ≥ 100 cycles · DRAMinstall in TLBPFN (physical frame)offsetphysical addressPFN from TLB · offset unchangedDRAM access≈ 100 ns to fetch a lineTLB hit ≈ 1 cycle · TLB miss + walk ≈ 100+ cycles · context switch may flush the whole table
Huge pages (2 MB or 1 GB instead of 4 KB) make a measurable difference for big working sets — fewer translations, fewer misses, fewer walks.
Worked example: translating one 64-bit virtual address by hand

A page is 4 KB, so the bottom 12 bits of every address are the offset within the page (since 2^12 = 4096). x86-64 uses 48-bit virtual addresses (the top 16 bits are sign-extended and ignored). The remaining 36 bits index a 4-level page table — 9 bits per level, because each table is itself one 4 KB page holding 512 eight-byte entries (2^9 = 512).

Take the virtual address 0x00007F8E12345678 and split its low 48 bits into fields:

fieldbitsvalue (binary)value (decimal)
PML4 index47–39011111111255
PDPT index38–3000011100056
PD index29–21111000001449
PT index20–1200100011070
offset11–00100011110001144

A TLB miss kicks off the page-table walk:

  1. PML4. The CPU reads the PML4 base register, fetches entry 255 from that page — one DRAM access. The entry holds the physical address of the next-level table.
  2. PDPT. Fetch entry 56 from that table — another DRAM access. Returns the PD's physical address.
  3. PD. Fetch entry 449 — another DRAM access. Returns the PT's physical address.
  4. PT. Fetch entry 70 — one more DRAM access. This entry holds the physical frame number (PFN) for our page, plus permission bits.
  5. Assemble. The physical address is (PFN << 12) | 0x678. The offset is copied through unchanged — same byte within the page, just a different page.

Four DRAM accesses per miss is why TLB misses hurt. It is also why huge pages help: a 2 MB page skips the last level entirely (one fewer DRAM access per walk), and one 2 MB TLB entry covers what would otherwise be 512 separate 4 KB entries.

Pitfall: cache coherence

When two cores cache the same memory line, their writes can disagree. Hardware enforces a coherence protocol — MESI is the classic — so any read sees the most recent write. Each cache line, in each core, lives in one of four states: Modified (dirty, only-copy), Exclusive (clean, only-copy), Shared (clean, possibly in other caches), Invalid. Transitions are driven by snoops on the interconnect. The protocol is invisible to software, but it has a cost.

If two cores write different bytes of the same 64-byte cache line, the line ping-pongs between their caches — false sharing. A "lock-free" loop can run 10× slower than the same loop on independent cache lines. This is one of the few places where the abstraction visibly leaks.

MESI cache-coherence states for one cache line, per coreMESIModifiedExclusiveSharedInvalidlocal writeremote readremote writelocal readlocal writeremote write / evictM: dirty, only-copy — must write back on evictionE: clean, only-copy — silently upgradeable to MS: clean, may exist in other caches — read-onlyI: invalid — must refetch on next accessMESI: four states per cache line, per core
False sharing forces the line through M → I → M → I, ping-ponging across the bus. Padding hot per-thread variables to separate cache lines is one of the first fixes a profiler suggests.
Worked example: two cores, one cache line, the MESI states in order

Two cores share an interconnect. Both have private L1 caches. We watch one 64-byte line L as the cores touch it. Each row shows the state of the line in each core's cache after the operation.

stepoperationCore 0's LCore 1's Lwhat happened
0(initial)IInobody has it
1Core 0 reads LEICore 0 misses, fetches from DRAM. Snoops show no other copy, so the line is Exclusive to Core 0 — clean, only-copy.
2Core 0 writes LMISilent upgrade E → M. No bus traffic; the line is already private and clean. Now it's dirty.
3Core 1 reads LSSCore 1 misses and snoops. Core 0's cache responds with its dirty data (and writes back to DRAM), then both copies become Shared — clean, read-only, in two caches.
4Core 1 writes LIMCore 1 broadcasts an invalidate. Core 0's copy goes I. Core 1's copy becomes M. The next read on Core 0 will miss.
5Core 0 reads LSSSymmetric to step 3: Core 1's cache supplies the dirty data, writes back, both end up Shared again.

Where false sharing bites. Imagine Core 0 keeps writing byte 0 of L while Core 1 keeps writing byte 32. The bytes never overlap — there is no real data race — but MESI tracks state per line, not per byte. Each write triggers an invalidate of the other core's copy. The line ping-pongs through M (core 0) → I, M (core 1) → I, M (core 0) → I, … on every write, dragging cache-line traffic across the interconnect. The fix is to put the two bytes on different cache lines — pad the struct so each thread's hot field sits 64 bytes from the next.

The hierarchy and its protocols hide DRAM latency. The clock that drives all of this has its own constraint.

The clock and the power wall

The clock. Inside a CPU, the clock is a square-wave signal that synchronizes every flip-flop in the design. Each rising edge is one cycle: the moment when sequential logic latches its inputs and the next combinational stage starts computing. Clock frequency is how many cycles fit in a second — a 3 GHz processor ticks 3 × 10⁹ times per second, giving each cycle about 333 picoseconds, the time light takes to travel about 10 cm. The maximum frequency is set by the slowest combinational path between any two flip-flops; if a stage cannot finish in one cycle, the next latch grabs garbage.

Why it stopped climbing. For four decades, clock frequencies climbed in lockstep with transistor counts. That ended around 2005. Dynamic power in CMOS scales as roughly P ∝ C · V² · f — capacitance, voltage squared, frequency. Voltage couldn't keep falling because of leakage and reliability floors. Raising f with a fixed V meant cubic increases in power per transistor, and there is no way to dissipate that heat through 1 cm² of silicon. The industry called it the power wall. Moore's law continued; the partner law that kept power-per-transistor falling did not.

The power wall — frequency plateaus while transistor count keeps climbing101110910710510319851995200520152025Transistors / dieClock frequencySingle-thread perf≈2005: power walllog scaleMoore's law continues; Dennard scaling does not
Once raising clock speed became a quadratic-in-power proposition, the industry pivoted to throwing more cores and more specialized blocks at the same die area.

More cores, then NUMA

What replaced "faster cores" was "more cores". A modern desktop CPU has 8–32 cores; a server CPU has 64–192. Cores share the last level of cache (L3) and the memory controller; they coordinate through cache coherence and a small set of synchronization primitives — atomic loads, stores, and compare-and-swap.

The catch: most software wasn't written to use them. Speeding up a single sequential thread by buying a bigger CPU stopped working twenty years ago, and the burden of finding parallelism moved up the stack — to compilers, runtimes, and ultimately to programmers.

A multi-core die — cores share L3 and a memory busCore 0Core 1Core 2Core 3L1 + L2privateL1 + L2privateL1 + L2privateL1 + L2privateShared L3 cacheMemory controller · interconnectmore transistors → more cores, not faster cores
Cache coherence and the memory bus become the bottleneck once you have more than a handful of cores. Server-class chips replace the bus with a ring or mesh interconnect; designs above 32 cores frequently chiplet-disaggregate to keep the wires short.

Once a system has more than one socket, memory itself becomes non-uniform. In a NUMA (non-uniform memory access) box, each socket has its own DRAM channels — local memory — and reaches the other socket's DRAM only through an interconnect. Local access is roughly 80 ns; remote is 120–150 ns and shares interconnect bandwidth with other traffic. Where threads run, and which DRAM their pages were allocated on, can change throughput by 1.5–2× on the same hardware. Linux's default first-touch policy allocates a page on the node of the thread that first touches it — which is why initialization order changes throughput on a server you haven't tuned.

Beyond the CPU

Problem. A CPU optimizes for single-threaded latency: each core does one (or a few) instruction per cycle, with deep pipelines and aggressive branch predictors that keep one execution stream moving. That is the wrong shape for problems where the same operation runs across millions of independent data points — graphics, deep learning, scientific simulation. Most of the silicon area on a CPU core is spent on machinery (predictors, ROBs, caches) that is irrelevant to those workloads.

Solution. Build chips with the opposite trade-off: many simple lanes, all running the same instruction in lockstep. SIMD (single-instruction, multiple-data) is that idea inside a CPU as a vector unit — typically 4–16 lanes (AVX-512, ARM SVE). SIMT (single-instruction, multiple-thread) is GPUs at scale: a single high-end GPU has roughly 10,000 ALU lanes grouped into "warps" of 32 that share a fetch unit. Throughput per watt is 5–20× higher than a CPU on workloads that fit the model.

CPU vs GPU — latency cores vs throughput coresCPU core (SISD)GPU (SIMT)one instruction → one data → one resultone instruction → many lanes → many resultsFetch / decodeALUResultFetch / decode (one)ALUALUALUALUALUALUALUALU8 results / cycleCPU: low latency on the critical path · GPU: high throughput on regular workillustrative — a real GPU has thousands of lanes grouped into warps of 32
Branchy, irregular code is a CPU workload. Regular tensor and pixel math is a GPU workload. A modern system runs both, with the OS or a runtime dispatching work to whichever fits.

Domain-specific accelerators

The next step beyond GPUs is domain-specific accelerators: chips that abandon programmability for one job done very efficiently. The clearest example is the systolic array at the heart of TPUs, NPUs, and modern matrix-multiply blocks. It is a 2-D grid of small multiply-accumulate cells. Matrix tiles flow through the grid one element per clock, and every cell does one multiply-add per cycle — without ever fetching an instruction. Google's TPU v4 includes a 128×128 systolic array, doing 16,384 multiply-accumulates per cycle when fed continuously. NVIDIA's Tensor Cores, Apple's Neural Engine, and the NPUs in modern phones all share this shape.

Systolic array — matrix data streams through a grid of MAC cells×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+×+Matrix A rows →Matrix B columns ↓→ partial sums accumulateno instruction fetch · every cell does one MAC per cycle
The cost of a systolic array is generality: the silicon does matmul (and very little else) very well. The benefit is one matmul per cycle when fed continuously, with no instruction-fetch overhead.

That trade-off is where the act lands. CPU cores run any program well enough. GPUs run regular dense math an order of magnitude better. Domain-specific accelerators run one shape of computation another order of magnitude better, at the cost of being useless for anything else. Future system performance depends less on a single faster transistor and more on choosing the right silicon for each piece of the workload — which is why a modern phone SoC contains a CPU complex, a GPU, an image-signal processor, a video codec block, and a neural engine, all on one die.

Standards

Every interface on this page has a canonical specification.

Going deeper

Branches that earn their own article.

  • Transistor physics (CMOS, FinFET).
  • Digital logic design (combinational, sequential, FSMs).
  • ISA deep dives (x86, ARM, RISC-V).
  • Pipelining, hazards, out-of-order execution.
  • Cache coherence protocols.
  • Branch prediction.
  • GPU architecture internals.
  • FPGA programming.
  • Chip fabrication process.