Machines — Act 2 | Digital Systems

A computer is one trick repeated at six scales. A switch — useless alone — composes into a gate that does logic. Gates compose into an arithmetic unit and a small bag of named storage. Add a clock and a way to read instructions in order, and you have a CPU core. Cores need a pyramid of memory beneath them, because raw RAM is too slow to feed them. The clock that drives all of this stopped getting faster around 2005, so chips grow sideways now: more cores, then specialized accelerators that abandon the general-purpose contract entirely.

Each layer hides the one below it. When that hiding fails — Spectre and Meltdown are the canonical examples — a lower layer leaks through and breaks the abstraction the layer above relied on.

The transistor

Problem. Computing requires a physical thing that can hold a bit and flip it on command, billions of times per second, without falling apart.

Solution. A transistor is a switch operated by voltage instead of by hand. It has three terminals — a source, a drain, and a gate. Voltage on the gate decides whether current flows between source and drain: gate high closes the switch, gate low opens it (in an n-channel MOSFET; the p-channel transistor is its complement). That is the entire substrate-level mechanism on which everything else in this page rests.

Why it works. Two properties matter. A transistor has only two stable output states, on or off — not a continuum — so a noise margin separates them and bits can pass through millions of stages without degrading. And modern transistors are very small and very fast: a current-generation gate is about 20 nm long and switches in roughly a picosecond. A leading-edge die packs 100–200 billion of them in roughly the area of a thumbnail. Transistor counts have kept doubling on schedule even though clock frequencies stopped doubling decades ago.

The implementation has changed across decades (planar → FinFET → gate-all-around), but the abstraction is stable: voltage in, current allowed or denied. Every page above this one treats the transistor as the atomic unit.

That's the floor. From here, computers stop being electrical and start being mathematical.

From switches to logic

Problem. A single switch decides one thing. Computation needs combinations: "this and that", "this or that", "not this".

Solution. Wire two transistors in series and the output is high only when both inputs are high — that's an AND gate. Wire them in parallel and you get an OR. Add an inverter and you get NAND and NOR. The four common gates (AND, OR, NOT, XOR) are shorthands for the patterns that appear most often in real designs.

NAND alone — or NOR alone — is functionally complete: every Boolean expression can be built from one repeated gate. That's the leap from physics to math. A truth table is the contract a gate keeps; the gate symbol is just shorthand.

CMOS implements each gate with a small, fixed number of transistors — a 2-input NAND uses four. Multiply by billions and you have a chip.

The first non-obvious circuit built from these gates is a half-adder: two inputs, two outputs. The sum is A XOR B (1 when exactly one input is high); the carry is A AND B (1 only when both are). Two of these plus an OR gate become a full-adder, which also accepts a carry-in from a previous bit. Chain n full-adders together and you can add two n-bit numbers. That's the entire arithmetic story from Act I, now built from gates.

Two gates suffice for one bit of arithmetic. The naive ripple-carry chain has latency proportional to width; production designs use carry-lookahead to do 64-bit addition in roughly log₂(64) gate-delays instead of 64.

Pitfall — the gates so far are memoryless. Feed the same inputs to an adder and you get the same output, every time, regardless of what came before. That's a combinational circuit. To run a program, we need circuits that remember — outputs that depend on inputs and on history.

The fix. A latch or flip-flop holds a bit between clock ticks. The standard cell is the D flip-flop: one data input D, one clock input, one output Q. On the rising edge of the clock — and only then — Q snaps to whatever D held at that instant; between edges, Q stays put. A 64-bit register is sixty-four D flip-flops in parallel on a shared clock. The boundary between "logic" and "memory" disappears here: a register is just a row of flip-flops.

Changes on D between rising edges are ignored — that property is what makes synchronous design tractable. Stack a few flip-flops with combinational logic and you have a state machine; every counter, controller, and program counter in a chip is built this way.

Arithmetic and memory together, sequenced by a clock, give the next layer: a CPU.

From logic to processor

A pile of gates is not a computer. To run a program, the chip needs to read instructions in some order, perform the operation each instruction names, and store the result somewhere it can read again. A CPU core is the smallest assembly that earns that name. It bundles four pieces:

Registers — a small bank of named storage (typically 16–32 general-purpose registers, each 32 or 64 bits wide). The fastest memory in the system; the ALU reads and writes them in one cycle.
ALU (arithmetic-logic unit) — does the math. Built from the adders and gates above, plus a multiplier, shifter, and comparator.
Control unit — sequences operations. Reads the next instruction, decodes which operation it names, and pulls the right wires to make the ALU and registers do that thing.
Buses — the wires that move bits between these pieces and out to memory.

A real core adds a program counter, instruction queue, branch predictor, and load/store unit. The four-piece skeleton is what undergraduate textbooks ship — and what every microcontroller still looks like.

The core runs an endless fetch-decode-execute loop. Fetch the instruction at the address held in the program counter. Decode the bits to know which operation, which registers, which immediate values. Execute on the ALU, or read or write memory. Write the result back to a register and advance the program counter. Repeat, billions of times per second. Modern superscalar cores fetch and retire 4–8 instructions per cycle and decorate this loop with pipelines, predictors, and speculative execution — but the contract the programmer sees is still that loop, in order.

Pipelining

A naive core spends each cycle on one stage of one instruction — fetching, then decoding, then executing, then writing back — leaving most of the chip idle most of the time. A pipelined core overlaps the stages so multiple instructions are in flight at once, each at a different stage. The classic textbook RISC pipeline has five stages — IF (fetch), ID (decode), EX (execute), MEM (memory access), WB (writeback). Once the pipeline is full, five instructions are in flight at once and one instruction retires per cycle, even though each individual instruction still takes five cycles end to end. Same gates, rearranged, five times the throughput at the same clock.

The throughput win requires the next instruction to be ready every cycle. Branches and data dependencies break that assumption — which is what the rest of this section is about.

Pitfall — data hazards. Two consecutive instructions can collide. If i2 reads a register that i1 is still computing, i2 must wait until i1 writes the result back. The naive fix is to stall the pipeline: insert a "bubble" until the value is in the register file, costing cycles for nothing.

The real fix — forwarding. Route i1's ALU result directly from the EX-stage output back to i2's EX-stage input, before the writeback ever happens. The two instructions retire without a stall. Modern cores forward across many paths, which is part of why their datapaths are so dense.

Data hazards have a clean fix; branches do not. The IF stage doesn't know which instruction to fetch next until a conditional branch executes — the topic of the next subsection.

Branch prediction and speculation

Problem. A pipelined core has to fetch the next instruction every cycle. After a conditional branch — if (x > 0) goto L1 else goto L2 — the next instruction lives at one of two addresses, and the core won't know which until the branch executes a few cycles later. Stalling until then would empty the pipeline.

Solution. A branch predictor guesses which way each branch will go and starts speculatively fetching from the predicted target. If the guess is right, no cycles are lost. If the guess is wrong, the speculatively fetched work is thrown out and the pipeline restarts from the correct target — typically a 15–20 cycle penalty on a deep modern pipeline.

Mechanism. The simplest predictor is a 2-bit saturating counter per branch: four states from "strongly not taken" through "strongly taken". A correct prediction nudges the counter further in that direction; a wrong prediction nudges it back. Two bits of history mean a single anomaly does not flip the prediction — it takes two consecutive surprises. Real predictors stack this with branch history and pattern tables and hit ≥95% accuracy on branchy code.

Speculation has a security cost. Mispredicted instructions get squashed before they commit, but they can still leave traces in the cache. That is the foothold Spectre exploits.

Worked example: one mispredict and what the core has to throw away

A hot loop runs the same branch ten times. The first nine iterations take it; the tenth exits. Suppose the predictor's counter for this branch starts at 11 (strongly taken) and the pipeline is five stages deep.

Cycles 1–9. Each iteration, the predictor says "taken" before the branch has executed. The fetch unit immediately starts pulling the loop-body instructions for iteration n+1 into IF, ID, EX behind the branch. By the time the branch resolves in EX, three or four follow-on instructions are already in flight — and they are the right ones. No cycles lost. Counter stays at 11.
Cycle 10. The branch is not taken this time (loop exit), but the predictor still says "taken". The fetch unit speculatively pulls in one more loop-body iteration. Four cycles later the branch resolves in EX and the comparator says: wrong.
The flush. Every instruction younger than the branch — the speculatively-fetched loop body, sitting in IF, ID, and EX — is squashed. Their register writes are discarded (the ROB never lets them commit). The program counter is reset to the loop-exit address and the pipeline starts refilling from there. That is the 15–20 cycle bubble the prose mentioned: the time to drain and refill on a deep modern pipeline.
The counter update. The 2-bit counter drops from 11 to 10 — still "taken", just less confidently. One surprise doesn't flip the prediction. If the loop ran again right away and exited at the same point, the counter would drop to 01 and now the predictor would say "not taken" for next time.

The cost of being wrong is the pipeline depth times the issue width — easily 50–100 squashed micro-ops on a wide core. The cost of being right is zero. That asymmetry is why even 95%-accurate prediction is worth the silicon.

Out-of-order execution

The natural endpoint of pipelining is to stop pretending that program order matches execution order. In an out-of-order core, instructions enter in program order but execute as soon as their operands are ready, on whichever execution unit happens to be free. A reorder buffer (ROB) holds the in-flight instructions; results are kept private until the ROB commits them in original program order. The programmer-visible state still matches the program — but underneath, 200+ instructions are in flight at once.

The ROB is what holds the abstraction together. Misbehaved instructions (bad branches, page faults) can be squashed before they commit, so the architectural state never sees them. The visible machine still executes one instruction at a time, in order.

The instruction set architecture

Hardware and software need a contract: which instructions exist, what bits encode them, and what each one does. The instruction set architecture (ISA) is that contract. x86-64 is the dominant desktop and server ISA — variable-length instructions (1–15 bytes), born at Intel, accumulated decades of extensions. Arm dominates mobile and now leads laptop performance-per-watt; fixed 32-bit instructions, designed to be simpler to decode. RISC-V is the open ISA with a free specification, increasingly used in microcontrollers and research silicon. Compilers translate source code to one of these. Everything above the ISA is portable; everything below is not.

Modern x86-64 cores translate variable-length CISC instructions into RISC-like internal micro-ops and execute those, so the visible difference at the instruction level often disappears at the gate level.

A core can run any program — but only if it can be fed. The next problem is that RAM is too slow.

Memory hierarchy

Problem. A modern core executes one or more instructions per nanosecond. A read from main memory takes about a hundred nanoseconds — long enough to lose hundreds of instructions waiting. If every memory access stalled the CPU, the chip would spend most of its time idle.

Solution. The memory hierarchy is a stack of progressively larger, slower, cheaper storage that hides the gap by keeping recently and likely-needed data close. Each level is roughly 5–30× larger than the one above and 3–10× slower.

Cache lines are typically 64 bytes — the smallest unit moved between levels. Reading one byte pulls in 64, because amortized cost wins over precision.

Why the pyramid works: locality

The hierarchy only pays off because real programs exhibit locality of reference. Temporal locality: an address used now is likely to be used again soon — loop counters, hot data structures, frequently-called functions. Spatial locality: addresses near a recently-used one are likely to be touched soon — sequential array scans, struct fields, instruction streams.

Caches exploit both. When the CPU misses a single byte, it doesn't fetch one byte; it fetches a 64-byte cache line, betting that the next bytes will be touched too. Code that respects locality (row-major iteration, packed structs, tight inner loops) often outruns "smarter" code that doesn't, by an order of magnitude.

Both kinds of locality are properties of programs, not of hardware. Code that has them runs faster on the same chip; code that doesn't pays the full memory latency. perf stat reports cache-miss rate; an L3 miss rate above a few percent is usually fixable.

Where to put a line: associativity

A cache also has to decide which slot holds each line. The trade-off is between fast lookup and avoiding conflicts.

Direct-mapped. Every memory address has exactly one slot it can occupy. Lookup is one tag compare — fast and cheap. But two hot addresses that map to the same slot evict each other forever.
Fully-associative. Any line can live in any slot. Zero conflict misses, but every lookup must compare against every tag — slow and power-hungry.
Set-associative. Each address maps to a set of N slots and the line can live in any of them. Real L1 caches are typically 8-way; L3 is often 16-way. The compromise wins everywhere.

Conflict misses are the lurking cost of direct-mapped caches. Two arrays whose start addresses differ by a multiple of the cache size will evict each other on every iteration of a loop that touches both — a footgun even good programmers hit.

Virtual addresses and the TLB

The same lookup question recurs one level up. The CPU sees virtual addresses; DRAM is indexed by physical addresses. Translating between them on every memory access would be ruinous, so the chip keeps a small, very-associative cache of recent translations: the TLB (translation lookaside buffer). A TLB hit costs one cycle and the access goes through. A miss triggers a page-table walk in DRAM that can take 100+ cycles, and a process switch may flush the entire TLB.

Huge pages (2 MB or 1 GB instead of 4 KB) make a measurable difference for big working sets — fewer translations, fewer misses, fewer walks.

Worked example: translating one 64-bit virtual address by hand

A page is 4 KB, so the bottom 12 bits of every address are the offset within the page (since 2^12 = 4096). x86-64 uses 48-bit virtual addresses (the top 16 bits are sign-extended and ignored). The remaining 36 bits index a 4-level page table — 9 bits per level, because each table is itself one 4 KB page holding 512 eight-byte entries (2^9 = 512).

Take the virtual address 0x00007F8E12345678 and split its low 48 bits into fields:

field	bits	value (binary)	value (decimal)
PML4 index	47–39	`011111111`	255
PDPT index	38–30	`000111000`	56
PD index	29–21	`111000001`	449
PT index	20–12	`001000110`	70
offset	11–0	`010001111000`	1144

A TLB miss kicks off the page-table walk:

PML4. The CPU reads the PML4 base register, fetches entry 255 from that page — one DRAM access. The entry holds the physical address of the next-level table.
PDPT. Fetch entry 56 from that table — another DRAM access. Returns the PD's physical address.
PD. Fetch entry 449 — another DRAM access. Returns the PT's physical address.
PT. Fetch entry 70 — one more DRAM access. This entry holds the physical frame number (PFN) for our page, plus permission bits.
Assemble. The physical address is (PFN << 12) | 0x678. The offset is copied through unchanged — same byte within the page, just a different page.

Four DRAM accesses per miss is why TLB misses hurt. It is also why huge pages help: a 2 MB page skips the last level entirely (one fewer DRAM access per walk), and one 2 MB TLB entry covers what would otherwise be 512 separate 4 KB entries.

Pitfall: cache coherence

When two cores cache the same memory line, their writes can disagree. Hardware enforces a coherence protocol — MESI is the classic — so any read sees the most recent write. Each cache line, in each core, lives in one of four states: Modified (dirty, only-copy), Exclusive (clean, only-copy), Shared (clean, possibly in other caches), Invalid. Transitions are driven by snoops on the interconnect. The protocol is invisible to software, but it has a cost.

If two cores write different bytes of the same 64-byte cache line, the line ping-pongs between their caches — false sharing. A "lock-free" loop can run 10× slower than the same loop on independent cache lines. This is one of the few places where the abstraction visibly leaks.

False sharing forces the line through M → I → M → I, ping-ponging across the bus. Padding hot per-thread variables to separate cache lines is one of the first fixes a profiler suggests.

Worked example: two cores, one cache line, the MESI states in order

Two cores share an interconnect. Both have private L1 caches. We watch one 64-byte line L as the cores touch it. Each row shows the state of the line in each core's cache after the operation.

step	operation	Core 0's L	Core 1's L	what happened
0	(initial)	I	I	nobody has it
1	Core 0 reads `L`	E	I	Core 0 misses, fetches from DRAM. Snoops show no other copy, so the line is Exclusive to Core 0 — clean, only-copy.
2	Core 0 writes `L`	M	I	Silent upgrade E → M. No bus traffic; the line is already private and clean. Now it's dirty.
3	Core 1 reads `L`	S	S	Core 1 misses and snoops. Core 0's cache responds with its dirty data (and writes back to DRAM), then both copies become Shared — clean, read-only, in two caches.
4	Core 1 writes `L`	I	M	Core 1 broadcasts an invalidate. Core 0's copy goes I. Core 1's copy becomes M. The next read on Core 0 will miss.
5	Core 0 reads `L`	S	S	Symmetric to step 3: Core 1's cache supplies the dirty data, writes back, both end up Shared again.

Where false sharing bites. Imagine Core 0 keeps writing byte 0 of L while Core 1 keeps writing byte 32. The bytes never overlap — there is no real data race — but MESI tracks state per line, not per byte. Each write triggers an invalidate of the other core's copy. The line ping-pongs through M (core 0) → I, M (core 1) → I, M (core 0) → I, … on every write, dragging cache-line traffic across the interconnect. The fix is to put the two bytes on different cache lines — pad the struct so each thread's hot field sits 64 bytes from the next.

The hierarchy and its protocols hide DRAM latency. The clock that drives all of this has its own constraint.

The clock and the power wall

The clock. Inside a CPU, the clock is a square-wave signal that synchronizes every flip-flop in the design. Each rising edge is one cycle: the moment when sequential logic latches its inputs and the next combinational stage starts computing. Clock frequency is how many cycles fit in a second — a 3 GHz processor ticks 3 × 10⁹ times per second, giving each cycle about 333 picoseconds, the time light takes to travel about 10 cm. The maximum frequency is set by the slowest combinational path between any two flip-flops; if a stage cannot finish in one cycle, the next latch grabs garbage.

Why it stopped climbing. For four decades, clock frequencies climbed in lockstep with transistor counts. That ended around 2005. Dynamic power in CMOS scales as roughly P ∝ C · V² · f — capacitance, voltage squared, frequency. Voltage couldn't keep falling because of leakage and reliability floors. Raising f with a fixed V meant cubic increases in power per transistor, and there is no way to dissipate that heat through 1 cm² of silicon. The industry called it the power wall. Moore's law continued; the partner law that kept power-per-transistor falling did not.

Once raising clock speed became a quadratic-in-power proposition, the industry pivoted to throwing more cores and more specialized blocks at the same die area.

More cores, then NUMA

What replaced "faster cores" was "more cores". A modern desktop CPU has 8–32 cores; a server CPU has 64–192. Cores share the last level of cache (L3) and the memory controller; they coordinate through cache coherence and a small set of synchronization primitives — atomic loads, stores, and compare-and-swap.

The catch: most software wasn't written to use them. Speeding up a single sequential thread by buying a bigger CPU stopped working twenty years ago, and the burden of finding parallelism moved up the stack — to compilers, runtimes, and ultimately to programmers.

Cache coherence and the memory bus become the bottleneck once you have more than a handful of cores. Server-class chips replace the bus with a ring or mesh interconnect; designs above 32 cores frequently chiplet-disaggregate to keep the wires short.

Once a system has more than one socket, memory itself becomes non-uniform. In a NUMA (non-uniform memory access) box, each socket has its own DRAM channels — local memory — and reaches the other socket's DRAM only through an interconnect. Local access is roughly 80 ns; remote is 120–150 ns and shares interconnect bandwidth with other traffic. Where threads run, and which DRAM their pages were allocated on, can change throughput by 1.5–2× on the same hardware. Linux's default first-touch policy allocates a page on the node of the thread that first touches it — which is why initialization order changes throughput on a server you haven't tuned.

Beyond the CPU

Problem. A CPU optimizes for single-threaded latency: each core does one (or a few) instruction per cycle, with deep pipelines and aggressive branch predictors that keep one execution stream moving. That is the wrong shape for problems where the same operation runs across millions of independent data points — graphics, deep learning, scientific simulation. Most of the silicon area on a CPU core is spent on machinery (predictors, ROBs, caches) that is irrelevant to those workloads.

Solution. Build chips with the opposite trade-off: many simple lanes, all running the same instruction in lockstep. SIMD (single-instruction, multiple-data) is that idea inside a CPU as a vector unit — typically 4–16 lanes (AVX-512, ARM SVE). SIMT (single-instruction, multiple-thread) is GPUs at scale: a single high-end GPU has roughly 10,000 ALU lanes grouped into "warps" of 32 that share a fetch unit. Throughput per watt is 5–20× higher than a CPU on workloads that fit the model.

Branchy, irregular code is a CPU workload. Regular tensor and pixel math is a GPU workload. A modern system runs both, with the OS or a runtime dispatching work to whichever fits.

Domain-specific accelerators

The next step beyond GPUs is domain-specific accelerators: chips that abandon programmability for one job done very efficiently. The clearest example is the systolic array at the heart of TPUs, NPUs, and modern matrix-multiply blocks. It is a 2-D grid of small multiply-accumulate cells. Matrix tiles flow through the grid one element per clock, and every cell does one multiply-add per cycle — without ever fetching an instruction. Google's TPU v4 includes a 128×128 systolic array, doing 16,384 multiply-accumulates per cycle when fed continuously. NVIDIA's Tensor Cores, Apple's Neural Engine, and the NPUs in modern phones all share this shape.

The cost of a systolic array is generality: the silicon does matmul (and very little else) very well. The benefit is one matmul per cycle when fed continuously, with no instruction-fetch overhead.

That trade-off is where the act lands. CPU cores run any program well enough. GPUs run regular dense math an order of magnitude better. Domain-specific accelerators run one shape of computation another order of magnitude better, at the cost of being useless for anything else. Future system performance depends less on a single faster transistor and more on choosing the right silicon for each piece of the workload — which is why a modern phone SoC contains a CPU complex, a GPU, an image-signal processor, a video codec block, and a neural engine, all on one die.

Standards

Every interface on this page has a canonical specification.

Floating-point arithmetic — IEEE 754-2019. Same standard from Act I; the hardware that obeys it lives here.
Hardware description — IEEE 1800 (SystemVerilog), IEEE 1076 (VHDL). The languages chips are designed in.
Test and debug — IEEE 1149.1 (JTAG). The boundary-scan interface every modern chip exposes.
Instruction set architectures — Intel® 64 and IA-32 Architectures Software Developer's Manual (Vols. 1–4); Arm Architecture Reference Manual for ARMv8-A / ARMv9-A; RISC-V Unprivileged ISA and RISC-V Privileged Architecture (riscv.org).
DRAM — JEDEC JESD79-4 (DDR4) and JESD79-5 (DDR5). Define the timings, signaling, and command set every memory controller targets.
GPU memory — JEDEC JESD235D (HBM3), JESD250C (GDDR6), and successors. The graphics-class memory specs.
Interconnect — PCI Express Base Specification (PCI-SIG); Compute Express Link Specification (CXL Consortium). The buses that link CPUs, GPUs, accelerators, and storage.
CMOS process nodes — IRDS roadmap (the successor to ITRS). Tracks what "5 nm" or "2 nm" actually means now that the number is no longer a physical gate length.