Machines
A computer is a clock ticking through a list of simple instructions, very fast.
A computer is one trick repeated at six scales. A switch — useless alone — composes into a gate that does logic. Gates compose into an arithmetic unit and a small bag of named storage. Add a clock and a way to read instructions in order, and you have a CPU core. Cores need a pyramid of memory beneath them, because raw RAM is too slow to feed them. The clock that drives all of this stopped getting faster around 2005, so chips grow sideways now: more cores, then specialized accelerators that abandon the general-purpose contract entirely.
The transistor
Problem. Computing requires a physical thing that can hold a bit and flip it on command, billions of times per second, without falling apart.
Solution. A transistor is a switch operated by voltage instead of by hand. It has three terminals — a source, a drain, and a gate. Voltage on the gate decides whether current flows between source and drain: gate high closes the switch, gate low opens it (in an n-channel MOSFET; the p-channel transistor is its complement). That is the entire substrate-level mechanism on which everything else in this page rests.
Why it works. Two properties matter. A transistor has only two stable output states, on or off — not a continuum — so a noise margin separates them and bits can pass through millions of stages without degrading. And modern transistors are very small and very fast: a current-generation gate is about 20 nm long and switches in roughly a picosecond. A leading-edge die packs 100–200 billion of them in roughly the area of a thumbnail. Transistor counts have kept doubling on schedule even though clock frequencies stopped doubling decades ago.
That's the floor. From here, computers stop being electrical and start being mathematical.
From switches to logic
Problem. A single switch decides one thing. Computation needs combinations: "this and that", "this or that", "not this".
Solution. Wire two transistors in series and the output is high only when both inputs are high — that's an AND gate. Wire them in parallel and you get an OR. Add an inverter and you get NAND and NOR. The four common gates (AND, OR, NOT, XOR) are shorthands for the patterns that appear most often in real designs.
NAND alone — or NOR alone — is functionally complete: every Boolean expression can be built from one repeated gate. That's the leap from physics to math. A truth table is the contract a gate keeps; the gate symbol is just shorthand.
The first non-obvious circuit built from these gates is a half-adder: two inputs, two outputs. The sum is A XOR B (1 when exactly one input is high); the carry is A AND B (1 only when both are). Two of these plus an OR gate become a full-adder, which also accepts a carry-in from a previous bit. Chain n full-adders together and you can add two n-bit numbers. That's the entire arithmetic story from Act I, now built from gates.
Pitfall — the gates so far are memoryless. Feed the same inputs to an adder and you get the same output, every time, regardless of what came before. That's a combinational circuit. To run a program, we need circuits that remember — outputs that depend on inputs and on history.
The fix. A latch or flip-flop holds a bit between clock ticks. The standard cell is the D flip-flop: one data input D, one clock input, one output Q. On the rising edge of the clock — and only then — Q snaps to whatever D held at that instant; between edges, Q stays put. A 64-bit register is sixty-four D flip-flops in parallel on a shared clock. The boundary between "logic" and "memory" disappears here: a register is just a row of flip-flops.
Arithmetic and memory together, sequenced by a clock, give the next layer: a CPU.
From logic to processor
A pile of gates is not a computer. To run a program, the chip needs to read instructions in some order, perform the operation each instruction names, and store the result somewhere it can read again. A CPU core is the smallest assembly that earns that name. It bundles four pieces:
- Registers — a small bank of named storage (typically 16–32 general-purpose registers, each 32 or 64 bits wide). The fastest memory in the system; the ALU reads and writes them in one cycle.
- ALU (arithmetic-logic unit) — does the math. Built from the adders and gates above, plus a multiplier, shifter, and comparator.
- Control unit — sequences operations. Reads the next instruction, decodes which operation it names, and pulls the right wires to make the ALU and registers do that thing.
- Buses — the wires that move bits between these pieces and out to memory.
The core runs an endless fetch-decode-execute loop. Fetch the instruction at the address held in the program counter. Decode the bits to know which operation, which registers, which immediate values. Execute on the ALU, or read or write memory. Write the result back to a register and advance the program counter. Repeat, billions of times per second. Modern superscalar cores fetch and retire 4–8 instructions per cycle and decorate this loop with pipelines, predictors, and speculative execution — but the contract the programmer sees is still that loop, in order.
Pipelining
A naive core spends each cycle on one stage of one instruction — fetching, then decoding, then executing, then writing back — leaving most of the chip idle most of the time. A pipelined core overlaps the stages so multiple instructions are in flight at once, each at a different stage. The classic textbook RISC pipeline has five stages — IF (fetch), ID (decode), EX (execute), MEM (memory access), WB (writeback). Once the pipeline is full, five instructions are in flight at once and one instruction retires per cycle, even though each individual instruction still takes five cycles end to end. Same gates, rearranged, five times the throughput at the same clock.
Pitfall — data hazards. Two consecutive instructions can collide. If i2 reads a register that i1 is still computing, i2 must wait until i1 writes the result back. The naive fix is to stall the pipeline: insert a "bubble" until the value is in the register file, costing cycles for nothing.
The real fix — forwarding. Route i1's ALU result directly from the EX-stage output back to i2's EX-stage input, before the writeback ever happens. The two instructions retire without a stall. Modern cores forward across many paths, which is part of why their datapaths are so dense.
Branch prediction and speculation
Problem. A pipelined core has to fetch the next instruction every cycle. After a conditional branch — if (x > 0) goto L1 else goto L2 — the next instruction lives at one of two addresses, and the core won't know which until the branch executes a few cycles later. Stalling until then would empty the pipeline.
Solution. A branch predictor guesses which way each branch will go and starts speculatively fetching from the predicted target. If the guess is right, no cycles are lost. If the guess is wrong, the speculatively fetched work is thrown out and the pipeline restarts from the correct target — typically a 15–20 cycle penalty on a deep modern pipeline.
Mechanism. The simplest predictor is a 2-bit saturating counter per branch: four states from "strongly not taken" through "strongly taken". A correct prediction nudges the counter further in that direction; a wrong prediction nudges it back. Two bits of history mean a single anomaly does not flip the prediction — it takes two consecutive surprises. Real predictors stack this with branch history and pattern tables and hit ≥95% accuracy on branchy code.
Worked example: one mispredict and what the core has to throw away
A hot loop runs the same branch ten times. The first nine iterations take it; the tenth exits. Suppose the predictor's counter for this branch starts at 11 (strongly taken) and the pipeline is five stages deep.
- Cycles 1–9. Each iteration, the predictor says "taken" before the branch has executed. The fetch unit immediately starts pulling the loop-body instructions for iteration n+1 into IF, ID, EX behind the branch. By the time the branch resolves in EX, three or four follow-on instructions are already in flight — and they are the right ones. No cycles lost. Counter stays at
11. - Cycle 10. The branch is not taken this time (loop exit), but the predictor still says "taken". The fetch unit speculatively pulls in one more loop-body iteration. Four cycles later the branch resolves in EX and the comparator says: wrong.
- The flush. Every instruction younger than the branch — the speculatively-fetched loop body, sitting in IF, ID, and EX — is squashed. Their register writes are discarded (the ROB never lets them commit). The program counter is reset to the loop-exit address and the pipeline starts refilling from there. That is the 15–20 cycle bubble the prose mentioned: the time to drain and refill on a deep modern pipeline.
- The counter update. The 2-bit counter drops from
11to10— still "taken", just less confidently. One surprise doesn't flip the prediction. If the loop ran again right away and exited at the same point, the counter would drop to01and now the predictor would say "not taken" for next time.
The cost of being wrong is the pipeline depth times the issue width — easily 50–100 squashed micro-ops on a wide core. The cost of being right is zero. That asymmetry is why even 95%-accurate prediction is worth the silicon.
Out-of-order execution
The natural endpoint of pipelining is to stop pretending that program order matches execution order. In an out-of-order core, instructions enter in program order but execute as soon as their operands are ready, on whichever execution unit happens to be free. A reorder buffer (ROB) holds the in-flight instructions; results are kept private until the ROB commits them in original program order. The programmer-visible state still matches the program — but underneath, 200+ instructions are in flight at once.
The ROB is what holds the abstraction together. Misbehaved instructions (bad branches, page faults) can be squashed before they commit, so the architectural state never sees them. The visible machine still executes one instruction at a time, in order.
The instruction set architecture
Hardware and software need a contract: which instructions exist, what bits encode them, and what each one does. The instruction set architecture (ISA) is that contract. x86-64 is the dominant desktop and server ISA — variable-length instructions (1–15 bytes), born at Intel, accumulated decades of extensions. Arm dominates mobile and now leads laptop performance-per-watt; fixed 32-bit instructions, designed to be simpler to decode. RISC-V is the open ISA with a free specification, increasingly used in microcontrollers and research silicon. Compilers translate source code to one of these. Everything above the ISA is portable; everything below is not.
A core can run any program — but only if it can be fed. The next problem is that RAM is too slow.
Memory hierarchy
Problem. A modern core executes one or more instructions per nanosecond. A read from main memory takes about a hundred nanoseconds — long enough to lose hundreds of instructions waiting. If every memory access stalled the CPU, the chip would spend most of its time idle.
Solution. The memory hierarchy is a stack of progressively larger, slower, cheaper storage that hides the gap by keeping recently and likely-needed data close. Each level is roughly 5–30× larger than the one above and 3–10× slower.
Why the pyramid works: locality
The hierarchy only pays off because real programs exhibit locality of reference. Temporal locality: an address used now is likely to be used again soon — loop counters, hot data structures, frequently-called functions. Spatial locality: addresses near a recently-used one are likely to be touched soon — sequential array scans, struct fields, instruction streams.
Caches exploit both. When the CPU misses a single byte, it doesn't fetch one byte; it fetches a 64-byte cache line, betting that the next bytes will be touched too. Code that respects locality (row-major iteration, packed structs, tight inner loops) often outruns "smarter" code that doesn't, by an order of magnitude.
perf stat reports cache-miss rate; an L3 miss rate above a few percent is usually fixable.Where to put a line: associativity
A cache also has to decide which slot holds each line. The trade-off is between fast lookup and avoiding conflicts.
- Direct-mapped. Every memory address has exactly one slot it can occupy. Lookup is one tag compare — fast and cheap. But two hot addresses that map to the same slot evict each other forever.
- Fully-associative. Any line can live in any slot. Zero conflict misses, but every lookup must compare against every tag — slow and power-hungry.
- Set-associative. Each address maps to a set of N slots and the line can live in any of them. Real L1 caches are typically 8-way; L3 is often 16-way. The compromise wins everywhere.
Virtual addresses and the TLB
The same lookup question recurs one level up. The CPU sees virtual addresses; DRAM is indexed by physical addresses. Translating between them on every memory access would be ruinous, so the chip keeps a small, very-associative cache of recent translations: the TLB (translation lookaside buffer). A TLB hit costs one cycle and the access goes through. A miss triggers a page-table walk in DRAM that can take 100+ cycles, and a process switch may flush the entire TLB.
Worked example: translating one 64-bit virtual address by hand
A page is 4 KB, so the bottom 12 bits of every address are the offset within the page (since 2^12 = 4096). x86-64 uses 48-bit virtual addresses (the top 16 bits are sign-extended and ignored). The remaining 36 bits index a 4-level page table — 9 bits per level, because each table is itself one 4 KB page holding 512 eight-byte entries (2^9 = 512).
Take the virtual address 0x00007F8E12345678 and split its low 48 bits into fields:
| field | bits | value (binary) | value (decimal) |
|---|---|---|---|
| PML4 index | 47–39 | 011111111 | 255 |
| PDPT index | 38–30 | 000111000 | 56 |
| PD index | 29–21 | 111000001 | 449 |
| PT index | 20–12 | 001000110 | 70 |
| offset | 11–0 | 010001111000 | 1144 |
A TLB miss kicks off the page-table walk:
- PML4. The CPU reads the PML4 base register, fetches entry 255 from that page — one DRAM access. The entry holds the physical address of the next-level table.
- PDPT. Fetch entry 56 from that table — another DRAM access. Returns the PD's physical address.
- PD. Fetch entry 449 — another DRAM access. Returns the PT's physical address.
- PT. Fetch entry 70 — one more DRAM access. This entry holds the physical frame number (PFN) for our page, plus permission bits.
- Assemble. The physical address is
(PFN << 12) | 0x678. The offset is copied through unchanged — same byte within the page, just a different page.
Four DRAM accesses per miss is why TLB misses hurt. It is also why huge pages help: a 2 MB page skips the last level entirely (one fewer DRAM access per walk), and one 2 MB TLB entry covers what would otherwise be 512 separate 4 KB entries.
Pitfall: cache coherence
When two cores cache the same memory line, their writes can disagree. Hardware enforces a coherence protocol — MESI is the classic — so any read sees the most recent write. Each cache line, in each core, lives in one of four states: Modified (dirty, only-copy), Exclusive (clean, only-copy), Shared (clean, possibly in other caches), Invalid. Transitions are driven by snoops on the interconnect. The protocol is invisible to software, but it has a cost.
If two cores write different bytes of the same 64-byte cache line, the line ping-pongs between their caches — false sharing. A "lock-free" loop can run 10× slower than the same loop on independent cache lines. This is one of the few places where the abstraction visibly leaks.
Worked example: two cores, one cache line, the MESI states in order
Two cores share an interconnect. Both have private L1 caches. We watch one 64-byte line L as the cores touch it. Each row shows the state of the line in each core's cache after the operation.
| step | operation | Core 0's L | Core 1's L | what happened |
|---|---|---|---|---|
| 0 | (initial) | I | I | nobody has it |
| 1 | Core 0 reads L | E | I | Core 0 misses, fetches from DRAM. Snoops show no other copy, so the line is Exclusive to Core 0 — clean, only-copy. |
| 2 | Core 0 writes L | M | I | Silent upgrade E → M. No bus traffic; the line is already private and clean. Now it's dirty. |
| 3 | Core 1 reads L | S | S | Core 1 misses and snoops. Core 0's cache responds with its dirty data (and writes back to DRAM), then both copies become Shared — clean, read-only, in two caches. |
| 4 | Core 1 writes L | I | M | Core 1 broadcasts an invalidate. Core 0's copy goes I. Core 1's copy becomes M. The next read on Core 0 will miss. |
| 5 | Core 0 reads L | S | S | Symmetric to step 3: Core 1's cache supplies the dirty data, writes back, both end up Shared again. |
Where false sharing bites. Imagine Core 0 keeps writing byte 0 of L while Core 1 keeps writing byte 32. The bytes never overlap — there is no real data race — but MESI tracks state per line, not per byte. Each write triggers an invalidate of the other core's copy. The line ping-pongs through M (core 0) → I, M (core 1) → I, M (core 0) → I, … on every write, dragging cache-line traffic across the interconnect. The fix is to put the two bytes on different cache lines — pad the struct so each thread's hot field sits 64 bytes from the next.
The hierarchy and its protocols hide DRAM latency. The clock that drives all of this has its own constraint.
The clock and the power wall
The clock. Inside a CPU, the clock is a square-wave signal that synchronizes every flip-flop in the design. Each rising edge is one cycle: the moment when sequential logic latches its inputs and the next combinational stage starts computing. Clock frequency is how many cycles fit in a second — a 3 GHz processor ticks 3 × 10⁹ times per second, giving each cycle about 333 picoseconds, the time light takes to travel about 10 cm. The maximum frequency is set by the slowest combinational path between any two flip-flops; if a stage cannot finish in one cycle, the next latch grabs garbage.
Why it stopped climbing. For four decades, clock frequencies climbed in lockstep with transistor counts. That ended around 2005. Dynamic power in CMOS scales as roughly P ∝ C · V² · f — capacitance, voltage squared, frequency. Voltage couldn't keep falling because of leakage and reliability floors. Raising f with a fixed V meant cubic increases in power per transistor, and there is no way to dissipate that heat through 1 cm² of silicon. The industry called it the power wall. Moore's law continued; the partner law that kept power-per-transistor falling did not.
More cores, then NUMA
What replaced "faster cores" was "more cores". A modern desktop CPU has 8–32 cores; a server CPU has 64–192. Cores share the last level of cache (L3) and the memory controller; they coordinate through cache coherence and a small set of synchronization primitives — atomic loads, stores, and compare-and-swap.
The catch: most software wasn't written to use them. Speeding up a single sequential thread by buying a bigger CPU stopped working twenty years ago, and the burden of finding parallelism moved up the stack — to compilers, runtimes, and ultimately to programmers.
Once a system has more than one socket, memory itself becomes non-uniform. In a NUMA (non-uniform memory access) box, each socket has its own DRAM channels — local memory — and reaches the other socket's DRAM only through an interconnect. Local access is roughly 80 ns; remote is 120–150 ns and shares interconnect bandwidth with other traffic. Where threads run, and which DRAM their pages were allocated on, can change throughput by 1.5–2× on the same hardware. Linux's default first-touch policy allocates a page on the node of the thread that first touches it — which is why initialization order changes throughput on a server you haven't tuned.
Beyond the CPU
Problem. A CPU optimizes for single-threaded latency: each core does one (or a few) instruction per cycle, with deep pipelines and aggressive branch predictors that keep one execution stream moving. That is the wrong shape for problems where the same operation runs across millions of independent data points — graphics, deep learning, scientific simulation. Most of the silicon area on a CPU core is spent on machinery (predictors, ROBs, caches) that is irrelevant to those workloads.
Solution. Build chips with the opposite trade-off: many simple lanes, all running the same instruction in lockstep. SIMD (single-instruction, multiple-data) is that idea inside a CPU as a vector unit — typically 4–16 lanes (AVX-512, ARM SVE). SIMT (single-instruction, multiple-thread) is GPUs at scale: a single high-end GPU has roughly 10,000 ALU lanes grouped into "warps" of 32 that share a fetch unit. Throughput per watt is 5–20× higher than a CPU on workloads that fit the model.
Domain-specific accelerators
The next step beyond GPUs is domain-specific accelerators: chips that abandon programmability for one job done very efficiently. The clearest example is the systolic array at the heart of TPUs, NPUs, and modern matrix-multiply blocks. It is a 2-D grid of small multiply-accumulate cells. Matrix tiles flow through the grid one element per clock, and every cell does one multiply-add per cycle — without ever fetching an instruction. Google's TPU v4 includes a 128×128 systolic array, doing 16,384 multiply-accumulates per cycle when fed continuously. NVIDIA's Tensor Cores, Apple's Neural Engine, and the NPUs in modern phones all share this shape.
That trade-off is where the act lands. CPU cores run any program well enough. GPUs run regular dense math an order of magnitude better. Domain-specific accelerators run one shape of computation another order of magnitude better, at the cost of being useless for anything else. Future system performance depends less on a single faster transistor and more on choosing the right silicon for each piece of the workload — which is why a modern phone SoC contains a CPU complex, a GPU, an image-signal processor, a video codec block, and a neural engine, all on one die.
Standards
Every interface on this page has a canonical specification.
- Floating-point arithmetic — IEEE 754-2019. Same standard from Act I; the hardware that obeys it lives here.
- Hardware description — IEEE 1800 (SystemVerilog), IEEE 1076 (VHDL). The languages chips are designed in.
- Test and debug — IEEE 1149.1 (JTAG). The boundary-scan interface every modern chip exposes.
- Instruction set architectures — Intel® 64 and IA-32 Architectures Software Developer's Manual (Vols. 1–4); Arm Architecture Reference Manual for ARMv8-A / ARMv9-A; RISC-V Unprivileged ISA and RISC-V Privileged Architecture (riscv.org).
- DRAM — JEDEC JESD79-4 (DDR4) and JESD79-5 (DDR5). Define the timings, signaling, and command set every memory controller targets.
- GPU memory — JEDEC JESD235D (HBM3), JESD250C (GDDR6), and successors. The graphics-class memory specs.
- Interconnect — PCI Express Base Specification (PCI-SIG); Compute Express Link Specification (CXL Consortium). The buses that link CPUs, GPUs, accelerators, and storage.
- CMOS process nodes — IRDS roadmap (the successor to ITRS). Tracks what "5 nm" or "2 nm" actually means now that the number is no longer a physical gate length.
Branches that earn their own article.
- Transistor physics (CMOS, FinFET).
- Digital logic design (combinational, sequential, FSMs).
- ISA deep dives (x86, ARM, RISC-V).
- Pipelining, hazards, out-of-order execution.
- Cache coherence protocols.
- Branch prediction.
- GPU architecture internals.
- FPGA programming.
- Chip fabrication process.