Instructions — Act 3 | Digital Systems

The CPU of Act II only runs instructions: numbered operations on registers and memory. People want to write code in something more humane. A programming language closes the gap. It does two jobs. First, it translates source text into instructions the machine can run. Second, it refuses certain operations — touching raw memory, skipping a type check, freeing your own allocations — and the refusal is what gives the language its character. Algorithms and data structures sit above all of this; they belong to the math, not the syntax.

Pick a path down the middle and a position on each axis at the bottom. That choice is roughly the personality of every language you have ever used.

What a program is

A source file is just bytes — the CPU has no idea what for or def or int main() mean. A translator reads that text and produces instructions the CPU does understand. Call it a compiler when its output runs later, an interpreter when its output runs immediately. Either way the work is the same: split source into tokens (lexing), build a tree (parsing), check that names and types are consistent (semantic analysis), and produce an intermediate representation (IR) that's easier to optimize and emit code from.

The IR is the choke point: every language-specific concept must be expressed in it, and every back end (x86, ARM, RISC-V, WebAssembly) consumes it. LLVM IR, GCC's GIMPLE, and JVM bytecode all play this role.

The IR decouples "what the language means" from "what the CPU runs". A front end lowers source to LLVM IR; a back end takes IR and emits x86 or ARM. Add a new language and every existing back end works for it; add a new CPU and every existing language runs on it. Linters and Language Servers reuse the same front end.

Between front end and back end sits the middle end: a sequence of optimization passes, each consuming IR and producing better IR. A pass might fold 1 + 2 to 3, delete a branch the compiler proves never runs, or inline a function's body at the call site. The same passes appear in every serious compiler — constant folding, dead-code elimination, inlining, loop transforms. Most of a release build's wall-clock goes here, not parsing. -O0 to -O2 typically buys 2-10x runtime speed on the same source.

The same IR shape is the input and output of every middle-end pass — that's why they compose. The back end picks instructions for one specific CPU and decides which values live in registers (fast, scarce) versus on the stack (slower, plentiful).

The back end's two hard jobs are instruction selection (which opcodes for which IR ops) and register allocation (which values live in registers, which spill to the stack). Registers are scarce — x86-64 has 16. Spilling in a hot loop can erase the wins from every other optimization, which is why -O2 matters.

Three paths to execution

The translator can produce instructions at three different times, and the choice shapes everything else about the language.

Ahead-of-time (AOT) compilation translates the whole program to native machine code before it runs. The output is a single binary the OS loads into memory. C, C++, Rust, and Go take this path. Compile time is a one-time cost; runtime overhead is roughly zero.

Interpretation keeps the source (or a parsed tree, or bytecode) and walks it every time. Bash, classic CPython, and Ruby take this path. The interpreter loop adds 10-100x overhead per operation, but startup is instant and code can be modified live.

Bytecode plus a VM sits between. The translator compiles once to a portable byte-stream (.class for Java, .pyc for Python); a runtime — the VM — interprets that bytecode wherever the program lands. Add a JIT ("just-in-time" compiler) to the VM and hot parts get re-compiled to native code at runtime, recovering most of the AOT speed. The JVM, .NET, V8, and modern Python all do this.

The boundaries blur in practice. Go AOT-compiles but ships a runtime for goroutines and GC. Python's "bytecode" lives in __pycache__/ and is interpreted. V8 has four JIT tiers. The shape of each path is durable; the lines between them aren't.

Linking: the AOT path's last step

A C program is rarely one source file. Each .c file compiles separately to an object file — machine code with holes where references to other files' functions go. The compiler has no idea where printf lives; it leaves a hole labelled printf and trusts someone to fill it.

The linker is that someone. It scans every object file and library, builds a global table of which file defines each symbol, then patches every hole with the right address. Two files defining the same symbol: link error. No file defining it: "undefined reference" — you forgot -lm or misspelled the function.

Worked example: linking main.o and foo.o into one binary

Two source files, compiled separately:

main.o contains the machine code for main(), which calls foo(). Its symbol table says: DEFINES main at offset 0x00; UNDEFINED foo (a hole at byte 0x12 inside main's code).
foo.o contains the machine code for foo(). Its symbol table says: DEFINES foo at offset 0x00; no undefined references.

The linker runs in three passes:

Scan defines. Walk every object file once and build a global table: { main → main.o:0x00, foo → foo.o:0x00 }. Two files defining the same name here would abort with "multiple definition".
Resolve undefs. For every undefined reference, look it up in the table. main.o needs foo → found in foo.o. Nothing left undefined; the program is closed. If anything were still missing, the linker would print undefined reference to 'foo' and stop.
Lay out and patch. Pick final addresses (say main at 0x401000, foo at 0x401080), then write those addresses into the holes (the byte at main.o:0x12 is rewritten from 00 00 00 00 to point at 0x401080). This last step is called relocation.

The output a.out now has zero holes. Every CALL instruction inside main jumps to a concrete address. Add a third file bar.o with another UNDEFINED foo and the linker resolves it to the same 0x401080 — that's how one definition serves the whole program.

Static linking copies the library's machine code into the binary. Dynamic linking leaves a stub and lets the OS loader patch in the address when the program starts (so one libc on disk serves every program on the machine).

Function calls and the stack

A running program lives mostly on its call stack — a region of memory that grows downward as functions call each other and shrinks back as they return. Each call pushes a stack frame: the return address, the caller's saved registers, and the callee's local variables.

Where do function arguments live? That's the calling convention — a contract between caller and callee. On x86-64 Linux/macOS, the first six integer arguments arrive in registers rdi, rsi, rdx, rcx, r8, r9; the rest go on the stack; the return value comes back in rax. Windows x64 picks different registers. The CPU doesn't care — these are conventions. Get them wrong and arguments arrive scrambled, which is exactly what happens when you forget extern "C" across a language boundary.

If a buffer overflow writes past the local array and over the saved return address, the function returns to wherever the attacker chose. This is "smashing the stack". Stack canaries — a random word placed between locals and the return address, checked on return — are the standard defence.

How a JIT works

A JIT recovers AOT speed without losing the bytecode-VM startup model. The runtime starts by interpreting bytecode — slow but instant. Each method gets a counter; once it passes a tier-up threshold (around 1,000 calls for a first-tier JIT, often 10,000 for an aggressive second tier), the JIT compiles that method to native code. Subsequent calls jump to the native version.

A JIT can be more aggressive than an AOT compiler because it sees actual runtime behaviour. If a call site has only ever seen int arguments, the JIT specializes for int and skips the type checks. If a branch has never been taken, the fast path omits it. These are bets. If a later call passes a string instead, the runtime de-optimizes: throws away the native code, falls back to the interpreter, and may re-compile later with the new profile.

Worked example: a hot function's life through a tiered JIT

Picture one method, dot(a, b), called inside a loop a million times. The runtime (think V8 or the JVM) carries a per-method counter and a profile of seen argument types. Threshold for tier 1 is 1,000 calls; tier 2 is 10,000.

Call #	What runs	Counter	Notes
1	Interpreter walks bytecode	1	first invocation; slow, no compile cost yet
2-999	Interpreter, recording types	999	every call: `a: float[]`, `b: float[]` — no surprises
1,000	Tier-up trigger. Baseline JIT compiles `dot` to native code. Specializes for `float[]` args; emits a guard `if (typeof(a) !== float[]) deopt`.	1,000	one-time compile latency (a few ms); next call jumps to native
1,001-9,999	Native code, with the guard checked each entry	9,999	maybe 20× faster than the interpreter
10,000	Tier-up again. Optimizing JIT re-compiles with inlining, loop unrolling, register allocation tuned to the observed shape.	10,000	another compile pause; afterwards ≈100× the interpreter
10,001-N	Heavily optimized native code	—	hottest path; bulk of the program's time spent here
Some later call	Caller passes a `string` instead	—	guard fails → deopt: throw away the native code, resume in the interpreter, eventually re-profile and re-compile

Two consequences worth noticing. Startup is slower than a pure interpreter because the first few thousand calls pay both the interpreter cost and the profile bookkeeping. Steady state is close to AOT because tier-2 native code knows things an AOT compiler can only guess at — like "this call site has only ever seen one type". Microbenchmarks that run for 100ms never escape tier 0 and report misleadingly slow numbers.

The first call to a function is slower under a JIT than under a pure interpreter (extra profiling). The thousandth call is far faster. Microbenchmarks that don't warm up the JIT report misleading numbers.

Four trade-offs every language makes

A language is defined as much by what it refuses as by what it allows. The refusals cluster on four axes. There's no neutral position — abstaining is itself a stance.

The axes are independent. Rust is "low-level" and "managed" because ownership is a compile-time form of management. The grid maps the design space; specific languages occupy specific cells.

Control vs safety

Hardware lets you do dangerous things — read uninitialized memory, treat an integer as a pointer, write off the end of an array. A language decides whether to expose that power.

C and assembly expose it. Cast an integer to a pointer and dereference — if the address isn't yours, the result is undefined behaviour: a segfault, a corrupted neighbour, a silently wrong answer, or a security exploit. Python and JavaScript hide it: no raw pointers, every collection access bounds-checked, every mistake a clean exception. Rust threads the needle — pointers exist but are tracked at compile time by ownership rules, and unsafe operations live in explicit unsafe blocks.

The trade is real. C lets you write a memcpy that beats glibc's. Python prevents you from writing a buffer overflow at all.

Speed vs expressiveness

Low-level code runs fast but is slow to write. High-level code is fast to write but slower at runtime. A language picks where on that line it wants to live.

Assembly is fastest but most painful — every function call is a stack manipulation you spelled out. C adds structs, control flow, and a standard library, no garbage collector. Java/Go/C# add managed memory, generics, richer abstractions; expressiveness rises and runtime cost rises with it. Python, Ruby, and Lisp keep climbing — comprehensions, metaclasses, dynamic dispatch on every method call — and pay another order of magnitude.

Rough rule: each step up the abstraction ladder buys you fewer lines per feature and costs you roughly an order of magnitude at runtime — Python is typically 10-50x slower than C on the same task. Rust sits near C on speed but far up on expressiveness; that's its claim.

Static vs dynamic typing

A program might call .length on an integer. When does the language tell you?

Static typing checks types before the program runs. The compiler refuses to produce a binary if a call doesn't make sense. Dynamic typing waits until the call happens and the actual value disagrees. Static catches the bug before any user sees it; dynamic lets you write the first version faster.

Gradual typing (Python's mypy, JS's TypeScript, Ruby's Sorbet) lets you slide a project leftward without rewriting it. The catch: gradual systems are unsound — types are only enforced where you've annotated, and the rest still fails at runtime.

Modern static checkers also narrow types as control flow refines them. A variable typed string | null becomes just string inside an if (x !== null) branch — the compiler proves the other case is gone. TypeScript, Kotlin, Swift, and Rust's match all do this. The technique is called flow-sensitive typing, and it's what makes "explicit nullability" tolerable instead of cast-heavy.

Manual vs managed memory

Every program allocates memory, and that memory has to get freed when nobody needs it. Who does the freeing?

Manual management: you do it. Every malloc needs a matching free, and the language doesn't check. Get it wrong and you produce one of three classic bugs: a leak (forgot to free, memory grows forever), a use-after-free (freed too early, the pointer now points at someone else's memory), or a double-free (freed twice, the allocator's bookkeeping corrupts). These three account for a large fraction of the CVEs ever filed against C and C++ programs.

Garbage collection (GC): the runtime does it. A tracing collector walks the live heap from a set of roots (CPU registers, the call stack, globals), marks every reachable object, and frees the rest. Java, Go, Python, and JavaScript all use GC. The cost is a pause while the collector runs. Modern concurrent collectors (Go's, Java's ZGC and Shenandoah) keep the stop-the-world portion in the single-digit milliseconds even on multi-hundred-gigabyte heaps, but it's never zero.

Ownership: the compiler does it. Rust's invention. Every value has exactly one owner; the owner gets dropped at the end of its scope; borrows are checked at compile time. No runtime cost, no leak class, no GC pause — paid for in compile-time effort and in code that won't compile until the lifetimes line up.

Java's ZGC and Shenandoah collectors target sub-millisecond pauses. Rust achieves the same goal with no pause but rejects programs whose lifetimes don't typecheck. They solve the same problem at different points in the lifecycle.

The simplest tracing GC is mark-and-sweep. Phase one: from the roots, follow every pointer and mark each reachable object. Phase two: walk the heap and free anything unmarked. Cost is proportional to live data on mark, total heap on sweep. Anything unreachable is, by definition, garbage.

The mark phase is graph traversal — BFS or DFS, doesn't matter which. The sweep phase is a linear walk of the heap. Compacting collectors add a third phase that moves live objects together to defeat fragmentation.

The decisive optimization on top of mark-and-sweep is the generational hypothesis: most objects die young. Across decades of measurements on real workloads, the large majority of allocations become unreachable shortly after they're made — often before the next allocation of comparable size. A generational collector splits the heap into a small young generation (collected often, cheaply) and a large old generation (collected rarely). Survivors of a few young-gen collections get promoted to the old gen. The young-gen pause stays small because the young gen does.

HotSpot's young gen is typically 5-10% of total heap. Modern G1 and ZGC use regions instead of strict generations but apply the same heuristic: short-lived objects collected cheaply and often.

The other GC family is reference counting (RC). Every object carries a count of references to it; the count goes up on copy, down on drop; the object frees itself when the count hits zero. No tracing, no pauses — freeing happens exactly when the last reference dies. Cycles are the failure mode: two objects pointing at each other keep each other alive even when nothing outside the cycle does, and neither count reaches zero.

weak ref to A. Weak refs don't bump the count.When user drops A, A.rc → 0 · A frees · A's strong ref to B drops · B.rc → 0 · B frees.

Languages built on RC (Swift, Python, C++ shared_ptr) ship a weak reference type that doesn't bump the count. Python also runs a periodic tracing collector to clean cycles the programmer forgot to break.

Paradigms

A paradigm is a default answer to "how should this code be organized?" Four answers dominate. Imperative programs read as sequences of commands that mutate state. Object-oriented programs model the world as things that hold state and respond to messages. Functional programs are transformations over immutable values. Declarative programs describe the desired result and let the runtime figure out how. Modern languages mix all four; the choice is per-module, not per-language.

Same answer, four shapes of thinking. A modern program might use all four within one file: imperative IO, OO domain model, functional transformations, embedded SQL for queries.

Two implementation tricks make these paradigms cheap at runtime. The first is the closure: a function value that carries the variables it captured from the enclosing scope, stored on the heap so they outlive the original stack frame. A callback often needs the local variables of the function that registered it, but by the time the callback runs that function has returned. The closure makes those variables stay.

no state"upvalues" in Lua, "free variables" elsewherereturna function value is two pointers, not one · the second is what makes "stateful callback" a language feature

C does not have closures, which is why C callbacks ship a void* user_data alongside every function pointer. C++ lambdas, Rust closures, and Go function literals all desugar to the same struct: code pointer plus captured fields.

The second trick is the vtable. In OO code, dog.speak() and cat.speak() run different functions even though the source looks identical — a lookup has to happen at runtime to pick the right one. The vtable does it in two memory loads. Each class has one vtable — an array of function pointers, one per method. Each instance starts with a header containing a pointer to its class's vtable. A virtual call becomes: load the vtable pointer from the object, load the right slot, call it. Branch predictors learn the common case within a few iterations.

Rust's dyn Trait is a "fat pointer" — data pointer plus vtable pointer — placing the vtable beside the reference rather than inside the object. Go's interfaces use the same shape.

Algorithms

An algorithm is a procedure whose cost can be reasoned about independently of any particular machine. The reasoning uses Big-O notation: write O(f(n)) to mean "for large enough n, this algorithm's work grows no faster than some constant times f(n)". Constants drop because real CPUs vary them too much to be portable claims.

What matters is the shape. O(1) is constant. O(log n) doubles input but adds one step. O(n) scales linearly. O(n log n) is what good sorts achieve. O(n²) is fine on small inputs and painful past a few thousand. O(2ⁿ) is hopeless past about thirty.

At a million items, O(n) finishes in a millisecond and O(n²) takes about seventeen minutes. Picking the right complexity class is usually a 1000x speedup; tuning constants inside it is usually 2x.

Sorting

Take n items, return them in order. A theorem says comparison-based sorts can't beat O(n log n) on average — an information-theoretic floor, not an engineering limit. Non-comparison sorts (radix, counting) reach O(n) but only when keys have structure (fixed-width integers, bounded range).

Real standard libraries don't ship a textbook sort. They ship hybrids that switch algorithm based on input shape. Python's list.sort() is Timsort: merge sort that detects already-sorted runs. C++'s std::sort is introsort: quicksort that falls back to heapsort if recursion gets too deep, then to insertion sort on small ranges. Rust uses pdqsort.

"Stable" means equal keys keep their original order — important when you sort by multiple keys in turn. Radix sort's k is the key width in digits; for 32-bit integers k is 4, so it's effectively linear.

Graph traversal

Given a graph (nodes connected by edges), visit every node reachable from a starting point.

Breadth-first search (BFS) uses a queue and visits nodes in order of distance from the start: root, then neighbours, then their neighbours. Best for shortest-path-by-edge-count. Depth-first search (DFS) uses a stack (or recursion, which is a stack) and goes as deep as possible before backtracking. Best for cycle detection and topological ordering. Both run in O(V + E). The only structural difference is queue vs stack.

Most graph problems reduce to one of these with a small twist. Dijkstra's shortest-path is BFS with a priority queue. A* is Dijkstra plus a heuristic. The numbers show visit order; both orderings reach all seven nodes.

Recursion vs iteration

Recursion and iteration compute the same answer with different memory shapes. A recursive factorial(n) allocates n stack frames before any multiplication happens. An iterative version uses a single accumulator and constant stack. The recursive form is often clearer; it also overflows the stack on large n in any language without tail-call optimization (TCO).

TCO is a compiler trick. If the recursive call is in tail position — its return value is the caller's return value with no further work — the compiler reuses the same stack frame instead of pushing a new one. Recursion compiles to JMP, not CALL. Scheme and OCaml require it; Rust and the JVM don't guarantee it; C compilers apply it opportunistically.

last thing the function does · TCO turns the CALL into a JMP

The naive recursive form does an extra multiply after the recursive call returns — that's why TCO can't apply. Rewriting with an accumulator (right column) puts the recursive call in tail position and unlocks the optimization.

Data structures

An algorithm is a procedure; a data structure is the shape it operates on. The right shape often makes an algorithm trivial; the wrong one makes the same algorithm infeasible. Four staples cover most needs.

Arrays (contiguous memory, indexed by integer): O(1) random access, expensive insertion in the middle. Hash maps (key-to-value via a hash function): O(1) average lookup and insertion, no order. Balanced trees (especially B-trees): O(log n) lookup with sortedness preserved — the basis for ordered maps and database indexes. Queues and stacks (FIFO / LIFO): what BFS and DFS use, plus task scheduling and undo stacks.

Big-O is the right floor for the conversation, but locality (see Act II) decides the wall-clock answer. A linear scan of a 1 KB array beats a tree lookup that misses cache, even though the array is "O(n) worst" and the tree is "O(log n)".

Inside the hash map

The hash map deserves a closer look because most programs use one. Recipe: hash the key into an integer, take that modulo the bucket-array size to pick a slot, store the key-value pair there.

Two keys can hash to the same slot — a collision, unavoidable past a certain density (the birthday paradox makes the first one likely after about √N insertions). Two strategies handle them. Chaining stores a small list per bucket. Open addressing probes neighbouring slots until it finds an empty one.

Past a load factor of about 0.75 (entries / slots), both strategies trigger a rehash: allocate a bigger bucket array, re-insert everything, throw the old one away. That single rehash is O(n), breaking the "O(1) per insert" claim — but amortized across all the cheap inserts before it, the average is still O(1).

Worked example: a collision in bucket 7, then a rehash

Start with a map of 8 slots (N = 8) using chaining. The hash of each key, mod 8, decides its slot:

Insert "Alice" → 1. hash("Alice") % 8 = 3. Slot 3 is empty; drop the pair in. Entries: 1, load factor 1/8 = 0.125.
Insert "Bob" → 2. hash("Bob") % 8 = 7. Slot 7 is empty; drop it in. Load factor 0.25.
Insert "Carol" → 3. hash("Carol") % 8 = 7. Collision — slot 7 already holds Bob. Chaining strategy: append Carol to slot 7's mini-list. Slot 7 is now a 2-item chain [Bob, Carol]. Lookups for Bob and Carol both go to slot 7, then walk the short list comparing keys. Load factor 0.375.
Insert "Dave" → 4. hash("Dave") % 8 = 7. Another collision in slot 7; chain becomes [Bob, Carol, Dave]. Load factor 0.5. Still under threshold.
Inserts continue until entries = 6. Load factor = 6/8 = 0.75 — threshold hit.
Rehash. Allocate a new array of 16 slots. Walk the old array, re-hash every key against the new size (hash(key) % 16), and place it in the new array. Bob, Carol, and Dave very likely scatter to different slots now because % 16 distributes them more finely than % 8 did. Free the old array.
Insert #7 onward runs against the bigger array. Load factor resets to 7/16 ≈ 0.44; collisions become rare again. The next rehash waits until entries = 12.

That one rehash at step 6 touched all 6 old entries — O(n) work. But the 5 cheap inserts before it, plus the cheap inserts that follow until the next rehash, average out: across many operations the cost per insert is O(1) amortized. With open addressing instead of chaining, step 3 would probe slot 8, 9, 10… until it found a free one — same load-factor logic, no extra list.

"Hash flooding" is a real attack: pick keys that all hash to the same bucket and a server's per-request hash map degrades to O(n) per insert. Modern languages (Python, Rust, Go) randomize the hash seed at process start to defeat it.

The pattern recurs everywhere. Compilers use hash maps for symbol tables, balanced trees for type lookup, queues for parser lookahead. Databases (Act VI) use B-trees for indexes, LSM trees for write-heavy workloads. Networking (Act V) uses queues for packet buffering and hash tables for flow tracking. The right shape isn't an algorithmic question; it's a question about which operation dominates.

Standards

Every language and algorithm cited on this page has a canonical reference.

C — ISO/IEC 9899 (current: C23). Defines the abstract machine, undefined behaviour, and the standard library every C compiler targets.
C++ — ISO/IEC 14882 (current: C++23).
JavaScript — ECMA-262 (annual editions). The host integration with browsers is split across WHATWG specs (HTML, DOM, Fetch).
JSON — ECMA-404 / RFC 8259 (handed forward from Act I).
Python — The Python Language Reference and the PEP series. The language is single-vendor but the spec is public; CPython is the reference implementation.
Rust — The Rust Reference. Single implementation, but the language and borrow checker are specified.
Go — The Go Programming Language Specification.
Java — The Java® Language Specification and The Java® Virtual Machine Specification (Oracle, JSR 924). The JVM bytecode spec is what every alternative JVM language (Kotlin, Scala, Clojure) targets.
POSIX — IEEE 1003.1 (the runtime contract every Unix-like OS exposes to programs; foreshadows Act IV).
SQL — ISO/IEC 9075 (foreshadows Act VI; programs embed it everywhere).
Algorithmic complexity — Bachmann–Landau notation (1894 / 1909); modern usage codified in Knuth, The Art of Computer Programming, vols. 1–4. No ISO standard, but the notation is universal.
Floating-point — IEEE 754-2019 (handed forward from Act I & Act II; affects every language's numeric semantics).