Act III of X

Instructions

A programming language is a set of constraints you choose, and those constraints shape what's easy and what's hard.

The CPU of Act II only runs instructions: numbered operations on registers and memory. People want to write code in something more humane. A programming language closes the gap. It does two jobs. First, it translates source text into instructions the machine can run. Second, it refuses certain operations — touching raw memory, skipping a type check, freeing your own allocations — and the refusal is what gives the language its character. Algorithms and data structures sit above all of this; they belong to the math, not the syntax.

From source text to running instructions, framed by four trade-offsSource texta file of bytes a programmer wroteTokens · AST · IRlex · parse · semantic check · lowerNative codeBytecode + VMTree-walkAOT · gcc · rustcjavac/JVM · CPythoninterpreter · CPython ASTRunning on the CPU of Act IIfetch · decode · execute · repeatThe four trade-offs every language picks oncecontrol ↔ safetyC ↔ Pythonspeed ↔ expressivenessasm ↔ Haskellstatic ↔ dynamicOCaml ↔ Lispmanual ↔ managedC ↔ Java/Rustalgorithms and data structures sit above all of this — they belong to the math, not the syntax
Pick a path down the middle and a position on each axis at the bottom. That choice is roughly the personality of every language you have ever used.

What a program is

A source file is just bytes — the CPU has no idea what for or def or int main() mean. A translator reads that text and produces instructions the CPU does understand. Call it a compiler when its output runs later, an interpreter when its output runs immediately. Either way the work is the same: split source into tokens (lexing), build a tree (parsing), check that names and types are consistent (semantic analysis), and produce an intermediate representation (IR) that's easier to optimize and emit code from.

Source textx = 1 + 2;lexerTokensIDENT('x') · '=' · INT(1) · '+' · INT(2) · ';'parserAST (parse tree)=x+12semantic analyzerTyped ASTx: i32 ← (1: i32) + (2: i32)loweringIntermediate representation (IR)%t1 = add i32 1, 2store i32 %t1, i32* %xfront end→ to back end
The IR is the choke point: every language-specific concept must be expressed in it, and every back end (x86, ARM, RISC-V, WebAssembly) consumes it. LLVM IR, GCC's GIMPLE, and JVM bytecode all play this role.

The IR decouples "what the language means" from "what the CPU runs". A front end lowers source to LLVM IR; a back end takes IR and emits x86 or ARM. Add a new language and every existing back end works for it; add a new CPU and every existing language runs on it. Linters and Language Servers reuse the same front end.

Between front end and back end sits the middle end: a sequence of optimization passes, each consuming IR and producing better IR. A pass might fold 1 + 2 to 3, delete a branch the compiler proves never runs, or inline a function's body at the call site. The same passes appear in every serious compiler — constant folding, dead-code elimination, inlining, loop transforms. Most of a release build's wall-clock goes here, not parsing. -O0 to -O2 typically buys 2-10x runtime speed on the same source.

Compiler middle and back end — IR through optimization passes to machine codeIR (from front end)%t1 = add i32 1, 2 · store %t1 → %xMiddle end · optimization passes (each: IR in → IR out)constant folddead-code eliminliningloop unrolladd 1, 2i32 3if (false) {…}(removed)call sq(x)x * xfor i in 0..4a;b;c;dCSE / GVNSROA · mem2reg · vectorizefind equal subexpressions, compute once(common-subexpression / global value numbering)promote stack slots to registers,SIMD-ify inner loops where possibleBack end · target-specific loweringinstruction selectionregister allocationmachine codemap IR ops → x86/ARMgraph colouring · spills.text section bytespasses are sequenced · each one is a tree rewrite · LLVM ships ≥ 100 of them, run at -O2 in a fixed pipeline
The same IR shape is the input and output of every middle-end pass — that's why they compose. The back end picks instructions for one specific CPU and decides which values live in registers (fast, scarce) versus on the stack (slower, plentiful).

The back end's two hard jobs are instruction selection (which opcodes for which IR ops) and register allocation (which values live in registers, which spill to the stack). Registers are scarce — x86-64 has 16. Spilling in a hot loop can erase the wins from every other optimization, which is why -O2 matters.

Three paths to execution

The translator can produce instructions at three different times, and the choice shapes everything else about the language.

Ahead-of-time (AOT) compilation translates the whole program to native machine code before it runs. The output is a single binary the OS loads into memory. C, C++, Rust, and Go take this path. Compile time is a one-time cost; runtime overhead is roughly zero.

Interpretation keeps the source (or a parsed tree, or bytecode) and walks it every time. Bash, classic CPython, and Ruby take this path. The interpreter loop adds 10-100x overhead per operation, but startup is instant and code can be modified live.

Bytecode plus a VM sits between. The translator compiles once to a portable byte-stream (.class for Java, .pyc for Python); a runtime — the VM — interprets that bytecode wherever the program lands. Add a JIT ("just-in-time" compiler) to the VM and hot parts get re-compiled to native code at runtime, recovering most of the AOT speed. The JVM, .NET, V8, and modern Python all do this.

AOT compilation, interpretation, and bytecode-VM-with-JIT — three contracts for getting bytes to runAOT compilationInterpretationBytecode + VM (+ JIT)C · Rust · GoBash · classic CPythonJava · JS · modern Pythonmain.ccompilemain.olink → ELFCPU runs nativeall-or-nothing translationdeploy = ship a binarymain.pyreadparseevalrepeatCPU runs interpreterloopwalk source per executiondeploy = ship the sourceFoo.javajavacFoo.class (bytecode)JVM interp + JITCPU runs native (hot)portable bytecode + runtimedeploy = ship bytecode + VMno path is "best" — they trade startup time, peak speed, deploy footprint, and platform reach
The boundaries blur in practice. Go AOT-compiles but ships a runtime for goroutines and GC. Python's "bytecode" lives in __pycache__/ and is interpreted. V8 has four JIT tiers. The shape of each path is durable; the lines between them aren't.

Linking: the AOT path's last step

A C program is rarely one source file. Each .c file compiles separately to an object file — machine code with holes where references to other files' functions go. The compiler has no idea where printf lives; it leaves a hole labelled printf and trusts someone to fill it.

The linker is that someone. It scans every object file and library, builds a global table of which file defines each symbol, then patches every hole with the right address. Two files defining the same symbol: link error. No file defining it: "undefined reference" — you forgot -lm or misspelled the function.

Worked example: linking main.o and foo.o into one binary

Two source files, compiled separately:

  • main.o contains the machine code for main(), which calls foo(). Its symbol table says: DEFINES main at offset 0x00; UNDEFINED foo (a hole at byte 0x12 inside main's code).
  • foo.o contains the machine code for foo(). Its symbol table says: DEFINES foo at offset 0x00; no undefined references.

The linker runs in three passes:

  1. Scan defines. Walk every object file once and build a global table: { main → main.o:0x00, foo → foo.o:0x00 }. Two files defining the same name here would abort with "multiple definition".
  2. Resolve undefs. For every undefined reference, look it up in the table. main.o needs foo → found in foo.o. Nothing left undefined; the program is closed. If anything were still missing, the linker would print undefined reference to 'foo' and stop.
  3. Lay out and patch. Pick final addresses (say main at 0x401000, foo at 0x401080), then write those addresses into the holes (the byte at main.o:0x12 is rewritten from 00 00 00 00 to point at 0x401080). This last step is called relocation.

The output a.out now has zero holes. Every CALL instruction inside main jumps to a concrete address. Add a third file bar.o with another UNDEFINED foo and the linker resolves it to the same 0x401080 — that's how one definition serves the whole program.

Linker resolves symbols across object files and libraries into one executablemain.omath.olibc.aDEFINES: mainUNDEF: squareUNDEF: printfDEFINES: squareDEFINES: cubeUNDEF: powDEFINES: printf, pow, …linkerbuild symbol tableresolve UNDEF → DEFapply relocationslay out sectionsa.out (ELF)main: 0x401000square: 0x401080cube: 0x4010c0printf: 0x401200pow: 0x4012a0static linking: copy symbols in at build time · dynamic linking: leave a stub, resolve at load time
Static linking copies the library's machine code into the binary. Dynamic linking leaves a stub and lets the OS loader patch in the address when the program starts (so one libc on disk serves every program on the machine).

Function calls and the stack

A running program lives mostly on its call stack — a region of memory that grows downward as functions call each other and shrinks back as they return. Each call pushes a stack frame: the return address, the caller's saved registers, and the callee's local variables.

Where do function arguments live? That's the calling convention — a contract between caller and callee. On x86-64 Linux/macOS, the first six integer arguments arrive in registers rdi, rsi, rdx, rcx, r8, r9; the rest go on the stack; the return value comes back in rax. Windows x64 picks different registers. The CPU doesn't care — these are conventions. Get them wrong and arguments arrive scrambled, which is exactly what happens when you forget extern "C" across a language boundary.

A function call's stack frame layout, with x86-64 SysV register passinghigh addresseslow addresses · stack grows downcaller's framemain()'s locals, saved regsreturn address (push by CALL)saved frame pointer (old rbp)callee-saved regscallee's localsint x · char buf[16] · …addressed off rbp as [rbp-N]rsp points here ← top of stackx86-64 SysV registersarg 1 → rdiarg 2 → rsiarg 3 → rdxarg 4 → rcxarg 5 → r8arg 6 → r9arg 7+ → on stackreturn → raxcaller-saved:rax · rcx · rdx · rsirdi · r8-r11callee-saved:rbx · rbp · r12-r15growsCALL pushes return addr · prologue saves rbp · epilogue + RET unwinds it
If a buffer overflow writes past the local array and over the saved return address, the function returns to wherever the attacker chose. This is "smashing the stack". Stack canaries — a random word placed between locals and the return address, checked on return — are the standard defence.

How a JIT works

A JIT recovers AOT speed without losing the bytecode-VM startup model. The runtime starts by interpreting bytecode — slow but instant. Each method gets a counter; once it passes a tier-up threshold (around 1,000 calls for a first-tier JIT, often 10,000 for an aggressive second tier), the JIT compiles that method to native code. Subsequent calls jump to the native version.

A JIT can be more aggressive than an AOT compiler because it sees actual runtime behaviour. If a call site has only ever seen int arguments, the JIT specializes for int and skips the type checks. If a branch has never been taken, the fast path omits it. These are bets. If a later call passes a string instead, the runtime de-optimizes: throws away the native code, falls back to the interpreter, and may re-compile later with the new profile.

Worked example: a hot function's life through a tiered JIT

Picture one method, dot(a, b), called inside a loop a million times. The runtime (think V8 or the JVM) carries a per-method counter and a profile of seen argument types. Threshold for tier 1 is 1,000 calls; tier 2 is 10,000.

Call #What runsCounterNotes
1Interpreter walks bytecode1first invocation; slow, no compile cost yet
2-999Interpreter, recording types999every call: a: float[], b: float[] — no surprises
1,000Tier-up trigger. Baseline JIT compiles dot to native code. Specializes for float[] args; emits a guard if (typeof(a) !== float[]) deopt.1,000one-time compile latency (a few ms); next call jumps to native
1,001-9,999Native code, with the guard checked each entry9,999maybe 20× faster than the interpreter
10,000Tier-up again. Optimizing JIT re-compiles with inlining, loop unrolling, register allocation tuned to the observed shape.10,000another compile pause; afterwards ≈100× the interpreter
10,001-NHeavily optimized native codehottest path; bulk of the program's time spent here
Some later callCaller passes a string insteadguard fails → deopt: throw away the native code, resume in the interpreter, eventually re-profile and re-compile

Two consequences worth noticing. Startup is slower than a pure interpreter because the first few thousand calls pay both the interpreter cost and the profile bookkeeping. Steady state is close to AOT because tier-2 native code knows things an AOT compiler can only guess at — like "this call site has only ever seen one type". Microbenchmarks that run for 100ms never escape tier 0 and report misleadingly slow numbers.

JIT tier-up and de-optimizationBytecodeProfilerNative (optimized)interpreted≈1× start costcounts calls + types"hot" when ≥ Nspecialized toobserved typesexecutetier up · compilede-optimize: assumption violated → fall back to interpreterMost code rarely runs. JITs spend their optimization budget only on the few percent of methodsthat get hot, and stay safe by guarding every assumption with a runtime check.
The first call to a function is slower under a JIT than under a pure interpreter (extra profiling). The thousandth call is far faster. Microbenchmarks that don't warm up the JIT report misleading numbers.

Four trade-offs every language makes

A language is defined as much by what it refuses as by what it allows. The refusals cluster on four axes. There's no neutral position — abstaining is itself a stance.

The four orthogonal trade-offs every language picksControlSafetylow-levelhigh-levelSpeedExpressivenessclose to metalclose to mathC · asmPythonasm · CHaskell · LispStatic typingDynamic typingcheckeddeferredManual memoryManaged memoryfree()GC · ownershipRust · OCamlLisp · JSC · C++Java · Go · Rustorthogonal axes · every language picks a coordinate · Rust sits at (control, speed, static, managed)
The axes are independent. Rust is "low-level" and "managed" because ownership is a compile-time form of management. The grid maps the design space; specific languages occupy specific cells.

Control vs safety

Hardware lets you do dangerous things — read uninitialized memory, treat an integer as a pointer, write off the end of an array. A language decides whether to expose that power.

C and assembly expose it. Cast an integer to a pointer and dereference — if the address isn't yours, the result is undefined behaviour: a segfault, a corrupted neighbour, a silently wrong answer, or a security exploit. Python and JavaScript hide it: no raw pointers, every collection access bounds-checked, every mistake a clean exception. Rust threads the needle — pointers exist but are tracked at compile time by ownership rules, and unsafe operations live in explicit unsafe blocks.

The trade is real. C lets you write a memcpy that beats glibc's. Python prevents you from writing a buffer overflow at all.

Speed vs expressiveness

Low-level code runs fast but is slow to write. High-level code is fast to write but slower at runtime. A language picks where on that line it wants to live.

Assembly is fastest but most painful — every function call is a stack manipulation you spelled out. C adds structs, control flow, and a standard library, no garbage collector. Java/Go/C# add managed memory, generics, richer abstractions; expressiveness rises and runtime cost rises with it. Python, Ruby, and Lisp keep climbing — comprehensions, metaclasses, dynamic dispatch on every method call — and pay another order of magnitude.

Rough rule: each step up the abstraction ladder buys you fewer lines per feature and costs you roughly an order of magnitude at runtime — Python is typically 10-50x slower than C on the same task. Rust sits near C on speed but far up on expressiveness; that's its claim.

Static vs dynamic typing

A program might call .length on an integer. When does the language tell you?

Static typing checks types before the program runs. The compiler refuses to produce a binary if a call doesn't make sense. Dynamic typing waits until the call happens and the actual value disagrees. Static catches the bug before any user sees it; dynamic lets you write the first version faster.

When type errors are caught — across the program lifecycleeditcompilelink / loadrunnever"red squiggle"type-check passJVM bytecode verifierTypeError thrownsilent wrong answerRust · OCaml · Haskell · TypeScript (strict)Java · C# · Go (compile + load-time bytecode verify)Python · Ruby · plain JSearlier ←→ laterwhen does the type system catch the bug?
Gradual typing (Python's mypy, JS's TypeScript, Ruby's Sorbet) lets you slide a project leftward without rewriting it. The catch: gradual systems are unsound — types are only enforced where you've annotated, and the rest still fails at runtime.

Modern static checkers also narrow types as control flow refines them. A variable typed string | null becomes just string inside an if (x !== null) branch — the compiler proves the other case is gone. TypeScript, Kotlin, Swift, and Rust's match all do this. The technique is called flow-sensitive typing, and it's what makes "explicit nullability" tolerable instead of cast-heavy.

Manual vs managed memory

Every program allocates memory, and that memory has to get freed when nobody needs it. Who does the freeing?

Manual management: you do it. Every malloc needs a matching free, and the language doesn't check. Get it wrong and you produce one of three classic bugs: a leak (forgot to free, memory grows forever), a use-after-free (freed too early, the pointer now points at someone else's memory), or a double-free (freed twice, the allocator's bookkeeping corrupts). These three account for a large fraction of the CVEs ever filed against C and C++ programs.

Garbage collection (GC): the runtime does it. A tracing collector walks the live heap from a set of roots (CPU registers, the call stack, globals), marks every reachable object, and frees the rest. Java, Go, Python, and JavaScript all use GC. The cost is a pause while the collector runs. Modern concurrent collectors (Go's, Java's ZGC and Shenandoah) keep the stop-the-world portion in the single-digit milliseconds even on multi-hundred-gigabyte heaps, but it's never zero.

Ownership: the compiler does it. Rust's invention. Every value has exactly one owner; the owner gets dropped at the end of its scope; borrows are checked at compile time. No runtime cost, no leak class, no GC pause — paid for in compile-time effort and in code that won't compile until the lifetimes line up.

Manual, garbage-collected, and ownership-based memory managementManualGarbage collectionOwnershipC · C++Java · Go · Python · JSRustmallocfreeuseprogrammer balances every alloc+ deterministic destruction+ zero runtime cost− leaks · use-after-free · double-free− entire bug class is a CVE sourcenewuseforgetGC pauseruntime sweeps unreachable objects+ no use-after-free / double-free+ "just allocate, forget about it"− pauses (μs to ms; tunable)− 1.5–4× memory overhead is typicallet x = …usescope endcompiler inserts drop at scope end+ deterministic · no GC · no UAF+ aliasing rules block data races− steeper learning curve− "fight the borrow checker"three points on the same axis: deterministic + cheap + safe — pick two, or shape your program around the thirdreference counting (Swift, Python's primary mechanism) is a fourth flavour — deterministic but bad with cycles
Java's ZGC and Shenandoah collectors target sub-millisecond pauses. Rust achieves the same goal with no pause but rejects programs whose lifetimes don't typecheck. They solve the same problem at different points in the lifecycle.

The simplest tracing GC is mark-and-sweep. Phase one: from the roots, follow every pointer and mark each reachable object. Phase two: walk the heap and free anything unmarked. Cost is proportional to live data on mark, total heap on sweep. Anything unreachable is, by definition, garbage.

Mark-and-sweep — DFS from roots marks live objects, sweep frees the restMark phaseSweep phaserootrootABCDEFunreachableunreachableheap (after sweep)ABDCEFfilled = live (kept)dashed = unmarked (freed)cost: O(live) mark + O(heap) sweepstop-the-world unless concurrentfragmentation accumulates over timeconcurrent collectors keep the program running during mark, using a tri-colour invariant to track work
The mark phase is graph traversal — BFS or DFS, doesn't matter which. The sweep phase is a linear walk of the heap. Compacting collectors add a third phase that moves live objects together to defeat fragmentation.

The decisive optimization on top of mark-and-sweep is the generational hypothesis: most objects die young. Across decades of measurements on real workloads, the large majority of allocations become unreachable shortly after they're made — often before the next allocation of comparable size. A generational collector splits the heap into a small young generation (collected often, cheaply) and a large old generation (collected rarely). Survivors of a few young-gen collections get promoted to the old gen. The young-gen pause stays small because the young gen does.

Generational GC — small frequent young-gen collections, rare expensive old-gen collectionsYoung generationOld (tenured) generationsmall · collected often (ms)large · collected rarely (seconds–minutes)survived NpromoteObject lifetime histogram (the generational hypothesis)≥ 95% die here (young)few % live long (old)very few live forever
HotSpot's young gen is typically 5-10% of total heap. Modern G1 and ZGC use regions instead of strict generations but apply the same heuristic: short-lived objects collected cheaply and often.

The other GC family is reference counting (RC). Every object carries a count of references to it; the count goes up on copy, down on drop; the object frees itself when the count hits zero. No tracing, no pauses — freeing happens exactly when the last reference dies. Cycles are the failure mode: two objects pointing at each other keep each other alive even when nothing outside the cycle does, and neither count reaches zero.

Reference counting cycle — two objects keep each other alive after external refs dropStep 1 · external ref + cycleStep 2 · external ref droppeduserABrc=2rc=1user → A · A ↔ Buser (gone)ABrc=1rc=1A.rc=1 · B.rc=1 · neither hits 0leak — unreachable but undeletableFix: weak references break the cycleA holds a strong ref to B; B holds a weak ref to A. Weak refs don't bump the count.When user drops A, A.rc → 0 · A frees · A's strong ref to B drops · B.rc → 0 · B frees.
Languages built on RC (Swift, Python, C++ shared_ptr) ship a weak reference type that doesn't bump the count. Python also runs a periodic tracing collector to clean cycles the programmer forgot to break.

Paradigms

A paradigm is a default answer to "how should this code be organized?" Four answers dominate. Imperative programs read as sequences of commands that mutate state. Object-oriented programs model the world as things that hold state and respond to messages. Functional programs are transformations over immutable values. Declarative programs describe the desired result and let the runtime figure out how. Modern languages mix all four; the choice is per-module, not per-language.

Same task — sum of squares 1..n — in four paradigmsImperativeObject-orientedFunctionalDeclarative"do these steps in order""send messages to objects""compose pure transformations""describe the result, not the path"total = 0for i in 1..=n: total += i * ireturn totalclass Range: def sum_sq(self): return sum( i*i for i in self)(1..=n) .map(|i| i*i) .sum()SELECT SUM(i*i) FROM generate_series( 1, n) AS s(i);
Same answer, four shapes of thinking. A modern program might use all four within one file: imperative IO, OO domain model, functional transformations, embedded SQL for queries.

Two implementation tricks make these paradigms cheap at runtime. The first is the closure: a function value that carries the variables it captured from the enclosing scope, stored on the heap so they outlive the original stack frame. A callback often needs the local variables of the function that registered it, but by the time the callback runs that function has returned. The closure makes those variables stay.

A closure — function value carries its captured environmentsourcefunction make_counter() { let count = 0; return () => ++count;}uselet next = make_counter();next(); next(); next(); // 1 2 3closure value at runtimecode ptr →.text: lambda bodyenv ptr →heap: { count: 3 } (one per make_counter call; captured by reference)a closure = code pointer + environment recordtwo closures from two calls share no state"upvalues" in Lua, "free variables" elsewherereturna function value is two pointers, not one · the second is what makes "stateful callback" a language feature
C does not have closures, which is why C callbacks ship a void* user_data alongside every function pointer. C++ lambdas, Rust closures, and Go function literals all desugar to the same struct: code pointer plus captured fields.

The second trick is the vtable. In OO code, dog.speak() and cat.speak() run different functions even though the source looks identical — a lookup has to happen at runtime to pick the right one. The vtable does it in two memory loads. Each class has one vtable — an array of function pointers, one per method. Each instance starts with a header containing a pointer to its class's vtable. A virtual call becomes: load the vtable pointer from the object, load the right slot, call it. Branch predictors learn the common case within a few iterations.

Object memory layout — instance header + fields, vtable holds method pointersDog instance (heap)headervtable ptr · type · GC bitsname: "Rex"age: 5weight: 22.0owner: ptr → Persontags: ptr → Vec+0+8+16+24+32+40Dog vtable (one per class)[0] speak → Dog::speak[1] eat → Animal::eat[2] sleep → Animal::sleep[3] fetch → Dog::fetch[4] drop → Dog::dropvtable ptrobj.speak() → obj.vtable0virtual dispatch = two loads + an indirect call
Rust's dyn Trait is a "fat pointer" — data pointer plus vtable pointer — placing the vtable beside the reference rather than inside the object. Go's interfaces use the same shape.

Algorithms

An algorithm is a procedure whose cost can be reasoned about independently of any particular machine. The reasoning uses Big-O notation: write O(f(n)) to mean "for large enough n, this algorithm's work grows no faster than some constant times f(n)". Constants drop because real CPUs vary them too much to be portable claims.

What matters is the shape. O(1) is constant. O(log n) doubles input but adds one step. O(n) scales linearly. O(n log n) is what good sorts achieve. O(n²) is fine on small inputs and painful past a few thousand. O(2ⁿ) is hopeless past about thirty.

Big-O complexity classes — log-y scale1091071051031011110²10⁴10⁶10⁸O(1)O(log n)O(n)O(n log n)O(n²)O(2ⁿ)workinput size nWall-clock at n = 10⁶ (1 ns / op)O(log n) ≈ 20 ns · O(n) ≈ 1 ms · O(n log n) ≈ 20 ms · O(n²) ≈ 17 minutes · O(2ⁿ) is heat-death-of-the-universe territory
At a million items, O(n) finishes in a millisecond and O(n²) takes about seventeen minutes. Picking the right complexity class is usually a 1000x speedup; tuning constants inside it is usually 2x.

Sorting

Take n items, return them in order. A theorem says comparison-based sorts can't beat O(n log n) on average — an information-theoretic floor, not an engineering limit. Non-comparison sorts (radix, counting) reach O(n) but only when keys have structure (fixed-width integers, bounded range).

Real standard libraries don't ship a textbook sort. They ship hybrids that switch algorithm based on input shape. Python's list.sort() is Timsort: merge sort that detects already-sorted runs. C++'s std::sort is introsort: quicksort that falls back to heapsort if recursion gets too deep, then to insertion sort on small ranges. Rust uses pdqsort.

Sorting algorithms: best, average, worst, space, stabilityAlgorithmBestAverageWorstSpaceStableInsertionO(n)O(n²)O(n²)O(1)yesMergeO(n log n)O(n log n)O(n log n)O(n)yesQuickO(n log n)O(n log n)O(n²)O(log n)noHeapO(n log n)O(n log n)O(n log n)O(1)noRadixO(nk)O(nk)O(nk)O(n+k)yesTimsortO(n)O(n log n)O(n log n)O(n)yesIntrosortO(n log n)O(n log n)O(n log n)O(log n)noyour standard library uses a hybrid · Python: Timsort · C++: introsort · Rust: pdqsort
"Stable" means equal keys keep their original order — important when you sort by multiple keys in turn. Radix sort's k is the key width in digits; for 32-bit integers k is 4, so it's effectively linear.

Graph traversal

Given a graph (nodes connected by edges), visit every node reachable from a starting point.

Breadth-first search (BFS) uses a queue and visits nodes in order of distance from the start: root, then neighbours, then their neighbours. Best for shortest-path-by-edge-count. Depth-first search (DFS) uses a stack (or recursion, which is a stack) and goes as deep as possible before backtracking. Best for cycle detection and topological ordering. Both run in O(V + E). The only structural difference is queue vs stack.

BFS vs DFS on the same graph — visit order labeled on each nodeBFS · queue · level-by-levelDFS · stack · depth-first12345671253467visit by level: root, then siblings, then grandchildrenvisit each subtree fully before its siblingboth are O(V + E) · same data, different discipline · queue vs stack is the only structural difference
Most graph problems reduce to one of these with a small twist. Dijkstra's shortest-path is BFS with a priority queue. A* is Dijkstra plus a heuristic. The numbers show visit order; both orderings reach all seven nodes.

Recursion vs iteration

Recursion and iteration compute the same answer with different memory shapes. A recursive factorial(n) allocates n stack frames before any multiplication happens. An iterative version uses a single accumulator and constant stack. The recursive form is often clearer; it also overflows the stack on large n in any language without tail-call optimization (TCO).

TCO is a compiler trick. If the recursive call is in tail position — its return value is the caller's return value with no further work — the compiler reuses the same stack frame instead of pushing a new one. Recursion compiles to JMP, not CALL. Scheme and OCaml require it; Rust and the JVM don't guarantee it; C compilers apply it opportunistically.

Recursion vs iteration vs tail-call optimization — same factorial, three stack shapesRecursiveIterativeTail-recursive + TCOfact(n): if n=0: return 1 return n*fact(n-1)fact(n): acc = 1 while n > 0: acc *= n; n -= 1 return accfact(n, acc=1): if n=0: return acc return fact(n-1, acc*n)fact(5)fact(4)fact(3)fact(2)fact(1)fact(0)fact (single frame)acc, n live in regsno calls pushedfact (frame reused)JMP, not CALLdepth stays at 1stack depth O(n)stack overflow risk past 10Kstack depth O(1)always safestack depth O(1)if compiler implements TCOtail call: recursive call is the last thing the function does · TCO turns the CALL into a JMP
The naive recursive form does an extra multiply after the recursive call returns — that's why TCO can't apply. Rewriting with an accumulator (right column) puts the recursive call in tail position and unlocks the optimization.

Data structures

An algorithm is a procedure; a data structure is the shape it operates on. The right shape often makes an algorithm trivial; the wrong one makes the same algorithm infeasible. Four staples cover most needs.

Arrays (contiguous memory, indexed by integer): O(1) random access, expensive insertion in the middle. Hash maps (key-to-value via a hash function): O(1) average lookup and insertion, no order. Balanced trees (especially B-trees): O(log n) lookup with sortedness preserved — the basis for ordered maps and database indexes. Queues and stacks (FIFO / LIFO): what BFS and DFS use, plus task scheduling and undo stacks.

Operation costs across the four staple data structuresStructureAccess by indexSearch by valueInsertDeleteArray (dynamic)O(1)O(n)O(1)* amortizedO(n)Linked listO(n)O(n)O(1)O(1)Hash mapO(1) avgO(1) avgO(1) avgBalanced treeO(log n)O(log n)O(log n)O(log n)Queue (FIFO)O(1)O(1)Heap (priority q)O(log n)O(log n)"avg" for hash map: assumes a good hash and a load factor below the rehash thresholdarrays beat linked lists in practice for less than 1000 items even where Big-O suggests otherwise — cache lines win
Big-O is the right floor for the conversation, but locality (see Act II) decides the wall-clock answer. A linear scan of a 1 KB array beats a tree lookup that misses cache, even though the array is "O(n) worst" and the tree is "O(log n)".

Inside the hash map

The hash map deserves a closer look because most programs use one. Recipe: hash the key into an integer, take that modulo the bucket-array size to pick a slot, store the key-value pair there.

Two keys can hash to the same slot — a collision, unavoidable past a certain density (the birthday paradox makes the first one likely after about √N insertions). Two strategies handle them. Chaining stores a small list per bucket. Open addressing probes neighbouring slots until it finds an empty one.

Past a load factor of about 0.75 (entries / slots), both strategies trigger a rehash: allocate a bigger bucket array, re-insert everything, throw the old one away. That single rehash is O(n), breaking the "O(1) per insert" claim — but amortized across all the cheap inserts before it, the average is still O(1).

Worked example: a collision in bucket 7, then a rehash

Start with a map of 8 slots (N = 8) using chaining. The hash of each key, mod 8, decides its slot:

  1. Insert "Alice" → 1. hash("Alice") % 8 = 3. Slot 3 is empty; drop the pair in. Entries: 1, load factor 1/8 = 0.125.
  2. Insert "Bob" → 2. hash("Bob") % 8 = 7. Slot 7 is empty; drop it in. Load factor 0.25.
  3. Insert "Carol" → 3. hash("Carol") % 8 = 7. Collision — slot 7 already holds Bob. Chaining strategy: append Carol to slot 7's mini-list. Slot 7 is now a 2-item chain [Bob, Carol]. Lookups for Bob and Carol both go to slot 7, then walk the short list comparing keys. Load factor 0.375.
  4. Insert "Dave" → 4. hash("Dave") % 8 = 7. Another collision in slot 7; chain becomes [Bob, Carol, Dave]. Load factor 0.5. Still under threshold.
  5. Inserts continue until entries = 6. Load factor = 6/8 = 0.75 — threshold hit.
  6. Rehash. Allocate a new array of 16 slots. Walk the old array, re-hash every key against the new size (hash(key) % 16), and place it in the new array. Bob, Carol, and Dave very likely scatter to different slots now because % 16 distributes them more finely than % 8 did. Free the old array.
  7. Insert #7 onward runs against the bigger array. Load factor resets to 7/16 ≈ 0.44; collisions become rare again. The next rehash waits until entries = 12.

That one rehash at step 6 touched all 6 old entries — O(n) work. But the 5 cheap inserts before it, plus the cheap inserts that follow until the next rehash, average out: across many operations the cost per insert is O(1) amortized. With open addressing instead of chaining, step 3 would probe slot 8, 9, 10… until it found a free one — same load-factor logic, no extra list.

Hash map mechanics: keys → hash → bucket, with a chained collision"Alice""Bob""Carol""Dave"hash()SipHash · FNV ·xxHash · CityHash↓ mod Nslot 0slot 1slot 2slot 3slot 4slot 5slot 6AliceCarolcollision · chainedload factor = entries / slots · rehash above ≈0.75 (doubles the bucket array, re-inserts everything)average O(1); worst-case O(n) when an adversary picks colliding keys — randomize your hash seedthe most-used non-trivial data structure in computing
"Hash flooding" is a real attack: pick keys that all hash to the same bucket and a server's per-request hash map degrades to O(n) per insert. Modern languages (Python, Rust, Go) randomize the hash seed at process start to defeat it.

The pattern recurs everywhere. Compilers use hash maps for symbol tables, balanced trees for type lookup, queues for parser lookahead. Databases (Act VI) use B-trees for indexes, LSM trees for write-heavy workloads. Networking (Act V) uses queues for packet buffering and hash tables for flow tracking. The right shape isn't an algorithmic question; it's a question about which operation dominates.

Standards

Every language and algorithm cited on this page has a canonical reference.

Going deeper

Branches that earn their own article.

  • Compiler internals (lexing, parsing, IRs, optimization passes, code generation).
  • Interpreter and VM design.
  • JIT compilation.
  • Garbage collection strategies (mark-sweep, generational, concurrent).
  • Type theory and type systems.
  • Specific language deep dives.
  • Formal algorithm analysis.
  • Individual data structure internals (B-trees, red-black trees, skip lists, bloom filters).
  • Concurrency primitives (threads, async/await, actors, CSP).
  • Formal methods and verification.