Act X of X

Frontier

Where computing is heading — the substrates, scales, and limits we have not finished negotiating with yet.

Three things are happening at the edge of computing in 2026, and an engineer who ships software is touched by all of them. The math underneath TLS is being replaced, because a working quantum computer would break the math that has carried public-key crypto for fifty years. The silicon underneath inference is fracturing into a dozen specialised shapes — TPUs, NPUs, LPUs, neuromorphic chips — because one architecture cannot win on every workload at a watt budget the grid can supply. And the substrate itself is being questioned for the first time since transistors won — photonic compute, in-memory compute, DNA storage are no longer paper ideas. This act walks each of these, names what is shipping versus what is research, and is honest about the walls — Landauer, the speed of light, the cooling envelope — that no engineering trick removes.

The Frontier — what is shipping, what is research, and where the walls areShipping todaypost-quantum crypto · specialised silicon · spatial computing runtimesScaling nowhyperscale training · the energy frontier · neuromorphic pilotsEmergingfault-tolerant quantum · AGI capability scaling · photonic computeSpeculativeDNA storage · molecular compute · reversible computingThe wallsLandauerkT ln 2 per erasememory wallCPU vs DRAM gapspeed of light1 ft/ns floorcoolingW per mm²the field has runway against most of these — and a few it does not
This act follows the band from top to bottom: what is in production, what is scaling, what is emerging, what is speculative, and what is permanent.

Post-quantum cryptography

A sufficiently large quantum computer running Shor's algorithm factors large integers and computes discrete logarithms in polynomial time. That single capability breaks RSA, finite-field Diffie-Hellman, and every elliptic-curve scheme — X25519, Ed25519, ECDSA, P-256. The current best public estimate for breaking RSA-2048 is in the millions of physical qubits, which 2026 hardware does not have. The threat is not "today." The threat is harvest-now-decrypt-later: an adversary recording encrypted traffic now and decrypting it years later once the hardware arrives. Anything whose confidentiality must outlive the next decade — health records, state secrets, long-lived API keys — is already at risk.

The solution is to replace the vulnerable primitives with algorithms whose hardness assumptions resist quantum attack. NIST ran an eight-year competition; the winners were standardised in August 2024:

  • FIPS 203 — ML-KEM (Module-Lattice Key Encapsulation, formerly Kyber). A key-encapsulation mechanism based on the Module-Learning-With-Errors problem.
  • FIPS 204 — ML-DSA (Module-Lattice Digital Signature Algorithm, formerly Dilithium). The lattice-based signature scheme, also Module-LWE.
  • FIPS 205 — SLH-DSA (Stateless Hash-Based Digital Signature, formerly SPHINCS+). A signature scheme whose security rests only on the underlying hash function — a deliberate hedge in case the lattice assumptions fall.

ML-KEM in particular is built around the difficulty of recovering a short vector from a noisy linear system over a lattice — a problem with decades of cryptanalytic attention and no known quantum speed-up beyond Grover's generic √N. The KEM produces a 32-byte shared secret. The catch is size. An ML-KEM-768 public key is 1184 bytes; the ciphertext (encapsulation) is 1088 bytes. X25519 public keys are 32 bytes and shares are 32 bytes. The handshake gets roughly 35 times larger on the wire.

A hybrid TLS 1.3 handshake with X25519 plus ML-KEM-768 key sharesClientServerClientHellokey_share: x25519 (32 B) + ML-KEM-768 pk (1184 B)supported_groups: X25519MLKEM768, x25519, secp256r1ServerHellokey_share: x25519 (32 B) + ML-KEM-768 ct (1088 B)selected_group: X25519MLKEM768derivederivess_ec = X25519(a, B)ss_pq = ML-KEM.Decap(sk, ct)ss = HKDF(ss_ec ‖ ss_pq)ss_ec = X25519(b, A)ss_pq = ML-KEM.Encap → ctss = HKDF(ss_ec ‖ ss_pq)Combined session secretsecure if EITHER X25519 OR ML-KEM holdstotal handshake bytes grow by approximately 2.2 KB versus pure X25519
The hybrid construction concatenates both shared secrets before feeding them to HKDF. If lattice cryptanalysis advances or if an implementation bug is found in ML-KEM, X25519 still protects the session. If a quantum computer arrives, ML-KEM still protects it.

The mechanism in production right now is hybrid key exchange: run both X25519 and ML-KEM-768 in parallel, concatenate the two shared secrets, and feed them through HKDF to derive the session keys. The CFRG drafted this as X25519MLKEM768; Chrome enabled it by default in 2024, Firefox followed, Edge inherits Chrome's. As of 2026 a non-trivial fraction of TLS traffic on the open internet is already post-quantum on the key-exchange side. The defence-in-depth argument is the engineering point: if either primitive is broken, the session still has the other.

Worked example: a hybrid TLS 1.3 ClientHello, byte by byte

A client opening a connection to example.com constructs a ClientHello that advertises one or more named groups in the key_share extension. For a hybrid post-quantum handshake the relevant lines look like:

supported_groups: X25519MLKEM768, x25519, secp256r1
key_share:
  X25519MLKEM768 (group 0x11EC):
    x25519 public key       32  bytes
    ML-KEM-768 public key 1184  bytes
                       = 1216  bytes
  x25519 (group 0x001D):
    public key              32  bytes

The client offers two shares. The server picks one — if it understands X25519MLKEM768 it picks the hybrid; otherwise it falls back cleanly to plain X25519. On a hybrid response, the ServerHello carries:

key_share:
  X25519MLKEM768:
    x25519 public key       32  bytes
    ML-KEM-768 ciphertext 1088  bytes
                       = 1120  bytes

Both sides now derive a 32-byte X25519 shared secret and a 32-byte ML-KEM shared secret. The TLS 1.3 key schedule concatenates them — ss = ss_ec ‖ ss_pq — and feeds the 64-byte concatenation into HKDF-Extract, which then drives the handshake-traffic-secret derivation exactly as TLS 1.3 already specifies.

The cost. The handshake grows by roughly 1184 + 1088 = 2272 bytes. On a fast link this is invisible. On a satellite link with 600 ms round-trip and limited window size, the additional flight can push handshake latency from 1.2 s to 1.5 s. Most production tuning so far has been around fitting the ClientHello inside the initial congestion window and avoiding fragmentation at MTU boundaries.

Why hybrid is the right migration path. Defence in depth, twice. Classical security if PQ breaks (the lattice problems are younger than RSA and have less cryptanalysis behind them). PQ security if classical breaks (a quantum computer arriving sooner than expected). It is the conservative answer, which is the right answer for cryptography on the open internet.

Signatures migrate differently. They are larger — ML-DSA signatures run 2.4-4.6 KB depending on parameter set, SLH-DSA signatures run 8-50 KB — and they live longer. A TLS handshake replaces session keys every connection; a code-signing certificate may sit on disk for ten years. The migration sequence most organisations follow:

  1. Key exchange first. TLS, SSH, VPN, IPsec — anything where a recorded ciphertext is a future vulnerability. Hybrid X25519+ML-KEM in TLS 1.3 is the canonical move.
  2. Authenticator signatures next. Code signing, software-update signatures, root certificates. SLH-DSA for the highest-assurance roots (because hash-based security has no number-theoretic dependency), ML-DSA elsewhere.
  3. Long-lived identity last. WebPKI certificate hierarchies, government PKI, device-identity roots. The chains are slow to rotate and the size blow-up matters most here.

The honest cost of all of this is engineering hours, not security. ML-KEM is fast — comparable to X25519 in pure CPU time. ML-DSA verification is fast, ML-DSA signing is slower. The wire-size growth is the production concern. The unknown is how lattice cryptanalysis ages: the schemes have stood up so far, but the field is two decades younger than ECC. The hybrid construction exists precisely so this question does not need a single answer.

The substrate the field has chosen — replace the math, keep the protocols, hedge with hybrids — is the post-quantum frontier where engineering meets product reality. The handshake gets larger; the threat model gets bounded; classical cryptography in Act VIIa still composes the rest of the stack.

Quantum computing

Classical bits answer one question at a time. There is interest in a model of computation where a register can hold a combination of possible values until measured, and where the measurement statistics interfere — some answers reinforce, some cancel. If you can engineer that interference, certain problems collapse from exponential to polynomial cost. The whole field rests on that conditional.

The unit is the qubit: a two-level quantum system whose state is a complex superposition α|0⟩ + β|1⟩ with |α|² + |β|² = 1. A single qubit's pure state is naturally drawn on the Bloch sphere — a unit sphere where the north pole is |0⟩, the south pole is |1⟩, and every other point is a coherent combination. Measurement collapses the state to one of the poles, with probability |α|² for |0⟩ and |β|² for |1⟩. Between measurements, you can rotate the state continuously by applying quantum gates.

A Bloch sphere with a tilted qubit state vector|0⟩|1⟩|−⟩|+⟩|ψ⟩ = α|0⟩ + β|1⟩|α|² + |β|² = 1θφα = cos(θ/2)β = e^(iφ) · sin(θ/2)measurement→ |0⟩ with prob |α|²→ |1⟩ with prob |β|²one qubit is a point on the surface; n qubits live in a 2ⁿ-dimensional complex space
Two real angles (θ, φ) parameterise a pure single-qubit state. Multi-qubit states do not fit on this picture — the dimensionality is what makes them useful and what makes them hard to simulate classically.

Two operations distinguish a quantum register from a classical one. Superposition lets a single n-qubit register simultaneously represent every n-bit string with its own complex amplitude — 2ⁿ amplitudes in total. You cannot read all 2ⁿ amplitudes out; one measurement returns one n-bit string. The engineering trick is to use gate sequences that arrange interference so the amplitude on the answer you want is large while the rest cancels. Entanglement is the second: two or more qubits in a joint state that cannot be factored into independent single-qubit states. Measuring one entangled qubit instantaneously constrains the measurement statistics of the others. This is the resource quantum algorithms actually consume.

Physical implementations in 2026 come in four families. Superconducting transmons (Google, IBM, Rigetti) use Josephson-junction circuits at 10-20 millikelvin; the largest published processors run a few thousand physical qubits, with gate times in tens of nanoseconds and gate fidelities pushing into the 99.9% range. Trapped ions (IonQ, Quantinuum) hold individual atomic ions in vacuum traps and manipulate them with lasers; fewer qubits but very high fidelity and full all-to-all connectivity. Neutral atoms (QuEra, Pasqal, Atom Computing) trap atoms in optical tweezer arrays; Atom Computing crossed 1000 qubits in 2023, and the architecture scales by adding tweezers. Photonics (PsiQuantum, Xanadu) encodes qubits in single photons; the bet is that photons are easy to move and naturally room-temperature, at the cost of probabilistic gates.

The two named algorithms drive most of the cryptographic concern. Shor's algorithm (1994) factors n-bit integers in roughly O(n³) quantum operations versus the O(exp(n^(1/3))) of the best known classical algorithm. Applied to RSA-2048 with realistic error correction, the standing estimate is around 20 million physical qubits and several hours of wall-clock time — neither of which exists in 2026. Grover's algorithm (1996) searches an unstructured space of N items in O(√N) queries versus O(N) classically. The implication for symmetric crypto is concrete: AES-128 against Grover has roughly the work of AES-64 against a classical attacker, which is why post-quantum guidance recommends AES-256 for long-lived data. Hashing degrades the same way; SHA-256 against Grover is roughly SHA-128.

Grover's algorithm — amplitude amplification on a 4-item searchInitial superpositionAfter oracleAfter diffusion|00⟩|01⟩|10⟩|11⟩½ ½ ½ ½|00⟩|01⟩|10⟩|11⟩ ←½ ½ ½ −½|00⟩|01⟩|10⟩|11⟩0 0 0 1one Grover iteration finds the marked item with certainty on N=4 — the rare exact case
The oracle flips the sign of the marked amplitude; diffusion reflects all amplitudes about their mean, lifting the marked one and zeroing the rest. The cost is √N oracle calls instead of N classical lookups.
Worked example: Grover's search on 4 items

Take the smallest non-trivial unstructured search: a database of 4 items indexed by 2 qubits, with exactly one marked item — say the marked item is |11⟩. The oracle is a black-box circuit that flips the sign of the marked amplitude. The goal is to identify which item is marked using as few oracle queries as possible. Classically you need 2.5 queries on average and up to 4 in the worst case. Grover does it in ⌊π/4 · √N⌋ = ⌊π/4 · 2⌋ = 1 iteration with certainty.

Step 0 — initialise. Apply Hadamards to both qubits, producing a uniform superposition:

|ψ₀⟩ = ½ |00+ ½ |01+ ½ |10+ ½ |11

Each amplitude is ½; each measurement probability is (½)² = ¼. Measuring now gives random results.

Step 1 — oracle. The oracle flips the sign on the marked state:

|ψ₁⟩ = ½ |00+ ½ |01+ ½ |10⟩ − ½ |11

Measurement probabilities are still all ¼ — the sign is invisible to a single measurement. The oracle has hidden the answer in the phase, not the magnitude.

Step 2 — diffusion. Apply the Grover diffusion operator 2|ψ₀⟩⟨ψ₀| − I, which reflects all amplitudes about their mean. The mean of (½, ½, ½, −½) is ¼. Reflecting each amplitude a about ¼ gives 2·(¼) − a = ½ − a:

|ψ₂⟩ = (½ − ½) |00+ (½ − ½) |01+ (½ − ½) |10+ (½  (−½)) |11     = 0 |00+ 0 |01+ 0 |10+ 1 |11

The marked state has amplitude 1; everything else is exactly 0. Measurement returns |11⟩ with probability 1.

This is the rare case (N = 4) where Grover finds the answer with certainty. For larger N, the optimal number of iterations is ⌊π/4 · √N⌋, and the success probability is close to but not exactly 1. The headline scaling is O(√N) oracle calls, which is a quadratic speed-up — significant for unstructured search, but nothing like Shor's exponential gap.

The architectural distinction that matters today is NISQ versus fault-tolerant. NISQ — Noisy Intermediate-Scale Quantum — describes today's machines: hundreds to a few thousand physical qubits, gate error rates around 10⁻³ to 10⁻⁴, no error correction, and circuit depth limited by decoherence to perhaps a few hundred two-qubit gates before the signal is noise. Fault-tolerant quantum computing requires logical qubits built from many physical qubits via error-correcting codes — most notably the surface code (Fowler et al., 2012), which encodes one logical qubit in a 2D grid of around 1000 physical qubits at current error rates. In 2024 and 2025 multiple groups demonstrated logical qubits with error rates below their constituent physical qubits — the threshold below which adding more physical qubits actually helps. As of 2026 the count of demonstrated logical qubits is in the single digits to low tens, not the millions needed for Shor against RSA-2048.

NISQ vs fault-tolerant — the gap in physical qubits per logical qubitNISQ (2026)Fault-tolerantnoisy intermediate scaleerror-corrected logical qubitsphysical qubits 10² – 10³gate error 10⁻³ to 10⁻⁴circuit depth hundreds before decoherencephysical qubits 10⁶ – 10⁸logical error 10⁻¹² target for Shorsurface code overhead ≈1000 physical per logicala few demonstrated logical qubits in 2024-2025 — millions required for cryptographically relevant Shor
The error-correction overhead is what separates today's machines from tomorrow's. Each generation of physical qubit improvement shrinks the per-logical overhead, but the gap to RSA-breaking scale is still many orders of magnitude.

The honest trade-off is that NISQ machines do not yet beat classical supercomputers on any commercially relevant problem with high confidence. Specific contrived benchmarks (random circuit sampling, Gaussian boson sampling) have produced quantum-advantage claims that classical algorithms then partially closed. Variational algorithms — quantum chemistry, optimization heuristics — run on NISQ hardware but rarely beat the classical state of the art at scale. The bet is that fault-tolerant machines, once they exist, will. The PQC migration above exists because the bet is reasonable enough to plan around.

The classical cryptography of Act VIIa is what PQC replaces; the bit of Act I is what the qubit generalises. Both are the substrate the rest of the book sits on.

Specialised silicon beyond GPUs

A GPU is a massively parallel array of small cores designed for graphics and adapted for deep learning. It is the right architecture for training because training is messy — different layer shapes, irregular memory accesses, frequent kernel launches, gradient sync. Inference is different. Inference runs the same model shape billions of times. The flexibility a GPU pays for in silicon area, power, and clock complexity is wasted on a fixed dataflow. Specialised inference silicon is the response.

TPUs (Google) are the longest-running example. The original TPU v1 (deployed 2015, paper ISCA 2017) was a single 256×256 8-bit systolic array — a grid of multiply-accumulate units where input activations flow horizontally, weights flow vertically, and partial sums accumulate downward. The mechanism is mechanically simple and very dense. By construction the array does one operation: matrix multiplication. Everything else — activation functions, layer norm, attention softmax — happens in surrounding logic. Successive generations (v2 added training support, v3 added more memory, v4 added inter-chip optical interconnect, v5 split into v5e/v5p, Trillium and Ironwood follow) have widened, deepened, and added bf16/int8 support, but the core is still systolic matmul.

A 4×4 systolic array — weights stay, activations flow, partial sums accumulateMACMACMACMACMACMACMACMACMACMACMACMACw₀₀w₀₁w₀₂w₀₃w₁₀w₁₁w₁₂w₁₃w₂₀w₂₁w₂₂w₂₃a₀ →a₁ →a₂ →↓ partial sumsSystolic array — activations flow right, partial sums flow downno shared bus, no register file, no scheduling — just a grid that does matmul
Each cell holds one weight and does one multiply-accumulate per cycle. The data movement is the operation; nothing is read from a remote register file. This is what makes the architecture energy-dense for matrix multiplication and useless for anything else.
Per-inference energy across silicon families for a ResNet-50 forward pass10⁰10⁻¹10⁻²10⁻³10⁻⁴joules per inferenceCPUGPUTPUNPULPUneuro≈1 J≈0.1 J≈30 mJ≈10 mJ≈5 mJest. ≈0.1 mJResNet-50 inference — energy per image, log scale
Rough order-of-magnitude figures, not benchmark numbers. The point is that each architectural step halves to tenths the energy of the previous, and that neuromorphic estimates remain estimates until comparable workloads are run end-to-end.

NPUs are the mobile inference cousin — Apple's Neural Engine, Qualcomm's Hexagon, Google's Edge TPU. The constraints are different: a few watts, integer arithmetic, fixed kernel set, and absolute deterministic latency. The architecture is the same recipe (dense matmul plus a fixed activation/normalisation pipeline), miniaturised. A 2024-era flagship NPU runs 30-50 TOPS at int8 on under 5 watts.

FPGAs (Xilinx Versal, Intel Agilex) are the reconfigurable middle ground. Logic blocks plus DSP slices plus on-chip memory can be wired into any dataflow at synthesis time. The penalty is roughly 10x in clock speed and 10x in density versus an ASIC running the same dataflow. The advantage is that the chip is a programmable chip; new models do not require a fab cycle. Microsoft's Project Catapult put FPGAs in front of Bing search and trained the industry that this was a real option.

Custom inference ASICs are where the per-watt economics get extreme.

  • Groq LPUs — Language Processing Units. The architectural bet is deterministic dataflow with no caches, no branch predictors, no out-of-order execution. The compiler statically schedules every instruction across hundreds of compute tiles and SRAM banks. The result is order-of-magnitude lower latency on LLM decode (500-700 tokens per second on Llama-class models, versus 50-100 on a GPU) because the inference path has no stalls and no contention.
  • Cerebras WSE — Wafer-Scale Engines. A full silicon wafer (around 46,225 mm²) treated as one chip with 900,000 cores and 44 GB of on-die SRAM. The mechanism is that on-chip is faster than off-chip; if you can fit a full model on one wafer, all communication is on-die and you skip the HBM and the interconnect entirely.
  • Tenstorrent — RISC-V-based scalable accelerators (Wormhole, Blackhole, Grayskull) with a tile-grid architecture and on-chip Ethernet. The play is an open software stack (TT-NN, TT-Metal) at a price point that undercuts H100-class GPUs.
  • AWS Trainium and Inferentia — Annapurna Labs' custom training and inference chips, offered as managed instances on AWS. The play is vertical integration: train cheaper than on rented GPUs, inference cheaper than on rented GPUs, all behind a familiar API.
Worked example: per-inference energy for a ResNet-50 forward pass

Take a concrete, well-studied workload: one ResNet-50 forward pass at fp16 or int8, batch size 1, on a 224×224 input. Around 8 GFLOPs of arithmetic. The energy cost varies by an order of magnitude or more across the silicon families:

CPU (modern x86 server, AVX-512):1.0   J per image
GPU (V100 / A100, mixed precision):0.1   J per image
TPU v4 / v5 (int8 systolic):30    mJ per image
mobile NPU (Apple ANE, Hexagon, int8):10    mJ per image
custom LPU (Groq-class):5    mJ per image
analog / neuromorphic estimate:0.1   mJ per image

These are rough order-of-magnitude figures synthesised from MLPerf inference results and published efficiency disclosures. The accuracy varies by parameter set, by software stack, by batch size, by whether you count memory access. The point is the spacing: each architectural step buys roughly half to one order of magnitude in energy. Multiplied across 10¹⁰ inferences per day across a large product, the difference between the CPU and the LPU is a megawatt-scale electricity bill.

This is why specialised silicon exists: at scale the answer is decided by joules per inference, not headline TOPS.

The trade-off is the same shape every time. A specialised chip wins on the workload it was designed for and is useless or slow on anything else. A change in model architecture — attention replaced by something new, a structurally different layer shape — can render a generation of inference ASICs obsolete before they age out of the data centre. The TPU's bet on bf16 systolic matmul aged well; an ASIC that bet specifically on a 2018-era CNN dataflow did not. Hyperscalers absorb this by amortising chip design across many workloads and many years; smaller players rent the silicon.

The instruction set of Act II generalised; the specialised silicon of this section narrows back down. Same trade-off as ever: flexibility versus efficiency, with the optimum point shifting as the workload calcifies.

Neuromorphic and analog computing

Conventional silicon — even the specialised silicon above — keeps the von Neumann split: memory in one place, compute in another, with a bus between. The bus is the bottleneck. Reading a 32-bit weight from off-chip DRAM costs roughly 640 pJ; the multiply-accumulate that consumes it costs roughly 1 pJ. The ratio is around 200x; most of the energy goes into moving the weight, not using it. The biological brain does not have this split. Synapses store weights and compute on them in the same physical structure, communicate via discrete asynchronous spikes, and consume around 20 W for the whole 86-billion-neuron system. The bet is that engineering can borrow enough of that pattern to make it useful.

Neuromorphic hardware models the brain's primitives: discrete neurons, weighted synapses, spike-based communication, event-driven execution. Intel Loihi 2 has 128 neuromorphic cores per chip with 1 million programmable neurons; it processes spikes asynchronously, idle cores draw near-zero power, and the asynchronous architecture means latency is set by spike propagation, not by a global clock. IBM TrueNorth (Merolla et al., 2014) was the early demonstration: 4096 cores, 1 million neurons, 256 million synapses, 70 mW on inference workloads. IBM's NorthPole (2023) is the modern follow-up — a single chip that fuses memory and compute and posts ResNet-50 inference at energy efficiencies that approach a couple of orders of magnitude better than contemporary GPUs.

Spiking-neuron timing — membrane potential, threshold, and output spikesVtθspikespikeintegrate inputs · leak slowlyspike when V crosses θ · resetLeaky integrate-and-fire neuron
Each input spike pushes the membrane potential up; in the absence of input it leaks toward rest. When the potential crosses threshold the neuron emits a spike and resets. Information is encoded in spike timing and rate, not in continuous activations.

The computational primitive is the Spiking Neural Network (SNN). A spiking neuron integrates incoming spikes weighted by synaptic strengths, leaks toward a resting potential, and fires its own spike when a threshold is crossed. The output is a sparse, asynchronous sequence of binary events in time, not a dense vector of floats. This sparsity is where the energy savings live: if a neuron does not fire, it consumes near-zero power; if its downstream neurons see no input, they consume near-zero power. Compare to a dense matrix multiply, which expends the full energy of every multiply-accumulate regardless of whether the values are near-zero.

The other half of the story is in-memory compute and memristor crossbars. A crossbar array of resistive devices (memristors, phase-change cells, ReRAM) can perform a vector-matrix multiply in a single physical step: apply voltages on the row lines, read currents on the column lines, and Kirchhoff's current law sums the products in the analog domain. The energy cost is dominated by reading the result, not by the multiply itself — somewhere around 1-10 fJ per operation in laboratory devices versus 1-10 pJ for digital. Processing-in-Memory (PIM) is the more conservative version: keep the digital arithmetic but move compute units physically inside the DRAM die, eliminating the off-chip bus for the hottest accesses. Samsung's HBM-PIM and SK Hynix's GDDR6-AiM are first-generation commercial PIM.

The catch is the programming model. Mapping a model trained with gradient descent in dense fp32 onto a sparse spiking network with stochastic devices and limited precision is its own research field. Direct training of SNNs uses surrogate gradients to approximate the non-differentiable spike function during backprop. ANN-to-SNN conversion trains a conventional network and then converts the activations to firing rates. Neither matches the accuracy of plain digital training on large benchmarks today, though the gap is closing. Analog crossbars have their own headaches: device variation, drift over time, limited write endurance, and the difficulty of writing precise analog weights. Mixed-precision and digital-correction schemes exist; the engineering is real and unfinished.

The honest assessment for 2026: neuromorphic and analog computing are commercially deployed for narrow inference niches (always-on keyword spotting on phones, event-based vision sensors, edge ML on milliwatt budgets) and remain research for the general-purpose deep-learning workload. The energy ceiling is high enough that the field will keep pushing. The substrate of Act II — clocked digital — may not be the substrate that runs the next decade of inference.

Spatial computing

Conventional displays draw a frame, the user looks at it, and 30 ms of latency is invisible because the user is not turning their head. A head-mounted display is welded to the user's skull. When the head turns, the displayed image must rotate to match — and the threshold below which the rotation is not perceived as lag, with mismatch driving nausea, is around 11 milliseconds of motion-to-photon latency. That is the budget within which the entire system — IMU sample, sensor fusion, application logic, render, display scan-out — has to complete a frame.

The solution is a runtime that decouples the application's render rate from the display's frame rate and uses prediction plus late-stage reprojection to hit the budget. The cross-vendor standard is OpenXR (Khronos, 2019), an API that abstracts the input devices, the spatial tracking system, and the swapchain across Vision Pro, Quest, Pico, and Windows Mixed Reality. The application submits a render at its native frame rate (often 60-90 fps); the runtime warps the latest frame to the predicted head pose at scan-out time, regardless of whether the application missed a frame.

Motion-to-photon timing pipeline for a VR frame0 ms3 ms6 ms8 ms10 ms11 msIMUpredictrendertimewarpscan-outsample poseextrapolatedraw framewarp to new posedisplay photons11 ms wallFrom head movement to photon — the entire budgetmiss the wall and the user feels lag; miss it badly and they feel sick
The application's render budget is the wide middle slice. The runtime owns the IMU sample at the start and the timewarp at the end, and decides at scan-out which frame to display warped against which pose.

Three engineering techniques absorb the budget gap. Pose prediction — the runtime extrapolates the head's position 10-15 ms into the future at the moment the application starts rendering, so the render targets where the head will be when the photons arrive, not where it is now. The IMU runs at 1 kHz; the prediction is a Kalman filter or similar over angular and linear velocity. Asynchronous timewarp (ATW) — after the application finishes rendering, the runtime samples the head pose again immediately before display, and warps the rendered frame to match. The warp is a fast homography on the GPU and shaves the last few milliseconds. Asynchronous spacewarp (ASW) — when the application drops a frame, the runtime synthesises an intermediate frame by warping the previous one with depth and motion vectors. ASW is what keeps the experience smooth when the application can only sustain 45 fps on a 90 Hz display.

Foveated rendering is the other major mechanism. The human eye has high acuity only in the fovea — the central 2-3 degrees of vision; everything else is blurry. A render that draws the full display at full resolution wastes pixels the user cannot see. Eye tracking (Vision Pro, Quest Pro, PSVR2) reports gaze direction at sub-millisecond latency; the renderer shades the foveal region at full rate, the periphery at quarter or eighth rate, and the total fragment shader cost drops 40-60% with no perceptual loss. Fixed foveated rendering (no eye tracking, just a guess at the centre of the lens) is the simpler fallback and is standard on Quest 3.

Spatial input is the other half of the runtime. Headsets in 2026 track hands at sub-millimetre precision from outside-in cameras at 60-120 Hz; pinch and tap gestures replace controllers for most applications. Voice input is integrated where ambient noise allows. Eye-tracking-plus-pinch is the Vision Pro's primary input model: look at something, pinch your fingers, and the system treats it as a click. Each input stream brings its own latency and confidence; the application combines them with the same kind of sensor fusion the pose pipeline does.

The honest trade-off is that presence — the perceptual illusion that you are physically in the rendered space — depends on every layer hitting its budget every frame. A single dropped frame is felt. A render that lands 12 ms late instead of 10 ms produces a tiny but reliable nausea response in a fraction of users. The headset hardware and the OS scheduler and the GPU driver and the application all have to cooperate; if any layer misbehaves, the experience breaks. This is the closest engineering analogue to hard-real-time scheduling in Act IV that a consumer product has.

Whether this becomes a frontier or a niche — wearable all day, used for hours, supplanting the phone in some workflows — is open. The substrate works; the application surface is what 2026-2030 will decide.

The energy frontier

Compute used to be a software cost. Then it became a chip cost. As of the late 2020s, it is a power cost — specifically a primary-electricity-generation cost, and increasingly a carbon-emissions cost. Training a frontier-class language model takes 10²³ to 10²⁵ FLOPs, which on current hardware runs on tens of thousands of accelerators drawing tens of megawatts for weeks to months. A single run consumes megawatt-months of electricity. Inference at planetary scale — billions of queries per day — sustains gigawatt-class data-centre load. The grid notices.

The mechanism that gates how much of the electricity actually reaches the compute is Power Usage Effectiveness (PUE) — the ratio of total facility power to compute power. Defined by the Uptime Institute, PUE = total facility power / IT equipment power. A PUE of 1.0 means every watt drawn from the grid ends up doing arithmetic; a PUE of 2.0 means half goes to cooling, lighting, power conversion, and other overheads. State-of-the-art hyperscale facilities run PUE 1.1-1.2; legacy enterprise data centres run 1.6-2.0. Cooling is the dominant overhead; the rest is power conversion losses, transformer heat, and lighting.

Data-center cooling — heat flow from die to atmosphereDieCold plateCDUCooling tower100 °C65 °C45 °CambientImmersion variant — server submerged in dielectric fluidno cold plates · fluid carries heat directly · density ≈ 100 kW per rackAir cooling — fans, hot/cold aisles, CRAC unitslegacy default · density tops out at ≈ 30 kW per rack · approaches limit for H100/B200 racksFrom transistor heat to atmosphere
Air cooling tops out around 30 kW per rack; modern accelerator racks demand 80-120 kW. Direct-to-chip liquid cooling and full immersion are the answers, both of which require new rack designs and new mechanical infrastructure.

Liquid cooling is the architectural response. Direct-to-chip liquid cooling — cold plates on each die with a coolant loop carrying heat to a coolant distribution unit (CDU) and out to a dry cooler or cooling tower — handles the 80-120 kW per rack that contemporary accelerator clusters draw. Immersion cooling goes further, submerging entire servers in dielectric fluid; the fluid carries heat directly off the components without a separate cold plate. Single-phase immersion uses mineral or synthetic oils; two-phase uses fluorocarbons that boil at around 50 °C and condense at the top of the tank. Either approach roughly halves the cooling-energy share of PUE.

The carbon side is the harder problem. Carbon-aware compute is the engineering practice of scheduling workloads in time and space against grid carbon intensity. The same training run consumes the same energy regardless of when it runs; the carbon emitted depends on the marginal generation source at the moment. Run during high solar/wind hours in a renewable-heavy grid and emissions can fall by 50% or more versus running during a coal-fired evening peak. Tools like the Green Software Foundation's carbon-aware SDK, Electricity Maps APIs, and AWS/Azure/GCP region carbon dashboards expose the data; the scheduling decision is engineering. Hyperscalers also pair data centres with on-site or contracted renewable generation, geothermal (Google's Nevada and Iceland projects), and increasingly nuclear (Microsoft's Three Mile Island restart, Amazon's small-modular-reactor agreements).

The numbers that anchor the discussion as of 2026:

  • A GPT-3-class training run (2020): around 1287 MWh, 502 tonnes CO₂e (Strubell et al. updates).
  • A GPT-4 / Llama 3 / frontier-class training run (2024-2025): order of magnitude 50-200 GWh per run, with carbon depending heavily on grid mix.
  • Inference: roughly 1-10 watt-hours per long LLM response at frontier scale, multiplied across billions of daily queries.
  • Global data-centre electricity (IEA Electricity 2024): around 460 TWh in 2022, projected to roughly double by 2026 driven primarily by AI workloads.

The trade-off is uncomfortable. The marginal carbon cost of running a model is real and measurable. The marginal benefit is harder to quantify but plausibly large for some applications and absent for others. The engineering response — efficiency gains in silicon and software, carbon-aware scheduling, renewables, nuclear — is necessary but not yet sufficient at the trajectory the workload is on. AI engineering (Act VIII) and the operations practice of Act IXb increasingly include a kilowatt-hour budget and a carbon-intensity SLO. This is now part of the job.

Hyperscale compute

Training a 100B-parameter model on a single GPU would take centuries. The job has to fan out across thousands to hundreds of thousands of accelerators that act, for synchronisation purposes, as one machine. The mechanism is parallelism — distributing the model and the data across the accelerators — combined with an interconnect fabric that lets gradients reach every accelerator on every step within the iteration budget.

The parallelism dimensions stack. Data parallelism replicates the model on each accelerator and splits the batch; all-reduce averages gradients across replicas at the end of each step. Tensor parallelism splits individual layer weights across accelerators within a node — the matrix multiply for one layer is computed in pieces with cross-accelerator activation exchanges. Pipeline parallelism splits the model layer-wise across accelerators — different stages process different micro-batches in a software pipeline. Real frontier training runs combine all three (3D parallelism), and on top of that FSDP/ZeRO (Rajbhandari et al.) shards optimizer state, gradients, and parameters across accelerators so a model larger than any single GPU's memory still fits.

3D parallelism — data, tensor, and pipeline dimensions on the same clusterData parallelTensor parallelPipeline parallelreplicate model · split batchsplit layer · within nodesplit layers · across nodesmodelmodelcopycopyall-reduce gradientsW₀W₁W₂W₃all-gather activationslayers 1-8layers 9-16layers 17-24layers 25-32micro-batch pipelinea 24,576-GPU run might be 8-way tensor × 96-way pipeline × 32-way data
The three dimensions are orthogonal and combine: tensor parallelism is bandwidth-bound and stays within a node, pipeline parallelism amortises across nodes, and data parallelism uses the all-reduce ring across replica groups.
Ring all-reduce across N GPUs — bandwidth-optimal gradient syncG₀G₁G₂G₃G₄G₅step 1: scatter-reduceeach node owns a shardstep 2: all-gatherbroadcast shard around ring2(N−1) hops totalbandwidth ≈ 2·data/Nindependent of N at the limitthe ring is bandwidth-optimal for large messages — every link runs at line rate the whole time
The ring all-reduce is the canonical implementation in NCCL. Scatter-reduce splits the gradient across nodes and sums each shard around the ring; all-gather broadcasts the final shards back. Bandwidth per node is independent of N for sufficiently large messages.

The collective that dominates the bandwidth budget is all-reduce — every accelerator contributes a gradient tensor, every accelerator receives the sum. The naive implementation costs O(N) bandwidth per node; the ring all-reduce (Baidu, 2017) splits the tensor into N shards and pipelines a scatter-reduce around the ring followed by an all-gather around the ring, costing 2·(N-1) hops with the per-node bandwidth approximately 2·data/N. Tree and hybrid algorithms exist for small messages and high-latency links. NCCL (NVIDIA Collective Communications Library) is the production implementation across NVIDIA hardware; the corresponding pieces for AMD (RCCL) and Intel/AWS-Trainium (HCCL, CCL) follow the same shape.

The interconnect is what makes those collectives feasible. NVLink provides 900 GB/s per H100, 1.8 TB/s per B200 between accelerators in a node. NVSwitch connects 8-16 GPUs within a node at full NVLink bandwidth per pair; for the largest configurations, NVLink Switch System extends the topology across multiple nodes. Cross-node communication uses InfiniBand (400 Gb/s NDR, 800 Gb/s XDR coming) or RoCE (RDMA over Converged Ethernet) on similar speeds. The defining mechanism is RDMA (Remote Direct Memory Access) — one node writes directly into another node's memory without involving the remote CPU, OS, or copies. RDMA latency between nodes runs at single-digit microseconds for small messages; without it the all-reduce would be CPU-bound long before it was network-bound.

Reliability at this scale is a different problem. With 25,000 GPUs running for three months, even five-nines per-GPU MTBF still produces frequent failures. Published failure rates from large training runs sit around 0.5-2 GPU failures per day across a 10,000-GPU cluster. The mechanisms that absorb this:

  • Checkpoint frequently. A typical frontier run checkpoints every 30 minutes to 2 hours. Checkpoint size for a 100B parameter model is around 1-2 TB; writing it to a parallel filesystem in under a minute is its own engineering problem.
  • Detect and recover. A failed GPU produces NaN gradients or NCCL timeouts; the framework (PyTorch DDP, Megatron-DeepSpeed) restarts from the last checkpoint with the dead GPU's work redistributed.
  • Replace transparently. Cluster managers (Slurm, Ray, custom hyperscaler stacks) keep a hot pool of spare nodes; a failure swaps a spare in and resumes within minutes.

The unfinished engineering is in efficiency. Strong scaling — adding accelerators to reduce wall-clock time on a fixed workload — degrades at large N because communication grows faster than compute. A modern frontier run might sustain 35-50% Model FLOPs Utilization (MFU) — the ratio of useful arithmetic to peak hardware capability — across the cluster, with the rest lost to communication stalls, kernel-launch overhead, memory bandwidth, and pipeline bubbles. Pushing MFU higher is a perpetual focus of the training-systems literature.

This is the operations practice of Act IXb applied to a 200 MW supercomputer running for two months. It is also the distributed-systems craft of Act Vb at a scale where the eight fallacies are not theoretical.

AGI considerations

The discourse around AGI (Artificial General Intelligence) is contested at every level: what it means, whether current architectures get there, what timelines look like, what risks attend it. This section is deliberately measured. It describes what the technical literature actually says, where the disagreements are, and what engineering practice has converged on so far.

The empirical anchor is the scaling laws literature. Kaplan et al. (2020) and the Chinchilla paper (Hoffmann et al., 2022) showed that, for transformer language models, loss falls as a smooth power law in compute, parameters, and training tokens, with the optimal balance roughly parameters proportional to tokens. Capability on a wide range of downstream benchmarks improved in parallel, often emergently — a model trained for next-token prediction picks up arithmetic, translation, code generation, and so on without those being explicit training objectives. The trajectory from GPT-2 (2019) to GPT-4 (2023) to the 2025-2026 frontier models followed the curve. Each generation traded an order of magnitude more compute for a substantial capability lift.

Scaling laws — loss as a power law in compute, parameters, and tokens2.03.04.05.0test loss10¹⁹10²¹10²³10²⁵10²⁷training compute (FLOPs, log scale)GPT-2GPT-3PaLM / ChinchillaGPT-4 era2025-26 frontier?Loss falls as a power law — until it might not
The dashed extension is where the 2024-2026 debate lives. The smooth curve described GPT-2 through GPT-4. Whether it continues to GPT-N with the same slope, flattens, or bends to test-time-compute is the open question.

The post-scaling debates of 2024-2026 are where the picture gets harder. Several practitioners and labs have reported diminishing returns on naive parameter-and-data scaling — capability gains are smaller per order of magnitude of compute than the early curves projected, and the data ceiling (high-quality text on the open internet) is finite. The response has been to scale test-time compute — reasoning models like OpenAI's o1/o3, Anthropic's reasoning, Google's Gemini thinking variants spend more compute per query through internal chain-of-thought rather than only at training. Whether this is a new exponential or a one-time lift is what 2026-2028 will resolve. There is also active work on different training paradigms: synthetic data at scale, self-play, environment-grounded learning. The substrate may shift before the curve flattens.

What "general" means in this context is itself unsettled. The minimal claim is "broadly competent across tasks humans find easy." Some definitions require capability matching or exceeding skilled humans across most economically valuable cognitive work. Some require autonomous agentic capability — the system can take long-horizon real-world actions, recover from errors, and accomplish goals without per-step human direction. Some require capabilities humans do not have at all. No single definition has won, and the engineering implications depend heavily on which one is being targeted. The pragmatic working definition most labs use is task-shaped: pick measurable benchmarks at the edge of current capability and chase them.

Evals at frontier scale are themselves a frontier. The benchmarks that defined the early LLM era — MMLU, HumanEval, GSM8K — saturated; frontier models score near 100% and the spread between systems disappears. The replacement is harder: GPQA (graduate-level science), Humanity's Last Exam, SWE-bench Verified (real software engineering tasks), tau-bench (agentic interactions), and frontier capability evals run by labs themselves and by safety institutes. The honest issue is that as evals approach human expert performance, building them requires human experts in the loop, which makes them slow, expensive, and impossible to fully automate. Frontier evaluation is now its own discipline, and AI engineering (Act VIII) has absorbed it as a routine part of shipping models.

The alignment problem is the engineering concern that sits underneath the rest. The standard formulation: as systems become more capable, ensuring their behaviour matches the intent of the people deploying them, and matches broader human values, gets harder to verify. The research-side work — interpretability (mechanistic interpretability, sparse autoencoders, circuit analysis), oversight (scalable oversight, debate, recursive reward modelling), capability evaluations (dangerous capability evals, autonomous-replication evals), and constitutional and RLHF methods — feeds into engineering practice via specific artefacts: safety policies and refusal behaviours baked into model fine-tuning, red-team exercises run before deployment, capability evaluations gated by responsible scaling policies (Anthropic's RSP, OpenAI's Preparedness Framework, Google DeepMind's Frontier Safety Framework), and post-deployment monitoring for misuse and unexpected behaviour. Engineers working at and around frontier labs cannot fully opt out of this work; it is part of shipping.

The honest uncertainty is large. Whether current scaling continues to general capability, whether it stalls and a new paradigm is needed, whether the timelines being discussed are calibrated — none of these have settled answers as of 2026. What is converging is the engineering response: capability evaluations as a prerequisite for release, deployment policies that gate model access by capability tier, infrastructure for monitoring and shutdown, and an emerging regulatory layer (EU AI Act, the US AI executive orders, the UK and Singapore AI safety institutes) that codifies parts of this practice. None of this answers the underlying question — what to build, how to align it, what the risks are — but it does shape the practice working engineers will be expected to follow. The substrate of Act VIb and the engineering practice of Act VIII are where the rubber meets the road.

Beyond silicon

CMOS silicon scaling, the engine of Moore's law for fifty years, is slowing. Transistor feature sizes are at 2-3 nm in 2026; below 1 nm the device is a few dozen atoms across and quantum effects dominate. Power density per die has hit a ceiling — chips cannot dissipate more than around 1 W/mm² without exotic cooling — and clock speeds have not climbed meaningfully since around 2005. The industry's response so far has been more cores, specialised silicon, and 3D stacking (HBM, chiplets, advanced packaging like TSMC's CoWoS). All of these are inside the silicon envelope. What follows are candidate substrates that are not. Speculation warning: most of this is research, not product. Where 2026 hardware exists, it is named; otherwise it is identified as speculative.

Photonic and optical computing replaces electrons with photons as the carrier of computation. The clearest near-term win is matrix multiplication: a Mach-Zehnder interferometer mesh can compute a vector-matrix product in essentially a single pass-through, at the speed of light through silicon waveguides, with energy dominated by the laser source and the read-out detectors rather than per-operation switching. Lightmatter (Envise, Passage), Lightelligence, Luminous, and Optalysys are early commercial efforts; PsiQuantum uses photonic substrates for quantum computing rather than classical. The numbers people quote are 10-100x improvements in energy per operation for matrix-heavy workloads. The honest limitations are real: nonlinear functions (activations) still require electronic conversion, optical memory is hard, integration density lags CMOS by orders of magnitude, and the manufacturing process is its own engineering programme. Photonic accelerators for specific workloads — communications, signal processing, fixed-shape neural-network inference — are commercially shipping in 2025-2026; general-purpose photonic compute is research.

DNA and molecular storage addresses storage density rather than compute. The premise is that DNA encodes roughly 2 bits per base pair, and a gram of DNA can in principle store on the order of 10¹⁸ bytes — about a million times denser than the densest tape. Church (2012) and Goldman et al. (2013) demonstrated end-to-end encoding of books, images, and short videos into DNA and retrieving them; subsequent work (Microsoft, Twist Bioscience, CATALOG) has pushed retrieval costs down and scale up. The bottleneck is write speed: current synthesis runs at hundreds of bases per second per channel, against a hard drive's tens of millions of bytes per second. Reading via sequencing is faster but still measured in hours per gigabyte. The trade-off as of 2026 is that DNA storage is interesting for archival cold storage where access is rare and density and longevity are paramount — DNA at room temperature is readable for thousands of years — but is not a hot-tier substitute. The economics are not yet competitive with magnetic tape at scale, though both speed and cost are dropping by roughly an order of magnitude every few years.

Molecular and chemical computing is the broader umbrella — using chemical reactions or biological systems as computational substrates. The work is largely academic in 2026: implementing Boolean logic in chemical reaction networks, building cellular computers from engineered bacteria, designing molecular memory. None of it is approaching the scale or speed of silicon, and most of it probably will not. The cases where it might matter are niche — sensors that compute directly on biological samples, computation embedded in living tissue for medical applications — rather than general purpose.

Neuromorphic photonics combines two of the threads above: spiking neural networks implemented in photonic substrates. Early work (Princeton, MIT, EPFL) suggests potential energy efficiencies far beyond either electronic neuromorphic or conventional photonic compute. Speculative as of 2026.

The broader speculation, valid as speculation: the 2030s and 2040s probably do not run on the same silicon CMOS that powered 1980-2025. They may run on a hybrid — silicon for control and general-purpose work, photonics for the bulk matrix multiplication, neuromorphic for sparse event-driven workloads, conventional storage for hot data, DNA for archive. They may run on something not yet named. The bit of Act I remains the unit; the substrate that holds the bit is what is shifting.

The walls

Engineering progresses by routing around walls. Some walls are economic — too expensive to push further but no physical impossibility. Some are physical — the universe says no, no further engineering possible. The frontier is partly defined by which walls are which.

The Landauer limit is the thermodynamic floor. Rolf Landauer (1961) proved that erasing a single bit of information has an unavoidable energy cost of at least kT ln 2, where k is Boltzmann's constant and T is the absolute temperature. At room temperature (T = 300 K) this evaluates to approximately 2.85 × 10⁻²¹ J per bit erased. Modern transistors operate around 10⁻¹⁵ J per switch, a factor of roughly a million above the limit, so engineering has runway. But the limit is real and fundamental: it follows from the second law of thermodynamics applied to information. The only theoretical out is reversible computing — Bennett (1973) showed that if a computation is logically reversible, in principle no information is erased and no minimum energy is required. Practical reversible computing remains a research direction; the engineering overheads have so far made it uncompetitive with conventional logic, even as the latter approaches the Landauer floor.

The walls — log-scale ceilings on energy, latency, and bandwidth10⁻¹²10⁻¹⁵10⁻¹⁸10⁻²¹J / bitLandauer19902000201020202030?transistor switching energy versus the Landauer floorCMOS switching energy approaching the thermodynamic floorsix orders of magnitude of headroom remain — most will be spent on the way down
The curve is approximate. The point is that conventional logic still has many decades of energy efficiency available before hitting the thermodynamic floor — but the floor exists, and reversible computing is the only theoretical path past it.

The memory wall is the divergence between CPU speed and DRAM access latency. CPUs have grown faster by orders of magnitude since 1980; DRAM access latency has improved by roughly a factor of 10 over the same period. The result is that a modern CPU can execute 300-500 instructions in the time it takes to fetch one cache miss from main memory. The mechanisms that absorbed the gap — multi-level caches, out-of-order execution, prefetching, hyper-threading — have largely run their course; what remains is architectural rather than circuit-level. HBM (High Bandwidth Memory) stacked on package, CXL (Compute Express Link) for coherent memory pooling, processing-in-memory, and on-die SRAM scaling (the Cerebras and Groq plays from earlier) are the contemporary attempts. The wall is not physical — bandwidth and latency can in principle improve further — but the cost curve bends sharply.

The speed-of-light wall is physical and absolute. Light travels approximately 1 foot per nanosecond in vacuum, slower in fibre. A signal from New York to London is at minimum 28 ms one-way at the great-circle distance, in practice 35-40 ms. A signal to a low-Earth-orbit satellite is 2-3 ms one-way; to geostationary, 120 ms. Distributed systems inherit these latencies as floors no engineering removes. The mechanisms that work around them — edge compute, CDNs, regional replication, eventual consistency, optimistic concurrency — push computation closer to users rather than fighting the underlying constraint. Strict global consistency at planetary distance is bounded below by light-speed round-trips; this is not changing.

The cooling wall is the per-die heat density limit. Modern dies dissipate 300-700 W in roughly 800 mm² — around 0.5-0.9 W/mm². Pushing higher requires direct liquid cooling, immersion, or eventually exotic substrates. The fundamental cause is junction temperature: silicon transistor performance and reliability degrade above around 100 °C, and the rate at which heat can be carried off through a few millimetres of silicon and packaging to a coolant is limited by thermal conductivity. 3D stacking (HBM, chiplets) makes this worse — more transistors per footprint without more surface area. The architectural response is to spread the work across more chips, package, and rack — at which point the energy frontier of the earlier section becomes the binding constraint.

Amdahl's law is the algorithmic wall. Gene Amdahl (1967) observed that the speed-up of a program from parallelisation is bounded by the fraction that remains sequential. If 5% of a program is inherently serial, no amount of parallel hardware delivers more than a 20x speed-up — the serial fraction alone caps it. Frontier ML training exhibits this constantly: communication, synchronisation, and pipeline-fill effects bound MFU below 100% regardless of how many GPUs are added. The wall is not negotiable for a given algorithm; engineering it away means changing the algorithm, not the hardware.

The cost of intelligence wall is the implicit one underneath the others. If frontier model training continues to grow by an order of magnitude per generation, by 2030 a single training run plausibly consumes percentage-of-national-grid electricity. The wall is economic and political before it is physical. Whether the field's trajectory continues to that wall, or whether efficiency gains in silicon, algorithms, and training methods bend the curve earlier, is the open question of the next five years.

The walls vary in negotiability. The Landauer limit is absolute but distant. Light speed is absolute and present. The memory wall, the cooling wall, and the energy wall are economic and engineering-bounded; pushing them costs money, and at some price they bend. Amdahl's wall is algorithmic and is only beaten by changing what you are computing. Knowing which is which is part of being honest about what comes next.

Standards

The frontier is by definition the part of the field where the standards are still forming. What follows is a mix of finalised standards, anchor papers, working drafts, and the canonical implementations engineers actually pin against.

Post-quantum cryptography

  • NIST FIPS 203 (2024) — Module-Lattice-Based Key-Encapsulation Mechanism Standard. ML-KEM (formerly Kyber). The first finalised post-quantum KEM.
  • NIST FIPS 204 (2024) — Module-Lattice-Based Digital Signature Standard. ML-DSA (formerly Dilithium).
  • NIST FIPS 205 (2024) — Stateless Hash-Based Digital Signature Standard. SLH-DSA (formerly SPHINCS+).
  • CRYSTALSpq-crystals.org. The original Kyber and Dilithium project pages with full specifications and reference implementations.
  • SPHINCS+sphincs.org. Specification and reference for the hash-based signature scheme behind SLH-DSA.
  • CFRG hybrid TLS draftsX25519MLKEM768 and earlier X25519Kyber768Draft00. The hybrid key-exchange groups now deployed in Chrome, Firefox, and Edge.
  • Open Quantum Safeopenquantumsafe.org. The reference open-source library (liboqs) and the OQS-OpenSSL provider that lets servers and clients test PQC integration.
  • NIST SP 800-208Stateful Hash-Based Signature Schemes. LMS and XMSS for firmware and code-signing where state can be managed.
  • NIST IR 8413 — Status report on the third round of the PQC competition, the document that locked in the eventual standards.

Quantum computing

Specialised silicon

Neuromorphic and analog

Spatial computing

  • OpenXRKhronos OpenXR specification. The cross-vendor runtime API.
  • Apple visionOSdeveloper.apple.com/visionos. RealityKit, ARKit, and the runtime documentation for Vision Pro.
  • Meta Questdevelopers.meta.com/horizon. Quest SDK, OpenXR conformance, and the Presence Platform documentation.
  • Foveated renderingGuenter et al., "Foveated 3D Graphics" (SIGGRAPH Asia 2012); Meta's Variable Rate Shading documentation; NVIDIA Foveated Rendering technical disclosures.
  • Asynchronous timewarp — Carmack's original Oculus blog post (2014) and subsequent VR runtime literature.
  • Motion-to-photon — The 20 ms / 11 ms latency thresholds in the simulator-sickness literature (Stanney et al., Kennedy & Stanney).

The energy frontier

Hyperscale training

AGI considerations

Beyond silicon

The walls

Cross-act references

  • The bit that Landauer's limit prices is defined in Act I.
  • The silicon transistors and clocked machine model that this act reaches past are in Act II.
  • The classical cryptography that post-quantum migration replaces is in Act VIIa.
  • The ML substrate that frontier models extend is in Act VIb.
  • The engineering practice around those models — prompts, retrieval, tools, evals, product layer — is in Act VIII.
  • The operations reality of running compute at hyperscale — capacity, observability, incident response, SLOs — is in Act IXc.
Going deeper

Branches that earn their own article.

  • NIST PQC standards walkthrough (FIPS 203, 204, 205).
  • Hybrid TLS deployment guides.
  • Qiskit and OpenQASM tutorials.
  • Quantum error correction (surface codes, magic-state distillation).
  • TPU architecture deep dives.
  • Loihi 2 and neuromorphic programming.
  • OpenXR specification.
  • Data-center cooling and PUE in depth.
  • Distributed training internals (FSDP, 3D parallelism).
  • Photonic and optical compute research.
  • DNA data storage research.
  • Landauer's principle and reversible computing.