Act IV of X

Orchestration

The OS is a referee that makes every program believe it has the whole machine to itself.

A modern computer runs hundreds of programs on hardware that, at the silicon level, can do exactly one thing per core at any instant. Something has to multiplex the cores, partition the memory, arbitrate the disk, and stop one program's bug from corrupting another's data. That something is the operating system. Every abstraction in this act — process, address space, file, socket, container — exists to maintain one illusion: each program runs as if it had the machine to itself, while the OS time-slices the real hardware underneath.

Kernel and user-space — the layered stack the OS maintainsProcess AProcess BProcess CProcess Dbrowsereditordatabaseshelluser space · ring 3syscallsyscallsyscallsyscallKernelring 0SchedulerVirtual memoryVFSNet stackIPC · signalsPage cacheBlock layerDriverspicks next taskpage tables · TLBfiles · inodesTCP · UDPpipes · futexread-ahead · dirtyelevator · queuesdevice-specificHardwareCPU coresRAM · MMUDisk · SSD · NICtimers · IRQsprograms reach the kernel through one narrow door — the syscall — and the kernel reaches hardware through drivers
Every file read, every page of memory, every TCP byte crosses this boundary. The kernel is a small privileged program; the rest of the system is everything that asks it for things.

Why an OS exists

A bare CPU executes whatever instructions you feed it, including the ones that overwrite the disk, reprogram the timer, or write through another program's memory. With one program at a time, that's fine — embedded systems still work that way. Put two programs on the same hardware and they contend for the same registers, the same memory pages, the same I/O ports. Whichever wrote last wins; the other's state is corrupted. Multi-tenancy is the original problem an OS solves.

The hardware enforces the solution. Modern x86-64 and Arm64 cores boot into a privileged mode (ring 0 on x86, EL1 on Arm) where every instruction is legal. Before handing control to a user program the CPU drops to an unprivileged mode (ring 3, EL0) where any instruction that touches hardware — port I/O, page-table writes, interrupt-mask changes, halt — faults. The only legal way back up is the syscall instruction: it atomically switches privilege level and jumps to a kernel-controlled entry point. Two rings, one door.

No OS versus OS-mediated sharingNo OS — programs collideOS mediates every accessProgram AProgram Boverlapping memory · racing for the CPUone bug corrupts the otherProgram AProgram BKernel — mediatesHardwaresyscallsyscallisolation, fairness, sharing — three jobs the kernel performs on every cycle
Without an OS, two programs on one machine collide on shared hardware. With one, the kernel is the only code authorized to touch the resources they would otherwise overwrite.

The kernel's three jobs follow from this setup. Isolation — a buggy program can crash itself but not its neighbours; a null-pointer dereference becomes a segfault, not a corrupted disk sector. Multiplexing — a few cores look like dozens of always-running cores by switching tasks every few milliseconds. Resource sharing with safety — many programs reach the same file or NIC, but only through the kernel, with permissions checked on every access.

Pitfall — perfect isolation does not exist. The L3 cache, the branch predictor, and the memory bus are physically shared and statistically leak between processes. Spectre, Meltdown, and Rowhammer all exploit the gap between "the kernel says you can't see this" and "the silicon shares state". Mitigations cost real performance: the Meltdown fix (kernel page-table isolation) added 5–30% overhead to syscall-heavy workloads. Isolation between programs is a software contract the hardware enforces imperfectly.

Processes and threads

The OS needs two distinct units: one for isolation (so a crash in one program doesn't take down others) and one for concurrency (so a single program can do several things at once). Linux separates them.

A process is a running program with its own private virtual address space, its own file-descriptor table, and its own credentials. Two processes cannot read each other's memory by accident — the page tables physically prevent it. They communicate through pipes, shared memory, or sockets, and only when both sides opt in. A bug in one cannot corrupt another.

A thread is a unit of execution inside a process. Threads of the same process share the address space, the heap, and the file descriptors; each thread gets its own stack (1–8 MB on Linux), program counter, and registers. Creating a thread runs in roughly 10–30 µs and costs a few KB of kernel state. Creating a process via fork(2) runs in 100–500 µs because the kernel has to clone page tables and a chunk of bookkeeping. The trade is direct: threads share memory at no cost but can race on it just as easily; processes communicate only through explicit channels but survive each other's crashes.

Process as isolation unit; threads as concurrency units inside itProcess 1PID 4271text (code)data + bss (globals)heap (malloc)file table · sockets · credsThreads (each its own stack + registers)T1T2T3stackPC · regsSP · TLSstackPC · regsSP · TLSstackPC · regsSP · TLSProcess 2PID 4302text · data · heapseparate VASseparate FD tableseparate credsT1single-threadedno shared memory
Threads inside Process 1 share everything but their stacks. Process 2's memory is invisible to Process 1 — the page tables make it physically unreachable.

Each thread moves through a small state machine. Created (the kernel built its task struct), ready (runnable, waiting for a core), running (on a core), blocked (waiting for I/O or a lock), terminated (exited but the parent hasn't collected the exit code yet — the zombie state). Most transitions are involuntary: the timer tick preempts a running thread, a blocking syscall pushes it into Blocked, an I/O completion wakes it back to Ready.

Signals are the kernel's way to interrupt a running thread asynchronously. kill(pid, SIGTERM) sets a pending bit; on the next return-to-user, the kernel checks the bit and either runs the default action (terminate, ignore, dump core), invokes a handler the process installed with sigaction(2), or holds delivery if the signal is masked. The handler runs on its own stack and most of libc is not safe to call from one — only async-signal-safe functions (a short list including write, _exit, and a handful of others). Forgetting this is a classic source of deadlocks where a handler calls printf and trips on a lock the interrupted code already held.

A thread moves through five kernel-tracked states across its lifetimeCreatedReadyRunningBlockedTerminatedtask struct builton a runqueueon a CPUwaiting on I/O · lockzombie until reapedadmitschedulepreempt · slice endsyscall blocks (read · futex)I/O ready · signal · lock freeexit() · fatal signala context switch is any transition that ends Running
The scheduler preempts on the timer tick; blocking syscalls push a task into Blocked; I/O completions and signals wake it back to Ready.

Scheduling

There are usually more runnable threads than cores. The OS needs a policy that decides which thread runs next and for how long, fairly enough that no thread starves and quickly enough that the decision itself isn't the bottleneck.

Linux's default scheduler — CFS (Completely Fair Scheduler) — keeps a per-CPU red-black tree of runnable threads ordered by virtual runtime: CPU time consumed, weighted by priority (a higher-priority thread accumulates vruntime more slowly). The scheduler always picks the leftmost (least-served) node, runs it for a slice of 1–10 ms, updates its vruntime, and reinserts. Insert and pick are both O(log n), about a microsecond even with thousands of threads.

A context switch is what happens at the end of a slice: save the outgoing thread's registers and stack pointer, load the incoming thread's, swap page-table roots if the process changed. The direct cost is 1–10 µs. The indirect cost is larger and harder to see — the new thread runs with a cold L1/L2 cache and a TLB that holds no entries for its address space, so it stalls on the first dozen memory accesses. A workload that switches threads too often can spend most of its CPU on cache warmup rather than on the work itself.

CFS picks the leftmost runnable task in a per-CPU red-black tree142981866211516021031leftmost — picked nextvruntime (ms)smaller = less CPU time consumedtree walks in O(log n) — a microsecond even with thousands of tasks
CFS keeps every runnable thread on its CPU's tree, indexed by virtual runtime. The scheduler picks left, runs it for a slice, updates its vruntime, and reinserts.
Worked example: three threads, three quanta, who runs when

Three runnable threads on one core. The scheduler keeps an integer vruntime for each — "weighted CPU time consumed so far". Lower vruntime means "this thread is owed CPU". Each thread has a nice value (priority); lower nice means higher priority and a smaller weight on its vruntime increments. Use simple weights here: weight 1 for nice 0, weight 2 for nice −5 (higher priority, accumulates vruntime half as fast), weight 0.5 for nice +5. Slice length: 4 ms.

Initial state, all three vruntimes start at 100:

Threadniceweightvruntime
A01100
B−52100
C+50.5100

A tie — the scheduler picks any (say A).

  1. Quantum 1. A runs for 4 ms. Increment its vruntime by slice / weight = 4 / 1 = 4. A is now 104, B and C still 100. Reinsert A; leftmost is now tied B/C. Pick B.
  2. Quantum 2. B runs for 4 ms. B is high-priority, so its vruntime grows slowly: 4 / 2 = 2. B is now 102, C still 100, A is 104. Leftmost is C. Pick C.
  3. Quantum 3. C runs for 4 ms. C is low-priority, so its vruntime grows quickly: 4 / 0.5 = 8. C is now 108. Order: B=102, A=104, C=108. Pick B again.

Over time, B (high priority) gets disproportionately more CPU because its vruntime climbs slowly; C (low priority) gets less because its vruntime climbs fast and pushes it to the right of the tree. Fair here means "each thread accumulates vruntime at the same rate" — and the weights make vruntime accumulation slower for higher-priority threads, so they get more wall-clock time to reach the same vruntime as everyone else. That single invariant — "pick the smallest vruntime" — produces priority-weighted fair sharing without any explicit "this thread's turn" bookkeeping.

A blocking syscall takes a thread off the tree; an I/O completion wakes it back on. To keep a sleeper from monopolising the CPU when it wakes (its vruntime is stale and tiny, so it would run forever), the kernel snaps the waking thread's vruntime up to min_vruntime - sched_latency / 2, just barely ahead of the leftmost runnable task. Fairness, not starvation, on both directions.

Each core has its own runqueue, so the fast path takes no cross-CPU lock. A load balancer runs periodically (and whenever a CPU goes idle), pulling threads from the busiest queue to an emptier one. Migration is not free: caches and the TLB are cold on the new CPU, costing 10–100 µs of warmup, so the balancer prefers migrations within the same NUMA socket where DRAM access stays local.

Pitfall — fairness is not the same as throughput. CFS divides CPU time evenly among runnable threads, which is what an interactive desktop wants. A batch workload with 100 threads doing the same work would prefer fewer, longer slices and less context-switch overhead. Real-time and latency-sensitive workloads use different scheduling classes — SCHED_FIFO, SCHED_DEADLINE — that bypass fairness for guaranteed latency at the cost of starving lower-priority work.

Virtual memory

If two processes shared one physical address space, a wild pointer in one could trash the other. Worse, every program would need to be told at link time which addresses were free — a non-starter on a multi-tasking machine. The OS solves both with a translation layer: every process gets its own private virtual address space, and the hardware translates each access into a physical address before it reaches RAM.

The CPU's MMU (Memory Management Unit) does the translation. Each process has its own page table the kernel maintains; the unit of mapping is a page — almost always 4 KB on x86-64 and Arm64, with optional 2 MB and 1 GB huge pages for workloads that touch a lot of memory. The virtual address space is enormous (48 bits on x86-64, 256 TB) and almost entirely unmapped — code at the bottom, heap growing up, stack growing down, a vast nothing in the middle.

This single mechanism does several things at once. It gives every process a clean contiguous space; it isolates processes from each other (their page tables map different physical frames); it lets the kernel back virtual memory with physical RAM only when the program actually touches it; and it lets the kernel page cold memory out to disk so processes can use more "memory" than the machine physically has.

Per-process virtual address space mapped to physical frames via the MMUProcess A — VASPhysical RAMtext (code)data+ bssheap →(grows up)unmapped(huge gap)← stackMMU + page tableVPN → PFNTLB caches recent translationsframe 0frame 4frame 6frame 8 (shared lib)frame 11filled physical frames are mapped; the dashed VAS region is unmapped — touching it traps
Two processes can map the same physical frame — a shared library, a copy-on-write page after fork — a trick paging makes nearly free.

The page table itself is a tree. A flat table for a 48-bit virtual space with 4 KB pages would need 236 entries — half a terabyte of metadata per process, and most of it empty. x86-64 walks four levels (PML4 → PDPT → PD → PT) with 9 bits of the address per level and 12 bits of offset within the page. Each level holds 512 8-byte entries, exactly one 4 KB page. A full walk on a TLB miss is four dependent memory loads: 50–500 cycles depending on what's cached. The TLB caches recent translations so that most accesses pay zero walk cost — a typical core has 64–512 4 KB entries, covering at most 2 MB of mapped memory.

x86-64 four-level page-table walkPML4 idxPDPT idxPD idxPT idxpage offset9 bits9 bits9 bits9 bits12 bits48-bit virtual addressPML4PDPTPDPT4 KB frame512 entries512 entries512 entries512 entriesthe dataCR3 holds the physical address of the PML4 — a context switch reloads CR34 dependent loads on a TLB miss · the TLB exists because this is expensive
Arm64 uses the same 4 KB / 4-level layout by default. Huge pages collapse the bottom levels into a single mapping for 2 MB or 1 GB regions.
Worked example: translating one virtual address into a physical one

The CPU has a virtual address it wants to load: 0x00007F8000402ABC. It needs to find the physical RAM location. Only the bottom 48 bits matter on x86-64; the top bits are sign-extension. Those 48 bits split into five fields, four 9-bit indices and one 12-bit offset:

48-bit VA: 0111 1111 1000 0000 0000 0000 0100 0000 0010 1010 1011 1100
           \_______/\________/\_________/\________/\___________________/
            PML4=255  PDPT=0    PD=2       PT=2      offset=0xABC

Each 9-bit index addresses one of 512 entries in a page-table page; each entry is 8 bytes and points at the next-level page. The CPU walks four levels:

  1. Load CR3. This register holds the physical address of this process's PML4. (A context switch is partly just "reload CR3 with the next process's PML4 address" — see the next section.)
  2. PML4[255]. Read the 8-byte entry at offset 255 × 8 from CR3. It contains the physical address of a PDPT.
  3. PDPT[0]. Read entry 0 of that PDPT. It contains the physical address of a PD.
  4. PD[2]. Read entry 2 of that PD. It contains the physical address of a PT.
  5. PT[2]. Read entry 2 of that PT. It contains the physical address of the 4 KB frame backing this page — call it 0x1A2B3000.
  6. Concatenate. Glue the 12-bit offset 0xABC onto the frame: physical address 0x1A2B3ABC. Now the CPU does the actual load.

Four memory loads, all dependent, all just to translate one address. On a TLB miss this can cost 50–500 cycles before the real load even starts. The TLB stores recent (virtual page → physical frame) mappings as a small associative cache; on a hit the entire walk above collapses into one lookup. A typical core has 64–512 TLB entries, which is why working sets that exceed entries × 4 KB thrash badly — and why huge pages (one TLB entry covers 2 MB instead of 4 KB) often double throughput on memory-heavy workloads.

Several useful behaviours fall out of this. Demand paging — when a process touches a page with no physical backing, the CPU raises a page fault, the kernel allocates a frame, and the instruction retries. A program that maps a 1 GB file but reads only a few pages pays only for what it touches. Copy-on-writefork(2) doesn't copy memory; it marks every parent page read-only and shares the frames with the child, copying a page only when one side writes to it. Memory-mapped filesmmap(2) ties a region of the address space to a file, so reads and writes go through the page cache and pagination is automatic. Swap — under pressure, cold pages get written out to disk; the working set, not the total allocation, determines real RAM use.

Worked example: what fork() actually does (and why it returns twice)

A process with PID 4271 has a 50 MB heap and calls fork(). A naive implementation would allocate 50 MB of new physical RAM and memcpy everything across. Linux does almost no copying. Here is the actual sequence:

  1. Clone the task struct. The kernel allocates a new task struct, gives it PID 4272, copies the file-descriptor table, credentials, signal handlers, and a reference to the parent's address space.
  2. Copy the page tables, not the pages. The kernel duplicates the parent's PML4 / PDPT / PD / PT entries so the child has its own four-level tree. Both trees now point at the same physical frames. No heap data is copied.
  3. Flip every writable page to read-only in both trees. Each PTE that was RW becomes R-, with a kernel bookkeeping bit marking it as CoW. Shared read-only pages (code, already-RO mappings) are left alone — they were already safe to share.
  4. Return twice. The scheduler now has two runnable tasks pointing at almost-identical state. The fork syscall returns 4272 in the parent (the child's PID) and 0 in the child. Same instruction pointer, same stack pointer, two different return values — because each task has its own copy of the saved registers from the syscall.
  5. First write triggers the trap. Suppose the parent writes to a heap page. The CPU sees a write to a read-only PTE and raises a page fault. The kernel's fault handler checks the CoW bookkeeping bit, allocates a fresh physical frame, copies the 4 KB page into it, updates the parent's PTE to point at the new frame with RW permissions, and clears the CoW bit. The faulting instruction retries and succeeds. The child's PTE for that page still points at the original frame, still marked CoW.
  6. Most pages never get copied. If the child immediately execs a new program, the whole address space is torn down and replaced — none of the 50 MB ever got duplicated. This is why fork + exec is cheap despite looking expensive on paper.

The two-return-values trick is what makes the C idiom work: if (fork() == 0) { /* child */ } else { /* parent */ }. Same code path, two processes, the return value tells each one which it is.

A page fault is cheap or expensive depending on where the page comes from. Touching a fresh heap page (the kernel hands you a zeroed frame) finishes in under 10 µs. A copy-on-write fault is similar. A fault that has to read the page from disk — a hard fault — costs 1–10 ms, five or six orders of magnitude longer than the load that triggered it. A program with a high hard-fault rate has spilled out of RAM.

Page fault flow — translation miss, kernel resolves backing, instruction retriesUser load/storemov rax, [0x7f…]MMU walkPTE bit: present = 0Trap → page-fault handlerdo_page_fault(addr, err)Where does the page come from?anonymousfile-backedCoW (after fork)swapfirst heap touchmap a fresh zero pagecost: under 10 µsmmap regionread from disk · cachecost: 1–10 ms coldwrite to RO shared pagecopy + remap RWcost: under 10 µspaged outread from swapcost: 1–10 msinstall PTE · flush local TLBreturn from trap, retry instructionload completes — instruction is none the wiser
Soft faults (zero pages, CoW) finish in microseconds. Hard faults that hit the disk cost milliseconds.

mmap deserves a closer look because it changes the cost model of file I/O. A normal read(2) copies bytes from the kernel's page cache into a user buffer — two copies per byte (disk → cache → user) and a syscall per batch. mmap instead points the user's page table at the same physical frames the page cache uses; reads and writes happen through ordinary loads and stores, with no syscall after the initial mapping. A multi-gigabyte file maps in microseconds because the kernel only sets up page-table entries; actual data arrives lazily on page fault.

Pitfall — the TLB is small. A typical core covers at most 2 MB of memory in its 4 KB TLB. Workloads with a working set larger than that miss constantly, paying for a 4-level walk on every miss. Huge pages are the standard fix: one 2 MB page is one TLB entry, so a 2 GB working set fits in 1024 entries instead of 524,288. The trade is coarser permission and dirty-page tracking, and transparent huge pages can stall when the kernel tries to defragment memory to assemble them.

Files and filesystems

A program needs to put bytes somewhere durable and find them again by name. A file is the OS abstraction that delivers this: a named, byte-addressable, persistent sequence of bytes. That single abstraction covers regular files, directories, devices, and pipes — everything you can open(2).

Internally, the kernel splits a file into three parts. The inode holds metadata — size, owner, permissions, timestamps — and pointers to the data blocks where the bytes live (a typical ext4 inode is 256 bytes). The dentry maps a name to an inode within a parent directory. Directories are themselves files whose data is a list of dentries. Path resolution is recursive: open root, look up home in its dentries, follow to that inode, look up user, follow, look up file.txt. Every / in a path is one inode load.

Path resolution: /home/user/file.txt walks dentries to inodes to data blocks/home/user/file.txt/ (root)/home/home/userfile.txtinode 2dirinode 128dirinode 4012dirinode 9001regular fileinode 9001mode: 0644uid/gid: 1000/1000size: 16384 bytesatime · mtime · ctimedirect[0..11] →indirect →double-indirect →triple-indirect →block 412block 413block 414block 415indirectblock(pointers)block Nblock N+1
Direct pointers cover small files; indirect blocks let one inode address multi-gigabyte files. ext4 supplements this with extents — runs of contiguous blocks recorded as (start, length) — to cut metadata overhead on large files.

The kernel layers this through a VFS (Virtual Filesystem) that defines a uniform interface — read, write, lookup, mkdir — that every concrete filesystem implements. ext4 is the Linux default; XFS scales to multi-petabyte volumes; ZFS and Btrfs add copy-on-write, snapshots, and end-to-end checksums; APFS is Apple's. They differ wildly internally — ext4 uses block groups and a journal, ZFS is always copy-on-write — but a program calling read(2) sees identical semantics on all of them. Journaling is the trick most use to survive crashes: writes are appended to a sequential log first, then applied to the main structures, so an interrupted update can be replayed or discarded cleanly.

Five common filesystems on the same VFS interface, compared on four axesfilesystemon-disk shapecrash safetysnapshots / CoWchecksumsext4XFSZFSBtrfsAPFSblock groups · extentsB+tree · dynamic inodesB-tree of block ptrs · CoWB-tree · CoWB-tree · CoW · clonesjournaled metadatadelayed alloc · journalCoW (no journal needed)CoW (no journal needed)CoW (no journal needed)nonoyes — cheapyes — cheapyes — file clonesmetadata onlymetadata onlydata + metadatadata + metadatametadata onlya program reads bytes from any of these the same way — VFS hides every difference behind read(2)
Journaling and copy-on-write are the two strategies for surviving a crash mid-update. End-to-end checksums (ZFS, Btrfs) catch silent disk corruption.

Pitfall — fsync is the only durability boundary. A successful write(2) only puts bytes in the page cache. They reach disk asynchronously, on the kernel's schedule. A crash before that flush loses the data. Databases and journals call fsync(2) to force the flush, and fsync is expensive — 1–10 ms on a spinning disk, 50–500 µs on a good NVMe. Write-heavy systems batch their fsyncs aggressively because every one is a synchronous round-trip to the device.

I/O

I/O is slow in absolute terms and far slower still relative to the CPU. A modern core executes a billion-plus instructions per second. An L3 miss costs ≈100 ns. An NVMe read costs ≈50–100 µs (≈200,000 cycles). A spinning-disk seek costs ≈10 ms. A transcontinental TCP round-trip costs ≈80 ms. The OS designs the I/O API specifically so the CPU is not idle while waiting on the slow side.

Blocking I/O is the simplest API: a thread calls read(2), the kernel parks it in the blocked state until data arrives, and another runnable thread runs in the meantime. Easy to write, expensive to scale: serving 10,000 connections needs 10,000 threads, each with its own stack and scheduler overhead.

I/O multiplexing lets one thread sleep on many descriptors at once. select, poll, epoll (Linux), and kqueue (BSD/macOS) take a set of file descriptors and return whichever become ready. epoll scales to 100,000+ descriptors per thread because its core operation is O(ready), not O(total).

Asynchronous I/O goes further: the program submits a batch of requests to a queue, the kernel completes them in the background, and the program reaps completions from a second queue. io_uring on Linux and IOCP on Windows let one syscall submit and reap dozens of operations at once, cutting per-op syscall overhead by 10–100× for I/O-heavy workloads.

Three I/O models — blocking, multiplexed, async — on a common timelineBlockingMultiplexed (epoll)Async (io_uring)read()thread blocked · ≈30 ms disk seekreturn1 syscall · 1 opepoll_waitread fd 4read fd 7epoll_waitread fd 9≈3 syscalls · N opssubmit 32 opsreap 32kernel + device process the queue · CPU free for other work2 syscalls · N opssyscall count drops as the API moves work from per-op to per-batch
Each generation of I/O API was invented to cut syscall overhead. epoll vs poll: O(ready) instead of O(fds). io_uring vs epoll: shared submit and completion rings, optionally zero syscalls in the steady state.

Pitfall — busy-waiting is almost always wrong. A non-blocking loop that calls read() until it succeeds pins a core at 100% utilization for a result that arrives in milliseconds. The energy bill compounds at scale. Use epoll_wait or io_uring to sleep until the kernel actually has data; reserve spin loops for sub-microsecond paths (lock contention on tiny critical sections, kernel-bypass NICs) where the wakeup itself dominates.

The syscall boundary

Programs need kernel services — open a file, send a packet, allocate memory — but cannot be allowed to run kernel code directly. The syscall is the controlled crossing: a single instruction (syscall on x86-64, svc #0 on Arm64) that atomically saves the user's program counter, switches privilege level to ring 0, and jumps to a fixed kernel entry point recorded in a hardware register.

On entry the kernel saves the rest of the user registers, looks up the syscall number in a table (Linux's sys_call_table has roughly 450 entries: read, write, mmap, clone, io_uring_enter, and so on), runs the handler, and returns by restoring the saved registers and executing sysret. Every byte that crosses the boundary is validated; every user pointer is checked against the user's address space; every privilege transition is atomic in hardware.

A syscall — privilege transition from ring 3 to ring 0 and backUser process · ring 3mov rax, 0 ; sys_read · syscallsyscall — atomic transitionKernel · ring 01. save user regsswitch to kernel stack2. dispatch viasys_call_table[rax]3. run handlere.g. vfs_read()cost (cold): ≈100–500 cycles base + TLB / cache pollutioncost (hot): ≈50–150 cycles · vDSO maps cheap reads(gettimeofday, clock_gettime) into user space and skips the boundarysysret — back to ring 3
The boundary is narrow on purpose. Every transition is atomic; every pointer is validated; every byte is copied through controlled paths.

The cost matters because it sets the floor on I/O throughput. A bare syscall round-trip on a modern x86-64 core is 100–500 cycles when caches and TLB are warm — call it 30–150 ns. After the Spectre and Meltdown mitigations (KPTI, IBRS, retpolines), syscall-heavy workloads lost 5–30% throughput because the boundary now flushes more state. The vDSO (virtual Dynamic Shared Object) is the optimization for read-only syscalls that don't actually need the kernel: gettimeofday, clock_gettime, getcpu are mapped read-only into every process and execute as ordinary function calls. For everything else, batching is the win — readv/writev, sendmmsg, io_uring — because amortizing one boundary crossing over 32 operations cuts the per-op cost by 32×.

Pitfall — syscalls in tight loops. A naive for (i = 0; i < N; i++) write(fd, &c, 1); makes N syscalls. Wrap it in fwrite or a manual buffer and you collapse N calls into N/4096. Three to four orders of magnitude faster, same logical work. The same pattern holds for read, recv, send. If you see a profile dominated by syscall entry, the answer is almost always batching.

Lifecycle of one read(fd, buf, 4096) — from user code through scheduler, VFS, page cache, and SSDuser modekerneldevice · schedulerread(fd, buf, 4096)≈0 ns · user codelibc trampoline · syscall≈10 nskernel entry · save regs · ring 3 to 0≈50 nsVFS dispatch · find inode≈100 nspage-cache lookup → misssubmit to block-device queue · ≈1 µsSSD does the readtask blocks · scheduler runsother work on this core≈100 µsIRQ wakes task · copy to user buf≈100 nssysret · back to ring 3≈10 nstotal ≈100 µs · syscall + VFS + scheduling overhead is <1% · the SSD wait is everything else
One read() touches the syscall boundary, VFS, page cache, scheduler, and device. Most of the wall-clock time is the SSD wait — which is why the kernel deschedules the task for it.

Containers

Shipping a program means shipping its dependencies — every library version, config file, and directory layout it expects. Doing that at the OS level (a full virtual machine) is heavy: a separate kernel, a separate boot, gigabytes per instance. A container ships the whole user-space environment but reuses the host's kernel.

Two kernel features carry almost the entire weight. Namespaces partition global kernel state so a process inside a namespace sees only its own slice. The major namespace types isolate one axis each: PID (process IDs; the container's init is PID 1), MNT (the container has its own /), NET (its own interfaces, routes, firewall), UTS (hostname), USER (uid/gid mapping, so root inside can be unprivileged outside), and IPC, CGROUP, TIME. cgroups v2 complement them with quantitative limits — CPU shares, memory caps, I/O bandwidth, process counts — applied to a hierarchy of process groups.

Containers vs VMs — same kernel, isolated viewsContainersVirtual machinesApp AApp BApp CPID nsMNT nsNET nscgroupPID nsMNT nsNET nscgroupPID nsMNT nsNET nscgroupHost kernel (one)namespaces · cgroups · seccompHardwareApp AApp BApp Cguest libsguest kernelguest initvirtio driversguest libsguest kernelguest initvirtio driversguest libsguest kernelguest initvirtio driversHypervisor (KVM · ESXi · Hyper-V)HardwareStart: ≈50 msOverhead: 0 — same kernelDensity: 100s–1000s/hostIsolation: kernel-levelStart: ≈10–30 sOverhead: ≈5–15%Density: 10s/hostIsolation: hardware-level
Containers share the kernel; VMs duplicate it. Containers start in milliseconds and run at near-bare-metal speed; VMs start in seconds and pay a few percent overhead in exchange for hardware-enforced isolation.

The image is the second half of the contract. An OCI image is a stack of read-only layers, each a tarball of filesystem changes, plus a JSON manifest pointing to those layers by SHA-256 digest. At runtime, a union filesystem (overlayfs on Linux) stacks the layers and adds a writable top layer; reads fall through the stack until they find a file, writes go to the top. Two containers sharing a base layer share its bytes on disk and in the page cache — pull a 200 MB Ubuntu image once, run a hundred containers, pay 200 MB total.

An OCI image as a stack of read-only layers plus a writable overlaywritable layer (overlayfs upper)app + config — your imageruntime libs (npm modules, pip wheels)language runtime (node, python, jvm)base OS userspace (debian, alpine)tmpfssha256:9c…sha256:4a…sha256:7e…sha256:1f…manifest.jsonlayers: [ sha256:1f…, sha256:7e…, sha256:4a… ]reads fall through the stack to the first layer that has the filewrites land in the upper layer · deletes are recorded as whiteoutstwo containers off the same image share their lower layers byte-for-byte
The image is content-addressed: change one byte in any layer and its SHA-256 digest changes, so caches and registries deduplicate exactly. The top writable layer is discarded when the container exits.

The combined effect is to ship the environment with the program. A binary that worked on the developer's laptop works on the production server because the entire user-space — every library version, every config file — travels with it. The kernel ABI is the only external dependency, and Linux's stable syscall ABI keeps that contract intact across decades. Containers are not a new operating system; they let one operating system pretend to be many.

Pitfall — kernel-level isolation is not VM-level isolation. Every container on a host shares the kernel, so any kernel privilege-escalation bug is a potential container escape. The mitigation stack — seccomp-bpf to filter syscalls, AppArmor/SELinux for mandatory access control, user namespaces so root-in-the-container is not root-on-the-host, read-only root filesystems — closes most of the gap, but not all of it. For untrusted code (multi-tenant PaaS, CI on user-submitted patches), the modern answer is micro-VMs (Firecracker, Kata Containers): containers wrapped in a stripped-down hypervisor for hardware-enforced isolation at near-container start times.

Standards

Every interface on this page has a canonical specification.

  • POSIXIEEE 1003.1, current edition POSIX.1-2024 / IEEE 1003.1-2024. Defines the runtime contract every Unix-like OS exposes: fork, exec, read, write, signals, threads, the shell command language, the regex grammar.
  • System V ABI / ELFSystem V Application Binary Interface (the calling convention) and the ELF (Executable and Linkable Format) reference (TIS Committee, 1995, with per-architecture supplements). The format every Linux executable, shared library, and core dump uses.
  • Linux kernel — there is no formal ISO spec; the de facto reference is the Linux man-pages project (man-pages package, also kernel.org/doc/man-pages) and the kernel's own Documentation/ tree. The user-visible syscall ABI is stable by Torvalds' explicit policy.
  • OCI Runtime Specification and OCI Image Specification — Open Container Initiative (opencontainers.org). The Runtime spec defines config.json and the lifecycle every runtime (runc, crun, Kata) implements; the Image spec defines the layer + manifest format.
  • OCI Distribution Specification — the HTTP API every container registry (Docker Hub, GHCR, Quay, ECR) speaks for docker pull and docker push.
  • virtio — OASIS Virtual I/O Device Specification (current: v1.2). The paravirtualized device interface KVM, Xen, and Firecracker use for high-performance guest I/O.
  • UEFI — UEFI Forum Unified Extensible Firmware Interface Specification (current: 2.10). The firmware/OS boot interface that replaced legacy BIOS — defines the boot manager, GPT partition format, and the protocols a bootloader uses.
  • Filesystems — POSIX defines the read/write/stat semantics every filesystem must implement. On-disk formats are per-filesystem: ext4 Wiki and Documentation/filesystems/ext4/ (Linux), xfs man pages and the SGI XFS format guide, OpenZFS On-Disk Format document, Btrfs Documentation.
  • Memory management — the hardware semantics live in Intel® 64 and IA-32 Architectures Software Developer's Manual, Vol. 3A (System Programming, paging chapters) and the Arm Architecture Reference Manual (ARMv8/v9), VMSAv8-64 chapter. These define the actual MMU, TLB, and page-fault behaviour the kernel layers atop.
  • io_uring — no formal RFC; the stable interface is documented in io_uring_setup(2), io_uring_enter(2), and the liburing repository (git.kernel.dk/liburing). Linux's stable-ABI rule applies.
  • cgroups v2Documentation/admin-guide/cgroup-v2.rst in the Linux source tree. The unified hierarchy that replaced the v1 patchwork; the spec for every container runtime's resource limits.
  • Forward refsIEEE 754 (handed forward from Act I & Act II; the FPU the kernel preserves across context switches), and the Intel/Arm/RISC-V ISA references (handed forward from Act II & Act III; the privilege levels and atomic instructions everything in this act assumes).
Going deeper

Branches that earn their own article.

  • Scheduling algorithms (CFS, EDF, real-time).
  • Virtual memory deep dive (TLB, page tables, huge pages, NUMA).
  • Filesystem internals (ext4, ZFS, Btrfs, copy-on-write).
  • I/O models (select, poll, epoll, io_uring, kqueue).
  • Linux kernel architecture.
  • Device drivers.
  • IPC mechanisms (pipes, shared memory, message queues).
  • Microkernel vs monolithic kernel debate.
  • Container runtimes and OCI spec.
  • Virtualization (hypervisors, Type 1 vs Type 2).