Operating Systems

TL;DR

A modern computer runs hundreds of programs at once. A browser, a code editor, a dozen background services, the window manager, the clock in the corner, whatever you are actually trying to do. Each program behaves as if it owns the whole machine: it has its own memory, its own files, its own CPU time. None of that is actually true. The hardware has one set of CPUs, one pool of RAM, one set of disks and network cards, and nothing in the silicon forces those resources to be shared fairly.

An operating system is the software layer that sits between every running program and the hardware, and makes the illusion work. It decides which program runs on which CPU next, which bytes of physical memory each program is allowed to touch, which files each program is allowed to open, and what happens when two programs want the same resource at the same time. Everything a program asks the outside world to do — read a file, open a socket, allocate memory, start another program — it asks through the OS. The OS is the referee of the machine.

This handbook walks the eight stations of that arbitration in the order they show up in practice: how programs are isolated from each other (processes and threads), how they share CPU time (the scheduler), how they share memory (virtual memory and paging), how they share persistent storage (filesystems), how they ask the kernel to do things (syscalls and io_uring), what security boundaries the kernel enforces (capabilities, namespaces, cgroups, seccomp), and how the whole thing starts (boot). The distance between "my program works on my laptop" and "it works under load, on a shared host, with hostile neighbours" is exactly the distance covered here.

You will be able to

Read a process's /proc/self/maps, limits, and status and say what resources it holds and how they'll be reclaimed.
Pick between threads, processes, async, and io_uring for a given workload and say which syscall budget wins.
Name the security boundary that matters (ring, capability, namespace, seccomp) for a given threat model.

The Map

You will be able to
The Map
Station 1 — Processes, threads, and the task_struct
Station 2 — Scheduling: CFS, EEVDF, realtime, deadline
Station 3 — Virtual memory, page tables, and the TLB
Station 4 — Filesystems, inodes, and journaling
Station 5 — Syscalls, io_uring, and the cost of crossing rings
Station 6 — Kernel/user boundary, capabilities, namespaces, cgroups
Station 7 — Boot: UEFI, bootloader, kernel, pid 1
How the stations connect
Standards & Specs
Test yourself

Rendering diagram…

Read the graph in two passes. First, top-down — every running program sits under one kernel, every resource it touches is mediated by that kernel. Second, left-to-right — this is the order in which an engineer typically meets each station: "my program crashed" (processes) → "it's slow" (scheduler) → "it's using too much RAM" (virtual memory) → "writes are slow" (filesystems) → "syscalls are the bottleneck" (io_uring) → "can a compromised service reach my secrets?" (security) → "why won't it start?" (boot).

Station 1 — Processes, threads, and the task_struct

The first job of the kernel is to keep running programs from stepping on each other. If every program could read and write any memory it pleased, open any file on the disk, or kill any other program, you would not have an operating system — you would have a monolith where the first buggy or malicious program to start crashes everything. Some kind of isolation boundary is needed around each running program: a box that says "this memory belongs to this program, and nobody else touches it."

The Unix answer, 50 years old and still the dominant model, is the process. A process is the isolation boundary: when the kernel starts a program it gives the program its own private view of memory (its address space), its own private table of open files, its own user and group identity, and a handle the kernel uses to keep track of it (its process ID, or PID). A program inside a process can only reach out of the box through a small, well-defined gate called the syscall interface (Station 5). That gate is where the kernel can say yes or no.

Inside a process, the program often needs to do more than one thing at a time — a web browser rendering a page while also downloading an image while also running JavaScript. Creating a brand-new process for each task is expensive and loses the convenience of shared memory. So inside a process we have threads: multiple independent streams of execution that share the process's memory and files, but each has its own stack and its own set of CPU registers. The kernel schedules threads, not processes (a process with one thread is still scheduled as one thread). Isolation between processes is strong; isolation between threads of the same process is essentially zero.

A process is the kernel's unit of isolation: a private address space, a private file-descriptor table, a set of credentials, and at least one thread of execution. A thread is the kernel's unit of scheduling: a stack, a register set, and a pointer back to the process it belongs to. On Linux both are actually the same thing — a task_struct — distinguished only by which resources the clone() call shared with the parent.

  clone() flags that turn a fork into a thread:
    CLONE_VM       share address space
    CLONE_FS       share filesystem info (cwd, root, umask)
    CLONE_FILES    share file descriptor table
    CLONE_SIGHAND  share signal handlers
    CLONE_THREAD   same thread group (same getpid(), different gettid())

  fork()    = clone() with no sharing           → new process
  pthread_create = clone with VM+FS+FILES+THREAD  → new thread in same process

A process has one PGID (process-group id) and one SID (session id), which the shell uses to send SIGHUP to all children when a terminal closes. A process has one PID the rest of the world uses, and each thread inside it has a TID (gettid, the kernel-level identifier the scheduler actually tracks). getpid() in a threaded program returns the TGID — the PID of the "main" thread — which is why pthread_self() and getpid() disagree.

Every process on Linux has a /proc/<pid>/ directory: maps (what's mapped where in virtual memory), status (uid, gid, memory, threads), fd/ (one symlink per open file descriptor), limits (RLIMIT_NOFILE, RLIMIT_AS, RLIMIT_STACK). Reading these when you're debugging beats printf.
Fork is copy-on-write: the child inherits the parent's page tables with every page marked read-only, and actual copies only happen on write. Forking a 10 GB process is fast; having the child write 10 GB is what's slow.
PID 1 is special: if pid 1 dies, the kernel panics. Docker containers run your app as pid 1 by default, which breaks signal handling and zombie reaping — use --init or a tiny init like tini or you'll discover this the hard way when Ctrl-C does nothing.
Linux threads and processes share the same scheduler. Context-switching between two threads of the same process costs ~1–2 µs; between processes costs ~2–4 µs — the difference is the TLB flush that a full address-space change forces. Multiply by a million per second under a heavy scheduler and it's real latency.

The model you want: every process is an island of isolated state connected to the world only through file descriptors and signals. If you can enumerate the fds and the signal handlers of a process, you know every way it can be influenced.

WARNING

fork() in a multi-threaded program is a loaded gun. Only the calling thread survives into the child; every mutex held by another thread is now locked forever in the child, with no owner to release it. POSIX only guarantees async-signal-safe calls between fork() and exec() for a reason. If you need processes, fork early (before threads) or use posix_spawn.

The wider picture. The isolation-and-execution layer has many more pieces:

Process lifecycle — fork / vfork / clone to create, execve to replace the program, wait / waitpid to reap, kill to signal, exit to leave. Understanding the sequence is half of understanding UNIX.
Signals — SIGTERM, SIGKILL, SIGINT, SIGSEGV, SIGPIPE, SIGHUP, SIGCHLD. Asynchronous messages a process must handle or default (kill). Signals are the original back-pressure.
Inter-process communication (IPC) — pipes, named pipes (FIFOs), UNIX-domain sockets, message queues, shared memory (shm_open, mmap), semaphores, signals, D-Bus. Cross-links to Systems & Architecture Rung 3.
User-space threading models — pthreads (1:1 mapping to kernel threads, the default), green threads (M:N), goroutines (M:N with a runtime scheduler), Java 21 virtual threads, coroutines (cooperative, not preempted). Covered in depth in the Languages handbook.
Process groups, sessions, and job control — how Ctrl-Z, bg, fg, and nohup actually work.
Zombies, orphans, and PID 1 — a child that exited but has not been reaped is a zombie; its parent has one job. If the parent dies first the child is re-parented to init.
Resource limits (ulimit / getrlimit) — RLIMIT_NOFILE (open files), RLIMIT_AS (address space), RLIMIT_STACK, RLIMIT_NPROC. The first line of defence against a runaway program.
Process state machines — running, runnable, sleeping (interruptible / uninterruptible), stopped, zombie. The letters in ps's STAT column.

Where this shows up next. Station 2 is how the scheduler picks which runnable thread gets the CPU. Station 3 is how the kernel makes each process believe it has its own memory. Station 6 is how the kernel tightens the box around a process (namespaces turn "this machine" into "this container"). The Systems & Architecture handbook rungs 1–3 are exactly the thread / process / IPC story from an application perspective.

Go deeper: Kerrisk, The Linux Programming Interface (the most accessible deep reference, chapters 24–28 for processes, 33–34 for threads); Bovet & Cesati, Understanding the Linux Kernel (3rd ed) chapters 3 and 7; Love, Linux Kernel Development; an hour reading your own process's /proc/self/maps and status while you poke the running program from another shell.

Station 2 — Scheduling: CFS, EEVDF, realtime, deadline

Once you have many threads but only a few CPUs, you must decide which thread runs next, and for how long, and what happens when a higher-priority thread wants the CPU that another thread is already using. There are always more threads that want to run than there are cores. The choice has to be fair (no thread is starved), responsive (interactive work doesn't wait behind long computations), and fast (the scheduling decision itself cannot eat the CPU it is trying to allocate).

The scheduler is the kernel subsystem that answers these questions. It holds a list of runnable threads (threads that are not waiting on I/O, a lock, or a sleep), and on every timer tick, every blocking call, and every wake-up it picks which thread goes next. On a modern machine this decision is made tens of thousands of times per second. Getting it wrong looks like slow mouse movement, laggy video calls, or a server whose p99 latency is ten times its median even though the CPUs show only 60% utilization.

Different workloads need different kinds of fairness. A web server wants every request to make progress; a video encoder wants every frame done before its deadline; a soft real-time audio thread wants its 5 ms slice every 20 ms or the output will glitch. Linux exposes several scheduling classes to serve these different needs, and a process can choose which class it runs under.

The scheduler decides which runnable thread gets a CPU next and for how long. On a 16-core machine with 3,000 runnable threads (a typical web server under load), this decision is made tens of thousands of times per second. The scheduler's quality shows up as tail latency, fairness, and whether your 99th percentile looks like your median.

Rendering diagram…

For the last 15 years the default was CFS (Completely Fair Scheduler, Ingo Molnár, 2007) — a red-black tree of runnable tasks keyed by "virtual runtime," always picking the leftmost (least-run) task. It was fair in a weighted sense but had trouble with latency-sensitive short tasks hidden behind long CPU-bound ones. EEVDF (Earliest Eligible Virtual Deadline First, Peter Zijlstra, merged in Linux 6.6 late 2023) replaces CFS with a scheme that lets latency-sensitive tasks declare a deadline in addition to a weight, so short interactive tasks preempt longer ones earlier.

nice values range −20 to +19. Each step is worth about 10% of CPU weight in CFS/EEVDF. nice is scheduling politeness, not priority; RT classes are where true priority lives.
SCHED_FIFO / SCHED_RR (priorities 1–99) preempt everything below them and have no time slice — an RT-FIFO task will run until it blocks or yields, which is why a buggy RT thread freezes the whole box. Linux has a safety valve (/proc/sys/kernel/sched_rt_runtime_us, default 950,000 of 1,000,000) that caps all RT classes at 95% of CPU so SCHED_NORMAL can still breathe.
SCHED_DEADLINE lets you declare (runtime, deadline, period) — "I need 200 µs every 10 ms, deadline 5 ms after each period starts." The kernel admits or rejects the job using Earliest-Deadline-First schedulability math; admitted jobs are guaranteed their budget.
Context-switch cost: ~1–2 µs for CPU state, +TLB flush if address space changes, +cache eviction if the new task's working set differs. A web server spending 30% of CPU in context switches is usually running too many threads for the load; the fix is a thread pool sized to CPU count, not a smarter scheduler.

The model you want: the scheduler is fair only in the units it measures. CFS measured virtual runtime; EEVDF measures eligibility-and-deadline; RT measures priority. Match your workload to the class that measures what matters to you — and pin threads with taskset/sched_setaffinity if you know better than the scheduler does.

TIP

"Raise the thread priority" is almost never the fix. If nice -n -5 helps your service, you have a dependency holding a lock under load — the scheduler is telling you where to look. Actual RT priority is for hard deadlines, not for wishful latency.

The wider picture. Scheduling is a small world with many dialects:

General-purpose fair schedulers — CFS (2007–2023), EEVDF (Linux 6.6+), BFS/MuQSS (out-of-tree), Windows's multi-level-feedback-queue scheduler, macOS's Mach scheduler.
Real-time scheduling theory — rate-monotonic (Liu & Layland 1973), earliest-deadline-first (EDF), least-laxity, priority-inheritance, priority-ceiling. The Signals & Embedded handbook Station 8 covers this from the RTOS angle.
Multicore concerns — load balancing, CPU pinning, cache-affinity, NUMA awareness (keep a thread on the node that owns its memory), hyper-threading / SMT, heterogeneous cores (big.LITTLE, Intel P-cores vs E-cores).
Priority inversion and its fixes — a low-priority task holds a lock that a high-priority task wants, blocked by a medium-priority task (Mars Pathfinder, 1997). Priority inheritance or priority ceiling is the mutex-level fix.
Scheduler domains and topology — Linux organizes CPUs into siblings, cores, sockets, nodes. Migrating a task costs more the further it moves.
Cooperative vs preemptive — early Windows / early macOS / cooperative coroutine runtimes ask tasks to yield voluntarily; preemptive kernels force the yield at a timer interrupt. Most modern systems are preemptive except inside user-space async runtimes.
cgroups CPU control — cpu.shares, cpu.cfs_quota_us / cpu.cfs_period_us (CFS bandwidth control). How Kubernetes "CPU requests / limits" are actually enforced.

Where this shows up next. The scheduler runs on top of processes and threads (Station 1). Virtual memory (Station 3) interacts with it through the TLB flush cost of a context switch. Syscalls (Station 5) block and unblock threads, which feeds work back into the scheduler's runnable queue. Outside this handbook, the Systems & Architecture handbook rung 2 describes threads from the application side, and the Computer Architecture handbook describes the per-core microarchitecture the scheduler targets.

Go deeper: Liu & Layland, "Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment" (JACM 1973) — the rate-monotonic and EDF paper every RT scheduler cites; Zijlstra's LKML thread announcing EEVDF (June 2023); Kerrisk, The Linux Programming Interface chapter 35 (process priorities and scheduling); Love, Linux Kernel Development chapter 4; perf sched and bpftrace one-liners for watching the scheduler on a running system.

Station 3 — Virtual memory, page tables, and the TLB

Stations 1 and 2 gave every program its own process with its own threads. Now we need to give every process its own view of memory. Two different processes should be able to dereference the same address (say, 0x7fff00000000) and get different data, because the address means something private to each. If they had to share one flat pool of physical RAM, one process could trash another's state with a stray pointer, and every program would need to know which addresses other programs had already claimed.

The answer is a level of indirection called virtual memory. Every address a user-space program uses is a virtual address — a number the CPU will translate, on every single memory access, into a physical address that actually points to RAM (or swap, or nowhere). The translation is driven by a table the kernel maintains per process, so the same virtual address in two different processes lands on two different physical pages. A program sees a clean, private, contiguous address space from 0 up to the architectural limit; the kernel decides which parts of that space actually have physical memory behind them, and which sit idle until first touched.

Translating every single load and store through a table is slow in the naïve case, so CPUs cache recent translations in a small hardware unit called the TLB (Translation Lookaside Buffer). A TLB hit is a few CPU cycles; a TLB miss triggers a "page walk" that reads several levels of the page-table tree from RAM — often hundreds of cycles. Almost everything interesting about OS memory management is either building the translation tables, minimising walks, or deciding what to do when a process touches an address whose page is not in RAM.

Every pointer in every user-space process is a virtual address. The CPU translates it to a physical address on every access, walking a per-process page-table tree the kernel maintains. That translation is what lets two processes both use 0x7fff00000000 for different data — the kernel points each process's page tables at different physical pages.

  x86-64 4-level page table (also called "48-bit virtual"), 4 KiB pages:

   virtual address bits  47.....39 38.....30 29.....21 20.....12 11.....0
                        ┌─────────┬─────────┬─────────┬─────────┬────────┐
                        │  PML4   │  PDPT   │  PD     │  PT     │ offset │
                        └────┬────┴────┬────┴────┬────┴────┬────┴────────┘
                             │         │         │         │
                        CR3 ▼         ▼         ▼         ▼
                        ┌─┐  ┌─┐      ┌─┐       ┌─┐       physical
                        │ ├─▶│ ├──▶   │ ├──▶    │ ├──▶    page frame
                        └─┘  └─┘      └─┘       └─┘              +
                                                                offset

  5-level ("57-bit virtual") adds PML5 for workloads that need > 128 TiB.
  huge pages:  2 MiB (skip last level)   1 GiB (skip last two)

A page-table walk takes 3–5 DRAM accesses. The TLB (Translation Lookaside Buffer) caches recent translations so the walk happens at most once per page. Modern Intel TLBs hold ~1,500 4 KiB entries for data — that's only about 6 MiB of addressable memory before a miss. When a workload's hot set exceeds the TLB's reach, every access becomes a page-walk, and performance falls off a cliff — TLB thrashing.

Huge pages (2 MiB and 1 GiB on x86-64) let one TLB entry cover more memory. A 2 MiB page is 512× the reach of a 4 KiB page per TLB slot. Databases and JVMs frequently enable THP (Transparent Huge Pages) for heaps; latency-sensitive services often disable it because THP compaction pauses are unpredictable.
Page faults come in three flavours. Minor: page is in memory, just not mapped (first touch after malloc+first write, or CoW after fork) — ~1 µs. Major: page must be loaded from disk (swap or memory-mapped file) — milliseconds to tens of ms on SSD, hundreds on HDD. Invalid: segfault; the kernel sends SIGSEGV and the process dies unless it installed a handler.
mmap is how you map a file (or anonymous memory) into your address space. Reading the file becomes a page fault; the kernel pulls in pages on demand and uses the page cache as a read/write buffer. Zero-copy I/O on Linux (sendfile, splice, vmsplice, io_uring registered buffers) is really "don't copy pages in and out of the page cache just to send them."
OOM killer: when the kernel can't satisfy an allocation, it picks a process by oom_score (high memory use + low uptime + no privilege + oom_score_adj nudge) and sends it SIGKILL. You do not get a second chance. Container runtimes set oom_score_adj so pid 1 and critical system processes die last.

The model you want: virtual memory is a giant per-process map from virtual pages to "where is this page actually?" The answer can be "in RAM at frame X," "in swap," "in a file on disk," "not yet allocated," or "I said no."

CAUTION

Swap is not "extra RAM." Swap turns memory latency from nanoseconds into milliseconds — a 10⁵× cliff. A process that starts swapping is a process that stopped serving traffic. For production services, either size the box so you never swap, or disable swap entirely and let the OOM killer make the call honestly.

The wider picture. Memory management is a large subject with many ingredients:

Paging vs segmentation — paging (4 KiB fixed pages) won on x86-64; segmentation (variable-size protected regions) survives only for thread-local storage.
Page replacement policies — LRU, clock (second-chance), ARC, LIRU, 2Q. What the kernel evicts when RAM fills up.
NUMA (Non-Uniform Memory Access) — on multi-socket servers, every CPU has a "near" memory node and a "far" one. numactl, mbind, and automatic NUMA balancing try to keep pages near the threads that use them.
Memory-mapped files (mmap) — treat a file as an array; the page cache does the I/O. MAP_SHARED, MAP_PRIVATE, MAP_ANONYMOUS, MAP_HUGE_2MB are the knobs.
Copy-on-write (CoW) — fork shares pages read-only; first write makes the copy. Foundation of snapshots and container layering.
Kernel same-page merging (KSM) — deduplicates identical anonymous pages across processes / VMs.
cgroup memory limits — memory.max, memory.high, memory.swap.max. How Kubernetes memory limits actually bite.
Hardware memory protection — user / supervisor bit, NX/XD (no-execute), SMEP / SMAP (supervisor-mode access prevention), PKU (protection keys). Silicon support for the kernel/user boundary — cross-link to Computer Architecture Station 4.
Memory errors and ECC — single-bit soft errors from cosmic rays, row-hammer, memory poisoning. Data-centre RAM has ECC; consumer laptops usually do not.
Memory allocators — glibc's ptmalloc, jemalloc, tcmalloc, mimalloc; slab / SLUB / SLOB inside the kernel. The user-space allocator sits on top of mmap / brk and decides where your malloc returns.

Where this shows up next. Filesystems (Station 4) ride on the page cache, which is virtual memory reused. Syscalls (Station 5) include mmap, mprotect, madvise, brk that touch the page tables directly. Security primitives (Station 6) use page protection bits for the ring boundary. The Foundations handbook Station 2 explained how bytes become numbers; this station explained how those bytes live somewhere.

Go deeper: Drepper, "What Every Programmer Should Know About Memory" (2007) — long but still the best primer; Intel SDM Vol. 3A §4 on paging; Mel Gorman, Understanding the Linux Virtual Memory Manager (free online); perf stat -e dTLB-load-misses,dTLB-loads on a hot loop in your own code.

Station 4 — Filesystems, inodes, and journaling

Memory (Station 3) disappears when the machine loses power. Real programs need to keep data across restarts — documents, databases, photographs, config. That means writing bytes to persistent storage — a hard disk, an SSD, a USB stick, a cloud block device — and getting the same bytes back later. The storage device itself is dumb: it offers a flat array of fixed-size blocks, each identified by a number. "Write 4,096 bytes to block 1,234,567." Nothing in that interface knows about files, folders, permissions, or "is this block still in use?"

A filesystem is the data structure that layers organization on top of a block device so users can think in files and folders instead of block numbers. It decides how to map a path like /home/alice/report.pdf to the specific blocks that hold the PDF's bytes; how to store metadata (owner, size, modification time); how to reclaim blocks when a file is deleted; and — crucially — how to survive a sudden power loss without corrupting the structure.

Linux supports dozens of filesystems (ext4, xfs, btrfs, zfs, f2fs, ntfs, fat32, tmpfs, nfs, fuse-based overlays) and shields programs from their differences. The kernel's VFS (Virtual File System) is an abstraction layer: when a program calls open() or read(), VFS routes the call to the filesystem that owns that path, which translates the operation into block reads and writes against the underlying device. The program never has to know whether the file lives on a local SSD, a network share, or a FUSE-mounted cloud bucket.

A filesystem is a data structure on a block device plus the kernel code that gives it a POSIX face. The kernel's VFS layer sits above them all — open, read, write, stat are VFS methods that each filesystem implements in its own way. That's why you can ls across ext4, xfs, btrfs, NFS, FUSE, and tmpfs without caring which is which.

Rendering diagram…

At the block level, a POSIX filesystem has four essential structures: a superblock (layout metadata), a bitmap of free blocks, an array of inodes (one per file: mode, uid, gid, size, timestamps, and pointers to data blocks), and directory entries that map names to inode numbers. A file's name is not stored with the file; it's stored in the directory, which is why rename() is atomic and hardlink is free.

fsync vs write: write() returns once the kernel's page cache has the bytes; the disk may not hear about them for up to 30 seconds (dirty_expire_centisecs). fsync() blocks until the disk has acknowledged. Databases call fsync on the journal before every commit — that's why slow disks kill transaction throughput.
Journaling modes for ext4: data=ordered (default — metadata journaled, data written before metadata), data=writeback (metadata only, faster, risk of stale data on crash), data=journal (everything journaled twice, half the write bandwidth, highest durability). Knowing which mode your DB host runs is a 30-second tune2fs -l check that saves arguments.
Copy-on-write filesystems (btrfs, ZFS, APFS) never overwrite data in place. A write goes to a new block; the old block is freed only after the new block's references are established. Snapshots are nearly free (a new root pointer) and torn writes are structurally impossible — but fragmentation is real and can hurt sequential read speed.
Extents vs block pointers: ext4 uses extents — (start, length) pairs — which describe a 4 GiB contiguous file with one record. Old ext2/3 used indirect blocks, which needed many records for the same file and scaled poorly past a few GiB.

The model you want: the filesystem is a key-value store in a trench coat — inode numbers are keys, file contents are values, and the POSIX face on top is just an ancient API for that store. Modern object stores (S3, GCS) are the same idea without the POSIX face, which is why they scale differently.

WARNING

Renames across devices are not atomic — they become copy + unlink, and a crash in the middle leaves you with two copies or none. rename(2) within a filesystem is atomic, which is the basis of every safe "write-temp-file-then-rename" update pattern. Staying within one filesystem matters.

The wider picture. Filesystems span a lot of design space:

Local journaled filesystems — ext4, xfs, ReiserFS (historical), NTFS. Crash-consistent via journals.
Copy-on-write filesystems — btrfs, ZFS, APFS, bcachefs. Snapshots, checksums, send/receive replication.
Log-structured filesystems — F2FS (flash-friendly), NILFS, the LFS research family. Writes go to a log; GC reclaims old space.
Distributed filesystems — NFS, SMB / CIFS, CephFS, GlusterFS, Lustre. One mount, many servers behind it.
Cluster filesystems — GFS2, OCFS2. Multiple nodes mount the same block device.
Object stores (S3-like) — same idea, different API. Key-value blobs, no POSIX, 11 nines durability. Covered in the Cloud & Infrastructure handbook.
In-memory filesystems — tmpfs, ramfs. Page cache pretending to be a disk.
Userspace filesystems (FUSE) — a kernel shim that forwards filesystem operations to a user-space process. How sshfs, s3fs, rclone mount, and many research filesystems ship.
Filesystem features on top — encryption at rest (fscrypt, eCryptfs, LUKS block-level), compression (btrfs zstd, ZFS lz4), deduplication, quotas, ACLs (POSIX ACLs, NFSv4 ACLs), extended attributes (xattr).
Fsync durability guarantees — what fsync, fdatasync, O_SYNC, O_DIRECT, and sync_file_range actually promise. A surprising number of databases were wrong about this.
Page cache interactions — dirty pages, write-back, vm.dirty_ratio, vm.dirty_expire_centisecs.

Where this shows up next. Syscalls (Station 5) are how programs actually talk to the VFS — open, read, write, stat, rename, fsync. Security primitives (Station 6) apply DAC / MAC / capabilities to filesystem operations. The Data & AI handbook is what happens when you scale this layer past one host; the Security & Cryptography handbook covers filesystem encryption in detail.

Go deeper: McKusick, The Design and Implementation of the FreeBSD Operating System (2nd ed) chapters 8–9 — the friendliest fs design book; Ts'o's ext4 design notes on kernel.org; Bonwick & Moore's ZFS papers; LWN's article series on btrfs for the "why CoW" argument; Pillai et al., "All File Systems Are Not Created Equal" (OSDI 2014) for what fsync actually guarantees.

Station 5 — Syscalls, io_uring, and the cost of crossing rings

User-space programs cannot directly open files, allocate pages, or send network packets. Those operations require privileges that only the kernel has. So every time a program needs something from the outside world, it has to ask the kernel to do it on the program's behalf. The gate through which it asks is called a system call (syscall). There are about 400 of them on modern Linux, and they cover every kernel-mediated operation: open, read, write, mmap, fork, execve, socket, bind, epoll_wait, clock_gettime, and so on.

A syscall is more than a function call. The CPU has to switch from user mode (ring 3) to kernel mode (ring 0), swap to a kernel stack, save the caller's registers, run the kernel code, and then reverse the whole process to return. Each crossing takes tens to hundreds of nanoseconds even before the kernel does any real work. For programs that need to move millions of small pieces of data — a busy web server, a high-frequency database, a packet-processing daemon — this crossing is the single biggest cost, not the I/O itself.

Decades of operating-system evolution can be read as a slow war on syscall overhead. Batching multiple operations into one syscall (readv, writev, sendmmsg). Event loops that tell the kernel once what events you care about, then let many events fire without re-entering the kernel (epoll, kqueue). Shared-memory ring buffers where the kernel and user space exchange work without any syscalls at all (io_uring, AF_XDP). Knowing where your program sits on this ladder is the difference between "my I/O loop does 50k ops/s" and "it does 5 M ops/s on the same hardware."

A syscall is how user code asks the kernel to do something only the kernel can: open files, talk to sockets, allocate pages, fork. On x86-64 it's the syscall instruction (or sysenter on older chips); the kernel catches it, switches to ring 0 with its own stack, services the request, and returns. Per-call cost: tens to hundreds of nanoseconds just for the transition, before the requested work begins.

Meltdown and the KPTI mitigation made this worse: crossing rings now flips page tables to a kernel-only map, flushes the TLB, and flips back — an extra several-hundred-ns tax on every syscall. On syscall-heavy workloads (millions of recv/send per second per core) this is measurable as a fleet-wide CPU regression in 2018 graphs.

  The historical ladder of I/O APIs on Linux:

  blocking read()            1 syscall per request, thread blocks                   always
  select(2) / poll(2)        1 syscall to wait on N fds; rebuild the fd set each time  POSIX
  epoll (edge / level)       kernel keeps the fd set; 1 syscall to get ready-events   Linux 2.6
  AIO (libaio, io_submit)    async for O_DIRECT only; small API, many caveats        Linux 2.5
  io_uring (Axboe, 2019)     two ring buffers in shared memory; zero syscalls in
                             the hot path with polled mode                           Linux 5.1+

io_uring is the modern answer. Two ring buffers are shared between user space and the kernel: the SQ (submission queue, user writes, kernel reads) holds io_uring_sqe entries describing operations; the CQ (completion queue, kernel writes, user reads) holds io_uring_cqe entries with the results. Submitting work is a memory write; reaping a completion is a memory read. With IORING_SETUP_SQPOLL the kernel runs a poll thread and you can do I/O with zero syscalls in steady state.

  io_uring submission (conceptual):

     user space                                    kernel
     ┌──────────────┐                          ┌──────────────┐
     │ SQ ring:     │──── tail pointer ───────▶│ SQ head      │
     │ [SQE][SQE][..│                          │ SQ poll thread│
     └──────────────┘                          └──────┬───────┘
                                                      │ executes
                                                      ▼
     ┌──────────────┐                          ┌──────────────┐
     │ CQ ring:     │◀─── writes CQEs ─────────│ completes I/O│
     │ [CQE][CQE][..│                          │              │
     └──────────────┘                          └──────────────┘

  One memory barrier + one pointer update = submit a batch of I/O.

A vanilla read() or recv() on Linux costs ~300–500 ns for the syscall itself on modern Intel; add page-cache/netdev work on top. On a single core doing 1 M ops/s, that's 30–50% CPU just in transition.
io_uring supports linked and drained SQEs so you can express "open, then read, then close" as one chain and let the kernel do the sequencing without a round trip to user space per step. Combined with IOSQE_BUFFER_SELECT and registered buffers (IORING_REGISTER_BUFFERS), network servers can serve requests with near-zero per-request allocation.
vDSO lets a few syscalls (gettimeofday, clock_gettime, getcpu, time) execute entirely in user space via a page the kernel maps into every process with the time data and the code to read it. That's how clock_gettime(CLOCK_MONOTONIC) can cost ~20 ns when "a real syscall" would cost ten times that.
eBPF is the other half of this story: tiny verified programs loaded into the kernel that can filter packets (XDP, tc), trace syscalls (kprobes, tracepoints), or enforce policy (seccomp-BPF, Cilium). It lets you write kernel extensions without shipping a kernel module.

The model you want: a syscall is a trip across the wall; batch your trips or learn to live without them. The hierarchy is: blocking per-op → non-blocking + readiness (epoll) → completion-based (io_uring) → polled (io_uring + SQPOLL). Move up when the syscall budget says so.

TIP

Don't reach for io_uring until profiling shows syscall overhead as a real bottleneck. Its API has more sharp edges than epoll — registration, ring sizing, completion ordering, and a liburing dependency you don't need for a 1k-QPS service.

The wider picture. Kernel-boundary crossing is a much larger topic:

I/O models — blocking, non-blocking, multiplexed (select / poll / epoll / kqueue), signal-driven I/O, asynchronous (POSIX AIO, Windows IOCP, io_uring). Stevens's UNIX Network Programming has a classic side-by-side table.
Zero-copy — sendfile, splice, vmsplice, tee, MSG_ZEROCOPY, io_uring registered buffers. Moving bytes between fds without user-space copies.
DMA (Direct Memory Access) and bus mastering — the hardware hands bytes directly between device and memory without the CPU. Device drivers and block I/O ride this.
eBPF — a tiny verified bytecode machine inside the kernel. Tracers (bpftrace, bcc), packet filters (XDP, tc), LSM hooks (bpf_lsm), even schedulers (sched_ext, 6.12+). The single biggest Linux change in a decade.
vDSO — a small shared library the kernel maps into every process so gettimeofday, clock_gettime, getcpu can run without a real syscall.
syscall filtering — seccomp and seccomp-bpf (Station 6), which restrict which syscalls a process can even attempt.
User-space networking — DPDK, AF_XDP, netmap. Bypass the kernel entirely for packet processing at 100+ Gbps.
Signals vs completion events vs condition variables — three different notification mechanisms with three different cost profiles.

Where this shows up next. Security primitives (Station 6) restrict which syscalls a process can make. The boot path (Station 7) ends with the kernel running the first user-space program — via a syscall it issues to itself. Every time the Cloud & Infrastructure handbook talks about a high-throughput service, the latency and cost of the I/O model this station describes is the bottom layer.

Go deeper: Jens Axboe's "Efficient IO with io_uring" whitepaper; man io_uring_enter, man io_uring_setup; Stevens, UNIX Network Programming Vol. 1 chapter 6 (the I/O models table that has been right for 25 years); Brendan Gregg, BPF Performance Tools; McKusick et al. on the history of select → kqueue → epoll.

Station 6 — Kernel/user boundary, capabilities, namespaces, cgroups

Stations 1 through 5 gave every program a private process, fair CPU time, its own view of memory, organized access to files, and a way to talk to the kernel. Now we need to address the other side of isolation: even when a program is running correctly, how much of the machine is it allowed to see and change? The root user of a Unix system was historically all-or-nothing — a process either had full power (uid == 0, can do anything) or it did not. That model is too coarse for modern software: a web server needs to bind port 80 (requires privilege) but should not need to read /etc/shadow or modify kernel parameters.

The kernel offers a ladder of finer-grained security primitives for narrowing that power. Discretionary access control (DAC) — the classic user / group / other permission bits — is the oldest layer. Capabilities split root's power into ~40 individually grantable pieces (CAP_NET_BIND_SERVICE lets you bind port 80 without any of the others). Namespaces give a process a private view of a kernel subsystem (its own network stack, mount tree, PID space) so it literally cannot see other processes. Control groups (cgroups) limit how much CPU, memory, I/O, or PIDs a process can consume. seccomp filters which syscalls a process is allowed to make at all. Mandatory access control (MAC) — SELinux, AppArmor — adds a second layer of policy enforced by the kernel regardless of what DAC allows.

Containers (Docker, Podman, Kubernetes pods) are not a distinct kernel feature; they are a composition of namespaces + cgroups + seccomp + a layered filesystem. Once you understand each primitive individually, a container is transparent: a process with its own PID and network namespace, inside a cgroup with CPU and memory limits, with a seccomp filter restricting syscalls, rooted at an overlayfs mount.

The kernel/user boundary is the oldest security primitive in a modern OS. Ring 3 (user mode, Intel) or EL0 (ARM) cannot touch ring-0 memory, cannot execute privileged instructions (hlt, mov cr3, rdmsr), and can only invoke the kernel through the syscall gate. Everything else — capabilities, namespaces, seccomp, cgroups — sits on top of that first wall.

Rendering diagram…

A container is not a kernel feature — it is a bundle of these primitives applied together. Docker starts a process under a new pid/mount/net/uts/ipc/user namespace, drops every capability except a short allowlist, applies a seccomp filter banning ~40 syscalls, sets cgroup limits for CPU and memory, and maybe applies an AppArmor profile. None of these individually are "containers"; the composition is.

Capabilities split the old root = all powers into ~40 distinct privileges. CAP_NET_BIND_SERVICE lets a non-root process bind to port 80; CAP_SYS_ADMIN is still effectively root and should be feared; CAP_DAC_READ_SEARCH bypasses file permissions for reads. getcap and setcap are the tools; capsh --print shows your current set.
seccomp-BPF lets you compile a syscall filter into a BPF program and install it on a thread. The filter sees the syscall number and argument registers and returns ALLOW, ERRNO, KILL_THREAD, or TRAP. Chromium's renderer process runs with ~150 syscalls allowed; the rest are rejected. This is the same mechanism Docker, systemd, and container runtimes use.
Namespaces virtualize kernel resources. A pid namespace gives a process its own pid tree (pid 1 is the container's init); a mount namespace lets it have its own filesystem root without affecting the host; a network namespace gives it its own interfaces, routing table, and iptables rules. unshare(2) and setns(2) are the syscalls; every container runtime uses them.
cgroups v2 is the accounting and throttling layer: memory.max caps a group at N bytes (OOM-kill on overflow), cpu.max sets a CFS bandwidth quota, io.max rate-limits block I/O, pids.max prevents fork-bombs. v2 (unified hierarchy) replaced the v1 per-controller hierarchies in 2016; kernel 5.x+ is v2-first.
LSM (Linux Security Module) is the hook framework that SELinux and AppArmor live in — mandatory access control beyond DAC. SELinux labels every object; AppArmor uses path-based profiles; both can veto an operation the DAC layer would allow.

The model you want: there is no single "secure" flag; security is a stack of reductions of attack surface, one primitive at a time. Each of capabilities, seccomp, namespaces, cgroups, LSM removes a different class of reach. A service that trusts only one layer is trusting none.

CAUTION

--privileged on a container disables namespaces, caps drop, and seccomp — it is a process running as real root on the host with a fancy filesystem. If you see it in a docker run or Kubernetes manifest, treat it as "this container is the host." Justify every use in a comment, and revisit the justification every quarter.

The wider picture. Kernel-level security is a large topic with many pieces:

Rings / privilege levels — x86 ring 0–3 (only 0 and 3 used in practice), ARM EL0–EL3, RISC-V M/S/U. The hardware foundation of the kernel/user boundary.
DAC — POSIX user / group / other rwx bits, setuid / setgid / sticky bits, POSIX ACLs for finer grants, extended attributes.
MAC — SELinux (type enforcement, RBAC, MCS/MLS), AppArmor (path-based profiles), Smack, TOMOYO. Policy the kernel enforces regardless of user preference.
Capabilities (POSIX.1e) — ~40 distinct kernel powers: CAP_NET_ADMIN, CAP_SYS_ADMIN (the dangerous grab-bag), CAP_DAC_OVERRIDE, CAP_NET_BIND_SERVICE, CAP_SETUID, etc. Attach per-process; setcap writes them to executables.
Namespaces — pid, mnt, net, uts, ipc, user, cgroup, time. Each gives the contained process its own view of a subsystem.
cgroups v2 — unified hierarchy with subsystems cpu, memory, io, pids, cpuset, hugetlb, rdma. Resource isolation and accounting.
seccomp and seccomp-bpf — per-process syscall allow / deny list, expressible as a BPF program.
Landlock — unprivileged sandboxing for user-space programs (Linux 5.13+); add-only filesystem restrictions.
Integrity subsystems — IMA (Integrity Measurement Architecture), EVM, Secure Boot, dm-verity, fs-verity. Prove that a file or filesystem hasn't been tampered with.
Kernel hardening knobs — KASLR (kernel ASLR), KPTI (Meltdown mitigation), CFI (control-flow integrity), stack canaries, SMEP / SMAP / LAM, SLUB hardening.
Container runtimes — runc, crun, youki, gVisor (user-space syscall emulation), Kata Containers (VM-isolated). All compose the primitives above differently.
Virtualisation as a boundary — KVM / hypervisors cut a stronger line than namespaces; Firecracker and Cloud Hypervisor are the micro-VM flavour.

Where this shows up next. Boot (Station 7) sets up the initial security state — Secure Boot, measured boot, the root user's first capabilities. Cross-link to the Security & Cryptography handbook for how these primitives map to a threat model, and to the Cloud & Infrastructure handbook for how container orchestration uses them at scale.

Go deeper: Kerrisk, The Linux Programming Interface chapter 39 (capabilities) and chapter 42 (namespaces) — the clearest single reference; the kernel's Documentation/admin-guide/cgroup-v2.rst; LWN's article series on namespaces, seccomp, and Landlock; Smalley, "Configuring the SELinux Policy" (NSA report, still the best intro); the bpf(2) and seccomp(2) man pages.

Station 7 — Boot: UEFI, bootloader, kernel, pid 1

The computer has just been powered on. There is no operating system running. There are no processes, no file systems mounted, no network stack. The CPU is in its simplest possible state, executing whatever instructions the firmware hands it. Yet ten seconds later there is a full Linux kernel, a login prompt, and dozens of services answering network requests. How?

The answer is a sequence of small programs, each one bigger and more capable than the last, each one's only job being to load and run the next. This chain is called booting. It starts in firmware that lives on a chip soldered to the motherboard (UEFI today, BIOS historically), progresses through a bootloader (GRUB, systemd-boot, U-Boot on embedded systems) that knows enough about storage to find a kernel image, hands control to the kernel (which initialises drivers, mounts the root filesystem, sets up virtual memory and processes), and finally starts the first user-space process, pid 1 (init, systemd, launchd, openrc). Everything you see after that — logging in, loading a browser, opening a file — is pid 1 or one of its children.

Knowing the boot chain matters because every modern security feature (Secure Boot, measured boot, full-disk encryption) has to hook into it at some point, and because every "my server won't start" debugging session is walking this chain in reverse.

Between "press the power button" and "my service accepts requests" there is a chain of handovers, each one with a distinct trust model and failure mode. Not knowing the chain is fine until the day the machine won't come up.

   1. Power on → CPU starts at the reset vector, runs firmware
                 (UEFI on modern x86 / ARM; legacy BIOS on old boxes)
      ▼
   2. UEFI  → runs firmware drivers, initializes RAM training,
              enumerates PCIe, runs Option ROMs, loads the EFI
              System Partition and launches an .EFI binary
      ▼
   3. Bootloader (GRUB, systemd-boot, shim, rEFInd)
              → picks a kernel, loads kernel + initramfs into RAM,
                passes command line and hands off
      ▼
   4. Kernel early boot → decompresses, sets up page tables,
                           brings up CPUs (SMP), probes drivers,
                           mounts initramfs as /
      ▼
   5. initramfs /init → loads block-device drivers (lvm, dm-crypt,
                        nvme, virtio), mounts the real root, does
                        a switch_root(), execs /sbin/init
      ▼
   6. PID 1 (systemd, OpenRC, runit, tini)
              → reads units, brings up targets, mounts filesystems,
                starts getty/ssh/your service

UEFI replaced legacy BIOS almost everywhere by 2012. It runs before any OS, has its own filesystem driver for FAT (the ESP — EFI System Partition), can boot directly from a network, and has a signed-boot chain called Secure Boot. The firmware verifies the bootloader's signature against keys in NVRAM (Microsoft's KEK is preloaded on most x86 machines); the bootloader verifies the kernel; the kernel can verify every module. Break the chain and the firmware refuses to boot.

Measured boot is the other half of Secure Boot: every stage hashes the next stage into a TPM PCR (Platform Configuration Register) before executing it. After boot, PCRs can be read remotely (remote attestation) or used as sealing keys so a disk only decrypts on a machine that booted the approved stack.
initramfs is a cpio archive unpacked into a tmpfs at / by the kernel during early boot. Its purpose is to host the drivers needed to reach the real root filesystem (encryption, LVM, RAID, NVMe, network-root). After switch_root, the initramfs is freed — it only lives for seconds.
systemd (Poettering et al., 2010) is the dominant pid 1 on Linux distributions. It parses .service/.target/.socket units, manages dependency graphs, handles journald, logind, resolved, networkd, and cgroup delegation. Its socket activation lets a service be started on first connection — the same idea as inetd, just with cgroups and a unit file.
Startup time decomposition on a typical cloud VM: firmware/UEFI ~2–5 s, bootloader under 1 s, kernel to userspace ~1–2 s, systemd to multi-user.target ~2–15 s depending on services. systemd-analyze blame and systemd-analyze critical-chain are the right first tools when "why does this box take 90 seconds to boot?" is the question.

The model you want: boot is a trust handoff chain. Each link runs in a different environment with a different trust anchor; a failure at any link looks the same from the outside — "nothing happens." Trace the chain from the top when a box is dark.

WARNING

dd if=/dev/zero of=/dev/sda on a UEFI system wipes the GPT and the ESP; you will not boot again from that disk. UEFI NVRAM remembers the last boot entry but cannot create a new one without an OS. Keep a USB-stick rescue image and know how to enter firmware setup. You will need it once.

The wider picture. Boot is more diverse than the x86 / UEFI / systemd path many of us know:

Firmware interfaces — UEFI (modern PCs and servers), Open Firmware / Device Tree (ARM, PowerPC, RISC-V), Coreboot (open-source UEFI/BIOS alternative), seL4, u-root.
Bootloaders — GRUB 2 (the heavyweight), systemd-boot (UEFI-only, simple), rEFInd, LILO (historical), iPXE (network boot), Petitboot (kexec-based), U-Boot (embedded), Das U-Boot.
Secure / verified / measured boot — Secure Boot (signed bootloader and kernel), dm-verity / fs-verity for the root filesystem, TPM attestation, Intel TXT, ARM TrustZone, AMD SEV-SNP, Confidential Computing frameworks.
Early userspace — initramfs / initrd, dracut, mkinitcpio. A tiny root filesystem the kernel uses before the real root is available; contains the drivers and tools to find and mount the real root.
Init systems — systemd (Linux mainstream), SysV init, OpenRC, runit, s6, upstart (historical), launchd (macOS), SMF (Solaris), rc.d (BSD).
Service management — dependency graphs, socket activation, on-demand starting, automatic restart, journal logging, timers. systemd unified many of these; the *.service file format is the new lingua franca.
Container images and / filesystem — base images, OSTree-based image systems (Fedora CoreOS, Silverblue), A/B partitioning for safe upgrades (Android, Chrome OS, Fedora IoT).
Network boot — PXE, iPXE, DHCP + TFTP + NBP, HTTP boot, WDS. Imaging datacentres at scale.
Suspend / resume / hibernation — ACPI S-states, kernel suspend, hibernation to swap. The "wake-up" version of boot.
Crash and recovery — kexec, kdump, early-printk, panic handlers, sysrq.

Where this shows up next. Everything in Stations 1–6 presumes the kernel is already running. Measured boot and dm-verity tie back to Station 7 of the Foundations handbook (hashes as content fingerprints) and to the Security & Cryptography handbook (trust anchors, TPM, attestation).

Go deeper: UEFI Specification v2.10 (the canonical reference; dry but exhaustive); the kernel's Documentation/arm/booting.rst / Documentation/x86/boot.rst; Rodsden's Linux System Programming chapter on boot; Poettering's "systemd for Administrators" blog series; Jessie Frazelle and Julia Evans on strace-ing PID 1; Linux Kernel in a Nutshell on the boot sequence.

How the stations connect

The kernel is the node every other station routes through. A process runs because the scheduler picked it; it touches memory because the page tables map it; it touches files and sockets because VFS and the network stack translate its syscalls. Security primitives are filters that sit across the syscall gate. Boot is how all of this came to exist in this box's RAM.

Rendering diagram…

The representation layer from the Foundations handbook rides on top of this stack — UTF-8 strings travel through sockets serviced by io_uring, Protobuf records are persisted by filesystem journals, content hashes dedupe pages in the page cache. The Systems & Architecture handbook starts where a second box enters the picture and Rung 7 opens.

Standards & Specs

POSIX — IEEE Std 1003.1-2024 — the OS API nobody fully implements but everyone cites.
System V ABI — x86-64 calling convention — the contract that lets binaries from different compilers link together.
Linux Kernel documentation — Documentation/scheduler/sched-design-CFS.rst, Documentation/admin-guide/cgroup-v2.rst, Documentation/filesystems/ext4.rst, Documentation/core-api/kernel-api.rst — the authoritative references, and short enough to skim.
UEFI Specification v2.10 and ACPI Specification — the firmware interfaces every modern boot rides on.
RFC 7530 — NFSv4.0 and RFC 7862 — NFSv4.2 — the network-filesystem side of VFS.
TCG TPM 2.0 library specification — the secure-boot and attestation anchor.
Canonical papers — Dijkstra, "The Structure of the 'THE' Multiprogramming System" (1968); Ritchie & Thompson, "The UNIX Time-Sharing System" (1974); Liu & Layland, "Scheduling Algorithms for Multiprogramming" (1973); Ousterhout, "Scheduling Techniques for Concurrent Systems" (1982); Rosenblum & Ousterhout, "The Design and Implementation of a Log-Structured File System" (1992); McKusick et al., "A Fast File System for UNIX" (1984); Bonwick, "The Slab Allocator" (1994); Axboe, "Efficient IO with io_uring" (2019).
Books — Bovet & Cesati, Understanding the Linux Kernel (3rd ed). Kerrisk, The Linux Programming Interface. Love, Linux Kernel Development. Stevens & Rago, Advanced Programming in the UNIX Environment. Tanenbaum & Bos, Modern Operating Systems. McKusick et al., The Design and Implementation of the FreeBSD Operating System.

Test yourself

A Python service using threads to do CPU-bound work tops out at ~100% of one CPU on a 32-core box. Why, and what changes when the work is I/O-bound?

The GIL (Global Interpreter Lock) in CPython serialises bytecode execution — only one thread runs Python at a time, so threads don't scale CPU-bound work. For CPU-bound work, use multiprocessing (separate processes, separate interpreters, separate address spaces) or the no-GIL Python 3.13+ build. For I/O-bound work the GIL is released around blocking syscalls, so N threads can happily wait on N sockets at once — and at that point you should ask whether asyncio or a single-threaded event loop is cheaper than the context-switch traffic. Rung 2 in the Systems & Architecture ladder has more on this trade-off.

A Postgres host shows iostat -x with %util at 99% on the data disk but await still under 1 ms. Is the disk the bottleneck?

Not necessarily. %util is "the fraction of time the device had at least one request outstanding," and on modern NVMe with high queue depths it can pin at 99% while serving plenty of headroom — the queue depth is saturated, the throughput is not. Look at r/s, w/s, rkB/s, wkB/s vs the device's advertised IOPS/bandwidth, and at await/svctm. If await is flat, the disk is coping. If %util and await both climb together, then you're saturated. See Station 4, and Little's Law in the Systems & Architecture handbook.

A Kubernetes pod is OOM-killed despite its container showing 40% of its memory limit used in kubectl top. Predict the cause and the right mitigation.

kubectl top shows working-set RSS; cgroup memory accounting also counts the page cache attributed to the cgroup, and on kernel 5.x+ with cgroups v2 memory.max kills the cgroup when total accounted memory (RSS + cache + kernel) exceeds the limit. A process that reads many large files can push the page cache over the limit even though RSS is low. Mitigations: raise the limit, use O_DIRECT for the hot path so reads bypass the page cache, or use posix_fadvise(POSIX_FADV_DONTNEED) to let the kernel drop pages promptly. See Station 3 and Station 6.