Distributed Systems

The network is unreliable, and every interesting problem here is a consequence of that fact.

On this page

The working table of contents.

  1. The moment you have two computers, everything changes — messages take time, clocks disagree, machines crash without warning.
  2. The fallacies of distributed computing — the eight assumptions developers wrongly make (the network is reliable, latency is zero, etc.).
  3. Consistency models — what "correct" means when copies of data exist on multiple machines. Linearizability (strongest, slowest), eventual consistency (weakest, fastest), and the spectrum between.
  4. CAP theorem — you can pick two of three: consistency, availability, partition tolerance. Since partitions always happen, you're really choosing between C and A during failures.
  5. Consensus — how machines agree on one value when some might be dead. Raft as the approachable algorithm (leader election, log replication). Why this is hard and why it matters (every database replication, every leader election).
  6. Replication — copies of data for durability and read scaling. Leader-follower, multi-leader, leaderless. Quorums.
  7. Partitioning/sharding — splitting data across machines when one machine can't hold it all. Hash-based vs range-based. The rebalancing problem.
Going deeper

Branches that earn their own article.

  • Paxos (the original consensus algorithm).
  • Byzantine fault tolerance (BFT).
  • Logical clocks (Lamport, vector, hybrid logical).
  • CRDTs (conflict-free replicated data types).
  • Distributed transactions (2PC, 3PC, Saga pattern).
  • Gossip protocols.
  • PACELC theorem.
  • Formal verification of distributed protocols (TLA+).