Systems Design & Cloud

This is Vc within Act V — where the unreliable network of Va and the API contracts of Vb become a single system. Once a request crosses a process boundary, every decision — how to split code, shape APIs, when to cache, where to run, how to tell if it's healthy — is a hedge against partial failure. The decisions split into five lanes: architecture, APIs, events, cloud, observability. The lanes look independent and aren't — picking microservices forces choices in every other lane. This chapter walks them in order.

From monolith to services

A growing engineering team eventually hits a wall: twenty people on one binary spend more time merging than coding, build times balloon, deploys queue, and one team's bug blocks another team's release. The fix is to split the system so teams ship independently — but every split costs something. The architecture spectrum reduces to one question: how many independently-deployable processes does the system run?

A monolith is one process holding every feature, with one database and one deploy. Function calls cost nanoseconds, transactions span the whole codebase, refactors happen in a single PR. Stack Overflow ran on this shape for years. The ceiling is human, not technical: coordination cost rises faster than headcount.

A modular monolith keeps the one-process deploy but enforces internal boundaries through packages and internal APIs that other modules call through. You get monolith simplicity with service-shaped seams you can cut later. Shopify runs this way. The boundary you draw early is the cut line you may not even need.

Microservices flip the trade. Each service has its own repo, deploy, database, and scaling profile. Teams ship without coordinating; a bug stays in one service. The cost: every function call that used to be free is now a network round trip, with all of Act Va's failure modes. Service discovery, distributed tracing, schema-versioned APIs, and a 24/7 platform team all become prerequisites. Below roughly 50–100 engineers, you pay the tax without earning the benefit.

Cell-based architecture refines the right end. A cell is a complete copy of the service stack sized for a known fraction of traffic — say 5%. Users shard to cells deterministically by user-id hash, region, or tenant. A cell failing affects only its shard; the other 95% of users see nothing. AWS uses this internally; Slack adopted it after a 2021 outage.

The architecture spectrum: monolith to modular monolith to microservices to cellsMonolithModular monolithMicroservicesCell-basedappone processordersbillingusersnotifyABCDEFcell 1cell 2cell 3cell 4A·B·CA·B·CA·B·CA·B·C1 process · 1 deploy1 process · modulesN processes · N deploysN services × M cellsteam-coordination cost →1 team3–5 teams10+ teamsplatform orgpick the smallest topology that solves your team-coordination problem
Conway's law made literal: the system shape mirrors the team shape. Right-size both.

Pitfall — distributed monolith. The worst topology looks like microservices but ships like a monolith: services that must all deploy together, share a database, and chain synchronous calls deep enough that p99 latency stacks (ten services at 50 ms each is 500 ms minimum, before retries). You bought the network tax and the deploy tax. The cure is real boundaries — separate stores, async messaging where it fits, no PR touching more than one service.

The cell shape is what mature microservices platforms converge on. Each cell holds the full stack — load balancer, services, datastores — for one shard. Routing happens at the edge by user-id hash. Re-sharding is mechanical: spin up a new cell, drain users to it, retire the old one.

Cell-based deployment: shard users across isolated stacks within a regionEdge router · shard by user-id hashconsistent hash · routes user → cell deterministicallycell α · 25%cell β · 25%cell γ · DOWNcell δ · 25%LBapidbcacheLBapidbcacheLBapidbcacheLBapidbcacheusers 0–25%users 25–50%users 50–75%users 75–100%cell γ failing affects only its 25% · the other three never noticed
Cell isolation pushes the failure unit from "service" to "user shard." A bad deploy is a 25% incident, not a 100% one — and the rollback is one cell.

API design

Once two services exist, they need an agreed format for talking. Three families dominate, each optimized for a different shape of caller.

REST treats the system as a set of resources at stable URLs and uses HTTP verbs as operations: GET /users/123 reads, POST /users creates, PUT /users/123 replaces, DELETE removes. GET is safe and cacheable, PUT is idempotent — semantics every HTTP proxy, CDN, browser, and curl invocation already understands. REST is the default for public APIs because every developer already knows it. The cost shows up at scale: related data needs follow-up calls (fetch a user, then their posts, then each post's comments — the N+1 pattern), and endpoints return whole resources whether you wanted three fields or thirty.

gRPC models a call as a function invocation. Schema-first via Protocol Buffers — handed forward from Act I — it generates strongly-typed client and server stubs in many languages. The wire format is binary and several times smaller than JSON; HTTP/2 underneath supports four call shapes (unary plus client-, server-, and bidirectional-streaming). Internal service-to-service calls at Google and Netflix run on gRPC because the schema enforces compatibility and streams map well to long-lived calls. The cost: not browser-friendly without a translation proxy, debugging needs special tools, and you can't curl an endpoint to read the response.

GraphQL flips the model. Instead of the server defining endpoints, the client sends a query picking exactly the fields it wants from the types it cares about. One endpoint, one schema, arbitrary response shapes. This kills over-fetching and the N+1 round-trip — Facebook built it for slim payloads on weak mobile networks. The trade-off: HTTP caching breaks (every query is a unique POST body), resolvers can fan out to dozens of backend calls per query (the N+1 moved server-side), and authorization moves field-by-field rather than per endpoint.

REST, gRPC, and GraphQL: three answers to the same questionRESTgRPCGraphQLresources + verbsfunction calls + streamsclient picks fieldsGET /users/123{id, name, ...}GET /users/123/ posts→ [{id, title}]POST /usersPUT /users/123DELETE /users/123service Users { rpc Get(Req) returns (User); rpc Stream(Req) returns (stream Event);}// proto3 wirequery { user(id: 123) { name posts(last: 3) { title } }}+ HTTP-cache friendly+ universal toolchain+ verbs as semantics− N+1 round trips− over-fetching− no built-in streams+ binary, much smaller+ schema-typed stubs+ HTTP/2 streams− not browser-native− opaque to curl− proxy-unfriendly+ exact-shape responses+ one endpoint+ schema + introspection− HTTP caching breaks− resolver fan-out− per-field authpublic APIs · CRUDinternal microservicesmobile + aggregating UIspick by caller · real platforms layer all three
The three are different sweet spots, not competitors. A common stack: gRPC between services, a GraphQL gateway facing the mobile app, REST for the public-developer API.

Pitfall — versioning blindness. Every API contract changes; clients in the wild outlive servers. A mobile app on someone's phone may keep calling your three-year-old endpoint long after you've forgotten about it. REST versions by URL prefix (/v1/) or Accept header. gRPC relies on Protobuf's tag rules: never reuse a field number, never change a field's type. GraphQL adds and deprecates fields without version bumps. Commit to forward and backward compatibility, or accept that every release is a rolling migration.

Three API versioning strategies and their coupling trade-offsURL versioningHeader versioningField evolutionexplicit · breakingcontent-negotiatedadditive onlyGET /v1/users/42{id, name}GET /v2/users/42{id, name, email}// v1 frozen// route by prefix// two codepathsGET /users/42Accept: application/ vnd.ex.v1+jsonGET /users/42Accept: application/ vnd.ex.v2+json// one URL · two shapesGET /users/42{id, name}// later, additive:{id, name, email}// never remove// never retype// deprecate slow+ obvious in logs+ caches cleanly− forks the spec+ same URL forever+ negotiation built-in− proxies miss it+ no version bumps+ old clients keep working− discipline requiredURL is loud; header is clean but invisible; evolution is cheapest if discipline holds
Field evolution is what gRPC and GraphQL force structurally — never break, only add and deprecate.

Auth, retries, and rate limits

Anything past a toy needs to know who is calling and how to survive when callers misbehave. Three cross-cutting concerns appear by month two.

Authentication answers "who is this." For browser and mobile clients calling third-party APIs, the modern default is OAuth 2.0 authorization code with PKCE. The user authenticates with the auth server directly (the app never sees credentials); the auth server issues a short-lived authorization code; the app exchanges it for an access token. PKCE (Proof Key for Code Exchange) closes a hole on mobile: an attacker stealing the redirect URL could otherwise redeem the code. The app generates a random verifier, sends only its hash with the initial request, and presents the original verifier when redeeming. Without the verifier the attacker never saw, the code is useless.

OAuth 2.0 authorization code flow with PKCEBrowserAppAuth serverResource API1. user clicks "log in"2. app: verifier=rand · challenge=SHA256(verifier)3. redirect to auth · code_challenge · redirect_uri4. auth screen · user enters credentials5. redirect back · auth_code6. POST /token · code · code_verifier7. auth checks SHA256(verifier) = challenge8. access_token · short-lived bearer9. resource returns user dataPKCE binds the redeemed code to the original initiator
Short-lived access tokens (15 min) paired with a refresh token (days) keep the bearer-token blast radius small.
Worked example: one login through PKCE, step by step

A user opens a mobile app and taps "Sign in with Acme." Four parties are involved: the user, the app on the phone, Acme's auth server, and Acme's resource API.

  1. The app generates a secret. It picks a random 43-character string, call it verifier = "xK3...q9". It hashes it with SHA-256 to get challenge = SHA256(verifier). Only the hash leaves the device.
  2. The app opens the auth URL in a browser. The URL carries the challenge, a redirect_uri the app controls, and a state nonce: https://acme.auth/authorize?challenge=abc...&redirect_uri=app://cb&state=xyz.
  3. The user logs in to Acme directly. Username, password, maybe a second factor — all entered into the Acme page. The app never sees the credentials.
  4. Acme redirects back with an authorization code. The browser receives app://cb?code=ONE_TIME_CODE&state=xyz. The app intercepts the redirect.
  5. The app exchanges the code for a token. It POSTs { code: ONE_TIME_CODE, code_verifier: "xK3...q9" } to Acme's token endpoint.
  6. Acme verifies the verifier. It computes SHA256("xK3...q9") and checks it matches the challenge from step 2. If yes, it returns { access_token, refresh_token }.
  7. The app calls the resource API. Every request to api.acme.com carries Authorization: Bearer <access_token>.

Why the verifier matters: suppose an attacker intercepts the redirect in step 4 and steals the code. They try to redeem it — but they don't have the verifier (it never left the phone). Step 6's hash check fails and the code is useless. The verifier binds the redeemed code to the same client that started the flow.

Circuit breakers protect a service from a sick dependency. Naively retrying a failing downstream exhausts your thread pool on timeouts, dragging your service down with theirs. A circuit breaker wraps each outgoing call and tracks the recent failure rate. In the closed state, calls flow. When failures cross a threshold (say 50% of the last 20 calls), it opens — every subsequent call fails instantly. After a cooldown, the breaker goes half-open and lets one probe call through; success closes it, failure re-opens. The point is fail-fast: 1000 instant failures don't hurt, 1000 thirty-second timeouts do.

Circuit breaker state machine: closed, open, half-openCLOSEDcalls flowcount failuresHALF-OPENone probecall onlyOPENfail fastno calls go outfailure rate over thresholdcooldown elapsed (probe)probe okprobe fails — re-opentrip · 50% failures over last 20 calls (configurable)cooldown · 30 s typical · prevents stampede of retriesprobe · one in-flight call gates the recovery decision
Pair the breaker with a fallback — a cached value, a default, a structured error — so the caller still gets a useful response.

Rate limiting protects you from callers and callers from each other. The pattern every API gateway settles on is the token bucket: a bucket holds up to B tokens, refills at R per second, and each request consumes one. A burst of B requests drains the bucket instantly; sustained traffic above R gets rejected with HTTP 429. This shape allows short bursts (a user mashing refresh) while bounding sustained throughput. State lives per key — per IP, per API key, per user — usually in Redis so all gateway instances share the counter.

Token bucket: capacity B, refill rate R, request consumes one tokenRefill sourceBucket (capacity B)Requestsdrip · R tok/se.g. R=10/scapacity B · burst limitaccept429 rejectRetry-After−1 tokenif emptybehaviorburst of B requests · accepted instantly · bucket drainssustained rate ≤ R · steady-state acceptedsustained rate over R · excess rejected with 429 + Retry-Afterper-key state · one bucket per IP / API key / user
Fix B to allow honest bursts, fix R to bound sustained load, return 429 with a hint to back off.

Event-driven architecture

A synchronous call says "A calls B, A waits, B answers." That works until B is slow, B is down, or A's caller times out. Event-driven architecture flips the question: instead of "call me back when you have an answer," it's "publish what happened and let whoever cares react."

A producer writes an event ("Order#42 created") to a topic in a broker — Kafka, NATS, RabbitMQ, SQS, Pub/Sub. Consumers subscribe to topics and process events independently. The producer doesn't know who's listening; adding a new consumer changes nothing on the producer side. Failure isolation falls out: if the email service is down when the order is placed, the event sits in the queue until it recovers. Brokers like Kafka sustain millions of events per second and retain them for days so they can be replayed.

Pub/sub topology and an event-sourced order timelineProducerTopicConsumer groupsorder-svcorders.eventsbilling-svcemail-svcanalytics-svcOrderCreated · 42ItemAdded · 41Paid · 40append-only log · partitionedretention 7 d · replayableEvent-sourced timeline · order #42OrderCreatedItemAddedPaidShipped10:00:0110:00:0810:01:1410:42:00derived statestatus: shippeditems: 1total: $42.00events are the source of truth · current state is a fold over the timeline
Pub/sub above; event-sourced state below. The broker delivers in real time; the durable log lets you rebuild state from scratch.

Event sourcing takes the idea further: store the events themselves as the system of record, not a "current state" table. The current state of an order is what you get by folding OrderCreated → ItemAdded → Paid → Shipped from empty. Auditing falls out (the log is the audit). Replay is free (rebuild any read model from time zero). Time travel is free (project state as of any past timestamp). The cost is conceptual: you can never edit history. A bug means writing a correction event, not a DELETE.

CQRS (Command Query Responsibility Segregation) pairs naturally. Writes ("commands") flow through one path producing events; reads flow through projections — denormalized views built by consuming the event stream, one per query shape. The two sides scale independently: a 1000:1 read/write workload runs 100 read replicas behind one write leader without contention.

CQRS: write side and read side share the event log but optimize for different shapesCommandPlaceOrderValidatorinvariantsEvent storeappend-onlyWrite storeaggregate rootsQueryGetOrdersprojector Aprojector Bprojector Cread store · order listread store · statsread store · searchUIuser dashboardwrite side · ↑read side · ↓one source of truth, many derived shapes
Each read store is a cache of the event log shaped for one query. Lose it and you rebuild from the log.

Pitfall — eventual consistency. Read projections lag the write log by milliseconds to seconds. A user who places an order and immediately refreshes their dashboard may not see it. UIs paper over this with optimistic updates; APIs offer strong-read modes that go to the write store. Either way, "read your own writes" stops being free — design for it explicitly.

Sagas and the outbox

A business operation touching three services ("place order, charge card, schedule shipping") can't run as a single ACID transaction — no two-phase commit across HTTP. A saga decomposes the workflow into steps. Each step is a local transaction in one service that emits an event triggering the next. If a step fails partway, compensating events undo the previous steps' effects: payment fails, so InventoryReleased frees the held items and OrderCancelled closes the order. Compensations aren't rollbacks — the inventory really was held for thirty seconds — they're corrective actions that restore consistency.

Sagas come in two shapes. Choreography has each service subscribe to events and react: no central brain, fewest moving parts, hardest to debug. Orchestration uses a central coordinator (Temporal, Step Functions, Camunda) that drives the sequence and handles retries: easier to observe, more code.

A saga: forward path with compensating events on failureOrderInventoryPaymentShippingDonecreateorderreserveitemschargecard · FAILSscheduleshipcommitOrderCreatedItemsReservedcompensation path · roll backwardcancelorderreleaseitemspaymentdeclinedPaymentFailedInventoryReleasedchoreographyeach service subscribes to events · no central brain · fewest moving partsorchestrationa coordinator owns the workflow · easier to observe, more code to write
The forward path is the happy case; the compensation path is the reality.
Worked example: one order saga that fails at payment

A user clicks "Place order" for one widget at $42. Four services need to agree, none of them shares a database. No two-phase commit, no global transaction — each step is a local commit in one service, glued by events.

  1. Order service begins. It writes orders(id=42, status=pending, total=42.00) in its own DB and emits OrderCreated{order_id=42} to the broker. The local transaction is done.
  2. Inventory service consumes OrderCreated. It writes reservations(order_id=42, sku=W1, qty=1) and decrements stock by 1 in its own DB. Both rows commit together. It emits InventoryReserved{order_id=42}.
  3. Payment service consumes InventoryReserved. It calls Stripe with the user's saved card. Stripe returns card_declined. Payment writes payments(order_id=42, status=failed) and emits PaymentFailed{order_id=42}.
  4. Inventory service consumes PaymentFailed. This is the compensating step. It deletes the reservation row and increments stock back to its prior value. It emits InventoryReleased{order_id=42}. Note: the original reservation really happened — the widget was unavailable for the 800 ms between step 2 and step 4. The compensation doesn't erase that; it corrects forward.
  5. Order service consumes PaymentFailed. It updates orders(id=42, status=cancelled) and emits OrderCancelled{order_id=42}. The saga ends.

Two properties make this work. Idempotency: each consumer dedups on order_id, so if PaymentFailed is delivered twice (brokers retry on consumer crashes), Inventory doesn't release stock twice. Compensating steps exist for every forward step: if step 2 had succeeded but step 3 had crashed mid-write, Payment's retry would resume the saga from a known event. There's no rollback button — only more events.

The dual-write trap hides in step 1. "Write the order row, then publish OrderCreated" is two operations on two systems. If the process crashes between them, the order exists with no event — Inventory never reserves, the order sits forever in pending. The outbox fixes this: step 1 writes the orders row and an outbox(event=OrderCreated, sent=false) row in the same local transaction. A separate relay process polls the outbox, publishes events to the broker, and marks them sent. If the relay crashes before publishing, the event is still in the outbox; the next run picks it up. If it crashes after publishing but before marking sent, the event is published twice — and the consumer dedup from above handles it.

Pitfall — the dual-write problem. "Save the order, then publish the event" has two failure windows: the DB write succeeds and the publish fails (event lost), or the publish succeeds and the DB write rolls back (event references a non-existent row). Two-phase commit across a DB and a broker is rarely available and rarely advisable.

The outbox pattern sidesteps it. The service writes the business row and the event row in the same local transaction. A separate relay process polls the outbox table — or tails the database's change log via change-data-capture — publishes each unpublished event, and marks it sent. If the relay crashes mid-publish, the event is still in the outbox and gets retried. At-least-once delivery falls out for free; consumers dedup with idempotency keys.

The outbox pattern: business row and event row in one transaction, relay publishes after commitServiceDatabase (1 transaction)Brokerorder-svchandle PlaceOrderorders tableid=42 status=newtotal=42.00outbox tableid=99 type=OrderCreatedpayload=... sent=falseINSERTINSERTCOMMIT — both rows visible togetherrelaypoll · or CDC tailorders.eventstopicread unsentpublishmark sentwhy this worksone local transaction · no distributed commitcrash before publish · relay retries from the outboxcrash after publish · duplicate event · consumer dedups by id
The dual-write problem dies once the event lives in the same database as the row that produced it.

Caching

Computing the same answer over and over is wasted work. Caching stores answers so you don't recompute them. The pattern is universal: registers cache RAM, RAM caches disk (handed forward from Act II), DNS resolvers cache lookups, browsers cache HTML. At every layer the structure is the same — lookup hits the fast store first, falls through to the slow store on a miss, writes the answer back on the way out.

The cost difference is dramatic: Redis at roughly 0.5 ms, a CDN edge at 20 ms, a database query at 50 ms, an origin generation at 300 ms. A 95% hit rate on the most expensive layer is often the difference between a system that scales and one that doesn't.

A typical web request touches four cache layers in sequence:

  • Browser cache uses HTTP headers (Cache-Control: max-age=3600, ETag, Last-Modified) to keep static assets out of the network entirely. The fastest request is the one you don't make.
  • CDN edge cache (Cloudflare, Fastly, CloudFront) holds copies at points of presence worldwide, dropping a cross-ocean round-trip from 200 ms to 20 ms.
  • Application cache lives in-process or in a shared store like Redis or Memcached, holding hot query results and session data at sub-millisecond latency.
  • Database cache is the engine's buffer pool plus query result cache — already there, already working.
The cache hierarchy of a typical web requestBrowserCDN edgeApp cacheDB cacheDBHTTP cacheCloudflare etc.Redis · in-procbuffer poolorigin≈0 ms≈20 ms≈0.5 ms≈5 ms≈50 mshit ratiohit ratiohit ratiohit ratio≈30%≈95%≈80%≈99%TTL hoursTTL minutesTTL secondspage LRUwrite patternscache-aside · app reads cache; on miss reads DB and writes cache (default)write-through · writes go to cache and DB together · consistent, slowwrite-behind · writes hit cache, async flush to DB · fast, can lose on crash
Each layer trades scope for hit ratio. Stack them well and 99% of requests never reach the origin.

The choice of who writes to the cache and when gives you the cache patterns. Cache-aside is the default: app reads cache, on a miss queries the DB and writes the result back. Write-through writes to cache and DB together — consistent, slow writes. Write-behind writes to cache fast and flushes async — fast, loses unflushed writes on crash. Pick by whether you can tolerate stale reads or not stale writes.

Pitfall — invalidation. A cache holds an answer; the underlying truth changes; now what? TTL works for read-mostly data — tolerate n seconds of staleness and let entries expire. Explicit invalidation (delete the key when the source changes) is correct but fragile — every write path must remember every dependent key. Versioned keys (user:123:v42) sidestep the problem: bump the version and the old key is dead. Cache stampedes — a popular key expires and a thousand concurrent requests all miss and hammer the origin — need their own defense (request coalescing, probabilistic early refresh, lock-on-miss). Caching is easy; correct caching is the work.

Worked example: a cache stampede, and how coalescing kills it

Suppose homepage:trending is a Redis key holding the top-10 trending posts. The query that computes it scans a billion rows and takes 800 ms. It's served from cache at 0.5 ms with a 60-second TTL. Steady-state traffic is 1000 requests per second.

Without coalescing — the stampede:

  1. At t=0.000s the key expires.
  2. At t=0.001s, request #1 misses the cache and starts the 800 ms recompute.
  3. Between t=0.001s and t=0.801s, 800 more requests arrive. Each looks up the key, sees it absent, and also starts its own 800 ms recompute. The origin DB now has 800 concurrent identical scans running. The DB melts; p99 latency jumps from 5 ms to 30 s; unrelated queries time out.
  4. At t=0.801s, the 800 recomputes finish almost simultaneously and all write the same value back. The cache is hot again, but the damage is done.

The pathology: every miss launched its own recompute because none of them knew about the others.

With request coalescing — a lock on the miss:

  1. At t=0.000s the key expires.
  2. At t=0.001s, request #1 misses. Before recomputing, it tries to acquire a short-lived lock lock:homepage:trending in Redis with SET NX EX 5. It wins the lock and starts the 800 ms recompute.
  3. At t=0.002s, request #2 misses. It tries the same SET NX and fails — someone else is recomputing. It waits (polls the cache key every 20 ms, or subscribes to a Redis pub/sub channel keyed to the lock).
  4. The next 800 requests do the same: see no value, see the lock held, wait.
  5. At t=0.801s, request #1 writes the new value to the cache and releases the lock.
  6. The 800 waiting requests wake up, find the value, and return.

One origin call instead of 800. Two refinements that show up in production: probabilistic early refresh has each request, with small probability proportional to how close TTL is to expiring, voluntarily recompute before expiry — so the key rarely actually goes cold. Stale-while-revalidate serves the expired value to all but one request while that one refreshes in the background — readers never wait at all, at the cost of one TTL's worth of staleness after expiry.

The cloud abstraction

Buying servers and running a datacenter is expensive and slow. Cloud providers rent datacenter primitives as APIs you call from a CLI or config file — VMs spin up in seconds, terabytes of storage allocate in milliseconds, load balancers cross regions with one flag. The mental model that survives across providers reduces to three primitive axes: compute, storage, network.

Compute

Compute is where your code runs, and the choice is a managed-ness gradient. VMs (EC2, Compute Engine) give you a whole simulated machine — you manage the OS and patches, boot takes 30–90 seconds. Containers (Kubernetes pods, ECS, Cloud Run) ship a packaged image — handed forward from Act IV — and the platform schedules them onto shared VMs; boot is seconds, you stop caring about underlying machines. Functions (Lambda, Cloud Functions) are stateless code triggered by events — no servers visible, billing per invocation and per millisecond, boot is milliseconds warm and can stretch to seconds cold.

Each step hands more responsibility to the provider and leaves you fewer knobs. A VM lets you tune anything but you patch it; a function abstracts everything but pins you to the provider's runtime.

Storage

Storage comes in three shapes, each optimized for a different access pattern.

  • Block storage (EBS, Persistent Disk) is a virtual disk attached to one VM. The OS formats it with any filesystem. Use it for databases — anything where one machine owns the data.
  • Object storage (S3, GCS, Azure Blob) is a flat key-value namespace — store, list, fetch by key. Replicates across availability zones for high durability, cheap per GB. Use it for "write once, read many, scale to petabytes."
  • File storage (EFS, Filestore) is a network filesystem mountable on many VMs at once. Slower per IO than block but multi-attach. Use it when a fleet shares a directory.

Network

Network is how the pieces find each other. A VPC (Virtual Private Cloud) is a software-defined network — your private IP space, routing tables, and subnets. A load balancer distributes traffic; at layer 4 it routes TCP connections (fast, dumb), at layer 7 it inspects HTTP and routes by path or header. A NAT lets private subnets reach the internet without being reachable from it. A CDN fans content out to points of presence near users. Every cloud has these primitives under different names.

The three cloud axes: compute, storage, networkComputeStorageNetworkVMContainerFunctionBlockObjectFileVPCLoad balancerNATCDNwhole OS · 30–90 s bootk8s pod · 1–10 s bootstateless · per-invocationone-VM disk · any filesystemflat key-value · multi-AZNFS · multi-attachprivate IP spaceL4 / L7 fan-outprivate → public egressedge cache · many PoPs→ less control, more managed→ pick by access pattern→ same primitives every cloud
The axes are mostly independent — pick a compute, pick a storage, pick a network. Pricing makes the choice non-trivial.

Load balancing at scale

At small scale, one load balancer routes traffic to a few servers. At global scale, "load balancing" becomes a stack of layers, each making one decision and handing off:

  • DNS maps the hostname to an IP by geographic or latency-based routing, picking the region.
  • Anycast advertises the same IP from many points of presence so the BGP fabric routes the user to the nearest.
  • L7 load balancer at the regional edge terminates TLS, inspects HTTP, and routes to the right service or cell.
  • L4 load balancer inside the cell distributes TCP connections across pods — fast, doesn't read payloads.
  • Instance pick lands on one process — round-robin, least-connections, or consistent-hash for cache locality.

Each layer fails over independently: a bad backend pulls only L4 traffic; a sick region withdraws BGP routes and DNS records. The blast radius shrinks one step at every layer down.

Multi-tier load balancing: DNS, anycast, L7, L4, instanceDNSAnycastL7 LBL4 LBInstancegeo / latency · regionBGP · nearest PoPpath · header · cellTCP · round-robinone pod · one PIDpickspickspickspicks instancewhy this worksDNS TTL too low: lookup load · too high: slow failoveranycast pulls users to a sick PoP unless health checks withdraw the routeL7 is HTTP-aware (rewrites, auth, WAF) · L4 is dumb and fastconsistent-hash at L4 keeps a user on the same backend for cache locality
Each tier is a different concern at a different speed.

Service mesh

Inside a cell, every cross-service call needs encryption, retries, timeouts, and telemetry. Implementing those in each service's code means every team reinvents the same wheel slightly differently. A service mesh moves these concerns out of the application by attaching a sidecar proxy to every pod that intercepts all in- and out-bound traffic. Istio, Linkerd, and Consul Connect share the shape: a data plane of proxies handling mutual TLS (mTLS, where both sides present certificates), retries, circuit-breaking, traffic-splitting, and span emission; a control plane that pushes config to all sidecars.

The win: every cross-service call gets the same encryption, observability, and resilience without app-code changes. The cost: pod count doubles and you have a control plane to operate. The "do you need a mesh?" question reduces to whether your platform team is large enough to run one without slowing everyone else down.

Service mesh: sidecar proxy per pod, control plane configures all sidecarsControl planerouting config · cert authoritypod Apod Bpod Cappsidecarappsidecarappsidecarlocalhost onlylocalhost onlylocalhost onlypush configmTLSmTLSwhat the sidecar gives you for freemTLS · retries · timeouts · circuit-break · traffic-splitdistributed traces · metrics · access logs · no app-code changescost · 2× pod count · control plane to operate
The application talks to localhost; the sidecar does the network. The control plane reconfigures sidecars without restarting apps, which is how a 1% canary or a region failover is one config push.

Pricing and the egress bill

Pricing is the fourth axis, and it's the one that turns architecture diagrams into spreadsheets. On-demand pays per hour with no commitment. Reserved instances cost much less in exchange for a 1- or 3-year commitment. Spot instances run on spare capacity at deep discounts but can be reclaimed with two minutes' notice; batch jobs and stateless web tiers can run on spot, latency-critical workloads can't.

The surprise is egress — bandwidth out of the cloud. AWS lists about $0.09 per GB to the internet (sliding lower with volume), and storage at one or two cents per GB-month; on a high-traffic app the transfer line dwarfs the storage line. A SELECT * returning a kilobyte per row at 100 QPS is fine in dev; at 100,000 QPS across regions it's tens of thousands of dollars a month. Cross-AZ replication on a multi-AZ database doubles the transfer line. The architecture review and the cost review converge — design without considering the bill and you ship a system you can't afford to run.

Infrastructure as code

Clicking buttons in a cloud console works for a demo and fails for a system. Nobody can audit who changed what, recovering from a region outage means re-clicking 200 dialogs, and dev/staging/prod drift apart until "works in staging" stops meaning anything. Infrastructure as code (IaC) describes infrastructure in version-controlled files — reviewed in PRs, tested in CI, deployed by a tool that reads them and makes reality match.

Terraform (and its open fork OpenTofu) is the dominant tool. It reads files declaring resources, holds a state file mapping declared resources to real cloud resources, computes a diff with terraform plan, and applies it with terraform apply. Pulumi does the same in a real programming language. AWS CDK compiles to CloudFormation. The engine pattern is universal: read desired state, read actual state, compute the minimal API calls to converge them. Kubernetes runs the same loop for workloads.

The reconciliation loop at the heart of TerraformDesired statePlannerDiffApplyState fileCloud providermain.tf · HCLdeclarativeparse + graphtopo sort+ create / − destroy≈ update / replaceAPI callsin orderlast-known actualremote backendactual staterefreshdrift detectionapplydesired ↔ actual · planner emits the minimal call sequence to converge them
Reconciliation is idempotent: run it twice and the second run is a no-op. Drift detection — the dashed line — catches manual changes before they desync the model.

Three things make IaC work in practice. Modules are the unit of reuse — a "VPC + subnets + NAT" module is a couple hundred lines reused fifty times across an org. State files map code to cloud resources; they must live in a remote backend with locking so concurrent applies can't corrupt them, and must never be hand-edited. Drift detection runs plan on a schedule and alerts on unexpected diffs — manual console changes, anything that breaks "code is the source of truth."

Pitfall — state file as load-bearing artifact. Lose the state file and Terraform thinks the cloud is empty — the next apply tries to recreate everything, collides with existing resources, and leaves you with a half-broken environment. Remote backend with locking, regular backups, no manual edits. The blast radius of a state-file mistake is the whole environment.

Deployment strategies

How code goes from a Git tag to live traffic is its own design space. The deployment strategy trades risk for speed.

  • Rolling replaces instances a few at a time. Simple, slow, fine for stateless services.
  • Blue/green stands up a second fleet next to the live one, runs smoke tests, and flips the load balancer. Rollback is one flip back. Costs 2× fleet during the rollout.
  • Canary sends a small fraction of traffic (1%, then 10%, then 50%) to the new version while watching key health signals — error rate, latency, what the next section calls SLIs. The rollout halts automatically on error-rate spikes.
  • Shadow mirrors production traffic to the new version with responses discarded — real load, zero user exposure.

Most production platforms compose them: a canary inside a blue/green, gated by automated reliability checks. If you can't articulate the rollback procedure in one sentence, you picked the wrong strategy.

Four deployment strategies: rolling, blue/green, canary, shadowRollingBlue / greenCanaryShadowt0t1t2t3replace n at a time · slow · simpleblue100%green0%fliptwo fleets · 2× cost · instant rollback1%5%25%50%100%step + watch SLOs · halt on regressionliveshadowdiscardcopyreal load · zero user exposurepicking onestateless web→ rollingDB schema→ blue/greenuser-facing→ canaryrewrites→ shadow firstSLO-gatedautomated rollbackon error budget burn
Strategy is a function of blast radius and confidence.

Observability

You can't fix what you can't see. Observability is the practice of building systems whose internal state can be inferred from their outputs. The canonical model is the three pillars — metrics, logs, and traces — each answering a different question.

Metrics are numeric time series: counters that only go up, gauges that go up and down, histograms that bucket distributions. Cheap to store, aggregable (sum, p50, p95, p99 over windows), the basis for dashboards and alerts. They answer "is the system healthy right now?" — p99 latency, error rate, request rate, queue depth.

Logs are timestamped events, typically structured JSON. Expensive at scale — a busy service can emit terabytes a day. They answer "what happened to this request?" — when a user reports a problem with a request id, you grep for it. Sampling and retention keep the bill survivable: keep error logs, sample 1% of successes, expire after thirty days.

Traces stitch one request's path across services into a tree of spans — each span records a service, operation, start time, and duration. Spans link by a trace id propagated through HTTP headers (W3C Trace Context). A trace shows the slow request spent 480 ms in auth, 12 ms in the gateway, 30 ms in the database. At 100,000 requests per second you can't trace every one — head-based sampling decides at request entry and propagates the decision.

The three pillars of observability: metrics, logs, tracesMetricsLogsTracesnumbers over timetimestamped eventsrequest journeysrequests_total{svc="api"}10:00:01 INFO user 42 login10:00:02 WARN cache miss10:00:03 ERROR db timeout10:00:04 INFO retry successapiauthdbnotifycachetrace-id: 7f3a... span-id: a12small · aggregable"is it healthy?"large · cardinal"what happened?"linked spans"where did it go?"scrape every 15 sindexed · sampledhead-sampled at entryuse metrics to alert, logs to inspect, traces to navigate
The three aren't redundant. A metric tells you something is broken; a trace tells you which service; a log tells you why.

SLOs and error budgets

Raw telemetry doesn't tell you whether to wake someone up. The SLO contract turns signals into a decision. A Service Level Indicator (SLI) is a measurable property — request success rate, p99 latency, queue lag. A Service Level Objective (SLO) is a target on that SLI: "99.9% of requests succeed" measured over a rolling 30-day window.

The complement of the SLO is the error budget. 0.1% of 30 days is 43 minutes of allowed unavailability. You spend the budget on incidents, deploys, and experiments. When the budget is exhausted, you stop shipping features and fix reliability — that mechanic is what makes SLOs an engineering tool rather than wishful thinking.

SLO and error budget over a 30-day window100%99.9%99%day 0day 30SLOincidentincidentbudget exhaustedfreeze releasesSLI · success ratebudget = 100% − SLO · burn-rate alerts fire when the 30-day spend pace puts you under target
Two big incidents and the month's allowance is spent — the contract says no risky deploys until the window resets.

Pitfall — vanity SLOs. An SLO no one enforces is decoration. The hard part is the human contract: when the budget is exhausted, are engineering leaders willing to halt the roadmap and pay down reliability debt? If the answer is "no, we ship anyway," the SLO is fiction and alert noise will make on-call miserable. Set targets you can defend, tied to user-visible behavior, with a documented protocol for what happens when the budget burns. That last clause is what makes the system work.

Standards

  • OpenAPI Specificationopenapis.org/specification (current 3.1.x). The de facto contract format for REST APIs; consumed by code generators, mock servers, and gateway routers.
  • AsyncAPIasyncapi.com/docs/reference/specification (current 3.x). The async / event API counterpart to OpenAPI; describes channels, messages, bindings to Kafka, AMQP, MQTT, etc.
  • gRPC + Protocol Buffersgrpc.io/docs and the language reference at protobuf.dev (handed forward from Act I). Wire format is stable; the .proto schema language defines the contract.
  • GraphQLspec.graphql.org (October 2021 edition is current). Query language and execution semantics; schema-definition-language is where the contract lives.
  • CloudEventscloudevents.io (CNCF, current 1.0.2). Standard envelope for event payloads across brokers; lets a Kafka event and a webhook event share metadata.
  • OCIopencontainers.org (handed forward from Act IV). The runtime, image, and distribution specs that define what a container actually is — every cloud's container service obeys them.
  • Kubernetes API conventionskubernetes/community/contributors/devel/sig-architecture/api-conventions.md. The unwritten rules every K8s controller follows: spec/status split, server-side apply, label conventions.
  • W3C Trace Contextw3.org/TR/trace-context (W3C Recommendation, 2020). Defines traceparent and tracestate headers; makes distributed tracing portable across vendors.
  • OpenTelemetryopentelemetry.io/docs/specs (CNCF). The unified spec for telemetry data (traces, metrics, logs), SDK semantics, and the OTLP wire protocol; the de facto successor to vendor-specific agents.
  • Prometheus exposition formatprometheus.io/docs/instrumenting/exposition_formats. The text format that almost every metrics-emitting tool now exposes; OpenMetrics (openmetrics.io) is its standardized superset.
  • Terraform / OpenTofudeveloper.hashicorp.com/terraform/language and opentofu.org/docs/language for HCL syntax and resource semantics. OpenTofu is the MPL-licensed open fork that several large vendors now ship.
  • HTTP cachingRFC 9111 (replaces RFC 7234). Defines Cache-Control, ETag, conditional requests, and freshness semantics every CDN and browser implements.
  • SLO/SLA practicesGoogle's Site Reliability Engineering (O'Reilly, 2016) chapters 3–4 are the de facto reference; The Site Reliability Workbook (2018) adds practical rollouts.
  • Forward refs — the observability triad recurs in Act IXb (Engineering Craft) where it becomes the on-call discipline; zero-trust networking and identity-aware proxies recur in Act VIIa (Security).
Going deeper

Branches that earn their own article.

  • Microservices patterns (sidecar, ambassador, circuit breaker, bulkhead).
  • Message brokers (Kafka, RabbitMQ, NATS) deep dives.
  • CQRS and event sourcing implementation details.
  • Rate limiting and back-pressure strategies.
  • Cell-based architecture.
  • Multi-region and active-active patterns.
  • Capacity planning.
  • Kubernetes internals (scheduler, kubelet, etcd, CRI, CNI).
  • Service mesh (Istio, Linkerd).
  • Serverless architectures and cold-start trade-offs.
  • FinOps and cloud cost optimization.
  • CI/CD pipeline design.