Cloud & Infrastructure

TL;DR

Cloud is not console clicks. It's five primitives — compute, storage, network, identity, observability — composed into four patterns (three-tier, serverless, edge-cached, event-driven), with latency and cost shaped by where state lives and how bytes move. Once you can read a design as primitives + pattern, every vendor docs page is a lookup against a model you already have.

You will be able to

Name the five primitives and give two vendor implementations of each.
Pick the right composition pattern given a workload's state and traffic shape.
Predict where the cloud bill accrues before you deploy anything.

Rendering diagram…

Each pattern is a different mix of the five primitives. Cost and latency fall out of the mix, not out of vendor choice.

Station 1 — Compute primitives

Four rungs, each with a different startup time, a different scaling shape, and a different bill:

  Primitive       Start time   Max scale shape          $ shape
  ─────────       ──────────   ───────────────          ───────
  VM              minutes      vertical + horizontal    per hour, committed
  Container       seconds      horizontal, cluster      per hour, fractional
  Function        ~50 ms (hot) horizontal, per-event    per request + ms
  Managed runtime immediate    opaque, provider-bound   per unit of work

VM (EC2, Compute Engine) — a whole OS. Maximum flexibility, slowest to scale, largest unit of waste.
Container (ECS, GKE, ACI) — a process in a filesystem jail. Fractional scaling, one image everywhere.
Function (Lambda, Cloud Functions) — code, not a machine. Per-invocation billing; scales to zero; cold-start tax.
Managed runtime (App Runner, Cloud Run, Fly Machines) — somewhere between container and function; you don't manage the host, you rent the runtime.

The model you want: scale-to-zero is a feature you pay for with cold starts; always-on is a feature you pay for with idle cost. Pick the side your traffic shape favors.

WARNING

Lambda + VPC + RDS is the classic cold-start trap. A function that has to attach an ENI into a VPC adds seconds to the cold path. If your function talks to a private database, consider a provisioned-concurrency floor or a small always-on container instead.

Go deeper: "Serverless in the Wild" (Shahrad et al., ATC 2020) on real cold-start patterns; run the same trivial endpoint on VM, container, and function, measure p50/p99 at zero, 1, and 1000 RPS.

Station 2 — Storage primitives

Six shapes of storage. Each trades durability, latency, and query power:

  Shape          Latency       Durability      Good for
  ─────          ───────       ──────────      ────────
  Block          µs            single-node     raw disks, RDBMS files
  Object         ~10-100 ms    11 nines        blobs, backups, static sites
  File           ~ms           cluster         shared home dirs, POSIX apps
  KV             µs - low ms   replicated      sessions, caches, feature flags
  Relational     ms            replicated      transactional workloads
  Log / stream   ms            replicated      CDC, events, append-only

Block (EBS, Persistent Disk) — attach to a VM, mount a filesystem.
Object (S3, GCS, Azure Blob) — HTTP-addressable blobs. Not a filesystem.
File (EFS, Filestore) — POSIX over the network. Convenient, expensive.
Key-value (DynamoDB, Redis, Bigtable) — low-latency gets; no joins.
Relational (RDS, Cloud SQL, Aurora, Spanner) — SQL, ACID, one or many replicas.
Log / stream (Kafka, Kinesis, Pub/Sub) — append-only, multi-consumer, usually retained.

The model you want: match the access pattern to the shape, not the label. Everyone reaches for Postgres first; often a KV or a log is the right shape for half the workload.

CAUTION

Object storage is not a filesystem. ls is slow. "Directories" are a client-side illusion. Listing a prefix with 10 million objects is an outage. Design for GET by key, not for walk tree.

Go deeper: Pat Helland's "Immutability Changes Everything"; the Dynamo paper; one weekend porting a table from Postgres to DynamoDB (or back) with honest measurements.

Station 3 — Network primitives

The network is usually what the bill is secretly about. Four primitives:

Rendering diagram…

VPC / subnets / security groups — the private network you rent. Everything that isn't public traffic travels on it.
DNS — the first hop of every external request. Also the slowest one to change; plan TTLs consciously.
Load balancer — L4 (fast, dumb, TCP/UDP) or L7 (slow, smart, HTTP/gRPC). L7 reads headers, cheaper per connection, expensive per request.
CDN — the cache at the edge. Turns your origin's "1000 RPS" into "10 RPS, and 990 served from cache."
Egress — bytes leaving the cloud region. Almost always the line item you didn't estimate.

The model you want: bytes out of the cloud are expensive; bytes inside a zone are free-ish; bytes across zones or regions are middlingly expensive. Design data locality accordingly.

WARNING

Cross-AZ traffic charges feel like rounding errors until a chatty service with 100 req/s × 10 cross-AZ hops shows up on the invoice as a real number. Pin chatty pairs to the same AZ when you can.

Go deeper: the AWS "Pricing" pages for EC2 data transfer (slowly, with a calculator open); Cloudflare's "How we built Anycast" posts; tracert from your laptop to your own origin and count the hops.

Station 4 — Identity and policy

The primitive engineers skip, then regret in post-mortems. Identity answers who is calling, policy answers what they can do.

          ┌──────────────────────────────────────────────────┐
          │                                                  │
   ┌──────┴──────┐          ┌──────────────┐          ┌─────────────┐
   │  Principal  │──auth──▶ │   Policy     │──allow─▶ │  Resource   │
   │  (user/svc) │          │  (rule set)  │          │  (S3, DB…)  │
   └─────────────┘          └──────────────┘          └─────────────┘
          │                        │
          │                        ├── who + what + when + where
          │                        └── deny beats allow
          │
          └── rotate credentials; prefer roles over keys

Three habits that save years of pain:

Least privilege by default. Start from deny; add only what the caller proves it needs.
Roles, not keys. Long-lived access keys get leaked; role-assumption with short-lived tokens does not.
Policies reviewed like code. An IAM policy is production configuration. It gets a PR and a reviewer.

The model you want: every principal should have the smallest policy that still lets it do its job. When an incident leaks credentials, the blast radius is exactly that policy — no bigger, no smaller.

CAUTION

"Give it admin and we'll scope it down later" is the single most common origin story in cloud incident post-mortems. Later does not arrive. Later is a place that exists only in project-management software, and the audit log does not check Jira.

Go deeper: AWS IAM docs on trust policies vs permission policies; the OWASP Cloud Security cheat sheet; one afternoon tightening an over-broad policy in your own account with access analyzer.

Station 5 — Observability

Three pillars and an honest statement of what each one's for:

Rendering diagram…

Logs — full fidelity, expensive, slow to query. The forensic tool.
Metrics — aggregates, cheap, pre-computed for dashboards. The alerting tool.
Traces — cross-service causality with sampled detail. The "where did the p99 go?" tool.

The model you want: logs tell you what; metrics tell you how often; traces tell you who. Alerts on metrics. Root-cause on traces. Forensics on logs. Don't confuse which does what.

WARNING

"We'll just log it" has bankrupted more engineering budgets than crypto miners. Sample aggressively above a free tier. Cardinality — the number of unique tag combinations — is the hidden cost multiplier; a single user_id label on a metric turns a $5 dashboard into a$ 5000 dashboard overnight.

Go deeper: Cindy Sridharan's "Distributed Systems Observability" free book; Google's SRE Book ch. 6 (monitoring) and ch. 12 (effective troubleshooting); put a OTEL tracer in one service for a day and read a real trace.

Station 6 — Composition patterns

Four patterns cover most real systems. Each is a specific mix of the five primitives.

Rendering diagram…

Three-tier — web + app + DB. The reliable default for transactional workloads. Scales vertically before horizontally.
Serverless — gateway + functions + managed state. Best for bursty, stateless workloads. Beware cold starts and vendor lock.
Edge-cached — CDN in front of a small, cacheable origin. Best for read-heavy static or near-static content; approaches zero origin traffic.
Event-driven — producer writes to a log or queue; one or more consumers process. Decouples services; best for pipelines, fan-out, audit trails.

The model you want: match the pattern to the workload shape, not to the hype cycle. Most products run one pattern for the core and another for analytics/audit; that's fine.

TIP

"We moved to serverless and our bill went up" is almost always a workload that was 24/7 busy. Functions win on zero-to-bursty; they lose on always-on.

Go deeper: AWS Well-Architected Framework (a long PDF that is worth the half-day); a week running the same service as three-tier and as serverless, measuring p99 and $/request.

Station 7 — Cost shape

The cloud bill is a shadow of your architecture. Five line items cover >90% of most invoices:

  Where the money goes        When it bites
  ──────────────────────      ─────────────
  Compute hours               always-on workloads
  Data egress                 chatty integrations, cross-region reads
  IOPS on storage tiers       write-heavy workloads on slow tiers
  Cold starts & provisioning  bursty functions with warmup cost
  Observability retention     unsampled logs at high cardinality

The model you want: every architecture decision is also a cost decision. A cross-region read replica buys you availability and sells you egress. A function buys you elasticity and sells you cold starts. Nothing is "cloud-native" for free.

CAUTION

"Reserved capacity" saves 30–60% on compute if you predict load correctly. It's also a bet. Bet small at first; read the utilization curves before betting bigger.

Go deeper: Corey Quinn's newsletter (cynicism, accurate); the FinOps Foundation's framework; one monthly ritual where the engineer who shipped the biggest cost line reads the invoice out loud.

How the stations connect

The cloud is a machine for trading dollars for latency, durability, and availability. Primitives are the controls. Patterns are the settings. Cost is the readout.

Rendering diagram…

Pick a pattern. Notice what it costs. Iterate.

Standards & Specs

The cloud has no single standards body; the useful authorities are a mix of IETF, vendor reference architectures, and cross-vendor open specs.

AWS Well-Architected Framework — the reliable default: six pillars (operational excellence, security, reliability, performance efficiency, cost optimization, sustainability). Even non-AWS teams lift from it.
Google Cloud Architecture Framework — same shape, different vocabulary; useful cross-reference.
CNCF Cloud Native Landscape — the taxonomy of open-source cloud-native tooling; the "what are my options" index.
Kubernetes API spec — the de facto container orchestration API.
OpenTelemetry spec — the vendor-neutral wire format for traces, metrics, and logs. The thing you should be emitting regardless of observability vendor.
OCI Image Format and OCI Distribution — the open standards your container registries speak.
RFC 1918 — private address space and RFC 4291 — IPv6 addressing — what your VPC is doing under the hood.
RFC 8446 — TLS 1.3 — the only acceptable transport for anything you'd show a regulator.
NIST SP 800-207 — Zero Trust Architecture — the reference for identity-first network design.
FinOps Framework — the open, vendor-neutral cost-management discipline; the organizational counterpart to well-architected's cost pillar.
Papers — DeCandia et al., "Dynamo" (SOSP 2007); Corbett et al., "Spanner" (OSDI 2012); Calder et al., "Windows Azure Storage" (SOSP 2011).

Test yourself

A static marketing site behind Route 53 + CloudFront + S3 has a slow first paint in Sydney. Which primitive do you investigate first, and why not the others?

Network, specifically CDN presence. If CloudFront has a POP near Sydney, first-byte should be under 50 ms. If requests fall through to S3 in us-east-1, you'll see ~200 ms + origin time. Not compute (static site, no server work). Not storage IO (S3 is fast). Not identity. Edge caching is the primitive that turns geography into a miss-vs-hit question.

Your serverless API has a p99 of 2 seconds on new deploys, dropping to 80 ms after warmup. What architectural choice is paying for this and what's the cheap fix?

You're paying the cold-start tax of the Function compute primitive. Cheap fix: provisioned concurrency for hot endpoints (keeps N instances warm at ~container cost), or move the hot path to a small always-on container/managed runtime. Same five primitives, different composition.

A new feature triples your monthly bill. The compute line is flat. Where is the money actually going?

With compute flat, the prime suspects are egress and observability retention. New feature likely added a chatty integration (cross-region DB reads, a 3rd-party API call per request with large payloads), or generous log verbosity at high cardinality. Both shapes show up in "other" on the bill; both are architectural, not operational.