Engineering Craft (9b) — Product

A codebase has two kinds of cost: the cost of writing it once, and the cost of living in it for years. Craft is the second cost. Version control records what changed and why. Review catches what tests can't. Tests, pipelines, and telemetry close the loop from commit to production. Incident response and decision records survive turnover. None of this is layered on top of the code; it weaves through every commit. This page walks the practices that, together, decide how expensive the second cost gets.

Break any link and the work upstream stops being trustworthy. A CI without tests is theatre. Tests without observability lie about reality. Observability without an incident process produces noise nobody owns.

Version control

Multiple engineers edit the same files. Edits arrive in different orders on different machines. Someone needs to know what the code looked like three weeks ago and why it changed. A shared folder can't do any of this. Git is a version-control system: it records every change as a snapshot, links snapshots into a history graph, and lets parallel work converge through a merge algorithm rather than by overwriting.

A commit is the unit. One commit should be one change you could explain in a sentence — "rename User.email to User.primaryEmail and update the migration." Bundling ten unrelated changes into one commit destroys the ability to revert any one of them. Splitting one change across ten commits destroys the ability to see what happened. The history of a project is a directed acyclic graph of these commits, and that graph is the artefact engineers consult when they ask why a line exists.

The object store

Git stores history as content-addressed objects in .git/objects/. Every object — file content, directory listing, commit, tag — is named by the SHA-1 hash of its bytes (newer repos can use SHA-256). The hash isn't a label assigned to the object; it is the object's identity. Change one byte and the hash changes. Change a commit and every descendant commit's hash changes too.

Four object kinds compose the model. A blob holds file contents — no name, no metadata, just bytes. A tree maps names to blobs (or to other trees), so a tree is one directory snapshot. A commit points at one tree (the snapshot it captures), at zero or more parent commits, and at author and message metadata. A tag is a signed pointer at any of those. Because every reference is by content-hash, two branches that share a file share its blob automatically.

Branches, tags, and HEAD sit on top of this object store as a thin layer of mutable name pointers in .git/refs/. Move the pointer, you change the branch. The objects themselves stay immutable until garbage collection prunes them.

The practical consequence: history is tamper-evident. Publish the head SHA and any later modification to any past commit changes that SHA. The chain anchors itself to its own contents, the same property a Merkle tree provides.

Merging and rebasing

Two engineers work on two branches. Eventually those branches need to combine. Merging produces a commit with two parents — both branches' history is preserved verbatim, and the new commit records that they came together. Rebasing replays one branch's commits onto the tip of the other — same patches, new parent pointers, new SHAs, and the original parallel history disappears from view.

Pick by what you want the history to say. Merge when the parallelism mattered — a release branch, a long-running integration, work that genuinely happened independently. Rebase when it didn't — a small feature you want to land as a clean line on main. Neither choice is technically correct or incorrect; both produce a working tree with the same files.

Merge captures fidelity; rebase optimises for readability. The trade-off is real and reversible — pick per branch, not per dogma.

Both operations share one algorithm underneath. Given two branch tips, Git finds their most recent common ancestor, computes the diff from ancestor to each tip, and applies both patch sets. Lines edited on only one side merge cleanly. Lines edited on both sides become conflicts: the algorithm marks the region with {'<<<<<<<'} / {'======='} / {'>>>>>>>'} and stops, because picking a winner requires understanding what the edits meant. The ancestor is what makes this work — without it, Git would have to guess which side was "right"; with it, the question reduces to "did the two sides touch the same region?"

Worked example: a 3-way merge conflict on one file

Three versions of the same file greet.py exist. The base is the common ancestor — the last commit both branches agreed on. Ours is the version on the branch being merged into. Theirs is the incoming version.

Base (the ancestor):

def greet(name):
    return "Hello, " + name

Ours (main — someone added punctuation):

def greet(name):
    return "Hello, " + name + "!"

Theirs (feature — someone added a title argument):

def greet(name, title):
    return "Hello, " + title + " " + name

Git looks at each line. The function signature changed only on theirs — no conflict, take theirs. The return statement changed on both sides relative to base — conflict, because Git cannot tell whether the punctuation should survive the new title argument. The file in the working tree after git merge looks like:

def greet(name, title):
{'<<<<<<<'} HEAD
    return "Hello, " + name + "!"
{'======='}
    return "Hello, " + title + " " + name
{'>>>>>>>'} feature

Everything between {'<<<<<<<'} and {'======='} is ours. Everything between {'======='} and {'>>>>>>>'} is theirs. The human resolves by editing the file to the intended version (probably "Hello, " + title + " " + name + "!"), removes the marker lines, then git add greet.py && git commit finishes the merge. The ancestor is what let Git auto-accept the signature change — without it, every line edited on either side would have to be flagged.

Branching strategies

A team that all commits to one branch trips over each other. A team that all works on long-lived parallel branches drowns in merge conflicts. Branching strategy is the coordination pattern that lives between those failure modes.

Trunk-based development keeps a single shared branch — usually main. Feature branches live less than a day before merging back. Continuous integration runs on every push. Half-finished work hides behind feature flags rather than living on an isolated branch. GitFlow keeps long-lived parallel branches — develop for integration, main for releases, separate feature/*, release/*, and hotfix/* branches for each lifecycle stage. Trunk-based fits daily deploys; GitFlow fits quarterly versioned releases. Most teams that ship continuously land between them, with short-lived feature branches and a single release branch per version.

Pitfall — force-push on shared history. git push --force rewrites the branch on the server. Anyone who pulled before the rewrite now has commits that no longer exist. The fix is --force-with-lease, which refuses the push if anyone else has updated the branch since you last fetched. Reserve unconditional force-push for branches only you touch.

Dependencies and semantic versioning

No project is self-contained. Every codebase depends on libraries; libraries depend on libraries. The version number a package publishes is the contract between author and consumer.

Semantic versioning (SemVer) formalises that contract as MAJOR.MINOR.PATCH. A PATCH bump (1.4.2 → 1.4.3) is a backwards-compatible bug fix. A MINOR bump (1.4.3 → 1.5.0) adds functionality without breaking existing callers. A MAJOR bump (1.5.0 → 2.0.0) is the package author saying "your code will break — read the migration notes." Lockfiles (package-lock.json, Cargo.lock, poetry.lock) freeze the exact versions of every transitive dependency so two machines and the CI runner all install the same bytes.

The failure mode is the MAJOR bump that's actually a MINOR, or worse, the MINOR bump that quietly breaks callers. SemVer is a social contract enforced by humans reading changelogs. When it breaks, you find out at runtime. Mitigations: pin to compatible ranges (^1.4.0 accepts patches and minor bumps but not majors), audit changelogs before bumping, and run the test suite against the new lockfile before merging the version bump. The point of the suite is to discover the broken assumption before production does.

Code review

You finish a change. You think it's correct. Nothing in the toolchain disagrees — tests pass, types check, the linter is quiet. But "correct" and "good for this codebase" are different properties. The function is correct in isolation and the wrong shape for where it lives. The name reads fine and contradicts the team's vocabulary. The abstraction works today and leaks the first time someone adds a second backend. Code review is the practice of having another engineer read the change before it merges, specifically to catch the things automation can't.

The default mental model — "review catches bugs" — is too narrow. Tests catch bugs. Review catches design drift. Every review is also asymmetric knowledge transfer: the author already understands the change; the reviewer reads themselves into it. Across enough reviews, more than one person knows each part of the codebase. The bus factor stays above one as a side effect of the practice.

What makes review work

Three forces shape the quality of the signal. Size — a 200-line PR gets read line by line; a 2,000-line PR gets rubber-stamped because nobody has the working memory. Latency — a review returned in four hours arrives while the author still has context; a review returned in three days arrives after they've moved on, after the branch needs rebasing, and after the next change is already blocked behind it. Tone — comments that critique the code with concrete alternatives ("consider extracting this branch — it's a different responsibility") are absorbed; comments that read as judgement ("this is wrong") aren't.

The review verbs are protocol, not vocabulary. Comment is non-binding chatter. Request changes blocks merge. Approve releases the gate. Conflating them dilutes the signal — every comment then has to be triaged as if it were a block, and the noise floor rises.

Slow reviews compound. Every extra day adds rebase work, blocks downstream PRs, and trains the team to send larger, riskier changes — because if you're waiting two days anyway, you might as well batch four features into one PR.

The two-author rule (one writes, at least one other reviews) is the hard floor. Many teams add a second reviewer for changes that cross domain boundaries — a backend engineer touching front-end code, a feature change touching auth. The cost of a second reviewer is hours; the cost of the wrong person being the only one who understands a critical subsystem is years.

Pitfall — review as gatekeeping. A review that asks for changes only to demonstrate seniority slows the team without raising quality. Distinguish required (the change is wrong, unsafe, or violates a documented standard) from suggested (you'd write it differently). Suggestions are fine; only requireds should block merge.

Tests

A test is code that runs the code and asserts something about the result. Without tests, every change is a hope. With tests, every change is a verifiable proposition — "the code still does what it claimed to do yesterday." The reason this matters isn't bug-finding in the moment; it's confidence in the moment after. Refactoring a function with no tests means rereading every caller. Refactoring a function with tests means running them.

The pyramid

Tests come in three rough sizes, and the right project has many of the small ones, fewer of the middle ones, and only a handful of the big ones. Unit tests exercise a single function or class with no I/O, run in milliseconds, and are the wide base of the pyramid. Integration tests exercise a component plus its real neighbours — a service plus its database, a parser plus its lexer — take 100 ms to 1 s, and form the narrower middle. End-to-end tests drive the whole system through its public interface, take seconds to minutes, and sit at the narrow top.

The shape matters because the suite has to be runnable. A 30-second unit suite runs on every save. A 30-minute end-to-end suite runs nightly, if at all. Inverting the pyramid — many slow end-to-end tests, few units — produces a suite that's slow, flaky, and skipped under deadline. The signal collapses the moment running it costs more than ignoring it.

The pyramid is empirical, not aesthetic: the only feedback that survives a deadline is the feedback fast enough to keep running.

Property-based testing

Example-based tests pin one input to one expected output. That works for known cases. It misses the input nobody thought to write — the empty list, NaN, the surrogate codepoint, the day after a leap second. Property-based testing flips the script. Instead of an input/output pair, you state an invariant the function must satisfy for all inputs: reverse(reverse(xs)) == xs, encode(decode(b)) == b, "the result is sorted." The framework generates hundreds of random inputs and checks the property against each.

When a property fails, the framework shrinks — it repeatedly trims the failing input until it produces the smallest example that still breaks the property. A 200-element list that fails on element 47 becomes a 2-element list that exposes the same bug. The shrinking step is what makes the technique practical: instead of a debugger crawl through a giant counterexample, the test report writes itself.

The cost is upfront — you write a generator rather than a literal value, and you have to think about what the function should be true for, not just at. The payoff is coverage breadth no example suite can match.

Worked example: the same reverse test, two styles

Example-based — you pick three inputs and assert on each:

def test_reverse_examples():
    assert reverse([]) == []
    assert reverse([1]) == [1]
    assert reverse([1, 2, 3]) == [3, 2, 1]

Three asserts, three inputs. Pass means: the function works for these three lists. Nothing said about a list of 50 elements, a list of duplicates, or a list containing None.

Property-based — you state one invariant and let the runner try inputs:

from hypothesis import given, strategies as st

@given(st.lists(st.integers()))
def test_reverse_property(xs):
    assert reverse(reverse(xs)) == xs

One assert, one property. The runner generates inputs of its own choosing — typically 100 by default — and runs the assert against each. A sample run might try [], [0], [-1, 1], [42, 42, 42], [1, 2, ..., 47], and ninety-five others. The signal: the property holds across all of them, or it doesn't.

Now suppose someone introduces a bug — reverse mutates the input and returns it, so for any list of length ≥ 2 the second call sees a reversed input and reverses it back to a different shape. The runner tries [1, 2, 3, 4, 5, 6], fails, then shrinks: try [1, 2, 3], still fails; try [1, 2], still fails; try [1], passes. The minimum failing input is [1, 2], and that's what gets printed in the report — not the original 6-element list the runner happened to generate first.

Mutation testing

Line coverage measures whether a test caused a line to execute. It does not measure whether the test would fail if the line were wrong. A function with 100% line coverage and no assertions has 100% coverage and 0% verification.

Mutation testing answers the question line coverage dodges. The tool modifies the production code one operator at a time — < becomes <=, + becomes -, if x becomes if not x — and runs the test suite against each mutant. A killed mutant is one the tests caught. A surviving mutant is one no test noticed: a hole in the suite that coverage metrics didn't see. The mutation score is killed-over-total, and a healthy suite hits 80%+.

Line coverage is a presence test; mutation testing is an absence test. The first asks whether the test touched the line. The second asks whether the test would fail if the line did the wrong thing — and only the second answer is the one that matters.

Pitfall — the test that tests the mock. A unit test that mocks every collaborator only verifies that the function calls the mocks the way the test expects. Change the implementation in a legitimate way and the test breaks. Introduce a real bug downstream of the mock and the test passes. Push tests one level outward — exercise the smallest unit that actually does something — and failures get closer to reality.

CI/CD

A change works on your laptop. That tells you almost nothing — your laptop has different dependencies, different state, different secrets. The only way to know whether the change works in the configuration the company actually ships is to run it in that configuration. Continuous integration (CI) runs every commit on a clean machine against the canonical toolchain. Continuous delivery (CD) produces, on every green build, an artefact that could deploy. Continuous deployment is CD with the human removed from the loop — the artefact rolls out automatically.

The animating goal is that deploys should be boring. Frequent, small, reversible. A team that ships ten times a day has ten times as many opportunities to discover a problem early, and each of those ten changes is small enough to revert by name. A team that ships once a quarter ships a bundle so large that "what changed" can't be answered in a single sentence.

A feature flag decouples deployed from released. The code can be in production for a week with the flag off, then turn on for 1% of users, then 10%, then everyone — and turn off again the moment a metric goes sideways.

Rollback as a primitive

A pipeline that can deploy but cannot roll back isn't reliable; it's optimistic. Rollback as a first-class operation means the previous artefact is still warm — tagged, loadable in seconds, requiring no rebuild. If reverting requires running migrations backwards, restoring database backups, or paging an SRE, you don't have rollback; you have an aspiration.

Database schemas are the hard case. You can't roll back a column you already dropped. The expand-migrate-contract pattern keeps rollback available across schema changes: first add the new column without removing the old one (expand); dual-write to both for a window long enough that a rollback target still works (migrate); only later drop the old column (contract). The database speaks two versions at once during the risk window. Get this wrong and a rollback during the window corrupts data.

Progressive delivery

Once rollback is cheap, you can ship in increments. Canary deployments route a small fraction of real traffic — 1%, then 5%, then 25% — to the new version while watching error rate, p99 latency, and saturation. If the metrics hold, the canary ramps to 100%. If they don't, traffic snaps back to the old version automatically. Blue-green deployments keep two full environments running — one serving, one idle with the new version — and switch the load balancer between them. Blue-green gives instant cutover at the cost of double infrastructure; canary gives gradual exposure at the cost of needing reliable metric-driven gates.

Blue-green is a switch. Canary is a dial. Use blue-green when the only question is "does the new version start"; use canary when the question is "does it behave at scale" — error rate and tail latency only surface under real traffic.

Pitfall — the long-running pipeline. Every minute the pipeline runs past about 15 is a minute the author has context-switched. Engineers learn to push, walk away, and come back; PR throughput collapses. The fix is parallelism — split the unit suite across runners — and caching — don't rebuild dependencies that haven't changed. Not "tolerate it; CI is slow."

Observability

Production lies to you. Code that worked in staging fails in production because real traffic has shapes test traffic doesn't. Code that emits no telemetry is opaque — the first signal of a problem is a customer email. Observability is the practice of instrumenting code before it ships so that, when something is wrong, you can answer "what?" and "why?" by query rather than by guesswork.

Three telemetry types do different jobs. Metrics are time-series numbers — request rate, error rate, p99 latency — cheap to aggregate and optimal for alerting. Logs are timestamped lines of context — the URL, the user ID, the error message — expensive at scale and optimal for debugging a specific incident. Traces are causally-linked spans across services — API call into auth service into database into cache — optimal for understanding the path of one request through a distributed system. The OpenTelemetry SDK is the vendor-neutral way to emit all three; the W3C traceparent header carries span context across service boundaries.

Service level objectives

A team needs a way to decide whether the system is reliable enough. Without a number, "reliable enough" becomes whoever shouted last. A Service Level Objective (SLO) is that number: a measurable property the team has committed to, like "99.9% of API requests succeed within 200 ms over a 28-day window." The Service Level Indicator (SLI) is the actual measurement — what the system is doing right now. The gap between the SLO and 100% is the error budget: the volume of failures the team is allowed to accept.

The budget turns reliability into accounting. If the past 28 days ran at 99.95%, you have 0.05% of the window's traffic to spend on risky deploys, schema migrations, and chaos drills. If the past 28 days ran at 99.85%, the budget is exhausted and the org freezes risky changes until reliability recovers. The argument "should we ship features or fix reliability?" becomes "look at the budget."

Worked example: a 99.9% SLO in minutes

Start with the SLO: 99.9% of requests succeed in a 28-day window. The error budget is the 0.1% that's allowed to fail.

Convert the window to minutes. 28 days × 24 hours × 60 minutes = 40,320 minutes.
Compute the budget. 0.1% of 40,320 minutes = 40.3 minutes. That's the total time the service can be down (or, more precisely, failing more requests than the SLO permits) over the whole window before the budget is gone.
Burn one incident. A bad deploy at day 10 takes the service down for 15 minutes before the on-call rolls back. Budget remaining: 40.3 − 15 = 25.3 minutes for the next 18 days.
Decide what to do. 25 minutes across 18 days is still a livable budget — risky deploys can continue, but with more care. Two more 15-minute incidents in the same window and the budget goes negative, which triggers the change freeze the policy promised.

The same arithmetic at other SLOs is worth memorising. 99% over 28 days is 403 minutes — about 6.7 hours — which is generous. 99.99% is 4 minutes, which is brutal: one bad rollback eats the whole month. The number of nines you pick is not aesthetic; it's the budget your team gets to spend on every change you ship.

A short window (24 h) reacts fast and overshoots. A long window (90 d) is stable but lets a slow degradation hide. 28 days is the common compromise — long enough to absorb a bad day, short enough to react to a bad week.

Pitfall — vanity SLOs. "99.99% uptime" sounds rigorous and is meaningless if the SLI doesn't reflect what users experience. A status page that only checks /health reports 100% while every real query times out. The SLI must measure user-perceived behaviour — request success rate from the user's edge, not from inside the rack.

Debugging and profiling

Telemetry tells you that something is wrong. Debugging and profiling tell you what. Debugging is the loop of reproducing a failure, narrowing where in the code the failure happens, and confirming the fix. The skill isn't clever tricks — it's restraint. Reproduce reliably before guessing. Bisect the change history (git bisect) to find the commit that introduced the regression. Read the actual values flowing through the code; do not trust the model in your head. Most "mystery" bugs are bugs the model said couldn't happen, which is why they look like mysteries.

Profiling is debugging for performance. Instead of asking "why is this wrong?", you ask "where is the time going?". A sampling profiler interrupts the program many times per second and records the stack; the function names that appear most often are where most of the wall-clock lives. A flame graph stacks those samples visually — wide bars are hot. The standard mistake is to optimise the code you suspect rather than the code the profiler points at. Intuition about performance is unreliable often enough that the measurement is the only safe place to start.

Incidents

Every production system fails eventually. The question is not whether failures happen — they do — but how fast the system recovers, how much users notice, and whether the same failure happens again next month. Incident management is the practice that turns one bad afternoon into one learning, instead of one bad afternoon into the same bad afternoon every quarter.

Two numbers frame the conversation. MTTR (mean time to recovery) is the time from alert to user-visible-behaviour-normal. This is what the team optimises. MTBF (mean time between failures) is the inverse rate of incidents — informative, but mostly outside the team's short-term control. Teams running mature on-call rotations target MTTR in minutes for severe incidents and hours for the rest.

Mitigation and resolution are different operations. Rolling back stops the bleed — that's mitigation. Finding why the bad code passed CI is resolution. The on-call mitigates; the postmortem resolves.

Blameless postmortems

A postmortem is the document written after an incident. The format is rigid on purpose: a one-paragraph summary, a minute-by-minute timeline, a list of contributing factors, and a list of action items each owned by a name. The timeline gets written before the analysis, because the timeline is observable and the analysis is interpretive.

Blameless means the document assumes the people involved acted reasonably given the information they had at the time. If they did, and the outage still happened, the outage is a system bug, not a personal failing. The opposite — punishing the person who pushed the bad change — produces a culture where the next person hides the next bad change, and the outage that follows is harder to investigate. Investigate the system; the next outage doesn't recur. Punish the person; the next outage hides.

The analysis section uses five whys. Start with the surface symptom. Ask "why?". Answer with the most immediate cause. Ask "why?" again of that answer. Keep going past the first plausible-sounding stop. The number five is approximate; the discipline is to keep descending. The first "why" usually surfaces a code fact ("the migration locked the rows"); the fifth usually surfaces a process fact ("we have no review step for schema changes against hot tables").

The point of descending isn't to blame the deepest layer; it's to find the layer where a fix has the highest leverage. Fixing the top ("revert the migration") prevents this incident. Fixing the bottom ("add a review gate for hot-table schema changes") prevents the next twenty.

On-call and runbooks

On-call is the staffing model that makes incident response possible. The standard shape is one primary engineer paged for alerts at any hour, one backup if the primary doesn't respond, and weekly handoffs. Alerts are categorised by severity — pages-immediately at the top, next-business-day at the bottom — and every alert links to a runbook.

A runbook is a short action document — "if auth_5xx_rate > 1%, check the connection pool saturation; if saturated, restart the pool; if that fails, page the platform team." Runbooks turn a 3 a.m. page from "think hard and hope" into "follow the recipe." A runbook that doesn't exist is a runbook that gets written during the incident, badly.

Pitfall — the action item with no owner. A postmortem ending in eight unowned action items has produced no learning. Either each item gets a name and a date, or the analysis was performative. Track action-item completion rate as a leading indicator: a low completion rate means the postmortem process is generating documents but not changing the system, however thorough the documents read.

Decision records

A new engineer asks why the service uses Postgres and not Redis. Three current engineers give three different answers. Two of them are wrong, and nobody can tell which. The decision was made in a meeting two years ago by people who have since left. The reasoning is gone. The result is that the next decision in the same area gets made without the context that motivated this one.

Engineering decisions get made every day; engineering decisions get recorded almost never. The asymmetry is the whole problem. Code records what the system does. Only writing records why it does it that way. Architecture Decision Records (ADRs) and RFCs are the writing-things-down practice that closes the gap.

ADRs

Michael Nygard's original ADR template has four fields and fits on one page.

Context — what's true about the world that forced a choice.
Decision — what we're going to do.
Status — proposed, accepted, or superseded by ADR-NNNN.
Consequences — the trade-offs we're accepting, both good and bad.

That's it. ADRs are short (under 500 words), numbered, immutable once accepted (you don't edit ADR-0007 — you write ADR-0023 superseding it), and committed to the repo alongside the code they describe.

An ADR is the answer the next engineer will get when they ask "why didn't you just use Redis?". Without it, the answer is folklore. With it, the answer is a document with a date and a name.

RFCs and writing for precision

An RFC is the longer cousin — a proposal for a change that touches enough of the system to need cross-team review before code gets written. RFCs are negotiated documents: comments shape the proposal before it's accepted. ADRs are recorded documents: they capture a decision after it's made. Together they close the gap between "we talked about it in a meeting" and "we wrote it down so the meeting survives the meeting."

Technical writing is its own skill. The vocabulary of RFC 2119 — MUST, SHOULD, MAY, and their negatives — exists because "should" in everyday English is ambiguous. In a spec, the uppercased keywords carry a defined meaning: MUST is a hard requirement, SHOULD admits exceptions with justification, MAY is optional. The same precision should run through any document the team will reread under pressure: be specific about what the system requires, what it permits, what's a recommendation. Write the document you would want to find at 3 a.m.

Pitfall — ADRs nobody reads. A repo of 200 ADRs that nobody consults is wasted ink. Make ADRs discoverable from the code, referenced in PRs ("this implements ADR-0023"), and surfaced during onboarding. An ADR that doesn't get cited within six months of acceptance is a candidate for archival.

The career ladder

A team needs a way to talk about where engineers are in their growth, what's expected at each step, and what "promotion" means beyond "more money." The ladder is that vocabulary. It isn't a measure of years; it's a measure of two axes — scope of impact and the share of value that comes from judgement rather than execution.

The classic five-level shape names the arc. Junior engineers implement solutions: they take a well-defined problem and produce code that solves it. Mid-level engineers design solutions: they take a fuzzy problem and decide how the code should be shaped. Senior engineers define problems: they look at a domain, see what should exist that doesn't, and write the plan. Staff engineers define the right problems: they distinguish the dozen things the org could work on from the two it should. Principal engineers redefine what "problem" means: they change the questions the company asks itself.

Each step is a phase change in what the role is. Junior-to-mid is becoming reliable. Mid-to-senior is becoming autonomous. Senior-to-staff is becoming the person who decides what "autonomous" should aim at.

The phase change at staff and above deserves its own line. Up to senior, the question is "how good is my code?" — and the answer arrives via PRs, reviews, and incident metrics. At staff and above, the question becomes "how much leverage does my judgement create?" — and the answer arrives via decisions other people make, projects other people land, and the gradual disappearance of categories of work the org used to find painful. Staff engineers measure themselves by the size of the surface they no longer need to touch.

Pitfall — staff-engineer-as-tech-lead-forever. Some orgs collapse "staff" into "very senior tech lead" and never grow the role into the strategic seat it should be. The symptom is staff engineers spending 80% of their time in one team's standups. The fix is at the org level: a healthy staff role has a portfolio crossing team boundaries, a sponsor at director-or-above, and an explicit charter in writing.

Standards

Git — for the command set, and the Documentation/ directory in the Git source tree (Documentation/technical/) for the on-disk object format. Git has no formal RFC; the implementation is the spec. SHA-1 is the historical default; SHA-256 (--object-format=sha256) was added in Git 2.29 for repositories created today.
Semantic Versioning (SemVer 2.0.0). MAJOR.MINOR.PATCH with the strict definition that MAJOR bumps signal incompatible API changes.
Conventional Commits (1.0.0). The feat: / fix: / chore: prefix grammar that lets release tooling infer SemVer bumps from commit messages.
OpenAPI / AsyncAPI / gRPC — handed forward from Act Vc as the contract artefacts CI/CD validates: OpenAPI 3.1 (Spec Project), AsyncAPI 3.0 (AsyncAPI Initiative), gRPC and Protocol Buffers 3 (Google / CNCF). Schema-first development means CI fails the build when the schema changes incompatibly.
OpenTelemetry — CNCF specification for telemetry data, covering the trace, metric, and log data models, the OTLP wire protocol, and SDK behaviour. The vendor-neutral instrumentation layer.
W3C Trace Context — W3C Recommendation defining the traceparent and tracestate HTTP headers that propagate span context across service boundaries.
Prometheus exposition format — the de facto plain-text metrics format (# HELP, # TYPE, metric{label="x"} 42) that every modern metrics agent ingests; OpenMetrics is its formalised successor.
CloudEvents — CNCF spec for event metadata (id, source, type, subject, time) so events from different systems can be routed and correlated by a common broker.
Containers / OCI — handed forward from Act Vc. OCI Runtime Specification, OCI Image Specification, OCI Distribution Specification — the registry, image, and runtime contracts every container runtime obeys.
JUnit XML report format — the de facto schema (<testsuite> / <testcase> / <failure>) every CI system ingests for test-result aggregation. There is no normative spec; the Ant tooling defines the shape.
ADR convention — Michael Nygard, "Documenting Architecture Decisions" (Relevance, 2011). The MADR (Markdown Architectural Decision Records) project extends Nygard's template with structured sections for considered alternatives.
RFC 2119 / RFC 8174 — Key words for use in RFCs to Indicate Requirement Levels. The "MUST / SHOULD / MAY" vocabulary used across technical writing; cite both because RFC 8174 clarified that the words are normative only when uppercased.
SRE practice — Site Reliability Engineering (O'Reilly, 2016) and The Site Reliability Workbook (O'Reilly, 2018), the de facto reference for SLO/error-budget mathematics, blameless postmortem structure, and on-call rotation design.
Software life cycle / safety — ISO/IEC/IEEE 12207 (software life-cycle processes); ISO 26262 (functional safety, automotive); IEC 62304 (medical-device software). The standards regulated industries cite when a project must produce evidence of disciplined engineering, not just shipped code.
Forward refs — security postures from Act VIIa apply to CI/CD secrets management: pipeline credentials live in a secret manager (Vault, AWS Secrets Manager, SOPS-encrypted files in-repo), are scoped per environment, and rotate on a schedule. A leaked CI token is a leaked production credential.