Handbook · Digital · Software
Engineering Craft
Engineering Craft··13 min read
TL;DR
Senior engineers ship working code. Staff engineers ship code that a team can change safely a year from now. The gap is not talent — it's craft: testing strategy, atomic version control, review rigor, design-before-code, observability that actually explains incidents, and blameless postmortems. This roadmap is the map of those habits.
You will be able to
- Name the seven stations and the artifact each one produces.
- Spot which station a struggling team is under-investing in.
- Prioritize the habits that compound over the ones that look productive.
The Map
- You will be able to
- The Map
- Station 1 — Testing strategy, not "more tests"
- Station 2 — Version control habits
- Station 3 — Code review for design, not style
- Station 4 — Design before code
- Station 5 — Observability is a product feature
- Station 6 — Automation and CI hygiene
- Station 7 — Incident response and postmortems
- How the stations connect
- Standards & Specs
- Test yourself
Read the diagram this way: the senior practices are a prerequisite, not a ceiling. Staff habits are additive — they compose on top of "works and is tested." They also take longer to pay off, which is why they're the thing that gets skipped.
Station 1 — Testing strategy, not "more tests"
The pyramid is not gospel but it's a useful starting shape:
╱ ╲
╱ E2E╲ few, slow, flaky, high confidence
╱──────╲
╱ Int ╲ some, medium, medium confidence
╱────────────╲
╱ Unit ╲ many, fast, deterministic, narrow
╱──────────────────╲
But the real model is what each test type gives you and what it costs:
- Unit — fast, deterministic, narrow scope. Catch logic bugs and regressions. Cheap to write, cheap to run.
- Integration — tests the contract between units (often including a real dependency — DB, HTTP client). Catches interface drift.
- End-to-end — tests a user-visible path. Catches the bugs the other two missed; is the most expensive to own.
- Property — generate inputs; assert invariants. Catches bugs you'd never have thought to test.
- Contract — consumer-driven or provider-driven schemas. Catches cross-service drift before deploy.
The model you want: pick the test type that gives you the highest confidence per minute-of-engineer-time. Most teams default to unit+e2e and under-use property and contract tests.
WARNING
A codebase where every PR adds a test but bugs keep shipping has a test-strategy problem, not a test-coverage problem. Doubling down on the wrong test type is the engineering equivalent of looking for your keys under the streetlight because the light is better there.
Go deeper: Kent Beck's Test-Driven Development; John Hughes's "Testing the Hard Stuff and Staying Sane" (property testing); one week replacing a flaky e2e suite with contract tests and measuring the trust delta.
Station 2 — Version control habits
Good history is documentation you got for free. Bad history is a debugging cost you pay forever. Three habits:
Habit Artifact Payoff
───── ──────── ──────
Atomic commits small, self-contained diffs bisect finds bugs in minutes
Intent-revealing messages "why", not "what" future-you understands past-you
Rebase local, merge shared linear history, real merges readable log, clean reverts
A good commit answers one question, makes one logical change, and passes tests on its own. "Refactor + fix bug + rename variable" is three commits.
The model you want: every commit should make the repo still work and still tell a story. That makes bisect, revert, cherry-pick, and review all cheap.
TIP
"Squash and merge" erases the story inside a branch. Prefer merge commits for multi-commit work; squash only the throwaway "WIP" and "fix typo" noise before merging.
Go deeper: Tim Pope's "A Note About Git Commit Messages" (short, mandatory); Pro Git book ch. 7; a weekend learning git rebase -i until it's boring.
Station 3 — Code review for design, not style
Every PR review is two questions in sequence:
The first question — is this the right change — is the one most reviewers skip. Line comments on style are easy. Saying "this shouldn't exist" or "this belongs somewhere else" is hard. Staff-level review does the hard one first.
Tone matters. A review comment is in text, asynchronous, public. Bias toward curiosity ("why this over X?") over edict ("use X"). The author always has context you don't.
CAUTION
Rubber-stamp reviews are negative value — they teach the author that review is theater. Either engage, or defer to someone who will.
Go deeper: Google's Engineering Practices docs (the "Code Review Developer Guide" is the canonical reference); pick three recent PRs and write the review you wish someone had written.
Station 4 — Design before code
For anything bigger than a bug fix, write first. The artifact has many names — RFC, ADR, design doc, tech spec — and one purpose: force the thinking to be text before it becomes code.
Artifact Scope Length Audience
──────── ───── ────── ────────
ADR one decision, right now <1 page future team
Tech spec one feature, 1–4 weeks 1–3 pages reviewers, PM
RFC architecture, quarter+ 3–10 pages cross-team, leadership
A design doc is a forcing function for honesty. What are the tradeoffs? What did you consider and reject? What are the failure modes? What does "done" mean? If you can't answer those in text, you can't answer them in code.
The model you want: the doc is the cheap prototype. Iterating on words takes hours; iterating on code takes days. Find the tradeoffs in the cheap version.
WARNING
"We'll document it after" is a tax you cannot afford on anything architectural. "After" is when you've shipped the decision and the document is archaeology.
Go deeper: Michael Nygard's original ADR post; Eugenia Sherebay's "How to write a good design document" at Gergely Orosz's newsletter; your team's best existing design doc, read critically for what made it good.
Station 5 — Observability is a product feature
A system you can't see inside is a system you can't change safely. Three pillars — see the Cloud & Infrastructure handbook for what each is for. The craft angle is when to add which:
Two habits worth internalizing:
- Define an SLI per feature, an SLO per service. SLI (Service Level Indicator) = the numeric thing users care about (p99 latency, error rate). SLO (Service Level Objective) = the target. Error budget =
1 - SLO. When the budget is spent, you slow down launches. - Trace first, log as a fallback. A trace tells you who called whom and where time went. A log is forensic, not diagnostic.
The model you want: observability is a feature you ship with the code, not a thing you bolt on after. Adding it during debugging is always twice as expensive.
TIP
Structured logs (JSON with a stable schema) are ~free to emit and 100x easier to query than free-text logs. Just make the switch.
Go deeper: Google SRE Book chapters 4–6; Liz Fong-Jones's talks on SLOs; the OpenTelemetry spec (skim, then wire it into a small service).
Station 6 — Automation and CI hygiene
Automation compounds. The rule: automate what gets done more than twice; skip what doesn't.
Tier Fast? Blocks merge? Examples
──── ───── ───────────── ────────
Pre-commit instant yes format, lint, simple checks
Pre-push seconds yes unit tests, type check
CI on PR 1–5 min yes full tests, integration
CI on main 10+ min no nightly, load, chaos, canary
Keep the fast tiers fast. Every second of PR-CI latency is tax on every engineer every day. If PR CI takes 20 minutes, engineers context-switch and the day fragments.
The model you want: fast feedback is a primitive, not a nice-to-have. If CI is slow, fixing CI is the highest-leverage thing on your plate.
CAUTION
A flaky test is worse than no test — it trains the team to ignore red. First time red, you fix. Second time red for the same reason, you fix hard. Third time, it gets deleted. There is no fourth time that ends well.
Go deeper: Martin Fowler's "Continuous Integration" article (the original, still right); Google's "Flaky Tests at Google" paper; a half-day parallelizing your own test suite and measuring wall-clock.
Station 7 — Incident response and postmortems
Every production system has incidents. The craft is in what you do with them.
Two non-negotiables:
- Mitigate before diagnose. Restore service first. Understand second. A rollback is almost always the right first action.
- Blameless postmortems. The postmortem asks "what made the system vulnerable to this failure," not "who screwed up." Humans are a symptom, systems are the diagnosis.
The "five whys" is the classic tool; "five hows" (Allspaw) is a better one because "why" smuggles in blame and "how" does not.
The model you want: every incident is a free lesson about how your system really works. Tax the lesson for all it's worth; the next incident gets cheaper.
WARNING
A team that runs postmortems but never does the action items is running postmortem theater. Track action items in the same backlog as features, with owners and due dates.
Go deeper: John Allspaw's "The Infinite How"; Google SRE Workbook ch. 10 (postmortem culture); read three of your team's past postmortems and grade them on "did we actually learn something?"
How the stations connect
The stations form a loop, not a pyramid. Design doc → tests-as-you-build → review for design → ship with observability → automate the repeatable → incident teaches → design doc for the fix.
A team that does all seven stations decently is unstoppable over years. A team that does any two of them brilliantly but skips the rest burns out.
Standards & Specs
Engineering craft lives in industry standards and canonical books more than in a single IETF spec. The authoritative references:
- Google SRE Book and SRE Workbook — the canonical reference for SLIs, SLOs, error budgets, on-call, and postmortem culture. Free to read.
- ISO/IEC/IEEE 29119 — Software testing — the formal test-strategy standard; most teams reinvent a subset of this and call it "our test plan."
- ISO/IEC 25010 — Software quality model — defines the quality characteristics (reliability, maintainability, security, portability, etc.) that every test strategy implicitly ranks.
- Agile Manifesto (2001) — short, still correct. What was lost in the ceremony that followed.
- IEEE 1044 — Standard Classification for Software Anomalies — the canonical taxonomy of defects; useful for incident classification.
- NIST SP 800-53 — Security and privacy controls — most review-for-design checklists eventually converge on a subset of these controls.
- OpenTelemetry spec — cross-vendor telemetry; the format your observability should emit in.
- Books — Nygard, Release It!; Hunt & Thomas, The Pragmatic Programmer; Fowler, Refactoring; Martin, Clean Code (read critically); Feathers, Working Effectively with Legacy Code; Allspaw & Robbins, Web Operations.
- Papers and essays — Allspaw, "The Infinite How"; Nygard, "Documenting Architecture Decisions" (ADRs); Fowler, "Continuous Integration"; Hughes, "Testing the Hard Stuff and Staying Sane" (property testing).
Test yourself
A team ships fast, reviews PRs in minutes, has 90% test coverage, and keeps having week-long outages once a quarter. Which station are they under-investing in?
Most likely Station 7 (incident response) or Station 5 (observability). Fast shipping + good review + high coverage is the senior-engineer stack. Outages that keep recurring mean the postmortems aren't producing durable action items — or the system is too opaque to diagnose inside an SLO budget. Look at the last three postmortems; if the action items are vague or uncompleted, that's the station.
A new engineer's PRs get rubber-stamped by reviewers who add only style comments. What practice is missing and where does it live?
Station 3 — reviewing for design, not style. The first reviewer question ("is this the right change?") is being skipped. Fix: set an explicit review checklist where the first item is "does this change belong?" and senior engineers model the behavior.
You inherit a service with full-text logs, no metrics, no traces. An incident happens. Predict the debugging experience and the first thing to add.
Debugging will be "grep and guess." You'll find the error eventually but won't know how common it is, whether it's degrading over time, or which upstream call it came from. First thing to add: structured logs with request IDs and a single metric (error rate). That unlocks the rest. Traces are the compounding win but also the larger investment.