Halley / Incident Post-Mortems Without the Ceremony

Created Mon, 19 May 2025 16:23:00 +0000 Modified Sun, 31 Aug 2025 22:17:24 +0000
282 Words

Big tech loves incident post-mortems. They come with slide decks, timelines, action items, and a dozen people pointing fingers.
For a one-person ops shop, that’s overkill. But you still need to learn from failure.

The Goal: Clarity, Not Blame

A post-mortem isn’t about self-flagellation. It’s a tool:

  • What went wrong?
  • Why did it happen?
  • How do I stop it happening again?

Skip the blame; focus on system design.

Write It Down While It Hurts

Memory is unreliable. As soon as you recover from the outage:

  • Capture the timeline: what happened and when
  • Note what you tried and what actually worked
  • Include any lucky breaks so you don’t rely on them next time

A few lines in a markdown file beat a perfect report written weeks later.

Identify Root Cause, Not Just Symptom

The database locked up. Why?

  • Backup job and cron clash?
  • Disk filled silently?
  • Missing alert that should have caught it?

Ask “why” until you hit the design flaw, not just the immediate trigger.

Cheap Action Items

No corporate boardroom needed. Pick one or two things to change:

  • Add a log or alert you wish you’d had
  • Adjust a cron schedule
  • Write down the recovery steps so next time you aren’t improvising

Small, cheap improvements compound over time.

Archive and Review

Keep post-mortems in your repo or docs:

  • Name them clearly (incident-2025-05-db-lock.md)
  • Review them occasionally to spot patterns
  • Fix recurring design flaws before they hurt you again

Your future self is your SRE team. Give them the gift of hindsight.

Boring but Necessary

A post-mortem doesn’t need theatre.
It needs honesty and follow-through.

If you write down what failed and act on it, you’re already ahead of most corporate ops teams.