I got inspired to write this post about what I’ve learnt about incident debriefs after seeing this tweet from Nora Jones (who you should follow immediately if you haven’t already).
Postmortem hot-tip: You'll get more ROI on completing the postmortem if a 3rd party (i.e. someone that wasn't involved in the incident) facilitates it. It may seem counterintuitive at first and that it will add extra time onto the incident investigation...— Nora Jones (@nora_js) February 21, 2020
Something we’ve been doing at Monzo is running some debriefs with an external facilitator. I’ve found it to be transformative as both a participant and a facilitator. Most of my experience comes from being in technical debriefs, so your mileage from this post may vary outside of the technical realm.
What problem are we tackling?
Typically when writing a postmortem and running the debrief, the group of people involved will be the ones most familiar with the system or the incident. For technical incidents, this may be the engineers who led the incident or did the analysis or own that part of the system.
A typical debrief structure may look like this:
- Brief overview of the systems in question (especially if there’s folks who aren’t familiar in the day-to-day technicalities)
- Go through the timeline of the incident, distilling key events, key players and the decisions made
- Review the contributing factors of the incident
- Assess the action items and make sure there’s ownership / some notion of completion
- How are we going to make sure this doesn’t happen again?
- Closing discussion or a Q&A session
The large majority of this can be done outside of the debrief. Often, the action items are already underway (or even completed depending on the severity) before the debrief is even held. The incident timeline can be read and digested offline, it may already be familiar to many.
For technical incidents, action items usually are some form of shipping a fix, reviewing metrics and instrumentation and improving processes / documentation. No one comes out of an incident wanting less metrics and documentation. Software fixes are typically are looking at the immediate set of bugs. This stuff is important but not the goal of a debrief.
This format validates that this specific problem won’t manifest again in the existing form, it does not effectively identify or tackle potential concerns at the root.
Where the external facilitator comes in
The primary role of the facilitator is to identify themes during the incident and steer the discussion around those. They keep the conversation on track as they’ve done the leg work on gathering concerns. The goal is to open up discussion on the root level problems.
It’s important for the facilitator to deeply understand the incident. For me, that typically involves:
- Pouring over the entire incident report (no matter what state it’s in)
- Examining the conversation history during the incident, picking out nuances and (trying to infer) emotional states based on language and communication style
- Pouring over data, typically from monitoring systems
- Looking at Pull Requests which may be been applied as mitigations
- Watching / Listening to any recorded calls that may have been started in the heat of the moment (if available)
- Speaking to everyone involved in the incident
It’s almost like doing your own investigation of the incident and replaying the steps taken and forming your own thoughts. Often, it can be an additional learning opportunity for you (especially when it comes to gnarly bugs).
A tip I’ve used from my colleague Miles Bryant is to ask a set of private questions to all individuals involved in the incident beforehand about how they felt during the incident and what they thought was important. The answers are treated with full anonymity and without judgement.
Examples of good themes that have prompted some great deep thinking and discussion:
- Day to day processes for knowledge sharing, addressing internal information and knowledge asymmetry
- The time pressures of delivering immediate day to day value vs time spent testing, writing effective documentation / runbooks, fixing issues etc.
- Ability for those involved to speak up and raise concerns and make sure they are heard and acted on
Attack the hindsight bias head on
More often than not, I’ve seen the Q&A portion of debriefs turn into a crapshoot. Hindsight Bias is very much alive in this sort of setting, regardless of how blameless your processes are. The line of questioning throughout the debrief is really important and can really affect the flow of the debrief and the willingness for everyone to be open.
It’s really easy to have the incident participants or system owners go into a state of defense (to defend their work or save face), leading to a biased debrief. I’ve done it many more times than I care to admit. This is where an external person being the debrief facilitator can really help.
Tackling the potential of hindsight bias as a facilitator right at the beginning can work wonders. I usually start debriefs with the following remark:
Hindsight bias is a real possibility in debriefs. If you have a question that starts with “why didn’t you” or “shouldn’t you have”, I urge you to re-think your line of questioning and your objective.
If you truly believed this would’ve made a material difference, you had every opportunity to chime in and give your input before / during the incident.
This remark works at Monzo because the vast majority of incidents are open and public and anyone can join and participate in real time if they have something to add or volunteer assistance. If incidents unfold behind closed doors, this may not be as effective.
It’s the role of the facilitator to arbitrate discussions straying into hindsight territory. Re-framing of the question really encourages folks to re-think what they want to achieve. Likely, the motivation of the question is genuine but the way it was originally framed may have not led to the best discussion or felt like an attack to some groups of people.
There’s still a lot to learn about running effective incident debriefs and making the most out of them. I highly recommend these two resources if you want to take this to the next level:
- The Field Guide to Understanding ‘Human Error’ by Sidney Dekker
- Learning from Incidents
Jacob Scott has curated a great list of reading around the incident lifecycle (including debriefs). If I find more resources, i’ll keep expanding this list too.