(2021-07-02) Jones Incident Analysis Your Organization’s Secret Weapon

Nora Jones on Incident Analysis: Your Organization’s Secret Weapon. the research that myself and my team have done in this space, has shown the following responses to the question of why are incident reviews important: “I’m honestly not sure.”... “Management wants us to.” ... “It gives the engineer space to vent.”

Gathering useful data about incidents does not come for free. You need time and space to determine it... It can give you that ROI you’re looking for and level up your entire organization.

There’s a famous equation in a book called Seeing What Others Don’t by Gary Klein. Gary Klein is a cognitive psychologist who studies experts and expertise in organizations. This metric he came up with is performance improvement. It’s the combination of error reduction + insight generation. You can’t have one without the other. Yet we focus as an industry way too much on the error reduction piece and not on the insight generation piece. Except we’re not actually going to improve the performance of our organizations if we’re only focusing on the error reduction piece.

Netflix: Story #1

Most of the time I realized that (just) the four of us were actually the ones using the (Chaos Monkey) tooling. We were using the tooling to create chaos experiments, to run chaos experiments, to analyze the results. Which meant, what were the teams doing? Well, they were receiving our results and sometimes they were fixing them and sometimes they weren’t... We weren’t the ones whose mental models needed refining or understanding, but we were the ones getting that refinement and understanding. Which actually didn’t provide much benefit to the organization.

Here’s the secret I found. Incident analysis is not actually about the incident, it’s this opportunity we have to see the delta between how we think our organization works and how it actually works. Yeah, most of the time we’re not good at exposing that delta... It’s a catalyst to understanding where you actually need to improve the socio of your socio-technical system, how you’re organizing teams, how people in different time zones are working together, how many people you need on each team, how folks are dealing with their OKRs given all the technical depth that they’re working through as well.

The Blame Game: Story #2

This was an organization that thought they were practicing blamelessness. We’ve all heard about blameless postmortems, but yet we all use it a little bit incorrectly. They thought they were practicing this without a deep understanding of it, and when something like this happens, a Kieran makes an error, it’s usually met with instituting a new rule or process within the organization without publicly saying that you thought it was Kieran’s fault. Yet everyone, including Kieran, knows that folks think that. That’s still blameful. It’s not only unproductive, it is actually hurting your organization’s ability to generate those new insights from that equation we looked at earlier, and build expertise after incidents. And so you’re actually harming your organization’s ability to improve your performance.

Lesson: Spotting Errors versus Encouraging Insights... Adding in these new rules and procedures actually diminishes the ability to glean new insights from these incidents.

Kieran was pretty new to the organization. And we had him on call for something like this, for two separate systems in the middle of the night, and I don’t really feel like this is Kieran’s fault so much anymore. I’m starting to think that this really wasn’t human error.

You can talk to people one-on-one like I did with Kieran. We call this an interview or a casual chat. And these individual interviews, prior to the bigger incident review, can determine what someone’s understanding of the event was.... Especially with emotionally charged incidents, we should set up some one-on-one individual chats like this. If I had asked Kieran the questions in the incident review meeting myself, it probably wouldn’t have revealed all the things that he revealed to me in that one-on-one chat.

The Promotion Packet Paradox: Story #3

People were losing promotions when they hadn’t completed things at the beginning of the quarter. But I know we’ve all been at organizations where we’ve committed to something at the beginning of the quarter, but we get midway through the quarter and realize that that’s not the most important thing anymore. Yet this is what we were judging people on... So they’d rush to complete those things just before promotion packets were due. And I saw spikes in incidents around the time promotion packets were due.

A good incident analysis should tell you where to look. And I mentioned this before, we’re not trained as software engineers to analyze incidents, we’re trained in different pieces of software and distributed systems. We can figure out technically what happened, but we’re not really trained to figure out socially what happened.

I was in an organization once where every time a certain guy came into the incident channel, everyone would react with the Batman emoji in Slack. And he was amazing, but it was actually a poor thing in this organization, because we relied on him a little bit too much. Incident analysis can help you see how you’re actually supporting that. You can see how much coordination efforts are costing you during incidents. As an industry, we pay a lot of attention to the customer costs of incidents and the repercussions of the incidents. We don’t pay a lot of attention to our coordination costs.

How to Make Incident Reviews Better

It doesn’t have to be every incident. And it doesn’t have to be every incident that just caused customer impact or just hit Twitter big time.... Like if there were more than two teams involved... Or if it involved a misuse of something that seemed trivial, like expired certs.. If a new service or interaction between services was involved.

So what can you do today to improve incident analysis? You can give folks more time and space to come up with better analysis... Have folks who were not involved in the incident doing the incident review, because you get that unbiased perspective, you get someone that can ask Kieran those questions without Kieran feeling like they’re blaming them.... And allow investigation for the big ones.

So How Do You Know It’s Working? There are more folks attending the incident reviews and more folks reading them, not because they’re being asked to, not because they’re required to, but because they want to. This is an indication that they’re actually learning something... Teams are collaborating more. You’re not seeing as high of coordination costs in your incident.

Facilitate the meeting, output the report, and then after some soak time, after a day or so after this, then come up with action items. I promise your action items are going to be so much better if you don’t do them right away.


Edited:    |       |    Search Twitter for discussion