(2022-02-08) Allspaw What We Talk About When We Talk About Root Cause
John Allspaw: What we talk about when we talk about ‘root cause.’ In recent years, the understanding that failure in complex systems requires multiple contributors coming together to produce these surprising events that we call incidents has gained traction.
while this perspective isn’t yet considered a “mainstream” view, I suspect it aligns with what all experienced software engineers intuitively understand.
In his seminal paper How Complex Systems Fail (Cook, 1998), my colleague Dr. Richard Cook put it this way:
3) Catastrophe requires multiple failures—single point failures are not enough. The array of defenses works. System operations are generally successful. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident.
*Another description of this perspective was made by Ryan Kitchens at SRECon Americas in 2019:
“There is no root cause. The problem with this term isn't just that it's singular or that the word root is misleading: there's more. Trying to find causes at all is problematic...looking for causes to explain an incident limits what you'll find and learn.* (incident analysis)
breaking down incidents into their multiple contributing factors, we're able to see that the things that led to an incident are either always or transiently present. An incident is just the first time they combined into a perfect storm of normal things that went wrong at the same time
I’d like to explore in this article what seems to keep people using the term ‘root cause’ despite the growing skepticism of its value.
Research literature on this topic reveals that in descriptions of accidents and incidents, use of the term ‘root cause’ (or even multiple ‘root’ causes) serves social purposes more than technical ones. (social fiction)
Labeling something as a ‘root cause’ helps people cope with the (sometimes implicit) anxiety that comes along with the experience of incidents.
Incidents have a way of producing genuine and unsettling dismay
immediate desire to identify what “caused” an event, so we can then do something (which typically means fixing something) in order to regain a sense of being in control.
labeling something as a ‘root cause’ reflects a cherry-picked perspective; it highlights one aspect of a complex event and discounts others.
Quite often, we find this usage more to reflect a thing better conveyed as a trigger, rather than a cause. The term ‘trigger’ tends to do a better job of describing a specific dynamic that “activates” already existing conditions, some of which might have been latent in the code or architecture’s arrangement for some time.
What agenda(s) might the author (or speaker) have in their version of the story, other than providing the richest description they can? (narrative)
What details seem to be noticeably absent in the story you’re being told?
What questions can you imagine being dismissed or discounted by the storyteller, if you had the chance to ask them?
Edited: | Tweet this! | Search Twitter for discussion