Five Whys Outside of Incident Post-Mortems

A good practice is performing root cause analysis, such as Five Whys, after a production outage.

If you’re unfamiliar with using Five Whys, here’s an example from Wikipedia:

An example of a problem is: the vehicle will not start.

Why? – The battery is dead.

Why? – The alternator is not functioning.

Why? – The alternator belt has broken.

Why? – The alternator belt was well beyond its useful service life and not replaced.

Why? – The vehicle was not maintained according to the recommended service schedule. (A root cause)

Engineers at tech companies are familiar with this process in incident post-mortems.

However, most engineers don’t reach for root cause analysis outside of incidents.

I’ve found that simply using Five Whys in normal engineering contexts has helped me uncover underlying architectural and infrastructure problems before they become an incident.


Master GitHub Actions with a Senior Infrastructure Engineer

As a senior staff infrastructure engineer, I share exclusive, behind-the-scenes insights that you won't find anywhere else. Get the strategies and techniques I've used to save companies $500k in CI costs and transform teams with GitOps best practices—delivered straight to your inbox.

Not sure yet? Check out the archive.

Unsubscribe at any time.