Nivi · November 17th, 2008
“For every dollar spent in failure, learn a dollar’s worth of lesson.”
Summary: Get started with five whys by applying it to a specific team with a specific problem. Select a five whys master to conduct a post mortem with everyone who was involved in the problem. Email the results of the analysis to the whole company. Repeatedly applying five whys at IMVU created a startup immune system that let our developers go faster by reducing mistakes.
In Part 1, I described how to use five whys to discover the root cause of problems, make corrections, and build an immune system for your startup. So…
How do you get started with five whys?
I recommend that you start with a specific team and a specific class of problems. For my first time, it was scalability problems and our operations team. But you can start almost anywhere—I’ve run this process for many different teams.
Start by having a single person be the five whys master. This person will run the post mortem whenever anyone on the team identifies a problem.
But don’t let the five whys master do it by himself; it’s important to get everyone who was involved with the problem (including those who diagnosed or debugged it) into a room together. Have the five whys master lead the discussion and give him or her the power to assign responsibility for the solution to anyone in the room.
Distribute the results of five whys to the whole company.
Once that responsibility has been assigned, have that new person email the whole company with the results of the analysis. This last step is difficult, but I think it’s very helpful. Five whys should read like plain English. If they don’t, you’re probably obfuscating the real problem.
The advantage of sharing this information widely is that it gives everyone insight into the kinds of problems the team is facing, but also insight into how those problems are being tackled. If the analysis is airtight, it makes it pretty easy for everyone to understand why the team is taking some time out to invest in problem prevention instead of new features.
If, on the other hand, it ignites a firestorm—that’s good news too. Now you know you have a problem: either the analysis is not airtight, and you need to do it over again, or your company doesn’t understand why what you’re doing is important. Figure out which of these situations you’re in, and fix it.
What happens when you apply five whys for months and years?
Over time, here’s my experience with what happens.
People get used to the rhythm of five whys, and it becomes completely normal to make incremental investments. Most of the time, you invest in things that otherwise would have taken tons of meetings to decide to do.
You’ll start to see people from all over the company chime in with interesting suggestions for how you could make things better. Now, everyone is learning together—about your product, process, and team. Each five whys email is a teaching document.
IMVU’s immune system after years of five whys.
Let me show you what this looked like after a few years of practicing five whys in the operations and engineering teams at IMVU. We had made so many improvements to our tools and processes for deployment, that it was pretty hard to take the site down. We had five strong levels of defense:
- Each engineer had his/her own sandbox which mimicked production as close as possible (whenever it diverged, we’d inevitably find out in a five whys shortly thereafter).
- We had a comprehensive set of unit, acceptance, functional, and performance tests, and practiced TDD across the whole team. Our engineers built a series of test tags, so you could quickly run a subset of tests in your sandbox that you thought were relevant to your current project or feature.
- 100% of those tests ran, via a continuous integration cluster, after every checkin. When a test failed, it would prevent that revision from being deployed.
- When someone wanted to do a deployment, we had a completely automated system that we called the cluster immune system. This would deploy the change incrementally, one machine at a time. That process would continually monitor the health of those machines, as well as the cluster as a whole, to see if the change was causing problems. If it didn’t like what was going on, it would reject the change, do a fast revert, and lock deployments until someone investigated what went wrong.
- We had a comprehensive set of nagios alerts, that would trigger a pager in operations if anything went wrong. Because five whys kept turning up a few key metrics that were hard to set static thresholds for, we even had a dynamic prediction algorithm that would make forecasts based on past data, and fire alerts if the metric ever went out of its normal bounds. (You can read a cool paper one of our engineers wrote on this approach.)
A strong immune system lets you go faster by reducing mistakes.
So if you had been able to sneak into the desk of any of our engineers, log into their machine, and secretly check in an infinite loop on some highly-trafficked page, somewhere between 10 and 20 minutes later, they would have received an email with a message more-or-less like this:
“Dear so-and-so, thank you so much for attempting to check in revision 1234. Unfortunately, that is a terrible idea, and your change has been reverted. We’ve also alerted the whole team to what’s happened, and look forward to you figuring out what went wrong. Best of luck, Your Software.”
OK, that’s not exactly what it said. But you get the idea.
Having this series of defenses was helpful for doing five whys. If a bad change got to production, we’d have a built-in set of questions to ask: Why didn’t the automated tests catch it? Why didn’t the cluster immune system reject it? Why didn’t operations get paged? And so forth.
And each and every time, we’d make a few more improvements to each layer of defense. Eventually, this let us do deployments to production dozens of times every day, without significant downtime or bug regressions.
In Part 3, I’ll show you how to apply five whys to “legacy” startups.