Nivi · November 14th, 2008
“When confronted with a problem, have you ever stopped and asked why five times?”
Summary: Whenever you find a defect, ask why five times to discover the root cause of the problem. Then make corrections at every level of the analysis. By applying five whys whenever you find a defect, you will (1) uncover the human problems beneath technical problems and (2) build an immune system for your startup.
Taiichi Ohno was one of the inventors of the Toyota Production System. His book, Toyota Production System, is a fascinating read, even though it’s decidedly non-practical. After reading it, you might not even realize that there are cars involved in Toyota’s business. Yet there is one specific technique that I learned most clearly from this book: asking why five times. I believe this is a critical lean startup technique.
When something goes wrong, we tend to see it as a crisis and seek to blame. A better way is to see it as a learning opportunity. Not in the existential sense of general self-improvement. Instead, we can use the technique of asking why five times to get to the root cause of the problem and make corrections.
Ask why five times whenever you discover a defect.
Here’s how it works. Let’s say you notice that your website is down. Obviously, your first priority is to get it back up. But as soon as the crisis is past, have the discipline to conduct a post-mortem in which you start asking why:
- Why was the website down? The CPU utilization on all our front-end servers went to 100%.
- Why did the CPU usage spike? A new bit of code contained an infinite loop!
- Why did that code get written? So-and-so made a mistake.
- Why did his mistake get checked in? He didn’t write a unit test for the feature.
- Why didn’t he write a unit test? He’s a new employee, and he was not properly trained in Test Driven Development (TDD).
Make five corrections.
So far, this isn’t very different from the kind of analysis any competent operations team would conduct for a site outage. The next step is this: you have to commit to making a proportional investment in corrective action at every level of the analysis. So, in the example above, we’d have to take five corrective actions:
- Bring the site back up.
- Remove the bad code.
- Help so-and-so understand why his code doesn’t work as written.
- Train so-and-so in the principles of TDD.
- Change the new engineer orientation to include TDD.
Making corrections builds your startup immune system.
I have come to believe that this technique should be used for all kinds of defects, not just site outages. Each time, we use the defect as an opportunity to find out what’s wrong with our process, and make a small adjustment.
By continuously adjusting, we eventually build up a robust series of defenses that prevent problems from happening. This approach is at the heart of breaking down the “time/quality/cost, pick two” paradox, because these small investments cause the team to go faster over time.
5 whys uncovers the human problems beneath technology problems.
In the example above, what started as a technical problem actually turned out to be a human and process problem. This is completely typical. Our bias as technologists is to focus on the product part of the problem, and five whys tends to counteract that tendency.
It’s why, at my previous job, we were able to get a new engineer completely productive on their first day. We had a great on-boarding process, complete with a mentoring program and a syllabus of key ideas to be covered. Most engineers would ship code to production on their first day.
Make your corrections proportional to the cost of the defect.
We didn’t start with a great program like that, nor did we spend a lot of time all at once investing in it. Instead, five whys kept leading to problems caused by an improperly trained new employee, and we’d make a small adjustment. Before we knew it, we stopped having those kinds of problems altogether.
So it’s important to remember the proportional investment part of the rule above. It’s easy to decide that when something goes wrong, a complete ground-up rewrite is needed. It’s part of our tendency to focus on the technical and to overreact to problems.
If you have a severe problem, like a site outage, that costs your company tons of money or causes lots of person-hours of debugging, go ahead and allocate about that same number of person-hours or dollars to the solution.
The budget for corrections should be, in total, proportional to the cost of the defect that triggered the five whys. So, if the site was down and five people burned a whole day on it, maybe five man-days of fixing is appropriate. But if the problem cost three customers 25 cents each, maybe only a few hours is appropriate.
But always have a maximum, and always have a minimum. For small problems, just move the ball forward a little bit. Don’t over-invest. If the problem recurs, five whys will give you a little more budget to move the ball forward some more. You can keep your cool because five whys will be there if the problem recurs.
In Part 2, I’ll describe how to get started with five whys.