“For every dollar spent in failure, learn a dollar’s worth of lesson.”

Jesse Robbins, Amazon’s former Master of Disaster

Summary: Get started with five whys by applying it to a specific team with a specific problem. Select a five whys master to conduct a post mortem with everyone who was involved in the problem. Email the results of the analysis to the whole company. Repeatedly applying five whys at IMVU created a startup immune system that let our developers go faster by reducing mistakes.

This is a guest post by Eric Ries, a founder of IMVU and an advisor to Kleiner Perkins. Eric also has a great blog called Startup Lessons Learned.

In Part 1, I described how to use five whys to discover the root cause of problems, make corrections, and build an immune system for your startup. So…

How do you get started with five whys?

I recommend that you start with a specific team and a specific class of problems. For my first time, it was scalability problems and our operations team. But you can start almost anywhere—I’ve run this process for many different teams.

Start by having a single person be the five whys master. This person will run the post mortem whenever anyone on the team identifies a problem.

But don’t let the five whys master do it by himself; it’s important to get everyone who was involved with the problem (including those who diagnosed or debugged it) into a room together. Have the five whys master lead the discussion and give him or her the power to assign responsibility for the solution to anyone in the room.

Distribute the results of five whys to the whole company.

Once that responsibility has been assigned, have that new person email the whole company with the results of the analysis. This last step is difficult, but I think it’s very helpful. Five whys should read like plain English. If they don’t, you’re probably obfuscating the real problem.

The advantage of sharing this information widely is that it gives everyone insight into the kinds of problems the team is facing, but also insight into how those problems are being tackled. If the analysis is airtight, it makes it pretty easy for everyone to understand why the team is taking some time out to invest in problem prevention instead of new features.

If, on the other hand, it ignites a firestorm—that’s good news too. Now you know you have a problem: either the analysis is not airtight, and you need to do it over again, or your company doesn’t understand why what you’re doing is important. Figure out which of these situations you’re in, and fix it.

What happens when you apply five whys for months and years?

Over time, here’s my experience with what happens.

People get used to the rhythm of five whys, and it becomes completely normal to make incremental investments. Most of the time, you invest in things that otherwise would have taken tons of meetings to decide to do.

You’ll start to see people from all over the company chime in with interesting suggestions for how you could make things better. Now, everyone is learning together—about your product, process, and team. Each five whys email is a teaching document.

IMVU’s immune system after years of five whys.

Let me show you what this looked like after a few years of practicing five whys in the operations and engineering teams at IMVU. We had made so many improvements to our tools and processes for deployment, that it was pretty hard to take the site down. We had five strong levels of defense:

  1. Each engineer had his/her own sandbox which mimicked production as close as possible (whenever it diverged, we’d inevitably find out in a five whys shortly thereafter).
  2. We had a comprehensive set of unit, acceptance, functional, and performance tests, and practiced TDD across the whole team. Our engineers built a series of test tags, so you could quickly run a subset of tests in your sandbox that you thought were relevant to your current project or feature.
  3. 100% of those tests ran, via a continuous integration cluster, after every checkin. When a test failed, it would prevent that revision from being deployed.
  4. When someone wanted to do a deployment, we had a completely automated system that we called the cluster immune system. This would deploy the change incrementally, one machine at a time. That process would continually monitor the health of those machines, as well as the cluster as a whole, to see if the change was causing problems. If it didn’t like what was going on, it would reject the change, do a fast revert, and lock deployments until someone investigated what went wrong.
  5. We had a comprehensive set of nagios alerts, that would trigger a pager in operations if anything went wrong. Because five whys kept turning up a few key metrics that were hard to set static thresholds for, we even had a dynamic prediction algorithm that would make forecasts based on past data, and fire alerts if the metric ever went out of its normal bounds. (You can read a cool paper one of our engineers wrote on this approach.)

A strong immune system lets you go faster by reducing mistakes.

So if you had been able to sneak into the desk of any of our engineers, log into their machine, and secretly check in an infinite loop on some highly-trafficked page, somewhere between 10 and 20 minutes later, they would have received an email with a message more-or-less like this:

“Dear so-and-so, thank you so much for attempting to check in revision 1234. Unfortunately, that is a terrible idea, and your change has been reverted. We’ve also alerted the whole team to what’s happened, and look forward to you figuring out what went wrong. Best of luck, Your Software.”

OK, that’s not exactly what it said. But you get the idea.

Having this series of defenses was helpful for doing five whys. If a bad change got to production, we’d have a built-in set of questions to ask: Why didn’t the automated tests catch it? Why didn’t the cluster immune system reject it? Why didn’t operations get paged? And so forth.

And each and every time, we’d make a few more improvements to each layer of defense. Eventually, this let us do deployments to production dozens of times every day, without significant downtime or bug regressions.

In Part 3, I’ll show you how to apply five whys to “legacy” startups.

Topics Lean

4 comments · Show

  • ckstevenson

    Loving the series, can’t wait for more.

    My questions are for services companies: Are there ways that this concept can/should be modified for that type of a company? Have you found methods of sharing the results most effectively for geographically dispersed companies? Do you use a Wiki or any other tool for shaving/sharing results and “lessons-learned”? I have found most companies/organizations have language somewhere about learning from mistakes via “lessons-learned” but rarely if ever do so, and almost never have a sound method for this. At best a Word document gets saved somewhere, there might even be a klunky searchable online database that either has too few or too many items to be of use.

    Thanks.

  • Eric Ries

    So glad you’re finding the series helpful. My experience has been that no amount of archiving, searching, or sorting works for this kind of thing. You have to find a way to get the learning in front of everyone in the company in a format that they will understand.

    So if you constantly send out automated reports, people quickly start auto-ignoring them. That’s no good.

    If your “five whys” analysis is full of technical jargon or is written in non-plain language, most people will assume it’s not written for them, and just ignore it. That doesn’t work either.

    So my conclusion is that the only thing that works is to have a real-live human being write an email, by hand, in plain english, and send it out to the company-wide mailing list. The same list that you would use for a major company announcement.

    Now, people might still ignore that email. But odds are they won’t ignore every single one. And more importantly, you’re signaling to the whole company that deep learning, continuous improvement, and root cause analysis are important.

    Helpful?

  • Speed and Growth

    [...] Building in tools and methods into the process that let founders and team members observe, orient, decide and act faster is key. The more I think about the need to build metrics into applications and marketing solutions. I want to be able to understand the performance of my design decisions relative to previous experience/performance, new information, competitive analysis and traditions. This is the Orient part of the OODA loop but it requires the codification of new performance data and the existing business metrics and the cluster immune system. [...]

  • Chico Charlesworth

    The link to ‘cool paper’ is broken, should be http://www.evanmiller.org/poisson.pdf