Chaos Monkey and the Coffee Shop: A Quality Emergency Plan
In the last 24 hours, I have had three glimpses into emergency plans. First, my local coffee shop -- the true source of my productivity -- experienced a water main break. Second, a site I manage experienced a server failure. Third, I came across Netflix's recently open sourced Chaos Monkey tool.
The emergencies I am talking about don't involve physical risk and are certainly not life-or-death circumstances. But they deal with a threat to a business: the loss of customers and revenue. Reflecting on these provides some insight into prevention and practice.
The Coffee Shop
When I walked into my corner coffee shop for the morning Joe, the barista informed me that due to a water main break, they were unable to make most of their drinks. I got lucky -- they still had some drip coffee. I ordered and sat down to drink it.
I watched the team deal with customers who came in, deal with the equipment, and engage in problem solving. After an hour, the talk turned from carrying on to closing shop. My curiosity was piqued, so I asked about the protocol for handling situations like this. They explained that they have a procedure to follow for emergencies, and that they had stepped through it to the best of their abilities. But there was a problem: The water main break was sudden, and the protocol didn't adapt well to cases where there was an unexpected and sudden loss of water.
The team did an admirable job, and took measures to encourage customers to come back. But the emergency plan had a flaw. (With a little inventiveness, I think they managed to avoid closing for the day.)
I admit that in most cases I am not good at planning for emergencies. But in the case of the website that failed, we had a plan. We had thought through enough of the possibilities that even the case that occurred was one we had anticipated. When the site failed, we implemented the plan, and it worked.
We could pat ourselves on the back, but here's the thing: Until our outage, we didn't know whether the emergency plan would work. We got lucky -- and even in our luck, there was a fair amount of flapping as we tried to implement our untested recovery plan.
If only we were more proactive in testing our emergency plan.
Netflix takes a different approach to their emergency plan: they simulate emergencies. In fact, they built a tool to simulate emergencies for them, the aptly named Chaos Monkey.
In a nutshell, Chaos Monkey causes servers to break. Yes, they intentionally break their servers. Then the emergency recovery process kicks in. If some part of the emergency process is broken, they'll know and they will be prepared to react (I presume that it's not 2 AM on a Sunday when they run this thing).
Because of this testing -- somewhat chaotic, but in a controlled environment -- the Netflix engineers can test and improve emergency plans.
An emergency plan is a must-have. But to play its role, it's got to work.