January 7, 2010 »

Fire: Destruction or Illumination?

Fire. *

It can shine forth as a candle lighting the darkness. Or it can burn everything in it’s path, leaving only ashes and ruin.

How does your team handle fire drills? When things go bump in the night (or smack dab in the middle of a work day), is it an opportunity to shine? Or just a tempestuous conflagration that destroys attitudes, team unity, confidence and business value?

I had the opportunity to “fire-test” our tech team this week. At HealthTalker, we are working toward running our entire infrastucture “in the cloud” using Citrix’s XenServer platform. Cloud computing provides all sorts of useful infrastructure features that make our platform that much more robust.

One component of this infrastructure is having SAN systems set up to provide fast, network-based disk for the virtual machines. Having the disk on a separate, network device provides the backbone for much of the flexibility and redundancy of the cloud setup. Our SANs have some definite quirks to them that we are working through. Recently, the primary SAN required a system upgrade, which we performed. Unfortunately, the new kernel panicked which left the SAN non-bootable and because of the quirks of the enclosure, without any console, keyboard, or external drive access. In other words: bricked.

In an enterprise-grade setup, redundancy of components means that when (not if, but when!) this happens, there are failover options to keep things running with minimal or ideally no downtime. At HealthTalker, we are building out our infrastructure to be fully enterprise-grade across all aspects. However, like much in a start-up, we are building it out as we go, and at this time our backup SAN was not properly provisioned. Thus began The Fire Drill.

Except that it wasn’t a drill. It was the Real Thing™

Get the systems back! Get the websites back! Get the data back, if possible. Or if not possible get the systems up with older data (which we had but not as up-to-date as we would have liked). Can it be recovered? Can the SAN be restored? Can the SAN and the data be restored? Questions swirled. Frustrations mounted. Guestimates were all over the map. 36+ hours later, scorched and weary, we were able to fully recover the SAN and the data due to incredible dedication by our tech team, including Boris, our new IT guy (who only started the day before this firestorm hit).

Lessons learned are legion:

  1. Quality people will always pull through. Get quality people and give them the freedom and trust to work their magic in their particular demesne.
  2. Mutual respect is critical. People respecting each other and working together can accomplish a LOT in a short period of time.
  3. Finger-pointing spreads fires. Don’t point fingers in the middle of the fire. It’s useless to talk about the why or who-did-what when it’s fire drill time. Put out the fire first… then talk about lessons learned.
  4. Procedures Matter. P&P isn’t just for big companies. Just because you can doesn’t mean you should. Just because you should doesn’t mean you do.
  5. Communication is key. Between team members… amongst the company… with the clients… and with the web users. Let people know what is happening, even if it hurts to say it!
  6. Own your failures. Be honest and admit when you blow it. If it was something stupid, say so, then learn from it. But don’t hide it or don’t downplay it. (BTW, I, like most people, have trouble with this one. Human nature, I guess)
  7. and last…
  8. Don’t forget to breathe. The fire drill will pass. The fire will be put out. Life will go on. And what is left behind might very well be a stronger infrastructure, a stronger team, and a stronger company!

* Fire. For Andy (who loves “The Boss”).

This post has 0 comments. Make a comment.

No Comments So Far



Start The Conversation

« Name
Required
« Email
Required
« URL
Optional
Comment »
Remember my personal information
Notify me of follow-up comments?
Page 1 of 1 pages