Monday, May 17, 2010

High availability

Every business owner seems to wants high availability for their website, even if it isn't ultimately important for their business, but no business owner wants to spend more money for it than they have to. However, it doesn't come for free, or even cheap. "There's no such thing as a free lunch", as my high school physics teacher used to say, (likely quoting/paraphrasing from The Moon is a Harsh Mistress).

Where I've seen it work, it was founded on redundancy. Single points of failure are poison. At the least, the redundancy must include:

  • Data redundancy: proper backups, replication, etc exist, as well as a clear, well-tested procedure for recovery; if done right, it would take an extraordinary catastrophe to set the business back more than one day worth of important data; if really done right (e.g. geo-redundant backups), even extraordinary catastrophe might be guarded against to some degree.
  • Machine/services redundancy: at a minimum, HA hosts must be placed in fail-over pairs. There are additional advantages to this. For instance, fail-over paired machines are also typically used for load-balancing. Also, if one of the pair goes down, you have live data that can be directly copied to it for recovery, avoiding having to dig out backups or risk losing any important data
  • Support redundancy: A team of Ops support people (e.g. systems administrators) with over-lapping skills and cross-training, and who participate in an on-call rotation. In my experience, 4 is the minimum team size to avoid fast burnout.
Lastly, remember that sometimes shit just happens. The world is full of uncertainty, and we can only do our best to mitigate it.