Where I've seen it work, it was founded on redundancy. Single points of failure are poison. At the least, the redundancy must include:
- Data redundancy: proper backups, replication, etc exist, as well as a clear, well-tested procedure for recovery; if done right, it would take an extraordinary catastrophe to set the business back more than one day worth of important data; if really done right (e.g. geo-redundant backups), even extraordinary catastrophe might be guarded against to some degree.
- Machine/services redundancy: at a minimum, HA hosts must be placed in fail-over pairs. There are additional advantages to this. For instance, fail-over paired machines are also typically used for load-balancing. Also, if one of the pair goes down, you have live data that can be directly copied to it for recovery, avoiding having to dig out backups or risk losing any important data
- Support redundancy: A team of Ops support people (e.g. systems administrators) with over-lapping skills and cross-training, and who participate in an on-call rotation. In my experience, 4 is the minimum team size to avoid fast burnout.
No comments:
Post a Comment