Monday, August 29, 2011

Risk Assessment after Hurricane Irene

Now that Hurricane Irene is behind us, and you've used your Crisis Control skills to salvage as much as possible in real-time, there's a natural urge to start Risk Assessment for the next natural disaster.

However, we cannot predict what the next surprise will be, since then it won't be a surprise. It's even harder to predict how much damage it will cause; nobody could predict the disaster and chaos caused by 9/11 - probably not even the terrorists who masterminded it - including Bin Laden himself.

Most of the time the size of the catastrophe is inversly proportional to the amount of preparation for it.

The trick is to break down the various troubles from the current disaster and group them into distinct categories. Each category then gets treated separately - and each category can be assessed for future prevention, if the risk/cost ratio is high enough. 

Here are some potential groupings - feel free to classify them differently.
  • Communication
    • Internet connectivity
      • Access to our web site
      • Access to our email
      • Access to our customer support system
    • Phone connectivity
      • Mobile phones
      • Land-lines
    • Inter-office communication
      • List of key people's home phones
      • List of team-leaders contact info
      • Escalation plan for:
        • Emergencies
        • Helping each other
        • Moral Support
  • Access to office
    • Physical access
    • Emergency exits
    • Virtual access to email
    • Access to sensitive info like passwords
  • Backup and Restore
    • Web sites
    • emails
    • Personal PCs
    • Off-site backup
    • Faster Restore
Each item on the list then needs to be taken care of. There are few - if any - generic solutions.

Let's work through the first item on our list: Access to our web site.

During the storm, our web site may have been inaccessible either because the electricity went out, or the internet connection died or the site was overloaded because of too many home-bound people trying to access it.

If our web site went offline during the storm, then we may want to have it hosted in more than one location. Unless it's a web site for the local population, and if the power to the web site goes down, then our entire user base cannot access it either, so there would be no need to be online.
  • This highlights why the disaster control plans need to be updated periodically. Nowadays that mobile phones are everywhere, the above presumption is no longer valid. There maybe no electricity or internet within 100 miles of your user base, but you can be 100% sure that many - if not most - of your users will be online from their mobiles, hand-helds, iPads, iPods and other wireless technology.
But simply deciding on distributing your internet hosting is not enough. It needs to be implemented. If your system was not designed for being hosted at multiple sites, this could be a major project. At some stage, management will have to make a decision. To help them, you want to have figures of:
  • The cost of damage to our reputation and traffic by remaining at a single site if it becomes unavailable for a day / week / month.
    • There's also the cumulative cost, of people never coming back to a site that was down for more than a few moments.
  • The monthly financial cost of hosting at 2 or more sites.
  • The cost of maintaining more than one site for both engineering, QA and IT.
  •  The cost of the design,  engineering and testing efforts required to upgrade the system. 
A final decision point to take into consideration is that the other site chosen may also be targeted by the next disaster.

When the Eyjafjallajokul volcano erupted, it caused airports all over Europe to be closed. Had your emergency flight plan called for racing across the continent from Belgium to Italy to catch a flight, it would have been useless as a backup plan.

Hurricane Irene proved that the 500 miles between NC and CT are not enough when doing hurricane Risk Assessment. An elementary understanding of weather patterns, the national electricity grids and the internet backbone would be required to decide on the safest location for hosting the mirror site.

If the decision is taken to implement the dual-location of the web site, then the implementation has to be tested to ensure that if either site goes down, then the other site will manage to run solo. It has to be able to handle the load; not only double the regular load, but maybe double of that, since during emergencies the site tends to get more traffic, either because it has useful information, or entertains those who are home-bound due to the circumstances, or maybe because the competition was not prepared to implement multi-site hosting.

Side benefit: Once we have multi-site hosting for out web site, maybe we can use it for other items on the list:
  • Email access
  • Lists of passwords and phone numbers
  • Escalation plan
  • Customer support
  • Off-site backups
But don't assume that any of these solutions are trivial; you have to take into account issues such as security and access control when sensitive information is made available online.

In conclusion, identifying the trouble spots is only the first task. At some level it's the easiest one. Implementing solutions is harder. In future posts maybe we'll analyse what needs to be done for other items on the list.

Risk Assessment and Crisis Control are full time jobs; and the more time you spend on them, the more you'll be prepared for the next surprise that nature or mankind will spring at you.

- Danny Schoemann



No comments:

Post a Comment