Risky Thinking
February 2014
Michael Z. Bell
www.RiskyThinking.com

Risky Thinking is a free newsletter providing essays, analysis, insights, and oddities related to Business Continuity, Disaster Recovery, and Risk Management.

To subscribe, visit: http://www.RiskyThinking.com/newsletter/

For more information and articles, visit the RiskyThinking website at
http://www.RiskyThinking.com/.


In This Issue
  • 11 Business Continuity Mistakes IT Departments Make
  • Risk Assessment Toolkit
  • News: Flooding, Pandemics, Generators, and Really Unusual Flooding
  • Fixed Price Business Continuity Plan Review
  • Risk Assessment / BIA Seminar Dates and Locations
  • Administrivia, Subscribing and Unsubscribing

11 Business Continuity Mistakes IT Departments Make

The IT department is frequently the most prepared part of an organization when it comes to business continuity. IT staff are typically highly trained staff who are only too well aware of the consequences of system failure and human error.

However, in reviewing business continuity and disaster recovery plans, I've found that there are some areas that are easily overlooked.

1. Underestimating the physical scale of a possible disaster

The closer to your normal location a recovery site is, the more convenient. At the same time, the closer it is, the more likely it is to be affected by the same incident and not provide the redundancy required. It's common to think that an adjacent building provides suitable redundancy, but this is not the case. A major fire, gas leak, forest fire, suspicious vehicle (possible car bomb) and many other events will cause the emergency services to evacuate quite a wide area. In addition,  if the area is threatened by tornadoes, earthquakes, forest fires, or major storms the area affected will be even larger. 

The worst case we've seen was a recovery site which was located in the same large building as the main site. It afforded little redundancy for anything other than a simple hardware failures.  It's very unlikely that only half a building will be evacuated in a real emergency.

2. Underestimating Recovery Effort

There's a qualitative difference between doing things at a small scale and doing things at a larger scale. We all know this, but it is easy to overlook. Preparing and serving a meal for a few friends is within the capabilities of most people: serving that same meal to a few hundred or a few thousand people is completely different. Food prices need to be negotiated and food bought and stored in bulk. Refrigerated trucks may be required for temporary storage. Tables and tableware needs to be bought or rented. Chefs, waiters, and waitresses need to be hired and trained. Industrial kitchen equipment needs to be purchased or rented. Portable toilet facilities need to be rented. Managers need to be hired to manage the process. There are lots of new things that have to be taken into account when scaling up.

In a similar manner, when large numbers of computers need to be replaced or provisioned at the same time, techniques suited to recovery of a single machine are of limited use. Even the effort involved in un-boxing and plugging replacement computers will be substantial, and may very well require the hiring of additional temporary staff. Ideally the computers need to be acquired with all relevant software pre-installed: a custom software installation and Windows update will take too long.

In addition, whatever techniques are proposed (eg. drive imaging) must not rely on hardware which is potentially unavailable. It's very easy to forget that all the work may have to be done at an alternate site with any tools needed being purchased along with the replacement equipment and with data center recovery operations taking place at the same time.

3. Obsolete and Long Lead Time Equipment

Not everything is available off-the-shelf.  Some equipment may have very long lead times, and other equipment may no longer be available. 

One particular example of this relates to media libraries: although the media may still be readily available, the equipment to use it may not. If you lose the tape library, (but not the tapes) can you still replace it? Can you still read the tapes? If the answer is "only by looking on eBay", it's uncertain whether the answer will be yes or no.

4. Unrealistic Quick Ship Agreements

Quick ship agreements are promises by a supplier to provide specific sets of equipment in a short timescale after a disaster. There are really two types: promises by a manufacturer to provide preferential treatment of current products, and promises by a third party to provide duplicates of specific pieces of equipment that are in use. At one time the latter made a lot of sense: there were a very limited number of very expensive pieces of equipment in use. It was too expensive for a company to purchase a backup mainframe computer, but a third party could purchase one, split the costs between a number of companies with a similar need, and earn a reasonable profit. 

However, the diversity and rapid obsolescence of equipment makes this approach unprofitable now. One company I audited had a quick ship agreement promising to deliver specific obsolete models of desktop computers, routers, servers etc. with specific network cards, disk drives etc. There was no way that they could fulfill this contract profitably; they would literally need to buy a duplicate of each piece of equipment for each customer. Needless to say, during a test exercise the equipment supplied was not the equipment ordered, although there were lots of promises that the correct equipment would somehow be produced if it if it was a "real" disaster.

With quick ship agreements you should examine the business from the suppliers' perspective: if it's not realistic to profitably meet your requirements, it's very unlikely that your requirements are actually going to be met, no matter what promises are given.

5. Redundancy That Isn't Redundant

If an organization relies heavily on power,  telephone, or internet to function, then it may make sense to arrange redundant supplies do allow work to continue in the event of one supplier having problems. For power supply disruptions, redundancy can be achieved through the use of backup generators, or even having separate feeds from different parts of the power grid. With many types of supply, it's easy to see whether redundancy exists. This isn't true for internet and telephone connections.

While it's possible to arrange with two different companies for network connections to enter the building at two different points, the structure of the telecommunications industry means that this is not guaranteed to achieve any significant redundancy. Companies often lease fiber, switching capacity, etc. from each other. So although there may be be some redundancy where the cables enter the building, two different companies may be leasing space on the same fiber a few hundred yards up the road. Paying more to a single supplier for redundant routing may sometimes be possible; but digging holes is expensive and preventing a single point of failure may not be cost-effective.

6. Backups for non-IT equipment

It is rare for an IT department not to have a good knowledge of backup procedures, the difference between backup and replication, and the importance of off-site historical backups. Unfortunately this knowledge often doesn't extend beyond the IT department. With the increased "intelligence" of equipment, there can be non-IT equipment which requires backup too. 

Often other departments may neither consider the need to revert to a previous configuration ("the configuration I had last month worked OK") nor the need to store backups off-site ("our basement just flooded and we lost the equipment and the configurations"). 

IT department control of non-IT equipment can be a political minefield, but an IT review of backup procedures for non-IT equipment  is a sensible precaution.

7. If only we had the passwords...

An IT department needs to access a wide range of outside services, typically secured by passwords or cryptographic keys.  These need to be kept secure, with limited access and a limited number of copies. They also need to be duplicated at a secure off-site location. 

A company had a fire, which destroyed part of its IT infrastructure. It had a backup facility off-site. However, in order to change its DNS entries to use the backup facilities, its staff had to authenticate themselves to a domain registrar using passwords. Unfortunately the passwords were lost in the blaze: the IT staff had trouble convincing the domain registrar that they were the legitimate domain owners. It  was therefore a number of days rather than a number of minutes before the backup facilities came on line.

Passwords, of course, are only one written record that needs to be preserved: also important are phone numbers, contact names, addresses, procedures, and current backup logs. In another company the only record of which off-site backup tape contained which backup was on one of the computers being backed up. If that computer had a major failure, there was the prospect of requesting and hunting through hundreds of off-site backup tapes to find the right backup tape to use.

8. Undocumented Procedures and Ill-defined Responsibilities

It's tempting with a small IT department to skimp on written documentation - and on duplicating that documentation to a secure off-site location. This may work well while the department remains small. 

As the department grows, it is no longer possible for every member of the team to know every procedure. 

At some point Joe may be the person responsible for backups, Kathy for network configuration, etc. Problems then arise if a member of the team gets sick, or leaves the company. Even with the best will in the world, unwritten procedures may be communicated incorrectly. In one example we saw, the person who was responsible for backups left. The unpopular job was shuffled round: every knew that backups needed to be made, but nobody had actual responsibility for making sure the backups were made correctly. There was no documentation, and it quickly became clear to us that not only was nobody sure what the correct procedure was, but not all systems were being backed up either. 

Lack of documentation is also a major burden during recovery: without adequate documentation staff are limited to working only on tasks they normally do, and it is difficult to employ extra temporary staff to assist.

9. The Limitations of Uninterruptible Power Supplies and Backup Generators

It's rare that a data center does not have backup power which will it allow it to keep operating for an hour or so in the event of a power failure. Depending upon the likely frequency and length of power cuts, a backup generator may also be used to keep the systems operating for an indefinite period. Unless an IT department has experienced extended power cuts, it may be unaware of the limitations of the its power backup system.

If the backup power does not extend to the heating, ventilation, and cooling system (most don't) then a limitation on run-time may be the ability to keep the data center sufficiently cool.

If the backup power does not extend to wiring cabinets and remote switches and access points, then nobody may be able to use the systems, even if they are running. And with the convergence between voice and IT, voice switching equipment is often located in the data center.  Unlike systems based on PABXs, VoIP phones require power to be supplied at the phone (or by power over Ethernet). If backup power never reaches the phone, the phone system will shut down too. 

10. Missing Dependencies and Recovery Tasks

IT department disaster recovery plans are often developed before the rest of the company considers business continuity, and there is often a disconnect between the IT department's plans, and those of the rest of the company. Sometimes the dependencies between IT services and activities are not recognized (in one BIA I reviewed the IT department's services were listed as "not critical" even though most of the critical activities clearly depended upon them). In another case, the assumptions about the speed of recovery following a disaster were unrealistic because it was assumed that the data center would always be undamaged. The Recovery Time Objective of the data center was an order of magnitude longer than the Recovery Time Objective of all the processes that depended upon it.

Also missed in the cracks may be the effort involved in supplying IT infrastructure to a new work location: this may involve leasing network connections and phone lines, and purchasing equipment. It may be just assumed by non-IT planners that existing network facilities offered by a hotel or empty office used as a temporary work location will be adequate for voice and simple IT services; the IT department will know that this is rarely the case. 

While it is easy to blame either the IT department or the rest of the company for a failure to communicate, the problems are more likely to arise by each side making unrealistic assumptions about the other's activities. Neither side knows what it does not know, and only by each party reviewing the other's plans will this become evident.

11. Competition for Outside Resources

Most events are localized, and only affect the building or the data center. However, if area-wide events are a significant threat (e.g. Earthquakes, Forest Fires) it's often unrecognized that there will be competition for resources. Companies which sell "warm" or "cold" sites over-sell their location on the assumption that it is unlikely that more than one or two companies will be operating in "disaster mode" at the same time. In the event of competition, it is the company that formally declares a disaster first (generally with some financial cost) which gets priority for use of the facility.

Similar considerations apply for the purchase of replacement equipment, or the hire of temporary resources. 

Read the news carefully, and you can see frequent situations where multiple companies had to compete for resources unexpectedly: replacing office computers in the 9/11 aftermath; oversubscribed disaster recovery locations in the aftermath of a Manchester telecom tunnel fire; the hunt for backup generators following the north east blackouts.

Whenever a large-scale emergency occurs which affects multiple companies, an IT department needs to be ready to act quickly to acquire any external resources it needs, and it also needs to be prepared for the possibility that these resources will not available. 

To Conclude...

IT departments are generally extremely good at planning for events which affect just the systems and networks under their immediate control, and are typically ahead-of-the-curve with respect to the rest of the organization when it comes to business continuity. However, without an external review, it's easy for the IT department to miss non-IT threats and dependencies, and it's easy for dependencies between IT and non-IT services to be overlooked.

 


Ad: Fixed Price Continuity Plan Review

It's often difficult to see your Business Continuity / Disaster Recovery Plans from a different perspective. Is it realistic? Does it miss something obvious? An external pair of eyes can often see problems or solutions which you can't.

We offer an economic fixed price service to review your business continuity plan. We will review documentation, interview key staff, and prepare a confidential written report identifying the strengths and any weaknesses we can see in your current plan.

Contact us for further details.
http://www.riskythinking.com/


News: Flooding and Flood Insurance

The continuing series of floods in Europe provide an urgent reminder to review flood preparedness and flood insurance coverage. It's common for insurance policies to have many exclusions in this area, and it's worth checking what is and isn't covered. In particular, insurers may distinguish between ground water flooding (if the water table rises too high due to extensive rain), overland flooding (from rivers etc.), flooding from burst pipes, etc.

The Wikipedia entry on Flood Insurance is a good starting point for some of the issues involved, but you should check your local legislation, insurance policy, insurance broker, etc. to check that you have the correct coverage, and what additional coverage is available:

http://en.wikipedia.org/wiki/Flood_insurance


News: Why do Hospital Generators Keep Failing?

I just came a across a 2012 article in Salon magazine looks specifically at the history of backup generator failures in hospitals, and suggests some reasons why they may fail when needed. Worth reading if you rely on a backup generators to see whether any of the observed organizational or procedural flaws may also apply to you.

http://www.salon.com/2012/11/01/why_do_hospital_generators_keep_failing/


News: Pandemics, Wild Birds, and Horse Flu

A new University of Arizona study sheds some new light on the evolution of the flu virus, which is still the most likely cause of the next major pandemic. This press release describes the surprising results of a study looked at the evolution of flu viruses and their transmission between species. The common assumption blaming wild birds for new flu strains may well be incorrect. It also includes an interesting account of the horse flu epidemic of 1872: it's a good exercise to try and guess some of the possible consequences of that epidemic before reading the press release.

http://uanews.org/story/ua-study-on-flu-evolution-may-change-textbooks-history-books


An Unusual Type of Flooding

As anybody who follows @RiskyThinking on Twitter knows, I collect unusual risks. Recently London's Underground was disrupted by a very different type of flooding. Quick-setting cement from a construction job accidentally flowed into a control room. Picture, article, and public statements:

http://metro.co.uk/2014/01/23/set-in-stone-victoria-line-closed-after-control-room-is-flooded-with-fast-setting-concrete-4275088/


Risk Assessment Toolkit

Our Risk Assessment Toolkit is designed to assist you in creating and maintaining a Risk Register and Business Impact Analysis by modeling dependencies, simulating disruptions, and calculating potential losses. Please download an evaluation copy if you are interested in finding a better way to do these things.

Download Evaluation Copy
http://www.riskythinking.com/risk_assessment_toolkit/


Updated Seminar Dates and Locations

Our Business Impact Analysis / Risk Assessment training seminars have some updated dates and locations. We are also including a free copy of the Risk Assessment Toolkit (a US$795 value) with each seminar seat. Hopefully I will get a chance to meet you there.

Seminar Details:
http://www.riskythinking.com/training/


Administrivia, Subscribing, and Unsubscribing

RISKY THINKING is a free newsletter providing essays, analysis, insights, and oddities related to Business Continuity, Disaster Recovery, and Risk Management. You can subscribe on the web at http://www.RiskyThinking.com/newsletter/.

Please feel free to forward RISKY THINKING to colleagues or friends who will find it valuable. You may reprint this newsletter providing it is reprinted in its entirety.