The IT department is frequently the most prepared part of an organization when it comes to business continuity. IT staff are typically highly trained staff who are only too well aware of the consequences of system failure and human error.
However, in reviewing business continuity and disaster recovery
plans, I've found that there are some areas that are easily overlooked.
1. Underestimating the physical scale of a possible disaster
The closer to your normal location a recovery site is, the more convenient. At the same time, the closer it is, the more likely it is to be affected by the same incident and not provide the redundancy required.
It's common to think that an adjacent building provides suitable redundancy, but this is not the case. A major fire, gas leak, forest fire, suspicious vehicle (possible car bomb) and many other events will cause the emergency services to evacuate quite a wide area. In addition, if the area is threatened by tornadoes, earthquakes, forest fires, or major storms the area affected will be even larger.
case we've seen was a recovery site which was located in the same large building as the main site. It afforded little redundancy for anything other than a simple hardware failures. It's very unlikely that only half a building will be evacuated in a real emergency.
2. Underestimating Recovery Effort
There's a qualitative difference between
doing things at a small scale and doing things at a larger scale. We all know this, but it is easy to overlook. Preparing and serving a meal for a few friends is within the capabilities of most people: serving that same meal to a few hundred or a few thousand people is completely different. Food prices need to be negotiated and food bought and stored in bulk. Refrigerated trucks may be required for temporary storage. Tables and tableware needs to be bought or rented. Chefs, waiters, and
waitresses need to be hired and trained. Industrial kitchen equipment needs to be purchased or rented. Portable toilet facilities need to be rented. Managers need to be hired to manage the process. There are lots of new things that have to be taken into account when scaling up.
In a similar manner, when large numbers of computers need to be replaced or provisioned at the same time, techniques suited to recovery of a single
machine are of limited use. Even the effort involved in un-boxing and plugging replacement computers will be substantial, and may very well require the hiring of additional temporary staff. Ideally the computers need to be acquired with all relevant software pre-installed: a custom software installation and Windows update will take too long.
In addition, whatever techniques are proposed (eg. drive imaging) must not rely on
hardware which is potentially unavailable. It's very easy to forget that all the work may have to be done at an alternate site with any tools needed being purchased along with the replacement equipment and with data center recovery operations taking place at the same time.
3. Obsolete and Long Lead Time Equipment
Not everything is available
off-the-shelf. Some equipment may have very long lead times, and other equipment may no longer be available.
One particular example of this relates to media libraries: although the media may still be readily available, the equipment to use it may not. If you lose the tape library, (but not the tapes) can you still replace it? Can you still read the tapes? If the answer is "only by looking on eBay", it's
uncertain whether the answer will be yes or no.
4. Unrealistic Quick Ship Agreements
Quick ship agreements are promises by a supplier to provide specific sets of equipment in a short timescale after a disaster. There are really two types: promises by a manufacturer to provide preferential treatment of current products, and promises by a third
party to provide duplicates of specific pieces of equipment that are in use. At one time the latter made a lot of sense: there were a very limited number of very expensive pieces of equipment in use. It was too expensive for a company to purchase a backup mainframe computer, but a third party could purchase one, split the costs between a number of companies with a similar need, and earn a reasonable profit.
However, the diversity and rapid obsolescence of equipment makes this approach unprofitable now. One company I audited had a quick ship agreement promising to deliver specific obsolete models of desktop computers, routers, servers etc. with specific network cards, disk drives etc. There was no way that they could fulfill this contract profitably; they would literally need to buy a duplicate of each piece of equipment for each customer. Needless to say, during a test exercise the
equipment supplied was not the equipment ordered, although there were lots of promises that the correct equipment would somehow be produced if it if it was a "real" disaster.
With quick ship agreements you should examine the business from the suppliers' perspective: if it's not realistic to profitably meet your requirements, it's very unlikely that your requirements are actually going to be met, no matter what promises are
5. Redundancy That Isn't Redundant
If an organization relies heavily on power, telephone, or internet to function, then it may make sense to arrange redundant supplies do allow work to continue in the event of one supplier having problems. For power supply disruptions, redundancy can be achieved through the use of backup
generators, or even having separate feeds from different parts of the power grid. With many types of supply, it's easy to see whether redundancy exists. This isn't true for internet and telephone connections.
While it's possible to arrange with two different companies for network connections to enter the building at two different points, the structure of the telecommunications industry means that this is not guaranteed to
achieve any significant redundancy. Companies often lease fiber, switching capacity, etc. from each other. So although there may be be some redundancy where the cables enter the building, two different companies may be leasing space on the same fiber a few hundred yards up the road. Paying more to a single supplier for redundant routing may sometimes be possible; but digging holes is expensive and preventing a single point of failure may not be cost-effective.
6. Backups for non-IT equipment
It is rare for an IT department not to have a good knowledge of backup procedures, the difference between backup and replication, and the importance of off-site historical backups. Unfortunately this knowledge often doesn't extend beyond the IT department. With the increased "intelligence" of equipment, there can be non-IT equipment which
requires backup too.
Often other departments may neither consider the need to revert to a previous configuration ("the configuration I had last month worked OK") nor the need to store backups off-site ("our basement just flooded and we lost the equipment and the configurations").
IT department control of non-IT equipment can be a political
minefield, but an IT review of backup procedures for non-IT equipment is a sensible precaution.
7. If only we had the passwords...
An IT department needs to access a wide range of outside services, typically secured by passwords or cryptographic keys. These need to be kept secure, with limited access and a limited number of copies.
They also need to be duplicated at a secure off-site location.
A company had a fire, which destroyed part of its IT infrastructure. It had a backup facility off-site. However, in order to change its DNS entries to use the backup facilities, its staff had to authenticate themselves to a domain registrar using passwords. Unfortunately the passwords were lost in the blaze: the IT staff had trouble convincing the domain
registrar that they were the legitimate domain owners. It was therefore a number of days rather than a number of minutes before the backup facilities came on line.
Passwords, of course, are only one written record that needs to be preserved: also important are phone numbers, contact names, addresses, procedures, and current backup logs. In another company the only record of which off-site backup tape contained which
backup was on one of the computers being backed up. If that computer had a major failure, there was the prospect of requesting and hunting through hundreds of off-site backup tapes to find the right backup tape to use.
8. Undocumented Procedures and Ill-defined Responsibilities
It's tempting with a small IT department to skimp on written
documentation - and on duplicating that documentation to a secure off-site location. This may work well while the department remains small.
As the department grows, it is no longer possible for every member of the team to know every procedure.
At some point Joe may be the person responsible for backups, Kathy for network configuration,
etc. Problems then arise if a member of the team gets sick, or leaves the company. Even with the best will in the world, unwritten procedures may be communicated incorrectly. In one example we saw, the person who was responsible for backups left. The unpopular job was shuffled round: every knew that backups needed to be made, but nobody had actual responsibility for making sure the backups were made correctly. There was no documentation, and it quickly became clear to us that not only was
nobody sure what the correct procedure was, but not all systems were being backed up either.
Lack of documentation is also a major burden during recovery: without adequate documentation staff are limited to working only on tasks they normally do, and it is difficult to employ extra temporary staff to assist.
9. The Limitations of
Uninterruptible Power Supplies and Backup Generators
It's rare that a data center does not have backup power which will it allow it to keep operating for an hour or so in the event of a power failure. Depending upon the likely frequency and length of power cuts, a backup generator may also be used to keep the systems operating for an indefinite period. Unless an IT department has experienced extended power cuts,
it may be unaware of the limitations of the its power backup system.
If the backup power does not extend to the heating, ventilation, and cooling system (most don't) then a limitation on run-time may be the ability to keep the data center sufficiently cool.
If the backup power does not extend to wiring cabinets and remote switches and access points,
then nobody may be able to use the systems, even if they are running. And with the convergence between voice and IT, voice switching equipment is often located in the data center. Unlike systems based on PABXs, VoIP phones require power to be supplied at the phone (or by power over Ethernet). If backup power never reaches the phone, the phone system will shut down too.
10. Missing Dependencies and
IT department disaster recovery plans are often developed before the rest of the company considers business continuity, and there is often a disconnect between the IT department's plans, and those of the rest of the company. Sometimes the dependencies between IT services and activities are not recognized (in one BIA I reviewed the IT department's services were listed as "not critical" even though most of
the critical activities clearly depended upon them). In another case, the assumptions about the speed of recovery following a disaster were unrealistic because it was assumed that the data center would always be undamaged. The Recovery Time Objective of the data center was an order of magnitude longer than the Recovery Time Objective of all the processes that depended upon it.
Also missed in the cracks may be the effort
involved in supplying IT infrastructure to a new work location: this may involve leasing network connections and phone lines, and purchasing equipment. It may be just assumed by non-IT planners that existing network facilities offered by a hotel or empty office used as a temporary work location will be adequate for voice and simple IT services; the IT department will know that this is rarely the case.
While it is
easy to blame either the IT department or the rest of the company for a failure to communicate, the problems are more likely to arise by each side making unrealistic assumptions about the other's activities. Neither side knows what it does not know, and only by each party reviewing the other's plans will this become evident.
11. Competition for Outside Resources
Most events are localized, and only affect the building or the data center. However, if area-wide events are a significant threat (e.g. Earthquakes, Forest Fires) it's often unrecognized that there will be competition for resources. Companies which sell "warm" or "cold" sites over-sell their location on the assumption that it is unlikely that more than one or two companies will be operating in "disaster mode" at the same time. In the event of competition, it is the company
that formally declares a disaster first (generally with some financial cost) which gets priority for use of the facility.
Similar considerations apply for the purchase of replacement equipment, or the hire of temporary resources.
Read the news carefully, and you can see frequent situations where multiple companies had to compete for resources
unexpectedly: replacing office computers in the 9/11 aftermath; oversubscribed disaster recovery locations in the aftermath of a Manchester telecom tunnel fire; the hunt for backup generators following the north east blackouts.
Whenever a large-scale emergency occurs which affects multiple companies, an IT department needs to be ready to act quickly to acquire any external resources it needs, and it also needs to
be prepared for the possibility that these resources will not available.
IT departments are generally extremely good at planning for events which affect just the systems and networks under their immediate control, and are typically ahead-of-the-curve with respect to the rest of the organization when it comes to business
continuity. However, without an external review, it's easy for the IT department to miss non-IT threats and dependencies, and it's easy for dependencies between IT and non-IT services to be overlooked.