Of Backups and Bare Metal Restore

Your data's safe, isn't it? If a disaster happened, you could simply buy new computers, restore from backups, and continue working. Or could you? Welcome to The First Rule of Real World Backups: backups don't exist unless you test them.

Your company keeps regular backups, doesn't it? Buried in an air-conditioned storeroom on the first floor of the building (safe from hurricanes and flooding) there's a shelf (or better still, a fireproof media safe) full of old backup tapes and DVDs. Each day or each week you backup everything and send the media off site to be stored safely in a secret location, which you imagine is some underground fortress surrounded by barbed wire fences, patrolled by armed guards in military uniforms with ferocious Rottweilers. Even a James Bond villain would envy the mighty fortress in which your data is kept.

Or perhaps your company is past all that. Through the wonders of the internet, your data is automatically replicated in three different locations, scattered around the globe, heavily encrypted so that even NSA scientists working with secret quantum computers would require many millennia to read your email. Once again the backup locations are in top secret bunkers protected by armed guards from the ravages of war, civil insurrection, and all the Acts of God listed in loving detail in your insurance policies. Even if the world is annihilated by a nuclear war or an alien invasion, you know your data will survive.

Of course, we know that in reality the data is stored in some anonymous warehouse patrolled by a bored security guard on minimum wage but it's still off-site and safe. And we can dream. If the worst happens we can restore everything easily, can't we?

It's a sad truth about backups that although we make them, we rarely test them. And we rarely test them under the conditions that would occur following a disaster.

As T. S. Eliot puts it in the poem The Hollow Men:

Between the idea
And the reality
Between the motion
And the act
Falls the shadow.

I first came across The Shadow in my first job. I was working for the research and development division of a software house. In those days, off-site backup consisted of sending a removable hard disk (probably holding all of 10 Mbytes) off to a service bureau each week. The company would then copy the contents of the disk to magnetic tape, store the tape, and then return the disk drive to us.

Fortunately we never needed those off-site backups.

About a year after I started working there we received a check and an apology from the service bureau. It transpired that an operator had been omitting one vital step from this process: actually copying the files to tape. The backup tapes were blank. (The rumor we heard was that the operator had observed our write-only use of tapes, and figured he could do his job more efficiently if he omitted the tape backup part of his job).

Thus The First Rule of Real World Backups: backups don't exist unless you test them.

The Shadow of course has turns up regularly on a small scale. How many times have you heard the plaintive user's cry: "But I thought that was being backed up?" Typically users assume that the IT department is backing up their computer. In reality the IT department doesn't have the space or time to store copies of the MP3 files, games, and other personal stuff that has somehow found its way onto the work computer. So backup is limited to a network drive, and excludes stuff stored in "My Documents" and programs installed without the IT department's knowledge. The user probably got told about the backup policy during their first day at work along with how to complete a timesheet and where to find the photocopier. But the fact was never used and quickly forgotten.

This doesn't just happen at the user level. In a recent business continuity audit, we noted that responsibility for making backups had been transferred when the person who normally made the backups left suddenly. The unfortunate soul who had the job suddenly dumped upon him did the job as he supposed it should be done. There were no written procedures, and few records. While some things were being backed up, it was evident to us that other things were not.

Thus a corollary to the First Rule is: just because something is being backed up, you can't assume everything is being backed up.

In that same audit, we noted that the lack of records meant that if a backup tape needed to be recalled from off-site storage to rebuild a particular server, there was no method (apart from reading the catalog off the backup tapes) of determining which tape would need to be couriered from off-site storage. So if a restore was required, all recent tapes would have to be recalled and read to find the tape that was needed. (The client assured us that tapes could easily be retrieved by the date on which they were stored: we were never quite convinced of this.)

Another corollary to the Fist Rule is therefore: if you can't figure out how to get it back from off-site storage, you don't really have an off-site backup.

When you do get the tapes back (and assuming you have also the hardware necessary to read the tape) it should all be plain sailing, right?

Not so. You are now faced with doing a Bare Metal Restore — restoring your backups to a machine on which nothing is installed. This is theoretically easy if the machine you are restoring is same as the machine you backed up. But it isn't. That machine is buried in a pile of smoking rubble. What you have is what your quick ship supplier promised would be a machine with an identical configuration, but isn't quite. It's got bigger disks (they don't make the type of disks you use any more), a different motherboard, a faster network adapter, and a different CPU.

We recently had an experience of the joys of a bare metal restore on a small scale. We had a full system backup made with Microsoft's NTBACKUP program. Here's what should have happened, according to Microsoft:

  1. Install Windows XP
  2. Restore full system backup, overwriting system files.
  3. Reboot
  4. Repair the XP installation from an installation CD to update windows to match hardware. (Note, do NOT use the Repair console — take the subsequent option to repair an existing installation instead of overwriting it).
  5. Reactivate Windows

Here's what actually happens:

  1. You install Windows XP on the replacement hardware.
  2. You use NTBACKUP to restore everything, overwriting system files.
  3. You reboot.
  4. As instructed, you repair the XP installation from an installation CD to update windows to match the new hardware.
  5. You find Windows won't run because it isn't activated. When you try and activate it, you get an error message telling you the program MSOOBE.EXE has crashed.
  6. You spend several hours searching the web. You find that MSOOBE.EXE is a Microsoft Program called ironically "Microsoft Out of the Box Advantage". You find that Microsoft has a lot of information about making backups with NTBACKUP.EXE, but almost nothing about restoring a system from backups.
  7. Finally you find this helpful posting from Kunal Mudliyar of Pune, India http://social.microsoft.com/Forums/en-US/genuinewindowsxp/thread/9e2a22a4-429e-4abc-9f8e-1735e46fb0c4
  8. You boot Windows in Safe Mode, find the hidden spuninst.exe program in the hidden folder c:\windows\ie7\spuninst and uninstall Internet Explorer 7.
  9. You reboot windows in normal mode and activate it.
  10. You reinstall Internet Explorer 7.
  11. A day has elapsed. You leave Windows updating itself and adjourn to the pub.

"Between the idea, And the reality … Falls the Shadow".

Bare metal restore is not the same as a simple file restore. So the final corollaries are these: you don't know how long a bare metal restore to different hardware will take until you try it. And: just because it worked last week doesn't mean it will work this week.

So to conclude: The First Rule of Real World Backups is: backups don't exist unless you test them. So don't assume: test. And test using a good approximation to the real world. Different hardware. Different people. Tapes recalled from off-site storage (or files transferred from an on-line vault). And just because it worked last time, don't assume that a system update won't screw you up this time.

23 February 2010

To get notified when new articles appear, subscribe to the Risky Thinking Newsletter. It's low volume: we don't send out an issue unless there is something interesting to say. You can also subscribe to our RSS Feed

Recently published articles can also be found here.

Agree or disagree? I'd like to hear your thoughts. Please initially use the contact form to get in touch.