Small Fire, Large Outage
There are two business continuity plans at work here: the VoIP provider’s and the data center’s.
The VoIP Provider
From the the VoIP provider’s perspective this should have been a quick (if not automated) switch-over to a different data center with partial (if not full) functionality. It’s clear that the possibility of a complete data center failure was either (a) forgotten, (b) ignored, or (c) judged to be too costly to fully mitigate for the probability of it happening.
Whatever the reason, it doesn’t make them look good. And the fact that I only found out about their problems by encountering them myself rather than being warned in a helpful customer email or in a statement on their website does little to inspire confidence.
Even a rudimentary business continuity plan should include warning customers about service problems.
So the VoIP provider is handling the outage badly. They are cheap and full-featured, but I can no longer say that they are reliable, I won’t recommend them any more and will be looking for a suitable replacement.
But what of the data center? Why isn’t it back yet?
The Data Center
Fortunately (if you know the right place to look and the right discussion board to follow) you can find out that the data center they used actually cared about notifying its customers. Here’s what happened:
Power remains off at our data center in REDACTED per the local fire marshal.
We have had an electrical failure with one of our redundant UPS’ that started to smoke and then had a small fire in the UPS room. The fire department was dispatched and the fire was extinguished quickly. The fire department subsequently cut power to the entire data center and disabled our generators while they and the utility verify electrical system. We have been working with both the fire department and the utility to expedite this process.
We are currently waiting on the fire marshal and local utility to re-energize the site. We are completely dependent upon their inspection and approval. We are hoping to get an update that we can share in the next hour.
At the current time, the fire department is controlling access to the building and we will not be able to let customers in.
And this is what BCP plans often fail to take into account when considering the risk of a small fire. If it’s electrical, the priority of the fire department and the local electrical utility is safety. The fire may have been small. The fire may have been swiftly extinguished. But the fire marshal’s job is to ensure safety. That means shutting off all electrical systems and not taking any chances.
And when everything is shut down, and after repairs are made, it takes a surprisingly long amount of time to switch everything back on and get it working correctly. Here’s an update from twenty-four hours later:
The REDACTED data center remains powered down at this time per the fire marshal. We are continuing with our cleanup efforts into the evening and working overnight as we make progress towards our 9AM EDT meeting time with the fire marshal and electrical inspectors in order to reinstate power at the site.
Once we receive approval and utility is restored, we will turn up critical systems. This will take approximately 5 hours. After the critical systems are restored, we will be turning up the carriers and then will start to turn the servers back on.
The fire marshal has requested replacement of the smoke detectors in the affected area as well as a full site inspection of the fire life safety system prior to allowing customers to enter the facility. Assuming that all goes as planned, the earliest that clients will be allowed back into the site to work on their equipment would be late in the day Wednesday.
The points to note here are that:
- After a fire, even a small one, you are no longer in control of your building: the Fire Marshal is.
- The Fire Marshal’s priority is safety. Electrical systems will be switched off if there is any possibility that they present a hazard to firefighters or have been damaged in a fire.
- Electrical systems have to cleaned, repaired and inspected before they can be re-energized.
- Smoke detectors and other fire safety system components may need to be replaced and the entire fire safety system may need to be checked before normal access to the building is allowed.
- After everything is completed, it can take a long time (in this case at least five hours) before critical systems are all working again.
- Even after this, customer systems will still need to be restored.
- In this case data loss was limited to that caused by systems being unexpectedly powered off. Hopefully this was taken into account in the design of all the application programs.
TL;DR? When planning remember that even small fires with limited damage can have major consequences.