Risky Thinking

Small Fire, Large Outage

2023-08-01T00:00:00Z

I tried to log on to our VoIP provider’s website today. The website wasn’t working properly. A small fire had ignited at the datacenter they used. It was quickly extinguished and the datacenter had redundant generators and backup systems. So why, more than twenty four hours later, was the website still down?

There are two business continuity plans at work here: the VoIP provider’s and the data center’s.

The VoIP Provider

From the the VoIP provider’s perspective this should have been a quick (if not automated) switch-over to a different data center with partial (if not full) functionality. It’s clear that the possibility of a complete data center failure was either (a) forgotten, (b) ignored, or (c) judged to be too costly to fully mitigate for the probability of it happening.

Whatever the reason, it doesn’t make them look good. And the fact that I only found out about their problems by encountering them myself rather than being warned in a helpful customer email or in a statement on their website does little to inspire confidence.

Even a rudimentary business continuity plan should include warning customers about service problems.

So the VoIP provider is handling the outage badly. They are cheap and full-featured, but I can no longer say that they are reliable, I won’t recommend them any more and will be looking for a suitable replacement.

But what of the data center? Why isn’t it back yet?

The Data Center

Fortunately (if you know the right place to look and the right discussion board to follow) you can find out that the data center they used actually cared about notifying its customers. Here’s what happened:

Power remains off at our data center in REDACTED per the local fire marshal.

We have had an electrical failure with one of our redundant UPS’ that started to smoke and then had a small fire in the UPS room. The fire department was dispatched and the fire was extinguished quickly. The fire department subsequently cut power to the entire data center and disabled our generators while they and the utility verify electrical system. We have been working with both the fire department and the utility to expedite this process.

We are currently waiting on the fire marshal and local utility to re-energize the site. We are completely dependent upon their inspection and approval. We are hoping to get an update that we can share in the next hour.

At the current time, the fire department is controlling access to the building and we will not be able to let customers in.

And this is what BCP plans often fail to take into account when considering the risk of a small fire. If it’s electrical, the priority of the fire department and the local electrical utility is safety. The fire may have been small. The fire may have been swiftly extinguished. But the fire marshal’s job is to ensure safety. That means shutting off all electrical systems and not taking any chances.

And when everything is shut down, and after repairs are made, it takes a surprisingly long amount of time to switch everything back on and get it working correctly. Here’s an update from twenty-four hours later:

The REDACTED data center remains powered down at this time per the fire marshal. We are continuing with our cleanup efforts into the evening and working overnight as we make progress towards our 9AM EDT meeting time with the fire marshal and electrical inspectors in order to reinstate power at the site.

Once we receive approval and utility is restored, we will turn up critical systems. This will take approximately 5 hours. After the critical systems are restored, we will be turning up the carriers and then will start to turn the servers back on.

The fire marshal has requested replacement of the smoke detectors in the affected area as well as a full site inspection of the fire life safety system prior to allowing customers to enter the facility. Assuming that all goes as planned, the earliest that clients will be allowed back into the site to work on their equipment would be late in the day Wednesday.

The points to note here are that:

After a fire, even a small one, you are no longer in control of your building: the Fire Marshal is.
The Fire Marshal’s priority is safety. Electrical systems will be switched off if there is any possibility that they present a hazard to firefighters or have been damaged in a fire.
Electrical systems have to cleaned, repaired and inspected before they can be re-energized.
Smoke detectors and other fire safety system components may need to be replaced and the entire fire safety system may need to be checked before normal access to the building is allowed.
After everything is completed, it can take a long time (in this case at least five hours) before critical systems are all working again.
Even after this, customer systems will still need to be restored.
In this case data loss was limited to that caused by systems being unexpectedly powered off. Hopefully this was taken into account in the design of all the application programs.

TL;DR? When planning remember that even small fires with limited damage can have major consequences.

Stay up to date with the free Risky Thinking Newsletter.

Phishing Airline Customers Made Easier

2023-02-28T00:00:00Z

Recently I received an email from Southwest Airlines. There was a reward of $100 for completing a survey. It was a straightforward phishing attempt, but Southwest made it easier than it should have been for the criminals. Are you making it easy too?

I received an email survey today. Not surprising. Many companies send them out. This one was from Southwest Airlines, and it offered me a reward of $100 for completing their survey.

Except it wasn’t. I knew this immediately as I have never flown on Southwest Airlines.

This was a phishing email directing me to a plausible sounding survey website. The fraudsters hadn’t bothered to get a free SSL certificate for their plausible domain, perhaps because this can trigger alerts when the Certificate Transparency List is published.

So should I warn Southwest Airlines? I went to their website. Was their an email address or form to fill in to let them know about the website “southwestairlinessurvey.today” phishing their customers? No there wasn’t. Plenty of ways to report lost baggage, but no obvious way to report an issue to their security team - assuming they have one. I could have found out how long their customer service queues were by calling their 1-800 number and found out whether their customer service agents knew how to contact their security team, but I didn’t.

I’m neither a customer nor a shareholder, so in a literal sense it’s none of my business. I will do things that are easy to do if it helps makes society better. But I won’t put in a lot of effort to help a company that doesn’t make it easy to help them. I suspect I’m not alone in this.

if you care about people phishing your customers make it easy to report it: otherwise you will only hear about it very much later from disgruntled customers who believe you cheated them out of rewards, goods, or services.

Stay up to date with the free Risky Thinking Newsletter.

Power Outages and Strategy 1.5

2023-02-14T00:00:00Z

An attack at the end of last year on two electrical substations put them out of action for a number of days. 40,000 customers were without power for a number of days. Such attacks are increasing. Do you need a better plan for dealing with power cuts?

On 3 December 2022 somebody attacked two electrical substations, putting them both out of action for a number of days. 40,000 customers were without power. The culprits had damaged parts which were difficult to repair or replace.

There were claims that the attack was sophisticated, that the attackers knew exactly what to hit with a high powered rifle to cause maximum disruption. How else to explain the amount and length of disruption caused? There were also claims the intention of the attackers was political: to disrupt a local drag festival.

But this is backward reasoning: these were the effects, therefore this must have been the intention. One plausible theory I read made the claim that this was no more than an attempt to cut off local power to burgle a store but with unfortunate collateral damage. Grady Hillhouse’s Practical Engineering YouTube channel has a good video describing the damage done and the work required to restore the grid.

What is certain is that such attacks are on the increase. In November 2022 the FBI warned of white supremacist plots to take down the US power grid, and that the information required to identify vulnerable substation components was being published by various groups. In 2014 a report by the Federal Energy Regulatory Commission warned that attacking just nine of the 55,000 US electrical substations could cause a national blackout.

Most businesses have some sort of plan for what to do when the power fails. The general assumption is that a power cut will be caused by bad weather and be of limited duration. The five day power cut in this case reminds us that power utilities only keep spares for common failure modes - not spares for all possible attacks. Large transformers (such as the one damaged here) are built to order and have long lead times, often over a year. If, as some have suggested, there is a real upward trend in attacks against equipment, the risk of multi-day power cuts is something that should be considered.

There are two basic power outage strategies which I’ve seen in business continuity plans:

Strategy 1: Install a backup generator so normal operations can continue without grid power. This is expensive, and only economic in some circumstances. Backup generators need to be regularly maintained and tested; even stored fuel requires maintenance.
Strategy 2: Accept the risk and send everybody home after an hour. Half a day’s lost productivity once every few years may be quite acceptable compared to the costs of providing backup power.

These strategies are commonly combined into a hybrid approach: keep critical functions (e.g. call centers, refrigerators and freezers) working with backup power, but send less critical or more power intensive departments (e.g. marketing, manufacturing) home.

Strategy 2 works well only if we can assume power cuts are rare and of limited duration. But what if, as in this case, they have the potential to last a number of days?

This is where Strategy 1.5 comes in.

Strategy 1.5: Accept the risk of a short duration outage, but mitigate the risk of a multi-day outage with provision for hooking up a mobile generator if needed.

A permanent backup generator may be too expensive to install and maintain, but provisioning for a mobile generator might not be. Truck or trailer mounted generators can supply up to 2MW at relatively short notice and be shipped long distances as needed. I’ve seen this strategy work well: mobile generator provisioning originally installed to mitigate Y2K risks was used years later to limit disruption during a multi-day power outage.

Does Strategy 1.5 make sense for you? It’s worth investigating the costs and doing a few rough calculations to find out.

Stay up to date with the free Risky Thinking Newsletter.

Santa's Ransomware Incident

2023-01-31T00:00:00Z

The recent ransomware incident at the North Pole was kept pretty quiet. In our exclusive interview Santa tells us what happened, and the unusual steps his organization was able to take after the incident occurred.

I’d arranged to meet Santa at the North Pole Pub. Most customers had left and the bar was almost empty. He was sitting at his usual table in the corner by the door, an assortment of bottles and glasses in front of him.

“Ho, ho, ho” he greeted me, but with little enthusiasm. His suit was dirty and crumpled, and even his white beard was looking a pale shade of grey.

“Tough Christmas?” I asked.

He looked down at his half-empty glass, as if the answer might be there. “The worst”, he replied. “Very, very stressful”.

And then he told me what had happened.

“It was ransomware. Bloody ransomware. We’ve been hit by it before and no doubt we’ll be hit by it again. But this time it was particularly bad.”

He looked up from his glass and looked me straight in the eye. “As you know, we’ve carefully segmented and firewalled all our networks. We don’t accept emails, which limits our exposure to phishing attacks. All our elves are well-trained - thanks to your Plan424 system each of them knows exactly what to do in any emergency. And when somebody attacks us, they generally go after what they think is our most important asset. The List.”

“The List isn’t the most critical asset?” I hadn’t given much thought to Santa’s IT assets before.

“Not even close. Outsiders think the Naughty and Nice list is our most important asset, but it isn’t. Yes, we keep it carefully backed up. But it doesn’t change that often. If we were forced to re-use last year’s list I doubt anybody would notice. I could offer The List for sale on the dark web nobody would buy it. You could even publish it on the Risky Thinking website and everybody would just say it was a fake list of names. It’s really not that critical.”

“So what did they go after?” I asked. “If you tell me what happened maybe our readers can avoid similar problems.”

He paused, and I could tell from his distant expression that he was deciding carefully just how much it was safe to tell me. Then he replied.

“Well, there are quite a few systems that are more important than The List. There’s the intelligence gathering system our informants use to tell us who is naughty and who is nice. It would be embarrassing if people found out too much about how that worked. But - like the list - it’s not particularly time critical. Then there’s the ERM system: organizing the worldwide production of millions of presents is hard. That would normally be critical, but by Christmas Eve production is long finished. And of course there’s the payroll system: our elves may like working here but they also like to be paid.”

Santa took a another long sip of his beer. “But this was a Bad Elf situation. They knew exactly which system to go for.”

“A Bad Elf situation? You mean an insider attack?”

“Precisely. There’s one system we can’t do without on Christmas Eve. The one system they knew we couldn’t do without. The SSS. The Sleigh Scheduling System. There’s no way we can deliver presents all over the world in an unbelievably short period of time without it. Insiders know what hurts the most. And our insider knew that at Christmas this was the most critical of all systems.”

“So what did you do?”

“Well we knew from the start it was an inside job. We employ strict least-privilege principles, and our systems are carefully air-gapped and firewalled. So we knew it had to be an elf who worked on the Sleigh Scheduling System. We also knew that whatever we did with our regular staff might be leaked to the extortioners, so we moved our negotiations to a team of outside professionals. We delayed as much as we could, but the extortioners knew we had a deadline which we couldn’t afford to miss. If we didn’t deliver by Christmas Day, nobody would ever believe in us again. So we negotiated the amount down as much as we could and then paid the ransom. We had no other choice.”

“Did you have insurance? I know that some companies have insurance for this sort of thing.”

“Cyber-insurance for ransomware is increasingly expensive. And insurers regard the North Pole as a very high risk location. So we had chosen to accept the risk ourselves.”

“So you payed them?”

“We didn’t have any choice at the time. But we do have a couple of advantages that the extortioners hadn’t reckoned with. Advantages that other organizations don’t have. We have the best intelligence system on the planet. We know exactly who is naughty or nice and why. So it didn’t take us long to identify the criminals we were dealing with. And after a discreet discussion with one of our bigger elves they decided to give the ransom back. It seems that nobody wants their children and their children’s children put on the Naughty List. Criminals just don’t think of that when they try to threaten us.”

“And the Bad Elf?”

He smiled. This was safer territory. “We moved him to where he couldn’t do any harm. We also got him into rehab. There’s a lot of gambling at the North Pole, and he was desperately trying to pay off his debts. That made him an easy mark for cybercriminals. In other circumstances he would have been a Good Elf, but with gambling debts piling up the temptation was just too much.”

With that Santa drained his pint and stood up. He seemed happier now that he had told someone what had happened. I offered to buy him another drink, but he declined… “Can’t be seen drunk in charge of a sleigh” he grinned as he pulled open the door and disappeared into the frigid polar night.

Stay up to date with the free Risky Thinking Newsletter.

Marketing Bots and Driverless Cars

2022-11-15T00:00:00Z

It seemed like such a good idea. Program a marketing bot to automatically send out promotional messages whenever national observances such as holidays were on the calendar. But it wasn’t. And what has this to do with driverless cars?

According to the BBC, the KFC marketing message read (in German): "It's memorial day for Kristallnacht! Treat yourself with more tender cheese on your crispy chicken. Now at KFCheese!".

Kristallnacht (9 November 1938) is, for those unfamiliar with modern European history, the night when the Nazis attacked Jewish homes and businesses in Germany. It is widely used to remember a bloodthirsty time which ended with the murder of over six million Jews. (If you live in the United States, just imagine KFC had made a similar suggestion for commemorating 9/11).

Oops!

The marketing bot had apparently been programmed to send marketing messages for national events without anyone realizing that some events are perhaps not best commemorated with a crispy chicken.

I don't know how much sophistication there was to this bot. Did it just cycle through a list of message templates? Did it come up with suitable suggestions based on the time of year and current promotions? Or did it employ some deep-learning algorithm that had found an obscure association between cheese and Neo-Nazis?

KFC apologized for the bot's message about an hour later. No great harm was done, as far as I can tell. People complained. KFC looked foolish. But nobody was inspired to major violence as a result of a perceived insult.

It's hard to think of everything when automating something. Most of the time it doesn't matter too much if some mistakes are made. Defining the requirements rigorously can require significant effort, which is why some development methodologies simply try to ignore the detailed requirements and iterate until an acceptable solution is achieved.

For safety-critical systems requirements definition is particularly hard. Nobody wants a nuclear reactor or an aircraft to act in an unanticipated way due to unforeseen inputs, which is why safety-critical systems are kept as simple as possible, go through multiple review processes, and are very expensive as a result. Even then, things can go wrong with fatal results.

Presumably nobody thought this bot was important enough to warrant much attention.†

Surely nobody would design a safety-critical system with the same lack of attention to detail as a marketing bot, would they?

So the next time you encounter a policeman directing traffic by gesticulating at you wildly and using a unique combination of hand signals and facial expressions, ask yourself what your future driverless car would make of the same situation. That's got to have been a thoroughly tested and well thought out part of the requirements, hasn't it?

† If you think defining requirements is easy, consider these lists of bad real-world assumptions that have tripped up programmers handling names and times.

Stay up to date with the free Risky Thinking Newsletter.