Y2K Revisited

In a recent radio show, the glib radio host implied that the Y2K risk never really existed. Was he right? Or do his comments tell us more about human nature than about the risks arising from time representation in computer systems?

It isn't the first time I've heard it, and it's becoming increasingly annoying. It was a glib radio presenter dismissing the whole Y2K risk because nothing happened.

Y2K as a risk was, of course, excessively hyped. Here was a risk where you pundits could (and did) make up the most alarmist scenarios possible. Airplanes would fall out of the sky, elevators would stop working, the electricity grid would shut down, bank machines would go out of order, there would be panic in the streets, and the dead would rise up from their graves. (OK, I may have made up the last one.)

The truth of the matter was that if certain computer systems weren't fixed, they would stop working, or start working incorrectly. There are very few safety-critical systems that use the year or day in safety-critical calculations and would have a reason to store a date in YYMMDD form. Why would they? It's a terrible number format to work in when making time difference calculations between days, let alone between seconds or milliseconds. And while an elevator could use the day of the week to optimize its behavior or lock out certain floors at weekends, no sane person would design it to stop functioning entirely if the date was wrong. The same goes for most other devices.

But while the life-threatening scenarios were wrong, there clearly was a risk for many businesses using older software. The designers of the software had no reasonable expectation that their software would still be in use by Y2K. Even if the designers had taken this into account, it's highly unlikely that anyone would have thought it necessary to include in end-user documentation a “Y2K Statement” that the software would continue working after year 2000. While some software now states that it will handle dates in a particular range (for example, MySql currently states that its TIMESTAMP format is good until at least 2030), until the Y2K scare made end-users sensitive to dates there was little reason to put such information in manuals.

In the absence of information about the performance of software past the year 2000, it was necessary for companies to check their existing code (if they wrote it themselves) or check with their vendors (if they were still in business) to see what might happen as the year digits rolled over into year 2000. This was expensive. There was a real risk that companies, their vendors, or their suppliers might not complete such work in time. From a corporate perspective, the risk was real. Fortunately it was acted upon, companies took the necessary steps. Governments protected themselves against the wilder scenarios. When the year 2000 finally came in, nothing much happened.

But if you were disappointed with Y2K, consider this: many systems use signed 32 integers to represent time (in seconds) since 00:00 on January 1st, 1970 GMT. This is known as the Unix Epoch, and is enshrined in the time_t data type used in the programming languages C and C++. Many programs are written in these languages, and it's not unlikely that they contain errors which will occur on Monday, 18th January 2038 27:14:08 UTC when any signed 32 bit integers representing time since the Unix Epoch will overflow to a large negative number.

Not a credible scenario? Consider this: Microsoft Windows has a convenient function GetTickCount() which returns the number of milliseconds since Windows booted as an unsigned 32-bit integer. Approximately every 49.71 days this number wraps around back to zero. Windows 95 and Windows 98 both require patches to remain functioning when this occurs., so it took at least two years for a bug in this area to be identified and fixed.

At 4:30pm on September 14, 2004 the FAA's radio system in Palmdale (which uses Microsoft Windows) failed. A bug in the software means that the system had to be rebooted at regular intervals to prevent failure. Unfortunately a technician forgot to reboot the software, resulting in the failure. A backup system also failed because of “human error”. The result was a loss of communication for three hours with aircraft over an area of some 178,000 square miles. (More Details.) The software “glitch” apparently exists in other air traffic control centers. 30,000 passengers were affected, 450 flights diverted or canceled, and another 150 were delayed.

Which takes me back to the glib radio presenter. Because no planes fell out the sky and nothing really bad happened, he believed that the Y2K risk never existed. Following his faulty logic, it followed that those who warned of this risk were deserving of ridicule since they must have been crying wolf.

What he failed to realize was that nothing happened because people recognized the risk and took the appropriate counter-measures.

This was a case where risk management actually worked.

22 April 2005

To get notified when new articles appear, subscribe to the Risky Thinking Newsletter. It's low volume: we don't send out an issue unless there is something interesting to say. You can also subscribe to our RSS Feed

Recently published articles can also be found here.

Agree or disagree? I'd like to hear your thoughts. Please initially use the contact form to get in touch.