And the Oscar goes to... Amazon S3. Learning from Human Error
Last month was an interesting time for human errors. A typing mistake by an engineer at Amazon resulted in their S3 service experiencing "higher than normal error rates", where the "higher than normal error rates" included the value 100%. Amazon S3 is an old and critical part of their cloud infrastructure. It's essentially a highly reliable file storage system, with the additional feature that the files can be served directly to the Internet. It's a critical part of Amazon's own infrastructure: many of their other services depend on it, as does the infrastructure of literally thousands of other companies. When it stops working, large parts of the internet stop working too.
And then there were the Oscars… Concealed in the wings at each side of the stage are two accountants from Price Waterhouse Coopers each with a set of envelopes containing the names of prize winners. Why two? This is so that the prize presenters can enter from either side of the stage and receive an envelope just before making their announcement. Depending upon which side the presenter chooses to make an entrance, the accountant must either hand out an envelope or discard an envelope. At the penultimate presentation each accountant has two envelopes left. I'm speculating a bit here, but when only holding two envelopes its easy to confuse the envelope to be discarded with the envelope to be retained. The wrong envelope is handed to Warren Beatty (one of the presenters). He realizes there is a problem (the prize names an actress and film instead of just a film). His co-presenter, Faye Dunnaway does what every good actor does when confronted with failing props or someone who has forgotten their lines: she carries on as best she can. She uses the name of the film on the card (LA LA Land) and the Oscar is awarded to the wrong film. Pandemonium ensues, and the Oscars ceremony is slightly more interesting than usual.
Both of these were very visible human errors, but the reactions of the organizations involved were very different. Amazon, after spending a number of hours fixing the problem, issued a detailed statement explaining what went wrong. It emphasized that the unnamed staff member involved had been following proper procedures and that mechanisms were being put in place to prevent a similar error occurring again. They recognised that most human errors are faults in procedure, not carelessness.
Contrast this with the Motion Picture Association of America (MPAA): their reaction was to "fire" the accountants involved and announce that the people involved were being banned from all future Oscar ceremonies.
By publicly blaming both the accountants involved (even though one of them made no error) the MPAA has replaced two people who had successfully carried out their job for the previous eight years with new people with zero experience. Although the job sounds simple, it's actually quite complex. The votes need to tallied votes securely; the results need to be kept secure (people bet on the outcomes, and the surprise announcements create dramatic tension); precautions must be taken against the envelopes being not reaching the ceremony due to a road accident or travel delay; the final result must be open to audit, etc. Before this year's ceremony there was quite a bit of publicity surrounding the precautions taken.
Which organization learned from its mistake and is less likely to make a mistake in future?
I'm still confident in Amazon's web services: I figure that another class of future problems has just been eliminated. But the Motion Picture Association of America? Well, I guess their reaction helps to explain the existence of Superbabies: Baby Geniuses 2 (2004) …
Michael Z. Bell