Death by software

Source: Nikkei Asian Review

On Dec. 7, along with hundreds of others at Paris Orly Airport, I spent the day waiting for a flight to London. As we found out after several hours, all flights to the London area had been canceled because of a communications glitch. The cause: a software problem in the internal phone system of the U.K.’s National Air Traffic Services.

As the BBC reported, the software failure happened when controllers working overnight were due to make the handoff to the day team at around 6 a.m. “To be clear, this is a very complex and sophisticated system with more than a million lines of software,” an NATS spokesman was quoted as saying. “This is not simply internal telephones, it is the system that controllers use to speak to other (air traffic control) agencies both in the U.K. and Europe and is the biggest system of its kind in Europe.”

Eurocontrol, which manages European air safety, said around 1,300 flights, or 8% of all air traffic on the Continent, had been severely delayed. Almost 10% of flights at London Heathrow Airport were canceled.

A gentlemen en queue with me turned and said, “Well, at least no one dies from software problems.”

If only that were true.

Computers are increasingly being introduced into safety-critical systems and, as a consequence, have been involved in accidents. Two of the most widely cited software-related accidents involved computerized radiation therapy machines.

The first involved the Therac-25 radiation machine. From June 1985 to January 1987, there were six known cases in which the machine gave massive overdoses, resulting in deaths and serious injuries.

The Therac-25 tragedy partly stemmed from a simple update to the user interface. The update allowed users to edit individual data fields, where before all data fields had to be re-entered if any error occurred. The software did not recognize the changes due to timing problems.

Years later, in January 2006, then 15-year-old Lisa Norris was a patient at the Beatson Oncology Centre in Glasgow, Scotland. While undergoing radiation therapy for a relatively rare and complex brain tumor, it was discovered that the 17 dose fractions Norris received were some 58% higher than the prescribed dose fractions. Norris died in October 2006, hastened by the overexposure.

Dangerous secrets

While much can be learned from such accidents, fears of potential liability or loss of business make it difficult to find out the details behind serious engineering mistakes.

“Placing barriers in the way of widespread dissemination of relevant details of adverse events is a way of preventing learning in any organization,” said Dr. John Wreathall of the Resilience Engineering group. “Bear in mind that one hallmark of a resilient organization is that it is prepared not only for its own failures, but those which it can learn from others. The more resilient an organization is, the larger are the lessons it has learned from others.”

This can manifest itself in several ways. One is recognizing a broader set of challenges that the organization can face, including those it creates for itself as a result of its own activities. This helps the organization better understand “what went wrong” and calibrate itself against the experiences of others.

Of course, just having the data available will not in itself ensure safety. But cutting off the public dissemination of data will ensure that accidents can be repeated.

As Dr. Nancy Leveson wrote in her Therac-25 investigation report: “Most accidents are system accidents; that is, they stem from complex interactions between various components and activities. To attribute a single cause to an accident is usually a serious mistake. We want to emphasize the complex nature of accidents and the need to investigate all aspects of system development and operation to understand what has happened and to prevent future accidents.”

These problems are not limited to the medical sector. It is still a common belief that any good engineer can build software, regardless of whether he or she is trained in state-of-the-art software-engineering procedures.

Software engineering is a young discipline. A liberal estimate puts its age at 63 years. We should not be surprised that software and its practitioners have few construction standards, procedures, guilds, review boards, licensing regimes, acceptable manufacturing practices or continuing education requirements such as we may find in structural engineering, law, architecture and even massage therapy.

After all, when civil engineering was the same age as software engineering is now, the wedge had not yet been invented.

Woody Epstein serves as manager of risk consulting at Lloyd’s Register Consulting Japan. From 2011 to 2012 he was also a visiting scientist at the Ninokata Laboratory of the Tokyo Institute of Technology, where he was involved in analyzing the Fukushima disaster. The opinions expressed in this regular column do not reflect those of his employers, their affiliates or clients.