Unexampled Events, Resilience, and PRA

Download as PDF

Woody Epstein
ABS Consulting, Tokyo, Japan
sepstein@absconsulting.com


Abstract

“Why isn’t it loaded? Are you afraid of shooting yourself?”
“Of course not. These weapons don’t go off accidentally. You have to do five things in a row before they’ll fire, and an accident can seldom count higher than three … which is a mystery of probability that my intuition tells me is rooted at the very base of physics. No, it’s never loaded because I am a pacifist.”
— Field Marshall Strassnitzky of the First Hussars of the Belvedere during WW I[ref]Helprin, Mark (1991). A Soldier of the Great War, Harcourt and Brace, pg. 546[/ref]


1 INTRODUCTION

I hope that my contribution to this symposium will be read as an essay (literally “an attempt”), a literary genre defined as the presentation of general material from a personal and opinionated point of view; my contribution should not be read as an academic or technical piece of writing, well balanced and well referenced. Instead please consider this as an attempt to clarify, mostly for my self, some general ideas and observations about unexampled events, resilience, and PRA.

I would like to discuss three topics:

  1. Some ideas about rare, unfortunate juxtapositions of events, those called unexampled events, leading to accidents in well-tested, well-analyzed man/machine systems;
  2. Some ideas about resilience and the relationship between unexampled events and resilience;
  3. Safety and PRA: focusing not on the numbers, but on the act of doing the PRA itself, several times, to increase the ability of an individual or organization to respond to unexampled events resiliently.

2 UNEXAMPLED EVENTS

From the PRA point of view, there are three senses of unexampled events. The first sense is that of an extraordinary, never before thought of, challenge to the normal, daily flow of a system or an organization, such as three hijacked airplanes concurrently flown into well-known American buildings. The second sense is a juxtaposition of seemingly disparate events, practices, and influences, usually over time, which as if from nowhere suddenly create a startling state, as the operators at Three Mile Island discovered. The third sense of an unexampled event is one whose probability is so small that it warrants little attention, even if the consequences are severe, because though one can postulate such an event, there is no example, or evidence, of such an event ever occurring.

From my point of view, I believe that inattention to all three types of unexampled events that can lead to severe consequences poses a grave danger to the safety of vigilantly maintained, well-tested systems and organizations. Of special concern to me is the question if PRA can help an organization respond to significant unwanted, unexampled events.

I would like to discuss my notions of unexampled events, not solely as an intellectual exercise (which does have its own beauty and purpose), but also as the first steps towards (1) looking at safety in a different way, and (2) defining what it means to be resilient to unexpected impacts on safety.

In the spirit of Kaplan [ref]Garrick, B.J. & Kaplan, S. (1981). On the Quantitative Definition of Risk. Risk
Analysis Vol. 1, No. 1.[/ref], an event can be defined by three attributes: a scenario, a likelihood function, and a consequence. Mathematically, risk analysts express this as e = [s ,l ,c]. The letter s is a description of the event, the scenario; l is the likelihood of the event actually occurring, perhaps some measure like the odds a bookmaker gives; and c are the consequences of the event, sometimes a measurable entity like money or deaths/year, but often a list or description.

As an example, let e be an event that entails the release of toxic chemicals into the environment. So e = [s ,l ,c], where s is “the canister of toxic fluid was dropped from 2 meters”, l is the judgment “not very likely”, and c is the list “drum breaks, release of chemical on floor, cleanup necessary, no deaths”.

It is easy then to imagine a set of events big E = {ei}, where big E might be defined as “the set of all events where canisters of toxic fluid are dropped”. Some of the little ei in big E are of special interest to the PRA analyst: the events where canisters leak, where workers are injured, where toxic fumes get into the ventilation system. To practice our art, or deception, we try to enumerate all of the events we can think of which lead to the consequences of interest, somehow give odds for the occurrence of each event, and present the results in such a way that decisions can be made so as to prevent, or lessen the impact of, the unwanted consequences.

Continuing the example of toxic chemicals, suppose by some black art we can assemble all of the events which lead to deaths by inadvertent chemical release, measure the likelihood of each event as a probability between 0 and 1, and then finally assign a number of deaths that could result from each event. We could then plot the results, such as the curve idealized in Figure 1 (in most situations the more severe consequences actually do have a lower probability of occurrence than less severe, but the curve is usually more jagged).

Each point on the curve represents an event in terms of number of deaths and probability (there may be many events with the same probability, of course, and in that case methods exist for combining their probabilities in a suitable manner). The extreme right part of the graph, the tail, is the home of unexampled events, what are called the outliers, the low probability/high consequence events. There are also unexampled of events in the tail on the left part of the graph (not a big tail here), and are usually of less interest because of the less severe consequences.

Of course, we have made one questionable assumption: we assumed that we have assembled all of the possible events that can lead to death by accidental release of toxic chemicals. Obviously, there will always be events not imagined and juxtapositions of circumstances not considered. But let us assume, for the moment, that in vigilantly maintained, well-tested systems, projects and organizations, these unconsidered unexampled events are of low probability. Later, I will add force to this assumption.

A rigorous definition would here give a method to locate the point on the graph where an event becomes unexampled. But this is not easily done, nor perhaps can be done. An unexampled event is a normative notion, depending on cultural influences, personal history, and the events under scrutiny: is 1 out-of 1000000 a limit for an unexampled event? However, if we look at how PRA characterizes risk, we can find an interesting connection with risk and the point where an event becomes an unexampled event.

In general, a PRA gives the odds for unwanted events occurring: the risk. In general, organizations that use the results of PRA (regulatory agencies, governments, insurance corporations) decide where they will place their marker on the graph and say, “Here is where the number of deaths is tolerable since the probability of the event occurring is 1 out-of 1000000; I’ll bet the event won’t happen.”

By the act of picking a point, one coordinate on the probability axis, one coordinate on the consequence axis, a decision maker makes an operational, normative definition of accepted, unwanted events; she decides what risks are acceptable in this situation. One can see that the shaded area of unexampled events in Figure 1 and the shaded area of unwanted, but accepted, events in Figure 2 bear a strong resemblance.

Clearly, PRA bets that decreasing focus on low probability/high consequence events, will not impact total safety of the situation.

Imagine a complex, dangerous situation, such as the excavation and disposal of 500,000 chemical weapons. Assume that in the design and operations of this system there is a very high degree of reliability of equipment, that workers and managers are vigilant in their testing, observations, procedures, training, and operations so as to eliminate the unwanted events in the white area of the curve in Figure 2. Given that an accident does occur in this situation, will the superior vigilance and performances postulated lower the probability that it the accident is severe? Surprisingly, at least to me 15 years ago, the answer is “No.”

In 1991 I was attempting a PRA of the software for the main engines of the NASA space shuttle. I was working with two more experienced colleagues, Marty Schuman and Herb Hecht. While planning tasks for the next day, Herb mentioned to me to pay close attention when doing a code walk-down on a software module which was seldom invoked during operations. The reason, he said, was “ … infrequently executed code has a higher failure rate than frequently executed code.” I had suddenly been awakened from my dogmatic slumbers.

Herein I summarize Herb Hecht’s ideas (first contained privately in “Rare Conditions – An Important Cause of Failures” [ref]Crane, P. & Hecht, H. (1994). Rare Conditions and their Effect on Software Failures. Proceedings Annual Reliability and Maintainability Symposium. [/ref]) from my copy of the first paper, now covered with the coffee stains and tobacco ashes of time. Like many artful and calm insights, Herb’s thesis is immediately grasped, perceived almost as a tautology; I had all of the same data available, but I had not seen the connection between them. Perhaps he was not the first to think of such an idea, however I believe that he was the first to see its implications for well-tested systems:

  1. In well-tested systems, rarely executed code has a higher failure rate than
    frequently executed code;
  2. consequences of rare event failures in well-tested systems are more severe
    than those of other failures;
  3. given that there is a failure in a well-tested system, significantly more of the
    failures are caused by rare events;
  4. inability to handle multiple rare conditions is a prominent cause of failure in
    well-tested systems.

In short, we have tested out all of the light stuff and what we are left with are rare accidents with severe consequences in any well-tested software system. How does this apply to other well-tested, vigilantly maintained systems, with well trained staff and enlightened management, good operating procedures in place; do Herb Hecht’s observations about software systems apply to a process plant or nuclear facility? I believe that they do.

Look at figure 2 again. Our hypothetical chemical weapon disposal facility has calculated the risk of the unwanted events, and assigned a point to represent the risks they are willing to accept, the magenta area. The white area represents the unwanted events that the facility wants to entirely eliminate. By exceptional planning, maintenance, reliability of equipment, human factors, training, and organizational development skills, the facility is successful. The known and the easy problems are vanquished. What is left are the events in the magenta area, the accepted risks, the unexampled, rare events. So if there is a failure, chances are the failure is an unexampled event.

Moreover, Herb Hecht’s study makes the following observation: all of the software which failed from three rare events, also failed, perhaps less severely, from two rare events, and three-quarters of the software which failed from two rare events, also failed, perhaps less severely, from one rare event.

What this means at my postulated facility is that if unwanted events and their consequences are actively guarded against, and equipment is vigilantly maintained, barriers in place, and staff prepared to prevent these events, and if indeed symptoms of unwanted events begin to occur, then there is a good chance that if we are on a failure path, it is the start of a severe accident scenario, out there in the tail of Figure 1. Perhaps more failures will occur to compound the situation and form a scenario which may have never been thought of, or previously dismissed as being improbable, and there are no procedures, nor experience nor training to aid in recovery. Chances are that this is not a simple or known situation; the first rare event failure has a good probability of being a harbinger of a severe accident scenario.

3 RESILIENCE

I would like to step away from unexampled events for a moment and look at resilience, with an eye as to how it applies to the occurrence of an unexampled event with severe consequences.

Resilience can be defined as a technical term: the capability of a strained body to recover its size and shape after deformation caused especially by compressive stress. It can also be defined for general usage: an ability to recover from or adjust easily to misfortune or change. Both definitions imply a reaction, not an action, on the part of a material, individual, or, perhaps, an organization to an impact or stress. The technical definition also has an operational aspect: resilience of a material, its coefficient of restitution and spring like effect, can only be determined by experiment. Resilience is something that cannot be measured until after the fact of impact. Perhaps one could try to prepare for acting resiliently in a given situation (a proactive measure of presilience?). Predicting resilience maybe easier, but entails knowing the essential properties of materials, and mutatis mutandis, individuals or organizations, which make them resilient, and they must confirmable in principle.

In the hypothetical chemical weapons disposal facility, I have postulated extraordinary activities to eliminate unwanted events; good work rules and work habits have been codified, proper procedures have been installed, proper surveillance and technical oversights groups are in place. In short, the facility has institutionalized what were before successful reactions, resilient responses, to known accidents, as standard operating procedures.

This is no small or trivial accomplishment. For anyone who has experienced a facility like this, a large nuclear power plant or the bigger than life oil platforms in the North Sea, the attention and concern given to safety is impressive, which Figure 3 represents as the white part of the graph, where the need for resilient reactions has been transformed into the need to strictly follow standard operating procedures. The magenta area in the tail of the graph represents unexampled, unwanted events, where resilience to stress is unknown and untested.

At this point, I wish that I were qualified to analyze the essential properties of resilience. This is probably the domain of psychology, a discipline in which I am formally unschooled. But I have had serious experience in situations where resilient reactions were needed during the 10 years I worked and lived in Israel. Without too much explanation, let me list some of the attributes which I believe are necessary, but not sufficient, properties of resilience:

  1. Experience – nothing is second to experience with a system and adversity;
  2. Intuition – intuition can give the best evidence for something being the case;
  3. Improvisation – knowing when to play out of book;
  4. Expecting the unexpected – not to be complacent
  5. Examine preconceptions – assumptions are blinders;
  6. Thinking outside of the box – look at it once, then look at it again;
  7. Taking advantage of luck – when it happens, assent immediately.

This list is not exhaustive, but a witnessed list of underlying traits for successful resilient response to novel, critical situations, from removing a tractor from a sea of mud to acts of heroism. In all cases, my sense was that individuals or groups that showed the traits enumerated above, were much different than individuals or groups whose work entailed following strict protocols, procedures, and rules. In critical situations there were sometimes clashes of these two different cultures. Resilience in an individual meets with no internal resistance, however in a group, those that follow a rule, and those who improvise a tune, can find themselves at odds.

It is tempting to make comparisons between resilience and adaptation in the Darwinian sense. Natural selection, the modus ponens of evolutionary theory, makes the entailment of survival from adaptation: “natural selection is the claim that organisms enjoying differential reproductive success will be, on the average, those variants who are fortuitously better adapted to changing local environments, and that those variants will then pass their favored traits to offspring by inheritance” [ref]Gould, S.J (2004). The Structure of Evolutionary Theory, Harvard Press[/ref]. For me, the word “fortuitously” is key. An organism does not decide that the trait of wider stripes will be a better adaptation to a change in the environment, but by chance those organisms with wider stripes proliferate, as do the stripes.

Is resilience in response to an unexampled event a type of adaptation? I will carry the metaphor a bit longer to gain some insight. Unwanted, unexampled events are experienced as changes to the environment, albeit bad ones. The exact skills that may be necessary so as to rebound from the situation cannot be known ahead of time. No procedure is made for the unexampled by definition. However, if individual characteristics fortuitously exist that can aid in resilient response to critical and sudden unexampled events, I believe that severe consequences may be dampened and perhaps stopped. Can we plan what traits are needed ahead of time? Perhaps the list I presented above is a start to understand what underlies resilient response.

Does resilience apply to groups as well as individuals? Darwin believed quite strongly that natural selection applied only to individual organisms, not to groups, species, or clades. In 1982, Stephen J. Gould and Niles Eldredge proposed the theory of punctuated equilibrium to explain the long periods of no change in the fossil record of a population, then suddenly a flurry of speciation:

“A new species can arise when a small segment of the ancestral population is isolated at the periphery of the ancestral range. Large, stable central populations exert a strong homogenizing influence. New and favorable mutations are diluted by the sheer bulk of the population through which they must spread. They may build slowly in frequency, but changing environments usually cancel their selective value long before they reach fixation. Thus, phyletic transformation in large populations should be very rare—as the fossil record proclaims. But small, peripherally isolated groups are cut off from their parental stock. They live as tiny populations in geographic corners of the ancestral range. Selective pressures are usually intense because peripheries mark the edge of ecological tolerance for ancestral forms. Favorable variations spread quickly. Small peripheral isolates are a laboratory of evolutionary change.” [ref]Gould, S.J (2004). The Structure of Evolutionary Theory, Harvard Press[/ref]

To continue the Darwinian metaphor: we should expect the operations of a vigilantly maintained, well-tested, well surveilled system to proceed flat and normally, with no signs of change or need of resilience, most of the time, then suddenly punctuated by critical challenges. Under great environmental stress, such as an unexpected, unforeseen accident, we can expect only a small, isolated group to respond resiliently and to “speciate” from the larger group.

And as those challenges are met, or not met, then the standard operating procedures of the system are changed to incorporate the resilient reactions that mitigated the situation; and another period of stasis will be entered.

3 PRA

My focus in this essay has been on well-tested, well-analyzed, vigilantly maintained systems, unexampled events, and resilience. I have tried to show that (1) unexampled events have an increased probability of severe consequences in these systems, and (2) resilience to respond to unexampled events is a trait that may be antithetical to the mindset that must run these systems without incidents. I would now like to focus on the implications to PRA.

PRA is the discipline of trying to quantify, under uncertainty, the risk or safety of an enterprise. To briefly state my views:

Quantification, or measuring, the risk/safety of a situation is not the goal of a PRA. Nor is it necessary to “quantify” with numbers (one could use colors). The act of trying to measure the risk involved is the source of knowledge. The acts of trying to assign values, combining them, questioning their verisimilitude, building the model are the great treasure of PRA: the key to the treasure is the treasure itself.

Uncertainty is not some noisy variation around a mean value that represents the true situation. Variation itself is nature’s only irreducible essence. Variation is the hard reality, not a set of imperfect measures for a central tendency. Means and medians are the abstraction.

Too often risk is defined as risk = likelihood * consequence and safety = 1-risk. I disagree with this. Risk is likelihood and consequence, not a simple multiplication with safety as the additive inverse of risk. Risk and safety are normative notions, changing with situations and expectations, and must be assessed accordingly.

Modern PRA began after Three-Mile Island with the publication of WASH-1400, “(The) Reactor Safety Study”, a report produced in 1975 for the USNRC by a committee of specialists under Professor Norman Rasmussen. It considered the course of events that might arise during a serious accident at a (then) large modern light water reactor, and estimated the radiological consequences of these events, and the probability of their occurrence, using a fault tree/event tree approach.

The proposed event trees and fault trees were very small with respect to the number of system and events they modeled. The mathematics was approximate and the data little more than reliability studies, the initiating events well known possible disturbances to normal plant operations. However, these easy methods gave operators and managers the first feel for the safety of the plant as a measurement, certainly one step in knowledge.

Times have changed, but the methods have not. Nuclear plant PRA models are orders of magnitude larger than envisioned by Rasmussen. The models are so large that they are neither reviewable nor surveyable. The results calculated are severe approximations with no knowledge of the error factors involved. Reliability techniques are used for human actions in questionable ways. The numerical results of the PRAs are highly suspect, and yet the desiderata of the activity.

The focus of these PRAs is almost entirely on known system disturbances as initiating events, and static, sequential views of accident emergence and progression. As a result, procedures, training, regulations, and methods of operation were put in place to guard and watch out for the known disturbances. Risk models were used not for their insights, but for the quantitative results offered, thus never exploring novel failure modes of the facilities, totally missing the ability to postulate unexampled events and strange system and extra-system influences/interactions/background.

The result is that the attention of the risk analysts is not on unexampled events. Given that symptoms of system failure occur, attention will not be on the tail of the distributions where unexampled events reside. There will be little experience in the organization for imagining scenarios that change critical assumptions, have slightly different symptoms, or include multiple failures. Moreover, the standard operational culture is focused on the procedures and rules for dealing with known disturbances and standard ways of solving problems. And rightly so, since without this focus on the checklists, procedures, and protocol controllable situations can easily escalate out of control, and the daily safety of the facility impacted.

A second culture is also needed. To restate a central theme in this essay, in welltested, etc., systems, given that there is an accident, chances are the level of consequence is high and that the causes had not been modeled in the PRA. The second culture, to be prepared for the unexampled event, must play with the model, question assumptions, run scenarios, and understand the uncertainty. When initial indications or symptoms that a system may be going astray, the second culture moves away from the probable and into the possible.

This can be visualized by using the typical Color Risk Matrix used by many, including me, to present risk analysis results. Here is an example of two 6×5 risk matrices:

In this type of matrix, colors represent risk, with the order usually being like a traffic light: red, orange, yellow, and green (from high risk to low). The two dimensions represent consequence and likelihood as marked in Figure 4.

The upper matrix is the typical risk matrix for the standard operating culture, focusing on the area above the diagonal. The lower matrix is the typical risk matrix for the second culture, focusing on the area below the diagonal. Note how the two matrices are rotated.

Can these two cultures coexist? Can one of the cultures “proactively presilient”? I do not know the answers at all. But I do know, that without them both, we can be assured of accidents with higher levels of consequence than not.

Safety is connected not only to risk, but also to expectation. It is a normative notion. In operations like a nuclear power plant or a chemical weapons disposal facility, which are of the well-tested etc. category, I expect the rare events to be guarded against, also. I weight consequence more heavily than likelihood to calculate safety in the well-tested etc.

It’s in words that the magic is–Abracadabra, Open Sesame, and the rest–but … the magic words in one story aren’t magical in the next. The real magic is to understand which words work, and when, and for what; the trick is to learn the trick.

And those words are made from the letters of our alphabet: a couple-dozen … squiggles we can draw with the pen. This is the key! And the treasure, too, if we can only get our hands on it! It’s as if–as if the key to the treasure is the treasure!

John Barth in Chimera

Validation of the Open PSA Model Exchange Format

Download as PDF

Validation Project for the Open-PSA Model Exchange using RiskSpectrum® and CAFTA®

Steven Epstein [ref]ABS Consulting, Yokohama, Japan[/ref], F. Mark Reinhart [ref]Vienna, Austria[/ref], and Antoine Rauzy [ref]Dassault Systemes, Paris, France[/ref]


Abstract: Under the sponsorship of the Institut pour la Maîtrise des Risques (IMdR), and supported financially and technically by more than ten European and US organizations (see section 4), this validation project has been successfully completed with both RiskSpectrum®, from Relcon
Scandpower and CAFTA®, from EPRI.
Keywords: PRA, Nano Structures, Dynamic PSA, list no more than 4 keyword phrases


1. INTRODUCTION

Over the last 5 years, much work has been done which shows the interest and the necessity to improve both models and assessment tools in Probabilistic Safety Analyses (PSA). The following issues are of special importance.

  • Quality insurance of calculations;
  • Reliance of approximations and cutoffs;
  • Portability of models between different software;
  • Clarity and documentation of the models;
  • Completeness of the models;
  • Better visualization of PSA results;
  • Interoperability of the different tools.

1.1. The Open-PSA Initiative

The Open-PSA Initiative for a new generation of Probabilistic Safety Analyses was launched during spring 2007. This initiative aims to provide the community with an open forum of discussion and exchange. It has the ambition to help the design of new methods and new tools, bringing to the international PSA community the benefits of an open initiative, and to bring together the different groups who engage in large scale PSA, in a non-competitive and commonly shared organization.

Stimulated by ongoing international interest to improve Probabilistic Safety Analyses (PSA) as well as by improvements in computer algorithms, the Open PSA Initiative was initiated in early 2007. Since its inception, the Open PSA Initiative has involved numerous discussions among international interests and has held a series of ten meetings in Austria, France (3), Japan, Spain, Sweden, Switzerland, and the United States (2).

Interest has been demonstrated by participation by representatives from the
following nations: Belgium, the Czech Republic, Estonia, Finland, France, Germany, Greece, Hungary, Italy, Japan, Lithuania, The Netherlands, Russia, Slovenia, Spain, Sweden, Switzerland, United States, the European Union, and the IAEA. In addition, a presentation to the U.S. NRC Advisory Committee on Reactor Safeguards, PRA Subcommittee was well received.

1.2. The Open-PSA Model Exchange Format

The very first objective of the Open-PSA initiative is the design of a Model Exchange Format for Probabilistic Safety Analyses. This format will make it possible to represent Fault Tree/Event Tree models. It will be complete and expressive enough to embed all existing models and open enough to provide room for future needs. It may be the kernel of an open architecture for the next generation of Probabilistic Safety Assessment tools.

The design of such a format is clearly a mandatory step to be able to tackle the issues mentioned above.

Under the sponsorship of the Institut pour la Maîtrise des Risques (IMdR), and supported financially and technically by more than ten European and US organizations (see section 4), this validation project has been successfully completed with both RiskSpectrum®, from Relcon Scandpower and CAFTA®, from EPRI.

2. STATE OF THE ART

Until the Open-PSA initiative work on this subject, no common Model Exchange Formats were available for PSA. Several formats were available for Fault Trees (e.g. the SETS format), but nothing regarding the following constructs:

  • Probability distributions of basic events;
  • Extra logical constructs such as common cause groups, delete terms, recovery rules, exchange events…
  • Event Trees;
  • Results of computations.

A first version of the Open-PSA Model Exchange Format (OPSA-MEF) has been available since December of 2008. It covers all the above issues. It remained, however, to instantiate it in software, to link the software with [it and] to check it against a large, Level 1 PSA, and possibly to adjust the OPSAMEF accordingly.

Goals of the Open PSA Initiative include the following:

  • PSA models need the following capabilities:
    • Be as complete as is reasonably and progressively achievable.
    • Have well founded and documented bases for approximations and cut-offs.
    • Assure the optimum definition and timing for truncation to assure, among other aspects, consistent and correct importance calculations.
    • Account for event tree success path contributions.
    • Account for human actions reliability and recovery.
    • Support dynamic (real time) analyses, directly or through interfacing “risk monitor” software.
    • Provide enhanced results visualization, a common user interface, and support for current and anticipated Risk-Informed Decision Making (RIDM) applications.
    • Be clear, understandable, traceable, transparent, and well documented.
    • Assure benchmarking, validation, and other quality assurance capabilities.
    • Have a standard file format, transferable (portable) among PSA software programs.
  • PSA data needs the following:
    • Coordinated format, sources, compilation, analyses, assimilation, and data bases.
    • Coordinated platform for secure international access and dissemination.
    • Transparent and traceable data flow from the source to the model.
  • Needed or strongly desired supporting characteristics:
    • Standard and consistent semantics.
    • An Extensible Mark-up Language (XML) format for data, PSA models, and PSA software.
    • Ability to interface with Human Reliability Analysis software.
    • The capability for Markov processes, Bayesian analysis, and fragility analysis.
  • Inclusion of Boolean Driven Markov Processes (BDMP), dynamic event trees, and Failure Modes and Effects Analyses (FMEA).

The series of constructively collegiate meetings mentioned above, offered, discussed, and dispositioned a number of valuable perspectives. A candid observation was that while many agreed that to pursue and attain these goals was necessary, there appeared to be no sustained effort in that direction. It was concluded that “we needed to act.” Consequently, a series of progressive and supportive steps was discussed and envisioned. An important first step was to develop and validate an Open PSA Model Exchange Format (OPSAMEF).

A number of PSA software developers were approached to work with the Open PSA Initiative. Scandpower, now a member of the non-profit organization, Lloyd’s Register, took the lead to validate that a RiskSpectrum PSA model [from the German KKP Nuclear Power Plant (NPP)] could be exported into the OPSAMEF and reimported from it. In this regard, Scandpower developed a “RiskSpectrum IMDRXML” software application to export the KKB PSA model from RiskSpectrum into the OPSAMEF and from the OPSAMEF back into RiskSpectrum.

In accordance with the requirements of this project, Scandpower issued a user manual, RiskSpectrum IMDR-MEF. In addition to the Scandpower efforts, parallel efforts to a less extent were made with CAFTA, RiskMan, and ASTRA (a non-nuclear application) PSA models.

The fundamental goal of this first step was that exported as well as re-imported PSA models to and from the OPSAMEF and among the various PSA software have the same clear understanding. The overall conclusion is that these efforts have achieved a high level of success, and there is confidence that the remaining challenges can and will be overcome.

3. RESULTS

The results of the project have been completely successful:

  • The creation of small benchmark models that will be used both for this project and for future uses;
  • The design of translators from RiskSpectrum® internal format and CAFTA® internal format to the OPSA-MEF and vice-versa;
  • The validation of OPSA-MEF on the Kernkraftwerk Brunsbuettel (KKB) Level 1 PSA;
  • The issuance of reports describing and commenting the experiments performed to validate OPSA-MEF;
  • Independent validation by the Joint Research Center (JRC), Ispra Italy, using their PSA software, ASTRA®.

The OPSAMEF was exercised; refinements were proposed (especially with additional grammar constructs) from lessons learned from the validation; and the process was validated at a “proof of concept” stage. The fundamental goal has been achieved to about 85%. The remaining 15% may be a more challenging task. Part of the remaining challenge will be to update and maintain a successful OPSAMEF as a living tool for use among various PSA models.

The success with the OPSAMEF and the RiskSpectrum IMDR-XML software application, in light of the fact that RiskSpectrum has approximately 400 customers internationally and is the most widely used PSA software in the world, has gained the attention of a considerable population of users of other PSA software. This is very important because a significant goal expressed by at least one major European regulator and by at least one European nation operating a large number of NPPs is to work with PSA and PSA applications among NPPs and regulators with a consistent PSA model and user interface. Such consistency in analyses and communications will help focus available resources on the most safety significant considerations and contribute to increased safe NPP design, construction, operation, and maintenance.

As presented above, the Open PSA Initiative has an ambitious catalog of goals, and the development of the OPSAMEF and the RiskSpectrum IMDR-MEF contribute to the initial step. Future steps to the achievement of the full catalog of goals are anticipated. One additional long term goal is to accommodate expressed interest and demonstrated measure of success in exploring the application of the OPSAMEF to industrial areas other than the nuclear power plant safety assessment.

Some anticipated near term future activities are anticipated to include the following:

  • Workshops on the Open PSA Initiative.
  • Submission of papers for the following conferences or publications:
    • 10th International Probabilistic Safety Assessment & Management Conference (PSAM 10), Seattle, Washington, USA, 7-11 June 2010.
    • The 8th Maîtrise des Risques et Sûreté de Fonctionnement, Oct 2010, La Rochelle, France, in the Field of Innovation and Risk Management.
  • Article for Scientific Review.
  • Future ESREL Conference. ESREL conferences are scheduled in Rhodes, Greece in September 2010 and in Troyes, France in September 2011.

Finally, The Open PSA Initiative Website is established at www.open-psa.org.

4. Project Participants

Contact Names Affiliation
Bäckström, Ola Relcon Scandpower AB
Becker, Günter RiSA Sicherheitsanalysen GmbH
Cepin, Marko IJS
Contini, Sergio EC JRC – Institute for the Protection and Security of the Citizen
Cojazzi, Giacomo EC JRC – Institute for the Protection and Security of the Citizen
Cronin, Frank ABS Consulting
Dassy, Françoise Suez Tractebel
da Zanna, Chiara TU Delft
Dres, Dennis Kernkraftwerk Leibstadt AG (KKL)
Duamp, Francois IRSN
Epstein, Steve (Woody) ABS Consulting
Hendrickx, Isabelle Suez Tractebel
Hibti, Mohamed EdF
Klugel, Jens Kernkraftwerk Goesgen AG (KKG)
Lannoy, André IMdR
Lundstedt, Lars Relcon Scandpower AB
Marle, Leila IMdR
Nusbaumer, Oliver Kernkraftwerk Leibstadt AG (KKL)
Rauzy, Antoine Dassault Systèmes
Richner, Martin NOK
Santos, Roberto Herrero CSN
Meléndez Asensio, Enrique CSN
Reinhart, Mark Interaction Internationale
Sörman, Johan Relcon Scandpower AB
Matuzas, Vaidas EC JRC – Institute for the Protection and Security of the Citizen

Lisa Norris’ Death by Software Changes

Download as PDF

Resilience Needed, or Just Good Old Testing?

1. Introduction

The overexposure of Lisa Norris to radiation did not come about because the Beatson staff was un-resilient to a new threat to its standard, vigilant safe operations. The overexposure happened because the most elementary precautions of user testing of a computer system upgrade, before system acceptance and actual use, were never, I believe, carried out. In fact, it is difficult to find in the investigation report any reference to software testing, acceptance criteria, test cases, or user training on a live system.

The actual data on failure of the Varis 7 computer system for the type of treatment under question shows that of the five times the system was used, there were four failures of planning personnel to use the systems correctly, with the first use of the system being contrary to the procedures then in place, and thus successful. It is my belief, that if “dry testing” had been performed, these failure events would not have happened during operations; certainly 100 “dry tests” would have shown at least one failure of the type committed, to wit, the forgettery of normalization, and the accident would most probably never have taken place.

2. What Went Wrong?

Prior to May, 2005, the Varis computer system used only the Eclipse software module for planning, schedule, and delivery of radiotherapy treatment.

There is not much information in the incident report about how the Varis system was used prior to May, 2003. It is not clear if, or how, either the RTChart or the Patient Manager modules were used. It is clear that the values for treatment dose and number of fractions were input manually to Eclipse. But prior to May, 2005, the treatment dose was always entered in units of MU per 100 CentiGrays, and therefore the Treatment Plan Report printed by Eclipse was always in units of MU per 100 CentiGrays.

Were there ever any incidents of miscalculation of MUs prior to May, 2005? I would hope that if there had been any noticed occurrences, they would have been mentioned in the incident report; so let us assume that operations had been perfect with respect to calculation and checking of MU values. It is important to note that the best information would differentiate and include calculation errors which were corrected by planning checks before actual treatment. In these cases, there was a system error (incorrect calculation) which was caught by a backup system (plan verification). This is not a success of the system as a whole, but a failure of the frontline system and rectified by a backup system. In counting system failures, we must count this is a failure of the frontline.

Why was not the Varis system used as an integrated whole, why was only the Eclipse module used? The only information we have is related in section 4.8 of the report, “… a decision had been taken at the BOC to use the Eclipse planning system as a standalone module … for a number of operational and technical reasons” [page 6]. There is no further elucidation.

In May, 2005 things changed. A decision was made by the BOC to use other Varis 7 modules:

After the upgrade in May 2005 to Varis 7, a decision was taken to integrate the Eclipse module more fully with other Varis software modules. [section 4.8, page 6].

Was this decision really made, as stated, after the upgrade? If so, does that mean that any testing that was done to verify that Varis 7 worked correctly was done before the decision to more fully integrate the modules? And after the decision, was a new round of testing done to make sure that the man-machine interactions would still produce a safe and reliable system? Did the BOC realize that a computer system incorporates both human and machine elements, and must be tested as such?

With the upgrade, it was possible to transfer information electronically between modules, including the treatment dose in terms of MUs. In this case, the Patient Manager module could import the treatment dose from RTChart directly to Eclipse, and then to the Treatment Plan Report for review, then to the Medulla Planning Form for treatment delivery. And this is what happened. The MU from RTChart was transferred electronically to Eclipse and to the Treatment Planning Report.

However, it was the actual MU which was transferred, not the normalized MU, because now RTChart was transferring actual units, while before the upgrade, the manual transfer to Eclipse was always normalized units, causing Miss Norris to receive almost twice the intended dose of radiation. No one, in this case, noticed the error.

The frontline system failed (incorrect calculation). The backup systems failed (treatment verification). The Varis 7 system calculated correctly, given the inputs, and printed information out nicely. No one noticed the error.

Why was the decision made to integrate the modules? Again, from the report:

… ‘Manual transfer of data either from planning to treatment units or between treatment units is associated with a high risk of transcription error’ and recommends, therefore that ‘The transfer of treatment data sets should be by local area IT network as far as is possible’.
[section 8.3, page 33]

Changing from manual transcription to electronic transcription will lower the risk of transcription error, but will this lower the risk to the patient? I do not believe we can make this inference without some type of evidence and theory to stand behind the claim; it is not axiomatic. In an electronic system, an error in input, which is propagated electronically to other “treatment units”, will absolutely proliferate through all data; with manual systems, a human has many chances to look at the data, and unlike a machine, may even think about the data, or notice other information, such as instructions to normalize data. We tend to believe numbers on beautiful printouts or screen displays without questioning; this is not so when we work with our hands and minds.

Why was the procedure of entering normalized dose changed to entering actual dose? We only know that it was done to “ … optimize the benefit of the change to electronic data transfer …”[section 8.2, page 33]. Certainly it shows a lack of communication between the users of the system and the software developers. Certainly the software could have been customized so that the users could continue entering normalized doses.

Off-the-shelf software rarely works in the same way as an individual organization. For software developers, if, on the average, the software works in the same way as the organization in 90% of the functions, it is considered very successful. But “means” are fictions; variation is the rule.

The Varis 7 system was used as an integrated whole, even though some of its features were not used, some features were used in ways not intended, and instead of complimenting the way people worked, it caused a major change in the way they worked, using normalized doses, with disastrous results.

Probably the software developers never imagined that the system would not be used as a completely integrated whole. They probably did not imagine, for example, that information generated by Eclipse would not be transferred electronically to RTChart. In this case, because the plan was a whole CNS plan, the BOC had made a decision to transfer data manually. When the system was used in this way, it was necessary to mark the Treat Plan Report status as “Rejected”, so, I speculate, the data would not transferred electronically. I imagine that the easiest way to stop data transfer from Eclipse to RTChart was to mark the Treatment Plan as “Rejected”.

Ironically, in this case, it was precisely because the plan was marked as “Rejected” that a radiologist and a senior planner discovered the errors being made [section 5.42, page 18].

Using a system in this way is sometimes called “walking hobbled”: we use a feature of a system (being able to reject a treatment plan) in a way never intended (to stop the plan from automatically going to RTChart). In its worst incarnation, “walking hobbled” becomes “using a ‘bug’ as a ‘feature’”, when we “trick” a computer system to get the output we need. When these “bugs” are subsequently fixed, there is no telling how the new system, old data, and users’ habits will interact to cause a system failure.

Computer systems are brittle, not resilient. They do not respond well in situations for which they were not programmed; they respond unpredictably when used in ways unforeseen (at least by the developers). And they cannot improvise when data is incorrect. A simple example here: if you have been typing “Y”, for yes, or “N”, for no, hundreds of times on an entry form, and suddenly you type “U”, instead of “Y” (remember that on anglo keyboards the “u” key is adjacent to the “y” key), will the software recognize this as a simple typographical error? I think not.

I imagine a surly, middle-aged East European bureaucrat (and I have much experience in this arena) in place of Eclipse, and the joy he would feel by pointing out to me that the new rules specify normalization, and with a pen indicating where the changes must be made.

3. How Likely Was It?

In the absence of test data, we are left only with operational experience with which to estimate the likelihood of failure from non-normalization of units for MU.

During the period from May 2005 and February 2006 there were five whole CNS procedures planned.

For the first of these, in August 2005, the prescribed radiation dose was not included in the data input to Eclipse. This could mean two things: either the normalized radiation dose was input in place of the prescribed (actual) dose, or no radiation dose at all was entered. In either case, the procedures in effect for the upgraded system were not followed. Ironically, a failure of the system in place (not following procedures) contributed to a safe delivery of radiation treatment. There is no comment in the report as to why the prescribed dose was not entered. [section 5.22, page 13]

For the second plan, in November 2005, the prescribed (actual) dose was entered into Eclipse. But because the prescribed dose per fraction was 100 CentiGrays, the normalized values and the actual values are the same. So again, even though there was a failure of the system, accidentally, no adverse consequences resulted. [section 5.22, page 13]

The third plan, December 2005, was the treatment plan for Miss Norris. As we know, not only did the frontline system (data entry according to procedures) fail, but the backup systems (input verification) did also.

The fourth plan, a medulla plan, was done in January 2006. The normalization procedure was necessary in this case. A senior planner noticed that the unit of measurement on the Treatment Plan (175 CentiGrays) was different from the unit of measurement on the Medulla Planning Form, and made the appropriate re-calculation. It should be noted here that the senior planner in this case had never done a whole CNS before, and was unaware of the changes to procedures. His previous experience therefore did not blind him. He noticed things during transcription instead of proceeding in a familiar, but non-questioning, way. [section 5.4, page 19]

The last plan, February 2006, was fortuitously the plan which brought to light the previous errors. In this case, a question was raised by a radiographer as to why the Treatment Plan was marked as “Rejected”. A senior planner looked at the Treatment Plan to remind himself as to why the status “Rejected” was used; and in this second, more focused, examination discovered the original errors. [section 5.42, page 18]

We can summarize the data in this table:

Event Failure to Normalize Failure of Verification
August 2005 1 ?
November 2005 1 ?
December 2005 1 1
January 2005 0 0
February 2006 1 0

As a Bayesian, I did a “back of the envelope” calculation. As my prior distribution, I used a flat prior, called FPRIOR, which indicates that I have no knowledge before operations as to the failure rate of the first Varis system used by the BOC. A flat prior indicates that every failure rate is equally possible.

I first updated FPRIOR with the data from the period 2003 to May 2005. The BOC estimates that 4-6 whole CNS treatments were performed each year. Since no incidents of failure to normalize were mentioned, we can conservatively say that there were 0 failures in 12 treatments. This distribution is named FAIL1. Then we must update FAIL1 with the data from after May 2005. Using the above table, I updated the distribution with 4 failures in 5 treatments, which results in the following posterior distribution, FAIL2:

and with distribution statistics for FAIL2 as follows:

Given the resultant posterior, we have better than a 25% chance that an incident like this will occur per whole CNS procedure, or about once a year. Quite unacceptable.

I find it difficult to believe that acceptance testing or a “dry test” was performed by the BOC. If they had, then their performance would have shown a similar type of failure rate, before the possibility of an accident could occur.

It can be said that during testing, many subsequent errors do not surface, because the testers take care while using a system and concentrate on doing things correctly. My response is that one must insure, as closely as possible, that the test conditions are similar to operational conditions, using a broad spectrum of testers, test situations, and with various time constraints.

Moreover, it has been said that with mature software systems, and seven versions of the Varis software indicates that it has been around, the Pareto Principal applies: 80% of the new errors will actually be caused by updates to the system. Computer software is brittle, and slight changes in one part of the software can cause errors which were never imagined.

Part of the Therac-25 tragedy, where 6 known accidents involved massive radiation overdoses, was caused by a simple update to the user interface allowing users to edit individual data fields, where before all data fields had to be re-entered if a data entry error occurred [Leveson, Turner 1993].

4. What Were the Consequences?

A person died.

On the Use of Agents Considered Harmful

Like many traveling business people, I often find myself on the Hertz bus returning to an airport on Friday afternoons. During the short trip, with my eyelids beginning to droop, I love to eavesdrop on week-ending philosophical insights from one business person to another. The following event took place on the way to O’Hare.

Two thirty-ish guys were returning to the airport together; one worked at the home office of a high tech company, call him Homes, the other in the field, call him Fields. Fields says to Homes, “Does your staff get such-and-such magazine anymore,”? Homes says, “Nope. We don’t get any magazines or journals for the staff. We found that they spend too much time reading.”

“In fact,” Homes goes on, “we don’t have subscriptions to any magazines or journals. We subscribe to a service. This service asks us what types of articles are of interest to us, which we specify as rules, like a boolean or keyword query; the service then creates “agents” which apply these rules to magazines and journals, electronically ‘clips’ articles, and delivers them to us on our network as files! That way people don’t waste time,”. And they don’t learn anything either, I thought.

In a way, this is another version of “garbage in, garbage out”. If the only things worth knowing are the broad categories you can specify logically, then the largest chunk of knowledge, the things you don’t yet know, will be hidden from you forever: “stuff you know in, stuff you know out”.

In a deeper sense, however, this vignette points to the weakest and most dangerous part of computers: they do what you say, not what you mean. Computers can only carry out a sequence of unambiguously specified logical instructions. We cannot even guarantee, in general, that this sequence will not go into an infinite loop (A. Turing, the Halting Problem).

It is impossible for me to specify logically all of the things of interest to me in a magazine. No computer can scan a magazine and free-associate. In fact, the act of browsing is it’s own reward: a new product announcement here, an ad with great layout helps to design an input screen there, a subject which never before generated interest catches my imagination.

Imagination! I guess that’s the real problem. Logic never made innovation. Insight, serendipity, and fortuitousness are the paths of innovation. To determine, via logical “agents” working in cyberspace, what will be of interest to you, will cut you off from the future. The richness of information on a page of WIRED, or even COMPUTERWORLD, cannot be logically specified for retrieval.

Cyberspace is a black box which can only be penetrated by logical query. The field of display, a 14″ to 20″ screen, is just not big enough to present things easily. Too many thoughts and physical manipulations are necessary to find information, let alone compare several pieces of information at the same time. Those who don’t learn the limitations of cyberspace are doomed to live in it.

Soon, Fields’ and Homes’ conversation drifted to the financial woes of their company. It seems that their company’s stock price has dropped below 16, a mythical lower barrier, and that their founder’s stock is next to worthless. Maybe they missed what is coming next.