11 June 1997
Artifacts such as computer systems are constructed with a purpose in mind, in order to fulfil a goal (Sea95). While much work in computer science has concentrated on how to state and fulfil such goals, there is much less work on failure to fulfil a goal.
In traditional engineering, failure is often correlated with a persistent failure state: something broke, and what's broke stays broke. With systems designed for complex behavior, such as may be found in most modern transport aircraft, this is no longer so. Such a system may exhibit unwanted behavior, may fail, even though nothing `breaks'. A trivial illustration is putting money into a vending machine, obtaining the desired item, but failing to receive change. Each action is appropriate in and of itself, but giving no change is not appropriate given the sequence of actions and states before it.
When the machine is a computer standing by itself, then these misbehaviors can mostly be traced to their origins in the design of the system, software or hardware. But suppose this system is not isolated, and communicates with other such systems, as well as human operators, in a changeable environment. In an aircraft, say. It's now much less clear where faults can be traced: not only to individual systems, and to the operator, and to the environment, but also to the interactions between these components. Failure analysis becomes technically complex and intellectually difficult.
Suppose there has been a failure, and suppose that all the salient events and system states have been researched and more-or-less discovered. A further difficulty lies in the analysis of causality: which event was a causal factor to which other events and states?
Aircraft accidents are amongst the most carefully researched failures in all of engineering. One would expect that the reports are exemplary. While looking carefully at recent accident reports involving complex and often computerised aircraft, we found what appeared to us to be reasoning discrepancies: significant causal factors described in the body of the report did not appear in the final list of causes (`probable cause' and `contributing factors') (1). Simple logic mistakes appear to have been made. We believed that there must be a rigorous method of reasoning that would allow objective evaluation of events and states as causal factors. We thus developed the WB-graph method. (The name comes from `Why...Because'-graph, and was suggested by our collaborator, the Human Factors expert Dr. Everett Palmer of NASA Ames Research Center in California (PaLa97)).
Problems associated with piloting in modern computerised aircraft are for example (Roc97):
A partial reason for occasional mistakes in reasoning becomes clear: this is a complex structure. There are roughly sixty salient events and states mentioned in the report, and even more causal connections between them! Using the WB-graph method, supported by internal checks programmed in the language DATR (Ger97) ensures logical rigor.
The overall structure of the WB-graph for this accident is Figure 1 (PS, 6K).
We observe that this graph can be broken down into three main sections along the `bottlenecks'. The `top' section is Figure 2 (PS, 3K). (The other two components are Figure 3 (PS, 3.5K) and Figure 4 (PS, 3K): the nodes that `join' two of these almost-components are included in both relevant figures.) If you use a postscript viewer such as ghostview, please note that Figures 2-4 are best viewed if you change to landscape mode.
One can immediately observe from Figure 2 that node 3.1.2: earth bank in overrun path is a causally-necessary node: hitting the bank was a cause of the damage and fire; the hit directly killed one person and rendered the other unconscious and therefore unable to participate in the evacuation. Furthermore, this node itself is not caused by any other event or state in the sequence. It is therefore to be counted amongst the `original causes' of the accident, according to the WB-graph method. However, it does not appear amongst the `probably cause' or `contributing factors' of the final report. We have therefore found a reasoning mistake in the report. It is not the only such node of which this is true.
This small example has shown how the WB-method renders reasoning rigorous, and enables the true original causal factors to be identified from amongst all the causally-relevant states and events.
What is the consequence of this rigorous reasoning? Once we have identified the position of the earth bank as an original causal factor, we know that had the bank not been where it is, the accident that happened would not have happened. (It is, of course, possible that the aircraft could have broken up and burned for some other reason - whether that was likely can be left to the experts to decide, but it's certainly not as likely as in the case where there's something there to hit!) Therefore, one could consider repositioning the bank in order to avoid a repeat. However, this was not considered or recommended in the report, we suppose because the position of the bank was not considered to be a causally-essential feature in the report.
So, even though press and media opinion may focus on the automated systems of the accident aircraft, or on pilot `error', it is also true that had the aircraft had a free `overrun area' at the end of the runway in which to slow down, the accident could have been a mere incident: unfortunate but not deadly. In a valid account of the accident, this mundane feature must also be noted as a causal factor along with pilot and airplane behavior.
Thus can rigorous causal analysis help for the future of air travel. And, in the future, we will all be travellers on computerised aircraft. But the computers are hardly ever the only cause of an accident.
Roughly speaking, the semantics of Lewis for the assertion that A is a causal factor of B, in which A, respectively B, is either an event or state, is that in the nearest possible world in which A did not happen, neither did B. This relies on a notion from formal semantics of `possible world', best illustrated by example. Suppose my office door is open. But it could have been shut. A semanticist can now say: in another possible world, it is shut. A possible world is a way of talking about things that could happen, but didn't. But what about `near' possible worlds? The `nearest' possible world in which my door is shut is one in which my door is shut, air currents around it behave appropriately, sound through it is muffled as it should be, but broadly speaking everything else remains the same. A further-away world would be one in which someone else who is not me is sitting here typing, and an even further-away world is one in which this whole environment is situated in Ghana rather than Germany.
Now, suppose my door shuts. What caused it to shut? I was pushing it shut. The air was still, there was no draft, the only thing moving was the door and it was moving because I was pushing it shut. Intuitively, my actions caused the door to shut. How do I know this from the formal semantics? In the nearest possible world in which I didn't push the door, did the door shut? We have already supposed that nothing else was moving, no air drafts, no other person in the vicinity, so in the nearest world these would also be the case. It could be that all the molecules in the door moved the same way at the same time, so the door spontaneously shut - but this situation is so highly improbable as to be almost unthinkable, so could it be really the nearest such world? No. In the nearest world, everything behaved the same way, except that I didn't push the door. So it didn't shut. So according to my formal semantics, my action caused the door to shut.
This formal semantical test is particularly important in circumstances in which many causal factors conjoin to make something happen, which is by far the most usual case. The simple semantics asks a question of two events, or states, at a time, and by asking the question systematically of all pairs, pair by pair, a complex WB-graph may be systematically built.
(1): The observation that in causal explanation not just one `probable cause', but normally many causal factors explain the occurrence of an event, and that one cannot distinguish between `more necessary' and `less necessary' factors, is often attributed to John Stuart Mill; for example, as quoted by Steward (Ste97a, p214):
It is usually between a consequent and the sum of several antecedents; the concurrence of them all being requisite to produce, that is, to be certain of being followed by the consequent. In such cases it is very common to single out only one of the antecedents under the denomination of Cause, calling the others merely Conditions....The real Cause is the whole of these antecedents; and we have, philosophically speaking, no right to give the name of causes to one of them exclusively of the others. (Mil43, p214).Back to text
(2): This accident was particularly poignant for computer scientists because Paris Kanellakis of Brown University was killed in the crash with his family. Back to text
(Col96): Aeronautica Civil of The Republic of Columbia, Aircraft Accident Report: Controlled Flight Into Terrain, American Airlines Flight 965, Boeing 757-223, N651AA, Near Cali, Colombia, December 20 1995, Author, 1996. Also available in full in (LadCOMP). Back
(Ger97): T. Gerdsmeier, A Tool For Building and Analysing WB-graphs, Technical Report RVS-RR-97-02, at http://www.rvs.uni-bielefeld.de Publications, January 1997. Back
(GeLa97): T. Gerdsmeier, P. B. Ladkin and K. Loer, Analysing the Cali Accident With a WB-Graph, at http://www.rvs.uni-bielefeld.de Publications, January 1997. Also to appear in the Proceedings of the Glasgow Workshop on Human Error and Systems Development, March 1997. Back
(HoLa9797): Michael Höhl und Peter B. Ladkin, Analysing the Warsaw Accident With a WB-Graph, in preparation, to appear as Technical Report RVS-Occ-97-07, at http://www.rvs.uni-bielefeld.de Publications, June 1997. Back
(LadCOMP): Peter B. Ladkin, ed., Computer-Related Incidents with Commercial Aircraft, compendium of accident reports, commentary and discussion, at http://www.rvs.uni-bielefeld.de Back
(Lew73): David Lewis, Causation, Journal of Philosophy 70, 1973, 556-567. Also in (SoTo93), 193-204. Back
(Lew86): David Lewis, Causal Explanation, in Philosophical Papers, ii, Oxford University Press, 1986, 214-240. Also in (Rub93), 182-206. Back
(Mil43): John Stuart Mill, A System of Logic, 8th edn., 1843; London: Longmans, 1873. Quoted in (Ste97a, p214).
(PaLa97): E. A. Palmer and P. B. Ladkin, Analysing An `Oops' Incident, in progress, will be available from http://www.rvs.uni-bielefeld.de Back
(Roc97): Gene L. Rochlin, Trapped in the Net, New Jersey: Princeton University Press, 1997. Back
(Rub93): David-Hillel Ruben, ed., Explanation, Oxford Readings in Philosophy Series, Oxford University Press, 1993. Back
(Sea95): John R. Searle, The Construction of Social Reality, New York: Simon and Schuster, 1995; London: Penguin, 1996. Back
(Ste97a): Helen Steward, The Ontology of Mind: Events, States and Processes, Oxford, Clarendon Press, 1997.