University of Bielefeld - Faculty of technology | |
---|---|
Networks and distributed Systems
Research group of Prof. Peter B. Ladkin, Ph.D. |
|
Back to Abstracts of References and Incidents | Back to Root |
This page was copied from: http://catless.ncl.ac.uk/Risks/18.10.html |
ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator
On December 20, 1995, American Airlines flight 965 (AA965), a B757-223 serial number N651AA, crashed into mountains on approach to Cali, Colombia. This was the first B757 fatal accident in a decade and a half of service, and I believe the first jet accident for AA in about the same length of time. On February 6, 1996, a B757 of Birgenair, registration TC-GEN, crashed into the ocean off Puerto Plata, Dominican Republic, shortly after takeoff. This was the second fatal B757 accident. The US National Transportation Safety Board is `participating fully' in each of the investigations. Statements and documents concerning the accidents released by the Columbian and Dominican Republic agencies are available from the NTSB. Statements on Cali were released on 28 Dec 1995 and Feb 15 1996, and the `docket' (containing the factual reports of the `specialty groups' of investigators) has been released recently. Statements on the Puerto Plata accident were released Feb 7, Mar 1 and Mar 18 1996. Final reports for neither are yet available. Contrary to the impression given by some recent articles in the non-specialist press, neither findings nor causes nor causal factors have yet been officially determined in either investigation. However, the factual reports so far indicate to me that the human-computer interface could be involved in both accident sequences. These airplanes have a safety record amongst the best of any type in regular airline use - many B757 and B767 aircraft have been flying for a decade and a half in the fleets of airlines all over the world, and there have only been three fatal accidents (the first, to a Lauda Air B767, is thought to have been due to an unrecoverable technical failure alone). After 16 years of exemplary safe use, one should therefore not be too hasty to `blame the computer' or its interface. Here are some comments on both accidents. Cali: Happened on approach to the airport, which is aligned along a relatively narrow valley with high mountains to either side. Because of benign weather conditions, the pilots were offered a `straight in' approach (to land in the direction they were flying) rather than to fly beyond, turn, and land in the opposite direction (the more usual procedure). They were not familiar with the `new' approach. During the approach procedures, the pilots were confused about exactly where they were in relation to the arrival procedure charts. They had passed a specific radio beacon (VOR) called Tulua, which is the start of the arrival procedure. It seems that they were not aware that they had done so, entered this fix into the Flight Management Computer (FMC), and didn't immediately notice that the aircraft had begun to turn back to the Tulua VOR. This turn began a quick sequence of events that led to impact with a mountain (a CFIT, Controlled Flight Into Terrain accident) about 3,000ft below the summit. A Ground Proximity Warning System warning sounded about 9 seconds before impact, initiated recovery procedures (full power, maximum angle of climb, retract any draggy control surfaces that are deployed). The crew didn't retract the speed brakes (manually controlled), which is a necessary part of the escape procedure. The actual manoeuvers that led to impact were that of turning towards a specified VOR, and turning towards a specified heading. Both were initiated by the crew. The former was apparently accomplished with the FMS, the latter with the autopilot. Small airplanes such as my Piper Archer have autopilots capable of such manoeuvers. The pilot flying must command either, as happened at Cali. However, the more sophisticated FMC requires more attention than a simple autopilot when entering fixes, and investigators are paying attention to the role the FMC-pilot interaction played. The question that arises is what exactly played a causal role. There is a known HCI phenomenon that when something is `not right', it takes more time and attention to `debug' the situation when more sophisticated devices are involved than when simpler ones are being used. However, there were also other, human, procedural problems. The December 28 statement noted that there was no indication of descent checklist procedures being performed by the crew, and no indication of an arrival or approach procedures briefing. Also, it is a basic rule of flying that you know where you are at all times. These pilots didn't, at a crucial phase of flight, even though they had reputations for conscientiousness (see the Operational Factors/Human Performance factual report). Their confusion may have been aided also by some of the pilot-controller discourse about the route for which they were cleared. (However, there was no evidence of `language difficulty', as this is usually understood.) Puerto Plata: The captain's airspeed indicator (ASI) was observed to be failed on takeoff. This is not an event that requires emergency handling. The captain asked the copilot to call out the significant airspeeds for takeoff (called V1 and V2, also normal procedure) and took off as normal. He then called for the center autopilot to be switched on. His own airspeed indicator showed higher and higher airspeeds as the aircraft climbed, even though the actual airspeed of the aircraft (as recorded by ground radar) was much lower. The first officer observed that his ASI showed marked decrease in airspeed, and the pilots became confused over which ASIs were failed (the captain thought at one point that they'd both failed). The aircraft apparently stalled and the pilots did not succeed in recovering before hitting the ocean. The DR/NTSB factual statement noted that the behavior of the aircraft was consistent with the captain's pitot being blocked. The airplane had been sitting on the ground for many days in the tropics without the pitots being covered (there is no procedural requirement for this, but pilots and mechanics all know that insects love pitots). The pitot is a tube facing into the airstream receiving air pressure facing in the direction of flight, and a related static port receives (roughly) ambient air pressure. The difference between the two is used to drive the ASIs in all aircraft. If the pitot is blocked, then pitot pressure remains roughly constant and ambient air pressure decreases as the aircraft climbs, leading to greater difference between pitot pressure and static pressure, and thus to greater `indicated' airspeed on the ASI. The copilot has a separate pitot-static system. The pitot-static readings on the B757 are fed into the left (for the captain's system) and right (for the copilot's) Air Data Computers, and thence to the CRT display instruments for the respective positions. There is a third pitot-static system that is purely mechanical (and therefore `traditional'). Normally, pilots of all airplanes are also trained to use `alternative source' if a pitot-static system fails. `Alternative source' on the B757 is to switch the displays so that the captain's ASI reads from the right ADC and the first officer's from the left ADC. Also, the `glass' ASIs should be checked against the mechanical `backup'. There is no evidence that either `alternate source' was used during the accident flight, or that either instrument was checked against the backup, even when the captain falsely thought that both his and the first officer's ASIs had failed. David Learmount asserted in Flight International (27 Mar - 2 April) that the central autopilot gets its data from `the' ADC. He must have meant to say from the *left* (captain's) ADC. Supposing this is the case, the autopilot would react to the (false) increasing airspeed indication by raising the nose, to attempt to reduce `airspeed'. Continuing to do so, since the false `airspeed' continued to increase with increasing altitude, the airplane would radically lose (real) airspeed, and eventually stall, which appears to be what happened. I have not yet been able to verify Learmount's (modified) assertion with the NTSB or Boeing engineers. The question arises, why would the captain switch on the center autopilot if his ASI had failed and he knew it got its data from the same ADC? Anecdotal information (a colleague with access to a B757 operations manual, another who is a B757 pilot) suggests that this might not be information that one could expect to be at the front of every B757 pilot's mind. These two situations (center AP gets data from left ADC; this information not in the mental foreground when dealing with ASI problems) may thus have played a causal role in the accident. (I emphasise again that I have not yet confirmed either situation.) This could be categorised as an HCI issue. As at Cali, there were other apparent procedural failures: failure to switch to `alternate source'; failure to check against the standby mechanical instrument. Performance of either would have avoided the accident: as Learmount says, the pilots lost control of a flyable airplane. Further Commentary: Maybe one can see in these accidents two known and often-observed HCI effects of automation, which I shall call Complacency and Complexity. The Complacency effect is that use of (normally reliable) automation can lead to reduced awareness of the state of the system. The Complexity effect is that increased automation makes some straightforward tasks more complex and interdependent. The effects are distinct, but both have the consequence that it becomes harder to figure out what's going on when something is wrong. As a discussion point, let me propose that the greater role at Cali seems to have been played by the Complacency effect, with some Complexity effect; and that at Puerto Plata by the Complexity effect alone. For the answers, we'll have to wait until the final reports. Finally, I don't regard either accident as giving grounds for concern about the role of automation in itself. But the reports might yield insights into and improvements to the procedures for dealing with certain forms of automation. One always hopes to learn from the tragedies. The text of the official statements referred to above, as well as other pertinent documents, may be found in the sections on these accidents in the hypertext Compendium `Computer-Related Incidents and Accidents to Commercial Airplanes' under http://www.techfak.uni-bielefeld.de/~ladkin/ Peter Ladkin
* Cutting: "Sunday Mail (Brisbane)", Sunday, the 14th April 1996, page 12: A Brisbane bloke was stunned to discover on his latest phone bill an amount of nearly $900 for a call of more than 10 hours duration to the Solomon Islands. The bloke _does_ occasionally call the Solomons and _does_ admit to being a bit of a yakker [talk the doors off a barn]. But for 10 hours? Query with Telstra [read as Bell Australia] brought the testy advice that he must have forgotten to hang up his receiver. The bloke pointed out that this theory was flawed by Telstra's own bill that showed that he'd used the same phone to call Melbourne (Australia) only 11 minutes after the start of the call to the Solomons. * Cutting: "Sunday Mail (Brisbane)", Sunday, the 21st April 1996, page 12: This case apparently stirred Telstra into prompt action. By Monday, 15 April, the matter had been "investigated" and the charge waived. Warrick JACKES, 52 Hamilton Road, MOOROOKA, Q. 4105 AUSTRALIA +61 411 18 55 68 wjackes@cit.gu.edu.au [Otherwise, he would have been a Gofer-Bloke. PGN]
Letter-to-the-editor of *The New York Times* Magazine section, 5 May 1996, quoted in full: > James Gleick's 'manual Labor' (Fast Forward, April 7) touched a > long-tender sore spot with me. For example, the manual that came > with a car I bought not long ago contained no fewer than 31 Cautions, > 32 Warnings, 28 Do Nots and 2 Nevers. (I never did discover the > difference between a Do Not and a Never). My favorite was this: > `Do not open sun roof when car is covered with snow.' > Robert L. Wolke, Pittsburgh" So much stuff to remember, but so few reasons. Wouldn't it be a lot easier to remember (or forever dismiss) such advisories if we knew the reasons, and had a structure to hang them on? Suppose I'm wedged between two trucks, the car is on fire, but there's still some snow on the roof. Should I try to get out through the sun roof? Do I risk only the obvious (some snow down my neck) -- or a death worse than by fire? Maybe best to wait til the snow has melted and hope that by then I'm not already fried? After all, if they've taken such pains to tell me something, then more than the obvious risk must be involved -- because if it were only the obvious, it wouldn't need to have been stated, right?
I heard on the radio today that a state legislator wondered why she was getting literature aimed at single parents. Investigating, she discovered that the state didn't record the marital status on birth records. Instead, they assumed that if the parents' last names were not the same, they were both single! And these are the kind of "statistics" on which we base public policy! In thinking about the 30% of the California mothers who are thereby supposedly single, I realize that the issue is deeper than bad methodology; it is the reason why such numbers become important. Much public policy is driven by numbers. The CPI (Consumer Price Index) is a more pervasive example. It reflects something or another but what is less important than that there be a way to keep score. Lest we blame the government, remember how much business policy is also "by the numbers" -- numbers that aren't necessarily accurate. In one case a very large company assigned a programmer to assure consistency in the numbers they had gathered from the field, which were the basis for billions of dollars of corporate decision making. He discovered that after attempting to clean up the data that it was essentially pure noise by the time it was massaged. More obvious and more explicit are the "sweep" numbers used by television stations. Someone more involved in the specifics should provide the details but apparently the advertising rates are set by tallying the viewing during designated weeks. The problem is that everyone knows which weeks they are and therefore create special shows to cook the numbers. Is there a better system?? Perhaps not. This is, of course, a risk of technology in that we have much better tools to gather numbers than we did in the past. But we also need some agreed to rules even if they are perverse. Interestingly, these numbers might assure some degree of fairness by being so noisy that it is hard to predict their outcome using inside knowledge. But, maybe I'm just getting too cynical. Or pragmatic?
Much as I dislike statistics about ``computer crimes'', this might at least be worth noting for the RISKS archives. Jon Swartz, writing in the *San Francisco Chronicle*, 7 May 1996, p. C1, summarized the conclusions of a survey just released by the Computer Security Institute and the FBI's International Computer Crime Squad. 41% of the respondents admitted to electronic intrusions or unauthorized probes of their systems by disgruntled employees or competitors in the past year, and roughly half of those involved insiders. The Internet has become a "prime source" of such activities. 20% said they did not know whether they had been invaded; and nearly 75% of those said they would not report incidents because of fear of negative publicity. Medical and financial institutions were particularly prone to data manipulations. (There were 428 respondents out of almost 5000 queried.) One of the biggest problems in justifying the need for computer security has always been that many organizations and individuals appear not to have been compromised. The 75% figure is perhaps the most interesting.
This morning, I started to receive e-mails accusing me of forcing a Swedish band called Ace of Base to leave Sweden for ever. The e-mails continue to arrive - some are very abusive. I have never heard of Ace of Base! (AoB) Some detective work later, I discover the story: Scandinavian Airlines (SAS) have reviewed Ace of Base in their in-flight magazine (apparently) and said that they prove that musical ability is not a requirement for Swedish bands (I paraphrase). An AoB fan has produced a Web page (http://www.ultranet.com/~wurther/opmdv.htm) that gives the story and invites readers to send their views by e-mail to a "senior SAS official"; he then gives *my* e-mail address! Why does he do this? I'm guessing, but an Altavista search for my e-mail address turns up Jan-Erik Andelin's MD80 Accidents Pages (http://www.clinet.fi/~andelin/md80acsk.htm) containing a copy of a report I posted to Risks in November 1993, quoting a Flight International report on the Dec 1991 crash of an SAS MD-81! So I guess that "Wurther" searched for SAS, found this page, assumed I must be an SAS official, and added me to his list of hate targets! I've mailed him and his webmaster, so far with no effect. Martyn Thomas, Praxis plc, 20 Manvers Street, Bath BA1 1PX UK. +44-1225-444700. mct@praxis.co.uk Fax: +44-1225-465205
re: X-URL: http://www.bell-atl.com/college/ Bell Atlantic is making it easy for students to disconnect service. Of course, they are also making it easy for OTHERS to disconnect service for you, and exposing the information you provide to anyone in between....
Folks following the _ACLU v. Reno_/_ALA v. DOJ_ case may want to check out our post-trial brief, filed this past Monday and now available from our home page at http://www.aclu.org . We'll also be posting our joint (ACLU/ALA) 200-page "Findings of Fact" shortly. That document neatly reorganizes all of the evidence presented at the hearing. The next and final step in the case is oral argument, scheduled for Friday, May 10th, at 9:30 a.m. in Philly. Ann Beeson Staff Counsel, _ACLU v. Reno_ American Civil Liberties Union 132 W. 43rd St. NY, NY 10036 212-944-9800 x788 [Courtesy of Audrie Krause, Executive Director, CPSR, P.O. Box 717 Palo Alto CA 94302 (415) 322-3778 akrause@cpsr.org * Web: http://cpsr.org/home.html ]
Another two risks demonstrated here are: Summarisation of technical information by people who do not understand it - a reporter in this case, but the risk probably applies elsewhere Believing what newspapers print. The story printed in the newspaper bears only a passing resemblance to the real incident. What actually happened was that a packet sniffer was found running on a machine on the subnet that connects the central Unix service, mail server, and so on. Everyone who uses these systems was required to change passwords. The e-mail system has not been replaced, and I've no idea how this detail got into the article. Steve Early sde1000@cam.ac.uk
> 1) Use PGP. This does not help. What if the output of PGP encryption innocently contains the byte sequence 0x63 0x75 0x6E 0x74? Being gibberish, you didn't check - but the computer may, and censor it. There are similar problems with uuencode, rot13, etc. Sigh! Peter Miller pmiller@agso.gov.au uunet!munnari!agso.gov.au!pmiller [Sigh-bernetics caught in a sigh-clone? Sigh-onara! Imagine being tossed in jail because your encrypted message just happened to trigger a filter! PGN]
I don't know about any public announcement, but in Richard Feynman's "What do you care what other people think?" there is an extended account of his part in the investigation of the Challenger disaster, including quite a lot about the odds quoted within NASA for various kinds of failure, and their (often tenuous) relation to reality. I'm afraid I don't remember any of the figures, but they were wildly overoptimistic. One interesting thing was that when Feynman talked to the engineers who actually worked on the shuttle components, they gave pretty good estimates for things; but somehow, as information propagated up the management structure, it got fudged. (Cf. the entry in the Jargon File under "SNAFU principle".) Gareth McCaughan Dept. of Pure Mathematics & Mathematical Statistics, gjm11@pmms.cam.ac.uk Cambridge University, England.
[Referring to Richard P.Feynmann: "What Do You Care What Other People Think?", Unwin Hyman Ltd. (London), 1988, ISBN 0-04-440341-0 (first published in USA in 1988 by W.W.Norton & Company Inc.) Chapter "Fantastic Figures" pp 177-188.] > What were the odds? 10^-5 (i.e., 1 in 100,000) was NASA's "official" figure. Feynman quotes the range safety officer at Kennedy as having arrived privately at an estimate of 1 in 100 (based on observation that 4% of unmanned flights failed, and an optimistic assumption that manned craft must be safer). Peter Mellor, Centre for Software Reliability, City University, Northampton Square, London EC1V 0HB, UK. Tel: +44 (171) 477-8422 p.mellor@csr.city.ac.uk [Also commented on by smurf@noris.de (Matthias Urlichs). "Ratzka Wolfgang Dr." <ratzka@braun0.HRZ.Uni-Marburg.DE> Roy Murphy <murphy@panix.com> PGN]
I have Volume I of the Report of the Presidential Commission on the Space Shuttle Challenger Accident (U.S. Gov't Printing Office, June 6th, 1986). There are 5 volumes; Volume I is the summary and has 256 pages. It is well-written, easy-to-read, and remarkably free of technobabble. Any large library should have it; some of it is also online at <A HREF="http://www.ksc.nasa.gov/shuttle/missions/51-l/docs/ rogers-commission/table-of-contents.html">Rogers Commission Report </A>. Nowhere in this volume could I find a reference to the numerical odds of a shuttle accident. There are many statements that recognize that there are risks that cannot be totally eliminated. Conceivably, there might be a calculation of the odds in one of the other volumes of the report. According to <A HREF="http://spacelink.msfc.nasa.gov/ NASA.Projects/Human.Space.Flight/Shuttle/Shuttle.Program.Changes. Since.1986">NASA Press Release</A> there has been 1 failure in 74 flights (thru January 1996), for a reliability of 0.986. (I asked a web index tool to find references to "odds of a space shuttle accident" to find these documents.) Personally, I believe that every practicing engineer, and every manager in an engineering organization, should read this report regularly. It is both enlightening and sobering on the difficulties of building reliable, complex systems in the real world. The accident was caused by "known problems" in a faulty design. The attempted resolution of the problems was poorly executed and poorly managed. Safety concerns were not escalated up the management chain. Known problems were dismissed as "still within limits". Launch constraints were waived at the expense of safety. Management reversed the position of its own engineers. All of this came out at the time, but perhaps some people who were not around at the time have not heard about it. I urge everyone to read it. Paul Green, Senior Technical Consultant, Stratus Computer, Inc., Marlboro, MA 01752 Paul_Green@stratus.com +1 508-460-2557 FAX: +1 508-460-0397
I have not seen that report, but in the late 1980's, when I was working on Shuttle-derived launch vehicle studies, we did an in-house assessment assuming the fixes post-Challenger were made correctly. We got a figure of 1 in 70 per launch of losing an Orbiter, and 1 in 100 per launch in losing the crew. The difference in the two figures is because you can have a landing that the crew can walk away from, but the landing was hard enough to overstress the Orbiter structure, so that you have to write off the vehicle. Dani Eder [Further comments from Ann_Holt@ftdetrck-ccmail.army.mil. PGN]
This page was copied from: | http://catless.ncl.ac.uk/Risks/18.10.html |
COPY! | |
COPY! |
by Michael Blume |