University of Bielefeld - Faculty of technology | |
---|---|
Networks and distributed Systems
Research group of Prof. Peter B. Ladkin, Ph.D. |
|
Back to Abstracts of References and Incidents | Back to Root |
20 March 1998
I consider three papers on the Ariane 5 first-flight accident, by Jézéquel and Meyer suggesting that the problem was one of using the appropriate system design techniques; by Garlington on the culture of flight-control and avionics software engineering; and by Baber on use of pre- and post-conditions in programming. I conclude that Jézéquel and Meyer's argument is fundamentally mistaken; and that although Garlington's (reconstructed) critique of them is broadly correct, a conclusion closer to that of Jézéquel and Meyer is most appropriate, following from a different argument from that which they gave.
On 4 June 1996 the maiden flight of the Ariane 5 launcher ended in a failure, about 40 seconds after initiation of the flight sequence. At an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded. The failure was caused by "complete loss of guidance and attitude information" 30 seconds after liftoff. To quote the synopsis of the official report: "This loss of information was due to specification and design errors in the software of the inertial reference system. The extensive reviews and tests carried out during the Ariane 5 development programme did not include adequate analysis and testing of the inertial reference system or of the complete flight control system, which could have detected the potential failure." Because of this conclusion, the accident has generated considerable discussion. Code was reused from the Ariane 4 guidance system. The Ariane 4 has different flight characteristics in the first 30 seconds of flight and exception conditions were generated on both inertial guidance system (IGS) channels of the Ariane 5. There are some instances in other domains where what worked for the first implementation did not work for the second. Henry Petroski (Pet94) makes a point about the history of bridge-building in the nineteenth and twentieth centuries, noting that failures often came not from the first, careful, conservative implementation of a design, but from its extension. The European Space Agency has provided both a summary (ESA96a) and the Inquiry Board Report (ESA96b) on the Web.
The problem was caused by an `Operand Error' in converting data
in a subroutine from 64-bit floating point to 16-bit signed integer.
One value was too large to be converted, creating the Operand Error.
This was not explicitly handled in the program (although other
potential Operand Errors were)
and so the computer, the Inertial Reference System (SRI) halted,
as specified in other requirements. There are two SRIs, one `active',
one `hot back-up' and the active one halted just after the backup,
from the same problem. Since no inertial guidance was now available,
and the control system depends on it, we can say that the destructive
consequence was the result of `Garbage in, garbage out' (GIGO).
The conversion error occurred in a routine which had been reused from
the Ariane 4 vehicle, whose launch trajectory was different from that of the
Ariane 5. The variable containing the calculation of Horizontal Bias
(BH), a quantity related to the horizontal velocity, thus went out
of `planned' bounds (`planned' for the Ariane 4) and caused the
Operand Error. Lots of software engineering issues arise from this
case history.
Four Comments on the Software and
Its Development
Back to Contents
Jean-Marc Jézéquel and Bertrand Meyer (JeMe97) argued that a different choice of design method and programming language would have avoided the problem. One argument for this conclusion could go like this. A language which forced explicit exception handling of all data type errors as well as other non-normal program states (whether expected or not) would have required an occurrence of an Operand Error in this conversion to be explicitly handled. To reproduce the problem, a programmer would have had to have written a handler which said `Do Nothing'. One can imagine that as part of the safety case for any new system, it would be required that such no-op handlers be tagged and inspected. An explicit inspection would have caught the problem before launch. As would, of course, other measures. Jézéquel and Meyer thus have to make the case that the programming language and design method would have highlighted such mistakes in a more reliable manner than other measures. Ken Garlington argues (Gar98) that they do not succeed in making this case.
Independently, Stefan Leue wondered if one couldn't make a case to blame the first Ariane 5 debacle partly on the programming language used, on the basis that with the right features, it wouldn't have happened:
Couldn't one attribute the failure of the inertial navigation software in the Ariane to the absence of a proper exception handling mechanism that would have caught the arithmetic overflow? Isn't that an accident caused by [the absence of] a robustness feature in the programming language used?
(Leu98)
That's an interesting case to argue. One would be saying that the routine was programmed on too low a level (it was programmed at the level at which the programmer had to handle the exception conditions himherself, rather than having available language features that would handle it) and the question would then be whether there was a reasonable alterative choice of language for that task that would have caught the overflow.
Jézéquel and Meyer believe it was a `reuse specification problem' and Garlington believes
With respect to the Ariane 5 accident: The final report summary states that the development team deliberately avoided use of the available language feature that could have been used to protect the code; thus, it does not seem fair to blame the language in this case: [...]
(Gar98b)
An argument superficially similar to that made by Jézéquel and Meyer was proposed by Robert Baber (Bab97a) (Bab97b). Baber correctly points out that the strict precondition of any data conversion routine must include range guards, and claims
Lack of attention to the strict preconditions [...] was the direct cause of the destruction of the Ariane 5 and its payload -- a loss of approximately DM 1200 million.Baber's use of the term `direct cause is technically in error -- the direct cause, as normally understood by accident investigators, would be the faulty guidance computations. This was caused by the halting of the IRS computers; this in turn caused by the Operand Error and the lack of explicit handling; this in turn caused by .... well, I'll stop here. Furthermore, he doesn't offer any suggestion of what should have been done with this strict precondition in the software development. The report (ESA96b) points out that range guards were implemented for some, but not all, of the variables, and indicates why they were not implemented for BH. So Baber's implicit suggestion, that range guards be explicitly recognised and incorporated in the software development process, was in fact followed, but rejected with reason in this specific case for the Ariane 4. Baber has no further suggestion in his paper about what to do to avoid similar problems.
(Bab97b)
How are we to decide the role of the language? First, Garlington's brief argument quoted above is too slick. Observe that machine language is Turing complete. Then observe that exception handlers for `Operand Errors' can therefore be written in machine language, then interpret `the language' in his statement to refer instead to machine language. Garlington's conclusion would then read that, had the code been written in assembler instead, it would have been `unfair to blame the language'. I imagine everyone would agree that this conclusion would be nonsense, even though the premiss would remain true. Since a simple substitution instance of Garlington's argument leads to a false conclusion from a true premiss, I infer that the argument itself is invalid. Which doesn't necessarily mean that it should be thrown away - it may simply be incomplete, and needs to be completed (i.e. more premises need to be added).
(JeMe97) is a combination of two things:
(JeMe97) says:
(JeMe97) says also that (h) good practice (which they call Design by Contract) requires that precondition, postcondition and invariant be specified for each module in a safety-critical system which is reused from elsewhere. (They encapsulate this piece of wisdom in the phrase "Reuse without a contract is folly!"). They say furthermore that (j) such a fundamental constraint as a range constraint for data type conversions would be stated explicitly, and therefore would have been caught by good practice as in (h).
That's it. The entire article. Now, who could disagree mutatis mutandis with conclusions (h) or (j)? I shall assume noone. After all, they do shoot horses, don't they? (I say `mutatis mutandis' because requiring that one separate the specification of a module's pre- and post-behavior into pre- and postcondition is an unnecessary and occasionally annoying logical constraint, as those of us who use TLA are aware.)
Now, (Gar98) considers and casts doubt on the claims that
Here's the reconstructed argument:
Conclusion (u): this is more properly classified as a requirements error rather than a programming error. The program was written against Ariane 4 requirements; these requirements were not transferred to the Ariane 5 requirements spec; the Ariane 5 requirements therefore did not state the range requirement; the (implicit in Ariane 5) range requirement was in conflict with the behavior of Ariane 5 (as in fact explicated in other Ariane 5 requirements); requirements came up against behavior and the rocket was destroyed. (It is not surprising that it was a requirements error - over 90% of safety-critical systems failures are requirements errors, according to a JPL study that has become folklore, as well as Knight-Leveson, I believe.)
Conclusion (v): since this was a requirements error, it is most appropriate to fix it by fixing the requirements process. For example, integrated testing, rather than testing narrowly against requirements.
Conclusion (w): since using the right programming language does not have much clearly to do with requirements analysis, recommending precise specification of modules in a suitable design/programming language is barking up the wrong tree.
Conclusion (x): which isn't to say that using DbC/Eiffel or any one of many programming techniques wouldn't have avoided the problem: accidents usually have many necessary causal factors, and if you eliminate any single one of them, the accident would not have happened (this is almost the definition of `causal factor'; indeed, it *is* the USAF definition of `cause'). It's just saying that choice of design/programming language isn't the most salient factor.
(Gar98) is actually more explicit than any of these conclusions. Although he didn't phrase it quite this way, Garlington suggests that
I have a lot of sympathy with conclusions (u) through (y), and I think Point (y) is particularly telling.
How does this compare with (JeMe97)? Let's agree with (JeMe97) that it wasn't a process management problem (incompetence or egregious management error). Then they rule out design error, implementation error and testing error. They conclude from this that it's a reuse error. Well, this reasoning is defective on a basic level. Here's why.
A system has requirements specs, design specs (down to and including the detailed design of the hardware) and hardware. I claim this is a complete, if very crude, list of its parts. Now, everybody agrees the Ariane 5 problem wasn't a hardware fault. So it's either a design error or a requirements error. (JeMe97) claims it wasn't a design fault (their `implementation' is included in the `design' part of my crude classification). They should have concluded it was a requirements error, which they didn't do. In fact, their proffered solution is a design solution (they're suggesting annotating individual modules), after having concluded it wasn't a design error. That's just mixed up. They should pay more attention to their basic ontology.
And Garlington's prognosis (y) seems to be right, even though (Gar98) hides this reasoning.
But I still think the basic point that (JeMe97) make is correct, and I think that, pace (Gar98) and (Gar98b), that the language is partly to blame. Here's why.
First, requiring explicit documentation of data range assumptions is, as (JeMe97) imply, simply a part of good practice. You can enforce it in the language, as they suggest with DbC/Eiffel, and as Leue suggested should be done with a `decent' language. Writing basic low-level data-conversion programs in a high-level language such as Ada seems pretty barmy to me, although Ada, being strongly typed, at least requires you do it explicitly, unlike C++ for example. (`Standard practice' you say? Then `standard practice' should be encouraged to change. After all, they do shoot horses....)
Second, a safety case for any system should examine this explicit documentation - after all, (JeMe97) are correct in saying this is a `basic constraint', even if you allow (Gar98)'s assertion that you can't check everything. Therefore requiring an explicit `No Handler' at the design/programming-language level would suffice to bring it explicitly to the attention of the safety-case examiners. Such a language, or its equivalent, should therefore be used; Ada is clearly not such a language.
Third, (Gar98) is right that you can mess it up at any of the levels, programming or requirements or safety-case. Even though you enforce explicit evidence of a potential problem, there's no guarantee that safety inspectors won't skim over it on their way to the coffee break.
And finally, suppose one doesn't believe any of this
reasoning. How could one objectively analyse the Ariane 5 accident to
find out objectively where the causal factors lie and therefore to
find possible solutions? Well, one could go to the report as
(Gar98)
does - but such reports often contain reasoning errors also. There's
only one way that I know that one can (a) objectively analyse the
causal relations in the accident, from the data given in the report;
(b) rigorously prove, in logic, that the ensuing causal analysis is correct
and sufficient (`sufficient' is a formal term whch is relative to the
amount of detail one wishes to include, which varies from customer to
customer). That way is called WBA
(WBA), and is a whole other story.
For the curious, I do caution that use of WBA requires a strong stomach
for formal logic.
Postscript
Back to Contents
Jézéquel has a WWW page (Jez98) summarising and commenting some discussion of his paper with Meyer. Garlington revised his paper in minor ways after our discussion on the Safety Critical Systems mailing list. There is a discrepancy between his version of Jézéquel and Meyer's title and that of the version of the paper he is criticising, which may mislead some readers. Garlington also describes deadlock (twice) as a timing fault or error, and refers to a methodology called ADARTS for justification. Since absence of deadlock is a classical safety property, which can be stated without any use of metric properties of a time line, and classical timing constraints are liveness properties, which are logically distinct sorts of things, I think it's highly misleading to suggest that deadlock is a timing error.
(Bab97a): Robert L. Baber, The Ariane 5 explosion: a software engineer's view, Risks-18.89, 12 March 1997, available at http://catless.ncl.ac.uk/Risks/18.89.html#subj6. (Back)
(Bab97b): Robert L. Baber, The Ariane 5 Explosion as seen by a software engineer, available at http://www.cs.wits.ac.za/~bob/ariane5.htm (Back)
(Gar98): Ken E. Garlington, Critique of "Put it in the contract: The lessons of Ariane", available at http://www.flash.net/~kennieg/ariane.html (Back)
(Gar98b): Ken E. Garlington, Note to the Safety Critical Systems Mailing List, 14 March 1998, distributed by Dept. of Computer Science, University of York. (Back)
(Gar98c): Ken E. Garlington, Note to the Safety Critical Systems Mailing List, 15 March 1998, distributed by Dept. of Computer Science, University of York. (Back)
(JeMe97): Jean-Marc Jézéquel and Bertrand Meyer, Design by Contract: The Lessons of Ariane, IEEE Computer 30(2):129-130, January 1997, also at http://www.eiffel.com/doc/manuals/technology/contract/ariane/ (Back)
(Jez98): Jean-Marc Jézéquel, Usenet comments on "Put it in the contract: The lessons of Ariane", at http://www.irisa.fr/pampa/EPEE/Ariane5-comments.html (Back)
(ESA96a): European Space Agency, EAS/CNES Joint Press Release Ariane 501, No. 33-96, 23 July 1996, available at http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html (Back)
(ESA96b): European Space Agency, Ariane 5, Flight 501 Failure, Board of Inquiry Report, 19 July 1996, available at http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html (Back)
(Leu98): Stefan Leue, personal communication, 13 March, 1998. (Back)
(Pet94): Henry Petroski, Design Paradigms: Case Histories of Error and Judgement in Engineering, Cambridge University Press, 1994. (Back)
(WBA): The WBA Home Page, available at http://www.rvs.uni-bielefeld.de (Back)
Copyright © 1999 Peter B. Ladkin, 1999-02-08 | |
by Michael Blume |