University of Bielefeld - Faculty of technology | |
---|---|
Networks and distributed Systems
Research group of Prof. Peter B. Ladkin, Ph.D. |
|
Back to Abstracts of References and Incidents | Back to Root |
This page was copied from: http://www.cs.york.ac.uk/~jdm/sclist/lelannariane.html |
This contribution is prompted by the fact that there is still no widespread agreement on the nature of the failure of Ariane 5 flight 501 (June 1996). This contribution is also prompted by discussions I have had with Peter Ladkin, who I thank for having helped in improving the presentation of the arguments that follow.
An Inquiry Board (IB) was formed to identify the cause(s) of the 501 failure. The IB report concludes that causes are software (S/W) design and S/W implementation errors [ESA, 1996][Le Lann, 1996] for examples. (Of course, these analyses, as well as this contribution, assume that all causal factors appear in the IB report). In fact, it is almost straightforward to show that the 501 failure has a unique cause, which is a system engineering (SE) fault.
This is so for the reason that this SE fault is the root of the causal
graph that leads to the 501 failure. Stated differently, among other
causal factors (such as, e.g., the BH overflow), none precedes this one.
(I leave it to Peter Ladkin to give a more refined definition of
"cause").
Back to the facts. The alignment task was running, despite the fact
that, after lift-off, realignment of the inertial platform, needed with
Ariane 4 (A4), is useless in the case of Ariane 5 (A5). This task
contains the conversion procedure that computes integer BH from
horizontal velocity.
What if someone would have had the idea of disallowing the execution of
this task after lift-off? Simple. The scenario which has led to the 501
failure could not have occurred.
Now the argument.
How could this someone know that this was the right thing to do?
Obviously, only by correctly capturing the problem to be solved by those
engineers in charge of the A5 computer-based system, i.e. by correctly
specifying the interface between this particular A5 subsystem and the
A5 inertial platform subsystem.
Decomposition of a launcher into subsystems, and specification of
appropriate interfaces (capture of requirements and assumptions) between
these subsystems, are SE activities, which depend on which satellite
launcher technologies are selected. Only the main architect of a
launcher can conduct such SE activities correctly, for the reason that
only the main architect of a launcher is responsible for deciding on how
to decompose a launcher into subsystems, given the technological choices
made.
Consequently, this someone can only be an Ariane 5 engineer. Indeed,
only an engineer aware of the technology retained for the A5 program can
tell: "Given A5 technology, there is no need to have the strap-down
inertial platform aligned after lift-off".
That system engineering-dependent knowledge is totally independent of
the fact that the alignment "thing" which, after lift-off, happens to be
needed (A4), or not needed (A5), is implemented in hardware, in
software, or in melloware, correctly or incorrectly. That knowledge is
also totally independent of the fact that the "thing" is a reused
"thing" or a newly developed "thing". It is also totally independent of
the fact that inhibition of the "thing" after lift-off is instantiated
via, e.g., a boolean set to false, or a mechanical switch activated
after lift-off.
Hence, the 501 failure does not result from "how" the "what" (was needed
or not needed) was instantiated. The 501 failure has been caused by an
overlook of the "what", which is a requirement capture fault. And given
that the knowledge at stake is system engineering-dependent, the cause
is a SE fault.
It has never been the intent of ESA, of CNES, of Arospatiale, or
Arianespace, to plan, commission, build and operate a launcher based on
A5's technology and which needs inertial platform alignment after
lift-off, a fictitious launcher that could be labelled Ariane 4.5,
half-way between A4 and A5.
End of the argument.
Therefore, stricto sensu, all the work that has been invested in
"inspecting the code" and ironing out the "S/W errors" from the
alignment task, all the contributions - including ours [Le Lann,
1996], [Le Lann, 1997] - to the "Is the 501 failure due to software or
system engineering mistakes?" debate, apply to this fictitious Ariane
4.5 launcher, that will never be operated, and whose unique flight is
labelled 501, not to the Ariane 5 program.
The real qualification flights of A5 have been (successful) flights 502
and 503, which were conducted with the alignment task inhibited after
lift-off. Consequently, success with these flights cannot result from
having "inspected the code and corrected the bugs" of the alignment
task (since this task was not in use (after lift-off)).
It is certainly interesting to keep discussing about the 501 failure, until, maybe, our community reaches a consensus on one of the three prevailing views, namely:
1)
The 501 failure could have been avoided by "inspecting the S/W"
(group G1),
Still, we should not forget that these discussions make sense only in
the context of the fictitious Ariane 4.5 launcher. Neither should we
ignore that the issue of "correcting the bugs" of the alignment task has
lost any practical relevance as early as 1996. Q1:
"Under which conditions should this function be available, be
inhibited?",
But why take chances, anyway? This knowledge (questions and responses)
is natural and obvious to Ariane 5 engineers (Q1 and Q2), natural and
obvious to (system-level) designers of the Ariane computer-based system
(Q3 and Q4). With a "good" System Engineering method at hand, it would
have been normal practice for these engineers to spontaneously
"propagate that knowledge", via specifications handed over to S/W (to
H/W) engineers, releaving them from the burden of "not forgetting to ask
(the right questions?, all of them?)".
[ESA, 1996] European Space Agency, "Ariane 5 - Flight 501 Failure",
Board of Inquiry Report, 19 July 1996, 18 p.
[http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html].
[Ladkin, 1998] P. Ladkin, "The Ariane 5 Accident: A Programming
Problem?", Article RVS-J-98-02, Bielefeld University, Germany, March
1998 [http://www.rvs.uni-bielefeld.de/publications/"],
(or look at the Computer-Related Incidents with Commercial Aircraft)
[Le Lann, 1996] G. Le Lann, "The Ariane 5 Flight 501 Failure - A Case
Study in System Engineering for Computing Systems", INRIA Research
Report 3079, Dec. 1996, 26 p [http://www.inria.fr/RRRT/publications-fra.html].
[Le Lann, 1997] G. Le Lann, "An Analysis of the Ariane 5 Flight 501
Failure - A System Engineering Perspective", 10th IEEE Intl. ECBS
Conference, March 1997, 339-346.
[RISKS] The RISKS Forum [http://catless.ncl.ac.uk/Risks].
[SCS] Safety Critical Systems Mailing List [ftp.cs.york.ac.uk, directory hise_reports/sc.list].
2) No way! The failure has been caused by a requirement fault,
which is further split in two diagnoses:
As a member of G3, I am interested in keeping interacting with
representatives of G2 (the most populated group it seems at this time),
and discuss at greater length why I believe it does not make sense to
shift responsibilities from System Engineering to S/W Engineering or to
H/W Engineering.
In the particular case of flight 501, I have argued in [Le Lann, 1996]
and [Le Lann, 1997] that those errors which have been identified in the
IB report are causal consequences of System Engineering faults. They are
not causes of the 501 failure, but manifestations of more "profound"
causes.
Yes, maybe, with luck, following some "good" S/W Engineering method
(some "good" H/W Engineering method if H/W implementation had been
resorted to), someone could have been led to ask such questions as:
Q2: "What's the range of possible values for horizontal
velocity?". It's much less likely that the
Q3: "What's the failure model assumed for processors?"
or the
Q4: "Can the assumption that there is no
common mode failure (of the SRI module) be violated" questions would
have been raised.
It seems there is a temptation to consider that a S/W (or a H/W)
Engineering method is "good" not only if it guarantees correct
implementations of specifications but, furthermore, if it also
guarantees that the specifications under consideration are correct with
respect to some higher-level problem. Why should a S/W (or a H/W)
Engineering method compensate for lack of consideration for System
Engineering issues? Where do these specifications meant to be S/W
(or H/W) implemented come from? Is there not a boundary to the
"universe" that is tractable with S/W (or H/W) concepts?
Besides this, concerning the Ariane 5 program, a really interesting
question is as follows: Was the S/W used for flight 501 - to the
exception of the alignment task - found to be "erroneous", and if the
case, have experts found fatal S/W errors, i.e., errors which, if not
corrected, would have led to a failure of flight 502?
As of now, I have been returned only one non content-free response
(i.e., other than "it's secret", which might be understandable). I have
been told by some experts - in group G1 - that they had found non fatal
S/W errors. This demonstrates that bug-free S/W is not a necessity,
given that Ariane 4 has been operated for over 10 years very
successfully, despite the existence of these S/W errors.
This page was copied from:
http://www.cs.york.ac.uk/~jdm/sclist/lelannariane.html
COPY!
COPY!
by Michael Blume