University of Bielefeld - Faculty of technology | |
---|---|
Networks and distributed Systems
Research group of Prof. Peter B. Ladkin, Ph.D. |
|
Back to Abstracts of References and Incidents | Back to Root |
[The Space Shuttle Primary Control Software was developed by IBM's Federal Systems Division, which recently became the Loral Federal Systems Division and now is owned by Lockheed-Martin. This is a delightful tale of - inadvertent - software debugging. The interview of Tony Macina, then manager of flight operations for IBM's On-Board Space Shuttle Program, and Jack Clemons, then manager of avionics flight software development and verification, was conducted by David Gifford and Alfred Spector. PBL]
AS: Could you describe a training scenario on the SMS [the Shuttle Mission Simulator] that caused a problem for you?
Clemons: Yes - it was a "bad-news-good-news" situation. In 1981, just before STS-2 was scheduled to take off, some fuel was spilled on the vehicle and a number of tiles fell off. The mission was therefore delayed for a month or so. There wasn't much to do at the Cape, so the crew came back to Houston to put in more time on the SMS.
One of the abort simulations they chose to test is called a "Transatlantic abort," which supposes that the crew can neither return to the launch site nor go into orbit. The objective is to land in Spain after dumping some fuel. The crew was about to go into this dump sequence when all four of our flight computer machines locked up and went "catatonic". Had this been the real thing, the Shuttle would probably have had difficulty landing. This kind of scenario could only occur under a very specific and unlikely combination of physical and aerodynamic conditions; but there it was: Our machines all stopped. Our greatest fear had materialized - a generic software problem.
We went off to look at the problem. The crew was rather upset, and they went off to lunch.
AS: And contemplated their future on the next mission?
Clemons: We contemplated our future too. We analyzed the dump and determined what had happened. Some software in all four machines had simultaneously branched off into a place where there wasn't any code to branch off into. This resulted in a short loop in the operating system that was trying to field and to service repeated interrupts. No applications were being run. All the displays got a big X across them indicating that they were not being serviced.
AS: What does that indicate?
Macina: The display units are designed to display a large X whenever the I/O traffic between the PASS computers and the display is interrupted.
Clemons: We pulled four or five of our best people together, and they spent two days trying to understand what had happened. It was a very subtle problem.
We started outside the module with the bad branch and worked our way backward until we foudn the code that was responsible. The module at fault was a multi-purpose piece of code that could be used to dump fuel at several points of the trajectory. In this particular case, it had been invoked the first time during ascent, had gone through part of its process, and was then stopped by the crew. It had stopped properly. Later on, it was invoked again from a different point in the software, when it was supposed to open the tanks and dump some additional fuel. There were some counters in the code, however, that had not been reinitialized. The module restarted, thinking it was on its first pass. One variable that was not reinitialized was a counter that was being used as the basis for a GOTO. THe code was expecting this counter to have a value between 1 and X, say, but because the counter was not reinitialized, it started out with a high value. Eventually the code encountered a value beyond the expected range, say X+1, which caused it to branch out of its logic. It was an "uncomputed" GOTO. Until we realized that the code had been called a second time, we couldn't figure out how the counter could return a value so high.
We have always been careful to analyze out processes whenever we've done something that's let a discrepancy get out. We are, after all, supposed to deliver error-free code. We noticed that this discrepancy resembled three or four previous ones we had seen in more benign conditions in other code modules. In these earlier cases, the code had always involved a module that took more than one pass to finish processing. These modules had all been interrupted and didn't work correctly when they were restarted. An example is the opening of the Shuttle vent doors. A module initially executes commands to open these doors and then passes. A second pass checks to see if the doors actually did open. A third pass checks to see how long time has run or whether it has received a signal to close the doors again, etc. Important status is maintained in the module between passes.
AS: Isn't flight control multipass?
Clemons: Yes, in a broad sense. But every pass through flight control looks like every other. We go in and sample data, and based on that data, we make some decision and take action. We don't wait for any set number of passes through flight control to occur.
For the STS-2 problem, we took three of our people, all relatively fresh from school, gave them these discrepancy reports (DRs) from similar problems, and asked for help. We were looking for a way to analyze modules that had these multiple-pass characteristics systematically. After working for about a week and a half, they developed a list of seven questions that they felt would have a high probability of trapping these kinds of problems. To test the questions, we constructed a simple experiment: We asked a random group of analysts and programmers to analyze a handful of modules, some with these types of discrepancies, some without. They found every one of the problems, and gave us several false alarms into the bargain. We were confident that they had found everything.
We then called everybody in our organisation together and presented these results. We asked them to use these seven questions to "debug" all of our modules, and ended up finding about 35 more potential problems, which we turned into potential DRs. In many instances, we had to go outside IBM to find out whether these discrepancies could really occur. The final result was a total of 17 real discrepancy reports. Of those, only one would have had a serious effect.
It turned out that this one problem originated during a sequence of events that occurred during countdown. A process was invoked that could be interrupted if there was a launch hold. The only way it would be reset to its correct initialization values was if a signal was sent from the ground when the launch process was restarted. We incorrectly assumed that this signal was always sent. Had we not found this problem, we would have lost safety chacking on the solid rocket boosters during ascent. We patched this one for STS-2 right away.
In retrospect, we took a very bad situation and turned it into something of a success story. We felt very good about it. This was the first time we'd been able to analyze this kind of error systematically. It's one thing to find logic errors, but in a system as complex as this, there are a lot of things that are difficult to test for. Despite a veritable ocean of test cases, the combination of requirements acting in concert under certain specific conditions is very difficult to identify, let alone test. There's a need for more appropriate kinds of analysis.
Back to `Computer-Related Incidents' Compendium.
Copyright © 1999 Peter B. Ladkin, 1999-02-08 | |
by Michael Blume |