University of Bielefeld - Faculty of technology | |
---|---|
Networks and distributed Systems
Research group of Prof. Peter B. Ladkin, Ph.D. |
|
Back to Abstracts of References and Incidents | Back to Root |
This page was copied from: http://catless.ncl.ac.uk/Risks/11.79.html |
ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator
<><><><><><><><> T h e V O G O N N e w s S e r v i c e <><><><><><><><> Edition : 2336 Tuesday 4-Jun-1991 Circulation : 8466 VNS TECHNOLOGY WATCH: [Mike Taylor, VNS Correspondent] ===================== [Littleton, MA, USA ] COMPUTERWORLD 1 April CREATORS ADMIT UNIX, C HOAX In an announcement that has stunned the computer industry, Ken Thompson, Dennis Ritchie and Brian Kernighan admitted that the Unix operating system and C programming language created by them is an elaborate April Fools prank kept alive for over 20 years. Speaking at the recent UnixWorld Software Development Forum, Thompson revealed the following: "In 1969, AT&T had just terminated their work with the GE/Honeywell/AT&T Multics project. Brian and I had just started working with an early release of Pascal from Professor Nicklaus Wirth's ETH labs in Switzerland and we were impressed with its elegant simplicity and power. Dennis had just finished reading `Bored of the Rings', a hilarious National Lampoon parody of the great Tolkien `Lord of the Rings' trilogy. As a lark, we decided to do parodies of the Multics environment and Pascal. Dennis and I were responsible for the operating environment. We looked at Multics and designed the new system to be as complex and cryptic as possible to maximize casual users' frustration levels, calling it Unix as a parody of Multics, as well as other more risque allusions. Then Dennis and Brian worked on a truly warped version of Pascal, called `A'. When we found others were actually trying to create real programs with A, we quickly added additional cryptic features and evolved into B, BCPL and finally C. We stopped when we got a clean compile on the following syntax: for(;P("\n"),R-;P("|"))for(e=C;e-;P("_"+(*u++/8)%2))P("| "+(*u/4)%2); To think that modern programmers would try to use a language that allowed such a statement was beyond our comprehension! We actually thought of selling this to the Soviets to set their computer science progress back 20 or more years. Imagine our surprise when AT&T and other US corporations actually began trying to use Unix and C! It has taken them 20 years to develop enough expertise to generate even marginally useful applications using this 1960's technological parody, but we are impressed with the tenacity (if not common sense) of the general Unix and C programmer. In any event, Brian, Dennis and I have been working exclusively in Pascal on the Apple Macintosh for the past few years and feel really guilty about the chaos, confusion and truly bad programming that have resulted from our silly prank so long ago." Major Unix and C vendors and customers, including AT&T, Microsoft, Hewlett-Packard, GTE, NCR, and DEC have refused comment at this time. Borland International, a leading vendor of Pascal and C tools, including the popular Turbo Pascal, Turbo C and Turbo C++, stated they had suspected this for a number of years and would continue to enhance their Pascal products and halt further efforts to develop C. An IBM spokesman broke into uncontrolled laughter and had to postpone a hastily convened news conference concerning the fate of the RS-6000, merely stating `VM will be available Real Soon Now'. In a cryptic statement, Professor Wirth of the ETH institute and father of the Pascal, Modula 2 and Oberon structured languages, merely stated that P. T. Barnum was correct. In a related late-breaking story, usually reliable sources are stating that a similar confession may be forthcoming from William Gates concerning the MS-DOS and Windows operating environments. And IBM spokesman have begun denying that the Virtual Machine (VM) product is an internal prank gone awry. {COMPUTERWORLD 1 April} {contributed by Bernard L. Hayes} <><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Please send subscription and backissue requests to CASEE::VNS Permission to copy material from this VNS is granted (per DIGITAL PP&P) provided that the message header for the issue and credit lines for the VNS correspondent and original source are retained in the copy. <><><><><><><><> VNS Edition : 2336 Tuesday 4-Jun-1991 <><><><><><><><>
From Herb Caen's column in the San Francisco Chronicle, June 4, 1991: "Ron Lemmen took his old friend Nellie White to Circuit City [a local discount electronics chain - PA] to buy a TV set and after she wrote a check with more than adequate ID, the computer turned her down. It hadn't been programmed for somebody born before 1900 (Nellie's 92)..."
In 1968 I was a system programmer at the Columbia University Computer Center, which had a tremendous complex of IBM mainframes: a 360/91 tightly coupled to a 360/75, with a total of several megabytes of memory between them. Each computer had a disk farm, there was a 2321 Data Cell, a 2540 card reader, and a [number?] card reader/punch. And of course, several 1200lpm 1403-N1 printers. Usually the computer room was a reassuring hum of noises ... off in the far corner, the very faint pulse of the PCU ("Plumbing Control Unit": the 91 was cooled by a closed-loop distilled water system!); the snicker-click of the 1315s and 2315s as the heads moved back and forth; the ka-CHUNK of the data cell as the selectors grabbed, turned, and released the magnetic cards in the cells; the susurration of the card readers and occasionally the clunk of the punch; and of course, the humming percussion of the printers. One day I was in the room when suddenly it lapsed into near-total silence, except for the whiz of paper slewing out of one of the printers at high speed. The disks, card devices, data cell, and second printer had all stopped dead. And what was worse, very few of the 50 gazillion lights on the consoles of the computers were winking. Guessing immediately that something to do with the printer had brought the whole huge system to a halt, I ran over to it and hit (usually I don't like the word "hit" to mean "press", but this time I really did HIT) the STOP button. The printer stopped, but unfortunately nothing else came to life. Eerie dead silence, except for the water pump in the far corner. I set to work to find the problem. We did figure it out: the listing on that printer had specified carriage control meaning "skip to the next punch in channel 12 of the carriage tape" -- but there was nothing punched in channel 12 of the carriage tape. And the system, it seems, had at that instant serviced all its interrupts except the one it was expecting from the printer -- which never came, because the only thing that could cause the interrupt (a punch in channel 12 of ...) never happened. So the entire multiprocessor system was at a dead stop waiting for the interrupt. What to do, what to do? I removed the carriage tape and punched a hole in channel 12, then replaced it in the printer. I hit the START button, the paper slewed and stopped, the interrupt arrived, and the whole system came to life. Right? Nope. Nothing happened. No sweat, I'm a system programmer, entitled to wander into the machine room and do anything I think I can get away with; so fine, I'll just warm-start the suckers and off we'll go. We're talking about $6,000,000 worth of computers, the finest IBM had built, servicing all the needs both academic and administrative (I mean, my PAYCHECK was calculated and printed on those machines!) of a prestigious university. One of whose trustees, by the way, was Thomas J. Watson, Jr. So I warm-started the system. Unfortunately, that didn't work either. Nothing worked. Nor could IBM make it work. We had to reinitialize the whole system from scratch, the only time in my memory that that was necessary. Now what to do about this terrible bug that could cause the whole machine to grind to an irreparable halt? A committee met several times to try to figure it out. (I wasn't invited. Too junior. Too technical.) The committee never came to a decision; they couldn't figure out whether to press IBM to solve the problem, or to try to find and fix it ourselves, or what. I did my part. With no one's permission, I checked all the carriage tapes to make sure that they all had at least one punch in every channel, and I instructed the operators that whenever they made a new tape, they should make sure that every channel had a punch somewhere. End of problem. ---Pete kaiser@heron.enet.dec.com +33 92.95.62.97
I've received additional data on that is pertinent to the discussion of the 767 crash. Boeing tested the 767 during certification for thrust reverser activation in flight. Not only does in-flight deployment not cause damage, but the aircraft can remain in flight in this condition. This is because the thrust reversers are not as efficient as the non- reversed engines. There is still enough total thrust produced such that the aircraft can maintain flight with one engine at full throttle and the other at full throttle with the reverser deployed. I realize that this information has no computer risk relevance, but speculation on the cause of this "computer controlled" aircraft has run high, even on the RISKs group. The above info answers some of the concerns that people had about operation of this aircraft. It also sheds some light on the fault tolerant nature of the Boeing hardware design, which should be instructive to those of us who design software. Steve
> New evidence that the crash of a Lauda Air Boeing 767-300 in Thailand had been > caused by in-flight reversal of the thrust on one engine has stunned the > aviation industry. Niki Lauda, owner of Lauda Air, made the claim in Vienna This may have happened, but what about the ground witnesses who said the plane flared up like a fire cracker, and what about the small size of the pieces of the wreck, and what about the dispersion site surface? All of these are inconsistent with a simple loss of control resulting in an impact with the ground. There seems to have been some mid air explosion. Alain (514) 934 6320 UUCP: alain@elevia.UUCP
>Herr Lauda said the >flight data recorder was damaged and could not be used to analyse the crash. RISK of improvement, if true. The old FDR's used metal scribes on stainless ribbon. They've been replaced by digital data recorders using magnetic tape. They offer more reliability, less error (in at least one famous case, the NTSB put the recorder in a centrifuge to explore the effects of high g-loads on the mechanical slop in the pen linkages!), larger number of channels - hence many more data points, maybe automatic reuse (as the voice recorder does) and likely lower cost. But, according to friends in the airline industry, they also are far less indestructible. Makes sense - no matter what you do, ferric oxide on a plastic base melts at a lower temperature than stainless steel. And that does not even consider the Curie point of the oxide. wb8foz@mthvax.cs.miami.edu
Before launching into a discussion of John's posting, some background is necessary. Between 1979 and 1982, one of my assignments was on the AFTI-F16 program. My prime task was to augment the "user interface" between the designers and the FLCCs (FLight Control Computers), Bendix bdx 930, AMD 2901-based systems with a custom microcode, 4 Mhz cycle, and 450 nsec access UVPROM memory. Specs that seem pre-historic today but were what we had to work with. During the flight test phase, as John points out, we had some "glitches", primarily from complex interactions that seem simple when explained but had us covering hallways with brush recorder outputs trying to figure out what happened. The telemetry and flight record capability available to us was typically (as I recall) limited to about 16 channels for each FLCC. Picking the correct software data points to monitor from in excess of 5000 locations was sometimes difficult. Readers must remember that the AFTI-F16 was a technology demonstrator and for the era, we were "pushing the envelope" in a very steep learning curve not only for the digital flight controls, but for the entire process of designing digital flight controls. "What if" discussions abounded right down to the philosophical discussions of what the pilot was permitted to do. (At the time it seemed centered on the lowest common denominator thought that (IMHO) resulted in the Iranian debacle of 1980. There were political as well as practical problems to be solved. I vividly remember one discussion concerning PLA (power level angle: throttle) authority. The P&W F100 engine had design limits (much like the red-line in a car) and the thinking at the time said to design the controls so that this point could not be exceeded. A few of us were of the opinion that this was a combat aircraft and that a fighter pilot should have the authority to exceed "design" limits if necessary to complete his mission. Warn him, but give him the option even at the risk of destroying his own aircraft. In combat, the rules MUST be different. Today, it seems incredible that the opposing viewpoint existed, but it did and was quite pervasive in some governmental circles. Then, we were the mavericks. >It seems that redundancy management became the primary source of unreliability >in the AFTI-F16 DFCS. Cost constraints & paper studies decreed that we would try a triplex design with hydromechanical back-up, in production, lessons learned on AFTI resulted in a Quadraplex system. Trying to develop a dual-fail-operational flight-critical system was not easy. >...the unsynchronized individual computers may sample sensors >at slightly different times, they can obtain readings that differ quite >appreciably from one another... Remember that I said "dual-fail operational". Synchronous operation would have eliminated such latency, but a "first-fail" could include loss of synchronization. Therefore asynchronous design was a level-1 decision. Today, processing and sensor speed has increased to the point that this approach would not be a problem but at 500 kips, cycle rates were under 100/ second and when > MACH 1.3, a lot can happen in a couple of milliseconds. > An even more serious shortcoming of asynchronous systems arises when the >control laws contain decision points. Here, sensor noise and sampling skew may >cause independent channels to take different paths at the decision points and >to produce widely divergent outputs. This occurred on Flight 44 of the >AFTI-F16 flight tests [4, p. 44]. Each channel declared the others failed; the >analog back-up was not selected because the simultaneous failure of two >channels had not been anticipated and the aircraft was flown home on a single >digital channel. The pilot had a switch that allowed him to select which computer(s) was selected and could over-ride this digital decision if necessary. Note that the aircraft still had the hydromechanical back-up with "get home" capability if necessary. This condition HAD been anticipated. > Another illustration is provided by a 3-second "departure" on Flight 36 of >the AFTI-F16 flight tests, during which sideslip exceeded 20deg, normal >acceleration exceeded first -4g, then +7g, angle of attack went to -10deg, then >+20deg, the aircraft rolled 360deg, the vertical tail exceeded design load, all >control surfaces were operating at rate limits, and failure indications were >received from the hydraulics and canard actuators. I do not have the records here but suspect that this was one of the "find a long hallway" ones. This was probably the case where a combination of extremely high AOA in a near-stall condition caused the envelope to be exceeded on the back side i.e. the plane was no longer flying & the control surfaces had little effect. My memory may be going, but I seem to recall one set of readings that indicated near-zero air speed with an AOA > 80 degrees. > The AFTI-F16 flight tests revealed numerous other problems of a similar >nature. Summarizing, Mackall [4, pp. 40-41] writes: >"The criticality and number of anomalies discovered in flight and ground tests >owing to design oversights are more significant than those anomalies caused by >actual hardware failures or software errors..." Easy words to say. Remember, this was a full-authority multiple-redundant flight control system containing five modes of flight with a computer that could only address 32k of memory (the upper bit of the 16 bit addressing was used to indicate an indirect operation). >"...qualification of such a complex system as this, to some given level of >reliability, is difficult ...[because] the number of test conditions becomes so >large that conventional testing methods would require a decade for completion. In other words, the only real way to test it and learn where the mistakes were was to strap in a pilot and wish him luck. (of course thousands of hours in a flight simulator connected to production hardware helped). >The fault-tolerant design can also affect overall system reliability by being >made too complex and by adding characteristics which are random in nature, >creating an untestable design. Huh ? Nothing in a digital system is random. period. Interactions may be unanticipated, but not random. Things were a bit more difficult before PCs though. >2: However, the greater the benefit provided by DFCS, the less plausible it >becomes to provide adequate back-up systems employing different technologies. >For example, the DFCS of an experimental version of the F16 fighter (the >"Advanced Fighter Technology Integration" or AFTI-F16) provides control in >flight regimes beyond the capability of the simpler analog back-up system. >Extending the capability of the back-up system to the full flight envelope of >the DFCS would add considerably to its complexity--and it is the very >simplicity of that analog system that is its chief source of credibility as a >back-up system [2]. Doubletalk. Sure, an analog system is going to have trouble with a Mach 1.3 50 ft. terrain following mode (so are the pilot's kidneys). What we found out was that you can make a plane do things with DFCS (digital flight control system) that are impossible with an analog system. In TF, if you have a failure, the back-up does not try to maintain that condition, instead a fly-up is instigated and the aircraft returns to a maintainable mode. > The danger of wide sensor selection thresholds is dramatically illustrated >by a problem discovered in the X29A. ... It was subsequently discovered that >if the nose probe failed to zero at low speed, it would still be within the >threshold of correct readings... At least we did not have this problem: on AFTI valid sensor ranges were confined so that any sensor reading zero or full scale was automatically declared failed. All in all, I thought that AFTI was pretty successful & lead to the PDFCS (Production Digital Flight Control System) program. We made mistakes and learned from them. If anything, most thresholds were set too high so that failures were declared that did not need to be, but we were kind of cautious in those days. Probably the riskiest thing was to bet that technology would allow us to replace that 450 ns memory with 250 ns units for a needed through put improvement - three manufacturers had announced them but no-one had shipped any when we froze the design. In any event, first flight was almost exactly ten years ago, and the most significant event was that the chase planes had to double their normal clear-space distances since the AFTI-F16 could translate horizontally and vertically without any warning, it was just suddenly someplace else. Padgett
This page was copied from: | http://catless.ncl.ac.uk/Risks/11.79.html |
COPY! | |
COPY! |
by Michael Blume |