Embedded Systems November 2000 Vol13_12

Issue link:

Contents of this Issue


Page 37 of 189

quarte rs as iceberg repor ts came in. There was not nearly enough precau- tion taken by the crew of the Titanic to minimize the risk of accident. Because of their confidence in their abili ty to withstand an accident, they didn 't work diligently enough to avoid the risks prior to that point. Principle 3: if an accident occurs, min- imize the loss of life and/ or property. The Titanic should have been designed to withstand a more severe collision. To design for the worst pre- viously observed case is to remain ignorant of the future hazards that don 't fo llow past patterns. Even if tl1e designers were completely confident that tl1e liner could withstand virtually any collision, one must still have fai l- safes in place in the event tl1 at the un thinkable occurs. All e lse being equal , if there had simply been suffi- cient life boats for all passengers, the lo s of life would have been dramati- cally reduced. Case study: Therac-25 The Therac-25 was the infamous med- ical linear accelerato r that massively overdosed six radiation the rapy patients over a two-year period .4 Three of those patients died shortly after their treatments from complications immediately attribu table to the radia- tion overdoses. The case of the Therac-25 is trou- bling on several counts. Firs t, the level of per onal injury is disturbing. There is notl1ing pleasant about th e damage that a massive radiation over- close can do to a human when con- cen trated on a small area. Second, the a ttitude o f th e manufac ture r (Atomic Energy of Canada Limited , or AECL) was unbelievably cavalier in the face of such serious claims of injury. Even after lawsui ts were be ing settled out of court with the families of deceased patients, AECL contin- ued to deny that th e Therac-25 could be the cause of the iruuries. Their internal investigations were sloppy, their reports to the relevant govern- me nt agencies inadequate, and their action plans disturbingly naive. As an example, the ir earliest investigations into th e accide nts did not even con- side r th e possibility tha t software could have been the root cause of the overdoses, even though software con- trolled virtually all relevant mecha- nisms in the machine. Finall y, a thor- ough investigati on revealed an in credible inattention to rigor in the software d evelopme nt process a t AECL. There appeared to be little in the way of formal software product life cycle in place, with much critical work being done in an ad hoc fas h- ion. Even critical system tests were not necessarily docume nted or repeatable by engineers. In Leveson 's report, she identifies the following causal fac tors: overcon- fidence in software; confusing reliabil- ity with safe ty; lack of defensive design; failure to eliminate root caus- es; complacency; unrealisti c ri sk assessments; inadequate investigation or follow-up on acciden t repor ts; inadequate softwa re engineering practi ces; software reuse; safe vs. friendly user interfaces; and user and government oversight and standards.s Let's see. Anytl1ing ignificant missing here? We can 't think of anything! It's almost unbelievable that such safety- critical software could be developed in such an ineffective fashion. Case study: Ariane-5 Ariane-5 was the newest in a family of rockets designed to carry satellites in to orbit. On its maiden launch on June 4, 1996, it flew fo r just under 40 seconds before self-destructin g, destroying the rocke t and its payload of four satellites. In contrast to the slipshod approach taken by AECL in the Therac-25 incident, Ariane imme- diately set up an Inquiry Board that conducted a thorough investigation and discovered the root cause of the accident. [7] The core problem in the Ariane failure was incorrect softwa re reuse. A critical piece of software had been 36 NOVEMBER 2000 Embedded Systems Programming reused from the Ariane-4 system, but behaved diffe re ntly in the Ariane-5 because of di fferences in th e ope ra- tional parame ters of the two rockets. During a data conversion from a 64- bit value to a 16-bit value, an over- fl ow occurred , which resulted in an ope rand error. Since th e code was no t d esign ed to ha ndl e such an error, the ine rtial reference sys tem in which the error occurred simply shut down. This caused control to pass to a second, redundant inertial refer- ence system, which, ope rating under th e same information as th e first one, also shut down! The fa ilure of these two sys tems led to the on-board com- pute r misin te rpre ting diagnostic da ta as proper flight data, causing a deviati on in flight path . Thi s devia- tion in fli ght path led to th e activa- ti on of th e rocket 's self-des tru ct mechanism. One of the important lessons from the Ariane-5 failure is tl1at the quali ty of a device' software must be consid- ered in the context of the entir·e sys- tem. Software by itself has no inheren t quali ty. l t must be considered as part of a whole system. It is an important lesson to keep in mind as software reuse continue to be an important trend in software engineering. Case study: Mars missions Two recen t back-to-back failures by NASA's J e t Propul sion Laborato ry QPL) have captured quite a bit of news recen tly. In the fi rst case the Ma rs Climate Orbite r (MCO) was launched on Decembe r 11, 1998 and spent nine months traveling toward Mars. I ts purpose upon arrival was to orbit Mars as th e fi rst inte rplanetary weathe r sate lli te. Tn addi tion, it was to provide a communications relay for the Mars Pola r Lander (MPL) which was scheduled to reach Mars three months later, in December of 1999. [8] On September 23, 1999 the Mars Orbi ter· stopped communicating with NASA and is presumed to have eithe r

Articles in this issue

Archives of this issue

view archives of EETimes - Embedded Systems November 2000 Vol13_12