Embedded Systems November 2000 Vol13_12

Issue link:

Contents of this Issue


Page 39 of 189

been destroyed in the atmosphere by entering orbit too sharply or to have passed by Mars by entering orbit too shallowly. The root cause for this fail- ure to approach orbi t at the right angle was discovered to be an incon- sistency in the units of measure used by two modules created by separate software groups. Thruster perfor- mance data was computed in English units and fed into a module that com- puted small forces, but which expect- ed data to be in metric uni ts. By accu- mulating errors over a nine-month j ourney, the Orbiter approached Mars at an incorrect orientation, and was lost. In the preliminary report after the loss of the Mars Orbiter, Dr. Edward Stone, director of JPL, issued the fo l- lowing statement, "Special attention is being directed at navigation and propulsion issues, and a fully indepen- den t ' red team' will review and approve the closure of all subsequent actions. We are committed to doing whatever it takes to maximize the prospects fo r a successful landing on Mars on Dec. 3." [9] This statement turned out to be an unfortunate bit of foreshadowing. On j anuary 3, 1999 (approximately one month after the Mars Orbiter was launched ) , NASA launched three spacecraft using a single launch vehi- cl e: the Mars Polar Lander (MPL) and two Deep Space 2 (DS2) probes. The Mars Lander was to land on the sur- face of the planet and perform exper- iments for 90 days. The Deep Space probes were to be released above the planet surface and drop through the atmosphere, embedding themselves beneath the surface. According to plan, th ese three spacecraft ended communications as they prepared to enter th e atmosphere of Ma rs on December 3, 1999. After arriving on the planet, they were to resume com- munication on the evening of December 4, 1999. Communication was never reestablished.[10] As in the case of th e Ma rs Orbite r, NASA conducted a very th oro ugh inves tigati o n, and explo red a numbe r of possible caus- es. The most probabl e cause seems to be th e genera tion of spurious sig- nals when the Lande r 's legs were de ployed dur ing descent. Spurious signals could give the Lande r a fa lse indi cati on tha t it had landed, caus- ing the en gines to shut d own. Of course, shutting down th e engin es before the Lande r had actually land- ed would result in the spacecraft crashing into the surface of Ma rs. The following root cause ana lysis is from NASA's repo rt: "It is not uncommon for sensors involved with mecha nical operations, such as th e lander leg deployment, to produce spurious signals. For MPL, there was no software requirement to clear spurious signals prior to using the sensor information to de termine that landing had occurred. During the test of the lander system, the sen- sors were incorrectly wired due to a design error. As a result, the spurious signals were not identified by th e sys- tems test, and th e systems test was not repeated with properly wired tou ch- down sensors. While the most proba- ble di rect cause of the failure is pre- mature engine shutdown, it is impor- tant to note that the underlying cause is inadequate software design and sys- tems test." [11] The tl1eme of underlying software design, verification, and validation is certainly a common one in these and most otl1er failures of safety-critical software. What level of risk? Few systems are completely free of risk. What is required for a system to be usable is that it have acceptable risk. The level of risk that is acceptable will vary with tl1e type of system and the potential losses. [12] Builders of safety-cri tical software must be aware of the principles of 1 -isk, and understand how to mitigate risk at each of these levels. Doing so will almost certai nly involve tl1e applica- tion of formal verification and valida- 38 NOVEMBER 2000 Embedded Systems Programming tion metl1ods, in addition to effective system tests. How does building safety-critical software diffe r from building any oth er kind of high quality software? In many ways it doesn 't di ffe r at all. Many of the same principl es and methods should be applied in either case. What sets safety-cri tical software apa rt is that the risks involved are potentially huge in terms of life and property. That should mean that the amount worth investing in building it right should be much greater. Schedule must be second to quali ty, particularly in those pa rts of a system that pose the highest risk. Safety-criti- cal software must be held to the high- est possible standards of engineering di scipline. esp Charles D. Knutson is an assistant prof es- sor of computer science at Brigharn Young University in Provo, Utah. He holds a PhD in cornputer science frorn Oregon State University. You can contact hirn at hnut- Sam Carmichael is a validation engi- neer at Micro Systems Engineering. Contact hirn at carmicha@biotmnih.corn. Endnotes 1. It's common to lump defects and failures together into the imprecise term "bug." Most people mean "failure" when they say "bug," because there was an observable problem in the software that could be demonstrated. But we are equally concerned with "bugs" that have not yet manifested themselves. 2. These properties were originally identi- fied in [2] and are still significant prob- lems in safety-critical software today. That they haven't been eliminated as concerns speaks to their essential nature. 3. A detailed discussion of these tech- niques is beyond the scope of this paper. Refer to Nancy Leveson's book, Safeware: System Safety and Computers, for more information on these and other techniques. 4. In [5], Nancy Leveson gives an extreme- ly detailed and thorough report of the

Articles in this issue

Archives of this issue

view archives of EETimes - Embedded Systems November 2000 Vol13_12