Embedded Systems November 2000 Vol13_12

Issue link:

Contents of this Issue


Page 129 of 189

One of the central problems with unreliable communication media is that it is not always possible to positively ascertain that a message that was sent has actually been received by the intended remote destination. cessing site implies the failure of all the software. In conu·ast, in a faul t-tol- eran t distributed system, a processing site f ailure means tl1 at tl1e software on the remaining sites needs to detect and handle that fa ilure in some way. This may involve redistributing tl1e functionali ty from tl1e failed site to other, operational, sites, or it may mean switching to some emergency mode of operation. The challenges of distributed software The maj ori ty of problems associated with distributed systems pertain to fail- ures of some kind. These are general- ly manifestations of the unpredictable, asynchronous, and highly diverse nature of the physical world. In other words, because fault-tolerant distrib- uted systems must contend with the complexity of the physical world , they are inherently complex. Failures, faults, and errors Let's introduce some basic terminolo- gy.2 A failure is an event that occurs when a component fai ls to behave according to its specifica tion. This is usually because the system experienc- ing th e fa ilure has reached some invalid state. We refer to such an unde- sirable state as an error. The underlying cau e of an error is called a fault. For example, a bi t in memory that is stuck at "high" is a fault. This will result in an error when a "low" value is written to th at bit. When the value of that bit is read and used in a calcula- tion, the outcome will be a failure. Of course, this classification is a rel- ative one. A fault is typically a failure at some lower level of abstraction (that is, the stuck-high bit may be the result of a lower-level fault due to an impuri- ty in the manufacturing process) . When a failure occurs, it is first necessary to de tect it and then to per- form some basic failure handling. The latter involves diagnosis (determining the underlying cause of the fault), fault removal , and failure recovery. Each of these activities can be quite complex. Consider, for example, failure diag- nosis. A single fault can often lead to many errors and many different cas- cading failures, each of which may be reported independently. A key di ffi- cul ty lies in sorting through the possi- ble flurry of consequent error reports, correlating them, and de termining the basic underlying cause (the fault). Processing site failures Because the processing sites of a dis- tributed system are independent of each othe r, they are independent poin ts of failure. While tl1is is an advantage from the viewpoint of tl1e user of tl1e system, it pre ents a com- plex problem for developers. In a cen- tralized system, the failure of a pro- 128 NOVEMBER 2000 Embedded Systems Programming Communication media failures Another kind of failure tl1at is inher- ent in most distributed systems comes from the communication medium. The most obvious, of course, is a com- plete hard failure of the entire medi- um, whereby communication between processing sites is not possible. In tl1e most severe cases, this type of failure can lead to partitioning of the system into multiple parts that are completely isolated from each other. The danger here is tl1 at tl1e diffe rent par ts will undertake conflicting activi ties. A di ffe re nt type of media failure is an intermittent failure. These are fail- ures whe reby messages travelling through a communication medium a re lost, reordered , or duplicated . Note that tl1ese are not always due to hardware fa ilures. For example, a mes- sage may be lost because the system may have temporarily run out of mem- ory for buffering it. Message reorder- ing may occur due to successive mes- sages taking diffe re nt paths through the communication medium. If tl1e delays incurred on tl1ese paths are dif- ferent, they may overtake each otl1 er. Duplication can occur in a number of ways. For instance, it may result from a reu·ansmission due to an erroneous conclusion that the original message was lost in u·ansit. One of the central problems with unreliable communication media is that it is not always possible to posi- tively ascertain that a message th at was sent has acn1ally been received by tl1e intended remote destination. A com- mon technique for dealing with tl1is is to use some type of positive acknowl-

Articles in this issue

Archives of this issue

view archives of EETimes - Embedded Systems November 2000 Vol13_12