Embedded Systems November 2000 Vol13_12

Issue link:

Contents of this Issue


Page 135 of 189

I t It has been formally proven that it is not possible to guarantee that two or more distributed sites will reach agreement in finite time over an asynchronous communication medium, if the medium between them is lossy or if one of the distributed sites can fail. attack at the proposed time but not Amphiloheus ( ince he did not ge t a confirmation). However, if the mes- senger wa caught before he reached Basileus, then Amphiloheus is in dan- ger of acting alone and uffe ring defeat. Furthermore, even if the mes- senger succeeds in getting back to Amphiloheus, there is still a possibili- ty that Basile u will n ot attack, because he i unsure that his confir- mation actually got through. To rem- edy this, Basileus may decide to send hi own messenger to Amphiloheu to ensure that his confirmation got through. But, the only way he can be certain of that is if he get a confir- mation of his confirmation. Since there is a possibility that ne ither mes- senger got through to Amphiloheus, Bas ileus is no better off than before if his second messenge r does not return . Clearly, while sending additional messengers can increase the likeli- hood that a confirmation will get through, it does not fundamentally solve the problem since there will always be a finite probability that mes- senger will get intercepted. It's a case of "does he know that I know that he knows that.. .?" and so on. Impossibility result Our parable of the generals is simply an illustra tion of a fundam ental impossibili ty result. amely, it ha been formally proven that it is not pos- sible to guarantee that two or more eli tributed sites will reach agreement in finite time over an asynchronous3 communication medium, if the medi- um between them is lossy [7] or if one of the distributed sites can fail [2] . This important result is, unfortu- nately, little known and many distrib- uted system developers are still trying to olve what is known to be an unsolv- able problem-the modern-day equiv- alent of trying to square the circle or devise a perpetual motion machine. The best that we can do in the e cir- cumstances is to reduce the possibili ty of non-termination to something that is highly unlikely. The Byzantine generals problem A common paradigm for a particular form of the distributed agreement problem is the so-called Byzantine gen- erals problem.4 In this problem it is assumed that at least one processing site is faul ty. Furthermore, any faul ty processing sites are assumed to be malicious in the sense that they are try- ing to subvert agreement by intention- ally sending incorrect information. While the number of such sites is known, their identity is not. The objec- tive is to find an algorithm whereby all the non-faul ty ites can reach agree- ment despite the disruptive actions of the faul ty sites. This may eem like a rather exotic and needle sly paranoid problem of little practical value. However, it is actually quite important for fault-toler- ant systems. The point is that any solu- tion that can survive a malicious attack will be robust enough to survive prac- tically any type of fault. Heterogeneity Many distributed system arose out of the need to integrate legacy stand- alone software y tems into a larger more comprehensive system. Since the individual systems we re often developed independe ntly of each other, they may have been based on different architectural principles and even different programming and operating system technologies. This creates another common problem of distribution: integ• -ation of heteroge- neous technologies. For example, one system might use a 16-bit integer fOJ- mat with the most significant bit in the 134 NOVEMBER 2000 Embedded Systems Programming - leftmost pOSitiOn, while its peer may use a 32-bit format with the most sig- nificant bit in the rightmost position. System establishment Since tl1ere are multiple processing sites in a distributed system, there are multiple loci of control. ote that, in gene ral, it cannot be assumed that all tl1e sites will necessa• -ily start executing at the same time. For example, a site that failed and was tl1en recovered will come up long after all the other com- ponents are all operational. A major problem is how these distributed sites find and synchronize with each other. Distributed algorithms and techniques In this section we examine some of the most common distributed algorithms. By "distributed" we mean that the decisions are reached by consensu , with each site making its own decision according to the rules of the algo- rithm and the evidence presented by the other participating sites. These algorithms gene rally are ba ed on shared (common) knowledge [7] ; tl1at is, each site must know what all the other sites know. Consequently, an important related problem is how to disseminate di stributed knowledge reliably and effi ciently. A major difficulty with common knowledge is that all parties have to know about each other. In dynamic systems, where sites may come and go, this can add a significant amount of overhead . In particular, difficulties result when a new site joins while a distributed algorithm is in progress. In those cases, it is typical to refuse entry to the newcomer until the algo- rithm terminates. Conversely, a site may fail in the middle of executing a distributed algorithm , which means that all the other participants mu t be notified. Communication techniques Both synchronous and asynchronous communication is used in distributed sys tems. Synchronous communica-

Articles in this issue

Archives of this issue

view archives of EETimes - Embedded Systems November 2000 Vol13_12