Embedded Systems November 2000 Vol13_12

Issue link:

Contents of this Issue


Page 141 of 189

If failures occur during the broadcast, the algorithms will fail since the participants will have inconsistent views of the system state. The site with the highest bid is select- ed as the leader (which explains the name of the algorithm). This site sends a confirmation message to all other ites. If a site in the monitoring state determines that the current leader has failed, it re-enters the election proce- dure by sending its bid to all sites that had previously entered a higher bid than its own. If none of these respond, the site elects itself as the leader and informs all lower-bid sites. However, should a higher-bidding site respond, it wa. receive the same set of messages • Only sent messages are delivered (that is, there are no "spontaneous" messages) Note that reliable broadcast does its to receive a confirmation mes-· sage from the new leader. If such a message does not arrive, the election process is re-entered. This simple sounding algorithm is fraught with practical implementation difficulties. As sites come and go, it is difficult to keep track of who is pre- sent and what phase of the election process they are in. If the communica- tion medium is not perfectly reliable, the difficulties are often insurmount- able. Detection of leader fai lures can also be problematic if the polling is based on timeouts. Fault-tolerant distributed broadcasting As we have seen, a key element of many distributed algorithms is the need to reliably broadcast local infor- mation to all the participants in an algorithm. If failures occur during the broadcast, the algorithms will fai l since the participants will have incon- sistent views of the system state. For tl1is reason, it is very useful to provide fault-tolerant broadcast services in a dis- tributed system.[5] The weakest form of fault-tolerant broadcast is called a reliable broadcast. Within a given group of sites, it guar- antees three basic properties: • All mes ages sent are received by all non-faulty sites in the group • All non-faulty sites in the group will not guaran tee the order in which the messages will be received. The order of reception can be crucial in some cases. For example, if an "on" message is sent after an "off' message, it is important that they are received in that order. For this reason a number of other types of fault-tolerant broad- casts are defined with progressively stronger resu·ictions on tl1e delivery order of messages. A FIFO broadcast is a reliable broad- cast that guarantees tl1 at messages sent by the same sender are delivered in sending order. Note, however, that multiple sites could be broadcasting at the same time so that the possibility exists even with FIFO broadcasts that different members of the group will receive messages from different sites in a different order. For this reason, we define atomic broadcasts, which guaran- tee that all non-faulty sites receive all messages in the same order. Another type of reliable broadcast is causal broadcast. In this case, tl1e delivery order is guaranteed to respect causal relationships. That is, it is not possible for a message to be delivered before another message that precedes it in the system cause-effect chain. Finally, it is sometimes useful to pro- vide a causal atomic broadcast that com- bines the features of causal and atom- ic broadcasts. Distributed groups Distributed algorithms are designed to operate within a given group of partic- ipants. Hence, they always imply a scope, a community of distributed sites within which the algorithm applies. Membership in this communi- ty may change dynamically as sites join 140 NOVEMBER 2000 Embedded Systems Programming and depart. For example, the fault-tol- erant broadcast algorithms described above are designed with this model in mind. This led to the idea of providing operating system support for distrib- uted groups. [ 4] A distributed group is simply a group of processing sites that are joined together for some common purpose. Sites may join and leave the group as they see fit. These distributed groups provide the basic services required by many different distributed algorithms and applications. A typical group facility provides the following: • A membership service, which is used to track which sites are currently in the group and to inform members of the group when new sites join or leave the group • Fault-tolerant b-roadcast services (as described previously) as well as reli- able point-to-point (member-to- member) communications • A mutual exclusion service that pro- vides a basic distributed locking mechanism within the group By relieving applications from hav- ing to implement the complex machinery involved with these ser- vices, a major simplification of appli- cations can be obtained. Centralized alternatives An alternative to fully distributed algo- rithms is to appoint one well-known site as a central arbiter. When agree- ment need to be reached, all the other sites simply consult the central arbiter and abide by its decision. This is a common solution in many disu·ib- uted systems but it has two drawbacks. The first is that a central site repre- sents a potential bottleneck. However, as we have seen previously, distributed algorithms require a significant over- head to establish and maintain com- mon knowledge. In fact, it is generally the case that a centralized algoritl1m is actually more efficient than a distrib- uted one. A more serious drawback is that a central arbiter becomes a single

Articles in this issue

Archives of this issue

view archives of EETimes - Embedded Systems November 2000 Vol13_12