Embedded Systems November 2000 Vol13_12

Issue link:

Contents of this Issue


Page 114 of 189

If the man stops kicking the dog, the dog will take advantage of the hesitation and bite the man. The software would crash almost immediately, even if the code is com- pletely bug free. This is exactly the sort of transient failure that watchdogs will catch. Bugs in software can also cause the system to hang, if they lead to an infi- nite loop, an accidental jump out of the code area of memory, or a dead- lock condition (in multitasking situa- tion ) . Obviously, it is preferable to fLX the root cause, rathe r than getting the watchdog to pick up the pieces. In a complex embedded system it may not be possible to guarantee that there are no bugs, but by using a watchdog you can guarantee that none of tho e bugs will hang tl1e system indefinitely. First aid Once your watchdog has bitten, you have to decide what action to take. The hardware will usually assert the processor's If we want the system to recover quk, the initialization after a watch- dog reset should be much shorter than power-on initialization. A possi- ble shortcut is to kip some of tl1e device's self-tests. On tl1e other hand, in some systems it is bette r to do a full set of self-tests since the root cause of the watchdog timeout might be identi- fied by such a test. In terms of the outside world, the recovery may be instantaneous, and the user may not even know a re e t occurred . The recovery time will be tl1e length of the watchdog timeout plus the time it takes tl1e system to reset and perform its initialization. How well tl1e device recove rs depends on how much pe rsiste nt data the device requires, and whe ther that data is stored regularly and read after the system resets. reset lin e, but other action are also possible. For example, when the watchdog bites it may direct- ly disable a motor, engage an inter- lock, or sound an alarm until the soft- wa re recove rs. Such actions are espe- cially important to leave the system in a safe state if, fo r some reason, the sys- tem's software is unable to run at all (perhaps due to chip death ) after the fai l m e. A microconu·oller with an intem al watchdog will almost always contain a sta tus bit that gets set when a bite occur . By examining this bit after emerging from a watchdog-induced reset, we can decide whether to con- tinue running, witch to a fail-safe state, and/ or display an error mes- sage. At the very least, you should count such events, so that a pe rsistent- ly errant application won't be restart- ed indefinitely. A reasonable approach might be to shut the system down if tl1 ree watchdog bites occur in one day. Sanity checks Kicking the dog on a regular interval proves that tl1e software is running. It is often a good idea to kick the dog only if the system passes some sanity check, as shown in Figure 1. Stack depth, number of buffers allocated, or the stams of some mechanical compo- nent may be checked before deciding to kick the dog. Good design of such checks will increase the family of errors that the watchdog will detect. One approach is to clear a number of flags before each loop is started, as shown in Figure 2. Each flag is set at a certain point in the loop. At the bot- tom of tl1e loop the dog is kicked, but first the flags are checked to see that all of the important poin ts in tl1e loop have been visited . The multitasking approach discussed later is based on a similar set of sani ty fJ ags. For a specific failure, it is often a good idea to try to record the cause (possibly in NVRAM), since it may be diificul t to establish the cau e after tl1e reset. lf the watchdog bite is due to a bug (would that be a bug bite?) then any other information you can record about the state of tl1e system, or the currently active task will be valuable when trying to diagnose tl1e problem. Choosing the timeout interval Any safety chain is only as good as its weakest link, and if the software policy used to decide when to kick the dog is not good, then using watchdog hard- wa~-e can make your system less reli- able. If you do not fully understand the timing characteristics of your pro- gram, you might pick a timeout inter- val that is too short. This could lead to occasional resets of the system, which may be difficult to diagnose. The inputs to the system, and the frequen- cy of interrupts, can affect the length of a single loop. One approach is to pick an interval which is seve ral seconds long. Use this approach when you are only tryi ng to reset a system that has definitely hung, but you do not want to do a det.:'liled study of tl1e timing of the system. This is a robust approach. Some systems require fas t recovery, but for otl1e rs, the only requireme nt is that the sys- tem is not left in a hung state indefi- nitely. For these more sluggish sys- tems, there is no need to do precise measurements of the worst case time of tl1e program's main loop to the nearest millisecond. When picking the timeout you may also want to conside r the greatest atn ount of damage the device can do between the original fa ilure and the watchdog biting. With a slowly responding system, such as a large tl1ermal mass, it may be acceptable to wai t 10 seconds before resetting. Such a long time can guarantee that there Embedded Systems Programming NOVEMBER 2000 113

Articles in this issue

Archives of this issue

view archives of EETimes - Embedded Systems November 2000 Vol13_12