0% found this document useful (0 votes)
20 views

Watchdog Timers in RTOS

Uploaded by

appuramuvichu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Watchdog Timers in RTOS

Uploaded by

appuramuvichu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

MULTI-TASKING

Watchdog timers
By Niall Murphy
Author
Front Panel: Designing Software for
Embedded User Interfaces

Making proper use of a watchdog


timer is not as simple as restarting
a counter. If you have a watchdog
timer in your system, you must
choose the timeout period care-
fully, ensure that the watchdog
timer is tested regularly, and, if
you are multi-tasking, monitor
all of the tasks. In addition, the
recovery actions you imple-
ment can have a big impact on
overall system reliability.
A watchdog timer is a piece
of hardware, often built into a
microcontroller that can cause
a processor reset when it judges
that the system has hung, or is
no longer executing the correct
sequence of code. This article
will discuss exactly the sort of
failures a watchdog can detect,
and the decisions that must
be made in the design of your
watchdog system. The first half
of the article will assume that
there is no RTOS present. The
second half covers a scheme for
making use of a watchdog in a
multi-tasking system. The hard-
ware component of a watchdog
is a counter that is set to a
certain value and then counts
down towards zero. It is the
responsibility of the software occurs too soon will cause a bite, they lead to an infinite loop, an line, but other actions are also
to set the count to its original but in order to use such a system, accidental jump out of the code possible. For example, when the
value often enough to ensure very precise knowledge of the area of memory, or a dead-lock watchdog bites it may directly
that it never reaches zero. If it timing characteristics of the main condition (in multi-tasking sit- disable a motor, engage an in-
does reach zero, it is assumed that loop of your program is required. uations). Obviously, it is prefer- terlock, or sound an alarm until
the software has failed in some What errors are caught? able to fix the root cause, rather the software recovers. Such ac-
manner and the CPU is reset. A properly designed watch- than getting the watchdog to tions are especially important to
In other texts you will see vari- dog mechanism should, at the pick up the pieces. In a complex leave the system in a safe state
ous terms for restarting the timer: very least, catch events that hang embedded system it may not if, for some reason, the system’s
strobing, stroking or updating the system. In electrically noisy be possible to guarantee that software is unable to run at all
the watchdog. However, in this environments, a power glitch there are no bugs, but by using (perhaps due to chip death)
article we will use the more may corrupt the program coun- a watchdog you can guarantee after the failure.
visual metaphor of a man kick- ter, stack pointer, or data in RAM. that none of those bugs will hang A microcontroller with an
ing the dog periodically-with The software would crash almost the system indefinitely. internal watchdog will almost
apologies to animal lovers. If the immediately, even if the code is always contain a status bit that
man stops kicking the dog, the completely bug free. This is ex- First aid gets set when a bite occurs. By
dog will take advantage of the actly the sort of transient failure Once your watchdog has bitten, examining this bit after emerg-
hesitation and bite the man. that watchdogs will catch. you have to decide what action ing from a watchdog-induced
It is also possible to design Bugs in software can also to take. The hardware will usu- reset, we can decide whether
the hardware so that a kick that cause the system to hang, if ally assert the processor’s reset to continue running, switch to

EE Times-India | November 2000 | eetindia.com 


requires, and whether that data One approach is to pick an
is stored regularly and read after interval which is several seconds
the system resets. long. Use this approach when
you are only trying to reset a
Sanity checks system that has definitely hung,
Kicking the dog on a regular but you do not want to do a
interval proves that the software detailed study of the timing of
is running. It is often a good idea the system. This is a robust ap-
to kick the dog only if the system proach. Some systems require
passes some sanity check, as fast recovery, but for others,
shown in Figure 1. Stack depth, the only requirement is that
number of buffers allocated, or the system is not left in a hung
the status of some mechanical state indefinitely. For these
component may be checked more sluggish systems, there is
before deciding to kick the dog. no need to do precise measure-
Good design of such checks will ments of the worst case time of
increase the family of errors that the program’s main loop to the
the watchdog will detect. nearest millisecond.
One approach is to clear a When picking the timeout
number of flags before each loop you may also want to consider
is started, as shown in Figure 2. the greatest amount of dam-
Each flag is set at a certain point age the device can do between
in the loop. At the bottom of the original failure and the
the loop the dog is kicked, but watchdog biting. With a slowly
first the flags are checked to see responding system, such as a
that all of the important points large thermal mass, it may be
in the loop have been visited. acceptable to wait 10 seconds
The multi-tasking approach before resetting. Such a long
discussed later is based on a time can guarantee that there
similar set of sanity flags. will be no false watchdog re-
For a specific failure, it is often sets. On a medical ventilator,
a good idea to try to record the 10 seconds would have been
cause (possibly in NVRAM), since far too long to leave the patient
it may be difficult to establish unassisted, but if the device can
the cause after the reset. If the recover within a second then
watchdog bite is due to a bug the failure will have minimal
(would that be a bug bite?) then impact, so a choice of a 500ms
any other information you can timeout might be appropriate.
record about the state of the When making such calculations
system, or the currently active be sure to include the time
task will be valuable when try- taken for the device to start up
ing to diagnose the problem. as well as the timeout time of
the watchdog itself.
Choosing the timeout One real-life example is the
interval Space Shuttle’s main engine con-
Any safety chain is only as good troller. 1 The watchdog timeout
a fail-safe state, and/or display On the other hand, in some sys- as its weakest link, and if the is set at 18ms, which is shorter
an error message. At the very tems it is better to do a full set of software policy used to decide than one major control cycle. The
least, you should count such self-tests since the root cause of when to kick the dog is not response to the watchdog biting
events, so that a persistently the watchdog timeout might be good, then using watchdog is to switch over to the backup
errant application won’t be re- identified by such a test. hardware can make your sys- computer. This mechanism al-
started indefinitely. A reasonable In terms of the outside world, tem less reliable. If you do not lows control to pass from a failed
approach might be to shut the the recovery may be instanta- fully understand the timing computer to the backup before
system down if three watchdog neous, and the user may not characteristics of your program, the engine has time to perform
bites occur in one day. even know a reset occurred. you might pick a timeout inter- any irreversible actions.
If we want the system to The recovery time will be the val that is too short. This could While on the subject of time-
recover quickly, the initialisation length of the watchdog time- lead to occasional resets of the outs, it is worth pointing out
after a watchdog reset should out plus the time it takes the system, which may be difficult that some watchdog circuits
be much shorter than power-on system to reset and perform its to diagnose. The inputs to the allow the very first timeout to
initialisation. initialisation. How well the de- system, and the frequency of be considerably longer than the
A possible shortcut is to skip vice recovers depends on how interrupts, can affect the length timeout used for the rest of the
some of the device’s self-tests. much persistent data the device of a single loop. periodic checks.

 eetindia.com | November 2000 | EE Times-India


This allows the processor
time to initialise, without having
to worry about the watchdog
biting.
While the watchdog can
often respond fast enough
to halt mechanical systems,
it offers little protection for
damage that can be done
by software alone. Consider
an area of non-volatile RAM
which may be overwritten
with rubbish data if some loop
goes out of control. It is likely
that such an overwrite would
occur far faster than a watch-
dog could detect the fault.
For those situations you need
some other protection such as
a checksum. The watchdog is
really just one layer of protec-
tion, and should form part of a
comprehensive safety net.

Multiplying the interval


If you are not building the
watchdog hardware yourself,
then you may have little say in
determining the longest inter-
val available. On some micro-
controllers the built-in watch-
dog has a maximum timeout
on the order of a few hundred
milliseconds. It you decide that
you want more time, you need
to multiply that in software.
Say the hardware provides a
100ms timeout, but your policy
says that you only want to check
the system for sanity every
300ms. You will have to kick
the dog at an interval shorter
than 100ms, but only do the
sanity check every third time
the kick function is called. This
approach may not be suitable
for a single loop design if the interrupt, it is vital to have a check fault would only be discovered If the jumper falls out, or a
main loop could take longer on the main loop, such as the one when some failure that normally service engineer who removed
than 100ms to execute. described in the previous para- leads to a reset, instead leads to the jumper for a test forgets to
One possibility is to move the graph. Otherwise it is possible a hung system. If such a failure replace it, the watchdog will be
sanity check out to an interrupt. to get into a situation where the was acceptable, you would rendered toothless.
The interrupt would be called main loop has hung, but the in- never have bothered with the The simplest way for a device
every 100ms, and would then kick terrupt continues to kick the dog, watchdog in the first place. to do a start-up self-test is to al-
the dog. On every third interrupt and the watchdog never gets a If you think watchdog failure low the watchdog to timeout,
the interrupt function would chance to reset the system. is a rare thing, think again. Many causing a processor reset. To
check a flag that indicates that systems contain a means to dis- avoid looping infinitely in this
the main loop is still spinning. Self-test able the watchdog, like a jump- way, it is necessary to distinguish
This flag is set at the end of the Assume that the watchdog hard- er that connects the watchdog the power-on case from the
main loop, and cleared by the in- ware fails in such a way that it output to the reset line. This is watchdog reset case. If the reset
terrupt as soon as it has read it. never bites. How would you ever necessary for some test modes, was due to a power-on, then
If you take the approach of know? When the system works, and for debugging with any perform this test, but if the reset
kicking the watchdog from an such a fault is not apparent. The tool that can halt the program. was due to a watchdog bite,

EE Times-India | November 2000 | eetindia.com 


then we may already be running was used on a medical ventilator watchdog, the monitor sets all the flag to ASLEEP. Those tasks
the test. Usually you will want to running on the RTXC real-time of the flags to UNKNOWN. By the must complete one full loop
write a value in RAM that will be operating system. The idea was time the monitor task executes and be back at the three lines
preserved through a reset, so loosely influenced by Agustus again, all of the UNKNOWN flags shown above in less time than
you can check if the reset was P. Lowell’s article “The Care and should have been overwritten the monitor’s timeout.
due to a watchdog test or to a Feeding of Watchdogs,” which with ALIVE. Figure 3 shows an Note that this mechanism
real failure. A counter should be describes a way to build the example with three tasks. is not used on all blocking calls
incremented while waiting for watchdog scheme into the RTOS to the operating system; it is
the reset. After the reset, check itself. 2 Unlike Lowell’s scheme, Waiting tasks only used for the waits that are
the counter to see how long you however, this scheme can run on Waiting tasks can’t be guaran- dependent on events for which
had to wait for the timeout, so top of any RTOS, without requir- teed to pass through their start a finite return time cannot be
you are sure that the watchdog ing changes to the RTOS code. point within any finite amount guaranteed. There are still some
bit after the correct interval. This scheme uses a task dedi- of time. These tasks normally concerns with this scheme. If a
If you are counting the num- cated to the watchdog. This task have one or more points at deadlock occurs that involves
ber of watchdog resets in order wakes up at a regular interval which they are waiting on an waits in a number of waiting
to decide if the system should and checks the sanity of all oth- external event, such as a user tasks while each of the waiting
give up trying, then be sure er tasks in the system. If all tasks key action or communication tasks has its flag set to ASLEEP,
that you do not inadvertently pass the test, the watchdog is from another processor. At the monitor cannot detect
count the watchdog test reset kicked. The watchdog monitor those points, the flags are set to the fault. In order to avoid this
as one of those. task runs at a higher priority than the value ASLEEP. After the wait, pitfall, a graph can be manually
the tasks it is monitoring. the flag is set to ALIVE, and the created to show each task with
Multi-tasking process continues as described an arrow to the tasks it waits
A watchdog strategy has four The nature of the tasks above. The monitor changes its on (drawing arrows only for
objectives in a multi-tasking Most tasks have some minimum scheme as follows: if the moni- waits that set the task flag to
system: period during which they are tor checks the flags and sees the ASLEEP). If there is a complete
• To detect an operating sys- required to run. A task may run value ASLEEP, it considers that loop (for example, Task1 waits
tem in reaction to a timer that oc- state to be valid. So, if all flags are on Task2; Task2 waits on Task3;
• To detect an infinite loop in curs at a regular interval. These either ASLEEP or ALIVE then the and Task3 waits on Task1), then
any of the tasks tasks have a start point through watchdog is kicked. these waits are not genuinely
• To detect deadlock involving which they pass in each ex- The disadvantage is that if a waiting for external events and
two or more tasks ecution loop. These tasks are re- task sets a flag to ASLEEP and you should consider whether
• To detect if some lower prior- ferred to as regular tasks. Other never changes it back, it always the task flag should be set to
ity tasks are not getting to run tasks respond to outside events, passes the test and any deadlock ASLEEP at all of these points. If
because higher priority tasks the frequency of which cannot or infinite loops in that task go such a loop cannot be avoided,
are hogging the CPU be predicted. These tasks are re- undetected. Therefore, one of an extra timeout could be set
ferred to as waiting tasks. our rules is that the line of code on one of the waits (assuming
Typically, not enough timing First we will discuss how the following the line where the flag that your RTOS supports timed
information is available on the scheme will work if all tasks are is set to ASLEEP must perform the waits), and this timeout would
possible paths of any given task regular and then we will explain wait, normally using one of the provide protection against a
to check for a minimum execu- what extra work has to be done blocking function calls from the deadlock. This timeout could be
tion time or to set the time limit for waiting tasks. operating system. The instruction far longer than the watchdog
on a task to be exactly the time The watchdog timeout can which follows the wait must set timeout period. In the case of
taken for the longest path. There- be chosen to be the maximum the flag to ALIVE. For example: this extra timer timing out, the
fore, while all infinite loops are time during which all regular myFlag = ASLEEP; system would be judged to be
detected, an error that causes a tasks have had a chance to run KS_wait(KEY_PRESS_HAP- in deadlock.
loop to execute a number of extra from their start point through PENED); In some cases, you may
iterations may go undetected one full loop back to their start myFlag = ALIVE; choose to assign two flags to
by the watchdog mechanism. A point again. Each task has a one task. The flags could then be
number of other considerations flag which can have two values, Because there are no condi- set to ALIVE at different points
have to be taken into account to ALIVE and UNKNOWN. The flag tions or branches in this se- within the task’s main loop. This
make any scheme feasible: is later read and written by the quence, no set of circumstances would catch a problem where a
• The extra code added to the monitor. The monitor’s job is to allow the task to continue with task was stuck in a loop that reset
normal tasks (as distinct from wake up before the watchdog the flag in the ASLEEP state. one of the flags but skipped some
a task created for monitor- timeout expires and check the Once the flag has been set to vital part of its work. The monitor
ing tasks) must be small, to status of each flag. If all flags ALIVE, the task must run to some would only consider the task to
reduce the likelihood of be- contain the value ALIVE, every point where the flag is again set be healthy if both flags are set to
coming prone to errors itself task got its turn to execute and to ALIVE or ASLEEP, before the ALIVE within each period.
• The amount of system re- the watchdog may be kicked. monitor task has time to clear For waiting tasks, all of the
sources used, especially CPU Some tasks may have executed the flag to UNKNOWN and wait tasks’ flags are set to ASLEEP
cycles, must be reasonable several loops and set their flag one timeout period. Many tasks at the waiting point and all of
to ALIVE several times, which have only one place where they them set to ALIVE immediately
The solution I will describe is acceptable. After kicking the wait on an external event and set afterwards.

 eetindia.com | November 2000 | EE Times-India


For example if a task was al- and low frequency flags. Each task performs checks other than flags are being updated. The as-
located two flags called myFlag1 time the monitor is awakened, the flags described here, and sumptions made in the “Concur-
and myFlag2 then the sequence the high frequency tasks’ flags are if those checks consume a lot rent access” section no longer
of calls when this task is waiting checked, but the low frequency of CPU cycles, you may want to hold, and the other task may up-
is as follows: tasks’ flags are only checked on consider altering this scheme to date the flag that the monitoring
myFlag1 = ASLEEP; every nth iteration, where n is the one where the monitoring task task ahs already read, but before
myFlag2 = ASLEEP; ratio between the high and low runs at a lower priority. If you do the monitoring task has a chance
KS_wait(KEY_PRESS_HAP- frequency. this you will have to ensure that to write to it. One option is to use
PENED); the watchdog task is scheduled a resource lock on the set of flags.
myFlag1 = ALIVE; Debugging to run more often so that it will Another option is to ensure that
myFlag2 = ALIVE; When testing and debugging the not be deferred for so long by examining and updating the
system, it is a good idea to run a high priority task that it does flag in the monitoring task is
Concurrent access the system with the watchdog not strobe the hardware watch- performed as an atomic read-
Since writes of a single byte are timeout set tighter than it nor- dog in time. For example you and-modify operation, which
atomic, it is safe to use a single mally will be in the field. This will might schedule it to kick the may be available as a single
byte as a flag for a single task. help identify any of the paths in dog every 25ms, even though CPU opcode, or your RTOS may
No matter when the task switch the code that are borderline. the hardware watchdog only provide a facility to do this.
occurs, it is impossible to get an It is also a good idea to install requires a kick every 50ms. It will
illegal value written to the byte. the monitor task early in the de- then survive a 25ms delay caused Conclusion
In the case of the monitor, velopment cycle, since that will at a time when a higher priority A good watchdog mechanism
the byte is read and then writ- show how the system reacts to task is running. requires careful consideration
ten. Theoretically, a task switch the real bugs in the monitored the hardware watchdog will of both software and hardware.
between the read and the write tasks during development. Dur- eventually bite. It also requires careful consid-
could change the state of the ing debugging, always place a Using a lower priority task will eration of what action to take
byte, and then that change would breakpoint in the monitor task improve the ability of high prior- when the failure is detected.
be overwritten by the monitor. at the point where it detects a ity tasks to meet their hard real- When you design with watch-
This can never happen if the failed flag. Then a failed task is time targets. The disadvantage dog hardware, make sure you
monitor is a higher priority task not only detected immediately, of such an approach is that you decide early on exactly how
than the tasks being monitored. but you can also use the debug- lose the opportunity to record the you intend to make best use of
The tasks being monitored never ger to look at its state and figure identity of the task that fails to set it, and you will reap the benefits
read the flag. They only write to out why it missed its deadline. its flag to ALIVE, which is useful of a more robust system.
it. Monitor interval debugging information. I also
As stated, the timeout interval Priority of monitoring task believe that it is harder to ensure References
must be enough for all of the This watchdog scheme is de- that there are no circumstances 1. www.hq.nasa.gov/of-
tasks being monitored to com- signed on the assumption that where a properly functioning fice/pao/History/com-
plete at least one loop. If there the monitoring task is running at system will lock out the monitor- puters/Ch4-7.html
is a big difference between the a higher priority than any of the ing task for long enough to get an 2. Lowell, Agustus P., “The Care
shortest task loop and the longest tasks that it is monitoring. This unwanted kick. and Feeding of Watchdogs,”
then the tasks with shorter execu- has one drawback. It means that When the lower priority task is Embedded Systems Pro-
tion times may only be getting it may take up CPU cycles at a the monitoring task, you will also gramming, April 1992, p. 38.
checked after a few hundred time when another task may be have to address the possibility
loops. The list of flags can be trying to meet some hard real- that another task may interrupt
divided into high frequency flags time target. If your monitoring the monitoring task while the Email   Send inquiry

EE Times-India | November 2000 | eetindia.com 

You might also like