College of Computer Science
Boston, MA 02115
In this short note I argue that the model of self-stabilization and the tools developed for it are important and useful from the information survivability perspective. In a nutshell, the self-stabilization model provides a uniform abstraction for all possible transient faults, and self-stabilizing protocols are protocols which guarantee automatic recovery from any transient fault. In the remainder of this note, I give some background and mention a few approaches and results for self-stabilization.
In order to deal with faults, one has to model them appropriately. Clearly, there is no way to defend the system unless we assume that the faults are somehow restricted. One popular way to restrict the faults is to view them as additional specific actions taken by the environment. For example, a ``shutdown'' action may model a processor crash; a ``drop message'' action will have the obvious interpretation; etc. Another accepted way to restrict faults is to assume that they may occur only at a given number of places. A typical example is where system guarantees are contingent on the assumption that at most t processors are faulty, for some specified parameter t. Another approach was pioneered by Dijkstra : unrestricted transient faults. In this model, faults may be extremely powerful: when a fault occurs, it is assumed that the state of all system components may have been arbitrarily corrupted; the only restrictions assumed are that the protocol code remains intact, and that eventually faults disappear. A protocol which can eventually resume normal operation after such a fault is called self-stabilizing.
It turns out that despite the stringent requirements, self-stabilizing protocols are feasible (e.g., for token passing, spanning tree) -- but they are tricky. The design of self-stabilizing protocols seems to require techniques which are foreign to conventional programmers. For example, we take it for granted that the result of a read(x) instruction is exactly the value v stored by the last write(x,v) instruction. But a self-stabilizing protocol can never just write a value and trust the system to keep it: there is no immediate way to tell whether a value returned by a subsequent read is ``genuine'' or perhaps a transient fault has occurred before the read took place. Another unnatural consideration to cope with is due to the fact that the program counter (the register which points to the next instruction to execute) is stored in volatile memory: a possible consequence of transient faults is that when executing the second statement in a sequence, there is no way to tell whether the first one has already been executed.
Recently, there is a trend to design general protocol transformers, which upgrade non-stabilizing protocols to be self-stabilizing. The common approach [2,3,4] is to have a detection mechanism which periodically verifies the integrity of the application, and when a fault is detected, it triggers a reset protocol: the task of the reset protocol is to flush the stale state information out of the system and to consistently install a new global state, from which the application can resume normal operation. It is fairly well-understood when and how to use reset-based stabilization. The main research question for this approach is the design of an efficient, self-stabilizing reset protocol. An ideal reset protocol should work in time proportional to the network diameter, and use only constant memory space at each node. The current state of the art is either optimal space protocol  or optimal time protocol , but no protocol achieves them both simultaneously.
Although the reset-based approach seems attractive from the engineering viewpoint, it is not widely used in the real world. The chief reason for rejecting reset is its insensitivity to the severity of the fault: even the slightest inconsistency will result in a global system reset. Consider a single processor which has completely lost track of the protocol; the desired correction in such a case is probably to restart that processor. We wouldn't like to restart all processors executing the application unless we absolutely have to. It is highly desirable that the correction algorithm would be time adaptive, namely the time to overcome a fault should be proportional to the extent of (i.e., number of processors affected by) the fault.
Time adaptive self-stabilizing algorithms are the subject of intensive
research recently. One work  shows how can a system
recover from a a transient fault in time proportional to the number of
nodes which were hit by the fault; unfortunately, the algorithm presented
in  is not self-stabilizing: in particular, the
effect of the faults is internally accumulated and must be manually undone
at some point. Another paper  shows
how to overcome a fault at a single node quickly, while faults which extend
to a larger number of nodes may take much longer time to repair. The last
result in this line of research is , which shows
how to repair any number F of faults in time proportional to F.
The basic technique which underlies the new algorithms is, of course, replication
of information across the system; the main difficulty is that while the
algorithm is required to change the values stored locally at the (faulty)
nodes, it should avoid committing itself to wrong values: otherwise, a
domino-effect might take place, where a few strategically-located faulty
nodes can bring down the whole network.
We remark that the main drawback of the latest algorithms are their excessive space requirement; however, under various practical assumptions, and using error-correcting codes, this overhead can be considerably reduced.