Time-Adaptive Self Stabilization

We study the scenario where a transient fault hit f of the n nodes of a distributed system by corrupting their state. We consider the basic problem of persistent bit, where the system is required to maintain a value in the face of transient failures by means of replication. We give an algorithm to recover the value quickly: the value of the bit is recovered at all nodes in O(f) time units for any unknown value of f<n/2. Moreover, complete state quiescence occurs in O(Diam) time units, where Diam denotes the diameter of the network. This means that the value persists indefinitely so long as any f<n/2 faults are followed by Omega(Diam) fault-free time units. We prove lower bounds which show that both time bounds are asymptotically optimal.

Using the algorithm for persistent bit, we present a general transformer which takes a distributed non-reactive, non-stabilizing protocol P, and produces a self-stabilizing protocol P' which solves the problem P solves, with the additional property that if the number of faults that hit the system after stabilization is f, for any unknown f<n/2, then the output of P' regains stability in O(f) time units, and the state stabilizes in O(Diam) time units.

Click here for proceedings version, and here for the full version.