Download Transformation of Programs for Fault-tolerance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Formal Aspects of Computing (1996) 3: 1{000
c 1996 BCS
Transformation of Programs for
Fault-tolerance
Zhiming Liu and Mathai Joseph
Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK
Keywords: Failure semantics; Consistency; Fault-tolerant transformation; Faulttolerant renement
Abstract. In this paper we shall describe how a program constructed for a
fault-free system can be transformed into a fault-tolerant program for execution
on a system which is susceptible to failures. A program is described by a set of
atomic actions which perform transformations from states to states. We assume
that a fault environment is represented by a program F . The interference by the
fault environment F on the execution of a program P can then be described as a
fault-transformation F which transforms P into a program F(P ). This is proved
to be equivalent to the program P []FP , where FP is derived from P and F , and
[] is the union of the sets of actions of P and PF . A recovery transformation R
transforms P into a program R(P ) = P []R by adding a set of recovery actions
R, called a recovery program. If the system is fail-stop and faults do not aect
recovery actions, we have
F(R(P )) = F(P )[]R = P []PF []R
We illustrate this approach to fault-tolerant programming by considering the
problem of designing a protocol that guarantees reliable communication from
a sender to a receiver in spite of faults in the communication channel between
them.
Correspondence and oprint requests to : Zhiming Liu and Mathai Joseph, Department of
Computer Science, University of Warwick, Coventry CV4 7AL, UK.
This work was supported by research grant GR/D 11521 from the Science and Engineering
Research Council.
2
Z. Liu and M. Joseph
1. Introduction
There are several dierent ways in which a program can be developed using
formal rules which guarantee that it will satisfy a specication when executed
on an fault-free system [Lam83, AL88, Bac87, Bac88, BW89, Bac89]. In this
paper, we consider how similar methods can be used to construct fault-tolerant
programs for execution on systems which may exhibit any of a set of specied
failure properties.
Fault-tolerant programs are required for applications where it is essential that
failures do not cause a program to have unpredictable execution behaviours. We
assume that the failures do not arise from errors in the program, since methods
such as those mentioned above can be used to construct error-free programs.
So, the only failures we shall consider are those caused by hardware and system
malfunctions. Many such failures can be masked from the program using automatic error detection and correction methods, but there is a limit to the extent
to which this can be achieved at reasonable cost in terms of the resources and
the time needed for correction.
When the nature or frequency of the errors makes automatic detection and
correction infeasible, it may still be possible that error detection can be performed. It is desirable that fault-tolerant programs are able to perform predictably under these conditions: for example when using memory with single
bit error correction and double bit error detection which operates even when
the error correction is not eective. In fact, the provision of good program level
fault-tolerance can make it possible to reduce the amount of expensive system
error correction needed, as program level error recovery can often be focussed
more precisely on the damage caused by an error than a general-purpose error
correction mechanism.
The task is then to develop programs which perform predictably in the presence of detected system failures, and this requires the representation of such
failures in the execution of a program. Earlier attempts to use formal proof
methods for verifying the properties of fault-tolerant programs (e.g. [JMS87],
[HW88]) were based on an informal description of the eects of faults, and this
limits their applicability. Here we shall instead model a fault as an action which
performs state transformations in the same way as other program actions, making it possible to extend a semantic model to include fault actions and to use a
single consistent method of reasoning for both program and fault actions.
Let P be a program satisfying the specication Sp. Let the eect of each
physical fault in the system on which P is executed be described as a fault
action which transforms a good program state into an error state which violates
Sp. Physical faults are then modelled as the actions of a fault program F which
interferes with the execution of P . A failure at any point during the execution
of P takes it into an error state in which a boolean variable f is true. (F is
assumed not to change an error state into a good state.)
In general a high level specication of a program is not sucient to specify its
behaviour in the presence of system faults or to transform it into a fault-tolerant
program. It is also necessary to describe the hardware organisation of the system
on which the program is to be executed, on its use of the resources of the system
and the nature of the possible faults in the system, e.g. which processors and
channels may fail, as all of these factors can aect the execution of the program.
And very little can be said about the eects of a system fault on a program until
it has been rened to the level where these eects can be observed. There is need
Transformation of Programs for Fault-tolerance
3
to represent faults and their eects at various levels of abstraction and here we
shall use specications to develop both the program and the fault environment
in which it executes.
After this introduction, in Section 2 a simple model for representing program
actions, fault actions and recovery actions is dened. A specication language is
presented in Section 3, with its syntax and semantics. Based on the semantics
of failures w:r:t: a given failure-prone environment, the eect of faults on the
original program is dened in terms of a program transformation in Section 4.
Section 5 provides an abstract denition of consistency which is used to dene
recovery transformations which permit both backward and forward recovery.
Section 6 shows that the fault-tolerant properties of a program can be rened
along with the other properties dened in a program specication. Finally, in
Section 7 an example is given to illustrate fault-tolerant programming using
the method described in this paper; the problem is to design a protocol that
guarantees reliable communication from a sender to a receiver in spite of faults
in the communication channel between them.
2. A Simple Model
This section presents a simple computational model for describing programs,
faults and recovery.
2.1. Programs
As usual, we describe a program in terms of its variables, states and actions
[Lam87]. Program variables are associated with a set of values, called the value
space of the program. A state of the program is a mapping from the set of program
variables to the value space.
A predicate Q is a function from the set of the program states to the boolean
values ftrue; false g. Thus, given a state S , S satises Q (or S is of property Q)
if Q(S ) is true .
An action A is a relation on the set of states. In other words, an action A
is a set of pairs of states. For a pair (S; T ) of states, (S; T ) 2 A means that
executing A starting in state S can produce (or terminate in) state T . An action
thus denes a set of transitions from state to state.
An action A is said to be enabled in state S if there is a state T such that
(S; T ) 2 A. Therefore, for each action A, there is a predicate gA which denes
the set of states in which A is enabled. Stated in another way, gA(S ) is true i
A is enabled in S . gA is called the guard of A.
Let a program P be described as a set Act P of actions and a set Init P of
predicates which dene the initial states of P . The actions of P are executed
atomically: if an action is chosen for execution, it will be executed without interference from the other actions in the program. Such a program P is denoted
as (Init P ; Act P ).
An execution of program P consists of an innite sequence, S0 ; S1 : : :; of
states, where each pair (Si ; Si+1 ) is in some action of Act P , or Si = Si+1 and S0
satises all the predicates in Init P . A xed point of P is a state S such that for
any action A 2 Act P ,
8T : (S; T ) 2 A ) T = S
4
Z. Liu and M. Joseph
An execution, E = S0 ; S1 : : :; of P terminates if it reaches a xed point, i.e.
for some i 0, Si is a xed point of P and hence Sj = Si for any j i. E is
said to be fair for an action A of P , if for any i 0, either there is a j i such
that (Sj ; Sj +1) 2 Act P , or there is a k i such that gA(Sk ) = false .
If P1 and P2 are programs, the union composition program P1 []P2 is dened
as
P1[]P2 = (Init P1 [ Init P2 ; Act P1 [ Act P2 )
In general, an action of a program is composed of a set of primitive actions.
If A and B are actions, the union A [ B is also an action. Let A B denote the
action which is the composition of the actions A and B .
f(S; T ) j 9T1 : ((S; T1 ) 2 A) ^ ((T1 ; T ) 2 B )g
The union operator models choice while the composition of actions models sequential composition in programs.
During the transition from one state to another, dened by an atomic action
of program P , the system may pass through a number of intermediate states.
For example, if A = A1 A2 is in Act P , the transition (S; T ) 2 A is carried out
through an intermediate state T1 such that
((S; T1 ) 2 A) ^ ((T1 ; T ) 2 B )
2.2. Concurrency
We can model the parallel execution of a program P = (Init P ; Act P ) by partitioning its actions Act P among a set Proc = fp1; : : :; pk g of processes, as in
Back's concurrent action system [Bac88]. A shared variable is one which is used
by two or more actions in dierent processes, while a private variable is used only
by the actions in one process. Two actions of P can be executed in parallel if and
only if they are not in the same process and do not share variables. Therefore a
concurrent system P with a set Proc of k processes can be written as
P = (Init P ; p1[] : : : []pk)
where pi denotes the actions that are assigned to process pi , 1 i k. In such a
concurrent system, processes communicate by executing actions using common
variables.
We can also model the parallel execution of P by partitioning its variables
Var (P ) among its processes Proc (Back's distributed action system [Bac88]).
A shared (or joint) action refers to variables in two or more processes, while a
private action refers only to variables in one process. Two actions of a distributed
system can be executed in parallel if and only if they do not refer to variables
in the same process. A distributed system P with a set Proc of k processes can
then be written as:
P = (Init P ; Proc ; A1[p11; : : :; p1k1 ][] : : : []An[pn1; : : :; pnk ])
where Proc = fp1 ; : : :; pkg is a partition of the variables of P and each Aj [pj 1; : : :; pjk ]
is an action shared by the processes pj 1; : : :; pjk in Proc , 1 j n.
An action Aj [pj 1; : : :; pjk ] is assumed to be executed jointly by the processes
fpj 1; : : :; pjk g that share the action Aj , and these processes synchronize over the
execution of Aj . The action Aj provides communication between the processes:
n
j
j
j
j
Transformation of Programs for Fault-tolerance
5
the variables of one process may be updated by Aj in a way that depends on the
values of the variables of other processes sharing this action.
A sequential action system is a special case of a parallel action system where
all the actions (or all the variables) are assigned to the same process. Another
special case is where each action (or each variable) is assigned to a separate
process. The program is then executed with maximal parallelism.
We can also view a concurrent action system
P = (Init P ; p1[] : : : []pk)
as a distributed system, with the private variables of each process pi being local
to the process and the shared variables being partitioned among new processes.
Because actions are executed atomically, a parallel computation can be modelled by a sequential computation with interleaved execution of actions. Hence,
the set of possible executions of an action system will be the same for a concurrent or a distributed action system and for a sequential action system. This
allows us to separate the logical behaviour of processes from implementation
issues.
2.3. Reasoning About Programs
As usual, we dene the Hoare triple fQgAfRg to mean
8(S; T ) : ((S; T ) 2 A ^ Q(S ) ) R(T ))
Thus, fQgAfRg denes the total correctness 1 of action A: when A is executed
in a state satisfying Q, it terminates in a state satisfying R. To reason about a
program P , A is universally or existentially quantied over the actions of P ; a
property which holds for all points of the execution of P is dened using universal
quantication while a property which holds eventually is dened using existential
quantication [Lam87, CM88]. For example, a safety property of a concurrent
program P is dened by an invariant Q so that fQgAfQg holds for any action
A in Act P .
Similarly,for any predicate R and action A, the weakest precondition wp(A; R)
is dened
wp(A; R)(S ) = 8T : ((S; T ) 2 A ) R(T ))
2.4. Renement
A program P is said to be rened by program P , denoted P v P , if each
execution of P is an execution of P . An action A is said to be rened by action
A , denoted A v A , if for any Q and R,
fQgAfRg ) fQgA fRg
The renement operation is reexive and two programs (or actions) P and
P are equivalent, denoted P P , if each is rened by the other.
Obviously, if A 2 Act P ,
0
0
0
0
0
0
0
0
This is Hoare logic for t otal correctness |the original proposal in [Hoa69] used the notation
QfAgR and dealt with partial correctness.
1
6
Z. Liu and M. Joseph
(gA , gA ) ^ (A v A ) ) (P v P [A=A ])
where P [A=A ] is the program obtained from P by replacing A with A .
A general denition and the rules of renement, can be found in [AL88,
Bac87, Bac88, Bac89, Mor90].
0
0
0
0
0
2.5. Modelling Faults
Let P be a program satisfying the specication Sp. The eect of a physical fault
in the system on which P is executed can be described as a fault action which
transforms a good state into an error state leading to a state which violates Sp.
The physical faults in the system can be then modelled as the actions of a fault
program F which interferes with the execution of P . A failure at any point during
the execution of P takes it into an error state in which a boolean variable f is
true .
Assume that a fault may interfere at any point in the execution of P . It
might appear that the interference by a fault environment F on the execution
of program P can simply be dened by the union composition P []F of P and
F , which would imply that faults occur only before or after the execution of
an atomic action of P , but not during its execution. However, this is clearly a
limited view since in practice faults do occur in intermediate states and can lead
to failures.
Let P be a program constructed from a set of primitive actions so that
each action A 2 Act P is constructed from primitive actions using the union
and composition of actions. The interference of F on the execution of P can be
dened as a transformation F , in the following way.
1. For a primitiveS action A,
F (A) = A [ ( Act F )
2. F (A [ B ) = F (A) [ F (B )
3. F (A B ) = F (A) F (B )
4. F (P ) = (Init P [ ff = false g; fF (A) j A 2 Act P g)
Using the algebra of relations, it is easy to prove the following theorem.
Theorem 2.1. For programs P and F , there is a program PF such that
F (P ) P []PF
We denote F (P ) by P F . Assume that this transformation ensures that the
execution of P under F is fail-stop [SS83], i.e. that no further actions of P will
be executed when a failure occurs. The execution of P in the fault environment
F is equivalent to the execution of F (P ) in a fault-free environment.
2.6. Modelling Recovery
The behaviour of F (P ) will not usually satisfy Sp. To make the program faulttolerant, P must be transformed by a fault-tolerant transformation into a program T (P ) such that F (T (P )) is expected to satisfy the specication Sp of P .
Unfortunately, even this may not always be possible though F (T (P )) may be
Transformation of Programs for Fault-tolerance
7
shown to satisfy a weaker but still acceptable specication. One kind of faulttolerant transformation is a recovery transformation, based on a denition of
consistency.
Let P have an initial execution sequence [S0 ]A1[S1 ]A2 : : :Am [Sm ] : : :An [Sn ]
in which each action Ai transforms state Si 1 into state Si . Assume that this sequence ends in state Sn because of a failure. Let S be a state which is the union of
the disjoint substates each of which is a substate of one of the states Sm ; : : :; Sn .
In general, such substates representing partial information (e.g. local states of
processes) have been saved at some of the execution points Sm ; : : :; Sn, and will
be used to calculate a global state S for recovery after the occurrence of faults.
If the execution of P can be restarted from state S and still satisfy Sp, then the
state S is said to be backward consistent with the interrupted execution sequence.
Alternatively, the execution of P can be continued from a possible future state
Sn+k . If there exists an execution sequence [Sn]An+1[Sn+1 ] : : :An+k [Sn+k ] of P
which satises Sp, then Sn+k is said to be forward consistent with the interrupted execution sequence. A consistent state is a state which is either backward
or forward consistent.
The recovery transformation R transforms a program P into a program
R(P ) = P []PR by adding a set of recovery actions R, called a recovery program.
The recovery actions of R are enabled only when a failure occurs and transform
an error state into a good state which is consistent with the execution sequence
interrupted by the fault. We assume that the recovery actions are not aected
by the fault environment F , i.e. that no failure occurs during the execution of a
recovery action. Therefore, we have
F (R(P )) = F (P )[]R = (P F )[]PR P []PF []PR
To ensure that the recovery program is eventually executed when a fault occurs,
the executions of F (R(P )) must be fair for each action of PR.
Let P0 v : : : v Pk be a sequence of program specications such that Pi v
Pi+1 and Pk contains enough information for specifying the fault environment
F . From Pk we may be able to determine the number of processes in the program, those which may fail during execution, and the channel variables which are
faulty. Based on F and the fault-transformation F on Pk , we can perform the
recovery transformation R on Pk and the fault-tolerant program Pk []PkR. For
example, fault-tolerant mechanisms such as checkpointing, recovery blocks and
conversations [BR81, Ran75, MR78, KT87] can then be introduced by applying
stepwise renement to Pk []PkR.
3. A Specication Language
The specication notation is a combination of Back's action system formalism
[Bac88] and a UNITY-like notation [CM88], and this is dened in the model
presented in the previous section. It will be used for reasoning about programs
at dierent levels of abstraction and to provide a sound renement calculus.
3.1. Commands and Actions
A program specication P = (Init P ; Act P ) is pair of sets, where Init P is a set
of predicates and Act P is a set of action specications. An action specication
8
Z. Liu and M. Joseph
A 2 ActP has the syntactic form g ! c, where g is a boolean condition and c is
a command [Bac88, BS89]. We use the same symbols for programs and program
specications, and we call a program specication a program, and an action
specication an action.
Let gA and cA respectively denote the guard g and the body c of action A,
and let the action true ! c be abbreviated to c when there is no confusion.
A command c is dened as follows,
c ::= xi := x i :Q
(nondeterministic assignment )
j c1 ; : : : ; cn; n > 1
(sequential composition )
j if A1 [] : : : []An fi; n 1 (conditional composition )
j do A1 [] : : : []An od; n 1 (iterative composition )
Here xi and x i are lists of variables and Q is a condition over the values of
program variables [BK83]. In sequential composition, each ci is a command, and
each Ai in the conditional and iterative composition commands is an action. We
write IFin=1 Ai for if A1 [] : : : []An fi and DOin=1 Ai for the iterative composition
command.
The eect of the nondeterministic assignment command is to assign to the
list of variables xi some value(s) x i satisfying condition Q. (Note that this
may introduce unbounded nondeterminism: we shall discuss this later). The set
ActP = fA1 ; : : :; Ang of actions can also be represented as a list of actions
A1[] : : : []An.
0
0
0
3.2. Semantics of Commands and Actions
By dening each action specication as an atomic action, the semantics of a
program specication is a program in the compuatational model. Because of this
atomicity, a program can be viewed as being sequential and nondeterministic.
We also assume that, for simplicity, no abortion occurs in the execution of any
command, and that the execution of each command always terminates.
States : Let V be the value space of program P . A state S of a program P is
a function from the program variables Var (P ) to the value space V . A sub-state
S jY is the restriction of S to Y where Y Var (P ). P is used to denote the set
of all the states of program P .
Semantics of commands : The semantics of a command c is a function:
(c) : P ! 2P
This function can be extended as a function over the powerset 2P in the following way: for a set of states
[ (c)(S)
(c)() =
S 2
The semantics of the execution of a command c in a state S is dened
below.
1. Assignment:
(xi := x i:Q)(S ) = fS j Q(S ) ^ 8y y 62 xi : S (y) = S (y)g
Thus the programmer must guarantee the existence of a state in which Q is
true .
0
0
0
0
Transformation of Programs for Fault-tolerance
9
2. Sequential composition: (c1 ; c2)(S ) = (c2 )( (c1 (S ))
3. Conditional composition:
[
(cAi )(S )
(IFin=1 Ai )(S ) =
gi (S )=true
To guarantee that no abortion occurs, in any state S , one of the guards gAi
must be true .
4. Iterative composition: We rst dene
[
(cAi )(S )
(A1[] : : : []An)(S ) =
gi (S )=true
P
within 2 which is the complete partial order domain (CPO) under inclusion
between sets of states of P . Let Y be a set of states and F (S ) be the least
xed point2 of the equation
Y = fS g [ (A1 [] : : : []An)(Y )
Then
(DOin=1 Ai )(S ) = F (S ) \ M
where M = fS j 8i : 1 i n : gAi (S ) = false g. To guarantee the
termination of DOin=1 Ai , we assume that F (S ) \ M is always non-empty
for any state S .
Semantics of actions : The semantics of an action A = g ! c is a
function:
(A) : P ! 2P
such that
= true
(A)(S ) = (c)(S ) ifif gg((SS )) =
false
Semantics of programs : For the value space V of program P , let V y
be the set of all the nite and innite sequences3 over V . An observation of
program P is a sequence of states which is dened as the function : V ar(P ) !
V y satisfying the following conditions:
OB1. (x) 6=<>, for any x 2 V ar(P ),
OB2. #(x) = #(y) (denoted by #) for any x; y 2 V ar(P ),
OB3. for each i < #, [i] is a state of P dened as [i](x) = (x)(i) for any
x 2 V ar(P ),
OB4. [0] is an initial state of P ,
The existence of such a xed point can be proved in set theory.
< a; : : : ; b > is the sequence of elements a, : : : , b; <> is the empty sequence and^concatenates
two sequences accordingto the standard denition:e.g. < a >^< b >=< a; b >, < a >^<>=<
a >, etc.; / denotes that is a prex of ; / denotes that is a proper prex of
; # is the length of the sequence and # = 1 if is innite; and (i 1) is the ith
element of , 1 i #; head() and last() denote respectively the rst and last elements
of a non-empty sequence ; tail() denotes the sequence obtained from by removing the rst
element of .
2
3
0
0
0
0
10
OB5.
Z. Liu and M. Joseph
8i : 0 < i < # :: ([i] = [i 1]) _ (9A 2 ActP : [i] 2 (A)([i 1]))
We use the sequence notation to describe a property of a function from V ar(P)
to the sequences V ; e.g. for the functions and , ^ is the function such
that ^ (x) = (x) ^ (x) for each x 2 V arP .
A function : V ar(P) ! V satisfying Condition OB2 is said to be
a sub-observation of , i.e. , if there is an observation and a function
: V ar(P) ! V satisfying Condition OB2 such that
= ^ ^
Obviously, given a command c and an action A, (c) and (A) can be extended to be functions over the set OBP of the nite observations of P, e.g.
(c)() = f ^S j S 2 (c)(last())g.
An execution E of P is an innite observation of P. It is said to exhibit
fairness (or justice [MP83, GP89]) if for any i 0 and A 2 ActP , either there
is a j i such that E(j + 1) 2 (A)(E(j)), or there is a k i such that
gA(E(k)) = false . The semantics of a program P is the set (P) of its fair
executions. Programs P and P are equivalent if they have the same semantics:
P P = (P) = (P )
Within this semantic model, P[]skip P, where skip denotes the program
which never changes the values of the program variables. We also use skip to
denote a command or an action which does not change the values of the program
variables.
Therefore, this model permits nite stuttering, i.e. in any execution, a state
can be repeated consecutively at most a nite number of times before the execution reaches the xed point of P. The stuttering property is required when
dealing with the renement of reactive programs [Lam83, AL88, Bac89].
y
0
0
0
0
y
0
y
0
0
0
0
3.3. Reasoning About Action Systems
As in Hoare logic [Hoa69, Dij76], the specication fQgAfRg denes an action
A which when executed in a state satisfying predicate Q terminates in a state
satisfying predicate R. A is universally or existentially quantied over the actions.
The logical operators unless, stable, invariant, ensures, leads-to (7 !)
are used to describe the safety and progress properties of programs. Following
UNITY [CM88], these operators are dened in terms of the Sat relation for the
formulae listed below.
P Sat (Q unless R) 8A : A 2 ActP :: fgA ^ Q ^ :RgAfQ _ Rg
P Sat (stable Q) P Sat (Q unless false)
P Sat (invariant Q) (InitP ) Q) ^ P Sat (stable Q)
P Sat (Q ensures R) P Sat (Q unless R) ^ (9A : A 2 ActP ::
fgA ^ Q ^ :RgAfRg)
P Sat (Q 7 ! R); P Sat (R 7 ! R ) ,
(Q ensures R)
P Sat
P Sat (Q 7 ! R) ,
P Sat (Q 7 ! R )
for any set W:
8m : m 2 W :: P Sat (Q(m) 7 ! R) .
P Sat ((9m : m 2 W :: Q(m)) 7 ! R)
0
0
Transformation of Programs for Fault-tolerance
11
A xed point of a program is a program state such that execution of any statement of the program in that state leaves the state unchanged. The execution
of a program terminates if it reaches its xed point. Let Fixedpoint(P ) denote
the predicate which denes the xed points of program P . A state S satises
Fixedpoint (P ) i it is a xed point of P .
A program specication can be written by using either UNITY-like logic or
an action system formalism. However, we will normally use the logic for high level
specication and the action system formalism for the renement. We assume a
fairness rule and that the execution of a command always terminates.
4. Transformations for Specied Faults
The eect of faults on a program P is specied by a program F which denes
a set of atomic actions, called fault actions, representing the fault environment.
The execution of the program P under the fault environment specied by F is
equivalent to the execution of P together with F on the fault-free system. Such
a failure execution of P is dened by the following failure semantics.
4.1. Failure Semantics
For a program P and a fault environment F , assume that P has a boolean
variable f to indicate the presence of a fault, and that the value of f is never
changed in P . Each action A 2 F is assumed to satisfy
ftruegAff g
A state S is good if f (S ) = false and an error state is a state which is not good.
If c is a command of P , then the failure semantics of c w:r:t F is a function
F (c) : P ! 2P
which satises the following conditions:
1. if c is an assignment command, then
8 fSg
if f (S ) = true
< [
F (c)(S ) = :
(a)(S ) [ (c)(S ) if f = false
a2F
2. F (c1 ; c2)(S ) = F (c2 )( F (c1 )(S ))
3. if c is the conditional composition if A1 [] : : : []An fi, Ai = gi ! ci, then
8 [
<
F (cAi )(S ) if :f (S ) = true
(
c
)(
S
)
=
F
gA
(S)=
true
i
: fSg
if f (S ) = true
4. for the iterative composition c = do A1 [] : : : []An od, Ai = gi ! ci , we rst
dene
8 [
<
F (ci )(S ) if f (S ) = false
F (A1 [] : : : []An)(S ) = : gi (S)=true
if f (S ) = true
12
Z. Liu and M. Joseph
Let E (S ) be the least xed point of the equation
Y = fS g [ F (A1 [] : : : []An)(Y )
Then F (c)(S ) is
F (c)(S ) = E (S ) \ M
W
n
where M = fS j (:f _ i=1 gi )(S ) = falseg
For each action A = g ! c, F (A) is dened by
F (c)(S ) if :f ^ g(S ) = true
F (A)(S ) =
otherwise
>From the failure semantics of a command and an action, the failure observation of the program P w:r:t: F can be derived as a function
: V ar(P ) ! V
satisfying the following conditions:
FOB1. (x) 6=<>, for any x 2 V ar (P ),
FOB2. # (x) = # (y ) (denoted by # ) for any x; y 2 V ar (P ),
FOB3. for each i < # , [i] is a state of P dened as [i](x) = (x)(i) for any
x 2 V ar(P ),
FOB4. [0] is an initial state of P ,
FOB5. 8i : 0 < i < # :: ( [i] = [i
1]) _ (9A 2 ActP : [i] 2 F (A)([i 1]))
A failure execution E of a program P w:r:t: F is an innite failure observation
of P . It said to be fair if any action A of P which is continuously enabled at
a execution point in E is eventually executed following the failure semantics of
A. The failure semantics of P w:r:t: to F is the set F (P ) of the fair failure
executions of P w:r:t: F .
With this denition, the execution of program P can be seen to be fail-stop.
Therefore, the failure semantics of the program P can be described in terms of
two functions good and error such that each failure observation w:r:t: to F
can be written as
= good() ^error()
where good() is an observation of P and error() is either empty or contains
only error states. Obviously, if there are no faults, i.e. F is empty, each `failure'
execution is the same as some execution of P :
F = ) F (P ) = (P )
Programs P and P are said to be fault-prone equivalent w:r:t: F if they are
equivalent and have the same failure semantics:
y
0
P F P = P P ^ F (P ) = F (P )
It should be noted that equivalent programs may not be fault-prone equivalent.
0
0
0
Transformation of Programs for Fault-tolerance
13
4.2. Fault Transformation
Given a program P = A1 [] : : : []Am and a fault environment F , let f = true if
a fault occurs in the execution of P . First, transform P into a program FS (P )
such that
FS (P ) = FS (A1 )[] : : : []FS (Am )
and each FS (Ai ) is obtained from Ai by changing gAi into gAi ^ :f and every
primitive command c which occurs in Ai into the command
FS (c) = if :f ! c[]f ! skip fi
If c is a primitive command, FS (c) is the primitive f -stop command of c.
For a command c of P , FS (c) is a f -stop command which is obtained from c by
changing every primitive command c which occurs in c into the primitive f -stop
command FS (c ). Given an action of P
A=g!c
the f -stop action of A is
FS (A) = :f ^ g ! FS (c)
The execution of FS (P ) on a system with the faults F is therefore fail-stop.
Obviously, if f is invariantly false, FS (P ) P since P does not change the
value of f .
For each f -stop command FS (c) and each f -stop action FS (A), a transformation M is dened as:
1. if c is a primitive command,
M(FS (c))= if :f ! c[]f ! skip[]:f ! F fi
0
0
2. if c = c1 ; c2 ,
M(FS (c))= M(FS (c1 )); M(FS (c2 ))
if (FS (c1 ); FS (c2 ))[]:f ! F []FS (c1); (if f ! skip[]:f ! F fi) fi
3. for an action A = g ! c
M(FS (A)) = :f ^ g ! M(FS (c))
4. for a command c = if A1 [] : : : []An fi,
M(FS (c)) = if M(FS (A1 ))[] : : : []M(FS (An))[]:f ! skip fi
5. if c = do A1 [] : : : []An od,
M(FS (c)) = do M(FS (A1 ))[] : : : []M(FS (An)) od
For the program P = A1[] : : : []Am, we dene
M(FS (P )) = M(FS (A1 ))[] : : : []M(FS (Am ))
Given the faults F , the fault transformation is dened as:
F (P ) = M(FS (P ))
and F (P ) is called the fault aected program of program P , denoted by P F .
14
Z. Liu and M. Joseph
Theorem 4.1. Given a program P and its fault environment F , the faulttransformation F satises
(P F ) = F (P )
>From the denitions of observations and failure observations, it is only
required to prove that for each state S and an action A = g ! c of P ,
F (A)(S ) = (M(FS (A)))(S )
Case 1 : if c isa primitive command, then
F (c)(S ) if :f ^ g(S ) = true
F (A)(S )=
otherwise
8 [
<
(a)(S ) [ (c)(S ) if :f ^ g(S ) = true
= : a2F
otherwise
(
c
)(
S
)
[
(
F
)(
S
)
if
:
f
^ g(S ) = true
= otherwise
= (:f ^ g ! c)(S )
= (M(FS (A)))(S )
Case 2 : if c = c1 ; c2 and F (ci) = (M(FS (ci )) for i = 1; 2, then
F (c1 )( F (c2 )(S )) if :f ^ g(S ) = true
F (A)(S )=
otherwise
(
M
(
FS
(
c
)))(
(
M
(
FS (c2 ))(S )) if :f ^ g(S ) = true
1
= otherwise
(
M
(
FS
(
c
));
M
(
FS
(
c
)))(
S
)
if
:
f ^ g(S ) = true
1
2
= otherwise
if :f ^ g(S ) = true
= (M(FS (c1 ; c2))(S ) otherwise
= (:f ^ g ! M(FS (c1 ; c2)))(S )
= (M(FS (A)))(S )
Case 3 : if c = if A1[] : : : []An fi, Ai = gi ! ci for i = 1; : : :; n, then
8 [
<
F (ci )(S ) if :f ^ g(S ) = true
F (A)(S )= : g (S )=true
if :f ^ g(S ) = false
= (:f ^ g ! if M(FS (A1 ))[] : : : []M(FS (An))[]f ! skip fi)(S )
= (M(FS (A)))(S )
Case 4 : the proof for the case of iterative composition is similar to Case 3.
Proof.
i
Corollary 4.1. Given a program P and its faults F ,
F =)P F P
Therefore, a fault-free execution of the program P F is a fault-prone execution
of program P and vice-versa, and P F is equivalent to P if no fault occurs.
Further, as shown in the following theorem, P F is equivalent to the union
composition of the f -stop program FS (P ) with a program PF .
Theorem 4.2. Given P and F , there is a program PF such that
Transformation of Programs for Fault-tolerance
15
P F FS (P )[]PF
Proof. Let P = A1 [] : : : []An , Ai = gi ! ci . We prove that for each Ai , there is
an action Fi such that
M(FS (Ai )) :f ^ gi ! if FS (cAi )[]Fi fi
Case 1 : if ci is a primitive command,
M(FS (Ai ))= :f ^ gi ! M(FS (ci ))
= :f ^ gi ! if :f ! c[]f ! skip[]:f ! F fi
:f ^ gi ! if FS (ci )[]:f ^ gi ! F fi
Case 2 : if ci = c; c , then
M(FS (Ai ))= :f ^ gi ! M(FS (c; c ))
:f ^ gi ! if (FS (c); FS (c ))[]:f ! F []FS (c);
(if f ! skip[]:f ! F fi) fi)
:f ^ gi ! FS (c; c )
[]:f ! F []FS (c); (if f ! skip[]:f ! F fi) fi)
Case 3 : if ci = if Ai1[] : : : []Ain fi, and M(FS (Aij )) FS (Aij )[]Fij ,
M(FS (Ai ))= :f ^ gi ! if M(FS (Ai1 ))[] : : : []M(FS (Ain ))[]f ! skip fi
:f ^ gi ! if FS (Ai1 )[]Fi1 : : : []FS (Ain )[]Fin []f ! skip fi
:f ^ gi ! if FS (Ai1 )[] : : : []FS (Ain )[]Fi1[] : : : []Fin []f ! skip fi
:f ^ gi ! if FS (ci )[]Fi1[] : : : []Fin fi
Case 4 : the proof for the case of iterative composition is similar to Case 3.
Therefore,
P F = M(FS (A1 ))[] : : : []M(FS (An))
:f ^ gA1 ! if FS (cA1 )[]F1 fi[] : : : []:f ^ gAn ! if FS (cAn )[]Fn fi
FS (A1 )[]:f ^ g1 ! F1[] : : : []FS (An)[]:f ^ gn ! Fn
FS (A1 [] : : : []An)[]:f ^ g1 ! F1 : : : []:f ^ gn ! Fn
= FS (P )[]:f ^ g1 ! F1 : : : []:f ^ gn ! Fn
0
0
0
0
i
i
i
i
i
i
i
The fault transformation dened above is based on the assumption that each
fault action in F may interrupt the execution of any action in P . Such an assumption simplies the presentation of the failure semantics, the fault transformations and the discussion of their properties. But this assumption is not
essential for the results obtained in this section. In general, an action in P can
be interrupted by only a subset of the fault actions in F .
For a program P = A1 [] : : : []An, Ai = gi ! ci , let
FS (Ai ) = :fA ^ gi ! FS (ci )
where FS (ci ) is the fA -stop command of ci . For each action Ai , let fi be a
boolean variable which is true if a fault occurs in the execution of Ai . A fault
action Af 2 F interrupts the action Ai if Af transforms fi from false to true,
Af 1 Ai = f:fi gAf ffi g
A fault-action Af 2 F stops action Ai if it makes fA true, i.e.
Af Ai = 9fj : f:fj gAf ffj g ^ (fj ) fA )
i
i
i
i
16
Z. Liu and M. Joseph
where Af 2 F and ffi gAf ffi g for i = 1; : : :; n. Let FA be the set of fault actions
in F which interrupt action Ai ,
FA = fAf j Af 2 F ^ Af 1 Ai g
Then Ai FA is dened as:
Ai FA = :fA ^ g ! M(FS (ci ))
where M(FS (ci )) is obtained by applying to the command ci the transformations
FS and M w:r:t: fi and FA . The fault aected program F (P ) is then dened
as:
F (P ) = P F = A1 FA1 [] : : : []An FA
It is easy to see that the transformation F dened in this way still satises
Theorem 4.2. And if no fault action interrupts the action Ai , then
Ai Ai
In particular, let P = p1 [] : : : []pn be a program with n processes and Fp
be fault actions which interrupts the actions of process pi but not of any other
process. The fault transformation F w:r:t: fFp j i = 1; : : :; ng is dened as
F (P ) = p1 Fp1 [] : : : []pn Fp
i
i
i
i
i
i
n
i
i
n
5. Consistency and Recovery Transformation
Let P F be the fault aected version of a program P and P0 v : : : v Pk = P be
a sequence of renements. When the execution of P F reaches an error state it
will stay in that state forever and no further action in P can be executed. Therefore, the behaviour of P F will not in general satisfy the original specication
P0, i.e. P F does not rene P0. To make the execution of P recoverable from
an error state, the system has to be restored to a good state from which the
interrupted execution can resume. Such a good state can be described in terms
of consistency with the execution of P .
5.1. Reachability and Consistency
For any state S of a program P , let Reach(S ) be the set of states which are
reachable from S by executing P , i.e.
Reach(S ) = fS j9; 2 OBP : S = last() ^ S = last( ) ^ / g
Let,
Reachable(S; S ) in P = S 2 Reach(S )
and, as an abbreviation,
Reachable(S ) in P = Reachable(initP ; S ) in P
Let be a nite observation of P . S is a possible future state of P for , if
there is an observation of P which extends to S , i.e. S is forward consistent
(ForwCon) with :
0
0
0
0
0
0
0
Transformation of Programs for Fault-tolerance
17
Reachable(last(); S )
F orwCon(S; ) =
Let X be a set of disjoint subsets of Var (P ) and = fSX j X 2 Xg, where
SX is a substate over X . We say that is forward-consistent with , if there
exists a state S such that:
8X 2 X : S jX = SX ^ F orwCon(S; )
We may consider a `state' as the union of sub-states previously reached at
dierent points in this execution, provided that this `state' could have been
reached in some execution of P . If the execution can continue from this state,
then it is consistent.
Let X = fX0; : : : ; Xn 1g be a partition of V ar(P ). Consider a set of substates = fSX j Xi 2 Xg which occur in a sub-observation of during the
execution of program P . More precisely, let = 1 ^ ^ . For each Xi 2 X there
exists ki such that i < j ) ki < kj and satises condition C:
(C)
8Xi 2 X : (ki )jX = SX
is backward consistent with , denoted BackwCon (; ), if there exists a
function
0 : Var (P ) ! V y
such that
i
i
i
BC1
^ BC2 ^ BC3
where for each Xi 2 X ,
0
BC1. 8j : 0 j ki :: (j )jX = (j )jX
0
BC2. 8j > ki : (j )jX = (ki )jX
i
BC3.
1 ^ 0
2 OB P
i
i
i
1 ^ 0 2 OB P is called a backward recovered observation of from .
If BackwCon (; ), for and 0 satisfying these conditions, let 0 = 1 ^ 0 .
Then is said to be a backward consistent prex of 0 and denoted as /bc 0 .
Similar to the case of forward consistency, let be given for a subset X P (V ar(P )), i.e. = fSX j Xi 2 Xg, satises Condition C. is backward-
consistent with , if there exists a state S such that:
i
8X 2 X : S jX = SX ^ BackwCon(S; )
Obviously, BackwCon( ; ) ) F orwCon( ; 1 ), where = 1 ^ ^ .
A state S is consistent with if it is either forward or backward consistent
with , i.e.
Consistent(S; ) = F orwCon(S; ) _ BackwCon(S; )
When there is no confusion, we will simplify the denitions and notation by
omitting , e.g. Consistent(S ).
5.2. Recovery Transformation
To resume the execution of P after interruption by fault actions in F , P has
to be transformed into a program R(P ) by adding a set of recovery actions PR
called a recovery program.
18
Z. Liu and M. Joseph
Let ob be an auxiliary variable whose value space is the set of observations
of P F
: V ar(P ) ! V
where [i] is a good state for each i #.
y
A state predicate Q over V ar(P ) is extended to be a state predicate Qob over
V ar(P ) [ fobg such that for a state S over V ar(P ) [ fobg, Qob (S ) = true if
Q(S jV ar(P ) ) ^ 8x 2 V ar(P ) : last(ob)(x) = S (x)
For and in the value space of ob, let Ext(; ) = true if
9S 2 V ar(P ) : 8x 2 V ar(P ) : (x) = (x) ^ < S (x) >
The predicate ForwExt(; ) = true if ForwCon(last(); ) ^ /.
The predicate BackwExt(; ) = true if BackwCon(last(); ) ^ /bc.
We dene
ConsExt(; ) = ForwExt(; ) _ BackwExt(; )
Now let the specications of P and F over V ar(P ) be transformed into
specications over V ar(P ) [ fobg so that:
1. each fQ ^ :f gAfRg in P is transformed into
fQob ^ :f ^ ob = gAfRob ^ Ext(ob; )g
2. each fQgAf fRg in F is transformed into
fQ ^ ob = gAf fR ^ ob = g
3. each ff gAff g in P or F is transformed into
ff ^ ob = gAff ^ ob = g
After the execution of F (P ) reaches an error state, the recovery program PR
is invoked and restores the variables to a consistent state. PR must satisfy the
following conditions:
R1. any action of FS (P ) excludes each action Ar 2 PR: 8A 2 P : :f ^ gA )
:gAr
R2. each action Ar 2 PR excludes any action FS (P ): 8A 2 P : gAr ) f _ :gA
R3. execution of the recovery program PR cannot be interrupted by the fault
program F .
R4. PR transforms an error state into a good consistent state:
(f ^ ob = ) 7 ! ((:f )ob ^ ConsExt(ob; ))
The recovery program PR can thus be given as:
Program PR :
hf ! X := X :ConsExt(ob; ob0) ; f := falsei
EndfPRg
These conditions for PR allow the recovery transformation R to take the form
0
0
0
0
0
0
0
0
0
0
0
0
0
Transformation of Programs for Fault-tolerance
19
R(P ) = P []PR
Condition R2 implies that a recovery action Ar does not change a good state
and thus
R(P ) P
>From Condition R3 and Theorem 4.2 we have
F (R(P )) F (P )[]PR FS (P )[]PR[]PF
F (R(P )) should ideally satisfy the specication P0. Unfortunately, it is not usually possible to have such a strong transformation for an arbitrary program P
and arbitrary faults F . However, we can often have a recovery transformation
such that F (R(P )) is weaker than P0 but acceptable in terms of its behaviour
and
F = ) F (R(P )) P
Two kinds of error recovery are used in practice [AL81]. With backward error
recovery, the system recovers from a fault by starting from a state which is
consistent with its previous states. Forward recovery is used when a program has
to recover from an error whose eects can either not be overcome by backward
recovery or (in a real-time program) because the time constraints do not permit
backward recovery. As in the case of backward recovery, for forward recovery the
variables V ar(P ) have to be assigned values so that the state S is good (i.e. f
is false) and forward consistent with the current observation ob.
The recovery transformation R can apply to both backward and forward
recovery and this shows that in principle backward and forward recovery methods
can be both used in one fault-tolerant system. Backward and forward recovery
programs are special cases of PR and specied as PBR and PFR respectively:
Program PBR :
hf ! X := X :BackwExt(ob; ob0) ; f := falsei
EndfPBRg
Program PFR :
hf ! X := X :ForwExt(ob; ob0) ; f := falsei
EndfPFRg
where ob0 is the value of ob before the execution of PR .
0
0
6. Using Renement for Fault-tolerance in Programs
The recovery transformation R (or the recovery program PR) describes what
should be done for recovery but imposes no restrictions on when PR is executed,
where it is executed (e.g. on which processor) or how it is to be executed (e.g.
how to nd a consistent state). These restrictions can be introduced by using
transformations on P []PR which will be described in terms of F -renement.
Given a fault environment F , program P is said to F -rene program P
(denoted as P vF P ) if
(P v P ) ^ (P Sat (f 7 ! :f ))
and for each execution E of F (P ) in which there are nitely many error states
of which E [k] is the last,
0
0
0
0
0
20
Z. Liu and M. Joseph
E [k + 1](ob) ^last(E [k + 2](ob)) ^: : : ^last(E [k + n](ob)) : : :
is an execution of P .
As for the equivalence of fault-prone programs, it is not necessarily the case
that P vF P if P renes P . However, the following results are easily proved:
Theorem 6.1. Given programs P , P , P and a fault environment F ,
1. P vF R(P )
2. P vF P vF P ) P vF P
3. P v P ) R(P ) vF R(P )
4. P v P ) P vF R(P )
Proof. Directly from the denition of F -renement and the specication of the
transformation R.
The rst two results show that fault-tolerance is introduced and preserved
by F -renement transformations. From the next two results, it can be seen that
renement transformations can be applied to the original program and faulttolerance introduced after that, using F -renement.
Corollary 6.1. Given a program P and a fault environment F ,
1. PR v PR ) R(P ) vF P []PR
2. P vF R(P ) vF P []PBR
3. P vF R(P ) vF P []PFR
As an example of the rst result in Corollary 6.1, let S0 be the initial state of
program P and
Program PR :
hf ! X := S0 ; f := falsei
EndfPR g
then
(PR v PR ) ^ (P vF P []PR vF P []PR )
Program P []PR is a fault-tolerant program such that whenever a fault occurs,
the execution of P will recover by re-starting from its initial state [JMS87, HH87].
The following theorem introduces a useful F -renement rule.
Theorem 6.2. Given a program P = A1 [] : : : []An and a fault environment F ,
let I be a non-empty subset of f1; : : :; ng and
RI (P ) = A1 [] : : : []An
where for each i 2 f1; : : :; ng,
Ai = ifAiAi []:gi ! skip fi; if :f ! skip[]PR fi ifif ii 622 II
then
1. R(P ) vF RI (P )
2. I J ) RI (P ) vF RJ (P )
The next theorem provides a rule to allow a recovery action to be freely introduced at any point in the program.
0
0
0
0
0
00
00
00
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Transformation of Programs for Fault-tolerance
21
Theorem 6.3. For the program RI (P) = A [] : : :[]An given in Theorem 6.2, if
an action Ai of RI (P) can be written as ai ; ai , let
Ai = ai ; if :f ! ai []f ! PR fi
0
0
1
0
1
00
Then
1
2
2
RI (P) vT (F ) RI (P)[Ai=Ai ]
0
00
It is noted that for any command c
c skip; c c; skip
And we assume here that the fault environment of skip is always empty:
F (c) F (skip; c) F (c; skip)
We can therefore introduce the recovery action PR (and its rened versions), by
replacing such a skip with PR , at any suitable place.
A program P can be rened using rules of the form dened by Back [BS88],
and then Theorem 6.1 can be used to add F-renement. Checkpointing actions
can then be added to P (or the rened version of P) by introducing new variables
and assignments [BS88, Mor90]. Then, following Theorem 6.2 and Theorem 6.3,
recovery actions can be introduced at appropriate points. The choice of recovery
point may follow well-known practice, e.g. by using recovery blocks or conversations. It can be shown that the checkpointing and recovery protocol suggested
by Koo and Toueg [KT87] can also be achieved by using F-renement.
It may be noticed that Theorem 6.2 and Theorem 6.3 also hold for any
program PR vF PR .
0
7. Example: A Protocol for Communication Over Faulty
Channels
In this section, we consider an example of fault-tolerant programming using the
methods presented in this chapter. The problem is to design a protocol that
guarantees reliable communication from a sender to a receiver in spite of faults
in the communication channel between them. In the following programs, we shall
omit the declaration and initial values of the variables if they are clear from the
context.
The Sender process produces an innite sequence ms0 of data. The Receiver
process reads in a sequence mr satisfying the following specication:
invariant mr/mso
(i.e. mr is a prex of ms0 )
#mr = n 7 ! #mr = n + 1 (i.e. the length of mr increases eventually)
If the sender and the receiver communicate over an unbounded reliable FIFO
channel c, the communication between them can be implemented using the program Sender-Receiver.
Program Sender -Receiver
initially
ms; mr; c = ms ; <>; <>
actions
c; ms := c ^head (ms); rest (ms)
[]
0
(Sender)
22
Z. Liu and M. Joseph
c 6=<>! c; mr := rest(c); mr ^head(c)
End fSender-Receiver g
(Receiver)
The FIFO channel c can be implemented as the following program:
Program C
initially
cs; cr =<>; <>
actions
cs 6=<>! cs; cr := rest(cs); cr ^head(cs)
End fC g
Let program P be
Program P
initially
ms; mr; cs; cr = ms0 ; <>; <>; <>
actions
cs; ms := cs ^head(ms); rest(ms)
[]
cr 6=<>! mr; cr := mr ^head(cr); rest(cr)
End fP g
Then,
Sender-Receiver v C[]P
Now assume that the channel C is faulty and has the following behaviour:
1. any message sent along the channel may be lost; however, only a nite number
of messages can be lost consecutively;
2. any message sent along the channel may be replicated, but no message can
be replicated forever;
3. messages are not permuted { i.e. messages are delivered in the order in which
they are sent; and
4. messages are not corrupted { i.e. their contents are not altered.
A program F simulating such a fault environment can be given as:
Program F
declare
b; f : boolean
initially
cs ; cr; b; f =<>; <>; false; false
actions
:b ^ cs 6=<>! cs; f := rest(cs); true
:b ^ cs 6=<>! cr; f := cr ^head(cs); true
[]
(loss)
(duplication)
[]
b := true
(guarantee the niteness of consecutive faults)
End fF g
Consider the renement C v C1, where
Program C1
Transformation of Programs for Fault-tolerance
23
initially
cs; cr; b =<>; <>; false
actions
cs 6=<>! cs; cr; b := rest (cs); cr ^head (cs); false
End fC1g
F (C )
The behaviour of a fault channel can be simulated by a program FC =
1
with the fail-stop assumption:
Program F (C1)
declare
b; f : boolean
initially
b; f = false; false
actions
:b ^ cs 6=<>! loss
:b ^ cs 6=<>! duplication
[]
:f ^ cs =6 <>! correct -transfer
[]
[]
b := true
End fF (C1 )g
(guarantees the niteness of consecutive faults)
where
1. loss = cs; f := rest (cs); true
cr; f := cr head (cs); true
2. duplication =
^
3. correct -transfer = b; cs; cr := false; rest (cs); cr ^head (cs)
And F (Sender-Receiver) (or F (C[]P)) is rened to the program F (C1[]P), given
below.
Program F (C1[]P)
declare
b; f : boolean
initially
ms; mr; cs; cr ; b; f = ms0 ; <>; <>; <>; false; false
actions
:f ! cs; ms := cs^head (ms); rest (ms)
[]
:f ^ cr 6=<>! mr; cr := mr ^head (cr); rest (cr)
[]
:b ^ cs 6=<>! loss
:b ^ cs 6=<>! duplication
[]
:f ^ cs =6 <>! correct -transfer
[]
[]
b := true
End fF (C1 []P)g
(guarantee the niteness of consecutive faults)
24
Z. Liu and M. Joseph
To design a recovery program for Sender-Receiver, let cs and cr be sequence
variables whose elements are pairs (integer; data item). Let C1 []P be rened to
P1 :
Program P1
declare
ks; kr : integer,
initially
cs; cr; ks; kr; b =<>; <>; 1; 0; false
actions
cs ; ks := cs ^(ks; ms(ks)); ks + 1
[]
cr =
6 <>! mr; cr; kr := mr ^head (cr):val; rest (cr) ; kr + 1
[]
cs =
6 <>! cs; cr; b := rest (cs); cr ^head (cs); false
End fP g
1
A message is lost if head (cr):dex > kr + 1 and a message is duplicated if
head (cr):dex kr. Therefore, the fault indicating variable f can be implemented
as
f , (lost _ duplic )
where
lost = head (cr):dex > kr + 1
and
duplic = head (cr):dex kr
It is sucient to assume that only the receiving command is f-stop. Such a
fail-stop version of P1 is then given as FS(P1 ):
Program FS(P1)
initially
cs; cr; ks; kr; b =<>; <> ; 1; 0; false
actions
cs; ks := cs ^(ks; ms[ks]); ks + 1
[]
:f ^ cr =
6 <>! mr; cr; kr := mr ^head (cr):val; rest (cr); kr + 1
[]
cs =
6 <>! b; cs; cr := false; rest (cs); cr ^head (cs)
End fFS(P )g
1
The fault aected version F (C1 []P) can be rened to F (P1 ) which is M(FS(P1)):
Program F (P1)
initially
cs; cr; ks; kr; b =<>; <>; 1; 0; false
actions
cs; ks := cs ^(ks; ms[ks]); ks + 1
[]
:f ^ cr =
6 <>! mr; cr; kr := mr ^head (cr):val; rest (cr); kr + 1
[]
Transformation of Programs for Fault-tolerance
25
:b ^ cs 6=<>! loss
:b ^ cs 6=<>! duplication
[]
cs =
6 <>! correct -transfer
[]
[]
b := true
(guarantees the niteness of consecutive faults)
End fF (P1)g
The recovery program for P1 in R(P1 ) is rened to P1R given below.
Program P1R
initially
ks; kr = 1 ; 0
actions
! do cr 6=<>! cr :=
do cs 6=<>! cs :=
lost
[]
duplic
End fP1Rg
(cr) od ;
(cs) od ;
ks := kr + 1; cs; ks := cs ^(ks; ms(ks)); ks + 1;
cr; cs := cr ^head (cs); rest (cs)
rest
rest
! do duplic ^ (cr 6=<>) ! cr := rest (cr) od
In the following, we provide an informal outline of the proof that F (P1 )[]P1R
satises the specication of the program Sender-Receiver. (A full formal proof
is given in [Liu91].)
To prove the invariant mr/ms:
a) mr/ms is true initially,
b) each action in F (P1 ) and P1R leaves mr/ms stable.
Hence mr/ms is invariant.
Proof.
To prove the progress property #mr = n 7 ! #mr = n + 1:
a) it is easily seen that #mr = kr is invariant,
b) from both F (P1) and P1R, each message in ms will eventually be transferred to cs,
c) each message in cs will be eventually be transferred to cr or lost,
d) if a message in cs is lost, it will be re-sent by the recovery action,
e) F (P1 ) also guarantees that the correct transfer action will eventually be
executed,
f) by the second recovery action in P1R and the receiving action in F (P1), each
message transferred into cr by the execution of the correct transfer action in
F (P1) will eventually be transferred to mr.
Therefore progress is guaranteed.
P1[]P1R is thus a version of Sender-Receiver which can tolerate message loss
and duplication; it can of course be rened further for a dierent implementation
[CM88].
26
Z. Liu and M. Joseph
8. Discussion
Assume that 0 = is the top level specication of a program in a UNITY-like
notation. The top level specication of the fault environment 0 can be given as
P
Sp
: 7 !
f
F
f
0 can be then rened into an action system 1 while 0 can be simulated (or
rened) by an action system 1 consisting of one action
! :=
Based on 1 and 1, the fault and recovery transformations can then be applied
to program 1 in the way described in Section 5.2. For each renement step
k v k+1 of the original program 0, a renement step k v k+1 of the
fault environment is derived by providing more details about the system and its
possible faults. The fault and recovery transformations can be applied again to
k+1 and k+1. During the renement of the original program, checkpoints and
recovery methods such as recovery blocks and conversations can be introduced.
Thus, in principle, a fault-tolerant program can be produced from its fault-free
specication by using F-renement rules. We do not underestimate the diculty
of doing this in practice, but the rules provided here do oer a formal means of
verifying what is otherwise left to less precise methods of reasoning.
Each step in the renement 0 v v k v provides more information
about the system on which the program is to be executed, and similar information
about the faults of the system is used for rening the fault specication. For
certain programs and faults, e.g.
and the faulty channels, the
boolean variable which indicates the presence of faults in O can be derived
from the state predicates of program k at a suitable stage in the renement.
But it is an open question whether this is always possible for any program and
any set of possible faults.
With the fail-stop assumption, special states for abortion and nontermination are not necessary if it is assumed that each action of the original program
always terminates. However, without the fail-stop assumption [Liu91], abortion
and nontermination have to be considered, because faults may cause abortion or
notermination within an action even if the action always terminates in a fault-free
environment.
The logic used for reasoning about programs is based on a built-in fairness
condition, known as strongfairness [MP83, GP89]. The logic itself is too weak to
express this or any other fairness condition [Fra86]. But as we have seen in the
previous sections (and elsewhere [Liu91]), the logic is sucient for describing
at a relatively high level the basic principles of fault-tolerant renement and
transformation. It would be interesting to examine the use of a more powerful
logic, such as Lamport's temporal logic of actions (TLA) [Lam90, Lam91], for
further work in this direction. There are extensions of Back's renement calculus
for dealing with renement of parallel programs [Bac89], and in [Liu91] we show
how this can be used for fault-tolerant renement of parallel programs.
Fault-tolerant systems often also have real-time constraints. So it is important
that the timing properties of a program are rened along with the fault-tolerant
and functional properties dened in a program specication. If we extend the
model used in this paper by adding timing properties in some time domain
[JG88], the recovery transformation can be dened with timing constraints. The
specication and renement of the recovery actions can then be required to sat-
P
P
F
F
true
P
f
true
F
P
P
P
P
P
F
F
P
:::
P
:::
Sender Receiver
f
F
P
F
Transformation of Programs for Fault-tolerance
27
isfy the condition that after a fault occurs, the system is restored to a consistent
state within a time bound which includes the delay caused by the execution of
the recovery action. Thus the method described in this paper can be extended to
take account of timing constraints. However, there are numerous problems still
to be examined in making such a method practical, and these are the goals of
further work.
The idea that a hardware fault be modelled as another kind of action (operation) that performs state transformations was described in [Cri85] through
an example. There, processor crashes and faults that aect the physical storage
medium are taken to be special operations, referred to fault operations, and described by axioms similar to those used to describe the semantics of ordinary
operations. As we have seen in this paper, developing this idea into a general
method for systematically constructing fault-tolerant programs is far from trival.
We have attempted this by dening transformations of programs, because this
enables us to formally discuss the fault-tolerant properties of a program at different level of abstraction.
Acknowledgements
We would like to thank our colleague, Asis Goswami, for helpful discussions on
the denition of consistency, and to Jozef Hooman for his carefully reading of an
earlier version of this paper and for many useful suggestions. A discussion with
R.J.R. Back was helpful in modelling faults and using the renement calculus
for fault-tolerant programming. Finally, we thank a referee for some spirited and
detailed comments which helped to bring the paper to its present form.
References
[AL81]
[AL88]
[Bac87]
[Bac88]
[Bac89]
[BK83]
[BR81]
[BS88]
[BS89]
[BW89]
[CM88]
[Cri85]
T. Anderson and P.A. Lee. Fault-tolerance: Principles and Practice. Prentice-Hall
International, 1981.
M. Abadi and L. Lamport. The existence of renement mapping. In Proc. 3rd
IEEE Sympoium on Logic and Computer Science, 1988.
R.J.R. Back. A calculus of renement for program derivations. Technical Report 54, Abo Akademi, 1987.
R.J.R. Back. Rening atomicity in parallel algorithms. Technical Report 57, Abo
Akademi, 1988.
R.J.R. Back. Renement calculus, Part II: Parallel and reactive programs. In
Lecture Notes in Computer Science 340, pages 67{93. Springer-Verlag, 1989.
R.J.R. Back and R. Kurki-Suonio. Decentralization of process nets with centralized control. In Second Annual ACM Symposium on Principles of Distributed
Computing, pages 131{142, 1983.
E. Best and B. Randell. A formal model of atomicity in asynchronous systems.
Acta Informatica, 16:93{124, 1981.
R.J.R. Back and K. Sere. Stepwise renement of parallel algorithms. Technical
Report 64, Abo Akademi, 1988.
R.J.R. Back and K. Sere. Stepwise renement of action systems. Technical Report 78, Abo Akademi, 1989.
R.J.R. Back and J. von Wright. Renement calculus, Part I: Sequential nondeterministic programs. In Lecture Notes in Computer Science 340, pages 42{66.
Springer-Verlag, 1989.
K.M. Chandy and J. Misra. Parallel Program Design: A Foundation. AddisonWesley Publishing Company, 1988.
F. Cristian. A rigorous approach to fault tolerant programming. IEEE Transactions on Software Engineering, SE-11(1):23{31, January 1985.
28
[Dij76]
[Fra86]
[GP89]
[HH87]
[Hoa69]
[HW88]
[JG88]
[JMS87]
[KT87]
[Lam83]
[Lam87]
[Lam90]
[Lam91]
[Liu91]
[Mor90]
[MP83]
[MR78]
[Ran75]
[SS83]
Z. Liu and M. Joseph
E.W. Dijkstra. A Discipline of Programming. Englewood Clis, New Jersey:
Prentice-Hall, 1976.
N. Francez. Fairness. Springer-Verlag, New York, 1986.
R. Gerth and A. Pnueli. Rooting UNITY. In Proceedings of the 5th IEEE International Workshop on Software Specication and Design, February 1989.
J. He and C.A.R. Hoare. Algebraic specicationand proof of a distributed recovery
algorithm. Distributed Computing, 2:1{12, 1987.
C.A.R. Hoare. An axiomatic basis for computer programming. Communications
of the ACM, 12(10):576{583, October 1969.
M.P. Herlihy and J.M. Wing. Reasoning about atomic objects. In Lecture Notes
in Computer Science 331, pages 193{208. Springer-Verlag, 1988.
M. Joseph and A. Goswami. What's `real' about real-time systems? In Proceedings of IEEE Real-time Systems Symposium, pages 78{85, Huntsville, Alabama,
December 1988.
M. Joseph, A. Moitra, and N. Soundararajan. Proof rules for fault tolerant distributed programs. Science of Computer Programming, 8(1):43{67, February
1987.
R. Koo and S. Toueg. Checkpointingand rollback-recoveryfor distributed systems.
IEEE Transactions on Software Engineering, SE-13(1):23{31, January 1987.
L. Lamport. Reasoning about nonatomic operations. In Proc. 10th ACM Conference on Principles of Programming Languages, pages 28{37, 1983.
L. Lamport. win and sin: Predicate transformers for concurrency. Technical
Report 17, Systems Research Center of Digital Equipment Corporation in Palo
Alto, California, May 1987.
L. Lamport. A temporal logic of actions. Technical Report 57, Digital SRC,
California, 1990.
L. Lamport. The temporal logic of actions. Technical Report 79, Digital SRC,
California, 1991.
Z. Liu. Fault-Tolerant Programming By Transformations. PhD thesis, Department of Computer Science, University of Warwick, 1991.
C. Morgan. Programming from Specication. Prentice Hall, 1990.
Z. Manna and A. Pnueli. How to cook a temporal proof system for your pet
language. In Proceedings of 10th Annual ACM Symposium on Principles of Programming Languages, Austin, Texas, 1983.
P.M. Merlin and B. Randell. State restoration in distributed systems. In Proceedings of 8th Ann. Int. Symp. on Fault-Tolerant Comput., pages 129{134, Toulouse,
France, 1978.
B. Randell. System structure for software fault tolerance. IEEE Transactions on
Software Engineering, SE-1(2):220{232, June 1975.
R.D. Schlichting and F.B. Schneider. Fail-stop processes: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems,
1(3):222{238, 1983.