Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Formal Aspects of Computing (1996) 3: 1{000 c 1996 BCS Transformation of Programs for Fault-tolerance Zhiming Liu and Mathai Joseph Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK Keywords: Failure semantics; Consistency; Fault-tolerant transformation; Faulttolerant renement Abstract. In this paper we shall describe how a program constructed for a fault-free system can be transformed into a fault-tolerant program for execution on a system which is susceptible to failures. A program is described by a set of atomic actions which perform transformations from states to states. We assume that a fault environment is represented by a program F . The interference by the fault environment F on the execution of a program P can then be described as a fault-transformation F which transforms P into a program F(P ). This is proved to be equivalent to the program P []FP , where FP is derived from P and F , and [] is the union of the sets of actions of P and PF . A recovery transformation R transforms P into a program R(P ) = P []R by adding a set of recovery actions R, called a recovery program. If the system is fail-stop and faults do not aect recovery actions, we have F(R(P )) = F(P )[]R = P []PF []R We illustrate this approach to fault-tolerant programming by considering the problem of designing a protocol that guarantees reliable communication from a sender to a receiver in spite of faults in the communication channel between them. Correspondence and oprint requests to : Zhiming Liu and Mathai Joseph, Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK. This work was supported by research grant GR/D 11521 from the Science and Engineering Research Council. 2 Z. Liu and M. Joseph 1. Introduction There are several dierent ways in which a program can be developed using formal rules which guarantee that it will satisfy a specication when executed on an fault-free system [Lam83, AL88, Bac87, Bac88, BW89, Bac89]. In this paper, we consider how similar methods can be used to construct fault-tolerant programs for execution on systems which may exhibit any of a set of specied failure properties. Fault-tolerant programs are required for applications where it is essential that failures do not cause a program to have unpredictable execution behaviours. We assume that the failures do not arise from errors in the program, since methods such as those mentioned above can be used to construct error-free programs. So, the only failures we shall consider are those caused by hardware and system malfunctions. Many such failures can be masked from the program using automatic error detection and correction methods, but there is a limit to the extent to which this can be achieved at reasonable cost in terms of the resources and the time needed for correction. When the nature or frequency of the errors makes automatic detection and correction infeasible, it may still be possible that error detection can be performed. It is desirable that fault-tolerant programs are able to perform predictably under these conditions: for example when using memory with single bit error correction and double bit error detection which operates even when the error correction is not eective. In fact, the provision of good program level fault-tolerance can make it possible to reduce the amount of expensive system error correction needed, as program level error recovery can often be focussed more precisely on the damage caused by an error than a general-purpose error correction mechanism. The task is then to develop programs which perform predictably in the presence of detected system failures, and this requires the representation of such failures in the execution of a program. Earlier attempts to use formal proof methods for verifying the properties of fault-tolerant programs (e.g. [JMS87], [HW88]) were based on an informal description of the eects of faults, and this limits their applicability. Here we shall instead model a fault as an action which performs state transformations in the same way as other program actions, making it possible to extend a semantic model to include fault actions and to use a single consistent method of reasoning for both program and fault actions. Let P be a program satisfying the specication Sp. Let the eect of each physical fault in the system on which P is executed be described as a fault action which transforms a good program state into an error state which violates Sp. Physical faults are then modelled as the actions of a fault program F which interferes with the execution of P . A failure at any point during the execution of P takes it into an error state in which a boolean variable f is true. (F is assumed not to change an error state into a good state.) In general a high level specication of a program is not sucient to specify its behaviour in the presence of system faults or to transform it into a fault-tolerant program. It is also necessary to describe the hardware organisation of the system on which the program is to be executed, on its use of the resources of the system and the nature of the possible faults in the system, e.g. which processors and channels may fail, as all of these factors can aect the execution of the program. And very little can be said about the eects of a system fault on a program until it has been rened to the level where these eects can be observed. There is need Transformation of Programs for Fault-tolerance 3 to represent faults and their eects at various levels of abstraction and here we shall use specications to develop both the program and the fault environment in which it executes. After this introduction, in Section 2 a simple model for representing program actions, fault actions and recovery actions is dened. A specication language is presented in Section 3, with its syntax and semantics. Based on the semantics of failures w:r:t: a given failure-prone environment, the eect of faults on the original program is dened in terms of a program transformation in Section 4. Section 5 provides an abstract denition of consistency which is used to dene recovery transformations which permit both backward and forward recovery. Section 6 shows that the fault-tolerant properties of a program can be rened along with the other properties dened in a program specication. Finally, in Section 7 an example is given to illustrate fault-tolerant programming using the method described in this paper; the problem is to design a protocol that guarantees reliable communication from a sender to a receiver in spite of faults in the communication channel between them. 2. A Simple Model This section presents a simple computational model for describing programs, faults and recovery. 2.1. Programs As usual, we describe a program in terms of its variables, states and actions [Lam87]. Program variables are associated with a set of values, called the value space of the program. A state of the program is a mapping from the set of program variables to the value space. A predicate Q is a function from the set of the program states to the boolean values ftrue; false g. Thus, given a state S , S satises Q (or S is of property Q) if Q(S ) is true . An action A is a relation on the set of states. In other words, an action A is a set of pairs of states. For a pair (S; T ) of states, (S; T ) 2 A means that executing A starting in state S can produce (or terminate in) state T . An action thus denes a set of transitions from state to state. An action A is said to be enabled in state S if there is a state T such that (S; T ) 2 A. Therefore, for each action A, there is a predicate gA which denes the set of states in which A is enabled. Stated in another way, gA(S ) is true i A is enabled in S . gA is called the guard of A. Let a program P be described as a set Act P of actions and a set Init P of predicates which dene the initial states of P . The actions of P are executed atomically: if an action is chosen for execution, it will be executed without interference from the other actions in the program. Such a program P is denoted as (Init P ; Act P ). An execution of program P consists of an innite sequence, S0 ; S1 : : :; of states, where each pair (Si ; Si+1 ) is in some action of Act P , or Si = Si+1 and S0 satises all the predicates in Init P . A xed point of P is a state S such that for any action A 2 Act P , 8T : (S; T ) 2 A ) T = S 4 Z. Liu and M. Joseph An execution, E = S0 ; S1 : : :; of P terminates if it reaches a xed point, i.e. for some i 0, Si is a xed point of P and hence Sj = Si for any j i. E is said to be fair for an action A of P , if for any i 0, either there is a j i such that (Sj ; Sj +1) 2 Act P , or there is a k i such that gA(Sk ) = false . If P1 and P2 are programs, the union composition program P1 []P2 is dened as P1[]P2 = (Init P1 [ Init P2 ; Act P1 [ Act P2 ) In general, an action of a program is composed of a set of primitive actions. If A and B are actions, the union A [ B is also an action. Let A B denote the action which is the composition of the actions A and B . f(S; T ) j 9T1 : ((S; T1 ) 2 A) ^ ((T1 ; T ) 2 B )g The union operator models choice while the composition of actions models sequential composition in programs. During the transition from one state to another, dened by an atomic action of program P , the system may pass through a number of intermediate states. For example, if A = A1 A2 is in Act P , the transition (S; T ) 2 A is carried out through an intermediate state T1 such that ((S; T1 ) 2 A) ^ ((T1 ; T ) 2 B ) 2.2. Concurrency We can model the parallel execution of a program P = (Init P ; Act P ) by partitioning its actions Act P among a set Proc = fp1; : : :; pk g of processes, as in Back's concurrent action system [Bac88]. A shared variable is one which is used by two or more actions in dierent processes, while a private variable is used only by the actions in one process. Two actions of P can be executed in parallel if and only if they are not in the same process and do not share variables. Therefore a concurrent system P with a set Proc of k processes can be written as P = (Init P ; p1[] : : : []pk) where pi denotes the actions that are assigned to process pi , 1 i k. In such a concurrent system, processes communicate by executing actions using common variables. We can also model the parallel execution of P by partitioning its variables Var (P ) among its processes Proc (Back's distributed action system [Bac88]). A shared (or joint) action refers to variables in two or more processes, while a private action refers only to variables in one process. Two actions of a distributed system can be executed in parallel if and only if they do not refer to variables in the same process. A distributed system P with a set Proc of k processes can then be written as: P = (Init P ; Proc ; A1[p11; : : :; p1k1 ][] : : : []An[pn1; : : :; pnk ]) where Proc = fp1 ; : : :; pkg is a partition of the variables of P and each Aj [pj 1; : : :; pjk ] is an action shared by the processes pj 1; : : :; pjk in Proc , 1 j n. An action Aj [pj 1; : : :; pjk ] is assumed to be executed jointly by the processes fpj 1; : : :; pjk g that share the action Aj , and these processes synchronize over the execution of Aj . The action Aj provides communication between the processes: n j j j j Transformation of Programs for Fault-tolerance 5 the variables of one process may be updated by Aj in a way that depends on the values of the variables of other processes sharing this action. A sequential action system is a special case of a parallel action system where all the actions (or all the variables) are assigned to the same process. Another special case is where each action (or each variable) is assigned to a separate process. The program is then executed with maximal parallelism. We can also view a concurrent action system P = (Init P ; p1[] : : : []pk) as a distributed system, with the private variables of each process pi being local to the process and the shared variables being partitioned among new processes. Because actions are executed atomically, a parallel computation can be modelled by a sequential computation with interleaved execution of actions. Hence, the set of possible executions of an action system will be the same for a concurrent or a distributed action system and for a sequential action system. This allows us to separate the logical behaviour of processes from implementation issues. 2.3. Reasoning About Programs As usual, we dene the Hoare triple fQgAfRg to mean 8(S; T ) : ((S; T ) 2 A ^ Q(S ) ) R(T )) Thus, fQgAfRg denes the total correctness 1 of action A: when A is executed in a state satisfying Q, it terminates in a state satisfying R. To reason about a program P , A is universally or existentially quantied over the actions of P ; a property which holds for all points of the execution of P is dened using universal quantication while a property which holds eventually is dened using existential quantication [Lam87, CM88]. For example, a safety property of a concurrent program P is dened by an invariant Q so that fQgAfQg holds for any action A in Act P . Similarly,for any predicate R and action A, the weakest precondition wp(A; R) is dened wp(A; R)(S ) = 8T : ((S; T ) 2 A ) R(T )) 2.4. Renement A program P is said to be rened by program P , denoted P v P , if each execution of P is an execution of P . An action A is said to be rened by action A , denoted A v A , if for any Q and R, fQgAfRg ) fQgA fRg The renement operation is reexive and two programs (or actions) P and P are equivalent, denoted P P , if each is rened by the other. Obviously, if A 2 Act P , 0 0 0 0 0 0 0 0 This is Hoare logic for t otal correctness |the original proposal in [Hoa69] used the notation QfAgR and dealt with partial correctness. 1 6 Z. Liu and M. Joseph (gA , gA ) ^ (A v A ) ) (P v P [A=A ]) where P [A=A ] is the program obtained from P by replacing A with A . A general denition and the rules of renement, can be found in [AL88, Bac87, Bac88, Bac89, Mor90]. 0 0 0 0 0 2.5. Modelling Faults Let P be a program satisfying the specication Sp. The eect of a physical fault in the system on which P is executed can be described as a fault action which transforms a good state into an error state leading to a state which violates Sp. The physical faults in the system can be then modelled as the actions of a fault program F which interferes with the execution of P . A failure at any point during the execution of P takes it into an error state in which a boolean variable f is true . Assume that a fault may interfere at any point in the execution of P . It might appear that the interference by a fault environment F on the execution of program P can simply be dened by the union composition P []F of P and F , which would imply that faults occur only before or after the execution of an atomic action of P , but not during its execution. However, this is clearly a limited view since in practice faults do occur in intermediate states and can lead to failures. Let P be a program constructed from a set of primitive actions so that each action A 2 Act P is constructed from primitive actions using the union and composition of actions. The interference of F on the execution of P can be dened as a transformation F , in the following way. 1. For a primitiveS action A, F (A) = A [ ( Act F ) 2. F (A [ B ) = F (A) [ F (B ) 3. F (A B ) = F (A) F (B ) 4. F (P ) = (Init P [ ff = false g; fF (A) j A 2 Act P g) Using the algebra of relations, it is easy to prove the following theorem. Theorem 2.1. For programs P and F , there is a program PF such that F (P ) P []PF We denote F (P ) by P F . Assume that this transformation ensures that the execution of P under F is fail-stop [SS83], i.e. that no further actions of P will be executed when a failure occurs. The execution of P in the fault environment F is equivalent to the execution of F (P ) in a fault-free environment. 2.6. Modelling Recovery The behaviour of F (P ) will not usually satisfy Sp. To make the program faulttolerant, P must be transformed by a fault-tolerant transformation into a program T (P ) such that F (T (P )) is expected to satisfy the specication Sp of P . Unfortunately, even this may not always be possible though F (T (P )) may be Transformation of Programs for Fault-tolerance 7 shown to satisfy a weaker but still acceptable specication. One kind of faulttolerant transformation is a recovery transformation, based on a denition of consistency. Let P have an initial execution sequence [S0 ]A1[S1 ]A2 : : :Am [Sm ] : : :An [Sn ] in which each action Ai transforms state Si 1 into state Si . Assume that this sequence ends in state Sn because of a failure. Let S be a state which is the union of the disjoint substates each of which is a substate of one of the states Sm ; : : :; Sn . In general, such substates representing partial information (e.g. local states of processes) have been saved at some of the execution points Sm ; : : :; Sn, and will be used to calculate a global state S for recovery after the occurrence of faults. If the execution of P can be restarted from state S and still satisfy Sp, then the state S is said to be backward consistent with the interrupted execution sequence. Alternatively, the execution of P can be continued from a possible future state Sn+k . If there exists an execution sequence [Sn]An+1[Sn+1 ] : : :An+k [Sn+k ] of P which satises Sp, then Sn+k is said to be forward consistent with the interrupted execution sequence. A consistent state is a state which is either backward or forward consistent. The recovery transformation R transforms a program P into a program R(P ) = P []PR by adding a set of recovery actions R, called a recovery program. The recovery actions of R are enabled only when a failure occurs and transform an error state into a good state which is consistent with the execution sequence interrupted by the fault. We assume that the recovery actions are not aected by the fault environment F , i.e. that no failure occurs during the execution of a recovery action. Therefore, we have F (R(P )) = F (P )[]R = (P F )[]PR P []PF []PR To ensure that the recovery program is eventually executed when a fault occurs, the executions of F (R(P )) must be fair for each action of PR. Let P0 v : : : v Pk be a sequence of program specications such that Pi v Pi+1 and Pk contains enough information for specifying the fault environment F . From Pk we may be able to determine the number of processes in the program, those which may fail during execution, and the channel variables which are faulty. Based on F and the fault-transformation F on Pk , we can perform the recovery transformation R on Pk and the fault-tolerant program Pk []PkR. For example, fault-tolerant mechanisms such as checkpointing, recovery blocks and conversations [BR81, Ran75, MR78, KT87] can then be introduced by applying stepwise renement to Pk []PkR. 3. A Specication Language The specication notation is a combination of Back's action system formalism [Bac88] and a UNITY-like notation [CM88], and this is dened in the model presented in the previous section. It will be used for reasoning about programs at dierent levels of abstraction and to provide a sound renement calculus. 3.1. Commands and Actions A program specication P = (Init P ; Act P ) is pair of sets, where Init P is a set of predicates and Act P is a set of action specications. An action specication 8 Z. Liu and M. Joseph A 2 ActP has the syntactic form g ! c, where g is a boolean condition and c is a command [Bac88, BS89]. We use the same symbols for programs and program specications, and we call a program specication a program, and an action specication an action. Let gA and cA respectively denote the guard g and the body c of action A, and let the action true ! c be abbreviated to c when there is no confusion. A command c is dened as follows, c ::= xi := x i :Q (nondeterministic assignment ) j c1 ; : : : ; cn; n > 1 (sequential composition ) j if A1 [] : : : []An fi; n 1 (conditional composition ) j do A1 [] : : : []An od; n 1 (iterative composition ) Here xi and x i are lists of variables and Q is a condition over the values of program variables [BK83]. In sequential composition, each ci is a command, and each Ai in the conditional and iterative composition commands is an action. We write IFin=1 Ai for if A1 [] : : : []An fi and DOin=1 Ai for the iterative composition command. The eect of the nondeterministic assignment command is to assign to the list of variables xi some value(s) x i satisfying condition Q. (Note that this may introduce unbounded nondeterminism: we shall discuss this later). The set ActP = fA1 ; : : :; Ang of actions can also be represented as a list of actions A1[] : : : []An. 0 0 0 3.2. Semantics of Commands and Actions By dening each action specication as an atomic action, the semantics of a program specication is a program in the compuatational model. Because of this atomicity, a program can be viewed as being sequential and nondeterministic. We also assume that, for simplicity, no abortion occurs in the execution of any command, and that the execution of each command always terminates. States : Let V be the value space of program P . A state S of a program P is a function from the program variables Var (P ) to the value space V . A sub-state S jY is the restriction of S to Y where Y Var (P ). P is used to denote the set of all the states of program P . Semantics of commands : The semantics of a command c is a function: (c) : P ! 2P This function can be extended as a function over the powerset 2P in the following way: for a set of states [ (c)(S) (c)() = S 2 The semantics of the execution of a command c in a state S is dened below. 1. Assignment: (xi := x i:Q)(S ) = fS j Q(S ) ^ 8y y 62 xi : S (y) = S (y)g Thus the programmer must guarantee the existence of a state in which Q is true . 0 0 0 0 Transformation of Programs for Fault-tolerance 9 2. Sequential composition: (c1 ; c2)(S ) = (c2 )( (c1 (S )) 3. Conditional composition: [ (cAi )(S ) (IFin=1 Ai )(S ) = gi (S )=true To guarantee that no abortion occurs, in any state S , one of the guards gAi must be true . 4. Iterative composition: We rst dene [ (cAi )(S ) (A1[] : : : []An)(S ) = gi (S )=true P within 2 which is the complete partial order domain (CPO) under inclusion between sets of states of P . Let Y be a set of states and F (S ) be the least xed point2 of the equation Y = fS g [ (A1 [] : : : []An)(Y ) Then (DOin=1 Ai )(S ) = F (S ) \ M where M = fS j 8i : 1 i n : gAi (S ) = false g. To guarantee the termination of DOin=1 Ai , we assume that F (S ) \ M is always non-empty for any state S . Semantics of actions : The semantics of an action A = g ! c is a function: (A) : P ! 2P such that = true (A)(S ) = (c)(S ) ifif gg((SS )) = false Semantics of programs : For the value space V of program P , let V y be the set of all the nite and innite sequences3 over V . An observation of program P is a sequence of states which is dened as the function : V ar(P ) ! V y satisfying the following conditions: OB1. (x) 6=<>, for any x 2 V ar(P ), OB2. #(x) = #(y) (denoted by #) for any x; y 2 V ar(P ), OB3. for each i < #, [i] is a state of P dened as [i](x) = (x)(i) for any x 2 V ar(P ), OB4. [0] is an initial state of P , The existence of such a xed point can be proved in set theory. < a; : : : ; b > is the sequence of elements a, : : : , b; <> is the empty sequence and^concatenates two sequences accordingto the standard denition:e.g. < a >^< b >=< a; b >, < a >^<>=< a >, etc.; / denotes that is a prex of ; / denotes that is a proper prex of ; # is the length of the sequence and # = 1 if is innite; and (i 1) is the ith element of , 1 i #; head() and last() denote respectively the rst and last elements of a non-empty sequence ; tail() denotes the sequence obtained from by removing the rst element of . 2 3 0 0 0 0 10 OB5. Z. Liu and M. Joseph 8i : 0 < i < # :: ([i] = [i 1]) _ (9A 2 ActP : [i] 2 (A)([i 1])) We use the sequence notation to describe a property of a function from V ar(P) to the sequences V ; e.g. for the functions and , ^ is the function such that ^ (x) = (x) ^ (x) for each x 2 V arP . A function : V ar(P) ! V satisfying Condition OB2 is said to be a sub-observation of , i.e. , if there is an observation and a function : V ar(P) ! V satisfying Condition OB2 such that = ^ ^ Obviously, given a command c and an action A, (c) and (A) can be extended to be functions over the set OBP of the nite observations of P, e.g. (c)() = f ^S j S 2 (c)(last())g. An execution E of P is an innite observation of P. It is said to exhibit fairness (or justice [MP83, GP89]) if for any i 0 and A 2 ActP , either there is a j i such that E(j + 1) 2 (A)(E(j)), or there is a k i such that gA(E(k)) = false . The semantics of a program P is the set (P) of its fair executions. Programs P and P are equivalent if they have the same semantics: P P = (P) = (P ) Within this semantic model, P[]skip P, where skip denotes the program which never changes the values of the program variables. We also use skip to denote a command or an action which does not change the values of the program variables. Therefore, this model permits nite stuttering, i.e. in any execution, a state can be repeated consecutively at most a nite number of times before the execution reaches the xed point of P. The stuttering property is required when dealing with the renement of reactive programs [Lam83, AL88, Bac89]. y 0 0 0 0 y 0 y 0 0 0 0 3.3. Reasoning About Action Systems As in Hoare logic [Hoa69, Dij76], the specication fQgAfRg denes an action A which when executed in a state satisfying predicate Q terminates in a state satisfying predicate R. A is universally or existentially quantied over the actions. The logical operators unless, stable, invariant, ensures, leads-to (7 !) are used to describe the safety and progress properties of programs. Following UNITY [CM88], these operators are dened in terms of the Sat relation for the formulae listed below. P Sat (Q unless R) 8A : A 2 ActP :: fgA ^ Q ^ :RgAfQ _ Rg P Sat (stable Q) P Sat (Q unless false) P Sat (invariant Q) (InitP ) Q) ^ P Sat (stable Q) P Sat (Q ensures R) P Sat (Q unless R) ^ (9A : A 2 ActP :: fgA ^ Q ^ :RgAfRg) P Sat (Q 7 ! R); P Sat (R 7 ! R ) , (Q ensures R) P Sat P Sat (Q 7 ! R) , P Sat (Q 7 ! R ) for any set W: 8m : m 2 W :: P Sat (Q(m) 7 ! R) . P Sat ((9m : m 2 W :: Q(m)) 7 ! R) 0 0 Transformation of Programs for Fault-tolerance 11 A xed point of a program is a program state such that execution of any statement of the program in that state leaves the state unchanged. The execution of a program terminates if it reaches its xed point. Let Fixedpoint(P ) denote the predicate which denes the xed points of program P . A state S satises Fixedpoint (P ) i it is a xed point of P . A program specication can be written by using either UNITY-like logic or an action system formalism. However, we will normally use the logic for high level specication and the action system formalism for the renement. We assume a fairness rule and that the execution of a command always terminates. 4. Transformations for Specied Faults The eect of faults on a program P is specied by a program F which denes a set of atomic actions, called fault actions, representing the fault environment. The execution of the program P under the fault environment specied by F is equivalent to the execution of P together with F on the fault-free system. Such a failure execution of P is dened by the following failure semantics. 4.1. Failure Semantics For a program P and a fault environment F , assume that P has a boolean variable f to indicate the presence of a fault, and that the value of f is never changed in P . Each action A 2 F is assumed to satisfy ftruegAff g A state S is good if f (S ) = false and an error state is a state which is not good. If c is a command of P , then the failure semantics of c w:r:t F is a function F (c) : P ! 2P which satises the following conditions: 1. if c is an assignment command, then 8 fSg if f (S ) = true < [ F (c)(S ) = : (a)(S ) [ (c)(S ) if f = false a2F 2. F (c1 ; c2)(S ) = F (c2 )( F (c1 )(S )) 3. if c is the conditional composition if A1 [] : : : []An fi, Ai = gi ! ci, then 8 [ < F (cAi )(S ) if :f (S ) = true ( c )( S ) = F gA (S)= true i : fSg if f (S ) = true 4. for the iterative composition c = do A1 [] : : : []An od, Ai = gi ! ci , we rst dene 8 [ < F (ci )(S ) if f (S ) = false F (A1 [] : : : []An)(S ) = : gi (S)=true if f (S ) = true 12 Z. Liu and M. Joseph Let E (S ) be the least xed point of the equation Y = fS g [ F (A1 [] : : : []An)(Y ) Then F (c)(S ) is F (c)(S ) = E (S ) \ M W n where M = fS j (:f _ i=1 gi )(S ) = falseg For each action A = g ! c, F (A) is dened by F (c)(S ) if :f ^ g(S ) = true F (A)(S ) = otherwise >From the failure semantics of a command and an action, the failure observation of the program P w:r:t: F can be derived as a function : V ar(P ) ! V satisfying the following conditions: FOB1. (x) 6=<>, for any x 2 V ar (P ), FOB2. # (x) = # (y ) (denoted by # ) for any x; y 2 V ar (P ), FOB3. for each i < # , [i] is a state of P dened as [i](x) = (x)(i) for any x 2 V ar(P ), FOB4. [0] is an initial state of P , FOB5. 8i : 0 < i < # :: ( [i] = [i 1]) _ (9A 2 ActP : [i] 2 F (A)([i 1])) A failure execution E of a program P w:r:t: F is an innite failure observation of P . It said to be fair if any action A of P which is continuously enabled at a execution point in E is eventually executed following the failure semantics of A. The failure semantics of P w:r:t: to F is the set F (P ) of the fair failure executions of P w:r:t: F . With this denition, the execution of program P can be seen to be fail-stop. Therefore, the failure semantics of the program P can be described in terms of two functions good and error such that each failure observation w:r:t: to F can be written as = good() ^error() where good() is an observation of P and error() is either empty or contains only error states. Obviously, if there are no faults, i.e. F is empty, each `failure' execution is the same as some execution of P : F = ) F (P ) = (P ) Programs P and P are said to be fault-prone equivalent w:r:t: F if they are equivalent and have the same failure semantics: y 0 P F P = P P ^ F (P ) = F (P ) It should be noted that equivalent programs may not be fault-prone equivalent. 0 0 0 Transformation of Programs for Fault-tolerance 13 4.2. Fault Transformation Given a program P = A1 [] : : : []Am and a fault environment F , let f = true if a fault occurs in the execution of P . First, transform P into a program FS (P ) such that FS (P ) = FS (A1 )[] : : : []FS (Am ) and each FS (Ai ) is obtained from Ai by changing gAi into gAi ^ :f and every primitive command c which occurs in Ai into the command FS (c) = if :f ! c[]f ! skip fi If c is a primitive command, FS (c) is the primitive f -stop command of c. For a command c of P , FS (c) is a f -stop command which is obtained from c by changing every primitive command c which occurs in c into the primitive f -stop command FS (c ). Given an action of P A=g!c the f -stop action of A is FS (A) = :f ^ g ! FS (c) The execution of FS (P ) on a system with the faults F is therefore fail-stop. Obviously, if f is invariantly false, FS (P ) P since P does not change the value of f . For each f -stop command FS (c) and each f -stop action FS (A), a transformation M is dened as: 1. if c is a primitive command, M(FS (c))= if :f ! c[]f ! skip[]:f ! F fi 0 0 2. if c = c1 ; c2 , M(FS (c))= M(FS (c1 )); M(FS (c2 )) if (FS (c1 ); FS (c2 ))[]:f ! F []FS (c1); (if f ! skip[]:f ! F fi) fi 3. for an action A = g ! c M(FS (A)) = :f ^ g ! M(FS (c)) 4. for a command c = if A1 [] : : : []An fi, M(FS (c)) = if M(FS (A1 ))[] : : : []M(FS (An))[]:f ! skip fi 5. if c = do A1 [] : : : []An od, M(FS (c)) = do M(FS (A1 ))[] : : : []M(FS (An)) od For the program P = A1[] : : : []Am, we dene M(FS (P )) = M(FS (A1 ))[] : : : []M(FS (Am )) Given the faults F , the fault transformation is dened as: F (P ) = M(FS (P )) and F (P ) is called the fault aected program of program P , denoted by P F . 14 Z. Liu and M. Joseph Theorem 4.1. Given a program P and its fault environment F , the faulttransformation F satises (P F ) = F (P ) >From the denitions of observations and failure observations, it is only required to prove that for each state S and an action A = g ! c of P , F (A)(S ) = (M(FS (A)))(S ) Case 1 : if c isa primitive command, then F (c)(S ) if :f ^ g(S ) = true F (A)(S )= otherwise 8 [ < (a)(S ) [ (c)(S ) if :f ^ g(S ) = true = : a2F otherwise ( c )( S ) [ ( F )( S ) if : f ^ g(S ) = true = otherwise = (:f ^ g ! c)(S ) = (M(FS (A)))(S ) Case 2 : if c = c1 ; c2 and F (ci) = (M(FS (ci )) for i = 1; 2, then F (c1 )( F (c2 )(S )) if :f ^ g(S ) = true F (A)(S )= otherwise ( M ( FS ( c )))( ( M ( FS (c2 ))(S )) if :f ^ g(S ) = true 1 = otherwise ( M ( FS ( c )); M ( FS ( c )))( S ) if : f ^ g(S ) = true 1 2 = otherwise if :f ^ g(S ) = true = (M(FS (c1 ; c2))(S ) otherwise = (:f ^ g ! M(FS (c1 ; c2)))(S ) = (M(FS (A)))(S ) Case 3 : if c = if A1[] : : : []An fi, Ai = gi ! ci for i = 1; : : :; n, then 8 [ < F (ci )(S ) if :f ^ g(S ) = true F (A)(S )= : g (S )=true if :f ^ g(S ) = false = (:f ^ g ! if M(FS (A1 ))[] : : : []M(FS (An))[]f ! skip fi)(S ) = (M(FS (A)))(S ) Case 4 : the proof for the case of iterative composition is similar to Case 3. Proof. i Corollary 4.1. Given a program P and its faults F , F =)P F P Therefore, a fault-free execution of the program P F is a fault-prone execution of program P and vice-versa, and P F is equivalent to P if no fault occurs. Further, as shown in the following theorem, P F is equivalent to the union composition of the f -stop program FS (P ) with a program PF . Theorem 4.2. Given P and F , there is a program PF such that Transformation of Programs for Fault-tolerance 15 P F FS (P )[]PF Proof. Let P = A1 [] : : : []An , Ai = gi ! ci . We prove that for each Ai , there is an action Fi such that M(FS (Ai )) :f ^ gi ! if FS (cAi )[]Fi fi Case 1 : if ci is a primitive command, M(FS (Ai ))= :f ^ gi ! M(FS (ci )) = :f ^ gi ! if :f ! c[]f ! skip[]:f ! F fi :f ^ gi ! if FS (ci )[]:f ^ gi ! F fi Case 2 : if ci = c; c , then M(FS (Ai ))= :f ^ gi ! M(FS (c; c )) :f ^ gi ! if (FS (c); FS (c ))[]:f ! F []FS (c); (if f ! skip[]:f ! F fi) fi) :f ^ gi ! FS (c; c ) []:f ! F []FS (c); (if f ! skip[]:f ! F fi) fi) Case 3 : if ci = if Ai1[] : : : []Ain fi, and M(FS (Aij )) FS (Aij )[]Fij , M(FS (Ai ))= :f ^ gi ! if M(FS (Ai1 ))[] : : : []M(FS (Ain ))[]f ! skip fi :f ^ gi ! if FS (Ai1 )[]Fi1 : : : []FS (Ain )[]Fin []f ! skip fi :f ^ gi ! if FS (Ai1 )[] : : : []FS (Ain )[]Fi1[] : : : []Fin []f ! skip fi :f ^ gi ! if FS (ci )[]Fi1[] : : : []Fin fi Case 4 : the proof for the case of iterative composition is similar to Case 3. Therefore, P F = M(FS (A1 ))[] : : : []M(FS (An)) :f ^ gA1 ! if FS (cA1 )[]F1 fi[] : : : []:f ^ gAn ! if FS (cAn )[]Fn fi FS (A1 )[]:f ^ g1 ! F1[] : : : []FS (An)[]:f ^ gn ! Fn FS (A1 [] : : : []An)[]:f ^ g1 ! F1 : : : []:f ^ gn ! Fn = FS (P )[]:f ^ g1 ! F1 : : : []:f ^ gn ! Fn 0 0 0 0 i i i i i i i The fault transformation dened above is based on the assumption that each fault action in F may interrupt the execution of any action in P . Such an assumption simplies the presentation of the failure semantics, the fault transformations and the discussion of their properties. But this assumption is not essential for the results obtained in this section. In general, an action in P can be interrupted by only a subset of the fault actions in F . For a program P = A1 [] : : : []An, Ai = gi ! ci , let FS (Ai ) = :fA ^ gi ! FS (ci ) where FS (ci ) is the fA -stop command of ci . For each action Ai , let fi be a boolean variable which is true if a fault occurs in the execution of Ai . A fault action Af 2 F interrupts the action Ai if Af transforms fi from false to true, Af 1 Ai = f:fi gAf ffi g A fault-action Af 2 F stops action Ai if it makes fA true, i.e. Af Ai = 9fj : f:fj gAf ffj g ^ (fj ) fA ) i i i i 16 Z. Liu and M. Joseph where Af 2 F and ffi gAf ffi g for i = 1; : : :; n. Let FA be the set of fault actions in F which interrupt action Ai , FA = fAf j Af 2 F ^ Af 1 Ai g Then Ai FA is dened as: Ai FA = :fA ^ g ! M(FS (ci )) where M(FS (ci )) is obtained by applying to the command ci the transformations FS and M w:r:t: fi and FA . The fault aected program F (P ) is then dened as: F (P ) = P F = A1 FA1 [] : : : []An FA It is easy to see that the transformation F dened in this way still satises Theorem 4.2. And if no fault action interrupts the action Ai , then Ai Ai In particular, let P = p1 [] : : : []pn be a program with n processes and Fp be fault actions which interrupts the actions of process pi but not of any other process. The fault transformation F w:r:t: fFp j i = 1; : : :; ng is dened as F (P ) = p1 Fp1 [] : : : []pn Fp i i i i i i n i i n 5. Consistency and Recovery Transformation Let P F be the fault aected version of a program P and P0 v : : : v Pk = P be a sequence of renements. When the execution of P F reaches an error state it will stay in that state forever and no further action in P can be executed. Therefore, the behaviour of P F will not in general satisfy the original specication P0, i.e. P F does not rene P0. To make the execution of P recoverable from an error state, the system has to be restored to a good state from which the interrupted execution can resume. Such a good state can be described in terms of consistency with the execution of P . 5.1. Reachability and Consistency For any state S of a program P , let Reach(S ) be the set of states which are reachable from S by executing P , i.e. Reach(S ) = fS j9; 2 OBP : S = last() ^ S = last( ) ^ / g Let, Reachable(S; S ) in P = S 2 Reach(S ) and, as an abbreviation, Reachable(S ) in P = Reachable(initP ; S ) in P Let be a nite observation of P . S is a possible future state of P for , if there is an observation of P which extends to S , i.e. S is forward consistent (ForwCon) with : 0 0 0 0 0 0 0 Transformation of Programs for Fault-tolerance 17 Reachable(last(); S ) F orwCon(S; ) = Let X be a set of disjoint subsets of Var (P ) and = fSX j X 2 Xg, where SX is a substate over X . We say that is forward-consistent with , if there exists a state S such that: 8X 2 X : S jX = SX ^ F orwCon(S; ) We may consider a `state' as the union of sub-states previously reached at dierent points in this execution, provided that this `state' could have been reached in some execution of P . If the execution can continue from this state, then it is consistent. Let X = fX0; : : : ; Xn 1g be a partition of V ar(P ). Consider a set of substates = fSX j Xi 2 Xg which occur in a sub-observation of during the execution of program P . More precisely, let = 1 ^ ^ . For each Xi 2 X there exists ki such that i < j ) ki < kj and satises condition C: (C) 8Xi 2 X : (ki )jX = SX is backward consistent with , denoted BackwCon (; ), if there exists a function 0 : Var (P ) ! V y such that i i i BC1 ^ BC2 ^ BC3 where for each Xi 2 X , 0 BC1. 8j : 0 j ki :: (j )jX = (j )jX 0 BC2. 8j > ki : (j )jX = (ki )jX i BC3. 1 ^ 0 2 OB P i i i 1 ^ 0 2 OB P is called a backward recovered observation of from . If BackwCon (; ), for and 0 satisfying these conditions, let 0 = 1 ^ 0 . Then is said to be a backward consistent prex of 0 and denoted as /bc 0 . Similar to the case of forward consistency, let be given for a subset X P (V ar(P )), i.e. = fSX j Xi 2 Xg, satises Condition C. is backward- consistent with , if there exists a state S such that: i 8X 2 X : S jX = SX ^ BackwCon(S; ) Obviously, BackwCon( ; ) ) F orwCon( ; 1 ), where = 1 ^ ^ . A state S is consistent with if it is either forward or backward consistent with , i.e. Consistent(S; ) = F orwCon(S; ) _ BackwCon(S; ) When there is no confusion, we will simplify the denitions and notation by omitting , e.g. Consistent(S ). 5.2. Recovery Transformation To resume the execution of P after interruption by fault actions in F , P has to be transformed into a program R(P ) by adding a set of recovery actions PR called a recovery program. 18 Z. Liu and M. Joseph Let ob be an auxiliary variable whose value space is the set of observations of P F : V ar(P ) ! V where [i] is a good state for each i #. y A state predicate Q over V ar(P ) is extended to be a state predicate Qob over V ar(P ) [ fobg such that for a state S over V ar(P ) [ fobg, Qob (S ) = true if Q(S jV ar(P ) ) ^ 8x 2 V ar(P ) : last(ob)(x) = S (x) For and in the value space of ob, let Ext(; ) = true if 9S 2 V ar(P ) : 8x 2 V ar(P ) : (x) = (x) ^ < S (x) > The predicate ForwExt(; ) = true if ForwCon(last(); ) ^ /. The predicate BackwExt(; ) = true if BackwCon(last(); ) ^ /bc. We dene ConsExt(; ) = ForwExt(; ) _ BackwExt(; ) Now let the specications of P and F over V ar(P ) be transformed into specications over V ar(P ) [ fobg so that: 1. each fQ ^ :f gAfRg in P is transformed into fQob ^ :f ^ ob = gAfRob ^ Ext(ob; )g 2. each fQgAf fRg in F is transformed into fQ ^ ob = gAf fR ^ ob = g 3. each ff gAff g in P or F is transformed into ff ^ ob = gAff ^ ob = g After the execution of F (P ) reaches an error state, the recovery program PR is invoked and restores the variables to a consistent state. PR must satisfy the following conditions: R1. any action of FS (P ) excludes each action Ar 2 PR: 8A 2 P : :f ^ gA ) :gAr R2. each action Ar 2 PR excludes any action FS (P ): 8A 2 P : gAr ) f _ :gA R3. execution of the recovery program PR cannot be interrupted by the fault program F . R4. PR transforms an error state into a good consistent state: (f ^ ob = ) 7 ! ((:f )ob ^ ConsExt(ob; )) The recovery program PR can thus be given as: Program PR : hf ! X := X :ConsExt(ob; ob0) ; f := falsei EndfPRg These conditions for PR allow the recovery transformation R to take the form 0 0 0 0 0 0 0 0 0 0 0 0 0 Transformation of Programs for Fault-tolerance 19 R(P ) = P []PR Condition R2 implies that a recovery action Ar does not change a good state and thus R(P ) P >From Condition R3 and Theorem 4.2 we have F (R(P )) F (P )[]PR FS (P )[]PR[]PF F (R(P )) should ideally satisfy the specication P0. Unfortunately, it is not usually possible to have such a strong transformation for an arbitrary program P and arbitrary faults F . However, we can often have a recovery transformation such that F (R(P )) is weaker than P0 but acceptable in terms of its behaviour and F = ) F (R(P )) P Two kinds of error recovery are used in practice [AL81]. With backward error recovery, the system recovers from a fault by starting from a state which is consistent with its previous states. Forward recovery is used when a program has to recover from an error whose eects can either not be overcome by backward recovery or (in a real-time program) because the time constraints do not permit backward recovery. As in the case of backward recovery, for forward recovery the variables V ar(P ) have to be assigned values so that the state S is good (i.e. f is false) and forward consistent with the current observation ob. The recovery transformation R can apply to both backward and forward recovery and this shows that in principle backward and forward recovery methods can be both used in one fault-tolerant system. Backward and forward recovery programs are special cases of PR and specied as PBR and PFR respectively: Program PBR : hf ! X := X :BackwExt(ob; ob0) ; f := falsei EndfPBRg Program PFR : hf ! X := X :ForwExt(ob; ob0) ; f := falsei EndfPFRg where ob0 is the value of ob before the execution of PR . 0 0 6. Using Renement for Fault-tolerance in Programs The recovery transformation R (or the recovery program PR) describes what should be done for recovery but imposes no restrictions on when PR is executed, where it is executed (e.g. on which processor) or how it is to be executed (e.g. how to nd a consistent state). These restrictions can be introduced by using transformations on P []PR which will be described in terms of F -renement. Given a fault environment F , program P is said to F -rene program P (denoted as P vF P ) if (P v P ) ^ (P Sat (f 7 ! :f )) and for each execution E of F (P ) in which there are nitely many error states of which E [k] is the last, 0 0 0 0 0 20 Z. Liu and M. Joseph E [k + 1](ob) ^last(E [k + 2](ob)) ^: : : ^last(E [k + n](ob)) : : : is an execution of P . As for the equivalence of fault-prone programs, it is not necessarily the case that P vF P if P renes P . However, the following results are easily proved: Theorem 6.1. Given programs P , P , P and a fault environment F , 1. P vF R(P ) 2. P vF P vF P ) P vF P 3. P v P ) R(P ) vF R(P ) 4. P v P ) P vF R(P ) Proof. Directly from the denition of F -renement and the specication of the transformation R. The rst two results show that fault-tolerance is introduced and preserved by F -renement transformations. From the next two results, it can be seen that renement transformations can be applied to the original program and faulttolerance introduced after that, using F -renement. Corollary 6.1. Given a program P and a fault environment F , 1. PR v PR ) R(P ) vF P []PR 2. P vF R(P ) vF P []PBR 3. P vF R(P ) vF P []PFR As an example of the rst result in Corollary 6.1, let S0 be the initial state of program P and Program PR : hf ! X := S0 ; f := falsei EndfPR g then (PR v PR ) ^ (P vF P []PR vF P []PR ) Program P []PR is a fault-tolerant program such that whenever a fault occurs, the execution of P will recover by re-starting from its initial state [JMS87, HH87]. The following theorem introduces a useful F -renement rule. Theorem 6.2. Given a program P = A1 [] : : : []An and a fault environment F , let I be a non-empty subset of f1; : : :; ng and RI (P ) = A1 [] : : : []An where for each i 2 f1; : : :; ng, Ai = ifAiAi []:gi ! skip fi; if :f ! skip[]PR fi ifif ii 622 II then 1. R(P ) vF RI (P ) 2. I J ) RI (P ) vF RJ (P ) The next theorem provides a rule to allow a recovery action to be freely introduced at any point in the program. 0 0 0 0 0 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Transformation of Programs for Fault-tolerance 21 Theorem 6.3. For the program RI (P) = A [] : : :[]An given in Theorem 6.2, if an action Ai of RI (P) can be written as ai ; ai , let Ai = ai ; if :f ! ai []f ! PR fi 0 0 1 0 1 00 Then 1 2 2 RI (P) vT (F ) RI (P)[Ai=Ai ] 0 00 It is noted that for any command c c skip; c c; skip And we assume here that the fault environment of skip is always empty: F (c) F (skip; c) F (c; skip) We can therefore introduce the recovery action PR (and its rened versions), by replacing such a skip with PR , at any suitable place. A program P can be rened using rules of the form dened by Back [BS88], and then Theorem 6.1 can be used to add F-renement. Checkpointing actions can then be added to P (or the rened version of P) by introducing new variables and assignments [BS88, Mor90]. Then, following Theorem 6.2 and Theorem 6.3, recovery actions can be introduced at appropriate points. The choice of recovery point may follow well-known practice, e.g. by using recovery blocks or conversations. It can be shown that the checkpointing and recovery protocol suggested by Koo and Toueg [KT87] can also be achieved by using F-renement. It may be noticed that Theorem 6.2 and Theorem 6.3 also hold for any program PR vF PR . 0 7. Example: A Protocol for Communication Over Faulty Channels In this section, we consider an example of fault-tolerant programming using the methods presented in this chapter. The problem is to design a protocol that guarantees reliable communication from a sender to a receiver in spite of faults in the communication channel between them. In the following programs, we shall omit the declaration and initial values of the variables if they are clear from the context. The Sender process produces an innite sequence ms0 of data. The Receiver process reads in a sequence mr satisfying the following specication: invariant mr/mso (i.e. mr is a prex of ms0 ) #mr = n 7 ! #mr = n + 1 (i.e. the length of mr increases eventually) If the sender and the receiver communicate over an unbounded reliable FIFO channel c, the communication between them can be implemented using the program Sender-Receiver. Program Sender -Receiver initially ms; mr; c = ms ; <>; <> actions c; ms := c ^head (ms); rest (ms) [] 0 (Sender) 22 Z. Liu and M. Joseph c 6=<>! c; mr := rest(c); mr ^head(c) End fSender-Receiver g (Receiver) The FIFO channel c can be implemented as the following program: Program C initially cs; cr =<>; <> actions cs 6=<>! cs; cr := rest(cs); cr ^head(cs) End fC g Let program P be Program P initially ms; mr; cs; cr = ms0 ; <>; <>; <> actions cs; ms := cs ^head(ms); rest(ms) [] cr 6=<>! mr; cr := mr ^head(cr); rest(cr) End fP g Then, Sender-Receiver v C[]P Now assume that the channel C is faulty and has the following behaviour: 1. any message sent along the channel may be lost; however, only a nite number of messages can be lost consecutively; 2. any message sent along the channel may be replicated, but no message can be replicated forever; 3. messages are not permuted { i.e. messages are delivered in the order in which they are sent; and 4. messages are not corrupted { i.e. their contents are not altered. A program F simulating such a fault environment can be given as: Program F declare b; f : boolean initially cs ; cr; b; f =<>; <>; false; false actions :b ^ cs 6=<>! cs; f := rest(cs); true :b ^ cs 6=<>! cr; f := cr ^head(cs); true [] (loss) (duplication) [] b := true (guarantee the niteness of consecutive faults) End fF g Consider the renement C v C1, where Program C1 Transformation of Programs for Fault-tolerance 23 initially cs; cr; b =<>; <>; false actions cs 6=<>! cs; cr; b := rest (cs); cr ^head (cs); false End fC1g F (C ) The behaviour of a fault channel can be simulated by a program FC = 1 with the fail-stop assumption: Program F (C1) declare b; f : boolean initially b; f = false; false actions :b ^ cs 6=<>! loss :b ^ cs 6=<>! duplication [] :f ^ cs =6 <>! correct -transfer [] [] b := true End fF (C1 )g (guarantees the niteness of consecutive faults) where 1. loss = cs; f := rest (cs); true cr; f := cr head (cs); true 2. duplication = ^ 3. correct -transfer = b; cs; cr := false; rest (cs); cr ^head (cs) And F (Sender-Receiver) (or F (C[]P)) is rened to the program F (C1[]P), given below. Program F (C1[]P) declare b; f : boolean initially ms; mr; cs; cr ; b; f = ms0 ; <>; <>; <>; false; false actions :f ! cs; ms := cs^head (ms); rest (ms) [] :f ^ cr 6=<>! mr; cr := mr ^head (cr); rest (cr) [] :b ^ cs 6=<>! loss :b ^ cs 6=<>! duplication [] :f ^ cs =6 <>! correct -transfer [] [] b := true End fF (C1 []P)g (guarantee the niteness of consecutive faults) 24 Z. Liu and M. Joseph To design a recovery program for Sender-Receiver, let cs and cr be sequence variables whose elements are pairs (integer; data item). Let C1 []P be rened to P1 : Program P1 declare ks; kr : integer, initially cs; cr; ks; kr; b =<>; <>; 1; 0; false actions cs ; ks := cs ^(ks; ms(ks)); ks + 1 [] cr = 6 <>! mr; cr; kr := mr ^head (cr):val; rest (cr) ; kr + 1 [] cs = 6 <>! cs; cr; b := rest (cs); cr ^head (cs); false End fP g 1 A message is lost if head (cr):dex > kr + 1 and a message is duplicated if head (cr):dex kr. Therefore, the fault indicating variable f can be implemented as f , (lost _ duplic ) where lost = head (cr):dex > kr + 1 and duplic = head (cr):dex kr It is sucient to assume that only the receiving command is f-stop. Such a fail-stop version of P1 is then given as FS(P1 ): Program FS(P1) initially cs; cr; ks; kr; b =<>; <> ; 1; 0; false actions cs; ks := cs ^(ks; ms[ks]); ks + 1 [] :f ^ cr = 6 <>! mr; cr; kr := mr ^head (cr):val; rest (cr); kr + 1 [] cs = 6 <>! b; cs; cr := false; rest (cs); cr ^head (cs) End fFS(P )g 1 The fault aected version F (C1 []P) can be rened to F (P1 ) which is M(FS(P1)): Program F (P1) initially cs; cr; ks; kr; b =<>; <>; 1; 0; false actions cs; ks := cs ^(ks; ms[ks]); ks + 1 [] :f ^ cr = 6 <>! mr; cr; kr := mr ^head (cr):val; rest (cr); kr + 1 [] Transformation of Programs for Fault-tolerance 25 :b ^ cs 6=<>! loss :b ^ cs 6=<>! duplication [] cs = 6 <>! correct -transfer [] [] b := true (guarantees the niteness of consecutive faults) End fF (P1)g The recovery program for P1 in R(P1 ) is rened to P1R given below. Program P1R initially ks; kr = 1 ; 0 actions ! do cr 6=<>! cr := do cs 6=<>! cs := lost [] duplic End fP1Rg (cr) od ; (cs) od ; ks := kr + 1; cs; ks := cs ^(ks; ms(ks)); ks + 1; cr; cs := cr ^head (cs); rest (cs) rest rest ! do duplic ^ (cr 6=<>) ! cr := rest (cr) od In the following, we provide an informal outline of the proof that F (P1 )[]P1R satises the specication of the program Sender-Receiver. (A full formal proof is given in [Liu91].) To prove the invariant mr/ms: a) mr/ms is true initially, b) each action in F (P1 ) and P1R leaves mr/ms stable. Hence mr/ms is invariant. Proof. To prove the progress property #mr = n 7 ! #mr = n + 1: a) it is easily seen that #mr = kr is invariant, b) from both F (P1) and P1R, each message in ms will eventually be transferred to cs, c) each message in cs will be eventually be transferred to cr or lost, d) if a message in cs is lost, it will be re-sent by the recovery action, e) F (P1 ) also guarantees that the correct transfer action will eventually be executed, f) by the second recovery action in P1R and the receiving action in F (P1), each message transferred into cr by the execution of the correct transfer action in F (P1) will eventually be transferred to mr. Therefore progress is guaranteed. P1[]P1R is thus a version of Sender-Receiver which can tolerate message loss and duplication; it can of course be rened further for a dierent implementation [CM88]. 26 Z. Liu and M. Joseph 8. Discussion Assume that 0 = is the top level specication of a program in a UNITY-like notation. The top level specication of the fault environment 0 can be given as P Sp : 7 ! f F f 0 can be then rened into an action system 1 while 0 can be simulated (or rened) by an action system 1 consisting of one action ! := Based on 1 and 1, the fault and recovery transformations can then be applied to program 1 in the way described in Section 5.2. For each renement step k v k+1 of the original program 0, a renement step k v k+1 of the fault environment is derived by providing more details about the system and its possible faults. The fault and recovery transformations can be applied again to k+1 and k+1. During the renement of the original program, checkpoints and recovery methods such as recovery blocks and conversations can be introduced. Thus, in principle, a fault-tolerant program can be produced from its fault-free specication by using F-renement rules. We do not underestimate the diculty of doing this in practice, but the rules provided here do oer a formal means of verifying what is otherwise left to less precise methods of reasoning. Each step in the renement 0 v v k v provides more information about the system on which the program is to be executed, and similar information about the faults of the system is used for rening the fault specication. For certain programs and faults, e.g. and the faulty channels, the boolean variable which indicates the presence of faults in O can be derived from the state predicates of program k at a suitable stage in the renement. But it is an open question whether this is always possible for any program and any set of possible faults. With the fail-stop assumption, special states for abortion and nontermination are not necessary if it is assumed that each action of the original program always terminates. However, without the fail-stop assumption [Liu91], abortion and nontermination have to be considered, because faults may cause abortion or notermination within an action even if the action always terminates in a fault-free environment. The logic used for reasoning about programs is based on a built-in fairness condition, known as strongfairness [MP83, GP89]. The logic itself is too weak to express this or any other fairness condition [Fra86]. But as we have seen in the previous sections (and elsewhere [Liu91]), the logic is sucient for describing at a relatively high level the basic principles of fault-tolerant renement and transformation. It would be interesting to examine the use of a more powerful logic, such as Lamport's temporal logic of actions (TLA) [Lam90, Lam91], for further work in this direction. There are extensions of Back's renement calculus for dealing with renement of parallel programs [Bac89], and in [Liu91] we show how this can be used for fault-tolerant renement of parallel programs. Fault-tolerant systems often also have real-time constraints. So it is important that the timing properties of a program are rened along with the fault-tolerant and functional properties dened in a program specication. If we extend the model used in this paper by adding timing properties in some time domain [JG88], the recovery transformation can be dened with timing constraints. The specication and renement of the recovery actions can then be required to sat- P P F F true P f true F P P P P P F F P ::: P ::: Sender Receiver f F P F Transformation of Programs for Fault-tolerance 27 isfy the condition that after a fault occurs, the system is restored to a consistent state within a time bound which includes the delay caused by the execution of the recovery action. Thus the method described in this paper can be extended to take account of timing constraints. However, there are numerous problems still to be examined in making such a method practical, and these are the goals of further work. The idea that a hardware fault be modelled as another kind of action (operation) that performs state transformations was described in [Cri85] through an example. There, processor crashes and faults that aect the physical storage medium are taken to be special operations, referred to fault operations, and described by axioms similar to those used to describe the semantics of ordinary operations. As we have seen in this paper, developing this idea into a general method for systematically constructing fault-tolerant programs is far from trival. We have attempted this by dening transformations of programs, because this enables us to formally discuss the fault-tolerant properties of a program at different level of abstraction. Acknowledgements We would like to thank our colleague, Asis Goswami, for helpful discussions on the denition of consistency, and to Jozef Hooman for his carefully reading of an earlier version of this paper and for many useful suggestions. A discussion with R.J.R. Back was helpful in modelling faults and using the renement calculus for fault-tolerant programming. Finally, we thank a referee for some spirited and detailed comments which helped to bring the paper to its present form. References [AL81] [AL88] [Bac87] [Bac88] [Bac89] [BK83] [BR81] [BS88] [BS89] [BW89] [CM88] [Cri85] T. Anderson and P.A. Lee. Fault-tolerance: Principles and Practice. Prentice-Hall International, 1981. M. Abadi and L. Lamport. The existence of renement mapping. In Proc. 3rd IEEE Sympoium on Logic and Computer Science, 1988. R.J.R. Back. A calculus of renement for program derivations. Technical Report 54, Abo Akademi, 1987. R.J.R. Back. Rening atomicity in parallel algorithms. Technical Report 57, Abo Akademi, 1988. R.J.R. Back. Renement calculus, Part II: Parallel and reactive programs. In Lecture Notes in Computer Science 340, pages 67{93. Springer-Verlag, 1989. R.J.R. Back and R. Kurki-Suonio. Decentralization of process nets with centralized control. In Second Annual ACM Symposium on Principles of Distributed Computing, pages 131{142, 1983. E. Best and B. Randell. A formal model of atomicity in asynchronous systems. Acta Informatica, 16:93{124, 1981. R.J.R. Back and K. Sere. Stepwise renement of parallel algorithms. Technical Report 64, Abo Akademi, 1988. R.J.R. Back and K. Sere. Stepwise renement of action systems. Technical Report 78, Abo Akademi, 1989. R.J.R. Back and J. von Wright. Renement calculus, Part I: Sequential nondeterministic programs. In Lecture Notes in Computer Science 340, pages 42{66. Springer-Verlag, 1989. K.M. Chandy and J. Misra. Parallel Program Design: A Foundation. AddisonWesley Publishing Company, 1988. F. Cristian. A rigorous approach to fault tolerant programming. IEEE Transactions on Software Engineering, SE-11(1):23{31, January 1985. 28 [Dij76] [Fra86] [GP89] [HH87] [Hoa69] [HW88] [JG88] [JMS87] [KT87] [Lam83] [Lam87] [Lam90] [Lam91] [Liu91] [Mor90] [MP83] [MR78] [Ran75] [SS83] Z. Liu and M. Joseph E.W. Dijkstra. A Discipline of Programming. Englewood Clis, New Jersey: Prentice-Hall, 1976. N. Francez. Fairness. Springer-Verlag, New York, 1986. R. Gerth and A. Pnueli. Rooting UNITY. In Proceedings of the 5th IEEE International Workshop on Software Specication and Design, February 1989. J. He and C.A.R. Hoare. Algebraic specicationand proof of a distributed recovery algorithm. Distributed Computing, 2:1{12, 1987. C.A.R. Hoare. An axiomatic basis for computer programming. Communications of the ACM, 12(10):576{583, October 1969. M.P. Herlihy and J.M. Wing. Reasoning about atomic objects. In Lecture Notes in Computer Science 331, pages 193{208. Springer-Verlag, 1988. M. Joseph and A. Goswami. What's `real' about real-time systems? In Proceedings of IEEE Real-time Systems Symposium, pages 78{85, Huntsville, Alabama, December 1988. M. Joseph, A. Moitra, and N. Soundararajan. Proof rules for fault tolerant distributed programs. Science of Computer Programming, 8(1):43{67, February 1987. R. Koo and S. Toueg. Checkpointingand rollback-recoveryfor distributed systems. IEEE Transactions on Software Engineering, SE-13(1):23{31, January 1987. L. Lamport. Reasoning about nonatomic operations. In Proc. 10th ACM Conference on Principles of Programming Languages, pages 28{37, 1983. L. Lamport. win and sin: Predicate transformers for concurrency. Technical Report 17, Systems Research Center of Digital Equipment Corporation in Palo Alto, California, May 1987. L. Lamport. A temporal logic of actions. Technical Report 57, Digital SRC, California, 1990. L. Lamport. The temporal logic of actions. Technical Report 79, Digital SRC, California, 1991. Z. Liu. Fault-Tolerant Programming By Transformations. PhD thesis, Department of Computer Science, University of Warwick, 1991. C. Morgan. Programming from Specication. Prentice Hall, 1990. Z. Manna and A. Pnueli. How to cook a temporal proof system for your pet language. In Proceedings of 10th Annual ACM Symposium on Principles of Programming Languages, Austin, Texas, 1983. P.M. Merlin and B. Randell. State restoration in distributed systems. In Proceedings of 8th Ann. Int. Symp. on Fault-Tolerant Comput., pages 129{134, Toulouse, France, 1978. B. Randell. System structure for software fault tolerance. IEEE Transactions on Software Engineering, SE-1(2):220{232, June 1975. R.D. Schlichting and F.B. Schneider. Fail-stop processes: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222{238, 1983.