Download i ≠ 1 - The Department of Computer Science

Self-Stabilizing Systems as a Base for Autonomic Computing Shlomi Dolev Yinnon Haviv, Reuven Yagel, Olga Brukman Trustworthy Systems: Why Is It So Hard? Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“ http://larch-www.lcs.mit.edu:8001/~corbato/turing91/ "You must pay extreme attention to detail here. One wrong bit will make things fail… " http://my.execpc.com/~geezer/os/pm.htm From Pentium’s manual: “… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down due to a lack of stack space. No exception is generated to indicate this condition" Mars Rover - Spirit …The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems…The operating system is Wind River Systems' Vx-Works.. …attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended… …Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot http://www.eetimes.com/sys/news/OEG20040220S0046 Linux and Windows do not Stabilize Self-Stabilization Self-healing, Self-managing, Self-* Recovery Oriented Computing [Berkeley, Stanford] Autonomic Computing [IBM] Self-Stabilization  Self-Stabilizing algorithm for mutual exclusion in a ring topology [Dijkstra’74] Well Established Theory ! Self-Stabilizing Systems Elegant fault tolerant approach. Started at any state, the system convergences to a desired behavior. Generally used in distributed systems.  Routing, clock synchronization, leader election, etc. Overcome transient faults in the system.  Transient faults: soft-errors (“98% of RAM errors are soft errors”), wrong CRC during communication etc. Self-Stabilization The combination and type of faults cannot be totally anticipated in on-going systems Any on-going system must be self stabilizing (or manually monitored) L E First Self-Stabilizing Algorithm: Token Passing Token Passing do forever if x1=xn then x1:=(x1+1)mod(n+1) Pi(i ≠ 1):do forever if xi≠xi-1 then xi:=xi-1 1 P1: 2 3 4 5 6 Token Passing Cont. {0; 0; 0; 0; 0}; {1; 0; 0; 0; 0}; {1; 1; 0; 0; 0}; {1; 1; 1; 0; 0}; Surely works when we start in x1 = x2 = … = xn = 0. {1; 1; 1; 1; 0}; One processor may change a state at a time. {2; 2; 1; 1; 1}; {1; 1; 1; 1; 1}; {2; 1; 1; 1; 1}; {2; 2; 2; 1; 1}; {2; 2; 2; 2; 1}; {2; 2; 2; 2; 2} … Token Passing: Faults Transient fault, soft errors, wrong CRC, unexpected temporal severe conditions, etc. Assigns each processor with an arbitrary state (in the range of its state space). For example {3; 4; 4; 1; 0}. p2; p4; and p5 have tokens! Will the system ever recover? Token Passing: Automatic Recovery p1 changes state infinitely often, Otherwise, let s1 be the fixed state of p1, p2 eventually copies s1 from p1, then p3 eventually copies s1 from p2, then ... pn eventually copies s1 from pn-1, then p1 changes state. p1 changes state in the order 4; 5; 0; 1; 2; 3; 4; 5; 0; ... Token Passing: Automatic Recovery Cont. In any initial state at least one state is missing, {4; 4; 1; 0; 2}, 3 and 5 are missing. Once p1 reaches the missing state e.g., 5, all the processors must copy 5, before p1 reads 5 from pn and changes state to 0. Will It Stabilize With mod (n - 2)? Mod 3 {0,0,2,1,0} p1 {1,0,2,1,0} p5 {1,0,2,1,1} p4 {1,0,2,2,1} p3 {1,0,0,2,1} p2 {1,1,0,2,1} +1 mod 3 ! Is Self-Stabilization a Toy? Talk Outline Self Stabilizing Microprocessor [DH04] Self Stabilizing Operating System [DY04] Self-Stabilization Preserving Compiler[DH05] Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] Recovery Oriented Programming[BD05] Self-Stabilizing Microprocessor Overcoming Soft-Errors Shlomi Dolev and Yinnon A. Haviv 17th International Conference on Architecture of Computing Systems (ARCS) Motivation Soft-Errors: Single Event Upsets (SEU) Caused by cosmic ray / other disruptions. Cause a logical gate to flip its content. Currently handled only in memories. Significant impact on the microprocessors. Soft-Errors - Current Solutions Obtaining masking using probabilistic approaches:     Information redundancy (ECC / Parity) Space redundancy Time redundancy Failure detection / recovery. Known solutions:    IBM S-390 Compaq NonStop Himalaya IROC Self-Stabilizing Microprocessor  Self-stabilizing algorithms assume that the microprocessor executes them.  Soft-errors may cause the microprocessor to be stuck in a faulty state.  A microprocessor is self-stabilizes if:   Started in any internal state, converges in a finite number of steps into the set of safe states. Microprocessor’s safe state – in which it performs “fetch-decode-execute” cycle Proving Convergence  Proving that there exists no “bad” cycle in the transition graph of the microprocessor.  Too large ! (we must explore the entire graph)  Using an abstraction:~ Group together states in which the micro-code program counter is the same. h a D g j b A E i c l d B F k C e f Self-Stabilizing Microprocessor: Summary Soft-errors are here to stay, we should:   Design our systems to mask them. Self-stabilize following a non-masked error. We provide methodology for validating selfstabilization property of microprocessors. Talk Outline Self Stabilizing Microprocessor [DH04] Self Stabilizing Operating System [DY04] Self-Stabilization Preserving Compiler[DH05] Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] Recovery Oriented Programming[BD05] Toward Self-Stabilizing Operating System (SOS) Shlomi Dolev and Reuven Yagel, SAACS’04 Workshop, Zaragoza Talk Outline The first self-stabilizing algorithm (of Dijkstra) Self Stabilizing Microprocessor [DH04] Self Stabilizing Operating System [DY04] Self-Stabilization Preserving Compiler[DH05] Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] Recover Oriented Programming[BD05] Basic Directions Black-box   Take existing OS (Unix, Windows, RTOS) Add stabilization layer Carefully tailoring a tiny kernel    Processor scheduling Memory management Device allocation Assumptions Every configuration (processor/memory) is possible At least some program code is hardwired (in ROM) and is correct – Harvard Model Processor:   Instruction manual (e.g. x86\IA-32) defines a transition function. Self-stabilizing [DH04] Black Box Requirements:   Defining a legal execution is usually impractical At least - restore original state (variables + code), infinitely often Periodic Reset Re-install and Execute    Watchdog timer (self-stabilizing) Periodic processor reset During bootstraps OS reinstall from ROM Weak self-stabilization   E = (ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …., ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …. Is it always acceptable? Alternative: Periodic re-install code only, add consistency check and enforcement Tailored Kernel Tiny Scheduler Tiny Memory Manager Requirements:    Self-stabilizing Fair Process stabilization preserving (e.g. validity of P.C. value) Tiny SOS Scheduler ~70 lines of a real machine assembly code 16bit Real mode & 32bit Protected mode. Standard build and emulation tools (Nasm, ld, Bochs) Detailed proof of requirement preservation ; increase task 10 mov word ax, [currentProc] 11 and ax, PROC_MASK ... ; load task state ... ;restore ip 52 mov ax, [bx+4] ;validate ip 53 and ax, IP_MASK 54 mov word [ss:STACK TOP], ax ;restore general registers 55 mov cx, word [bx+12] 56 mov dx, word [bx+14] 57 mov si, word [bx+16] 58 mov di, word [bx+18] Tiny SOS Memory Manager Requirements:   Consistency of memory hierarchy Self-stabilization preservation Tiny SOS Scheduler Clock tick / execute next Any State Some Error Establish Scheduler Consistency Some Error Some Error Process(ing) Next Process Validated & Ready Tiny SOS Scheduler Clock tick / execute next Any State NMI / load PC with scheduler handler Process(ing) Establish Scheduler Consistency Next Process Validated & Ready Sketch of Proof In every execution E, the code of the scheduler is started to be executed and is executed from the first instruction to the last instruction infinitely often In every execution E of the scheduler each process is executed infinitely often The self-stabilizing scheduler preservers stabilization of processes. Talk Outline Self Stabilizing Microprocessor [DH04] Self Stabilizing Operating System [DY04] Self-Stabilization Preserving Compiler[DH05] Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] Recover Oriented Programming[BD05] Self-Stabilization Preserving Compiler Shlomi Dolev, Yinnon A. Haviv, Department of Computer Science Ben-Gurion University, Israel Mooly Sagiv, Department of Computer Science Tel Aviv University, Israel Motivation Transient malfunctions. Single processor:   Hardware glitches. Soft-Errors. Distributed environment:   Processor crashes / recoveries. Link errors. Resulting in an unpredictable system state. Coping with Transient Errors Masking (safety factor) achieved by:   Information redundancy (e.g., ECC). Time/Space redundancy. (e.g., TMR) Self-Stabilization [Dijkstra74]:    Assuming any system state (caused by errors). Recovering by converging into legal behavior. Existing algorithms for distributed tasks:  Routing, leader election, mutual exclusion, etc. Self-Stabilizing Algorithms – a Solution to Soft-Errors? Self-Stabilizing algorithm assumes that the microprocessor executes it.  Soft-Errors may cause the microprocessor to be stuck in a faulty state. Composition of self-stabilizing algorithms creates a self-stabilizing system.  Make the microprocessor eventually fetch-decodeexecute machine code. The Gap. Need a transformation between:   Input program P written in a high abstraction language, e.g., (D)ASM. Output program Q in a machine language, say, JVM. Existing compilers?   P and Q behaves the same when started in the initial state. What if Q reaches an unexpected state due to soft-error experienced by microprocessor? Trivial Example A statement of the form: For each i in {0..9} do f(i) May be compiled to  Start with cx=12 inside the loop… Moreover: Any runtime mechanism can get stuck / inconsistent. mov ax, 10 mov cx, 0 loop1: push cx call f inc cx cmp cx,ax jne loop Stabilization Preserving Compiler – a closer look Ensuring that Q eventually behaves as P: State space of P State space of Q Self-Stabilization Preserving Compiler: Summary Front end of compiler established. Typed version of ASM. JavaCC as a parser generator. Interpreter (used as a model). Fast stabilization vs. optimizations. Self Stabilization preserving compiler.  Language with clear semantics from any state.  Innovative demands from compiler. Talk Outline Self Stabilizing Microprocessor [DH04] Self Stabilizing Operating System [DY04] Self-Stabilization Preserving Compiler[DH05] Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] Recover Oriented Programming[BD05] Self-Stabilization and Evolving Systems Real world systems cannot be verified exhaustively… We enforce safety and live-ness specifications Contract between the client, project manager and programmers, that is checked on line! Make sure that the additional (thin) monitoring and recovering layer is self-stabilizing A change can be made to the implementation/specification to support evolving environments Self-Stabilizing Recoverer for Eventual Byzantine Software Olga Brukman, Shlomi Dolev Department of Computer Science Ben-Gurion University, Israel Hillel Kolodner, Haifa Research Labs IBM, Israel Software Contains Bugs Heisenbugs, corrupt states, leaked resources are common… Correct and faultless SW is hard  Long-lived running programs, e.g., OS Usually software is tested when starting from initial state and considering limited time scenarios. Fault Model Reflecting Reality Software packages can be trusted to work as required after restart. Eventual Byzantine software. System administrators and users use reboot to deal with faults. Middleware Architecture <Preds,RActs>1 <Preds,RActs>2 … <Preds,RActs>n OMR Kernel OS Monitor-Restarter for Process and Subsystem <Pred,RActs>1 <Pred,RActs>2 … Restart Actions – Mature Approach Subsystem waits for completion of a restart of its components. Restart action may vary, depending on component internal state.    Reschedule Roll-back Kill & Restart Few restart attempts with more drastic restart actions. Computational Model: rsf-execution An execution E is rsf (restart supporting fair)execution iff E is a fair execution in which every subsystem subi that is initialised during E respects its specification function ssi. Requirement: Every rsf-execution E has a suffix in which the system respects its specification function ss. Tools for Implementation – Black Box Approach Software package is a black box. Package is monitored by recording it’s IO (e.g., strace in Linux). Monitors are independent of specific implementation Tools for Implementation – Transparent Box Approach Software package implementation tool is known. Run-Time Reflection tools are used to monitor and restart the package. Possible in Java, C++, CORBA, COM. Practical Experience: Printers Problem Corrupted pdf, doc or ps file sent to printing server. Printer can’t print the file. Cause retries by printing server  Printer is “stuck” on one job. Predicate for printing server:  Restrict number of retries, try format conversions, send error message to user. Self-Stabilizing Automatic Recoverer: Summary Theory foundations of self-stabilization and restart techniques could serve as a basis for the new paradigms. General framework for design and correctness proof for autonomic recoverer. Printers experience coordinated with IBM. Recovery Oriented Programming Olga Brukman and Shlomi Dolev Department of Computer Science Ben-Gurion University, Israel Towards Robust Software Programming  Structural programming, OOD, Design Patterns… Testing and debugging   Unit testing [JUnit, CppUnit]… Design By Contract (Eiffel) … Formal specification languages  ASM, IO Automata, NURPL Model checking Online recovery   ROC [PBB02]. Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software [BDK03] - black box software packages. Our Contribution Program invariants derived from design specifications.  Checked every time invariant variables are updated. Automatic code generation for invariant verification and recovery upon invariant violation. Invariants are verified during runtime.  Change of invariant variable is pre-checked in sandbox.  Violation is prevented and replaced with a recovery action. Our Contribution Cont. Recovery action is chosen depending on the current state and history.  Roll back & resume.  Wait.  Reschedule.  Kill & restart. External Monitoring Monitoring the whole task to avoid  transient faults occurrence after which invariant variables are not changed ( and no invariant checks are done)  liveness problem – monitor over time Producer Consumer Example Producer Producer, Consumer – threads Queue – a circular bounded length queue  Queue   Consumer  int queue[size] int start – position of current first element in the queue int end – position of the first empty place in the queue boolean empty Invariants for Queue start, end are in some values range ({0,..,size-1}) Queue is not empty iff start != end A possible invariant is: (start in {0,..,size-1}, end in{0,.., size-1}) && (start != end => empty == false) Given Implementation Code class Queue { private int[] queue; private int start, end, size; private boolean empty; public Queue(int size){ queue = new int[size]; start = 0; end =0; empty= true; } public synchronized int dequeue() { int result; while (empty) wait(); result = queue[start]; start= (start + 1) % size; empty = (start == end) ? true; notifyAll(); return result; } public synchronized void enqueue(int value) { while (start == end && !empty) //while full wait(); queue[end] = value; end = (end + 1) % size; empty=false; notifyAll(); } } Code Transformation class Queue { invariant: (start in {0,..,size-1}, end in {0,.., size-1}) && (start != end => empty == false) recovery actions: {this = new Queue(); empty = false;} private int[] queue; private int start, end, size; private boolean empty; public Queue(int size){ queue = new int[size]; atomic{ start = 0; end =0; empty= true;} } public synchronized int get() { int result; while (empty) wait(); result = queue[start]; atomic{ start= (start + 1) % size; empty = (start == end) ? true;} notifyAll(); return result; } Code Transformation Cont. public synchronized void put(int value) { while (start == end && !empty) //while full wait(); queue[end] = value; atomic{ end = (end + 1) % size; empty=false;} notifyAll(); } } External Monitoring The possible invariant is: (start in {0,..,size-1}, end in {0,.., size-1}) && (start != end => empty = false) && (start = end && empty = false => producer.wait()) && (empty = true =>consumer.wait()) && [(producer.wait() && !consumer.wait()) || (!producer.wait() && consumer.wait())] New recovery actions: interrupt producer/consumer and initialize it. Recovery Oriented Programming: Summary Programming with self-stabilization enforcing. Eventually safe execution. Talk Conclusions Self-Stabilization as an effective paradigm for creating robust systems. Rigorous approach for designing basic system components     Microprocessor Operating system Compiler Evolving and Recovery Oriented

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download i ≠ 1 - The Department of Computer Science