Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Event-based Failure Prediction An Extended Hidden Markov Model Approach DISSERTATION zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) im Fach Informatik eingereicht an der Mathematisch-Naturwissenschaftlichen Fakultät II Humboldt-Universität zu Berlin von Herrn Dipl.-Ing. Felix Salfner geboren am 27.04.1974 in Düsseldorf Präsident der Humboldt-Universität zu Berlin Prof. Dr. Christoph Markschies Dekan der Mathematisch-Naturwissenschaftlichen Fakultät II Prof. Dr. Wolfgang Coy Gutachter: 1. Prof. Dr. M. Malek 2. Prof. Dr. Dr. h.c. G. Hommel 3. Prof. Dr. A. Reinefeld Tag der mündlichen Prüfung: 6.2.2008 To Gesine, Anton Linus, Henry, and Fabienne. iii Acknowledgments First of all, I would like to thank my doctoral advisor Miroslaw Malek for his ongoing support and advice —I have benefitted greatly from his broad experience. I am also very grateful to Katinka Wolter, who has led me to the fascinating beauty of stochastic processes and who has repeatedly helped me to review, rethink, and revise my ideas. A part of this work was carried out as a member of the Graduate School “Stochastische Modellierung und quantitative Analyse großer Systeme in den Ingenieurwissenschaften” (MAGSI), which has provided an inspiring scientific environment. I would like to thank the members of MAGSI for discussions and for giving feedback on my work from the most diverse viewpoints. In particular, I would like to acknowledge the effort of Günter Hommel and Armin Zimmermann (Technical University Berlin) organizing and providing a forum for stimulating scientific exchange, and of Tobias Harks (Technical University Berlin) who kept a watchful eye on the mathematical aspects of this work. This work was also greatly improved by fruitful discussions with my colleagues, especially with Günther Hoffmann, Maren Lenk, and Peter Ibach, and by the great support from Jan Richling and Steffen Tschirpke, whom I hereby thank. I am also grateful for discussions, help, and comments from Alexander Schliep (Max Planck Institute for Molecular Genetics, Berlin), Tobias Scheffer and Ulf Brefeld (Max Planck Institute for Computer Science), Aad van Moorsel (School of Computer Science, Newcastle University), who have given many impulses to my work, and I would like to express my thanks to my old friend Patrick Stiegeler, who was an open-minded reviewer of my thesis. Besides the working life, I am very grateful to my parents taking good care of me especially during writing of the first half of the thesis and for improving many of the figures found in this dissertation. Finally, I want to extend my most heartfelt thanks to my wonderful wife Fabienne and our children, without whose support and consideration this work would not have come into existence. This work was supported also by Deutsche Forschungsgemeinschaft (German Research Foundation) project “Failure Prediction in Critical Infrastructures” and Intel Corporation. v Abstract Human lives and organizations are increasingly dependent on the correct functioning of computer systems and their failure might cause personal as well as economic damage. There are two non-exclusive approaches to minimize the risk of such hazards: (a) faultintolerance tries to eliminate design and manufacturing faults for hardware and software before a system is put into service. (b) fault-tolerance techniques deal with faults that occur during service trying to avert that faults turn into failures. Since faults, in most cases, cannot be ruled out, we focus on the second approach. Traditionally, fault tolerance has followed a reactive scheme of fault detection, location and subsequent recovery by redundancy either in space or time. However, in recent years the focus has changed from these reactive methods towards more proactive schemes that try to evaluate the current situation of a running system in order to start acting even before the failure occurs. Once a failure is predicted, it may either be prevented or the outage may be shifted from unplanned to planned downtime, which can both improve significantly the system’s reliability. The first step in this approach, online failure prediction, is the main focus of this thesis. The objective of the online failure prediction is to predict the occurrence of failures in the near future based on the current state of the system as it is observed by runtime monitoring. A new failure prediction method that builds on the evaluation of error events is introduced in this dissertation. More specifically, it treats the occurrence of errors as an event-driven temporal sequence and applies a pattern recognition technique in order to predict upcoming failures. Hidden Markov models have successfully solved many pattern recognition tasks. However, standard hidden Markov models are not well-suited to processing sequences in continuous time and existing augmentations do not account adequately for the event-driven character of error sequences. Hence, an extension of hidden Markov models has been developed that employs a semi-Markov process to state traversals providing the flexibility to model a great variety of temporal characteristics of the underlying stochastic process. The proposed hidden semi-Markov model has been applied to industrial data of a commercial telecommunication platform. The case study showed significantly improved failure prediction capabilities in comparison to well-known existing approaches. The case study also demonstrated that hidden semi-Markov models perform significantly better than standard hidden Markov models. In order to assess the impact of failure prediction and subsequent actions, a reliability model has been developed that enables to compute steady-state system availability, reliability and hazard rate. Based on the model, it is shown that such approaches can significantly improve system dependability. Keywords: Event-based failure prediction, Hidden semi-Markov model, Proactive fault management, Autonomic Computing vii Zusammenfassung Es gibt kaum mehr einen Bereich in unserer Gesellschaft, der nicht an ein korrektes und fehlerfreies Funktionieren von zum Teil hochkomplexen Computersystemen gebunden ist. So kann nicht nur das Überleben ganzer Unternehmen davon abhängen, sondern auch das Leben von Menschen. Es gibt zwei grundlegende Ansätze mit diesem Risiko umzugehen: (a) man versucht, Fehlerursachen während der Entwurfs- und Herstellungsphase, also noch bevor das System in Betrieb geht, zu eliminieren (Fehler-Intoleranz) und / oder (b) man versucht, um einen Ausfall des Systems zu verhindern, ein System zu bauen, das mit Fehlern —die trotz ausgefeilter Fehler-Intoleranz Verfahren in der Produktionsphase auftreten können— umgehen kann (Fehlertoleranz). Die vorliegende Arbeit konzentriert sich auf letzteren Ansatz. Traditionell haben Fehlertoleranz-Verfahren auf Fehler lediglich reagiert und versucht, Ausfälle des Gesamtsystems durch räumliche oder zeitliche Redundanz zu verhindern. In den letzten Jahren hat sich der Fokus der Forschung jedoch von diesen eher statischen Verfahren zu dynamischeren Ansätzen verschoben, die versuchen, bereits vor dem Auftreten eines Fehlers einzugreifen. Dazu wird der Zustand des laufenden Systems überwacht und analysiert, um einen möglichen Ausfall vorherzusagen. Bei einem drohenden Ausfall wird dann entweder versucht, den Ausfall zu verhindern, oder sich auf ihn vorzubereiten, um die Reparaturzeit zu verringern. Beides kann die Zuverlässigkeit des Systems erheblich verbessern. Die vorliegende Arbeit beschäftigt sich vorwiegend mit der Vorhersage von Ausfällen und verfolgt dazu einen Ansatz, der auf der Erkennung von Mustern in Sequenzen von Fehlerereignissen basiert. Das entwickelte Vorhersageverfahren ist das erste, das sowohl die Art von Fehlerereignissen, als auch deren Auftrittszeitpunkt erfolgreich integriert und das ein Mustererkennungsverfahren anwendet, um zu entscheiden, ob eine im System beobachtete Sequenz von Fehlern symptomatisch für einen drohenden Ausfall ist oder nicht. Das Mustererkennungsverfahren basiert auf zu “hidden semi-Markov Modellen” erweiterten “hidden Markov Modellen,” die dem ereignisgesteuerten Charakter von Fehlern besser gerecht werden. Das Ausfallvorhersageverfahren wurde auf Daten einer kommerziellen Telekommunikationsplattform angewandt und evaluiert. Sowohl im Vergleich zu den bekanntesten existierenden Verfahren als auch im Vergleich zu herkömmlichen zeitdiskreten “hidden Markov Modellen” wird eine signifikant bessere Vorhersagegüte erreicht. Eine Ausfallvorhersage ist lediglich der erste wichtige Schritt für einen aktiven Umgang mit Fehlern: Im Anschluss an die Vorhersage müssen Aktionen ausgeführt werden, um einen drohenden Ausfall zu vermeiden beziehungsweise seine Folgen zu minimieren. In der Arbeit wird ein Zuverlässigkeitsmodell vorgestellt, mit dem stationäre Verfügbarkeit, Zuverlässigkeit und Hazard-Rate von Systemen mit Ausfallvorhersage und anschließenden Maßnahmen berechnet werden können. Mit Hilfe dieses Modells kann gezeigt werden, dass die Kombination aus Ausfallvorhersage und sich anschließsenden Aktionen die Systemzuverlässigkeit erheblich verbessern kann. Schlagwörter: Ereignisgesteuerte Ausfallvorhersage, Hidden Semi-Markov Modell, Präventive Fehlertoleranz, Autonomic Computing ix Contents List of Figures xvii List of Tables xxi Mathematical Notation xxiii Preface I xxv Introduction, Problem Statement, and Related Work 1 Introduction, Motivation and Main Contributions 1.1 From Fault Tolerance to Proactive Fault Management . . . . . . . . . . 1.2 Origins and Background . . . . . . . . . . . . 1.3 Outline of the Thesis . . . . . . . . . . . . . . 1.4 Main Contributions . . . . . . . . . . . . . . . 1 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Problem Statement, Key Properties, and Approach to Solution 2.1 A Definition of Online Failure Prediction . . . . . . . . . . . 2.1.1 Failures . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Online Prediction . . . . . . . . . . . . . . . . . . . 2.2 The Objective of the Case Study . . . . . . . . . . . . . . . 2.3 Key Properties . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Analysis of the Approach . . . . . . . . . . . . . . . . . . . 2.5.1 Identifiable Types of Failures . . . . . . . . . . . . . 2.5.2 Identifiable Types of Faults . . . . . . . . . . . . . . 2.5.3 Relation to Other Research Areas and Issues . . . . . 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 6 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 11 12 14 16 19 19 20 24 26 3 A Survey of Online Failure Prediction Methods 3.1 A Taxonomy and Survey of Online Failure Prediction Methods 3.2 Methods Used for Comparison . . . . . . . . . . . . . . . . . 3.2.1 Dispersion Frame Technique . . . . . . . . . . . . . . 3.2.2 Eventset Method . . . . . . . . . . . . . . . . . . . . 3.2.3 SVD-SVM Method . . . . . . . . . . . . . . . . . . . 3.2.4 Periodic Prediction . . . . . . . . . . . . . . . . . . . 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 45 46 48 50 53 53 xi 4 Introduction to Hidden Markov Models and Related Work 4.1 An Introduction to Hidden Markov Models . . . . . . . . 4.1.1 The Forward-Backward Algorithm . . . . . . . . 4.1.2 Training: The Baum-Welch Algorithm . . . . . . 4.2 Sequences in Continuous Time . . . . . . . . . . . . . . 4.2.1 Four Approaches to Incorporate Continuous Time 4.3 Related Work on Time-Varying Hidden Markov Models . 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modeling 55 55 58 60 63 64 67 70 73 5 Data Preprocessing 5.1 From Logfiles to Sequences . . . . . . . . . 5.1.1 From Messages to Error-IDs . . . . 5.1.2 Tupling . . . . . . . . . . . . . . . 5.1.3 Extracting Sequences . . . . . . . . 5.2 Clustering of Failure Sequences . . . . . . 5.2.1 Obtaining the Dissimilarity Matrix . 5.2.2 Grouping Failure Sequences . . . . 5.2.3 Determining the Number of Groups 5.2.4 Additional Notes on Clustering . . . 5.3 Filtering the Noise . . . . . . . . . . . . . . 5.4 Improving Logfiles . . . . . . . . . . . . . 5.4.1 Event Type and Event Source . . . . 5.4.2 Hierarchical Numbering . . . . . . 5.4.3 Logfile Entropy . . . . . . . . . . . 5.4.4 Existing Solutions . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 The Model 6.1 The Hidden Semi-Markov Model . . . . . . . . . . . . . . . . . . . . 6.1.1 Wrap-up of Semi-Markov Processes . . . . . . . . . . . . . . 6.1.2 Combining Semi-Markov Processes with HMMs . . . . . . . 6.2 Sequence Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Recognition of Temporal Sequences: The Forward Algorithm 6.2.2 Sequence Prediction . . . . . . . . . . . . . . . . . . . . . . 6.3 Training Hidden Semi-Markov Models . . . . . . . . . . . . . . . . . 6.3.1 Beta, Gamma and Xi . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Reestimation Formulas . . . . . . . . . . . . . . . . . . . . . 6.3.3 A Summary of the Training Algorithm . . . . . . . . . . . . . 6.4 Difference Between the Approach and other HSMMs . . . . . . . . . 6.5 Proving Convergence of the Training Algorithm . . . . . . . . . . . . 6.5.1 A Proof of Convergence Framework . . . . . . . . . . . . . . 6.5.2 The Proof for HSMMs . . . . . . . . . . . . . . . . . . . . . 6.6 HSMMs for Failure Prediction . . . . . . . . . . . . . . . . . . . . . 6.7 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . 6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii . . . . . . . . . . . . . . . . 75 75 75 76 79 79 80 81 82 83 83 86 86 87 89 90 92 . . . . . . . . . . . . . . . . . 95 95 95 97 99 99 102 105 105 106 109 112 116 116 119 125 128 130 7 Classification 7.1 Bayes Decision Theory . . . . . . . . . . . . . . . . . . 7.1.1 Simple Classification . . . . . . . . . . . . . . . 7.1.2 Classification with Costs . . . . . . . . . . . . . 7.1.3 Rejection Thresholds . . . . . . . . . . . . . . . 7.2 Classifiers for Failure Prediction . . . . . . . . . . . . . 7.2.1 Threshold on Sequence Likelihood . . . . . . . . 7.2.2 Threshold on Likelihood Ratio . . . . . . . . . . 7.2.3 Using Log-likelihood . . . . . . . . . . . . . . . 7.2.4 Multi-class Classification Using Log-Likelihood 7.3 Bias and Variance . . . . . . . . . . . . . . . . . . . . . 7.3.1 Bias and Variance for Regression . . . . . . . . . 7.3.2 Bias and Variance for Classification . . . . . . . 7.3.3 Conclusions for Failure Prediction . . . . . . . . 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applications of the Model 133 133 134 135 136 136 136 136 137 138 138 138 140 143 146 147 8 Evaluation Metrics 8.1 Evaluation of Clustering . . . . . . . . . . . . . . . . . . . . . 8.1.1 Dendrograms . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Banner Plots . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 Agglomerative and Divisive Coefficient . . . . . . . . . 8.2 Metrics for Prediction Quality . . . . . . . . . . . . . . . . . . 8.2.1 Contingency Table . . . . . . . . . . . . . . . . . . . . 8.2.2 Metrics Obtained from Contingency Tables . . . . . . . 8.2.3 Plots of Contingency Table Measures . . . . . . . . . . 8.2.4 Cost Impact of Failure Prediction . . . . . . . . . . . . . 8.2.5 Other Metrics . . . . . . . . . . . . . . . . . . . . . . . 8.3 Evaluation Process . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Setting of Parameters . . . . . . . . . . . . . . . . . . . 8.3.2 Three Types of Data Sets . . . . . . . . . . . . . . . . . 8.3.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . 8.4 Statistical Confidence . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Theoretical Assessment of Accuracy . . . . . . . . . . . 8.4.2 Confidence Intervals by Assuming Normal Distributions 8.4.3 Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . 8.4.5 Bootstrapping with Cross-validation . . . . . . . . . . . 8.4.6 Confidence Intervals for Plots . . . . . . . . . . . . . . 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Experiments and Results Based on Industrial Data 9.1 Description of the Case Study . . . . . . . . . . 9.2 Data Preprocessing . . . . . . . . . . . . . . . 9.2.1 Making Logfiles Machine-Processable . 9.2.2 Error-ID Assignment . . . . . . . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 149 149 151 151 152 153 154 157 160 164 166 166 167 168 168 168 169 170 170 171 172 172 . . . . 175 175 177 177 178 9.2.3 Tupling . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Extracting Sequences . . . . . . . . . . . . . . . . . . 9.2.5 Grouping (Clustering) of Failure Sequences . . . . . . 9.2.6 Noise Filtering . . . . . . . . . . . . . . . . . . . . . 9.3 Properties of the Preprocessed Dataset . . . . . . . . . . . . . 9.3.1 Error Frequency . . . . . . . . . . . . . . . . . . . . . 9.3.2 Distribution of Delays . . . . . . . . . . . . . . . . . 9.3.3 Distribution of Failures . . . . . . . . . . . . . . . . . 9.3.4 Distribution of Sequence Lengths . . . . . . . . . . . 9.4 Training HSMMs . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Parameter Space . . . . . . . . . . . . . . . . . . . . 9.4.2 Results for Parameter Investigation . . . . . . . . . . . 9.5 Detailed Analysis of Failure Prediction Quality . . . . . . . . 9.5.1 Precision, Recall, and F-measure . . . . . . . . . . . . 9.5.2 ROC and AUC . . . . . . . . . . . . . . . . . . . . . 9.5.3 Accumulated Runtime Cost . . . . . . . . . . . . . . . 9.6 Dependence on Application Specific Parameters . . . . . . . . 9.6.1 Lead-Time . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 Data Window Size . . . . . . . . . . . . . . . . . . . 9.7 Dependence on Data Specific Issues . . . . . . . . . . . . . . 9.7.1 Size of the Training Data Set . . . . . . . . . . . . . . 9.7.2 System Configuration and Model Aging . . . . . . . . 9.8 Failure Sequence Grouping and Filtering . . . . . . . . . . . . 9.8.1 Failure Grouping . . . . . . . . . . . . . . . . . . . . 9.8.2 Sequence Filtering . . . . . . . . . . . . . . . . . . . 9.9 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . 9.9.1 Dispersion Frame Technique (DFT) . . . . . . . . . . 9.9.2 Eventset . . . . . . . . . . . . . . . . . . . . . . . . . 9.9.3 SVD-SVM . . . . . . . . . . . . . . . . . . . . . . . 9.9.4 Periodic Prediction Based on MTBF . . . . . . . . . . 9.9.5 Comparison with Standard HMMs . . . . . . . . . . . 9.9.6 Comparison with Random Predictor . . . . . . . . . . 9.9.7 Comparison with UBF . . . . . . . . . . . . . . . . . 9.9.8 Discussion and Summary of Comparative Approaches 9.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 180 182 188 191 192 192 193 197 198 198 199 205 205 205 206 207 207 208 209 210 211 212 212 213 213 214 215 216 217 217 218 219 219 221 Improving Dependability, Conclusions, and Outlook 225 10 Assessing the Effect on Dependability 10.1 Proactive Fault Management . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Downtime Avoidance . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Downtime Minimization . . . . . . . . . . . . . . . . . . . . 10.2 Related Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 The Availability Model . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 The Original Model for Software Rejuvenation by Huang et al. 10.3.2 Availability Model for Proactive Fault Management . . . . . . 10.4 Computing the Rates of the Model . . . . . . . . . . . . . . . . . . . 227 227 229 229 231 233 233 234 236 xiv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 239 243 244 244 245 246 246 247 251 252 252 252 254 258 258 11 Summary and Conclusions 11.1 Phase I: Problem Statement, Key Properties and Related Work 11.2 Phase II: Data Preprocessing, the Model, and Classification . . 11.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . 11.2.2 The Hidden Semi-Markov Model . . . . . . . . . . . . 11.2.3 Sequence Classification . . . . . . . . . . . . . . . . . 11.3 Phase III: Evaluation Methods and Results for Industrial Data . 11.3.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . 11.3.2 Results for the Telecommunication System Case Study 11.4 Phase IV: Dependability Improvement . . . . . . . . . . . . . 11.4.1 Proactive Fault Management . . . . . . . . . . . . . . 11.4.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.3 Parameter Estimation . . . . . . . . . . . . . . . . . . 11.4.4 Case Study and an Advanced Example . . . . . . . . . 11.5 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . 11.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 263 265 265 266 268 268 268 270 273 273 273 274 274 274 275 . . . . . . . 277 277 277 278 278 278 280 280 10.5 10.6 10.7 10.8 10.9 10.4.1 The Parameters in Detail . . . . . . . . . . 10.4.2 Computing the Rates from Parameters . . . Computing Availability . . . . . . . . . . . . . . . Computing Reliability . . . . . . . . . . . . . . . . 10.6.1 The Reliability Model . . . . . . . . . . . . 10.6.2 Reliability and Hazard Rate . . . . . . . . . How to Estimate the Parameters from Experiments 10.7.1 Failure Prediction Accuracy . . . . . . . . 10.7.2 Failure Probabilities PT P , PF P , and PT N . . 10.7.3 Repair Time Improvement k . . . . . . . . 10.7.4 Summary of the Estimation Procedure . . . A Case Study and an Example . . . . . . . . . . . 10.8.1 Experiment Description . . . . . . . . . . . 10.8.2 Results . . . . . . . . . . . . . . . . . . . 10.8.3 An Advanced Example . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . 12 Outlook 12.1 Further Development of Prediction Models . . . . . 12.1.1 Improving the Hidden Semi-Markov Model 12.1.2 Bias and Variance . . . . . . . . . . . . . . 12.1.3 Online Learning . . . . . . . . . . . . . . . 12.1.4 Further Issues . . . . . . . . . . . . . . . . 12.1.5 Further Application Domains for HSMMs . 12.2 Proactive Fault Management . . . . . . . . . . . . V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Derivatives with respect to Parameters for Selected Distributions xv 285 Erklärung 289 Acronyms 291 Index 295 Bibliography 301 xvi List of Figures 1.1 1.2 Predict-react cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The engineering cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 Definitions and interrelations of faults, errors and failures . . . . . . . . Four stages where faults can become visible . . . . . . . . . . . . . . . Distinction between root cause analysis and failure prediction . . . . . . Time relations in online failure prediction . . . . . . . . . . . . . . . . Failure definition for the case study . . . . . . . . . . . . . . . . . . . . Data acquisition setup . . . . . . . . . . . . . . . . . . . . . . . . . . . Two phase machine learning approach . . . . . . . . . . . . . . . . . . Dependencies among components lead to a temporal sequence of errors Overview of the training procedure . . . . . . . . . . . . . . . . . . . . Overview of the online failure prediction approach . . . . . . . . . . . Permanent, intermittent and transient faults (Siewiorek & Swarz [241]). Fault model based on Barborak et al. [23] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 11 11 12 13 14 16 17 19 20 21 22 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 A taxonomy for online failure prediction approaches . . . . . Failure prediction by function approximation . . . . . . . . . Failure prediction using signal processing techniques . . . . . Failure prediction based on the occurrence of errors . . . . . . Failure prediction by recognition of failure-prone error patterns Dispersion Frame Technique . . . . . . . . . . . . . . . . . . The eventset method . . . . . . . . . . . . . . . . . . . . . . Bag-of-words representation of error sequences . . . . . . . . Singular value decomposition . . . . . . . . . . . . . . . . . . Maximum margin classification using support vector machines . . . . . . . . . . . . . . . . . . . . 31 34 39 40 43 46 48 51 51 52 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 Discrete Time Markov Chain . . . . . . . . . . . . . . . . . . . . . . . A discrete-time hidden Markov model . . . . . . . . . . . . . . . . . . A trellis to visualize the forward algorithm . . . . . . . . . . . . . . . . A trellis visualizing the computation of ξt (i, j) . . . . . . . . . . . . . . Notations for event-driven temporal sequences . . . . . . . . . . . . . . Incorporating continuous time by time slotting . . . . . . . . . . . . . . Duration modeling by a discrete-time HMM with self-transitions. . . . . Representing time by delay symbols . . . . . . . . . . . . . . . . . . . Delay representation by two-dimensional output probability distributions Duration modeling by explicit modeling of state durations . . . . . . . . Topology of an Expanded State HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 57 59 62 63 64 64 65 66 68 69 xvii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 From faults to error messages . . . . . . . . . . . . . . . . . . . . . Truncation and collision in tupling . . . . . . . . . . . . . . . . . . Plotting the number of tuples over time window size ε . . . . . . . . Extracting sequences. . . . . . . . . . . . . . . . . . . . . . . . . . For each failure sequence F i , a separate HSMM M i is trained. . . . Matrix of logarithmic sequence likelihoods . . . . . . . . . . . . . Inter-cluster distance rules . . . . . . . . . . . . . . . . . . . . . . Noise filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Three different sequence sets to compute symbol prior probabilities Hierarchical error numbering with SHIP . . . . . . . . . . . . . . . An inherent problem of hard classification approaches . . . . . . . . Sets of required information and given information of a log record . A plot of log entropy . . . . . . . . . . . . . . . . . . . . . . . . . Principle structure of a Common Base Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 78 78 79 80 81 83 84 86 87 88 89 91 91 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 A semi-Markov process . . . . . . . . . . . . . . . . . . . . . . . A sample hidden semi-Markov model . . . . . . . . . . . . . . . Notation for temporal sequences . . . . . . . . . . . . . . . . . . Summary of the complete training algorithm for HSMMs. . . . . . A simplified sketch of phoneme assignment to a speech signal. . . Assigning states to observations in speech processing . . . . . . . Trellis structure for the forward algorithm with duration modeling Lower bound optimization . . . . . . . . . . . . . . . . . . . . . Gradient vector projection . . . . . . . . . . . . . . . . . . . . . Failure prediction model structure used for training . . . . . . . . Model with intermediate states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 98 99 111 113 114 115 117 124 126 127 7.1 7.2 7.3 7.4 7.5 7.6 Classification by maximum posterior for a two-class example Error in regression problems . . . . . . . . . . . . . . . . . True and estimated posterior probabilities . . . . . . . . . . Distribution of estimated posterior . . . . . . . . . . . . . . Boundary error plots . . . . . . . . . . . . . . . . . . . . . Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 139 141 142 142 144 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13 Dendrograms . . . . . . . . . . . . . . . . . . . . . Banner plots . . . . . . . . . . . . . . . . . . . . . . Sample precision/recall-plot for two failure predictors Sample ROC plot . . . . . . . . . . . . . . . . . . . Relation between ROC plots and precision and recall Detection error trade-off plot . . . . . . . . . . . . . Iso-cost lines in ROC space . . . . . . . . . . . . . . Determining minimum cost from ROC . . . . . . . . Cost curves . . . . . . . . . . . . . . . . . . . . . . Exemplary accumulated runtime cost . . . . . . . . . AUC can be misleading . . . . . . . . . . . . . . . . Cross-validation and bootstrapping . . . . . . . . . . Averaging ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 151 158 158 159 160 161 162 162 163 165 172 173 9.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 xviii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11 9.12 9.13 9.14 9.15 9.16 9.17 9.18 9.19 9.20 9.21 9.22 9.23 9.24 9.25 9.26 9.27 9.28 9.29 9.30 9.31 9.32 9.33 9.34 9.35 Typical error log record . . . . . . . . . . . . . . . . . . . . . . . . . . Levenshtein similarity plot . . . . . . . . . . . . . . . . . . . . . . . . Effect of tupling window size for cluster-wide logfile . . . . . . . . . . Effect of tupling window size. . . . . . . . . . . . . . . . . . . . . . . HSMM toplogy for failure sequence grouping . . . . . . . . . . . . . . Effect of clustering methods. . . . . . . . . . . . . . . . . . . . . . . . Effect of number of states . . . . . . . . . . . . . . . . . . . . . . . . . Effect of background distribution weight . . . . . . . . . . . . . . . . . Values of Xi for noise filtering: Cluster prior . . . . . . . . . . . . . . . Values of Xi for noise filtering: Cluster failure sequences . . . . . . . . Values of Xi for noise filtering: all sequences . . . . . . . . . . . . . . Mean sequence length depending on filtering threshold . . . . . . . . . Number of errors per five minutes . . . . . . . . . . . . . . . . . . . . Histogram and QQ-plots of delays between errors . . . . . . . . . . . . Analysis of time between failure . . . . . . . . . . . . . . . . . . . . . Normalized autocorrelation of failure occurrence . . . . . . . . . . . . Histogram and ECDF for the length of sequences . . . . . . . . . . . . Average negative training sequence log-likelihood . . . . . . . . . . . . Mean training time for number of states and maximum span of shortcuts Computation times for testing . . . . . . . . . . . . . . . . . . . . . . . Upper bounds for mean testing times . . . . . . . . . . . . . . . . . . . Precision/recall and F-measure plot for industrial data . . . . . . . . . . ROC plot for industrial data . . . . . . . . . . . . . . . . . . . . . . . . Accumulated runtime cost for industrial data . . . . . . . . . . . . . . . Failure prediction performance for various lead-times . . . . . . . . . . Effects of data window size ∆td . . . . . . . . . . . . . . . . . . . . . Data sets for experiments investigating size of the data set. . . . . . . . F-measure and Training time as function of size of training data set . . . Data sets for experiments investigating system configuration . . . . . . Prediction quality as function of train-test gap. . . . . . . . . . . . . . . Precision / recall plot and ROC plot for single failure group model . . . Histograms of time-between-errors for DFT . . . . . . . . . . . . . . . Precision/recall and ROC plot for the SVD-SVM prediction algorithm . Summary of prediction results for comparative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 179 180 181 182 184 186 187 189 189 190 191 192 194 195 196 197 201 203 204 204 206 207 208 209 210 210 211 212 213 214 215 217 220 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 10.10 10.11 10.12 10.13 Principle approach of proactive fault management . . . . . . . . . Improved TTR for prediction-driven repair schemes . . . . . . . . The original rejuvenation model . . . . . . . . . . . . . . . . . . Availability model for proactive fault management . . . . . . . . . Four cases of prediction including lead-time and prediction-period. Time relations for prediction . . . . . . . . . . . . . . . . . . . . CTMC model for reliability . . . . . . . . . . . . . . . . . . . . . Four situations in failure prediction experiments . . . . . . . . . . Cases with fault injection . . . . . . . . . . . . . . . . . . . . . . Summary of the procedure to estimate model parameters . . . . . Overview of the case study . . . . . . . . . . . . . . . . . . . . . Reliability for the case study . . . . . . . . . . . . . . . . . . . . Hazard rate for the case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 230 234 235 238 242 245 247 250 253 254 256 257 xix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.14 Reliability for the more sophisticated example . . . . . . . . . . . . . . . 10.15 Hazard rate for the more sophisticated example. . . . . . . . . . . . . . . 259 259 11.1 Trade-off between predictive power and complexity . . . . . . . . . . . . 276 12.1 Steps of proactive fault management . . . . . . . . . . . . . . . . . . . . 281 xx List of Tables 8.1 8.2 Contingency table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Metrics obtained from contingency table . . . . . . . . . . . . . . . . . . . 153 154 9.1 9.2 9.3 9.4 9.5 Number of different log messages . . . . . . . . . . . . . . Experiment settings for detailed analysis. . . . . . . . . . . Contingency table for a random predictor. . . . . . . . . . . Contingency table for the UBF failure prediction approach. . Summary of computation times for comparative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 205 218 219 221 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 Actions performed after prediction . . . . . . . . . . Parameters used for modeling . . . . . . . . . . . . . Simplified contingency table . . . . . . . . . . . . . Solution to the steady-state equations for availability Mapping of cases to situations . . . . . . . . . . . . Estimation results for the case study . . . . . . . . . Relative amount of the four types of prediction . . . Parameters assumed for the sophisticated example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 237 238 244 248 255 257 258 xxi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mathematical Notation • Vectors are typeset in bold lower-case letters or square brackets such as π = [π1 , . . . , πN ] • Matrices are typeset in bold capital letters such as B = [bij ] or as a vector of vectors such as A = [ai ] • Sets are indicated by curly brackets such as E = {x, y, z} • Random Variables are denoted by capital letters such as X. If a random variable is fixed to some value, the notation X = x is used • Observation symbols are denoted by lower-case letters oi ∈ O, where O denotes the alphabet of size M . The alphabet is simply a set of observation symbols: O = {o1 , . . . , oM } • Sequences of observations are denoted by a sequence of random variables O without separating commas such as O1 O2 . . . OL . For a specific, given sequence of observations vector notation o = [Ot ] is used • The notation Oi = ok expresses that the i-th element in an observation sequence is equal to symbol ok • States are denoted by lower-case letters si ∈ S , where S denotes the set of all N states. Similar to observations, random variables denoting states use capital S and sequences of states are defined equivalently to observation sequences • Observation probabilities in hidden Markov models are either denoted in a matrix form B = [bij ] or in a functional form bij = bi (oj ) xxiii Preface There are no faults, only lessons. — freely adapted from Dr. Chérie Carter-Scott Knowing the future has always been ingrained into desires of mankind —and it has been fascinating ever since. Think, for example, of the oracle at Delphi during the classical period of Greece or the priests of the oracle at Siwa. Their supposed ability to foresee the future created an aura and reputation lasting already for more than 2000 years. Stonehenge, as a second example, has probably been an equinox predictor. Today, predictions are used in a manifold of areas. There are methods to forecast wars, the weather, and winds.1 Financial markets, healthcare and insurance heavily use predictions, as well. Turning to physics and engineering, prediction strategies are, for example, applied to predict the path of meteorites, or the future development of a signal in signal processing. Even in computer science, prediction methods are quite frequently used: In microprocessors, branch prediction tries to prefetch instructions that are most likely executed and memory or cache prediction tries to forecast what data might be required next.2 In this dissertation, prediction techniques are used to forecast the occurrence of system failures. Today, human lives and organizations are increasingly dependent on the correct functioning of computer systems. Train control systems, emergency systems, stock trading software, and enterprise resource planning systems are only a few examples. A failure in any of these systems may cause huge personal as well as economic damage. However, computer systems have reached a level of complexity that precludes the development of a completely correct system. Therefore, the occurrence of failures cannot be fully ruled out but the likelihood of their occurrence should be minimized. This dissertation contributes to an approach called proactive fault management, which tries to deal with faults even before the failure has occurred. These methods can be applied most efficiently, if it is known whether a failure is imminent in the system or not. This is called online failure prediction and it is the main topic of this thesis. Turning back to historic oracles, it was the search for structures3 and interrelations and the ability to identify fundamental influencing factors which was essential in their “modus operandi”. Based on this knowledge they have been able to analyze the present situation and to infer future developments. These two principles are also the key to the challenge of online failure prediction in complex computer systems. Particularly, the 1 To interested readers, specific references on war forcasting (Moll & Luebbert [186]), weather forecasting (Pielke [204]), and wind forecasting (Marzbana & Stumpf [178]) can be found. 2 Specific references can be found on signal processing (Kalman & Bucy [140]), instruction prefetching (Jiménez & Lin [134]), and cache prediction (Joseph & Grunwald [136]) 3 For example, Jacob Burckhardt [41] reports in his book about Greek culture that in ancient times priests hoped to forecast the future by examining viscera of sacrificial animals xxv approach proposed in this dissertation investigates interrelations between system components by identifying symptomatic error patterns. One of the key problems in prediction is that future is in principle not fully predictable. Hence, any prediction needs to handle uncertainty. In case of the historic oracles, their replies have intentionally been cryptic and ambiguous, as can be seen from one of the best-known replies, which is the one given to Croesus: When Croesus asked the oracle at Delphi whether he should go to war with the Persians, the oracle responded: “If Croesus attacks the Persians, he will destroy a mighty empire”.4 However, it was Croesus’es mighty empire that was destroyed, not the Persian —nevertheless, the oracle’s reply remained to be true. The prediction method proposed here takes on a different approach to handle uncertainty: it strictly follows a probabilistic approach. Due to the size and complexity of contemporary computer systems machine learning techniques have been applied in order to reveal symptomatic patterns from observed failures that have occurred in the past —which is a fundamental difference to the task the ancient predictors were confronted with: oracles had to evaluate singular events while in failure prediction, there is a chance to gain experience. Hence the problem that is solved in this dissertation is incomparably easier than the job of the venerable Greek oracles. 4 Herodot [118] xxvi Part I Introduction, Problem Statement, and Related Work 1 Chapter 1 Introduction, Motivation and Main Contributions A manifold of domains in today’s life and organizations are becoming increasingly dependent on the correct functioning of computer systems. Automotive assistant systems, medical imaging devices, banking systems, and production planning and control systems are only a few examples. Hence dependability, which is about preventing personal as well as economic damage, is rendered a crucial issue. However, computer systems have reached a level of complexity that precludes the development of a completely correct system. Being built of commercial-off-the-shelf components with millions of transistors and millions of lines of code, the occurrence of failures cannot be fully ruled out but the likelihood of their occurrence should be minimized. Considering availability, another aspect can be observed: striving for high availability in most cases implies extremely short repair times. For example, a five-nines availability1 implies that a system must on average not be down for more than 5.26 minutes per year. It is almost impossible for a human being to analyze, diagnose and repair a complex system within such a short time interval.2 Hence, systems need to react to failures more or less automatically. But even if reaction is automated, it might in some cases be rather difficult to even restart the system within five minutes. One way out of this dilemma is to follow a more proactive approach that starts acting even before the failure occurs. This requires some short-term anticipation of upcoming failures based on an evaluation of the current runtime state of the system, followed by some proactive mechanisms that either try to avoid the upcoming failure or try to minimize its effects (see Figure 1.1). This thesis focuses on online failure prediction for centralized complex computer systems, which is the first step towards an efficient proactive fault management. The need for accurate short-term failure prediction methods for computer systems has recently been demonstrated by Liang et al. [165]. The authors mention that checkpointing3 is one of the most efficient ways to improve dependability in large scale computers. However, in parallel computing, the overhead of checkpointing is immense and can even nullify the gain in dependability due to the fact that failures occur irregularly. Failure 1 I.e., the ratio of uptime over lifetime equals at least 0.99999 2 Even if a failure occurs only every three years, it seems rather difficult to repair the system within 15.768 minutes 3 Checkpointing denotes the strategy to regularly save the entire state of a system such that this consistent state can be restored when a failure has occurred 3 4 1. Introduction, Motivation and Main Contributions Figure 1.1: Predict-react cycle prediction methods are needed to differentiate between periods with few failures and periods with many and to adapt checkpointing to these situations. Oliner & Sahoo [197] carry out experiments showing that failure prediction-driven checkpointing4 can boost both performance and reliability of large-scale systems. 1.1 From Fault Tolerance to Proactive Fault Management Online failure prediction belongs to the research discipline called fault tolerance which dates back to the pioneers of computing (c.f., e.g, Hamming [113], or von Neumann [192]). The methods developed at that time mainly concerned ways to deal with incredibly unreliable hardware components such as relays and vacuum tubes. As complexity of computing systems increased over the years, the main interest in reliable computing also gradually turned over to a system wide view (Esary & Proschan [92]). Along with this development, fault tolerance methods became more dynamic. One well-known example is the Self-Testing And Repairing (STAR) computer, developed by Aviz̆ienis et al. [15]. Various variants of fault tolerance mechanisms employing static and dynamic fault tolerance techniques (hybrid approaches) have been developed (see, e.g., Siewiorek & Swarz [241] for an introduction). At the same time, software became more and more complex and software fault tolerance techniques such as recovery blocks (Randell [212]) and Nversion programming (Aviz̆ienis [14], Kelly et al. [143]) have been developed. This was in part a reaction to the fact that the relative amount of software-related failures became predominant (see, e.g., Sullivan & Chillarege [252]). However, fault tolerance techniques developed until the 1990s were reactive, passive and still static in nature: They were triggered after a problem had been detected and the type of reactions had to be prespecified during system design. In 1995, Huang et al. [126] proposed a new approach that has become well-known under the term rejuvenation. Rejuvenation is a technique that 4 The authors call it cooperative checkpointing 1.2 Origins and Background 5 restarts parts of a system even if no fault has occurred. It has proven to be a successful concept to deal with problems of software-aging (Parnas [198]) such as accumulating numerical rounding errors, corruption of data, exhaustion of resources, memory leaks, etc. All the while system complexity has not stopped to grow, and traditional fault tolerance mechanisms could not keep pace with the dynamics and flexibility of new computing architectures and paradigms. Both industry and academia set off the search for new concepts in fault tolerance and other dependability issues like security as can be seen from initiatives and research efforts on autonomic computing (Horn [123]), trustworthy computing (Mundie et al. [188]), adaptive enterprise (Coleman & Thompson [63]), recoveryoriented computing (Brown & Patterson [40]), responsive computing (e.g., Malek [173]), rejuvenation (e.g., Garg et al. [101]) and various conferences on self-*properties (see, e.g., Babaoglu et al. [19]) where the asterisk can be replaced by any of “configuration”, “healing”, “optimization”, or “protection”. Throughout this dissertation, the term proactive fault management will be used. In parallel to computer fault tolerance, research in mechanical engineering developed the concept of preventive maintenance. Preventive maintenance tries to improve system reliability by replacement of components (c.f., e.g., Gertsbakh [105] for an overview). Several replacement strategies exist ranging from simple lifetime distribution models to more complex models including prediction-based preventive maintenance incorporating monitoring data (c.f., e.g., Williams et al. [278]). However, due to the fact that the actions triggered for mechanical machines differ significantly from those for computing systems and since the observation-based methods seem not to be able to account for the complexity of contemporary large computer systems, the two research communities have not merged (except for some rare approaches such as Albin & Chao [4]). 1.2 Origins and Background Initial point for the work described in this dissertation was the challenge to develop failure prediction algorithms based on data collected from an industrial telecommunication system. At the Computer Architecture and Communication group at Humboldt University Berlin, three different approaches have been proposed: Steffen Tschirpke has introduced an adaptive fault dictionary, Günther Hoffmann has developed a method based on data from continuous system monitoring (Hoffmann [120]) and this thesis focuses on a prediction method based on error event patterns. However, the prediction method described in the following chapters is not the first attempt to master the challenge. Previously, a rather straightforward solution has been developed that builds on a semi-Markov process and clustering of similar error events. This method has been named Similar Events Prediction (see Salfner et al. [226] for details). However, it has two major drawbacks: 1. Computing overhead for predictions longer than three minutes in advance resulted in unacceptable computation times due to exponentially growing complexity of the algorithms. 2. Although results seemed promising, prediction quality dropped to a low level if test data differed only slightly (e.g., caused by a different configuration of the system under investigation) from the data that had been used to build the model. The explanation for this behavior is called overfitting, which means that the model is too 6 1. Introduction, Motivation and Main Contributions specifically tailored to the data analyzed: If an observed pattern under investigation varied only slightly from the patterns observed in the training data, it was not recognized anymore and hence no failure was predicted. Having learned the lessons, the task of failure prediction for the commercial telecommunication system has been analyzed from scratch in a structured, traditional engineering fashion (see Figure 1.2): First, key properties of the system have been identified and by abstraction, a precise problem statement has been formulated. Then, a methodology has been developed that is specifically targeted to the key properties of the problem. Having developed a methodology, it has been implemented and tested with the industrial data of the telecommunication system in order to assess how well the solution solves the problem. In the last phase of the engineering cycle, the solution is usually applied to improve the system. However, failure prediction per se does not improve system dependability unless coupled with proactive actions, which is beyond the scope of this dissertation. Therefore, only a theoretical assessment of the effects on dependability has been performed. Figure 1.2: The engineering cycle. 1.3 Outline of the Thesis Following the engineering approach depicted in Figure 1.2, this thesis is divided into four parts: • Part I The first step —abstraction and identification of key properties— is described in Chapter 2: a problem statement is given and the principle approach taken in this dissertation is motivated, introduced, and discussed. Before developing a new solution, any engineer should review and investigate existing ones. In Chapter 3, a survey of failure prediction methods is provided. This includes a taxonomy in order to categorize existing methods and to classify the approach taken in this thesis. Furthermore, some approaches are described in more detail since these methods are used for comparison in the experiments carried out in Part III. Due to the fact that the prediction method presented here builds on hidden Markov 1.4 Main Contributions 7 models (HMMs), related work on HMMs and their extension to continuous time are described in Chapter 4. • Part II The second step of the engineering cycle, which is concerned with the development of a methodology, is covered by Chapters 5 to 7. In Chapter 5, some concepts of data preprocessing are described including issues related to error logfiles, a clustering method to identify failure mechanisms and an approach to tackle the problem of noisy data. In Chapter 6, the hidden semi-Markov model used for failure prediction is presented. As for failure prediction the output of hidden Markov models are probabilistic likelihoods, subsequent classification is necessary in order to decide whether the current runtime state is failure-prone or not. Classification is discussed in Chapter 7. • Part III The third step of the engineering cycle involves experiments in order to verify that the assumptions made during modeling match the original problem and to investigate how well the developed methodology performs. Prediction performance is gauged by several measures, which are introduced in Chapter 8. Then the model is applied to industrial data of the commercial telecommunication system in Chapter 9. This includes a detailed analysis of the data, data preprocessing, prediction performance and a comparative analysis with the most well-known prediction approaches in that area. • Part IV In order to close the engineering cycle, dependability improvement capabilities are assessed in Chapter 10, in which a model is developed in order to theoretically assess the effect of failure prediction-driven fault tolerance mechanisms (proactive fault management) on availability, reliability and hazard rate. The chapter also includes results of a case study where such mechanisms have been applied to a demo web-shop application. Main results are summarized and an outlook to future research topics is provided in Chapters 11 and 12. The main contributions of each chapter are presented in chapter summaries. 1.4 Main Contributions The overall contribution of this dissertation is the development of a novel approach to error event-based failure prediction. Experiments on industrial data of an industrial telecommunication system have shown superior prediction performance in comparison with the most well-known prediction algorithms in that area. In addition to that several advancements to the state-of-the-art are presented: • A novel extension of Hidden Markov Models to incorporate continuous time. In contrast to previous extensions that have been developed mainly in the area of speech recognition, the model developed in this thesis is specifically tailored to event-driven temporal sequences. • To our knowledge the first taxonomy and survey on computer failure prediction approaches including indication of promising areas for further research. The taxonomy is based on the fundamental relationship among faults, errors, and failures. 8 1. Introduction, Motivation and Main Contributions Symptoms, which reflect side-effects of faults, have been added to this basic concept. • To our knowledge the first model to assess dependability of prediction-driven fault tolerance techniques (proactive fault management). The model incorporates correct and false predictions, downtime avoidance as well as downtime minimization techniques and cases where failures are induced by the fault management techniques themselves. • A novel methodology to group failure sequences. Although only used for data preprocessing, the approach may as well contribute to diagnosis. • To our knowledge the first measure to quantify quality the of logfiles: logfile entropy combines Shannon’s information entropy with specific requirements for comprehensive logfiles. All in all this comprehensive approach to online failure prediction proposed in this thesis, if combined with preventive actions, has a potential of increasing computer systems availability by an order of magnitude. Chapter 2 Problem Statement, Key Properties, and Approach to Solution The first step in any scientific as well as any engineering project should be a proper statement of the problem to be solved. The challenge that had to be solved in the course of this work is online failure prediction, which is defined in Section 2.1. The motivating case study that lead to the selection of this topic is an industrial telecommunication system of which we had given the chance to collect data. In Section 2.2, the prediction objective is clearly specified for the concrete scenario of the telecommunication system. The case study is introduced at this early point of the thesis in order to identify key properties of systems for which the failure prediction method proposed in this thesis is designed. The key properties are discussed in Section 2.3. From these key properties, the principle approach to the solution is presented in Section 2.4 and its general properties are analyzed in Section 2.5. 2.1 A Definition of Online Failure Prediction The aim of online failure prediction is to predict the occurrence of failures during runtime based on the current system state. For a more precise definition, the terms “failure” and “online prediction” are defined separately. 2.1.1 Failures Failures are commonly defined as follows (Aviz̆ienis & Laprie [16]): A system failure occurs when the delivered service deviates from the specified service, where the service specification is an agreed description of the expected service. Similar definitions can be found, e.g., in Melliar-Smith & Randell [180], Laprie & Kanoun [155], Avižienis et al. [17]. The main point here is that a failure refers to misbehavior that can be observed by the user, which can either be a human or a computer component using another component. Things may go wrong inside the system, but as long as it does 9 10 2. Problem Statement, Key Properties, and Approach to Solution Figure 2.1: Definitions and interrelations of faults, errors and failures not result in corrupted output,1 there is no failure. More specifically, a failure is an event: It is the point in time when a system terminates to fulfill its intended function [64]. Faults are the root cause of failures and are defined to be a defective (incorrect) state [64]. In most cases faults remain undetected for some time. Once a fault has become visible it is called an error. That is why errors are called “manifestation” of faults. Figure 2.1, which is a modified version of a figure by Siewiorek & Swarz [241], visualizes the relationships. The key aspect to note here is that faults are unobserved defective states. Four stages exist at which faults can become visible (see Figure 2.2): 1. The system can be audited in order to actively search for faults, e.g., by testing on checksums of data structures, etc. 2. System parameters such as memory usage, number of processes, workload, etc., can be monitored in order to identify side-effects of the faults. These side-effects are called symptoms. For example, the side-effect of a memory leak (the fault) is that the amount of free memory decreases over time. 3. If a fault is activated and detected (observed), it turns into an error. 4. If the fault is not detected by fault detection mechanisms, it might directly turn into a failure which can be observed from outside the system or component. A good example for this are faults on disk drives: Consider the fault of a defective disk sector. Until no read/write operations have been performed trying to access the sector, the fault remains unobserved. Auditing would make it visible by, e.g., reading the entire disk (not for data but for testing purposes). Symptoms of a (not yet completely failed disk) could be observed by monitoring, e.g., wobbling of the disk. Once the sector is completely damaged and data shall be read from it, an error is detected. In a single disk environment, this is usually equivalent to the occurrence of a failure. However, if the defective disk is, e.g., part of a redundant array of independent disks (RAID), the desired service of data delivery can still be fulfilled and hence no failure occurs. 1 including the case that there is no output at all 2.1 A Definition of Online Failure Prediction 11 Figure 2.2: Faults can become visible at four stages: by auditing, by monitoring of system parameters such as workload, memory usage, etc., to capture symptoms of faults, by detecting manifestation of faults (errors), or by a failure that can be observed from outside the system or component Figure 2.3: Distinction between root cause analysis and failure prediction Another key aspect for a precise definition of failure prediction methods is that usually there is no one-to-one mapping among faults and errors: Several faults may result in one single error or one fault may result in several errors. The same holds for errors and failures: Some errors result in a failure some errors do not, and even more complicated are cases where some errors only result in a failure under special conditions, and some faults may cause failures directly. Moreover, some faults remain inactive for the entire system lifetime. For this reason, two distinct research directions have evolved: root cause analysis and failure prediction. Having observed some misbehavior by one of the means shown in Figure 2.2, root cause analysis tries to identify the fault that caused an error or failure, while failure prediction tries to assess the risk that the misbehavior will result in future failure (see Figure 2.3). For example, if it is observed that a database is not available, root cause analysis tries to identify what the reason for unavailability is: a broken network connection, or a changed configuration, etc. Failure prediction on the other hand tries to assess whether this situation bears the risk that the system cannot deliver its expected results, which depends on the system and the current situation: is there a backup database or some other fault tolerance mechanism available? What is the current load of the system? 2.1.2 Online Prediction The term “failure prediction” is widely used, e.g., for reliability prediction where the goal is to assess future reliability of a system from its design or specification (see, e.g., Musa et al. [189], Bowles [35], Denson [77], Blischke & Murthy [32]). However in contrast, 12 2. Problem Statement, Key Properties, and Approach to Solution the topic of online failure prediction is to identify during runtime whether a failure will occur in the near future based on an assessment of the monitored current system state. Although architectural properties such as interdependencies play a crucial role in some online failure prediction methods, online failure prediction is concerned with a short-term assessment that allows to decide, whether there will be a failure, e.g., five minutes ahead or not. Reliability prediction, however, is concerned with long-term predictions based on input data such as architectural properties or the number of bugs that have been fixed. More precisely, for the case of online failure prediction, four different times need to be defined (see Figure 2.4): • Lead-time ∆tl defines how far from present time failures are predicted in the future. • Minimal warning-time ∆tw defines the minimum lead-time such that failure prediction is of any use. If lead-time were shorter than the warning time, there would not be enough time to perform any preparatory or preventive actions. • Prediction-period ∆tp is the time for which a prediction holds. Increasing ∆tp increases the probability that a failure is predicted correctly.2 On the other hand, if ∆tp is too large, the prediction is of little use since it is not clear when exactly the failure will occur. • Data window size ∆td defines the amount of data that is taken into account for failure prediction. Even if online failure prediction algorithms take the current system state into account, many algorithms additionally investigate what happened shortly before present time. However, in some approaches the amount of data is not determined by a time window but other measures such as, e.g., a fixed number of error events. In this case ∆td is also defined, but may vary with each prediction. Figure 2.4: Time relations in online failure prediction. Present time is denoted by t. Failures are predicted with lead-time ∆tl , which must be greater than minimal warningtime ∆tw . A prediction is assumed to be valid for some time period, named prediction-period, ∆tp . In order to perform the prediction, some data up to a time horizon of ∆td are used. ∆td is called data window size. 2.2 The Objective of the Case Study Data of an industrial telecommunication system serves as a gauge of the extent to which the online failure prediction algorithm is able to predict the occurrence of failures. Al2 For ∆tp → ∞, simply predicting that a failure will occur would always be 100% correct! 2.2 The Objective of the Case Study 13 though it is a case study, it demonstrates the type of systems and environments in which the developed online failure prediction method is intended to be applied, and that is why the concrete objective of the case study is described at this early point of the thesis. In subsequent sections, the case study serves to identify key properties that are typical for the problem domain. The main purpose of the telecommunication system under investigation is to realize a so-called Service Control Point (SCP) in an Intelligent Network (IN) [171]. An SCP provides services3 to handle communication related management data such as billing, number translations or prepaid functionality for various services of mobile communication: Mobile Originated Calls (MOC), Short Message Service (SMS), or General Packet Radio Service (GPRS). The fact that the system is an SCP implies that the system cooperates closely with other telecommunication systems in the Global System for Mobile Communication (GSM). Note that the system does not switch calls itself. It rather has to respond to a large variety of different service requests regarding accounts, billing, etc. submitted to the system over various protocols such as Remote Authentication Dial In User Interface (RADIUS), Signaling System Number 7 (SS7), or Internet Protocol (IP). The system’s architecture is very complex and cannot be reproduced here for confidentiality reasons. However, two key facts are that it has a multi-tier architecture employing a component based software design. At the time when data were collected, the system consisted of more than 1.6 million lines of code, approximately 200 components realized by more than 2000 classes, running simultaneously in several containers, each replicated for fault tolerance. Typically, one of the most complicated parts in reliability-related projects is the clear definition of what a failure is. As defined before, failures are the event when a system ceases to fulfill its specification. Specification for the telecommunication system requires that within successive, non-overlapping five minutes intervals, the fraction of calls having response time longer than 250 milliseconds must not exceed 0.01%, as shown in Figure 2.5. Figure 2.5: If within a five minutes interval, the fraction of calls having response time > 250ms exceeds 0.01%, a failure has occurred This definition is equivalent to a required four-nines interval service availability: Ai = 3 no. of service requests within 5 min having response time ≤ 250ms ! ≥ 0.9999 . (2.1) total no. of service requests within 5 min so-called Service Control Functions (SCF) 14 2. Problem Statement, Key Properties, and Approach to Solution Various classifications of failures have been published, one of which is Cristian et al. [69], extended by Laranjeira et al. [156], who classify failures by the following categories: • crash failure the service stops operating and does not resume operation until repair • omission failure the service does not respond to a request • performance failure the service responds too late (given a threshold) • timing failure the service too early or too late (given two thresholds) • computation failure the service’s response shows wrong results • arbitrary failure the service’s response shows an arbitrary failure where each failure class is included in the following classes. According to this definition, the objective of this thesis is to predict performance failures of the telecommunication system. Using the terminology of Laprie & Kanoun [155], these failures are consistent timing failures. However, it is not possible (for us) to assess the consequences on the environment since these are top-level failures and no information is available on how other parts of a telecommunication network rely on the service of the system analyzed here. Figure 2.6: Data acquisition setup. Error logs have been collected from the telecommunication system while a failure log has been obtained from an entity that tracked response times of calls. Field data was collected including various workloads. Request response times have been measured and all failed requests (i.e., having response times of more than 250 milliseconds) have been written into a failure log. The second source of data are error logs, which have been collected from the telecommunication system (see Figure 2.6). Both failure and error logs have been collected for 200 days containing a total of 1560 failures. 2.3 Key Properties By analyzing the telecommunication case study, key properties have been identified yielding the assumptions on which the failure prediction approach developed in this thesis is based. In particular, the key properties are: 1. Only very little knowledge about the system internals is available. For the reason that we did not have full access to the system internals a thorough analysis of the 2.3 Key Properties 15 system’s structure has not been possible. Moreover, such an analysis seems infeasible due to the sheer size of the system. 2. A lot of data is available. The error logs of the 200 days of testing contained an overall amount of 26,991,314 log records, which corresponds to an average logging activity of 43 log records per minute on one node and 51 log records per minute on the second node, respectively. Investigations have shown that only a small fraction of error records gives a notice of upcoming failures. 3. Failures occur rarely. This leads to an imbalance of failure and non-failure data. 4. The telecommunication system is built of software components. Software components are more or less isolated subsystems that are executed in so-called containers, providing additional functionality such as data persistency, replication, logging, etc. The system serves requests by invoking one or more components, which in turn invoke other components to fulfill the job. This leads to interdependencies within the software. Usually, interdependencies set up a forest in terms of graph theory, but cycles cannot be excluded in general. 5. Fine-grained fault detection and error reporting is built into the system. For example, each component is continuously observing its state and is checking the input received from other components. Additionally, there might be several steps of escalation that can assign different levels of severity to error events. 6. The system is running multiple tasks and processes in parallel. For this reason, several concurrent tasks can send messages to the error logging back-end. Such behavior can be interpreted as noise in the error logs. A second effect of this property is that the order of events can be interchanged if several events occur more or less concurrently. 7. Error logs have at least two dimensions: a timestamp and a type specifying what has happened.4 It is assumed that both dimensions contribute information that can be exploited for failure prediction. 8. Due to the property of being event-triggered and showing values of a finite countable set, error logs form a temporal sequence. 9. The telecommunication system can serve requests for several protocols such as GPRS, SMS, MOC, etc. Data of two groups of protocols have been recorded separately. Furthermore, interval service availability requirements must be fulfilled separately for both groups. In general, it must be assumed that contemporary systems can show failures of various types and different failure definitions may exist for each of them. 10. In a system of such complexity, it must be assumed that several failure mechanisms exist for each failure type. A failure mechanism denotes the relation of faults and system states to a failure with focus on the process how the faults lead to the failure. This is closely related to the term failure modes as defined by Laprie & 4 In many cases such as the telecommunication system investigated here, the type is only implicitly specified by an error message in natural language. The task of message type assignment is addressed in Chapter 5 16 2. Problem Statement, Key Properties, and Approach to Solution Kanoun [155], but the term failure mechanism is used here in order to emphasize the temporal aspect. 11. The telecommunication system is highly configurable: more than 2000 parameters can be adjusted. Configurability also adds to system complexity, e.g., by parametrization of interrelations within the system. 12. Systems are subject to updates which can alter system behavior significantly. Hence, the process to adapt failure predictors to new system specifics should require as little effort as possible. At least from that perspective, algorithmic solutions seem preferable in comparison to human analysis. 13. The system is non-distributed. Although the data is collected from two machines interconnected by a dedicated high-speed local network and running on synchronized clocks, the data is merged into one single error log and no computing node-specific aspects are used throughout this thesis. 2.4 Approach Due to the property that only limited analytical knowledge but a large amount of data is available, a machine learning approach has been chosen. It infers symptoms of upcoming failures from measurements (training data) rather than from an analytical analysis of the system. Machine learning, as applied here, consists of two steps (see Figure 2.7): first Figure 2.7: Machine learning approach: First a model is built from training data (a). After training, the model is used to predict the occurrence of failures during runtime (b) a model is built from recorded data using some training algorithm, which means that model parameters are adjusted such that some objective function is optimized. Specifically, training data consists of error-log files and failure logs, which are used to identify whether a failure occurred or not. Having trained a model, the model is used to predict failures online during runtime. However, as we do not have access to the running system, this thesis must do without real-time testing. Rather, the data set is divided into a training and a test dataset such that prediction quality must be estimated from samples that were not available in training. 2.4 Approach 17 The key notion of the approach is that dependencies in the component-based system lead to error patterns, as shown in Figure 2.8. Assume that component “C3” is faulty. Figure 2.8: Dependencies among components lead to a temporal sequence of errors Once the fault is detected, an error message “C” is generated and written to the error log. Some time later, component “C1” needs some functionality of “C3”, but due to the fact that “C3” is faulty, “C1” also has a problem and reports an error of type “A”. Due to component internal mechanisms / dependencies (see, e.g., Hansen & Siewiorek [114]), the component writes a second error message “B”. After some time, the same happens to “C2”: when functionality of “C2” is requested but cannot be delivered, an error message of type “D” is generated. As can be seen from the bottom time line of Figure 2.8, this behavior leads to an event-triggered temporal sequence of error events. The telecommunication system under study is a fault-tolerant system. Hence the chain of dependencies as shown in the figure is not necessarily traversed for a single request. For example, if component “C3” has problems connecting to the database, which results in error message “C”, this problem may be handled by another component5 or it may lead to a single failed call request. But a single failed request does not make a failure, yet. However, if component “C3” is faulty for a while, there are some conditions under which other components start to have problems, too, which is component “C1” in the figure. This may still be fine, but in some situations, even “C2” gets a problem, which finally leads to a failure since too many components are having problems and hence too many call requests fail. These effects give rise to the central idea of the failure prediction approach investigated in this thesis: ⇒ Dependencies in the system lead to error patterns, as shown in Figure 2.8 ⇒ There are error patterns that lead to failures, others do not, depending on conditions which are not observable from outside ⇒ Apply pattern recognition techniques to identify those patterns that have lead to failures. ⇒ Analyze error patterns that have been previously recorded to train the pattern recognizer using machine learning techniques. 5 which is a component failover 18 2. Problem Statement, Key Properties, and Approach to Solution Hidden Markov Models (HMMs) have been shown to be successful pattern recognition tools in a large variety of recognition tasks ranging from speech recognition to intrusion detection in computer systems. This being the first reason for the choice to use HMMs for failure prediction, there is a second rationale referring to the very basic distinction between faults, errors and failures: Faults are by definition unobserved. Once they manifest, they turn into errors, which are observable. This insight can be transferred analogously to HMMs: the states of an HMM are hidden, i.e., unobservable, generating observation symbols. Hence, a close match exists between “hidden units”, faults and the states of HMMs, and between their manifestations, which are errors and observation symbols, respectively. As the occurrence of failures represents some final state (at least in non-repairable systems) failures are represented by an absorbing final state producing a dedicated failure symbol. However, standard hidden Markov models are not well-suited to represent eventtriggered temporal sequences (as is discussed in Section 4.2). For this reason, an extension of HMMs has been developed that permits to model time behavior of error sequences by use of a continuous-time semi Markov process. The training procedure. The goal of training is to adjust HMM parameters to error patterns that are indicative of upcoming failures. To account for the imbalance of failure versus non-failure data (class skewness), HMMs are trained with failure-prone sequences only. Since it is assumed that several failure mechanisms exist in the system and hence are present in the data, a separate HMM is trained for each. The term failure mechanism denotes the principle process how specific faults, states and circumstances lead to a specific failure. In order to separate failure sequences in the training data to group sequences of the same failure mechanism, clustering of failure sequences is accomplished (see Section 5.2). In order to distinguish failure-prone from non-failure sequences in the prediction phase, a separate model targeted to non-failure sequences is needed. It is trained from a selection of non failure-prone sequences in the training data. Although grouping of non-failure sequences would in principle be possible, it is not applied since the non-failure sequence model only serves as a reference for classification. Furthermore, sequence clustering would not be applicable due to the large number of non-failure sequences in the data set. Due to the fact that logfiles are noisy and that sometimes there is too much logging going on for a prediction method to be successful, data needs to be preprocessed. Data preprocessing involves filtering mechanisms and statistical testing. An overview of the training procedure is provided by Figure 2.9. Online prediction. Given an error event sequence observed at runtime, online failure prediction is performed by computing the similarity of the observed sequence to the sequences of the training data. This is done by computing sequence likelihood for each model including the model targeted to non-failure sequences. Sequence likelihood can be interpreted as a probabilistic measure of similarity between the given sequence and sequence characteristics as represented by the hidden Markov model. In order to come to a decision whether the current situation is failure-prone or not, multi-class classifica6 The letter u is used here since letters i to n, which are commonly used to indicate integer numbers, occur frequently in later chapters and have fixed connotations in this thesis. 2.5 Analysis of the Approach 19 Figure 2.9: An overview of the training procedure. Model 0 is trained with non-failure sequences. Failure sequences are grouped by means of clustering. A separate model is then trained for each of the u groups.6 tion, based on Bayes decision theory, is performed. As was the case for training, data preprocessing including failure group specific filtering has to be applied prior to sequence likelihood computation. An overview of the procedure for online failure prediction is depicted in Figure 2.10. 2.5 Analysis of the Approach In order to show principle properties and limitations of the approach, various aspects are discussed in the following sections. The intention is to position the approach with respect to existing failure and fault models, and to relate it to other research areas. 2.5.1 Identifiable Types of Failures A classification of failures has already been given in Section 2.2 from which it has been concluded that the objective for the telecommunication system is to predict performance failures. However, the prediction algorithm can be applied to other systems as well. Since the algorithm is data-driven, it is clear that it can only learn to predict failures whose underlying failure mechanism is similar to the mechanisms contained in the training data. Furthermore, the machine learning approach focuses on general principles in the data which means that very rare special cases are more or less ignored. The conclusion from this discussion is that the proposed prediction approach can only predict failures that occur more or less frequently —it is not appropriate for predicting really rare failure events. This 20 2. Problem Statement, Key Properties, and Approach to Solution Figure 2.10: An overview of the online failure prediction approach. In order to investigate an observed error sequence, sequence likelihood is computed for each of the models including the model targeted to non-failure sequences (Model 0). Sequence likelihood is a probabilistic measure for similarity of the observed error sequence to sequences of the training data. Failure prediction is then performed by subsequent classification whether the current situation is failure-prone or not. In order to prepare the sequence for this process, data preprocessing including failure group specific filtering has to be applied. may seem insufficient from a researcher’s viewpoint, but it is useful from an engineer’s perspective. For example, in [60], Chillarege et al. show that the distribution of failures resembles a Pareto distribution, from which follows that a few failures contribute to the majority of outages. Levy & Chillarege [162] state that from an economic viewpoint it is most efficient to first address those failures that occur most frequently in order to achieve the largest impact on overall system availability. Furthermore, Lee & Iyer [159] report in a study about the Tandem GUARDIAN system that over two-thirds of reported software failures are recurrences of previously reported faults. The authors concluded that “in addition to reducing the number of software faults, software dependability in Tandem systems can be enhanced by reducing the recurrence rate”. 2.5.2 Identifiable Types of Faults Research on dependable computing has put much effort on analyzing and categorizing the things that can go wrong in computer systems. Classification of different types of faults are called fault models, which can be helpful, e.g., to determine the potentials and limits of a fault tolerance technique. 2.5 Analysis of the Approach 21 Design–Runtime fault model. A fundamental distinction of faults addresses the development phase from which the fault originates. Design faults originate from bad system design, e.g., use of an algorithm that does not converge in some situations and hence might cause an “infinite loop.” Opposed to this are runtime faults that occur during the production phase of a system. Permanent–Intermittent–Transient fault model. Another well-known classification focuses on the duration of faults, as shown in Figure 2.11. Figure 2.11: Permanent, intermittent and transient faults (Siewiorek & Swarz [241]). The figure introduces three types of faults: • permanent faults. which are defects that stay active until the fault is removed by repair. A typical example is a damaged sector on a hard disk. • intermittent faults. which are temporary defects that result from system internal flaws. • transient faults. which are temporary defects that trace back to environmental exceptions such as a hit by an alpha particle, etc. As might have become visible, this categorization is focused on hardware issues. Although the concept can in principle be transferred to software, there are some difficulties. For example, due to the fact that a software fault (a bug) can only be removed by repair, software faults should be classified as permanent. However, some studies have shown that their occurrence resembles transient faults (see, e.g., Gray [107]) due to the fact that their activation patterns are dependent on many conditions in the system. Bohr–Mandel–Heisen–Schrödingbugs fault model. This fault model is tailored to software faults and explores an analogy between software bugs and well-known physicists and mathematicians. It focuses on the bugs’ type in terms of observability / tangibility. Gray & Reuter [109] classify software bugs into “Bohrbugs” and “Heisenbugs”. This concept has been extended, as, for example, in Candea [42]: 22 2. Problem Statement, Key Properties, and Approach to Solution • Bohrbugs. According to the rather simple and deterministic atom model of Niels Bohr,7 Bohrbugs are deterministic bugs that can be reproduced most easily. Most Bohrbugs are identified by testing and eliminated in a thorough software engineering process. • Mandelbugs. According to the mathematician Benoît B. Mandelbrot, who is one of the founders of chaos theory, Mandelbugs are bugs that appear chaotic due to manifold and complex dependencies. • Heisenbugs. According to Werner Heisenberg’s uncertainty principle, Heisenbugs disappear or change behavior when being investigated. For example race conditions can disappear when a program is run in a debugger since the debugger changes the timing behavior of the program. • Schrödingbugs. According to Schrödinger’s cat thought-experiment in quantum physics, Schrödingbugs do not manifest until, e.g., someone reading source code notices it and the program stops working for everybody until fixed. An example for such a bug might be a security breach that is exploited rapidly after being identified so that the program becomes unusable until the bug is fixed. Fail-stop–to–Byzantine fault model. It characterizes faults with respect to their “hazardousness” or “behavior”. The fault model presented here is taken from Barborak et al. [23], which is an extended version of Laranjeira et al. [156], who themselves extended a model introduced by Cristian et al. [68] (see Figure 2.12). One of the beautiful properties of the model is that inner fault classes are proper subsets of outer fault classes. The farther outside a fault resides in the picture, the more difficult it is to detect and hence the more complex are the resulting failure scenarios. Figure 2.12: Fault model based on Barborak et al. [23] The types of faults can be described as follows: 7 terming the model “simple” is not intended to belittle the merits of Niels Bohr —remember that he proposed this model already in 1904! 2.5 Analysis of the Approach 23 • Fail stop A faulty processing entity ceases operation and signals this to other processors. • Crash fault The processor simply halts (crashes). • Omission fault The processor omits to react to some tasks • Timing fault The processor reacts to tasks, but too early or too late. • Incorrect computation fault The processor responds to all requests in time, but the result is corrupted. • Authenticated Byzantine fault An arbitrary or even malicious fault that cannot corrupt authenticated messages (sender or receiver can detect corruption) • Byzantine fault Every fault / malicious action possible. Software–Hardware–Human fault model. While the presented classifications of fault classes reflects mainly design and operational faults, there is also a number of faults that can be attributed to human operators. One way to incorporate operator faults is to classify according to their origin: hardware, software, or human. Several variants of this distinction exist that basically refer to the same concepts. For example, Scott [232] uses the terms “technology and disasters”, “application failure” and “operator error”, and in the SHIP model (Malek [174]), the concept is extended by incorporation of “interoperability” faults. Discussion of fault models. Unfortunately, none of the presented fault models provides a tight boundary that allows to completely describe all faults leading to failures that can be predicted by the presented approach. Nonetheless, each fault model provides a framework to discuss potentials and limits of the failure prediction approach presented in this dissertation. 1. Design–Runtime fault model. Design faults are the target of fault intolerance techniques (Avižienis [13]) which attempt to eliminate flaws by elaborate engineering such as formal specification, design reviews, and thorough testing. If —despite of all efforts to build a flaw-free system— something goes wrong, runtime faults are addressed by fault tolerance techniques which try to handle the situation such that no catastrophic failure occurs. Online failure prediction is a fault tolerance technique and is hence targeted at runtime faults. However, the boundary between design and runtime faults is sometimes blurred. If, for example, a design fault always results in similar misbehavior that is clearly identifiable by patterns of error events, the proposed failure prediction method can anticipate runtime faults as well. 2. Permanent–Intermittent–Transient fault model. The failure prediction approach of this thesis identifies faults that trigger failure mechanisms known from training data. This is most likely the case for permanent faults. Albeit the fact that this fault model is of limited use for software faults, also failures caused by transient or intermittent faults can be predicted, if triggering has been observed often enough in the training data. This seems rather unlikely for faults such as the hit by an alpha particle. 24 2. Problem Statement, Key Properties, and Approach to Solution However, as the failure prediction approach is targeted at identifying failure triggering conditions, it fits the transient behavior of software faults as observed by condition-based activation patterns. 3. Bohr–Mandel–Heisen–Schrödingbugs fault model. Online failure prediction will most likely be performed on fault-tolerant systems that have undergone thorough code revision, testing, etc. For this reason, it can be assumed that most Bohrbugs have been eliminated. Schrödingbugs are a construct that is very unlikely to occur, but as all programs stop until fixing of the bug, there is no need for online failure prediction. Mandelbugs and Heisenbugs are the typical bugs for which failure prediction is relevant. Both are triggered under complex conditions and the difference between both is more related to root cause analysis rather than failure prediction. 4. Fail-stop–to–Byzantine fault model. Since this fault model has the property that more “friendly” fault classes are proper subsets of more general fault classes, it is sufficient to determine an upper bound. Due to the fact that Byzantine faults can behave arbitrarily they can trigger failure mechanisms that have not been present in the training data and can hence not be predicted. The same holds for authenticated Byzantine faults. Incorrect computation faults can be predicted, as long as they lead to errors that are detected within components. Nevertheless, it should be pointed out that there is no 100% coverage, even not for fail-stop faults.8 5. Software–Hardware–Human fault model. The failure prediction approach operates on errors that have been logged by some software. From this follows that hardware faults can only be detected if they result in an error at the software level. If, for example, it is never detected until system failure that some hard disk controller delivers corrupted data, this failure cannot be predicted. However, several studies on causes of failures such as Gray [107], Gray [108], and Scott [232] have documented a trend towards software caused failures. The most astonishing study is Lee & Iyer [159] who have investigated the Tandem GUARDIAN system and have identified that 89.5% of reported failures have been identified to be caused by software. 2.5.3 Relation to Other Research Areas and Issues In the following, relations to other research areas are briefly discussed. A comprehensive classification of the proposed failure prediction algorithm with respect to other prediction approaches is given in Chapter 3. Fault diagnosis. According to Marciniak & Korbicz [176], there are three different approaches to pattern recognition for fault diagnosis: • Minimal distance methods. Classification is achieved by assigning data under investigation to the nearest class as determined by a distance metric in feature space. In failure prediction, error sequences would have to be analyzed in order to extract features like frequency of error occurrence, etc. 8 Although fail-stop faults will very unlikely evolve into a system failure due to the fault-tolerant design of the system. 2.5 Analysis of the Approach 25 • Statistical methods. The goal is to estimate the probability of a class given the data point under investigation: P (c|x). In failure prediction, classes refer to failureprone or not-failure-prone and x refers to an error sequence. • Approximation approach. The class membership function F (x) is approximated by a function. In the case of failure prediction, F (x) would determine whether error sequence x belongs to the class of failure-prone sequences. With respect to this classification, the approach of this thesis is a statistical method since the outcome of the HMM forward algorithm is sequence likelihood P (x|c), which is turned into P (c|x) by the subsequent Bayesian classification step. Temporal sequence processing. It has been stated that the approach is related to temporal sequence processing. According to Sun [253], temporal sequence processing is typically accomplished if one of four problems is addressed: 1. Sequence generation. Having specified a model, generate samples of time series. 2. Sequence recognition. Does some given sequence belong to the typical behavior of the underlying stochastic process or not? More precisely: What is the probability of it? 3. Sequence prediction. Given the beginning of a sequence, assess the probability of the next observation (or state) of the time series. 4. Sequential decision making. Select a sequence of actions in order to achieve some goal or to optimize some cost function. Failure prediction, as introduced here, clearly refers to sequence recognition. However, Section 12.1.1 in the outlook sketches a variant of failure prediction that makes use of sequence prediction. Since the majority of models for temporal sequence processing deal with series whose values occur equidistantly (see, e.g., Box et al. [36] for an overview), it seems infeasible to compare the HMM approach to other temporal sequence modeling techniques. Machine learning. The solution presented here clearly belongs to the group of supervised learning algorithms. Supervised learning refers to the property that training data is labeled with a target value. In terms of failure prediction, this means that for every error event sequence in the training data set it is known whether it is a failure or non-failure sequence. Furthermore, the presented approach employs batch learning,9 which denotes that the approach consists of two phases: a training phase and an application phase (see Figure 2.7). Such approach is valid as long as dynamics of the system more or less stay the same. Due to configurability of the system and updates, this assumption only holds partly, as is investigated in Section 9.7.2. A solution to this problem can be online learning where the model is adapted continuously during runtime. The No Free Lunch Theorem10 of machine learning proves that on the criterion of generalization performance, there is no single modeling technique that is superior to all other 9 also called offline learning 10 see, e.g., Wolpert [280] 26 2. Problem Statement, Key Properties, and Approach to Solution techniques on all problems. However, this does not imply that for a given problem all approaches are equal. In fact, it is the topic of this thesis to design, test and verify superiority of one specific modeling technique for the concrete task of online failure prediction from error events. Data-driven approaches. The approach presented here is clearly a measurement datadriven approach. Such approaches can, —despite of their generalization capabilities— only learn interrelations that are present in the training data. Hamerly & Elkan [112] and Petsche et al. [202] argue that one escape from the dilemma is to build anomaly detectors, which inverts the problem: The focus of modeling is not the abnormal failure behavior but the way the system behaves when it is running well. However, this approach also fails if normal behavior is very diverse, which can be assumed for systems of such complexity as the telecommunication system. In the outlook (Chapter 12), a new approach to this dilemma is proposed: The HSMM developed in this thesis may be augmented manually to account for failure mechanisms that are not contained in the training data. Class Skewness. Failure prediction approaches usually have to deal with extreme class skewness: measurements for failures —even performance failures— occur much more seldom than measurements for non-failures. As can be seen from Figure 2.2 on Page 11, errors occur late in the process from faults to failures: an error is only reported if some misbehavior in the system has been detected. Hence, in comparison to failure prediction approaches operating on periodically measured symptom monitoring, the ratio of failure and non-failure data is more balanced and the problem of class skewness is mitigated. Nevertheless, both classes are far from being equally distributed and hence failure models are trained on failure data only. 2.6 Summary This chapter has defined the objective of this thesis: online failure prediction. In terms of the telecommunication system case-study, the failures that are to be predicted are performance failures, which are defined to be a drop below a four-nines threshold on five minute interval call availability. Key properties of the objective have been identified and the approach pursued in this thesis has been outlined. The last section of the chapter included a brief description of one failure and four fault models and has discussed potentials and limits of the described approach. The following list summarizes the line of arguments that lead to the approach to online failure prediction followed in this thesis: • Dependencies within systems lead to error sequences. • In fault-tolerant systems, not every occurrence of errors leads to a failure. • Fault-tolerant systems fail only under some conditions. • Error pattern recognition is applied to distinguish between error sequences that are failure-prone and those that are not. 2.6 Summary 27 • It is assumed that both dimensions of error sequences, time of event occurrence and type of the event are equally important. Hence, error sequences are treated as temporal sequences. • Extended hidden Markov models are used as pattern recognition toolkit. The extension allows to model the temporal behavior of error patterns by use of a semiMarkov process. • Several failure mechanisms are assumed to be present in a system. In order to separate failure mechanisms, failure sequences in the training data are grouped by clustering. • Since error logs are a noisy data source, data preprocessing has to be applied to the data • In order to address the problem of class skewness, failure models are trained using failure sequences only. • The approach is a batch learning supervised machine learning task. • By use of a model targeted to non-failure sequences, Bayes decision theory is applied for online prediction in order to classify the current situation of a running system as failure-prone or not. Contributions of this chapter. This chapter has discussed the stages at which faults can be observed. It turned out that the classical distinction between faults, errors, and failures is not sufficient as it is missing side-effects of faults, which are called symptoms. Hence one contribution is the extension of this differentiation. The second contribution is a novel view on the task of online failure prediction. To the best of our knowledge, this work is the first to treat the problem as a pattern recognition task of temporal sequences. Relation to other chapters. This chapter has formally defined the objective of the thesis and has presented an overview of the approach. The next two chapters provide some background on related approaches. The reason why there are two chapters on related work is that in this thesis, an existing modeling technique —hidden Markov models— has been extended and applied to the area of online failure prediction. Hence Chapter 3 provides an overview of other approaches to online failure prediction, while Chapter 4 covers related work on hidden Markov models. Chapter 3 A Survey of Online Failure Prediction Methods As mentioned in Section 2.1, online failure prediction denotes only a small area in the broad field of prediction techniques. However, even in that limited sense, a wide spectrum of approaches have been published. This chapter provides a survey on methods that have been published and points to techniques that might in future be applied to online failure prediction. In order to structure the spectrum, a taxonomy is introduced in Section 3.1. Major concepts are briefly explained and related work is referenced. As it is not possible to implement all techniques without a huge team of researchers, in this thesis only the most promising approaches that are closely related to the approach presented in this thesis have been selected for comparative analysis in the case study. These methods are explained in more detail in Section 3.2. 3.1 A Taxonomy and Survey of Online Failure Prediction Methods A significant body of work has been published in the area of online failure prediction. This section introduces a taxonomy that structures the manifold of approaches (see Figure 3.1). The most fundamental differentiation of failure prediction approaches refers to the ability to evaluate the current state. Since the current state can only be considered if some monitoring of the system is used as input data, these methods are also called monitoringbased methods. However, to be complete, failure prediction mechanisms exist that are, e.g., only based on lifetime probability distributions, the system’s architecture, or other static properties of the system (Branch 2 in the taxonomy). Reliability models and most methods known from preventive maintenance fall into this category. The book by Lyu [170], and especially the chapters Farr [94] and Brocklehurst & Littlewood [38], provide a good overview, while the book by Musa et al. [189] covers the topic comprehensively. The category of methods that evaluate the current system state (branches starting with 1 in the taxonomy), can be further divided into four categories by analyzing at which stage of failure evolution, observations are taken. Referring to Figure 2.2 on Page 11, 29 30 3. A Survey of Online Failure Prediction Methods faults can be observed at four stages: By audits, by monitoring of symptoms, detection of errors or observation of failures. However, since audit-based methods are mainly offline procedures,1 they are not included in the taxonomy. Failure Observation (1.1) The basic idea of failure prediction based on previous failure occurrence is to draw conclusions about the probability distribution of future failure occurrence. The framework for these conclusions can be quite formal as is the case with Bayesian classifiers or rather heuristic as in the case of counting and thresholding. Bayesian Predictors (1.1.1) The key notion of Bayesian failure prediction is to estimate the probability distribution of the next time to failure by benefiting from the knowledge obtained from previous failure occurrences in a Bayesian framework. In Csenki [72], such a Bayesian predictive approach [3] is applied to the Jelinski-Moranda software reliability model [132] in order to yield an improved estimate of the next time to failure probability distribution. Non-parametric Methods (1.1.2) It has been observed that the failure process can be non-stationary and hence the probability distribution of time-between-failures (TBF) varies. Reasons for non-stationarity are manifold, since the fixing of bugs, changes in configuration or even varying utilization patterns can affect the failure process. In these cases, techniques such as histograms result in poor estimations since stationarity2 is inherently assumed. For these reasons, the non-parametric method of Pfefferman & Cernuschi-Frias [203] assumes the failure process to be a Bernoulli-experiment where a failure of type k occurs at time n with probability pk (n). From this assumption follows that the probability distribution of TBF for failure type k is geometric since only the n-th outcome is a failure of type k and hence the probability is: n o m−1 P r T BFk (n) = m | failure of type k at n = pk (n) 1 − pk (n) . (3.1) The authors propose a method to estimate pk (n) using an autoregressive averaging filter with a “window size” depending on the probability of the failure type k. Counting / Thresholding (1.1.3) It has been observed several times, that failures occur in clusters in a temporal as well as in a spatial sense. Liang et al. [165] choose such an approach to predict failures of IBM’s BlueGene/L from event logs containing reliability, availability and serviceability data. The key to their approach is data preprocessing employing first a categorization and then temporal and spatial compression: Temporal compression combines all events at a single location occurring with inter-event times lower than some threshold, and spatial compression combines all messages that refer to the same location within some time window. 1 We have not found any publication investigating audit-based online failure prediction 2 at least within a time window 3.1 A Taxonomy and Survey of Online Failure Prediction Methods Figure 3.1: A taxonomy for online failure prediction approaches 31 32 3. A Survey of Online Failure Prediction Methods Prediction methods are rather straightforward: Using data from temporal compression, if a failure of type application I/O or network appears, it is very likely that a next failure will follow shortly. If spatial compression suggests that some components have reported more events than others, it is very likely that additional failures will occur at that location. A paper by Fu & Xu [99] formalizes the concept by introducing a measure of temporal and spatial correlation. Symptom Monitoring (1.2) Some types of faults affect the system gradually, which is also known as service degradation. A prominent example for such types of faults are memory leaks. If some part of a system has a memory leak, more and more system memory is consumed over time, but, as long as there is still memory available, neither an error nor a failure is observed. When memory is getting scarce, the computer may first slow down3 and only if there is no memory left an error occurs, which may then result in a failure. The key notion of failure prediction based on monitoring data is that faults like memory leaks can be grasped by their side-effects on the system such as exceptional memory usage, CPU load, or disk I/O. These side-effects are called symptoms. Four principle approaches have been identified: Failure prediction based on a system model, function approximation techniques, classifiers, and time series analysis. System Models (1.2.1) The foundation of these failure prediction methods is a model of system behavior, which is in most cases built from previously recorded training data. Stochastic models (1.2.1.1): Vaidyanathan & Trivedi [263] construct a semi-Markov reward model in the following way: Several system parameter measurements are periodically taken from a running system including the number of process context switches and the number of page-in and page-out operations. Clustering training data yielded eleven clusters. The authors assume that these clusters represent eleven different workload states. A semi-Markov reward model was built where each of the clusters corresponds to one state in the Markov model. State transition probabilities were estimated from the measurement dataset and sojourn-time distributions were obtained by fitting two-stagehyperexponential or two-stage-hypoexponential distributions to the training data. Then, a resource consumption “reward” rate for each workload state is estimated from the data: Depending on the workload state the system is in, the state reward defines at what rate the modeled resource is changing. The rate was estimated by fitting a linear function to the data using the method of Sen [233]. The authors modeled two resources: the amount of swap-space used and the amount of free real memory. Failure prediction is accomplished by estimating the time until resource exhaustion. This is achieved by computing the expected reward rate at steady state from the semi-Markov reward model. Berenji et al. [27] build a system model in a hierarchical two step approach: First, they build component simulation models that try to mimic the input / output behavior of system components. These models are used to train component diagnostic models by combining input data with component outputs obtained from the component simulation models. The 3 e.g., due to memory swapping 3.1 A Taxonomy and Survey of Online Failure Prediction Methods 33 target output values of the diagnostic models are binary where a value of one corresponds to faulty component behavior and zero to non-faulty behavior. The same approach is then applied on the next hierarchical level to obtain a system-wide diagnostic models. The authors use a clustering method to obtain a radial basis function rule base. A more theoretic approach that could in principle be applied to online failure prediction is to abstract system behavior by a queuing model that incorporates additional knowledge about the current state of the system. Failure prediction can be performed by computing the input value dependent expected response time of the system. Ward & Whitt [272] show how to compute estimated response times of an M/G/I processorsharing queue based on measurable input data such as number of jobs in the system at time of arrival using a numerical approximation of the inverse Laplace transform. Anomaly detectors (1.2.1.2): One of the most intuitive methods of failure prediction is to build a model that captures key aspects of system behavior and to check during runtime, whether the actual system behavior deviates from normal behavior. For example, Elbaum et al. [89] describe an experiment where function calls, changes in the configuration, module loading, etc. of the email client “pine” had been recorded. The authors have proposed three types of failure prediction among which sequence-based checking was most successful: a failure was predicted if two successive events occurring in “pine” during runtime do not belong to any of the event transitions observed in the training data. Candea et al. [45] describe a dependable system consisting of several parts such as the pinpoint problem determination approach [53] or automatic failure path inference [44]. Even though the methods are only used in the context of recovery-oriented computing [40] the methods could easily be extended to detect deviation from usual behavior during runtime in order to predict upcoming failures. The same holds for a failure diagnosis system that employs a decision tree evaluating runtime properties of requests to a large Internet site [54]. In [144], a χ2 goodness-of-fit test is used to determine, whether the proportion of runtime paths between a component instance and other component classes deviates from a fault-free behavior. Control theory (1.2.1.3): It is common in control theory to have an abstraction of the controlled system estimating the internal state of the system and its progression over time by some mathematical equations, such as linear equation systems, differential equation systems, Kalman filters, etc. (see, e.g., Lunze [169]). These methods are widely used for fault diagnosis (see, e.g., Korbicz et al. [147]) but have only rarely been used for failure prediction. However, many of the methods inherently include the possibility to predict future behavior of the system and hence have the ability to predict failures. For example, Neville [193] describes in his Ph.D. thesis the prediction of failures in large scale engineering plants. Another example is Discenzo et al. [78] who mention that such methods have been used to predict failures of an intelligent motor using the standard IEEE motor model. Limiting the scope to failure prediction in computer systems, only a few examples exist, one of which is Yang [282] who uses Kalman filters to predict future states in combination with an “early failure detection and isolation arrangement” (EFDIA) Petri Net. Another approach has been published by Singer et al. [243] who propose the Multivariate State Estimation Technique (MSET) to detect system disturbances by a comparison of the estimated and measured system state. More precisely, a matrix of measurement 34 3. A Survey of Online Failure Prediction Methods Figure 3.2: Function approximation tries to mimic an unknown target function by the use of measurements taken from a system at runtime data of normal operation is collected. This training data is further processed such that an expressive subset of training data is selected. In the operational phase, a combination of selected data vectors weighted by similarity to the current (runtime) observations is used to compute a state estimate. The difference between observed and estimated state constitutes a residual that is checked for significant deviation by a sequential probability ratio test (SPRT). In Gross et al. [110], the authors have applied the method to detect software aging [198] in an experiment where a memory-leak fault injector consumed system memory at an adjustable rate. MSET and SPRT have been used to detect whether the fault injector was active and if so, at what rate it was operating. By this, time to memory consumption can be estimated. MSET has also been applied to online transaction processing servers in order to detect software aging (Cassidy et al. [48]). Function Approximation (1.2.2) Function approximation techniques try to mimic target values, which are assumed to be the outcome of an unknown function of input data. Target functions include, e.g., the probability of failure occurrence or the true long-term progression of resource consumption. Due to the fact that neither the function is known nor can the faults, which are part of the input to the unknown function, be observed, the target function can only be estimated from measurements (see Figure 3.2). Function approximation is a broad research area, and various approaches have been published to address this type of problems, among which some are listed here that are related to failure prediction. Prediction of failures can be achieved with function approximation techniques in two ways: 1. The target function is the probability of failure occurrence. In these cases, the target value in the training dataset is boolean. This case is depicted in Figure 3.2. 2. The target function is some computing resource and failure prediction is accomplished by estimating the time until resource exhaustion. However, since most of the work presented below follows the second approach, categorization distinguishes between function approximation methods rather than the target function. Curve fitting (1.2.2.1): In this category of techniques, the target function is the true, long-term progression of some system resource, e.g., system memory. However, if free 3.1 A Taxonomy and Survey of Online Failure Prediction Methods 35 system memory is measured periodically during runtime, measurements vary heavily since it is natural that memory is allocated and freed during normal system operation. Curve fitting techniques4 adapt parameters of a function such that the curve best fits the measurement data, e.g., by minimizing mean square error. The simplest form of curve fitting is regression with a linear function. Garg et al. [100] have presented work where after data smoothing a statistical test (seasonal Kendall test) is applied in order to identify whether a trend is present and if so, a non-parametric trend estimation procedure [233] is applied. Failure prediction is then accomplished by computing the estimated time to resource exhaustion. Castelli et al. [49] mention that IBM has implemented a curve fitting algorithm for the xSeries Software Rejuvenation Agent. Several types of curves are fit to the measurement data and a model-selection criterion is applied in order to choose the best curve. Prediction is again accomplished by extrapolating the curve. Cheng et al. [57] present a framework for high availability cluster systems. Failure prediction is accomplished in two stages: first, a health index ∈ [0, 1] is established based on measurement data employing fuzzy logic and then trend analysis is applied in order to estimate the mean time to next failure. Andrzejak & Silva [10] apply deterministic function approximation techniques such as splines to characterize the functional relationships between the target function5 and “work metrics” such as the work that has been accomplished since the last restart of the system. Deterministic modeling offers a simple and concise description of system behavior with few parameters. Additionally, using work-based input variables offers the advantage that the function is not depending on absolute time anymore: For example, if there is only little load on a server, aging factors accumulate slowly and so does accomplished work whereas in case of high load, both accumulate more quickly. Genetic programming (1.2.2.2): In the paper by Abraham & Grosan [1] the target function is the so-called stressor-susceptibility-interaction (SSI), which basically denotes failure probability as function of external stressors such as environment temperature or power supply voltage. The overall failure probability can be computed by integration of single SSIs. The paper presents an approach where genetic programming has been used to generate code representing the overall SSI function by learning from training data. Although the paper mainly focuses on electronic devices, the approach might be adopted for failure prediction in complex computer systems. However, this is difficult to tell since only few results are presented in the paper. Machine learning (1.2.2.3): One of the predominant applications of machine learning is function approximation. It seems natural that various techniques have a long tradition in failure prediction, as can also be seen from various patents in that area. In 1990, Troudet et al. have proposed to use neural networks for failure prediction of mechanical parts and Wong et al. [281] use neural networks to approximate the impedance of passive components of power systems. The authors have used an RLC-Π model where faults have been simulated to generate the training data. Neville [193] has described how standard neural networks can be used for failure prediction in large scale engineering plants. 4 which are also called regression techniques 5 the authors use the term “aging indicator” 36 3. A Survey of Online Failure Prediction Methods Turning to publications regarding failure prediction in large scale computer systems, various techniques have been applied there, too. Ning et al. [194] have modeled resource consumption time series by fuzzy wavelet networks (FWN). They use fuzzy logic inference to predict software aging in application servers based on performance parameters. Turnbull & Alldrin [259] use Radial Basis Functions (RBF) to predict server failures based on hardware sensors on motherboards. In his dissertation [120], Günther Hoffmann has developed a failure prediction approach based on universal basis functions (UBF), which are an extension to RBFs that use a weighted convex combination of two kernel functions instead of a single kernel. He has applied the method to predict failures of the same telecommunication system used as case study in this thesis. However, UBF primarily builds on equidistantly monitored data to identify symptoms while the method proposed in this dissertation focuses on event-driven error sequences. In [122], Hoffmann et al. have conducted a comparative study of several modeling techniques with the goal to predict resource consumption of the Apache webserver. The study showed that UBF turned out to yield the best results for free physical memory prediction, while server response times could be predicted best by support vector machines (SVM). However, the authors point out that the issue of choosing a good subset of input variables has a much greater influence on prediction accuracy than the choice of modeling technology. This means that the result might be better if, for example, only workload and free physical memory are taken into account and other measurements such as used swap space are ignored. Variable selection6 is concerned with finding the optimal subset of measurements. Typical examples of variable selection algorithms are principle component analysis (PCA, see Hotelling [124]) as used in Ning et al. [194] or Forward Stepwise Selection (see, e.g., Hastie et al. [115]), which has been used in Turnbull & Alldrin [259]. Günther Hoffmann has also developed a new algorithm called probabilistic wrapper approach (PWA), which combines probabilistic techniques with forward selection or backward elimination. Instance-based learning methods store the entire training dataset including input and target values and predict by finding similar matches in the stored database of training data (eventually combining them). Kapadia et al. [141] have applied three learning algorithms (k-nearest-neighbors, weighted average and weighted polynomial regression) to predict CPU-time of semiconductor simulation software based on input data such as number of grid points, or number of etch steps of the simulated semiconductor. Classifiers (1.2.3) In contrast to function approximation, classification approaches do not strive to mimic some target function but try to directly come to a decision about criticality of the system’s state. For this reason, training data for classification approaches has discrete (and in most cases binary) target labels. However, the input data to classification approaches can consist of discrete as well as continuous measurements. For example, for hard disk failure prediction based on SMART7 values, input data may consist of the number of reallocated sectors (discrete value) and the drive’s temperature (theoretically a continuous variable). Target values are not a continuous values but a binary classification whether the drive is failure-prone or not. 6 some authors also use the term feature selection 7 Self-Monitoring And Reporting Technology 3.1 A Taxonomy and Survey of Online Failure Prediction Methods 37 Statistical Tests (1.2.3.1): Ward et al. [271] estimate time-dependent mean and variance of the number of TCP connections in various states from a web proxy server in order to identify Internet service performance failures. If actual measurements deviate significantly from the mean of training data, a failure is predicted. A more robust statistical test has been applied to hard disk failure prediction in Hughes et al. [127]. The authors employ a rank sum hypothesis test to identify failure prone hard disks. The basic idea is to collect SMART values from fault-free drives and store them as reference data set. Then, during runtime SMART values of the monitored drive are tested the following way: The combined data set consisting of the reference data and the values observed at runtime is sorted and the ranks of the observed measurements are computed8 . The ranks are summed up and compared to a threshold. If the drive is not fault-free, the distribution of observed values are skewed and the sum of ranks tends to be greater or smaller than for fault-free drives. Bayesian Classifier (1.2.3.2): In [112], Hamerly & Elkan describe two Bayesian failure prediction approaches. The first Bayesian classifier proposed by the authors is abbreviated by NBEM expressing that a specific Naïve Bayes model is trained with the Expectation Maximization algorithm based on a real data set of SMART values of Quantum Inc. disk drives. Specifically, a mixture model is proposed where each naïve Bayes submodel m is weighted by a model prior P (m) and an expectation maximization algorithm is used to iteratively adjust model priors as well as submodel probabilities. Second, a standard naïve Bayes classifier is trained from the same input data set. More precisely, SMART variables xi such as read soft error rate or calibration retries are divided into bins and conditional probabilities for class k ∈ {Failure, Non-failure} are computed. The term naïve derives from the fact that all attributes xi are assumed to be independent and hence the joint probability can simply be computed as the product of single attribute probabilities P (xi | k). The authors report that both models outperform the rank sum hypothesis test failure prediction algorithm of Hughes et al. [127].9 Pizza et al. [205] propose a Bayesian method to distinguish between transient and permanent faults on the basis of diagnosis results. In this case the measured symptoms are obtained by monitoring and evaluation of modules or components. Although not mentioned in the paper, this method could be used for failure prediction by issuing a failure warning once a permanent fault has been detected. Other approaches (1.2.3.3): Failures of computer systems can be predicted by applying a clustering method directly to system measurement data: After collection of a labeled training data set indicating whether measurements are failure-prone or not, a clustering method can be used, e.g., to identify centroids of failure-free and failure-prone regions. During runtime, actual measurements can be classified by assessing proximity to failureprone and failure-free centroids. Sfetsos [234] describes that clustering has been used together with function approximation techniques for load-forecasting of power systems. Additionally, clustering is part of the training procedure in Berenji et al. [27], which has been described in category 1.2.1.1. 8 which in fact involves nothing more than simple counting 9 The rank sum test was announced and submitted to the journal in 2000, but appeared after the publication of the NBEM algorithm in the year 2002. 38 3. A Survey of Online Failure Prediction Methods Cheng et al. [57] apply a fuzzy logic soft classifier to compute a health index in high availability cluster systems (see category 1.2.2.1). Daidone et al. [73] have proposed to use a hidden Markov model approach to infer whether the true state of a monitored component is healthy or not. The use of hidden Markov models is motivated by the fact that the true state of the monitored component cannot be observed. However, the state can be estimated from a sequence of monitoring results by the so-called forward algorithm of hidden Markov models. Additionally, mistakes in the component specific defect detection mechanism10 are included in the model. Since this method is based on concurrent monitoring the method could also be used for failure prediction: If a component is detected to be faulty, a failure is likely to occur. Chen et al. [52] and Kiciman & Fox [144], which are related publications, apply a probabilistic context free grammar (PCFG)11 to evaluate call paths collected from a Java 2 Enterprise Edition (J2EE) demo application, an industrial enterprise voice application TM network, and from eBay servers. Although the approach is designed to identify failures quickly, the approach could also be used to predict upcoming failures: if the probability of the beginning of a call path is very low, it is likely that the system is not behaving normally and there is an increased probability that a failure will occur in the further course of the request. Time Series Analysis (1.2.4) Failure predictions belonging to this category directly measure the target function and analyze it in order to determine whether a failure is imminent or not. Feature analysis computes a residual of the measurement series, while time series prediction models try to predict the future progression of the target function from the series’ values itself (without using other measurements as input data). Finally, also signal processing techniques can be used for time series analysis. Feature analysis (1.2.4.1): Crowell et al. [71] have discovered that memory related system parameters such as kernel memory or system cache resident bytes show multifractal characteristics in the case of software aging. The authors used the Hölder exponent to identify fractality, which is a residual expressing the amount of fractality in the time series. In a later paper [238], the same authors extended this concept and built a failure prediction system by applying the Shewhart change detection algorithm [24] to the residual time series of Hölder exponents. A failure warning is issued after detection of the second change point. Time Series Prediction (1.2.4.2): In Hellerstein et al. [117], the authors describe an approach to predict if a target function will violate a threshold. In order to achieve this, several time series models are employed to model stationary as well as non-stationary effects. For example, the model accounts for the influence of the day-of-the-week, or time-of-the-day, etc. Experiments have been carried out on prediction of HTTP operations per second of a production webserver. A similar approach has been described in Vilalta et al. [266]. 10 The authors use the term “deviation detection mechanism”. 11 For more details on PCFGs, see category 1.3.3.1 3.1 A Taxonomy and Survey of Online Failure Prediction Methods 39 Li et al. [163] collect various parameters from a web-server and build autoregressive model with auxiliary input (ARX) to predict further progression of system resources utilization. Failures are predicted by estimating resource exhaustion times. A similar approach has been proposed by Sahoo et al. [220] who applied various time series models to data of a 350-node cluster system to predict parameters like percentage of system utilization, idle time and network IO. Signal Processing (1.2.4.3): Signal processing techniques are of course related to methods that have already been described (e.g., Kalman filters in category 1.2.1.3). However, in contrast to the methods presented above, techniques of this category neither rely on any other input data nor do they require an abstract model of system behavior or a concept of (hidden) system states. Algorithms that fall into this category use signal processing techniques such as low-pass or noise filtering to obtain a clean estimate of a system resource measurement. For example, if free system memory is measured, observations will vary greatly due to allocation and freeing of memory. Such measurement series can be seen as a noisy signal where noise filtering techniques can be applied in order to obtain the “true” behavior of free system memory: If it is a continuously decreasing function, software aging is likely in progress and the amount of free memory can be estimated for the near-future by means of signal processing prediction methods (see Figure 3.3). However, to the best of our knowledge, signal processing techniques such as frequency transformations have only been used for data preprocessing so far. Figure 3.3: Failure prediction using signal processing techniques on measurement data can for example be achieved by noise filtering Manifestation of Faults – Errors (1.3) As already mentioned, the third major group of failure prediction methods that incorporate the current state of the system analyzes the occurrence of error events in order to assess the current situation with regard to upcoming failures. One of the major differences between errors and symptom monitoring is that errors always denote an event while symptoms are in most cases detected by periodic system observations. Furthermore, symptoms are in most cases values out of a continuous range while error events are mostly characterized by discrete, categorical data such as event IDs, component IDs, etc. (see Figure 3.4). Frequency of Occurrence (1.3.1) One assumption that is very common in failure prediction approaches is the notion that the frequency of error occurrence increases before a failure occurs. Several methods building 40 3. A Survey of Online Failure Prediction Methods Figure 3.4: Failure prediction based on the occurrence of errors (A,B,C). The goal is to assess the risk of failure at some point in future (indicated by the question mark). In order to perform the prediction, some data that have occurred shortly before present time are taken into account (data window). on this assumption have been proposed over the decades. According to Siewiorek & Swarz [241], Nassar & Andrews [190] were the first to propose two ways of failure prediction based on the occurrence of errors. The first approach investigates the distribution of error types. If the distribution of error types changes systematically (i.e., one type of error occurs more frequently) a failure is supposed to be imminent. The second approach investigates error distributions for all error types obtained for intervals between crashes. If the error generation rate increases significantly, a failure is looming. Both approaches resulted in computation of threshold values upon which a failure warning can be issued. Iyer et al. [131] apply a hierarchical aggregation method to error occurrences in order to filter out so-called symptoms:12 First, errors of equal type reported by one machine form so-called clusters. Second, subsequent clusters that occur within some specified time interval are combined to form so-called error groups. Third, error groups that occur within a 24h interval and that share at least two error records are called “events”. After data aggregation, Iyer et al. estimate singleton and joint probabilities to test for statistical dependence.13 A symptom of an event is formed by records that are common to most of the groups in an event. Although originally used for automatic identification of the root cause of permanent faults, the detection of a symptom could as well be used for the prediction of upcoming failures (see also Iyer et al. [130]). The dispersion frame technique (DFT) developed by Lin & Siewiorek [167] uses a set of heuristic rules on the time of occurrence of consecutive error events to identify looming permanent failures. Since this method is used for comparison with the model presented in this thesis, DFT is further explained in Section 3.2.1. Lal & Choi [153] show plots and histograms of errors occurring in a UNIX Server. The authors propose to aggregate errors in an approach similar to tupling (c.f., Tsao & Siewiorek [258]) and state that the frequency of clustered error occurrence indicates an upcoming failure. Furthermore, they showed histograms of error occurrence frequency over time before failure. More recently, Leangsuksun et al. [157] have presented a study where hardware sensors measurements such as fan speed, temperature, etc. are aggregated using several thresholds to generate error events with several levels of criticality. These events are an12 Not to be confused with side-effects of faults as used in this thesis 13 For independent random variables A and B, the following equation holds: P (A, B) = P (A) ∗ P (B). If not, A and B are not independent and are likely to occur together. 3.1 A Taxonomy and Survey of Online Failure Prediction Methods 41 alyzed in order to eventually generate a failure warning that can be processed by other modules. The study was carried out on data of a high availability high performance Linux cluster. In the paper presented by Levy & Chillarege [162], the authors derive three principles, two of which fall into this category: principle one (“counts tell”) again emphasizes the property that the number of errors14 per time unit increases before a failure. Principle number three (“clusters form early”) basically states the same by putting more emphasis on the fact that for common failures the effect is even more apparent if errors are clustered into groups. Another link to this relationship between errors and failures is provided by Liang et al. [165]: The authors have analyzed jobs of an IBM BlueGene/L supercomputer and support the thesis: “On average, we observe that if a job experiences two or more non-fatal events after filtering, then there is a 21.33% chance that a fatal failure will follow. For jobs that only have one non-fatal event, this probability drops to 4.7%”. Rule-based Systems (1.3.2) The essence of rule-based failure prediction is that the occurrence of a failure is predicted once at least one of a set of conditions is met. Hence rule-based failure prediction has the form IF <condition1 > THEN <f ailure warning> IF <condition2 > THEN <f ailure warning> ... Since in most computer systems the set of conditions cannot be set up manually, the goal of failure prediction algorithms in this category is to identify conditions algorithmically from a set of training data. The art is to find a set of rules that is general enough to capture as many failures as possible but that is also specific enough not to generate too many false failure warnings. Data mining (1.3.2.1): To our knowledge, the first data mining approach to failure prediction has been published by Hätönen et al. [116]. The authors describe that a rule miner was set up by manually specifying certain characteristics of episode rules. For example, the maximum length of the data window, types of error messages15 and ordering requirements had to be specified. However, the algorithm returned too many rules such that they had to be presented to human operators with system knowledge in order to filter out informative ones. Weiss [275] introduces a failure prediction technique called “timeweaver” that is based on a genetic training algorithm. In contrast to searching and selecting patterns that exist in the database, rules are generated “from scratch” by use of a simple language: error events are connected with three types of ordering primitives. The genetic algorithm 14 Since the paper is about a telecommunication system, the authors use the term alarm for what is termed an error, here. 15 As this work has also been published in the telecommunication community, the authors use the term alarm instead of errors. 42 3. A Survey of Online Failure Prediction Methods starts with an initial set of rules16 and repetitively applies crossing and mutation operations to generate new rules. Quality of the obtained candidates is assessed using a special fitness function that incorporates both prediction quality17 as well as diversity of the rule set. After generating a rule set with the genetic algorithm, the rule set is pruned in order to remove redundant patterns. Results are compared to three standard machine learning algorithms: C4.5rules [209], RIPPER [61] and FOIL [208]. Although timeweaver outperforms these algorithms, standard learning algorithms might work well for failure prediction in other applications. Vilalta & Ma [268] describe a data-mining approach that is tailored to short-term prediction of boolean data. Since the approach builds on a concept termed “eventsets”, the failure prediction algorithm is referenced here as eventset method. The method searches for predictive subsets of events occurring before a target event. In the terminology used here, events refer to errors and target events to failures. The first major concept of the method addresses class skewness (see Section 2.5.3). The solution is —similar to the solution used in this thesis— to first consider only error sequences preceding a failure within some time window, and to incorporate non-failure data only to remove unwanted patterns in a later step. The eventset method is used for comparative analysis and is hence explained in more details in Section 3.2.2. The eventset method has also been applied for failure prediction in a 350-node cluster system, as described in [220]. As indicated by its name, the eventset method operates on sets of errors and does not take the ordering of errors into account while the timeweaver method includes partial ordering. However, there are other data-mining methods having the potential to achieve good results, which have not yet been applied to the problem of failure prediction. For example a lot of research has been published in the field of sequential pattern mining. As an example, Srikant & Agrawal [249] introduce the concept of ontologies that would enable to incorporate relationships between error messages, which is closely related to hierarchical fault models. A second area of research having developed methods that could as well be applied to failure prediction is concerned with the analysis of path traversal patterns. For example, Chen et al. [55] generate a tree structure of path traversals to identify frequent paths and to isolate those paths that set up a basis (so-called “maximal reference sequences”). However, since the method assumes a dedicated start of all paths18 application of the method to failure prediction is limited to areas where some dedicated starting points exist such as in transaction-based systems. Fault trees (1.3.2.2): Fault trees have been developed in the 1960’s and have become a standard reliability modeling technique. A comprehensive treatment of fault trees is, for example, given by Vesely et al. [265]. The purpose of fault trees is to model conditions under which failures can occur using logical expressions. Expressions are arranged in form of a tree, and probabilities are assigned to the leaf nodes, facilitating to compute the overall failure probability. Fault tree analysis is a static analysis that does not take the current system status into account. However, if the leaf nodes are combined with online fault detectors, and logical expressions are transformed into a set of rules, they can be used as online failure predictor. 16 the so-called initial population 17 based on a variant of the F-Measure, that allows to adjust the relative weight of precision and recall 18 which is the root node of the tree 3.1 A Taxonomy and Survey of Online Failure Prediction Methods 43 Although such approach has been applied to chemical process failure prediction [260] and power systems [216], we have not found such approach being applied to computer systems. Other approaches (1.3.2.3): In the area of machine learning, a broad spectrum of methods are available that could in principle be used for online failure prediction. This paragraph only lists a few techniques that either have been applied for failure prediction or that seem at least promising. A relatively new technique on the rise is the so-called “rough set theory” [199]. Chiang & Braun [58] propose a combination of rough set theory with neural networks to predict failures in computer networks based on network events. Rough set theory has also been applied to aircraft component failure prediction (c.f., e.g., Pena et al. [200]). Bai et al. [20] employ a Markov Bayesian Network for reliability prediction but a similar approach might work for online failure prediction, as well. The same holds for decision tree methods: upcoming failures can be predicted if error events are classified using a decision tree approach similar to Chen et al. [54], which has been described in Section 1.2.1.2. Pattern recognition (1.3.3) Sequences of errors form error patterns. The principle of pattern recognition in this category is to assign a ranking value to an observed sequence of error events expressing similarity with learned patterns that are known to lead to system failures. Failure prediction is then accomplished by classification based on pattern similarity rankings (see Figure 3.5). Figure 3.5: Failure prediction by recognition of failure-prone error patterns Probabilistic context-free grammars – PCFG (1.3.3.1): This modeling technique has been developed in the area of statistical natural language processing (see, e.g., Manning & Schütze [175]). A probabilistic context free grammar consists of a set of rules of a context-free grammar Ni → X where Ni is a nonterminal symbol and X is a sequence of terminals and nonterminals. Furthermore, PCFGs associate a probability with each rule such that: X ∀i : P (Ni → Xj ) = 1 . j Given a sentence, which is a sequence of terminal symbols, i.e., the words, the sentence’ probability can be computed by finding and summing all possible parse trees having the given sentence as leaf nodes. The probability of each tree is defined as the product of rule probabilities that were used to generate the parse tree. Algorithms have been developed to 44 3. A Survey of Online Failure Prediction Methods perform these computations efficiently in a dynamic programming manner. Furthermore, algorithms have been developed to learn rule probabilities from a given set of training sentences. Failure prediction could be realized with PCFGs by learning the grammar of error event sequences that have lead to a failure in the training dataset. Following the approach depicted in Figure 3.5, failures can be predicted during runtime by computing the probability of the sequence of error events that have occurred in a time window before present time. To our knowledge, such an approach has not been implemented for online failure prediction. The only failure-related publications that use PCFGs are Chen et al. [52] and Kiciman & Fox [144]. However, these papers analyze runtime-paths, which are symptoms rather than errors —hence this approach has been described in category 1.2.3.3. A further well-known stochastic speech modeling technique are n-gram models [175]. N -grams represent sentences by conditional probabilities taking into account a context of up to n words in order to compute the probability of a given sentence.19 Conditional densities are estimated from training data. Transferring this concept to failure prediction, error events correspond to words and error sequences to sentences. If the probabilities (the “grammar”) of an n-gram model were estimated from failure sequences, high sequence probabilities would translate into “failure-prone” and low probabilities into “not failureprone”. Markov models (1.3.3.2): Similarity of error sequences to failure-prone patterns extracted from training data can be computed with Markov models in two different ways, depending whether a Markov chain or a hidden Markov model (HMM) is used. In case of Markov chains, each error event corresponds to a state in the chain. Sequence similarity is hence computed by the product of state traversal probabilities. Similar events prediction (SEP), which is the prequel of the prediction technique developed in this thesis, was built on this concept (see [226] for a description). The failure prediction approach described in this thesis also belongs to this category. The first ideas have been published in Salfner [223], but an implementation has shown that the concept needed to be developed further, which resulted in the approach presented here. Pairwise alignment (1.3.3.3): Computing similarity between sequences is one of the key tasks in biological sequence analysis [86]. Various algorithms have been developed such as the Needleman-Wunsch algorithm [191], Smith-Waterman algorithm [244] or the BLAST algorithm [8]. The outcome of such algorithms is usually a score evaluating the alignment of two sequences. If used as a similarity measure between the sequence under investigation and known failure sequences, failure prediction can be accomplished as depicted in Figure 3.5. One of the advantages of alignment algorithms is that they build on a substitution matrix providing scores for the substitution of symbols. In terms of error event sequences this technique has the potential to define a score for one error event being “replaced” by another event giving rise to use a hierarchical grouping of errors as is defined in Section 5.4. However, to our knowledge, no failure prediction approaches applying pairwise alignment algorithms have been published, at this time. 19 Although, in most applications of statistical natural language processing the goal is to predict the next word using P (wn |w1 , . . . , wn−1 ), the two problems are connected via the theorem of conditional probabilities. 3.2 Methods Used for Comparison 45 Other Methods (1.3.4) Statistical tests (1.3.4.1): Principle number two (“the mix changes”) in Levy & Chillarege [162] delineates the discovery that the order of subsystems sorted by error generation frequency changes prior to a failure. According to the paper, relative error generation frequencies of subsystems follow a Pareto distribution: Most errors are generated by only a few subsystems while most subsystems generate only very few errors.20 The proposed failure prediction algorithm monitors the order of subsystems and predicts a failure if it changes significantly, which basically is a statistical test. Classifier (1.3.4.2): Classifiers usually associate an input vector with a class label. In category 1.3, input data consists of one or more error events that have to be represented by a vector in order to be processed by a classification algorithm. A straightforward solution would be to use the error type of the first event in a sequence as value of the first input vector component, the second type as second component, and so on. However, it turns out that such a solution does not work: If the sequence is only shifted one step, the sequence vector is orthogonally rotated in the input space and most classifiers will not judge the two vectors as similar. One solution to this problem has been proposed by Domeniconi et al. [81]: SVD-SVM21 borrows a technique known from information retrieval: the socalled “bag-of-words” representation of texts [175]. In the bag of words representation, there is a dimension for each word of the language. Each text is a point in this highdimensional space where the magnitude along each dimension is defined by the number of occurrences of the specific word in the text.22 SVD-SVM applies the same technique to represent error event sequences. Since SVD-SVM is used for comparative analysis, it is described in more detail in the next section. 3.2 Methods Used for Comparison In order to compare the prediction method presented in this thesis to the state-of-the-art, other prediction methods have been implemented and applied to the data of the case study. The selection of approaches is primarily based on the type of input data: the best-known and most promising error-based approaches have been chosen, which are: • Dispersion Frame Technique (DFT) developed by Lin [166], which is an errorfrequency based approach (category 1.3.1) • Eventset Method developed by Vilalta & Ma [268], which is a data-mining approach (category 1.3.2) • SVD-SVM developed by Domeniconi et al. [81], which is a classification approach (category 1.3.4) Together with the pattern recognition approach presented in this dissertation, all categories of error-based failure prediction are covered. 20 This is also known as Zipf’s law [285] 21 Singular-Value-Decomposition and Support-Vector-Machine 22 There are more sophisticated representations incorporating term weighting such as tf.idf, but this has not been used for SVD-SVM 46 3. A Survey of Online Failure Prediction Methods In addition to that, a periodic prediction of failures based on mean-time-betweenfailures (MTBF), which belongs to category 1.1, has been applied in order to show the prediction results that can be achieved with almost no effort. Comparing the data that is taken into account by the various prediction methods, one can conclude: • DFT only makes use of the time of error occurrence • Eventset only makes use of the type of error occurrence • SVD-SVM makes use of the type of error events. Using a bag-of-words representation, also the number of error occurrences can be incorporated. Using a special representation to incorporate time of error occurrence has not been successful for the case study. • MTBF only takes the occurrence of failures into account. In this regard, the novelty of the approach presented here is that it is the first to analyze error events as event-triggered temporal sequence. 3.2.1 Dispersion Frame Technique Lin [166] has developed a technique called Dispersion Frame Technique (DFT) that evaluates the time of error occurrence and is therefore classified to category 1.3.1. It is based on the notion that errors occur more frequently before a failure occurs. It is a wellknown heuristic to analyze error occurrence frequencies and has been shown to be superior to classic statistical approaches like fitting of Weibull distribution shape parameters [167, 12]. The technique was developed for data of the Andrews distributed File System at Carnegie-Mellon University. The following paragraphs describe DFT as originally published and notes about its application to the case study in this thesis are provided at the end. Figure 3.6: Dispersion Frame Technique. Diamond u i denotes the last error that has occurred, i − 1 the predecessor error of the same type. DF denotes a dispersion frame and EDI the error dispersion index. W denotes a failure warning that is issued at the end of DF 1 centered around error i − 2. The first step of DFT prediction is to separate all error events pertinent to one device. Then the time of error occurrence for each device is analyzed. A Dispersion Frame (DF) 3.2 Methods Used for Comparison 47 is the interval time between successive error events of the same type. In Figure 3.6, two DFs are shown, DF1 is the time interval between errors i − 4 and i − 3 whereas DF2 is the interval between errors i − 3 and i − 2. Each DF is shifted such that it is centered around the next and next but one error. The Error Dispersion Index (EDI) is defined to be the number of error occurrences in the later half of a DF. If it is observed that a DF is less than 168 hours, a heuristic is activated, which predicts a failure if at least one of the following rules are met: 1. when two consecutive EDIs from successive application of the same DF exhibit an EDI of at least three. In Figure 3.6 this is true for DF1 centered around i − 3 and i−2 2. when two consecutive EDIs from two successive DFs exhibit an EDI of at least three. 3. when a dispersion frame is less than one hour, 4. when four error events occur within a 24-hour frame, 5. when there are four monotonically decreasing DFs and at least one DF is half the size of its previous DF. This rule is also met in Figure 3.6 The failure warning is issued at the end of the data frame, as shown in the figure. As might have become clear, the rules are heuristic and account for several types of system behavior. For example, rules three and four put absolute thresholds on erroroccurrence frequencies, whereas rules one and two on window-averaged occurrence frequencies. Finally, rule five is determined to detect trends in error occurrence frequencies. It should be noted that DFT was developed for data of the Andrews distributed File System (AFS). In this dissertation, the approach has been transferred to the prediction of failures of a component-based industrial telecommunication system. Therefore, the DFT method had to be adapted slightly: 1. AFS is a physically distributed campus-wide system and error messages could be assigned easily to field replaceable units (FRUs), which are also strong fault containment regions. The data used for the case study in this thesis derives from a non-distributed system built from software components. However, in the case of AFS, error detection took place within each FRU, while in the case-study considered here, software components are much weaker fault containment regions and error detection frequently took place in other parts of the system. Moreover, components were sometimes even not identifiable in the data. Hence, software containers, which execute the components, have been considered as the entity equivalent to FRUs. 2. There are several parameters in the ruleset that are problem specific. For example, the activation threshold of 168 hours is the time above which faults are considered to be unrelated. Since the goal of the case-study used here is to predict service availability failures on a five-minutes timescale, ruleset parameters had to be adapted. To do this, each parameter has been “optimized” separately by varying parameter values. Each choice has been evaluated with respect to the ability to predict failures. If two choices for a parameter were almost equal in precision and recall (see Chapter 8), the one with less false positives has been chosen. 48 3. A Survey of Online Failure Prediction Methods 3. There is no notion of warning-time ∆tw in the method. Since warning-time is the minimum time for any failure prediction to be useful, failure warnings issued for the interval (t, t + ∆tw ] are removed. Indeed, by design DFT can only predict failures at most half the length of a dispersion frame ahead. This resulted in removal of quite a lot of warnings due to the short inter-error-event times occurring in the data. 3.2.2 Eventset Method The prediction approach published by Vilalta & Ma [268] is based on data-mining techniques. The basic concept of the method are so-called eventsets. As the name indicates, an eventset E = {Xi } is a set of error events that indicates an upcoming failure. The failure predictor consists of a set of eventsets. The goal of the training procedure is to find a good set of eventsets such that as many failures as possible can be captured with as few false warnings as possible. In order to deal with the imbalance of class distributions (failures are rare events), the method first considers only failure data and uses non-failure data in a second validation step. Failure data consists of all error events that have occurred within a time window of length ∆td before each failure in the training dataset. These windows are termed failure windows, here. The original approach does not consider lead-time ∆tl . It has been incorporated by shifting the failure window, as depicted in Figure 3.7. Figure 3.7: The eventset method builds a database of sets of errors occurring within a time window before failures (indicated by t). The database is then reduced in several steps to yield a better predictor. In some of these steps, data occurring in nonfailure windows are used. ∆td denotes the length of the data window and ∆tl lead-time An initial database consisting of all subsets of events that have occurred in the event windows is set up. This initial database of eventsets is then reduced in three steps: 1. Keep only frequent eventsets. An eventset is said to be frequent if it has support 3.2 Methods Used for Comparison 49 greater than a user-defined threshold. Support is defined to be the relative frequency of occurrence in the event windows: support(E) = number of failure windows containing E total number of failure windows . (3.2) In the example, eventsets {A}, {B}, and {A, B} have support 100% and eventsets {C}, {A, C}, {B, C}, and {A, B, C} have support 50%. Assuming a threshold of, say 70%, only the first eventsets remain in the database. 2. Keep only accurate eventsets. In the example, the event A also occurs between the two failures which leads to the conclusion that the occurrence of A is not indicating an upcoming failure. Confidence takes this into account: Confidence is defined to be the relative frequency of occurrence of the eventset with respect to all time windows (including those that do not precede a failure event): confidence(E) = number of failure windows containing E number of all windows containing E . (3.3) An eventset is said to be accurate if it has confidence greater than a user-defined threshold. In the example, eventsets {B}, and {A, B} have confidence 100% while {A} has confidence 23 . Assuming a confidence threshold of, say, 70%, only eventsets {B}, and {A, B} remain in the database. Due to the fact that putting a threshold on confidence does not check for negative correlations, an additional statistical test is performed testing for the nullhypothesis: H0 : P (E|failure windows) ≤ P (E|non-failure windows). (3.4) Only eventsets E for which H0 can be rejected (with a certain confidence level) stay in the database. 3. Remove eventsets that are too general. Remaining eventsets are ordered by confidence in the first place and subsequently by support and finally by specificity: An eventset E1 is more specific than E2 if E2 ⊂ E1 . Going through the sorted list of eventsets, the algorithm removes eventsets that arei less specific. In the example, the h sorted list of eventsets consists of {A, B}, {B} . Since {B} ⊂ {A, B}, {B} is removed and the only remaining eventset is {A, B}. This means that events A and B must occur together in order to indicate an upcoming failure. Failure prediction is performed by checking, whether any eventset of the database is a subset of the currently observed set of error events. For example, if —during runtime— errors A, C, and B occur within a time window spanning an interval of length ∆td , a failure is predicted since the eventset {A, B} ⊂ {A, B, C}. As might have become clear, the initial database of eventsets has cardinality of the power set, which would tag the algorithm infeasible in real applications. Therefore the first step of support filtering is incorporated into the generation of the initial eventset database by use of the a-priori algorithm (Agrawal et al. [2]), which also applies branch and bound techniques. 50 3. A Survey of Online Failure Prediction Methods 3.2.3 SVD-SVM Method Latent semantic indexing (LSI) is a technique developed in information retrieval that enables to find related text documents even if they do not share search terms. LSI is based on the notion of co-occurrence of terms and provides a method to identify “latent” semantic concepts in texts (see, e.g., [175]). Domeniconi et al. [81] have applied this technique to the problem of failure prediction and assume that co-occurrence of error events indicates the “latent” state of the system.23 More specifically, the approach consists of three steps. 1. Error sequences are represented in a so-called bag-of-words representation, which is frequently used in natural language processing: for text documents, there’s a dimension for each word of the language and the magnitude along each dimension is, for example, simply the number of times the word occurs in the document. In the case of error event sequences, there’s a dimension for each event type and the magnitude along the dimension (i.e., the distance from origin) represents how “prominent” an error type is in the sequence. The authors describe three ways of assigning a value to “prominence”: • existence: one if an event occurs in the sequence, zero if not • count: the number of occurrences in the sequence • temporal: partitioning the sequence into time-slots and assigning a one to a binary digit if the event occurs within the corresponding time slot. The key notion of the bag-of-words representation is that each event sequence represents a point in a high-dimensional space and hence the entire training data set comprises a multidimensional point cloud. The process of turning error log data into sequences is similar to the Eventset method: All errors occurring within a time window of length ∆td preceding a failure by lead-time ∆tl constitute a failure sequence which is translated into a positive (failure-prone) bag-of-words data point. Errors occurring in data windows between failures constitute negative examples (see Figure 3.8). 2. Semantic concepts, which refer to the latent states of the system, are identified by means of singular value decomposition (SVD). The result of SVD is then used to reduce the number of dimensions in the data. More precisely, co-occurring events in the space of event types are mapped onto the same dimensions in the space of latent states by a least-squares method to decompose the matrix of training eventsequences into a product of square and diagonal matrices. Assuming that there are n training sequences, and m event types, then the matrix of training data D is a m×n matrix with each column corresponding to an event sequence. SVD decomposes D into D=U S VT , (3.5) where S is a diagonal matrix with ordered singular values on the main diagonal indicating the amount of variation for each dimension. SVD has the property that 23 The authors call it “pattern context information” 3.2 Methods Used for Comparison 51 Figure 3.8: Bag-of-words representation of error sequences occurring prior to failures (s). Each time window defines an event sequence. By assuming that there are only two types of error messages (A and B), each sequence can be mapped to a point in two-dimensional event-type space where the magnitude along each dimension is determined by the number of times the event occurs in the sequence. Sequences from windows preceding a failure are positive examples (black bullets), sequence from windows between failures constitute negative examples (white bullet). ∆td denotes length of the window and ∆tl lead-time. projecting data onto the the first k dimensions yields a least-square optimal projection. The projection matrix is defined by the first k columns of matrix U and projection can simply be performed by matrix multiplication. An example is shown in Figure 3.9: assuming that there are only two different errors A and B, the training data set can be represented in two-dimensional space. Figure 3.9-a shows an example using the count encoding, black bullets to indicate failure and white bullets to indicate non-failure sequences. The training data defines a 2×11 dimensional matrix D. SVD computes new dimensions x1 and x2 as shown in (b), such that projection (c) results in a least-square overall error. The projected data set has only one dimension. Figure 3.9: Singular value decomposition (SVD). (a): Bag-of-words representation of training data set. (b): Rotated dimensions found by SVD. (c): Projection onto the new dimension x1 . 3. A classifier is trained in order to distinguish between failure and non-failure sequences. The input data to classification are the projected event sequences (obtained from step two). The classification technique used are Support Vector Machines 52 3. A Survey of Online Failure Prediction Methods (SVMs), which have been developed at the beginning of the 1990’s by Vapnik [264].24 Support vector machines are linear maximum margin classifiers. Linear means that the decision boundary corresponds to a straight line in two-dimensional space and to a hyper-plane in higher dimensions. However, such an approach can only classify linearly separable problems appropriately, which is not the case for most real-world classification problems. To remedy this problem, a second transformation into a high-dimensional feature space including non-linear features is performed, which can turn complex classification problems into linear problems in feature space. Figure 3.10 depicts such a transformation denoted by ϕ. Although the additional transformation seems to introduce extra computation complexity, it is in fact one of the reasons for the computational efficiency of SVMs: The trick is that transformations exist for which the distance measure can be computed much more efficiently. The second important feature of SVMs is that they belong to the class of maximum margin classifiers, which means that the decision boundary is chosen such that the margin25 is maximal. It has been proven that this results in most robust classification (see, e.g., [237]). Figure 3.10: Maximum margin classification in feature space. On the left-hand side data points in the original space cannot be separated linearly. By transformation ϕ data points are transformed into a feature space, where a linear separation is possible. The decision boundary (indicated by the dashed line) is chosen such that the margin (solid lines) is maximal. After training, online failure prediction is performed in three steps: 1. all error events that occurred within a time window of length ∆td before present time are represented as a bag-of-words. 2. Singular value decomposition needs not to be performed for online prediction. Instead, the bag of words is transformed into reduced semantic space by multiplication with the projection matrix. 3. the resulting k-dimensional vector is classified using a support vector machine, which includes a further transformation using ϕ. 24 An introduction can be found in Cristianini & Shawe-Taylor [70] 25 which is the distance to the closest datapoints 3.3 Summary 53 According to the authors of SVD-SVM, failure patterns show similar properties as text classification tasks. For example, the frequency distribution for error events follows Zipf’s law [285], which inspired them to apply text processing techniques. 3.2.4 Periodic Prediction The failure prediction method used to estimate some sort of lower bound can be derived directly from reliability theory, since the probability of failure occurrence up to time t is simply: F (t) = 1 − R(t) , (3.6) where R(t) is reliability. Assuming a Poisson failure process, reliability turns out to be an exponential distribution (see, e.g., Musa et al. [189]) and failure probability is: F (t) = 1 − e−λt . (3.7) The distribution parameter λ is fit to the data by setting λ= 1 , M T BF (3.8) where MTBF denotes mean-time-between-failure of the training data set.26 Using this model, a failure is predicted according to the median of the failure distribution: 1 Tp = ln(2) . (3.9) λ 3.3 Summary This chapter has introduced a taxonomy of online failure prediction approaches for complex computer systems and has provided a comprehensive survey of online failure prediction methods. Furthermore, the survey points to research areas that provide a toolbox of methods that could most promisingly be applied to the task of online failure prediction. From this it can be concluded that the technique presented in this thesis is the first to apply temporal sequence pattern recognition methods to the task of online failure prediction. The second major goal of this chapter was to describe in detail four existing failure prediction approaches that are used for comparative analysis in this thesis, namely dispersion frame technique, eventset method, SVD-SVM and a periodic prediction based on a reliability model. Contributions of this chapter. To the best of our knowledge, this chapter provides the first taxonomy and the first survey of online failure prediction approaches. 26 Some works use MTTF instead of MTBF, but since in the case study performance failures are predicted, repair time is not an issue, here. 54 3. A Survey of Online Failure Prediction Methods Relation to other chapters. This chapter has presented related work with respect to online failure prediction approaches. Since the failure prediction method presented in this thesis is based on an extension to hidden Markov models, related work on hidden Markov models is presented in the next chapter. However, in order to explain the various models, first, an introduction to the theory of hidden Markov models is provided. Chapter 4 Introduction to Hidden Markov Models and Related Work As a result of their capabilities, hidden Markov models (HMMs) are becoming more and more frequently used in modeling. Examples include, e.g., the detection of intrusion into computer systems [273], fault diagnosis [73], network traffic modeling [229, 274], estimation and control [90], speech recognition [125], part-of-speech tagging [175], and genetic sequence analysis applications [86]. In this work, HMMs are used for online failure prediction following a pattern recognition approach. For this reason this chapter gives an introduction to the theory of HMMs (Section 4.1). The approach taken in this thesis builds on the assumption that time and type of error occurrence are crucial for accurate failure prediction. However, standard HMMs are not appropriate models for processing temporal sequences. In Section 4.2, three principle approaches are provided how temporal sequences can be handled by HMMs, followed by related work on timevarying HMMs, which is provided in Section 4.3. 4.1 An Introduction to Hidden Markov Models HMMs are based on discrete-time Markov chains (DTMCs), which consist of a set S = {si } of N states, a square matrix A = [aij ] defining transition probabilities between the states, and a vector of initial state probabilities π = [πi ] (see Figure 4.1). A is a stochastic matrix, which means that all row sums equal one: ∀i : N X aij = 1 . (4.1) j=1 Additionally, the vector of initial state probabilities π must define a discrete probability distribution such that N X πi = 1 . (4.2) i=1 The stochastic process defined by a DTMC can be described as follows: An initial state is chosen according to the probability distribution π. Starting from the initial state, 55 56 4. Introduction to Hidden Markov Models and Related Work Figure 4.1: Discrete Time Markov Chain the process transits from state to state according to the transition probabilities defined by A: being in state i, the successor state j is chosen according to the probability distribution aij . Such a process shows the so-called Markov assumptions or properties: 1. The process is memoryless: a transition’s destination is dependent only on the current state irrespective of the states that have been visited previously. 2. The process is time-homogeneous: transition probabilities A stay the same regardless of the time that has already elapsed (A is not depending on time t) More formally, both assumptions can be expressed by the following equation: P (St+1 = sj | St = si , . . . , S0 ) = P (S1 = sj | S0 = si ) . (4.3) Loss of memory is expressed by the fact that all previous states S0 , . . . , St−1 are ignored on the right-hand side of Equation 4.3, and time-homogeneity is reflected by the fact that the transition probabilities for time t → t + 1 are equal to the probabilities for time 0 → 1. Hidden Markov Models extend the concept of DTMCs in that at each time step an output (or observation) is generated according to a probability distribution. The key notion is that this output probability distribution depends on the state the stochastic process is in. Two types of HMMs can be distinguished regarding the types of their outputs: • If the output is continuous, e.g, a vector of real numbers, the model is called a continuous HMM.1 • If the output is chosen from some finite countable set, outputs are called symbols. Such models are called discrete HMMs. Due to the fact that error message IDs are finite and countable only discrete HMMs are considered. In order to formalize this, HMMs additionally define a finite countable set of symbols O = {oi } of M different symbols, which is called the alphabet of the HMM. A matrix B = [bij ] of observation probabilities is defined where each row i of B defines a probability distribution for state si such that bij is the probability for emitting symbol oj given that the stochastic process is in state si : bij = P (Ot = oj |St = si ) , 1 Not to be confused with continuous-time HMMs, as explained later (4.4) 4.1 An Introduction to Hidden Markov Models 57 where Ot denotes the random variable for the observation at time t. Hence, B has dimensions N × M and is a stochastic matrix such that ∀i : M X bij = 1 . (4.5) j=1 Note that for readability reasons, bij will sometimes be denoted by bsi (oj ). Figure 4.2 shows a simple discrete-time HMM. Figure 4.2: A discrete-time HMM with N = 4 states and M = 2 observation symbols The reason why HMMs are called “hidden” stems from the perspective that only the outputs can be observed from outside and the actual state si the stochastic process resides in is hidden from the observer. From this notion, three basic problems arise for which algorithms have been developed: 1. Given a sequence of observations and a hidden Markov model, but having no clue about the states the process has passed to generate the sequence, what is the overall probability that the given sequence can be generated? This probability is called sequence likelihood. The Forward algorithm provides an efficient solution to this problem. 2. Given a sequence and a model as above: What is the most probable sequence of states the process has traveled through while producing the given observation sequence? The Forward-Backward and Viterbi algorithms provide solutions to this problem. 3. Given a set of observation sequences: What are optimal HMM parameters A, B, and π such that the likelihood of the sequence set is maximal? The Baum-Welch training algorithm yields a solution by iteratively converging to at least a local maximum. The following sections will introduce the three algorithms. Although the algorithms can be found in many textbooks or in Rabiner [210], they are described here for reasons of comparison: In Chapter 6, these algorithms are adapted for the hidden semi-Markov model introduced in this thesis. 58 4.1.1 4. Introduction to Hidden Markov Models and Related Work The Forward-Backward Algorithm As the name might suggest, the Forward-Backward algorithm consists of a forward and a backward part. The forward part alone provides a solution to the first problem: the computation of sequence likelihood. The likelihood of a given observation sequence o = [Ot ] is the probability that a given HMM with parameters λ = (A, B, π) has generated the sequence, which is denoted by P (o|λ). In order to compute this probability, first assume that the sequence of hidden states s = [St ] was known. The likelihood could then be computed by: L Y P (o, s|λ) = πS0 bS0 (O0 ) aSt−1 St bSt (Ot ) , (4.6) t=1 where L is the length of the sequence. As only o is known, all possible state sequences s have to be considered and summed up: P (o|λ) = X πS0 bS0 (O0 ) s L Y aSt−1 St bSt (Ot ) . (4.7) t=1 However, such approach results in intractable complexity since there are N L+1 different sequences. An efficient reformulation has been found exploiting the Markov assumption that transition probabilities are time homogeneous and only dependent on the current state. Using this property, Equation 4.7 can be rearranged such that repetitive computations can be grouped together. From this rearrangement it is only a small step to a recursive formulation, which is also known as dynamic programming. The resulting algorithm is called Forward algorithm. Forward algorithm. The algorithm is based on a forward variable αt (i) denoting the probability for sub-sequence O0 . . . Ot under the assumption that the stochastic process is in state i at time t: αt (i) = P (O0 O1 . . . Ot , St = si |λ) . (4.8) αt (i) can be computed by the following recursive computation scheme: α0 (i) = πi bsi (O0 ) αt (j) = N X αt−1 (i) aij bsj (Ot ); 1≤t≤L. (4.9) i=1 The algorithm can be visualized by a trellis structure as shown in Figure 4.3. Each node represents one αt (i) while edges visualize the terms of the sum in Equation 4.9. The trellis can be computed from left to right, from which the name “forward algorithm” is derived. As αL (i) is the probability of the entire sequence together with the fact that the stochastic process is in state i at the end of the sequence, sequence likelihood P (o|λ) can be computed by summing over all states in the rightmost column of the trellis: P (o|λ) = N X i=1 which is the solution to the first problem. αL (i) , (4.10) 4.1 An Introduction to Hidden Markov Models 59 Figure 4.3: A trellis to visualize the forward algorithm. Bold edges indicate the terms that have to be summed up in order to compute αt (i) Backward Algorithm. A backward variable βt (i) can be defined in a similar way, denoting the probability of the rest of the sequence Ot+1 . . . OL given the fact that the stochastic process is in state i at time t: βt (i) = P (Ot+1 . . . OL |St = si , λ) . (4.11) βt (i) can be computed in a similar recursive way by: βL (i) = 1 βt (i) = N X (4.12) aij bsj (Ot+1 ) βt+1 (j); 0≤t≤L−1. (4.13) j=1 Forward-backward algorithm. Combining both αt (j) and βt (i) leads to the estimation of the probability that the process is in state si at time t given an observation sequence o. This probability is denoted by γt (i) = P (St = si |o, λ) . (4.14) Some computations yield: P (St = si , O0 . . . Ot Ot+1 . . . OL | λ) (4.15) P (O0 . . . OL |λ) P (St = si , O0 . . . Ot | λ) P (Ot+1 . . . OL |St = si , λ) = P (O0 . . . OL |λ) (4.16) αt (i) βt (i) = (4.17) P (O0 . . . OL |λ) P (St = si |O0 . . . OL , λ) = and hence γt (i) can be computed by: γt (i) = αt (i) βt (i) αt (i) βt (i) = PN P (o | λ) i=1 αt (i) βt (i) . (4.18) Viterbi algorithm. The forward-backward variable γt (i) does not yet solve the second problem completely, since γt (i) solves for the most probable state at one point in time but 60 4. Introduction to Hidden Markov Models and Related Work the task is to find the most probable sequence of states. A straightforward solution would be to select the most probable state at each time step t: Smax (t) = arg max γt (i) . (4.19) i However, it turns out that models exist for which some transitions from Smax (t) to Smax (t + 1) are not possible (i.e., the transition probability aij equals zero). This is due to the fact that α and β both combine all possible paths through the states of the DTMC —and γ is only the product of α and β. One solution to this problem is called Viterbi algorithm. Very similar to αt (i), let δt (i) denote the probability of the most probable state sequence for the sub-sequence of observations O0 . . . Ot that ends in state si : δt (i) = max P (O0 . . . Ot , S0 . . . St−1 , St = si |λ) . S0 ... St−1 (4.20) δt (i) can be computed by a slight modification of the forward algorithm using the maximum operator instead of the sum over all states: δ0 (i) = πi bsi (O0 ) δt (j) = max δt−1 (i) aij bsj (Ot ); 1≤i≤N 1≤t≤L. (4.21) (4.22) In order to identify the states that contributed to the most probable sequence, each state selected by the maximum operator has to be stored in a separate array. The sequence can then be reconstructed by tracing backwards through the array starting from state arg maxi δL (i). 4.1.2 Training: The Baum-Welch Algorithm In the forward-backward algorithm, the HMM parameters λ were assumed to be fixed and known. However, in the majority of applications, λ cannot be inferred analytically but need to be estimated from recorded sample data. In the machine learning community, such a procedure is called training. Several algorithms exist for HMM training, of which the Baum-Welch algorithm is most prominent. In terms of HMMs, the goal of training is to maximize sequence likelihood for training sequences. More precisely, the parameters π, A, and B have to be set such that Equation 4.10 is maximized. For convenience, only a single training sequence is considered, here and the case of multiple sequences is discussed later. The algorithm can be understood most easily by first considering a simpler case where the sequence of “hidden” states is known. This occurs, e.g., in part-of-speech tagging2 applications. In this case, the parameters of the HMM can be optimized by maximum likelihood estimates: • Initial state probabilities πi are determined by the relative frequency of sequences starting in state si : π̂i = 2 See, e.g., Manning & Schütze [175] number of sequences starting in si total number of sequences . (4.23) 4.1 An Introduction to Hidden Markov Models 61 • Transition probabilities aij are determined by the number of times the process went from state si to state sj divided by the number of times, the process left state si to anywhere: number of transitions si → sj âij = . (4.24) number of transitions si → ? • Emission probabilities bsi (oj ) are determined by the number of times the process has generated symbol oj in state si compared to the number of times the process has been in state si : b̂i (oj ) = number of times symbol oj has been emitted in state si number of times the process has been in state si . (4.25) However, in many applications the sequence of states is not known. The solution found by Baum and Welch introduced expectation values for the unknown quantities. The algorithm belongs to the class of Expectation-Maximization (EM) algorithms.3 The algorithm consists of two major steps: 1. Expectation step: Compute estimates for unknown data (state probabilities) using the current set of model parameters. 2. Maximization step: Adjust model parameters to maximize data likelihood using the estimates for the unknown data of the expectation step. This scheme is repeated until sequence likelihood converges. It can be proven (see Section 6.5) that at least a local maximum is found. In the following paragraphs, both steps are described in more detail. Expectation-Step. Let Xt (i, j) denote the binary random variable indicating whether the transition taking place at time t passes from state si to sj or not. The expected value for Xt (i, j) is equal to the probability that Xt (i, j) is one.4 Let ξt (i, j) denote this probability (given an observation sequence o): ξt (i, j) = P (St = si , St+1 = sj |o, λ) . (4.26) ξt (i, j) can be computed similarly to Equations 4.15–4.17 by interposing the transition from si to sj between α and β: ξt (i, j) = αt (i) aij bsj (Ot+1 ) βt+1 (j) P (o|λ) αt (i) aij bsj (Ot+1 ) βt+1 (j) = PN PN i=1 j=1 αt (i) aij bsj (Ot+1 ) βt+1 (j) (4.27) . (4.28) This approach can also be visualized in a trellis as shown in Figure 4.4. While ξt (i, j) is the expected value that a transition i → j takes place at time t, the expected value for the total number of transitions from si to sj is X t 3 4 E[Xt (i, j)] = L−1 X ξt (i, j) . (4.29) t=0 A more detailed discussion is given along with the proof of convergence for HSMMs, see Section 6.5 P E[X] = X P (X) = 0 ∗ P (X = 0) + 1 ∗ P (X = 1) = P (X = 1) 62 4. Introduction to Hidden Markov Models and Related Work Figure 4.4: A trellis visualizing the computation of ξt (i, j) Note that summing up ξt (i, j) over all destination states sj yields the probability for the source state si at time t: N X ξt (i, j) = γt (i) . (4.30) j=1 The expectation step requires knowledge of model parameters λ which are either known from (random) initialization or a previous iteration of the EM algorithm. Maximization-Step. The second step of the Baum-Welch algorithm is a maximum likelihood optimization of parameters λ based on the expected values estimated in the first step: π̄i ≡ expected number of sequences starting in state si total number of sequences ≡ L−1 X āij ≡ expected number of transitions si → sj expected number of transitions si → ? ≡ γ0 (i) (4.31) ξt (i, j) t=0 L−1 X (4.32) γt (i) t=0 L−1 X b̄i (k) ≡ expected number of times observing ok in state si expected number of times in state si ≡ γt (i) t=0 s.t. Ot =ok L−1 X .(4.33) γt (i) t=0 Notes on the Baum-Welch algorithm. Initial point for the Baum-Welch algorithm is a completely initialized HMM. This means that the number of states, number of observation symbols, transition probabilities, initial probabilities and observation probabilities need to be defined. The algorithm then iteratively improves the model’s parameters λ = (A, B, π) until a (local) maximum in sequence likelihood is reached.5 In each 5 For implementation, a maximum number of iterations is often used as an additional stopping criterion. 4.2 Sequences in Continuous Time 63 M-step, the expectation values of the previous E-step are used and vice versa. Several properties of the algorithm can be derived from that: • The number of states and the size of the alphabet are not changed by the algorithm. • The model structure is not altered during the training process: if there is no transition from state si to sj (aij = 0), the Baum-Welch algorithm will never change this. • Initialization should exploit as much a-priori knowledge as possible. If this is not possible, simply random initialization can be used. Training with Multiple Sequences. The formulas presented here have only considered one single observation sequence, although in most applications, there is a large set of training sequences. The main idea of multiple sequence training is that nominators and denominators of Equations 4.31 to 4.33 are transformed into a sum over sequences ok , each scaled by P (o1k | λ) computed along with the E-step of the algorithm. 4.2 Sequences in Continuous Time Error events occur on a continuous time scale, but there has been no notion of time in HMMs, yet. In this section four approaches how time can be incorporated into HMMs are introduced followed by a review of approaches that have been published on that topic. An observation sequence is assumed to be an event-driven sequence consisting of symbols which are element of a finite countable set. Such sequences are called temporal sequences. Sequences of length L + 1 are considered where the first symbol occurs at time t0 and the last at time tL , as shown in Figure 4.5. In order to clarify notations, let Figure 4.5: Notations for an event-driven temporal sequence. The sequence consists of symbols A, B and C that occur at time t0 , . . . , tL . The delay between two successive symbols is denoted by dk . • {o1 , . . . , oM } denote the set of symbols that can potentially occur, which is {A, B, C} in the example. • Ok denotes the symbol that has occurred at time tk and • dk denotes the length of the time interval tk − tk−1 . 64 4. Introduction to Hidden Markov Models and Related Work 4.2.1 Four Approaches to Incorporate Continuous Time The notion of continuous time can be incorporated into HMMs in four ways: 1. Time can be divided into equidistant time slots 2. Delays can be represented by delay symbols 3. Events and delays can be represented by two-dimensional outputs 4. A time-varying stochastic process can be used. The following paragraphs investigate each solution and discuss their properties. Time slots. Time is divided into non-overlapping intervals of equal length, as shown in Figure 4.6. Due to the reason that hidden Markov models generate a symbol in each Figure 4.6: Incorporating continuous time by division of time into slots of equal length time step, time slots not containing any error symbols need to be “filled” by a special observation indicating “silence”. Performing this procedure on the temporal sequence shown in Figure. 4.6 results in an observation sequence “A C B S S S A” where S denotes the symbol indicating silence. The simplest way to incorporate time-slotting into HMMs is to introduce state selftransitions: In each time step, there is some probability that the stochastic process transits to itself and hence stays in the state (see Figure 4.7). Figure 4.7: Duration modeling by a discrete-time HMM with self-transitions. This approach leads to a geometric distribution for state sojourn times, since the probability to stay in state si for d time-steps equals Pi (D = d) = aiid−1 (1 − aii ) . Time-slotting has the following characteristics: + Standard HMMs can be used. + There is almost no increase in computational complexity. (4.34) 4.2 Sequences in Continuous Time 65 – Time slot size is critical. If it is too small, long delays must be represented by repetitions of the silence symbol as can be seen in the example. The geometric delay distribution leads to poor modeling quality in most cases.6 On the other hand, if time slot size is too large, more than one event will probably occur within a time slot. There are several solutions to this issue including the definition of additional symbols representing combined events, dropping of events or assignment to the next “free” slot. However, all of these solutions have their problems. In general, if the length of inter-symbol intervals varies greatly, time slots cannot represent the temporal behavior of event sequences appropriately. – Time resolution is reduced to the size of time slots since it is no longer known when exactly an event has occurred within the time slot. This is true especially for the case of long time slot intervals. For these reasons, time slotting does not appear appropriate for online failure prediction. Delay symbols. A second approach to incorporating inter-event delays is to define a set of delay symbols representing delays of various lengths. The sequence shown in Fig. 4.6 could then be represented by, e.g., “A S1 C S1 B S3 A”. An evaluation of this approach shows that: + In comparison to time slotting, representation of time is improved since “chains” of silence symbols are avoided. If delays are represented on a logarithmic scale, a wide range of inter-symbol delays can be handled. + The approach can be implemented using a completely discrete environment such that standard implementations of HMMs can be used. – The structure of HMMs must be adapted. Due to the fact that events and delay symbols alternate, there must be two distinct sets of states one of which generates event symbols while the other generates delay symbols. This results in increased computational complexity (see Figure 4.8) Figure 4.8: Representing time by delay symbols. States Ei generate error observation symbols and states Di generate delay symbols – The internal (hidden) stochastic process does not represent the properties of the stochastic process that originally generated the observation sequence. – Time resolution is even worse than for time slotting due to the fact that one symbol accounts for long time intervals. 6 The effect can be reduced by introducing silence sub-models, which are out of scope here 66 4. Introduction to Hidden Markov Models and Related Work Figure 4.9: Delay representation by two-dimensional output probability distributions Two dimensional output symbols. Hidden Markov models permit usage of multidimensional output symbols. Hence the temporal sequence can be represented by tuples consisting of the event type and the delay to the previous event. The example sequence of Figure 4.6 would then be represented as (A, d0 ) (C, d1 ) (B, d2 ) (A, d3 ) where d0 is not relevant. For such a representation, observation probabilities are two-dimensional: one dimension is discrete representing event symbols while the second dimension is continuous representing inter-event delays as shown in Figure 4.9. Output probabilities have to obey ∀i : M Z ∞ X j=1 0 bsi (oj , dτ ) dτ = 1 . (4.35) An assessment of the method yields: + Lossless representation of the temporal sequence with in principle unlimited time resolution. – The internal (hidden) stochastic process does not represent temporal properties of the stochastic process that originally generated the observation sequence. This is a problem especially when future behavior of the stochastic process is to be predicted. – Public implementations do —to the best of our knowledge— only exist for merely discrete and merely continuous outputs. Hence an implementation would require the development / adaption of a new toolkit. Time-varying internal process. The fourth approach is to incorporate the temporal behavior of the stochastic process that originally generated the observation sequence directly into the stochastic process of hidden state transitions. For example, a straightforward solution is to replace the internal DTMC by a continuous-time Markov chain (CTMC), which is able to handle transitions of arbitrary durations since transition probabilities are defined by exponential probability distributions P (t). Such an approach results in: + Lossless representation of the temporal sequence + The internal stochastic process can (at least in part) mimic the stochastic process that originally generated the observation sequence – Although various extensions to time-varying processes have been published (see next section), to our knowledge no publicly available toolkit exists. 4.3 Related Work on Time-Varying Hidden Markov Models 67 Summary. Error event sequences are temporal sequences. Four approaches have been described how continuous time can be incorporated into HMMs. From the discussion follows that the most promising approach is to incorporate time variation directly into the hidden stochastic process, which is the approach taken in this thesis. Since various solutions to an incorporation of time variation into the stochastic process exist, related work with such focus is presented in the following. 4.3 Related Work on Time-Varying Hidden Markov Models A few decades ago, application of standard discrete HMMs was the only way to get to a feasible (i.e., real-time) solution, even in application domains where temporal behavior is important. One such domain is speech recognition, where, e.g., phoneme durations vary statistically. Since it was observed quickly that continuous time models can improve modeling performance significantly (see, e.g., Russell & Cook [218]), and due to increasing available computing power, more and more time varying models have been published. The development was mainly driven by the speech recognition research community but time-varying models have also been applied to other domains such as webworkload modeling [284]. The following sections give an overview of the various classes of time-varying HMMs. Continuous Time Hidden Markov Models Incorporating time-variance into HMMs by replacing the internal (hidden) DTMC process by a continuous time Markov chain (CTMC) has been described in Wei et al. [274]. The resulting model is abbreviated by CT-HMM and should not be confused with continuous HMMs (CHMMs), which are discrete-time HMMs with continuous output probability densities. CTMCs are determined by an initial distribution equivalent to DTMCs, but the transition matrix A is replaced by an infinitesimal generator matrix Q. Determination of the infinitesimal generator matrix Q follows a two-step approach: First, a transition matrix P (∆) and the initial distribution are estimated by Baum-Welch training from the training data. Then Q is obtained by Taylor expansion of the equation 1 Q = ln(P ) , (4.36) ∆ which can be derived directly from Kolmogorov’s equations (see, e.g., Cox & Miller [67]). ∆ denotes some minimal delay (a time step). Hidden Semi-Markov Models Models such as CT-HMMs imply strong assumptions about the underlying stochastic process since CTMCs are based on exponential distributions, which are time-homogeneous and memoryless. A more powerful approach towards continuous-time HMMs is to substitute the underlying DTMC by a semi-Markov process (SMP), which allows to use arbitrary probability distributions for specification of each transition’s duration.7 Resulting 7 The only requirement is that it depends solely on the two states of the transition. A precise definition is given in Chapter 6. 68 4. Introduction to Hidden Markov Models and Related Work models are called Hidden Semi-Markov Models (HSMMs). Figure 4.10: Duration modeling by explicit modeling of state durations A first approach to HSMMs is to substitute the self-transitions as in Figure 4.7 by state durations that follow a state-specific probability distribution pi (d) as depicted in Figure 4.10. Several solutions have been developed to explicitly specify and determine pi (d) from training data along with the Baum-Welch algorithm. Ferguson’s model. One of the first approaches to explicit state duration modeling was proposed by Ferguson [96] in the year 19808 . The idea was to use a discrete probability distribution for pi (d). While the approach was very flexible, it showed three disadvantages: first, it is a discrete-time model requiring the definition of a time step ∆ and a maximum delay D, second, convergence of the training algorithm was insufficiently slow, and third, much more training data was needed for training. The last two drawbacks result from a dramatically increased number of parameters that have to be estimated from the training data: The number of parameters increases from N self-transitions to N × D duration probabilities. Mitchell et al. [182] extend the approach to transition durations and propose a training algorithm with reduced complexity. HSMMs with Poisson-distributed durations. In order to reduce the number of parameters, Ferguson already proposed to use parametric distributions instead of discrete ones. So have done Russell & Moore [219], who have used Poisson distributions. A comparison of both models showed that the Poisson-distributed model performs better in the case when an insufficient amount of training data available [218]. HSMMs with gamma-distributed durations. Levinson [161] provided a maximum likelihood estimation for parameters of gamma-distributed durations. As it is the case with most maximum likelihood procedures, optimal parameters are obtained by derivation of the likelihood function. However, this derivative cannot be computed explicitly and numerical approximation has to be applied. Azimi et al. [18] apply HSMMs with gammadistributed durations to signal processing but adjust duration parameters from estimated mean and variance of durations in the training data set. HSMMs with durations from the exponential family. Mitchell & Jamieson [183] extended the spectrum of available distributions for explicit duration modeling to all distributions of the exponential family, which includes gamma distributions. Their work is also founded on a direct computation of maximum likelihood involving numerical approximation of the maximum. 8 A crisp overview can be found in Rabiner [210]. 4.3 Related Work on Time-Varying Hidden Markov Models 69 HSMMs with Viterbi path constrained uniform distributions. Kim et al. [145] present an approach where transition durations are assumed to be uniformly distributed. Their key idea is that first parameters π, A and B are obtained by the discrete-time standard HMM reestimation procedure as explained in Section 4.1. A subsequent step involves computation of Viterbi paths for the training data in order to identify minimum and maximum durations for each transition: this defines a uniform duration distribution for each transition. Expanded State HMMs (ESHMMs). In parallel to the development of HSMMs with parametrized probability distributions, it has been found that Ferguson’s model can be implemented in a much easier way by a series-parallel topology of the hidden states (Cook & Russell [65]). To be precise, each state of the HMM is replaced by a DTMC sharing the same emission probability distribution. State durations are then expressed by transition probabilities of the DTMC. Figure 4.11 shows a small example for a HMM with left-toright topology. Those models are named by Expanded State HMMs(ESHMMs). Figure 4.11: Topology of an Expanded State HMM (ESHMM). The model represents discrete state duration probabilities pi (d) by discrete-time Markov chains. Emission probabilities bsi (oj ) have been omitted. The benefit of ESHMMs is that they can be implemented using standard discrete-time HMM toolkits. Furthermore, the idea to represent state durations by state chains led to several variants extending Ferguson’s model. For example, the duration Markov chain may have self-transitions that allow to model durations of arbitrary length instead of a fixed maximum duration D. Some structures have been proposed by Noll & Ney [195] and Pylkkönen [206] and a comparison of two extended structures is provided by Russell & Cook [218]. More elaborate training algorithms for ESHMMs have been proposed by Wang [270] and Bonafonte et al. [33]. Segmental HMMs. Segmental HMMs are used to model sequences whose behavior changes in epochs. It is assumed that there is some outer stochastic process determining the “type” of the segment. Some discrete duration is chosen specifying the length of the epoch. Once the type and the duration of the epoch are fixed, an inner stochastic process determines the behavior for the segment. Examples for such models can be found in Ge [102] and Russell [217]. Hidden Semi-Markov Event Sequence Model (HSMESM). In [93], Faisan et al. have presented a hidden semi-Markov model for modeling of functional magnetic resonance 70 4. Introduction to Hidden Markov Models and Related Work imaging (fMRI) sequences.9 The key idea with respect to temporal modeling is that discrete duration probabilities are stored for each transition rather than state durations. However, the model is specifically targeted to fMRI. Inhomogeneous HMMs (IHMMs). Ramesh & Wilpon [211] have developed another variant of HMMs, called Inhomogeneous HMM (IHMM) . Time homogeneity of stochastic processes refers to the property that the behavior (i.e., the probability distributions) do not change over time. In terms of Markov chains, this means that the transition probabilities aij are constant and not a function of time. However, the authors abandon this assumption and define: aij (d) = P (St+1 = j|St = i, dt (i) = d); 1≤d≤D, (4.37) which is the transition probability from state si to state sj given that the duration dt (i) in state si at time t equals d. In order to define a proper stochastic process, the transition probabilities must satisfy: ∀d ∈ {1, . . . , D} : N X aij (d) = 1 . (4.38) j=1 As can be seen from the formulas, Ramesh & Wilpon also assume discretized time and a maximum state duration D. 4.4 Summary The approach to online failure prediction taken in this thesis is to use HMMs as pattern recognition tool for error sequences, which are event-driven sequences in continuous time with symbols from a finite countable set. Such sequences are called temporal sequences. This chapter has introduced the theory of standard HMMs and has identified four ways how sequences in continuous time can be handled by HMMs. From this discussion followed that the most promising solution is to turn the stochastic process of hidden state traversals into a time-varying process. Since this idea is not new, related work on previous extensions has been presented. Most domains standard hidden Markov models have been applied to are characterized by • equidistant / periodic occurrence of observation symbols caused by sampling. This defines a minimum time step size such that all temporal aspects can be expressed in integer multiples of the sampling interval. • a maximum duration. For example in speech recognition, phonemes, syllables, etc. can well be assumed to have limited duration. However, these assumptions do not hold for online failure prediction based on error events: Observation symbols can occur on a continuous time scale and delays between errors can range from very short to very long time intervals. Therefore, none of the continuous-time extensions presented in this chapter seems appropriate for failure prediction. The extended hidden semi-Markov model proposed in this dissertation differs from existing solutions in the following aspects: 9 More details on the model can be found in Thoraval [255] 4.4 Summary 71 1. The model operates on true continuous time instead of multiples of a minimum time step size. This feature circumvents the problems associated with time-slotting and is advantageous if sequences show a great variability of inter-event delays as is the case for the log data used in the case study. 2. There is no maximum duration D. The model can handle very long inter-error delays with the same computational overhead as short delays. 3. The model allows to use a great variety of parametric transition probability distributions. More specifically, every parametric continuous distribution for which the density’s gradient with respect to the parameters can be computed are applicable. This includes well-known distributions such as Gaussian, exponential, gamma, etc. The advantage of this feature is that transition duration distributions can be adapted to the delays occurring in the system rather than to assume some distribution apriori. Furthermore, the model allows to use background distributions which helps to deal with noise in the data. 4. The model allows to specify transition durations rather than state durations. Widely used state durations are a special case where all transitions are equally distributed. This feature alleviates the Markov restriction that the process is only dependent on the current state. Although some of the models presented in this chapter share some of these features, the proposed model is the first to provide the combination of all four properties, which, as it will be seen later, proved to be beneficial. Contributions of this chapter. First, four ways to incorporate continuous time into HMMs have been identified and discussed. Second, the chapter seems to be the first work to present a summary and state-of-the-art for continuous time extensions of hidden Markov models. Relation to other chapters. This chapter concludes the first phase of the engineering cycle, which has focused on a problem statement, identification of key properties and related work. The second phase focuses on a proper formalization of the approach that has been sketched in Figure 2.9 on Page 19 and Figure 2.10 on Page 20, respectively. More specifically, formalization of the approach includes data preprocessing (Chapter 5), the hidden semi-Markov model (Chapter 6), and classification (Chapter 7). Part II Modeling 73 Chapter 5 Data Preprocessing The overall approach to online failure prediction consists of several steps of which data preprocessing is the first. It is applied for training, i.e., estimation of model parameters, and for online prediction. In Section 5.1 some known concepts of error-log preprocessing are described. A novel approach to separate failure mechanisms is introduced in Section 5.2 and a statistical method to filter noise is explained in Section 5.3. Finally, in Section 5.4, logfile formatting is discussed and a novel concept of logfile entropy is introduced. 5.1 From Logfiles to Sequences Error logfiles are a natural source of information if something goes wrong in the system, and they are frequently used both for diagnosis and online failure prediction.1 This section describes the necessary steps to get from raw error logs to temporal event sequences used as input data for the hidden semi-Markov models. 5.1.1 From Messages to Error-IDs One of the major handicaps with error logfiles is that they are commonly not designed for automatic processing. Their main purpose is to convey information to human operators to support quick identification of problems. Hence error logs frequently do not contain any error-ID. Instead, they consist of error messages in natural language. This also holds for the error logs of the telecommunication system and hence methods had to be developed to turn natural language messages into error IDs. The method described here has been developed together with Steffen Tschirpke.2 The key idea of translating natural language messages into an error ID is to apply a similarity measure known from text editing to yield a similarity matrix, to cluster the matrix and to assign an error ID to each cluster. However, even if dedicated log data 1 See category 1.3 in the taxonomy 2 In fact, he was the one who implemented it and who solved all the real problems regarding this issue. 75 76 5. Data Preprocessing such as timestamps, etc. are ignored, almost every log message is unique. This is due to numbers and log-record specific data in messages. For example, the log message: process 1534: end of buffer reached will most probably occur only very rarely in an error log since it happens infrequently that exactly the process with number 1534 will have the same problem. For this reason, the mapping from error messages to error IDs consists of three steps: 1. All numbers, and log-record specific data such as IP addresses, etc. are replaced by placeholders. For example, the message shown above is translated into: process nn: end of buffer reached 2. A 100% complete replacement of all record-specific data is infeasible. Furthermore, there are even typos in the error messages themselves. Hence, dissimilarities between all pairs of log messages are computed using the Levenshtein distance metric [11], which measures the number of deletions, insertions and substitutions required to transform one string into the other. 3. Log messages are grouped by a simple threshold on dissimilarity: All messages having dissimilarity below the threshold are assigned to one message ID. The goal of the method described is to assign an ID to textual messages. The ID then forms the so-called error type or symbol. If additional information from log messages such as, e.g., thread IDs, shall be used, various numbers have to be combined into a single error type. 5.1.2 Tupling As Iyer & Rosetti [129] have noted, repetitive log records that occur more or less at the same time are frequently multiple reports of the same fault. Hansen & Siewiorek analyzed this property further and presented an illustrative figure, which is reproduced for convenience in Figure 5.1. Please note that terms have been adapted in order to be consistent with other chapters. The figure depicts the process from a fault to corresponding events Figure 5.1: A fault, once activated, can result in various misbehavior. Some misbehaviors are not detected, some are detected several times and sometimes, several misbehaviors are caught by one single detection. Due to, e.g., a system crash, not every error may occur as message in the error log [114]. in the error log. Once activated, a fault may lead to various misbehaviors in the system. There are four possibilities how such misbehavior can be detected: 5.1 From Logfiles to Sequences 77 1. unusual behavior is detected leading to one error 2. unusual behavior is not detected and hence no error occurs 3. unusual behavior is detected by several fault detectors leading to several errors 4. one fault detector detects several misbehaviors resulting in one single error However, not each error finds its way to the error log. For example, if the fault causes the logging process or the entire system to crash, the error cannot be written to the logfile. In order to increase expressiveness of logfiles, Tsao & Siewiorek [258] introduced a procedure called tupling, which basically refers to grouping of error events that occur within some time interval or that refer to the same location. However, equating the location reported in an error message with the true location of the fault only works for systems with strong fault containment regions. Since this assumption does not hold for the telecommunication system under consideration, spatial tupling is not considered any further, here. There are two principle approaches to grouping of errors in the temporal domain: 1. after some pause, all errors that occur within a fixed interval starting from the first error are grouped, as proposed by Iyer et al. [131] 2. All errors showing an inter-arrival time less than a threshold ε are grouped, as proposed by Tsao & Siewiorek [258]3 Further considerations only refer to the second grouping method. Two problems can arise when tupling is applied (see Figure 5.2): 1. error messages might be combined that refer to several (unrelated) faults. According to the paper this case is called a collision 2. If an inter-arrival time > ε occurs within the error pattern of one single fault, this pattern is divided into more than one tuple. This effect is called truncation Both the number of collisions and truncations depend on ε. If ε is large, truncation happens rarely and collision will occur very likely. If ε is small the effect is vice versa. In order to analyze the relationship, Hansen & Siewiorek [114] have derived a formula for the probability of collision. Assuming that faults are exponentially distributed, collision probability can be computed by Pc (ε) = X 1 − e−λF ε pj e−λF lj , (5.1) j where λF is the fault rate, and pj denotes the discrete distribution of tuples of length lj estimated from the logfile. However, the fault rate λF is unobservable and the authors suggest to estimate it by the tuple rate λT . The authors have checked their results using two machine years of data from a Tandem TNS II system and showed that the formula can provide a rough estimate. 3 In Tsao & Siewiorek [258], there is a second, larger threshold to add later events if they are similar, but this is not further considered, here 78 5. Data Preprocessing Figure 5.2: For demonstration purposes, the first and second time line depict error patterns for two faults separately. The bottom line shows what is observed in the error log. Error logs are grouped if inter-arrival time is less than ε. Each group defines a tuple (shaded areas). Truncation occurs, if the inter-arrival time for one fault is > ε. However, large ε lead to collisions, if events of other faults occur earlier than ε [114]. As stated above, reducing the number of collisions by lowering ε increases the number of truncations. However, truncation is much more complicated to identify since it is mostly difficult to tell whether some error occurring much later (> ε) belongs to the same fault or to another. Therefore, the authors suggest the following strategy: Plotting the number of tuples over ε yields an L-shaped curve as shown in Figure 5.3. If ε equals Figure 5.3: Plotting the number of tuples over time window size ε yields an L-shaped curve [114]. zero, the number of tuples equals the number of error events in the logfile. While ε is increased, the number first drops quickly. At some point, the curve flattens suddenly. Choosing ε slightly above this point seems optimal. The rational behind this procedure is the assumption that —on average— there is a small gap between the errors of different faults: If ε is large enough to capture all errors belonging to one fault, the number of resulting tuples decreases slower if ε is further increased. Current research aims at quantifying temporal and spatial tupling. For example, Fu & Xu [99] introduce a correlation measure for this purpose but since research is at an early stage, such measure has not been applied. For the rest of this chapter, it is assumed that both error-ID assignment and tupling have been applied. 5.2 Clustering of Failure Sequences 5.1.3 79 Extracting Sequences The hidden Markov models are trained using either failure or non-failure sequences, as shown in Figure 2.9 on Page 19. A failure sequence is defined as a temporal sequence of error events preceding a failure (see Figure 5.4). Its maximum duration is determined by the data window size ∆td , as defined in Section 2.1 (see Figure 2.4 on Page 12). The time of failure occurrence is usually not reported in the error logs themselves but in documents such as operator repair reports, logs of stress generators, service trackers, etc. Non-failure Figure 5.4: Extracting sequences from an error log. Sequences are extracted from a time window of duration ∆td . Sequences preceeding a failure (denoted by t) by leadtime ∆tl form failure sequences F i . Sequences occurring between failures (with some margin ∆tm ) set up non-failure sequences N F i sequences denote sequences that have occurred between failures. In order to be relatively sure that the system is healthy and no failure is imminent, non-failure sequences must not occur within some margin ∆tm before or after any failure. Non-failure sequences can be generated with overlapping or non-overlapping windows or by random sampling. 5.2 Clustering of Failure Sequences A failure mechanism, as used in this thesis, denotes a principle chain of actions or conditions that leads to a system failure. It is assumed that various failure mechanisms exist in complex computer systems such as the telecommunication system. Different failure mechanisms can show completely different behavior in the error event logs, which makes it very difficult for the learning algorithm to extract the inherent “principle” of failure behavior in a given training data set. For this reason, a novel approach to an identification and separation of failure mechanisms has been developed. The key notion of the approach is that failure sequences of the same underlying failure mechanism are more similar to each other than to failure sequences of other failure mechanisms. Grouping can be achieved by clustering algorithms, however, the challenge is to define a similarity measure between any pair of error event sequences. Since there is no “natural” distance such as Euclidean norm for error event sequences, sequence likelihoods from small hidden semi-Markov models are used for this purpose.4 The approach is related to Smyth 4 The same hidden semi-Markov models are used as developed in the next chapter. However, since this thesis follows the order of tasks: preprocessing → modeling → classification, details on the model are presented in Chapter 6. For the time being, it is sufficient to remember that HSMMs are hidden Markov models tailored to temporal sequences. 80 5. Data Preprocessing [246] but it yields separate specialized models instead of one mixture model. 5.2.1 Obtaining the Dissimilarity Matrix Since most clustering algorithms require dissimilarities among data points as input data, a dissimilarity matrix D is computed from the set of failure sequences F i . More precisely, D(i, j) denotes the dissimilarity between failure sequence F i and F j . In order to compute D(i, j), first, a small HSMM M i is trained for each failure sequence F i , as shown in Figure 5.5. Figure 5.5: For each failure sequence F i , a separate HSMM M i is trained. Second, sequence likelihood is computed for each sequence F i using each model M j . However, since sequence likelihood takes on very small numbers for longer sequences, it cannot be represented properly even by double precision floating point numbers and the logarithm of the likelihood (log-likelihood) is used here.5 Sequence likelihood of all sequences F i computed with all HSMMs M j defines a matrix where each element (i, j) of the probability that model i can generate failure sequence j: h is the logarithm i i j log P (F |M ) ∈ (−∞, 0]. In other words, the logarithmic sequence likelihood is close to zero if the sequence fits the model very well and is significantly smaller if it does not really fit. Since model M j has been adjusted to the specifics of failure sequence F j in the first step, P (F i |M j ) expresses some sort of proximity between the two failure sequences F i and F j . An exemplary resulting matrix of log-likelihoods is shown in Figure 5.6. Unfortunately, the matrix is not yet a dissimilarity matrix, since first, values are ≤ 0 and second, sequence likelihoods are not symmetric: P (F i |M j ) 6= P (F j |M i ). This is solved by taking the arithmetic mean of both likelihoods and using the absolute value. Hence D(i, j) is defined as: D(i, j) = h i h i log P (F i | M j ) + log P (F j | M i ) 2 . (5.2) Still, matrix D is not a proper dissimilarity matrix since a proper metric requires that D(i, j) = 0, if F i = F j . There is no solution to this problem since from D(j, j) = 0 follows that P (F j |M j ) = 1. However, if M j would assign a probability of one to F j it would assign a probability of zero to all other sequences F i 6= F j , which would be useless for clustering. Nevertheless, D(j, j) is close to zero since it denotes log-sequence 5 In fact, many HMM implementations only return the log-likelihood 5.2 Clustering of Failure Sequences 81 Figure 5.6: Matrix of logarithmic sequence likelihoods. Each element (i, j) in the matrix is logarithmic sequence likelihood log P (F i |M j ) for sequence F i and model Mj. likelihood for the sequence, model M j has been trained with. For this reason, matrix D is used as defined above. Regarding the topology of models M i , the purpose of each model is to get a rough notion of proximity between failure sequences. In contrast to the models used for failure prediction (c.f., Section 6.6, the purpose is not to clearly identify sequences that are very similar to the training data set and to judge other sequences as “completely different”. Therefore, models M i have only a few states and have the structure of a clique, which means that there is a transition from every state to every other state.6 In order to further avoid too specific models, so-called background distributions are applied (c.f., Page 112). The effects of the number of states and background distributions are further investigated along with the case study. 5.2.2 Grouping Failure Sequences In order to group similar failure sequences, a clustering algorithm is applied. Two groups of clustering algorithms exist (c.f., Kaufman & Rousseeuw [142]): Partitioning techniques divide the data into u different clusters (partitions), and u is a fixed number that needs to be specified in advance. Hierarchical clustering approaches do not rely on a prespecification of the number of clusters. They either divide the data into more and more sub groups (divisive approach), or start with each data point as separate cluster and repetitively merge smaller clusters into bigger ones (agglomerative approach). In general, partitioning approaches yield better results for a single u, while hierarchical algorithms are much quicker than repetitively partitioning for different values of u. Due to the fact that u cannot be determined upfront, hierarchical clustering is used for the grouping of failure sequences. The output of hierarchical clustering algorithms is a grouping function gF (u) that 6 These models are also called ergodic 82 5. Data Preprocessing partitions the set of failure sequences F = {F i } into u groups: n o gF (u) = Gl ; 1 ≤ l ≤ u; ∀ l : Gl ⊂ F ; [ l Gl = F, \ Gl = ∅ , (5.3) l where Gl denotes the set of failure sequences that belong to group l. 5.2.3 Determining the Number of Groups Hierarchical clustering yields the function gF (u), determining for each number of groups u which sequences belong to which group. Therefore, the number of groups u needs to be determined in order to separate the failure sequences in training data. In principle, u should be as small as possible, since a separate model needs to be trained for each group, which affects computation time both for training and online prediction. Moreover, the more groups the less failure sequences remain in the training data set of each group which results in worse models. On the other hand, if u is too small, there is no clear separation of failure mechanisms and the resulting failure prediction models have difficulties to learn the structure of failure sequences. Several ways have been proposed to determine the number of groups u: • Visual inspection is a very robust technique if data is presented adequately. Banner plots (see Section 8.1.2) have shown to be an adequate representation for this purpose. However, visual inspection works only if the number of failure sequences is not too large. • Evaluation of inter-cluster distances. Such approaches investigate the distance level at which clusters are merged or divided. The basic idea is that if there is a large gap in cluster distance (one that deviates significantly from the others) some fundamental difference must be present in the data. Such approaches are sometimes called stopping rules (see, e.g., Mojena [185], Lance & Williams [154], Salvador & Chan [228]). • Elbow criterion. The percentage of variance explained7 is plotted for each number of groups. The point at which adding a new cluster does not add sufficient information can be observed by an elbow in the plot (see, e.g., Aldenderfer & Blashfield [5]) • Bayesian framework. Using Bayes’ theorem, maximum probability for the number of groups given the data arg maxu P (u|D) can be computed from the probability of data given the number of groups P (D|u). However, this requires to try all values of u, ranging from one to the number of sequences F , and each trial requires to train 2 u HSMMs. Hence, F (F2−1) = F 2+F Baum-Welch training procedures would have to be performed which is not feasible in reasonable time. Due to the fact that the number of failure sequences in the case study are still manageable, and visual inspection is a very simple but robust technique, it is the method of choice in this thesis. 7 This is the ratio of within-group variance to total variance 5.3 Filtering the Noise 83 Figure 5.7: Inter-cluster distance rules: (a) nearest neighbor, (b) Furthest neighbor, and (c) unweighted pair-group average method 5.2.4 Additional Notes on Clustering Matrix D defines some sort of distance between single failure sequences. However, for clustering some measure is needed to evaluate the distance between clusters, which can have a decisive impact on the result of clustering. The three predominant techniques for agglomerative clustering are (see Figure 5.7): • Nearest neighbor. The shortest connection between two clusters is considered. This approach tends to yield elongated clusters due to the so-called chaining effect: If two clusters get close only in point, the two clusters are merged. For this reason, the nearest neighbor rule is also called single linkage rule. • Furthest neighbor. The maximum distance of any two points in two clusters are considered. This approach tends to yield compact clusters that are not necessarily well separated. This rule is also called complete linkage rule. • Unweighted pair-group average method (UPGMA). The distance of two clusters is computed by the average of distances from all points of one group to all points of the other. This approach results in ball-shaped clusters that are in most cases well separated. In addition to these inter-cluster distances measures, Ward’s method generates clusters by minimizing the squared Euclidean distance to the center mean. Each method has its advantages and disadvantages and it is difficult to determine upfront, which is best-suited for a given data set. Therefore, for all methods have been applied to data of the case study (c.f., Section 9.2.5). Despite of failure sequence grouping for data preprocessing, the clustering method presented here can possibly be used to enhance diagnosis, as is discussed in the outlook. 5.3 Filtering the Noise The objective of the previous clustering step was to group failure sequences that are traces of the same failure mechanism. Hence it can be expected that failure sequences of one group are more or less similar. However, experiments have shown that this is not the case. The reason for this is that error logfiles contain noise, which results mainly from parallelism within the system (see Section 2.3). Therefore, some filtering is necessary to eliminate noise and to mine the events in the sequences that make up the true pattern. The filtering applied in this thesis is based on the notion that at certain times within failure sequences of the same failure mechanism, indicative events occur more frequently 84 5. Data Preprocessing Figure 5.8: After grouping similar failure sequences by means of clustering, filtering is applied to each group in order to remove noise from the training data set. For failure group u the blow-up shows that sequences are aligned at the time of failure occurrence (t). For each time window (vertical shaded bars) each error symbol (A,B,C) is checked whether it occurs significantly more frequent than expected. Those symbols that do not pass the filter (crossed-out symbols) are removed from the training sequence than within all other sequences. The precise definition of “more frequently” is based on the χ2 test of goodness of fit. The filtering process is depicted in the blow-up of Figure 5.8 and performs the following steps: 1. Prior probabilities are estimated for each symbol. Priors express the “general” probability that a given symbol occurs. 2. All sequences of one group (which are similar and are expected to represent one failure mechanism), are aligned such that the failure occurs at time t = 0. In the figure, sequences F 1 , F 2 , and F 4 are aligned and the dashed line indicates time of failure occurrence. 3. Time windows are defined that reach backwards in time. The length of the time window is fixed and time windows may overlap. Time windows are indicated by shaded vertical bars in the figure. 4. The test is performed for each time window separately, taking into account all error events that have occurred within the time window in all failure sequences of the group. 5. Only error events that occur significantly more frequently in the time window than their prior probability stay in the training sequences. All other error events within the time window are removed, since these are assumed to be noise. In the figure, removed error events are crossed out. 6. Filtering rules are stored for each time window specifying error symbols that pass the filter. The filter rules are used for online failure prediction where new sequences 5.3 Filtering the Noise 85 have to be processed in order to classify the current state of the system as failureprone or not. Each incoming error sequence is filtered before sequence likelihood is computed. Each failure group has a separate set of filter rules and no filtering is applied for the non-failure sequence model. That is why there is a group-specific part of the preprocessing block in Figure 2.10 on Page 20. In order to formalize the test, let p̂0i denote the estimated prior probability of error event type (symbol) i, which is the null hypothesis. The set of failure sequences under consideration is obtained from clustering. Assume the l-th group is to be filtered, then the set of filtering sequences Gl consisting of sequences Gjl is defined by: n o h i Gl = Gjl = gF (u) , (5.4) l where gF (u) is defined by Equation 5.3. Let S denote the set of symbols that occur in all failure sequences in Gl within the time window (t − ∆t, t]: S= [n s ∈ Gjl | s occurs within (t − ∆t, t] o . (5.5) j Each symbol si ∈ S is checked for significant deviation from the prior p̂0i by a test variable known from χ-grams, which are a non-squared version of the testing variable of the χ2 goodness of fit test (see, e.g., Schlittgen [230]). The testing variable Xi is defined as the non-squared standardized difference: Xi = ni − n p̂0i q n p̂0i , (5.6) where ni denotes the number of occurrences of symbol si and n is the total number of symbols in the time window. Disregarding estimation effects, properties of the testing variable Xi can be assessed by assuming that ni is binomially distributed, so that from the Poisson approximation follows8 for expectation value and variance: E[ni ] ≈ n p̂0i V [ni ] ≈ n p̂0i , (5.7) (5.8) where E[] denotes expectation value and V [] variance. Hence, E[Xi ] = E n − n p̂0i i q n p̂0i (5.9) ≈1. (5.10) n p̂0i ni − V [Xi ] = V q n p̂0i ≈0 From this analysis follows that all Xi are standardized and can be compared to a threshold c: Filtering eliminates all symbols si from S within time window (t−∆t, t], for which Xi < c. Hence, the set of remaining symbols for the time window is: S 0 = {si ∈ S | Xi ≥ c} . 8 0 p̂i can be assumed to be rather small (5.11) 86 5. Data Preprocessing Figure 5.9: Three different sequence sets can be used to compute symbol prior probabilities: the set of all training sequences, the set of failure training sequences, and the set of failure training sequences belonging to the same group (indicated by Gi ). In reality, grouped sequence sets Gi cover (i.e. partition) the set of failure training sequences. 0 The set of filtered training sequences G0l = {Gjl } is finally obtained by removing all symbols from each sequence that do not occur in any of the filtered symbol sets S 0 covering the time at which the symbol occurs in the sequence. G0l is then used to train the model for the l-th failure mechanism / group (see Section 6.6). For online prediction, the sequence under investigation is filtered the same way before sequence likelihood is computed. Three variants regarding the computation of priors p̂i 0 have been investigated in this thesis (see Figure 5.9): 1. p̂0i are estimated from all training sequences (failure and non-failure). Xi compares the frequency of occurrence of symbol si to the frequency of occurrence within the entire training data. 2. p̂0i are estimated from all failure sequences (irrespective of the groups obtained from clustering). Xi compares the frequency of occurrence of symbol si to all failure sequences (irrespective of the group). 3. p̂0i are estimated separately for each group of failure sequences from all errors within the group. For each symbol si the testing variable Xi compares the occurrence within one time window to the entire group of failure sequences. All variants have been applied to the data of the case study. An analysis is provided in Section 9.2.6. 5.4 Improving Logfiles Life could have been easier if logfiles would have been written in a format that is suited for automatic processing. From the experience of working with logfiles, a paper has been written (Salfner et al. [227]) discussing several issues relevant for logging. The major concepts are described in shorter form, here. At the end of the section, a comparison to the Common Base Event format is added, which is not included in the paper. 5.4.1 Event Type and Event Source While data like timestamps and process identifiers are given more or less explicitly in logfiles, the logged event itself is in most cases represented in natural language. Analyzing 5.4 Improving Logfiles 87 messages like “Could not get connection to service X” reveals that the textual description merges two different pieces of information, that should be represented distinguishably: • what has happened: some connection could not be established. This information is called the event type • what resource the problem arose with, which is “service X” in the example. This information is called the source the event is associated with. Note that the source is in general not identical to the detector issuing the error report. Event type and source are related to orthogonal defect classification (Chillarege et al. [59]) where the type correlates with the defect type and the source with the defect trigger. However, since in our scheme error events and not the root cause of defects are considered, the event type must not necessarily coincide with the defect type. The source is only a suspected trigger entity, while the defect trigger, as defined by Chillarege et al., describes the entire state in which the defect occurred. Of course a natural language sentence is able to carry more information than only “event type” and “source”. To prevent the additional information from being lost, it should be spelled out by additional fields in the log record. 5.4.2 Hierarchical Numbering In Section 5.1.1 a method has been described to map natural language error messages to event IDs. This step could have been avoided if message IDs would have been written directly into the log. Furthermore, if error message IDs are chosen in a systematic way, such approach can be superior to natural language error messages, as can be shown for the numbering scheme described in the following. The numbering scheme is based on a hierarchical classification of errors, represented by a tree. The topmost classification is based on the SHIP fault model (c.f. Section 2.5). The software subtree has been further developed introducing 62 categories, of which an excerpt is shown in Figure 5.10. Error message identifiers are simply constructed of Figure 5.10: Hierarchical numbering scheme for error event types with the SHIP model. The example only shows a sub-classification for software errors the labels along the path from the root to the leaf node separated by a dot. This numbering scheme originates from Dewey [51] and has become popular, e.g., with LDAP 88 5. Data Preprocessing (Lightweight Directory Access Protocol)[269]. In cases where an error matches several leaves of the tree, all possible identifiers should be written into the log. On the other hand, if an event cannot be resolved down to a leaf category, the most-detailed identifiable categorization should be used. Furthermore, the error classification scheme can be extended easily. In comparison with freely chosen error IDs, as they occur with methods such as the one presented in Section 5.1.1, the numbering scheme provides two advantages: 1. It provides an ordering that can be exploited to derive a notion of similarity between error event types 2. It provides means to present error data with multiple levels of detail. A distance metric. The numbering scheme gives rise to a measure of similarity that could be used, e.g., in clustering algorithms to group error messages. For example, failure prediction algorithms could benefit from a notion of error proximity, or clusters can be analyzed in order to diagnose an apparent problem. The distance metric proposed here is defined as follows: d(id1 , id2 ) := length of path between id1 and id2 . (5.12) which has properties: d(id1 , id2 ) = 0 ⇔ id1 = id2 d(id1 , id2 ) = d(id2 , id1 ) d(id1 , id3 ) ≤ d(id1 , id2 ) + d(id2 , id3 ) (5.13) (5.14) (5.15) from which follows that d(id1 , id2 ) is a proper metric. It can be efficiently computed by simply comparing the individual parts of id1 and id2 from left to right and calculating d(id1 , id2 ) directly from the position in which the two identifiers differ.9 Due to the lack of system knowledge, it has not been possible to apply hierarchical numbering to the data of the telecommunication system, and hence the distance metric has not applied to industrial data. Therefore, one potential conceptual problem known from decision trees could not be investigated using real data: the proposed metric can assign a large distance to objects that are closely related in reality but reside in different subtrees (see Figure 5.11) Figure 5.11: An inherent problem of hard classification approaches such as decision trees: the two highlighted points are assigned a long distance (thick lines), although they are close in reality 9 This algorithm does not even require knowledge of the error classification tree 5.4 Improving Logfiles 89 Multiple levels of detail. The proposed numbering scheme supports views of diverse granularity on the data, which enables to present the log data at multiple levels of detail. For example, a failure prediction tool will need more fine-grained information than an administrator who is only observing whether the system is running well. Presentation at various levels of granularity can simply be achieved by truncating the error numbers. 5.4.3 Logfile Entropy Gaining experience from working with logfiles of various programs, one gets a notion of what makes a good logfile. In order to assess the quality of logfiles quantitatively, a metric has been developed. Due to its affinity to Shannon’s definition [235] it is called information entropy of logfiles. Starting from Shannon’s work, information entropy is defined as: H(Xi ) = log2 1 P (Xi ) ! , (5.16) where Xi is a symbol of a signal source, and P (Xi ) is the probability that Xi occurs. In terms of error logs, Xi corresponds to the type of a log record. If P (Xi ) = 1, the logfile will consist only of messages of type Xi . According to Shannon, as the occurrence of such a log record is fully predictable, it does not convey any new information and the entropy is zero. However, the frequency of occurrence is only one part of what makes a good log record. A metric must also comprise the information that is given within the record. To measure this, log records are taken to be a set where the elements relate to pieces of information such as timestamp, process ID, etc. Let Ri be the set of information required to fully describe the event that log record Xi is reporting on, and let Gi denote the set of information that is actually given within log record Xi . As can be seen from Figure 5.12, Figure 5.12: Sets of required information (Ri ) and given information (Gi ) of a log record the intersection Ri ∩Gi is the required information that is actually present in the log record and (Ri ∪ Gi ) \ (Ri ∩ Gi ) is the set of information that is either missing or irrelevant. The bigger the intersection, and the smaller the rest, the better the log record. This is expressed by the integrity function I(Xi ), where the notation ](·) denotes cardinality of a set: I(Xi ) = ](Ri ∩ Gi ) ]((Ri ∪ Gi ) \ (Ri ∩ Gi )) − ](Ri ∪ Gi ) ](Ri ∪ Gi ) . (5.17) The first term is a Jaccard score [106] for the similarity between given and required information and the second evaluates the amount of missing and irrelevant information. To see 90 5. Data Preprocessing how integrity is measured by I(Xi ), consider the following two extreme cases: If a log record contains exactly the information that is required and nothing more, Ri ∩ Gi equals Ri ∪ Gi . Hence I(Xi ) equals one. If a log record contains none of the required information (all given information is irrelevant), Ri ∩ Gi = ∅ and the result is -1. Therefore, I(Xi ) can take any values of the range [−1, 1]. Not only the fraction of given and required information but also the absolute number of statements contained in a log message has impact on information density. The number of reasonable statements in the record is S(Xi ) = ](Ri ∩ Gi ) . (5.18) Combined with a linear transformation of I(Xi ) to the range [0, 1], the quality Q(Xi ) of a log record is measured by Q(Xi ) = S(Xi ) I(Xi ) + 1 2 . (5.19) Finally, entropy for one log record is the product of the quality Q(Xi ) and Shannon’s logarithmic quantity measure given by Equation 5.16: H(Xi ) = Q(Xi ) log2 1 P (Xi ) ! . (5.20) In order to compute entropy of entire logfiles, the expected value over all log records is computed analogously to Shannon: H(X) = m X P (Xi ) H(Xi ) . (5.21) i=1 Properties of logfile entropy. Q(Xi ) contributes to H(Xi ) on linear scale, while the quantity function has a logarithmic scale. Not surprisingly, for a high quality Q(Xi ) which occurs with very small probability P , the entropy takes on very high values (see Figure 5.13). In order to compute the maximum entropy, integrity I(Xi ) = 1 and P (Xi ) = m1 is assumed for all m log records. Then, maximum entropy is [103]: Hmax (X) = ]R log2 m , (5.22) where ]R denotes mean number of required statements per log record. The set Ri is defined to contain all the information needed to comprehensively describe the event that caused writing of the log record, but nothing more. Analyzing Ri for each error type is a laborious task. In Salfner et al. [227] an example is provided and it is shown how an intuitively better logfile results in an increased entropy. 5.4.4 Existing Solutions When IBM started its autonomic computing initiative, it has been found out quickly that an automatic processing of logfiles is crucial —Bridgewater [37] called them “a nervous system for computer infrastructures”. Against the background of multi-vendor commercial-off-the-shelf systems, a log standard had to be developed. Together with other companies such as HP, Oracle and SAP, the Common Base Event [196], which is 5.4 Improving Logfiles 91 3 entrop yH 2 1 0.0 Q G 0.2 #R 0.4 qua lity 0 0.6 bab ility P 0.8 pro 0.8 1.0 norm 0.4 alize d 0.6 0.2 1.0 Figure 5.13: A surface plot of the entropy H(Xi ). Quality Q has been normalized to the range [0, 1] part of the “Web Services Distributed Management” standard (WSDM 1.0), has been developed to enable standardized logging. A Common Base Event (CBE) is a specification of one event, which has been called log record, so far. A CBE consists of three major parts: 1. The component reporting a particular situation 2. The component affected by the situation 3. The situation itself Each of the three parts is further specified by several, fixed attributes. The specification contains an UML description of the CBE format, of which Figure 5.14 is a simplified version to visualize the concept. Common Base Event reporter creationTime severity message ... source ComponentIdentification Situation application componentID componentType processID ... categoryName ... StartSituation ConfigureSituation ... DependencySituation ConnectSituation Figure 5.14: Principle structure of a Common Base Event, depicted as an UML model 92 5. Data Preprocessing Evaluating CBE with respect to the issues raised in the previous sections, the following conclusions can be drawn: • CBE also separates event type and source: “Situation” specifies the event type and the “source ComponentIdentification” contains a specification of the failed resource. • The “reporter ComponentIdentification” corresponds to parts of the log records that have not been addressed in the previous sections. For example, in application logfiles, the reporter is in most cases the application that wrote the log. • Instead of a hierarchical numbering scheme specifying the event, eleven “situationNames” have been defined. Valid situation identifiers include “START”, “CONNECT”, or “CONFIGURE”. For this reason, the proposed distance metric cannot be applied to CBE. Error logs of Sun Microsystem’s Solaris 10 operating system, which has been released in 2004, also separates event type and event source. Solaris 10 error reportings support three levels of detail by structuring each report into an outline, error details and error ID details. However, instead of a hierarchical numbering scheme, unique event-IDs are used that can be looked up on the Internet in order to obtain further details. 5.5 Summary This chapter has covered the steps that are applied to yield a set of training sequences from error logfiles. Specifically, this process involves: • Mapping natural language error messages to event IDs using the Levenshtein edit distance. • Removal of repetitive reportings of the same cause by means of tupling. • Extraction of sequences from the filtered logfiles. • Grouping of failure sequences that belong to the same failure mechanism. To achieve this, a hierarchical clustering based on a dissimilarity matrix computed by using small HSMMs is applied. • Filtering the noise that is present in the data by means of a statistical test related to the χ2 goodness of fit test. The last section of the chapter addressed the topic of logfiles in general. It has been proposed that event type and source should be separated and a hierarchical numbering scheme should be applied to assign IDs to error events. The numbering scheme allows to define a distance metric and to present logfiles at various levels of detail. A measure for the quality of logfiles has been developed. The measure is based on Shannon’s definition of information entropy and is hence called “logfile entropy”. Finally, these principles of “educated logging” have been compared to an existing logging standard from autonomic computing, the Common Base Event. 5.5 Summary 93 Contributions of this chapter. This chapter has introduced • a novel approach to identify failure mechanisms in the system by means of failure sequence clustering. This may also be helpful for failure diagnosis. • a novel approach to noise reduction in failure sequences by means of the χ2 related statistical test • a novel way to represent error events using hierarchical numbering, which gives rise to a definition of a distance between error event IDs • a novel way to assess the quality of logfiles by means of an entropy measure for logfiles. Relation to other chapters. Having covered data preprocessing in this chapter, the extended hidden Markov model, which is the heart of the failure prediction approach presented in this dissertation, is described in the next chapter. Chapter 6 The Model This chapter describes the essence of the proposed approach to failure prediction: The hidden semi-Markov model that is used for pattern recognition. In Section 6.1, the model is defined. Subsequently, in Section 6.2 it is described how the model is used to process temporal sequences, followed by a delineation of the training procedure in Section 6.3. A proof of convergence for the training procedure is given in Section 6.5 and modeling issues that are specific to failure prediction are discussed in Section 6.6. Finally, in Section 6.7, computational complexity is analyzed. 6.1 The Hidden Semi-Markov Model Similar to the way standard hidden Markov models (HMMs) are an extension of discrete time Markov chains (DTMCs), hidden semi-Markov models (HSMMs) extend semiMarkov processes (SMPs). For this reason, SMPs are defined first followed by their extension to HSMMs. 6.1.1 Wrap-up of Semi-Markov Processes SMPs are continuous-time stochastic processes that allow to specify probability distributions for the duration of transitions from one state to the next. Several definitions exist, which all lead to the same properties. In this dissertation, the approach of Kulkarni [149] is adopted. Semi-Markov processes are a continuous-time extension of Markov renewal sequences, which are defined as follows: A sequence of bivariate random variables {(Yn , Tn )} is called a Markov renewal sequence if 1. T0 = 0, Tn+1 ≥ Tn ; Yn ∈ S , and (6.1) 2. P (Yn+1 = j, Tn+1 − Tn ≤ t|Yn = i, Tn , . . . , Y0 , T0 ) = P (Y1 = j, T1 ≤ t|Y0 = i) (6.2) ∀n ≥ 0 . Here, S denotes the set of states, and the random variables Yn and Tn denote the state and time of the n-th element in the Markov renewal sequence. Note that Tn refers to points in time on a continuous time scale and t is the length of the interval between Tn 95 96 6. The Model and Tn+1 . Similarly to Equation 4.3 on Page 56, Equation 6.2 expresses that Markov renewal sequences are memoryless and time-homogeneous: As the transition probabilities only depend on the immediate predecessor, it has no memory of previous states, and since transition probabilities at time n are equal to the probabilities at time 0, the process is time-homogeneous. Let gij (t) denote the conditional probability that state sj follows si after time t as defined by Equation 6.2. Then the matrix G(t) := [gij (t)] is called the kernel of the Markov renewal sequence. Note that gij (t) has all properties of a cumulative probability distribution except that the limiting probability pij must be equal to or less than one: pij := lim gij (t) = P (Y1 = j|Y0 = i) ≤ 1 . t→∞ (6.3) Even if Markov renewal sequences are defined on a continuous time scale, they form a discrete sequence of points. If the gaps between the points of a Markov renewal sequence are “filled”, a Semi-Markov process (SMP) is obtained. More formally: A continuous-time stochastic process {X(t), t ≥ 0} with countable state space S is said to be a semi-Markov process if 1. it has piecewise constant, right continuous sample paths, and 2. {(Yn , Tn ), n ≥ 0} is a Markov renewal sequence, where Tn is the n-th jump epoch and Yn = X(Tn +) . Yn = X(Tn +) denotes that the state X of the SMP is defined by the state Yn of the Markov renewal sequence at any time t. The notation Tn + indicates that the sample path is right continuous, and n is determined such that it is the largest index for which Tn ≤ t (see Figure 6.1). Figure 6.1: A semi-Markov process X(t) defined by a Markov renewal sequence {(Yn , Tn )} An SMP is called regular, if it only performs a finite number of transitions in a finite amount of time. As only regular SMPs are considered in this thesis, the term “regular” will be omitted from now on. As can be seen from Equation 6.3, the limiting probabilities pij “eliminate” temporal behavior. Hence, they define a DTMC that is said to be embedded in the SMP. From this analogy it is clear that the following property holds for each transient state si : ∀i : N X j=1 pij = 1 , (6.4) 6.1 The Hidden Semi-Markov Model 97 expressing the fact that it is sure that the SMP leaves state si if time t approaches infinity. In addition to the notion of the embedded DTMC, the limiting probabilities pij can be used to define a quantity that helps to understand the way how SMPs operate. Let dij (t) denote a probability distribution for the duration of a transition from state si to state sj : dij (t) = P (T1 ≤ t | Y0 = i, Y1 = j) . (6.5) Using the limiting probabilities pij , durations dij (t) can be computed from gij (t) the following way: gij (t) if pij > 0 pij dij (t) = (6.6) 1 if pij = 0 , and therefore, gij (t) can be split into a transition probability and a transition duration distribution: gij (t) = pij dij (t) , (6.7) which leads to an intuitive description of the behavior of SMPs: Assume that at time 0 the system enters state i. Then, it chooses the next state to be j according to probability pij . Having decided upon the next state to be j, it stays in state i for a random amount of time sampled from distribution dij (t) before it enters state j. Once the SMP enters state j it looses all memory of the history and behaves as before, starting from state j. Note that the theory of SMPs allows pii 6= 0, i.e., the SMP may return to state i immediately after leaving it. However, for simplicity reasons, it will be assumed from now on that pii = 0. This description of SMPs also shows why they are called semi-Markov processes: the choice of the successor state is a Markov process, but the duration probability is depending both on the current as well as on the successor state and is therefore non-Markovian. Hence the name semi-Markov. Finally, it should be noted that SMPs are fully specified by two quantities: 1. the initial distribution π = [πi ] = [P (X(0) = i)] 2. the kernel G(t) of the underlying Markov renewal sequence. From Equation 6.7 follows that G(t) can alternatively be specified by P = [pij ], which is a transition matrix for the embedded DTMC, and D(t) = [dij (t)] defining the probability distributions for the duration of each transition from si to sj . Be aware that Equation 6.7 only holds for each gij (t) separately and hence matrices P and D(t) can only be multiplied element-wise. 6.1.2 Combining Semi-Markov Processes with Hidden Markov Models HSMMs extend SMPs in the same way that HMMs extend DTMCs. Hence, once the stochastic process of state traversals enters a state si , an observation oj is produced according to the probability distribution bsi (oj ). Due to the fact that error event-based failure prediction evaluates temporal sequences with discrete symbols, only discrete distributions bsi (oj ) are considered, here. Nevertheless, the approach could be extended easily to continuous, multimodal outputs.1 An example is shown in Figure 6.2. 1 See, e.g., Liporace [168], Juang et al. [137], Rabiner [210] for a summary how it is done for discrete-time HMMs 98 6. The Model Figure 6.2: Similar to the HMM shown in Figure 4.2, a HSMM consists of a semi-Markov process of (hidden) state traversals defined by gij (t) and output probabilities bsi (oj ) According to Equation 6.7, gij (t) is the product of limiting probabilities pij and durations dij (t). Durations dij (t) can in general be arbitrary time-continuous cumulative distributions, which even need not to be differentiable. For example, dij (t) can be a piecewise constant non-decreasing function. In this thesis, however, a convex combination of parametrized probability distributions is assumed: dij (t) = R X wij,r κij,r (t|θij,r ) (6.8) r=0 s.t. R X wij,r = 1 , wij,r ≥ 0 . (6.9) r=0 Each duration distribution dij (t) is a sum of R cumulative probability distributions κij, r (t|θij,r ) with a specific set of parameters θij,r , weighted by wij,r . The weights sum up to one so that a proper probability distribution is obtained. The single distributions κij, r are called kernels. For example, if κij, r is a Gaussian kernel, parameters θij,r con2 . Additionally as stated above, it is assumed, that sists of mean µij,r and variance σij,r pii = 0, expressing the fact that there are no self-transitions in the model. In the literature, such convex combination is sometimes termed a mixture of probability distributions, even though the term is mathematically less precise. In summary, an HSMM is completely defined by • The set of states S = {s1 , . . . , sN } • The set of observation symbols O = {o1 , . . . , oM } • The N -dimensional initial state probability vector π • The N × M matrix of emission probabilities B • The N × N matrix of limiting transition probabilities P • The N × N matrix of cumulative transition duration distribution functions D(t) For better readability of formulas, let λ = {π, B, P , D(t)} denote the set of parameters. Taking Equation 6.7 into account, sometimes also the notation λ = {π, B, G(t)} is used. S and O are not included since O is determined by the application and S is not altered by the training procedure, as is explained in Section 6.3. 6.2 Sequence Processing 6.2 99 Sequence Processing In machine learning, usually a training procedure is applied first in order to adjust model parameters to a training data set, and after that the resulting model is applied. For failure prediction with HSMMs, this translates into determining model parameters λ and then to process error sequences observed during system runtime. Nevertheless description of the two steps is reversed here for simplicity reasons since sequence processing is better suited to explain the hidden semi-Markov model. Training is then covered in the next section. 6.2.1 Recognition of Temporal Sequences: The Forward Algorithm Online failure prediction with HSMMs consists of the three stages preprocessing, sequence recognition resulting in sequence likelihood, and subsequent classification (c.f. Figure 2.10 on Page 20). This section covers the second stage and provides the algorithm to compute sequence likelihood from a given observation sequence, which is the forward algorithm. Figure 6.3 illustrates some notations that are used throughout this chapter. The notation Oi = ok is used to describe observation sequences, expressing that the i − th symbol in a sequence is symbol ok ∈ O. The notation is adopted from literature on random variables such as Cox & Miller [67], where capital letters denote the variables and small letters the realization of the variable. In the figure, ok is either “A”, or “B”. The events occur at times t0 to t2 . However, if relative distances between events are relevant, time is represented by delays di = ti − ti−1 . The sequence of hidden states that are traversed to generate the observations is denoted as a sequence of random variables Si = sj , where sj ∈ S . Figure 6.3: Notations used for temporal sequences in this chapter. Capital letters denote random variables, small letters realizations (actual values). [Oi ] denotes the sequence of observation symbols and [Si ] the sequence of hidden states. Time is expressed as delay di between observations at time ti and ti−1 . The forward algorithm for HSMMs is derived from the discrete-time equivalent as defined by Equations 4.9 on Page 58. The fact that sequences in continuous time are considered leads to a change in time indexing: instead of t denoting an equidistant time step, tk denotes the time when the k-th symbol has occurred. As can be seen from comparing Figure 6.2 with Figure 4.2 on Page 57, transition probabilities aij are replaced by gij (t) in HSMMs. However a strict one-to-one replacement is not sufficient, as can be seen from the following considerations: 100 6. The Model 1. Assume that at time tk−1 the stochastic process has just entered state si and has emitted observation symbol ol : Sk−1 = si , Ok−1 = ol . 2. Assume that there is a state transition when the next observation occurs. Hence, the duration of the transition is dk := tk − tk−1 . 3. Knowing dk , transition probabilities to successor states sh can be computed by gih (dk ). Assume that the successor state is sj : Sk = sj . 4. The subsequent symbol Ok = om is then emitted by state sj with probability bsj (om ). 5. However, the inequality N X gih (dk ) ≤ 1 (6.10) h=1 holds and equality is only reached for dk → ∞ (c.f., Equations 6.3 and 6.4 and keeping in mind that gii (t) ≡ 0). Hence, for dk < ∞, the sum is less than one, which means that some fraction of the probability mass is not distributed among successor states. The explanation for this is as follows: there is a non-zero probability that the stochastic process still resides in state si when time dk has elapsed. In this case, state si generates symbol om , and the probability for this is 1− N X gih (dk ) . (6.11) h=1 6. Applying the Markov assumptions, the stochastic process looses all memory and considerations for the next observation start from 1. In order to formalize these considerations, a probability vij (dk ) is defined as follows: vij (dk ) = P (Sk = sj , dk = tk − tk−1 | Sk−1 = si ) = gij (dk ) N X 1 − gih (dk ) h=1 (6.12) if j 6= i if j = i (6.13) h6=i with the property that ∀ i, d : N X vij (d) = 1 . (6.14) j=1 One of the advantageous characteristics of this approach is that it can handle the situation when the order of errors occurring closely together is changed, which happens frequently in systems where several components send error messages to a central logging component (c.f., Property 6 on Page 15). More technically, if two symbols O1 = oa and O2 = ob occur at the same time (d = 0), the resulting sequence likelihood is identical regardless of the order, since for d = 0 the process stays in the state si with probability one and the resulting (part of) the sequence likelihood is bsi (oa ) bsi (ob ) = bsi (ob ) bsi (oa ). 6.2 Sequence Processing 101 The forward algorithm. Similar to the case of discrete-time HMMs (c.f. Section 4.1), the forward variable α for HSMMs equals the probability of the sequence up to time tk for all state sequences that end in state si (at time tk ): αk (i) = P (O0 O1 . . . Ok , Sk = si |λ) . (6.15) By replacing aij by vij (t) and changing time indexing, the following recursive computation scheme for αk (i) is derived from Equation 4.9 on Page 58: α0 (i) = πi bsi (O0 ) αk (j) = N X αk−1 (i) vij (tk − tk−1 ) bsj (Ok ); 1≤k≤L. (6.16) i=1 The forward algorithm can also be visualized by a trellis structure as shown in Figure 4.3 on Page 59. Sequence likelihood. In the context of online failure prediction with HSMMs, sequence likelihood is a probabilistic measure for the similarity of the observed error sequence to the sequences in the training data set. More specifically, sequence likelihood is denoted as P (o | λ), which is the probability that a HSMM with parameter set λ can generate observation sequence o. Equivalent to standard HMMs, this probability can be computed by the sum over the last column of the trellis structure for the forward variable α: P (o | λ) = N X αL (i) . (6.17) i=1 When executing the forward algorithm on computers, probabilities quickly approach the limit of computational accuracy, even with double precision floating point numbers. Therefore, a technique called scaling is applied (see, e.g., Rabiner [210]). The values of column k in the trellis for α are scaled to one by a scaling factor ck : ck := P 1 i αk (i) ⇒ X ck αk (i) = i X αk0 (i) = 1 . (6.18) i Instead of the sequence likelihood, which also gets too small very quickly, the logarithm of the likelihood is used. It can be shown that the so-called log-likelihood h sequence i log P (o | λ) can be computed easily by summing up the logarithms of the scaling factors: h i log P (o | λ) = − L X log ck . (6.19) k=1 Finding the most probable sequence of states: Viterbi algorithm. The forward algorithm incorporates all possible state sequences. In some applications, however, this is not desired and only the most probable sequence of states is of interest. This is computed by the Viterbi algorithm. 102 6. The Model In analogy to discrete-time HMMs,2 the Viterbi algorithm is derived from the forward algorithm by replacing the sum over all previous states by the maximum operator: δk (i) = max S0 S1 ... Sk−1 P (O0 O1 . . . Ok , S0 , S1 , . . . , Sk−1 , Sk = si | λ) δ0 (i) = πi bsi (O0 ) δk (j) = (6.20) (6.21) max δk−1 (i) vij (tk − tk−1 ) bsj (Ok ) . (6.22) 1≤i≤N Hence, maxi δL (i) is the maximum probability of a single state sequence generating observation sequence o. The sequence of states itself can be obtained by storing which state was selected by the maximum operator and then tracing back through the array starting from state arg maxi δL (i). 6.2.2 Sequence Prediction Sequence prediction deals with the estimation of the future behavior of a temporal sequence. Although not used for failure prediction in this thesis, other application areas exist that take advantage of anticipating the further evolvement of a given sequence. Given a model and the beginning of a temporal sequence, sequence prediction addresses the question, how the sequence will evolve in the near future based on the characteristics expressed by the underlying model. More precisely, two different types of sequence prediction can be distinguished: 1. What is the probability for the next observation of the sequence? 2. What is the probability that the underlying stochastic process will reach a certain distinguished state within some time interval? Probability of the Next Observation. In order to estimate the probability of next observations, the following probability is defined: ηt (ok ) = P (OL+1 = ok , T ≤ t | tL , O0 . . . OL , λ); t ≥ tL . (6.23) Here, ηt (ok ) is the probability that the next emitted observation symbol is ok occurring at time T ≤ t, given a HSMM λ, the beginning of an observation sequence o = O0 . . . OL , and the time of occurrence of the last symbol tL . ηt (ok ) can be computed as follows: ηt (ok ) = N X P (SL+1 = sj , OL+1 = ok , T ≤ t | tL , o, λ) (6.24) j=1 = N X P (OL+1 = ok | SL+1 = sj , T ≤ t, tL , o, λ)× (6.25) j=1 P (SL+1 = sj , T ≤ t | tL , o, λ) . The first probability of Equation 6.25 is simply the observation probability for state sj : P (OL+1 = ok | SL+1 = sj , T ≤ t, tL , o, λ) = bsj (ok ) 2 c.f., Equation 4.20–4.22 (6.26) 6.2 Sequence Processing 103 whereas the second probability in Equation 6.25 can be split up further: P (SL+1 = sj , T ≤ t | tL , o, λ) = = N X i=1 N X (6.27) P (SL+1 = sj , SL = si , T ≤ t | tL , o, λ) (6.28) P (SL+1 = sj , T ≤ t | SL = si , tL , o, λ) P (SL = si | tL , o, λ) . (6.29) i=1 The first term of the product in Equation 6.29 is the probability that the state process is in state sj at time T ≤ t given that it was in state si at time tL . This equals the cumulative probability distribution vij (d): P (SL+1 = sj , T ≤ t | SL = si , tL , o, λ) = vij (t − tL ) . (6.30) The second term of the product in Equation 6.29 is the probability that the state process resides in state si at the end of the observation sequence. This can be computed by use of the forward algorithm: P (SL = si | tL , o, λ) = αL (i) αL (i) P (o, SL = si | tL , λ) . (6.31) = = N X P (o | λ) P (o | λ) αL (j) j=1 Summarizing the results, the probability that observation symbol ok will occur up to time t in the future can be computed by ηt (ok ) = N X j=1 bsj (ok ) N X i=1 vij (t − tL ) αL (i) . P (o | λ) (6.32) Probability to Reach a Distinguished State. Computing probabilities for the next observation symbol involved one single state transition (see Equation 6.30). However, if the next observation symbol is not of interest but the probability distribution to reach a distinguished state, computation of the first-step successor is not sufficient. Moreover, the general probability to reach the distinguished state sd for the first time by time t irrespective of the number of hops is desired: P (Sd = sd , Td ≤ t | o, λ); Td = min( t : St = sd ) . (6.33) The procedure to compute this probability involves two steps: 1. Based on the given observation sequence o and the model λ, compute the probability distribution for the last hidden state in the sequence P (SL = si | o, λ) using Equation 6.31. 2. Use P (SL = si | o, λ) as the starting point to estimate the future behavior of the system. The objective is the probability defined in Equation 6.33, which is called first passage time distribution. 104 6. The Model In principle, an estimation of future behavior should take into account both the process of hidden state traversals and generated observation symbols. Taking into account observation symbols results in a sum over all symbols for each state. However, only the semi-Markov process of hidden state transitions has to be analyzed, since observation P probabilities can be omitted due to M k=1 bsi (ok ) = 1. In order to compute the first passage time distribution, the so-called first step analysis (Kulkarni [149]) is applied. The essence of first step analysis can be summarized as follows: In order to reach the designated state, the first step of the stochastic process either reaches the state directly or the process transits to an intermediate state. In the latter case, the designated state is then reached directly from the intermediate state or via another intermediate state. This establishes a recursive computation scheme. As in Equation 6.33, let Td denote the time to first reach the designated state sd and let Fid (t) = P (Td ≤ t|SL = si ) denote the probability to reach sd by time t given that the process is in state si at the end of the observation sequence, then Fid (t) = gid (t) + XZ t j6=d 0 d gij (τ ) Fjd (t − τ ) , (6.34) where gid (t) is the cumulative probability distribution as defined in Section 6.1.1 and Z t 0 d gij (τ ) Fjd (t − τ ) denotes the Lebesgue-Stieltjes integral (see, e.g., Saks [221]). Equation 6.34 is derived from first step analysis: either state sd is reached directly within t —for which the probability is gid (t)— or via some intermediate state sj 6= sd . In this case the transition to sj takes time τ and state sj is then reached within time t − τ . As might have become clear from the formula, this is a recursive problem, since starting from sj the destination state may either be reached directly or via yet another intermediate state. However, the duration of the transition from si to the intermediate state sj is not known. Therefore, all possible values for τ have to be considered which results in the integral with bounds 0 and t. In order to solve the equation system defined by Equation 6.34, a recursive scheme can be defined: (0) Fid (t) = 0 (n+1) Fid (t) = gid (t) + XZ t j6=d 0 (n) d gij (τ ) Fjd (t − τ ) . (6.35) Kulkarni [149] showed that this recursion has the approximation property: sup Fid (τ ) − Fid (τ ) ≤ µ[ r ] , (n) n (6.36) 0≤τ ≤t where µ and r are derived from a result on regular Markov renewal processes stating that for any fixed t ≥ 0, an integer r and real number 0 < µ < 1 exist such that: X j gij∗r (t) ≤ µ (6.37) 6.3 Training Hidden Semi-Markov Models 105 and gij∗r (t) denotes the r-th convolution of gij (t) with itself. Since Fid (t) assumes the stochastic process to be initially in state si , the sum over all states has to be computed where the probability for each state is determined by Equation 6.31. Hence, in summary the probability to reach state sd within time t is given by: X P (Sd = sd , Td ≤ t | o, λ) = Fid (t) P (SL = si | o, λ) . (6.38) i Computation of Equation 6.35 can be quite costly, depending on n, which is the maximum number of transitions up to time t that are considered in the approximation. Additionally, each step involves a solution of the Lebesgue-Stieltjes integral which must in many cases be solved numerically as there are many distributions for which there is no analytical representation (e.g., the cumulative distribution of a Gaussian random variable). However, computational complexity can be limited since the maximum number of transitions is commonly limited by the application (in most applications, there is a minimum delay between successive observations). Furthermore, a minimum delay between observations also limits the number of points in time for which the Lebesgue-Stieltjes integral has to be approximated. It should also be noted that Fid (t) depends on the parameters of the HSMM but not on the observation sequence: hence, the complex computations including integrations can be precomputed. An online evaluation of Equation 6.38 only involves computation of Equation 6.31 for each state, multiplication with precomputed Fid (t) and summing up the products. 6.3 Training Hidden Semi-Markov Models In previous sections it has been assumed that the parameters λ of a HSMM are given. This section deals with the task to estimate the parameters from training sequences. For this purpose, the Baum-Welch algorithm for standard HMMs (see Section 4.1.2) is adapted to hidden semi-Markov models. 6.3.1 Beta, Gamma and Xi In addition to the forward variable αk (i), reestimation formulas for standard HMMs are based on a backward variable βt (i), a state probability γt (i), and a transition probability ξt (i, j). The same applies to reestimation for HSMMs, which uses equivalent variables βk (i), γk (i) and ξk (i, j). Analogously to standard HMMs, the backward variable βk (i) denotes the probability of the rest of the observation sequence Ok+1 . . . OL given that the process is in state si at time tk : βk (i) = P (Ok+1 . . . OL | Sk = si , λ) . (6.39) βk (i) is computed backwards starting from time tL : βL (i) = 1 βk (i) = N X j=1 vij (dk ) bsj (Ok+1 ) βk+1 (j) . (6.40) 106 6. The Model γk (i) denotes the probability that the stochastic process is in state si at the time when the k-th observation occurs. It can be computed from αk (i) and βk (i) following the same scheme as presented in Section 4.1.1: αk (i) βk (i) . γk (i) = PN i=1 αk (i) βk (i) (6.41) ξk (i, j) is the probability that the stochastic process is in state si at time tk and is in state sj at time tk+1 : ξk (i, j) = P (Sk = si , Sk+1 = sj | o, λ) (6.42) αk (i) vij (dk+1 ) bsj (Ok+1 ) βk+1 (j) . ξk (i, j) = PN PN j=1 αk (i) vij (dk+1 ) bsj (Ot+1 ) βk+1 (j) i=1 (6.43) As was the case for standard HMMs, the expected number of transitions from state si to state sj is the sum over time L−1 X ξk (i, j) . (6.44) k=0 6.3.2 Reestimation Formulas As has been described in Section 4.1.2, the most common training procedure of standard HMMs is the Baum-Welch algorithm, which is an iterative procedure. Similar to standard HMMs, the “expectation” step comprises computation of α, β, and subsequently γ and ξ. Then, the “maximization” step is performed where model parameters are adjusted using the values computed in the expectation step. This section provides the formulas for the maximization step, which are derived in the course of the proof of convergence in Section 6.5. Initial probabilities π π̄i ≡ expected number of series starting in state si ≡ γ0 (i) . total number of sequence (6.45) Emission probabilities bsi (oj ) L X b̄i (oj ) ≡ expected number of times observing oj in state si ≡ expected number of times in state si γk (i) k=0 s.t. Ok =oj L X . (6.46) γk (i) k=0 Except for a different notation of time, the formulas are the same as for standard HMMs. 6.3 Training Hidden Semi-Markov Models 107 Transition parameters. Since the stochastic process underlying state traversals is changed from a discrete time Markov chain for standard HMMs to a semi-Markov process in the case of HSMMs, maximization of transition parameters is quite different from standard HMMs. The key difficulty is that parameters of all outgoing transitions from si to sj occur at two places: once for the transition si → sj ; j 6= i and once in the computation of the probability that the process has stayed in state si . This can be seen from the definition of vij (d) (see Equation 6.13), which is reiterated in an extended form for convenience, here: vij (dk ) = pij dij (dk ) N X pih dih (dk ) 1 − h=1 if j 6= i (6.47) if j = i . h6=i The fact that pij occurs in both cases of the equation prohibits to apply similar formulas as for standard HMMs. Instead, a gradient-based iterative optimization is used to maximize likelihood of the training sequence with respect to the transition parameters, which are specifically: • limiting transition probabilities pij • kernel parameters θij,r for each transition duration dij (dk ) (c.f., Equation 6.8) • kernel weights wij,r for each transition duration dij (dk ) (c.f., Equation 6.9) As is derived in detail in Section 6.5, optimization is performed for each state si by maximizing the objective function Qvi : Qvi = X X k h i ξk (i, j) log pij dij (dk ) h + ξk (i, i) log 1 − j6=i X i pih dih (dk ) . (6.48) h6=i The gradient comprises partial derivatives of the objective function with respect to HSMM transition parameters pij , wij,r , θij,r . The derivative for pij is given by: X 1 dij (dk ) ∂ ξk (i, j) ; X Qvi = − ξk (i, i) ∂ pij pij 1− pih dih (dk ) k i 6= j . (6.49) h6=i For i = j, the derivative is equal to zero. Derivation of Qvi with respect to wij,r can be computed as follows: ∂ Qvi ∂ Qvi ∂ dij (dk ) = , ∂ wij,r ∂ dij (dk ) ∂ wij,r (6.50) ∂ dij (dk ) = κij,r (dk | θij,r ) ∂ wij,r (6.51) where and Qvi ∂ = ∂ dij (dk ) X k ξk (i, j) 1 pij ; X − ξk (i, i) dij (dk ) 1− pih dih (dk ) h6=i i 6= j . (6.52) 108 6. The Model Again, for i = j, the derivative is equal to zero. Derivation of Qvi with respect to θij,r is determined by: ∂ Qvi ∂ Qvi ∂ dij (dk ) ∂ κij,r = , ∂ θij,r ∂ dij (dk ) ∂ κij,r ∂ θij,r where ∂ Qvi ∂ dij (dk ) (6.53) is as given by Equation 6.52, ∂ dij (dk ) = wij,r , ∂ κij,r (6.54) ij,r and ∂∂ κθij,r is depending on the type of probability distribution. For example, if an exponential distribution is used, the derivative is given by: ∂ ∂ κij,r = 1 − e−λij,r dk = dk e−λij,r dk . ∂ θij,r ∂ λij,r (6.55) Gradient-based optimization techniques are usually iterative using a search direction s(n) , which is at least in part based on the gradient. The algorithms perform an update of length η in direction of s(n) . Various techniques exist to estimate η, including line search, and the Goldstein Armijo rule (see, e.g., Dennis & Moré [76]). The next point of evaluation in the parameter space λ is determined by: λ(n+1) = λ(n) + η s(n) . (6.56) The search direction s(n) is given by: s(n) = ∇Qvi |λ(n) , (6.57) where ∇Qvi |λ(n) denotes the gradient vector of Qvi with respect to the parameters evaluated at the point λ(n) . A slight modification is used by conjugate gradient approaches (Hestenes & Stiefel [119]), where the next search direction is obtained by: s(n) = ∇Qvi |λ(n) + ζ s(n−1) , (6.58) where ζ is a scalar that can be computed from the gradient.3 Several equality constraints apply to the optimization problem for a fixed state si such as: 1. X ! pij = 1 (6.59) j 2. ∀j : X ! wij,r = 1 , (6.60) r which results in a restricted search space for the gradient method. Equality constraints of this form can be incorporated by projecting the search direction onto the hyperplane defined by the constraints. The following example explains this procedure. Assume that state si has J outgoing transitions, and each duration distribution dij (t) consists of exactly one Gaussian distribution having two parameters µij and σij . Then, the gradient vector 3 See, e.g., Shewchuk [239] 6.3 Training Hidden Semi-Markov Models 109 ∇Qvi has 3J components.4 The constraint on limiting transition probabilities pij defines the hyperplane given by X pij − 1 = 0 . (6.61) j Let M denote the 3J × J matrix5 of orthonormal base vectors for the hyperplane translated to cross the origin of parameter space. The new gradient vector (∇Qvi )0 , which obeys equality constraints is obtained by projecting it onto the hyperplane by matrix multiplication: (6.62) (∇Qvi )0 = (M M T )∇Qvi . If several equality constraints apply to the optimization problem, M is the matrix of orthonormal base vectors for the intersection of all hyperplanes induced by the constraints. Moreover, equality constraints are also obeyed for conjugate gradient approaches, since both (∇Qvi )0 and s(n) lie within the constraint hyperplanes, and hence a linear combination of the two vectors also results in a search direction within the hyperplane. Variables also have to satisfy inequality constraints. For example, probabilities pij can only take values within the interval [0, 1]. In order to account for this, η must be restricted such that the optimization algorithm cannot leave the space of feasible parameter values. This can be achieved by checking, whether λ(n+1) is in the range of feasible values. If not, η must be made smaller, which can either be done by computing the intersection of the line λ(n) + a s(n) with the bordering hyperplane6 or by other heuristics. 6.3.3 A Summary of the Training Algorithm The goal of the training procedure is to adjust the model parameters λ such that the likelihood of a given training sequence o is maximized. The training algorithm does only affect π, B, P , and D(t), but not the structure of the HSMM. The structure consists of • the set of states S = {s1 , . . . , sN }, • the set of symbols O = {o1 , . . . , oM }, which is also called the alphabet • the topology of the model. It defines, which of the N states can be initial states, which of the potentially N × N transitions can be traversed by the stochastic process, and which of the potentially N × M emissions are available in each state. Technically, a transition si → sj is “removed” by setting pij = 0. The same holds for the initial state distribution π and the emission probabilities: if bsi (ok ) is set to zero, state si cannot generate observation symbol ok . Since the training algorithm can never assign a non-zero value to probabilities that are initialized by zero, it does not change the structure of the HSMM. • specification of the transition durations D(t). This includes the number R and types of kernels κij,r for each existing transition. The structure may also comprise specification of additional parameters that are not adjusted by the training procedure 4 Since there is only one duration distribution, no weights wij,r are needed. Hence each of the J outgoing transitions is determined by µij , σij , and pij 5 Equation 6.61 defines a J dimensional hyperplane in 3J-dimensional space 6 E.g., defined by pij = 0 110 6. The Model such as upper and lower bounds for uniform background distributions, which need to be set up before training starts. Having specified the model structure, the training algorithm performs the steps shown in Figure 6.4 in order to adjust the parameters λ such that sequence likelihood of P (o | λ) reaches at least a local maximum. Some notes on the training procedure: Gradient-based maximization within an EM algorithm has been used to train standard HMMs, e.g., in Wilson & Bobick [279]. Such approach is called Generalized Expectation Maximization algorithm. If a conjugate gradient approach is applied, the resulting HMM learning algorithm is called Expectation Conjugate Gradient (ECG). Under certain conditions, ECG performs even better than the original Baum-Welch algorithm (Salakhutdinov et al. [222]), but computational complexity is increased. However, complexity can be limited: • The number of parameters that have to be estimated depends heavily on the number of outgoing transitions (J). These, in turn, depend on the topology of the model: If, for example, the topology is a simple chain, then each state, despite of the last one, has only one outgoing transition. In case of an ergodic topology, where every state is connected to every other, the number equals N − 1. • The kernel weights wij,r do not necessarily need to be optimized. If the number of parameters is too large, the weights of the convex combination can simply be fixed, which reduces the number of parameters by J × R̄, where R̄ denotes the average number of kernels per outgoing transition. • The number of kernels may be reduced when some duration background distribution is used. If specified a-priori, background distributions do not increase the size of the parameter vector. • As is shown in Section 6.5, the overall EM algorithm also converges if sequence likelihood is only increased by a sufficiently large amount. Therefore, gradientbased optimization can be stopped after a few iterations. • Since the optimization algorithm is based on the gradient, only cumulative distributions dij (dk ) can be used for which the derivative with respect to its parameters are available. However, this is the case for many widespread distributions. See Appendix V for some examples. The former notes have mainly addressed the embedded gradient-based optimization of Qvi . Regarding the entire training procedure, the following notes should be kept in mind when applying HSMMs as a modeling technique: • Equivalently to standard HMMs, scaling factors sk used to scale αk (i) (c.f., Equation 6.18) can also be used to scale βk . ξ and γ can then be computed on the basis of scaled α0 and β 0 . • It has been shown that for large models, results and speed of convergence can be improved if prior knowledge is incorporated into parameter initialization. For example, the length of failure sequences used for training of one model show a certain distribution with respect to the number of observations. This can be exploited 6.3 Training Hidden Semi-Markov Models 111 1. Initialize the model by assigning values to π, B, and G(t) for all entries that exist in the structure. This constitutes λold 2. Compute αk (i) by Equation 6.16, βk (i) by Equation 6.40, γk (i) by Equation 6.41, and ξk (i, j) by Equation 6.43 using λold and observation sequence o 3. Compute sequence likelihood P (o | λold ) by Equation 6.17. and 4. Adjust π by Equation 6.45, and B by Equation 6.46, resulting in λnew π λnew B 5. Reestimate the parameters of G by the embedded optimization algorithm. For each state si , perform: (a) Compute the gradient vector g (n) of Qvi with respect to the parameters of (n) G at λGi , which is either initialized by λold Gi or obtained from a previous iteration g (n) = ∂ Qvi ∂ Qvi ∂ Qvi = , ..., , ..., , ... ∂ pij ∂ wij,r ∂ κij,r " (∇Qvi ) (n) λG i # , (n) λG i where Qvi is given by Equation 6.48 (b) Project the gradient onto the hyperplane of feasible solutions for equality equations. This is achieved by matrix multiplication: g 0(n) = (M M T ) g (n) , where M denotes the matrix of orthonormal base vectors for hyperplanes defined by equality constraints such as that the sum of probabilities should equal one. (c) Determine a search direction s(n) from g 0(n) and eventually s(n−1) and a step size η, e.g., by line search. Assure that the search vector does not cross the boundaries induced by inequality equations such as the condition that probabilities must lie between [0, 1]. The next point in search space is obtained by: (n+1) λGi (n) = λGi + η s(n) . (d) Repeat from Step 5a until step size is less than some bound or a maximum number of steps is reached. The result constitutes to λnew Gi 6. Set λold := λnew and repeat Steps 2 to 6 until the difference in observation sequence likelihood P (o | λnew ) − P (o | λold ) is less than some bound. Figure 6.4: Summary of the complete training algorithm for HSMMs. 112 6. The Model to come up with a better guess for initial probabilities π. Additionally, initialization of observation probabilities can be improved by taking the prior distribution of symbols into account. Other techniques first apply the Viterbi algorithm to come up with an initial assignment of states to observations, as described, e.g., in Juang & Rabiner [138]. Similar techniques can also be used to obtain an initial guess for transition durations. • It has also been shown that results can be improved by setting all observation probabilities bi (ok ) to zero that are less than some threshold (Rabiner [210]). • The training procedure improves model parameters until some local maximum is reached, which can be significantly lower than the global maximum. Therefore, in this thesis the training procedure is performed several times with different (random) parameter initializations. Other approaches are discussed in the outlook (Chapter 12). • Gradient-based optimization could be applied to sequence likelihood directly (and not to the Q-function, which is a lower bound for likelihood). However, first, dimensionality of the optimization parameter space would be dramatically increased, and second, the efficiency that parts of the optimization problem can be solved analytically would be lost. • The training procedure described only considered one single training sequence. An extension to multiple sequences is similar to standard HMMs with the slight difference that the gradient takes all training sequences into account. However, vectors λ(n) → λ(n+1) for single sequences can be linearly combined exploiting the fact that log-likelihood for multiple sequences is the sum of single sequence log-likelihoods. • Background distributions for observation probabilities B can be applied to HSMMs in the same way as to standard HMMs. They are frequently used to circumvent one of the major drawbacks of the Baum-Welch algorithm: Observation probabilities bsi (oj ) are computed from the number of occurrences of observation oj (c.f., Equation 6.46). If one specific symbol oc has not occurred in the training data, bsi (oc ) is set to zero for all states si in the first iteration of the training algorithm. Hence, in the forward algorithm, any observation sequence containing oc is assigned a sequence likelihood of zero (c.f., Equation 6.16). Background probabilities remedy this problem by substituting bsi (oj ) with b0i (oc ) > 0 as defined in the following. Let Pb (oj ) denote a discrete probability distribution over all observation symbols. Observation probabilities of a hidden Markov model become a convex combination of the original observation probabilities bsi (oj ) and Pb (oj ): b0ij = b0si (oj ) = ρi Pb (oj ) + (1 − ρi ) bsi (oj ); 0 ≤ ρi ≤ 1 , (6.63) where ρi is a state-dependent weighting factor. 6.4 Difference Between the Approach and other HSMMs The term “hidden semi-Markov model” has been used for various models, since “semi” simply indicates that a model employs some probability distribution for representation of 6.4 Difference Between the Approach and other HSMMs 113 time. Due to the fact that the models have been developed in the area of speech recognition and signal processing, almost all models assume input data to be an equidistant time series, which leads to the simplification that a minimum time step exists and durations can be handled by multiples of the time step. Speech recognition. In order to better explain the differences between the approach presented here and previously published work, the task of phoneme7 assignment to a speech signal is taken as an example. A plethora of work exists on this topic8 introducing various methods and techniques to improve speech recognition quality —however, the focus here is on duration modeling and only the basic principles are explained. The process of phoneme recognition is sketched in Figure 6.5. Starting from the top of Figure 6.5: A simplified sketch of phoneme assignment to a speech signal. the figure, the analog sound signal is sampled and converted into a digital signal. Portions of the sampled signal are then analyzed in order to extract features of the signal. Feature extraction involves, e.g., a short-time Fourier transform and various other computations. Since in this thesis only discrete emissions are considered, assume that the result of feature extraction is one symbol out of a discrete set, denoted by “A” and “B”.9 Subsequently, the sequence of features is analyzed by several HMMs: Each HMM is modeling one 7 A phoneme is the smallest unit of speech that distinguishes meaning. 8 For an overview, see, e.g., Cole et al. [62] 9 Usually, it is a feature vector containing both discrete and continuous values 114 6. The Model phoneme and sequence likelihood is computed for each HMM using the forward or Viterbi algorithm. In order to assign a phoneme to the sequence of features, some classification is performed. As has been pointed out by several authors (see, e.g., Russell & Cook [218]), the quality of assignment can be improved by introducing the notion of state duration: Rather than traversing to the next state each time an observation symbol (i.e., a feature) occurs, the stochastic process may reside in one state for a certain time generating several subsequent observation symbols before traversing to the next state. Figure 6.6 (a) shows the Figure 6.6: Assigning states si to observations (A or B). (a) shows the case where a state transition takes place each time an observation symbols occurs. If state durations are introduced the process may reside in one state accounting for several subsequent observations. However, several state sequences are possible, of which a few are shown by (b)-(d) case where the occurrence of each feature symbol corresponds to a state transition. Introducing the notion of state duration, the process of state transitions is decoupled from the occurrence of observation symbols. However, this flexibility results in several potential state sequences, as can be seen from Figures 6.6 (b) to (d). Considering all potential state sequences increases the complexity to compute sequence likelihood since all possible state paths have to be summed up. To be precise, the number of potential paths increases from N L where N denotes the number of states and L the length of the sequence (c.f., Equation 4.7 on Page 58) to L−1 X k=0 ! L−1 N (N − 1)k , k (6.64) where k is the number of state transitions that take place.10 The major drawback of this is that dynamic programming approach such as the forward algorithm cannot be applied. 10 It is assumed that x 0 =1 6.4 Difference Between the Approach and other HSMMs 115 Figure 6.7: The trellis structure for the forward algorithm with duration modeling. A maximum duration of D = 2 is used. Thick lines highlight terms involved in computation of α3 (1) This is due to the fact that the Markov assumptions do not apply: the condition that all the information needed to compute αt (j) is included in α’s of the previous time step is not fulfilled for variable state durations. Concrete models that were used in speech recognition have typically applied one restriction in order to come up with a feasible algorithm: They included an upper bound for state durations (denoted by D). This leads to the following forward-like algorithm (see, e.g., Mitchell & Jamieson [183]): αt (j) = N min(D,t) X X i=1 τ =1 αt−τ (i) aij dj (τ ) τY −1 bsj (Ot−m ) , (6.65) m=0 where αt (j) denotes the probability of the observation sequence for all state sequences for which state sj ends at time t. The algorithm includes an additional sum over τ , which is the duration how long the process stays in state sj , and dj (τ ) specifies the probability distribution for the duration. The product over bsj (·) results from the fact that during its stay, state sj has to produce all the emission symbols Ot−τ +1 . . . Ot . Similar to the standard forward algorithm, the approach can be visualized by a trellis structure, as shown in Figure 6.7. As can be seen from the figure, the major drawback of the algorithm is its computational complexity: according to Ramesh & Wilpon [211], it increases by a factor 2 of D2 . Various modifications to this approach have been proposed of which the major categories have been described in Chapter 4. Temporal sequences. The essential difference between speech recognition and temporal sequence processing is that symbols occur equidistantly in the first case, which does not apply to the latter case. Periodicity in speech recognition is caused by the underlying sampling of an analogous signal, whereas in temporal sequences such as error sequences, the occurrence of symbols is event-triggered. This difference leads to the following conclusions: • Using discrete timesteps of fixed size is appropriate for speech signals but not for temporal sequences due to the reasons given in the discussion of time-slotting (c.f. Section 4.2.1). • In event-driven temporal sequences, temporal variability is already included in the observation sequence itself. Therefore, a tight relation between hidden state transitions and occurrence of observation symbols can be assumed. Specifically, the model presented in this thesis assumes a one-to-one mapping. 116 6. The Model • The one-to-one mapping between state transitions and observation symbol occurrence has two advantages: 1. It enforces the Markov assumption, which leads to an efficient forward algorithm that is very similar to the standard forward algorithm of discrete-time HMMs. Specifically, the sum over durations τ in Equation 6.65 is avoided so that the algorithm belongs to the same complexity class as the standard forward algorithm, as is shown in Section 6.7. 2. Durations can be assigned to transitions rather than to states, which increases modeling flexibility and expressiveness. Obviously, state durations are a special case of transition durations.11 Considering Equation 6.13, the approach is related to inhomogeneous HMMs (IHMMs). However, the process must still be called homogeneous since probabilities vij (d) stay the same regardless of the time when the transition takes place, i.e. at the beginning or ending of the sequence. Furthermore, in contrast to IHMMs, continuous duration distributions rather than discrete ones are used in this thesis. 6.5 Proving Convergence of the Training Algorithm The objective of the training procedure is to find a set of parameters λopt that maximizes sequence likelihood of the training data: λopt = arg max P (o | λ) . (6.66) λ The training procedure described here is an Expectation-Maximization (EM) algorithm (Dempster et al. [75]). It improves sequence likelihood until at least some local maximum is reached. The algorithm described here is closely related to the Baum-Welch algorithm, whose convergence was originally proven by Baum & Sell [25] without the framework of EM algorithms. However, the framework of EM algorithms provides a view on the problem that allows for simpler proofs. Such approach is adapted to prove convergence of the training algorithm presented here. In the following, first a general proof of convergence for EM algorithms by Minka [181] is presented, which is subsequently adapted to the specifics of HSMMs. 6.5.1 A Proof of Convergence Framework EM algorithms are maximum-a-posteriori (MAP) estimators and hence rely on the presence of some data that has been observed, which in this case refers to the observation sequence o forming dataset O. The goal is to maximize data likelihood P (o|λ). The potential and wide range of application of EM algorithms stems from two properties: 1. EM algorithms build on lower bound optimization (Minka [181]). Instead of optimizing a complex objective function directly, some simpler lower bound is optimized. 2. EM algorithms can handle incomplete / unobservable data. 11 In this case, all outgoing transitions have the same duration distribution 6.5 Proving Convergence of the Training Algorithm 117 Lower bound optimization. In lower bound optimization, which is also called the primal-dual method (Bazaraa & Shetty [26]), a computationally intractable objective function is optimized by repetitive maximization of some lower bound that is easier to compute. More specifically, if o(λ) denotes the objective function, a simpler lower bound b(λ) that equals o(λ) at the current estimation of λ is maximized (see Figure 6.8). Maximization of b(λ) yields a new estimate for λ, for which the objective o(λ) is increased (except for the case when the derivative of the objective equals zero, which is a local optimum). If the objective function is continuous and bounded, as is the case for HSMMs, iteratively increasing the lower bound converges to at least a local optimum of the objective function. Figure 6.8: Lower bound optimization. Starting from the current estimate of parameter λ, a lower bound b(λ) to the objective function o(λ) is determined that is easier to maximize than the objective function. If the lower bound equals the objective function at the current estimate of λ, maximization of the lower bound leads to a new estimate of λ for which the value of the objective function is increased. Performing this procedure iteratively yields at least a local maximum of the objective function. From this, the following iterative optimization scheme can be derived: 1. Determine a lower bound simpler than the objective function that equals the objective function at the current estimate of parameter λ. 2. Determine the maximum of the lower bound yielding the next estimation of λ. 3. Repeat until the increase of the objective function is below some threshold. Compared to this, gradient-based optimization approaches approximate the objective function by the tangent to the objective function at the current estimate for λ and move along that line for some distance to obtain the new estimate. Handling of unobservable data. Unobservable data describes the situation where some quantity used in modeling cannot be observed by measurements. In the case of HMMs and its variants, this refers to the fact that the sequence of hidden states s, which the stochastic process has traversed, cannot be observed. Analogously to O, let S = {s} denote the set of state sequences s of the training data set. Two data sets must be distinguished: the complete dataset Z = (O, S) includes both observed and unknown data, 118 6. The Model while the incomplete dataset only consists of observed data. The objective is to optimize data likelihood of the (observable) incomplete dataset P (o|λ). EM algorithms deal with this problem by assuming the incomplete data likelihood to be the marginal of the complete data set. Hence, P (o|λ) = Z P (o, s|λ) ds . (6.67) s The Q-Function. In order to determine a lower bound to data likelihood, Jensen’s inequality [133] can be used: X g(j) aj ≥ j Y aj gj (j); aj ≥ 0, X j aj = 1, g(j) ≥ 0 (6.68) j stating that the arithmetic mean is greater or equal to the geometric mean. Application to Equation 6.67 requires extension by some arbitrary function q(s) as follows (see Minka [181]): P (o|λ) = Z s !q(s) ds Y P (o, s|λ) P (o, s|λ) q(s) ds ≥ q(s) q(s) s where Z = f (λ, q(s)) , ! q(s) ds = 1 . (6.69) (6.70) s f (λ, q(s)) is the lower bound and q(s) is some arbitrary probability density over s. The arbitrary function f needs to be chosen such that it touches the objective function at the current estimate of parameters λold (see Figure 6.8). It can be shown that setting q(s) = P (s | o, λold ) (6.71) fulfills the requirement (see Minka [181]). Maximization of the lower bound is performed by maximizing its logarithm. Logarithmizing yields h i log f (λ, q(s)) = Z h i q(s) log P (o, s|λ) ds − Z h i q(s) log q(s) ds . (6.72) s s Substituting Equation 6.71 into Equation 6.72 and dropping terms that are not depending on λ yields the so-called Q-function: Q(λ, λold ) = Z h i log P (o, s|λ) P (s | o, λold ) ds , (6.73) s which in fact is the expected value over the unknown data s of the log-likelihood of the complete data set. Since likelihood of the complete data set is in many cases easier to optimize than the one of the incomplete data set, EM algorithms can solve more complex optimization problems. EM algorithms. With the notation just developed, the procedure of EM algorithms can be refined as follows: • E-step: Compute the Q-function based on parameters λold obtained from initialization or the previous M-step. 6.5 Proving Convergence of the Training Algorithm 119 • M-step: Compute the next estimation for λ by maximizing the Q-function: λnew = arg max Q(λ, λold ) . (6.74) λ • Repeat until increase in data likelihood P (o|λ) is less than some threshold. Convergence of the procedure is guaranteed, since for the objective function holds 0 ≤ Q(λ, λold ) ≤ P (o|λ) ≤ 1 and the lower bound Q is not decreasing in any iteration. In case of HMMs, a local maximum of Q is usually found by partial derivation of the Q-function and solving the equation ∂Q ! =0. ∂λ (6.75) and usage of Lagrange multipliers to account for additional constraints on parameters (e.g., the sum of outgoing probabilities has to be equal to one). Another way to optimize Q is to apply an iterative approximation technique. If the optimum of Q is not found exactly, the algorithm converges still, if only some new parameter values λ are found for which the lower bound is sufficiently greater than for λold . Such approach is called Generalized EM algorithm (e.g, Wilson & Bobick [279]). 6.5.2 The Proof for HSMMs For HSMMs, the complete dataset Z = (O, S) consists of the observation sequence o and the sequence of hidden states s that the stochastic process has traversed. If both the sequence of hidden states and observation sequence are known, (complete) data likelihood is computed by alternately multiplying state transition probabilities and observation probabilities along the path of states s: P (o, s | λ) = πs0 bs0 (O0 ) L Y vsk−1 sk (dk ) bsk (Ok ) (6.76) k=1 L Y = π s0 bsk (Ok ) L Y vsk−1 sk (dk ) (6.77) k=1 k=0 and hence the Q-function is (c.f., Equation 6.73): Q(λ, λold ) = X h i log P (o, s|λ) P (s | o, λold ) (6.78) s∈S = X h i log πs0 P (s | o, λold ) (6.79) s∈S + L X X h i log bsk (Ok ) P (s | o, λold ) (6.80) s∈S k=0 + L X X h i log vsk−1 sk (dk ) P (s | o, λold ) s∈S k=1 π old = Q (π, λ ) + Qb (B, λold ) + Qv (G, λold ) , (6.81) (6.82) 120 6. The Model where S denotes the set of all possible state sequences s. Some papers (e.g., Bilmes [29]) use P (s, o | λold ) instead of P (s | o, λold ). However, this difference does not matter since P (s, o | λold ) = P (s | o, λold ) P (o | λold ) , (6.83) and since P (o | λold ) is independent of λ it does not affect the arg max operator used to determine λnew (c.f., Equation 6.74). The important feature of Equation 6.82 is that the terms Qπ , Qb , and Qv are independent of each other with respect to π, B, and G. Due to partial derivation involved in maximization, Qπ , Qb , and Qv can be maximized separately. Maximizing Qπ . π old Q (π, λ ) = Qπ can be further simplified: X h i old log πs0 P (s | o, λ ) = N X h i log πi P (S0 = si | o, λold ) , (6.84) i=1 s∈S since for each s ∈ S, only the first state s0 is of importance. The second term on the right of the Equation, P (S0 = si | o, λold ), subsumes all state sequences starting with state si and hence the sum over all state sequences s can be turned into a sum over all states. In order to determine λopt with respect to π, the following constrained maximization problem has to be solved: π opt = arg max Qπ (π, λold ); πi ; i=1,...,N s.t. N X πi = 1 . (6.85) i=1 This can be accomplished by a Lagrange multiplier ϕ. Note that derivation is performed for one specific πi out of the sum of πi ’s: ∂ ∂πi ⇔ N X h i old X N log πi P (S0 = si | o, λ ) − ϕ i=1 ! πi − 1 ! =0 (6.86) i=1 1 P (S0 = si | o, λold ) − ϕ = 0 . πi (6.87) The Lagrangian multiplier ϕ can be determined by substituting Equation 6.87 into the side condition: N X P (S0 = si | o, λold ) ! =1 ϕ i=1 ⇔ ϕ= N X P (S0 = si | o, λold ) (6.88) (6.89) i=1 ⇔ ϕ=1, (6.90) since it is sure that the stochastic process is in one state at the beginning of the sequence. Using this result, Equation 6.87 can be solved to obtain the reestimation formula given in Equation 6.45: πi = P (S0 = si | o, λold ) , = γ0 (i) (6.91) as can be seen from the definition of γi (t) (c.f., Equation 4.14 on Page 59). 6.5 Proving Convergence of the Training Algorithm 121 Maximizing O b . In order to maximize the second term of the Q-function it is simplified first. The “row-wise” collection along state sequences s is exchanged by a “column-wise” collection for each time step k. Therefore, P (s | o, λold ) is exchanged by P (Sk = si | o, λold ) and sums are adapted adequately: L X X Qb (B, λold ) = h i log bsk (Ok ) P (s | o, λold ) = s∈S k=0 N X L X h i log bsi (Ok ) P (Sk = si | o, λold ) . (6.92) i=1 k=0 For readability reasons, bsi (oj ) is denoted by bij in the following. The maximization problem is: B opt b old s.t. ∀ i : = arg max Q (B, λ ); bij i=1,...,N j=1,...,M M X ! bij = 1 (6.93) j=1 leading to L N M N X X X ∂ X ϕi bij − 1 = 0 log bi (Ok ) P (Sk = si | o, λold ) − ∂bij i=1 k=0 i=1 j=1 ⇔ L X k=0; Ok =oj 1 P (Sk = si | o, λold ) − ϕi = 0 bij L X ⇔ bij = (6.94) (6.95) P (Sk = si | o, λold ) k=0; Ok =oj ; ϕi ϕi 6= 0 . (6.96) Substitution into side-constraints yields: L X P (Sk = si | o, λold ) k=0; M X Ok =oj ! ϕi j=1 ⇔ ϕi = M X L X j=1 k=0; Ok =oj =1 old (6.97) P (Sk = si | o, λ ) = L X P (Sk = si | o, λold ) . (6.98) k=0 The condition ϕi 6= 0 is fulfilled, if state si is reachable with the given sequence. Finally, L X bij = P (Sk = si | o, λ ) k=0; Ok =oj L X k=0 L X old = old P (Sk = si | o, λ ) γk (i) k=0; Ok =oj L X k=0 γk (i) . (6.99) 122 6. The Model Maximizing Qv . In order to maximize the transition part of the Q function for HSMMs, the sums are again rearranged. This time, the grouping collects all transitions from Sk−1 = si to Sk = sj as follows: Qv (G, λold ) = L X X h i log vsk−1 sk (dk | G) P (s | o, λold ) = s∈S k=1 N X N X L X h i log vij (dk | G) P (Sk−1 = si , Sk = sj | o, λold ) . (6.100) i=1 j=1 k=1 In contrast to the maximization of π and B, and in contrast to standard HMMs, maximization cannot be performed analytically. The reason for this can be traced back to the definition of vij (dk ), which actually is a function of parameters P and D:12 vij (dk | G) = vij dk | P , D(d) = pij dij (dk ) N X 1 − pih dih (dk ) h=1 if j 6= i (6.101) if j = i . h6=i The problem is that pij and dij (dk ) appear twice in vij (dk ), once in each case, which complicates computations as can be seen from a derivation with respect to pij . In order to shorten notations, it can be seen from the definition given in Equation 6.42 that P (Sk−1 = si , Sk = sj | o, λold ) = ξk (i, j) . (6.102) Incorporating the side conditions given in Equation 6.4, the Lagrangian L for Equation 6.100 is: L= N X N X L X log vij (dk ) ξk (i, j) − N L X N X X k=1 i=1 N X i=1 12 ϕi N X pij − 1 (6.103) j=1 j6=i log pij dij (dk ) ξk (i, j) + log 1 − j=1 j6=i N X ϕi i=1 i=1 j=1 k=1 = N X X pih dih (dk ) ξk (i, i) − h=1 h6=i pij − 1 . (6.104) j=1 j6=i Although it has been assumed that pii ≡ 0, notations include h 6= i to highlight that no self-transitions are incorporated. 6.5 Proving Convergence of the Training Algorithm 123 Setting the partial derivative to zero yields: ∂ L=0 ∂ pij (6.105) ⇔ L X dij (dk ) ξ (i, j) + pij dij (dk ) k k=1 −dij (dk ) ξk (i, i) − ϕi = 0 1− pih dih (dk ) (6.106) X h6=i ⇔ L X 1 ξ (i, j) − pij k k=1 1− X dij (dk ) ξk (i, i) − ϕi = 0 . (6.107) pih dih (dk ) − pij dij (dk ) h6=i,j Although a solution for pij exists, we are not able to analytically solve for ϕi . For this reason, a gradient-based approximation technique is applied. It can be seen from Equation 6.107 that the derivatives are independent for each state si . Therefore, parameters can be optimized separately for each state and the objective function Qv can be split as follows: Qv (P , D(d), λold ) = N X Qvi (P i , D i (d), λold ) . (6.108) i=1 This reduces complexity of the optimization procedure since the number of parameters is much smaller. The objective comprises all outgoing transitions of state si , the objective function is denoted by Qvi : Qvi (P i , D i (d), λold ) = L X N X k=1 log pij dij (dk ) ξk (i, j) j=1 j6=i (6.109) + log 1 − X pih dih (dk ) − pij dij (dk ) ξk (i, i) . h=1 h6=i,j Let ∇Qvi denote the gradient vector, which components are obtained by partial derivation of Equation 6.109 with respect to the parameters. Derivations with respect to kernel weights wij,r and kernel parameters θij,r are obtained by (∇Qvi )wij,r = ∂ Qvi ∂ dij (dk ) ∂ Qvi ∂ dij (dk ) ∂ κij,r (∇Qvi )θij,r = . (6.110) ∂ dij (dk ) ∂ wij,r ∂ dij (dk ) ∂ κij,r ∂ θij,r The dimension of ∇Qvi equals dim(∇Qvi ) = J 1 + R̄ 1 + θ̄ , (6.111) where J is the number of outgoing transitions, R̄ the average number of kernels per transition and θ̄ the average number of kernel parameters θij,r per kernel. However, the optimization procedure has to obey several restrictions. The first has already been expressed by the Lagrangian in Equation 6.103: the sum over pij for all outgoing transitions has to equal one. Rearranging this restriction yields: J X j=1 pij − 1 = 0 , (6.112) 124 6. The Model which is the defining equation of a J-dimensional hyperplane in the space of all optimization parameters. The interpretation for this is that all feasible solutions to the optimization problem have to be points within the hyperplane. However, the vector ∇Qvi does not necessarily point to a direction parallel to the hyperplane such that an unrestricted gradient ascent would leave the hyperplane of feasible solutions. In order to avoid this, the gradient vector is projected onto the hyperplane, which results in the direction of steepest ascent within the subspace of feasible solutions (see Figure 6.9). Figure 6.9: Projecting the gradient vector g into the plane of values for which p1 + p2 = 1. The result is denoted by g 0 . Θ denotes an arbitrary third parameter Projection onto a hyperplane can be achieved by simple matrix multiplication: (∇Qvi )0 = (M M T )∇Qvi , (6.113) where (∇Qvi )0 denotes the projected gradient vector and M is the matrix of orthonormal base vectors of the hyperplane translated such that it crosses the origin of parameter space. Note that M is constant such that the projection matrix can be precomputed. In most applications, duration distributions will be applied that are a convex combination of two or more kernels. In this case, the requirement that kernel weights sum up to one (c.f., Equation 6.9 on Page 98) constitutes an additional hyperplane restricting the subspace of feasible solutions similar to Equation 6.112. In a geometric interpretation, the subspace of feasible solutions is then defined by the intersection of all constraining hyperplanes. For example in Figure 6.9, if parameter Θ had to be equal to zero, the subspace of feasible solutions would only consist of the intersection of the shaded hyperplane with the p1 p2 plane, as indicated by the bold line. Matrix M consists of the orthonormal base vectors of the intersection of all restricting hyperplanes, which can also be precomputed. However, there are further restrictions. For example, pij denote probabilities, which can hence only take values of range [0, 1]. Another example is that the parameter λ of an exponential distribution must be greater than zero. The solution to this problem is that the stepsize along the projected gradient vector needs to be restricted such that the optimiza- 6.6 HSMMs for Failure Prediction 125 tion cannot leave the admissible range. In the geometric interpretation, this corresponds to clipping the projected gradient vector at boundary hyperplanes such as λ = 0. Summary of the proof of convergence. The goal of this section was to prove convergence of the training algorithm. The strategy of EM algorithms is to iteratively maximize a lower bound to reach a maximum of the objective function. The lower bound of EM algorithms is the so-called Q function, which is the expected training data likelihood over all combinations of unknown data. In the case of HSMMs, the Q function is the expected observation sequence likelihood over all sequences of hidden states. Similar to standard HMMs, the Q function for HSMMs can be separated into a sum of three independent parts such that maximization of Q can be achieved by individual maximization. The maximum for initial probabilities π and observation probabilities B has been computed analytically using the method of Lagrangian multipliers resulting in reestimation formulas similar to the ones of the Baum-Welch algorithm for standard HMMs. However, an analytical solution is not available for transition parameters. Therefore, a gradient-based iterative maximization procedure is used for this part of the Q function. The fact that the Q function is increased leads to an increased value of the objective function. Since the objective function is continuous and bounded, a repetitive increase converges to a local maximum. 6.6 HSMMs for Failure Prediction There is a principle interrelation between the number of free parameters and the amount of training data needed to estimate the parameters: The more parameters that need to be estimated the more training sequences are required to yield reliable estimates. Since in failure prediction, the models are trained from failure data, the amount of training data is naturally limited. Hence the number of free parameters must be kept small. The number of free model parameters is mainly determined by the number of states and the topology, which determines the connections among states. The most wide-spread topology for HMMs is a chain like structure, since first, the notion of a sequence has some “left-to-right” connotation and second, it has the least number of transitions. The model’s topology used for online failure prediction are no exception in that respect. However, there are some particularities that need to be explained. It is a principle and unavoidable characteristic of supervised machine learning approaches that the desired specifics are extracted from training data, which can never capture all properties of the true underlying interrelations. More specifically, this results from the fact that 1. Training data is a finite sample of data, from which follows that samples only contain/reveal a subset of the true characteristics 2. Measurement data is subject to noise. In the case of error sequences, e.g., it is common that error messages that are not related to the failure mechanism occur in the training data (noise filtering can alleviate the problem but cannot completely remove noise from the data) In order to account for these two properties, a strict left-right model is extended in two steps: 126 6. The Model 1. Jumps are introduced such that states can be left out, as is shown in Figure 6.10. This addresses missing error events in training sequences, which is related to the first particularity listed above. 2. After training, intermediate states are introduced (see Figure 6.11), addressing the second particularity. Training is performed between the two steps for the reason to keep the number of parameters as small as possible. Chain model with shortcuts. The model topology to which training is applied is shown in Figure 6.10. Since this structure is rather sparse, training computation times remain acceptable. Note that only shortcuts bypassing one state have been included in the figure. The models used for the telecommunication system case study also included shortcuts of larger maximum span. Figure 6.10: Failure prediction model structure used for training. Only shortcuts bypassing one state are shown. In implementations, also shortcuts having a larger span have been used. Transition parameters, and prior probabilities are initialized randomly. Observation probabilities are also initialized randomly, with one restriction: failure symbols can only be generated by the last absorbing failure state and error event IDs only by the transient states. Since the number of states N is not altered by the training procedure, it must be prespecified, although the optimal number of states cannot be identified upfront: If there are too few states, there are not enough transitions to represent all symbols in the sequences. If there are too many, the number of parameters is too large to be reliably estimated from the limited amount of training data. Furthermore, the model might overfit the training data. For this reason, several values of N are tried and the most appropriate model is selected. Please also note that training sequences have been filtered in the process of data preprocessing, as described in Section 5.3. Background distributions. As has been pointed out in Section 6.3.3, the Baum-Welch algorithm sets observation probabilities to zero for all observation symbols that do not occur in the training data set and hence every sequence containing one of those symbols is assigned a sequence likelihood of zero. This is not appropriate for failure prediction since the subsequent classification step builds on a continuous measure for similarity. Furthermore, training data is incomplete: During online prediction, there might be failure-prone sequences that are very similar containing some symbol that has not been present in the 6.6 HSMMs for Failure Prediction 127 (filtered) training data. Assigning a sequence likelihood (i.e., a similarity) of zero is obviously not appropriate. Hence, after training the chain model with shortcuts, background distributions have to be applied to observation sequences. Intermediate states. For each transition of the model, a fixed number of intermediate states are added such that the sum of mean transition times equals the mean transition time of the original transition (see Figure 6.11). More precisely, for any pair of states Figure 6.11: Adding intermediate states for each transition. Bold arcs visualize transitions from the model shown in Figure 6.10. µij denotes the mean duration of the transition from state si to state sj . Observation probability distributions bsi (oj ) for states 1, 2, and 3 have been omitted si and sj of the model obtained from training (c.f., Figure 6.10), v intermediate states sij,1 , . . . , sij,v are added such that the mean time of transition duration via intermediate states equals duration mean time of the direct transition si → sj .13 Limiting transition probabilities pij are adapted by distributing a fixed, prespecified amount of probability mass equally to the intermediate states. For example in Figure 6.11, if it is specified upfront, that 10% of probability mass should be assigned to intermediates, then p12 and p13 are scaled by 0.9 and the probabilities from state s1 to all the intermediates equals 0.1 . 4 Observation probabilities of intermediate states are not subject to training and hence prior probabilities P (oj ) estimated from the entire training data set are used. 13 That is, e.g., if the mean transition time from state s1 to s2 is µ12 = 12s and there are two intermediate states s12,1 and s12,2 , mean durations from s1 to s12,1 , from s12,1 to s12,22 , and from s12,2 to s2 are all four seconds, but the transition from s1 to s12,2 is eight seconds 128 6. The Model 6.7 Computational Complexity An assessment of computational complexity for most machine learning techniques has to consider two cases: training and online application. Training is performed offline and computing time is hence less critical than the application of the model, which is in this case the online prediction of upcoming failures. Both cases are investigated separately. Application complexity. The approach to failure prediction presented here involves computation of the forward algorithm for each sequence. The forward algorithm of standard HMMs is of the order O(N 2 L) as can be seen from the trellis shown in Figure 4.3 on Page 59: for each of the L + 1 symbols of the sequence, a sum over N terms has to be computed for each of the N states. However, this is only true if really all predecessors are taken into account. If the implementation uses adjacency lists, this assessment applies only to ergodic (fully connected) model structures. In case of frequently used left-to-right structures complexity goes down to O(N L). Complexity of the Viterbi algorithm is the same since the sum of the forward algorithm is simply replaced by a maximum operator, which also has to investigate all N predecessors in order to select the maximum value. Complexity of the Backward algorithm is also equal to the forward algorithm, although multiplication of bsi (Ot ) cannot be factored out —but since constant factors do not change the class of complexity in the O-calculus, the same class results. Turning to HSMMs, the algorithms belong to the same class of complexity, since the only difference between the algorithms is that aij is replaced by vij (dk ). More precisely: aij ⇔ pij R X wij,r κij, r (d|θij,r ) for i 6= j . (6.114) r=0 κij,r (d) are cumulative probability distributions that have to be evaluated for delay d. Depending on the type of distribution this might involve more or less computations since for, e.g., Gaussian distributions, there is no formula for the cumulative distribution. However, since R is constant (and most likely less than five) irrespective of N and L, it is a constant factor and complexity in terms of the O-calculus is the same as for standard HMMs. For the case that the process has stayed in state si (j = i), computations are even less costly if the products pij dij (d); j 6= i are summed up “on the fly”. Training complexity. Estimating overall complexity of the Baum-Welch algorithm is a difficult task since the number of iterations is depending on many factors such as • model initialization, which is in many cases random • quality and quantity of the training data, which includes the number of training sequences • appropriateness of the HMM assumptions • appropriateness of model topology • number of parameters of the model. In case of a standard HMM, the number is determined by N values for π, up to N 2 transition probabilities aij in case of a fully 6.7 Computational Complexity 129 connected HMM, and N M observation probabilities B. Since M is determined by the application, it is assumed to be constant. Hence, the number of parameters is O(N 2 ). Some approaches have been published that try to predict computation time (e.g., Hoffmann [120]) but since these models are based on measurements, they do not help to derive an O-calculus assessment. Due to the number of parameters being in the order of O(N 2 ) it is assumed here that also the number of iterations is ∈ O(N 2 ), which in reality is a quite loose upper bound. In fact convergence can be much better if a large amount of consistent training data is available. Furthermore, in real applications, a constant upper bound for the number of iterations is used. Note that this does not guarantee that the training procedure is close to a local maximum. However, since usually training is repeated several times with different random initializations, this drawback is relatively small. Complexity of one reestimation step can be determined: The E-Step of the EM algorithm involves execution of the forward-backward algorithm of complexity O(N 2 L). Then, to accomplish the M-step, reestimation of π requires O(N ) steps B requires O(N L) steps A requires O(N 2 L) steps for each sequence. Hence the overall training procedure also has complexity O(N 2 L). Putting this together with the number of iterations, overall training complexity is of the order of O(N 4 L). Similar to model application, complexity of models used in real applications (e.g., left-to-right topology) is less. Turning to HSMMs, reestimation of π and B remains the same while reestimation of A is replaced by an iterative approximation procedure, which leads to an increased complexity of HSMMs: • The optimization algorithm has to be run for each of the N states. • For a fully connected model, the number of parameters that have to be estimated increases by const ∗ (N − 1), which is in the order of O(N ). • Computing the gradient involves the sum over all training data, which is O(L). • Since a few gradient-based optimization steps are sufficient and assuming constant complexity to determine the step size, the number of iterations can be limited to O(1). The resulting complexity is: N ∗ O(N ) ∗ O(L) ∗ O(1) = O(N 2 L) . (6.115) Assuming the number of iterations of the outer EM algorithm to be O(N 2 ) as before, this again yields an overall complexity of O(N 4 L). Again, in real applications such as online failure prediction, a left-to-right structure is used, which also limits training complexity of each iteration to O(N L). Additionally, a constant upper bound on the number of iterations can be applied, showing the same drawback as for standard HMMs. In general, this analysis shows the sometimes misleading over-simplification of the O-calculus: although belonging to the same complexity class, it should be noted that HSMMs are clearly more complex than standard HMMs. However, as experiments along with the case study will show, computation times are still acceptable (see Sections 9.4.2, 9.7.1, and 9.9.5). 130 6.8 6. The Model Summary Hidden Semi-Markov Models (HSMMs) are a combination of semi-Markov processes (SMPs) and standard hidden Markov models (HMMs): Standard HMMs employ a discrete time Markov chain for the stochastic process of hidden state traversals, which is replaced by a continuous time SMP in the case of HSMMs. Although it is not the first time that such a combination has been proposed, previous approaches were limited to discrete time steps of length ∆t and/or have used state duration distributions instead of transition durations and / or were limited to a maximum duration. The forward, backward, and Viterbi algorithm have been derived yielding algorithms that are of the same complexity class14 as the algorithms of standard HMMs. This has been achieved by a strict application of the Markov property and the assumption that a state transition takes place each time an observation occurs. Although this might sound too simplistic, a comparison of event-triggered temporal sequence processing with the situation encountered in speech recognition reveals why this assumption is appropriate for temporal sequence processing: temporal properties of the process appear at the surface and are expressed by the time when events occur whereas speech recognition operates on periodic (i.e., equidistant) sampling, and hence the underlying temporal properties do not appear in observation data. The forward or Viterbi algorithm are used for sequence recognition. Sequence prediction aims to forecast the further development of the stochastic process. There are two different types of prediction: first, it might be of interest what the next observation symbol at a certain time in the future will be, and second, the probability that the stochastic process reaches a distinguished state up to some time t in the future can be computed. Solutions to both goals have been derived. Training of HSMMs is accomplished in a similar way to standard HMMs: Based on the forward and backward algorithm, an expectation maximization algorithm is employed. However, the formulas known from standard HMMs can only be adopted for initial state and observation distributions π and bsi (oj ), respectively. Limiting transition distributions pij and transition durations dij (dk ) need to be optimized by an embedded gradient-based optimization procedure. The entire training procedure has been summarized on Page 111. A proof to show that the training procedure converges to a local maximum of sequence likelihood, has been presented. It is based on the notion that EM algorithms perform lower-bound optimization from which a so-called Q-function can be derived. This derivation has been applied to the case of HSMMs yielding three terms that can be optimized independently. The proof investigates all terms and derives the training formulas. Topics that are relevant to the application of HSMMs to online failure prediction have been covered, including a two step model construction process. Together with the application of background distributions, this process increases model bias and lowers variance, as is shown in Section 7.3. Finally, complexity of the derived algorithms has been assessed using the O-calculus. For a fully connected (ergodic) model, both the algorithms for standard HMMs and HSMMs are of complexity O(N 4 L), assuming the number of outer EM iterations to be of O(N 2 ). However, the constant factors, which are hidden by the O-calculus, are significant for HSMMs. Furthermore, for many applications complexity is reduced to O(N L). 14 in terms of the O-calculus 6.8 Summary 131 Contributions of this chapter. HSMMs, as proposed in this chapter, follow a novel approach to extend hidden Markov models to continuous time. The fundamental difference between periodically sampled input data of applications such as speech recognition and event-triggered temporal sequences is that temporal aspects of the underlying stochastic process is revealed at the level of observations. By exploiting this difference, a hidden semi-Markov model has been proposed that operates on true continuous time rather than discrete time steps. It is able to model transition durations rather than state sojourn times, and does not require specification of a maximum duration. Furthermore, the model provides great flexibility in terms of the distributions used and offers the possibility to incorporate background distributions for transition durations. Moreover, the algorithm is of the same complexity class as standard hidden Markov models. Relation to other chapters. In online failure prediction, similarity of an error sequence that has been observed in the running system is compared to failure-prone sequences of the training dataset by computing sequence likelihood. Since at least two HSMMs are used —one for similarity to failure and one for non-failure sequences— a classification step is needed in order to come to a final evaluation of the current system status. Several approaches to classification are presented in the next chapter. Chapter 7 Classification Classification is the last stage of the failure prediction process (see Figure 2.10 on Page 20). Classification facilitates a decision whether the current status of the system, as expressed by the observed error sequence, is failure-prone or not. This chapter discusses issues related to that topic. More specifically, in Section 7.1 Bayes decision theory is introduced, while topics directly related to failure prediction are discussed in Section 7.2. As the outcome of the classifier is a decision that can be either right or wrong, classification error is analyzed in more detail in Section 7.3. This includes the bias-variance-dilemma and approaches how the trade-off between bias and variance can be controlled. 7.1 Bayes Decision Theory Classification, in its principal sense, denotes the assignment of some class label ci , i ∈ {0, . . . , u}, to an input feature vector s. It seems not surprising that a decision theory bearing the name of Revd. Thomas Bayes is a stochastic formal foundation to derive and evaluate rules for class label assignment based on the Bayes rule. The principal approach of Bayesian decision is that class label assignment is based on the probability for class ci after having observed feature vector s, which is the so-called posterior probability distribution P (ci | s). Applying Bayes’ rule, the posterior can be computed by: P (ci | s) = p(s | ci ) P (ci ) p(s | ci ) P (ci ) =P , p(s) l p(s | cl ) P (cl ) (7.1) where p(s | ci ) is called the likelihood, and P (ci ) is called the prior. The likelihood expresses that certain features occur with different probabilities depending on the true class. The prior accounts for the fact that classes ci are not equally frequent. Due to the fact that classification theory has mainly been developed for continuous feature vectors, there are infinitely many s’es and likelihood p(s | ci ) is a probability density, which is denoted by a small letter “p”. 133 134 7. Classification 7.1.1 Simple Classification The simplest classification rule is to assign an observed feature vector s to the class with maximum posterior probability class(s) = arg max P (ci |s) (7.2) ci p(s|ci ) P (ci ) = arg max P ci l p(s|cl ) P (cl ) (7.3) = arg max p(s|ci ) P (ci ) . (7.4) ci The last step from Equation 7.3 to Equation 7.4 can be performed since the denominator is independent of ci and hence does not influence the arg max operator. This classification rule seems intuitively correct, and it can be shown that it minimizes the missclassification error (see Bishop [30]). For the sake of simplicity, let us assume that there are only two classes c1 and c2 . Let Ri denote a decision region, which is a not necessarily contiguous partition of feature space: if a data point s occurs within region Ri , class label ci is assigned to s. The total probability of misclassification, i.e., the error, is given by: P (error) = P (s ∈ R2 , c1 ) + P (s ∈ R1 , c2 ) = P (s ∈ R2 | c1 ) P (c1 ) + P (s ∈ R1 | c2 ) P (c2 ) = Z R2 p(s | c1 ) P (c1 ) ds + Z R1 p(s | c2 ) P (c2 ) ds . (7.5) (7.6) (7.7) The boundaries between decision regions are known as decision surfaces or decision boundaries. Figure 7.1 visualizes Equation 7.7 for a one-dimensional feature space s and two continuous regions defining a single decision boundary θ. It can be seen from the Figure 7.1: Classification by maximum posterior for a two-class example. The curves show p(s|ci ) P (ci ) and hatched areas indicate the error. R1 and R2 are decision regions: Every s within R1 is classified as c1 and within R2 as c2 . It can be seen that the error is minimal if the decision boundary θ equals the point where the two probabilities cross. figure that the total probability of an error (i.e., the hatched area in the figure) is minimal 7.1 Bayes Decision Theory 135 if θ is chosen to be the value of s for which p(s | c1 ) P (c1 ) = p(s | c2 ) P (c2 ). From this follows that the decision rule given in Equation 7.4 results in minimum probability of misclassification for two classes. The resulting minimum error for this boundary is called the Bayes error rate. In the case of more classes, it is easier to compute the probability of correct classification XZ P (correct) = p(s | ci ) P (ci ) ds . (7.8) Ri ci Choosing decision regions such that the probability of correct classification is maximized leads to Equation 7.4 in its general form for multiple classes. In summary, the Bayes classifier chooses decision regions such that the probability of correct classification is maximized. No other partitioning can yield a smaller probability of error (Duda & Hart [84]). 7.1.2 Classification with Costs The classification rule derived above has not considered any cost or risks involved with classification. However, cost can influence classification significantly. For instance in the case of medical screening, classifying an image of a tumor as normal is much worse than the reverse. The same might hold for failure prediction, too: Not predicting an upcoming failure might cause much higher cost than spuriously predicting a failure when the system is actually running well. In order to account for cost, a cost or risk matrix is introduced.1 Each element of the risk matrix rta defines the cost / risk associated with assigning a pattern s to class ca when in reality it belongs to class ct . Although the term “risk” might not seem appropriate for cases where the correct class label is assigned, the term is used here. Instead of minimizing the probability of error, an optimal cost-based classification minimizes expected risk. To derive a formula, first the expected risk of assigning a sequence s to class ca is considered: Ra (s) = X rta P (ct | s) . (7.9) t Since class ca is assigned to all s ∈ Ra , the average cost of assignment to class ca is: Ra = Z Ra X rta P (ct | s) p(s) ds = Z X Ra t t rta p(s | ct ) P (ct ) p(s) ds p(s) (7.10) and the total expected risk equals R= X a Ra = XZ a X Ra rta p(s | ct ) P (ct ) ds . (7.11) t Risk is minimized if the integrand is minimized for each sequence s, which is achieved by choosing the decision region for assignment to class ca such that s ∈ Ra if X t rta p(s | ct ) P (ct ) < X rti p(s | ct ) P (ct ) ∀ i 6= a (7.12) t resulting in a Bayes decision rule where minimum loss across all assignments for sequence s is chosen. If two assignments have equal loss, any tie-breaking rule can be used. 1 In classification, the matrix is also called loss matrix 136 7.1.3 7. Classification Rejection Thresholds Bishop [30] mentions that classification can also yield the result that a given instance cannot be classified with enough confidence. The idea is to classify a sequence s only if the maximum posterior is above some threshold θ ∈ [0, 1] (c.f., Equation 7.2): class(s) = c k = arg maxci P (ci | s) ∅ if P (ck | s) ≥ θ else . (7.13) Rejection thresholds might be useful for online failure prediction if there is a human operator who can be alerted if a sequence cannot be classified in order to further investigate or observe the system’s status. However, since experiments carried out in this work are only based on a data set (there has been no operator to alert), rejection thresholds have not been applied, here. A second application of rejection thresholds is concerned with improving computing performance of classifiers: In a first step, simple classifiers can be used to classify the non-ambiguous cases. More complex situations, for which the simple classifications do not exceed the rejection thresholds, more sophisticated but computationally more expensive methods can be applied to further analyze the situation. However, since optimization of computing performance is not the purpose of this dissertation, such approach has also not been applied in this thesis. 7.2 Classifiers for Failure Prediction Bayesian decision theory provides the basic framework for classification. In this section, failure prediction specific as well as practical issues are discussed. Note that now probability p(s | ci ) denotes likelihood of sequence s, which has been observed during runtime. In case of hidden Markov models, sequence likelihood is computed by the forward algorithm. 7.2.1 Threshold on Sequence Likelihood The simplest classification rule is to have only one single HSMM trained on all failure sequences irrespective of the failure mechanism, and to apply a threshold θ ∈ [0, 1] to sequence likelihood p(s | λF ) where λF denotes a model that has been trained on failure data, only. The problem is that observation sequences s are delimited by a time window ∆td (c.f., Figure 5.4 on Page 79) resulting in a varying number of symbols in observation sequences. Sequence likelihood decreases monotonically with the number of observation symbols and hence threshold θ has to depend on the number of observation symbols. Furthermore, experiments have shown that such approach does not result in decisive models. For these reasons, the method of simple thresholding is not used in this thesis. 7.2.2 Threshold on Likelihood Ratio One way to circumvent the problem of varying length of observation sequences is to use exactly two models —λF for failure and λF̄ for non-failure sequences— and to compute the ratio of sequence likelihoods. A failure is predicted if the ratio is above some threshold 7.2 Classifiers for Failure Prediction 137 θ ∈ [0, ∞). More formally, a failure is predicted if P (s | λF ) >θ. P (s | λF̄ ) (7.14) In order to analyze this approach it is cast into the framework of Bayes decision theory. However, to simplify affairs, formulas of Bayes decision theory become more handy if rephrased for the two-class case. From Equation 7.12 follows that the classifier should opt for a failure if rF F p(s | cF ) P (cF ) + rF̄ F p(s | cF̄ ) P (cF̄ ) < rF F̄ p(s | cF ) P (cF ) + rF̄ F̄ p(s | cF̄ ) P (cF̄ ) (7.15) ⇔ (7.16) (rF F − rF F̄ ) p(s | cF ) P (cF ) < (rF̄ F̄ − rF̄ F ) p(s | cF̄ ) P (cF̄ ) . Under the reasonable assumption that rF̄ F̄ < rF̄ F , which means that the cost associated with classifying a non-failure-prone situation correctly as o.k. are less than cost associated with falsely classifying it as failure-prone, equations can be transformed as follows: ⇔ rF F − rF F̄ p(s | cF ) P (cF ) > p(s | cF̄ ) P (cF̄ ) rF̄ F̄ − rF̄ F (7.17) ⇔ p(s | cF ) (r − rF̄ F̄ ) P (cF̄ ) > F̄ F . p(s | cF̄ ) (rF F̄ − rF F ) P (cF ) (7.18) Identifying the likelihoods p(s | cF ) with estimated sequence likelihoods P (s | λF ) obtained from the model, it can be seen that classification by threshold on sequence likelihoods is optimal if for threshold θ holds: θ= 7.2.3 (rF̄ F − rF̄ F̄ ) P (cF̄ ) . (rF F̄ − rF F ) P (cF ) (7.19) Using Log-likelihood In many real applications and models such as hidden semi-Markov models, sequence likelihoods P (s | λt ) get too small to be computed and hence the log-likelihood is used (c.f., Equation 6.19 on Page 101). However, this does not rule out the Bayes classification to be used since the logarithm is a strictly monotonic increasing function and hence Equation 7.18 can be transformed into " # " r − rF̄ F̄ P (cF̄ ) log p(s | λF ) − log p(s | λF̄ ) > log F̄ F + log rF F̄ − rF F P (cF ) h i h i | {z ∈(−∞;∞) } | {z const. # . (7.20) } Usefulness of the formula can be seen more easily if only cost for misclassification are taken into account, which means rF F = rF̄ F̄ = 0. Hence, " # r θ̃ = log F̄ F + c . rF F̄ (7.21) 138 7. Classification Equation 7.21 approaches −∞ if rF F̄ and rF̄ F → 0. In other words, if cost for incorrectly raising a failure warning approaches zero, the threshold θ gets infinitely small and consequently classifying every event sequence as failure-prone results in minimal cost. On the other hand, if cost for such misclassification is high, then it must be quite evident that the current status is failure-prone, i.e., big difference in sequence log-likelihoods, until a failure warning is raised. In terms of rF F̄ the situation is inverse. 7.2.4 Multi-class Classification Using Log-Likelihood As can be seen from Figure 2.10 on Page 20, in the approach presented here, one nonfailure model and u failure models —one for each failure mechanism— are used to predict a failure, which naturally leads to a multi-class classification problem. If sequence likelihoods P (s | λt ) would be available, then Equation 7.12 had to be used for classification. However, in real applications only log-likelihoods log P (s | λF ) are available but Equation 7.12 cannot be solved to include singleton log P (s | λt ) terms. Therefore, the multi-class classification problem is turned into a two-class one by selecting maximum log sequence likelihood of failure models and comparing it to log sequence likelihood of the non-failure model: class(s) = F ⇔ u max log P (s | λi ) − log P (s | λ0 ) > log θ , i=1 (7.22) where θ is as in Equation 7.19. The motivation for the approach is as follows: Failure models are related since they all indicate an upcoming failure. If the system encounters an upcoming failure, the observed error sequence is the outcome of exactly one underlying failure mechanism. Hence the failure model that is targeted to this failure mechanism should recognize the error sequence as most similar, which is expressed by maximum sequence log-likelihood. An additional advantage of the approach is that the cost matrix defining θ only has four elements, which can be overseen and determined more easily. 7.3 Bias and Variance Bayes decision theory has been based on minimizing classification error for each single observation sequence (c.f., Equation 7.5). However, the classifier is trained from some finite training data set. Analyzing dependence on training data yields fundamental insights into machine learning, which in turn lead to improved modeling techniques. In order to describe the concept, bias and variance are first derived for regression, as it has been developed by Geman et al. [104]. Having the concept in mind, the work of Friedman [98] is described, who has proposed an analysis of bias and variance for classification. The purpose of presenting this material is to provide the background for a discussion of bias and variance in the context of failure prediction and for an overview of known techniques to control the trade-off between bias and variance. For further details, please refer to textbooks such as Bishop [30] or Duda et al. [85]. 7.3.1 Bias and Variance for Regression Machine learning techniques usually try to estimate unknown mechanisms / interrelations from training samples, which leads to different resulting models depending on the data 7.3 Bias and Variance 139 Figure 7.2: Mean square error in regression problems. Dots in each figure indicate two different training datasets D1 , and D2 from which (in this case linear) models y(s; Di ) have been trained. Mean square error is determined by (y(s; D) − t(s))2 , where t(s) is the target value at point s. present in the training data set. This is due to the fact that training data is a finite sample and the system under investigation might also be stochastic. The following considerations assess the dependence on the choice of training data, resulting in an analysis of bias and variance. A common way to explain the two terms is to first investigate mean square error E for regression: The error is measured by square of the difference between y(s; D), which is the output value for input data point s of some model that has been trained from training data set D of fixed size n, and the target value t(s) (see Figure 7.2). Since training data is a finite sample, resulting models may vary with every different training dataset. The expected error over all training datasets for one data point s is computed and decomposed as follows: (c.f., e.g., Alpaydin [6]): E = ED y(s; D) − t(s) 2 (7.23) h i (7.24) h i (7.25) = ED y 2 − 2 t ED [y] + t2 = ED y 2 − 2 t ED [y] + t2 + ED [y]2 − ED [y]2 h i = ED [y]2 − 2 t ED [y] + t2 + ED y 2 − ED [y]2 2 = ED [y(s; D)] − t(s) | {z Bias2 } + ED y(s; D) | 2 (7.26) 2 − ED y(s; D) {z V ariance . (7.27) } Equation 7.27 indicates that the mean squared deviation from the true target data of any machine learning method consists of two parts: 1. ability to mimic the training data set (bias) 2. sensitivity of the training method to variations in the selection of the training data set (variance) The relation can be understood best if two extreme cases are considered: 140 7. Classification • Assume a machine learning technique that memorizes all training data points. Such technique has a bias of zero. However, the resulting model is strongly different for different selections of the training data set resulting in high variance. • Assume a “learning” technique that does not adapt to the training data at all (e.g., a fixed straight line), then the resulting model is the same irrespective of the data set (zero variance). However, deviation from the target values is quite high resulting in a high bias. The key insight of Equation 7.27 is that in order to obtain a model with small average error on s, both bias and variance must be reduced. A good model achieves a balance of underfitting (high bias, low variance) and overfitting (low bias, high variance), which is also known as the bias and variance!dilemma. 7.3.2 Bias and Variance for Classification The above derivations investigated mean square error for regression problems. Turning to classification, the situation is different. In two-class classification, there are only two target values t ∈ {0, 1}. Mean squared error (y(s; D) − t)2 could be used to measure proximity of the model output to binary target data as well, but this is not a proper approach. Regard, for example, a classifier that yields output y(s) = 0.51 for t = 1 and y(s) = 0.49 for t = 0 for all s. This is a perfect classifier since with a threshold of 0.5 all s would be classified correctly. However, with the mean square error, the classifier would receive a high bias. Friedman [98] was one of the first to investigate this problem and to derive an assessment of bias and variance for classification problems. Although others such as Shi & Manduchi [240], Domingos [82] have developed the topic further, only the basic findings of Friedman are presented here. The regression problem of the previous section involved the notation y(s; D) to denote the output of the model. In terms of classifiers, classification is based on modeled posterior class probability (c.f., Equation 7.1): fˆ(s; D) = P̂ (c = 1|s) = 1 − P̂ (c = 0|s) , (7.28) which is an estimate of the true posterior probability f (s) = P (c1 |s). The posterior estimate fˆ(s; D) is used to classify intput s in a Bayes classifier. In a two-class classification problem, the assigned class label is determined by: ĉ(s; D) = IA fˆ(s; D) ≥ r01 r01 + r10 , (7.29) where IA (x) is the standard indicator function and rta denote classification risk as in Equation 7.12. Correspondingly, the optimal classification is based on the true posterior: cB (s) = IA f (s) ≥ r01 r01 + r10 , (7.30) which results in cost minimal (Bayes) classification. In order to simplify notations, equal cost r01 = r10 is assumed such that the decision level is set to 1/2. Figure 7.3 shows the situation. Similar to derivation of bias and variance for regression, the estimated posterior fˆ(s; D) is a random variable depending on the training data set D. For one training 7.3 Bias and Variance 141 Figure 7.3: True posterior probability f (s), and estimated posterior fˆ(s; D) obtained from training using dataset D . In regions of s where f (s) and fˆ(s; D) are on the same side of the Bayesian decision boundary 1/2, a correct classification results and classification error rate is minimal (regions R2 and R4 ). If not, the classifier based on fˆ(s; D) assigns the wrong class label resulting in maximal classification cost (for s in that region) data set, fˆ(s; D) may be on the correct side of the decision boundary (for s), for another data set not. In order to handle this dependency on training data, again the expected value ED is used to assess the average misclassification rate P ĉ(s) 6= c(s) = ED P ĉ(s; D) 6= c(s) , (7.31) where c(s) is the true class of input s. It can be shown that Equation 7.31 can be separated into the minimal Bayes error P cB (s) 6= c(s) and a term that is linearly dependent on the so-called boundary error P ĉ(s) 6= cB (s) . Since the Bayesian error does not depend on the classifier, only the boundary error needs to be investigated. For further assessment, Friedman assumes that the estimated posterior fˆ(s; D) is dis tributed —for varying datasets D— according to p fˆ(s) , which is unknown in general. However, since many machine learning algorithms (including Baum-Welch) employ averaging, p fˆ(s) can be approximated by a normal distribution: p fˆ(s) = N ED [fˆ(s; D)]; Var[fˆ(s; D)] . (7.32) In order to compute the boundary error P ĉ(s) 6= cB (s) , the desired quantity is the probability that fˆ(s) and f (s) are on opposite sides of the decision boundary 1/2, which yields (see Figure 7.4): P ĉ(s) 6= cB (s) = R ∞ ˆ ˆ 1/2 p f (s) df R 1/2 ˆ ˆ −∞ p f (s) df if f (s) < 1/2 (7.33) if f (s) ≥ 1/2 . The two cases can be turned into one using the sign function: P ĉ(s) 6= cB (s) = Φ sign f (s) − 1/2 | ED [fˆ(s; D)] − 1/2 {z boundary bias Var[fˆ(s; D)] } | {z variance } −1/2 (7.34) 142 7. Classification Figure 7.4: Distribution of estimated posterior where Φ is the upper tail integral of the normal distribution.2 Plots of the boundary error as a function of f and ED [fˆ] are provided for two values of Var[fˆ] in Figure 7.5. Boundary Error, f = −0.25 Boundary Error, var[f^] = 0.05 1.0 1.0 0.8 0.8 err dary boun ary bound 0.6 0.6 error 0.4 0.4 or 0.2 0.2 0.0 1.5 0.0 1.5 1.0 0.0 0.5 0.6 Ef 0.5 0.0 1.0 1.0 0.8 1.0 E[ 0.5 f^ ] f^] V[ 0.4 0.0 −0.5 1.5 f 0.2 −0.5 0.0 (a) (b) Figure 7.5: Boundary error P ĉ(s) 6= cB (s) . Plot (a) shows dependence on ED [fˆ(s; D)] and Var[fˆ(s; D)] for a given true posterior of f (s) = −0.25. Plot (b) shows dependence on true posterior f (s) and expected value ED [fˆ(s; D)] for Var[fˆ(s; D)] = 0.05. Note that depending on the modeling technique estimates fˆ(s; D) may exceed the range [0, 1]. This is not a problem since classification is performed by comparing fˆ(s) to the decision boundary 1/2 Several key insights into the nature of classification error can be gained from this: 1. From Equation 7.34 it can be seen that bias and variance affect each other in a multiplicative way rather than additive as in the case of regression (c.f., Equation 7.27). This results in the complex relationship seen in Figure 7.5-a. 2. Small classification errors can only be achieved if variance Var[fˆ(s; D)] is low. 2 Hence Φ(·) = 1 − erf(·) 7.3 Bias and Variance 143 However, this is only true if boundary bias is positive, i.e., f (s) and fˆ(s; D) are on the same side of the decision boundary 1/2. If it is negative, a very large classification error results. 3. Except for the special case of zero variance, the error rate is depending on the distance of ED [fˆ(s; D)] from the decision boundary 1/2. For this reason, bias is expressed as boundary bias. 4. The error rate of the classifier is not depending on distance between f (s) and the decision boundary, as long as f (s) and ED [fˆ(s; D)] are on the same side. In Figure 7.5-b, this can be seen that for fixed ED [fˆ(s; D)] the boundary error is the same for all f < 1/2 and f ≥ 1/2, respectively. From this discussion follows that optimal classification3 is achieved for small variance (resulting models are more or less equal regardless of the selection of training data) provided the fact that the training algorithm on average yields an estimation of the posterior probability that is on the correct side of the Bayes decision boundary. Note that all the formulas derived above have evaluated only one single s. If the overall error rate is to be assessed, a further integral is needed: P (ĉ 6= c) = Z ∞ P ĉ(s) 6= c(s) p(s) ds . (7.35) −∞ 7.3.3 Conclusions for Failure Prediction The detailed analysis of classification error with respect to bias and variance has shown that first, there is a trade-off between underfitting and overfitting, and second, in the case of classification, small variance is more important than small bias. For this reason, bias and variance have to be controlled in order to achieve a robust classifier. A manifold of techniques exist among which a few are shortly described here including a discussion whether they can be used for online failure prediction with HSMMs, or not. • The most intuitive golden rule for machine learning approaches is to increase the amount of training data. However, in most real application the amount of available training data is limited, either since cost for data acquisition is too high or, as in the case of failure prediction, data acquisition simply takes too long. Since in most applications one part of available data is used for training and the other is used to assess the generalization / prediction quality of the models, it is suggested to use techniques such as m-fold cross validation to make full use of the limited data. This technique has also been used in this thesis (see Section 8.3.3). • Training with noise. In the case that not enough training data is available, noise can be synthetically added to the training data in order to divert the training procedure and to avoid memorizing of training data points (overfitting), hence increasing bias and lowering variance. In the case of regression, “noise” refers to a simple zero mean stochastic process being added to measurement data. However, it is not clear how this concept translates into failure sequences. While a zero mean random number could be added to the delay between error events, it seems hazardous to 3 Remember that the overall error rate is the sum of Bayes error and a term linear in P ĉ(s) 6= cB (s) 144 7. Classification interchange the event type, which is a nominal, i.e., non-ordinal, variable.4 Hence, this technique could not be applied in this thesis. • Early stopping. Many machine learning techniques apply an iterative estimation algorithm to stepwise adapt model parameters to the training data. This corresponds to a stepwise transition from under- to overfitting. The idea of early stopping is to evaluate generalization performance with a data set that is not used for training and to halt the training procedure once the validation error begins to rise (see Figure 7.6). Experiments have shown that early stopping does not seem to be an ap- Figure 7.6: Early stopping. Error for the training data decreases with every training step approaching some minimum error. Evaluating generalization performance using a separate validation data set shows an increasing error after some number of training steps due to the fact that the model is overfitting the training data. Early stopping interrupts the training procedure once validation error begins to rise propriate technique for hidden semi-Markov models. The reason for this is that the Baum-Welch estimation procedure sets all observation probabilities of symbols that do not occur in the training data set to zero in the first iteration. As early stopping can only halt at integer steps, the first possible stop is already “too late”. It has also been tried to combine early stopping with background distributions but this did not result in significant improvement in comparison to the application of background distributions alone. • Growing and pruning. One of the major factors influencing the trade-off between bias and variance is the number of free parameters of the model: Provided that there is enough training data, the greater the number of free parameters, the better a model can memorize training data points resulting in a low bias but high variance. The idea of growing or pruning algorithms is to iteratively increase / decrease the number of free parameters until an optimal solution is found. In hidden Markov models, the number of parameters is mainly determined by the number of states and transitions and hence algorithms try to add / delete edges or nodes / states following some mostly heuristic rule. Bicego et al. [28] have proposed several pruning algorithms. However, these methods can only be applied to models with recurrent states, which is not the case for the models used for online failure prediction. • Model order selection. As discussed above, growing and pruning are not applica4 If a numbering scheme for event IDs similar to the one proposed in Section 5.4.2 is used, adding noise could be applied. However, data of the telecommunication platform did not provide such numbering. 7.3 Bias and Variance 145 ble within an automatic rule-based approach. However, “growing and pruning” is achieved by simple trial and error for some range of model parameters such as the number of states, number of intermediate states or maximum span of shortcuts. In this approach, the most appropriate model is selected applying techniques such as cross-validation. • Parameter tying. The number of free parameters of some model classes such as neural networks and hidden Markov models can be reduced if several parameters are “grouped”. In the case of hidden semi-Markov models, for example, transition parameters pij of several transitions can be forced to be equal, which reduces the number of free parameters. However, in order to apply tying wisely, not blindly, strong assumptions and hence detailed knowledge about the modeled process are necessary, which is not the case for the problem addressed in this dissertation. • Background distributions, intermediate states, and shortcuts. Observation probabilities of hidden Markov models can be mixed with so-called background distributions (c.f. Page 112). Background distributions “blur” the output probabilities of the HMM which results in an increased training bias but reduced variance. If observation probabilities are trained using the Baum-Welch algorithm (as is the case for this thesis) the application of background distributions is especially important to circumvent the problem of zero probability for observation symbols not occurring in the training data set. Intermediate states and shortcuts added to the model topology (see Section 6.6) have a similar effect on specificity of state transitions. Furthermore, HSMMs also allow to incorporate background distributions into transition durations. Due to the fact that (a) observation background distributions have been available for the HMM toolkit on which the implementation of HSMMs is based, (b) transition background distributions are within the core of HSMMs, and (c) intermediate states and shortcuts can easily be incorporated by modifying the model structure, these techniques have primarily been used in this thesis. • Regularization. The techniques described so far have left the core of the training procedure untouched. Regularization methods modify the training procedure itself in that the objective function of training is modified such that model complexity is penalized. Many regularization techniques exist for neural networks (see, e.g., Bishop [30]). However, for hidden Markov models there are less. Hence, regularization has been left over for future work. • Aggregated models. Another group of techniques does not build on one single model but rather on a population of component models that are aggregated to form a big one. One of the predominant techniques is called arcing5 among which bagging and boosting are most well-known. Bagging trains various component models by randomly chosen subsets of training data. The output of the aggregated model is simply a majority vote among component models. Boosting, among which AdaBoost6 is most well-known, first trains a component model from a subset of training data, and then subsequently trains further component models from data sets that 5 Adaptive Reweighting and CombinING 6 “Adaptive Boosting” 146 7. Classification consist half of input data that is correctly classified by the previous component models and half of incorrectly classified training samples. By this method, subsequent components models are somewhat complementary to their predecessors. See Duda et al. [85] for an overview of these methods. In this thesis, aggregated models have not been used. Nonetheless, the concepts could be applied without restrictions. 7.4 Summary In this chapter the theory of the last step of online failure prediction using a pattern recognition approach such as hidden semi-Markov models has been covered: the final classification whether the current status of the system, as expressed by the observed error event sequence, is failure-prone or not. In order to found the classification process on a theoretical framework, Bayes decision theory has been introduced. It has been shown why the overall error rate of any classifier is minimal if decision boundaries are chosen at the points where posterior probability distributions cross. This concept has been extended to multi-class classification, minimum cost classification and the use of rejection thresholds. Based on the framework, other straightforward classification schemes have been analyzed. Since in real applications, log-likelihood is most commonly used, classification based on log-likelihood has been investigated leading to the conclusion that only two-class classification can be used. Since the modeling approach of this thesis employs a model for each failure mechanism, all failure-related models are combined using the maximum operator, which is then compared to the sequence log-likelihood of the non-failure model. The framework of Bayes decision theory is also the foundation for a detailed analysis of classifier error rate in terms of bias and variance. The so-called bias-variance dilemma has been introduced by the simpler case of regression. Subsequently, an analysis for classification has been presented. The main purpose of this excursion was to explain the necessity to control the bias-variance trade-off of the modeling approach. Hence finally, a collection of well-known techniques has been described and each has been discussed in the light of online failure prediction with hidden semi-Markov models. Contributions of this chapter. The overview of main methods to control the trade-off between bias and variance is a collection of the techniques found in several textbooks on machine learning and pattern recognition. Additionally, it is —to the best of our knowledge— the first time the aspect of log-likelihood for multiclass classification is considered. Furthermore, some new figures and plots have been developed in the hope to make Friedman’s theory more understandable. Relation to other chapters. This chapter has covered the third stage of the comprehensive approach to online failure prediction pursued in this thesis: after data preprocessing and HSMM modeling, it has described the step of coming to a conclusion about the current status of the system. This chapter also concludes the modeling part of the thesis. Being equipped with the principal solution to the problem of online failure prediction, the next part turns to the third phase of the engineering cycle: The application of the principal solution to industrial data of a commercial telecommunication system. Part III Applications of the Model 147 Chapter 8 Evaluation Metrics Having presented in detail the approach to online failure prediction, this third part of the thesis is concerned with the experimental evaluation of the approach. Experiments have been performed on data of an industrial telecommunication system. Before presenting experimental results, this chapter introduces the metrics used for evaluation. Specifically, in Section 8.1 metrics related to failure sequence clustering are presented, and in Section 8.2, metrics to evaluate accuracy / quality of failure predictions are covered. The evaluation process is described in Section 8.3 including the topic how statistical significance is assessed. 8.1 Evaluation of Clustering Data preprocessing includes clustering at two levels: first, when message IDs are assigned to log records and second, when failure sequences are grouped in order to separate failure mechanisms in the training data (c.f., Sections 5.1.1 and 5.2). Several aspects must be considered in the process of clustering: a (hierarchical) clustering algorithm must be chosen (i.e., agglomerative or divisive clustering), and in case of agglomerative clustering, the inter-cluster distance metric needs to be defined (i.e., nearest neighbor, furthest neighbor, unweighted pair-group average, or Ward’s method). Using dendrograms and banner plots, the choice of methods can be visually investigated in order to see whether the clustering technique results in a clear and reasonable division. A more formal analysis is provided by the agglomerative and divisive coefficient that try to express “clusterability” as a real number between zero and one. After clustering, the number of groups needs to be determined into which the data is partitioned. This topic has been covered in Section 5.2.3, one of which is visual inspection. For visual inspection, dendrograms or banner plots can be used as well. 8.1.1 Dendrograms Dendrograms are tree-like charts that indicate which data points have successively been merged / divided in the course of agglomerative / divisive hierarchical clustering. In Figure 8.1, dendrograms for a simple six point example clustered with three different 149 150 8. Evaluation Metrics clustering methods are shown. The tree structure indicates which data points are merged / E F 15 20 divisive clustering C F E B A 5 0 D 5 C 10 Height 20 D 10 x2 15 20 data points B A 5 10 dissimilarities Divisive Coefficient = 0.78 x1 (b) (a) agglomerative complete linkage clustering 10 Height F E D (c) F E B A B A 0 dissimilarities Agglomerative Coefficient = 0.63 D 0 C 2 5 C 6 4 Height 8 20 10 agglomerative single linkage clustering dissimilarities Agglomerative Coefficient = 0.84 (d) Figure 8.1: Dendrograms for a six point example. (a) shows the data points to be clustered. (b) shows the result of divisive clustering, (c) agglomerative clustering with the single linkage distance metric, and (d) agglomerative clustering using the complete linkage distance metric divided and the height of the connecting horizontal bar indicates the corresponding level of the distance metric termed “height”. It can be seen that different clustering algorithms can result in different groupings. In the example depicted in Figure 8.1, divisive and single linkage clustering suggest a division into two groups {A, B} and {C, D, E, F } while complete linkage clustering suggests three groups {A, B}, {C, D}, and {E, F }. 8.1 Evaluation of Clustering 8.1.2 151 Banner Plots Although dendrograms provide an intuitive way to present the result of clustering, they get overly complicated if the number of data points is increased. Rousseeuw [215] has introduced banner plots, which are more suited to large data sets. Therefore, in this dissertation, banner plots are used to visualize clustering results. A banner plot is a horizontal plot that connects data points by a colored bar of length according to the level of division / merge. As is the case for dendrograms, this sometimes requires reordering of data points. Figure 8.2 shows corresponding banner plots for the dendrograms shown in Figure 8.1-b and 8.1-d. Note that banner plots for divisive and divisive clustering agglomerative complete linkage clustering A A B B C C D D E E F F 26.9 24 20 16 12 8 6 4 2 0 0 2 4 Height Divisive Coefficient = 0.78 6 8 12 16 20 24 Height Agglomerative Coefficient = 0.84 (a) (b) Figure 8.2: Banner plots for divisive clustering (a) and agglomerative clustering based on complete linkage (b). The plots correspond to dendrograms (b) and (d) of Figure 8.1 agglomerative clustering are reversed, since banner plots document the “operation” of the clustering algorithm, i.e., division and merging, from left to right. 8.1.3 Agglomerative and Divisive Coefficient Dendrograms and banner plots visually give a notion of the data set’s “clusterability”. A formal metric addressing this aspect are divisive or agglomerative coefficient. For the divisive algorithms, let d(i) denote the diameter of the last cluster to which observation i belongs (before being split off as a single observation) divided by the diameter of the whole dataset. For agglomerative algorithms, let m(i) denote dissimilarity of observation i to the first cluster it is merged with, divided by the dissimilarity of the merger in the final step of the algorithm. Then divisive coefficient DC and agglomerative coefficient AC are 152 8. Evaluation Metrics defined as follows: n 1X DC = 1 − d(i) n i=1 AC = n 1X 1 − m(i) n i=1 ∈ [0, 1] (8.1) ∈ [0, 1] . (8.2) Both coefficients can be interpreted as average width of the banner plot, which is also a measure for the “filling” of the banner plot. Since the banner plot is scaled such that the first split / last merger determines one border of the plot, the larger the filled area, the clearer is the structure in the data. Hence, AC and DC can be interpreted as an indicator for the strength of clustering structure in the data. However, with increasing number of observations n, both AC and DC grow and should therefore not be used to compare data sets of very different sizes. 8.2 Metrics for Prediction Quality The output of online failure prediction is a binary decision whether the current status of the system is failure-prone or not. Evaluating these binary decisions results in a socalled contingency table from which a variety of metrics can be inferred. The advantage of these metrics is that an intuitive interpretation of classification results exists. On the other hand, as explained in Chapter 7, decisions are subject to various parameters such as classification cost and prior distributions. While prior distributions can be estimated from the data set, an assignment of classification cost is quite application specific and is not an easy task. Indeed, by choice of classification cost a comparison of failure prediction methods can easily be tuned in favor of one or another failure prediction method. For this reason, classification independent metrics are also used to evaluate the predictive power of online failure prediction approaches. The purpose of this section is to provide a comprehensive overview of the various evaluation metrics for failure prediction algorithms. However, only • precision, recall, true positive rate, false positive rate • F-measure, • precision-recall plot, • ROC plot, • AUC, and • accumulated runtime cost are used in this dissertation. 8.2 Metrics for Prediction Quality 8.2.1 153 Contingency Table Obviously, the goal of any failure prediction is to predict a failure if and only if the system really is failure-prone. However, it can be doubted that any prediction algorithm will ever reach such one-to-one match between failure predictions and true situation of the system. In fact, two types of mispredictions can occur: • The failure prediction algorithm may predict an upcoming failure but in reality the system is running well so no failure is about to occur. This is called a false positive, or Type I error. In failure prediction, a positive prediction is also called a failure warning and hence the misprediction is a false warning. • The failure prediction algorithm may suggest that the system is in a correct, not failure-prone state but this is not true. Such misprediction is called a false negative or Type II error. Since there is no warning about the upcoming failure, this situation is also called a missing warning Similarly, there are two cases for correct predictions: • If the system is correctly identified as failure-prone, the prediction is a true positive or correct warning • if the system is correctly identified as non failure-prone, the prediction is a true negative or correct no-warning If for an experiment each prediction is assigned to one of the four cases and the number of occurrence of each case is stored, a so-called contingency table is obtained, as shown in Table 8.1. The table is sometimes also called the confusion matrix (e.g., in Kohavi & True Failure True Non-failure Sum Prediction: Failure (failure warning) Prediction: No failure (no failure warning) true positive (T P ) (correct warning) false negative (F N ) (missing warning) false positive (F P ) (false warning) true negative (T N ) (correctly no warning) positives (P OS) negatives (N EG) Sum failures (F ) non-failures (N F ) total (N ) Table 8.1: Contingency table. Any failure prediction belongs to one out of four cases: if the prediction algorithm decides in favor of an upcoming failure, the prediction is called a positive resulting in raising of a failure warning. This decision can be right or wrong. If in truth the system is in a failure-prone state, the prediction is a true positive. If not, a false positive. Analogously, in case the prediction decides that the system is running well (a negative prediction) this prediction may be right (true negative) or wrong (false negative) Provost [146]), and it depends on lead-time ∆tl , prediction-period ∆tp and data window size ∆td (c.f., Figure 2.4 on Page 12). . 154 8. Evaluation Metrics 8.2.2 Metrics Obtained from Contingency Tables Various metrics have been proposed in different research communities that express various aspects of the contingency table. Table 8.2 summarizes the metrics. Although the table already lists the metrics that are used in this thesis, they are discussed shortly in the next paragraphs. Please note further that the terms “precision” and “accuracy” are used differently than for measurements, where they refer to the mean deviation from the true value and spread of measurements. Moreover, there are at least seven more meanings of “precision”. Name of the metric Precision Symbol p Formula TP T P +F P = TP P OS Other names Confidence Positive predictive val. Support Sensitivity Statistical power Recall True positive rate r tpr TP T P +F N = TP F False positive rate fpr FP F P +T N = FP NF Fall-out 1 − fpr TN T N +F P = TN NF True negative rate False negative rate 1−r FN T P +F N = FN F Negative predictive val. npv TN T N +F N = TN N EG False positive error rate 1−p FP F P +T P = FP P OS Specificity Accuracy acc T P +T N T P +T N +F P +F N Odds ratio OR T P ·T N F P ·F N Table 8.2: Metrics obtained from contingency table (c.f., Table 8.1). Different names for the same measures have been used in various research areas (rightmost column). Specificy, false negative rate, negative predictive value, and false positive error rate are listed for completeness, they are not further discussed in this thesis as they do not add a fundamentally different view on the contingency table. Precision and recall. The terms precision and recall have originally been introduced for information retrieval by van Rijsbergen [214]. Precision is defined as the ratio of correctly identified failures to the number of all failure predictions. Recall is the ratio of 8.2 Metrics for Prediction Quality 155 correctly predicted failures to the number of true failures: Precision p = true positives correct warnings = true positives + false positives failure warnings ∈ [0, 1] Recall r = correct warnings true positives = true positives + false negatives failures ∈ [0, 1] . (8.4) (8.3) Consider the following two examples for clarification: First, a perfect failure predictor would achieve precision and recall of 1.0. Second, a real prediction algorithm that achieves precision of 0.8, generates correct failure warnings (referring to true failures) in 80% of all cases and false positives in 20% of all cases. A recall of 0.9 expresses that 90% of all true failures are predicted and 10% are missed. Since information retrieval has to cope with extreme class imbalance1 precision and recall are also well-suited for the evaluation of failure prediction tasks: failures are usually much more rare than non-failures. There are two boundary cases for which, precision, and recall are not defined: • Precision is not defined if there are no positive predictions at all. Since the number of true positives equals the number of all positives, a precision of 1 is used. The same result is obtained, if a threshold is involved in classification (c.f., Section 7.2.2): with increasing threshold, the prediction algorithm must be “more sure” about an upcoming failure to issue a warning. Hence precision increases. At some point the threshold is so high that not a single prediction is positive and precision is hence set to one. • Recall is not defined if the number of failures in the experiment is zero. However, since testing a failure predictor without any failures in the test data set is not useful this case is not further considered. Weiss & Hirsh [277] argue that in real applications of failure prediction, first, the same failure might be predicted several times and second, false positives occurring in bursts should not be counted equally to false positives occurring separately. Therefore, the authors introduce a modified version of precision and recall: p0 = predicted failures predicted failures + discounted false warnings (8.5) r0 = predicted failures , total number of failures (8.6) where discounted false warnings refer to the number of complete, non-overlapping prediction periods ∆tp associated with a false prediction. F-measure. Improving precision, i.e., reducing the number of false positives, often results in worse recall, i.e., increasing the number of false negatives, at the same time. To integrate the trade-off between precision and recall the F-measure can be used (Makhoul 1 Usually the number of relevant documents is much smaller than the total number of documents 156 8. Evaluation Metrics et al. [172]). The F-measure is the weighted harmonic mean of precision and recall, where precision is weighted by α ∈ [0, 1]: Fα = α p 1 p·r 1−α = (1 − α) p + α r + r ∈ [0, 1] . (8.7) A special case is F0.5 where precision and recall are weighted equally: F0.5 = 2·p·r . p+r (8.8) If precision and recall both equal zero, the F-measure is not defined, but the discontinuity can be removed such that the F-measure equals 0 in this case.2 False positive rate and true positive rate. The false positive rate is defined as the ratio of incorrect failure warnings to the number of all non-failures: false positive rate fpr = false warnings false positives = . false positives + true negatives non-failures (8.9) The definition of true positive rate tpr is equivalent to recall. However, in combination with false positive rate, the term true positive rate is used. Accuracy. All evaluation metrics are concerned with the “accuracy” of failure prediction approaches in a general meaning of the word. Confusingly, one such measure is actually called accuracy, which is defined as the ratio of correct predictions in comparison to all predictions performed: accuracy acc = true positives + true negatives . true positives + false positives + false negatives + true negatives (8.10) However, accuracy is not an appropriate measure for failure prediction. This is due to the fact that failures are rare events. Consider, for example, a predictor that always classifies the system to be non-failure-prone. Since the vast majority of predictions refer to non-failure prone situations, the predictor achieves excellent accuracy since it is right in most of the cases. Instead, precision and recall measure the percentage of correct failure warnings and percentage of correctly predicted failures, respectively. Hence, these metrics are more appropriate to assess the quality of failure prediction algorithms. Odds ratio. Although mainly used in medical research, odds ratio can be applied to assess failure prediction algorithms. In statistics, odds are a way to describe probabilities in a p : q manner. More specifically, the odds O of an event E is defined as: O(E) = P (E) . 1 − P (E) (8.11) For example, if 60% of all cats are black, odds for a cat to be black are 60:40 = 1.5. 2 To prove lim(p,r)→(0,0) F (p, r) = 0, it has to be shown that ∀ ε > 0 : ∃ δ > 0 such that ∀ (p, r); p, r > 0; 2p r |(p, r) − (0, 0)| < δ: p+r < ε (c.f., e.g., Bronstein et al. [39]). The existence of δ can be proven by letting p, r = 2ε from which follows that δ = √ε2 . 8.2 Metrics for Prediction Quality 157 The odds ratio is defined as the ratio of the odds of an event occurring in one group to the odds of it occurring in another group: OR(E) = O1 (E) . O2 (E) (8.12) 1.5 For example, if the odds for mice to be black is 1:10 = 0.1, the odds ratio is 0.1 = 15 1 expressing that cats are much more likely to be black than mice. Due to the fact that OR(E) can take values from [0, ∞), the odds ratio is skewed. However, taking the logarithm turns it into a measure with values in (−∞, ∞), which additionally is normally distributed such that standard error and hence confidence intervals can be computed (see, e.g., Bland & Altman [31]). In the case of failure prediction evaluation, the odds ratio is OR(W ) = TP · TN , FP · FN (8.13) expressing the “odds” that a failure warning occurs in the case of a true failure than in tpr the case of a true non-failure. However, odds ratio is equivalent to 1−tpr · 1−fpr and a fpr comparison with ROC-plots, which also relate tpr and fpr (see below), has shown that ROC plots are much more meaningful (Pepe et al. [201]). Therefore, the odds ratio is not used explicitly in this dissertation. 8.2.3 Plots of Contingency Table Measures The various measures obtained from a contingency table are singleton values that share two restrictions: 1. They evaluate binary decisions. As derived in Chapter 7, binary decisions result from comparison with a threshold θ. Hence contingency table-based metrics are dependent on θ. 2. They represent average behavior over the entire evaluation data set If either of the two restrictions are released, a curve rather than a singleton value results. By inspection of these curves more insight into a predictor’s characteristics can be gained. On the other hand, comparability between failure prediction methods is worse. Precision-recall curves. To visualize the inverse relationship between precision and recall —improving recall by more frequently warning about an upcoming failure often results in worse precision and vice versa— values of precision and recall can be plotted for various threshold levels. The resulting graph is called a precision-recall curve. Figure 8.3 shows an exemplary plot. Note that neither precision nor recall incorporate the number of true negative predictions. Receiver operating characteristics employ false positive rate, which indirectly includes the number of true negatives. 158 8. Evaluation Metrics Figure 8.3: Sample precision/recall-plot for two failure predictors A and B. Each point on a curve corresponds to one classification threshold θ . Predictor A shows relatively good precision for most recall values but then drops quickly. In the limiting case that all sequences are classified as failure prone, a recall of one and corresponding precision of F/N is achieved. The opposite case, where no sequence is classified as failure prone, recall is zero and precision equals one. Receiver Operating Characteristics (ROC). ROC curves (see, e.g., Egan [88]) are one of the most versatile plots used in machine learning. They plot true positive rate over false positive rate. Since a perfect classifier achieves a false positive rate fpr = 0 and true positive rate tpr = 1, the closer a curve gets to the upper left corner, the better the classifier. If applicable, points for various thresholds are drawn and linearly interpolated resulting in a curve.3 As has been shown in Chapter 7, in case of Bayes classification, θ depends on skewness as well as on the cost involved with the four cases of classification. Figure 8.4 shows ROC curves for three threshold-based predictors / classifiers and a perfect classifier. Figure 8.4: ROC plot. True positive rate is plotted over false positive rate for varying classification threshold θ . Predictor A shows better performance than B , while predictor C corresponds to random guessing. A perfect predictor would achieve fpr = 0 and tpr = 1. 3 Other methods, such as decision trees (e.g., C4.5) apply a “fixed” classification and hence result in a single point in ROC space 8.2 Metrics for Prediction Quality 159 In order to relate ROC plots to precision and recall, consider the following equivalent formula for precision: p = TP = TP + FP TP F TP F + FP F = TP F + TP F NF F · FP NF = tpr tpr + NF F · fpr , (8.14) which is a function of tpr and fpr. NFF denotes the ratio of non-failure over failure sequences, which is class skewness. It can be shown that iso-precision curves in ROC space are concentric lines originating from point (0,0) (c.f., Flach [97]). Keeping in mind that true positive rate equals recall, each point on the ROC curve can be associated with a value for precision and recall as shown in Figure 8.5. Figure 8.5: Relation between ROC plots and precision and recall. Each point on the ROC curve is associated with a precision / recall pair. Iso-precision lines are concentric at (0,0). In the graph, precision p1 > p2 > p3 . Since recall equals true positive rate, corresponding recall values r1 , r2 and r3 can be read off directly. ROC plots, as well as precision-recall plots, account for all possible values of θ, which is one of their major advantages. However, in the special case of failure prediction one problem occurs with ROC plots. In failure prediction, there is usually non negligible class skewness since failures are encountered less frequently than non-failure cases. Therefore, low false positive rates are easily obtained and hence only a small fraction of ROC space is of “interest”. In other words, in many failure prediction approaches, and especially in those evaluating periodic measured data, true negative predictions dominate which results in a small fpr. Flach [97] has analyzed effects of class skewness on ROC plots and has defined skew insensitive variants for accuracy, precision, and F-measure. However, these are not considered in this thesis since experiments are carried out on one single data set and hence class skewness is the same for all experiments. Detection error trade-off (DET). Another way to compensate for class skewness is to use DET curves (Martin et al. [177]). DET curves differ from ROC plots in two ways: 1. Instead of true positive rate, the y-axis plots false negative rate f nr = 1 − tpr. This gives uniform treatment to both types of mispredictions: false positives and false negatives. 160 8. Evaluation Metrics 2. Both axes are plotted on normal deviate scale. This leads to a linear curve in the case of normal class distributions. Figure 8.6 shows an example. Figure 8.6: Detection error trade-off (DET) plot. In comparison to ROC plots, DET plot false negative rate f nr = 1 − tpr instead of tpr over false positive rate. Both axes have normal deviate scale. Curve B corresponds to random prediction while predictor A is better than random. The drawback of DET curves is that there is no graphical way to determine minimum cost, such as for ROC plots (see below). Additionally, DET curves have not yet been established as standard plot for classification performance evaluation and no failure prediction related publication has been found that uses them. Hence, DET curves are not further considered. 8.2.4 Cost Impact of Failure Prediction In Section 7.1.2, a cost or risk matrix was introduced, where rta denotes the cost for assigning class label a to a sequence which in reality belongs to class t, e.g., rF F̄ denotes the cost for falsely classifying a failure-prone sequence as non failure-prone. If true positive rate and false positive rate of a failure prediction algorithm are known, its expected cost can be determined as follows: NF F (1 − tpr) rF F̄ + tpr rF F + (1 − fpr) rF̄ F̄ + fpr rF̄ F cost = N N . (8.15) The equation distinguishes between all four cases: true and false, positive and negative predictions. F/N determines the fraction of failure and N F/N the fraction of non-failure sequences. The true positive rate (tpr) indicates the fraction of failure sequences that are predicted4 and hence cost rF F are assigned to this case. The same argumentation applies to the remaining three cases. Given a cost / risk matrix, the overall goal is to find a failure predictor with minimum expected cost. 4 “caught” by the failure predictor 8.2 Metrics for Prediction Quality 161 When analyzing contour lines of classification cost in ROC space, it can be shown that iso-cost lines are straight lines having slope d cost N F rF̄ F − rF̄ F̄ = , d fpr F rF F̄ − rF F (8.16) which is only dependent on class skewness NFF and the classification cost matrix rij . Figure 8.7 shows iso-cost lines for two values for class skewness. As expected, lower cost 1.0 iso−cost lines NF:F 0.6 0.4 0.0 0.2 true positive rate 0.8 40:1 4:1 0.0 0.2 0.4 0.6 0.8 1.0 false positive rate Figure 8.7: Iso-cost lines. Contours of equal cost are plotted for two class distributions. Solid lines correspond to a ratio of N F : F = 40 : 1, while dashed lines correspond to a ratio of 4 : 1. Classification cost has been assumed to be rF̄ F̄ = 1, rF F = 10, rF̄ F = 100, rF F̄ = 1000 (c.f., Section 7.1.2). is achieved near the top-left corner of the ROC plot. Since the slope of iso-cost lines is only dependent on variables that are determined by the application and not by the classifier, minimum achievable cost can be assessed by identifying the iso-cost line that is a tangent to the ROC curve (see Figure 8.8) Cost graphs of Drummond & Holte. In [83], Drummond & Holte propose a way to turn ROC plots into a graph that explicitly shows cost. They define a so-called probability cost function F r N F F̄ P CF = F (8.17) NF r + r F F̄ F̄ F N N expressing the ratio of misclassifying a failure-prone sequence as non-failure prone F (N rF F̄ ) and maximum expected cost, which is the sum of both types of misclassification. Note that P CF only consists of application-specific parameters. Normalized expected cost N E is defined as expected cost divided by maximum cost. It can be shown that N E = (1 − tpr − fpr) P CF + fpr , (8.18) 162 8. Evaluation Metrics Figure 8.8: Determining minimum achievable cost from ROC. Three iso-cost lines c1 < c2 < c3 of slope determined by Equation 8.16 are drawn in the figure. Minimum achievable cost can be determined by the tangent to the ROC curve. which is a linear function in P CF with bounding values fpr and 1 − tpr. Hence, for each point in the ROC curve (i.e., tpr and fpr for a given threshold θ), there is a tpr / fpr pair defining a straight line in the cost graph. If this line is plotted for various ROC points / thresholds, a convex hull results (see Figure 8.9). The convex hull can be used to identify the optimal threshold, resulting in minimal normalized expected cost, for each (application specific) value of PCF. Furthermore, the intersection of the convex hull with the lines for always positive and always negative predictions defines the range of operation in terms of P CF for a given predictor. Figure 8.9: Cost curves. Varying the classification threshold θ = {θi } for one predictor results in set of corresponding pairs (tpri , fpr i ). Each pair defines a straight line showing normalized expected cost (NE) as a function of the probability cost function (PCF). The diagonals correspond to the two trivial predictors that always predict a failure F or non-failure F̄ . If for every value on the PCF-axis the minimum value is chosen, a convex hull results (thick line). It can be seen that for some values of PCF, expected cost is greater than for a trivial predictor. This defines the operating range (in terms of PCF) of the predictor. However, as can be seen from Equation 8.17, the plot only takes misclassification cost 8.2 Metrics for Prediction Quality 163 rF F̄ and rF̄ F into account5 but cost for correct classification are not involved. Due to this restriction and due to the fact that cost is difficult to estimate for the telecommunication system, cost graphs of Drummond & Holte are left over for future investigations. Accumulated runtime cost. All of the above metrics and graphs build on average values for the entire data set. However, it makes a difference if a failure predictor runs very well for most of the time except for short periods showing bursts of mispredictions or if the same number of wrong predictions occur all over the training data set. Accumulated runtime cost graphs yield exactly this insight by adding cost rij for each prediction and showing the step function of accumulating cost over runtime of the test (see Figure 8.10 for an example). They have initially been developed together with Dr. Günther Hoffmann (see, e.g., Salfner et al. [224]) and have been extended in this dissertation. An accumulated runtime cost curve can be drawn either for several predictors or varying thresholds θ for one predictor. Figure 8.10: Exemplary accumulated runtime cost. Cost for all four types of prediction (true / false positive / negative) is plotted as it accumulates over time for two predictors A and B. In the figure, a cost setting of rF̄ F̄ : rF F : rF̄ F : rF F̄ = 1 : 2 : 4 : 8 has been assumed. Shaded areas indicate cost boundaries: maximum cost (each prediction is wrong), cost without failure prediction (failures are missed), cost for a perfect predictor (each prediction correct), and cost for oracle predictions (rF F for each failure). Diamonds (u) on the time line indicate the time of failure occurrence and circles (•) the time of predictions between failures. A further advantage of accumulated runtime cost is that cost boundaries can be visualized: • An oracle, which is of course not existing, would need no evaluation of measurement data. It would just know when a failure is about to occur. Hence accumulating 5 Hence, the cost/risk matrix would have zeros at the main diagonal. 164 8. Evaluation Metrics cost would only consist of cost for correct failure predictions rF F occurring each time a true failure is observed. • In contrast to the oracle, real predictors need to evaluate measurements from the running system. As each evaluation incurs some cost, real predictors result in higher accumulated cost. However, the perfect predictor, which only performs correct predictions, indicates the minimum cost for any predictor operating at times of measurements. More specifically, cost of rF F occurs at times of failure and cost of rF̄ F̄ at times of non-failure predictions / measurements. Nevertheless, it must be pointed out that this only determines minimal achievable cost for one class of predictors. If, for example, measurements and hence predictions are performed much more rarely, lower cumulative cost can result even for non-perfect predictors. One typical example for this is the distinction whether prediction is performed on error events or on periodic measurements of system parameters such as workload: in most systems errors occur more seldom than periodic measurements. • Cost if no predictor is in place can be determined in the following way: At each occurrence of a failure, cost of rF F̄ − rF̄ F̄ occurs, which means that all failures are missed and no predictions are performed in between. The reason why rF F̄ is decreased by rF̄ F̄ is that rF F̄ also includes cost for performing a prediction. Cost for a prediction without action can be approximated by true negative predictions and hence rF̄ F̄ is subtracted. • Maximum cost can be determined by assuming all predictions to be wrong. Hence, each non-failure prediction receives cost of rF̄ F and each prediction at the time of failure occurrence receives rF F̄ . This also applies only to one class of predictors. Of course, as is the case for all plots assuming fixed cost, the graph can look significantly different if the ratio of cost rij is changed. Furthermore, the difficulty to estimate the cost / risk matrix for real systems also applies to accumulated cost graphs. Nevertheless, since accumulated cost graphs do not build on average values, they provide an insight into the temporal behavior of a failure prediction algorithm and are for this reason used in this dissertation. 8.2.5 Other Metrics Despite of measures obtained from the contingency table (see Table 8.1) and the plots shown here, some other measures should be mentioned. Area under ROC curve (AUC). The integral of a ROC curve, AU C = Z 1 tpr(fpr) d fpr ∈ [0, 1] , (8.19) 0 is a wide-spread measure for classification accuracy. AUC can also be interpreted as the probability that a randomly chosen failure-prone sequence receives a higher rating than a randomly chosen non-failure sequence. AUC turns the ROC curve into a single real number which, in contrast to ROC plots, enables numeric comparison of classifiers. Obviously, a perfect predictor achieves AUC 8.2 Metrics for Prediction Quality 165 equal to one and a purely random classifier receives an AUC of 0.5.6 AUC is threshold independent, which is the major difference to contingency table based metrics. However, AUC has its problems, too: • AUC equally incorporates all possible threshold values regardless of class skewness (c.f., the discussion of the ROC curve). • Interpretation of the AUC is not as intuitive as of contingency table base methods • For a given cost setting and class skewness, AUC can be misleading: a classifier with larger AUC might result in a higher cost impact of failure prediction, hence, even though AUC is better than for other predictors, the predictor results in worse cost incurred. For example in Figure 8.11, AUC for predictor B is larger than for predictor A. However, minimal achievable cost for B is C2 which is larger than C1 for predictor A. Figure 8.11: AUC can be misleading: predictor B (dashed line) has better AUC than predictor A (solid line). However, for a given cost setting and N F/F ratio, the cost incurred by prediction are higher for B than for A, since C2 > C1 . Precision-recall-break-even One special point in precision-recall curves is the point where the precision-recall curve crosses the ascending diagonal. At this point, precision and recall are equal resulting in a scalar measure that can be used for comparison. However, if precision and recall are not equally significant for the application, this approach seems not convincing and is hence not further considered in this thesis. Further metrics. Many other metrics have been proposed in various scientific disciplines such as data mining or machine learning with decision trees. These measures include more recently introduced measures such as the G-measure (Flach [97]), weighted relative accuracy (Todorovski et al. [256]), and SAR (Squared error, Accuracy, and ROC 6 Note that the inverse inference is not valid: An AUC of 0.5 does not necessarily imply a random classifier! 166 8. Evaluation Metrics area, see Caruana & Niculescu-Mizil [46]), as well as well-known metrics such as Gini coefficient, lift, Piatetsky-Shapiro, φ-coefficient, etc. The measures could be applied to failure prediction as well, however, to the best of our knowledge, this has not been investigated, so far. One exception is the κ-statistic, which has been used by Elbaum et al. [89] to build a detector for anomalous software events within the email client “pine”. The interesting thing about κ-statistics is that it allows for a “soft” evaluation of prediction performance based on the κ value (see, Altman [7] for details). 8.3 Evaluation Process In previous sections the metrics by which the potential to predict failures is assessed have been discussed. In this section, the focus is on the evaluation process, i.e., how the metrics are obtained. The ultimate goal of evaluation is, of course, to identify the potential to predict failures of a failure prediction approach given some data set. The evaluation process consists of several parts: 1. Many modeling approaches such as the one described in this thesis involve parameters that need to be adjusted, which is also called training. 2. In machine learning, training is based on data. However, this data should not be used for evaluation. Hence the data set needs to be split. 3. In application domains such as failure prediction the amount of data available for evaluation is limited. Hence a technique called cross-validation is applied. The following sections discuss each issue separately. 8.3.1 Setting of Parameters Ideally, parameters involved in modeling should be adjusted such that optimal failure prediction performance is achieved. However, “performance”, as has been discussed, can be assessed by various metrics. Having decided upon one optimization criterion (e.g., Fmeasure), theoretically, each parameter in the modeling process should be analyzed with respect to its effect on final failure prediction performance, which implies that each value of each parameter must be tested in combination with each value of each other parameter of the entire modeling process. Not surprisingly, this is hardly feasible if more than 15 parameters are involved in HSMM failure prediction. For this reason, the evaluation process consists of a mixture of “greedy” and “non-greedy” steps: • greedy: Parameters that can be set rather “robustly” by some local optimization criterion or heuristic. Local optimization in this context means that not overall prediction performance needs to be evaluated but some criterion that can be computed without a fully trained prediction model. “Robust” in this context means that there is sufficient background knowledge about the effect of the parameter on final prediction performance. “Greedy” also implies that —once adjusted— a parameter is not changed in later stages of the modeling process. An example of a parameter that can be set greedily is the length of 8.3 Evaluation Process 167 the tupling interval ε. The local optimization criterion is here the number of resulting tuples (c.f., Section 5.1.2). • non-greedy: Parameters for which no local optimization criterion exists, or upon which little is known with respect to the effects on final failure prediction performance, need to be tested in combination with all other parameters that cannot be determined greedily. In order to reduce complexity, the range of values needs to be limited. Additionally, not each single value of the range needs to be explored, if it is expected that final prediction performance is a smooth function of the parameter. An example for such parameter setting is the adjustment of the maximum span of shortcuts in the structure of the HSMM (c.f., Section 6.6). For each combination of parameters, the full modeling process needs to be performed and prediction performance is assessed with respect to the selected evaluation metric. Since greedy parameter optimizations include only one local optimization, increasing their number drastically reduces the overall training effort. Chapter 9 will provide details how each parameter for modeling of the industrial telecommunication system has been set. 8.3.2 Three Types of Data Sets As described in Section 2.4, a typical two phase batch learning approach is applied in this thesis: first, a model is trained from previously recorded data, and is subsequently applied to the running system in order to predict failures online. However, the project from which the data has been acquired did not allow to apply the failure prediction approach to a running system for evaluation. For this reason, failure prediction performance must be evaluated from the data set itself. However, assessing prediction of failures that have been known in the training phase does not yield a realistic estimation of prediction performance —hence the data needs to be separated into disjoint training and test data set data set. However, training involves non-greedy estimation of model parameters, from which follows that parameters have to be adjusted with respect to the final prediction performance metric. For this reason, the training data set needs to be further subdivided to yield a so-called validation data set. Hence three types of data sets result: 1. Training data set: The data on which the training algorithm is run. 2. Validation data set: Parameters for which no local optimization criterion is available need to be optimized with respect to final prediction performance (non-greedy estimation). Validation data is used to assess prediction performance of each setting . 3. Test data set: Final evaluation of failure prediction performance is carried out on completely new data, which is test data. By this, generalization performance of the model is assessed which is taken as an indication to how well the failure predictor would predict future upcoming failures in a running system. Since evaluation is performed on data that has not been available for training and validation, such evaluation is also called out-of-sample. 168 8.3.3 8. Evaluation Metrics Cross-validation In many machine learning applications, much data is available such that it cannot be processed entirely. In this case, the issue is to determine the minimum size of data sets that is needed to assure some statistical significance. In the case of online failure prediction, the situation is different: Failure data is always scarce and all available data must be used. It is even so scarce that after splitting data into training, validation and test data, data sets get too small to yield statistically reliable results. To remedy this situation m-fold cross validation,7 which exploits the limited amount of data available by cyclic reuse, each time holding out another portion of the data for validation / testing of performance, can be used. More precisely, for m-fold cross validation data is split randomly into m disjoint sets of equal size n/m, where n is the size of the data set. The training and testing procedure is repeated m times, each time holding out a different subset for testing. The remaining portion, which is of size n − n/m is subsequently split further into training and validation data. A special form of cross validation uses stratification, which means that distribution of classes N F and F remain the same in each subset. However, stratification can only be applied to validation since it is one of the main characteristics of the training procedure to separate failure from non-failure sequences in order to deal with class imbalance. Hence, stratification has not been applied, here. A further validation variant is Monte-Carlo cross validation (Shao [236]) where the data set is repeatedly divided into a fraction β for testing and (1−β) for training. Although the procedure has been shown to yield more stable results for selecting the number of kernels in a Gaussian mixture modeling problem (Smyth [245]). However, since first it is not clear upfront, that Monte-Carlo cross validation also performs better for failure prediction, and second it adds another parameter (β) that needs to be determined, only standard m-fold cross validation has been applied in this dissertation. 8.4 Statistical Confidence In order to gain trust in the assessment of failure prediction quality, each evaluation metric should be accompanied by confidence intervals. For the accuracy evaluation metric a theoretical analysis is available. A second theoretical analysis for other metrics builds on the assumption of a normal sampling distribution, which cannot be guaranteed. For this reason, confidence intervals are obtained from a well-known resampling strategy called “bootstrapping”. 8.4.1 Theoretical Assessment of Accuracy Mitchell [184] provides an analysis of confidence intervals for the mean error rate observed from an experiment 1X (1 − δĉ(s) c(s) ) , (8.20) Es = ES P ĉ(s) 6= c(s) = n s∈S 7 According to Duda et al. [85], cross validation has been invented by Cover [66]. However, Yu [283] claims that cross-validation has first been invented by Kurtz [151] and has been developed to multi-cross validation by Krus & Fuller [148]. Even more confusingly, Bishop [30] mentions Stone [251] as its inventor. 8.4 Statistical Confidence 169 where n denotes the size of the experiment’s data set S = {s}, c(s) is the true value for s, ĉ(s) the estimated value, and δij is the Kronecker delta. Es is also called the sample error rate. Confidence intervals can be obtained from the fact that counting misclassifications within a test data set of size n is a Bernoulli experiment and the probability to encounter k misclassifications in the test data set is ! n k P (k) = p (1 − p)n−k , k (8.21) where p is the true yet unknown error rate. p can be estimated from the number of misclassified sequences in the test data set, which is k, since the maximum likelihood estimation p≈ k = Es n (8.22) is an unbiased estimator given that the samples of the test data set had been drawn according to the prior distribution P (s). From the fact that p is estimated as mean value and from the fact that k is binomially distributed follows that standard deviation of the error rate is approximately s Es (1 − Es ) . (8.23) σEs ≈ n For n Es (1 − Es ) ≥ 5, the binomial distribution can be well approximated by a normal distribution8 and confidence intervals can be obtained: CN (Es ) = Es − zN s Es (1 − Es ) , Es + zN n s Es (1 − Es ) , n (8.24) where zN is the width of the smallest interval about the mean that includes N % of the total probability mass. Finally, a confidence interval for accuracy can be obtained from the relation acc = 1 − Es . (8.25) However, Duda et al. [85] show that unless n is fairly large, the maximum likelihood estimation of p must be interpreted with caution. Furthermore, the analysis is only applicable to error rate / accuracy but confidence intervals are needed for all of the evaluation metrics presented. Hence this approach is not applied in this thesis. 8.4.2 Confidence Intervals by Assuming Normal Distributions The central limit theorem states that any sum of independent and identically distributed random variables tends towards the normal distribution. From this follows that statistics such as the mean, which is defined by a sum also tends to be normally distributed and hence confidence intervals can be obtained by " s s C = x̄ − √ , x̄ + √ n n 8 # , otherwise, the cumulative binomial distribution must computed directly (8.26) 170 8. Evaluation Metrics where x̄ denotes the mean of values observed in the test data set, s denotes the standard deviation, and n denotes sample size. However, this parametric way to determine confidence intervals only works for statistics that yield normal sampling distributions, which is a strong assumption that cannot be applied to all statistics. Furthermore, there is no way to correct for bias or skew of the estimator. For this reason, this approach is also not applied in this thesis. 8.4.3 Jackknife Quenouille [207] invented an estimation procedure that is applicable to any statistic estimator θ̂. The principle idea of the method was to compute the statistic for a data set from which one single data point has been removed. This is repeated by removing each data point once and the overall value of the statistic is finally obtained by the so-called leave-one-out mean: n 1X θ̂(i) , (8.27) θ̂ = n i=1 where θ̂ denotes the estimate of statistic θ and θ̂(i) is the statistic for the data set from which data point i has been removed. The major benefit of this method is that bias and variance of the statistic can be estimated, even for statistics that resist theoretical analysis such as the mode or median. For this versatility, the method became also known under the term jacknife. Although this method can in principle be applied to this thesis, the major problem with the jackknife method is that it processes exactly n subsets. Computing complexity can be limited by leaving out more than one single sequence (similar to m-fold cross validation), but this on the other hand deteriorates the quality of statistic estimation. 8.4.4 Bootstrapping Bootstrapping (Efron [87]) adds more flexibility to the estimation process and is currently seen as state-of-the art (at least in engineering disciplines). According to Moore & McCabe [187], bootstrapping should be applied when the sampling distribution is nonnormal, biased or skewed, or to the estimation of statistics for which parametric estimations of confidence intervals are not available (such as for the well-known outlier resistant 25% trimmed mean). The basic idea of bootstrapping is that based on one original sample, many so-called resamples are generated by randomly selecting n instances from the original sample with replacement. Similar to the jackknife, the desired statistic is computed for each resample. However, the number of resamplings can be chosen arbitrarily and the same data point may occur several times in one resample. One of the explanations why this method works is that the resulting bootstrapping distribution, which is the distribution of the statistic among resamples, can be shown to approximate the true sampling distribution if the original sample represents reality rather well. The statistic’s bias can be estimated by: bias = B 1 X θ̂(b) − θ̂ , B b=1 (8.28) 8.4 Statistical Confidence 171 where B denotes the number of resamples, θ̂(b) the statistic θ computed from the b-th resample and θ̂ is the statistic computed from the original sample. This estimate of bias can be used to yield more reliable confidence intervals even for biased and skewed sampling distributions. In this thesis bootstrap bias corrected accelerated confidence intervals (BCa) have been used, which require that the number of artificial resamples B has to be set to at least 5000. 8.4.5 Bootstrapping with Cross-validation As stated before, failure data is scarce and cross-validation needs to be applied to fully exploit available data. Each step in cross-validation could be analyzed separately and results could be combined afterwards. However, bootstrapping cannot compensate for small sized original samples! Even if the resampling process is run many thousand times, resamples only consist of the few data points available in the original sample. For this reason, a combination of cross-validation and bootstrapping has been applied in this thesis (see Figure 8.12) • The complete dataset is randomly divided into ten groups for 10-fold-cross validation. • Each group is used once as test group – The remaining nine data groups form the training / validation dataset. – A model is trained / validated. – The resulting model is applied to data of the test group. – Model outputs are stored in a test result dataset. • After performing this for all ten groups, the evaluation metric (statistic) is computed from the (combined) test result dataset. • Bootstrapping is applied to test results, which means that test results are resampled 5000 times in order to yield BCa confidence intervals for the evaluation metric. Ten-fold-validation has been described above, which implies that ten complete modeling procedures have to be performed. However, this procedure can be adapted to the computing power available: The number of folds can be increased up to n, which would result in the jackknife method with subsequent bootstrapping results. Note that the bootstrapping procedure only operates on the result of the training and testing procedure, which is incomparably less laborious than training 5000 models. In summary, the number of data points in the result dataset from which the statistic is estimated is always n and the bootstrapping tries to compensate for the reduced number of trainings. Please also note that • cross-validation simulates the variability in selecting the training and test data • bootstrapping simulates the sampling process in order to mimic the sampling distribution although the two are related, they are not exactly the same. 172 8. Evaluation Metrics Figure 8.12: Cross-validation and bootstrapping. The dataset contains three failure sequences (hatched boxes at the top) and seven non-failure sequences (shaded boxes at the top). All sequences are randomly divided into ten groups. Each group is used once as test data set. For each test group the remaining nine groups are used as training / validation dataset. After training / validation, sequences of the test group are fed into the model and results are stored in the test result dataset. The evaluation statistic is computed at the end from all test results. In order to estimate confidence intervals, bootstrapping with 5000 resamples is applied. 8.4.6 Confidence Intervals for Plots The estimation procedure shown above can be applied directly to contingency table-based metrics such as precision, recall, etc. However, plots, such as ROC, precision / recall, have two equally important dimensions. Fawcett [95] discusses the topic extensively for ROC curves and proposes to compute contingency intervals in both directions by fixing the threshold (see Figure 8.13). The same concept applies to precision / recall curves. Confidence intervals are not investigated for accumulated runtime cost since this graph depends on one specific excerpt of the data (times of predictions and failures are shown on the x-axis). As AUC integrates over all threshold values, no threshold-based averaging can be applied, either. Instead, confidence intervals can be computed by the bootstrapping procedure directly. 8.5 Summary In this chapter the process of evaluating failure prediction methods has been discussed. Starting from an evaluation of clustering results, which is only relevant for the approach in this dissertation, in subsequent sections failure prediction metrics have been discussed. There are two principal groups: 1. Contingency table-based measures such as precision, recall, F-measure, false positive and true positive rate, or accuracy. These measures evaluate binary decisions and are hence dependent on one specific decision threshold 8.5 Summary 173 Figure 8.13: Averaging ROC curves. For each value of the threshold (A, B , C , D , and E ), confidence intervals are computed separately for false positive and true positive rate. The ROC curve is then plotted through average values 2. Plots that account for various thresholds. In this thesis Precision / recall plots, ROC plots, detection error trade-off (DET), cost curves and accumulated runtime cost graphs have been presented. AUC has a special place as it is a single value and does not depend on a threshold value since it is obtained from ROC plots. The subsequent topic addressed by this chapter has been a description of the evaluation process. Three topics have been discussed: greedy vs. non-greedy parameter optimization, the distinction between training, validation, and test data set, and cross-validation. Evaluating failure prediction by the use of data naturally raises the question of statistical confidence. Several approaches to an assessment of confidence estimation have been discussed and it has been argued why they cannot be applied to the case of online failure prediction. The discussion concludes that bootstrapping is applied in this thesis and a combination of cross-validation and bootstrapping has been proposed. Finally, it has been described how confidence intervals can be generated for plots having two equally important variables. Contributions of this chapter. • To the best of our knowledge, the first comprehensive overview on failure prediction evaluation metrics has been presented. • A novel evaluation plot —accumulated runtime cost graphs— has been introduced. In comparison to other evaluation techniques, the graph can reveal whether a predictor operates very well for most of the time but fails for a short period or whether false predictions occur equally distributed over time. Furthermore, the graph allows to compare cost incurred by a predictor with cost for an oracle predictor, perfect or worst predictor, and with cost for a system without failure prediction in place. However, these comparisons only hold for predictors of the same class. A further drawback is that the graph is highly sensible to the assignment of cost for true and false positive and negative predictions. 174 8. Evaluation Metrics • To the best of our knowledge, this thesis presents a novel combination of m-fold cross validation and bootstrapping: The computationally much more expensive task of model training is reduced and at least partly compensated by bootstrapping with a large number of resamplings. Furthermore, this approach allows to fully exploit the limited amount of data and to take advantage of state-of-the art confidence interval estimation offered by the bootstrap. Relation to other chapters. This chapter has been the first of the third phase of the engineering cycle in which the modeling methodology is applied to industrial data of the real system. Having defined the measures for evaluation as well as the procedure how these measures are obtained, the whole approach will be applied to real data of the industrial telecommunication system in the next chapter. Chapter 9 Experiments and Results Based on Industrial Data The failure prediction approach proposed in this dissertation has been applied to industrial data of a commercial telecommunication system. In this chapter, detailed results are provided. The chapter is organized along the process of modeling: starting with the introduction of the case study (Section 9.1) and data preprocessing (Section 9.2), properties of the data set are presented in Section 9.3, and training of HSMMs is discussed in Section 9.4. The resulting failure predictor is analyzed in detail (Section 9.5) and dependence on various parameters is investigated in Sections 9.6 and 9.7. Furthermore, a comparative analysis is provided by applying several different prediction techniques to the same data. Note that for readability reasons, in this chapter, the term “model” is not only used to denote the class of hidden semi-Markov models, but also a concrete HSMM parametrization, such as a “model with 50” states, which denotes an instance of a HSMM that has 50 states. 9.1 Description of the Case Study Although the telecommunication system has been briefly introduced in Section 2.2, the description is repeated here for convenience. The main purpose of the telecommunication system is to realize a Service Control Point (SCP) in an Intelligent Network (IN), providing Service Control Functions (SCF) for communication related management such as billing, number translations or prepaid functionality. Services are offered for Mobile Originated Calls (MOC), Short Message Service (SMS), or General Packet Radio Service (GPRS). Service requests are transmitted to the system using various communication protocols such as Remote Authentication Dial In User Interface (RADIUS), Signaling System Number 7 (SS7), or Internet Protocol (IP). Since the system is a SCP, it cooperates closely with other telecommunication systems in the Global System for Mobile Communication (GSM), however, it does not switch calls itself. The system is realized as multi-tier architecture employing a component-based software design. At the time when measurements were taken the system consisted of more than 1.6 million lines of code, approximately 200 components realized by more than 2000 classes, running simultaneously 175 176 9. Experiments and Results Based on Industrial Data in several containers, each replicated for fault tolerance. Specification for the telecommunication system requires that within successive, nonoverlapping five minute intervals, the fraction of calls having response time longer than 250ms must not exceed 0.01%. This definition is equivalent to a required four-nines interval service availability (c.f., Equation 2.1 on Page 13). Hence the failures predicted in this work are performance failures. The setup from which data has been collected is depicted in Figure 9.1. A call tracker kept trace of request response times and logged each request that showed a response time exceeding 250ms. Furthermore, the call tracker provided information in five-minute intervals whether call availability dropped below 99.99%. More specifically, the exact time of failure has been determined to be the first failed request that caused interval availability to drop below the threshold. The telecommunication system consisted of two nodes that are connected with a high-speed local network. Error logs have been collected separately from both nodes and have been combined to form a system-wide logfile by merging both logs into one based on timestamps (the system runs with synchronized clocks) treating the system as whole. Figure 9.1: Experiment setup. Call response times have been tracked from outside the system in order to identify failures. The telecommunication system consisted of two computing nodes from which error logs have been collected. We had access to data collected at 200 non-consecutive days spanning a period of 273 days. The entire dataset consists of error logs of two machines including 12,377,877 + 14,613,437 = 26,991,314 log records including 1,560 failures of two types: The first type (885 instances) relates to GPRS and the second (675 instances) to SMS and MOC services but due to limited human resources, only the first failure type has been investigated. Some notes on the procedure. As has been stated in Section 8.3, there are two strategies how parameters are set: greedy and non-greedy. Obviously, the best parameter setting would be found by trying all combinations of parameters and to evaluate them with respect to failure prediction. However, such approach is not feasible and a different approach has been taken for experiments: As long as there is a reasonable way to set parameters directly based on some local criterion or observation, parameters are set by this heuristic. This implies that once a parameter has been set by a “local” criterion or heuristic, its effect on overall failure prediction quality is not checked later, and hence it cannot be determined whether even better prediction results may be achievable with the method. However, since the results achieved by this strictly forward approach are already convincing, there is no need to do so —at least from an engineering point of view. For this reason the following sections go through the entire data preprocessing and modeling process from the start and investigate each step one after another. 9.2 Data Preprocessing 177 Implementation of the HSMM approach has been accomplished by modifying the General Hidden Markov Model (GHMM) [179] library developed by the Algorithmics group lead by Dr. Alexander Schliep at the Max Planck Institute for Molecular Genetics, Berlin, Germany. The GHMM library and hence its modifications are written in C, wrapped by Python classes which in turn are controlled by shell scripts. Clustering, evaluation and plotting has been performed using R statistical language (see, e.g., Dalgaard [74]). 9.2 Data Preprocessing As explained in Chapter 2, modeling first involves data preprocessing, which consists of several steps. The following investigations will explain and analyze each step separately in the order they have been performed on the data. 9.2.1 Making Logfiles Machine-Processable System logfiles contain events of all architectural layers above the cluster management layer including 55 different, partially non-numeric variables. Figure 9.2 shows one (anonymized) log record consisting of three lines in the error log. In order to obtain 2004/04/09-19:26:13.634089-29846-00010-LIB_ABC_USER-AGOMP#020200034000060| 020101044430000|000000000000-020234f43301e000-2.0.1|020200003200060|00000001 2004/04/09-19:26:13.634089-29846-00010-LIB_ABC_USER-NOT: src=ERROR_APPLICATION sev=SEVERITY_MINOR id=020d02222083730a 2004/04/09-19:26:13.634089-29846-00010-LIB_ABC_USER-unknown nature of address value specified Figure 9.2: Typical error log record consisting of three log lines (anonymized). machine-readable logfiles, many steps had to be performed, and the tremendous effort by Steffen Tschirpke, who has done most of the programming for these steps, should be acknowledged at this point. The major steps of logfile preprocessing include: 1. Eliminating logfile rotation. Many large systems perform logfile rotation, which means that logfiles are limited either in size or time span (or both) and once a logfile has reached the limit, logging is redirected to the next file. After logging to the n-th file, logging starts from the first file in a ring-buffer fashion. This behavior lead to duplicated log messages. Data has been reorganized to form one large chronologically ordered logfile for each computing node. 2. Identifying borders between messages. While error messages “travel” through various modules and levels of the system, more and more information is accumulated until the resulting log-record is written into the logfile. In our case, various delimiters between the pieces of information were used and one log record could even span several lines in the logfile, sometimes quoting the error message several 178 9. Experiments and Results Based on Industrial Data times. For this reason, the logfile had to be parsed in order to generate a log where each line corresponds to one log record, to employ usage of a unique delimiter and to assign pieces of information to fixed positions (columns) within the line. 3. Converting time. Timestamps in the original logfiles were tailored to be “processed” by humans and were of the form 2004/04/09-19:26:13.634089 stating that the log message occurred at 7 pm, 26 minutes and 13.634089 seconds on April, 9th in the year 2004. In order to be able to, e.g., compute the length of the time interval between two successive error messages, time had to be transformed into a format that can be processed by computers. Real-valued UTC has been used for this purpose, which roughly relates to seconds since Jan. 1st, 1970. 9.2.2 Error-ID Assignment After preprocessing, the next step involved assignment of an error ID to each message as described in Section 5.1.1. In case of the telecommunication data, there were originally 1,695,160 different log-messages. By replacing numbers, etc., the number of different messages has been reduced to 12,533. By application of the Levenshtein distance metric to each pair (resulting in 157,063,556 distances) the log messages could be assigned to 1,435 groups by application of a constant similarity threshold. Table 9.1 summarizes the numbers. Data Original Without numbers Levenshtein clustering No of different messages 1,695,160 12,533 1,435 Reduction in % n/a 99.26% 88.55% / 99.92% (original) Table 9.1: Number of different log messages in the original data, after substitution of numbers by placeholders, and after clustering by the Levenshtein distance metric. In principle, the task of message grouping is a clustering problem. However, grouping 12,533 data points using a full-blown clustering algorithm is a considerably complex task. Furthermore, application of such complex algorithms is not necessary. Figure 9.3 provides a plot where the gray value of each point indicates distance of the corresponding message pair. Except for a few blocks in the middle of the plot, there are dark steps along the main descending diagonal and the rest of the plot is rather light-colored. The plot has been created by putting sequences next to each other if their Levenshtein distance was below some fixed threshold. Since plotting similarities is not possible for all messages, Figure 9.3 has been generated from a subset of the data. The figure indicates that strong similarity is only present among groups of log messages and not to other message types. Hence a rather robust grouping can be achieved by one of the simplest clustering methods: grouping by a threshold on dissimilarity. The reason why this simple method works rather robustly is that (after replacement of numbers by a placeholder), messages with more or less the same text agree in most parts and other messages are significantly different. Note that each error message type corresponds to one error symbol (indicated by A, B, or C in previous chapters). Together with the number of failure types (which are at most 9.2 Data Preprocessing 179 Figure 9.3: Levenshtein similarity plot for a subset of message types. Each point represents Levenshtein distance of one pair of error messages. Dark dots indicate similar messages (small distance) while lighter dots indicate a larger Levenshtein distance. Messages have been arranged such that sequences are next to each other if their Levenshtein distance is below some fixed threshold. two in our case study) the number of different error messages defines the size of the HSMM alphabet. Therefore, experiments in this case study had alphabets of size 1, 436 (1.435 errors plus one failure) since only one failure type has been investigated at a time. Please also note that memory consumption of observation symbol matrix B is determined by the number of states times the size of the alphabet. For these reasons, reducing the number of error messages is an important step in the failure prediction approach described in this thesis. 9.2.3 Tupling As described in Section 5.1.2, tupling is a technique that combines several occurrences of the same event in order to account for multiple reporting of the same problem. In order to determine the optimal time window size ε, the heuristic shown in Figure 5.3 on Page 78 has been applied to the data. The size of the optimal time window is identified graphically by plotting the number of resulting tuples over various values for ε. Figure 9.4 shows the plot for a subset of one million log records for the cluster logfile (which has been received by merging error logs of both machines). The graph strongly supports the claim of Iyer & Rosetti that a change point value for ε can be identified above which the number of tuples decreases much slower. According to the heuristic, ε is chosen slightly above the change point. In order to show that properties related to tupling do not change by merging the two error logs, tupling analysis has also been performed for each machine error log separately. As shown in Figure 9.5, the change point for both machines occurs at roughly the same point. The most striking difference between Figure 9.5 and Figure 9.4 is that the number of resulting tuples is smaller for single machine logfiles. This can be traced back to the merging process: Tupling only lumps bursts of the same message —if a different 180 9. Experiments and Results Based on Industrial Data Figure 9.4: Effect of tupling window size for the cluster-wide logfile. The graph shows resulting number of tuples depending on tupling time window size ε (in seconds) message from the second machine is woven into the burst, the burst results in at least two separate tuples. However, the main point of the analysis is that a change point exists, and, furthermore, that it occurs roughly at the same value for ε in single machine logfiles. Based on this analysis, a value of ε = 0.015s has been used for experiments. 9.2.4 Extracting Sequences After tupling, sequences are extracted from error logs (c.f., Section 5.1.3 and especially Figure 5.4 on Page 79). In order to decide whether a sequence is a failure sequence or not, the failure log, which has been written by the call tracker, has been analyzed, to extract timestamps and types of failure occurrences. Three time intervals determine the process of sequence extraction: 1. Lead-time ∆tl . If not specified explicitly, a lead-time of five minutes has been used, although it is shown in Section 9.6.1 that prediction performance is comparably good for even longer lead-times. However, since lead-time experiments have been carried out relatively late, previous experiments have not been carried out again and results are reported for five minutes lead time. For large and complex computer systems, it is assumed that proactive fault handling actions such as restart, garbage collection or checkpointing can be performed within five minutes, i.e., warningtime ∆tw is shorter than five minutes. 2. Data window size ∆td . Analyses presented in the next section are based on a data window size of five minutes. An explicit analysis of ∆td is carried out in Section 9.6.2. 3. Margins for non-failure sequences ∆tm . This value is used to determine time intervals when no failure is imminent in the system. Since it cannot be measured, whether the system really is fault-free, a value of 20 minutes has been chosen arbitrarily. According to an analysis of failure data, it has been observed that failures often occur in bursts which are interpreted to be caused by the same instability 9.2 Data Preprocessing Figure 9.5: Effect of tupling window size for each individual machine 181 182 9. Experiments and Results Based on Industrial Data (fault). Employing a margin of 20 minutes seems to yield a stable separation. For other systems that show long-range failure behavior (e.g., in the order of hours), this value might be too small. Non-failure sequences have been generated using overlapping time windows, which simulates the case that failure prediction is performed each time an error occurs. 9.2.5 Grouping (Clustering) of Failure Sequences The goal of failure sequence clustering is to identify failure mechanisms contained in the training data set (c.f., Section 5.2). The approach builds on ergodic (fully connected) HSMMs to determine the dissimilarity matrix that is subsequently analyzed by a clustering method. Clustering has been performed using the cluster library of the statistical programming language R.1 The approach implies several parameters such as the size of the HSMMs. This section explores their influence on sequence clustering. In order to explore the influence, many combinations of parameters have been tried. Although it is not possible to include all plots here, key results are presented and visualized by plots. In order to support clarity of the plots, a data excerpt from five successive days including 40 failure sequences has been used. The HSMMs used to compute sequence likelihoods had a topology as shown in Figure 9.6 and used exponential duration distributions mixed with a uniform background. Figure 9.6: Topology of HSMMs used for computation of the dissimilarity matrix. The model shown here has five states and an additional absorbing failure state Results are presented for one failure type only. However, conclusions drawn from the analysis also apply to the second failure type. Clustering method. As explained in Section 5.2, several hierarchical clustering methods exist. In this thesis, one divisive and four agglomerative approaches have been applied to the same data: The DIANA algorithm described in Kaufman & Rousseeuw [142] for divisive clustering and agglomerative clustering using single linkage, average linkage, complete linkage and Ward’s procedure. The agglomerative clustering method is called 1 see http://www.r-project.org, or Dalgaard [74] 9.2 Data Preprocessing 183 “AGNES”, hence the name is also used in the plots to indicate agglomerative clustering. Figure 9.7 shows banner plots (c.f., Section 8.1.2) for all methods using a dissimilarity matrix that has been generated using a HSMM with 20 states and a background level of 0.25. As will be shown next, the choice of the number of states and background level have only very little impact on clustering results. Therefore, results look very similar if the clustering methods are applied to dissimilarity matrices computed with another model configuration. The plotting software could not include sequence labels on the y-axis in the plots. However, checking the grouping by hand for some instances yielded similar groupings. Regarding first single linkage clustering (second row, left), the typical chaining effect can be observed. Since single linkage merges two clusters if they get close at only one point, elongated clusters result. Although beneficial for some applications, this behavior does not result in a good separation of failure sequences yielding an agglomerative coefficient of only 0.45. Hence single linkage is not appropriate for the purpose. Complete linkage (first row, right) performs better resulting in a clear separation of two groups and an agglomerative coefficient of 0.72. Not surprisingly, average linkage (first row, left) resembles some mixture of single and complete linkage clustering. The result is not convincing with two single sequences left over. As was the case for complete linkage, it cannot be clearly stated how many groups are in the data. Hence average linkage also does not seem appropriate. Divisive clustering (bottom row, left) divides data into three groups at the beginning but does not look consistent since groups are split up further rather quickly. The resulting agglomerative coefficient is 0.69. Finally, agglomerative clustering using Ward’s method (second row, right) results in the clearest separation achieving an agglomerative coefficient of 0.85. Considering other parameter settings, the picture always is the same: single linkage fails and Ward’s method results in the clearest separation. For this reason, Ward’s method is considered to be the most robust and most appropriate for failure sequence clustering and has been used in all further experiments conducted in this dissertation. Nevertheless, there are other parameters to failure clustering, such as the number of states of the HSMMs, which are investigated in the following. Number of states. Since it is not clear a priori, how many states the HSMMs should have, experiments have been conducted with model sizes ranging from five to 50 states. Results for clustering using Ward’s procedure are shown in Figure 9.8. It can be observed from the figure that the order in which clusters are merged is very similar for 20, 35, and 50 states, but is different for five states. Although not provable, the effect might be attributed to the number of the model’s transitions. Let N denote the number of states2 then the number of transitions equals N · (N − 1) + N = N 2 . Considering the empirical cumulative distribution function (ECDF) of the length of failure sequences (c.f., Figure 9.18-b on Page 197), it can be observed that for N = 5 (i.e., 25 transitions) more than 60% of failure sequences have more symbols than there are transitions in the model, whereas for N = 20 (i.e., 400 transitions) there is no failure sequence for which this is the case. Although the number of transitions is not directly proportional to a model’s recognition ability it gives an indication. Note that the ergodic models used here can in principle rec2 Not including the absorbing failure state F 184 9. Experiments and Results Based on Industrial Data agnes average 20 states bg = 0.25 0 20 40 60 agnes complete 20 states bg = 0.25 80 100 120 140 0 20 40 60 80 Height 120 Agglomerative Coefficient = 0.57 Agglomerative Coefficient = 0.72 agnes single 20 states bg = 0.25 agnes ward 20 states bg = 0.25 0 10 20 30 40 50 160 200 234 Height 60 70 80 90 0 Height 50 100 150 200 250 300 350 400 Height Agglomerative Coefficient = 0.45 Agglomerative Coefficient = 0.85 diana standard 20 states bg = 0.25 234 200 160 120 80 60 40 20 0 Height Divisive Coefficient = 0.69 Figure 9.7: Effect of clustering methods. Five different clustering methods are applied to the same dissimilarity matrix, which has been generated by a 20-state HSMM with 0.25 background weight. The agglomerative clustering algorithm is called “agnes” and the divisive algorithm “diana”. For agglomerative clustering, average linkage, complete linkage, single linkage and Ward’s procedure have been used. 9.2 Data Preprocessing 185 ognize sequences of arbitrary length but if transitions have to be “reused”, probabilities get blurred and the model looses discriminative power. Similar observations can be made if clustering methods other than Ward’s procedure are used.. As a rule of thumb, the number of states √ for HSMMs used for failure sequence clustering should be chosen such that N > L for the majority of failure sequences, where N denotes the number of of states and L the length of the sequence. Weight of background distributions. It has already been mentioned in Section 6.6 that background distributions must be used with HSMMs since observation probabilities for errors that do not occur in the (single) training sequence are set to zero by the Baum-Welch training algorithm. Hence each failure sequence that contains at least one error message not contained in the training sequence would receive a sequence likelihood of zero (or −∞ in the case of log-likelihood) and no useful dissimilarity matrix would be obtained. Using background distributions, a small probability is assigned to all observation symbols resulting in non-zero sequence likelihoods. In the experiments, a uniform distribution of all error symbols occurring in the entire set of failure sequences has been used. The effect of background distributions on sequence clustering has been investigated by varying the background distribution weighting factor ρi , which has been equal for all states i of the HSMM (c.f., Equation 6.63 on Page 112). Figure 9.9 shows results for clustering with a HSMM with 20 states using Ward’s method. As can be seen from the plots, varying the background weight does only slightly affect grouping. In fact, with increasing background weight more “chaining-effects” can be observed and the agglomerative coefficient is decreasing. The explanation for this behavior is that the single sequence HSMMs become “more equal” with increasing ρi due to the fact that the uniform background distribution supersedes the specialized output probabilities obtained from training. The more similar the models, the more equal are sequence likelihoods resulting in less structure in the dissimilarity matrix. Nevertheless, all background values result in a grouping that is similar to the ones obtained by the majority of clustering approaches. Analysis is based on Ward’s procedure here, but the same effect can be observed for other clustering methods, as well. For some of the procedures clustering is affected if the background distribution weight gets too large. A plot for a background weight of zero has not been included since it could not be used for clustering due to sequence log-likelihoods of −∞. Hence, the conclusions from this analysis is that the background weight has not much influence on clustering but should neither be too small nor too large. For the case study, a value 0.1 has been used. Summary of failure sequence grouping. From the experiments the following conclusions (regarding failure sequence clustering) can be drawn: • Agglomerative clustering using Ward’s procedure yields the most robust and most clear grouping • The number of states of the HSMMs used to compute sequence likelihoods is not critical, however, it should be chosen such that the number of transitions is larger than the number of error symbols of the majority of failure sequences, hence the √ number of states should be roughly equal to L. 186 9. Experiments and Results Based on Industrial Data agnes ward 05 states bg = 0.05 0 50 150 250 350 agnes ward 20 states bg = 0.05 450 541 0 50 150 Height 250 350 450 Height Agglomerative Coefficient = 0.89 Agglomerative Coefficient = 0.89 agnes ward 35 states bg = 0.05 agnes ward 50 states bg = 0.05 0 50 150 250 540 350 450 Height Agglomerative Coefficient = 0.89 540 0 50 150 250 350 450 540 Height Agglomerative Coefficient = 0.89 Figure 9.8: Effect of number of states. The plots show clustering results using agglomerative clustering with Ward’s procedure for dissimilarity matrices computed by HSMMs with 5, 20, 35, and 50 hidden states. 9.2 Data Preprocessing 187 agnes ward 20 states bg = 0.05 0 50 150 250 agnes ward 20 states bg = 0.25 350 450 540 0 50 100 Height 200 300 400 Height Agglomerative Coefficient = 0.89 Agglomerative Coefficient = 0.85 agnes ward 20 states bg = 0.45 0 50 100 150 200 250 300 350 Height Agglomerative Coefficient = 0.82 Figure 9.9: Effect of background distribution weight. One HSMM with 20 states has been trained and dissimilarity matrices have been computed using three different values of the background distribution weight ρi (denoted by “bg” in the plots). The banner plots show results of agglomerative clustering using Ward’s procedure. 188 9. Experiments and Results Based on Industrial Data • Background distributions are necessary in order to yield useful dissimilarity matrices but the actual value is not very decisive. A value of 0.1 is used in the case study. 9.2.6 Noise Filtering The goal of the statistical test involved with noise filtering (c.f., Section 5.3) is to eliminate error messages that are not indicative of failure sequences. The idea is to consider only error messages that occur significantly more frequent in the failure sequences in comparison to the expected number of occurrences in a given time frame. The decision is based on a testing variable Xi (c.f., Equation 5.6 on Page 85), which involves the prior probability p̂0i . As described in Section 5.3, three variants exist to compute priors p̂0i : 1. p̂0i are estimated separately for each group of failure sequences. 2. p̂0i are estimated from all failure sequences —irrespective of the groups. 3. p̂0i are estimated from all sequences, containing failure and non-failure sequences. Noise filtering has been implemented such that Xi values are stored for each symbol in order to allow for filtering with various thresholds c. Experiments have been performed on the dataset used previously for clustering analysis and six non-overlapping filtering time windows of length 50 seconds have been analyzed. Figures 9.10-9.12 show bar plots of Xi values for each symbol and time window. Figure 9.10 has been generated using group-based priors, Figure 9.11 using failure sequencebased priors and Figure 9.12 using a prior computed from the entire training dataset. Each figure shows two plots: one for each group of failure sequences. The three figures are ordered by specificy of priors: the group-wise prior is computed from the failure symbols itself (but without windowing) resulting in rather small values of Xi since the distribution of failures in the time window is very close to the expected distribution. More general priors result in larger values of Xi , as can be seen in Figures 9.10, 9.11, and 9.12.3 Regarding Figure 9.10, it can be observed that the distribution of symbols depends on time before failure. The prior has been computed without time windows which can be seen as the average over the entire length of failure sequences. Xi values mark the difference to the prior for each time window. The figure shows that deviation from priors is different for each window. This is an important finding: It is a further evidence for one of the most principle assumptions of this thesis: The assumption that timing information —at least time-before-failure— cannot be neglected in online failure prediction. By the way, Figure 9.10 supports the second principle mentioned by Levy & Chillarege in [162], stating that the mix of errors changes prior to a failure. Due to the fact that the prior is computed for each group separately, the sum of Xi values over all time windows should be equal to zero. Although this is the case for most of the symbols, some violate this equality. The explanation for this is that sequences of length up to 300 seconds have been used, but only time windows up to 250s have been plotted for readability reasons. 3 Note that y-axes have been scaled to fit all Xi values. 9.2 Data Preprocessing 189 −5 0 Xi 5 10 group 1, group prior −225 −175 −125 −75 −25 −75 −25 filtering interval centers [seconds] −5 0 Xi 5 10 group 2, group prior −225 −175 −125 filtering interval centers [seconds] Figure 9.10: Values of Xi for noise filtering with a prior computed from each cluster of failure sequences. The upper plot is for the first group of failure sequences and the lower for the second group. Within each plot, each group corresponds to one time window. Within each group, each bar corresponds to one error symbol and the y-axis displays the value of the testing variable Xi . Numbers below each group denote the center of the time interval in seconds before failure occurrence. −2 0 2 Xi 4 6 8 group 1, fseq prior −225 −175 −125 −75 −25 −75 −25 filtering interval centers [seconds] −6 −2 Xi 2 4 6 8 group 2, fseq prior −225 −175 −125 filtering interval centers [seconds] Figure 9.11: Values of Xi for noise filtering with a prior computed from failure sequences. 190 9. Experiments and Results Based on Industrial Data Xi 0 10 20 30 40 group 1, all seq prior −225 −175 −125 −75 −25 −75 −25 filtering interval centers [seconds] 0 −20 Xi 10 20 group 2, all seq prior −225 −175 −125 filtering interval centers [seconds] Figure 9.12: Values of Xi for noise filtering with a prior computed from all sequences. Regarding Figure 9.11, it can be observed that the distributions of Xi values are quite different in the two groups. This is due to the fact, that the prior has been computed from all failure sequences (regardless of the group), which can be interpreted as an indication that failure sequence grouping supports failure pattern recognition since separate models can be trained that are tailored towards the distributions in each group. The third figure (Figure 9.12), which is based on a prior from failure and non-failure sequences, supports the third principle described by Levy & Chillarege in [162] called “clusters form early”: It can be observed especially in the lower plot that a few error symbols outnumber their expected value heavily. Furthermore, the effect becomes stronger the closer the time window is to the occurrence of failures (the further right in the plot). In order to investigate the effect of filtering on sequences the number of symbols within each sequence has been analyzed. Figure 9.13 plots the average number of symbols in one group of failure sequences after filtering out all symbols with Xi < c for various values of c. Again, all three types of priors have been investigated. Regarding first the “global” prior computed from all sequences (solid line), the resulting curve can be characterized as follows: For very small thresholds, all symbols pass the filter and the average number of symbols in failure sequences equals the average number without filtering. At some value of c the length of sequences starts dropping quickly until some point where sequence lengths stabilize for some range of c. With further increasing c average sequence length drops again until finally not a single symbol passes filtering. The supposed explanation for this behavior is that the first drop results from filtering real noise. The middle plateau indicates some sort of a “gap”, which may result from some significant difference in the data: this is the filtering range where error symbols relevant to failure sequences still get through but background noise is eliminated. At some point c becomes too large even for relevant error symbols to get through and the average number 9.3 Properties of the Preprocessed Dataset 191 Figure 9.13: For each filtering threshold value c, mean sequence length has been plotted. The solid line shows values for a prior computed from all sequences, the dashed line for a prior computed from all failure sequences and the dotted line for priors computed individually for each group/cluster of failure sequences. of symbols in failure sequences drops to zero (the plateau around c = 40 is interpreted to result from outliers). Comparing the “global” prior with the two other priors, it can be observed that the curve for a cluster-based prior drops most quickly and the curve for a “global” prior drops most slowly. The reason for this is again specificy of the priors. Plateaus are at least not as obvious as for the global prior. Summary of noise filtering. From this analysis follows that a “global” prior computed from all sequences (failure and non-failure) seems most appropriate. Therefore, further experiments are based on filtered data using such prior. Similar to the tupling heuristic proposed by Tsao & Siewiorek [258], the filtering threshold c has been chosen such that it is slightly above the beginning of the middle plateau. 9.3 Properties of the Preprocessed Dataset Before going into details of the modeling process, the preprocessed data has been analyzed. Later sections will then refer back to the properties described here. Additionally, data analysis helps to understand better the system under investigation and may also help others to judge whether results presented in this thesis can be transferred to their systems. 192 9. Experiments and Results Based on Industrial Data 9.3.1 Error Frequency 150 100 50 0 number of errors per 5 minutes One of the most straightforward methods for online failure prediction is to look at the frequency of error occurrence and to warn about an upcoming failure once the frequency starts to rise significantly. However, as Figure 9.14 shows, such simple approach is not effective when applied to the commercial telecommunication system. 0 200 400 600 800 1000 time[min] Figure 9.14: Number of errors per five minutes in preprocessed data. Diamonds (u) indicate the occurrence of failures. More specifically, the figure shows the number of errors per five minute time intervals. The plot has been generated of data obtained after tupling. As can be seen from the plot, the number of log records varies greatly ranging from zero to 153 log records within five minutes. Note that Figure 9.14 is only an excerpt of the data. The peak value observed in the data of five successive days (the same data that has also been used in the previous analyses) even reaches 267 log records within five minutes. Performing the same analysis with time intervals of length of one second reveals that there are up to eight messages per second. The figure shows that a straightforward counting method would not work well since the pure number of errors seems quite unrelated to the occurrence of failures: Failures occur at times with many and with few errors per time interval, and in sections where the number of errors increases as well as decreases. There are time intervals with heavy error reporting but only a few failures and time intervals with few errors but several failures. 9.3.2 Distribution of Delays The model for online failure prediction proposed in this dissertation builds on the timing between error occurrence and hence uses probability distributions to handle time between successive errors (delays). This section provides an analysis of delays in error sequences. Theory of HSMMs allows to define a unique convex combination of distributions for 9.3 Properties of the Preprocessed Dataset 193 each transition. However, it is not possible to determine upfront which transition should have what type of distribution, and, it is not practical for real applications. Therefore, the same combination of distributions has been used for all transitions: Each transition, for example, consists of a convex combination of exponential and uniform distribution. Note that this does not imply that distributions are equal: the parameters of the distributions (e.g., rate λ of exponential distributions, the combining weight, etc.) are initialized randomly and then further changed by the Baum-Welch algorithm. In order to get a picture of delay distributions, delays occurring in the entire dataset have been analyzed. More precisely, a histogram and quantile-quantile-plots (QQ-plots) are provided in Figure 9.15. The dataset used for analysis comprised 24,787 delays spanning a range from zero4 to 29.39 seconds with a mean of 1.404 seconds. The histogram shown at the top left of the figure plots relative frequency of delays with a resolution of 1 second. The distribution of data seems to resemble an exponential distribution except for the peak at 12-13 seconds. It might be supposed that the peak results from some outliers. However, 1,048 delays fall into this category and hence the peak results more likely from some system inherent property. In order to further investigate which parametric distribution fits best the data, QQ-plots have been generated plotting quantiles of the observed delay distribution against the parametric ones: The normal distribution (middle row, left) obviously fits very badly. This is due to its property that the distribution can take on negative values, which is inappropriate for delays. Exponential (top row, right) and lognormal (middle row, right) fit much better. However, both distributions show a quite bad match for higher quantiles. As HSMMs provide the possibility to mix distributions, the exponential and log-normal distributions have been mixed with a uniform distribution resulting in an improved fit (except for very large delays), with the exponential being slightly better than the lognormal. However, further investigations have revealed that very long delays (> 12s) occur only in 0.41% of all cases and a worse fit of the distribution can be accepted. Based on this analysis, experiments have been performed using a convex combination of exponential and uniform distribution. 9.3.3 Distribution of Failures Assumptions on the distribution of failures (with respect to their time of occurrence) are used in various areas of dependable computing research. For example, preventive maintenance, reliability engineering, and reliability modeling make use of it. As has been described in Chapter 3, there are also online failure prediction approaches exploiting the time of failure occurrence. Therefore, an analysis of the distribution of time-betweenfailure (TBF) has been performed. However, since failures are not as common as errors, the entire dataset of 200 days has been analyzed. More precisely, the dataset consisted of 885 timestamps of failures of one type. Figure 9.16 summarizes the results. Similar to the analysis of inter-error-delays, a histogram is provided at the top left of the figure. Note that the histogram might not fully represent reality for the first two slots since failures occurring earlier than 20 minutes after a previous failure have been considered as related to the previous one and have been eliminated from the dataset during 4 A delay of zero means that two log records occur with the same timestamp in the log. Technically, this means that the two records have a delay lower than the minimum time resolution of the system, which is about a millisecond in the telecommunication system. 194 9. Experiments and Results Based on Industrial Data exponential 30 histogram 20 15 0.2 data quantiles 10 + 5 0 0.0 10 15 20 25 30 0 4 6 8 normal lognormal 10 25 + 0 ++ + + 15 + + + + + + 0 5 + 5 10 ++ ++ ++ + + +++++ +++++++ ++++++ +++++ ++++++ ++++ 0 5 10 15 20 25 distribution quantiles exponential mixed with uniform lognormal mixed with uniform 30 distribution quantiles 30 30 + + + + ++ ++ ++ + + ++++++ +++++++ ++++++++++ ++++++++++++++ +++++++++++++++++++++++ + + + ++++++++++++++++++++ 10 data quantiles 20 25 20 15 25 + ++++ 15 20 + + data quantiles 15 20 25 + 0 2 4 + 5 ++ 0 +++ ++++++ +++++++ +++++++++ +++++++++ + + + + + + + + + ++++++ 6 8 distribution quantiles ++ + + + + ++ + + 10 data quantiles 10 5 0 2 ++ distribution quantiles 10 data quantiles + ++ +++ + + + + + + ++++++ ++++++++++ ++++++++++++++ ++++++++++++++++++ ++++++++++++ delays [seconds] + −5 5 + + 30 5 30 0 0 + + + + 0.1 Density 0.3 25 0.4 + 10 12 + ++ +++ +++++++ + + + + + +++ +++++ ++++++ ++++++ 0 2 4 ++ 6 8 10 12 distribution quantiles Figure 9.15: Histogram and QQ-diagrams of delays between errors. QQ-plots plot the distribution of delays observed in the dataset versus several parametric distributions: exponential, normal, log-normal, exponential mixed with uniform and log-normal mixed with uniform. The straight line indicates a perfect match of quantiles. Parameters of parametric distributions have been estimated from the data (e.g., mean of the normal distribution has been set to the mean of the data) 9.3 Properties of the Preprocessed Dataset 195 250 200 0.000 50 0.002 100 150 data quantiles 0.010 0.008 0.006 0.004 Density 0.012 300 exponential 150 200 250 300 300 100 300 + + + 200 data quantiles 100 50 150 200 + ++ ++ ++ + + +++ ++++ +++ ++++ + + + + +++ +++++ ++++++ +++++++ ++++++ + + + + + + ++++++ ++++++ +++++++++ +++++++ +++ 50 100 ++ + 150 200 100 150 distribution quantiles + + 250 250 + + 50 100 + 200 + 200 data quantiles ++ 250 300 Weibull 300 Gamma ++ + ++ + + ++++ +++ ++++ + + ++ ++++ +++++ ++++ ++++++ + + + + + + ++++++ ++++++ ++++++ +++++++ ++++++++ + + + + + + + +++ + ++ + + + distribution quantiles + 400 250 250 200 150 data quantiles 100 50 200 distribution quantiles 200 data quantiles 150 100 50 + + + 50 ++ lognormal ++ + ++ + ++ ++ ++ + + +++ ++++ +++ ++++ + + + + +++ +++++ ++++++ +++++++ ++++++ + + + + + + +++++++ ++++++++ ++++++++ + + + ++++ 50 100 + normal + 0 ++ ++ ++++ + + + ++++ +++ ++++ +++++ + + + +++++ ++++++ +++++++ ++++++ ++++ + + ++++ ++++ +++ + distribution quantiles + + + + + 0 + + + TBF [min] 150 100 150 50 300 0 + + ++ + ++ + ++ + + ++++ +++ ++++ + + ++ ++++ +++++ ++++ ++++++ + + + + + + ++++++ ++++++ ++++++ +++++++ ++++++++ + + + + + + + ++ ++++ 50 100 150 200 distribution quantiles Figure 9.16: Analysis of time-between-failures (TBF). The top left plot shows a histogram. The five other graphs plot quantiles of the observed data against quantiles of exponential, normal, log-normal, gamma, and Weibull distribution (QQ-plots). 196 9. Experiments and Results Based on Industrial Data 0.0 0.2 0.4 ACF 0.6 0.8 1.0 data preprocessing. In addition to the histogram, QQ-plots are provided for the most frequently used distributions in reliability theory. Parameters for the gamma and Weibull distributions have been estimated by maximum likelihood. The interesting observation here is that the frequently used exponential distribution yields a relatively bad fit. But also other frequently used distributions such as the gamma or Weibull do not really fit the data. The best approximation is obtained by a lognormal distribution. Results of a second analysis are provided in Figure 9.17. In order to investigate, whether some periodicity is present in the data, the normalized autocorrelation of failure occurrence has been plotted. More specifically, the data has been divided into buckets of five minute intervals and the autocorrelation has been computed for lags of up to 240 minutes. The observation is that there is almost no periodicity in failure occurrence, which is the reason why periodic prediction does not work for the case study (see Section 9.9.4). 0 50 100 150 200 Lag [min] Figure 9.17: Normalized autocorrelation of failure occurrence. Failure data has been grouped into buckets of five minute intervals and autocorrelation has been computed for lags of up to 240 minutes. 9.3 Properties of the Preprocessed Dataset 9.3.4 197 Distribution of Sequence Lengths Error sequences are delimited by time ∆td and hence an analysis of the length of sequences in terms of the number of errors is provided here. For the test dataset, a histogram of the number of symbols is shown in Figure 9.18-a. Taking only failure sequences into account, Figure 9.18-b plots the empirical cumulative distribution function. ECDF of failure sequence length 0.8 0.4 Fn(x) 0.6 0.010 0.0 0.000 0.2 0.005 Density 0.015 1.0 Histogram of length of all sequences 0 50 100 150 200 250 sequence length [no of symbols] (a) 300 0 50 100 150 200 failure sequence length [number of symbols] (b) Figure 9.18: (a) Histogram of length of all sequences. (b) Empirical cumulative distribution function (ECDF) for the length of failure sequences. The histogram of all sequences (Figure 9.18-a) shows two peaks, one around 50, the other around 225 symbols. This means that a large amount of sequences either have around 50 or 225 symbols, although most of the sequences span a time interval of five minutes. An explanation for this phenomenon is that the system writes either a great many error log records or only a few, depending on varying call-load on the system present in the rather small excerpt of data (as can also be observed in Figure 9.14). An analysis of the entire data set showed a more equal distribution. In order to be consistent with the other analyses presented in this section, the distribution has been plotted as is. Regarding failure sequences (Figure 9.18-b), the empirical cumulative distribution function is presented since it is the appropriate visualization for the argumentation used in Section 9.2.5. The reason why the maximum length of failure sequences is smaller than for all sequences is simply random variability: A separate investigation has shown that there are also failure sequences with more than 200 symbols. Again, for the reason of consistency, the plot is provided for the same data that has been used in investigations of previous sections. Comparing Figure 9.18-b to Figure 9.13, it might look surprising how an average length of 25 can result from the ECDF shown in Figure 9.18. The explanation for this is that Figure 9.13 only plots average length of sequences belonging to one failure group. The second group has an average length of 75.2 without noise filtering. 198 9. Experiments and Results Based on Industrial Data 9.4 Training HSMMs In previous sections, data preprocessing has been explored, which is not necessarily focused on HSMM failure prediction. This section describes and analyzes the steps involved in training HSMMs for failure prediction. Note that for reasons of legibility the previous analysis was based on a small excerpt of data. In order to yield more reliable results, a larger data set has been used for the experiments described in the following sections. 9.4.1 Parameter Space A lot of parameters are involved in modeling. Although most of them have already been mentioned and / or explained in previous chapters, an overview is provided, here. Moreover, the number of parameters and their possible values is too large to compare all combinations. Hence some parameters have been explored in detail while reasonable values, based on an “educated guess” has been assumed for others (this approach has been termed greedy versus non-greedy in Section 8.3.1). Parameters that have been set heuristically. No experiments have been performed for the following parameters. Instead, values have been chosen according to the reasons described. • Intermediate probability mass and distribution. In the experiments, 10% of the probability mass of each transition has been distributed among intermediate states (c.f., Section 6.6). The transitions itself have been chosen to be normal distributions since they are centered around the mean, which is useful for the requirement that the sum of mean intermediate durations should equal the mean duration between original states that are extended.5 Since for uncorrelated random variables the following property holds: ! Var X Xi = i X (Var Xi ) , (9.1) i variance of the intermediate distributions have been set to the variance of intererror-delays divided by the number of intermediate states plus one. The assumption that two successive delays are uncorrelated might not hold,6 however it lead to reasonable good prediction results. • Number of tries in optimization. As stated before, the Baum-Welch algorithm converges into a local optimum starting from a random initialization. The problem is that it cannot be determined whether the local optimum is close to the global one or not. Ignoring more sophisticated techniques such as evolutionary strategies, the Baum Welch algorithm has simply been performed 20 times and the best solution in terms maximum overall training sequence likelihood has been chosen. • Type of background distributions. In principle, the concept of background distributions for observation probabilities allows to use arbitrary distributions. In this 5 Due to the central limit theorem, if there are many intermediate distributions having finite variance, the sum approximates a normal distribution, anyway 6 E.g., due to the bursty behavior described in Section 9.3.4 9.4 Training HSMMs 199 thesis, the distribution of symbols estimated from the entire training data set has been used since this reflects the overall frequency of error occurrence. Parameters that have been varied. Several experiments have been performed in order to determine the effects of the following parameters. Results of these experiments are provided in the next section. • Number of states. As can be seen from the figures on the principal prediction approach (Figure 2.9 on Page 19 and Figure 2.10 on Page 20) u + 1 HSMMs are involved, where u is the number of groups obtained from failure sequence clustering. Each model consists of N states. The question is how the number of states affects the modeling process. Since the prediction models have a strict left-to-right structure (c.f., Figure 6.10 on Page 126), the maximum number of transitions is N − 1. From this one might conclude that the models should have as many states as there are symbols in the sequences. On the other hand, the larger the model, the more model-parameters have to be estimated from the same limited amount of training data resulting in worse estimations. Therefore, a better solution might be obtained if an HSMM with fewer states is used and some very long sequences are ignored. • Maximum span of shortcuts. Figure 6.10 on Page 126 shows that there are shortcut transitions in the model bypassing several states. Increasing the maximum span of shortcuts increases flexibility of the models but almost doubles, triples, etc. the number of transitions and hence the number of transition parameters. • Number of intermediate states. After training, intermediate states are added to the model (c.f., Figure 6.11 on Page 127). The number of states added between each pair of states affects generality of the model. If there are no intermediate states, the models might be overfitted. If there are too many, the model is too general. • Amount of background weight. Background distributions are an important way to reduce variance of hidden Markov models. The weight ρi by which background distributions are mixed with observation distributions obtained from training also affects the bias-variance trade-off. 9.4.2 Results for Parameter Investigation Four parameters have been listed that need to be explored with respect to failure prediction performance. One way to investigate their effect would be to perform a separate experiment for each parameter. However, such approach has two problems: First, the approach neglects coherence among parameters and second, while testing one parameter, it is not clear what (fixed) values should be assumed for the others. However, an investigation reveals that there are two effects how parameters influence the model: 1. Number of states and maximum span of shortcuts determine the number of parameters (i.e. the degree of freedom) of the HSMM that need to be optimized from a fixed and finite amount of training data. The trade-off is that a higher degree of freedom in principle allows the model to better adapt to the data specifics – however, since more parameters need to be estimated from the same amount of data, the estimates get worse resulting in worse adaption to data specifics. 200 9. Experiments and Results Based on Industrial Data 2. The number of intermediate states and amount of background weight affect generality of the models after training. More general models can account for a larger variety of input data. On the other hand, too general models yield blurred sequence likelihoods which in turn can result in worse classification results. Therefore, the parameters have been investigated in two groups. First, models are trained for various combinations of number of states and maximum span of shortcuts. In a second step, each resulting model is altered by adding intermediate states and applying some amount of background weight. Tests are performed in order to evaluate dependence of failure prediction quality on all four parameters. Additionally, failure prediction is dependent on the final classification threshold θ. In order to eliminate dependence on θ, for each combination of the four parameters, various values of θ have been investigated and maximum F-measure has been used to compare prediction results. Training with varying number of states and maximum span of shortcuts. The number of states and maximum span of shortcuts are integer variables. However, a complete enumeration of values is not possible. If, for example, the maximum span of shortcuts should be varied from zero to five and the number of states from 20 to 500, 2886 combinations of model parameters would have to be tested, which is not possible since preparation of the data, set up of models, training, testing and evaluation of prediction results would be too time consuming. Hence, only some values for maximum span of shortcuts and the number of states have been selected and all their combinations have been tested. More specifically, HSMMs with 20, 50, 100, and 200 states have been investigated. Larger models could not be considered due to requirements both in terms of memory and computing time. The maximum span of shortcuts has been varied from zero to three. This selection is based on the following reasons: Shortcuts are introduced to account for missing errors in failure sequences (e.g., if a symptomatic pattern is B-A-A-B but one example sequence would only consist of B-A-B, shortcuts enable to align both sequences). By limiting the maximum span of shortcuts to three, it is assumed that not more than three successive errors are missing. However, even if this case occurs, the sequence with missing errors can be aligned from the next but one state on. Furthermore, this limitation is sufficient since best failure prediction results are achieved with a shorter maximum span of shortcuts, as is shown in the next paragraphs. Also note that shortcuts are not necessarily required to handle short sequences: due to the initial probabilities πi , a short sequence may start “in the middle” of the model. In order to visualize the tradeoff average training sequence log-likelihood is plotted. However, for legibility reasons, the negative of the sequence log-likelihood is shown in Figure 9.19. That means, the higher the bar, the worse the training result, which could be seen as some sort of training error. The dataset used for these experiments consisted of 3650 sequences, among which are 278 failure sequences. Looking at training likelihoods for a maximum shortcut span of zero (first column in Figure 9.19), it can be observed that adaptation to training data increases for an increasing number of states up to a model with 100 states, but gets worse for a model with 200 states. Regarding the effect of the maximum span of shortcuts it can be seen that for 20 and 50 states, incorporation of shortcuts spanning one to three states deteriorates models with 20 and 50 states and improves model training for models with 100 or 200 states. Overall, the best training result is achieved using a model with 100 states and a maximum shortcut span of one. The following conclusions can be drawn from these observations: 9.4 Training HSMMs 201 Figure 9.19: Average negative training sequence log-likelihood for several combinations of the number of states and maximum span of shortcuts. 1. Models with 20 and 50 states seem to be inappropriately small since the number of states determines the maximum length of sequences that can be handled. Since shortcuts do not remedy this problem but only introduce additional parameters, training results get worse due to worse probability estimates. 2. As can be seen from experiments without shortcuts, models with 200 states are too large. In case of infinite training one would expect that average training loglikelihood is smaller than for 100 states since the model has more degrees of freedom and can hence better adapt to the training data. Therefore, the reason why training likelihood is worse than for models with 100 states is attributed to worse parameter estimation from the limited amount of training data. Furthermore, since the Baum-Welch algorithm assigns some small fraction of the probability mass to all transitions, results get also blurred if there are too many. 3. Considering only models without shortcuts, models with 100 states achieve minimum negative log-likelihood. However, by the introduction of shortcuts of length one, results can be further improved. The fact that negative training log-likelihood increases if shortcuts spanning more states are included can be explained by the same effects as in 1. Note that these investigations do not automatically allow for the conclusion that models with 100 states and shortcuts spanning one state should be used for online failure prediction since such models could be overfitted to the training data, as will be discussed in the next section. Number of intermediate states and amount of background weight. Intermediate states and background distributions are applied after training to control the trained model’s generalization capabilities. However, overfitting can be reduced either by using fewer states and more background weight or vice versa. This is the principle reason why nongreedy parameter selection is necessary: a model with worse training sequence likelihood might after introduction of intermediate states and application of background distributions result in better failure prediction performance than the model with best training results (see discussion of bias and variance in Section 7.3). Hence, all 16 combinations of number of states and shortcuts have been combined with zero to three intermediate states per 202 9. Experiments and Results Based on Industrial Data transition, and with five levels of background distribution weight. This selection is based on the following considerations: Similar to the introduction of shortcuts, the introduction of intermediate states aims at alignment of sequences with additional errors in between symptomatic ones. And for similar reasons, the introduction of up to three intermediate states between each transition is sufficient. Background weight is a real-valued parameter and hence five values had been selected spanning a range from zero to 0.2. In order to evaluate each combination, the maximum achievable F-measure with respect to out-of-sample prediction of validation sequences has been determined. Out-ofsample means that validation sequences have not been available for training. Since it is not possible to present results of all 320 combinations here, the three most important findings are described in the following: 1. Application of background distributions can increase failure prediction performance for all combinations. However, this is only true if the background distribution weight is rather small. Too large values for background distribution weight quickly result in “random” models resulting in worse prediction performance than models without background distribution. Hence, in later experiments, a background weight ρi of 0.05 has been applied.7 2. A similar effect can be observed from the introduction of intermediate states. Overall, the effect of adding intermediate states to the models did not meet expectations: Failure prediction performance could only be improved slightly when one intermediate state per transition was added. This setting has been used for further experiments. 3. One setting for a model with 50 states and no shortcuts achieved roughly the same failure prediction quality as the model with 100 states and maximum shortcut span of one, which gives evidence to the described problem that a model with best training likelihood does not guarantee optimal prediction performance on test data. On the other hand, the model with best training likelihood belongs to the set of models with best failure prediction performance. Therefore, the model with 100 states has been used for further experiments since it can account for longer sequences. Computation times. A theoretical analysis of the algorithm’s complexity has been provided in Chapter 6, but the analysis was rather coarse grained and did only take the number of states and length of the sequence into account. Although the effect of the four parameters investigated in this section could in principle be traced down to the number of states and the number of edges, or even further to the number of multiplications and additions, such full-fledged analysis is not provided here. Instead, the time needed to train the models and to classify test data has been measured several times on one and the same machine, which allows at least a guess on the effect of parameters in some relative way. Training time is affected only by the number of states and maximum span of shortcuts, and testing time is additionally influenced by the number of intermediate states. The amount of background weight has no influence on testing times since output probabilities bi (Oj) are altered before testing starts. Figures 9.20 to 9.22 show the results. In Figure 9.20, mean training time for all 16 combinations of parameters is shown. Training time is determined by the time needed to train one model. Not surprisingly, 7 c.f., Equation 6.63 on Page 112 9.4 Training HSMMs 203 Figure 9.20: Mean training time depending on the number of states and maximum span of shortcuts. training time increases both with the number of states and the maximum span of shortcuts since both increase the number of parameters that need to be estimated from the training data set. The figure suggests that the number of states has a stronger influence than the maximum span of shortcuts. One reason is that the maximum span of shortcuts only increases the number of transitions parameters, which are only a subset of all parameters that need to be determined. For the configuration used in further experiments (100 states, maximum shortcut span of one), a mean training time of 1365 seconds resulted. With respect to testing, 75% trimmed mean testing times are plotted in Figure 9.21. Testing time is determined by the mean of the time needed to perform a prediction on one single sequence. In Figure 9.21-a, processing time is plotted in dependence on the number of states and maximum span of shortcuts. The the number of states clearly dominates testing time, which can again be explained by the fact that the maximum span of shortcuts only increases the number of transitions (and no state-dependent parameters) and hence only has effects in the most inner core loops of the algorithm. In addition to the number of states and maximum span of shortcuts, testing time is also influenced by the number of intermediate states. Figure 9.21-b shows 75% trimmed mean testing times in dependence on the number of states and the number of intermediate states for a maximum shortcut span of one. Surprisingly, computation time decreases by the introduction of intermediate states. An analysis has revealed that this is due to the fact that with intermediate states probabilities decrease more quickly in the forward and backward algorithm such that implemented shortcuts in the algorithm are executed if probabilities are below a certain threshold for some sequences. Performing online failure prediction is a real-time application. However, no fullfledged real-time analysis can be presented here. Instead, Figure 9.22 shows upper limits of 95% confidence intervals on mean testing time. This is obviously no guarantee that the algorithm can always be performed in real time. However, two things should also be taken into consideration: First, the algorithm operates on a lead time that is much larger (e.g., five minutes), hence there is some space for “buffering”. Second, errors occur in short bursts with longer time intervals with only very few errors. This means that there is some chance for the algorithm to catch up. If not, the algorithm could simply ignore 204 9. Experiments and Results Based on Industrial Data (a) (b) Figure 9.21: Computation time needed for testing a single sequence dependending on number of states and maximum span of shortcuts for one intermediate state (a) and dependending on number of states and number of intermediate states for a maximum shortcut span of one (b). (a) (b) Figure 9.22: 95% upper confidence interval limits for mean testing times corresponding to Figure 9.21. some sequences.8 In the experiments, one non-failure model and two failure models have been used. With respect to testing, the effect of the number of groups is linear. However, with respect to training, the effect is more complex since with an increasing number of groups there are fewer training sequences in each group partly compensating for the overhead of training more models. The number of groups is expected to reflect the number of failure mechanisms in the system and is determined during data preprocessing. Nevertheless, the effect has been analyzed: the same data has been processed with only one failure group. It has been found that the total time of training a model with only one non-failure and one failure model is increased by approximately 20% since more iterations on a larger training data set have to be performed. 8 Note that processing times shown in Figures 9.21 and 9.22 refer to an entire sequence. 9.5 Detailed Analysis of Failure Prediction Quality 9.5 205 Detailed Analysis of Failure Prediction Quality In the previous sections the parameters involved in setting up an HSMM-based failure predictor have been investigated. Although some model parameters have been assessed with respect to failure prediction, only the maximum F-measure has been used. In this section, the quality of failure prediction is assessed in more detail. Specifically, in Section 9.5.1, the focus is on precision, recall, and F-measure. In Section 9.5.2, ROC curves and related metrics are provided while in Section 9.5.3 evaluation deals with cost-based metrics. All experiments shown here have been performed using parameter settings as listed in Table 9.2. lead time ∆tl 5 min data window length ∆td 5 min no. of states N 100 max. span of shortcuts 1 no. of intermediate states 1 background weight 0.05 Table 9.2: Experiment settings for detailed analysis. With respect to data sets, the experiments performed in previous sections have been evaluated using out-of-sample validation data, while results reported in this section refer to out-of-sample test data (c.f. Section 8.3.2). 95% confidence intervals have been estimated by the procedure described in Section 8.4.5. 9.5.1 Precision, Recall, and F-measure Precision, recall, and F-measure have been defined in Section 8.2.2. As they have been developed for information retrieval evaluation, their focus is on imbalanced class distributions as is the case for failure prediction. However, precision, recall, and F-measure are dependent on the classification threshold θ (c.f., Equation 7.20 on Page 137) and hence precision/recall plots and a plot of the F-measure for a selection of eleven thresholds ranging from −∞ to ∞ are provided. At each of the eleven classification threshold levels θ, 95% confidence intervals have been computed. At the threshold level for the maximum F-measure of 0.66, the corresponding values of precision and recall are 0.70 and 0.62, respectively, which means that failure warnings are correct in 70% of all cases and almost two third of all failures are caught by the prediction algorithm. Both values can be increased to reach 1.0 by adjusting the classification threshold θ. It depends on the methods and actions triggered by the prediction algorithm, whether high precision or high recall is more important. 9.5.2 ROC and AUC Taking true negative predictions into account, ROC curves plot true positive rate versus false positive rate (c.f., Section 8.2.2) and AUC is the area under the resulting curve as estimated by integrating the piecewise linearly interpolated ROC curve. Figure 9.24 shows ROC for HSMM failure prediction. Choosing the threshold yield- 9. Experiments and Results Based on Industrial Data 0.5 0.4 0.0 0.0 0.1 0.2 0.2 0.3 F measure 0.6 0.4 Precision 0.8 0.6 1.0 206 0.0 0.2 0.4 0.6 0.8 1.0 Recall confidence level: 0.95 (a) 0 10 20 30 40 50 decision threshold (b) Figure 9.23: Precision/Recall plot (a) and corresponding values of F-measure (b) for the HSMM failure prediction model. A selection of eleven thresholds ranging from −∞ to ∞ has been plotted including 95% confidence intervals for precision and recall. ing the maximum F-measure, a false positive rate of 0.016 and true positive rate (which is equal to the recall) of 0.62 results. Area under the ROC curve (AUC) equals 0.873. 9.5.3 Accumulated Runtime Cost The plot showing accumulated runtime cost (c.f., Section 8.2.4) is dependent on the assignment of cost to each of the four cases that can occur in failure prediction: • A true negative prediction has cost rF̄ F̄ . Since it is a negative prediction, no subsequent actions are performed. Furthermore, since it is a correct decision, rF̄ F̄ should be the smallest value. A value of 1 has been chosen arbitrarily. • A true positive prediction has cost rF F . Since the occurrence of a failure is predicted, some actions are performed in order to deal with the upcoming failure resulting in higher cost. However, it is a correct prediction and hence cost should not be too high. Hence, a value of 10 has been chosen. • A false positive prediction has cost rF̄ F . A failure is predicted and actions are performed as in the previous case —however, these actions are unnecessary since in truth no failure is imminent. Hence a value of 20 has been chosen. • A false negative prediction has cost rF F̄ . From the point of view of computational workload, cost should equal rF̄ F̄ . However, this is the worst case since an upcoming failure is not predicted and nothing would be done about it. The system fails which implies highest cost. Therefore, cost of 1000 have been assigned to this case. Figure 9.25 shows accumulated runtime cost for a simulated run of 31.5 days. The figure includes boundary cost for: • oracle predictor: this predictor issues only a true positive failure warning at the time of failure occurrence setting the lower bound of overall achievable cost. 207 0.6 0.4 0.0 0.2 true positive rate 0.8 1.0 9.6 Dependence on Application Specific Parameters 0.0 0.2 0.4 0.6 0.8 1.0 false positive rate confidence level: 0.95 Figure 9.24: ROC plot for the HSMM failure prediction model applied to telecommunication system data. A selection of 11 thresholds ranging from −∞ to ∞ has been plotted. At each threshold level, 95% confidence intervals for true and false positive rate are provided. • perfect predictor: performs a prediction at each time instant, an error message occurs. However, each prediction is correct, i.e., only true positive or true negative prediction occur. • no prediction: if no predictor was in place, cost for false negative prediction occurs each time, a failure occurs • maximum cost: A prediction is performed each time an error message occurs. However, each prediction is wrong and hence only false positives and false negative predictions are performed. As it can be seen from the plot, many failures occurred at the beginning of the run, followed by some “silent” period. However, due to lack of plotting resolution, it cannot be seen that some failures occurred quite close in time resulting in a total of 232 failures. By use of the HSMM failure predictor, accumulated runtime cost can be cut down to approximately one fifth of the cost without a failure predictor. 9.6 Dependence on Application Specific Parameters Experiments conducted so far have analyzed the parameters involved in data preprocessing and modeling. Parameters were not specific for the application domain for which failure prediction should be performed. This section investigates application specific factors, which refers to restrictions or properties imposed by the application domain or the system. 9.6.1 Lead-Time In Section 2.1, or more specifically in Figure 2.4 on Page 12, it is shown that lead-time ∆tl has a lower bound called warning-time ∆tw , which is determined by the time needed to perform some action upon failure warning. In the experiments carried out so far a lead-time ∆tl of five minutes has been used. In order to evaluate the effect of lead-time, 208 9. Experiments and Results Based on Industrial Data 150000 no prediction 50000 cost 250000 maximum cost HSMM predictor 0 perfect predictor oracle predictor 0 500000 1000000 1500000 2000000 2500000 time Figure 9.25: Accumulated runtime cost for the HSMM failure prediction model. A test run of 31.5 days has been plotted. A cost ratio of rF̄ F̄ : rF F : rF̄ F : rF F̄ = 1 : 10 : 20 : 1000 has been used. The plot also includes boundary cost for an oracle predictor, perfect predictor, a system without prediction and maximum cost. Triangles at the bottom indicate times of failure occurrence. experiments with a lead-time ranging from ∆tl = 5 minutes to ∆tl = 30 minutes have been performed. Figure 9.26 summarizes the results in terms of maximum F-measure with 95% confidence intervals determined from out-of-sample test data. Although one could expect a rather linear decrease of failure prediction performance, experiments indicate that failure prediction performance stays more or less constant until a lead-time of up to 20 minutes, then the F-measure drops quickly. The rather sharp drop observed in the figure indicates that symptomatic manifestations of an upcoming failure are only observable up to 20 minutes9 before failure occurrence. Taking into account that errors occur late in the process from faults to failures, it can be concluded that the fine-grained detection mechanism in the telecommunication system is able to grasp the first misbehaviors up to 20 minutes ahead before failure. 9.6.2 Data Window Size Training of HSMMs as well as of other failure prediction models is based on error sequences. The length of each sequence is determined by data window size ∆td . Although ∆td is a data preprocessing parameter (c.f., Section 9.2.4), it is analyzed here since effects with respect to failure prediction quality have not been investigated in Section 9.2.4. As is the case with many parameters, the effects of ∆td on failure prediction quality are manifold and can hardly be assessed analytically. In principle, longer sequences should result in a more precise classification. On the other hand, the farther sequences reach back into past, the more likely it becomes that failure-unrelated errors are included in failure sequences, which deteriorates failure prediction. Figure 9.27 plots maximum F-value for five values of ∆td : data windows of length 1 minute, 5 minutes, 10 minutes, 15 minutes and 20 minutes. Figure 9.27-a shows failure prediction quality in terms of 9 plus length of the data window ∆td 209 0.4 0.0 0.2 F measure 0.6 9.7 Dependence on Data Specific Issues 5 10 15 20 25 30 lead time [min] Figure 9.26: Failure prediction performance for various lead-times ∆tl . The plot shows Fmeasure with 95% confidence intervals. maximum F-value and 95% confidence intervals. As can be seen from the figure, despite of a data window size of ten minutes, failure prediction quality improves with larger data window sizes. The reason for the exception at ∆td = 10min might be caused by random effects and the fact that the Baum-Welch algorithm only converges to a local maximum rather than a global one, even if it is repeated 20 times. Improved prediction comes at the price of memory consumption and processing time. Figure 9.27-b shows mean processing time per sequence in seconds. Processing time increases heavily with increasing ∆td : With twenty minutes data frames, the average processing time reaches 2.34 seconds per sequence. This increase is caused by two effects: (a) length of the sequences (L) increases with ∆td , and (b) HSMMs also need to have more states (N ) in order to represent longer sequences. The reason why confidence intervals for processing time get wider with increasing ∆td is that the number of errors in sequences vary more: time windows of five minute length are “filled with errors” in most cases whereas in time windows of 20 minute length, there sometimes are larger gaps resulting in a sequence with fewer errors. 9.7 Dependence on Data Specific Issues Building a failure prediction model following a data-driven machine learning approach is always dependent on quality and quantity of the data and —of course— on the system itself. This section investigates sensitivity of failure prediction quality with respect to data-specific issues. 9. Experiments and Results Based on Industrial Data 2 1 0 0.0 0.2 0.4 F measure 0.6 mean processing time [s] 3 0.8 210 5 10 15 20 data window size [min] (a) 5 10 15 20 data window size [min] (b) Figure 9.27: Experiments for various data window sizes ∆td . (a) Failure prediction performance reported as maximum F-value. (b) Mean processing times per sequence in minutes. 95% confidence intervals are shown in both plots. 9.7.1 Size of the Training Data Set The objective of machine learning is to identify unobservable relationships from measured data, which is usually blurred by noise. Hence one of the rules of thumb for machine learning is to use as many datapoints as available. Where in many cases, the maximum size of the training dataset is the limiting factor, time needed for training may also be critical. Additionally, if very old data is included in the training data set, it might not represent precisely the relationships as they are present in the running system. In order to investigate the effect of the size of the data set, parts of increasing size of training data have been selected to train a model and have been tested on the same test data set (see Figure 9.28). More precisely, in the experiments the relationship between the amount of Figure 9.28: Selection of training data sets for experiments on the effect of the amount of training data available training data and resulting failure prediction quality as well as the time needed for training has been investigated. In order to visualize the effect, two plots are presented: Figure 9.29-a plots maximum F-measure for the three data sets of different size and Figure 9.29-b shows the time needed to train the models. In failure prediction, usually the number of failures in the data set are the limiting factor. In the experiments a small data 211 1000 1200 800 600 mean training time [s] 400 0 0.0 0.1 200 0.2 0.3 0.4 F measure 0.5 0.6 0.7 1400 9.7 Dependence on Data Specific Issues 1 2 3 1 size of data set (a) 2 3 size of data set (b) Figure 9.29: Effects of the size of the training data set. (a) shows maximum F-measure and (b) the time needed to train models. Data set one contained 72, data set two 134 and data set three 278 failure sequences. set with only 72 failure sequences, a medium data set with 134 failure sequences and a large data set with 278 failure sequences were used, which had been obtained by reducing the data set that also used in previous experiments. Regarding Figure 9.29-a, the F-measure is roughly similar for the large data set (3) and the middle sized data set (2). When the data set is further reduced (data set 1) the F-measure drops down significantly. This is due to the fact that there are too less examples to learn from, or, more precisely, to robustly estimate all the parameters of the HSMMs. Figure 9.29-b shows the expected dependence of time needed to train the models on the size of the data set. 9.7.2 System Configuration and Model Aging Complex computer systems involve a manifold of configuration parameters and are subject to patches and updates. In the case of the telecommunication system investigated in this thesis, the number of configuration parameters have been estimated by system experts to exceed 2.000. A separate configuration database is installed on the system and the system is as flexible as different versions or implementations of a component can be used just by updating one value in the configuration database. Hence a single change in the configuration may alter system failure behavior significantly. It is the goal of this section to investigate sensitivity of the trained HSMM failure prediction models to changes in system configuration. However, the problem is that we neither had access to the configuration database nor to any logs indicating configuration changes. Therefore, sensitivity can only be investigated in terms of the temporal gap ∆tg between the training data set and the test data set (see Figure 9.30). Figure 9.31 presents results from training a failure prediction model with five different gaps ∆tg . More precisely, experiments have been performed with a gap of 13 days, 42 days, 91 days, 125 days and 152 days between the end of the training and beginning of the test data. Since we had no access to the configuration database, these numbers have been chosen from an in-depth analysis of the entire data set, which revealed, e.g., 212 9. Experiments and Results Based on Industrial Data Figure 9.30: Selection of test data sets for experiments on the effect of changing system configuration. ∆tg indicates the gap between the end of training and start of the test data set. changes in the log format. Two conclusions can be drawn from the figure: Prediction quality decreases with an increasing temporal distance between training and application of the failure predictor. 95% confidence intervals get larger with increasing gap size. Both characteristics can be interpreted with the background of continuous partial updates and patches: If only parts of the system are changed, some failure indicating error-sequences are still the same, while others are changed. The HSMM recognizes known (old) error sequences well while it fails on new sequences. The increasing diversity of sequences is reflected in wider confidence intervals obtained by the bootstrapping procedure. Besides of the aspect of configuration, plotting failure prediction quality as a function of ∆tg brings up another aspect of machine learning. The training procedure applied in this thesis is called supervised offline batch learning, which means that first a batch of data is collected which is used entirely to train a model. In this context offline means that training is performed not during operation but in the two-phase approach indicated by Figure 2.7 on Page 16. There are other machine learning approaches that try to continuously adapt the model in order to keep it up-to-date, however, in order to keep the approach simple such techniques have not been investigated in this dissertation (see Chapter 12). The important thing to note here is that —assuming an ever changing real system— the model is always outdated, even right after training. Hence, the question is how quickly key properties of the system change with respect to the prediction of upcoming failures. Gap ∆tg is one way to expresses “age” of a model. 9.8 Failure Sequence Grouping and Filtering In this dissertation two data preprocessing techniques have been proposed and used without scrutinizing. The experiments described in this section have done so and have investigated the effect of failure sequence grouping as well as noise filtering. 9.8.1 Failure Grouping In order to obtain more consistent training datasets, failure sequence grouping intends to separate failure mechanisms in the data. However, this also decreases the number of sequences available for training of each model. In order to investigate the effects of failure grouping, prediction performance has been evaluated for a predictor with only one HSMM failure group model.10 Figure 9.32 presents results, which are intended to be compared to Figure 9.23 and 9.24, respectively. Results show that failure prediction performance without separating failure sequences into groups is worse and a maximum F-measure of 10 And, of course, a non-failure model 213 0.4 0.3 0.0 0.1 0.2 F measure 0.5 0.6 0.7 9.9 Comparative Analysis 0 20 40 60 80 100 120 140 160 gap size [days] Figure 9.31: Prediction quality (expressed as maximum F-measure) depending on the temporal gap ∆tg between training and test data. The gap is expressed in days. 0.5097 and AUC of 0.7700 are achieved. This indicates that failure sequences are too diverse to be represented by one single HSMM. Since by means of clustering similar sequences are grouped and a separate model is trained for each group, the models can better adopt to the specifics of error sequences indicating an upcoming failure. 9.8.2 Sequence Filtering In order to remove noise from failure sequences, a statistical filter technique has been applied. To investigate the effects, a model with the same parameters as used in Section 9.5 has been trained and evaluated using unfiltered data. A maximum F-measure of 0.3601 resulting from a precision of 0.670 and a recall of 0.246 with a false positive rate of 0.0095 have been achieved. Hence sequence filtering improves failure prediction performance slightly, at least for the parameter settings used previously.11 Additionally, filtering removes symbols from sequences, which in turn has a positive effect on computation times: the average processing time for the prediction of a sequence without filtering is increased by 16.9%. 9.9 Comparative Analysis In order to be able to judge the results presented in previous sections, the HSMM-based failure prediction approach has been compared to several published failure prediction approaches. As already explained in Section 3.2, most promising and well-known approaches to error-driven failure prediction, as they have been identified as subbranches of Category 1.3 in the failure prediction taxonomy (c.f., Figure 3.1 on Page 31) have been 11 However, it cannot be excluded that other model parametrizations exist that achieve better prediction performance 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 Precision 0.6 true positive rate 0.8 1.0 9. Experiments and Results Based on Industrial Data 1.0 214 0.0 0.2 0.4 0.6 0.8 1.0 0.0 Recall confidence level: 0.95 (a) 0.2 0.4 0.6 0.8 1.0 false positive rate confidence level: 0.95 (b) Figure 9.32: Precision / recall plot (a) and ROC plot (b) for prediction with a single failure group model selected. Additionally, results are provided for the simplest prediction method: Prediction based on mean-time-to-failures. Since the HSMM-based failure prediction approach presented in this thesis extends standard HMMs, results for standard HMMs are provided, and comparison to a random predictor and the UBF approach proposed by Hoffmann is given. Even though the approaches have already been described in Section 3.2, their key idea is repeated here for convenience. All experiments have been carried out on the same dataset that has been used in Section 9.5 with a lead-time ∆tl of five minutes. Each model is discussed separately and a results are summarized at the end of the section, including computation times. 9.9.1 Dispersion Frame Technique (DFT) DFT (c.f., Section 3.2.1) investigates the time of error occurrence by defining dispersion frames (DF) and computing the error dispersion index (EDI). A failure is predicted at the end of the DF if at least one out of five heuristic rules matches. In addition to the original method, predictions that are closer to present time than warning-time ∆tw of three minutes have not been considered. Initial results for DFT using a parameter setting as in the paper by Lin & Siewiorek [167] resulted in poor prediction performance. Parameters such as thresholds for tupling and for the rules have been modified in order to improve prediction. However, even with investigating 540 different combinations of parameters the best achievable result only obtained an F-measure score of 0.115, resulting from a precision of 0.597 but only a recall of 0.063. False positive rate equals 0.00352. Comparing the results of DFT with the original work by Lin and Siewiorek, achieved prediction performance is worse. The main reason for that seems to be the difference of investigated systems: While the original paper investigated failures in the Andrews distributed file system based on the occurrence of host errors, our study applied the technique to errors that had been reported by software components in order to predict upcoming performance failures. In our study, intervals between errors of the same type are much 9.9 Comparative Analysis 215 0 5 10 15 20 25 delay [s] 50000 30000 Frequency 10000 20000 1000 800 0 0 0 200 20000 400 600 Frequency 100000 60000 Frequency 40000 1200 140000 shorter. As software container IDs have been chosen as the entity corresponding to field replaceable units (FRUs) Figure 9.33 shows histograms of the time between errors for three different software containers. As can be seen from the histograms, for the leftmost 0 10000 20000 30000 40000 50000 delay [s] 0 20 40 60 80 100 delay [s] Figure 9.33: Histogram of time-between-errors for the dispersion frame technique. Since container IDs have been chosen to be the FRU equivalent, error messages of three containers have been analyzed. In order to obtain histograms in which any details can be seen, after tupling only delays up to the 99% quantile has been used in order to remove very rare but extremely large values container ID, the vast majority of delays is below five seconds. Since DFT can at most predict a failure half of the delay ahead, most of the failure predictions from this container are dropped since they are closer to present time than the warning period of 100 seconds. The same holds for the rightmost container ID. The fact that most of the predictions have been dropped results in the low recall. However, if a failure warning is issued, it is correct in almost 60% of all cases. 9.9.2 Eventset The eventset method (c.f., Section 3.2.2) is based on data mining techniques identifying sets of error event types that are indicative for upcoming failures, which set up a rule database. Construction of the rule database includes the choice of four parameters: • length of the data window • level of minimum support • level of confidence • significance level for the statistical test The training algorithm has been run for 64 combinations of various values for the parameters and the best combination with respect to F-measure has been selected. Since the first part of the algorithm potentially needs to investigate the power set of all 1435 error symbols, which is approximately 9.5 · 10430 , a branch and bound algorithm called “apriori” has been used as indicated in the paper by Vilalta & Ma [268].12 Best results have 12 More specifically, the implementation of Christian Borgelt (see [34]) 216 9. Experiments and Results Based on Industrial Data been achieved for a window length of five minutes, confidence of 10%, support of 25% and significance level of 5% yielding a precision of 0.465, recall of 0.327, F-measure of 0.3841, and false positive rate of 0.0422. 9.9.3 SVD-SVM Support Vector Machines (SVMs) are state-of-the art classifiers showing various desirable properties such as convexity of the optimization criterion, etc. The major problem when using SVM classifiers for failure prediction is representation of error data. Domeniconi et al. [81] have used a bag-of-word representation together with latent semantic indexing techniques to solve this problem resulting in the failure prediction approach described in Section 3.2.3. 90 different configurations have been tested and the configuration with maximum F-measure has been selected. In particular, configurations have been defined by the following parameters: • length of the data window ∆td • type of kernel function: linear, polynomial, and radial basis functions (c.f., e.g., Chen et al. [56]) • parameters controlling the kernels, such as γ for radial basis function kernel • trade-off between training error and margin (parameter C, as in, e.g., Schölkopf et al. [231]) • feature encoding: either existence, count, or temporal (c.f., Section 3.2.3) The approach has been implemented using R and the free SVM toolkit “SVMlight” [135]. However, there is one difference to the algorithm as originally published in Domeniconi et al. [81]: Since the output of SVMlight is not only a class label but a distance from the decision boundary, a precision / recall and ROC plot can be drawn. The idea is to classify a sequence as failure prone only if the SVM output is above some customizable threshold. Classification performance of the original algorithm hence corresponds to a threshold equal to zero. Figure 9.34 presents the results. Best results have been achieved using a radial basis function kernel with γ = 0.6, error / margin tradeoff c = 10, and count feature encoding. Using this setting, a maximum Fmeasure of 0.226, precision of 0.182, recall of 0.299, and false positive rate of 0.1103 have been achieved. The fact that encoding error messages by the count scheme rather than the temporal scheme might seem contradictory to one of the principal assumptions in this dissertation that taking both type and time of error messages into account should improve failure prediction. However, this is not the case, since the way, time is represented in the temporal scheme has a fundamental flaw: By representing the occurrence of each error type as a binary number, the temporal scheme investigates absolute time of error occurrence in the sequence rather than relative, and discretizes time rather than treating it continuously (c.f., Section 4.2.1). As an example, assume that there is only one occurrence of one specific error message type in a sequence. If the error messages appears only a little earlier such that it falls into the next time slot, the magnitude along the error dimension is doubled in the bag-of-words representation. 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 Precision 0.6 true positive rate 0.8 1.0 217 1.0 9.9 Comparative Analysis 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 Recall confidence level: 0.95 (a) 0.4 0.6 0.8 1.0 false positive rate confidence level: 0.95 (b) Figure 9.34: Failure prediction results for the SVD-SVM failure prediction algorithm. Results are reported as Precision/recall plot (a) and ROC plot (b) 9.9.4 Periodic Prediction Based on MTBF The reliability model-based failure prediction approach is rather simple and has been included to show the prediction performance that can be achieved with the most straightforward prediction method. Not surprisingly, prediction performance is low: Precision equals 0.054, and also recall equals 0.054 yielding an F-measure of 0.0541. Since the approach only issues failure warnings (there are no non-failure-predictions), false positive rate cannot be determined, here. The reason why this prediction method does not really work for the case study is that the prediction method is periodic: the next failure is predicted to occur at the median of a distribution that is not adapted during runtime. As can be seen from the histogram of time-between-failures (Figure 9.16) the distribution of time-between-failures is widespread and from the auto-correlation of failure occurrence (Figure 9.17) that there is no periodicity evident in the failure data set. 9.9.5 Comparison with Standard HMMs This dissertation is the first to apply hidden Markov models to the task of online failure prediction. In Section 4.2 it has been argued that standard HMMs are not well-suited for representing error logs due to insufficient capabilities to represent time (c.f., Figure 4.7 on Page 64). While this has just been a claim based on theoretical analysis, this section provides experimental evidence for it. Three experiments have been performed in which 1. no timing information is used 2. time-slotting (c.f., Section 4.2.1) is used. 3. the model described in Salfner [223] has been used. 218 9. Experiments and Results Based on Industrial Data The third case did not work out due to the fact that the process is forced to always traverse the same limited set of states and hence looses its pattern recognition potential. Due to this theoretic flaw, this approach was not further investigated. In the first two cases, the structure of the HMM was similar to the structure of the failure prediction HSMM. In case of the time-slotting model, an extra observation symbol representing silence has been added. Theoretically, the time slot size should be set equal to the minimum delay between the errors, which is determined by the tupling parameter ε = 0.015s. However, this would lead to huge models since: ]states = 5 min. × 60 slots/min. = 20000 slots = 20000 states . 0.015 (9.2) However, 20000 states are far to many to be trained from limited data within reasonable time. Hence a larger time-slotting interval of ε = 0.2s has been used resulting in a model with 1500 states. If two error symbols occurred within one time slot, one symbol has been chosen randomly, which treats such cases as noise. Opposed to that, the model without timing had 100 states. The model without timing achieved a prediction performance of precision = 0.230, recall = 0.176 and hence an F-measure of 0.1996. False positive rate was equal to 0.049. The model with time-slotting achieved a prediction performance of precision = 0.079, recall = 0.129 and hence an F-measure of 0.0982 with a false positive rate of 0.124. Similar to the SVD-SVM method, this experiment also shows that the sheer incorporation of time information does not automatically lead to good failure prediction results. Rather, in the case considered here, the incorporation of time by time-slotting did render the prediction approach almost unusable. 9.9.6 Comparison with Random Predictor The term “random predictor” denotes a predictor that at each time a prediction is to be performed, a failure warning is issued with probability of 0.5. Applied to the case study, such predictor would result in a contingency table as shown in Table 9.3. From the table, Prediction: Failure Prediction: No Failure Sum True Failure 139 139 278 True Non-failure 1686 1686 3372 Sum 1825 1825 3650 Table 9.3: Contingency table for a random predictor. the following values for precision, recall, and false positive rate can be computed: • precision = • recall = • f pr = 139 278 1686 3372 139 1825 ≈ 0.076 = 0.5 = 0.5 9.9 Comparative Analysis 219 which results in an F-measure of approximately 0.1322. One might conclude from this considerations that any predictor with recall less than 50% is useless. However, this is not true: As precision and recall are in most cases inversely linked, many prediction methods trade recall for precision, and are hence useful even though recall is below 50%. 9.9.7 Comparison with UBF HSMM-based failure prediction operates on event-triggered input data and comparative approaches have been selected from this class, too. However, Günther Hoffmann has proposed a failure prediction technique called “Universal Basis Functions” (UBF) and has applied it to symptom monitoring data such as workload and memory consumption13 of the same telecommunication system (Hoffmann & Malek [121]). Therefore, results are outlined here for comparison. In Hoffmann [120], values for true/false positives/negatives have been published, shown here as contingency Table 9.4. From this table, a precision Prediction: Failure Prediction: No Failure Sum True Failure 4 2 6 True Non-failure 49 192 241 Sum 53 194 247 Table 9.4: Contingency table for the UBF failure prediction approach. of 0.076, recall of 0.667 and false positive rate of 0.2033 are computed. This yields an F-measure of 0.13559. AUC is reported to be 0.846. It should be noted that the above values are derived from a rather small data set containing only 247 predictions, among which are only six failures. Looking at precision and F-measure, it seems as if UBF is not much better than a random predictor. However, this is not true, since UBF operates on different data: A random predictor applied to the UBF data would only achieve a precision of 0.024 resulting in an F-measure of 0.0463. Furthermore, UBF achieves an AUC value that is similar to AUC of the HSMM approach –but at considerably lower computational cost: In order to perform a UBF prediction, each kernel has to be evaluated only once14 and results are linearly combined. Therefore, in scenarios where, e.g., false positive alarms do not incur high cost, UBF can achieve similar results with lower cost. However, if high precision is a requirement, HSMM outperforms UBF significantly. 9.9.8 Discussion and Summary of Comparative Approaches In this section, HSMM-based failure prediction has been compared to other well-known prediction techniques: Dispersion Frame Technique (DFT), Eventset method and SVD13 More precisely, a variable selection technique called PWA has been applied yielding the number of semaphore operations per second and the amount of allocated kernel memory as most descriptive variables 14 in case embedding dimension is zero, which has been shown to yield best results for UBF 220 9. Experiments and Results Based on Industrial Data SVM. Additionally, periodic prediction has been investigated in order to show performance of the most straightforward failure prediction approach. Failure prediction quality has been expressed as precision, recall, F-measure and false positive rate (FPR). Figure 9.35 summarizes results for event-based failure predictors including 95% BCa confidence intervals obtained from bootstrapping. In summary, 0.0 0.2 0.4 0.6 periodic DFT Eventset SVD−SVM HSMM precision recall F−measure false positive rate Figure 9.35: Summary of prediction results for comparative approaches. Results are reported as mean values and 95% confidence intervals. HSMM-based failure prediction outperforms the other techniques significantly in most of the metrics. The second best technique is Eventset, which has been developed at IBM and has been used for failure prediction in large scale parallel systems. However, improved prediction is not for free: HSMM is by far the most complex failure prediction algorithm with respect to both time and memory consumption. More specifically, Table 9.5 lists training times and the time needed to perform one prediction along with lower and upper bounds of 95% confidence intervals. It can be observed that with respect to training, the HSMM approach takes approximately 2.4 times as long as SVD-SVM and almost 60 times as long as Eventset, which is the second-best prediction algorithm in this comparative analysis. Nevertheless, training has no tight real-time constraints and can still be performed within reasonable timescales. With respect to online prediction, HSMM takes much longer in comparison to the other techniques. However, computation times are still sufficiently small if seen in the context of a lead time of at least five minutes. HSMM-based failure prediction has also been compared to standard HMMs since HSMMs are derived from HMMs. Prediction performance of a random predictor has been computed for comparative reasons, and, finally, HSMM-based failure prediction has been 15 Time variances have been below system time resolution, hence no confidence intervals can be provided, here 9.10 Summary 221 Prediction technique Training Online Prediction Reliability-based periodic n/a n/a DFT n/a Eventset SVD-SVM (max. F-measure) HSMM (max. F-measure) 17.22 572.62 1295.90 4.9767e-5 / 1.2857e-4 / 2.0728e-4 0.000715 / 22.90 / 28.58 / 572.73 / 572.83 / 1365.00 / 1434.10 5.7566e-4 / 6.7781e-4 / 7.7995e-4 5.7761e-2 / 0.15715 / 0.25655 Table 9.5: Summary of average computation times for comparative approaches with 95% confidence intervals. Time is reported in seconds. compared to Universal Basis Functions (UBF), which is a very good failure prediction technique for the analysis of periodic measurements such as system workload or memory consumption. 9.10 Summary In this chapter, the theory developed in previous chapters has been applied to industrial data of a commercial telecommunication system in order to investigate how well failures of a complex computing system can be predicted. All the steps from data preprocessing to an evaluation of failure prediction quality have been thoroughly investigated, which means that the effects of the various parameters involved have been assessed. In more detail, the following issues have been covered: • Data preprocessing consists of assignment of error-IDs to error messages, tupling, failure sequence clustering, and noise filtering. • Properties of the resulting data set have been investigated. This involved an analysis of error frequency, the distribution of inter-error delays, the distribution of failures, and length of the resulting error sequences. • Modeling. Parameters involved in HSMM modeling include the number of states, maximum span of shortcuts, number of intermediate states, intermediate probability mass and distribution, distribution type and amount of background weight, and number of tries for the Baum-Welch algorithm. Some of these parameters have been set heuristically while others, for which no values could be determined upfront, have been investigated with respect to failure prediction performance on out-of-sample validation data. • For the given setting of parameters, failure prediction quality has been investigated in more detail using out-of-sample test data: precision, recall, F-measure, false positive rate, precision-recall plot, ROC plot, AUC and accumulated runtime cost have been reported. • Application specific parameters lead-time ∆tl , and data window size ∆td have been explored in order to determine their effect on failure prediction performance. 222 9. Experiments and Results Based on Industrial Data • A Data-specific issues have been investigated in order to determine how failure prediction depends on the size of the training data set and the temporal distance between training and test dataset, which can be taken to give an indication of model aging due to system configuration changes and updates. • The effect of failure sequence clustering and noise filtering have been investigated. • In order to show that the theory developed in this thesis really improves failure prediction quality, a comparative analysis has been performed. The selection of comparative approaches includes the best-known approaches to error-driven failure prediction, as they have been identified as subbranches of Category 1.3 in the failure prediction taxonomy (c.f., Figure 3.1 on Page 31). Specifically, HSMM-based failure prediction has been compared to Dispersion Frame Technique (DFT) developed by Lin & Siewiorek [167], Eventset method developed by Vilalta & Ma [268] at IBM and Singular value decomposition – support vector machines (SVD-SVM) developed by Domeniconi et al. [81]. In order to provide some rough estimate on effortless prediction, periodic prediction on the basis of MTBF has also been applied to the same data. In a further experiment, HSMM-based prediction has been compared to standard hidden Markov models, a random predictor and Universal Basis Functions (UBF) developed by Hoffmann [120]. In summary, it has been shown that for industrial data of the commercial telecommunication system, HSMM-based prediction is superior to all failure prediction approaches it has been compared with. Supposedly, the main reasons for this are first the approach to efficiently exploit both time and type of error messages by treating them as temporal sequence, and second the modeling flexibility provided by HSMMs. For example, one characteristic is that HSMMs can handle permutations of error symbols occurring together within a short time interval (c.f., Page 100). This property is relevant for error sequence-based failure prediction since ordering of error events occurring closely in time cannot be guaranteed in complex environments such as the telecommunication system. On the other hand, it must be conceded that modeling flexibility comes at the price of a considerable number of parameters that need to be adjusted. Hence, applying HSMMs as modeling technique requires substantial investigations and experience. Additionally, computational effort is increased: In comparison with, e.g., SVD-SVM, HSMM-based training consumes approximately 2.4 times as much time for training and 232 times as long for online prediction. However, this is still not prohibitive against the background that HSMM-based failure prediction is able to reliably predict the occurrence of failures with a lead-time of up to 20 minutes. Contributions of this chapter. Contributions of this chapter are two-fold. From an engineering point of view the chapter has shown in detail how an industrial system can be modelled and how the various parameters can be investigated and adjusted to the specifics of a system. From a scientific point of view, the main contributions of this chapter are an in-depth performance evaluation of the HSMM method and a comparative analysis of the approach to other well-known prediction approaches. Furthermore, it has been shown that extending standard HMMs to HSMMs is worth the effort since prediction quality is significantly improved and that the proposed preprocessing techniques, i.e., failure sequence clustering and noise filtering, improve failure prediction results. 9.10 Summary 223 Relation to other chapters. It has been shown in this chapter that a prediction of upcoming failures is possible in complex computer systems. However, prediction alone does not improve system dependability! Hence the following fourth part of the thesis addresses the issue what to do about a failure that has been predicted. In terms of the engineering cycle, the third phase has been completed and a solution for failure prediction has been obtained. The next part will cover the fourth and last phase of the engineering cycle, which focuses on system improvement. Part IV Improving Dependability, Conclusions, and Outlook 225 Chapter 10 Assessing the Effect on Dependability The last phase of the engineering cycle named “improvement”, closes the loop: The goal is to use the failure prediction solution developed in previous chapters in order to improve the system with respect to system dependability. However, dependability improvement is not the primary goal of this dissertation —it focuses mainly on failure prediction. That is why this last part is shorter and investigations are not as detailed as in previous chapters. More specifically, in Section 10.1 proactive fault management is introduced, which denotes the combination of online failure prediction and actions to improve system dependability. Related work on previous approaches to model proactive fault management is provided in Section 10.2. In Sections 10.3 to 10.6, an availability model and a simplified reliability model are proposed, and closed form solutions for availability, reliability and hazard rate are derived. The issue of parameter estimation from experimental data is covered by Section 10.7 and some experiments that have been performed in the course of a diploma thesis primarily supervised by the author are presented in Section 10.8. 10.1 Proactive Fault Management System dependability cannot be improved solely by predicting failures —some actions are necessary in order to do something about the failure that has been predicted. As shown in Figure 1.1 on Page 4, online failure prediction and actions form a cycle where a running system is continuously monitored in order to obtain data on the current status of the system, and a prediction algorithm is performed resulting in a classification whether the current situation is failure-prone or not. If so, it raises a failure warning and actions are performed in order to do something about the failure. This might include diagnosis to investigate the root cause of the imminent problem and a decision which technique will be most effective (see Chapter 12). However, there are two different classes of actions that can be performed upon failure prediction (see Figure 10.1): • Downtime avoidance (or failure avoidance) aims at circumventing the occurrence of the failure such that the system continues to operate without interruption • Downtime minimization (minimization of the impact of failures) involves downtime, but the goal is to reduce downtime by preparation for true upcoming failures 227 228 10. Assessing the Effect on Dependability Figure 10.1: Proactive fault management combines failure prediction and proactive actions. Actions either try to avoid or to minimize downtime or by intentionally bringing the system down in order to shift it from unplanned to forced downtime. Although several systems combining failure prediction with actions have been described in the literature, there is no unified name for this approach. Following Castelli et al. [49], the name proactive fault management (PFM) is used in this thesis. Several examples for such systems employing PFM have been described in the literature. For example, Castelli et al. [49] describe that a resource consumption trend estimation technique has been implemented into IBM Director Management Software for xSeries servers that can restart parts of the system. In Cheng et al. [57] a framework called application cluster service is described that facilitates failover (both preventive and after a failure) and state recovery services. Li & Lan [164] propose FT-Pro, which is a failure prediction-driven adaptive fault management system. It uses false positive error rate and false negative rate of a failure predictor together with cost and expected downtime to choose among the options to migrate processes, to trigger checkpointing or to do nothing. The behavior of PFM can be described in more detail as follows: If the failure predictor’s analysis suggests that the system is running well and hence no failure is anticipated in the near future (which is a negative prediction), no action occurs. If a failure is predicted (a positive prediction), either downtime avoidance actions or downtime minimization actions are performed, or both. However, it is obvious that any failure predictor can make wrong decisions: the predictor might forecast an upcoming failure even if this is not the case, which is called a false positive, or the predictor might miss to predict a failure that is imminent in the system, which is called a false negative (c.f., Table 8.1 on Page 153 for an overview of all four cases that may occur). From this follows that in case of a false positive prediction (FP) actions are performed unnecessarily while in case of a false negative prediction (FN), nothing is done about the failure that is imminent in the system. Table 10.1 summarizes these cases. Prediction Downtime avoidance Downtime minimization True positive Try to prevent failure Prepare repair (recovery) False positive Unnecessary action Unnecessary preparation True negative No action No action False negative No action Standard (unprepared) repair (recovery) Table 10.1: Actions performed after prediction. For a definition of true/false positives / negatives, see Table 8.1 on Page 153 10.1 Proactive Fault Management 229 Especially in event-based failure prediction, there are situations where a failure occurs and no prediction has been performed at all, since there was no triggering event prior to the failure. However, this case can be easily incorporated by treating it as a false negative prediction. All mechanisms that can benefit from the knowledge about an upcoming failure can be used within PFM. It is not the focus of this thesis to provide a detailed analysis of all kinds of actions falling into this category, and hence only some major concepts are described in the following. 10.1.1 Downtime Avoidance Downtime avoidance actions are triggered by a failure predictor in order to prevent the occurrence of a failure that seems to be imminent in the system but has not yet occurred. Three categories of mechanisms can be identified: • State clean-up tries to avoid failures by cleaning up resources. Examples include garbage collection, clearance of queues, correction of corrupt data or elimination of “hung” processes. • Preventive failover techniques perform a preventive switch to some spare hardware or software unit. Several variants of this technique exist. One of which is failure prediction-driven load balancing accomplishing gradual “failover” from a failureprone to failure-free component. For example, Chakravorty et al. [50] describe a multiprocessor environment that is able to migrate processes in case of an imminent failure. • Lowering the load is a common way to prevent failures. For example, web-servers reject connection requests in order not to become overloaded. Within proactive fault management, the number of allowed connections is adaptive and would depend on the risk of failure. 10.1.2 Downtime Minimization Repairing the system after failure occurrence is the classical way of failure handling. Detection mechanisms such as coding checks, replication checks, timing checks or plausibility checks trigger the recovery. Within PFM, these actions still incur downtime, but its occurrence is either anticipated or even intended in order to reduce time-to-repair. More specifically, there are two types of downtime minimization methods: 1. techniques that react to the occurrence of failures, and the goal is to reduce time-torepair by preparation for the failure. This is called reactive downtime minimization 2. techniques that intentionally bring the system down in order to cause less downtime in comparison to downtime associated with unplanned failure occurrence. This class of techniques is termed proactive downtime minimization 230 10. Assessing the Effect on Dependability Figure 10.2: Improved time-to-repair for prediction-driven repair schemes. (a) sketches classical recovery and (b) improved recovery in case of preparation for an upcoming failure. “Checkpoint” denotes the last checkpoint before failure, “Failure” the time of failure occurrence, “Reconfigured” the time when reconfiguration has finished and “Up” the time when the system is up again. In (b) time-to-repair is improved since reconfiguration can start after prediction of an upcoming failure and the prediction-triggered checkpoint is closer to the occurrence of the failure, which results in less computation that needs to be recomputed after reconfiguration. Reactive downtime minimization. The goal of such techniques can be summarized that the system shall be brought into a consistent fault-free state. If this state is a previous one (a so-called checkpoint), the action applies a roll-backward scheme (see, e.g., Elnozahy et al. [91] for a survey of roll-back recovery in message passing systems). In this case, all computation from the last checkpoint up to the time of failure occurrence has to be recomputed. Typical examples are recovery from a checkpoint or the recovery block scheme introduced by Siewiorek & Swarz [241]. In case of a roll-forward scheme, the system is moved forward to a consistent state by either dropping or approximating the computations that have failed (see, e.g., Randell et al. [213]). Both schemes may comprise reconfiguration such as switching to a hardware spare or another version of a software program, changing network routing, etc. Reconfiguration takes place before computations are redone or approximated. In traditional fault-tolerant computing without PFM, checkpoints are saved independently of upcoming failures, e.g., periodically. When a failure occurs, first reconfiguration takes place until the system is ready for recomputation / approximation and then all the computations from the last checkpoint up to the time of failure occurrence are redone. Time-to-repair (TTR) is determined by two factors: time needed for reconfiguration and the time needed for recomputation or approximation of lost computations. In the case of roll-backward strategies, recomputation time is determined by the length of the time interval between the checkpoint and the time of failure occurrence (see Figure 10.2-a). In some cases recomputation may take less time than originally but the implication still holds. Note that not all types of repair actions exhibit both factors contributing to TTR. A large variety of repair actions exist that can benefit from failure prediction. In principle, coupling with a failure predictor can reduce both factors contributing to TTR (see Figure 10.2-b): • Time needed for reconfiguration can be reduced since reconfiguration can be prepared for an upcoming failure. Think, for example, of a cold spare: Booting the spare machine can be started right after an upcoming failure has been predicted (and hence before failure occurrence) such that reconfiguration is almost finished when the failure occurs. 10.2 Related Models 231 • Checkpoints may be saved upon failure prediction close to the failure, which reduces the amount of computation that needs to be repeated. This minimizes time consumed by recomputation. On the other hand, it might not be wise to save a checkpoint at a time when a failure can be anticipated since the system state might already be corrupted. The question whether such scheme is applicable depends on fault isolation between the system that is going to observe the failure and the state that is included in the checkpoint. For example, if the amount of free memory is monitored for failure prediction but the checkpoint comprises database tables of a separate database server, it might be sure to rely on the correctness of the database checkpoint. Additionally, an adaptive checkpointing scheme similar to the one described in Oliner & Sahoo [197] could be applied. Leangsuksun et al. [157] describe that they have implemented predictive checkpointing for a high-availability high performance Linux cluster. Proactive downtime minimization. Parnas [198] reported on an effect that he called software aging, being a name for effects such as memory leaks, unreleased file locks, file descriptor leaking, or memory corruption. Based on these observations, Huang et al. introduced a concept that the authors termed rejuvenation. The idea of rejuvenation is to deal with problems related to software aging by restarting the system (or at least parts of it). By this approach, unplanned / unscheduled / unprepared downtime incurred by non-anticipated failures is replaced by forced / scheduled / anticipated downtime. The authors have shown that —under certain assumptions— overall downtime and downtime cost can be reduced by this approach. In Candea et al. [43] the approach is extended by introducing recovery-oriented computing (see, e.g., Brown & Patterson [40]), where restarting is organized recursively until the problem is solved. 10.2 Related Models The objective of this chapter is a theoretical assessment of proactive fault management with respect to system dependability, or more precisely steady-state system availability, reliability and hazard rate. As is common in reliability theory, a model expressing the relevant interrelations is used. Proactive fault management is rooted in preventive maintenance, that has been a research issue for several decades (an overview can be found, e.g., in Gertsbakh [105]). More specifically, proactive fault management belongs to the category of condition-based preventive maintenance (c.f., e.g., Starr [250]). However, the majority of work has been focused on industrial production systems such as heavy assembly line machines and more recently to computing hardware. With respect to software, preventive maintenance has focused more on long-term software product aging such as software versions and updates rather than short-term execution aging. The only exception is software rejuvenation which has been investigated heavily (c.f., e.g., Kajko-Mattson [139]). Starting from general preventive maintenance theory, Kumar & Westberg [150] compute reliability of condition-based preventive maintenance. However, their approach is based on a graphical analysis of so-called total time on test plots of singleton observation variables such as temperature, etc. rendering the approach not appropriate for application to automatic proactive fault management in software systems. An approach better suited 232 10. Assessing the Effect on Dependability to software has been presented by Amari & McLaughlin [9]. They use a continuous-time Markov chain (CTMC) to model system deterioration, periodic inspection, preventive maintenance and repair. However, one of the major disadvantages of their approach is that they assume perfect periodic inspection, which does not reflect failure prediction reality, as has been shown along with the case study presented in Chapter 9. A significant body of work has been published addressing software rejuvenation. Initially, Huang et al. [126] have used a CTMC in order to compute steady-state availability and expected downtime cost. In order to overcome various limitations of the model, e.g., that constant transition rates are not well-suited to model software aging, several variations to the original model of Huang et al. have been published over the years, some of which are briefly discussed here. Dohi et al. have extended the model to a semi-Markov process to deal more appropriately with the deterministic behavior of periodic restarting. Furthermore, they have slightly altered topology of the model since they assume that there are cases where a repair does not result in a clean state and restart (rejuvenation) has to be performed after repair. The authors have computed steady-state availability (Dohi et al. [80]) and cost (Dohi et al. [79]) using this model. Cassady et al. [47] propose a slightly different model and use Weibull distributions to characterize state transitions. However, due to this choice, the model cannot be solved analytically and an approximate solution from simulated data is presented. Garg et al. [101] have used a three state discrete time Markov chain (DTMC) with two subordinated non-homogeneous CTMCs to model rejuvenation in transaction processing systems. One subordinated CTMC models queuing behavior of transaction processing and the second models preventive maintenance. The authors compute steady-state availability, probability of loosing a transaction, and an upper bound on response time for periodic rejuvenation. They model a more complex scheme that starts rejuvenation when the processing queue is empty. The same three-state macro-model has been used in Vaidyanathan & Trivedi [262], but here, time-to failure is estimated using a monitoringbased subordinated semi-Markov reward model. However, for model solution, the authors approximate time-to-failure with an increasing failure rate distribution. Leangsuksun et al. [158] have presented a detailed stochastic reward net model of a high availability cluster system in order to model availability. The model differentiates between servers, clients and network. Furthermore, it distinguishes permanent as well as intermittent failures that are either covered (i.e., eliminated by reconfiguration) or uncovered (i.e., eliminated by rebooting the cluster). Again, the model is too complex to be analyzed analytically and hence simulations are performed. An analytical solution for computing the optimal rejuvenation schedule is provided by Andrzejak & Silva [10] who use deterministic function approximation techniques to characterize the relationship between aging factors and work metrics. The optimal rejuvenation schedule can then be found by an analytical solution to an optimization problem. The key property of PFM is that it operates upon failure predictions rather than a purely time-triggered execution of fault-tolerance mechanisms. One of the first papers to address this issue is Vaidyanathan et al. [261]. The authors propose several stochastic reward nets (SRN), one of which explicitly models prediction-based rejuvenation. However, there are two limitations to this model: first, only one type of wrong predictions is covered, and second, the model is tailored to rejuvenation —downtime avoidance or reactive downtime minimization are not included. Furthermore, due to the complexity of the model, no analytical solution for availability is presented. Focussing on service 10.3 The Availability Model 233 degradation, Bao et al. [21] propose a CTMC that includes the number of service requests in the system plus the amount of leaked memory. An adaptive rejuvenation scheme is analyzed that is based on estimated resource consumption. Later, the model has been combined with the three-state macro model in order to compute availability (Bao et al. [22]). However, the model does also not investigate the effect of mispredictions. Last but not least, the model presented in this dissertation is not the first attempt to assess the effects of proactive fault management. In Salfner & Malek [225], an approach has been published that directly extends the well-known formula for steady-state availability: A= MT T F . MT T F + MT T R (10.1) However, the approach proposed in Salfner & Malek [225] had three limitations: 1. It did not clearly distinguish between true and false positive and negative predictions. This flaw resulted in an inappropriate handling of prevented and induced failures. 2. Only steady-state availability could be estimated. Other dependability metrics such as reliability and hazard rate could not be computed. 3. The model is implicit but not transparent to help better understand the behavior of proactive fault management. In summary, to the best of our knowledge, no work has been published that captures both downtime avoidance as well as reactive and proactive downtime minimization, and that is incorporating all four cases of failure prediction: true and false positives as well as negatives. 10.3 The Availability Model As is the case for many of the rejuvenation models mentioned before, the model developed here is based on the CTMC originally published by Huang et al. [126]. First, the original model is briefly presented and then the new model is introduced. 10.3.1 The Original Model for Software Rejuvenation by Huang et al. As described by Parnas, software aging can be observed in long-running software. However, software aging does not cause the software to crash immediately but increases the risk of failure. For example, if a memory leak is present, the amount of available memory is continuously decreasing (in long-term behavior). Assuming that each service request requires some (stochastically distributed) amount of memory, the risk that some service request fails due to insufficient free memory is increasing over time. However, if the maximum number of concurrent service requests and the maximum amount of memory consumption of each service request are limited, software aging does not affect service availability as long as the amount of free memory is above some threshold. This observation is one of the key concepts in the model for rejuvenation proposed by Huang et al. [126]: Some state exists, where a running system enters a failure probable state Sp (see 234 10. Assessing the Effect on Dependability Figure 10.3). In the example, the system transits into this state when the amount of free memory drops below the described threshold. Rejuvenation is performed periodically in order to clean up the system and to bring it back into the fault free state S0 . The occurrence of forced downtime (e.g., incurred by rejuvenation) is known while failures occur stochastically (unplanned downtime). The key notion of software rejuvenation is that both downtime and the associated downtime cost are less for forced downtime than for unplanned. Therefore, the model has two different down-states: One for rejuvenation (SR ) and one for failures (SF ). Since the periodically triggered restarting process is different from repair after failure, two transition rates r1 and r3 are used. Figure 10.3: The original CTMC model as used by Huang et al. [126] to compute availability of a system with rejuvenation. S0 denotes the state when everything is up and running, SP the failure probable state, SR the rejuvenation state and SF the failed state with appropriate transition rates as used in the original paper 10.3.2 Availability Model for Proactive Fault Management In order to develop an availability model for proactive fault management, three key differences are taken into account: • In addition to rejuvenation, proactive fault management involves downtime avoidance techniques. In terms of the model, this means that there needs to be some way to get from the failure probable state back to the S0 state without an intermediate down state. • Proactive fault management actions operate upon failure prediction rather than periodically. However, predictions can be correct or false. Moreover, it makes a difference whether there really is a failure imminent in the system or not. Hence, the single failure probable state SP in Figure 10.3 needs to be split up into a more fine-grained analysis: According to the four cases of prediction, there is a state for true positive predictions (ST P ), false positive predictions (SF P ), true negative predictions (ST N ) and false negative predictions (SF N ). • Besides rejuvenation, which is a proactive downtime minimization technique, proactive fault management also comprises reactive downtime minimization actions. However, both types of actions can be assessed in terms of their effect on time-to-repair. Hence, it is sufficient to keep up two down states: one for prepared / forced downtime (SR ) and one for unprepared / unplanned downtime (SF ). 10.3 The Availability Model 235 The resulting CTMC is shown in Figure 10.4. Figure 10.4: Availability CTMC for proactive fault management. State S0 is the fault-free state. States ST P , SF P , ST N and SF N are failure-probable states corresponding to the four cases of failure prediction correctness. States 5 and 6 are “down” states where SR accounts for forced downtime caused by scheduled restart or prepared repair, and SF accounts for the unplanned counterpart. In order to better explain the model, consider the following scenario: Starting from the up-state S0 a failure prediction is performed at some point in time. If the predictor comes to the conclusion that a failure is imminent, the prediction is a positive and a failure warning is raised. If this is true (something is really going wrong in the system) the prediction is a true positive and a transition into ST P takes place. Due to the warning, some actions are performed in order to either prevent the failure from occurring (downtime avoidance), or to prepare for some forced downtime (downtime minimization). Assuming first that some preventive actions are performed, let PT P := P (failure | true positive prediction) (10.2) denote the probability that the failure occurs despite of preventive actions. Hence, with probability PT P a transition into failure state SR takes place, and with probability (1 − PT P ) the failure can be avoided and the system returns to state S0 . Due to the fact that a failure warning was raised (the prediction was a positive one), preparatory actions have been performed and repair is quicker (on average), such that state S0 is entered with rate rR . If the failure warning is wrong (in truth the system is doing well) the prediction is a false positive (state SF P ). In this case actions are performed unnecessarily. However, although no failure was imminent in the system, there is some risk that a failure is caused by the additional workload for failure prediction and subsequent actions. Hence, let PF P := P (failure | false positive prediction) (10.3) denote that an additional failure is induced. Since there was a failure warning, preparation for an upcoming failure has been carried out and hence the system transits into state SR . 236 10. Assessing the Effect on Dependability In case of a negative prediction (no failure warning is issued) no action is performed. If the judgment of the current situation to be non failure-prone is correct (there is no failure imminent), the prediction is a true negative (state ST N ). In this case, one would expect that nothing happens since no failure is imminent. However, depending on the system, even failure prediction (without subsequent actions) may put additional load onto the system which can lead to a failure although no failure was imminent at the time when the prediction started. Hence there is also some small probability of failure occurrence in the case of a true negative prediction: PT N := P (failure | true negative prediction) . (10.4) Since no failure warning has been issued, the system is not prepared for the failure and hence a transition to state SF rather than SR , takes place. This implies that the transition back to the fault-free state S0 occurs at rate rF , which takes longer (on average). If no additional failure is induced, the system returns to state S0 directly with probability (1 − PT N ). If the predictor does not recognize that something goes wrong in the system and a failure comes up, the prediction is a false negative (state SF N ). Since nothing is done about the failure that comes up there is no transition back to the up-state and the model transits to the failure state SF without any preparation. The reason why there is an intermediate state SF N originates from the way transition rates are computed, as explained in the next section. 10.4 Computing the Rates of the Model Reliability modeling is typically performed to investigate new techniques for systems that are under design1 in order to determine their potential effect on system parameters such as availability. The model shown in Figure 10.4 comprises the following parameters: • PT P , PF P , PT N denote the probability of failure occurrence given a true positive, false positive, or true negative prediction. • rT P , rF P , rT N , and rF N denote the rate of true/false positive and negative predictions • rA denotes the action rate, which is determined by the average time from start of the prediction to downtime or to return to the fault-free state. • rR denotes repair rate for forced / prepared downtime • rF denotes repair rate for unplanned downtime However, some of these parameters are difficult to determine. Therefore, more intuitive parameters are used. from which the rates of the CTMC model are computed. Usually, there are two groups of parameters: 1. fixed parameters that are estimated / measured from a given system or determined by the application area 1 As already mentioned in the case of this dissertation, it was not possible to try the methods on the commercial system 10.4 Computing the Rates of the Model 237 2. parameters that shall be investigated / optimized in order to assess their effect on target metrics. In the case considered here, it is assumed that a system without proactive fault management shall be extended by PFM and the effect of PFM with respect to availability, reliability and hazard rate shall be investigated. More specifically, it is assumed that fixed parameters comprise mean-time-to-failure (MTTF), mean-time-to-repair (MTTR), leadtime ∆tl , and prediction-period ∆tp . The second group of parameters (those that shall be investigated) includes parameters evaluating accuracy of failure prediction and parameters investigating the efficiency of actions. Table 10.2 summarizes the specific parameters that are used in the following. Note that in contrast to the definition in Section 8.2.2, for readability reasons the single letter “f ” is used to denote false positive rate in this chapter. Parameter Mean time to failure (system w/o PFM) Mean time to repair (system w/o PFM) Lead-time Prediction-period Precision Recall False positive rate Failure probability given TP prediction Failure probability given FP prediction Failure probability given TN prediction Repair time improvement Symbol MT T F MT T R ∆tl ∆tp p r f PT P PF P PT N k Fixed X X X X Investigated X X X X X X X Table 10.2: Parameters used for modeling In summary, it is intuitively clear that any proactive fault management technique should strive to achieve the following parameter values in order to minimize downtime: 1. Failure prediction should be as accurate as possible. This translates into high precision, high recall and low false positive rate. 2. Failure occurrence probabilities PT P , PF P and PT N should be as close to zero as possible. 3. Time to repair for forced downtime / prepared repair should be as small as possible in comparison to repair time for unplanned / accidental downtime. 10.4.1 The Parameters in Detail Parameters can be divided into three groups (see Table 10.2): 1. Precision, recall and false positive rate specify failure prediction accuracy2 . 2 in a general sense, not as strict as in Definition 8.10 on Page 156 238 10. Assessing the Effect on Dependability 2. Failure probabilities PT P , PF P , PT N assess effectiveness of downtime avoidance and the risk of additional failures that are induced by the additional workload of prediction and actions. 3. Repair time improvement factor k determines effectiveness of downtime minimization. Failure prediction accuracy. Figure 10.5 visualizes all four cases of failure prediction correctness including lead-time ∆tl and prediction-period ∆tp . The case that a failure occurs without any failure prediction being performed3 is mapped to a missing failure warning, which is a false negative prediction. Although contingency table, precision, recall and false positive rate have been defined in Chapter 8 (c.f., Table 8.1 on Page 153, Equations 8.3 and 8.4 on Page 155, and Equation 8.9 on Page 156), they are repeated here for convenience. Also, notation is slightly changed in order to emphasize that the metrics are defined by numbers that have been counted during an experiment: For example, nT P denotes the number of true positive predictions within one experiment with a total of n predictions. Figure 10.5: A timeline showing failures (t) and all four types of predictions (P): true positive, false positive, false negative, and true negative. A failure is counted as predicted if it occurs within prediction-period of length ∆tp , which starts lead-time ∆tl after beginning of prediction Table 10.3 shows the modified version of the contingency table. Using this notation, Prediction: Failure Prediction: No failure Sum True Failure nT P nF N nF True Non-failure nF P nT N nN F Sum nP OS nN EG n Table 10.3: This contingency table is a simplified version of Table 8.1 on Page 153. It emphasizes that the fields consist of the number of true positives (nT P ), false positives (nF P ), etc. predictions from an experiment with a total of n predictions. 3 E.g., in error-based prediction, if no error occurs prior to a failure 10.4 Computing the Rates of the Model 239 precision, recall and false positive rate are defined as follows: Precision p = Recall r = False positive rate f = nT P nT P = nT P + nF P nP OS (10.5) nT P nT P = nT P + nF N nF (10.6) nF P nF P = . nF P + nT N nN F (10.7) Effectiveness of downtime avoidance and risk of induced failures. Preventive actions are applied in order to avoid an imminent failure which affects time-to-failure (TTF). However, the opposite effect may also happen: due to additional load generated by failure prediction and actions, failures can be provoked that would not have occurred if no PFM had been in place. In order to account for this effect, the model uses three probabilities corresponding to the types of failure prediction correctness: PT P is the probability that a failure occurs in case of a correct warning. This is the probability that the preventive action is not successful. PF P is the probability of failure occurrence in case of a false positive warning. Since no failure is imminent at the time of prediction, it corresponds to the probability that a failure is provoked by the extra load of failure prediction and subsequent actions. PT N is the probability that an extra failure is provoked by prediction alone: since it is a true negative prediction, a failure occurs although no failure is imminent in the system and no actions are performed. There is no need to define a probability for false negative predictions since nothing is done about the failure that will occur. The probability of failure occurrence is hence equal to one. Effectiveness of downtime minimization. Effects of forced downtime / prepared repair on availability, reliability, and hazard rate are gauged by time-to-repair. More specifically, the effect is expressed by mean relative improvement, how much faster the system is up in case of forced downtime / prepared repair in comparison to MTTR after an unanticipated failure: MT T R , (10.8) k= M T T Rp which is the ratio of MTTR without preparation to MTTR for the forced / prepared case. Obviously, one would expect that preparation for upcoming failures improves MTTR, thus k > 1, but the definition also allows k < 1 corresponding to a change for the worse. 10.4.2 Computing the Rates from Parameters CTMC models express temporal behavior using exponential distributions for timing in the state before transitioning. Exponential distributions are determined by a single parameter: the transition rate. In this dissertation, only constant transition rates are considered which are determined by the inverse of mean time. 240 10. Assessing the Effect on Dependability It is the objective of this section to relate the model’s rates rT P , rF P , rT N , rF N , rA , rR , and rF to the more intuitive parameters listed in Table 10.2. Therefore, using the formulas developed in the following, the rates of CTMC can be computed from the more intuitive parameters. The text follows a bottom-up approach such that basic relationships and equations are developed first which are subsequently used to derive equations for the CTMC rates given by Equations 10.30 to 10.33. Starting point for computations is to determine the distribution of predictions among true and false positives and negatives. This can be obtained using the prediction-related metrics precision, recall and false positive rate. The distribution is expressed by the number of, e.g., true positive predictions divided by the total number of predictions. By reference to Table 10.3 and the definitions given by Equations 10.5 to 10.7, it can be derived that: n = nF + nN F nF P nT P + = r f (10.9) (10.10) nT P nP OS − nT P + r f nT P nT P nT P = + − r pf f = nT P n ⇒ = 1 1 r + 1 pf − 1 f , (10.11) (10.12) (10.13) which is an equation to compute the fraction of true positive predictions nT P (in comparison to the total number of predictions n) from the prediction-related parameters precision, recall and false positive rate. In order to compute the fraction of false positive, false negative, and true negative predictions, it is necessary to determine: nP OS n = 1 nT P p n (10.14) nF n = 1 nT P , r n (10.15) which leads to nF P n nF N n nT N n nP OS nT P − n n nF nT P = − n n nN F nF P nF nF P = − =1− − . n n n n = (10.16) (10.17) (10.18) Now, as the relative distribution among true and false positive and negative predictions is known, the corresponding transition rates rT P , rF P , rT N , rF N can be computed. The approach is to first compute the overall prediction rate rp , which determines timing of the process once it has entered state S0 . The mean time is determined by meantime-to-prediction (MTTP), which is computed in two steps: first, mean-time-betweenpredictions (MTBP) is computed from the temporal parameters MTTF, MTTR, lead time 10.4 Computing the Rates of the Model 241 ∆tl , and prediction period ∆tp . MTTF and MTTR are assumed to be known from a system without PFM —that is why they are fixed parameters. Then in a second step, MTTP is obtained from MTBP by subtracting time needed for prediction, etc. The principal notion to compute MTBP is that there are x-times as many predictions as true failures. By assuming the number of predictions to be n and the number of true failures to be nF , x can be determined by expressing n in terms of nF , as is shown in the following: n = nF + nN F nF P = nF + f nP OS nT P = nF + − f f nT P nT P = nF + − pf f (10.19) (10.20) (10.21) (10.22) 1 1 − pf f ! 1 1 = nF + nF r − pf f ! = nF + nT P = nF 1−p 1+r pf (10.23) (10.24) ! . (10.25) This means that there are 1 + r 1−p as many predictions as failures. Hence it can be pf concluded that for the mean times holds: M T BP = 1 1+r 1−p pf M T BF , (10.26) where MTBF denotes “mean-time-between-failures” for a system without proactive fault management, which can be computed from MTTF and MTTR by the standard formula M T BF = M T T F + M T T R . (10.27) As can be seen in Figure 10.6 MTTP can be computed from MTBP by subtracting leadtime ∆tl , and repair time R. Additionally, half of the prediction-period has to be subtracted since a failure may occur at any time within the prediction-period and hence on average failures occur at half of the prediction-period.4 However, for a system with PFM, repair time R is not equal to MTTR, since there are two different repair times: One for prepared repair (or forced downtime) and one for the unprepared / unplanned case. But, as we only consider mean values, mean repair time R is a combination of both cases and the mixture is determined by the fraction of positive predictions in comparison to negative 4 To be precise, a symmetric distribution centered around the middle of the prediction-period is assumed, i.e., a distribution with zero skewness and median equal to ∆tp/2 242 10. Assessing the Effect on Dependability Figure 10.6: Time relations for prediction. Failures are indicated by t, predictions by P and repair by R predictions. More specifically, M T T P is given by: ∆tp 2 nT P + nF P nT N + nF N − M T T Rp − M T T R , (10.28) n n where M T T Rp is mean-time-to-repair for the case of forced / prepared downtime. It is related to MTTR of unplanned downtime by repair time improvement factor k (c.f., Equation 10.8). Finally, prediction rate rp is computed by: M T T P = M T BP − ∆tl − rp = ∆tp MT T F + MT T R − ∆tl − 1−p 2 1 + r pf −1 nT P + nF P nT N + nF N − + MT T R . (10.29) kn n As already mentioned, transition rates from S0 to ST P , SF P , ST N , and SF N , are determined by distributing rp among true / false positive / negative predictions: nij ∗ rp where i ∈ {T, F} and j ∈ {P, N} , (10.30) rij = n where nnij denotes the fractions given by Equations 10.13 to 10.18. The three remaining rates are action rate (rA ), repair rate for forced downtime / prepared repair (rR ) and repair rate for an unprepared failure (rF ). rA is characterized by average time from the beginning of the prediction to the occurrence of downtime or its prevention and can hence be computed from lead-time ∆tl and prediction-period ∆tp : rA = 1 . ∆tl + 1/2∆tp (10.31) Repair rate rF is determined by the repair rate of a system without PFM, which is MTTR: 1 (10.32) MT T R and repair rate for forced downtime / prepared repair is determined by MTTR and k: rF = rR = 1 k = = k rF . M T T Rp MT T R (10.33) 10.5 Computing Availability 10.5 243 Computing Availability Steady-state availability is defined as the portion of uptime versus lifetime, which is equivalent to the portion of time, the system is up. In terms of our CTMC model, this quantity can be determined by the equilibrium state distribution: It is the portion of probability mass in steady-state assigned to the non-failed states, which are S0 , ST P , SF P , ST N , and SF N . In order to simplify representation, numbers 0 to 6 —as indicated in Figure 10.4— are used to identify the states of the CTMC. The infinitesimal generator Matrix Q of the CTMC model is: −rp rT P rF P rT N rF N 0 0 (1−P ) r −r 0 0 0 PT P rA 0 TP A A (1−P ) r 0 −rA 0 0 PF P r A 0 FP A Q = (1−PT N ) rA 0 0 −rA 0 0 PT N r A 0 0 0 0 −r 0 r A A rR 0 0 0 0 −rR 0 rF 0 0 0 0 0 −rF (10.34) The equilibrium state distribution of a CTMC can be determined by solving the global balance equations. This is equivalent to a solution of the following linear equation system (see, e.g., Kulkarni [149]): πQ = 0 6 X s.t. (10.35) πi = 1 . (10.36) i=0 The way to a solution is based on the following observation: If π is a solution to Equation 10.35 then each scaling of π is also a solution and hence, an infinite number of solutions exist, one of which fulfills Equation 10.36. Therefore, π6 is arbitrarily set to one and the inhomogeneous equation system π 0 Q0 = b is solved by Gaussian elimination yielding a single solution π 0 where Q0 is −rp rT P rF P rT N rF N 0 (1−P )r −r 0 0 0 PT P rA TP A A (1−P )r 0 −rA 0 0 PF P rA FP A 0 Q = (1−PT N )rA 0 0 −rA 0 0 0 0 0 0 −rA 0 rR 0 0 0 0 −rR and b= −rF 0 0 0 0 0 . (10.37) (10.38) The final solution π is obtained by scaling of πi0 such that the sum equals one (c.f., Equation 10.36): πi = πi0 P 5 i=0 π6 = P5 i ∈ {0 . . . 5} πi0 + 1 1 0 i=0 πi + 1 . (10.39) 244 10. Assessing the Effect on Dependability By exploiting that rR = k rF , results can be further simplified. Equations for πi are provided by Table 10.4. πi π0 π1 π2 π3 π4 π5 π6 Solution rF rF rF rF rF rF rF rF k rA k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N ) rF k rT P k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N ) rF k r F P k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N ) rF k r T N k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N ) rF k rF N k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N ) rA (PF P rF P + PT P rT P ) k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N ) k rA (PT N rT N + rF N ) k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N ) Table 10.4: Solution to equation system defined by Equations 10.35 and 10.36. πi ’s are equilibrium (steady-state) probabilities for the states in the availability model Steady-state availability is determined by the portion of time the stochastic process stays in one of the up-states 0 to 4: A= 4 X πi = 1 − π5 − π6 i=0 A= (rA + rp )k rF , k rF (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N ) (10.40) yielding a closed-form solution for steady-state availability of systems with PFM. 10.6 Computing Reliability Reliability R(t) is defined as the probability of failure occurrence up to time t given that the system is fully operational at t = 0. In terms of CTMC modeling this is equivalent to a non-repairable system and computation of the first passage time into the down-state. 10.6.1 The Reliability Model Since a non-repairable system is to be modeled, the distinction between two down-states (SR and SF ) is not required anymore. Furthermore, there’s no transition back to the up- 10.6 Computing Reliability 245 state. That is why a simpler topology can be used where the two failure states are merged into one absorbing state SF as shown in Figure 10.7. Figure 10.7: CTMC model for reliability. Failure states 5 and 6 of Figure 10.4 have been merged into one absorbing state SF The generator matrix for this model has the form: Q= T t0 0 0 ! , (10.41) where T equals: T = −rp rT P rF P rT N rF N (1−PT P ) rA −rA 0 0 0 (1−PF P ) rA 0 −rA 0 0 (1−PT N ) rA 0 0 −rA 0 0 0 0 0 −rA and t0 equals: t0 = [ 0 10.6.2 PT P rA PF P rA PT N rA rA ]T . , (10.42) (10.43) Reliability and Hazard Rate The distribution of the probability to first reach the down-state SF yields the cumulative distribution of time-to-failure. In terms of CTMCs this quantity is called first-passagetime distribution F (t). Reliability R(t) and hazard rate h(t) can be computed from F (t) in the following way: R(t) = 1 − F (t) f (t) h(t) = , 1 − F (t) (10.44) (10.45) where f (t) denotes the corresponding probability density of F (t). F (t) and f (t) are the cumulative and density of a phase-type exponential distribution defined by T and t0 (see, 246 10. Assessing the Effect on Dependability e.g., Kulkarni [149]): F (t) = 1 − α exp (t T ) e f (t) = α exp (t T ) t0 , (10.46) (10.47) where e is a vector with all ones, exp (t T ) denotes the matrix exponential, and α is the initial state probability distribution. It can be determined from the fact that reliability is defined such that the system is fully operational at time t = 0. Hence: α = [1 0 0 0 0] . (10.48) Closed form expressions exist and can be computed using a symbolic computer algebra tool. However, the solution would fill several pages5 and will hence not be provided here. 10.7 How to Estimate the Parameters from Experiments The previous sections described how availability and reliability for systems with PFM can be determined as a function of eleven parameters: M T T F , M T T R, ∆tl , ∆tp , p, r, f , PT P , PF P , PT N and k (c.f., Table 10.2). M T T F and M T T R were assumed to be known from a system without proactive fault management and can hence be estimated from an existing system; and ∆tl and ∆tp have been assumed to be application specific. The remaining seven parameters refer to proactive fault management. If reliable estimates for these parameters are available from similar systems, the derived formulas can be applied directly. However if not, it seems impossible to derive them analytically from system specifications. Therefore, they must be estimated by experiment in an environment similar to the production environment. In this section, an estimation procedure is described that separates the mutual influence of failure prediction and reaction schemes in order to determine all seven parameters. 10.7.1 Failure Prediction Accuracy During the first experiment, only those parameters characterizing failure prediction (namely p, r, and f ) are investigated with as little feedback onto the system as possible. This can either be accomplished by performing predictions offline working with previously recorded logfiles (as it has been done in Chapter 9 for the telecommunication system) or performing predictions on a separate machine. Side-effects such as additional workload are incorporated in later experiments. The outcome of the failure prediction experiment yields a sequence of predictions (either positive or negative) and a sequence of failures, by which predictions can be classified as true or false. Figure 10.8 shows all four cases that can occur. The figure is almost the same as Figure 10.5, however, it assigns situation IDs ¬ to ¯, which are needed in later steps of the estimation procedure. Starting from a timeline as in Figure 10.8, predictions can be assigned to be true positive (situation ¬), false positive (situation ), false negative (situation ®) or true negative (situation ¯). From this assignment p, r, and f can be computed according to 5 The solution found by MapleTM contains approximately 3000 terms. 10.7 How to Estimate the Parameters from Experiments 247 Figure 10.8: A timeline obtained from an experiment showing true failures (t) and prediction results. “!” indicate positive predictions (failure warnings) and “Ø” negative predictions. Four situations can occur as indicated by ¬ to ¯ the definitions given in Equations 10.5 to 10.7 p = count(¬) count(¬) + count() (10.49) r = count(¬) count(¬) + count(®) (10.50) f = count() , count() + count(¯) (10.51) where count(x) denotes the number of times, situation x has occurred in the experiment. Using p, r, and f , the expected ratio of true and false positive and negative predictions nT P nF P nT N , n , n , and nFnN can be computed using Equations 10.13 to 10.18, where in the n following N denotes the total number of predictions in the experiment’s trace. From now on, nTnP , nFnP , nTnN , and nFnN are assumed to be known. They are used in later experiments. 10.7.2 Failure Probabilities PT P , PF P , and PT N The goal of the second experiment is to assess the capability of downtime avoidance mechanisms. Since these mechanisms are run only in case of a positive prediction, and a failure can only be avoided if such is imminent, failure avoidance capability is gauged by probability PT P . More precisely, PT P is the probability that a failure occurs even though an upcoming true failure has been predicted and downtime avoidance mechanisms have been performed. To estimate it, failure predictions and downtime avoidance mechanisms have to be performed together on a test system that mimics key features of the modeled system as close as possible. The outcome of the experiment is again a timeline as shown in Figure 10.8. However, the simple assignment of cases to true / false positives / negatives is not possible any more due to the following observations: • situation ¬ (co-occurrence of failure warning and failure). This situation might be traced back to two scenarios: (a) the prediction was a true positive and the triggered downtime avoidance action could not prevent the occurrence of the failure, or (b) the prediction was a false positive that successively lead to a failure induced by the prediction algorithm and / or the triggered action (e.g., due to additional load). • situation (failure warning without occurrence of a failure). This situation might 248 10. Assessing the Effect on Dependability be traced back to (a) a false positive prediction or (b) to a true positive prediction with successful avoidance of the failure. • situation ® (occurrence of failure only). This situation can be caused by (a) a false negative prediction or (b) by a true negative prediction where the execution of the failure prediction algorithm (there are no actions performed upon negative predictions) caused the failure. Table 10.5 provides a complete list of all these cases. As is indicated by the horizontal Situation Comment Prediction Failure ¬ action not successful TP F ¬ failure caused by PFM FP F failure prevented TP NF false positive prediction FP NF ® failure caused by prediction TN F ® false negative prediction FN F ¯ correctly no warning TN NF Prob. of occur. nT P n nF P n nT P n nF P n nT N n nF N n nT N n PT P PF P (1 − PT P ) (1 − PF P ) PT N (1 − PT N ) Table 10.5: Mapping of cases to situations. Although only four different situations can be observed in the experiment’s output (c.f., Figure 10.8), they can be traced back to seven different cases if downtime avoidance techniques are applied line, there are two groups of non-overlapping parameters and situations: The first group comprises parameters PT P , PF P and situations ¬ and , while the second group comprises parameter PT N and situations ® and ¯. Since handling of the second group is easier, this group is discussed first. Estimation of PT N . By combining rows referring to the same situation in the second group of Table 10.5, the following linear equation system can be set up: nT N nF N PT N + n n count(®) N (10.52) nT N count(¯) (1 − PT N ) = n N (10.53) = Since there are two equations for one parameter (PT N ), it can be shown that a solution only exists, if count(®) count(¯) nT N nF N + = + , (10.54) N N n n expressing that the observed fraction of negative predictions (left-hand-side) is equal to the expected fraction computed from precision, recall and false positive rate, which have been estimated before (right-hand-side of the equation). Assuming that this is the case, one of Equations 10.52 or 10.53 can be chosen. Since situation ¯ is expected to occur 10.7 How to Estimate the Parameters from Experiments 249 more frequently, estimation error is expected to be lower and hence Equation 10.53 is solved for PT N : count(¯) N PT N = 1 − , (10.55) nT N n ¯) equals nT N (expressing that all true negawhich has an intuitive interpretation: if count( N n tive predictions appear in situation ¯), there are no true negative predictions that resulted in situation ®, which means that no failures are induced by the prediction algorithm, which in turn is consistent with PT N being equal to zero. Estimation of PT P and PF P . The same procedure is applied to the first group in Table 10.5. The linear equation system is nT P nF P count(¬) PT P + PF P = n n N nT P nF P count() (1 − PT P ) + (1 − PF P ) = , n n N (10.56) (10.57) which are two equations for two variables. However, Equations 10.56 and 10.57 are not independent. Similar to the estimation of PT N , a solution only exists, if nT P nF P count(¬) count() + = + . N N n n (10.58) Since there’s only one (independent) equation containing two variables, an additional, independent equation involving PT P or PF P has be formed. The following options are available: 1. Since PF P denotes the risk of failure induced by execution of failure prediction algorithms and subsequent (unnecessary) actions, PF P could be set a-priori yielding PF P = const. (10.59) 2. In case of a true positive prediction, a failure may occur due to two reasons: The action was not able to avoid the failure, or the action would have avoided the failure that had been predicted, but due to additional load, another failure occurs. However, the risk of inducing an additional failure is PF P (see above), and hence one could assume that PT P = P (failure cannot be avoided) + PF P . (10.60) So the difficulty is to determine the probability that a failure cannot be avoided. 3. A fixed ratio of PT P : PF P could be assumed. For example, a ratio of 10:1 would express that the risk of failure occurrence after issuing of a failure warning is ten times as high if the warning is correct as if it is a false warning. In general, this leads to PF P = c PT P , (10.61) where c is a constant (ten in the example). 250 10. Assessing the Effect on Dependability 4. Either PT P or PF P can be determined in a separate experiment. This also results in PF P = const. or PT P = const. (10.62) Solutions one to three involve assumptions that are vague and difficult to support by measurements. In contrast, solution four is based on experimental evidence. It might seem that it does not make a difference whether PT P or PF P is estimated, but this is not true: In order to estimate PF P or PT P it must be known when a prediction is a false or true positive. In the false positive case, it must be proven that a failure would not have occurred if failure prediction and actions had not been in place, which seems infeasible. However, in the second case, it must be assured that a positive prediction is a true positive, which means that a failure is really imminent. This can be achieved by fault injection (see, e.g., Silva & Madeira [242] for an introduction), as is explained in the following. Once again, PT P is the probability of failure occurrence given a true positive prediction. Applying a maximum likelihood estimator yields: PT P = P (F |T P ) = count(F ∧ T P ) count(F ∧ T P ) = , count(T P ) count(F ∧ T P ) + count(N F ∧ T P ) (10.63) where count(F ∧ T P ) denotes the number of true positive predictions where (despite all preventive actions) a failure has occurred, and count(N F ∧ T P ) denotes the number of cases where a failure warning is raised correctly that is not followed by a failure. Fault injection is applied in order to know when a failure really is imminent in the system and hence any positive prediction (failure warning) occurring within some time interval after fault injection is a true positive. The case that a true positive prediction is followed by a failure (F ∧ T P ) can be identified directly in the log of a fault-injection experiment (c.f., situation ° in Figure 10.9). Identification of the case that no failure occurs after a Figure 10.9: Identifying true positive predictions by fault injection. °: If a failure (t) occurs within a given time-interval after fault injection and the failure is preceded by a failure warning (exclamation mark), the situation is assumed to be a true positive prediction where the failure could not be prevented. ±: If no failure but a failure warning are observed after fault injection, this corresponds either to a false positive prediction if fault injection was not successful or to a true positive prediction where the failure has been prevented true positive prediction (N F ∧ T P ) is more complicated. The reason for this is that the injection of a fault does not always lead to a failure. Hence situation ± in Figure 10.9 can either be a true positive where the failure has been prevented (this is the case needed for Equation 10.63) or a false positive prediction in the case that fault injection did not succeed. However, these two cases can be distinguished by the relative frequencies of true positive and false positive predictions, which is determined by precision. But since a fault injector can in some cases change system behavior significantly, precision has 10.7 How to Estimate the Parameters from Experiments 251 to be estimated separately for the fault injection experiments following the same offline procedure as described in the previous section (p0 is used to indicate precision in this case). This leads to the following formula for maximum likelihood estimation of PT P : PT P = count(°) . count(°) + p0 count(±) (10.64) It should be noted that fault injection is a difficult issue and care should be taken that a broad range of faults are injected such that failures of different types occur. If downtime avoidance techniques are only able to compensate for upcoming failures of certain classes, PT P equals one for failure types that are not taken care of. If the distribution of failure types is known, the estimate given in Equation 10.64 can be improved. By substituting the solution for PT P (Equation 10.64) either into Equation 10.56 or 10.57, PF P can be computed. Using the first equation yields PF P = count(¬) N − nF P n nT P n PT P . (10.65) Measuring deviation. Since all experiments are finite samples, and since, if actions are performed, failure prediction accuracy might deviate slightly from the values determined by offline estimation, exact equalities in Equation 10.54 and 10.58 will be observed rather rarely. If deviation is sufficiently small, the equations can be used nonetheless. If not, experiments have to be repeated with an increased sample size or with more similar environments and conditions. The amount of deviation can be determined by Equations 10.54 and 10.58. It can be observed that the deviation is symmetric: if the observed fraction of negative predictions is larger than expected (left-hand side > right-hand side in Equation 10.54), the observed fraction of positive predictions is smaller than expected (lefthand side < right-hand side in Equation 10.58) and vice versa. Therefore, one of both can be used to determine deviation from expectations. Since there usually are more negative than positive predictions, the estimate is more reliable if negative predictions are used and deviation is defined as follows (c.f., Equation 10.54): dev = 10.7.3 count(®) N count(¯) nT N nF N + − − . N n n (10.66) Repair Time Improvement k In order to estimate the repair time improvement factor k, an experimental trace such as Fig. 10.8 that additionally includes time-to-repair is needed. As k is the ratio of MTTR for unplanned /unprepared downtime and MTTR for forced /prepared downtime (c.f., Equation 10.8), mean values for both cases need to be computed. The distinction between both types of downtime is based on failure prediction: in case of a failure warning (situation ¬ in Figure 10.8) time to repair contributes to forced /prepared downtime, in case of no failure warning (situation ® in Figure 10.8), it contributes to unplanned /unprepared downtime. Comparing the value of M T T R for the unpredicted case to the fixed value known from a system without PFM yields a further indication how representative the estimate is. 252 10. Assessing the Effect on Dependability 10.7.4 Summary of the Estimation Procedure Since the estimation procedure is quite complex involving several experiments, a brief summary of the procedure is provided in Figure 10.10. 10.8 A Case Study and an Example In his diploma thesis [254], Olaf Tauber has set up an experimental environment in order to explore the effects of proactive fault management onto a real system. The case study has been performed by extending the .NET Pet Shop application6 from Microsoft. This section summarizes the work and highlights main results. Since results have not been convincing, a more advanced example is also presented. 10.8.1 Experiment Description .NET is a runtime environment developed by Microsoft that is able to execute software components written in various programming languages. Furthermore, it provides readyto-use functionality to handle a multitude of tasks ranging from multi-threading to graphical user interfaces. “Pet Shop” is a small, open source web-shop demo application that has been built in order to demonstrate superiority to the Java-based “PetStore” demo application developed by Sun Microsystems. Running the Pet Shop application requires at least two additional components: a webserver that handles http requests from web-browsers (clients) and a database to store the data in. In order to create an experimental environment for testing proactive fault management techniques, several modules had to be added to the system (see Figure 10.11): • Stressors. In order to simulate a real scenario, workload must be put onto the system. An existing load generator called JMeter had been adapted to simulate a variety of actions associated with shopping (e.g., logging in, browsing the catalog, viewing and changing the shopping cart, payment, etc.). Activity patterns have been executed randomly obeying several boundary conditions such as that users have to log in prior to payment, etc. Furthermore, stressors have been replicated simulating a total of 70 users shopping concurrently. The second important part of stressors is response analysis: Each response has been analyzed with respect to response times and correctness of the returned web-page. For performance reasons, relevant data has only be stored during runtime and has been analyzed offline after each test run. • Monitoring. Since proactive fault management is about acting upon an analysis of the current state, runtime monitoring is necessary. In this case, a .NET component has been used to report system-wide Windows performance counters such as the number of active database transactions, size of the swap file, etc. • Failure Prediction. Monitoring values have been transmitted over a network socket to a failure predictor. It must be pointed out that not the failure prediction algorithm proposed in this dissertation has been used. Instead, Olaf Tauber has developed a 6 See http://msdn2.microsoft.com/en-us/library/ms978487.aspx 10.8 A Case Study and an Example 253 1. Experiment 1: without feedback onto the system: Either write logfiles or execute predictions on separate computer. (a) Classify the resulting data into situations ¬ to ¯ (c.f., Figure 10.8). (b) Compute precision p, recall r, and false positive rate f using Equations 10.49, 10.50, and 10.51. (c) Using p, r, and f , compute expected ratios of true and false positives and negatives nTnP , nFnP , nTnN , and nFnN using Equations 10.13 to 10.18. 2. Experiment 2: with failure prediction and actions similar to a production system. (a) Classify the resulting data into situations ¬ to ¯ (c.f., Figure 10.8). ®) + (b) Determine the relative amount of negative predictions, which is count( N count(¯) where N is the total number of predictions. Compute deviation N dev by Equation 10.66 using nT N , n and nF N n from Experiment 1. (c) If deviation is significant,a experiments 1 and 2 have to be repeated either with more samples to reduce sampling effects or such that computing environments and conditions are more similar. (d) Estimate PT N using Equation 10.55. (e) From repair times occurring in the experiment, estimate M T T R and M T T Rp and compute repair time improvement factor k as described in Section 10.7.3. (f) Compute overall prediction rate rp using Equation 10.29 (g) Using rp , compute prediction rates rT P , rT P , rT P , rT P from Equation 10.30 using values nTnP , nFnP , nTnN , and nFnN from Experiment 1-c, above. (h) compute rates rA , rF , and rR using Equations 10.31 to 10.33. 3. Experiment 3: with fault injection but without feedback onto the system. (a) Identify occurrences of situations ¬ and for fault injection experiment (c.f., Figure 10.8). (b) Estimate p0 using Equation 10.49 4. Experiment 4: with fault injection, prediction and actions in place. By analyzing situations ° and ± (c.f., Figure 10.9), estimate PT P using Equation 10.64. 5. Estimate PF P by Equation 10.65 using data of Experiment 2. a the threshold is application specific and cannot be provided, here Figure 10.10: Summary of the procedure to estimate model parameters 254 10. Assessing the Effect on Dependability Figure 10.11: Overview of the case study simple prediction algorithm that is based on weighted events generated from threshold violations. The reason for this is that implementation of HSMM-based failure prediction has not been finished at the time Olaf Tauber carried out the experiments. • Action. If a failure is predicted, some action is triggered. One downtime avoidance and one downtime minimization technique have been implemented: – Load lowering has been chosen for downtime avoidance. More specifically, lowering of the load was achieved by displaying a web page stating that the server is temporarily overloaded and clients should retry in a few seconds. – A two-level hierarchical reboot strategy was used for downtime minimization. The reboot strategy was able to either reboot the application layer in the .NET runtime or to reboot the entire system. • Fault Injection. One of the most effective fault injection techniques is to limit available ressources. Olaf Tauber has opted for allocating memory such that the rest of the system (including Pet Shop application, webserver and the database) have to cope with a reduced amount of free memory. Specifically, fault injection has been implemented by a multi-threaded process controlled7 from outside the system. 10.8.2 Results At the time when Olaf Tauber carried out his experiments, the model proposed in this Chapter had not been developed, and hence he had used the formulas and estimation technique proposed in Salfner & Malek [225]. Fortunately, the supplemental DVD to the diploma thesis contained the complete recordings collected during experiments and the data could be analyzed with the estimation procedure described in Section 10.7. In contrast to this procedure, the one that Olaf Tauber has applied consisted of only two phases, but since he has applied fault injection in his experiments, data could be split 7 this means specification of start, duration and amount of memory allocation 10.8 A Case Study and an Example 255 further in time intervals with and without failure prediction resulting in data for four experiments. In order to clearly separate the parts, some time period after the end of each fault injection has been removed from considerations. Two proactive fault management techniques have been investigated by Olaf Tauber: employing downtime minimization by restart and employing downtime avoidance by presenting a static page saying “server is busy”. Since the only type of failures observed were singleton runtime failures, each only affecting a few requests, the restarting approach was not at all successful: Even in the case of application-level restarting, eleven times as many service requests got lost during restart than by the failure itself. For this reason, only the downtime avoidance technique has been analyzed in the following. Analysis has been performed with a lead-time ∆tl of 60s, and prediction-period ∆tp of five minutes. Table 10.6 shows parameter values for the resulting model. Unfortunately, the limited amount of data is not sufficient to yield a statistically reliable assessment of the parameters. Hence, the results need to be interpreted with care. Deviation, as defined by Equation 10.66 has been equal to 0.0164. Fixed parameters MT T F MT T R ∆tl ∆tp Value [s] 25711 2.00 60 300 Estimated parameters p r f PT P PF P PT N k Value 0.167 0.25 0.0617284 0.5 0.1463768 0.04366895 1.5625 Resulting rates rT P rF P rT N rF N rA rF rR Value [ 1s ] 1.178169e-05 5.876737e-05 0.000893264 3.534508e-05 0.004761905 0.5 0.78125 Table 10.6: Resulting values for model parameters as estimated from data of the case study. Fixed parameters refer to the parameters not depending on PFM. Estimated parameters are those that are estimated from experiments as described in Section 10.7. The rightmost column lists the resulting transition rates computed from estimated parameters. It might look surprising, that k is not equal to one since showing a “server is busy” page aims at downtime avoidance rather than downtime minimization. The explanation for this behavior is that M T T R as well as M T T Rp are determined by the first successful request after a failed one. If only a static page is displayed, the first successful response can be delivered earlier,8 and hence MTTR is reduced. Using the estimated values for model parameters, steady-state availability, reliability and hazard rate can be computed and plotted, respectively. In particular, steady-state availability of the system without proactive fault management was equal to A = 0.9999222 and of the system with PFM AP F M = 0.9998618. This is a dramatic decrease! More precisely 1 − AP F M ≈ 1.78 , (10.67) 1−A 8 To be precise, after 1.28 seconds rather than 2.00 seconds resulting in k = 1.5625 1.000 10. Assessing the Effect on Dependability 1.0 256 w/o PFM with PFM 0.990 R(t) 0.0 0.980 0.2 0.985 0.4 R(t) 0.6 0.995 0.8 w/o PFM with PFM 0 10000 20000 30000 40000 50000 0 100 200 time [s] 300 400 500 time [s] (a) Reliability (b) Blow-up of the first 500s of (a) Figure 10.12: Reliability for the case study. The blow-up (b) of the first 500 seconds shows the phase-type character of the reliability model which indicates that unavailability is approximately doubled. Regarding reliability, a similar picture is observed: In Figure 10.12-a reliability of the system with and without proactive fault management are plotted. It can be observed that reliability of the PetShop system without PFM shows better reliability than the altered PetShop system. A more fine-grained analysis of the first few hundred seconds reveals that reliability of the case with PFM is slightly higher within the first 300 seconds (see Figure 10.12-b). However, this results most likely from the simple model used to compute reliability of the system without PFM, which employs an exponential distribution of time-to-failure: 1 R(t) = 1 − F (t) = 1 − 1 − e− M T T F t . (10.68) Nevertheless, the fine-grained analysis reveals the phase-type character of reliability as a consequence of the modeling approach. Hazard rates are shown in Figure 10.13. From the usage of a single exponential distribution for the system without PFM (c.f., Equation 10.68) results a constant hazard rate: h(t) = 1 λ e−λ t =λ= . −λ t 1 − (1 − e ) MT T F (10.69) Regarding hazard rate of the system with PFM, the characteristic that the hazard rate is zero for t = 0 results from the fact that there is no direct transition from the initial up-state to a failure. It can also be observed from Figure 10.13 that for t → ∞ hazard rate approaches a constant, which results from the CTMC settling into steady-state. As could be expected from worse steady-state availability and reliability, the constant value is higher than for the case without PFM. Looking at Table 10.6, the bad performance of the proactive fault management can be traced back both to low values for precision and recall and the inefficiency of the downtime avoidance technique: • Low precision and recall express that the used simplistic threshold-based failure prediction method is not able to achieve sufficiently accurate failure prediction: A 257 h(t) 2e−05 4e−05 6e−05 8e−05 10.8 A Case Study and an Example 0e+00 with PFM w/o PFM 0 200 400 600 800 1000 time [s] Figure 10.13: Hazard rate for the case study precision of 0.167 implies that about 83% of all failure warnings are false. Orthogonal to that, only 25% of failures are caught by the prediction algorithm and three fourth are missed. As a side-remark, these values are a good example to show that accuracy —as defined by Equation 8.10 on Page 156— is not an appropriate metric to evaluate failure prediction: accuracy equals 90.59%! The explanation for this discrepancy is that most of the predictions are true negatives, as can be seen from Table 10.7 listing the relative distribution among predictions as obtained from Equations 10.13 to 10.18. Type of prediction True positives False positives True negatives False negatives Relative amount 1.18% 5.88% 89.40% 3.54% Table 10.7: Relative amount of the four types of prediction • Poor ability to prevent failures: the probability that a failure occurs even if it is predicted is PT P = 0.5. The number is that even since only a total of 36 predictions containing five failures are available from the data. The value of PT P to be one half indicates that each second true positive prediction, a failure occurs. Although downtime is smaller for predicted outages (k > 1) it cannot compensate for the fact that there are 1 + r 1−p = 21.25 as many predictions as occurrences of failures in pf the Petshop without proactive fault management. Even that would be no problem if most of the predictions were true negatives and PT N was sufficiently small, which is also not the case in this example. In summary, the experiment has shown that the application of proactive fault management can make a system worse. However, the applied failure prediction method was too simple 258 10. Assessing the Effect on Dependability Parameter p r f PT P PF P PT N k Value 0.70 0.62 0.016 0.25 0.1 0.001 2 Table 10.8: Parameters assumed for the sophisticated example and downtime avoidance was far from being effective. The next section will demonstrate the effects in a more sophisticated setting. 10.8.3 An Advanced Example In order to show that proactive fault management can indeed improve steady-state system availability, calculations have been carried out assuming parameter values from a better failure predictor and more effective actions. More specifically, the values that have been observed for HSMM-based failure prediction for the telecommunication system (c.f., Chapter 9) have been used, which are precision equal to 0.70, recall equal to 0.62, and a false positive rate of 0.016. Also with respect to effectiveness of actions and risk induced failures, slightly better values have been assumed. Exact values for PT P , PF P , PT N , and k are listed in Table 10.8. Values for M T T F and M T T R are the same as in the case study by Olaf Tauber. Using these values, a steady-state availability has been computed showing a value of AP F M = 0.999962. Availability of a system without PFM is the same as in the previous experiment. This results in cutting down unavailability approximately by half: 1 − AP F M ≈ 0.488 . (10.70) 1−A Reliability and hazard rate are also improved, as can be seen from Figures 10.14 and 10.15. This time, the constant limiting hazard rate is below the hazard rate of a system without proactive fault management. 10.9 Summary In this chapter, a model has been introduced in order to assess the effect of proactive fault management, which denotes the approach to combine proactive techniques with a failure predictor: each time, an imminent failure is predicted, actions are triggered that either try to avoid or to minimize the downtime incurred by failure occurrence. Examples have been given for both types of actions. The model presented is based on the well-known continuous-time Markov chain model used by Huang et al. [126] to model software rejuvenation, which is a special case of downtime minimization by periodic restarting. The model replaces the failureprobable state of the original rejuvenation CTMC, which is one its major drawbacks, by 1.000 259 1.0 10.9 Summary w/o PFM with PFM R(t) 0.990 0.0 0.980 0.2 0.985 0.4 R(t) 0.6 0.995 0.8 w/o PFM with PFM 0 10000 20000 30000 40000 50000 0 100 time [s] 200 300 400 500 time [s] (a) Reliability (b) Magnification of the first 500s of (a) h(t) 2e−05 4e−05 6e−05 8e−05 Figure 10.14: Reliability for the sophisticated example. Similar to Figure 10.12, the first 500 seconds are magnified showing the phase-type character of the underlying distribution. 0e+00 with PFM w/o PFM 0 200 400 600 800 1000 time [s] Figure 10.15: Hazard rate for the more sophisticated example. four states representing correctness of failure predictions. The model is based on eleven parameters, of which four are determined by the boundary conditions of the system and the remaining seven parameters characterize efficiency of proactive fault management: • precision, recall and false positive rate are used for assessment of failure prediction accuracy • probability of failure occurrence in case of true positive, false positive or true negative predictions are used to assess success of downtime avoidance techniques as well as to capture probability of failures that are induced by failure prediction and actions themselves. • a repair time improvement factor accounts for the effect of improved repair times 260 10. Assessing the Effect on Dependability in case of forced versus unplanned downtime. Closed-form solutions for steady state availability, reliability, and hazard rate have been developed, and a procedure, how these seven parameters can be estimated from experimental data has been described. Finally, a case study has been presented, where the Microsoft .NET demo web-shop called “Pet Shop” has been extended in order to facilitate testing of simple proactive fault management techniques. The case study is based on data gathered by Olaf Tauber in the course of his diploma thesis, which has primarily been supervised by the author. However, neither the failure prediction algorithm —which is not the HSMM-based algorithm described in this thesis— nor the applied downtime minimization and avoidance techniques have been convincing such that availability, reliability and hazard-rate get worse if the techniques are applied. For this reason, a second, more advanced example has been presented using values for precision, recall, and false positive rate that have been achieved by HSMM-based failure prediction for the telecommunication system case study. Regarding efficiency of methods, slightly better values than the ones estimated from Olaf Tauber’s experiments have been used. In this setting reliability was significantly improved, and unavailability was cut down by half. However, it should be noted that HSMM-based prediction, if applied to the Pet Shop system, would not have reached as good results as for the telecommunication case study. The reason for this is that no fine-grained fault detection is built into the Pet Shop. Therefore, only very few indicative symptomatic errors are reported prior to a failure. One of the major limitations of the model is that it operates only on mean times, which is a direct consequence of using continuous-time Markov chains. Other models such as stochastic activity networks (SAN) can model more details. On the other hand, finding closed-form solutions is rather difficult for these models. A further limitation of the model presented here is that diagnosis and scheduling of actions (see Chapter 12) are not explicitly modeled: If a PFM system comprises several different actions, a decision is necessary about which action to trigger in a given situation. This decision can as well be correct or wrong. Although decision accuracy of the dispatcher is inherently contained in probabilities PT P , PF P , PT N , and k,9 a more detailed modeling would be desirable. On the other hand, introduction of even more states and parameters makes the model more difficult to understand and results in more parameters that need to be estimated from experimental data. Contributions of this chapter. Main contribution of this chapter is the proposal of a CTMC model to assess the effect of proactive fault management on availability, reliability, and hazard rate. A brief survey of existing models that try to evaluate the effect of proactive fault management has revealed that —to the best of our knowledge— the proposed CTMC model is the first to • clearly distinguish between all four types of failure predictions: true positives, true negatives, false positives and false negatives, • handle both downtime minimization as well as downtime avoidance techniques, 9 Think, for example, of the case that the dispatcher chooses to prepare a repair action instead of triggering a preventive action then the probability of failure occurrence is increased while k is improved 10.9 Summary 261 • incorporate the case that failure prediction plus triggered actions can induce failures, i.e., due to additional load caused by prediction or actions, a failure occurs that would not have occurred if no proactive fault management was in place. From a practical point of view, the three main contributions of the model are: • It can help to decide if application of proactive fault management is useful for a given system. In order to do so, MTTF and MTTR must be determined from the current version of the system. The remaining parameters must be estimated from experiments in similar environments, such as done in Chapter 9 for assessment of failure prediction effectiveness. • When analyzing a system that already employs proactive fault management techniques, partial derivations of the availability / reliability formulas may give an indication which of the seven parameters would be most effective to increase availability. For example, if a system’s engineer had $100,000 to spend on improved proactive fault management, a derivation of the formulas derived in this chapter indicates whether it is, e.g., more effective to spend the money on improved failure prediction methods or on a reduction of MTTR for forced / prepared outages. • It can be used to determine the optimal trade-off between precision, recall, and false positive rate. In order to do so, all parameters except for precision, recall, and false positive rate must be assumed to be fixed. Then by Equation 10.40, availability becomes a function of these three parameters. Hence, an availability value can be assigned to each point of the trajectory through the space of precision/recall/false positive rate and the optimal combination can be chosen. Relation to other chapters. This chapter is the first of the fourth phase of the engineering cycle —and since the main focus of this thesis is on failure prediction, it is also the last. The remaining chapters will conclude the thesis and will provide an outlook onto further research. Chapter 11 Summary and Conclusions The initial spark that has lit up the fire providing energy to write this dissertation has been the challenge to predict the failures of a commercial telecommunication platform from errors that occur. In this chapter the essentials are summarized, major contributions are pointed out, and remaining issues are discussed. Beginning with the aim to improve a given system, a typical engineering approach can be divided into four phases forming the “engineering cycle” (c.f., Figure 1.2 on Page 6). The thesis has been structured along this concept and so is its summary. 11.1 Phase I: Problem Statement, Key Properties and Related Work The ultimate goal addressed in this dissertation is to improve computer system dependability by means of a proactive management of faults. However, the thesis has been focused on the prerequisite first step, which is online failure prediction where the objective is to predict the occurrence of failures in the near future based on the current state of the system as it is observed by runtime monitoring. As a case study, failures of a commercial telecommunication platform of which industrial data has been available were to be predicted. A detailed analysis of the surrounding conditions of the case study has revealed several key properties, for which the proposed approach to failure prediction has been designed: • The size of the system is so immense that detailed knowledge of complex relationships is rare —it has at least not been overt to us. However, with ever-growing complexity of systems and increasing use of commercial-off-the-shelf components, this assumption might also be valid for the companies themselves. For this reason, a black-box approach has been applied. However, as is discussed in the outlook, the model can be augmented by analytical knowledge, which would turn it into a gray-box approach. • A huge amount of data is available. Therefore, a data-driven approach from machine learning has been chosen that aims at filtering out relevant interrelations from data rather than building on an analytical approach where interrelations are extracted manually. This approach has a major consequence: Only those types of failures can be predicted that have occurred (frequently enough) in the training data. 263 264 11. Summary and Conclusions Events that are really rare are not the focus, here. However, as Levy & Chillarege [162] have pointed out, failure types follow Zipf’s law and targetting at frequent failures first, results in the biggest impact. Regarding the telecommunication system case study, the goal was to predict performance failures a few minutes ahead. • Faults can become visible at four stages: by auditing, which means actively searching for faults, by symptom monitoring, by error detection or by observing a failure. In the case of this thesis, errors have been used as input data. There are reasons in favor and against this choice. The most important are: – Errors occur late in the process from faults to failures. In order to be able to predict failures with reasonable lead time, fine-grained fault detection must be in place that is able to capture misbehavior early enough. + Due to the property of occurring late, input occurs only when something is going wrong in the system. This alleviates the problem of class skewness: the ratio of failure and non-failure data is more even than in symptom monitoringbased approaches. + Since error reporting is inherently built into the majority of systems, errorbased prediction techniques are expected to have less effects onto the production system than monitoring-based approaches: As the case study in the diploma thesis of Olaf Tauber has shown, system response times are dramatically influenced by the amount as well as the frequency of collected monitoring data. + Quite a lot of symptom monitoring-based approaches have been published while the area of error event-based methods is not well explored. From a scientific point of view, it has been alluring to explore this white spot in prediction methods. Experiments conducted in this thesis have been performed using previously recorded logfiles. It should be noted that for real application to a running system a direct interface to error event reporting should be used. • Component-based software architectures are common in large systems. The clear structure of encapsulated entities advocates an approach that builds on interrelationships and dependencies among components. From this follows that the order of error events is relevant. Moreover, an analysis has revealed that not only the order but the temporal delay between errors is even more decisive. Since errors occur non-equidistantly and the type of each error belongs to a finite countable set, temporal sequences are the input data for the failure prediction algorithm. • Fault-tolerant systems can cope with many erroneous situations but fail under some conditions. The principal assumption in this thesis is, that the distinction between erroneous situations leading to failure and erroneous situations that do not lead to failure can be distinguished by identifying patterns in temporal error sequences. For this reason, a pattern recognition approach has been applied. • It is a non-distributed system. Although the approach might in principle be applicable to distributed systems as well, such aspects have not been considered in this thesis. 11.2 Phase II: Data Preprocessing, the Model, and Classification 265 The resulting approach is divided into two major steps: first, models are adjusted to system specifics from previously recorded training data. After training, error sequences occurring at runtime are analyzed in order to classify the current status of the system as failure-prone or not. In machine learning, such procedure is called an supervised offline batch learning approach. In order to review existing approaches to online failure prediction, a taxonomy has been developed and a comprehensive survey of major publications has been presented. Additionally, related work on extending hidden Markov models to continuous time has been presented. 11.2 Phase II: Data Preprocessing, the Model, and Classification The second phase of the engineering cycle aims at synthesizing a problem-specific methodology. In many cases including this thesis, existing approaches need to be adapted or a new model needs to be developed. Online failure prediction is performed in three steps: 1. Error messages that have occurred within a given time window before present time form an error sequence. The sequence is preprocessed, which includes assignment of symbols, tupling and noise filtering. In the case of training, failure sequences are additionally grouped by clustering. 2. Using extended hidden Markov models, similarity to failure and non-failure sequences is computed. Sequence likelihood is used as a measure for similarity between the observed sequence under investigation and the sequences of the training data. 3. Applying Bayes decision theory, a final decision is made whether the current situation is failure-prone or not. 11.2.1 Data Preprocessing Data preprocessing consists of several steps, of which the assignment of error IDs, the tupling technique by Iyer & Rosetti, and sequence extraction are more of technical rather than conceptual nature and are hence not summarized, here. Failure sequence clustering. Due to the complexity of the system, it must be assumed that several failure mechanisms exist and are hence present in the data. The term failure mechanism is used to denote specific relations of faults and system states to a failure. In this thesis, a technique has been developed that separates failure mechanisms by means of clustering. The basic notion of failure sequence clustering is that a dissimilarity matrix is formed by training a small hidden semi-Markov model for each sequence and by computing sequence likelihoods with each model for all failure sequences. Then, a standard clustering technique can be applied to identify groups of failure sequences that are “close” in the sense of large mutual sequence likelihoods. An analysis using the telecommunication data has revealed that agglomerative clustering using Ward’s procedure yields 266 11. Summary and Conclusions the most stable results. Other clustering parameters such as the number of states of the models and the level of background distributions are not very decisive and default values have been derived. Noise filtering. The purpose of noise filtering is to remove non failure-related errors from failure sequences in the training data as well as for online prediction. Noise filtering is based on a statistical test derived from the well-known χ2 test of goodness of fit. The principle notion is that only symbols that are outstanding, i.e., occur more frequently than expected at a given time, are considered. At least for the data of the case study, an analysis has shown that computing the symbols’ expected probabilities from all sequences shows a rather clear separation from signal to noise. Improved logfiles. Although not applied to the data of the case study, a principle investigation of logfiles has resulted in two proposals how logfiles can be improved for automatic processing: 1. Event type and event source should be clearly separated 2. A hierarchical numbering scheme should be used, which supports data investigation by providing multiple levels of detail. Furthermore, a distance metric can be defined that would facilitate clustering of error message types. In order to quantify the quality of logfiles, logfile entropy has been defined. It is based on Shannon’s information entropy but additionally incorporates the overlap of required and given information in logfiles. 11.2.2 The Hidden Semi-Markov Model In this thesis, a pattern recognition approach is applied to the task of online failure prediction. Hidden Markov models (HMMs) have been chosen as modeling formalism since first, HMMs have successfully been used in many advanced pattern recognition tasks, and second, there is an appealing match of concepts from faults to hidden states and from errors to observation symbols. However, temporal sequences, which are sequences in continuous time are used as input data but standard HMMs are not designed for continuous time. Four ways how standard HMMs can be used / extended to process continuous-time sequences have been discussed. An extension of the stochastic process of hidden state traversals seemed most promising due to a lossless representation of time and the power to mimic the temporal behavior of the underlying stochastic process. In order to achieve this, a new model has been proposed in this dissertation. Its key concepts and properties are summarized in the following: • HMMs have been combined with a semi-Markov process resulting in a hidden semi-Markov model (HSMM). HSMMs combine the mature formalism and wellunderstood properties and algorithms of standard HMMs with a great flexibility to specify the duration of transitions from one hidden state to the next. • For sequence recognition the efficient forward algorithm has been adapted to HSMMs. By this, sequence likelihood can be computed which is a probabilistic measure of similarity between the sequence under investigation and the set of 11.2 Phase II: Data Preprocessing, the Model, and Classification 267 sequences the HSMM has been trained with. In order to find the most probable sequence of hidden states, the Viterbi algorithm has been adapted, as well. However, this has not been of major concern for this thesis, although it might be of interest for diagnosis. • The HSMM can also be used for sequence prediction, which is also not used for online failure prediction, here. However, this technique might be of interest for diagnosis or other applications of the model. • In order to train the model, the Baum-Welch algorithm used for standard HMMs has been adapted to HSMMs. It belongs to the class of generalized expectation maximization algorithms combining techniques from maximum likelihood estimation and gradient-based methods for optimization of transition duration distribution parameters. • Convergence of the training procedure has been proven based on the rather universal theory of EM algorithms, which employs lower bound optimization resulting in a so-called Q-function. The specific Q-function for HSMMs has been derived and by partial differentiation and application of Lagrange multipliers it has been shown that the algorithm converges at least to a local maximum of training sequence likelihood. • For the specific task of online failure prediction, a dedicated topology of HSMMs is used. Failure prediction models employ a chain-like, or left-to-right structure. However, in order to deal with missing errors in training sequences, shortcuts are included in the model. In order to deal with additional error messages (noise) that has not been present in training data, intermediate states are added after completion of the training procedure: By this, model flexibility is increased without affecting complexity of the training procedure. Experiments with the telecommunication data have shown that one intermediate state per transition is most effective, although the benefit lags behind expectations. • In order to assess complexity of the algorithm, two cases must be distinguished: application of HSMMs for online failure prediction during runtime, and training of HSMM parameters. Application complexity of general HSMMs belongs to the class O(N 2 L), where N denotes the number of states and L the number of symbols in the error sequence. However, due to the left-to-right structure used for online failure prediction, complexity actually belongs to class O(N L). Theoretically, training complexity is O(N 4 L). However, the chain-like structure reduces complexity to O(N 3 L). In order to remedy the problem of convergence to local maxima, the entire training procedure is repeated 20 times with varying random initialization. • The forward algorithm of the HSMM developed in this thesis is much more efficient than previous extensions to continuous time. The main reason for this is that previous extensions have mainly been developed in the area of speech recognition. An in-depth comparison of the task of failure prediction with speech recognition has revealed that for failure prediction a one-to-one mapping between states and observation symbols can be assumed and temporal properties are mainly included in the stochastic process of hidden state traversals. This analysis allows a strict 268 11. Summary and Conclusions enforcement of the Markov assumption, which results in a forward algorithm that is almost as efficient as its discrete-time counterpart. Furthermore, this approach allows to model time as transition durations rather than state sojourn times which offers more modeling flexibility. 11.2.3 Sequence Classification The final step in a pattern recognition approach to online failure prediction is to classify whether the current runtime state is failure-prone or not. Bayes decision theory has been used in order to derive classification rules. More specifically, • An introduction to Bayes decision theory has been given including the proof that classification error rate is minimal if each sequence is classified according to maximum posterior probabilities, and a minimum cost classification rule. • Since for real applications of hidden Markov models1 only logarithmic sequence likelihood can be used, the Bayesian decision rule has been extended to a multiclass classification rule for log-likelihoods. • By introducing the bias-variance dilemma, it has been shown why it is important to control the trade-off between bias (which is how closely training can adapt to the training data) and variance (which is how much the resulting model is dependent on the selection of training data). Several techniques have been discussed with respect to their applicability to online failure prediction with HSMMs. In this dissertation, model order selection, and background distributions, in combination with maximum amount of training data have been applied. 11.3 Phase III: Evaluation Methods and Results for Industrial Data Having developed the theoretical methodology, the third phase of the engineering cycle is concerned with implementing it and to perform experiments with data. This leads to a solution that can be applied to a running system. 11.3.1 Evaluation Methods Many different metrics exist that capture various aspects of failure prediction. The comprehensive overview and discussion of characteristics is one of this thesis’ contributions. Metrics for prediction quality. Many metrics for the evaluation of prediction are based on the contingency table, which classifies each prediction to be either a true positive, false positive, true negative, or false negative. A table has been presented listing a great variety of metrics and their synonymous names. In this thesis, precision, recall, false positive and true positive rate have been used. Additionally, the F-measure is used in order to turn precision and recall into a single real number. 1 As well as of their hidden semi-Markov extensions 11.3 Phase III: Evaluation Methods and Results for Industrial Data 269 One of the major drawbacks of contingency table-based methods is that they are defined on a binary basis: a prediction is either positive (a failure warning) or negative (no warning). However, many prediction methods such as the HSMM approach employ a customizable threshold upon which the decision is based, and each threshold value may result in a different contingency table and subsequently in different values for the associated metrics. Several plots address this problem: Precision / recall curves plot precision over recall for various values of the decision threshold. In addition to the F-measure, the point where precision and recall are equal can be used to turn them into a single number. A second well-known plot are receiver operating characteristics (ROC) where true positive rate is plotted versus false positive rate. In order to turn this graph into a single number, the integral under the ROC curve is used, which is called “area under curve” (AUC). A new type of graph has been introduced in this thesis: accumulated runtime cost graphs plot prediction cost as they accumulate over runtime. In contrast to contingency table-based metrics, which imply mean values, accumulated runtime cost graphs reveal a temporal aspect of prediction since it can be seen when, e.g., false positive predictions have occurred. Furthermore, any predictor can be compared to an oracle predictor, a perfect measurement-based predictor, a system without predictor, and maximum cost. In summary, it should be pointed out that there is no single perfect evaluation metric. For example, precision and recall do not account for true negative predictions. AUC weights all threshold values equally which results in cases where a predictor with better AUC incurs higher cost. Accumulated runtime cost graphs are sensitive to the relative distribution of cost, which can be chosen such that the graph is altered significantly. Evaluation process and statistical confidence. One of the major problems with many machine learning approaches is that a lot of parameters are involved that are not directly optimized by the training procedure. For example, the length of the data time window is usually assumed to be fixed, but it is not clear what size of the window results in optimal prediction quality. Moreover, several parameters are dependent so that each combination of all values for all parameters would have to be tested and evaluated with respect to final prediction performance. Since more than 15 parameters are involved, such approach would result in tremendous computation times. For this reason, a mixed approach has been applied in this thesis: Parameters that could be set by separate experiments have been optimized separately (greedy approach) while other parameters have been optimized in combination with dependent ones that cannot be determined in a greedy way. Three types of data sets have been used in the experiments: • Training data is used as input data for the training procedure. • Validation data is used to assess and control overfitting. • Test data is used for final out-of-sample assessment of failure prediction performance. Even though a lot of data has been available for the telecommunication system, the amount of failure data is still limited. A fixed division into three data sets of equal size would not result in sufficient estimation of real prediction performance. The standard solution to this kind of problem is called m-fold cross validation where data is divided into m parts, and 1 − m parts are used for training / validation while the remaining part is 270 11. Summary and Conclusions used for test. This procedure is repeated m times such that each of the m parts is used for testing once. Training data is then further divided into training and validation data in the same way. In order to get to an estimation of confidence intervals, cross validation is combined with a technique called bootstrapping. Other confidence interval estimation techniques have been investigated, too, but are either not applicable (such as assuming a BernoulliExperiment or normal distributions), or are not flexible enough to be applied to large datasets (such as the Jacknife). 11.3.2 Results for the Telecommunication System Case Study Industrial data of a commercial telecommunication system has been analyzed in order to assess the potential to predict failures of the system. The entire modeling procedure has been described and analyzed in detail from the first steps of data preprocessing to a detailed analysis of the modeling parameters on final prediction quality. The most relevant results are provided, here. Data preprocessing. The main goal was to investigate whether the assumptions made in the theoretical development of the methodology fit reality as observed in the industrial data. In particular, findings included 1. The proposed procedure to assign error IDs to error messages is relatively robust. The majority of assignments is non-ambiguous. The procedure reduced the number of different message types from 1,695,160 to 1,435. 2. It seems safe to determine tupling window size by the procedure proposed by Iyer & Rosetti. The expected bend in the number of resulting tuples can be identified clearly. 3. Agglomerative clustering with Ward’s method should be used to group failure sequences. The number of states for the HSMMs √ used to compute sequence likelihoods should be chosen to be approximately L where L denotes the maximum length of the majority of failure sequences. The weight assigned to background distributions should be chosen rather small —a value of 0.1 has been used in experiments. 4. Noise filtering works best if a global prior estimated from the entire training data set is used. Experiments indicate that the proposed noise filtering mechanism can distinguish between signal and noise. The filtering threshold should be chosen to be slightly above a plateau in average sequence length that has been observable in data of the case study. Furthermore, experiments support two principles observed by Levy & Chillarege: prior to a failure, the mix of errors changes and a few errors outnumber their expected value heavily. Analysis of the preprocessed dataset. After preprocessing, the resulting dataset has been investigated with respect to the following characteristics: • Error frequency varies heavily in the data set. However, no correlation between the number of errors per time unit and the occurrence of failures can be observed. 11.3 Phase III: Evaluation Methods and Results for Industrial Data 271 Hence straightforward counting and thresholding techniques do not seem appropriate. • Delays between errors can be approximated best by a mixed probability distribution consisting of an exponential and a uniform distribution. • An analysis of the distribution of time-between-failures revealed that frequently used distributions such as exponential, gamma or Weibull do not fit the data very well. For this reason, failure prediction or preventive maintenance techniques that simply rely on lifetime distributions are most likely deemed to fail. Furthermore, an autocorrelation analysis shows that there is no periodicity in the occurrence of failures. That is why periodic techniques cannot achieve good prediction results. Model parameters. Quite a few parameters are involved in the modeling step. Parameters have been divided into two groups: • Parameters that can be fixed heuristically in a greedy manner. This group includes probability mass and distribution of intermediate states, the number of iterations the Baum-Welch algorithm is performed, and the type of the background distribution. • Parameters that can only be evaluated by training a model and testing prediction performance on a test data. This group includes the number of states of the HSMM, the maximum number of states that are skipped by shortcuts, the number of intermediate states that are added to the model after training, and the amount of background weight applied after training. Parameters of the second group have been investigated with respect to F-Measure, and their effect on computation times. Best results have been achieved with a model of 100 states, shortcuts bypassing one state, one intermediate state per transition and a background weight of 0.05. The optimal set of parameters has then been investigated further and precision / recall plots, ROC curves, AUC, and cost curves have been provided. At the threshold value of maximum F-measure (0.66), precision of 0.70, recall of 0.62, and false positive rate of 0.016 have been achieved. AUC was equal to 0.873. Application specific parameters. Two modeling parameters depend on the application rather than the model itself: lead-time (i.e., the time how far in the future failures are predicted), and data window size (how much data is used for prediction). An analysis of these two parameters has shown that lead time approximately stays at the same level for predictions up to 20 minutes ahead and drops quickly for predictions with longer lead time. With respect to the size of the data window, model quality in general becomes better if longer sequences are taken into account. However, mean processing time increases heavily for longer sequences putting a limit on the size of the data window. Sensitivity analysis. Large-scale computer systems such as the telecommunication system are highly configurable and undergo repetitive updates. In order to assess sensitivity to these issues, the approach has been tested in two ways: 272 11. Summary and Conclusions • Dependence of prediction quality on the size of training dataset. Many stochastic estimators such as the mean yield unreliable results if the number of data points is decreased. A similar effect was observed in the case study. By reducing the size of the training data set in two steps, results remained stable for the first step but failure prediction quality broke down after the second reduction. Not surprisingly, mean training time was also reduced for smaller training data sets. • Dependence on changing system configurations and model aging. Since with offline batch learning, parameters of the HSMM are trained once, behavior of the running system will be increasingly different from system behavior at training time with every change to configuration and every update. This effect has been simulated by an increasing time gap between selected training and test data. Experiments have shown that mean maximum F-measure decreases almost linearly with increasing size of the gap. Additionally, it has been observed, that confidence intervals obtained from bootstrapping get wider which can be explained that with increasing gap size more and more sequences are significantly different from training data. • Grouping of failure sequences has been applied in order to separate failure mechanisms. However, partitioning the set of failure sequences results in less training sequences for each model, which in turn may deteriorate HSMM parameter estimation involved in the training procedure. In order to check whether this is the case, a HSMM failure predictor with only one failure group model has been trained. Results for this model have been significantly worse supporting the assumption that the HSMMs can adopt better to the training sequences if failure sequences are grouped according to their similarity. Comparative analysis. The HSMM-based failure prediction approach has been compared to the most promising and well-known failure prediction approaches of that area, which are dispersion frame technique (DFT) by Lin & Siewiorek, the Eventset method by Vilalta & Ma, and SVD-SVM by Domeniconi et al.. DFT only evaluates the time of error occurrence, while Eventset and SVD-SVM only investigate the type of errors that occur.2 In contrast, HSMM-based failure prediction investigates both time of error occurrence and their type —it treats input data as a temporal sequence. In order to provide a comparison with a very simple prediction method, periodic prediction based on MTBF has also been included in the comparison. Standard, discrete-time HMMs can be used for failure prediction, too. In order to assess the gain in prediction performance achieved by introducing a semi-Markov process, prediction performance of standard HMMs have been tested, too. Additionally, HSMMbased failure prediction has been compared to failure prediction based on universal basis functions (UBF) developed by Günther Hoffmann, although UBF prediction belongs to a different class of prediction algorithms operating on equidistant monitoring of system variables. In summary, it can be concluded from the comparative analysis that HSMM-based failure prediction outperforms other failure prediction approaches significantly. However, 2 Although SVD-SVM can in principle incorporate both time and type of error messages, prediction actually deteriorates if time is included. 11.4 Phase IV: Dependability Improvement 273 improved failure prediction comes at the price of computational complexity: Model training consumes 2.38 times and online prediction 224.5 times as much time as the slowest comparative approach. Nevertheless, the approach demonstrates what prediction performance is achievable with error-event triggered online failure prediction. 11.4 Phase IV: Dependability Improvement Failure prediction is not worth the effort if it does not help to improve system dependability. In order to improve dependability, failure prediction must be coupled with subsequent actions that are performed once an upcoming failure has been predicted. This is called proactive fault management. However, the focus of this thesis is on failure prediction and therefore, only a theoretical analysis of the effect of proactive fault management on system dependability has been provided. 11.4.1 Proactive Fault Management Two strategies exist how system dependability can be improved in case of a predicted upcoming failure: • Downtime avoidance techniques try to prevent the failure. Their goal is to achieve continuous operation. Three groups of downtime avoidance techniques have been identified: state clean-up, preventive failover, and load lowering. • Downtime minimization techniques can be further divided into two subgroups: reactive techniques let the predicted failure happen, however, the system is prepared for its occurrence such that time-to-repair is reduced. This is achieved by either one or both of two effects: (a) reconfiguration time can be shortened if an upcoming failure is anticipated, and (b) the time needed for recomputation can be reduced. On the other hand, proactive techniques actively trigger repair actions such as a restart, turning unplanned downtime into planned downtime, which is expected to be shorter or incurring less cost. Several examples for all types of techniques have been given. 11.4.2 Models Based on the continuous Markov-chain model (CTMC) for software rejuvenation (i.e., preventive restart of components or the entire system) introduced by Huang et al., two CTMC models have been developed: The first model is used to compute steady-state system availability, while the second simplified model is used to compute system reliability and hazard rate. It has been shown how the rates of the model can be computed from eleven parameters, of which four are application specific and hence assumed to be fixed. The remaining seven modeling parameters are: precision, recall, false positive rate, failure probability given a true positive, false positive, and true negative prediction, and repair time improvement factor. Using these parameters, closed-form solutions for steady-state system availability, reliability and hazard rate have been derived. 274 11. Summary and Conclusions 11.4.3 Parameter Estimation A procedure has been described, how the several parameters can be estimated from experiments. The procedure consists of four experiments, two of which include fault injection in order to assure that the prediction of a failure is a true positive. 11.4.4 Case Study and an Advanced Example A diploma thesis has set up an experimental environment where simple proactive fault management techniques have been applied to an open web-shop demo application on the basis of Microsoft .NET. Specifically, preventive restart on application as well as on system level have been used as downtime minimization technique, and delivering a page stating that the server is temporarily busy has been used as technique to avoid downtime by system relieving. The parameter estimation procedure has been applied to the data recorded in the experiments. However, neither of the two proactive techniques has been able to improve system availability, reliability or hazard rate (in the long term). The main reasons for this is that the implemented failure prediction algorithm (which was not HSMM-based prediction but a simple threshold-based method) has not been able to provide sufficiently good predictions. Furthermore, both types of actions have not proven to be successful: Instead of a reduced downtime, restarting took eleven times as long as downtime incurred by a failure, and at every second true positive prediction, a failure occurred even though system relieving was in place. Since the experiments have not resulted in improved system dependability, a more sophisticated example has been provided. In this second example, the values estimated from the telecommunication case study have been used for prediction quality. Additionally, better but still realistic values have been assumed for the other parameters. This scenario resulted in a considerable improvement in dependability: unavailability was cut by half and reliability as well as hazard rate have been significantly improved. 11.5 Main Contributions In summary a novel failure prediction approach has been developed that has strong foundations in stochastic pattern recognition rather than heuristics, and that outperforms wellknown prediction techniques if applied to industrial data of a commercial telecommunication system of considerable size. On the way to this result, several contributions to the state-of-the art have been achieved: • In the fundamental relationship between faults, errors, and failures, side-effects of faults are missing. In this dissertation, side-effects of faults are termed symptoms and the fundamental concept has been extended. • A comprehensive taxonomy of online failure prediction methods has been introduced. Based on the taxonomy, an in-depth survey of online failure prediction techniques has been presented. including research areas that have not been explored for the objective of failure prediction. • The failure prediction method developed in this thesis is the first to apply pattern recognition to error event-driven time sequences (temporal sequences). 11.6 Conclusions 275 • A novel extension of hidden Markov models to incorporate continuous time has been developed. Since previous extensions to continuous time have focused on equidistant time series, the extension presented here is the first to specifically address temporal sequences as input data. • A novel model to theoretically assess dependability of proactive fault management, which is prediction-driven fault tolerance, has been introduced. To our knowledge, it is the first to incorporate correct and false predictions, as well as downtime avoidance and downtime minimization techniques. In addition to that, the model incorporates failures that are induced by proactive fault management itself, e.g., by the additional load that is put onto the system. Although not directly related to the failure prediction model, several other contributions to the state-of-the-art have been made: • To the best of our knowledge, this thesis is the first to collect and discuss the various evaluation metrics for prediction tasks. • A novel methodology to identify failure mechanisms and to group failure sequences has been developed. Although only used for data preprocessing in this thesis, the approach might also be useful for diagnosis, as well. • To our knowledge the first measure to quantify the quality of logfiles has been introduced. Due to its roots in information theory, the measure is called logfile entropy. 11.6 Conclusions In this dissertation, an effective online failure prediction approach has been proposed that builds on the recognition of symptomatic patterns of error sequences. A novel continuoustime extension of hidden Markov models has been developed and the approach has been applied to industrial data of a commercial telecommunication system. In comparison to the best-known error-based failure prediction approaches, the proposed methodology showed superior prediction accuracy. However, accuracy comes at the prize of computational complexity. Although this is intuitively comprehensible, Legg [160] has investigated performance and complexity of prediction algorithms in a principle way. Based on a universal formal theory for sequence prediction by Solomonoff [247, 248], which is not computable in general, Legg has proven that predictors of a given predictive power require some minimum computational complexity (see Figure 11.1). Another important result of Legg’s work is that, although very powerful predictors exist for computable sequences, they are not provable due to Gödel incompleteness problems. In other words, for provable algorithms, an upper bound with respect to predictive power exists. Hence, maximum achievable predictive accuracy for the telecommunication case study might be worse than 100% precision and 100% recall, and HSMM-based failure prediction is even closer to the optimum than it appears. Starting point of the model’s development was an analysis of key properties of complex, component-based, non-distributed software systems and the failure prediction approach has been designed with these properties in mind. Hence HSMMs should also 276 11. Summary and Conclusions Figure 11.1: Trade-off between predictive power and complexity. It can be shown that for a given complexity, there is an upper bound on predictive power. Hence, there is also an upper bound on predictive power achievable by algorithms with provable complexity. However, it can also be shown that algorithms with better predictive power exist but their complexity is unprovable. HSMM-based prediction algorithm lies within the hatched area (Legg [160]). show very good prediction results if applied to other systems sharing the same properties. Additionally, HSMMs can be adapted to different situations by adjusting the various parameters involved in modeling. Furthermore, since they are a general contribution to event-driven temporal sequence processing, HSMMs might prove to achieve similarly outstanding results in other application domains beyond failure prediction, as well. Chapter 12 Outlook As is the case with most projects, there is always room for further investigations and improvements. In this chapter, some potential and promising directions are highlighted. Starting from technical issues how the proposed hidden semi-Markov model (HSMM) could be further improved the scope is widened successively. 12.1 Further Development of Prediction Models The survey of online prediction models (see Chapter 3) has shown that quite a few prediction models have been developed in the past, but also that there are several areas that seem promising to be explored. The discussion along the branches of the taxonomy is not be reiterated, here but rather, the focus is on more sophisticated machine learning techniques. 12.1.1 Improving the Hidden Semi-Markov Model More sophisticated optimization techniques than the gradient-based could be used for estimation of transition duration parameters in the Baum-Welch algorithm for HSMMs. For example, second order optimization algorithms such as Newton’s method or quasiNewton methods such as Broyden-Fletcher-Goldfarb-Shanno (BFGS) and DavidonFletcher-Powell (DFP) could be applied. In this thesis, the problem of local maxima has been addressed by simply running the Baum-Welch algorithm several times. A more sophisticated solution would, for example, apply an evolutionary optimization strategy. Additionally, the EM training algorithm used in this dissertation does not alter the structure of the HSMM. Extended algorithms such as state pruning, which also alter the topology of an HSMM, may be investigated. The black-box approach can actually be turned into a gray-box approach by adding failure group models that are constructed manually. More specifically, if it is known from system design that activation of a special failure mechanism results in a unique sequence of errors, an additional failure group model can be built that is specifically targeted to this sequence. Variations and uncertainties in time as well as in error symbols can be modeled by transition and observation probabilities. The resulting additional failure group model can be seamlessly integrated with the models obtained from data-driven machine learning. Referring to Figures 2.9 and 2.10 on Pages 19 and 20, respectively, the handbuilt model would simply be added as model u + 1. By this procedure, the purely datadriven modeling approach described in this thesis is turned into a machine-learning / 277 278 12. Outlook analytic hybrid modeling approach. 12.1.2 Bias and Variance Controlling bias and variance means to control the trade-off between under- and overfitting. As has been mentioned in Chapter 7, algorithms such as bagging and boosting can be applied to HSMMs as well. A further technique controlling the bias-variance tradeoff is called regularization (see, e.g., Bishop [30]). Regularization usually denotes a technique where the optimization objective is augmented by a term putting a penalty on model complexity or specificy, such as curvature in regression problems. Regularization can in principle also be applied to HSMMs. In order to do so, the Baum-Welch algorithm would have to be changed such that the optimization objective, which is training sequence likelihood, is augmented by a complexity / specificy term. For example, a penalty could be put on setting transition or observation probabilities to zero. Another approach is to introduce a prior probability distribution over the values of model parameters, as has been introduced by Hughey & Krogh [128]. However, regularization changes the model rather deep in its core and similar results can very likely be achieved by other techniques such as background distributions, which have been used in this thesis. It is common knowledge that every single modeling technique is well-suited for some problems but performs worse on others. This is called the inductive bias of a modeling technique. Meta-learning (see, e.g., Vilalta & Drissi [267]) makes use of different inductive bias of several modeling techniques. For example, one technique of meta-learning assigns a new problem to the base-learner with the most appropriate inductive bias. This has been shown to improve failure prediction significantly in Gujrati et al. [111] even though a very simple meta-learning algorithm has been applied. 12.1.3 Online Learning Systems are undergoing permanent updates and configuration changes. With each such step, failure behavior of the system might be changed. The consequence is that the models obtained from training are getting more and more outdated. One solution to this problem is online learning. In online learning, the model is permanently updated such that it adapts to the changes in the system. A straightforward solution to online learning for HSMMs would be to collect new failure and non-failure sequences at runtime and to periodically train new models in the background. However, most likely more sophisticated approaches can be applied. 12.1.4 Further Issues Prediction in continuous time. In this dissertation, failures have been predicted with a fixed lead-time ∆tl . However, if the error sequence under investigation is assumed to be the start of a temporal sequence, sequence prediction techniques (c.f., Section 6.2.2) can be used to determine the continuous cumulative probability of failure occurrence over time. Such approach is beneficial if several proactive actions are available in a system that imply different warning times (i.e., minimum lead-time): failure prediction would have to be performed only once rather than once for every lead-time. 12.1 Further Development of Prediction Models 279 Conditional random fields. Markov models in general are subject to the so-called label bias problem (see, e.g., Lafferty et al. [152]). The problem is that the entire probability mass is distributed among successor states. Hence, if a state has only one successor, the stochastic process transits with probability one to the next state. If there are two successors and both successors are equally likely, they proceed with probability of roughly a half. From this follows that sequence likelihood depends on the number of outgoing transitions. Even if this problem is not that urgent for HSMMs, since first, the model topology is rather symmetric (most states have the same number of successor states) and furthermore, transition probability is also determined by the duration of the transition, the principle restriction still applies. In recent years, new stochastic models have been developed, among which conditional random fields (CRF) are promising candidates. These models have a second important advantage: The objective function is convex, which guarantees that the training procedure converges to a global rather than a local maximum. However, these models are rather new and experience is limited —that is why they have not been considered in this thesis. Input variables. In this dissertation, error events have served as input data to the hidden semi-Markov model. However, as the title of the thesis indicates, any event-based data source may be used, too. For example, by defining a threshold, any (equidistant) monitoring of system variables such as memory consumption or workload can be turned into an event-based data source. Although this has not been applied in this thesis, it might be a valuable solution for systems that do not have fine-grained fault detection installed. In many machine learning applications, the problem of variable selection is an important issue. For online failure prediction based on symptom monitoring, Hoffmann et al. [122] have shown that a good selection of variables can be even more decisive than a sophisticated choice of modeling technique. In the course of this thesis, some experiments with different sets of error-message variables have been performed. However, results could not be improved. The main reason for this is that —in contrast to symptom monitoring— not all variables are available all the time: each error message may contain a different set of variables. Hence, in order to successfully apply variable selection techniques, extra care must be taken to missing variables. Since this is not the case for most existing variable selection algorithms, further research is needed. Mining rare events. A further issue is related to the problem that failure sequences are rare in comparison to non-failure sequences. Weiss [276] has comprehensively investigated this topic, even though the main focus has been on data mining. Many of the proposed techniques, e.g., training failure models on the rare class only, and using rare class robust evaluation metrics such as precision and recall have been applied in this thesis. However, other techniques, such as advanced sampling methods could be additionally applied. Distributed systems. This dissertation has focused on centralized systems, only. However, distributed systems are also important and should be considered. According to its features to flexibly incorporate timing behavior, to model interdependencies of more or less isolated entities, and to handle missing events or permutations in their order, HSMMbased failure prediction seems to be a good candidate for failure prediction in distributed systems. 280 12. Outlook Design for predictability. it has been assumed throughout this thesis that the system is fixed and given and failure prediction algorithms have to adapt to its specifics. However in future, it may also be the other way round: “designing for predictability” may be considered from the very beginning of the software development process. At the current stage, it is not yet clear what characteristics of a software design makes failures predictable and further research is needed. However, it can be concluded from this dissertation that if error event-based failure prediction is to be applied, fine-grained fault detection has to be embedded throughout the system. 12.1.5 Further Application Domains for HSMMs The hidden semi-Markov model developed in this dissertation has been designed for the processing of event-triggered temporal sequences. Therefore, HSMMs can be applied to other problem domains as well. The prerequisite key characteristic is that observations (input data) must occur in an event-driven way and input values must belong to a finite countable set (observation symbols). There are supposedly many areas where HSMMs can be applied, among which are • Web user profiling. The click stream of a web user navigating through a site forms a temporal sequence: each click is an event and, e.g., the requested URL is the observation symbol. HSMMs might be used to distinguish between various types of users (sequence recognition) or to predict the most probable URL that the web user will click next (sequence prediction). Both could be used to dynamically adjust web pages to the users needs and preferences. • Shopping tour prediction. In a retail store, each time a customer puts an item into the (technically enhanced) cart, an event is generated. The type of the event is defined, e.g., by the id of the item, its location, etc. Temporal sequence processing based on HSMMs might be used to, e.g., display context-sensitive advertisements. In contrast to existing data-mining approaches, not only the set of items is relevant but also the time when the customer has put the item into the cart, which would enable to present advertisements that are along the customer’s anticipated route through the shop or to enable a predictive planning of cash counter personnel. • Failure prediction in critical infrastructures. Many infrastructures that are used everyday (such as electricity, telecommunication, water supply, food transport) can, in case of a failure, impose drastic restrictions in daily life or even pose a severe threat to the health of many people. Failure prediction may be used to predict infrastructure failures such that appropriate actions can be undertaken to prevent or at least to alleviate them. HSMM-based failure prediction might prove to be especially successful for infrastructures where only critical events but no continuous monitoring is available. 12.2 Proactive Fault Management The essence of proactive fault management is to act proactively rather than reactively to system failures. In the context of this dissertation, techniques are considered that rely on 12.2 Proactive Fault Management 281 the prediction of upcoming failures. Event though there are techniques such as checkpointing that can be triggered directly by a failure prediction algorithm, subsequent diagnosis is required in order to investigate what is going wrong in the system, i.e. what caused the failure that is imminent. Based on failure prediction and diagnosis results, a decision needs to be made which of the implemented downtime avoidance or downtime minimization techniques should be applied and when it should be executed in order to remedy the problem (see Figure 12.1). Figure 12.1: The steps involved in proactive fault management. After prediction of an upcoming failure, diagnosis is needed in order to find the fault that causes the upcoming failure. Failure prediction and diagnosis results are used to decide upon which proactive method to apply and to schedule their execution. Both diagnosis and choice / scheduling of actions are complex problems that need to be solved for proactive fault management to be most effective. Nevertheless, the following paragraphs will discuss some issues that are related to HSMM-based failure prediction as proposed in this dissertation. Diagnosis. The objective of diagnosis is to find out where the fault is located (e.g., at which component) and sometimes to find out what caused it. Note that in contrast to traditional diagnosis, in proactive fault management diagnosis is invoked by failure prediction, i.e., when the failure is imminent but has not yet occurred. One idea how diagnosis could be accomplished is to analyze the hidden semi-Markov model used for failure prediction: Since the HSMM approach makes use of several HSMM instances (one for non-failure and several other for failure sequences), and each failure group model is targeted to a failure mechanism, sequence likelihoods of the failure group models can be compared in order to analyze which failure mechanism might be active in the system. The fault might then be determined by an analysis of characteristic error messages, which might also include identification of the most probable sequence of hidden states by applying the Viterbi algorithm. Some parts of this analysis could be even precomputed after clustering of training failure sequences. Scheduling of actions. The investigation of dependability enhancement presented in Chapter 10 has been based on a binary classification whether a failure is imminent or not. However, in general, the decision which proactive technique to apply should be based on an objective function taking cost of actions, confidence in the prediction, effectiveness and complexity of actions into account in order determine the optimal trade-off. For example, to trigger a rather costly technique such as system restart, the scheduler should be almost sure about an upcoming failure, whereas for a less expensive action such as writing a supplemental checkpoint, less confidence in the correctness of the failure prediction is required. In contrast to many other failure prediction approaches, HSMM-based failure 282 12. Outlook prediction can support the scheduler by reporting the posterior probability (c.f., Equation 7.1) rather than the binary decision whether a failure is coming up or not. Both topics, diagnosis and scheduling are challenges each worth a separate dissertation raising a manifold of scientific questions. The crucial issue, however, is to bring proactive fault management into practical applications in order to prove that system dependability can be boosted by up to an order of magnitude by the proactive fault handling toolbox, which is a combination of effective downtime avoidance and downtime minimization techniques, diagnosis, action scheduling, and last but not least accurate online failure prediction. Part V Appendix 283 Derivatives with Respect to Parameters for Selected Distributions In order to compute the gradient used in hidden semi-Markov model training (c.f., Section 6.3.2) partial derivatives with respect to parameters of the transition duration distribution are needed. Derivatives for some commonly used distributions are provided in the following. Note that cumulative parametric probability distributions are used to specify a hidden semi-Markov model’s transition durations. Exponential distribution. The cumulative distribution is given by κij,r = 1 − e−λij,r dk . The derivative with respect to λij,r is hence: ∂ 1 − e−λij,r dk = dk e−λij,r dk . ∂ λij,r Normal distribution. No closed-form representation for the cumulative normal distribution Φµ,σ (t) is known. However, it can be expressed using the so-called error function erf(t): 2 Z t −τ 2 e dτ erf(t) = π 0 ∂ erf 2 2 = √ e−t ∂t π The cumulative normal distribution is then given by: " t−µ 1 1 + erf √ Φµ,σ (t) = 2 2σ In order to compute the partial derivative of Φ let: fµ,σ (t) := t−µ √ 2σ ∂f ∂µ 1 = −√ 2σ ∂f ∂σ = 285 µ−t √ 2σ 2 !# . 286 and hence, ∂Φ 1 2 −f 2 √ e = ∂µ 2 π 1 −√ 2σ ! 1 t2 − 2tµ + µ2 = −√ exp − 2σ 2 2πσ ∂Φ µ−t t2 − 2tµ + µ2 = √ exp − ∂σ 2σ 2 2πσ 2 ! ! Log-normal distribution. Similar to the normal distribution, the cumulative lognormal distribution can be expressed using the error function: " ln(t) − µ 1 √ 1 + erf Ψµ,σ (t) = 2 2σ !# . Therefore, derivations are derived similar to the normal distribution: 1 ∂Ψ ln(t)2 − 2 ln(t)µ + µ2 = −√ exp − ∂µ 2σ 2 2πσ ! ∂Ψ µ − ln(t) ln(t)2 − 2 ln(t)µ + µ2 √ = exp − ∂σ 2σ 2 2πσ 2 ! Pareto distribution. The cumulative distribution is determined by the location parameter (tm ), which determines the minimum value for t, and a shape parameter k: Ptm ,k (t) := 1 − tm t k Using MapleTM , the derivatives with respect to tm can be determined yielding ∂P =− ∂tm tm t k k tm and the derivative with respect to k is given by: tm ∂P =− ∂k t k tm ln t 287 Gamma distribution. The density of the gamma distribution is defined to be: t gk,θ (t) = t k−1 e− θ θk Γ(k) where Γ(k) denotes the gamma function. The cumulative distribution is given by: Gk,θ (t) = γ(k; θt ) Γ(k) where γ denotes the incomplete gamma function. Derivation of Gk,θ (t) with respect to k as well as to θ is possible, however, the result comprises many different terms which are hence not displayed here. Rather, the result can be seen by evaluating the following four lines using MapleTM : gam := int(t^(a-1)*exp(-t), t=0..x); CDFgam := subs(a=k,x=x/theta,gam) / GAMMA(k); diff(CDFgam,k); diff(CDFgam,theta); Erklärung Ich erkläre hiermit, dass • ich die vorliegende Dissertationsschrift “Event-based Failure Prediction: An Extended Hidden Markov Model Approach” selbstständig und ohne unerlaubte Hilfe angefertigt habe; • ich mich nicht bereits anderwärts um einen Doktorgrad beworben habe oder einen solchen besitze; • mir die Promotionsordnung der Mathematisch-Naturwissenschaftlichen Fakultät II der Humboldt-Universität zu Berlin (veröffentlicht im Amtl. Mitteilungsblatt Nr. 34 / 2006) bekannt ist. 289 Acronyms AC AFS AGNES ARMA ARX AUC BCa BFGS BLAST CBE CDF CHMM CPU CRF CTMC CT-HMM DC DET DF DFT DIANA DTMC DVD ECDF ECG EDI EFDIA EM ESHMM fMRI FN FOIL FP FPR FRU FWN GHMM Agglomerative coefficient Andrews file system Agglomerative Nesting Auto-regressive moving average Auto-regressive model with auxiliary input Area under (ROC) curve Bias corrected accelerated confidence intervals Broyden-Fletcher-Goldfarb-Shanno Basic local alignment search tool Common base event Cumulative distribution function Continuous hidden Markov model Central processing unit Conditional random fields Continuous time Markov chain Continuous time hidden Markov model Divisive coefficient Detection error tradeoff Dispersion frame Dispersion frame technque Divisive analysis clustering Discrete time Markov chain Digital versatile disk Empirical cumulative distribution function Expectation conjugate gradient Error dispersion index Early failure detection and isolation arrangement Expectation maximization Expanded State HMM functional magnetic resonance imaging False negative First order ??? language (not spelled out in the paper) False positive False positive rate Field replacable unit Fuzzy wavelet network General hidden Markov model library 291 292 GPRS GSM HMM HP HSMM HSMESM HTTP IBM ID IHMM IN IO IP LDAP LSI MAP ML MOC MSET MTBF MTBP MTTF MTTP MTTR NBEM NF OR PCA PCF PCFG PFM PR PWA QQ RADIUS RAID RBF RLC ROC SAN SAP SAR SCF SCP SEP SHIP SMART General packet radio service Global system for mobile communication Hidden Markov model Hewlett-Packard Hidden semi-Markov model Hidden semi-Markov event sequence model Hypertext transport protocol International business machines Identifier Inhomogeneous hidden Markov model Intelligent network Input output Internet protocol Lightweight directory access protocol Latent semantic indexing Maximum a-posteriori Maximum likelihood Mobile originated call Multivariate state estimation technique Mean time between failures Mean time between predictions Mean time to failure Mean time to prediction Mean time to repair Naive Bayes expectation maximization Non-failure Odds ratio Principal component analysis Probability cost function Probabilistic context-free grammar Proactive fault management Precision-recall Probabilistic wrapper approach Quantile quantile Remote authentication dial in user interface Redundant array of independent disks Radial basis function Resistor inductor capacitor Receiver operating characteristic Stochastic activity network Systems, applications, products System activity reporter Service control function Service control point Similar events prediction Software hardware interoperability people Self-monitoring, analysis and reporting technology 293 SMP SMS SPRT SRN SSI STAR SVD SVM TBF TCP TN TP TTF TTP TTR UBF UML UPGMA URL UTC WSDM Semi-Markov process Short message service Sequential probability ratio test Stochastic reward net Stressor susceptibility interaction Self-testing and repairing Singular value decomposition Support vector machine Time between failures Transmission control protocol True negative True positive Time to failure Time to prediction Time to repair Universal basis function Universal modeling language Unweighted pair-group average method Uniform resource locator Universal time coordinated Web services distributed management Index φ-coefficient, 166 a-priori algorithm, 49 abstraction, 6 accumulated runtime cost, 163 accuracy, 156 AdaBoost, 145 adaptive enterprise, 5 agglomerative coefficient, 151 aggregated models, 145 alarm, 41 alphabet, 56, 109 amount of background weight, 199 anomaly detectors, 33 approximation approach, 25 arcing, 145 area under curve (AUC), 164 autonomic computing, 5 background distribution, 81, 126, 145 backward algorithm HMM, 59 HSMM, 105 bag-of-words, 50 bagging, 145, 278 banner plot, 151 Baum-Welch algorithm HMM, 60 HSMM, 106 Bayes error rate, 135 Bayes prediction, 30 Bayesian prediction, 37 BCa, 171 bias and variance, 138 bias, 139 classification, 140 regression, 138 variance, 139 bias-variance dilemma, 140 boosting, 145, 278 bootstrapping, 170 boundary bias, 143 boundary error, 141 bug Bohrbugs, 22 Heisenbugs, 22 Mandelbugs, 22 Schrödingbugs, 22 chaining effect, 83 checkpoint, 230 class skewness, 26 classification, 19 cost, 135 cost-based, 135 failure prediction, 136 likelihood ratio, 136 log-likelihood, 137 loss matrix, 135 multiclass log-likelihood, 138 rejection thresholds, 136 risk, 135 sequence likelihood, 136 clustering, 37 agglomerative, 81 complete linkage, 83 divisive, 81 failure sequences, 18 hierarchical, 81 nearest neighbor, 83 partitioning, 81 stopping rules, 82 unweighted pair-group average, 83 Ward’s method, 83 clusters, 83 collision, 77 common base event (CBE), 90 conditional random fields, 279 confidence, 49 confusion matrix, 153 containers, 15 contingency table, 152, 153 continuous output probability densities, 67 295 296 continuous time sequences, 63 convex combination, 98 cooperative checkpointing, 4 correct no-warning, 153 correct warning, 153 count encoding, 216 counting and thresholding prediction, 30 data mining, 41 data sets, 167 test, 167 training, 167 validation, 167 data window size ∆td , 136, 208 data window size ∆td ., 180 decision boundaries, 134 region, 134 surfaces, 134 defect trigger, 87 defect type, 87 delay symbols, 65 dendrogram, 149 detection error trade-off (DET), 159 diagnosis, 281 discrete time Markov chain (DTMC), 55 dispersion frame technique (DFT), 46 dissimilarity matrix, 80 distributed system, 16 divisive coefficient, 151 downtime avoidance, 227 downtime minimization, 227 duration, 97 early stopping, 144 engineering cycle, 6 entropy, 90 equilibrium state distribution, 243 ergodic topology, 81, 110 error, 10, 41 error function, 285 error patterns, 17 error type, 76 error-based failure prediction, 39 classifier, 45 frequency, 39 pattern recognition, 43 rule-based, 41 statistical tests, 45 event, 39 event type, 87 Index event-triggered temporal sequence, 46 eventset, 48 accurate, 49 frequent, 48 method, 42, 48 expectation conjugate gradient (ECG), 110 expectation maximization (EM), 61, 116 generalized, 110, 119 expected risk, 135 F-measure, 155 failure, 10 arbitrary, 14 computation, 14 crash, 14 omission, 14 performance, 14, 19 timing, 14 failure avoidance, 227 failure mechanism, 15, 18, 79 failure modes, 15 failure prediction, 11 online, 12 failure probable, 233 failure sequence, 79 clustering, 79, 182, 212 grouping, 79, 182, 212 failure warning, 153 failure windows, 48 false negative, 153, 228 false positive, 153, 228 false positive rate, 156 false warning, 153 fault, 10 auditing, 10 design, 21 detection, 10 intermittent, 21 monitoring, 10 permanent, 21 runtime, 21 transient, 21 fault injection, 250 fault intolerance, 23 fault model, 20 fault tolerance, 23 fault trees, 42 feature analysis, 38 feature selection, 36 first passage time distribution, 103 first step analysis, 104 Index forced downtime, 228 forward algorithm HMM, 58 HSMM, 101 frequency of error occurrence, 39 function approximation-based prediction, 34 curve fitting, 34 genetic programming, 35 machine learning, 35 furthest neighbor, 83 G-measure, 165 generalized EM, 110, 119 Gini coefficient, 166 growing and pruning, 144 hidden Markov model (HMM), 56 basic problems of, 57 continuous (CHMM), 56, 67 continuous time (CT-HMM), 67 discrete, 56 hidden semi-Markov model (HSMM), 68, 95 complexity, 128 event sequence model (HSMESM), 69 expanded state (ESHMM), 69 exponentially-distributed durations, 68 Ferguson’s model, 68 gamma-distributed durations, 68 inhomogeneous (IHMM), 70, 116 Poisson-distributed durations, 68 proof of convergence, 116 reestimation formulas, 106 segmental, 69 structure of, 109 topology of, 109 Viterbi path constrained durations, 69 hierarchical numbering, 87 hybrid modeling approach, 278 incomplete data, 117 inductive bias, 278 information entropy of logfiles, 89 interdependencies, 15 intermediate states, 127, 145 jacknife, 170 kernels, 96, 98 label bias problem, 279 lead-time ∆tl , 180, 207 learning 297 batch, 25 offline, 25 online, 25 supervised, 25 leave-one-out, 170 lift, 166 likelihood, 133 logarithmic, 80, 101 load lowering, 229 logfile entropy, 89 error ID assignment, 178 hierarchical numbering, 87 tupling, 179 type and source, 86 lower bound optimization, 117 m-fold cross validation, 143, 168 machine learning, 16 marginal, 118 margins for non-failure sequences, 180 Markov assumptions, 56 properties, 56, 96 Markov renewal sequence, 95 kernel, 96 maximum cost, 164 maximum span of shortcuts, 199 median, 170 meta-learning, 278 minimal distance methods, 24 missing warning, 153 mixture of distributions, 98 mode, 170 model order selection, 144 monitoring-based classifiers, 29, 36 Bayesian classifier, 37 clustering, 37 statistical tests, 37 n-version programming, 4 no free lunch theorem, 25 noise filtering, 83, 188 non-failure sequences, 79 non-parametric prediction, 30 number of intermediates, 199 number of states, 199, 200 number of tries in each optimization step, 198 observation probabilities, 56 odds ratio, 157 298 online learning, 278 oracle, 163 out-of-sample, 167, 202 overfitting, 5, 140 pairwise alignment, 44 parameter setting greedy, 166 non-greedy, 167 parameter tying, 145 pattern recognition-based prediction, 43 Markov models, 44 pairwise alignment, 44 probabilistic context-free grammar, 43 perfect predictor, 164 periodic prediction, 53, 217 Piatetsky-Shapiro, 166 positive, 153 posterior probability distribution, 133 precision, 154 precision recall break-even, 165 precision recall curves, 157 prediction overview, 18 preparation, 227 preventive failover, 229 primal-dual method, 117 prior, 133 proactive downtime minimization, 229 proactive fault management, 5, 228 probabilistic context-free grammars (PCFG), 43 probabilistic wrapper approach (PWA), 36 properties of the data set, 221 Q-function, 118 quality of logfiles, 89 reactive downtime minimization, 229 recall, 154 receiver operating characteristics (ROC), 158 recovery oriented computing, 5 reestimation step, 129 regularization, 145, 278 rejuvenation, 4, 231 reliability model, 244 resamples, 170 responsive computing, 5 roll-backward scheme, 230 roll-forward scheme, 230 root cause, 10 analysis, 11 Index rule-based prediction, 41 data mining, 41 fault trees, 42 sample error rate, 169 SAR, 165 scaling, 101 self-* properties, 5 self-testing and repairing computer (STAR), 4 self-transitions, 64 semi-Markov process (SMP), 67, 95 sequence generation, 25 likelihood, 18 prediction, 25 recognition, 25 sequence extraction, 79 sequence likelihood, 57 HSMM, 101 sequence prediction, 102 sequential decision making, 25 sequential pattern mining, 42 service degradation, 32 SHIP fault model, 23 shortcuts, 126, 145 signal processing, 39 similar events prediction (SEP), 5 single linkage, 83 singular value decomposition (SVD), 50 software aging, 5, 231 software components, 15 source, 87 speech recognition, 113 state clean-up, 229 state duration, 114 statistical confidence, 168 statistical methods, 25 steady-state availability, 243, 244 stratification, 168 structure, 109 supervised offline batch learning, 212 support, 49 support vector machines (SVM), 51 SVD-SVM, 50 symbol, 56, 76 symptoms, 10, 40 system configuration, 211 system model-based prediction, 32 anomaly detectors, 33 control theory, 33 stochastic, 32 Index temporal encoding, 216 temporal output, 66 temporal sequence, 15, 63, 115 temporal sequence pattern recognition, 53 test data, 167 test data set, 167 time series analysis, 38 feature analysis, 38 signal processing, 39 time series prediction, 38 time slotting, 64 time-varying internal process, 66 topology, 109 training overview, 18 training data set, 167 training with noise, 143 transition duration, 97 probability, 97 true negative, 153 true positive, 153 true positive rate, 156 truncation, 77 trustworthy computing, 5 tupling, 76, 77 two dimensional output, 66 Type I error, 153 Type II error, 153 type of background distributions, 198 underfitting, 140 universal basis functions (UBF), 219 unobservable data, 117 validation, 167 validation data set, 167 variable selection, 36, 279 Viterbi algorithm HMM, 59 HSMM, 101 weighted relative accuracy, 165 299 Bibliography [1] Abraham, A. & Grosan, C. Genetic programming approach for fault modeling of electronic hardware. In IEEE Proceedings Congress on Evolutionary Computation (CEC’05),, volume 2, 1563–1569. Edinburgh, UK, 2005 [2] Agrawal, R., Imieliński, T., & Swami, A. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data (SIGMOD 93), 207–216. ACM Press, 1993 [3] Aitchison, J. & Dunsmore, I. R. Statistical Prediction Analysis. Cambridge University Press, 1975 [4] Albin, S. & Chao, S. Preventive replacement in systems with dependent components. IEEE Transactions on Reliability, volume 41(2): 230–238, 1992 [5] Aldenderfer, M. & Blashfield, R. Cluster Analysis. Sage Publications, Inc., Newbury Park (CA,USA), 1984 [6] Alpaydin, E. Introduction To Machine Learning. MIT Press, 2004 [7] Altman, D. G. Practical Statistics for Medical Research. Chapman-Hall, 1991 [8] Altschul, S., Gish, W., Miller, W., Myers, E., & Lipman, D. Basic local alignment search tool. Journal of Molecular Biology, volume 215(3): 403–410, 1990 [9] Amari, S. & McLaughlin, L. Optimal design of a condition-based maintenance model. In IEEE Proceedings of Reliability and Maintainability Symposium (RAMS), 528–533. 2004 [10] Andrzejak, A. & Silva, L. Deterministic Models of Software Aging and Optimal Rejuvenation Schedules. In 10th IEEE/IFIP International Symposium on Integrated Network Management (IM ’07), 159–168. 2007 [11] Apostolico, A. E. D. & Galil, Z. Pattern Matching Algorithms. Oxford University Press, 1997 [12] Ascher, H. E., Lin, T.-T. Y., & Siewiorek, D. P. Modification of: Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis. IEEE Transactions on Reliability, volume 41(4): 599–601, 1992 [13] Avižienis, A. Fault-tolerance and fault-intolerance: Complementary approaches to reliable computing. In Proceedings of the international conference on Reliable software, 458–464. ACM Press, New York, NY, USA, 1975 [14] Aviz̆ienis, A. The N-Version Approach to Fault-Tolerant Software. IEEE Transactions on Software Engineering, volume SE-11(12): 1491–1501, 1985 301 302 Bibliography [15] Aviz̆ienis, A., Gilley, G., Mathur, F., Rennels, D., Rohr, J., & Rubin, D. The STAR (SelfTesting And Repairing) Computer: An Investigation of the Theory and Practice of FaultTolerant Computer Design. IEEE Transactions on Computers, volume C-20(11): 1312– 1321, 1971 [16] Aviz̆ienis, A. & Laprie, J.-C. Dependable computing: From concepts to design diversity. Proceedings of the IEEE, volume 74(5): 629–638, 1986 [17] Avižienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, volume 1(1): 11–33, 2004 [18] Azimi, M., Nasiopoulos, P., & Ward, R. K. Offline and Online Identification of Hidden Semi-Markov Models. IEEE Transactions on Signal Processing, volume 53(8): 2658–2663, 2005 [19] Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel A., & van Steen, M. (eds.). Self-Star Properties in Complex Information Systems, Lecture Notes in Computer Science, volume 3460. Springer-Verlag, 2005 [20] Bai, C. G., Hu, Q. P., Xie, M., & Ng, S. H. Software failure prediction based on a Markov Bayesian network model. Journal of Systems and Software, volume 74(3): 275–282, 2005 [21] Bao, Y., Sun, X., & Trivedi, K. Adaptive Software Rejuvenation: Degradation Model and Rejuvenation Scheme. In Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN’2003). IEEE Computer Society, 2003 [22] Bao, Y., Sun, X., & Trivedi, K. A workload-based analysis of software aging, and rejuvenation. IEEE Transactions on Reliability, volume 54(3): 541–548, 2005 [23] Barborak, M., Dahbura, A., & Malek, M. The consensus problem in fault-tolerant computing. ACM Computing Surveys, volume 25(2): 171–220, 1993 [24] Basseville, M. & Nikiforov, I. Detection of abrupt changes: theory and application. Prentice Hall, 1993 [25] Baum, L. E. & Sell, G. R. Growth Transformations for Functions on Manifolds. Pacific Journal of Mathematics, volume 27(2): 211–227, 1968 [26] Bazaraa, M. S. & Shetty, C. M. Nonlinear Programming. John Wiley and Sons, New York, 1979 [27] Berenji, H., Ametha, J., & Vengerov, D. Inductive learning for fault diagnosis. In IEEE Proceedings of 12th International Conference on Fuzzy Systems (FUZZ’03), volume 1. 2003 [28] Bicego, M., Murino, V., & Figueiredo, M. A. T. A sequential pruning strategy for the selection of the number of states in hidden Markov models. Pattern Recognition Letters, volume 24(9–10): 1395–1407, 2003 [29] Bilmes, J. A. A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Tech. report ICSI-TR-97021, U.C. Berkeley, International Computer Science Institute, Berkeley, CA, 1998 [30] Bishop, C. M. Neural Networks for Pattern Recognition. Oxford University Press, 1995 Bibliography 303 [31] Bland, J. M. & Altman, D. G. The odds ratio. British Medical Journal, volume 320(7247): 1468, 2000 [32] Blischke, W. R. & Murthy, D. N. P. Reliability: Modeling, Prediction, and Optimization. Probability and Statistics. John Wiley and Sons, 2000 [33] Bonafonte, A., Vidal, J., & Nogueiras, A. Duration modeling with expanded HMM applied to speech recognition. In IEEE Proceedings of the Fourth International Conference on Spoken Language (ICSLP 96), volume 2, 1097–1100. 1996 [34] Borgelt, C. & Kruse, R. Induction of Association Rules: Apriori Implementation. In Proceedings of 15th Conference on Computational Statistics (Compstat 2002). Physica Verlag, Heidelberg, Germany, 2002 [35] Bowles, J. A survey of reliability-prediction procedures for microelectronic devices. IEEE Transactions on Reliability, volume 41(1): 2–12, 1992 [36] Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. Time Series Analysis: Forecasting and Control. Prentice Hall, Englewood Cliffs, New Jersey, 3rd edition, 1994 [37] Bridgewater, D. Standardize Messages with the Common Base Event Model. 2004. URL www-106.ibm.com/developerworks/autonomic/library/ac-cbe1/ [38] Brocklehurst, S. & Littlewood, B. Techniques for Prediction Analysis and Recalibration. In Lyu, M. R. (ed.), Handbook of software reliability engineering, chapter 4, 119–166. McGraw-Hill, 1996 [39] Bronstein, I. N., Semendjajew, K. A., Musiol, G., & Mühlig, H. Taschenbuch der Mathematik. Harri Deutsch, Frankfurt am Main, Germany, 6th edition, 2005 [40] Brown, A. & Patterson, D. Embracing Failure: A Case for Recovery-Oriented Computing (ROC). In High Performance Transaction Processing Symposium. 2001 [41] Burckhardt, J. Griechische Kultur. Safari Verlag, Berlin, Germany, 1958 [42] Candea, G. The Enemies of Dependability I: Software. Technical Report CS444a, Stanford University, CA, 2003 [43] Candea, G., Cutler, J., & Fox, A. Improving Availability with Recursive Microreboots: A Soft-State System Case Study. Performance Evaluation Journal, volume 56(1-3), 2004 [44] Candea, G., Delgado, M., Chen, M., & Fox, A. Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications. In Proceedings of the 3rd IEEE Workshop on Internet Applications (WIAPP). San Jose, CA, 2003 [45] Candea, G., Kiciman, E., Zhang, S., Keyani, P., & Fox, A. JAGR: An Autonomous SelfRecovering Application Server. In Proceedings of the 5th International Workshop on Active Middleware Services. Seattle, WA, USA, 2003 [46] Caruana, R. & Niculescu-Mizil, A. Data mining in metric space: an empirical analysis of supervised learning performance criteria. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 04), 69–78. ACM Press, New York, NY, USA, 2004 304 Bibliography [47] Cassady, C., Maillart, L., Bowden, R., & Smith, B. Characterization of optimal agereplacement policies. In IEEE Proceedings of Reliability and Maintainability Symposium, 170–175. 1998 [48] Cassidy, K. J., Gross, K. C., & Malekpour, A. Advanced Pattern Recognition for Detection of Complex Software Aging Phenomena in Online Transaction Processing Servers. In Proceedings of Dependable Systems and Networks (DSN), 478–482. 2002 [49] Castelli, V., Harper, R., P., H., Hunter, S., Trivedi, K., Vaidyanathan, K., & Zeggert, W. Proactive management of software aging. IBM Journal of Research and Development, volume 45(2): 311–332, 2001 [50] Chakravorty, S., Mendes, C., & Kale, L. Proactive fault tolerance in large systems. In HPCRI Workshop in conjunction with HPCA 2005. 2005 [51] Chan, L. M., Comaromi, J. P., Mitchell, J. S., & Satija, M. Dewey Decimal Classification: A Practical Guide. OCLC Forest Press, Albany, N.Y., 2nd edition, 1996 [52] Chen, M., Accardi, A., Lloyd, J., Kiciman, E., Fox, A., Patterson, D., & Brewer, E. Pathbased Failure and Evolution Management. In Proceedings of USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI). San Francisco, CA, 2004 [53] Chen, M., Kiciman, E., Fratkin, E., Fox, A., & Brewer, E. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In Proceedings of 2002 International Conference on Dependable Systems and Networks (DSN), IPDS track, 595–604. IEEE Computer Society, 2002 [54] Chen, M., Zheng, A., Lloyd, J., Jordan, M., & Brewer, E. Failure diagnosis using decision trees. In IEEE Proceedings of International Conference on Autonomic Computing, 36–43. 2004 [55] Chen, M.-S., Park, J. S., & Yu, P. S. Efficient Data Mining for Path Traversal Patterns. IEEE Transactions on Knowledge and Data Engineering, volume 10(2): 209–221, 1998. URL citeseer.nj.nec.com/article/chen98efficient.html [56] Chen, P., Lin, C. J., & Schoelkopf, B. A tutorial on ν-Support Vector Machines. Applied Stochastic Models in Business and Industry, volume 21(2): 111–136, 2005 [57] Cheng, F., Wu, S., Tsai, P., Chung, Y., & Yang, H. Application Cluster Service Scheme for Near-Zero-Downtime Services. In IEEE Proceedings of the International Conference on Robotics and Automation, 4062–4067. 2005 [58] Chiang, F. & Braun, R. Intelligent Network Failure Domain Prediction in Complex Telecommunication Systems with Hybrid Neural Rough Nets. In The Second International Symposium on Neural Networks (ISNN 2005). Chongqing, China, 2005 [59] Chillarege, R., Bhandari, S., Chaar, J. K., Halliday, M. J., Moebus, D. S., Ray, B. K., & Wong, M.-Y. Orthogonal Defect Classification - A Concept for In-Process Measurements. IEEE Transactions on Software Engineering, volume 18(11): 943–955, 1992 [60] Chillarege, R., Biyani, S., & Rosenthal, J. Measurement of Failure Rate in Widely Distributed Software. In FTCS ’95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing, 424–432. IEEE Computer Society, 1995 Bibliography 305 [61] Cohen, W. W. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, 115–123. 1995 [62] Cole, R., Mariani, J., Uszkoreit, H., Varile, G. B., Zaenen, A., Zampolli, A., & Zue, V. (eds.). Survey of the State of the Art in Human Language Technology. Cambridge University Press and Giardini, 1997 [63] Coleman, D. & Thompson, C. Model Based Automation and Management for the Adaptive Enterprise. In Proceedings of the 12th Annual Workshop of HP OpenView University Association, 171–184. 2005 [64] Comission, I. I. T. (ed.). Dependability and Quality of Service, chapter 191. IEC, 2nd edition, 2002 [65] Cook, A. E. & Russell, M. J. Improved duration modeling in hidden Markov models using series-parallel configurations of states. Proc. Inst. Acoust., volume 8: 299–306, 1986 [66] Cover, T. M. Learning in pattern recognition. In Watanabe, S. (ed.), Methodologies of Pattern Recognition, 111–132. Academic Press, 1968 [67] Cox, D. R. & Miller, H. D. The Theory of Stochastic Processes. Chapman and Hall, London, UK, 1st edition, 1965 [68] Cristian, F., Aghili, H., Strong, R., & Dolev, D. Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement. In IEEE Proceedings of 15th International Symposium on Fault Tolerant Computing (FTCS). 1985 [69] Cristian, F., Dancey, B., & Dehn, J. Fault-tolerance in the Advanced Automation System. In IEEE Proceedings of 20th International Symposium on Fault-Tolerant Computing (FTCS20), 6–17. 1990 [70] Cristianini, N. & Shawe-Taylor, J. An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000 [71] Crowell, J., Shereshevsky, M., & Cukic, B. Using fractal analysis to model software aging. Technical report, West Virginia University, Lane Department of CSEE, Morgantown, WV, 2002 [72] Csenki, A. Bayes Predictive Analysis of a Fundamental Software Reliability Model. IEEE Transactions on Reliability, volume 39(2): 177–183, 1990 [73] Daidone, A., Di Giandomenico, F., Bondavalli, A., & Chiaradonna, S. Hidden Markov Models as a Support for Diagnosis: Formalization of the Problem and Synthesis of the Solution. In IEEE Proceedings of the 25th Symposium on Reliable Distributed Systems (SRDS 2006). Leeds, UK, 2006 [74] Dalgaard, P. Introductory Statistics with R. Springer, 2002 [75] Dempster, A., Laird, N., & Rubin, D. Maximum-Likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, volume 39(1): 1–38, 1977 [76] Dennis, J. E. J. & Moré, J. J. Quasi-Newton Methods, Motivation and Theory. SIAM Review, volume 19(1): 46–89, 1977 [77] Denson, W. The history of reliability prediction. IEEE Transactions on Reliability, volume 47(3): 321–328, 1998 306 Bibliography [78] Discenzo, F., Unsworth, P., Loparo, K., & Marcy, H. Self-diagnosing intelligent motors: a key enabler for nextgeneration manufacturing systems. In IEE Colloquium on Intelligent and Self-Validating Sensors. 1999 [79] Dohi, T., Goseva-Popstojanova, K., & Trivedi, K. S. Analysis of Software Cost Models with Rejuvenation. In Proceedings of IEEE Intl. Symposium on High Assurance Systems Engineering, HASE 2000. 2000 [80] Dohi, T., Goseva-Popstojanova, K., & Trivedi, K. S. Statistical Non-Parametric Algorihms to Estimate the Optimal Software Rejuvenation Schedule. In Proceedings of the Pacific Rim International Symposium on Dependable Computing (PRDC 2000). 2000 [81] Domeniconi, C., Perng, C.-S., Vilalta, R., & Ma, S. A Classification Approach for Prediction of Target Events in Temporal Sequences. In Elomaa, T., Mannila, H., & Toivonen, H. (eds.), Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’02), LNAI, volume 2431, 125–137. Springer-Verlag, Heidelberg, 2002 [82] Domingos, P. A Unified Bias-Variance Decomposition for Zero-One and Squared Loss. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, 564–569. 2000 [83] Drummond, C. & Holte, R. C. Explicitly representing expected cost: an alternative to ROC representation. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’00), 198–207. ACM Press, New York, NY, USA, 2000 [84] Duda, R. O. & Hart, P. E. Pattern classification and scene analysis. John Wiley and Sons, New York, London, Sydney, Toronto, 1973 [85] Duda, R. O., Hart, P. E., & Stork, D. G. Pattern Classification. Wiley-Interscience, 2nd edition, 2000 [86] Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK, 1998 [87] Efron, B. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, volume 7(1): 1–26, 1979 [88] Egan, J. P. Signal detection theory and ROC analysis. Academic Press New York, 1975 [89] Elbaum, S., Kanduri, S., & Amschler, A. Anomalies as precursors of field failures. In IEEE Proceedings of the 14th International Symposium on Software Reliability Engineering (ISSRE 2003), 108–118. 2003 [90] Elliott, R. J., Aggoun, L., & Moore, J. B. Hidden Markov Models: Estimation and Control, Stochastic Modelling and Applied Probability, volume 29. Springer Verlag, 1st edition, 1995 [91] Elnozahy, E. N., Alvisi, L., Wang, Y., & Johnson, D. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, volume 34(3): 375–408, 2002 Bibliography 307 [92] Esary, J. D. & Proschan, F. The Reliability of Coherent Systems. In Wilcox & Mann (eds.), Redundancy Techniques for Computing Systems, 47–61. Spartan Books, Washington, DC, 1962 [93] Faisan, S., Thoraval, L., Armspach, J., & Heitz, F. Unsupervised Learning and Mapping of Brain fMRI Signals Based on Hidden Semi-Markov Event Sequence Models. In Goos, G., Hartmanis, J., & van Leeuwen, J. (eds.), Medical Image Computing and ComputerAssisted Intervention (MICCAI 2003), Lecture Notes in Computer Science, volume 2879, 75–82. Springer, 2003 [94] Farr, W. Software Reliability Modeling Survey. In Lyu, M. R. (ed.), Handbook of software reliability engineering, chapter 3, 71–117. McGraw-Hill, 1996 [95] Fawcett, T. ROC graphs: notes and practical considerations for data mining researchers. Technical Report 2003-4, HP Laboratories, Palo Alto, CA, USA, 2003 [96] Ferguson, J. Variable duration models for speech. In Proceedings of the Symposium on the Application of HMMs to Text and Speech, 143–179. 1980 [97] Flach, P. The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In Proceedings of the 20th International Conference on Machine Learning (ICML’03), 194–201. AAAI Press, 2003 [98] Friedman, J. H. On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Data Mining and Knowledge Discovery, volume 1(1): 55–77, 1997 [99] Fu, S. & Xu, C.-Z. Quantifying Temporal and Spatial Fault Event Correlation for Proactive Failure Management. In IEEE Proceedings of Symposium on Reliable and Distributed Systems (SRDS 07). 2007 [100] Garg, S., van Moorsel, A., Vaidyanathan, K., & Trivedi, K. S. A Methodology for Detection and Estimation of Software Aging. In Proceedings of the 9th International Symposium on Software Reliability Engineering, ISSRE 1998. 1998 [101] Garg, S., Puliafito, A., Telek, M., & Trivedi, K. Analysis of Preventive Maintenance in Transactions Based Software Systems. IEEE Trans. Comput., volume 47(1): 96–107, 1998 [102] Ge, X. Segmental semi-Markov models and applications to sequence analysis. Ph.D. thesis, University of California, Irvine, 2002. Chair-Padhraic Smyth [103] Gellert, W., Küstner, H., Hellwig, M., & Kästner, H. (eds.). Kleine Enzyklopädie Mathematik. VEB Bibliographisches Institut, Leipzig, Germany, 1965 [104] Geman, S., Bienenstock, E., & Doursat, R. Neural networks and the bias/variance dilemma. Neural Computation, volume 4(1): 1–58, 1992 [105] Gertsbakh, I. Reliability Theory: with Applications to Preventive Maintenance. SpringerVerlag, Berlin, Germany, 2000 [106] Goldberg, D. E. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison Wesley, 1989 [107] Gray, J. Why do computers stop and what can be done about it? In Proceedings of Symposium on Reliability in Distributed Software and Database Systems (SRDS-5), 3–12. IEEE CS Press, Los Angeles, CA, 1986 308 Bibliography [108] Gray, J. A census of tandem system availability between 1985 and 1990. IEEE Transactions on Reliability, volume 39(4): 409–418, 1990 [109] Gray, J. & Reuter, A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992 [110] Gross, K. C., Bhardwaj, V., & Bickford, R. Proactive Detection of Software Aging Mechanisms in Performance Critical Computers. In SEW ’02: Proceedings of the 27th Annual NASA Goddard Software Engineering Workshop (SEW-27’02). IEEE Computer Society, Washington, DC, USA, 2002 [111] Gujrati, P., Li, Y., Lan, Z., Thakur, R., & White, J. A Meta-Learning Failure Predictor for Blue Gene/L Systems. In IEEE proceedings of International Conference on Parallel Processing (ICPP 2007). 2007 [112] Hamerly, G. & Elkan, C. Bayesian approaches to failure prediction for disk drives. In Proceedings of the Eighteenth International Conference on Machine Learning, 202–209. Morgan Kaufmann Publishers Inc., 2001 [113] Hamming, W. R. Error Detecting and Error Correcting Codes. Bell Systems Technical Journal, volume 29(2): 147–160, 1950 [114] Hansen, J. & Siewiorek, D. Models for time coalescence in event logs. In IEEE Proceedings of International Symposium on Fault-Tolerant Computing (FTCS-22), 221–227. 1992 [115] Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer Verlag, 2001 [116] Hätönen, K., Klemettinen, M., Mannila, H., Ronkainen, P., & Toivonen, H. TASA: Telecommunication Alarm Sequence Analyzer, or: How to enjoy faults in your network. In IEEE Proceedings of Network Operations and Management Symposium, volume 2, 520 – 529. Kyoto, Japan, 1996 [117] Hellerstein, J. L., Zhang, F., & Shahabuddin, P. An approach to predictive detection for service management. In IEEE Proceedings of Sixth International Symposium on Integrated Network Management, 309–322. 1999 [118] Herodot. Historien. Kröner Verlag, Stuttgart, Germany, 1971 [119] Hestenes, M. R. & Stiefel, E. Methods of conjugate gradients for solving linear systems. Journal. Research of the National Bureau of Standards, volume 49(6): 409–436, 1952 [120] Hoffmann, G. A. Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker Verlag, 2006 [121] Hoffmann, G. A. & Malek, M. Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach. In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS 2006). Leeds, United Kingdom, 2006 [122] Hoffmann, G. A., Trivedi, K. S., & Malek, M. A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability, volume 56(4): 615–628, 2007 [123] Horn, P. Autonomic Computing: IBM’s perspective on the State of Information Technology. 2001. URL http://www.research.ibm.com/autonomic/manifesto/ autonomic_computing.pdf Bibliography 309 [124] Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, volume 24: 417–441, 1933 [125] Huang, X., Acero, A., & Hon, H.-W. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Upper Saddle River, NJ, USA, 2001 [126] Huang, Y., Kintala, C., Kolettis, N., & Fulton, N. Software Rejuvenation: Analysis, Module and Applications. In Proceedings of IEEE Intl. Symposium on Fault Tolerant Computing, FTCS 25. 1995 [127] Hughes, G., Murray, J., Kreutz-Delgado, K., & Elkan, C. Improved disk-drive failure warnings. IEEE Transactions on Reliability, volume 51(3): 350–357, 2002 [128] Hughey, R. & Krogh, A. Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS, volume 12(2): 95–107, 1996 [129] Iyer, R. & Rosetti, D. A statistical load dependency of CPU errors at SLAC. In IEEE Proceedings of 12th International Symposium on Fault Tolerant Computing (FTCS-12). 1982 [130] Iyer, R. K., Young, L. T., & Iyer, P. K. Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data. IEEE Transactions on Computers, volume 39(4): 525–537, 1990 [131] Iyer, R. K., Young, L. T., & Sridhar, V. Recognition of error symptoms in large systems. In Proceedings of 1986 ACM Fall joint computer conference, 797–806. IEEE Computer Society Press, Los Alamitos, CA, USA, 1986 [132] Jelinski, Z. & Moranda, P. Software reliability research. In Freiberger, W. (ed.), Statistical computer performance evaluation. Academic Press, 1972 [133] Jensen, J. L. W. V. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Mathematica, volume 30(1): 175–193, 1906 [134] Jiménez, D. A. & Lin, C. Neural methods for dynamic branch prediction. ACM Transactions on Computer Systems, volume 20(4): 369–397, 2002 [135] Joachims, T. Making large-scale SVM Learning Practical. In Schölkopf, B., Burges, C., & A., S. (eds.), Advances in Kernel Methods - Support Vector Learning. MIT Press, 1999 [136] Joseph, D. & Grunwald, D. Prefetching Using Markov Predictors. IEEE Transactions on Computers, volume 48(2): 121–133, 1999 [137] Juang, B. H., Levinson, S. E., & Sondhi, M. M. Maximum Likelihood Estimation for Multivariate Mixture Observations of Markov Chains. IEEE Transactions on Information Theory, volume 32(2): 307–309, 1986 [138] Juang, B.-H. & Rabiner, L. The segmental K-means algorithm for estimating parameters of hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, volume 38(9): 1639–1641, 1990 [139] Kajko-Mattson, M. Can We Learn Anything from Hardware Preventive Maintenance? In ICECCS ’01: Proceedings of the Seventh International Conference on Engineering of Complex Computer Systems, 106–111. IEEE Computer Society, 2001 310 Bibliography [140] Kalman, R. E. & Bucy, R. S. New results in linear filtering and prediction theory. Transactions of the ASME, Series D, Journal of Basic Engineering, volume 83: 95–107, 1961 [141] Kapadia, N. H., Fortes, J. A. B., & Brodley, C. E. Predictive application-performance modeling in a computational gridenvironment. In IEEE Procedings of the eighth International Symposium on High Performance Distributed Computing, 47–54. 1999 [142] Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data. John Wiley and Sons, New York, 1990 [143] Kelly, J. P. J., Aviz̆ienis, A., Ulery, B. T., Swain, B. J., Lyu, M. R., Tai, A., & Tso, K. S. Multi-Version Software Development. In Proceedings IFAC Workshop SAFECOMP’86, 43–49. Sarlat, France, 1986 [144] Kiciman, E. & Fox, A. Detecting application-level failures in component-based Internet services. IEEE Transactions on Neural Networks, volume 16(5): 1027–1041, 2005 [145] Kim, W.-G., Choi, J.-Y., & Youn, D. H. HMM with global path constraint in Viterbi decoding for isolatedword recognition. In IEEE Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP-94), volume 1, 605–608. 1994 [146] Kohavi, R. & Provost, F. Glossary of terms. Machine Learning, volume 30(2/3): 271–274, 1998 [147] Korbicz, J., Kościelny, J. M., Kowalczuk, Z., & Cholewa, W. (eds.). Fault Diagnosis: Models, Artificial Intelligence, Applications. Springer Verlag, 2004 [148] Krus, D. J. & Fuller, E. A. Computer Assisted Multicrossvalidation in Regression Analysis. Educational and Psychological Measurement, volume 42(1): 187–193, 1982 [149] Kulkarni, V. G. Modeling and Analysis of Stochastic Systems. Chapman and Hall, London, UK, 1st edition, 1995 [150] Kumar, D. & Westberg, U. Maintenance scheduling under age replacement policy using proportional hazards model and TTT-plotting. European Journal of Operational Research, volume 99(3): 507–515, 1997 [151] Kurtz, A. K. A research test of Rorschach test. Personnel Psychology, volume 1: 41–53, 1948 [152] Lafferty, J., McCallum, A., & Pereira, F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. 18th International Conf. on Machine Learning, 282–289. Morgan Kaufmann, San Francisco, CA, 2001. URL citeseer. ist.psu.edu/article/lafferty01conditional.html [153] Lal, R. & Choi, G. Error and Failure Analysis of a UNIX Server. In IEEE Proceedings of third International High-Assurance Systems Engineering Symposium (HASE), 232–239. IEEE Computer Society Washington, DC, USA, 1998 [154] Lance, G. N. & Williams, W. T. A general theory of classificatory sorting strategies, 1. Hierarchical Systems. The Computer Journal, volume 9(4): 373–380, 1967 [155] Laprie, J.-C. & Kanoun, K. Software Reliability and System Reliability. In Lyu, M. R. (ed.), Handbook of software reliability engineering, chapter 2, 27–69. McGraw-Hill, 1996 Bibliography 311 [156] Laranjeira, L., Malek, M., & Jenevein, R. On tolerating faults in naturally redundant algorithms. In IEEE Proceedings of Tenth Symposium on Reliable Distributed Systems (SRDS),, 118–127. 1991 [157] Leangsuksun, C., Liu, T., Rao, T., Scott, S., & Libby, R. A Failure Predictive and PolicyBased High Availability Strategy for Linux High Performance Computing Cluster. In The 5th LCI International Conference on Linux Clusters: The HPC Revolution, 18–20. 2004 [158] Leangsuksun, C., Shen, L., Liu, T., Song, H., & Scott, S. Availability prediction and modeling of high mobility OSCAR cluster. In IEEE Proceedings of International Conference on Cluster Computing, 380–386. 2003 [159] Lee, I. & Iyer, R. K. Software dependability in the Tandem GUARDIAN system. IEEE Transactions on Software Engineering, volume 21(5): 455–467, 1995 [160] Legg, S. Is There an Elegant Universal Theory of Prediction? In Algorithmic Learning Theory, Lecture Notes in Computer Science, volume 4264, 274–287. Springer Verlag, 2006 [161] Levinson, S. E. Continuously variable duration hidden Markov models for automatic speech recognition. Computer Speech and Language, volume 1(1): 29–45, 1986 [162] Levy, D. & Chillarege, R. Early Warning of Failures through Alarm Analysis - A Case Study in Telecom Voice Mail Systems. In ISSRE ’03: Proceedings of the 14th International Symposium on Software Reliability Engineering. IEEE Computer Society, Washington, DC, USA, 2003 [163] Li, L., Vaidyanathan, K., & Trivedi, K. S. An Approach for Estimation of Software Aging in a Web Server. In Proceedings of the Intl. Symposium on Empirical Software Engineering, ISESE 2002. Nara, Japan, 2002 [164] Li, Y. & Lan, Z. Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing. In IEEE Proceedings of the Sixth International Symposium on Cluster Computing and the Grid (CCGRID’ 06), 531–538. IEEE Computer Society, Los Alamitos, CA, USA, 2006 [165] Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., & Sahoo, R. BlueGene/L Failure Analysis and Prediction Models. In IEEE Proceedings of the International Conference on Dependable Systems and Networks (DSN 2006), 425–434. 2006 [166] Lin, T.-T. Y. Design and evaluation of an on-line predictive diagnostic system. Ph.D. thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University, Pittsburgh, PA, 1988 [167] Lin, T.-T. Y. & Siewiorek, D. P. Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability, volume 39(4): 419–432, 1990 [168] Liporace, L. A. Maximum Likelihood Estimation for Multivariate Observations of Markov Sources. IEEE Transactions on Information Theory, volume 28(5): 729–734, 1982 [169] Lunze, J. Automatisierungstechnik. Oldenbourg, 1st edition, 2003 [170] Lyu, M. R. (ed.). Handbook of Software Reliability Engineering. McGraw-Hill, 1996 [171] Magedanz, T. & Popescu-Zeletin, R. Intelligent networks: basic technology, standards and evolution. Internat. Thomson Computer Press, London, UK, 1996 312 Bibliography [172] Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. Performance Measures for Information Extraction. In Proceedings of DARPA Broadcast News Workshop. Herndon, VA, 1999 [173] Malek, M. Responsive Systems: The challenge for the nineties. Microprocessing and Microprogramming, volume 30: 9–16, 1990 [174] Malek, M. Personal communication. 2007 [175] Manning, C. D. & Schütze, H. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, 1999 [176] Marciniak, A. & Korbicz, J. Pattern Recognition Approach to Fault Diagnostics. In Korbicz, J., Kościelny, J. M., Kowalczuk, Z., & Cholewa, W. (eds.), Fault Diagnosis: Models, Artificial Intelligence, Applications, chapter 14, 557–590. Springer Verlag, 2004 [177] Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. The DET curve in assessment of detection task performance. In Proceedings of the 5th European Conference on Speech Communication and Technology, volume 4, 1895–1898. 1997 [178] Marzbana, C. & Stumpf, G. J. A Neural Network for Damaging Wind Prediction. Weather and Forecasting, volume 13(1): 151–163, 1998 [179] Max Planck Institute for Molecular Genetics. General Hidden Markov Model library. 2007. URL http://www.ghmm.org, date: 06-12-07 [180] Melliar-Smith, P. M. & Randell, B. Software reliability: The role of programmed exception handling. SIGPLAN Not., volume 12(3): 95–100, 1977 [181] Minka, T. Expectation-Maximization as lower bound maximization. Tutorial published on the web at http://research.microsoft.com/users/minka/papers/ minka-em-tut.ps.gz, 1998 [182] Mitchell, C., Harper, M., & Jamieson, L. On the Complexity of Explicit Duration HMM’s. IEEE Transactions on Speech and Audio Processing, volume 3(3): 213–217, 1995 [183] Mitchell, C. & Jamieson, L. Modeling duration in a hidden Markov model with the exponential family. In IEEE Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP-93), volume 2, 331–334. 1993 [184] Mitchell, T. M. Machine Learning. McGraw-Hill, international edition 1997 edition, 1997 [185] Mojena, R. Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal, volume 20(4): 359–363, 1977 [186] Moll, K. D. & Luebbert, G. M. Arms Race and Military Expenditure Models: A Review. The Journal of Conflict Resolution, volume 24(1): 153–185, 1980 [187] Moore, D. S. & McCabe, G. P. Introduction to the Practice of Statistics. W. H. Freeman & Co., New York, NY, USA, 5th edition, 2006 [188] Mundie, C., de Vries, P., Haynes, P., & Corwine, M. Trustworthy Computing. Technical report, Microsoft Corp., 2002. URL http://www.microsoft.com/mscorp/twc/ twc_whitepaper.mspx Bibliography 313 [189] Musa, J. D., Iannino, A., & Okumoto, K. Software Reliability: Measurement, Prediction, Application. McGraw-Hill, 1987 [190] Nassar, F. A. & Andrews, D. M. A Methodology for Analysis of Failure Prediction Data. In IEEE Real-Time Systems Symposium, 160–166. 1985 [191] Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, volume 48(3): 443–53, 1970 [192] von Neumann, J. Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components. In Shannon, C. & McCarthy, J. (eds.), Automata Studies, 43–98. Princeton University Press, Princeton, 1956 [193] Neville, S. W. Approaches for Early Fault Detection in Large Scale Engineering Plants. Ph.D. thesis, University of Victoria, 1998 [194] Ning, M. H., Yong, Q., Di, H., Ying, C., & Zhong, Z. J. Software Aging Prediction Model Based on Fuzzy Wavelet Network with Adaptive Genetic Algorithm. In 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’06), 659–666. IEEE Computer Society, Los Alamitos, CA, USA, 2006 [195] Noll, A. & Ney, H. Training of phoneme models in a sentence recognition system. In IEEE Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’87), volume 12, 1277–1280. 1987 [196] Ogle, D., Kreger, H., Salahshour, A., Cornpropst, J., Labadie, E., Chessell, M., Horn, B., & Gerken, J. Canonical Situation Data Format: The Common Base Event. IBM Specification ACAB.BO0301.1.1, 2003. URL http://xml.coverpages.org/ IBMCommonBaseEventV111.pdf [197] Oliner, A. & Sahoo, R. Evaluating cooperative checkpointing for supercomputing systems. In IEEE Proceedings of 20th International Parallel and Distributed Processing Symposium (IPDPS 2006). 2006 [198] Parnas, D. L. Software aging. In IEEE Proceedings of the 16th international conference on Software engineering (ICSE ’94), 279–287. IEEE Computer Society Press, Los Alamitos, CA, USA, 1994 [199] Pawlak, Z., Wong, S. K. M., & Ziarko, W. Rough sets: Probabilistic versus deterministic approach. International Journal of Man-Machine Studies, volume 29: 81–95, 1988 [200] Pena, J. M., Létourneau, S., & Famili, F. Application of Rough Sets Algorithms to Prediction of Aircraft Component Failure. In Advances in Intelligent Data Analysis: Third International Symposium (IDA-99), LNCS, volume 1642. Springer Verlag, Amsterdam, The Netherlands, 1999 [201] Pepe, M. S., Janes, H., Longton, G., Leisenring, W., & Newcomb, P. Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic, Prognostic, or Screening Marker. American Journal of Epidemiology, volume 159(9): 882–890, 2004 [202] Petsche, T., Marcantonio, A., Darken, C., Hanson, S. J., Kuhn, G. M., & Santoso, I. A Neural Network Autoassociator for Induction Motor Failure Prediction. In Touretzky, D. S., 314 Bibliography Mozer, M. C., & Hasselmo, M. E. (eds.), Advances in Neural Information Processing Systems, volume 8, 924–930. The MIT Press, 1996. URL citeseer.ist.psu.edu/ petsche96neural.html [203] Pfefferman, J. & Cernuschi-Frias, B. A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability, volume 51(4): 434–442, 2002 [204] Pielke, R. Mesoscale Meteorological Modeling, International Geophysics, volume 78. Elsevier, 2nd edition, 2001 [205] Pizza, M., Strigini, L., Bondavalli, A., & Di Giandomenico, F. Optimal Discrimination between Transient and Permanent Faults. In IEEE Proceedings of Third International HighAssurance Systems Engineering Symposium (HASE’98), 214–223. IEEE Computer Society, Los Alamitos, CA, USA, 1998 [206] Pylkkönen, J. Phone Duration Modeling Techniques in Continuous Speech Recognition. Master’s thesis, Helsinki University of Technology, Department of Computer Science and Engineering, Laboratory of Computer and Information Science, 2004 [207] Quenouille, M. H. Notes on Bias in Estimation. Biometrika, volume 43(3/4): 353–360, 1956 [208] Quinlan, J. Learning logical definitions from relations. Machine Learning, volume 5(3): 239–266, 1990 [209] Quinlan, J. C4. 5: Programs for Machine Learning. Morgan Kaufmann, 1993 [210] Rabiner, L. R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, volume 77(2): 257–286, 1989 [211] Ramesh, P. & Wilpon, J. G. Modeling state durations in hidden Markov models for automatic speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-92), volume 1, 381–384. 1992 [212] Randell, B. System structure for software fault tolerance. IEEE Transactions on Software Engineering, volume 1(2): 220–232, 1975 [213] Randell, B., Lee, P., & Treleaven, P. C. Reliability Issues in Computing System Design. ACM Computing Survey, volume 10(2): 123–165, 1978 [214] van Rijsbergen, C. J. Information Retrieval. Butterworth, London, 2nd edition, 1979 [215] Rousseeuw, P. J. A visual display for hierarchical classification. In Diday, E., Escoufier, Y., Lebart, L., Pagès, J., Schektman, Y., & Tomassone, R. (eds.), Data Analysis and Informatics IV, 743–748. North-Holland, Amsterdam, 1986 [216] Rovnyak, S., Kretsinger, S., Thorp, J., & Brown, D. Decision trees for real-time transient stability prediction. IEEE Transactions on Power Systems, volume 9(3): 1417–1426, 1994 [217] Russell, M. A segmental HMM for speech pattern modelling. In IEEE Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP-93), volume 2, 499–502. 1993 [218] Russell, M. & Cook, A. Experimental evaluation of duration modelling techniques for automatic speech recognition. In IEEE Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’87), volume 12, 2376–2379. 1987 Bibliography 315 [219] Russell, M. J. & Moore, R. K. Explicit Modelling of State Occupancy in Hidden Markov Models for Automatic Speech Recognition. In IEEE Proceedings of Int. Conf. on Acoustics, Speech and Signal Processing, 5–8. 1985 [220] Sahoo, R. K., Oliner, A. J., Rish, I., Gupta, M., Moreira, J. E., Ma, S., Vilalta, R., & Sivasubramaniam, A. Critical Event Prediction for Proactive Management in Large-scale Computer Clusters. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’03), 426–435. ACM Press, 2003 [221] Saks, S. Theory of the Integral. G. E. Stechert & Co, New York, USA, 1937 [222] Salakhutdinov, R., Roweis, S., & Ghahramani, Z. Expectation-Conjugate Gradient: An Alternative to EM. IEEE Signal Processing Letters, volume 11(7), 2004 [223] Salfner, F. Predicting Failures with Hidden Markov Models. In Proceedings of 5th European Dependable Computing Conference (EDCC-5), 41–46. Budapest, Hungary, 2005. Student forum volume [224] Salfner, F., Hoffmann, G. A., & Malek, M. Prediction-Based Software Availability Enhancement. In Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel A., & van Steen, M. (eds.), Self-Star Properties in Complex Information Systems, Lecture Notes in Computer Science, volume 3460. Springer-Verlag, 2005 [225] Salfner, F. & Malek, M. Proactive Fault Handling for System Availability Enhancement. In IEEE Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS’05) - Workshop 16 IEEE Proceedings, DPDNS Workshop. Denver, CO, 2005 [226] Salfner, F., Schieschke, M., & Malek, M. Predicting Failures of Computer Systems: A Case Study for a Telecommunication System. In Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS 2006), DPDNS workshop. Rhodes Island, Greece, 2006 [227] Salfner, F., Tschirpke, S., & Malek, M. Comprehensive Logfiles for Autonomic Systems. In IEEE Proceedings of International Parallel and Distributed Processing Symposium (IPDPS), Workshop on Fault-Tolerant Parallel, Distributed and Network-Centric Systems (FTPDS). IEEE Computer Society, Santa Fe, New Mexico, USA, 2004 [228] Salvador, S. & Chan, P. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In IEEE Proceedings of 16th International Conference on Tools with Artificial Intelligence (ICTAI 2004), 576–584. 2004 [229] Salvo Rossi, P., Romano, G., Palmieri, F., & Iannello, G. A hidden Markov model for Internet channels. In IEEE Proceedings of the 3rd International Symposium on Signal Processing and Information Technology (ISSPIT 2003), 50–53. 2003 [230] Schlittgen, R. Einführung in die Statistik: Analyse und Modellierung von Daten. Oldenbourg-Wissenschaftsverlag, München, Wien, 9th edition, 2000 [231] Schölkopf, B., Smola, A. J., Williamson, R. C., & Bartlett, P. L. New Support Vector Algorithms. Neural Computation, volume 12(5): 1207–1245, 2000 [232] Scott, D. Making Smart Investments to Reduce Unplanned Downtime. Technical Report Tactical Guidelines, TG-07-4033, GartnerGroup RAS Services, 1999 316 Bibliography [233] Sen, P. K. Estimates of the Regression Coefficient Based on Kendall’s Tau. Journal of the American Statistical Association, volume 63(324): 1379–1389, 1968 [234] Sfetsos, A. Short-term load forecasting with a hybrid clustering algorithm. IEE Proceedings of Generation, Transmission and Distribution, volume 150(3): 257–262, 2003 [235] Shannon, C. A Mathematical Theory of Communication. The Bell System Technical Journal, volume 27: 379–423,623–656, 1948 [236] Shao, J. Linear Model Selection by Cross-Validation. Journal of the American Statistical Association, volume 88(422): 486–494, 1993 [237] Shawe-Taylor, J. & Cristianini, N. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004 [238] Shereshevsky, M., Crowell, J., Cukic, B., Gandikota, V., & Liu, Y. Software aging and multifractality of memory resources. In Proceedings of the International Conference on Dependable Systems and Networks (DSN 2003), 721–730. IEEE Computer Society, San Francisco, CA, USA, 2003 [239] Shewchuk, J. An introduction to the conjugate gradient method without the agonizing pain. Technical report, School of Computer Science, Carnegie Mellon University, Pittsburgh PA, USA, 1994 [240] Shi, X. & Manduchi, R. Invariant operators, small samples, and the bias-variance dilemma. In IEEE Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2004), volume 2. 2004 [241] Siewiorek, D. P. & Swarz, R. S. Reliable Computer Systems. Digital Press, Bedford, MA, 2nd edition, 1992 [242] Silva, J. G. & Madeira, H. Experimental Dependability Evaluation. In Diab, H. B. & Zomaya, A. Y. (eds.), Dependable Computing Systems, chapter 12, 327–355. John Wiley & Sons, 2005 [243] Singer, R. M., Gross, K. C., Herzog, J. P., King, R. W., & Wegerich, S. Model-Based Nuclear Power Plant Monitoring and Fault Detection: Theoretical Foundations. In Proceedings of Intelligent System Application to Power Systems (ISAP 97), 60–65. Seoul, Korea, 1997 [244] Smith, T. & Waterman, M. Identification of Common Molecular Subsequences. Journal of Molecular Biology, volume 147: 195–197, 1981 [245] Smyth, P. Clustering Using Monte Carlo Cross-Validation. In ACM proceedings of Knowledge Discovery and Data Mining (KDD 1996), 126–133. 1996 [246] Smyth, P. Clustering Sequences with Hidden Markov Models. In Mozer, M. C., Jordan, M. I., & Petsche, T. (eds.), Advances in Neural Information Processing Systems, volume 9, 648. The MIT Press, 1997 [247] Solomonoff, R. J. A Formal Theory of Inductive Inference, Part 1. Information and Control, volume 7(1): 1–22, 1964 [248] Solomonoff, R. J. A Formal Theory of Inductive Inference, Part 2. Information and Control, volume 7(2): 224–254, 1964 Bibliography 317 [249] Srikant, R. & Agrawal, R. Mining Sequential Patterns: Generalizations and Performance Improvements. In Apers, P. M. G., Bouzeghoub, M., & Gardarin, G. (eds.), Proc. 5th Int. Conf. Extending Database Technology, EDBT, volume 1057, 3–17. Springer-Verlag, 1996. URL citeseer.nj.nec.com/article/srikant96mining.html [250] Starr, A. A structured approach to the selection of condition based maintenance. In IEE Proceedings of Fifth International Conference on Factory 2000 - The Technology Exploitation Process. 1997 [251] Stone, M. Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society, volume 36(2): 111–147, 1974 [252] Sullivan, M. & Chillarege, R. Software defects and their impact on system availability - a study of field failures in operating systems. 21st Int. Symp. on Fault-Tolerant Computing (FTCS-21), 2–9, 1991. URL citeseer.ist.psu.edu/sullivan91software. html [253] Sun, R. Introduction to Sequence Learning. In Sun, R. & Giles, C. L. (eds.), Sequence Learning: Paradigms, Algorithms, and Applications, Lecture Notes in Computer Science, volume 1828, 1–11. Springer, Berlin / Heidelberg, 2001 [254] Tauber, O. Einfluss vorhersagegesteuerter Restarts auf die Verfügbarkeit. Master’s thesis, Humboldt-Universität zu Berlin, Berlin, Germany, 2006 [255] Thoraval, L. Hidden Semi-Markov Event Sequence Models. Technical report, Université Louis Pasteur Strasbourg, France, 2002 [256] Todorovski, L., Flach, P., & Lavrac, N. Predictive performance of weighted relative accuracy. In Zighed, D. A., Komorowski, J., & Żytkow, J. (eds.), Proceedings of the Fourth European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2000), Lecture Notes in Artificial Intelligence, volume 1910, 255–264. Springer, 2000 [257] Troudet, T., Merrill, W., Center, N., & Cleveland, O. A real time neural net estimator of fatigue life. In IEEE Proceedings of International Joint Conference on Neural Networks(IJCNN 90), 59–64. 1990 [258] Tsao, M. M. & Siewiorek, D. P. Trend Analysis on System Error Files. In Proc. 13th International Symposium on Fault-Tolerant Computing, 116–119. Milano, Italy, 1983 [259] Turnbull, D. & Alldrin, N. Failure Prediction in Hardware Systems. Technical report, University of California, San Diego, 2003. Available at http://www.cs.ucsd.edu/ ~dturnbul/Papers/ServerPrediction.pdf [260] Ulerich, N. & Powers, G. On-line hazard aversion and fault diagnosis in chemical processes: the digraph+fault-tree method. IEEE Transactions on Reliability, volume 37(2): 171–177, 1988 [261] Vaidyanathan, K., Harper, R. E., Hunter, S. W., & Trivedi, K. S. Analysis and implementation of software rejuvenation in cluster systems. In Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 62–71. ACM Press, 2001 [262] Vaidyanathan, K. & Trivedi, K. A comprehensive model for software rejuvenation. IEEE Transactions on Dependable and Secure Computing, volume 2: 124–137, 2005 318 Bibliography [263] Vaidyanathan, K. & Trivedi, K. S. A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems. In Proceedings of the International Symposium on Software Reliability Engineering (ISSRE). 1999 [264] Vapnik, V. N. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995 [265] Vesely, W., Goldberg, F. F., Roberts, N. H., & Haasl, D. F. Fault Tree Handbook. Technical Report NUREG-0492, U.S. Nuclear Regulatory Commission, Washington, DC, 1981 [266] Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., & Weiss, S. M. Predictive algorithms in the management of computer systems. IBM Systems Journal, volume 41(3): 461–474, 2002 [267] Vilalta, R. & Drissi, Y. A perspective view and survey of meta-learning. Artificial Intelligence Review, volume 18(2): 77–95, 2002 [268] Vilalta, R. & Ma, S. Predicting Rare Events In Temporal Domains. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02), 474–482. IEEE Computer Society, Washington, DC, USA, 2002 [269] Wahl, M., Howes, T., & Kille, S. Lightweight Directory Access Protocol (v3). RFC 2251, 1997. http://www.ietf.org/rfc/rfc2251.txt [270] Wang, X. Durationally constrained training of HMM without explicit state durational PDF. In Proceedings of the Institute of Phonetic Sciences, University of Amsterdam, volume 18, 111–130. 1994 [271] Ward, A., Glynn, P., & Richardson, K. Internet service performance failure detection. SIGMETRICS Performance Evaluation Review, volume 26(3): 38–43, 1998 [272] Ward, A. & Whitt, W. Predicting response times in processor-sharing queues. In Glynn, P. W., MacDonald, D. J., & Turner, S. J. (eds.), Proc. of the Fields Institute Conf. on Comm. Networks. 2000 [273] Warrender, C., Forrest, S., & Pearlmutter, B. Detecting intrusions using system calls: alternative data models. In IEEE Proceedings of the 1999 Symposium on Security and Privacy, 133–145. 1999 [274] Wei, W., Wang, B., & Towsley, D. Continuous-time hidden Markov models for network performance evaluation. Performance Evaluation, volume 49(1-4): 129–146, 2002 [275] Weiss, G. Timeweaver: A Genetic Algorithm for Identifying Predictive Patterns in Sequences of Events. In Proceedings of the Genetic and Evolutionary Computation Conference, 718–725. Morgan Kaufmann, San Francisco, CA, 1999 [276] Weiss, G. M. Mining with rarity: a unifying framework. SIGKDD Explor. Newsl., volume 6(1): 7–19, 2004 [277] Weiss, G. M. & Hirsh, H. Learning to Predict Rare Events in Event Sequences. In R. Agrawal, P. S. & Piatetsky-Shapiro, G. (eds.), Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 359–363. AAAI Press, Menlo Park, California, 1998 [278] Williams, J., Davies, A., & Drake, P. (eds.). Condition-based Maintenance and Machine Diagnostics. Springer Verlag, 1994 Bibliography 319 [279] Wilson, A. D. & Bobick, A. F. Recognition and interpretation of parametric gesture. In IEEE Proceedings of Sixth International Conference on Computer Vision, 329–336. 1998 [280] Wolpert, D. H. The Mathematics of Generalization. Addison-Wesley, Reading, MA, 1995 [281] Wong, K. C. P., Ryan, H., & Tindle, J. Early Warning Fault Detection Using Artificial Intelligent Methods. In Proceedings of the Universities Power Engineering Conference. 1996. URL citeseer.nj.nec.com/217993.html [282] Yang, S. A condition-based failure-prediction and processing-scheme for preventive maintenance. IEEE Transactions on Reliability, volume 52(3): 373–383, 2003 [283] Yu, C. H. Resampling methods: concepts, applications, and justification. Practical Assessment, Research and Evaluation, volume 8(19), 2003 [284] Yu, S.-Z., Liu, Z., Squillante, M. S., Xia, C., & Zhang, L. A hidden semi-Markov model for web workload self-similarity. In IEEE Proceedings of 21st International Performance, Computing, and Communications Conference, 65–72. 2002 [285] Zipf, G. K. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley Press, Cambridge, Mass, 1949