Download Event-based Failure Prediction - Institut für Informatik

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Mixture model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Event-based Failure Prediction
An Extended Hidden Markov Model Approach
DISSERTATION
zur Erlangung des akademischen Grades
Doktor-Ingenieur (Dr.-Ing.)
im Fach Informatik
eingereicht an der
Mathematisch-Naturwissenschaftlichen Fakultät II
Humboldt-Universität zu Berlin
von
Herrn Dipl.-Ing. Felix Salfner
geboren am 27.04.1974 in Düsseldorf
Präsident der Humboldt-Universität zu Berlin
Prof. Dr. Christoph Markschies
Dekan der Mathematisch-Naturwissenschaftlichen Fakultät II
Prof. Dr. Wolfgang Coy
Gutachter:
1. Prof. Dr. M. Malek
2. Prof. Dr. Dr. h.c. G. Hommel
3. Prof. Dr. A. Reinefeld
Tag der mündlichen Prüfung:
6.2.2008
To Gesine, Anton Linus, Henry, and
Fabienne.
iii
Acknowledgments
First of all, I would like to thank my doctoral advisor Miroslaw Malek for his ongoing
support and advice —I have benefitted greatly from his broad experience. I am also
very grateful to Katinka Wolter, who has led me to the fascinating beauty of stochastic
processes and who has repeatedly helped me to review, rethink, and revise my ideas.
A part of this work was carried out as a member of the Graduate School “Stochastische
Modellierung und quantitative Analyse großer Systeme in den Ingenieurwissenschaften”
(MAGSI), which has provided an inspiring scientific environment. I would like to thank
the members of MAGSI for discussions and for giving feedback on my work from the
most diverse viewpoints. In particular, I would like to acknowledge the effort of Günter
Hommel and Armin Zimmermann (Technical University Berlin) organizing and providing
a forum for stimulating scientific exchange, and of Tobias Harks (Technical University
Berlin) who kept a watchful eye on the mathematical aspects of this work. This work was
also greatly improved by fruitful discussions with my colleagues, especially with Günther
Hoffmann, Maren Lenk, and Peter Ibach, and by the great support from Jan Richling
and Steffen Tschirpke, whom I hereby thank. I am also grateful for discussions, help,
and comments from Alexander Schliep (Max Planck Institute for Molecular Genetics,
Berlin), Tobias Scheffer and Ulf Brefeld (Max Planck Institute for Computer Science),
Aad van Moorsel (School of Computer Science, Newcastle University), who have given
many impulses to my work, and I would like to express my thanks to my old friend Patrick
Stiegeler, who was an open-minded reviewer of my thesis.
Besides the working life, I am very grateful to my parents taking good care of me
especially during writing of the first half of the thesis and for improving many of the
figures found in this dissertation. Finally, I want to extend my most heartfelt thanks to my
wonderful wife Fabienne and our children, without whose support and consideration this
work would not have come into existence.
This work was supported also by Deutsche Forschungsgemeinschaft (German Research Foundation) project “Failure Prediction in Critical Infrastructures” and Intel Corporation.
v
Abstract
Human lives and organizations are increasingly dependent on the correct functioning of
computer systems and their failure might cause personal as well as economic damage.
There are two non-exclusive approaches to minimize the risk of such hazards: (a) faultintolerance tries to eliminate design and manufacturing faults for hardware and software
before a system is put into service. (b) fault-tolerance techniques deal with faults that occur during service trying to avert that faults turn into failures. Since faults, in most cases,
cannot be ruled out, we focus on the second approach. Traditionally, fault tolerance has
followed a reactive scheme of fault detection, location and subsequent recovery by redundancy either in space or time. However, in recent years the focus has changed from these
reactive methods towards more proactive schemes that try to evaluate the current situation
of a running system in order to start acting even before the failure occurs. Once a failure is predicted, it may either be prevented or the outage may be shifted from unplanned
to planned downtime, which can both improve significantly the system’s reliability. The
first step in this approach, online failure prediction, is the main focus of this thesis. The
objective of the online failure prediction is to predict the occurrence of failures in the near
future based on the current state of the system as it is observed by runtime monitoring.
A new failure prediction method that builds on the evaluation of error events is introduced in this dissertation. More specifically, it treats the occurrence of errors as an
event-driven temporal sequence and applies a pattern recognition technique in order to
predict upcoming failures. Hidden Markov models have successfully solved many pattern recognition tasks. However, standard hidden Markov models are not well-suited to
processing sequences in continuous time and existing augmentations do not account adequately for the event-driven character of error sequences. Hence, an extension of hidden
Markov models has been developed that employs a semi-Markov process to state traversals providing the flexibility to model a great variety of temporal characteristics of the
underlying stochastic process.
The proposed hidden semi-Markov model has been applied to industrial data of a
commercial telecommunication platform. The case study showed significantly improved
failure prediction capabilities in comparison to well-known existing approaches. The case
study also demonstrated that hidden semi-Markov models perform significantly better
than standard hidden Markov models.
In order to assess the impact of failure prediction and subsequent actions, a reliability model has been developed that enables to compute steady-state system availability,
reliability and hazard rate. Based on the model, it is shown that such approaches can
significantly improve system dependability.
Keywords:
Event-based failure prediction, Hidden semi-Markov model, Proactive fault
management, Autonomic Computing
vii
Zusammenfassung
Es gibt kaum mehr einen Bereich in unserer Gesellschaft, der nicht an ein korrektes und
fehlerfreies Funktionieren von zum Teil hochkomplexen Computersystemen gebunden ist.
So kann nicht nur das Überleben ganzer Unternehmen davon abhängen, sondern auch das
Leben von Menschen. Es gibt zwei grundlegende Ansätze mit diesem Risiko umzugehen:
(a) man versucht, Fehlerursachen während der Entwurfs- und Herstellungsphase, also
noch bevor das System in Betrieb geht, zu eliminieren (Fehler-Intoleranz) und / oder (b)
man versucht, um einen Ausfall des Systems zu verhindern, ein System zu bauen, das
mit Fehlern —die trotz ausgefeilter Fehler-Intoleranz Verfahren in der Produktionsphase
auftreten können— umgehen kann (Fehlertoleranz). Die vorliegende Arbeit konzentriert
sich auf letzteren Ansatz.
Traditionell haben Fehlertoleranz-Verfahren auf Fehler lediglich reagiert und versucht, Ausfälle des Gesamtsystems durch räumliche oder zeitliche Redundanz zu verhindern. In den letzten Jahren hat sich der Fokus der Forschung jedoch von diesen eher
statischen Verfahren zu dynamischeren Ansätzen verschoben, die versuchen, bereits vor
dem Auftreten eines Fehlers einzugreifen. Dazu wird der Zustand des laufenden Systems
überwacht und analysiert, um einen möglichen Ausfall vorherzusagen. Bei einem drohenden Ausfall wird dann entweder versucht, den Ausfall zu verhindern, oder sich auf
ihn vorzubereiten, um die Reparaturzeit zu verringern. Beides kann die Zuverlässigkeit
des Systems erheblich verbessern.
Die vorliegende Arbeit beschäftigt sich vorwiegend mit der Vorhersage von Ausfällen
und verfolgt dazu einen Ansatz, der auf der Erkennung von Mustern in Sequenzen von
Fehlerereignissen basiert. Das entwickelte Vorhersageverfahren ist das erste, das sowohl
die Art von Fehlerereignissen, als auch deren Auftrittszeitpunkt erfolgreich integriert und
das ein Mustererkennungsverfahren anwendet, um zu entscheiden, ob eine im System beobachtete Sequenz von Fehlern symptomatisch für einen drohenden Ausfall ist oder nicht.
Das Mustererkennungsverfahren basiert auf zu “hidden semi-Markov Modellen” erweiterten “hidden Markov Modellen,” die dem ereignisgesteuerten Charakter von Fehlern
besser gerecht werden.
Das Ausfallvorhersageverfahren wurde auf Daten einer kommerziellen Telekommunikationsplattform angewandt und evaluiert. Sowohl im Vergleich zu den bekanntesten
existierenden Verfahren als auch im Vergleich zu herkömmlichen zeitdiskreten “hidden
Markov Modellen” wird eine signifikant bessere Vorhersagegüte erreicht.
Eine Ausfallvorhersage ist lediglich der erste wichtige Schritt für einen aktiven Umgang mit Fehlern: Im Anschluss an die Vorhersage müssen Aktionen ausgeführt werden,
um einen drohenden Ausfall zu vermeiden beziehungsweise seine Folgen zu minimieren.
In der Arbeit wird ein Zuverlässigkeitsmodell vorgestellt, mit dem stationäre Verfügbarkeit, Zuverlässigkeit und Hazard-Rate von Systemen mit Ausfallvorhersage und anschließenden Maßnahmen berechnet werden können. Mit Hilfe dieses Modells kann gezeigt
werden, dass die Kombination aus Ausfallvorhersage und sich anschließsenden Aktionen
die Systemzuverlässigkeit erheblich verbessern kann.
Schlagwörter:
Ereignisgesteuerte Ausfallvorhersage, Hidden Semi-Markov Modell, Präventive
Fehlertoleranz, Autonomic Computing
ix
Contents
List of Figures
xvii
List of Tables
xxi
Mathematical Notation
xxiii
Preface
I
xxv
Introduction, Problem Statement, and Related Work
1 Introduction, Motivation and Main Contributions
1.1 From Fault Tolerance to
Proactive Fault Management . . . . . . . . . .
1.2 Origins and Background . . . . . . . . . . . .
1.3 Outline of the Thesis . . . . . . . . . . . . . .
1.4 Main Contributions . . . . . . . . . . . . . . .
1
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Problem Statement, Key Properties, and Approach to Solution
2.1 A Definition of Online Failure Prediction . . . . . . . . . . .
2.1.1 Failures . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Online Prediction . . . . . . . . . . . . . . . . . . .
2.2 The Objective of the Case Study . . . . . . . . . . . . . . .
2.3 Key Properties . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Analysis of the Approach . . . . . . . . . . . . . . . . . . .
2.5.1 Identifiable Types of Failures . . . . . . . . . . . . .
2.5.2 Identifiable Types of Faults . . . . . . . . . . . . . .
2.5.3 Relation to Other Research Areas and Issues . . . . .
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
5
6
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
9
11
12
14
16
19
19
20
24
26
3 A Survey of Online Failure Prediction Methods
3.1 A Taxonomy and Survey of Online Failure Prediction Methods
3.2 Methods Used for Comparison . . . . . . . . . . . . . . . . .
3.2.1 Dispersion Frame Technique . . . . . . . . . . . . . .
3.2.2 Eventset Method . . . . . . . . . . . . . . . . . . . .
3.2.3 SVD-SVM Method . . . . . . . . . . . . . . . . . . .
3.2.4 Periodic Prediction . . . . . . . . . . . . . . . . . . .
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
45
46
48
50
53
53
xi
4 Introduction to Hidden Markov Models and Related Work
4.1 An Introduction to Hidden Markov Models . . . . . . . .
4.1.1 The Forward-Backward Algorithm . . . . . . . .
4.1.2 Training: The Baum-Welch Algorithm . . . . . .
4.2 Sequences in Continuous Time . . . . . . . . . . . . . .
4.2.1 Four Approaches to Incorporate Continuous Time
4.3 Related Work on Time-Varying Hidden Markov Models .
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
II
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Modeling
55
55
58
60
63
64
67
70
73
5 Data Preprocessing
5.1 From Logfiles to Sequences . . . . . . . . .
5.1.1 From Messages to Error-IDs . . . .
5.1.2 Tupling . . . . . . . . . . . . . . .
5.1.3 Extracting Sequences . . . . . . . .
5.2 Clustering of Failure Sequences . . . . . .
5.2.1 Obtaining the Dissimilarity Matrix .
5.2.2 Grouping Failure Sequences . . . .
5.2.3 Determining the Number of Groups
5.2.4 Additional Notes on Clustering . . .
5.3 Filtering the Noise . . . . . . . . . . . . . .
5.4 Improving Logfiles . . . . . . . . . . . . .
5.4.1 Event Type and Event Source . . . .
5.4.2 Hierarchical Numbering . . . . . .
5.4.3 Logfile Entropy . . . . . . . . . . .
5.4.4 Existing Solutions . . . . . . . . . .
5.5 Summary . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 The Model
6.1 The Hidden Semi-Markov Model . . . . . . . . . . . . . . . . . . . .
6.1.1 Wrap-up of Semi-Markov Processes . . . . . . . . . . . . . .
6.1.2 Combining Semi-Markov Processes with HMMs . . . . . . .
6.2 Sequence Processing . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Recognition of Temporal Sequences: The Forward Algorithm
6.2.2 Sequence Prediction . . . . . . . . . . . . . . . . . . . . . .
6.3 Training Hidden Semi-Markov Models . . . . . . . . . . . . . . . . .
6.3.1 Beta, Gamma and Xi . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Reestimation Formulas . . . . . . . . . . . . . . . . . . . . .
6.3.3 A Summary of the Training Algorithm . . . . . . . . . . . . .
6.4 Difference Between the Approach and other HSMMs . . . . . . . . .
6.5 Proving Convergence of the Training Algorithm . . . . . . . . . . . .
6.5.1 A Proof of Convergence Framework . . . . . . . . . . . . . .
6.5.2 The Proof for HSMMs . . . . . . . . . . . . . . . . . . . . .
6.6 HSMMs for Failure Prediction . . . . . . . . . . . . . . . . . . . . .
6.7 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . .
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
75
75
76
79
79
80
81
82
83
83
86
86
87
89
90
92
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
95
95
95
97
99
99
102
105
105
106
109
112
116
116
119
125
128
130
7 Classification
7.1 Bayes Decision Theory . . . . . . . . . . . . . . . . . .
7.1.1 Simple Classification . . . . . . . . . . . . . . .
7.1.2 Classification with Costs . . . . . . . . . . . . .
7.1.3 Rejection Thresholds . . . . . . . . . . . . . . .
7.2 Classifiers for Failure Prediction . . . . . . . . . . . . .
7.2.1 Threshold on Sequence Likelihood . . . . . . . .
7.2.2 Threshold on Likelihood Ratio . . . . . . . . . .
7.2.3 Using Log-likelihood . . . . . . . . . . . . . . .
7.2.4 Multi-class Classification Using Log-Likelihood
7.3 Bias and Variance . . . . . . . . . . . . . . . . . . . . .
7.3.1 Bias and Variance for Regression . . . . . . . . .
7.3.2 Bias and Variance for Classification . . . . . . .
7.3.3 Conclusions for Failure Prediction . . . . . . . .
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
III
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Applications of the Model
133
133
134
135
136
136
136
136
137
138
138
138
140
143
146
147
8 Evaluation Metrics
8.1 Evaluation of Clustering . . . . . . . . . . . . . . . . . . . . .
8.1.1 Dendrograms . . . . . . . . . . . . . . . . . . . . . . .
8.1.2 Banner Plots . . . . . . . . . . . . . . . . . . . . . . . .
8.1.3 Agglomerative and Divisive Coefficient . . . . . . . . .
8.2 Metrics for Prediction Quality . . . . . . . . . . . . . . . . . .
8.2.1 Contingency Table . . . . . . . . . . . . . . . . . . . .
8.2.2 Metrics Obtained from Contingency Tables . . . . . . .
8.2.3 Plots of Contingency Table Measures . . . . . . . . . .
8.2.4 Cost Impact of Failure Prediction . . . . . . . . . . . . .
8.2.5 Other Metrics . . . . . . . . . . . . . . . . . . . . . . .
8.3 Evaluation Process . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Setting of Parameters . . . . . . . . . . . . . . . . . . .
8.3.2 Three Types of Data Sets . . . . . . . . . . . . . . . . .
8.3.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . .
8.4 Statistical Confidence . . . . . . . . . . . . . . . . . . . . . . .
8.4.1 Theoretical Assessment of Accuracy . . . . . . . . . . .
8.4.2 Confidence Intervals by Assuming Normal Distributions
8.4.3 Jackknife . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.4 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . .
8.4.5 Bootstrapping with Cross-validation . . . . . . . . . . .
8.4.6 Confidence Intervals for Plots . . . . . . . . . . . . . .
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 Experiments and Results Based on Industrial Data
9.1 Description of the Case Study . . . . . . . . . .
9.2 Data Preprocessing . . . . . . . . . . . . . . .
9.2.1 Making Logfiles Machine-Processable .
9.2.2 Error-ID Assignment . . . . . . . . . .
xiii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
149
149
149
151
151
152
153
154
157
160
164
166
166
167
168
168
168
169
170
170
171
172
172
.
.
.
.
175
175
177
177
178
9.2.3 Tupling . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.4 Extracting Sequences . . . . . . . . . . . . . . . . . .
9.2.5 Grouping (Clustering) of Failure Sequences . . . . . .
9.2.6 Noise Filtering . . . . . . . . . . . . . . . . . . . . .
9.3 Properties of the Preprocessed Dataset . . . . . . . . . . . . .
9.3.1 Error Frequency . . . . . . . . . . . . . . . . . . . . .
9.3.2 Distribution of Delays . . . . . . . . . . . . . . . . .
9.3.3 Distribution of Failures . . . . . . . . . . . . . . . . .
9.3.4 Distribution of Sequence Lengths . . . . . . . . . . .
9.4 Training HSMMs . . . . . . . . . . . . . . . . . . . . . . . .
9.4.1 Parameter Space . . . . . . . . . . . . . . . . . . . .
9.4.2 Results for Parameter Investigation . . . . . . . . . . .
9.5 Detailed Analysis of Failure Prediction Quality . . . . . . . .
9.5.1 Precision, Recall, and F-measure . . . . . . . . . . . .
9.5.2 ROC and AUC . . . . . . . . . . . . . . . . . . . . .
9.5.3 Accumulated Runtime Cost . . . . . . . . . . . . . . .
9.6 Dependence on Application Specific Parameters . . . . . . . .
9.6.1 Lead-Time . . . . . . . . . . . . . . . . . . . . . . .
9.6.2 Data Window Size . . . . . . . . . . . . . . . . . . .
9.7 Dependence on Data Specific Issues . . . . . . . . . . . . . .
9.7.1 Size of the Training Data Set . . . . . . . . . . . . . .
9.7.2 System Configuration and Model Aging . . . . . . . .
9.8 Failure Sequence Grouping and Filtering . . . . . . . . . . . .
9.8.1 Failure Grouping . . . . . . . . . . . . . . . . . . . .
9.8.2 Sequence Filtering . . . . . . . . . . . . . . . . . . .
9.9 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . .
9.9.1 Dispersion Frame Technique (DFT) . . . . . . . . . .
9.9.2 Eventset . . . . . . . . . . . . . . . . . . . . . . . . .
9.9.3 SVD-SVM . . . . . . . . . . . . . . . . . . . . . . .
9.9.4 Periodic Prediction Based on MTBF . . . . . . . . . .
9.9.5 Comparison with Standard HMMs . . . . . . . . . . .
9.9.6 Comparison with Random Predictor . . . . . . . . . .
9.9.7 Comparison with UBF . . . . . . . . . . . . . . . . .
9.9.8 Discussion and Summary of Comparative Approaches
9.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IV
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
179
180
182
188
191
192
192
193
197
198
198
199
205
205
205
206
207
207
208
209
210
211
212
212
213
213
214
215
216
217
217
218
219
219
221
Improving Dependability, Conclusions, and Outlook
225
10 Assessing the Effect on Dependability
10.1 Proactive Fault Management . . . . . . . . . . . . . . . . . . . . . .
10.1.1 Downtime Avoidance . . . . . . . . . . . . . . . . . . . . . .
10.1.2 Downtime Minimization . . . . . . . . . . . . . . . . . . . .
10.2 Related Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 The Availability Model . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.1 The Original Model for Software Rejuvenation by Huang et al.
10.3.2 Availability Model for Proactive Fault Management . . . . . .
10.4 Computing the Rates of the Model . . . . . . . . . . . . . . . . . . .
227
227
229
229
231
233
233
234
236
xiv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
237
239
243
244
244
245
246
246
247
251
252
252
252
254
258
258
11 Summary and Conclusions
11.1 Phase I: Problem Statement, Key Properties and Related Work
11.2 Phase II: Data Preprocessing, the Model, and Classification . .
11.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . .
11.2.2 The Hidden Semi-Markov Model . . . . . . . . . . . .
11.2.3 Sequence Classification . . . . . . . . . . . . . . . . .
11.3 Phase III: Evaluation Methods and Results for Industrial Data .
11.3.1 Evaluation Methods . . . . . . . . . . . . . . . . . . .
11.3.2 Results for the Telecommunication System Case Study
11.4 Phase IV: Dependability Improvement . . . . . . . . . . . . .
11.4.1 Proactive Fault Management . . . . . . . . . . . . . .
11.4.2 Models . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.3 Parameter Estimation . . . . . . . . . . . . . . . . . .
11.4.4 Case Study and an Advanced Example . . . . . . . . .
11.5 Main Contributions . . . . . . . . . . . . . . . . . . . . . . .
11.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
263
263
265
265
266
268
268
268
270
273
273
273
274
274
274
275
.
.
.
.
.
.
.
277
277
277
278
278
278
280
280
10.5
10.6
10.7
10.8
10.9
10.4.1 The Parameters in Detail . . . . . . . . . .
10.4.2 Computing the Rates from Parameters . . .
Computing Availability . . . . . . . . . . . . . . .
Computing Reliability . . . . . . . . . . . . . . . .
10.6.1 The Reliability Model . . . . . . . . . . . .
10.6.2 Reliability and Hazard Rate . . . . . . . . .
How to Estimate the Parameters from Experiments
10.7.1 Failure Prediction Accuracy . . . . . . . .
10.7.2 Failure Probabilities PT P , PF P , and PT N . .
10.7.3 Repair Time Improvement k . . . . . . . .
10.7.4 Summary of the Estimation Procedure . . .
A Case Study and an Example . . . . . . . . . . .
10.8.1 Experiment Description . . . . . . . . . . .
10.8.2 Results . . . . . . . . . . . . . . . . . . .
10.8.3 An Advanced Example . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . .
12 Outlook
12.1 Further Development of Prediction Models . . . . .
12.1.1 Improving the Hidden Semi-Markov Model
12.1.2 Bias and Variance . . . . . . . . . . . . . .
12.1.3 Online Learning . . . . . . . . . . . . . . .
12.1.4 Further Issues . . . . . . . . . . . . . . . .
12.1.5 Further Application Domains for HSMMs .
12.2 Proactive Fault Management . . . . . . . . . . . .
V
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Appendix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
283
Derivatives with respect to Parameters for Selected Distributions
xv
285
Erklärung
289
Acronyms
291
Index
295
Bibliography
301
xvi
List of Figures
1.1
1.2
Predict-react cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The engineering cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
Definitions and interrelations of faults, errors and failures . . . . . . . .
Four stages where faults can become visible . . . . . . . . . . . . . . .
Distinction between root cause analysis and failure prediction . . . . . .
Time relations in online failure prediction . . . . . . . . . . . . . . . .
Failure definition for the case study . . . . . . . . . . . . . . . . . . . .
Data acquisition setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
Two phase machine learning approach . . . . . . . . . . . . . . . . . .
Dependencies among components lead to a temporal sequence of errors
Overview of the training procedure . . . . . . . . . . . . . . . . . . . .
Overview of the online failure prediction approach . . . . . . . . . . .
Permanent, intermittent and transient faults (Siewiorek & Swarz [241]).
Fault model based on Barborak et al. [23] . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
10
11
11
12
13
14
16
17
19
20
21
22
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
A taxonomy for online failure prediction approaches . . . . .
Failure prediction by function approximation . . . . . . . . .
Failure prediction using signal processing techniques . . . . .
Failure prediction based on the occurrence of errors . . . . . .
Failure prediction by recognition of failure-prone error patterns
Dispersion Frame Technique . . . . . . . . . . . . . . . . . .
The eventset method . . . . . . . . . . . . . . . . . . . . . .
Bag-of-words representation of error sequences . . . . . . . .
Singular value decomposition . . . . . . . . . . . . . . . . . .
Maximum margin classification using support vector machines
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
34
39
40
43
46
48
51
51
52
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
Discrete Time Markov Chain . . . . . . . . . . . . . . . . . . . . . . .
A discrete-time hidden Markov model . . . . . . . . . . . . . . . . . .
A trellis to visualize the forward algorithm . . . . . . . . . . . . . . . .
A trellis visualizing the computation of ξt (i, j) . . . . . . . . . . . . . .
Notations for event-driven temporal sequences . . . . . . . . . . . . . .
Incorporating continuous time by time slotting . . . . . . . . . . . . . .
Duration modeling by a discrete-time HMM with self-transitions. . . . .
Representing time by delay symbols . . . . . . . . . . . . . . . . . . .
Delay representation by two-dimensional output probability distributions
Duration modeling by explicit modeling of state durations . . . . . . . .
Topology of an Expanded State HMM . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
56
57
59
62
63
64
64
65
66
68
69
xvii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
6
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
From faults to error messages . . . . . . . . . . . . . . . . . . . . .
Truncation and collision in tupling . . . . . . . . . . . . . . . . . .
Plotting the number of tuples over time window size ε . . . . . . . .
Extracting sequences. . . . . . . . . . . . . . . . . . . . . . . . . .
For each failure sequence F i , a separate HSMM M i is trained. . . .
Matrix of logarithmic sequence likelihoods . . . . . . . . . . . . .
Inter-cluster distance rules . . . . . . . . . . . . . . . . . . . . . .
Noise filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Three different sequence sets to compute symbol prior probabilities
Hierarchical error numbering with SHIP . . . . . . . . . . . . . . .
An inherent problem of hard classification approaches . . . . . . . .
Sets of required information and given information of a log record .
A plot of log entropy . . . . . . . . . . . . . . . . . . . . . . . . .
Principle structure of a Common Base Event . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
76
78
78
79
80
81
83
84
86
87
88
89
91
91
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
A semi-Markov process . . . . . . . . . . . . . . . . . . . . . . .
A sample hidden semi-Markov model . . . . . . . . . . . . . . .
Notation for temporal sequences . . . . . . . . . . . . . . . . . .
Summary of the complete training algorithm for HSMMs. . . . . .
A simplified sketch of phoneme assignment to a speech signal. . .
Assigning states to observations in speech processing . . . . . . .
Trellis structure for the forward algorithm with duration modeling
Lower bound optimization . . . . . . . . . . . . . . . . . . . . .
Gradient vector projection . . . . . . . . . . . . . . . . . . . . .
Failure prediction model structure used for training . . . . . . . .
Model with intermediate states . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
96
98
99
111
113
114
115
117
124
126
127
7.1
7.2
7.3
7.4
7.5
7.6
Classification by maximum posterior for a two-class example
Error in regression problems . . . . . . . . . . . . . . . . .
True and estimated posterior probabilities . . . . . . . . . .
Distribution of estimated posterior . . . . . . . . . . . . . .
Boundary error plots . . . . . . . . . . . . . . . . . . . . .
Early stopping . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
134
139
141
142
142
144
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
8.13
Dendrograms . . . . . . . . . . . . . . . . . . . . .
Banner plots . . . . . . . . . . . . . . . . . . . . . .
Sample precision/recall-plot for two failure predictors
Sample ROC plot . . . . . . . . . . . . . . . . . . .
Relation between ROC plots and precision and recall
Detection error trade-off plot . . . . . . . . . . . . .
Iso-cost lines in ROC space . . . . . . . . . . . . . .
Determining minimum cost from ROC . . . . . . . .
Cost curves . . . . . . . . . . . . . . . . . . . . . .
Exemplary accumulated runtime cost . . . . . . . . .
AUC can be misleading . . . . . . . . . . . . . . . .
Cross-validation and bootstrapping . . . . . . . . . .
Averaging ROC curves . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
150
151
158
158
159
160
161
162
162
163
165
172
173
9.1
Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
176
xviii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.13
9.14
9.15
9.16
9.17
9.18
9.19
9.20
9.21
9.22
9.23
9.24
9.25
9.26
9.27
9.28
9.29
9.30
9.31
9.32
9.33
9.34
9.35
Typical error log record . . . . . . . . . . . . . . . . . . . . . . . . . .
Levenshtein similarity plot . . . . . . . . . . . . . . . . . . . . . . . .
Effect of tupling window size for cluster-wide logfile . . . . . . . . . .
Effect of tupling window size. . . . . . . . . . . . . . . . . . . . . . .
HSMM toplogy for failure sequence grouping . . . . . . . . . . . . . .
Effect of clustering methods. . . . . . . . . . . . . . . . . . . . . . . .
Effect of number of states . . . . . . . . . . . . . . . . . . . . . . . . .
Effect of background distribution weight . . . . . . . . . . . . . . . . .
Values of Xi for noise filtering: Cluster prior . . . . . . . . . . . . . . .
Values of Xi for noise filtering: Cluster failure sequences . . . . . . . .
Values of Xi for noise filtering: all sequences . . . . . . . . . . . . . .
Mean sequence length depending on filtering threshold . . . . . . . . .
Number of errors per five minutes . . . . . . . . . . . . . . . . . . . .
Histogram and QQ-plots of delays between errors . . . . . . . . . . . .
Analysis of time between failure . . . . . . . . . . . . . . . . . . . . .
Normalized autocorrelation of failure occurrence . . . . . . . . . . . .
Histogram and ECDF for the length of sequences . . . . . . . . . . . .
Average negative training sequence log-likelihood . . . . . . . . . . . .
Mean training time for number of states and maximum span of shortcuts
Computation times for testing . . . . . . . . . . . . . . . . . . . . . . .
Upper bounds for mean testing times . . . . . . . . . . . . . . . . . . .
Precision/recall and F-measure plot for industrial data . . . . . . . . . .
ROC plot for industrial data . . . . . . . . . . . . . . . . . . . . . . . .
Accumulated runtime cost for industrial data . . . . . . . . . . . . . . .
Failure prediction performance for various lead-times . . . . . . . . . .
Effects of data window size ∆td . . . . . . . . . . . . . . . . . . . . .
Data sets for experiments investigating size of the data set. . . . . . . .
F-measure and Training time as function of size of training data set . . .
Data sets for experiments investigating system configuration . . . . . .
Prediction quality as function of train-test gap. . . . . . . . . . . . . . .
Precision / recall plot and ROC plot for single failure group model . . .
Histograms of time-between-errors for DFT . . . . . . . . . . . . . . .
Precision/recall and ROC plot for the SVD-SVM prediction algorithm .
Summary of prediction results for comparative approaches . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
177
179
180
181
182
184
186
187
189
189
190
191
192
194
195
196
197
201
203
204
204
206
207
208
209
210
210
211
212
213
214
215
217
220
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
10.9
10.10
10.11
10.12
10.13
Principle approach of proactive fault management . . . . . . . . .
Improved TTR for prediction-driven repair schemes . . . . . . . .
The original rejuvenation model . . . . . . . . . . . . . . . . . .
Availability model for proactive fault management . . . . . . . . .
Four cases of prediction including lead-time and prediction-period.
Time relations for prediction . . . . . . . . . . . . . . . . . . . .
CTMC model for reliability . . . . . . . . . . . . . . . . . . . . .
Four situations in failure prediction experiments . . . . . . . . . .
Cases with fault injection . . . . . . . . . . . . . . . . . . . . . .
Summary of the procedure to estimate model parameters . . . . .
Overview of the case study . . . . . . . . . . . . . . . . . . . . .
Reliability for the case study . . . . . . . . . . . . . . . . . . . .
Hazard rate for the case study . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
228
230
234
235
238
242
245
247
250
253
254
256
257
xix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10.14 Reliability for the more sophisticated example . . . . . . . . . . . . . . .
10.15 Hazard rate for the more sophisticated example. . . . . . . . . . . . . . .
259
259
11.1
Trade-off between predictive power and complexity . . . . . . . . . . . .
276
12.1
Steps of proactive fault management . . . . . . . . . . . . . . . . . . . .
281
xx
List of Tables
8.1
8.2
Contingency table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Metrics obtained from contingency table . . . . . . . . . . . . . . . . . . .
153
154
9.1
9.2
9.3
9.4
9.5
Number of different log messages . . . . . . . . . . . . . .
Experiment settings for detailed analysis. . . . . . . . . . .
Contingency table for a random predictor. . . . . . . . . . .
Contingency table for the UBF failure prediction approach. .
Summary of computation times for comparative approaches .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
178
205
218
219
221
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
Actions performed after prediction . . . . . . . . . .
Parameters used for modeling . . . . . . . . . . . . .
Simplified contingency table . . . . . . . . . . . . .
Solution to the steady-state equations for availability
Mapping of cases to situations . . . . . . . . . . . .
Estimation results for the case study . . . . . . . . .
Relative amount of the four types of prediction . . .
Parameters assumed for the sophisticated example . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
228
237
238
244
248
255
257
258
xxi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Mathematical Notation
• Vectors are typeset in bold lower-case letters or square brackets such as
π = [π1 , . . . , πN ]
• Matrices are typeset in bold capital letters such as B = [bij ] or as a vector of vectors
such as A = [ai ]
• Sets are indicated by curly brackets such as E = {x, y, z}
• Random Variables are denoted by capital letters such as X. If a random variable is
fixed to some value, the notation X = x is used
• Observation symbols are denoted by lower-case letters oi ∈ O, where O denotes
the alphabet of size M . The alphabet is simply a set of observation symbols: O =
{o1 , . . . , oM }
• Sequences of observations are denoted by a sequence of random variables O without separating commas such as O1 O2 . . . OL . For a specific, given sequence of
observations vector notation o = [Ot ] is used
• The notation Oi = ok expresses that the i-th element in an observation sequence is
equal to symbol ok
• States are denoted by lower-case letters si ∈ S , where S denotes the set of all N
states. Similar to observations, random variables denoting states use capital S and
sequences of states are defined equivalently to observation sequences
• Observation probabilities in hidden Markov models are either denoted in a matrix
form B = [bij ] or in a functional form bij = bi (oj )
xxiii
Preface
There are no faults, only lessons.
— freely adapted from Dr. Chérie Carter-Scott
Knowing the future has always been ingrained into desires of mankind —and it has
been fascinating ever since. Think, for example, of the oracle at Delphi during the classical period of Greece or the priests of the oracle at Siwa. Their supposed ability to
foresee the future created an aura and reputation lasting already for more than 2000 years.
Stonehenge, as a second example, has probably been an equinox predictor. Today, predictions are used in a manifold of areas. There are methods to forecast wars, the weather,
and winds.1 Financial markets, healthcare and insurance heavily use predictions, as well.
Turning to physics and engineering, prediction strategies are, for example, applied to predict the path of meteorites, or the future development of a signal in signal processing.
Even in computer science, prediction methods are quite frequently used: In microprocessors, branch prediction tries to prefetch instructions that are most likely executed and
memory or cache prediction tries to forecast what data might be required next.2 In this
dissertation, prediction techniques are used to forecast the occurrence of system failures.
Today, human lives and organizations are increasingly dependent on the correct functioning of computer systems. Train control systems, emergency systems, stock trading
software, and enterprise resource planning systems are only a few examples. A failure in
any of these systems may cause huge personal as well as economic damage. However,
computer systems have reached a level of complexity that precludes the development of a
completely correct system. Therefore, the occurrence of failures cannot be fully ruled out
but the likelihood of their occurrence should be minimized. This dissertation contributes
to an approach called proactive fault management, which tries to deal with faults even
before the failure has occurred. These methods can be applied most efficiently, if it is
known whether a failure is imminent in the system or not. This is called online failure
prediction and it is the main topic of this thesis.
Turning back to historic oracles, it was the search for structures3 and interrelations
and the ability to identify fundamental influencing factors which was essential in their
“modus operandi”. Based on this knowledge they have been able to analyze the present
situation and to infer future developments. These two principles are also the key to the
challenge of online failure prediction in complex computer systems. Particularly, the
1
To interested readers, specific references on war forcasting (Moll & Luebbert [186]), weather forecasting
(Pielke [204]), and wind forecasting (Marzbana & Stumpf [178]) can be found.
2
Specific references can be found on signal processing (Kalman & Bucy [140]), instruction prefetching
(Jiménez & Lin [134]), and cache prediction (Joseph & Grunwald [136])
3
For example, Jacob Burckhardt [41] reports in his book about Greek culture that in ancient times priests
hoped to forecast the future by examining viscera of sacrificial animals
xxv
approach proposed in this dissertation investigates interrelations between system components by identifying symptomatic error patterns.
One of the key problems in prediction is that future is in principle not fully predictable.
Hence, any prediction needs to handle uncertainty. In case of the historic oracles, their
replies have intentionally been cryptic and ambiguous, as can be seen from one of the
best-known replies, which is the one given to Croesus: When Croesus asked the oracle
at Delphi whether he should go to war with the Persians, the oracle responded: “If Croesus attacks the Persians, he will destroy a mighty empire”.4 However, it was Croesus’es
mighty empire that was destroyed, not the Persian —nevertheless, the oracle’s reply remained to be true. The prediction method proposed here takes on a different approach to
handle uncertainty: it strictly follows a probabilistic approach.
Due to the size and complexity of contemporary computer systems machine learning
techniques have been applied in order to reveal symptomatic patterns from observed failures that have occurred in the past —which is a fundamental difference to the task the
ancient predictors were confronted with: oracles had to evaluate singular events while in
failure prediction, there is a chance to gain experience. Hence the problem that is solved
in this dissertation is incomparably easier than the job of the venerable Greek oracles.
4
Herodot [118]
xxvi
Part I
Introduction, Problem Statement, and
Related Work
1
Chapter 1
Introduction, Motivation and Main
Contributions
A manifold of domains in today’s life and organizations are becoming increasingly dependent on the correct functioning of computer systems. Automotive assistant systems,
medical imaging devices, banking systems, and production planning and control systems
are only a few examples. Hence dependability, which is about preventing personal as
well as economic damage, is rendered a crucial issue. However, computer systems have
reached a level of complexity that precludes the development of a completely correct system. Being built of commercial-off-the-shelf components with millions of transistors and
millions of lines of code, the occurrence of failures cannot be fully ruled out but the likelihood of their occurrence should be minimized. Considering availability, another aspect
can be observed: striving for high availability in most cases implies extremely short repair
times. For example, a five-nines availability1 implies that a system must on average not
be down for more than 5.26 minutes per year. It is almost impossible for a human being to analyze, diagnose and repair a complex system within such a short time interval.2
Hence, systems need to react to failures more or less automatically. But even if reaction
is automated, it might in some cases be rather difficult to even restart the system within
five minutes. One way out of this dilemma is to follow a more proactive approach that
starts acting even before the failure occurs. This requires some short-term anticipation
of upcoming failures based on an evaluation of the current runtime state of the system,
followed by some proactive mechanisms that either try to avoid the upcoming failure or
try to minimize its effects (see Figure 1.1). This thesis focuses on online failure prediction for centralized complex computer systems, which is the first step towards an efficient
proactive fault management.
The need for accurate short-term failure prediction methods for computer systems has
recently been demonstrated by Liang et al. [165]. The authors mention that checkpointing3 is one of the most efficient ways to improve dependability in large scale computers.
However, in parallel computing, the overhead of checkpointing is immense and can even
nullify the gain in dependability due to the fact that failures occur irregularly. Failure
1
I.e., the ratio of uptime over lifetime equals at least 0.99999
2
Even if a failure occurs only every three years, it seems rather difficult to repair the system within 15.768
minutes
3
Checkpointing denotes the strategy to regularly save the entire state of a system such that this consistent
state can be restored when a failure has occurred
3
4
1. Introduction, Motivation and Main Contributions
Figure 1.1: Predict-react cycle
prediction methods are needed to differentiate between periods with few failures and periods with many and to adapt checkpointing to these situations. Oliner & Sahoo [197]
carry out experiments showing that failure prediction-driven checkpointing4 can boost
both performance and reliability of large-scale systems.
1.1
From Fault Tolerance to
Proactive Fault Management
Online failure prediction belongs to the research discipline called fault tolerance which
dates back to the pioneers of computing (c.f., e.g, Hamming [113], or von Neumann
[192]). The methods developed at that time mainly concerned ways to deal with incredibly unreliable hardware components such as relays and vacuum tubes. As complexity of
computing systems increased over the years, the main interest in reliable computing also
gradually turned over to a system wide view (Esary & Proschan [92]). Along with this
development, fault tolerance methods became more dynamic. One well-known example
is the Self-Testing And Repairing (STAR) computer, developed by Aviz̆ienis et al. [15].
Various variants of fault tolerance mechanisms employing static and dynamic fault tolerance techniques (hybrid approaches) have been developed (see, e.g., Siewiorek & Swarz
[241] for an introduction). At the same time, software became more and more complex
and software fault tolerance techniques such as recovery blocks (Randell [212]) and Nversion programming (Aviz̆ienis [14], Kelly et al. [143]) have been developed. This was
in part a reaction to the fact that the relative amount of software-related failures became
predominant (see, e.g., Sullivan & Chillarege [252]). However, fault tolerance techniques
developed until the 1990s were reactive, passive and still static in nature: They were
triggered after a problem had been detected and the type of reactions had to be prespecified during system design. In 1995, Huang et al. [126] proposed a new approach that
has become well-known under the term rejuvenation. Rejuvenation is a technique that
4
The authors call it cooperative checkpointing
1.2 Origins and Background
5
restarts parts of a system even if no fault has occurred. It has proven to be a successful
concept to deal with problems of software-aging (Parnas [198]) such as accumulating numerical rounding errors, corruption of data, exhaustion of resources, memory leaks, etc.
All the while system complexity has not stopped to grow, and traditional fault tolerance
mechanisms could not keep pace with the dynamics and flexibility of new computing
architectures and paradigms. Both industry and academia set off the search for new concepts in fault tolerance and other dependability issues like security as can be seen from
initiatives and research efforts on autonomic computing (Horn [123]), trustworthy computing (Mundie et al. [188]), adaptive enterprise (Coleman & Thompson [63]), recoveryoriented computing (Brown & Patterson [40]), responsive computing (e.g., Malek [173]),
rejuvenation (e.g., Garg et al. [101]) and various conferences on self-*properties (see,
e.g., Babaoglu et al. [19]) where the asterisk can be replaced by any of “configuration”,
“healing”, “optimization”, or “protection”. Throughout this dissertation, the term proactive fault management will be used.
In parallel to computer fault tolerance, research in mechanical engineering developed
the concept of preventive maintenance. Preventive maintenance tries to improve system
reliability by replacement of components (c.f., e.g., Gertsbakh [105] for an overview).
Several replacement strategies exist ranging from simple lifetime distribution models to
more complex models including prediction-based preventive maintenance incorporating
monitoring data (c.f., e.g., Williams et al. [278]). However, due to the fact that the actions
triggered for mechanical machines differ significantly from those for computing systems
and since the observation-based methods seem not to be able to account for the complexity
of contemporary large computer systems, the two research communities have not merged
(except for some rare approaches such as Albin & Chao [4]).
1.2
Origins and Background
Initial point for the work described in this dissertation was the challenge to develop failure
prediction algorithms based on data collected from an industrial telecommunication system. At the Computer Architecture and Communication group at Humboldt University
Berlin, three different approaches have been proposed: Steffen Tschirpke has introduced
an adaptive fault dictionary, Günther Hoffmann has developed a method based on data
from continuous system monitoring (Hoffmann [120]) and this thesis focuses on a prediction method based on error event patterns. However, the prediction method described in
the following chapters is not the first attempt to master the challenge. Previously, a rather
straightforward solution has been developed that builds on a semi-Markov process and
clustering of similar error events. This method has been named Similar Events Prediction
(see Salfner et al. [226] for details). However, it has two major drawbacks:
1. Computing overhead for predictions longer than three minutes in advance resulted
in unacceptable computation times due to exponentially growing complexity of the
algorithms.
2. Although results seemed promising, prediction quality dropped to a low level if test
data differed only slightly (e.g., caused by a different configuration of the system
under investigation) from the data that had been used to build the model. The explanation for this behavior is called overfitting, which means that the model is too
6
1. Introduction, Motivation and Main Contributions
specifically tailored to the data analyzed: If an observed pattern under investigation varied only slightly from the patterns observed in the training data, it was not
recognized anymore and hence no failure was predicted.
Having learned the lessons, the task of failure prediction for the commercial telecommunication system has been analyzed from scratch in a structured, traditional engineering
fashion (see Figure 1.2): First, key properties of the system have been identified and by
abstraction, a precise problem statement has been formulated. Then, a methodology has
been developed that is specifically targeted to the key properties of the problem. Having
developed a methodology, it has been implemented and tested with the industrial data of
the telecommunication system in order to assess how well the solution solves the problem.
In the last phase of the engineering cycle, the solution is usually applied to improve the
system. However, failure prediction per se does not improve system dependability unless
coupled with proactive actions, which is beyond the scope of this dissertation. Therefore,
only a theoretical assessment of the effects on dependability has been performed.
Figure 1.2: The engineering cycle.
1.3
Outline of the Thesis
Following the engineering approach depicted in Figure 1.2, this thesis is divided into four
parts:
• Part I The first step —abstraction and identification of key properties— is described in Chapter 2: a problem statement is given and the principle approach
taken in this dissertation is motivated, introduced, and discussed. Before developing a new solution, any engineer should review and investigate existing ones. In
Chapter 3, a survey of failure prediction methods is provided. This includes a taxonomy in order to categorize existing methods and to classify the approach taken
in this thesis. Furthermore, some approaches are described in more detail since
these methods are used for comparison in the experiments carried out in Part III.
Due to the fact that the prediction method presented here builds on hidden Markov
1.4 Main Contributions
7
models (HMMs), related work on HMMs and their extension to continuous time
are described in Chapter 4.
• Part II The second step of the engineering cycle, which is concerned with the development of a methodology, is covered by Chapters 5 to 7. In Chapter 5, some concepts of data preprocessing are described including issues related to error logfiles,
a clustering method to identify failure mechanisms and an approach to tackle the
problem of noisy data. In Chapter 6, the hidden semi-Markov model used for failure prediction is presented. As for failure prediction the output of hidden Markov
models are probabilistic likelihoods, subsequent classification is necessary in order
to decide whether the current runtime state is failure-prone or not. Classification is
discussed in Chapter 7.
• Part III The third step of the engineering cycle involves experiments in order to
verify that the assumptions made during modeling match the original problem and
to investigate how well the developed methodology performs. Prediction performance is gauged by several measures, which are introduced in Chapter 8. Then the
model is applied to industrial data of the commercial telecommunication system in
Chapter 9. This includes a detailed analysis of the data, data preprocessing, prediction performance and a comparative analysis with the most well-known prediction
approaches in that area.
• Part IV In order to close the engineering cycle, dependability improvement capabilities are assessed in Chapter 10, in which a model is developed in order to theoretically assess the effect of failure prediction-driven fault tolerance mechanisms
(proactive fault management) on availability, reliability and hazard rate. The chapter also includes results of a case study where such mechanisms have been applied
to a demo web-shop application.
Main results are summarized and an outlook to future research topics is provided in
Chapters 11 and 12.
The main contributions of each chapter are presented in chapter summaries.
1.4
Main Contributions
The overall contribution of this dissertation is the development of a novel approach to error event-based failure prediction. Experiments on industrial data of an industrial telecommunication system have shown superior prediction performance in comparison with the
most well-known prediction algorithms in that area. In addition to that several advancements to the state-of-the-art are presented:
• A novel extension of Hidden Markov Models to incorporate continuous time. In
contrast to previous extensions that have been developed mainly in the area of
speech recognition, the model developed in this thesis is specifically tailored to
event-driven temporal sequences.
• To our knowledge the first taxonomy and survey on computer failure prediction
approaches including indication of promising areas for further research. The taxonomy is based on the fundamental relationship among faults, errors, and failures.
8
1. Introduction, Motivation and Main Contributions
Symptoms, which reflect side-effects of faults, have been added to this basic concept.
• To our knowledge the first model to assess dependability of prediction-driven fault
tolerance techniques (proactive fault management). The model incorporates correct
and false predictions, downtime avoidance as well as downtime minimization techniques and cases where failures are induced by the fault management techniques
themselves.
• A novel methodology to group failure sequences. Although only used for data preprocessing, the approach may as well contribute to diagnosis.
• To our knowledge the first measure to quantify quality the of logfiles: logfile entropy combines Shannon’s information entropy with specific requirements for comprehensive logfiles.
All in all this comprehensive approach to online failure prediction proposed in this thesis, if combined with preventive actions, has a potential of increasing computer systems
availability by an order of magnitude.
Chapter 2
Problem Statement, Key Properties,
and Approach to Solution
The first step in any scientific as well as any engineering project should be a proper statement of the problem to be solved. The challenge that had to be solved in the course of
this work is online failure prediction, which is defined in Section 2.1. The motivating case
study that lead to the selection of this topic is an industrial telecommunication system of
which we had given the chance to collect data. In Section 2.2, the prediction objective
is clearly specified for the concrete scenario of the telecommunication system. The case
study is introduced at this early point of the thesis in order to identify key properties of
systems for which the failure prediction method proposed in this thesis is designed. The
key properties are discussed in Section 2.3. From these key properties, the principle approach to the solution is presented in Section 2.4 and its general properties are analyzed
in Section 2.5.
2.1
A Definition of Online Failure Prediction
The aim of online failure prediction is to predict the occurrence of failures during runtime
based on the current system state. For a more precise definition, the terms “failure” and
“online prediction” are defined separately.
2.1.1
Failures
Failures are commonly defined as follows (Aviz̆ienis & Laprie [16]):
A system failure occurs when the delivered service deviates from the specified service, where the service specification is an agreed description of the
expected service.
Similar definitions can be found, e.g., in Melliar-Smith & Randell [180], Laprie & Kanoun [155], Avižienis et al. [17]. The main point here is that a failure refers to misbehavior
that can be observed by the user, which can either be a human or a computer component
using another component. Things may go wrong inside the system, but as long as it does
9
10
2. Problem Statement, Key Properties, and Approach to Solution
Figure 2.1: Definitions and interrelations of faults, errors and failures
not result in corrupted output,1 there is no failure. More specifically, a failure is an event:
It is the point in time when a system terminates to fulfill its intended function [64].
Faults are the root cause of failures and are defined to be a defective (incorrect) state
[64]. In most cases faults remain undetected for some time. Once a fault has become visible it is called an error. That is why errors are called “manifestation” of faults. Figure 2.1,
which is a modified version of a figure by Siewiorek & Swarz [241], visualizes the relationships. The key aspect to note here is that faults are unobserved defective states. Four
stages exist at which faults can become visible (see Figure 2.2):
1. The system can be audited in order to actively search for faults, e.g., by testing on
checksums of data structures, etc.
2. System parameters such as memory usage, number of processes, workload, etc.,
can be monitored in order to identify side-effects of the faults. These side-effects
are called symptoms. For example, the side-effect of a memory leak (the fault) is
that the amount of free memory decreases over time.
3. If a fault is activated and detected (observed), it turns into an error.
4. If the fault is not detected by fault detection mechanisms, it might directly turn into
a failure which can be observed from outside the system or component.
A good example for this are faults on disk drives: Consider the fault of a defective disk
sector. Until no read/write operations have been performed trying to access the sector,
the fault remains unobserved. Auditing would make it visible by, e.g., reading the entire
disk (not for data but for testing purposes). Symptoms of a (not yet completely failed
disk) could be observed by monitoring, e.g., wobbling of the disk. Once the sector is
completely damaged and data shall be read from it, an error is detected. In a single disk
environment, this is usually equivalent to the occurrence of a failure. However, if the
defective disk is, e.g., part of a redundant array of independent disks (RAID), the desired
service of data delivery can still be fulfilled and hence no failure occurs.
1
including the case that there is no output at all
2.1 A Definition of Online Failure Prediction
11
Figure 2.2: Faults can become visible at four stages: by auditing, by monitoring of system
parameters such as workload, memory usage, etc., to capture symptoms of faults,
by detecting manifestation of faults (errors), or by a failure that can be observed
from outside the system or component
Figure 2.3: Distinction between root cause analysis and failure prediction
Another key aspect for a precise definition of failure prediction methods is that usually
there is no one-to-one mapping among faults and errors: Several faults may result in one
single error or one fault may result in several errors. The same holds for errors and
failures: Some errors result in a failure some errors do not, and even more complicated
are cases where some errors only result in a failure under special conditions, and some
faults may cause failures directly. Moreover, some faults remain inactive for the entire
system lifetime. For this reason, two distinct research directions have evolved: root cause
analysis and failure prediction. Having observed some misbehavior by one of the means
shown in Figure 2.2, root cause analysis tries to identify the fault that caused an error
or failure, while failure prediction tries to assess the risk that the misbehavior will result
in future failure (see Figure 2.3). For example, if it is observed that a database is not
available, root cause analysis tries to identify what the reason for unavailability is: a
broken network connection, or a changed configuration, etc. Failure prediction on the
other hand tries to assess whether this situation bears the risk that the system cannot
deliver its expected results, which depends on the system and the current situation: is
there a backup database or some other fault tolerance mechanism available? What is the
current load of the system?
2.1.2
Online Prediction
The term “failure prediction” is widely used, e.g., for reliability prediction where the goal
is to assess future reliability of a system from its design or specification (see, e.g., Musa
et al. [189], Bowles [35], Denson [77], Blischke & Murthy [32]). However in contrast,
12
2. Problem Statement, Key Properties, and Approach to Solution
the topic of online failure prediction is to identify during runtime whether a
failure will occur in the near future based on an assessment of the monitored
current system state.
Although architectural properties such as interdependencies play a crucial role in some
online failure prediction methods, online failure prediction is concerned with a short-term
assessment that allows to decide, whether there will be a failure, e.g., five minutes ahead
or not. Reliability prediction, however, is concerned with long-term predictions based on
input data such as architectural properties or the number of bugs that have been fixed.
More precisely, for the case of online failure prediction, four different times need to be
defined (see Figure 2.4):
• Lead-time ∆tl defines how far from present time failures are predicted in the future.
• Minimal warning-time ∆tw defines the minimum lead-time such that failure prediction is of any use. If lead-time were shorter than the warning time, there would
not be enough time to perform any preparatory or preventive actions.
• Prediction-period ∆tp is the time for which a prediction holds. Increasing ∆tp
increases the probability that a failure is predicted correctly.2 On the other hand, if
∆tp is too large, the prediction is of little use since it is not clear when exactly the
failure will occur.
• Data window size ∆td defines the amount of data that is taken into account for
failure prediction. Even if online failure prediction algorithms take the current
system state into account, many algorithms additionally investigate what happened
shortly before present time. However, in some approaches the amount of data is not
determined by a time window but other measures such as, e.g., a fixed number of
error events. In this case ∆td is also defined, but may vary with each prediction.
Figure 2.4: Time relations in online failure prediction. Present time is denoted by t. Failures
are predicted with lead-time ∆tl , which must be greater than minimal warningtime ∆tw . A prediction is assumed to be valid for some time period, named
prediction-period, ∆tp . In order to perform the prediction, some data up to a time
horizon of ∆td are used. ∆td is called data window size.
2.2
The Objective of the Case Study
Data of an industrial telecommunication system serves as a gauge of the extent to which
the online failure prediction algorithm is able to predict the occurrence of failures. Al2
For ∆tp → ∞, simply predicting that a failure will occur would always be 100% correct!
2.2 The Objective of the Case Study
13
though it is a case study, it demonstrates the type of systems and environments in which
the developed online failure prediction method is intended to be applied, and that is why
the concrete objective of the case study is described at this early point of the thesis. In
subsequent sections, the case study serves to identify key properties that are typical for
the problem domain.
The main purpose of the telecommunication system under investigation is to realize
a so-called Service Control Point (SCP) in an Intelligent Network (IN) [171]. An SCP
provides services3 to handle communication related management data such as billing,
number translations or prepaid functionality for various services of mobile communication: Mobile Originated Calls (MOC), Short Message Service (SMS), or General Packet
Radio Service (GPRS). The fact that the system is an SCP implies that the system cooperates closely with other telecommunication systems in the Global System for Mobile
Communication (GSM). Note that the system does not switch calls itself. It rather has
to respond to a large variety of different service requests regarding accounts, billing, etc.
submitted to the system over various protocols such as Remote Authentication Dial In
User Interface (RADIUS), Signaling System Number 7 (SS7), or Internet Protocol (IP).
The system’s architecture is very complex and cannot be reproduced here for confidentiality reasons. However, two key facts are that it has a multi-tier architecture employing
a component based software design. At the time when data were collected, the system
consisted of more than 1.6 million lines of code, approximately 200 components realized
by more than 2000 classes, running simultaneously in several containers, each replicated
for fault tolerance.
Typically, one of the most complicated parts in reliability-related projects is the clear
definition of what a failure is. As defined before, failures are the event when a system
ceases to fulfill its specification. Specification for the telecommunication system requires
that within successive, non-overlapping five minutes intervals, the fraction of calls having
response time longer than 250 milliseconds must not exceed 0.01%, as shown in Figure 2.5.
Figure 2.5: If within a five minutes interval, the fraction of calls having response time > 250ms
exceeds 0.01%, a failure has occurred
This definition is equivalent to a required four-nines interval service availability:
Ai =
3
no. of service requests within 5 min having response time ≤ 250ms !
≥ 0.9999 . (2.1)
total no. of service requests within 5 min
so-called Service Control Functions (SCF)
14
2. Problem Statement, Key Properties, and Approach to Solution
Various classifications of failures have been published, one of which is Cristian et al.
[69], extended by Laranjeira et al. [156], who classify failures by the following categories:
• crash failure
the service stops operating and does not resume operation
until repair
• omission failure
the service does not respond to a request
• performance failure
the service responds too late (given a threshold)
• timing failure
the service too early or too late (given two thresholds)
• computation failure
the service’s response shows wrong results
• arbitrary failure
the service’s response shows an arbitrary failure
where each failure class is included in the following classes. According to this definition,
the objective of this thesis is to predict performance failures of the telecommunication
system. Using the terminology of Laprie & Kanoun [155], these failures are consistent
timing failures. However, it is not possible (for us) to assess the consequences on the
environment since these are top-level failures and no information is available on how
other parts of a telecommunication network rely on the service of the system analyzed
here.
Figure 2.6: Data acquisition setup. Error logs have been collected from the telecommunication system while a failure log has been obtained from an entity that tracked
response times of calls.
Field data was collected including various workloads. Request response times have
been measured and all failed requests (i.e., having response times of more than 250 milliseconds) have been written into a failure log. The second source of data are error logs,
which have been collected from the telecommunication system (see Figure 2.6). Both
failure and error logs have been collected for 200 days containing a total of 1560 failures.
2.3
Key Properties
By analyzing the telecommunication case study, key properties have been identified yielding the assumptions on which the failure prediction approach developed in this thesis is
based. In particular, the key properties are:
1. Only very little knowledge about the system internals is available. For the reason
that we did not have full access to the system internals a thorough analysis of the
2.3 Key Properties
15
system’s structure has not been possible. Moreover, such an analysis seems infeasible due to the sheer size of the system.
2. A lot of data is available. The error logs of the 200 days of testing contained an
overall amount of 26,991,314 log records, which corresponds to an average logging
activity of 43 log records per minute on one node and 51 log records per minute on
the second node, respectively. Investigations have shown that only a small fraction
of error records gives a notice of upcoming failures.
3. Failures occur rarely. This leads to an imbalance of failure and non-failure data.
4. The telecommunication system is built of software components. Software components are more or less isolated subsystems that are executed in so-called containers,
providing additional functionality such as data persistency, replication, logging, etc.
The system serves requests by invoking one or more components, which in turn invoke other components to fulfill the job. This leads to interdependencies within the
software. Usually, interdependencies set up a forest in terms of graph theory, but
cycles cannot be excluded in general.
5. Fine-grained fault detection and error reporting is built into the system. For example, each component is continuously observing its state and is checking the input
received from other components. Additionally, there might be several steps of escalation that can assign different levels of severity to error events.
6. The system is running multiple tasks and processes in parallel. For this reason,
several concurrent tasks can send messages to the error logging back-end. Such behavior can be interpreted as noise in the error logs. A second effect of this property
is that the order of events can be interchanged if several events occur more or less
concurrently.
7. Error logs have at least two dimensions: a timestamp and a type specifying what
has happened.4 It is assumed that both dimensions contribute information that can
be exploited for failure prediction.
8. Due to the property of being event-triggered and showing values of a finite countable set, error logs form a temporal sequence.
9. The telecommunication system can serve requests for several protocols such as
GPRS, SMS, MOC, etc. Data of two groups of protocols have been recorded separately. Furthermore, interval service availability requirements must be fulfilled
separately for both groups. In general, it must be assumed that contemporary systems can show failures of various types and different failure definitions may exist
for each of them.
10. In a system of such complexity, it must be assumed that several failure mechanisms exist for each failure type. A failure mechanism denotes the relation of faults
and system states to a failure with focus on the process how the faults lead to the
failure. This is closely related to the term failure modes as defined by Laprie &
4
In many cases such as the telecommunication system investigated here, the type is only implicitly specified
by an error message in natural language. The task of message type assignment is addressed in Chapter 5
16
2. Problem Statement, Key Properties, and Approach to Solution
Kanoun [155], but the term failure mechanism is used here in order to emphasize
the temporal aspect.
11. The telecommunication system is highly configurable: more than 2000 parameters can be adjusted. Configurability also adds to system complexity, e.g., by
parametrization of interrelations within the system.
12. Systems are subject to updates which can alter system behavior significantly. Hence,
the process to adapt failure predictors to new system specifics should require as little effort as possible. At least from that perspective, algorithmic solutions seem
preferable in comparison to human analysis.
13. The system is non-distributed. Although the data is collected from two machines interconnected by a dedicated high-speed local network and running on synchronized
clocks, the data is merged into one single error log and no computing node-specific
aspects are used throughout this thesis.
2.4
Approach
Due to the property that only limited analytical knowledge but a large amount of data is
available, a machine learning approach has been chosen. It infers symptoms of upcoming
failures from measurements (training data) rather than from an analytical analysis of the
system. Machine learning, as applied here, consists of two steps (see Figure 2.7): first
Figure 2.7: Machine learning approach: First a model is built from training data (a). After
training, the model is used to predict the occurrence of failures during runtime (b)
a model is built from recorded data using some training algorithm, which means that
model parameters are adjusted such that some objective function is optimized. Specifically, training data consists of error-log files and failure logs, which are used to identify
whether a failure occurred or not. Having trained a model, the model is used to predict
failures online during runtime. However, as we do not have access to the running system,
this thesis must do without real-time testing. Rather, the data set is divided into a training
and a test dataset such that prediction quality must be estimated from samples that were
not available in training.
2.4 Approach
17
The key notion of the approach is that dependencies in the component-based system
lead to error patterns, as shown in Figure 2.8. Assume that component “C3” is faulty.
Figure 2.8: Dependencies among components lead to a temporal sequence of errors
Once the fault is detected, an error message “C” is generated and written to the error log.
Some time later, component “C1” needs some functionality of “C3”, but due to the fact
that “C3” is faulty, “C1” also has a problem and reports an error of type “A”. Due to
component internal mechanisms / dependencies (see, e.g., Hansen & Siewiorek [114]),
the component writes a second error message “B”. After some time, the same happens to
“C2”: when functionality of “C2” is requested but cannot be delivered, an error message
of type “D” is generated. As can be seen from the bottom time line of Figure 2.8, this
behavior leads to an event-triggered temporal sequence of error events.
The telecommunication system under study is a fault-tolerant system. Hence the chain
of dependencies as shown in the figure is not necessarily traversed for a single request.
For example, if component “C3” has problems connecting to the database, which results
in error message “C”, this problem may be handled by another component5 or it may
lead to a single failed call request. But a single failed request does not make a failure,
yet. However, if component “C3” is faulty for a while, there are some conditions under
which other components start to have problems, too, which is component “C1” in the
figure. This may still be fine, but in some situations, even “C2” gets a problem, which
finally leads to a failure since too many components are having problems and hence too
many call requests fail. These effects give rise to the central idea of the failure prediction
approach investigated in this thesis:
⇒ Dependencies in the system lead to error patterns, as shown in Figure 2.8
⇒ There are error patterns that lead to failures, others do not, depending on conditions
which are not observable from outside
⇒ Apply pattern recognition techniques to identify those patterns that have lead to
failures.
⇒ Analyze error patterns that have been previously recorded to train the pattern recognizer using machine learning techniques.
5
which is a component failover
18
2. Problem Statement, Key Properties, and Approach to Solution
Hidden Markov Models (HMMs) have been shown to be successful pattern recognition tools in a large variety of recognition tasks ranging from speech recognition to
intrusion detection in computer systems. This being the first reason for the choice to use
HMMs for failure prediction, there is a second rationale referring to the very basic distinction between faults, errors and failures: Faults are by definition unobserved. Once
they manifest, they turn into errors, which are observable. This insight can be transferred
analogously to HMMs: the states of an HMM are hidden, i.e., unobservable, generating observation symbols. Hence, a close match exists between “hidden units”, faults and
the states of HMMs, and between their manifestations, which are errors and observation
symbols, respectively. As the occurrence of failures represents some final state (at least
in non-repairable systems) failures are represented by an absorbing final state producing
a dedicated failure symbol.
However, standard hidden Markov models are not well-suited to represent eventtriggered temporal sequences (as is discussed in Section 4.2). For this reason, an extension
of HMMs has been developed that permits to model time behavior of error sequences by
use of a continuous-time semi Markov process.
The training procedure. The goal of training is to adjust HMM parameters to error
patterns that are indicative of upcoming failures. To account for the imbalance of failure
versus non-failure data (class skewness), HMMs are trained with failure-prone sequences
only. Since it is assumed that several failure mechanisms exist in the system and hence
are present in the data, a separate HMM is trained for each. The term failure mechanism denotes the principle process how specific faults, states and circumstances lead to
a specific failure. In order to separate failure sequences in the training data to group sequences of the same failure mechanism, clustering of failure sequences is accomplished
(see Section 5.2). In order to distinguish failure-prone from non-failure sequences in
the prediction phase, a separate model targeted to non-failure sequences is needed. It is
trained from a selection of non failure-prone sequences in the training data. Although
grouping of non-failure sequences would in principle be possible, it is not applied since
the non-failure sequence model only serves as a reference for classification. Furthermore,
sequence clustering would not be applicable due to the large number of non-failure sequences in the data set.
Due to the fact that logfiles are noisy and that sometimes there is too much logging
going on for a prediction method to be successful, data needs to be preprocessed. Data
preprocessing involves filtering mechanisms and statistical testing. An overview of the
training procedure is provided by Figure 2.9.
Online prediction. Given an error event sequence observed at runtime, online failure
prediction is performed by computing the similarity of the observed sequence to the sequences of the training data. This is done by computing sequence likelihood for each
model including the model targeted to non-failure sequences. Sequence likelihood can
be interpreted as a probabilistic measure of similarity between the given sequence and
sequence characteristics as represented by the hidden Markov model. In order to come
to a decision whether the current situation is failure-prone or not, multi-class classifica6
The letter u is used here since letters i to n, which are commonly used to indicate integer numbers, occur
frequently in later chapters and have fixed connotations in this thesis.
2.5 Analysis of the Approach
19
Figure 2.9: An overview of the training procedure. Model 0 is trained with non-failure sequences. Failure sequences are grouped by means of clustering. A separate
model is then trained for each of the u groups.6
tion, based on Bayes decision theory, is performed. As was the case for training, data
preprocessing including failure group specific filtering has to be applied prior to sequence
likelihood computation. An overview of the procedure for online failure prediction is
depicted in Figure 2.10.
2.5
Analysis of the Approach
In order to show principle properties and limitations of the approach, various aspects are
discussed in the following sections. The intention is to position the approach with respect
to existing failure and fault models, and to relate it to other research areas.
2.5.1
Identifiable Types of Failures
A classification of failures has already been given in Section 2.2 from which it has been
concluded that the objective for the telecommunication system is to predict performance
failures. However, the prediction algorithm can be applied to other systems as well. Since
the algorithm is data-driven, it is clear that it can only learn to predict failures whose
underlying failure mechanism is similar to the mechanisms contained in the training data.
Furthermore, the machine learning approach focuses on general principles in the data
which means that very rare special cases are more or less ignored. The conclusion from
this discussion is that the proposed prediction approach can only predict failures that occur
more or less frequently —it is not appropriate for predicting really rare failure events. This
20
2. Problem Statement, Key Properties, and Approach to Solution
Figure 2.10: An overview of the online failure prediction approach. In order to investigate an
observed error sequence, sequence likelihood is computed for each of the models including the model targeted to non-failure sequences (Model 0). Sequence
likelihood is a probabilistic measure for similarity of the observed error sequence
to sequences of the training data. Failure prediction is then performed by subsequent classification whether the current situation is failure-prone or not. In order
to prepare the sequence for this process, data preprocessing including failure
group specific filtering has to be applied.
may seem insufficient from a researcher’s viewpoint, but it is useful from an engineer’s
perspective. For example, in [60], Chillarege et al. show that the distribution of failures
resembles a Pareto distribution, from which follows that a few failures contribute to the
majority of outages. Levy & Chillarege [162] state that from an economic viewpoint
it is most efficient to first address those failures that occur most frequently in order to
achieve the largest impact on overall system availability. Furthermore, Lee & Iyer [159]
report in a study about the Tandem GUARDIAN system that over two-thirds of reported
software failures are recurrences of previously reported faults. The authors concluded that
“in addition to reducing the number of software faults, software dependability in Tandem
systems can be enhanced by reducing the recurrence rate”.
2.5.2
Identifiable Types of Faults
Research on dependable computing has put much effort on analyzing and categorizing the
things that can go wrong in computer systems. Classification of different types of faults
are called fault models, which can be helpful, e.g., to determine the potentials and limits
of a fault tolerance technique.
2.5 Analysis of the Approach
21
Design–Runtime fault model. A fundamental distinction of faults addresses the development phase from which the fault originates. Design faults originate from bad system
design, e.g., use of an algorithm that does not converge in some situations and hence
might cause an “infinite loop.” Opposed to this are runtime faults that occur during the
production phase of a system.
Permanent–Intermittent–Transient fault model. Another well-known classification
focuses on the duration of faults, as shown in Figure 2.11.
Figure 2.11: Permanent, intermittent and transient faults (Siewiorek & Swarz [241]).
The figure introduces three types of faults:
• permanent faults. which are defects that stay active until the fault is removed by
repair. A typical example is a damaged sector on a hard disk.
• intermittent faults. which are temporary defects that result from system internal
flaws.
• transient faults. which are temporary defects that trace back to environmental exceptions such as a hit by an alpha particle, etc.
As might have become visible, this categorization is focused on hardware issues. Although the concept can in principle be transferred to software, there are some difficulties.
For example, due to the fact that a software fault (a bug) can only be removed by repair,
software faults should be classified as permanent. However, some studies have shown that
their occurrence resembles transient faults (see, e.g., Gray [107]) due to the fact that their
activation patterns are dependent on many conditions in the system.
Bohr–Mandel–Heisen–Schrödingbugs fault model. This fault model is tailored to
software faults and explores an analogy between software bugs and well-known physicists and mathematicians. It focuses on the bugs’ type in terms of observability / tangibility. Gray & Reuter [109] classify software bugs into “Bohrbugs” and “Heisenbugs”. This
concept has been extended, as, for example, in Candea [42]:
22
2. Problem Statement, Key Properties, and Approach to Solution
• Bohrbugs. According to the rather simple and deterministic atom model of Niels
Bohr,7 Bohrbugs are deterministic bugs that can be reproduced most easily. Most
Bohrbugs are identified by testing and eliminated in a thorough software engineering process.
• Mandelbugs. According to the mathematician Benoît B. Mandelbrot, who is one
of the founders of chaos theory, Mandelbugs are bugs that appear chaotic due to
manifold and complex dependencies.
• Heisenbugs. According to Werner Heisenberg’s uncertainty principle, Heisenbugs
disappear or change behavior when being investigated. For example race conditions
can disappear when a program is run in a debugger since the debugger changes the
timing behavior of the program.
• Schrödingbugs. According to Schrödinger’s cat thought-experiment in quantum
physics, Schrödingbugs do not manifest until, e.g., someone reading source code
notices it and the program stops working for everybody until fixed. An example for
such a bug might be a security breach that is exploited rapidly after being identified
so that the program becomes unusable until the bug is fixed.
Fail-stop–to–Byzantine fault model. It characterizes faults with respect to their “hazardousness” or “behavior”. The fault model presented here is taken from Barborak et al.
[23], which is an extended version of Laranjeira et al. [156], who themselves extended a
model introduced by Cristian et al. [68] (see Figure 2.12). One of the beautiful properties
of the model is that inner fault classes are proper subsets of outer fault classes. The farther
outside a fault resides in the picture, the more difficult it is to detect and hence the more
complex are the resulting failure scenarios.
Figure 2.12: Fault model based on Barborak et al. [23]
The types of faults can be described as follows:
7
terming the model “simple” is not intended to belittle the merits of Niels Bohr —remember that he proposed this model already in 1904!
2.5 Analysis of the Approach
23
• Fail stop
A faulty processing entity ceases operation and signals this to other processors.
• Crash fault
The processor simply halts (crashes).
• Omission fault
The processor omits to react to some tasks
• Timing fault
The processor reacts to tasks, but too early or too
late.
• Incorrect computation fault
The processor responds to all requests in time, but
the result is corrupted.
• Authenticated Byzantine fault
An arbitrary or even malicious fault that cannot
corrupt authenticated messages (sender or receiver
can detect corruption)
• Byzantine fault
Every fault / malicious action possible.
Software–Hardware–Human fault model. While the presented classifications of fault
classes reflects mainly design and operational faults, there is also a number of faults that
can be attributed to human operators. One way to incorporate operator faults is to classify according to their origin: hardware, software, or human. Several variants of this
distinction exist that basically refer to the same concepts. For example, Scott [232] uses
the terms “technology and disasters”, “application failure” and “operator error”, and in the
SHIP model (Malek [174]), the concept is extended by incorporation of “interoperability”
faults.
Discussion of fault models. Unfortunately, none of the presented fault models provides
a tight boundary that allows to completely describe all faults leading to failures that can
be predicted by the presented approach. Nonetheless, each fault model provides a framework to discuss potentials and limits of the failure prediction approach presented in this
dissertation.
1. Design–Runtime fault model. Design faults are the target of fault intolerance techniques (Avižienis [13]) which attempt to eliminate flaws by elaborate engineering
such as formal specification, design reviews, and thorough testing. If —despite
of all efforts to build a flaw-free system— something goes wrong, runtime faults
are addressed by fault tolerance techniques which try to handle the situation such
that no catastrophic failure occurs. Online failure prediction is a fault tolerance
technique and is hence targeted at runtime faults. However, the boundary between
design and runtime faults is sometimes blurred. If, for example, a design fault always results in similar misbehavior that is clearly identifiable by patterns of error
events, the proposed failure prediction method can anticipate runtime faults as well.
2. Permanent–Intermittent–Transient fault model. The failure prediction approach of
this thesis identifies faults that trigger failure mechanisms known from training data.
This is most likely the case for permanent faults. Albeit the fact that this fault model
is of limited use for software faults, also failures caused by transient or intermittent
faults can be predicted, if triggering has been observed often enough in the training data. This seems rather unlikely for faults such as the hit by an alpha particle.
24
2. Problem Statement, Key Properties, and Approach to Solution
However, as the failure prediction approach is targeted at identifying failure triggering conditions, it fits the transient behavior of software faults as observed by
condition-based activation patterns.
3. Bohr–Mandel–Heisen–Schrödingbugs fault model. Online failure prediction will
most likely be performed on fault-tolerant systems that have undergone thorough
code revision, testing, etc. For this reason, it can be assumed that most Bohrbugs
have been eliminated. Schrödingbugs are a construct that is very unlikely to occur,
but as all programs stop until fixing of the bug, there is no need for online failure
prediction. Mandelbugs and Heisenbugs are the typical bugs for which failure prediction is relevant. Both are triggered under complex conditions and the difference
between both is more related to root cause analysis rather than failure prediction.
4. Fail-stop–to–Byzantine fault model. Since this fault model has the property that
more “friendly” fault classes are proper subsets of more general fault classes, it is
sufficient to determine an upper bound. Due to the fact that Byzantine faults can
behave arbitrarily they can trigger failure mechanisms that have not been present in
the training data and can hence not be predicted. The same holds for authenticated
Byzantine faults. Incorrect computation faults can be predicted, as long as they lead
to errors that are detected within components. Nevertheless, it should be pointed
out that there is no 100% coverage, even not for fail-stop faults.8
5. Software–Hardware–Human fault model. The failure prediction approach operates
on errors that have been logged by some software. From this follows that hardware
faults can only be detected if they result in an error at the software level. If, for
example, it is never detected until system failure that some hard disk controller
delivers corrupted data, this failure cannot be predicted. However, several studies on
causes of failures such as Gray [107], Gray [108], and Scott [232] have documented
a trend towards software caused failures. The most astonishing study is Lee & Iyer
[159] who have investigated the Tandem GUARDIAN system and have identified
that 89.5% of reported failures have been identified to be caused by software.
2.5.3
Relation to Other Research Areas and Issues
In the following, relations to other research areas are briefly discussed. A comprehensive
classification of the proposed failure prediction algorithm with respect to other prediction
approaches is given in Chapter 3.
Fault diagnosis. According to Marciniak & Korbicz [176], there are three different
approaches to pattern recognition for fault diagnosis:
• Minimal distance methods. Classification is achieved by assigning data under investigation to the nearest class as determined by a distance metric in feature space.
In failure prediction, error sequences would have to be analyzed in order to extract
features like frequency of error occurrence, etc.
8
Although fail-stop faults will very unlikely evolve into a system failure due to the fault-tolerant design of
the system.
2.5 Analysis of the Approach
25
• Statistical methods. The goal is to estimate the probability of a class given the data
point under investigation: P (c|x). In failure prediction, classes refer to failureprone or not-failure-prone and x refers to an error sequence.
• Approximation approach. The class membership function F (x) is approximated by
a function. In the case of failure prediction, F (x) would determine whether error
sequence x belongs to the class of failure-prone sequences.
With respect to this classification, the approach of this thesis is a statistical method since
the outcome of the HMM forward algorithm is sequence likelihood P (x|c), which is
turned into P (c|x) by the subsequent Bayesian classification step.
Temporal sequence processing. It has been stated that the approach is related to temporal sequence processing. According to Sun [253], temporal sequence processing is
typically accomplished if one of four problems is addressed:
1. Sequence generation. Having specified a model, generate samples of time series.
2. Sequence recognition. Does some given sequence belong to the typical behavior of
the underlying stochastic process or not? More precisely: What is the probability
of it?
3. Sequence prediction. Given the beginning of a sequence, assess the probability of
the next observation (or state) of the time series.
4. Sequential decision making. Select a sequence of actions in order to achieve some
goal or to optimize some cost function.
Failure prediction, as introduced here, clearly refers to sequence recognition. However,
Section 12.1.1 in the outlook sketches a variant of failure prediction that makes use of
sequence prediction. Since the majority of models for temporal sequence processing deal
with series whose values occur equidistantly (see, e.g., Box et al. [36] for an overview),
it seems infeasible to compare the HMM approach to other temporal sequence modeling
techniques.
Machine learning. The solution presented here clearly belongs to the group of supervised learning algorithms. Supervised learning refers to the property that training data is
labeled with a target value. In terms of failure prediction, this means that for every error
event sequence in the training data set it is known whether it is a failure or non-failure
sequence. Furthermore, the presented approach employs batch learning,9 which denotes
that the approach consists of two phases: a training phase and an application phase (see
Figure 2.7). Such approach is valid as long as dynamics of the system more or less stay
the same. Due to configurability of the system and updates, this assumption only holds
partly, as is investigated in Section 9.7.2. A solution to this problem can be online learning where the model is adapted continuously during runtime.
The No Free Lunch Theorem10 of machine learning proves that on the criterion of generalization performance, there is no single modeling technique that is superior to all other
9
also called offline learning
10
see, e.g., Wolpert [280]
26
2. Problem Statement, Key Properties, and Approach to Solution
techniques on all problems. However, this does not imply that for a given problem all
approaches are equal. In fact, it is the topic of this thesis to design, test and verify superiority of one specific modeling technique for the concrete task of online failure prediction
from error events.
Data-driven approaches. The approach presented here is clearly a measurement datadriven approach. Such approaches can, —despite of their generalization capabilities—
only learn interrelations that are present in the training data. Hamerly & Elkan [112] and
Petsche et al. [202] argue that one escape from the dilemma is to build anomaly detectors,
which inverts the problem: The focus of modeling is not the abnormal failure behavior
but the way the system behaves when it is running well. However, this approach also fails
if normal behavior is very diverse, which can be assumed for systems of such complexity
as the telecommunication system. In the outlook (Chapter 12), a new approach to this
dilemma is proposed: The HSMM developed in this thesis may be augmented manually
to account for failure mechanisms that are not contained in the training data.
Class Skewness. Failure prediction approaches usually have to deal with extreme class
skewness: measurements for failures —even performance failures— occur much more
seldom than measurements for non-failures. As can be seen from Figure 2.2 on Page 11,
errors occur late in the process from faults to failures: an error is only reported if some
misbehavior in the system has been detected. Hence, in comparison to failure prediction
approaches operating on periodically measured symptom monitoring, the ratio of failure
and non-failure data is more balanced and the problem of class skewness is mitigated.
Nevertheless, both classes are far from being equally distributed and hence failure models
are trained on failure data only.
2.6
Summary
This chapter has defined the objective of this thesis: online failure prediction. In terms of
the telecommunication system case-study, the failures that are to be predicted are performance failures, which are defined to be a drop below a four-nines threshold on five minute
interval call availability. Key properties of the objective have been identified and the approach pursued in this thesis has been outlined. The last section of the chapter included
a brief description of one failure and four fault models and has discussed potentials and
limits of the described approach.
The following list summarizes the line of arguments that lead to the approach to online
failure prediction followed in this thesis:
• Dependencies within systems lead to error sequences.
• In fault-tolerant systems, not every occurrence of errors leads to a failure.
• Fault-tolerant systems fail only under some conditions.
• Error pattern recognition is applied to distinguish between error sequences that are
failure-prone and those that are not.
2.6 Summary
27
• It is assumed that both dimensions of error sequences, time of event occurrence
and type of the event are equally important. Hence, error sequences are treated as
temporal sequences.
• Extended hidden Markov models are used as pattern recognition toolkit. The extension allows to model the temporal behavior of error patterns by use of a semiMarkov process.
• Several failure mechanisms are assumed to be present in a system. In order to
separate failure mechanisms, failure sequences in the training data are grouped by
clustering.
• Since error logs are a noisy data source, data preprocessing has to be applied to the
data
• In order to address the problem of class skewness, failure models are trained using
failure sequences only.
• The approach is a batch learning supervised machine learning task.
• By use of a model targeted to non-failure sequences, Bayes decision theory is applied for online prediction in order to classify the current situation of a running
system as failure-prone or not.
Contributions of this chapter. This chapter has discussed the stages at which faults
can be observed. It turned out that the classical distinction between faults, errors, and
failures is not sufficient as it is missing side-effects of faults, which are called symptoms.
Hence one contribution is the extension of this differentiation.
The second contribution is a novel view on the task of online failure prediction. To the
best of our knowledge, this work is the first to treat the problem as a pattern recognition
task of temporal sequences.
Relation to other chapters. This chapter has formally defined the objective of the thesis
and has presented an overview of the approach. The next two chapters provide some
background on related approaches. The reason why there are two chapters on related
work is that in this thesis, an existing modeling technique —hidden Markov models—
has been extended and applied to the area of online failure prediction. Hence Chapter 3
provides an overview of other approaches to online failure prediction, while Chapter 4
covers related work on hidden Markov models.
Chapter 3
A Survey of Online Failure Prediction
Methods
As mentioned in Section 2.1, online failure prediction denotes only a small area in the
broad field of prediction techniques. However, even in that limited sense, a wide spectrum of approaches have been published. This chapter provides a survey on methods that
have been published and points to techniques that might in future be applied to online
failure prediction. In order to structure the spectrum, a taxonomy is introduced in Section 3.1. Major concepts are briefly explained and related work is referenced. As it is
not possible to implement all techniques without a huge team of researchers, in this thesis
only the most promising approaches that are closely related to the approach presented in
this thesis have been selected for comparative analysis in the case study. These methods
are explained in more detail in Section 3.2.
3.1
A Taxonomy and Survey of Online Failure Prediction
Methods
A significant body of work has been published in the area of online failure prediction. This
section introduces a taxonomy that structures the manifold of approaches (see Figure 3.1).
The most fundamental differentiation of failure prediction approaches refers to the
ability to evaluate the current state. Since the current state can only be considered if some
monitoring of the system is used as input data, these methods are also called monitoringbased methods. However, to be complete, failure prediction mechanisms exist that are,
e.g., only based on lifetime probability distributions, the system’s architecture, or other
static properties of the system (Branch 2 in the taxonomy). Reliability models and most
methods known from preventive maintenance fall into this category. The book by Lyu
[170], and especially the chapters Farr [94] and Brocklehurst & Littlewood [38], provide
a good overview, while the book by Musa et al. [189] covers the topic comprehensively.
The category of methods that evaluate the current system state (branches starting with
1 in the taxonomy), can be further divided into four categories by analyzing at which
stage of failure evolution, observations are taken. Referring to Figure 2.2 on Page 11,
29
30
3. A Survey of Online Failure Prediction Methods
faults can be observed at four stages: By audits, by monitoring of symptoms, detection of
errors or observation of failures. However, since audit-based methods are mainly offline
procedures,1 they are not included in the taxonomy.
Failure Observation (1.1)
The basic idea of failure prediction based on previous failure occurrence is to draw conclusions about the probability distribution of future failure occurrence. The framework
for these conclusions can be quite formal as is the case with Bayesian classifiers or rather
heuristic as in the case of counting and thresholding.
Bayesian Predictors (1.1.1)
The key notion of Bayesian failure prediction is to estimate the probability distribution of
the next time to failure by benefiting from the knowledge obtained from previous failure
occurrences in a Bayesian framework. In Csenki [72], such a Bayesian predictive approach [3] is applied to the Jelinski-Moranda software reliability model [132] in order to
yield an improved estimate of the next time to failure probability distribution.
Non-parametric Methods (1.1.2)
It has been observed that the failure process can be non-stationary and hence the probability distribution of time-between-failures (TBF) varies. Reasons for non-stationarity
are manifold, since the fixing of bugs, changes in configuration or even varying utilization patterns can affect the failure process. In these cases, techniques such as histograms
result in poor estimations since stationarity2 is inherently assumed. For these reasons,
the non-parametric method of Pfefferman & Cernuschi-Frias [203] assumes the failure
process to be a Bernoulli-experiment where a failure of type k occurs at time n with probability pk (n). From this assumption follows that the probability distribution of TBF for
failure type k is geometric since only the n-th outcome is a failure of type k and hence
the probability is:
n
o
m−1
P r T BFk (n) = m | failure of type k at n = pk (n) 1 − pk (n)
.
(3.1)
The authors propose a method to estimate pk (n) using an autoregressive averaging filter
with a “window size” depending on the probability of the failure type k.
Counting / Thresholding (1.1.3)
It has been observed several times, that failures occur in clusters in a temporal as well as
in a spatial sense. Liang et al. [165] choose such an approach to predict failures of IBM’s
BlueGene/L from event logs containing reliability, availability and serviceability data.
The key to their approach is data preprocessing employing first a categorization and then
temporal and spatial compression: Temporal compression combines all events at a single
location occurring with inter-event times lower than some threshold, and spatial compression combines all messages that refer to the same location within some time window.
1
We have not found any publication investigating audit-based online failure prediction
2
at least within a time window
3.1 A Taxonomy and Survey of Online Failure Prediction Methods
Figure 3.1: A taxonomy for online failure prediction approaches
31
32
3. A Survey of Online Failure Prediction Methods
Prediction methods are rather straightforward: Using data from temporal compression, if
a failure of type application I/O or network appears, it is very likely that a next failure will
follow shortly. If spatial compression suggests that some components have reported more
events than others, it is very likely that additional failures will occur at that location. A
paper by Fu & Xu [99] formalizes the concept by introducing a measure of temporal and
spatial correlation.
Symptom Monitoring (1.2)
Some types of faults affect the system gradually, which is also known as service degradation. A prominent example for such types of faults are memory leaks. If some part
of a system has a memory leak, more and more system memory is consumed over time,
but, as long as there is still memory available, neither an error nor a failure is observed.
When memory is getting scarce, the computer may first slow down3 and only if there is no
memory left an error occurs, which may then result in a failure. The key notion of failure
prediction based on monitoring data is that faults like memory leaks can be grasped by
their side-effects on the system such as exceptional memory usage, CPU load, or disk
I/O. These side-effects are called symptoms. Four principle approaches have been identified: Failure prediction based on a system model, function approximation techniques,
classifiers, and time series analysis.
System Models (1.2.1)
The foundation of these failure prediction methods is a model of system behavior, which
is in most cases built from previously recorded training data.
Stochastic models (1.2.1.1): Vaidyanathan & Trivedi [263] construct a semi-Markov
reward model in the following way: Several system parameter measurements are periodically taken from a running system including the number of process context switches
and the number of page-in and page-out operations. Clustering training data yielded
eleven clusters. The authors assume that these clusters represent eleven different workload
states. A semi-Markov reward model was built where each of the clusters corresponds to
one state in the Markov model. State transition probabilities were estimated from the
measurement dataset and sojourn-time distributions were obtained by fitting two-stagehyperexponential or two-stage-hypoexponential distributions to the training data. Then, a
resource consumption “reward” rate for each workload state is estimated from the data:
Depending on the workload state the system is in, the state reward defines at what rate
the modeled resource is changing. The rate was estimated by fitting a linear function to
the data using the method of Sen [233]. The authors modeled two resources: the amount
of swap-space used and the amount of free real memory. Failure prediction is accomplished by estimating the time until resource exhaustion. This is achieved by computing
the expected reward rate at steady state from the semi-Markov reward model.
Berenji et al. [27] build a system model in a hierarchical two step approach: First, they
build component simulation models that try to mimic the input / output behavior of system
components. These models are used to train component diagnostic models by combining
input data with component outputs obtained from the component simulation models. The
3
e.g., due to memory swapping
3.1 A Taxonomy and Survey of Online Failure Prediction Methods
33
target output values of the diagnostic models are binary where a value of one corresponds
to faulty component behavior and zero to non-faulty behavior. The same approach is then
applied on the next hierarchical level to obtain a system-wide diagnostic models. The
authors use a clustering method to obtain a radial basis function rule base.
A more theoretic approach that could in principle be applied to online failure prediction is to abstract system behavior by a queuing model that incorporates additional
knowledge about the current state of the system. Failure prediction can be performed
by computing the input value dependent expected response time of the system. Ward
& Whitt [272] show how to compute estimated response times of an M/G/I processorsharing queue based on measurable input data such as number of jobs in the system at
time of arrival using a numerical approximation of the inverse Laplace transform.
Anomaly detectors (1.2.1.2): One of the most intuitive methods of failure prediction is
to build a model that captures key aspects of system behavior and to check during runtime,
whether the actual system behavior deviates from normal behavior. For example, Elbaum
et al. [89] describe an experiment where function calls, changes in the configuration,
module loading, etc. of the email client “pine” had been recorded. The authors have
proposed three types of failure prediction among which sequence-based checking was
most successful: a failure was predicted if two successive events occurring in “pine”
during runtime do not belong to any of the event transitions observed in the training data.
Candea et al. [45] describe a dependable system consisting of several parts such as the
pinpoint problem determination approach [53] or automatic failure path inference [44].
Even though the methods are only used in the context of recovery-oriented computing
[40] the methods could easily be extended to detect deviation from usual behavior during
runtime in order to predict upcoming failures. The same holds for a failure diagnosis
system that employs a decision tree evaluating runtime properties of requests to a large
Internet site [54]. In [144], a χ2 goodness-of-fit test is used to determine, whether the
proportion of runtime paths between a component instance and other component classes
deviates from a fault-free behavior.
Control theory (1.2.1.3): It is common in control theory to have an abstraction of the
controlled system estimating the internal state of the system and its progression over time
by some mathematical equations, such as linear equation systems, differential equation
systems, Kalman filters, etc. (see, e.g., Lunze [169]). These methods are widely used
for fault diagnosis (see, e.g., Korbicz et al. [147]) but have only rarely been used for
failure prediction. However, many of the methods inherently include the possibility to
predict future behavior of the system and hence have the ability to predict failures. For
example, Neville [193] describes in his Ph.D. thesis the prediction of failures in large
scale engineering plants. Another example is Discenzo et al. [78] who mention that such
methods have been used to predict failures of an intelligent motor using the standard IEEE
motor model. Limiting the scope to failure prediction in computer systems, only a few
examples exist, one of which is Yang [282] who uses Kalman filters to predict future states
in combination with an “early failure detection and isolation arrangement” (EFDIA) Petri
Net.
Another approach has been published by Singer et al. [243] who propose the Multivariate State Estimation Technique (MSET) to detect system disturbances by a comparison of the estimated and measured system state. More precisely, a matrix of measurement
34
3. A Survey of Online Failure Prediction Methods
Figure 3.2: Function approximation tries to mimic an unknown target function by the use of
measurements taken from a system at runtime
data of normal operation is collected. This training data is further processed such that
an expressive subset of training data is selected. In the operational phase, a combination
of selected data vectors weighted by similarity to the current (runtime) observations is
used to compute a state estimate. The difference between observed and estimated state
constitutes a residual that is checked for significant deviation by a sequential probability
ratio test (SPRT). In Gross et al. [110], the authors have applied the method to detect software aging [198] in an experiment where a memory-leak fault injector consumed system
memory at an adjustable rate. MSET and SPRT have been used to detect whether the fault
injector was active and if so, at what rate it was operating. By this, time to memory consumption can be estimated. MSET has also been applied to online transaction processing
servers in order to detect software aging (Cassidy et al. [48]).
Function Approximation (1.2.2)
Function approximation techniques try to mimic target values, which are assumed to be
the outcome of an unknown function of input data. Target functions include, e.g., the
probability of failure occurrence or the true long-term progression of resource consumption. Due to the fact that neither the function is known nor can the faults, which are part
of the input to the unknown function, be observed, the target function can only be estimated from measurements (see Figure 3.2). Function approximation is a broad research
area, and various approaches have been published to address this type of problems, among
which some are listed here that are related to failure prediction.
Prediction of failures can be achieved with function approximation techniques in two
ways:
1. The target function is the probability of failure occurrence. In these cases, the target
value in the training dataset is boolean. This case is depicted in Figure 3.2.
2. The target function is some computing resource and failure prediction is accomplished by estimating the time until resource exhaustion.
However, since most of the work presented below follows the second approach, categorization distinguishes between function approximation methods rather than the target
function.
Curve fitting (1.2.2.1): In this category of techniques, the target function is the true,
long-term progression of some system resource, e.g., system memory. However, if free
3.1 A Taxonomy and Survey of Online Failure Prediction Methods
35
system memory is measured periodically during runtime, measurements vary heavily
since it is natural that memory is allocated and freed during normal system operation.
Curve fitting techniques4 adapt parameters of a function such that the curve best fits the
measurement data, e.g., by minimizing mean square error. The simplest form of curve
fitting is regression with a linear function. Garg et al. [100] have presented work where
after data smoothing a statistical test (seasonal Kendall test) is applied in order to identify
whether a trend is present and if so, a non-parametric trend estimation procedure [233]
is applied. Failure prediction is then accomplished by computing the estimated time to
resource exhaustion. Castelli et al. [49] mention that IBM has implemented a curve fitting
algorithm for the xSeries Software Rejuvenation Agent. Several types of curves are fit to
the measurement data and a model-selection criterion is applied in order to choose the
best curve. Prediction is again accomplished by extrapolating the curve.
Cheng et al. [57] present a framework for high availability cluster systems. Failure
prediction is accomplished in two stages: first, a health index ∈ [0, 1] is established based
on measurement data employing fuzzy logic and then trend analysis is applied in order to
estimate the mean time to next failure.
Andrzejak & Silva [10] apply deterministic function approximation techniques such as
splines to characterize the functional relationships between the target function5 and “work
metrics” such as the work that has been accomplished since the last restart of the system.
Deterministic modeling offers a simple and concise description of system behavior with
few parameters. Additionally, using work-based input variables offers the advantage that
the function is not depending on absolute time anymore: For example, if there is only
little load on a server, aging factors accumulate slowly and so does accomplished work
whereas in case of high load, both accumulate more quickly.
Genetic programming (1.2.2.2): In the paper by Abraham & Grosan [1] the target
function is the so-called stressor-susceptibility-interaction (SSI), which basically denotes
failure probability as function of external stressors such as environment temperature or
power supply voltage. The overall failure probability can be computed by integration of
single SSIs. The paper presents an approach where genetic programming has been used
to generate code representing the overall SSI function by learning from training data.
Although the paper mainly focuses on electronic devices, the approach might be adopted
for failure prediction in complex computer systems. However, this is difficult to tell since
only few results are presented in the paper.
Machine learning (1.2.2.3): One of the predominant applications of machine learning
is function approximation. It seems natural that various techniques have a long tradition
in failure prediction, as can also be seen from various patents in that area. In 1990,
Troudet et al. have proposed to use neural networks for failure prediction of mechanical
parts and Wong et al. [281] use neural networks to approximate the impedance of passive
components of power systems. The authors have used an RLC-Π model where faults have
been simulated to generate the training data. Neville [193] has described how standard
neural networks can be used for failure prediction in large scale engineering plants.
4
which are also called regression techniques
5
the authors use the term “aging indicator”
36
3. A Survey of Online Failure Prediction Methods
Turning to publications regarding failure prediction in large scale computer systems,
various techniques have been applied there, too. Ning et al. [194] have modeled resource
consumption time series by fuzzy wavelet networks (FWN). They use fuzzy logic inference to predict software aging in application servers based on performance parameters. Turnbull & Alldrin [259] use Radial Basis Functions (RBF) to predict server failures
based on hardware sensors on motherboards. In his dissertation [120], Günther Hoffmann
has developed a failure prediction approach based on universal basis functions (UBF),
which are an extension to RBFs that use a weighted convex combination of two kernel
functions instead of a single kernel. He has applied the method to predict failures of
the same telecommunication system used as case study in this thesis. However, UBF
primarily builds on equidistantly monitored data to identify symptoms while the method
proposed in this dissertation focuses on event-driven error sequences. In [122], Hoffmann
et al. have conducted a comparative study of several modeling techniques with the goal
to predict resource consumption of the Apache webserver. The study showed that UBF
turned out to yield the best results for free physical memory prediction, while server response times could be predicted best by support vector machines (SVM). However, the
authors point out that the issue of choosing a good subset of input variables has a much
greater influence on prediction accuracy than the choice of modeling technology. This
means that the result might be better if, for example, only workload and free physical
memory are taken into account and other measurements such as used swap space are ignored. Variable selection6 is concerned with finding the optimal subset of measurements.
Typical examples of variable selection algorithms are principle component analysis (PCA,
see Hotelling [124]) as used in Ning et al. [194] or Forward Stepwise Selection (see, e.g.,
Hastie et al. [115]), which has been used in Turnbull & Alldrin [259]. Günther Hoffmann
has also developed a new algorithm called probabilistic wrapper approach (PWA), which
combines probabilistic techniques with forward selection or backward elimination.
Instance-based learning methods store the entire training dataset including input and
target values and predict by finding similar matches in the stored database of training data
(eventually combining them). Kapadia et al. [141] have applied three learning algorithms
(k-nearest-neighbors, weighted average and weighted polynomial regression) to predict
CPU-time of semiconductor simulation software based on input data such as number of
grid points, or number of etch steps of the simulated semiconductor.
Classifiers (1.2.3)
In contrast to function approximation, classification approaches do not strive to mimic
some target function but try to directly come to a decision about criticality of the system’s
state. For this reason, training data for classification approaches has discrete (and in
most cases binary) target labels. However, the input data to classification approaches can
consist of discrete as well as continuous measurements. For example, for hard disk failure
prediction based on SMART7 values, input data may consist of the number of reallocated
sectors (discrete value) and the drive’s temperature (theoretically a continuous variable).
Target values are not a continuous values but a binary classification whether the drive is
failure-prone or not.
6
some authors also use the term feature selection
7
Self-Monitoring And Reporting Technology
3.1 A Taxonomy and Survey of Online Failure Prediction Methods
37
Statistical Tests (1.2.3.1): Ward et al. [271] estimate time-dependent mean and variance of the number of TCP connections in various states from a web proxy server in
order to identify Internet service performance failures. If actual measurements deviate
significantly from the mean of training data, a failure is predicted.
A more robust statistical test has been applied to hard disk failure prediction in Hughes
et al. [127]. The authors employ a rank sum hypothesis test to identify failure prone hard
disks. The basic idea is to collect SMART values from fault-free drives and store them as
reference data set. Then, during runtime SMART values of the monitored drive are tested
the following way: The combined data set consisting of the reference data and the values
observed at runtime is sorted and the ranks of the observed measurements are computed8 .
The ranks are summed up and compared to a threshold. If the drive is not fault-free, the
distribution of observed values are skewed and the sum of ranks tends to be greater or
smaller than for fault-free drives.
Bayesian Classifier (1.2.3.2): In [112], Hamerly & Elkan describe two Bayesian failure
prediction approaches. The first Bayesian classifier proposed by the authors is abbreviated
by NBEM expressing that a specific Naïve Bayes model is trained with the Expectation
Maximization algorithm based on a real data set of SMART values of Quantum Inc. disk
drives. Specifically, a mixture model is proposed where each naïve Bayes submodel m
is weighted by a model prior P (m) and an expectation maximization algorithm is used
to iteratively adjust model priors as well as submodel probabilities. Second, a standard
naïve Bayes classifier is trained from the same input data set. More precisely, SMART
variables xi such as read soft error rate or calibration retries are divided into bins and
conditional probabilities for class k ∈ {Failure, Non-failure} are computed. The term
naïve derives from the fact that all attributes xi are assumed to be independent and hence
the joint probability can simply be computed as the product of single attribute probabilities
P (xi | k). The authors report that both models outperform the rank sum hypothesis test
failure prediction algorithm of Hughes et al. [127].9
Pizza et al. [205] propose a Bayesian method to distinguish between transient and
permanent faults on the basis of diagnosis results. In this case the measured symptoms
are obtained by monitoring and evaluation of modules or components. Although not
mentioned in the paper, this method could be used for failure prediction by issuing a
failure warning once a permanent fault has been detected.
Other approaches (1.2.3.3): Failures of computer systems can be predicted by applying a clustering method directly to system measurement data: After collection of a labeled
training data set indicating whether measurements are failure-prone or not, a clustering
method can be used, e.g., to identify centroids of failure-free and failure-prone regions.
During runtime, actual measurements can be classified by assessing proximity to failureprone and failure-free centroids. Sfetsos [234] describes that clustering has been used
together with function approximation techniques for load-forecasting of power systems.
Additionally, clustering is part of the training procedure in Berenji et al. [27], which has
been described in category 1.2.1.1.
8
which in fact involves nothing more than simple counting
9
The rank sum test was announced and submitted to the journal in 2000, but appeared after the publication
of the NBEM algorithm in the year 2002.
38
3. A Survey of Online Failure Prediction Methods
Cheng et al. [57] apply a fuzzy logic soft classifier to compute a health index in high
availability cluster systems (see category 1.2.2.1).
Daidone et al. [73] have proposed to use a hidden Markov model approach to infer
whether the true state of a monitored component is healthy or not. The use of hidden
Markov models is motivated by the fact that the true state of the monitored component
cannot be observed. However, the state can be estimated from a sequence of monitoring
results by the so-called forward algorithm of hidden Markov models. Additionally, mistakes in the component specific defect detection mechanism10 are included in the model.
Since this method is based on concurrent monitoring the method could also be used for
failure prediction: If a component is detected to be faulty, a failure is likely to occur.
Chen et al. [52] and Kiciman & Fox [144], which are related publications, apply a
probabilistic context free grammar (PCFG)11 to evaluate call paths collected from a Java
2 Enterprise Edition (J2EE) demo application, an industrial enterprise voice application
TM
network, and from eBay servers. Although the approach is designed to identify failures
quickly, the approach could also be used to predict upcoming failures: if the probability of
the beginning of a call path is very low, it is likely that the system is not behaving normally
and there is an increased probability that a failure will occur in the further course of the
request.
Time Series Analysis (1.2.4)
Failure predictions belonging to this category directly measure the target function and
analyze it in order to determine whether a failure is imminent or not. Feature analysis
computes a residual of the measurement series, while time series prediction models try to
predict the future progression of the target function from the series’ values itself (without
using other measurements as input data). Finally, also signal processing techniques can
be used for time series analysis.
Feature analysis (1.2.4.1): Crowell et al. [71] have discovered that memory related system parameters such as kernel memory or system cache resident bytes show multifractal
characteristics in the case of software aging. The authors used the Hölder exponent to
identify fractality, which is a residual expressing the amount of fractality in the time series. In a later paper [238], the same authors extended this concept and built a failure
prediction system by applying the Shewhart change detection algorithm [24] to the residual time series of Hölder exponents. A failure warning is issued after detection of the
second change point.
Time Series Prediction (1.2.4.2): In Hellerstein et al. [117], the authors describe an
approach to predict if a target function will violate a threshold. In order to achieve this,
several time series models are employed to model stationary as well as non-stationary
effects. For example, the model accounts for the influence of the day-of-the-week, or
time-of-the-day, etc. Experiments have been carried out on prediction of HTTP operations
per second of a production webserver. A similar approach has been described in Vilalta
et al. [266].
10
The authors use the term “deviation detection mechanism”.
11
For more details on PCFGs, see category 1.3.3.1
3.1 A Taxonomy and Survey of Online Failure Prediction Methods
39
Li et al. [163] collect various parameters from a web-server and build autoregressive
model with auxiliary input (ARX) to predict further progression of system resources utilization. Failures are predicted by estimating resource exhaustion times.
A similar approach has been proposed by Sahoo et al. [220] who applied various time
series models to data of a 350-node cluster system to predict parameters like percentage
of system utilization, idle time and network IO.
Signal Processing (1.2.4.3): Signal processing techniques are of course related to methods that have already been described (e.g., Kalman filters in category 1.2.1.3). However,
in contrast to the methods presented above, techniques of this category neither rely on any
other input data nor do they require an abstract model of system behavior or a concept of
(hidden) system states. Algorithms that fall into this category use signal processing techniques such as low-pass or noise filtering to obtain a clean estimate of a system resource
measurement. For example, if free system memory is measured, observations will vary
greatly due to allocation and freeing of memory. Such measurement series can be seen
as a noisy signal where noise filtering techniques can be applied in order to obtain the
“true” behavior of free system memory: If it is a continuously decreasing function, software aging is likely in progress and the amount of free memory can be estimated for the
near-future by means of signal processing prediction methods (see Figure 3.3). However,
to the best of our knowledge, signal processing techniques such as frequency transformations have only been used for data preprocessing so far.
Figure 3.3: Failure prediction using signal processing techniques on measurement data can
for example be achieved by noise filtering
Manifestation of Faults – Errors (1.3)
As already mentioned, the third major group of failure prediction methods that incorporate
the current state of the system analyzes the occurrence of error events in order to assess the
current situation with regard to upcoming failures. One of the major differences between
errors and symptom monitoring is that errors always denote an event while symptoms are
in most cases detected by periodic system observations. Furthermore, symptoms are in
most cases values out of a continuous range while error events are mostly characterized
by discrete, categorical data such as event IDs, component IDs, etc. (see Figure 3.4).
Frequency of Occurrence (1.3.1)
One assumption that is very common in failure prediction approaches is the notion that the
frequency of error occurrence increases before a failure occurs. Several methods building
40
3. A Survey of Online Failure Prediction Methods
Figure 3.4: Failure prediction based on the occurrence of errors (A,B,C). The goal is to assess
the risk of failure at some point in future (indicated by the question mark). In order
to perform the prediction, some data that have occurred shortly before present
time are taken into account (data window).
on this assumption have been proposed over the decades.
According to Siewiorek & Swarz [241], Nassar & Andrews [190] were the first to propose two ways of failure prediction based on the occurrence of errors. The first approach
investigates the distribution of error types. If the distribution of error types changes systematically (i.e., one type of error occurs more frequently) a failure is supposed to be
imminent. The second approach investigates error distributions for all error types obtained for intervals between crashes. If the error generation rate increases significantly,
a failure is looming. Both approaches resulted in computation of threshold values upon
which a failure warning can be issued.
Iyer et al. [131] apply a hierarchical aggregation method to error occurrences in order
to filter out so-called symptoms:12 First, errors of equal type reported by one machine
form so-called clusters. Second, subsequent clusters that occur within some specified
time interval are combined to form so-called error groups. Third, error groups that occur
within a 24h interval and that share at least two error records are called “events”. After
data aggregation, Iyer et al. estimate singleton and joint probabilities to test for statistical
dependence.13 A symptom of an event is formed by records that are common to most
of the groups in an event. Although originally used for automatic identification of the
root cause of permanent faults, the detection of a symptom could as well be used for the
prediction of upcoming failures (see also Iyer et al. [130]).
The dispersion frame technique (DFT) developed by Lin & Siewiorek [167] uses a set
of heuristic rules on the time of occurrence of consecutive error events to identify looming
permanent failures. Since this method is used for comparison with the model presented
in this thesis, DFT is further explained in Section 3.2.1.
Lal & Choi [153] show plots and histograms of errors occurring in a UNIX Server.
The authors propose to aggregate errors in an approach similar to tupling (c.f., Tsao &
Siewiorek [258]) and state that the frequency of clustered error occurrence indicates an
upcoming failure. Furthermore, they showed histograms of error occurrence frequency
over time before failure.
More recently, Leangsuksun et al. [157] have presented a study where hardware sensors measurements such as fan speed, temperature, etc. are aggregated using several
thresholds to generate error events with several levels of criticality. These events are an12
Not to be confused with side-effects of faults as used in this thesis
13
For independent random variables A and B, the following equation holds: P (A, B) = P (A) ∗ P (B). If
not, A and B are not independent and are likely to occur together.
3.1 A Taxonomy and Survey of Online Failure Prediction Methods
41
alyzed in order to eventually generate a failure warning that can be processed by other
modules. The study was carried out on data of a high availability high performance Linux
cluster.
In the paper presented by Levy & Chillarege [162], the authors derive three principles,
two of which fall into this category: principle one (“counts tell”) again emphasizes the
property that the number of errors14 per time unit increases before a failure. Principle
number three (“clusters form early”) basically states the same by putting more emphasis
on the fact that for common failures the effect is even more apparent if errors are clustered
into groups.
Another link to this relationship between errors and failures is provided by Liang et al.
[165]: The authors have analyzed jobs of an IBM BlueGene/L supercomputer and support
the thesis: “On average, we observe that if a job experiences two or more non-fatal events
after filtering, then there is a 21.33% chance that a fatal failure will follow. For jobs that
only have one non-fatal event, this probability drops to 4.7%”.
Rule-based Systems (1.3.2)
The essence of rule-based failure prediction is that the occurrence of a failure is predicted
once at least one of a set of conditions is met. Hence rule-based failure prediction has the
form
IF <condition1 > THEN <f ailure warning>
IF <condition2 > THEN <f ailure warning>
...
Since in most computer systems the set of conditions cannot be set up manually, the goal
of failure prediction algorithms in this category is to identify conditions algorithmically
from a set of training data. The art is to find a set of rules that is general enough to capture
as many failures as possible but that is also specific enough not to generate too many false
failure warnings.
Data mining (1.3.2.1): To our knowledge, the first data mining approach to failure prediction has been published by Hätönen et al. [116]. The authors describe that a rule miner
was set up by manually specifying certain characteristics of episode rules. For example,
the maximum length of the data window, types of error messages15 and ordering requirements had to be specified. However, the algorithm returned too many rules such that
they had to be presented to human operators with system knowledge in order to filter out
informative ones.
Weiss [275] introduces a failure prediction technique called “timeweaver” that is
based on a genetic training algorithm. In contrast to searching and selecting patterns
that exist in the database, rules are generated “from scratch” by use of a simple language:
error events are connected with three types of ordering primitives. The genetic algorithm
14
Since the paper is about a telecommunication system, the authors use the term alarm for what is termed
an error, here.
15
As this work has also been published in the telecommunication community, the authors use the term alarm
instead of errors.
42
3. A Survey of Online Failure Prediction Methods
starts with an initial set of rules16 and repetitively applies crossing and mutation operations to generate new rules. Quality of the obtained candidates is assessed using a special
fitness function that incorporates both prediction quality17 as well as diversity of the rule
set. After generating a rule set with the genetic algorithm, the rule set is pruned in order
to remove redundant patterns. Results are compared to three standard machine learning algorithms: C4.5rules [209], RIPPER [61] and FOIL [208]. Although timeweaver
outperforms these algorithms, standard learning algorithms might work well for failure
prediction in other applications.
Vilalta & Ma [268] describe a data-mining approach that is tailored to short-term
prediction of boolean data. Since the approach builds on a concept termed “eventsets”, the
failure prediction algorithm is referenced here as eventset method. The method searches
for predictive subsets of events occurring before a target event. In the terminology used
here, events refer to errors and target events to failures. The first major concept of the
method addresses class skewness (see Section 2.5.3). The solution is —similar to the
solution used in this thesis— to first consider only error sequences preceding a failure
within some time window, and to incorporate non-failure data only to remove unwanted
patterns in a later step. The eventset method is used for comparative analysis and is hence
explained in more details in Section 3.2.2. The eventset method has also been applied for
failure prediction in a 350-node cluster system, as described in [220].
As indicated by its name, the eventset method operates on sets of errors and does
not take the ordering of errors into account while the timeweaver method includes partial
ordering. However, there are other data-mining methods having the potential to achieve
good results, which have not yet been applied to the problem of failure prediction. For
example a lot of research has been published in the field of sequential pattern mining.
As an example, Srikant & Agrawal [249] introduce the concept of ontologies that would
enable to incorporate relationships between error messages, which is closely related to
hierarchical fault models. A second area of research having developed methods that could
as well be applied to failure prediction is concerned with the analysis of path traversal
patterns. For example, Chen et al. [55] generate a tree structure of path traversals to
identify frequent paths and to isolate those paths that set up a basis (so-called “maximal
reference sequences”). However, since the method assumes a dedicated start of all paths18
application of the method to failure prediction is limited to areas where some dedicated
starting points exist such as in transaction-based systems.
Fault trees (1.3.2.2): Fault trees have been developed in the 1960’s and have become a
standard reliability modeling technique. A comprehensive treatment of fault trees is, for
example, given by Vesely et al. [265]. The purpose of fault trees is to model conditions
under which failures can occur using logical expressions. Expressions are arranged in
form of a tree, and probabilities are assigned to the leaf nodes, facilitating to compute the
overall failure probability.
Fault tree analysis is a static analysis that does not take the current system status into
account. However, if the leaf nodes are combined with online fault detectors, and logical
expressions are transformed into a set of rules, they can be used as online failure predictor.
16
the so-called initial population
17
based on a variant of the F-Measure, that allows to adjust the relative weight of precision and recall
18
which is the root node of the tree
3.1 A Taxonomy and Survey of Online Failure Prediction Methods
43
Although such approach has been applied to chemical process failure prediction [260]
and power systems [216], we have not found such approach being applied to computer
systems.
Other approaches (1.3.2.3): In the area of machine learning, a broad spectrum of methods are available that could in principle be used for online failure prediction. This paragraph only lists a few techniques that either have been applied for failure prediction or
that seem at least promising.
A relatively new technique on the rise is the so-called “rough set theory” [199]. Chiang & Braun [58] propose a combination of rough set theory with neural networks to
predict failures in computer networks based on network events. Rough set theory has also
been applied to aircraft component failure prediction (c.f., e.g., Pena et al. [200]).
Bai et al. [20] employ a Markov Bayesian Network for reliability prediction but a
similar approach might work for online failure prediction, as well. The same holds for
decision tree methods: upcoming failures can be predicted if error events are classified
using a decision tree approach similar to Chen et al. [54], which has been described in
Section 1.2.1.2.
Pattern recognition (1.3.3)
Sequences of errors form error patterns. The principle of pattern recognition in this category is to assign a ranking value to an observed sequence of error events expressing
similarity with learned patterns that are known to lead to system failures. Failure prediction is then accomplished by classification based on pattern similarity rankings (see
Figure 3.5).
Figure 3.5: Failure prediction by recognition of failure-prone error patterns
Probabilistic context-free grammars – PCFG (1.3.3.1): This modeling technique has
been developed in the area of statistical natural language processing (see, e.g., Manning
& Schütze [175]). A probabilistic context free grammar consists of a set of rules of a
context-free grammar Ni → X where Ni is a nonterminal symbol and X is a sequence
of terminals and nonterminals. Furthermore, PCFGs associate a probability with each
rule such that:
X
∀i :
P (Ni → Xj ) = 1 .
j
Given a sentence, which is a sequence of terminal symbols, i.e., the words, the sentence’
probability can be computed by finding and summing all possible parse trees having the
given sentence as leaf nodes. The probability of each tree is defined as the product of rule
probabilities that were used to generate the parse tree. Algorithms have been developed to
44
3. A Survey of Online Failure Prediction Methods
perform these computations efficiently in a dynamic programming manner. Furthermore,
algorithms have been developed to learn rule probabilities from a given set of training
sentences.
Failure prediction could be realized with PCFGs by learning the grammar of error
event sequences that have lead to a failure in the training dataset. Following the approach
depicted in Figure 3.5, failures can be predicted during runtime by computing the probability of the sequence of error events that have occurred in a time window before present
time. To our knowledge, such an approach has not been implemented for online failure
prediction. The only failure-related publications that use PCFGs are Chen et al. [52] and
Kiciman & Fox [144]. However, these papers analyze runtime-paths, which are symptoms
rather than errors —hence this approach has been described in category 1.2.3.3.
A further well-known stochastic speech modeling technique are n-gram models [175].
N -grams represent sentences by conditional probabilities taking into account a context of
up to n words in order to compute the probability of a given sentence.19 Conditional
densities are estimated from training data. Transferring this concept to failure prediction,
error events correspond to words and error sequences to sentences. If the probabilities (the
“grammar”) of an n-gram model were estimated from failure sequences, high sequence
probabilities would translate into “failure-prone” and low probabilities into “not failureprone”.
Markov models (1.3.3.2): Similarity of error sequences to failure-prone patterns extracted from training data can be computed with Markov models in two different ways,
depending whether a Markov chain or a hidden Markov model (HMM) is used.
In case of Markov chains, each error event corresponds to a state in the chain. Sequence similarity is hence computed by the product of state traversal probabilities. Similar events prediction (SEP), which is the prequel of the prediction technique developed in
this thesis, was built on this concept (see [226] for a description).
The failure prediction approach described in this thesis also belongs to this category.
The first ideas have been published in Salfner [223], but an implementation has shown
that the concept needed to be developed further, which resulted in the approach presented
here.
Pairwise alignment (1.3.3.3): Computing similarity between sequences is one of the
key tasks in biological sequence analysis [86]. Various algorithms have been developed
such as the Needleman-Wunsch algorithm [191], Smith-Waterman algorithm [244] or the
BLAST algorithm [8]. The outcome of such algorithms is usually a score evaluating
the alignment of two sequences. If used as a similarity measure between the sequence
under investigation and known failure sequences, failure prediction can be accomplished
as depicted in Figure 3.5. One of the advantages of alignment algorithms is that they build
on a substitution matrix providing scores for the substitution of symbols. In terms of error
event sequences this technique has the potential to define a score for one error event
being “replaced” by another event giving rise to use a hierarchical grouping of errors as
is defined in Section 5.4. However, to our knowledge, no failure prediction approaches
applying pairwise alignment algorithms have been published, at this time.
19
Although, in most applications of statistical natural language processing the goal is to predict the next
word using P (wn |w1 , . . . , wn−1 ), the two problems are connected via the theorem of conditional probabilities.
3.2 Methods Used for Comparison
45
Other Methods (1.3.4)
Statistical tests (1.3.4.1): Principle number two (“the mix changes”) in Levy &
Chillarege [162] delineates the discovery that the order of subsystems sorted by error
generation frequency changes prior to a failure. According to the paper, relative error
generation frequencies of subsystems follow a Pareto distribution: Most errors are generated by only a few subsystems while most subsystems generate only very few errors.20
The proposed failure prediction algorithm monitors the order of subsystems and predicts
a failure if it changes significantly, which basically is a statistical test.
Classifier (1.3.4.2): Classifiers usually associate an input vector with a class label. In
category 1.3, input data consists of one or more error events that have to be represented by
a vector in order to be processed by a classification algorithm. A straightforward solution
would be to use the error type of the first event in a sequence as value of the first input
vector component, the second type as second component, and so on. However, it turns out
that such a solution does not work: If the sequence is only shifted one step, the sequence
vector is orthogonally rotated in the input space and most classifiers will not judge the
two vectors as similar. One solution to this problem has been proposed by Domeniconi
et al. [81]: SVD-SVM21 borrows a technique known from information retrieval: the socalled “bag-of-words” representation of texts [175]. In the bag of words representation,
there is a dimension for each word of the language. Each text is a point in this highdimensional space where the magnitude along each dimension is defined by the number
of occurrences of the specific word in the text.22 SVD-SVM applies the same technique
to represent error event sequences. Since SVD-SVM is used for comparative analysis, it
is described in more detail in the next section.
3.2
Methods Used for Comparison
In order to compare the prediction method presented in this thesis to the state-of-the-art,
other prediction methods have been implemented and applied to the data of the case study.
The selection of approaches is primarily based on the type of input data: the best-known
and most promising error-based approaches have been chosen, which are:
• Dispersion Frame Technique (DFT) developed by Lin [166], which is an errorfrequency based approach (category 1.3.1)
• Eventset Method developed by Vilalta & Ma [268], which is a data-mining approach (category 1.3.2)
• SVD-SVM developed by Domeniconi et al. [81], which is a classification approach
(category 1.3.4)
Together with the pattern recognition approach presented in this dissertation, all categories of error-based failure prediction are covered.
20
This is also known as Zipf’s law [285]
21
Singular-Value-Decomposition and Support-Vector-Machine
22
There are more sophisticated representations incorporating term weighting such as tf.idf, but this has not
been used for SVD-SVM
46
3. A Survey of Online Failure Prediction Methods
In addition to that, a periodic prediction of failures based on mean-time-betweenfailures (MTBF), which belongs to category 1.1, has been applied in order to show the
prediction results that can be achieved with almost no effort.
Comparing the data that is taken into account by the various prediction methods, one
can conclude:
• DFT only makes use of the time of error occurrence
• Eventset only makes use of the type of error occurrence
• SVD-SVM makes use of the type of error events. Using a bag-of-words representation, also the number of error occurrences can be incorporated. Using a special
representation to incorporate time of error occurrence has not been successful for
the case study.
• MTBF only takes the occurrence of failures into account.
In this regard, the novelty of the approach presented here is that it is the first to analyze
error events as event-triggered temporal sequence.
3.2.1
Dispersion Frame Technique
Lin [166] has developed a technique called Dispersion Frame Technique (DFT) that evaluates the time of error occurrence and is therefore classified to category 1.3.1. It is based
on the notion that errors occur more frequently before a failure occurs. It is a wellknown heuristic to analyze error occurrence frequencies and has been shown to be superior to classic statistical approaches like fitting of Weibull distribution shape parameters
[167, 12]. The technique was developed for data of the Andrews distributed File System at Carnegie-Mellon University. The following paragraphs describe DFT as originally
published and notes about its application to the case study in this thesis are provided at
the end.
Figure 3.6: Dispersion Frame Technique. Diamond u i denotes the last error that has occurred, i − 1 the predecessor error of the same type. DF denotes a dispersion
frame and EDI the error dispersion index. W denotes a failure warning that is
issued at the end of DF 1 centered around error i − 2.
The first step of DFT prediction is to separate all error events pertinent to one device.
Then the time of error occurrence for each device is analyzed. A Dispersion Frame (DF)
3.2 Methods Used for Comparison
47
is the interval time between successive error events of the same type. In Figure 3.6, two
DFs are shown, DF1 is the time interval between errors i − 4 and i − 3 whereas DF2 is
the interval between errors i − 3 and i − 2. Each DF is shifted such that it is centered
around the next and next but one error. The Error Dispersion Index (EDI) is defined to
be the number of error occurrences in the later half of a DF. If it is observed that a DF is
less than 168 hours, a heuristic is activated, which predicts a failure if at least one of the
following rules are met:
1. when two consecutive EDIs from successive application of the same DF exhibit an
EDI of at least three. In Figure 3.6 this is true for DF1 centered around i − 3 and
i−2
2. when two consecutive EDIs from two successive DFs exhibit an EDI of at least
three.
3. when a dispersion frame is less than one hour,
4. when four error events occur within a 24-hour frame,
5. when there are four monotonically decreasing DFs and at least one DF is half the
size of its previous DF. This rule is also met in Figure 3.6
The failure warning is issued at the end of the data frame, as shown in the figure.
As might have become clear, the rules are heuristic and account for several types
of system behavior. For example, rules three and four put absolute thresholds on erroroccurrence frequencies, whereas rules one and two on window-averaged occurrence frequencies. Finally, rule five is determined to detect trends in error occurrence frequencies.
It should be noted that DFT was developed for data of the Andrews distributed File
System (AFS). In this dissertation, the approach has been transferred to the prediction of
failures of a component-based industrial telecommunication system. Therefore, the DFT
method had to be adapted slightly:
1. AFS is a physically distributed campus-wide system and error messages could be
assigned easily to field replaceable units (FRUs), which are also strong fault containment regions. The data used for the case study in this thesis derives from a
non-distributed system built from software components. However, in the case of
AFS, error detection took place within each FRU, while in the case-study considered here, software components are much weaker fault containment regions and
error detection frequently took place in other parts of the system. Moreover, components were sometimes even not identifiable in the data. Hence, software containers, which execute the components, have been considered as the entity equivalent
to FRUs.
2. There are several parameters in the ruleset that are problem specific. For example,
the activation threshold of 168 hours is the time above which faults are considered
to be unrelated. Since the goal of the case-study used here is to predict service availability failures on a five-minutes timescale, ruleset parameters had to be adapted.
To do this, each parameter has been “optimized” separately by varying parameter
values. Each choice has been evaluated with respect to the ability to predict failures. If two choices for a parameter were almost equal in precision and recall (see
Chapter 8), the one with less false positives has been chosen.
48
3. A Survey of Online Failure Prediction Methods
3. There is no notion of warning-time ∆tw in the method. Since warning-time is the
minimum time for any failure prediction to be useful, failure warnings issued for the
interval (t, t + ∆tw ] are removed. Indeed, by design DFT can only predict failures
at most half the length of a dispersion frame ahead. This resulted in removal of
quite a lot of warnings due to the short inter-error-event times occurring in the data.
3.2.2
Eventset Method
The prediction approach published by Vilalta & Ma [268] is based on data-mining techniques. The basic concept of the method are so-called eventsets. As the name indicates,
an eventset E = {Xi } is a set of error events that indicates an upcoming failure. The
failure predictor consists of a set of eventsets. The goal of the training procedure is to find
a good set of eventsets such that as many failures as possible can be captured with as few
false warnings as possible.
In order to deal with the imbalance of class distributions (failures are rare events), the
method first considers only failure data and uses non-failure data in a second validation
step. Failure data consists of all error events that have occurred within a time window
of length ∆td before each failure in the training dataset. These windows are termed
failure windows, here. The original approach does not consider lead-time ∆tl . It has
been incorporated by shifting the failure window, as depicted in Figure 3.7.
Figure 3.7: The eventset method builds a database of sets of errors occurring within a time
window before failures (indicated by t). The database is then reduced in several
steps to yield a better predictor. In some of these steps, data occurring in nonfailure windows are used. ∆td denotes the length of the data window and ∆tl
lead-time
An initial database consisting of all subsets of events that have occurred in the event
windows is set up. This initial database of eventsets is then reduced in three steps:
1. Keep only frequent eventsets. An eventset is said to be frequent if it has support
3.2 Methods Used for Comparison
49
greater than a user-defined threshold. Support is defined to be the relative frequency
of occurrence in the event windows:
support(E) =
number of failure windows containing E
total number of failure windows
.
(3.2)
In the example, eventsets {A}, {B}, and {A, B} have support 100% and eventsets
{C}, {A, C}, {B, C}, and {A, B, C} have support 50%. Assuming a threshold of,
say 70%, only the first eventsets remain in the database.
2. Keep only accurate eventsets. In the example, the event A also occurs between the
two failures which leads to the conclusion that the occurrence of A is not indicating
an upcoming failure. Confidence takes this into account: Confidence is defined
to be the relative frequency of occurrence of the eventset with respect to all time
windows (including those that do not precede a failure event):
confidence(E) =
number of failure windows containing E
number of all windows containing E
.
(3.3)
An eventset is said to be accurate if it has confidence greater than a user-defined
threshold. In the example, eventsets {B}, and {A, B} have confidence 100%
while {A} has confidence 23 . Assuming a confidence threshold of, say, 70%, only
eventsets {B}, and {A, B} remain in the database.
Due to the fact that putting a threshold on confidence does not check for negative correlations, an additional statistical test is performed testing for the nullhypothesis:
H0
:
P (E|failure windows) ≤ P (E|non-failure windows).
(3.4)
Only eventsets E for which H0 can be rejected (with a certain confidence level)
stay in the database.
3. Remove eventsets that are too general. Remaining eventsets are ordered by confidence in the first place and subsequently by support and finally by specificity: An
eventset E1 is more specific than E2 if E2 ⊂ E1 . Going through the sorted list of
eventsets, the algorithm removes eventsets
that arei less specific. In the example, the
h
sorted list of eventsets consists of {A, B}, {B} . Since {B} ⊂ {A, B}, {B} is
removed and the only remaining eventset is {A, B}. This means that events A and
B must occur together in order to indicate an upcoming failure.
Failure prediction is performed by checking, whether any eventset of the database is a
subset of the currently observed set of error events. For example, if —during runtime—
errors A, C, and B occur within a time window spanning an interval of length ∆td , a
failure is predicted since the eventset {A, B} ⊂ {A, B, C}.
As might have become clear, the initial database of eventsets has cardinality of the
power set, which would tag the algorithm infeasible in real applications. Therefore the
first step of support filtering is incorporated into the generation of the initial eventset
database by use of the a-priori algorithm (Agrawal et al. [2]), which also applies branch
and bound techniques.
50
3. A Survey of Online Failure Prediction Methods
3.2.3
SVD-SVM Method
Latent semantic indexing (LSI) is a technique developed in information retrieval that enables to find related text documents even if they do not share search terms. LSI is based on
the notion of co-occurrence of terms and provides a method to identify “latent” semantic
concepts in texts (see, e.g., [175]). Domeniconi et al. [81] have applied this technique to
the problem of failure prediction and assume that co-occurrence of error events indicates
the “latent” state of the system.23 More specifically, the approach consists of three steps.
1. Error sequences are represented in a so-called bag-of-words representation, which
is frequently used in natural language processing: for text documents, there’s a
dimension for each word of the language and the magnitude along each dimension is, for example, simply the number of times the word occurs in the document.
In the case of error event sequences, there’s a dimension for each event type and
the magnitude along the dimension (i.e., the distance from origin) represents how
“prominent” an error type is in the sequence. The authors describe three ways of
assigning a value to “prominence”:
• existence:
one if an event occurs in the sequence, zero if not
• count:
the number of occurrences in the sequence
• temporal:
partitioning the sequence into time-slots and assigning a one to
a binary digit if the event occurs within the corresponding time
slot.
The key notion of the bag-of-words representation is that each event sequence represents a point in a high-dimensional space and hence the entire training data set
comprises a multidimensional point cloud.
The process of turning error log data into sequences is similar to the Eventset
method: All errors occurring within a time window of length ∆td preceding a failure by lead-time ∆tl constitute a failure sequence which is translated into a positive
(failure-prone) bag-of-words data point. Errors occurring in data windows between
failures constitute negative examples (see Figure 3.8).
2. Semantic concepts, which refer to the latent states of the system, are identified by
means of singular value decomposition (SVD). The result of SVD is then used to
reduce the number of dimensions in the data. More precisely, co-occurring events
in the space of event types are mapped onto the same dimensions in the space of
latent states by a least-squares method to decompose the matrix of training eventsequences into a product of square and diagonal matrices. Assuming that there are n
training sequences, and m event types, then the matrix of training data D is a m×n
matrix with each column corresponding to an event sequence. SVD decomposes D
into
D=U S VT ,
(3.5)
where S is a diagonal matrix with ordered singular values on the main diagonal
indicating the amount of variation for each dimension. SVD has the property that
23
The authors call it “pattern context information”
3.2 Methods Used for Comparison
51
Figure 3.8: Bag-of-words representation of error sequences occurring prior to failures (s).
Each time window defines an event sequence. By assuming that there are only
two types of error messages (A and B), each sequence can be mapped to a point
in two-dimensional event-type space where the magnitude along each dimension
is determined by the number of times the event occurs in the sequence. Sequences from windows preceding a failure are positive examples (black bullets),
sequence from windows between failures constitute negative examples (white bullet). ∆td denotes length of the window and ∆tl lead-time.
projecting data onto the the first k dimensions yields a least-square optimal projection. The projection matrix is defined by the first k columns of matrix U and
projection can simply be performed by matrix multiplication.
An example is shown in Figure 3.9: assuming that there are only two different
errors A and B, the training data set can be represented in two-dimensional space.
Figure 3.9-a shows an example using the count encoding, black bullets to indicate
failure and white bullets to indicate non-failure sequences. The training data defines
a 2×11 dimensional matrix D. SVD computes new dimensions x1 and x2 as shown
in (b), such that projection (c) results in a least-square overall error. The projected
data set has only one dimension.
Figure 3.9: Singular value decomposition (SVD). (a): Bag-of-words representation of training
data set. (b): Rotated dimensions found by SVD. (c): Projection onto the new
dimension x1 .
3. A classifier is trained in order to distinguish between failure and non-failure sequences. The input data to classification are the projected event sequences (obtained
from step two). The classification technique used are Support Vector Machines
52
3. A Survey of Online Failure Prediction Methods
(SVMs), which have been developed at the beginning of the 1990’s by Vapnik
[264].24 Support vector machines are linear maximum margin classifiers. Linear
means that the decision boundary corresponds to a straight line in two-dimensional
space and to a hyper-plane in higher dimensions. However, such an approach can
only classify linearly separable problems appropriately, which is not the case for
most real-world classification problems. To remedy this problem, a second transformation into a high-dimensional feature space including non-linear features is
performed, which can turn complex classification problems into linear problems in
feature space. Figure 3.10 depicts such a transformation denoted by ϕ. Although
the additional transformation seems to introduce extra computation complexity, it
is in fact one of the reasons for the computational efficiency of SVMs: The trick is
that transformations exist for which the distance measure can be computed much
more efficiently. The second important feature of SVMs is that they belong to the
class of maximum margin classifiers, which means that the decision boundary is
chosen such that the margin25 is maximal. It has been proven that this results in
most robust classification (see, e.g., [237]).
Figure 3.10: Maximum margin classification in feature space. On the left-hand side data
points in the original space cannot be separated linearly. By transformation ϕ
data points are transformed into a feature space, where a linear separation is
possible. The decision boundary (indicated by the dashed line) is chosen such
that the margin (solid lines) is maximal.
After training, online failure prediction is performed in three steps:
1. all error events that occurred within a time window of length ∆td before present
time are represented as a bag-of-words.
2. Singular value decomposition needs not to be performed for online prediction. Instead, the bag of words is transformed into reduced semantic space by multiplication with the projection matrix.
3. the resulting k-dimensional vector is classified using a support vector machine,
which includes a further transformation using ϕ.
24
An introduction can be found in Cristianini & Shawe-Taylor [70]
25
which is the distance to the closest datapoints
3.3 Summary
53
According to the authors of SVD-SVM, failure patterns show similar properties as
text classification tasks. For example, the frequency distribution for error events follows
Zipf’s law [285], which inspired them to apply text processing techniques.
3.2.4
Periodic Prediction
The failure prediction method used to estimate some sort of lower bound can be derived
directly from reliability theory, since the probability of failure occurrence up to time t is
simply:
F (t) = 1 − R(t) ,
(3.6)
where R(t) is reliability.
Assuming a Poisson failure process, reliability turns out to be an exponential distribution (see, e.g., Musa et al. [189]) and failure probability is:
F (t) = 1 − e−λt .
(3.7)
The distribution parameter λ is fit to the data by setting
λ=
1
,
M T BF
(3.8)
where MTBF denotes mean-time-between-failure of the training data set.26
Using this model, a failure is predicted according to the median of the failure distribution:
1
Tp = ln(2) .
(3.9)
λ
3.3
Summary
This chapter has introduced a taxonomy of online failure prediction approaches for complex computer systems and has provided a comprehensive survey of online failure prediction methods. Furthermore, the survey points to research areas that provide a toolbox of
methods that could most promisingly be applied to the task of online failure prediction.
From this it can be concluded that the technique presented in this thesis is the first to apply
temporal sequence pattern recognition methods to the task of online failure prediction.
The second major goal of this chapter was to describe in detail four existing failure
prediction approaches that are used for comparative analysis in this thesis, namely dispersion frame technique, eventset method, SVD-SVM and a periodic prediction based on a
reliability model.
Contributions of this chapter. To the best of our knowledge, this chapter provides the
first taxonomy and the first survey of online failure prediction approaches.
26
Some works use MTTF instead of MTBF, but since in the case study performance failures are predicted,
repair time is not an issue, here.
54
3. A Survey of Online Failure Prediction Methods
Relation to other chapters. This chapter has presented related work with respect to
online failure prediction approaches. Since the failure prediction method presented in this
thesis is based on an extension to hidden Markov models, related work on hidden Markov
models is presented in the next chapter. However, in order to explain the various models,
first, an introduction to the theory of hidden Markov models is provided.
Chapter 4
Introduction to Hidden Markov Models
and Related Work
As a result of their capabilities, hidden Markov models (HMMs) are becoming more and
more frequently used in modeling. Examples include, e.g., the detection of intrusion
into computer systems [273], fault diagnosis [73], network traffic modeling [229, 274],
estimation and control [90], speech recognition [125], part-of-speech tagging [175], and
genetic sequence analysis applications [86]. In this work, HMMs are used for online
failure prediction following a pattern recognition approach. For this reason this chapter
gives an introduction to the theory of HMMs (Section 4.1). The approach taken in this
thesis builds on the assumption that time and type of error occurrence are crucial for
accurate failure prediction. However, standard HMMs are not appropriate models for
processing temporal sequences. In Section 4.2, three principle approaches are provided
how temporal sequences can be handled by HMMs, followed by related work on timevarying HMMs, which is provided in Section 4.3.
4.1
An Introduction to Hidden Markov Models
HMMs are based on discrete-time Markov chains (DTMCs), which consist of a set S =
{si } of N states, a square matrix A = [aij ] defining transition probabilities between the
states, and a vector of initial state probabilities π = [πi ] (see Figure 4.1). A is a stochastic
matrix, which means that all row sums equal one:
∀i
:
N
X
aij = 1 .
(4.1)
j=1
Additionally, the vector of initial state probabilities π must define a discrete probability
distribution such that
N
X
πi = 1 .
(4.2)
i=1
The stochastic process defined by a DTMC can be described as follows: An initial
state is chosen according to the probability distribution π. Starting from the initial state,
55
56
4. Introduction to Hidden Markov Models and Related Work
Figure 4.1: Discrete Time Markov Chain
the process transits from state to state according to the transition probabilities defined by
A: being in state i, the successor state j is chosen according to the probability distribution
aij . Such a process shows the so-called Markov assumptions or properties:
1. The process is memoryless: a transition’s destination is dependent only on the current state irrespective of the states that have been visited previously.
2. The process is time-homogeneous: transition probabilities A stay the same regardless of the time that has already elapsed (A is not depending on time t)
More formally, both assumptions can be expressed by the following equation:
P (St+1 = sj | St = si , . . . , S0 ) = P (S1 = sj | S0 = si ) .
(4.3)
Loss of memory is expressed by the fact that all previous states S0 , . . . , St−1 are ignored
on the right-hand side of Equation 4.3, and time-homogeneity is reflected by the fact
that the transition probabilities for time t → t + 1 are equal to the probabilities for time
0 → 1.
Hidden Markov Models extend the concept of DTMCs in that at each time step an
output (or observation) is generated according to a probability distribution. The key notion
is that this output probability distribution depends on the state the stochastic process is in.
Two types of HMMs can be distinguished regarding the types of their outputs:
• If the output is continuous, e.g, a vector of real numbers, the model is called a
continuous HMM.1
• If the output is chosen from some finite countable set, outputs are called symbols.
Such models are called discrete HMMs. Due to the fact that error message IDs are
finite and countable only discrete HMMs are considered.
In order to formalize this, HMMs additionally define a finite countable set of symbols O =
{oi } of M different symbols, which is called the alphabet of the HMM. A matrix B =
[bij ] of observation probabilities is defined where each row i of B defines a probability
distribution for state si such that bij is the probability for emitting symbol oj given that
the stochastic process is in state si :
bij = P (Ot = oj |St = si ) ,
1
Not to be confused with continuous-time HMMs, as explained later
(4.4)
4.1 An Introduction to Hidden Markov Models
57
where Ot denotes the random variable for the observation at time t. Hence, B has dimensions N × M and is a stochastic matrix such that
∀i
:
M
X
bij = 1 .
(4.5)
j=1
Note that for readability reasons, bij will sometimes be denoted by bsi (oj ). Figure 4.2
shows a simple discrete-time HMM.
Figure 4.2: A discrete-time HMM with N = 4 states and M = 2 observation symbols
The reason why HMMs are called “hidden” stems from the perspective that only the
outputs can be observed from outside and the actual state si the stochastic process resides
in is hidden from the observer. From this notion, three basic problems arise for which
algorithms have been developed:
1. Given a sequence of observations and a hidden Markov model, but having no clue
about the states the process has passed to generate the sequence, what is the overall
probability that the given sequence can be generated? This probability is called
sequence likelihood. The Forward algorithm provides an efficient solution to this
problem.
2. Given a sequence and a model as above: What is the most probable sequence of
states the process has traveled through while producing the given observation sequence? The Forward-Backward and Viterbi algorithms provide solutions to this
problem.
3. Given a set of observation sequences: What are optimal HMM parameters A, B,
and π such that the likelihood of the sequence set is maximal? The Baum-Welch
training algorithm yields a solution by iteratively converging to at least a local maximum.
The following sections will introduce the three algorithms. Although the algorithms can
be found in many textbooks or in Rabiner [210], they are described here for reasons
of comparison: In Chapter 6, these algorithms are adapted for the hidden semi-Markov
model introduced in this thesis.
58
4.1.1
4. Introduction to Hidden Markov Models and Related Work
The Forward-Backward Algorithm
As the name might suggest, the Forward-Backward algorithm consists of a forward and
a backward part. The forward part alone provides a solution to the first problem: the
computation of sequence likelihood. The likelihood of a given observation sequence o =
[Ot ] is the probability that a given HMM with parameters λ = (A, B, π) has generated
the sequence, which is denoted by P (o|λ). In order to compute this probability, first
assume that the sequence of hidden states s = [St ] was known. The likelihood could then
be computed by:
L
Y
P (o, s|λ) = πS0 bS0 (O0 )
aSt−1 St bSt (Ot ) ,
(4.6)
t=1
where L is the length of the sequence. As only o is known, all possible state sequences s
have to be considered and summed up:
P (o|λ) =
X
πS0 bS0 (O0 )
s
L
Y
aSt−1 St bSt (Ot ) .
(4.7)
t=1
However, such approach results in intractable complexity since there are N L+1 different
sequences. An efficient reformulation has been found exploiting the Markov assumption that transition probabilities are time homogeneous and only dependent on the current
state. Using this property, Equation 4.7 can be rearranged such that repetitive computations can be grouped together. From this rearrangement it is only a small step to a recursive formulation, which is also known as dynamic programming. The resulting algorithm
is called Forward algorithm.
Forward algorithm. The algorithm is based on a forward variable αt (i) denoting the
probability for sub-sequence O0 . . . Ot under the assumption that the stochastic process is
in state i at time t:
αt (i) = P (O0 O1 . . . Ot , St = si |λ) .
(4.8)
αt (i) can be computed by the following recursive computation scheme:
α0 (i) = πi bsi (O0 )
αt (j) =
N
X
αt−1 (i) aij bsj (Ot );
1≤t≤L.
(4.9)
i=1
The algorithm can be visualized by a trellis structure as shown in Figure 4.3. Each node
represents one αt (i) while edges visualize the terms of the sum in Equation 4.9. The
trellis can be computed from left to right, from which the name “forward algorithm” is
derived.
As αL (i) is the probability of the entire sequence together with the fact that the
stochastic process is in state i at the end of the sequence, sequence likelihood P (o|λ)
can be computed by summing over all states in the rightmost column of the trellis:
P (o|λ) =
N
X
i=1
which is the solution to the first problem.
αL (i) ,
(4.10)
4.1 An Introduction to Hidden Markov Models
59
Figure 4.3: A trellis to visualize the forward algorithm. Bold edges indicate the terms that have
to be summed up in order to compute αt (i)
Backward Algorithm. A backward variable βt (i) can be defined in a similar way,
denoting the probability of the rest of the sequence Ot+1 . . . OL given the fact that the
stochastic process is in state i at time t:
βt (i) = P (Ot+1 . . . OL |St = si , λ) .
(4.11)
βt (i) can be computed in a similar recursive way by:
βL (i) = 1
βt (i) =
N
X
(4.12)
aij bsj (Ot+1 ) βt+1 (j);
0≤t≤L−1.
(4.13)
j=1
Forward-backward algorithm. Combining both αt (j) and βt (i) leads to the estimation of the probability that the process is in state si at time t given an observation sequence
o. This probability is denoted by
γt (i) = P (St = si |o, λ) .
(4.14)
Some computations yield:
P (St = si , O0 . . . Ot Ot+1 . . . OL | λ)
(4.15)
P (O0 . . . OL |λ)
P (St = si , O0 . . . Ot | λ) P (Ot+1 . . . OL |St = si , λ)
=
P (O0 . . . OL |λ)
(4.16)
αt (i) βt (i)
=
(4.17)
P (O0 . . . OL |λ)
P (St = si |O0 . . . OL , λ) =
and hence γt (i) can be computed by:
γt (i) =
αt (i) βt (i)
αt (i) βt (i)
= PN
P (o | λ)
i=1 αt (i) βt (i)
.
(4.18)
Viterbi algorithm. The forward-backward variable γt (i) does not yet solve the second
problem completely, since γt (i) solves for the most probable state at one point in time but
60
4. Introduction to Hidden Markov Models and Related Work
the task is to find the most probable sequence of states. A straightforward solution would
be to select the most probable state at each time step t:
Smax (t) = arg max γt (i) .
(4.19)
i
However, it turns out that models exist for which some transitions from Smax (t) to
Smax (t + 1) are not possible (i.e., the transition probability aij equals zero). This is due
to the fact that α and β both combine all possible paths through the states of the DTMC
—and γ is only the product of α and β.
One solution to this problem is called Viterbi algorithm. Very similar to αt (i), let
δt (i) denote the probability of the most probable state sequence for the sub-sequence of
observations O0 . . . Ot that ends in state si :
δt (i) = max P (O0 . . . Ot , S0 . . . St−1 , St = si |λ) .
S0 ... St−1
(4.20)
δt (i) can be computed by a slight modification of the forward algorithm using the maximum operator instead of the sum over all states:
δ0 (i) = πi bsi (O0 )
δt (j) = max δt−1 (i) aij bsj (Ot );
1≤i≤N
1≤t≤L.
(4.21)
(4.22)
In order to identify the states that contributed to the most probable sequence, each state
selected by the maximum operator has to be stored in a separate array. The sequence
can then be reconstructed by tracing backwards through the array starting from state
arg maxi δL (i).
4.1.2
Training: The Baum-Welch Algorithm
In the forward-backward algorithm, the HMM parameters λ were assumed to be fixed
and known. However, in the majority of applications, λ cannot be inferred analytically
but need to be estimated from recorded sample data. In the machine learning community,
such a procedure is called training. Several algorithms exist for HMM training, of which
the Baum-Welch algorithm is most prominent.
In terms of HMMs, the goal of training is to maximize sequence likelihood for training
sequences. More precisely, the parameters π, A, and B have to be set such that Equation 4.10 is maximized. For convenience, only a single training sequence is considered,
here and the case of multiple sequences is discussed later.
The algorithm can be understood most easily by first considering a simpler case where
the sequence of “hidden” states is known. This occurs, e.g., in part-of-speech tagging2
applications. In this case, the parameters of the HMM can be optimized by maximum
likelihood estimates:
• Initial state probabilities πi are determined by the relative frequency of sequences
starting in state si :
π̂i =
2
See, e.g., Manning & Schütze [175]
number of sequences starting in si
total number of sequences
.
(4.23)
4.1 An Introduction to Hidden Markov Models
61
• Transition probabilities aij are determined by the number of times the process went
from state si to state sj divided by the number of times, the process left state si to
anywhere:
number of transitions si → sj
âij =
.
(4.24)
number of transitions si → ?
• Emission probabilities bsi (oj ) are determined by the number of times the process
has generated symbol oj in state si compared to the number of times the process
has been in state si :
b̂i (oj ) =
number of times symbol oj has been emitted in state si
number of times the process has been in state si
.
(4.25)
However, in many applications the sequence of states is not known. The solution
found by Baum and Welch introduced expectation values for the unknown quantities.
The algorithm belongs to the class of Expectation-Maximization (EM) algorithms.3 The
algorithm consists of two major steps:
1. Expectation step: Compute estimates for unknown data (state probabilities) using
the current set of model parameters.
2. Maximization step: Adjust model parameters to maximize data likelihood using the
estimates for the unknown data of the expectation step.
This scheme is repeated until sequence likelihood converges. It can be proven (see Section 6.5) that at least a local maximum is found. In the following paragraphs, both steps
are described in more detail.
Expectation-Step. Let Xt (i, j) denote the binary random variable indicating whether
the transition taking place at time t passes from state si to sj or not. The expected value for
Xt (i, j) is equal to the probability that Xt (i, j) is one.4 Let ξt (i, j) denote this probability
(given an observation sequence o):
ξt (i, j) = P (St = si , St+1 = sj |o, λ) .
(4.26)
ξt (i, j) can be computed similarly to Equations 4.15–4.17 by interposing the transition
from si to sj between α and β:
ξt (i, j) =
αt (i) aij bsj (Ot+1 ) βt+1 (j)
P (o|λ)
αt (i) aij bsj (Ot+1 ) βt+1 (j)
= PN PN
i=1
j=1 αt (i) aij bsj (Ot+1 ) βt+1 (j)
(4.27)
.
(4.28)
This approach can also be visualized in a trellis as shown in Figure 4.4.
While ξt (i, j) is the expected value that a transition i → j takes place at time t, the
expected value for the total number of transitions from si to sj is
X
t
3
4
E[Xt (i, j)] =
L−1
X
ξt (i, j) .
(4.29)
t=0
A more detailed discussion is given along with the proof of convergence for HSMMs, see Section 6.5
P
E[X] = X P (X) = 0 ∗ P (X = 0) + 1 ∗ P (X = 1) = P (X = 1)
62
4. Introduction to Hidden Markov Models and Related Work
Figure 4.4: A trellis visualizing the computation of ξt (i, j)
Note that summing up ξt (i, j) over all destination states sj yields the probability for
the source state si at time t:
N
X
ξt (i, j) = γt (i) .
(4.30)
j=1
The expectation step requires knowledge of model parameters λ which are either known
from (random) initialization or a previous iteration of the EM algorithm.
Maximization-Step. The second step of the Baum-Welch algorithm is a maximum likelihood optimization of parameters λ based on the expected values estimated in the first
step:
π̄i ≡
expected number of sequences starting in state si
total number of sequences
≡
L−1
X
āij ≡
expected number of transitions si → sj
expected number of transitions si → ?
≡
γ0 (i)
(4.31)
ξt (i, j)
t=0
L−1
X
(4.32)
γt (i)
t=0
L−1
X
b̄i (k) ≡
expected number of times observing ok in state si
expected number of times in state si
≡
γt (i)
t=0
s.t. Ot =ok
L−1
X
.(4.33)
γt (i)
t=0
Notes on the Baum-Welch algorithm. Initial point for the Baum-Welch algorithm is
a completely initialized HMM. This means that the number of states, number of observation symbols, transition probabilities, initial probabilities and observation probabilities need to be defined. The algorithm then iteratively improves the model’s parameters
λ = (A, B, π) until a (local) maximum in sequence likelihood is reached.5 In each
5
For implementation, a maximum number of iterations is often used as an additional stopping criterion.
4.2 Sequences in Continuous Time
63
M-step, the expectation values of the previous E-step are used and vice versa. Several
properties of the algorithm can be derived from that:
• The number of states and the size of the alphabet are not changed by the algorithm.
• The model structure is not altered during the training process: if there is no transition from state si to sj (aij = 0), the Baum-Welch algorithm will never change
this.
• Initialization should exploit as much a-priori knowledge as possible. If this is not
possible, simply random initialization can be used.
Training with Multiple Sequences. The formulas presented here have only considered
one single observation sequence, although in most applications, there is a large set of
training sequences. The main idea of multiple sequence training is that nominators and
denominators of Equations 4.31 to 4.33 are transformed into a sum over sequences ok ,
each scaled by P (o1k | λ) computed along with the E-step of the algorithm.
4.2
Sequences in Continuous Time
Error events occur on a continuous time scale, but there has been no notion of time in
HMMs, yet. In this section four approaches how time can be incorporated into HMMs
are introduced followed by a review of approaches that have been published on that topic.
An observation sequence is assumed to be an event-driven sequence consisting of
symbols which are element of a finite countable set. Such sequences are called temporal
sequences. Sequences of length L + 1 are considered where the first symbol occurs at
time t0 and the last at time tL , as shown in Figure 4.5. In order to clarify notations, let
Figure 4.5: Notations for an event-driven temporal sequence. The sequence consists of symbols A, B and C that occur at time t0 , . . . , tL . The delay between two successive
symbols is denoted by dk .
• {o1 , . . . , oM } denote the set of symbols that can potentially occur, which is
{A, B, C} in the example.
• Ok denotes the symbol that has occurred at time tk and
• dk denotes the length of the time interval tk − tk−1 .
64
4. Introduction to Hidden Markov Models and Related Work
4.2.1
Four Approaches to Incorporate Continuous Time
The notion of continuous time can be incorporated into HMMs in four ways:
1. Time can be divided into equidistant time slots
2. Delays can be represented by delay symbols
3. Events and delays can be represented by two-dimensional outputs
4. A time-varying stochastic process can be used.
The following paragraphs investigate each solution and discuss their properties.
Time slots. Time is divided into non-overlapping intervals of equal length, as shown
in Figure 4.6. Due to the reason that hidden Markov models generate a symbol in each
Figure 4.6: Incorporating continuous time by division of time into slots of equal length
time step, time slots not containing any error symbols need to be “filled” by a special
observation indicating “silence”. Performing this procedure on the temporal sequence
shown in Figure. 4.6 results in an observation sequence “A C B S S S A” where S denotes
the symbol indicating silence.
The simplest way to incorporate time-slotting into HMMs is to introduce state selftransitions: In each time step, there is some probability that the stochastic process transits
to itself and hence stays in the state (see Figure 4.7).
Figure 4.7: Duration modeling by a discrete-time HMM with self-transitions.
This approach leads to a geometric distribution for state sojourn times, since the probability to stay in state si for d time-steps equals
Pi (D = d) = aiid−1 (1 − aii ) .
Time-slotting has the following characteristics:
+ Standard HMMs can be used.
+ There is almost no increase in computational complexity.
(4.34)
4.2 Sequences in Continuous Time
65
– Time slot size is critical. If it is too small, long delays must be represented by
repetitions of the silence symbol as can be seen in the example. The geometric
delay distribution leads to poor modeling quality in most cases.6 On the other hand,
if time slot size is too large, more than one event will probably occur within a time
slot. There are several solutions to this issue including the definition of additional
symbols representing combined events, dropping of events or assignment to the
next “free” slot. However, all of these solutions have their problems. In general, if
the length of inter-symbol intervals varies greatly, time slots cannot represent the
temporal behavior of event sequences appropriately.
– Time resolution is reduced to the size of time slots since it is no longer known when
exactly an event has occurred within the time slot. This is true especially for the
case of long time slot intervals.
For these reasons, time slotting does not appear appropriate for online failure prediction.
Delay symbols. A second approach to incorporating inter-event delays is to define a set
of delay symbols representing delays of various lengths. The sequence shown in Fig. 4.6
could then be represented by, e.g., “A S1 C S1 B S3 A”. An evaluation of this approach
shows that:
+ In comparison to time slotting, representation of time is improved since “chains”
of silence symbols are avoided. If delays are represented on a logarithmic scale, a
wide range of inter-symbol delays can be handled.
+ The approach can be implemented using a completely discrete environment such
that standard implementations of HMMs can be used.
– The structure of HMMs must be adapted. Due to the fact that events and delay
symbols alternate, there must be two distinct sets of states one of which generates
event symbols while the other generates delay symbols. This results in increased
computational complexity (see Figure 4.8)
Figure 4.8: Representing time by delay symbols. States Ei generate error observation symbols and states Di generate delay symbols
– The internal (hidden) stochastic process does not represent the properties of the
stochastic process that originally generated the observation sequence.
– Time resolution is even worse than for time slotting due to the fact that one symbol
accounts for long time intervals.
6
The effect can be reduced by introducing silence sub-models, which are out of scope here
66
4. Introduction to Hidden Markov Models and Related Work
Figure 4.9: Delay representation by two-dimensional output probability distributions
Two dimensional output symbols. Hidden Markov models permit usage of multidimensional output symbols. Hence the temporal sequence can be represented by tuples
consisting of the event type and the delay to the previous event. The example sequence
of Figure 4.6 would then be represented as (A, d0 ) (C, d1 ) (B, d2 ) (A, d3 ) where d0 is not
relevant. For such a representation, observation probabilities are two-dimensional: one
dimension is discrete representing event symbols while the second dimension is continuous representing inter-event delays as shown in Figure 4.9. Output probabilities have to
obey
∀i :
M Z ∞
X
j=1 0
bsi (oj , dτ ) dτ = 1 .
(4.35)
An assessment of the method yields:
+ Lossless representation of the temporal sequence with in principle unlimited time
resolution.
– The internal (hidden) stochastic process does not represent temporal properties of
the stochastic process that originally generated the observation sequence. This is a
problem especially when future behavior of the stochastic process is to be predicted.
– Public implementations do —to the best of our knowledge— only exist for merely
discrete and merely continuous outputs. Hence an implementation would require
the development / adaption of a new toolkit.
Time-varying internal process. The fourth approach is to incorporate the temporal behavior of the stochastic process that originally generated the observation sequence directly
into the stochastic process of hidden state transitions. For example, a straightforward solution is to replace the internal DTMC by a continuous-time Markov chain (CTMC), which
is able to handle transitions of arbitrary durations since transition probabilities are defined
by exponential probability distributions P (t). Such an approach results in:
+ Lossless representation of the temporal sequence
+ The internal stochastic process can (at least in part) mimic the stochastic process
that originally generated the observation sequence
– Although various extensions to time-varying processes have been published (see
next section), to our knowledge no publicly available toolkit exists.
4.3 Related Work on Time-Varying Hidden Markov Models
67
Summary. Error event sequences are temporal sequences. Four approaches have been
described how continuous time can be incorporated into HMMs. From the discussion
follows that the most promising approach is to incorporate time variation directly into
the hidden stochastic process, which is the approach taken in this thesis. Since various
solutions to an incorporation of time variation into the stochastic process exist, related
work with such focus is presented in the following.
4.3
Related Work on Time-Varying
Hidden Markov Models
A few decades ago, application of standard discrete HMMs was the only way to get to
a feasible (i.e., real-time) solution, even in application domains where temporal behavior
is important. One such domain is speech recognition, where, e.g., phoneme durations
vary statistically. Since it was observed quickly that continuous time models can improve modeling performance significantly (see, e.g., Russell & Cook [218]), and due to
increasing available computing power, more and more time varying models have been
published. The development was mainly driven by the speech recognition research community but time-varying models have also been applied to other domains such as webworkload modeling [284]. The following sections give an overview of the various classes
of time-varying HMMs.
Continuous Time Hidden Markov Models
Incorporating time-variance into HMMs by replacing the internal (hidden) DTMC process
by a continuous time Markov chain (CTMC) has been described in Wei et al. [274]. The
resulting model is abbreviated by CT-HMM and should not be confused with continuous
HMMs (CHMMs), which are discrete-time HMMs with continuous output probability
densities. CTMCs are determined by an initial distribution equivalent to DTMCs, but the
transition matrix A is replaced by an infinitesimal generator matrix Q. Determination
of the infinitesimal generator matrix Q follows a two-step approach: First, a transition
matrix P (∆) and the initial distribution are estimated by Baum-Welch training from the
training data. Then Q is obtained by Taylor expansion of the equation
1
Q = ln(P ) ,
(4.36)
∆
which can be derived directly from Kolmogorov’s equations (see, e.g., Cox & Miller
[67]). ∆ denotes some minimal delay (a time step).
Hidden Semi-Markov Models
Models such as CT-HMMs imply strong assumptions about the underlying stochastic process since CTMCs are based on exponential distributions, which are time-homogeneous
and memoryless. A more powerful approach towards continuous-time HMMs is to substitute the underlying DTMC by a semi-Markov process (SMP), which allows to use arbitrary probability distributions for specification of each transition’s duration.7 Resulting
7
The only requirement is that it depends solely on the two states of the transition. A precise definition is
given in Chapter 6.
68
4. Introduction to Hidden Markov Models and Related Work
models are called Hidden Semi-Markov Models (HSMMs).
Figure 4.10: Duration modeling by explicit modeling of state durations
A first approach to HSMMs is to substitute the self-transitions as in Figure 4.7 by
state durations that follow a state-specific probability distribution pi (d) as depicted in
Figure 4.10. Several solutions have been developed to explicitly specify and determine
pi (d) from training data along with the Baum-Welch algorithm.
Ferguson’s model. One of the first approaches to explicit state duration modeling was
proposed by Ferguson [96] in the year 19808 . The idea was to use a discrete probability distribution for pi (d). While the approach was very flexible, it showed three disadvantages: first, it is a discrete-time model requiring the definition of a time step ∆ and
a maximum delay D, second, convergence of the training algorithm was insufficiently
slow, and third, much more training data was needed for training. The last two drawbacks
result from a dramatically increased number of parameters that have to be estimated from
the training data: The number of parameters increases from N self-transitions to N × D
duration probabilities. Mitchell et al. [182] extend the approach to transition durations
and propose a training algorithm with reduced complexity.
HSMMs with Poisson-distributed durations. In order to reduce the number of parameters, Ferguson already proposed to use parametric distributions instead of discrete ones.
So have done Russell & Moore [219], who have used Poisson distributions. A comparison of both models showed that the Poisson-distributed model performs better in the case
when an insufficient amount of training data available [218].
HSMMs with gamma-distributed durations. Levinson [161] provided a maximum
likelihood estimation for parameters of gamma-distributed durations. As it is the case
with most maximum likelihood procedures, optimal parameters are obtained by derivation
of the likelihood function. However, this derivative cannot be computed explicitly and
numerical approximation has to be applied. Azimi et al. [18] apply HSMMs with gammadistributed durations to signal processing but adjust duration parameters from estimated
mean and variance of durations in the training data set.
HSMMs with durations from the exponential family. Mitchell & Jamieson [183] extended the spectrum of available distributions for explicit duration modeling to all distributions of the exponential family, which includes gamma distributions. Their work is also
founded on a direct computation of maximum likelihood involving numerical approximation of the maximum.
8
A crisp overview can be found in Rabiner [210].
4.3 Related Work on Time-Varying Hidden Markov Models
69
HSMMs with Viterbi path constrained uniform distributions. Kim et al. [145]
present an approach where transition durations are assumed to be uniformly distributed.
Their key idea is that first parameters π, A and B are obtained by the discrete-time
standard HMM reestimation procedure as explained in Section 4.1. A subsequent step
involves computation of Viterbi paths for the training data in order to identify minimum
and maximum durations for each transition: this defines a uniform duration distribution
for each transition.
Expanded State HMMs (ESHMMs). In parallel to the development of HSMMs with
parametrized probability distributions, it has been found that Ferguson’s model can be
implemented in a much easier way by a series-parallel topology of the hidden states (Cook
& Russell [65]). To be precise, each state of the HMM is replaced by a DTMC sharing the
same emission probability distribution. State durations are then expressed by transition
probabilities of the DTMC. Figure 4.11 shows a small example for a HMM with left-toright topology. Those models are named by Expanded State HMMs(ESHMMs).
Figure 4.11: Topology of an Expanded State HMM (ESHMM). The model represents discrete
state duration probabilities pi (d) by discrete-time Markov chains. Emission probabilities bsi (oj ) have been omitted.
The benefit of ESHMMs is that they can be implemented using standard discrete-time
HMM toolkits. Furthermore, the idea to represent state durations by state chains led to
several variants extending Ferguson’s model. For example, the duration Markov chain
may have self-transitions that allow to model durations of arbitrary length instead of a
fixed maximum duration D. Some structures have been proposed by Noll & Ney [195]
and Pylkkönen [206] and a comparison of two extended structures is provided by Russell
& Cook [218]. More elaborate training algorithms for ESHMMs have been proposed by
Wang [270] and Bonafonte et al. [33].
Segmental HMMs. Segmental HMMs are used to model sequences whose behavior
changes in epochs. It is assumed that there is some outer stochastic process determining
the “type” of the segment. Some discrete duration is chosen specifying the length of the
epoch. Once the type and the duration of the epoch are fixed, an inner stochastic process
determines the behavior for the segment. Examples for such models can be found in Ge
[102] and Russell [217].
Hidden Semi-Markov Event Sequence Model (HSMESM). In [93], Faisan et al. have
presented a hidden semi-Markov model for modeling of functional magnetic resonance
70
4. Introduction to Hidden Markov Models and Related Work
imaging (fMRI) sequences.9 The key idea with respect to temporal modeling is that discrete duration probabilities are stored for each transition rather than state durations. However, the model is specifically targeted to fMRI.
Inhomogeneous HMMs (IHMMs). Ramesh & Wilpon [211] have developed another
variant of HMMs, called Inhomogeneous HMM (IHMM) . Time homogeneity of stochastic processes refers to the property that the behavior (i.e., the probability distributions) do
not change over time. In terms of Markov chains, this means that the transition probabilities aij are constant and not a function of time. However, the authors abandon this
assumption and define:
aij (d) = P (St+1 = j|St = i, dt (i) = d);
1≤d≤D,
(4.37)
which is the transition probability from state si to state sj given that the duration dt (i) in
state si at time t equals d. In order to define a proper stochastic process, the transition
probabilities must satisfy:
∀d ∈ {1, . . . , D} :
N
X
aij (d) = 1 .
(4.38)
j=1
As can be seen from the formulas, Ramesh & Wilpon also assume discretized time and a
maximum state duration D.
4.4
Summary
The approach to online failure prediction taken in this thesis is to use HMMs as pattern
recognition tool for error sequences, which are event-driven sequences in continuous time
with symbols from a finite countable set. Such sequences are called temporal sequences.
This chapter has introduced the theory of standard HMMs and has identified four ways
how sequences in continuous time can be handled by HMMs. From this discussion followed that the most promising solution is to turn the stochastic process of hidden state
traversals into a time-varying process. Since this idea is not new, related work on previous extensions has been presented.
Most domains standard hidden Markov models have been applied to are characterized
by
• equidistant / periodic occurrence of observation symbols caused by sampling. This
defines a minimum time step size such that all temporal aspects can be expressed in
integer multiples of the sampling interval.
• a maximum duration. For example in speech recognition, phonemes, syllables, etc.
can well be assumed to have limited duration.
However, these assumptions do not hold for online failure prediction based on error
events: Observation symbols can occur on a continuous time scale and delays between
errors can range from very short to very long time intervals. Therefore, none of the
continuous-time extensions presented in this chapter seems appropriate for failure prediction. The extended hidden semi-Markov model proposed in this dissertation differs from
existing solutions in the following aspects:
9
More details on the model can be found in Thoraval [255]
4.4 Summary
71
1. The model operates on true continuous time instead of multiples of a minimum time
step size. This feature circumvents the problems associated with time-slotting and
is advantageous if sequences show a great variability of inter-event delays as is the
case for the log data used in the case study.
2. There is no maximum duration D. The model can handle very long inter-error
delays with the same computational overhead as short delays.
3. The model allows to use a great variety of parametric transition probability distributions. More specifically, every parametric continuous distribution for which the
density’s gradient with respect to the parameters can be computed are applicable.
This includes well-known distributions such as Gaussian, exponential, gamma, etc.
The advantage of this feature is that transition duration distributions can be adapted
to the delays occurring in the system rather than to assume some distribution apriori. Furthermore, the model allows to use background distributions which helps
to deal with noise in the data.
4. The model allows to specify transition durations rather than state durations. Widely
used state durations are a special case where all transitions are equally distributed.
This feature alleviates the Markov restriction that the process is only dependent on
the current state.
Although some of the models presented in this chapter share some of these features, the
proposed model is the first to provide the combination of all four properties, which, as it
will be seen later, proved to be beneficial.
Contributions of this chapter. First, four ways to incorporate continuous time into
HMMs have been identified and discussed. Second, the chapter seems to be the first
work to present a summary and state-of-the-art for continuous time extensions of hidden
Markov models.
Relation to other chapters. This chapter concludes the first phase of the engineering
cycle, which has focused on a problem statement, identification of key properties and
related work. The second phase focuses on a proper formalization of the approach that
has been sketched in Figure 2.9 on Page 19 and Figure 2.10 on Page 20, respectively.
More specifically, formalization of the approach includes data preprocessing (Chapter 5),
the hidden semi-Markov model (Chapter 6), and classification (Chapter 7).
Part II
Modeling
73
Chapter 5
Data Preprocessing
The overall approach to online failure prediction consists of several steps of which data
preprocessing is the first. It is applied for training, i.e., estimation of model parameters,
and for online prediction. In Section 5.1 some known concepts of error-log preprocessing are described. A novel approach to separate failure mechanisms is introduced in
Section 5.2 and a statistical method to filter noise is explained in Section 5.3. Finally,
in Section 5.4, logfile formatting is discussed and a novel concept of logfile entropy is
introduced.
5.1
From Logfiles to Sequences
Error logfiles are a natural source of information if something goes wrong in the system,
and they are frequently used both for diagnosis and online failure prediction.1 This section
describes the necessary steps to get from raw error logs to temporal event sequences used
as input data for the hidden semi-Markov models.
5.1.1
From Messages to Error-IDs
One of the major handicaps with error logfiles is that they are commonly not designed for
automatic processing. Their main purpose is to convey information to human operators to
support quick identification of problems. Hence error logs frequently do not contain any
error-ID. Instead, they consist of error messages in natural language. This also holds for
the error logs of the telecommunication system and hence methods had to be developed
to turn natural language messages into error IDs. The method described here has been
developed together with Steffen Tschirpke.2
The key idea of translating natural language messages into an error ID is to apply
a similarity measure known from text editing to yield a similarity matrix, to cluster the
matrix and to assign an error ID to each cluster. However, even if dedicated log data
1
See category 1.3 in the taxonomy
2
In fact, he was the one who implemented it and who solved all the real problems regarding this issue.
75
76
5. Data Preprocessing
such as timestamps, etc. are ignored, almost every log message is unique. This is due to
numbers and log-record specific data in messages. For example, the log message:
process 1534: end of buffer reached
will most probably occur only very rarely in an error log since it happens infrequently
that exactly the process with number 1534 will have the same problem. For this reason,
the mapping from error messages to error IDs consists of three steps:
1. All numbers, and log-record specific data such as IP addresses, etc. are replaced by
placeholders. For example, the message shown above is translated into:
process nn: end of buffer reached
2. A 100% complete replacement of all record-specific data is infeasible. Furthermore,
there are even typos in the error messages themselves. Hence, dissimilarities between all pairs of log messages are computed using the Levenshtein distance metric
[11], which measures the number of deletions, insertions and substitutions required
to transform one string into the other.
3. Log messages are grouped by a simple threshold on dissimilarity: All messages
having dissimilarity below the threshold are assigned to one message ID.
The goal of the method described is to assign an ID to textual messages. The ID then
forms the so-called error type or symbol. If additional information from log messages
such as, e.g., thread IDs, shall be used, various numbers have to be combined into a single
error type.
5.1.2
Tupling
As Iyer & Rosetti [129] have noted, repetitive log records that occur more or less at the
same time are frequently multiple reports of the same fault. Hansen & Siewiorek analyzed
this property further and presented an illustrative figure, which is reproduced for convenience in Figure 5.1. Please note that terms have been adapted in order to be consistent
with other chapters. The figure depicts the process from a fault to corresponding events
Figure 5.1: A fault, once activated, can result in various misbehavior. Some misbehaviors are
not detected, some are detected several times and sometimes, several misbehaviors are caught by one single detection. Due to, e.g., a system crash, not every
error may occur as message in the error log [114].
in the error log. Once activated, a fault may lead to various misbehaviors in the system.
There are four possibilities how such misbehavior can be detected:
5.1 From Logfiles to Sequences
77
1. unusual behavior is detected leading to one error
2. unusual behavior is not detected and hence no error occurs
3. unusual behavior is detected by several fault detectors leading to several errors
4. one fault detector detects several misbehaviors resulting in one single error
However, not each error finds its way to the error log. For example, if the fault causes the
logging process or the entire system to crash, the error cannot be written to the logfile.
In order to increase expressiveness of logfiles, Tsao & Siewiorek [258] introduced
a procedure called tupling, which basically refers to grouping of error events that occur within some time interval or that refer to the same location. However, equating the
location reported in an error message with the true location of the fault only works for
systems with strong fault containment regions. Since this assumption does not hold for
the telecommunication system under consideration, spatial tupling is not considered any
further, here.
There are two principle approaches to grouping of errors in the temporal domain:
1. after some pause, all errors that occur within a fixed interval starting from the first
error are grouped, as proposed by Iyer et al. [131]
2. All errors showing an inter-arrival time less than a threshold ε are grouped, as proposed by Tsao & Siewiorek [258]3
Further considerations only refer to the second grouping method. Two problems can arise
when tupling is applied (see Figure 5.2):
1. error messages might be combined that refer to several (unrelated) faults. According to the paper this case is called a collision
2. If an inter-arrival time > ε occurs within the error pattern of one single fault, this
pattern is divided into more than one tuple. This effect is called truncation
Both the number of collisions and truncations depend on ε. If ε is large, truncation
happens rarely and collision will occur very likely. If ε is small the effect is vice versa. In
order to analyze the relationship, Hansen & Siewiorek [114] have derived a formula for
the probability of collision. Assuming that faults are exponentially distributed, collision
probability can be computed by
Pc (ε) =


X
1 − e−λF ε  pj e−λF lj 
,
(5.1)
j
where λF is the fault rate, and pj denotes the discrete distribution of tuples of length lj
estimated from the logfile. However, the fault rate λF is unobservable and the authors
suggest to estimate it by the tuple rate λT . The authors have checked their results using
two machine years of data from a Tandem TNS II system and showed that the formula
can provide a rough estimate.
3
In Tsao & Siewiorek [258], there is a second, larger threshold to add later events if they are similar, but
this is not further considered, here
78
5. Data Preprocessing
Figure 5.2: For demonstration purposes, the first and second time line depict error patterns
for two faults separately. The bottom line shows what is observed in the error log.
Error logs are grouped if inter-arrival time is less than ε. Each group defines a
tuple (shaded areas). Truncation occurs, if the inter-arrival time for one fault is
> ε. However, large ε lead to collisions, if events of other faults occur earlier than
ε [114].
As stated above, reducing the number of collisions by lowering ε increases the number
of truncations. However, truncation is much more complicated to identify since it is
mostly difficult to tell whether some error occurring much later (> ε) belongs to the same
fault or to another. Therefore, the authors suggest the following strategy: Plotting the
number of tuples over ε yields an L-shaped curve as shown in Figure 5.3. If ε equals
Figure 5.3: Plotting the number of tuples over time window size ε yields an L-shaped curve
[114].
zero, the number of tuples equals the number of error events in the logfile. While ε is
increased, the number first drops quickly. At some point, the curve flattens suddenly.
Choosing ε slightly above this point seems optimal. The rational behind this procedure
is the assumption that —on average— there is a small gap between the errors of different
faults: If ε is large enough to capture all errors belonging to one fault, the number of
resulting tuples decreases slower if ε is further increased.
Current research aims at quantifying temporal and spatial tupling. For example, Fu &
Xu [99] introduce a correlation measure for this purpose but since research is at an early
stage, such measure has not been applied. For the rest of this chapter, it is assumed that
both error-ID assignment and tupling have been applied.
5.2 Clustering of Failure Sequences
5.1.3
79
Extracting Sequences
The hidden Markov models are trained using either failure or non-failure sequences, as
shown in Figure 2.9 on Page 19. A failure sequence is defined as a temporal sequence of
error events preceding a failure (see Figure 5.4). Its maximum duration is determined by
the data window size ∆td , as defined in Section 2.1 (see Figure 2.4 on Page 12). The time
of failure occurrence is usually not reported in the error logs themselves but in documents
such as operator repair reports, logs of stress generators, service trackers, etc. Non-failure
Figure 5.4: Extracting sequences from an error log. Sequences are extracted from a time
window of duration ∆td . Sequences preceeding a failure (denoted by t) by leadtime ∆tl form failure sequences F i . Sequences occurring between failures (with
some margin ∆tm ) set up non-failure sequences N F i
sequences denote sequences that have occurred between failures. In order to be relatively
sure that the system is healthy and no failure is imminent, non-failure sequences must not
occur within some margin ∆tm before or after any failure. Non-failure sequences can be
generated with overlapping or non-overlapping windows or by random sampling.
5.2
Clustering of Failure Sequences
A failure mechanism, as used in this thesis, denotes a principle chain of actions or conditions that leads to a system failure. It is assumed that various failure mechanisms exist
in complex computer systems such as the telecommunication system. Different failure
mechanisms can show completely different behavior in the error event logs, which makes
it very difficult for the learning algorithm to extract the inherent “principle” of failure
behavior in a given training data set. For this reason, a novel approach to an identification and separation of failure mechanisms has been developed. The key notion of the
approach is that failure sequences of the same underlying failure mechanism are more
similar to each other than to failure sequences of other failure mechanisms. Grouping
can be achieved by clustering algorithms, however, the challenge is to define a similarity
measure between any pair of error event sequences. Since there is no “natural” distance
such as Euclidean norm for error event sequences, sequence likelihoods from small hidden semi-Markov models are used for this purpose.4 The approach is related to Smyth
4
The same hidden semi-Markov models are used as developed in the next chapter. However, since this
thesis follows the order of tasks: preprocessing → modeling → classification, details on the model are
presented in Chapter 6. For the time being, it is sufficient to remember that HSMMs are hidden Markov
models tailored to temporal sequences.
80
5. Data Preprocessing
[246] but it yields separate specialized models instead of one mixture model.
5.2.1
Obtaining the Dissimilarity Matrix
Since most clustering algorithms require dissimilarities among data points as input data, a
dissimilarity matrix D is computed from the set of failure sequences F i . More precisely,
D(i, j) denotes the dissimilarity between failure sequence F i and F j .
In order to compute D(i, j), first, a small HSMM M i is trained for each failure sequence F i , as shown in Figure 5.5.
Figure 5.5: For each failure sequence F i , a separate HSMM M i is trained.
Second, sequence likelihood is computed for each sequence F i using each model M j .
However, since sequence likelihood takes on very small numbers for longer sequences,
it cannot be represented properly even by double precision floating point numbers and
the logarithm of the likelihood (log-likelihood) is used here.5 Sequence likelihood of
all sequences F i computed with all HSMMs M j defines a matrix where each element
(i, j)
of the probability that model i can generate failure sequence j:
h is the logarithm
i
i
j
log P (F |M ) ∈ (−∞, 0]. In other words, the logarithmic sequence likelihood is close
to zero if the sequence fits the model very well and is significantly smaller if it does not
really fit. Since model M j has been adjusted to the specifics of failure sequence F j in the
first step, P (F i |M j ) expresses some sort of proximity between the two failure sequences
F i and F j . An exemplary resulting matrix of log-likelihoods is shown in Figure 5.6.
Unfortunately, the matrix is not yet a dissimilarity matrix, since first, values are ≤ 0
and second, sequence likelihoods are not symmetric: P (F i |M j ) 6= P (F j |M i ). This is
solved by taking the arithmetic mean of both likelihoods and using the absolute value.
Hence D(i, j) is defined as:
D(i, j) =
h
i
h
i
log P (F i | M j ) + log P (F j | M i ) 2
.
(5.2)
Still, matrix D is not a proper dissimilarity matrix since a proper metric requires that
D(i, j) = 0, if F i = F j . There is no solution to this problem since from D(j, j) = 0
follows that P (F j |M j ) = 1. However, if M j would assign a probability of one to F j
it would assign a probability of zero to all other sequences F i 6= F j , which would be
useless for clustering. Nevertheless, D(j, j) is close to zero since it denotes log-sequence
5
In fact, many HMM implementations only return the log-likelihood
5.2 Clustering of Failure Sequences
81
Figure 5.6: Matrix of logarithmic sequence likelihoods.
Each element (i, j) in the matrix is
logarithmic sequence likelihood log P (F i |M j ) for sequence F i and model
Mj.
likelihood for the sequence, model M j has been trained with. For this reason, matrix D
is used as defined above.
Regarding the topology of models M i , the purpose of each model is to get a rough
notion of proximity between failure sequences. In contrast to the models used for failure
prediction (c.f., Section 6.6, the purpose is not to clearly identify sequences that are very
similar to the training data set and to judge other sequences as “completely different”.
Therefore, models M i have only a few states and have the structure of a clique, which
means that there is a transition from every state to every other state.6 In order to further
avoid too specific models, so-called background distributions are applied (c.f., Page 112).
The effects of the number of states and background distributions are further investigated
along with the case study.
5.2.2
Grouping Failure Sequences
In order to group similar failure sequences, a clustering algorithm is applied. Two groups
of clustering algorithms exist (c.f., Kaufman & Rousseeuw [142]): Partitioning techniques divide the data into u different clusters (partitions), and u is a fixed number that
needs to be specified in advance. Hierarchical clustering approaches do not rely on a
prespecification of the number of clusters. They either divide the data into more and
more sub groups (divisive approach), or start with each data point as separate cluster and
repetitively merge smaller clusters into bigger ones (agglomerative approach). In general,
partitioning approaches yield better results for a single u, while hierarchical algorithms
are much quicker than repetitively partitioning for different values of u. Due to the fact
that u cannot be determined upfront, hierarchical clustering is used for the grouping of
failure sequences.
The output of hierarchical clustering algorithms is a grouping function gF (u) that
6
These models are also called ergodic
82
5. Data Preprocessing
partitions the set of failure sequences F = {F i } into u groups:
n
o
gF (u) = Gl ;
1 ≤ l ≤ u; ∀ l : Gl ⊂ F ;
[
l
Gl = F,
\
Gl = ∅ ,
(5.3)
l
where Gl denotes the set of failure sequences that belong to group l.
5.2.3
Determining the Number of Groups
Hierarchical clustering yields the function gF (u), determining for each number of groups
u which sequences belong to which group. Therefore, the number of groups u needs to
be determined in order to separate the failure sequences in training data. In principle, u
should be as small as possible, since a separate model needs to be trained for each group,
which affects computation time both for training and online prediction. Moreover, the
more groups the less failure sequences remain in the training data set of each group which
results in worse models. On the other hand, if u is too small, there is no clear separation of
failure mechanisms and the resulting failure prediction models have difficulties to learn
the structure of failure sequences. Several ways have been proposed to determine the
number of groups u:
• Visual inspection is a very robust technique if data is presented adequately. Banner
plots (see Section 8.1.2) have shown to be an adequate representation for this purpose. However, visual inspection works only if the number of failure sequences is
not too large.
• Evaluation of inter-cluster distances. Such approaches investigate the distance level
at which clusters are merged or divided. The basic idea is that if there is a large gap
in cluster distance (one that deviates significantly from the others) some fundamental difference must be present in the data. Such approaches are sometimes called
stopping rules (see, e.g., Mojena [185], Lance & Williams [154], Salvador & Chan
[228]).
• Elbow criterion. The percentage of variance explained7 is plotted for each number
of groups. The point at which adding a new cluster does not add sufficient information can be observed by an elbow in the plot (see, e.g., Aldenderfer & Blashfield
[5])
• Bayesian framework. Using Bayes’ theorem, maximum probability for the number
of groups given the data arg maxu P (u|D) can be computed from the probability of
data given the number of groups P (D|u). However, this requires to try all values of
u, ranging from one to the number of sequences F , and each trial requires to train
2
u HSMMs. Hence, F (F2−1) = F 2+F Baum-Welch training procedures would have
to be performed which is not feasible in reasonable time.
Due to the fact that the number of failure sequences in the case study are still manageable,
and visual inspection is a very simple but robust technique, it is the method of choice in
this thesis.
7
This is the ratio of within-group variance to total variance
5.3 Filtering the Noise
83
Figure 5.7: Inter-cluster distance rules: (a) nearest neighbor, (b) Furthest neighbor, and (c)
unweighted pair-group average method
5.2.4
Additional Notes on Clustering
Matrix D defines some sort of distance between single failure sequences. However, for
clustering some measure is needed to evaluate the distance between clusters, which can
have a decisive impact on the result of clustering. The three predominant techniques for
agglomerative clustering are (see Figure 5.7):
• Nearest neighbor. The shortest connection between two clusters is considered. This
approach tends to yield elongated clusters due to the so-called chaining effect: If
two clusters get close only in point, the two clusters are merged. For this reason,
the nearest neighbor rule is also called single linkage rule.
• Furthest neighbor. The maximum distance of any two points in two clusters are
considered. This approach tends to yield compact clusters that are not necessarily
well separated. This rule is also called complete linkage rule.
• Unweighted pair-group average method (UPGMA). The distance of two clusters is
computed by the average of distances from all points of one group to all points of
the other. This approach results in ball-shaped clusters that are in most cases well
separated.
In addition to these inter-cluster distances measures, Ward’s method generates clusters by
minimizing the squared Euclidean distance to the center mean.
Each method has its advantages and disadvantages and it is difficult to determine upfront, which is best-suited for a given data set. Therefore, for all methods have been
applied to data of the case study (c.f., Section 9.2.5). Despite of failure sequence grouping for data preprocessing, the clustering method presented here can possibly be used to
enhance diagnosis, as is discussed in the outlook.
5.3
Filtering the Noise
The objective of the previous clustering step was to group failure sequences that are traces
of the same failure mechanism. Hence it can be expected that failure sequences of one
group are more or less similar. However, experiments have shown that this is not the
case. The reason for this is that error logfiles contain noise, which results mainly from
parallelism within the system (see Section 2.3). Therefore, some filtering is necessary to
eliminate noise and to mine the events in the sequences that make up the true pattern.
The filtering applied in this thesis is based on the notion that at certain times within
failure sequences of the same failure mechanism, indicative events occur more frequently
84
5. Data Preprocessing
Figure 5.8: After grouping similar failure sequences by means of clustering, filtering is applied
to each group in order to remove noise from the training data set. For failure group
u the blow-up shows that sequences are aligned at the time of failure occurrence
(t). For each time window (vertical shaded bars) each error symbol (A,B,C)
is checked whether it occurs significantly more frequent than expected. Those
symbols that do not pass the filter (crossed-out symbols) are removed from the
training sequence
than within all other sequences. The precise definition of “more frequently” is based on
the χ2 test of goodness of fit.
The filtering process is depicted in the blow-up of Figure 5.8 and performs the following steps:
1. Prior probabilities are estimated for each symbol. Priors express the “general” probability that a given symbol occurs.
2. All sequences of one group (which are similar and are expected to represent one
failure mechanism), are aligned such that the failure occurs at time t = 0. In the
figure, sequences F 1 , F 2 , and F 4 are aligned and the dashed line indicates time of
failure occurrence.
3. Time windows are defined that reach backwards in time. The length of the time
window is fixed and time windows may overlap. Time windows are indicated by
shaded vertical bars in the figure.
4. The test is performed for each time window separately, taking into account all error
events that have occurred within the time window in all failure sequences of the
group.
5. Only error events that occur significantly more frequently in the time window than
their prior probability stay in the training sequences. All other error events within
the time window are removed, since these are assumed to be noise. In the figure,
removed error events are crossed out.
6. Filtering rules are stored for each time window specifying error symbols that pass
the filter. The filter rules are used for online failure prediction where new sequences
5.3 Filtering the Noise
85
have to be processed in order to classify the current state of the system as failureprone or not. Each incoming error sequence is filtered before sequence likelihood
is computed. Each failure group has a separate set of filter rules and no filtering is
applied for the non-failure sequence model. That is why there is a group-specific
part of the preprocessing block in Figure 2.10 on Page 20.
In order to formalize the test, let p̂0i denote the estimated prior probability of error event
type (symbol) i, which is the null hypothesis. The set of failure sequences under consideration is obtained from clustering. Assume the l-th group is to be filtered, then the set of
filtering sequences Gl consisting of sequences Gjl is defined by:
n
o
h
i
Gl = Gjl = gF (u) ,
(5.4)
l
where gF (u) is defined by Equation 5.3. Let S denote the set of symbols that occur in all
failure sequences in Gl within the time window (t − ∆t, t]:
S=
[n
s ∈ Gjl | s occurs within (t − ∆t, t]
o
.
(5.5)
j
Each symbol si ∈ S is checked for significant deviation from the prior p̂0i by a test variable
known from χ-grams, which are a non-squared version of the testing variable of the χ2
goodness of fit test (see, e.g., Schlittgen [230]). The testing variable Xi is defined as the
non-squared standardized difference:
Xi =
ni − n p̂0i
q
n p̂0i
,
(5.6)
where ni denotes the number of occurrences of symbol si and n is the total number of
symbols in the time window. Disregarding estimation effects, properties of the testing
variable Xi can be assessed by assuming that ni is binomially distributed, so that from the
Poisson approximation follows8 for expectation value and variance:
E[ni ] ≈ n p̂0i
V [ni ] ≈ n p̂0i ,
(5.7)
(5.8)
where E[] denotes expectation value and V [] variance. Hence,


E[Xi ] = E
n − n p̂0i 
 i
q

n p̂0i
(5.9)
≈1.
(5.10)

n p̂0i 
ni −
V [Xi ] = V  q
n p̂0i
≈0
From this analysis follows that all Xi are standardized and can be compared to a threshold c: Filtering eliminates all symbols si from S within time window (t−∆t, t], for which
Xi < c. Hence, the set of remaining symbols for the time window is:
S 0 = {si ∈ S | Xi ≥ c} .
8 0
p̂i
can be assumed to be rather small
(5.11)
86
5. Data Preprocessing
Figure 5.9: Three different sequence sets can be used to compute symbol prior probabilities:
the set of all training sequences, the set of failure training sequences, and the set
of failure training sequences belonging to the same group (indicated by Gi ). In
reality, grouped sequence sets Gi cover (i.e. partition) the set of failure training
sequences.
0
The set of filtered training sequences G0l = {Gjl } is finally obtained by removing all symbols from each sequence that do not occur in any of the filtered symbol sets S 0 covering
the time at which the symbol occurs in the sequence. G0l is then used to train the model for
the l-th failure mechanism / group (see Section 6.6). For online prediction, the sequence
under investigation is filtered the same way before sequence likelihood is computed.
Three variants regarding the computation of priors p̂i 0 have been investigated in this
thesis (see Figure 5.9):
1. p̂0i are estimated from all training sequences (failure and non-failure). Xi compares
the frequency of occurrence of symbol si to the frequency of occurrence within the
entire training data.
2. p̂0i are estimated from all failure sequences (irrespective of the groups obtained from
clustering). Xi compares the frequency of occurrence of symbol si to all failure
sequences (irrespective of the group).
3. p̂0i are estimated separately for each group of failure sequences from all errors within
the group. For each symbol si the testing variable Xi compares the occurrence
within one time window to the entire group of failure sequences.
All variants have been applied to the data of the case study. An analysis is provided in
Section 9.2.6.
5.4
Improving Logfiles
Life could have been easier if logfiles would have been written in a format that is suited
for automatic processing. From the experience of working with logfiles, a paper has been
written (Salfner et al. [227]) discussing several issues relevant for logging. The major
concepts are described in shorter form, here. At the end of the section, a comparison to
the Common Base Event format is added, which is not included in the paper.
5.4.1
Event Type and Event Source
While data like timestamps and process identifiers are given more or less explicitly in
logfiles, the logged event itself is in most cases represented in natural language. Analyzing
5.4 Improving Logfiles
87
messages like “Could not get connection to service X” reveals that the textual description
merges two different pieces of information, that should be represented distinguishably:
• what has happened: some connection could not be established. This information is
called the event type
• what resource the problem arose with, which is “service X” in the example. This
information is called the source the event is associated with. Note that the source is
in general not identical to the detector issuing the error report.
Event type and source are related to orthogonal defect classification (Chillarege et al.
[59]) where the type correlates with the defect type and the source with the defect trigger.
However, since in our scheme error events and not the root cause of defects are considered,
the event type must not necessarily coincide with the defect type. The source is only a
suspected trigger entity, while the defect trigger, as defined by Chillarege et al., describes
the entire state in which the defect occurred.
Of course a natural language sentence is able to carry more information than only
“event type” and “source”. To prevent the additional information from being lost, it should
be spelled out by additional fields in the log record.
5.4.2
Hierarchical Numbering
In Section 5.1.1 a method has been described to map natural language error messages to
event IDs. This step could have been avoided if message IDs would have been written
directly into the log. Furthermore, if error message IDs are chosen in a systematic way,
such approach can be superior to natural language error messages, as can be shown for
the numbering scheme described in the following.
The numbering scheme is based on a hierarchical classification of errors, represented
by a tree. The topmost classification is based on the SHIP fault model (c.f. Section 2.5).
The software subtree has been further developed introducing 62 categories, of which an
excerpt is shown in Figure 5.10. Error message identifiers are simply constructed of
Figure 5.10: Hierarchical numbering scheme for error event types with the SHIP model. The
example only shows a sub-classification for software errors
the labels along the path from the root to the leaf node separated by a dot. This numbering scheme originates from Dewey [51] and has become popular, e.g., with LDAP
88
5. Data Preprocessing
(Lightweight Directory Access Protocol)[269]. In cases where an error matches several
leaves of the tree, all possible identifiers should be written into the log. On the other hand,
if an event cannot be resolved down to a leaf category, the most-detailed identifiable categorization should be used. Furthermore, the error classification scheme can be extended
easily.
In comparison with freely chosen error IDs, as they occur with methods such as the
one presented in Section 5.1.1, the numbering scheme provides two advantages:
1. It provides an ordering that can be exploited to derive a notion of similarity between
error event types
2. It provides means to present error data with multiple levels of detail.
A distance metric. The numbering scheme gives rise to a measure of similarity that
could be used, e.g., in clustering algorithms to group error messages. For example, failure
prediction algorithms could benefit from a notion of error proximity, or clusters can be
analyzed in order to diagnose an apparent problem. The distance metric proposed here is
defined as follows:
d(id1 , id2 ) := length of path between id1 and id2 .
(5.12)
which has properties:
d(id1 , id2 ) = 0 ⇔ id1 = id2
d(id1 , id2 ) = d(id2 , id1 )
d(id1 , id3 ) ≤ d(id1 , id2 ) + d(id2 , id3 )
(5.13)
(5.14)
(5.15)
from which follows that d(id1 , id2 ) is a proper metric. It can be efficiently computed by
simply comparing the individual parts of id1 and id2 from left to right and calculating
d(id1 , id2 ) directly from the position in which the two identifiers differ.9
Due to the lack of system knowledge, it has not been possible to apply hierarchical
numbering to the data of the telecommunication system, and hence the distance metric
has not applied to industrial data. Therefore, one potential conceptual problem known
from decision trees could not be investigated using real data: the proposed metric can
assign a large distance to objects that are closely related in reality but reside in different
subtrees (see Figure 5.11)
Figure 5.11: An inherent problem of hard classification approaches such as decision trees:
the two highlighted points are assigned a long distance (thick lines), although
they are close in reality
9
This algorithm does not even require knowledge of the error classification tree
5.4 Improving Logfiles
89
Multiple levels of detail. The proposed numbering scheme supports views of diverse
granularity on the data, which enables to present the log data at multiple levels of detail.
For example, a failure prediction tool will need more fine-grained information than an
administrator who is only observing whether the system is running well. Presentation at
various levels of granularity can simply be achieved by truncating the error numbers.
5.4.3
Logfile Entropy
Gaining experience from working with logfiles of various programs, one gets a notion
of what makes a good logfile. In order to assess the quality of logfiles quantitatively, a
metric has been developed. Due to its affinity to Shannon’s definition [235] it is called
information entropy of logfiles.
Starting from Shannon’s work, information entropy is defined as:
H(Xi ) = log2
1
P (Xi )
!
,
(5.16)
where Xi is a symbol of a signal source, and P (Xi ) is the probability that Xi occurs. In
terms of error logs, Xi corresponds to the type of a log record. If P (Xi ) = 1, the logfile
will consist only of messages of type Xi . According to Shannon, as the occurrence of
such a log record is fully predictable, it does not convey any new information and the
entropy is zero.
However, the frequency of occurrence is only one part of what makes a good log
record. A metric must also comprise the information that is given within the record. To
measure this, log records are taken to be a set where the elements relate to pieces of
information such as timestamp, process ID, etc. Let Ri be the set of information required
to fully describe the event that log record Xi is reporting on, and let Gi denote the set of
information that is actually given within log record Xi . As can be seen from Figure 5.12,
Figure 5.12: Sets of required information (Ri ) and given information (Gi ) of a log record
the intersection Ri ∩Gi is the required information that is actually present in the log record
and (Ri ∪ Gi ) \ (Ri ∩ Gi ) is the set of information that is either missing or irrelevant. The
bigger the intersection, and the smaller the rest, the better the log record. This is expressed
by the integrity function I(Xi ), where the notation ](·) denotes cardinality of a set:
I(Xi ) =
](Ri ∩ Gi )
]((Ri ∪ Gi ) \ (Ri ∩ Gi ))
−
](Ri ∪ Gi )
](Ri ∪ Gi )
.
(5.17)
The first term is a Jaccard score [106] for the similarity between given and required information and the second evaluates the amount of missing and irrelevant information. To see
90
5. Data Preprocessing
how integrity is measured by I(Xi ), consider the following two extreme cases: If a log
record contains exactly the information that is required and nothing more, Ri ∩ Gi equals
Ri ∪ Gi . Hence I(Xi ) equals one. If a log record contains none of the required information (all given information is irrelevant), Ri ∩ Gi = ∅ and the result is -1. Therefore,
I(Xi ) can take any values of the range [−1, 1].
Not only the fraction of given and required information but also the absolute number
of statements contained in a log message has impact on information density. The number
of reasonable statements in the record is
S(Xi ) = ](Ri ∩ Gi ) .
(5.18)
Combined with a linear transformation of I(Xi ) to the range [0, 1], the quality Q(Xi )
of a log record is measured by
Q(Xi ) = S(Xi )
I(Xi ) + 1
2
.
(5.19)
Finally, entropy for one log record is the product of the quality Q(Xi ) and Shannon’s
logarithmic quantity measure given by Equation 5.16:
H(Xi ) = Q(Xi ) log2
1
P (Xi )
!
.
(5.20)
In order to compute entropy of entire logfiles, the expected value over all log records is
computed analogously to Shannon:
H(X) =
m
X
P (Xi ) H(Xi ) .
(5.21)
i=1
Properties of logfile entropy. Q(Xi ) contributes to H(Xi ) on linear scale, while the
quantity function has a logarithmic scale. Not surprisingly, for a high quality Q(Xi ) which
occurs with very small probability P , the entropy takes on very high values (see Figure 5.13). In order to compute the maximum entropy, integrity I(Xi ) = 1 and P (Xi ) = m1
is assumed for all m log records. Then, maximum entropy is [103]:
Hmax (X) = ]R log2 m ,
(5.22)
where ]R denotes mean number of required statements per log record.
The set Ri is defined to contain all the information needed to comprehensively describe the event that caused writing of the log record, but nothing more. Analyzing Ri for
each error type is a laborious task. In Salfner et al. [227] an example is provided and it is
shown how an intuitively better logfile results in an increased entropy.
5.4.4
Existing Solutions
When IBM started its autonomic computing initiative, it has been found out quickly
that an automatic processing of logfiles is crucial —Bridgewater [37] called them “a
nervous system for computer infrastructures”. Against the background of multi-vendor
commercial-off-the-shelf systems, a log standard had to be developed. Together with
other companies such as HP, Oracle and SAP, the Common Base Event [196], which is
5.4 Improving Logfiles
91
3
entrop
yH
2
1
0.0
Q
G
0.2
#R
0.4
qua
lity
0
0.6
bab
ility
P
0.8
pro
0.8
1.0
norm
0.4
alize
d
0.6
0.2
1.0
Figure 5.13: A surface plot of the entropy H(Xi ). Quality Q has been normalized to the
range [0, 1]
part of the “Web Services Distributed Management” standard (WSDM 1.0), has been
developed to enable standardized logging.
A Common Base Event (CBE) is a specification of one event, which has been called
log record, so far. A CBE consists of three major parts:
1. The component reporting a particular situation
2. The component affected by the situation
3. The situation itself
Each of the three parts is further specified by several, fixed attributes. The specification
contains an UML description of the CBE format, of which Figure 5.14 is a simplified
version to visualize the concept.
Common Base Event
reporter
creationTime
severity
message
...
source
ComponentIdentification
Situation
application
componentID
componentType
processID
...
categoryName
...
StartSituation
ConfigureSituation
...
DependencySituation
ConnectSituation
Figure 5.14: Principle structure of a Common Base Event, depicted as an UML model
92
5. Data Preprocessing
Evaluating CBE with respect to the issues raised in the previous sections, the following conclusions can be drawn:
• CBE also separates event type and source: “Situation” specifies the event type
and the “source ComponentIdentification” contains a specification of the failed resource.
• The “reporter ComponentIdentification” corresponds to parts of the log records that
have not been addressed in the previous sections. For example, in application logfiles, the reporter is in most cases the application that wrote the log.
• Instead of a hierarchical numbering scheme specifying the event, eleven “situationNames” have been defined. Valid situation identifiers include “START”, “CONNECT”, or “CONFIGURE”. For this reason, the proposed distance metric cannot
be applied to CBE.
Error logs of Sun Microsystem’s Solaris 10 operating system, which has been released
in 2004, also separates event type and event source. Solaris 10 error reportings support
three levels of detail by structuring each report into an outline, error details and error ID
details. However, instead of a hierarchical numbering scheme, unique event-IDs are used
that can be looked up on the Internet in order to obtain further details.
5.5
Summary
This chapter has covered the steps that are applied to yield a set of training sequences
from error logfiles. Specifically, this process involves:
• Mapping natural language error messages to event IDs using the Levenshtein edit
distance.
• Removal of repetitive reportings of the same cause by means of tupling.
• Extraction of sequences from the filtered logfiles.
• Grouping of failure sequences that belong to the same failure mechanism. To
achieve this, a hierarchical clustering based on a dissimilarity matrix computed by
using small HSMMs is applied.
• Filtering the noise that is present in the data by means of a statistical test related to
the χ2 goodness of fit test.
The last section of the chapter addressed the topic of logfiles in general. It has been
proposed that event type and source should be separated and a hierarchical numbering
scheme should be applied to assign IDs to error events. The numbering scheme allows to
define a distance metric and to present logfiles at various levels of detail. A measure for
the quality of logfiles has been developed. The measure is based on Shannon’s definition
of information entropy and is hence called “logfile entropy”. Finally, these principles of
“educated logging” have been compared to an existing logging standard from autonomic
computing, the Common Base Event.
5.5 Summary
93
Contributions of this chapter. This chapter has introduced
• a novel approach to identify failure mechanisms in the system by means of failure
sequence clustering. This may also be helpful for failure diagnosis.
• a novel approach to noise reduction in failure sequences by means of the χ2 related
statistical test
• a novel way to represent error events using hierarchical numbering, which gives
rise to a definition of a distance between error event IDs
• a novel way to assess the quality of logfiles by means of an entropy measure for
logfiles.
Relation to other chapters. Having covered data preprocessing in this chapter, the extended hidden Markov model, which is the heart of the failure prediction approach presented in this dissertation, is described in the next chapter.
Chapter 6
The Model
This chapter describes the essence of the proposed approach to failure prediction: The
hidden semi-Markov model that is used for pattern recognition. In Section 6.1, the model
is defined. Subsequently, in Section 6.2 it is described how the model is used to process
temporal sequences, followed by a delineation of the training procedure in Section 6.3.
A proof of convergence for the training procedure is given in Section 6.5 and modeling issues that are specific to failure prediction are discussed in Section 6.6. Finally, in
Section 6.7, computational complexity is analyzed.
6.1
The Hidden Semi-Markov Model
Similar to the way standard hidden Markov models (HMMs) are an extension of discrete time Markov chains (DTMCs), hidden semi-Markov models (HSMMs) extend semiMarkov processes (SMPs). For this reason, SMPs are defined first followed by their extension to HSMMs.
6.1.1
Wrap-up of Semi-Markov Processes
SMPs are continuous-time stochastic processes that allow to specify probability distributions for the duration of transitions from one state to the next. Several definitions exist,
which all lead to the same properties. In this dissertation, the approach of Kulkarni [149]
is adopted. Semi-Markov processes are a continuous-time extension of Markov renewal
sequences, which are defined as follows:
A sequence of bivariate random variables {(Yn , Tn )} is called a Markov renewal sequence
if
1. T0 = 0, Tn+1 ≥ Tn ;
Yn ∈ S ,
and
(6.1)
2. P (Yn+1 = j, Tn+1 − Tn ≤ t|Yn = i, Tn , . . . , Y0 , T0 ) = P (Y1 = j, T1 ≤ t|Y0 = i) (6.2)
∀n ≥ 0 .
Here, S denotes the set of states, and the random variables Yn and Tn denote the state
and time of the n-th element in the Markov renewal sequence. Note that Tn refers to
points in time on a continuous time scale and t is the length of the interval between Tn
95
96
6. The Model
and Tn+1 . Similarly to Equation 4.3 on Page 56, Equation 6.2 expresses that Markov
renewal sequences are memoryless and time-homogeneous: As the transition probabilities
only depend on the immediate predecessor, it has no memory of previous states, and
since transition probabilities at time n are equal to the probabilities at time 0, the process
is time-homogeneous.
Let gij (t) denote the conditional probability that state sj follows si after time t as
defined by Equation 6.2. Then the matrix G(t) := [gij (t)] is called the kernel of the
Markov renewal sequence. Note that gij (t) has all properties of a cumulative probability
distribution except that the limiting probability pij must be equal to or less than one:
pij := lim gij (t) = P (Y1 = j|Y0 = i) ≤ 1 .
t→∞
(6.3)
Even if Markov renewal sequences are defined on a continuous time scale, they form a
discrete sequence of points. If the gaps between the points of a Markov renewal sequence
are “filled”, a Semi-Markov process (SMP) is obtained. More formally:
A continuous-time stochastic process {X(t), t ≥ 0} with countable state space S is said
to be a semi-Markov process if
1. it has piecewise constant, right continuous sample paths, and
2. {(Yn , Tn ), n ≥ 0} is a Markov renewal sequence, where Tn is the n-th jump epoch
and Yn = X(Tn +) .
Yn = X(Tn +) denotes that the state X of the SMP is defined by the state Yn of the
Markov renewal sequence at any time t. The notation Tn + indicates that the sample path
is right continuous, and n is determined such that it is the largest index for which Tn ≤ t
(see Figure 6.1).
Figure 6.1: A semi-Markov process X(t) defined by a Markov renewal sequence {(Yn , Tn )}
An SMP is called regular, if it only performs a finite number of transitions in a finite
amount of time. As only regular SMPs are considered in this thesis, the term “regular”
will be omitted from now on.
As can be seen from Equation 6.3, the limiting probabilities pij “eliminate” temporal
behavior. Hence, they define a DTMC that is said to be embedded in the SMP. From this
analogy it is clear that the following property holds for each transient state si :
∀i :
N
X
j=1
pij = 1 ,
(6.4)
6.1 The Hidden Semi-Markov Model
97
expressing the fact that it is sure that the SMP leaves state si if time t approaches infinity.
In addition to the notion of the embedded DTMC, the limiting probabilities pij can be
used to define a quantity that helps to understand the way how SMPs operate. Let dij (t)
denote a probability distribution for the duration of a transition from state si to state sj :
dij (t) = P (T1 ≤ t | Y0 = i, Y1 = j) .
(6.5)
Using the limiting probabilities pij , durations dij (t) can be computed from gij (t) the following way:

 gij (t)
if pij > 0
pij
dij (t) =
(6.6)
1
if pij = 0 ,
and therefore, gij (t) can be split into a transition probability and a transition duration
distribution:
gij (t) = pij dij (t) ,
(6.7)
which leads to an intuitive description of the behavior of SMPs: Assume that at time 0
the system enters state i. Then, it chooses the next state to be j according to probability
pij . Having decided upon the next state to be j, it stays in state i for a random amount of
time sampled from distribution dij (t) before it enters state j. Once the SMP enters state j
it looses all memory of the history and behaves as before, starting from state j. Note that
the theory of SMPs allows pii 6= 0, i.e., the SMP may return to state i immediately after
leaving it. However, for simplicity reasons, it will be assumed from now on that pii = 0.
This description of SMPs also shows why they are called semi-Markov processes: the
choice of the successor state is a Markov process, but the duration probability is depending
both on the current as well as on the successor state and is therefore non-Markovian.
Hence the name semi-Markov.
Finally, it should be noted that SMPs are fully specified by two quantities:
1. the initial distribution π = [πi ] = [P (X(0) = i)]
2. the kernel G(t) of the underlying Markov renewal sequence. From Equation 6.7
follows that G(t) can alternatively be specified by P = [pij ], which is a transition matrix for the embedded DTMC, and D(t) = [dij (t)] defining the probability
distributions for the duration of each transition from si to sj .
Be aware that Equation 6.7 only holds for each gij (t) separately and hence matrices P
and D(t) can only be multiplied element-wise.
6.1.2
Combining Semi-Markov Processes with Hidden Markov Models
HSMMs extend SMPs in the same way that HMMs extend DTMCs. Hence, once the
stochastic process of state traversals enters a state si , an observation oj is produced according to the probability distribution bsi (oj ). Due to the fact that error event-based failure
prediction evaluates temporal sequences with discrete symbols, only discrete distributions
bsi (oj ) are considered, here. Nevertheless, the approach could be extended easily to continuous, multimodal outputs.1 An example is shown in Figure 6.2.
1
See, e.g., Liporace [168], Juang et al. [137], Rabiner [210] for a summary how it is done for discrete-time
HMMs
98
6. The Model
Figure 6.2: Similar to the HMM shown in Figure 4.2, a HSMM consists of a semi-Markov process of (hidden) state traversals defined by gij (t) and output probabilities bsi (oj )
According to Equation 6.7, gij (t) is the product of limiting probabilities pij and durations dij (t). Durations dij (t) can in general be arbitrary time-continuous cumulative
distributions, which even need not to be differentiable. For example, dij (t) can be a piecewise constant non-decreasing function. In this thesis, however, a convex combination of
parametrized probability distributions is assumed:
dij (t) =
R
X
wij,r κij,r (t|θij,r )
(6.8)
r=0
s.t.
R
X
wij,r = 1 ,
wij,r ≥ 0 .
(6.9)
r=0
Each duration distribution dij (t) is a sum of R cumulative probability distributions
κij, r (t|θij,r ) with a specific set of parameters θij,r , weighted by wij,r . The weights sum
up to one so that a proper probability distribution is obtained. The single distributions
κij, r are called kernels. For example, if κij, r is a Gaussian kernel, parameters θij,r con2
. Additionally as stated above, it is assumed, that
sists of mean µij,r and variance σij,r
pii = 0, expressing the fact that there are no self-transitions in the model. In the literature,
such convex combination is sometimes termed a mixture of probability distributions, even
though the term is mathematically less precise.
In summary, an HSMM is completely defined by
• The set of states S = {s1 , . . . , sN }
• The set of observation symbols O = {o1 , . . . , oM }
• The N -dimensional initial state probability vector π
• The N × M matrix of emission probabilities B
• The N × N matrix of limiting transition probabilities P
• The N × N matrix of cumulative transition duration distribution functions D(t)
For better readability of formulas, let λ = {π, B, P , D(t)} denote the set of parameters.
Taking Equation 6.7 into account, sometimes also the notation λ = {π, B, G(t)} is used.
S and O are not included since O is determined by the application and S is not altered
by the training procedure, as is explained in Section 6.3.
6.2 Sequence Processing
6.2
99
Sequence Processing
In machine learning, usually a training procedure is applied first in order to adjust model
parameters to a training data set, and after that the resulting model is applied. For failure
prediction with HSMMs, this translates into determining model parameters λ and then to
process error sequences observed during system runtime. Nevertheless description of the
two steps is reversed here for simplicity reasons since sequence processing is better suited
to explain the hidden semi-Markov model. Training is then covered in the next section.
6.2.1
Recognition of Temporal Sequences: The Forward Algorithm
Online failure prediction with HSMMs consists of the three stages preprocessing, sequence recognition resulting in sequence likelihood, and subsequent classification (c.f.
Figure 2.10 on Page 20). This section covers the second stage and provides the algorithm
to compute sequence likelihood from a given observation sequence, which is the forward
algorithm.
Figure 6.3 illustrates some notations that are used throughout this chapter. The notation Oi = ok is used to describe observation sequences, expressing that the i − th symbol
in a sequence is symbol ok ∈ O. The notation is adopted from literature on random
variables such as Cox & Miller [67], where capital letters denote the variables and small
letters the realization of the variable. In the figure, ok is either “A”, or “B”. The events
occur at times t0 to t2 . However, if relative distances between events are relevant, time is
represented by delays di = ti − ti−1 . The sequence of hidden states that are traversed to
generate the observations is denoted as a sequence of random variables Si = sj , where
sj ∈ S .
Figure 6.3: Notations used for temporal sequences in this chapter. Capital letters denote
random variables, small letters realizations (actual values). [Oi ] denotes the sequence of observation symbols and [Si ] the sequence of hidden states. Time is
expressed as delay di between observations at time ti and ti−1 .
The forward algorithm for HSMMs is derived from the discrete-time equivalent as
defined by Equations 4.9 on Page 58. The fact that sequences in continuous time are
considered leads to a change in time indexing: instead of t denoting an equidistant time
step, tk denotes the time when the k-th symbol has occurred.
As can be seen from comparing Figure 6.2 with Figure 4.2 on Page 57, transition probabilities aij are replaced by gij (t) in HSMMs. However a strict one-to-one replacement is
not sufficient, as can be seen from the following considerations:
100
6. The Model
1. Assume that at time tk−1 the stochastic process has just entered state si and has
emitted observation symbol ol : Sk−1 = si , Ok−1 = ol .
2. Assume that there is a state transition when the next observation occurs. Hence, the
duration of the transition is dk := tk − tk−1 .
3. Knowing dk , transition probabilities to successor states sh can be computed by
gih (dk ). Assume that the successor state is sj : Sk = sj .
4. The subsequent symbol Ok = om is then emitted by state sj with probability
bsj (om ).
5. However, the inequality
N
X
gih (dk ) ≤ 1
(6.10)
h=1
holds and equality is only reached for dk → ∞ (c.f., Equations 6.3 and 6.4 and
keeping in mind that gii (t) ≡ 0). Hence, for dk < ∞, the sum is less than one,
which means that some fraction of the probability mass is not distributed among
successor states. The explanation for this is as follows: there is a non-zero probability that the stochastic process still resides in state si when time dk has elapsed.
In this case, state si generates symbol om , and the probability for this is
1−
N
X
gih (dk ) .
(6.11)
h=1
6. Applying the Markov assumptions, the stochastic process looses all memory and
considerations for the next observation start from 1.
In order to formalize these considerations, a probability vij (dk ) is defined as follows:
vij (dk ) = P (Sk = sj , dk = tk − tk−1 | Sk−1 = si )
=


gij (dk )



N
X
1
−
gih (dk )




h=1
(6.12)
if j 6= i
if j = i
(6.13)
h6=i
with the property that
∀ i, d :
N
X
vij (d) = 1 .
(6.14)
j=1
One of the advantageous characteristics of this approach is that it can handle the situation when the order of errors occurring closely together is changed, which happens
frequently in systems where several components send error messages to a central logging
component (c.f., Property 6 on Page 15). More technically, if two symbols O1 = oa and
O2 = ob occur at the same time (d = 0), the resulting sequence likelihood is identical
regardless of the order, since for d = 0 the process stays in the state si with probability
one and the resulting (part of) the sequence likelihood is bsi (oa ) bsi (ob ) = bsi (ob ) bsi (oa ).
6.2 Sequence Processing
101
The forward algorithm. Similar to the case of discrete-time HMMs (c.f. Section 4.1),
the forward variable α for HSMMs equals the probability of the sequence up to time tk
for all state sequences that end in state si (at time tk ):
αk (i) = P (O0 O1 . . . Ok , Sk = si |λ) .
(6.15)
By replacing aij by vij (t) and changing time indexing, the following recursive computation scheme for αk (i) is derived from Equation 4.9 on Page 58:
α0 (i) = πi bsi (O0 )
αk (j) =
N
X
αk−1 (i) vij (tk − tk−1 ) bsj (Ok );
1≤k≤L.
(6.16)
i=1
The forward algorithm can also be visualized by a trellis structure as shown in Figure 4.3
on Page 59.
Sequence likelihood. In the context of online failure prediction with HSMMs, sequence
likelihood is a probabilistic measure for the similarity of the observed error sequence to
the sequences in the training data set. More specifically, sequence likelihood is denoted
as P (o | λ), which is the probability that a HSMM with parameter set λ can generate
observation sequence o. Equivalent to standard HMMs, this probability can be computed
by the sum over the last column of the trellis structure for the forward variable α:
P (o | λ) =
N
X
αL (i) .
(6.17)
i=1
When executing the forward algorithm on computers, probabilities quickly approach
the limit of computational accuracy, even with double precision floating point numbers.
Therefore, a technique called scaling is applied (see, e.g., Rabiner [210]). The values of
column k in the trellis for α are scaled to one by a scaling factor ck :
ck := P
1
i αk (i)
⇒
X
ck αk (i) =
i
X
αk0 (i) = 1 .
(6.18)
i
Instead of the sequence likelihood, which also gets too small very quickly, the logarithm
of the
likelihood is used. It can be shown that the so-called log-likelihood
h sequence
i
log P (o | λ) can be computed easily by summing up the logarithms of the scaling factors:
h
i
log P (o | λ) = −
L
X
log ck .
(6.19)
k=1
Finding the most probable sequence of states: Viterbi algorithm. The forward algorithm incorporates all possible state sequences. In some applications, however, this is not
desired and only the most probable sequence of states is of interest. This is computed by
the Viterbi algorithm.
102
6. The Model
In analogy to discrete-time HMMs,2 the Viterbi algorithm is derived from the forward
algorithm by replacing the sum over all previous states by the maximum operator:
δk (i) =
max
S0 S1 ... Sk−1
P (O0 O1 . . . Ok , S0 , S1 , . . . , Sk−1 , Sk = si | λ)
δ0 (i) = πi bsi (O0 )
δk (j) =
(6.20)
(6.21)
max δk−1 (i) vij (tk − tk−1 ) bsj (Ok ) .
(6.22)
1≤i≤N
Hence, maxi δL (i) is the maximum probability of a single state sequence generating observation sequence o. The sequence of states itself can be obtained by storing which state
was selected by the maximum operator and then tracing back through the array starting
from state arg maxi δL (i).
6.2.2
Sequence Prediction
Sequence prediction deals with the estimation of the future behavior of a temporal sequence. Although not used for failure prediction in this thesis, other application areas
exist that take advantage of anticipating the further evolvement of a given sequence.
Given a model and the beginning of a temporal sequence, sequence prediction addresses the question, how the sequence will evolve in the near future based on the characteristics expressed by the underlying model. More precisely, two different types of
sequence prediction can be distinguished:
1. What is the probability for the next observation of the sequence?
2. What is the probability that the underlying stochastic process will reach a certain
distinguished state within some time interval?
Probability of the Next Observation. In order to estimate the probability of next observations, the following probability is defined:
ηt (ok ) = P (OL+1 = ok , T ≤ t | tL , O0 . . . OL , λ);
t ≥ tL .
(6.23)
Here, ηt (ok ) is the probability that the next emitted observation symbol is ok occurring at
time T ≤ t, given a HSMM λ, the beginning of an observation sequence o = O0 . . . OL ,
and the time of occurrence of the last symbol tL . ηt (ok ) can be computed as follows:
ηt (ok ) =
N
X
P (SL+1 = sj , OL+1 = ok , T ≤ t | tL , o, λ)
(6.24)
j=1
=
N X
P (OL+1 = ok | SL+1 = sj , T ≤ t, tL , o, λ)×
(6.25)
j=1
P (SL+1 = sj , T ≤ t | tL , o, λ) .
The first probability of Equation 6.25 is simply the observation probability for state sj :
P (OL+1 = ok | SL+1 = sj , T ≤ t, tL , o, λ) = bsj (ok )
2
c.f., Equation 4.20–4.22
(6.26)
6.2 Sequence Processing
103
whereas the second probability in Equation 6.25 can be split up further:
P (SL+1 = sj , T ≤ t | tL , o, λ)
=
=
N
X
i=1
N
X
(6.27)
P (SL+1 = sj , SL = si , T ≤ t | tL , o, λ)
(6.28)
P (SL+1 = sj , T ≤ t | SL = si , tL , o, λ) P (SL = si | tL , o, λ) .
(6.29)
i=1
The first term of the product in Equation 6.29 is the probability that the state process is in
state sj at time T ≤ t given that it was in state si at time tL . This equals the cumulative
probability distribution vij (d):
P (SL+1 = sj , T ≤ t | SL = si , tL , o, λ) = vij (t − tL ) .
(6.30)
The second term of the product in Equation 6.29 is the probability that the state process
resides in state si at the end of the observation sequence. This can be computed by use of
the forward algorithm:
P (SL = si | tL , o, λ) =
αL (i)
αL (i)
P (o, SL = si | tL , λ)
. (6.31)
=
= N
X
P (o | λ)
P (o | λ)
αL (j)
j=1
Summarizing the results, the probability that observation symbol ok will occur up to
time t in the future can be computed by
ηt (ok ) =
N
X
j=1
bsj (ok )
N
X
i=1
vij (t − tL )
αL (i)
.
P (o | λ)
(6.32)
Probability to Reach a Distinguished State. Computing probabilities for the next observation symbol involved one single state transition (see Equation 6.30). However, if
the next observation symbol is not of interest but the probability distribution to reach a
distinguished state, computation of the first-step successor is not sufficient. Moreover, the
general probability to reach the distinguished state sd for the first time by time t irrespective of the number of hops is desired:
P (Sd = sd , Td ≤ t | o, λ);
Td = min( t : St = sd ) .
(6.33)
The procedure to compute this probability involves two steps:
1. Based on the given observation sequence o and the model λ, compute the probability distribution for the last hidden state in the sequence P (SL = si | o, λ) using
Equation 6.31.
2. Use P (SL = si | o, λ) as the starting point to estimate the future behavior of the
system. The objective is the probability defined in Equation 6.33, which is called
first passage time distribution.
104
6. The Model
In principle, an estimation of future behavior should take into account both the process
of hidden state traversals and generated observation symbols. Taking into account observation symbols results in a sum over all symbols for each state. However, only the
semi-Markov process of hidden state transitions has to be analyzed, since observation
P
probabilities can be omitted due to M
k=1 bsi (ok ) = 1.
In order to compute the first passage time distribution, the so-called first step analysis
(Kulkarni [149]) is applied. The essence of first step analysis can be summarized as
follows:
In order to reach the designated state, the first step of the stochastic process
either reaches the state directly or the process transits to an intermediate state.
In the latter case, the designated state is then reached directly from the intermediate state or via another intermediate state. This establishes a recursive
computation scheme.
As in Equation 6.33, let Td denote the time to first reach the designated state sd and let
Fid (t) = P (Td ≤ t|SL = si ) denote the probability to reach sd by time t given that the
process is in state si at the end of the observation sequence, then
Fid (t) = gid (t) +
XZ t
j6=d 0
d gij (τ ) Fjd (t − τ ) ,
(6.34)
where gid (t) is the cumulative probability distribution as defined in Section 6.1.1 and
Z t
0
d gij (τ ) Fjd (t − τ )
denotes the Lebesgue-Stieltjes integral (see, e.g., Saks [221]). Equation 6.34 is derived
from first step analysis: either state sd is reached directly within t —for which the probability is gid (t)— or via some intermediate state sj 6= sd . In this case the transition to
sj takes time τ and state sj is then reached within time t − τ . As might have become
clear from the formula, this is a recursive problem, since starting from sj the destination
state may either be reached directly or via yet another intermediate state. However, the
duration of the transition from si to the intermediate state sj is not known. Therefore, all
possible values for τ have to be considered which results in the integral with bounds 0
and t.
In order to solve the equation system defined by Equation 6.34, a recursive scheme
can be defined:
(0)
Fid (t) = 0
(n+1)
Fid
(t) = gid (t) +
XZ t
j6=d 0
(n)
d gij (τ ) Fjd (t − τ ) .
(6.35)
Kulkarni [149] showed that this recursion has the approximation property:
sup Fid (τ ) − Fid (τ ) ≤ µ[ r ] ,
(n)
n
(6.36)
0≤τ ≤t
where µ and r are derived from a result on regular Markov renewal processes stating that
for any fixed t ≥ 0, an integer r and real number 0 < µ < 1 exist such that:
X
j
gij∗r (t) ≤ µ
(6.37)
6.3 Training Hidden Semi-Markov Models
105
and gij∗r (t) denotes the r-th convolution of gij (t) with itself.
Since Fid (t) assumes the stochastic process to be initially in state si , the sum over
all states has to be computed where the probability for each state is determined by Equation 6.31. Hence, in summary the probability to reach state sd within time t is given
by:
X
P (Sd = sd , Td ≤ t | o, λ) =
Fid (t) P (SL = si | o, λ) .
(6.38)
i
Computation of Equation 6.35 can be quite costly, depending on n, which is the maximum number of transitions up to time t that are considered in the approximation. Additionally, each step involves a solution of the Lebesgue-Stieltjes integral which must in
many cases be solved numerically as there are many distributions for which there is no analytical representation (e.g., the cumulative distribution of a Gaussian random variable).
However, computational complexity can be limited since the maximum number of transitions is commonly limited by the application (in most applications, there is a minimum
delay between successive observations). Furthermore, a minimum delay between observations also limits the number of points in time for which the Lebesgue-Stieltjes integral
has to be approximated.
It should also be noted that Fid (t) depends on the parameters of the HSMM but not on
the observation sequence: hence, the complex computations including integrations can
be precomputed. An online evaluation of Equation 6.38 only involves computation of
Equation 6.31 for each state, multiplication with precomputed Fid (t) and summing up the
products.
6.3
Training Hidden Semi-Markov Models
In previous sections it has been assumed that the parameters λ of a HSMM are given. This
section deals with the task to estimate the parameters from training sequences. For this
purpose, the Baum-Welch algorithm for standard HMMs (see Section 4.1.2) is adapted to
hidden semi-Markov models.
6.3.1
Beta, Gamma and Xi
In addition to the forward variable αk (i), reestimation formulas for standard HMMs are
based on a backward variable βt (i), a state probability γt (i), and a transition probability
ξt (i, j). The same applies to reestimation for HSMMs, which uses equivalent variables
βk (i), γk (i) and ξk (i, j).
Analogously to standard HMMs, the backward variable βk (i) denotes the probability
of the rest of the observation sequence Ok+1 . . . OL given that the process is in state si at
time tk :
βk (i) = P (Ok+1 . . . OL | Sk = si , λ) .
(6.39)
βk (i) is computed backwards starting from time tL :
βL (i) = 1
βk (i) =
N
X
j=1
vij (dk ) bsj (Ok+1 ) βk+1 (j) .
(6.40)
106
6. The Model
γk (i) denotes the probability that the stochastic process is in state si at the time when
the k-th observation occurs. It can be computed from αk (i) and βk (i) following the same
scheme as presented in Section 4.1.1:
αk (i) βk (i)
.
γk (i) = PN
i=1 αk (i) βk (i)
(6.41)
ξk (i, j) is the probability that the stochastic process is in state si at time tk and is in
state sj at time tk+1 :
ξk (i, j) = P (Sk = si , Sk+1 = sj | o, λ)
(6.42)
αk (i) vij (dk+1 ) bsj (Ok+1 ) βk+1 (j)
.
ξk (i, j) = PN PN
j=1 αk (i) vij (dk+1 ) bsj (Ot+1 ) βk+1 (j)
i=1
(6.43)
As was the case for standard HMMs, the expected number of transitions from state si
to state sj is the sum over time
L−1
X
ξk (i, j) .
(6.44)
k=0
6.3.2
Reestimation Formulas
As has been described in Section 4.1.2, the most common training procedure of standard
HMMs is the Baum-Welch algorithm, which is an iterative procedure. Similar to standard
HMMs, the “expectation” step comprises computation of α, β, and subsequently γ and
ξ. Then, the “maximization” step is performed where model parameters are adjusted
using the values computed in the expectation step. This section provides the formulas
for the maximization step, which are derived in the course of the proof of convergence in
Section 6.5.
Initial probabilities π
π̄i ≡
expected number of series starting in state si
≡ γ0 (i) .
total number of sequence
(6.45)
Emission probabilities bsi (oj )
L
X
b̄i (oj ) ≡
expected number of times observing oj in state si
≡
expected number of times in state si
γk (i)
k=0
s.t. Ok =oj
L
X
.
(6.46)
γk (i)
k=0
Except for a different notation of time, the formulas are the same as for standard HMMs.
6.3 Training Hidden Semi-Markov Models
107
Transition parameters. Since the stochastic process underlying state traversals is
changed from a discrete time Markov chain for standard HMMs to a semi-Markov process in the case of HSMMs, maximization of transition parameters is quite different from
standard HMMs. The key difficulty is that parameters of all outgoing transitions from si
to sj occur at two places: once for the transition si → sj ; j 6= i and once in the computation of the probability that the process has stayed in state si . This can be seen from
the definition of vij (d) (see Equation 6.13), which is reiterated in an extended form for
convenience, here:
vij (dk ) =


pij dij (dk )



N
X
pih dih (dk )

1 −


h=1
if j 6= i
(6.47)
if j = i .
h6=i
The fact that pij occurs in both cases of the equation prohibits to apply similar formulas as
for standard HMMs. Instead, a gradient-based iterative optimization is used to maximize
likelihood of the training sequence with respect to the transition parameters, which are
specifically:
• limiting transition probabilities pij
• kernel parameters θij,r for each transition duration dij (dk ) (c.f., Equation 6.8)
• kernel weights wij,r for each transition duration dij (dk ) (c.f., Equation 6.9)
As is derived in detail in Section 6.5, optimization is performed for each state si by maximizing the objective function Qvi :

Qvi =
X
X

k
h
i
ξk (i, j) log pij dij (dk )
h
+ ξk (i, i) log 1 −
j6=i
X

i
pih dih (dk ) 
. (6.48)
h6=i
The gradient comprises partial derivatives of the objective function with respect to HSMM
transition parameters pij , wij,r , θij,r . The derivative for pij is given by:


X
1
dij (dk )
∂
ξk (i, j)
;
X
Qvi =
− ξk (i, i)
∂ pij
pij
1−
pih dih (dk )
k
i 6= j .
(6.49)
h6=i
For i = j, the derivative is equal to zero.
Derivation of Qvi with respect to wij,r can be computed as follows:
∂ Qvi
∂ Qvi
∂ dij (dk )
=
,
∂ wij,r
∂ dij (dk ) ∂ wij,r
(6.50)
∂ dij (dk )
= κij,r (dk | θij,r )
∂ wij,r
(6.51)
where
and
Qvi
∂
=
∂ dij (dk )

X
k
ξk (i, j)

1
pij
;
X
− ξk (i, i)
dij (dk )
1−
pih dih (dk )
h6=i
i 6= j .
(6.52)
108
6. The Model
Again, for i = j, the derivative is equal to zero.
Derivation of Qvi with respect to θij,r is determined by:
∂ Qvi
∂ Qvi
∂ dij (dk ) ∂ κij,r
=
,
∂ θij,r
∂ dij (dk ) ∂ κij,r
∂ θij,r
where
∂ Qvi
∂ dij (dk )
(6.53)
is as given by Equation 6.52,
∂ dij (dk )
= wij,r ,
∂ κij,r
(6.54)
ij,r
and ∂∂ κθij,r
is depending on the type of probability distribution. For example, if an exponential distribution is used, the derivative is given by:
∂ ∂ κij,r
=
1 − e−λij,r dk = dk e−λij,r dk .
∂ θij,r
∂ λij,r
(6.55)
Gradient-based optimization techniques are usually iterative using a search direction
s(n) , which is at least in part based on the gradient. The algorithms perform an update
of length η in direction of s(n) . Various techniques exist to estimate η, including line
search, and the Goldstein Armijo rule (see, e.g., Dennis & Moré [76]). The next point of
evaluation in the parameter space λ is determined by:
λ(n+1) = λ(n) + η s(n) .
(6.56)
The search direction s(n) is given by:
s(n) = ∇Qvi |λ(n) ,
(6.57)
where ∇Qvi |λ(n) denotes the gradient vector of Qvi with respect to the parameters
evaluated at the point λ(n) . A slight modification is used by conjugate gradient approaches (Hestenes & Stiefel [119]), where the next search direction is obtained by:
s(n) = ∇Qvi |λ(n) + ζ s(n−1) ,
(6.58)
where ζ is a scalar that can be computed from the gradient.3
Several equality constraints apply to the optimization problem for a fixed state si such
as:
1.
X
!
pij = 1
(6.59)
j
2. ∀j :
X
!
wij,r = 1 ,
(6.60)
r
which results in a restricted search space for the gradient method. Equality constraints
of this form can be incorporated by projecting the search direction onto the hyperplane
defined by the constraints. The following example explains this procedure. Assume that
state si has J outgoing transitions, and each duration distribution dij (t) consists of exactly
one Gaussian distribution having two parameters µij and σij . Then, the gradient vector
3
See, e.g., Shewchuk [239]
6.3 Training Hidden Semi-Markov Models
109
∇Qvi has 3J components.4 The constraint on limiting transition probabilities pij defines
the hyperplane given by
X
pij − 1 = 0 .
(6.61)
j
Let M denote the 3J × J matrix5 of orthonormal base vectors for the hyperplane translated to cross the origin of parameter space. The new gradient vector (∇Qvi )0 , which
obeys equality constraints is obtained by projecting it onto the hyperplane by matrix multiplication:
(6.62)
(∇Qvi )0 = (M M T )∇Qvi .
If several equality constraints apply to the optimization problem, M is the matrix of orthonormal base vectors for the intersection of all hyperplanes induced by the constraints.
Moreover, equality constraints are also obeyed for conjugate gradient approaches, since
both (∇Qvi )0 and s(n) lie within the constraint hyperplanes, and hence a linear combination of the two vectors also results in a search direction within the hyperplane.
Variables also have to satisfy inequality constraints. For example, probabilities pij can
only take values within the interval [0, 1]. In order to account for this, η must be restricted
such that the optimization algorithm cannot leave the space of feasible parameter values.
This can be achieved by checking, whether λ(n+1) is in the range of feasible values. If
not, η must be made smaller, which can either be done by computing the intersection of
the line λ(n) + a s(n) with the bordering hyperplane6 or by other heuristics.
6.3.3
A Summary of the Training Algorithm
The goal of the training procedure is to adjust the model parameters λ such that the likelihood of a given training sequence o is maximized. The training algorithm does only
affect π, B, P , and D(t), but not the structure of the HSMM. The structure consists of
• the set of states S = {s1 , . . . , sN },
• the set of symbols O = {o1 , . . . , oM }, which is also called the alphabet
• the topology of the model. It defines, which of the N states can be initial states,
which of the potentially N × N transitions can be traversed by the stochastic process, and which of the potentially N × M emissions are available in each state.
Technically, a transition si → sj is “removed” by setting pij = 0. The same holds
for the initial state distribution π and the emission probabilities: if bsi (ok ) is set to
zero, state si cannot generate observation symbol ok . Since the training algorithm
can never assign a non-zero value to probabilities that are initialized by zero, it does
not change the structure of the HSMM.
• specification of the transition durations D(t). This includes the number R and
types of kernels κij,r for each existing transition. The structure may also comprise
specification of additional parameters that are not adjusted by the training procedure
4
Since there is only one duration distribution, no weights wij,r are needed. Hence each of the J outgoing
transitions is determined by µij , σij , and pij
5
Equation 6.61 defines a J dimensional hyperplane in 3J-dimensional space
6
E.g., defined by pij = 0
110
6. The Model
such as upper and lower bounds for uniform background distributions, which need
to be set up before training starts.
Having specified the model structure, the training algorithm performs the steps shown
in Figure 6.4 in order to adjust the parameters λ such that sequence likelihood of P (o | λ)
reaches at least a local maximum.
Some notes on the training procedure: Gradient-based maximization within an EM
algorithm has been used to train standard HMMs, e.g., in Wilson & Bobick [279]. Such
approach is called Generalized Expectation Maximization algorithm. If a conjugate gradient approach is applied, the resulting HMM learning algorithm is called Expectation
Conjugate Gradient (ECG). Under certain conditions, ECG performs even better than the
original Baum-Welch algorithm (Salakhutdinov et al. [222]), but computational complexity is increased. However, complexity can be limited:
• The number of parameters that have to be estimated depends heavily on the number
of outgoing transitions (J). These, in turn, depend on the topology of the model: If,
for example, the topology is a simple chain, then each state, despite of the last one,
has only one outgoing transition. In case of an ergodic topology, where every state
is connected to every other, the number equals N − 1.
• The kernel weights wij,r do not necessarily need to be optimized. If the number of
parameters is too large, the weights of the convex combination can simply be fixed,
which reduces the number of parameters by J × R̄, where R̄ denotes the average
number of kernels per outgoing transition.
• The number of kernels may be reduced when some duration background distribution is used. If specified a-priori, background distributions do not increase the size
of the parameter vector.
• As is shown in Section 6.5, the overall EM algorithm also converges if sequence
likelihood is only increased by a sufficiently large amount. Therefore, gradientbased optimization can be stopped after a few iterations.
• Since the optimization algorithm is based on the gradient, only cumulative distributions dij (dk ) can be used for which the derivative with respect to its parameters
are available. However, this is the case for many widespread distributions. See
Appendix V for some examples.
The former notes have mainly addressed the embedded gradient-based optimization of
Qvi . Regarding the entire training procedure, the following notes should be kept in mind
when applying HSMMs as a modeling technique:
• Equivalently to standard HMMs, scaling factors sk used to scale αk (i) (c.f., Equation 6.18) can also be used to scale βk . ξ and γ can then be computed on the basis
of scaled α0 and β 0 .
• It has been shown that for large models, results and speed of convergence can be
improved if prior knowledge is incorporated into parameter initialization. For example, the length of failure sequences used for training of one model show a certain distribution with respect to the number of observations. This can be exploited
6.3 Training Hidden Semi-Markov Models
111
1. Initialize the model by assigning values to π, B, and G(t) for all entries that
exist in the structure. This constitutes λold
2. Compute αk (i) by Equation 6.16, βk (i) by Equation 6.40, γk (i) by Equation 6.41, and ξk (i, j) by Equation 6.43 using λold and observation sequence
o
3. Compute sequence likelihood P (o | λold ) by Equation 6.17.
and
4. Adjust π by Equation 6.45, and B by Equation 6.46, resulting in λnew
π
λnew
B
5. Reestimate the parameters of G by the embedded optimization algorithm. For
each state si , perform:
(a) Compute the gradient vector g (n) of Qvi with respect to the parameters of
(n)
G at λGi , which is either initialized by λold
Gi or obtained from a previous
iteration
g
(n)
=
∂ Qvi
∂ Qvi
∂ Qvi
=
, ...,
, ...,
, ...
∂ pij
∂ wij,r
∂ κij,r
"
(∇Qvi )
(n)
λG
i
#
,
(n)
λG
i
where Qvi is given by Equation 6.48
(b) Project the gradient onto the hyperplane of feasible solutions for equality
equations. This is achieved by matrix multiplication:
g 0(n) = (M M T ) g (n) ,
where M denotes the matrix of orthonormal base vectors for hyperplanes
defined by equality constraints such as that the sum of probabilities should
equal one.
(c) Determine a search direction s(n) from g 0(n) and eventually s(n−1) and a
step size η, e.g., by line search. Assure that the search vector does not
cross the boundaries induced by inequality equations such as the condition that probabilities must lie between [0, 1]. The next point in search
space is obtained by:
(n+1)
λGi
(n)
= λGi + η s(n) .
(d) Repeat from Step 5a until step size is less than some bound or a maximum
number of steps is reached. The result constitutes to λnew
Gi
6. Set λold := λnew and repeat Steps 2 to 6 until the difference in observation
sequence likelihood P (o | λnew ) − P (o | λold ) is less than some bound.
Figure 6.4: Summary of the complete training algorithm for HSMMs.
112
6. The Model
to come up with a better guess for initial probabilities π. Additionally, initialization of observation probabilities can be improved by taking the prior distribution of
symbols into account. Other techniques first apply the Viterbi algorithm to come
up with an initial assignment of states to observations, as described, e.g., in Juang
& Rabiner [138]. Similar techniques can also be used to obtain an initial guess for
transition durations.
• It has also been shown that results can be improved by setting all observation probabilities bi (ok ) to zero that are less than some threshold (Rabiner [210]).
• The training procedure improves model parameters until some local maximum is
reached, which can be significantly lower than the global maximum. Therefore, in
this thesis the training procedure is performed several times with different (random)
parameter initializations. Other approaches are discussed in the outlook (Chapter 12).
• Gradient-based optimization could be applied to sequence likelihood directly (and
not to the Q-function, which is a lower bound for likelihood). However, first, dimensionality of the optimization parameter space would be dramatically increased,
and second, the efficiency that parts of the optimization problem can be solved analytically would be lost.
• The training procedure described only considered one single training sequence. An
extension to multiple sequences is similar to standard HMMs with the slight difference that the gradient takes all training sequences into account. However, vectors
λ(n) → λ(n+1) for single sequences can be linearly combined exploiting the fact that
log-likelihood for multiple sequences is the sum of single sequence log-likelihoods.
• Background distributions for observation probabilities B can be applied to HSMMs
in the same way as to standard HMMs. They are frequently used to circumvent
one of the major drawbacks of the Baum-Welch algorithm: Observation probabilities bsi (oj ) are computed from the number of occurrences of observation oj (c.f.,
Equation 6.46). If one specific symbol oc has not occurred in the training data,
bsi (oc ) is set to zero for all states si in the first iteration of the training algorithm.
Hence, in the forward algorithm, any observation sequence containing oc is assigned
a sequence likelihood of zero (c.f., Equation 6.16). Background probabilities remedy this problem by substituting bsi (oj ) with b0i (oc ) > 0 as defined in the following.
Let Pb (oj ) denote a discrete probability distribution over all observation symbols.
Observation probabilities of a hidden Markov model become a convex combination
of the original observation probabilities bsi (oj ) and Pb (oj ):
b0ij = b0si (oj ) = ρi Pb (oj ) + (1 − ρi ) bsi (oj );
0 ≤ ρi ≤ 1 ,
(6.63)
where ρi is a state-dependent weighting factor.
6.4
Difference Between the Approach and other HSMMs
The term “hidden semi-Markov model” has been used for various models, since “semi”
simply indicates that a model employs some probability distribution for representation of
6.4 Difference Between the Approach and other HSMMs
113
time. Due to the fact that the models have been developed in the area of speech recognition
and signal processing, almost all models assume input data to be an equidistant time
series, which leads to the simplification that a minimum time step exists and durations
can be handled by multiples of the time step.
Speech recognition. In order to better explain the differences between the approach
presented here and previously published work, the task of phoneme7 assignment to a
speech signal is taken as an example. A plethora of work exists on this topic8 introducing
various methods and techniques to improve speech recognition quality —however, the
focus here is on duration modeling and only the basic principles are explained.
The process of phoneme recognition is sketched in Figure 6.5. Starting from the top of
Figure 6.5: A simplified sketch of phoneme assignment to a speech signal.
the figure, the analog sound signal is sampled and converted into a digital signal. Portions
of the sampled signal are then analyzed in order to extract features of the signal. Feature
extraction involves, e.g., a short-time Fourier transform and various other computations.
Since in this thesis only discrete emissions are considered, assume that the result of feature
extraction is one symbol out of a discrete set, denoted by “A” and “B”.9 Subsequently,
the sequence of features is analyzed by several HMMs: Each HMM is modeling one
7
A phoneme is the smallest unit of speech that distinguishes meaning.
8
For an overview, see, e.g., Cole et al. [62]
9
Usually, it is a feature vector containing both discrete and continuous values
114
6. The Model
phoneme and sequence likelihood is computed for each HMM using the forward or Viterbi
algorithm. In order to assign a phoneme to the sequence of features, some classification
is performed.
As has been pointed out by several authors (see, e.g., Russell & Cook [218]), the quality of assignment can be improved by introducing the notion of state duration: Rather than
traversing to the next state each time an observation symbol (i.e., a feature) occurs, the
stochastic process may reside in one state for a certain time generating several subsequent observation symbols before traversing to the next state. Figure 6.6 (a) shows the
Figure 6.6: Assigning states si to observations (A or B). (a) shows the case where a state
transition takes place each time an observation symbols occurs. If state durations
are introduced the process may reside in one state accounting for several subsequent observations. However, several state sequences are possible, of which a
few are shown by (b)-(d)
case where the occurrence of each feature symbol corresponds to a state transition. Introducing the notion of state duration, the process of state transitions is decoupled from the
occurrence of observation symbols. However, this flexibility results in several potential
state sequences, as can be seen from Figures 6.6 (b) to (d). Considering all potential state
sequences increases the complexity to compute sequence likelihood since all possible
state paths have to be summed up. To be precise, the number of potential paths increases
from N L where N denotes the number of states and L the length of the sequence (c.f.,
Equation 4.7 on Page 58) to
L−1
X
k=0
!
L−1
N (N − 1)k ,
k
(6.64)
where k is the number of state transitions that take place.10 The major drawback of this
is that dynamic programming approach such as the forward algorithm cannot be applied.
10
It is assumed that
x
0
=1
6.4 Difference Between the Approach and other HSMMs
115
Figure 6.7: The trellis structure for the forward algorithm with duration modeling. A maximum
duration of D = 2 is used. Thick lines highlight terms involved in computation of
α3 (1)
This is due to the fact that the Markov assumptions do not apply: the condition that all the
information needed to compute αt (j) is included in α’s of the previous time step is not
fulfilled for variable state durations.
Concrete models that were used in speech recognition have typically applied one restriction in order to come up with a feasible algorithm: They included an upper bound for
state durations (denoted by D). This leads to the following forward-like algorithm (see,
e.g., Mitchell & Jamieson [183]):
αt (j) =
N min(D,t)
X
X
i=1
τ =1
αt−τ (i) aij dj (τ )
τY
−1
bsj (Ot−m ) ,
(6.65)
m=0
where αt (j) denotes the probability of the observation sequence for all state sequences
for which state sj ends at time t. The algorithm includes an additional sum over τ , which
is the duration how long the process stays in state sj , and dj (τ ) specifies the probability
distribution for the duration. The product over bsj (·) results from the fact that during
its stay, state sj has to produce all the emission symbols Ot−τ +1 . . . Ot . Similar to the
standard forward algorithm, the approach can be visualized by a trellis structure, as shown
in Figure 6.7. As can be seen from the figure, the major drawback of the algorithm is its
computational complexity: according to Ramesh & Wilpon [211], it increases by a factor
2
of D2 . Various modifications to this approach have been proposed of which the major
categories have been described in Chapter 4.
Temporal sequences. The essential difference between speech recognition and temporal sequence processing is that symbols occur equidistantly in the first case, which does
not apply to the latter case. Periodicity in speech recognition is caused by the underlying
sampling of an analogous signal, whereas in temporal sequences such as error sequences,
the occurrence of symbols is event-triggered. This difference leads to the following conclusions:
• Using discrete timesteps of fixed size is appropriate for speech signals but not for
temporal sequences due to the reasons given in the discussion of time-slotting (c.f.
Section 4.2.1).
• In event-driven temporal sequences, temporal variability is already included in the
observation sequence itself. Therefore, a tight relation between hidden state transitions and occurrence of observation symbols can be assumed. Specifically, the
model presented in this thesis assumes a one-to-one mapping.
116
6. The Model
• The one-to-one mapping between state transitions and observation symbol occurrence has two advantages:
1. It enforces the Markov assumption, which leads to an efficient forward algorithm that is very similar to the standard forward algorithm of discrete-time
HMMs. Specifically, the sum over durations τ in Equation 6.65 is avoided
so that the algorithm belongs to the same complexity class as the standard
forward algorithm, as is shown in Section 6.7.
2. Durations can be assigned to transitions rather than to states, which increases
modeling flexibility and expressiveness. Obviously, state durations are a special case of transition durations.11
Considering Equation 6.13, the approach is related to inhomogeneous HMMs (IHMMs).
However, the process must still be called homogeneous since probabilities vij (d) stay the
same regardless of the time when the transition takes place, i.e. at the beginning or ending
of the sequence. Furthermore, in contrast to IHMMs, continuous duration distributions
rather than discrete ones are used in this thesis.
6.5
Proving Convergence of the Training Algorithm
The objective of the training procedure is to find a set of parameters λopt that maximizes
sequence likelihood of the training data:
λopt = arg max P (o | λ) .
(6.66)
λ
The training procedure described here is an Expectation-Maximization (EM) algorithm
(Dempster et al. [75]). It improves sequence likelihood until at least some local maximum
is reached. The algorithm described here is closely related to the Baum-Welch algorithm,
whose convergence was originally proven by Baum & Sell [25] without the framework
of EM algorithms. However, the framework of EM algorithms provides a view on the
problem that allows for simpler proofs. Such approach is adapted to prove convergence
of the training algorithm presented here. In the following, first a general proof of convergence for EM algorithms by Minka [181] is presented, which is subsequently adapted to
the specifics of HSMMs.
6.5.1
A Proof of Convergence Framework
EM algorithms are maximum-a-posteriori (MAP) estimators and hence rely on the presence of some data that has been observed, which in this case refers to the observation
sequence o forming dataset O. The goal is to maximize data likelihood P (o|λ).
The potential and wide range of application of EM algorithms stems from two properties:
1. EM algorithms build on lower bound optimization (Minka [181]). Instead of optimizing a complex objective function directly, some simpler lower bound is optimized.
2. EM algorithms can handle incomplete / unobservable data.
11
In this case, all outgoing transitions have the same duration distribution
6.5 Proving Convergence of the Training Algorithm
117
Lower bound optimization. In lower bound optimization, which is also called the
primal-dual method (Bazaraa & Shetty [26]), a computationally intractable objective
function is optimized by repetitive maximization of some lower bound that is easier to
compute. More specifically, if o(λ) denotes the objective function, a simpler lower bound
b(λ) that equals o(λ) at the current estimation of λ is maximized (see Figure 6.8). Maximization of b(λ) yields a new estimate for λ, for which the objective o(λ) is increased
(except for the case when the derivative of the objective equals zero, which is a local optimum). If the objective function is continuous and bounded, as is the case for HSMMs,
iteratively increasing the lower bound converges to at least a local optimum of the objective function.
Figure 6.8: Lower bound optimization. Starting from the current estimate of parameter λ,
a lower bound b(λ) to the objective function o(λ) is determined that is easier to
maximize than the objective function. If the lower bound equals the objective function at the current estimate of λ, maximization of the lower bound leads to a new
estimate of λ for which the value of the objective function is increased. Performing
this procedure iteratively yields at least a local maximum of the objective function.
From this, the following iterative optimization scheme can be derived:
1. Determine a lower bound simpler than the objective function that equals the objective function at the current estimate of parameter λ.
2. Determine the maximum of the lower bound yielding the next estimation of λ.
3. Repeat until the increase of the objective function is below some threshold.
Compared to this, gradient-based optimization approaches approximate the objective
function by the tangent to the objective function at the current estimate for λ and move
along that line for some distance to obtain the new estimate.
Handling of unobservable data. Unobservable data describes the situation where some
quantity used in modeling cannot be observed by measurements. In the case of HMMs
and its variants, this refers to the fact that the sequence of hidden states s, which the
stochastic process has traversed, cannot be observed. Analogously to O, let S = {s}
denote the set of state sequences s of the training data set. Two data sets must be distinguished: the complete dataset Z = (O, S) includes both observed and unknown data,
118
6. The Model
while the incomplete dataset only consists of observed data. The objective is to optimize data likelihood of the (observable) incomplete dataset P (o|λ). EM algorithms deal
with this problem by assuming the incomplete data likelihood to be the marginal of the
complete data set. Hence,
P (o|λ) =
Z
P (o, s|λ) ds .
(6.67)
s
The Q-Function. In order to determine a lower bound to data likelihood, Jensen’s inequality [133] can be used:
X
g(j) aj ≥
j
Y aj
gj (j);
aj ≥ 0,
X
j
aj = 1, g(j) ≥ 0
(6.68)
j
stating that the arithmetic mean is greater or equal to the geometric mean. Application to
Equation 6.67 requires extension by some arbitrary function q(s) as follows (see Minka
[181]):
P (o|λ) =
Z
s
!q(s) ds
Y P (o, s|λ)
P (o, s|λ)
q(s) ds ≥
q(s)
q(s)
s
where
Z
= f (λ, q(s)) ,
!
q(s) ds = 1 .
(6.69)
(6.70)
s
f (λ, q(s)) is the lower bound and q(s) is some arbitrary probability density over s.
The arbitrary function f needs to be chosen such that it touches the objective function
at the current estimate of parameters λold (see Figure 6.8). It can be shown that setting
q(s) = P (s | o, λold )
(6.71)
fulfills the requirement (see Minka [181]).
Maximization of the lower bound is performed by maximizing its logarithm. Logarithmizing yields
h
i
log f (λ, q(s)) =
Z
h
i
q(s) log P (o, s|λ) ds −
Z
h
i
q(s) log q(s) ds .
(6.72)
s
s
Substituting Equation 6.71 into Equation 6.72 and dropping terms that are not depending
on λ yields the so-called Q-function:
Q(λ, λold ) =
Z
h
i
log P (o, s|λ) P (s | o, λold ) ds ,
(6.73)
s
which in fact is the expected value over the unknown data s of the log-likelihood of the
complete data set. Since likelihood of the complete data set is in many cases easier to
optimize than the one of the incomplete data set, EM algorithms can solve more complex
optimization problems.
EM algorithms. With the notation just developed, the procedure of EM algorithms can
be refined as follows:
• E-step: Compute the Q-function based on parameters λold obtained from initialization or the previous M-step.
6.5 Proving Convergence of the Training Algorithm
119
• M-step: Compute the next estimation for λ by maximizing the Q-function:
λnew = arg max Q(λ, λold ) .
(6.74)
λ
• Repeat until increase in data likelihood P (o|λ) is less than some threshold.
Convergence of the procedure is guaranteed, since for the objective function holds
0 ≤ Q(λ, λold ) ≤ P (o|λ) ≤ 1 and the lower bound Q is not decreasing in any iteration.
In case of HMMs, a local maximum of Q is usually found by partial derivation of the
Q-function and solving the equation
∂Q !
=0.
∂λ
(6.75)
and usage of Lagrange multipliers to account for additional constraints on parameters
(e.g., the sum of outgoing probabilities has to be equal to one). Another way to optimize
Q is to apply an iterative approximation technique. If the optimum of Q is not found
exactly, the algorithm converges still, if only some new parameter values λ are found
for which the lower bound is sufficiently greater than for λold . Such approach is called
Generalized EM algorithm (e.g, Wilson & Bobick [279]).
6.5.2
The Proof for HSMMs
For HSMMs, the complete dataset Z = (O, S) consists of the observation sequence o
and the sequence of hidden states s that the stochastic process has traversed. If both the
sequence of hidden states and observation sequence are known, (complete) data likelihood is computed by alternately multiplying state transition probabilities and observation
probabilities along the path of states s:
P (o, s | λ) = πs0 bs0 (O0 )
L
Y
vsk−1 sk (dk ) bsk (Ok )
(6.76)
k=1
L
Y
= π s0
bsk (Ok )
L
Y
vsk−1 sk (dk )
(6.77)
k=1
k=0
and hence the Q-function is (c.f., Equation 6.73):
Q(λ, λold ) =
X
h
i
log P (o, s|λ) P (s | o, λold )
(6.78)
s∈S
=
X
h
i
log πs0 P (s | o, λold )
(6.79)
s∈S
+
L
X X
h
i
log bsk (Ok ) P (s | o, λold )
(6.80)
s∈S k=0
+
L
X X
h
i
log vsk−1 sk (dk ) P (s | o, λold )
s∈S k=1
π
old
= Q (π, λ ) + Qb (B, λold ) + Qv (G, λold ) ,
(6.81)
(6.82)
120
6. The Model
where S denotes the set of all possible state sequences s.
Some papers (e.g., Bilmes [29]) use P (s, o | λold ) instead of P (s | o, λold ). However,
this difference does not matter since
P (s, o | λold ) = P (s | o, λold ) P (o | λold ) ,
(6.83)
and since P (o | λold ) is independent of λ it does not affect the arg max operator used to
determine λnew (c.f., Equation 6.74).
The important feature of Equation 6.82 is that the terms Qπ , Qb , and Qv are independent of each other with respect to π, B, and G. Due to partial derivation involved in
maximization, Qπ , Qb , and Qv can be maximized separately.
Maximizing Qπ .
π
old
Q (π, λ ) =
Qπ can be further simplified:
X
h
i
old
log πs0 P (s | o, λ ) =
N
X
h
i
log πi P (S0 = si | o, λold ) ,
(6.84)
i=1
s∈S
since for each s ∈ S, only the first state s0 is of importance. The second term on the right
of the Equation, P (S0 = si | o, λold ), subsumes all state sequences starting with state si
and hence the sum over all state sequences s can be turned into a sum over all states.
In order to determine λopt with respect to π, the following constrained maximization
problem has to be solved:
π opt = arg max Qπ (π, λold );
πi ;
i=1,...,N
s.t.
N
X
πi = 1 .
(6.85)
i=1
This can be accomplished by a Lagrange multiplier ϕ. Note that derivation is performed
for one specific πi out of the sum of πi ’s:
∂
∂πi
⇔
N
X
h
i
old
X
N
log πi P (S0 = si | o, λ ) − ϕ
i=1
!
πi − 1
!
=0
(6.86)
i=1
1
P (S0 = si | o, λold ) − ϕ = 0 .
πi
(6.87)
The Lagrangian multiplier ϕ can be determined by substituting Equation 6.87 into the
side condition:
N
X
P (S0 = si | o, λold ) !
=1
ϕ
i=1
⇔
ϕ=
N
X
P (S0 = si | o, λold )
(6.88)
(6.89)
i=1
⇔
ϕ=1,
(6.90)
since it is sure that the stochastic process is in one state at the beginning of the sequence.
Using this result, Equation 6.87 can be solved to obtain the reestimation formula given in
Equation 6.45:
πi = P (S0 = si | o, λold ) , = γ0 (i)
(6.91)
as can be seen from the definition of γi (t) (c.f., Equation 4.14 on Page 59).
6.5 Proving Convergence of the Training Algorithm
121
Maximizing O b . In order to maximize the second term of the Q-function it is simplified first. The “row-wise” collection along state sequences s is exchanged by a
“column-wise” collection for each time step k. Therefore, P (s | o, λold ) is exchanged
by P (Sk = si | o, λold ) and sums are adapted adequately:
L
X X
Qb (B, λold ) =
h
i
log bsk (Ok ) P (s | o, λold ) =
s∈S k=0
N X
L
X
h
i
log bsi (Ok ) P (Sk = si | o, λold ) . (6.92)
i=1 k=0
For readability reasons, bsi (oj ) is denoted by bij in the following. The maximization
problem is:
B
opt
b
old
s.t. ∀ i :
= arg max Q (B, λ );
bij
i=1,...,N
j=1,...,M
M
X
!
bij = 1
(6.93)
j=1
leading to


L
N
M
N X
X
X
∂ X
ϕi
bij − 1  = 0
log bi (Ok ) P (Sk = si | o, λold ) −
∂bij i=1 k=0
i=1
j=1
⇔
L
X
k=0;
Ok =oj
1
P (Sk = si | o, λold ) − ϕi = 0
bij
L
X
⇔ bij =
(6.94)
(6.95)
P (Sk = si | o, λold )
k=0;
Ok =oj
;
ϕi
ϕi 6= 0 .
(6.96)
Substitution into side-constraints yields:
L
X
P (Sk = si | o, λold )
k=0;
M
X
Ok =oj
!
ϕi
j=1
⇔
ϕi =
M
X
L
X
j=1
k=0;
Ok =oj
=1
old
(6.97)
P (Sk = si | o, λ ) =
L
X
P (Sk = si | o, λold ) .
(6.98)
k=0
The condition ϕi 6= 0 is fulfilled, if state si is reachable with the given sequence. Finally,
L
X
bij =
P (Sk = si | o, λ )
k=0;
Ok =oj
L
X
k=0
L
X
old
=
old
P (Sk = si | o, λ )
γk (i)
k=0;
Ok =oj
L
X
k=0
γk (i)
.
(6.99)
122
6. The Model
Maximizing Qv . In order to maximize the transition part of the Q function for HSMMs,
the sums are again rearranged. This time, the grouping collects all transitions from
Sk−1 = si to Sk = sj as follows:
Qv (G, λold ) =
L
X X
h
i
log vsk−1 sk (dk | G) P (s | o, λold ) =
s∈S k=1
N X
N X
L
X
h
i
log vij (dk | G) P (Sk−1 = si , Sk = sj | o, λold ) . (6.100)
i=1 j=1 k=1
In contrast to the maximization of π and B, and in contrast to standard HMMs, maximization cannot be performed analytically. The reason for this can be traced back to the
definition of vij (dk ), which actually is a function of parameters P and D:12
vij (dk | G) = vij dk | P , D(d) =


pij dij (dk )



N
X
1
−
pih dih (dk )




h=1
if j 6= i
(6.101)
if j = i .
h6=i
The problem is that pij and dij (dk ) appear twice in vij (dk ), once in each case, which
complicates computations as can be seen from a derivation with respect to pij . In order to
shorten notations, it can be seen from the definition given in Equation 6.42 that
P (Sk−1 = si , Sk = sj | o, λold ) = ξk (i, j) .
(6.102)
Incorporating the side conditions given in Equation 6.4, the Lagrangian L for Equation 6.100 is:
L=
N X
N X
L
X
log vij (dk ) ξk (i, j) −
N L X
N X
X
k=1 i=1
N
X
i=1
12
ϕi
N
X
pij − 1
(6.103)
j=1
j6=i
log pij dij (dk ) ξk (i, j) + log 1 −
j=1
j6=i
N
X
ϕi
i=1
i=1 j=1 k=1
=
N
X
X
pih dih (dk ) ξk (i, i) −
h=1
h6=i
pij − 1 .
(6.104)
j=1
j6=i
Although it has been assumed that pii ≡ 0, notations include h 6= i to highlight that no self-transitions
are incorporated.
6.5 Proving Convergence of the Training Algorithm
123
Setting the partial derivative to zero yields:
∂
L=0
∂ pij
(6.105)

⇔

L 
X
 dij (dk )

ξ (i, j) +
 pij dij (dk ) k
k=1

−dij (dk )

ξk (i, i) − ϕi = 0

1−
pih dih (dk )
(6.106)
X
h6=i

⇔

L 
X
 1

ξ (i, j) −
 pij k
k=1
1−
X

dij (dk )

ξk (i, i) − ϕi = 0 . (6.107)

pih dih (dk ) − pij dij (dk )
h6=i,j
Although a solution for pij exists, we are not able to analytically solve for ϕi . For this
reason, a gradient-based approximation technique is applied. It can be seen from Equation 6.107 that the derivatives are independent for each state si . Therefore, parameters
can be optimized separately for each state and the objective function Qv can be split as
follows:
Qv (P , D(d), λold ) =
N
X
Qvi (P i , D i (d), λold ) .
(6.108)
i=1
This reduces complexity of the optimization procedure since the number of parameters
is much smaller. The objective comprises all outgoing transitions of state si , the objective
function is denoted by Qvi :
Qvi (P i , D i (d), λold )
=
L X
N X
k=1
log pij dij (dk ) ξk (i, j)
j=1
j6=i
(6.109)
+ log 1 −
X
pih dih (dk ) − pij dij (dk ) ξk (i, i) .
h=1
h6=i,j
Let ∇Qvi denote the gradient vector, which components are obtained by partial derivation of Equation 6.109 with respect to the parameters. Derivations with respect to kernel
weights wij,r and kernel parameters θij,r are obtained by
(∇Qvi )wij,r =
∂ Qvi ∂ dij (dk )
∂ Qvi ∂ dij (dk ) ∂ κij,r
(∇Qvi )θij,r =
. (6.110)
∂ dij (dk ) ∂ wij,r
∂ dij (dk ) ∂ κij,r ∂ θij,r
The dimension of ∇Qvi equals
dim(∇Qvi ) = J 1 + R̄ 1 + θ̄
,
(6.111)
where J is the number of outgoing transitions, R̄ the average number of kernels per transition and θ̄ the average number of kernel parameters θij,r per kernel.
However, the optimization procedure has to obey several restrictions. The first has
already been expressed by the Lagrangian in Equation 6.103: the sum over pij for all
outgoing transitions has to equal one. Rearranging this restriction yields:
J
X
j=1
pij − 1 = 0 ,
(6.112)
124
6. The Model
which is the defining equation of a J-dimensional hyperplane in the space of all optimization parameters. The interpretation for this is that all feasible solutions to the optimization
problem have to be points within the hyperplane. However, the vector ∇Qvi does not necessarily point to a direction parallel to the hyperplane such that an unrestricted gradient
ascent would leave the hyperplane of feasible solutions. In order to avoid this, the gradient
vector is projected onto the hyperplane, which results in the direction of steepest ascent
within the subspace of feasible solutions (see Figure 6.9).
Figure 6.9: Projecting the gradient vector g into the plane of values for which p1 + p2 = 1.
The result is denoted by g 0 . Θ denotes an arbitrary third parameter
Projection onto a hyperplane can be achieved by simple matrix multiplication:
(∇Qvi )0 = (M M T )∇Qvi ,
(6.113)
where (∇Qvi )0 denotes the projected gradient vector and M is the matrix of orthonormal
base vectors of the hyperplane translated such that it crosses the origin of parameter space.
Note that M is constant such that the projection matrix can be precomputed.
In most applications, duration distributions will be applied that are a convex combination of two or more kernels. In this case, the requirement that kernel weights sum up
to one (c.f., Equation 6.9 on Page 98) constitutes an additional hyperplane restricting the
subspace of feasible solutions similar to Equation 6.112. In a geometric interpretation, the
subspace of feasible solutions is then defined by the intersection of all constraining hyperplanes. For example in Figure 6.9, if parameter Θ had to be equal to zero, the subspace
of feasible solutions would only consist of the intersection of the shaded hyperplane with
the p1 p2 plane, as indicated by the bold line. Matrix M consists of the orthonormal base
vectors of the intersection of all restricting hyperplanes, which can also be precomputed.
However, there are further restrictions. For example, pij denote probabilities, which
can hence only take values of range [0, 1]. Another example is that the parameter λ of an
exponential distribution must be greater than zero. The solution to this problem is that the
stepsize along the projected gradient vector needs to be restricted such that the optimiza-
6.6 HSMMs for Failure Prediction
125
tion cannot leave the admissible range. In the geometric interpretation, this corresponds
to clipping the projected gradient vector at boundary hyperplanes such as λ = 0.
Summary of the proof of convergence. The goal of this section was to prove convergence of the training algorithm. The strategy of EM algorithms is to iteratively maximize
a lower bound to reach a maximum of the objective function. The lower bound of EM
algorithms is the so-called Q function, which is the expected training data likelihood over
all combinations of unknown data. In the case of HSMMs, the Q function is the expected
observation sequence likelihood over all sequences of hidden states. Similar to standard
HMMs, the Q function for HSMMs can be separated into a sum of three independent
parts such that maximization of Q can be achieved by individual maximization. The
maximum for initial probabilities π and observation probabilities B has been computed
analytically using the method of Lagrangian multipliers resulting in reestimation formulas similar to the ones of the Baum-Welch algorithm for standard HMMs. However, an
analytical solution is not available for transition parameters. Therefore, a gradient-based
iterative maximization procedure is used for this part of the Q function. The fact that the
Q function is increased leads to an increased value of the objective function. Since the
objective function is continuous and bounded, a repetitive increase converges to a local
maximum.
6.6
HSMMs for Failure Prediction
There is a principle interrelation between the number of free parameters and the amount
of training data needed to estimate the parameters: The more parameters that need to be
estimated the more training sequences are required to yield reliable estimates. Since in
failure prediction, the models are trained from failure data, the amount of training data is
naturally limited. Hence the number of free parameters must be kept small. The number
of free model parameters is mainly determined by the number of states and the topology,
which determines the connections among states.
The most wide-spread topology for HMMs is a chain like structure, since first, the notion of a sequence has some “left-to-right” connotation and second, it has the least number
of transitions. The model’s topology used for online failure prediction are no exception
in that respect. However, there are some particularities that need to be explained.
It is a principle and unavoidable characteristic of supervised machine learning approaches that the desired specifics are extracted from training data, which can never capture all properties of the true underlying interrelations. More specifically, this results from
the fact that
1. Training data is a finite sample of data, from which follows that samples only contain/reveal a subset of the true characteristics
2. Measurement data is subject to noise. In the case of error sequences, e.g., it is
common that error messages that are not related to the failure mechanism occur in
the training data (noise filtering can alleviate the problem but cannot completely
remove noise from the data)
In order to account for these two properties, a strict left-right model is extended in two
steps:
126
6. The Model
1. Jumps are introduced such that states can be left out, as is shown in Figure 6.10.
This addresses missing error events in training sequences, which is related to the
first particularity listed above.
2. After training, intermediate states are introduced (see Figure 6.11), addressing the
second particularity.
Training is performed between the two steps for the reason to keep the number of parameters as small as possible.
Chain model with shortcuts. The model topology to which training is applied is shown
in Figure 6.10. Since this structure is rather sparse, training computation times remain
acceptable. Note that only shortcuts bypassing one state have been included in the figure.
The models used for the telecommunication system case study also included shortcuts of
larger maximum span.
Figure 6.10: Failure prediction model structure used for training. Only shortcuts bypassing
one state are shown. In implementations, also shortcuts having a larger span
have been used.
Transition parameters, and prior probabilities are initialized randomly. Observation
probabilities are also initialized randomly, with one restriction: failure symbols can only
be generated by the last absorbing failure state and error event IDs only by the transient
states. Since the number of states N is not altered by the training procedure, it must
be prespecified, although the optimal number of states cannot be identified upfront: If
there are too few states, there are not enough transitions to represent all symbols in the
sequences. If there are too many, the number of parameters is too large to be reliably
estimated from the limited amount of training data. Furthermore, the model might overfit
the training data. For this reason, several values of N are tried and the most appropriate
model is selected. Please also note that training sequences have been filtered in the process
of data preprocessing, as described in Section 5.3.
Background distributions. As has been pointed out in Section 6.3.3, the Baum-Welch
algorithm sets observation probabilities to zero for all observation symbols that do not occur in the training data set and hence every sequence containing one of those symbols is
assigned a sequence likelihood of zero. This is not appropriate for failure prediction since
the subsequent classification step builds on a continuous measure for similarity. Furthermore, training data is incomplete: During online prediction, there might be failure-prone
sequences that are very similar containing some symbol that has not been present in the
6.6 HSMMs for Failure Prediction
127
(filtered) training data. Assigning a sequence likelihood (i.e., a similarity) of zero is obviously not appropriate. Hence, after training the chain model with shortcuts, background
distributions have to be applied to observation sequences.
Intermediate states. For each transition of the model, a fixed number of intermediate
states are added such that the sum of mean transition times equals the mean transition
time of the original transition (see Figure 6.11). More precisely, for any pair of states
Figure 6.11: Adding intermediate states for each transition. Bold arcs visualize transitions
from the model shown in Figure 6.10. µij denotes the mean duration of the
transition from state si to state sj . Observation probability distributions bsi (oj )
for states 1, 2, and 3 have been omitted
si and sj of the model obtained from training (c.f., Figure 6.10), v intermediate states
sij,1 , . . . , sij,v are added such that the mean time of transition duration via intermediate
states equals duration mean time of the direct transition si → sj .13 Limiting transition
probabilities pij are adapted by distributing a fixed, prespecified amount of probability
mass equally to the intermediate states. For example in Figure 6.11, if it is specified
upfront, that 10% of probability mass should be assigned to intermediates, then p12 and
p13 are scaled by 0.9 and the probabilities from state s1 to all the intermediates equals 0.1
.
4
Observation probabilities of intermediate states are not subject to training and hence prior
probabilities P (oj ) estimated from the entire training data set are used.
13
That is, e.g., if the mean transition time from state s1 to s2 is µ12 = 12s and there are two intermediate
states s12,1 and s12,2 , mean durations from s1 to s12,1 , from s12,1 to s12,22 , and from s12,2 to s2 are all
four seconds, but the transition from s1 to s12,2 is eight seconds
128
6. The Model
6.7
Computational Complexity
An assessment of computational complexity for most machine learning techniques has
to consider two cases: training and online application. Training is performed offline and
computing time is hence less critical than the application of the model, which is in this
case the online prediction of upcoming failures. Both cases are investigated separately.
Application complexity. The approach to failure prediction presented here involves
computation of the forward algorithm for each sequence. The forward algorithm of standard HMMs is of the order O(N 2 L) as can be seen from the trellis shown in Figure 4.3
on Page 59: for each of the L + 1 symbols of the sequence, a sum over N terms has to
be computed for each of the N states. However, this is only true if really all predecessors
are taken into account. If the implementation uses adjacency lists, this assessment applies
only to ergodic (fully connected) model structures. In case of frequently used left-to-right
structures complexity goes down to O(N L).
Complexity of the Viterbi algorithm is the same since the sum of the forward algorithm is simply replaced by a maximum operator, which also has to investigate all N
predecessors in order to select the maximum value.
Complexity of the Backward algorithm is also equal to the forward algorithm, although multiplication of bsi (Ot ) cannot be factored out —but since constant factors do
not change the class of complexity in the O-calculus, the same class results.
Turning to HSMMs, the algorithms belong to the same class of complexity, since the
only difference between the algorithms is that aij is replaced by vij (dk ). More precisely:
aij ⇔ pij
R
X
wij,r κij, r (d|θij,r )
for i 6= j .
(6.114)
r=0
κij,r (d) are cumulative probability distributions that have to be evaluated for delay
d. Depending on the type of distribution this might involve more or less computations
since for, e.g., Gaussian distributions, there is no formula for the cumulative distribution.
However, since R is constant (and most likely less than five) irrespective of N and L, it
is a constant factor and complexity in terms of the O-calculus is the same as for standard
HMMs. For the case that the process has stayed in state si (j = i), computations are even
less costly if the products pij dij (d); j 6= i are summed up “on the fly”.
Training complexity. Estimating overall complexity of the Baum-Welch algorithm is a
difficult task since the number of iterations is depending on many factors such as
• model initialization, which is in many cases random
• quality and quantity of the training data, which includes the number of training
sequences
• appropriateness of the HMM assumptions
• appropriateness of model topology
• number of parameters of the model. In case of a standard HMM, the number is
determined by N values for π, up to N 2 transition probabilities aij in case of a fully
6.7 Computational Complexity
129
connected HMM, and N M observation probabilities B. Since M is determined by
the application, it is assumed to be constant. Hence, the number of parameters is
O(N 2 ).
Some approaches have been published that try to predict computation time (e.g., Hoffmann [120]) but since these models are based on measurements, they do not help to
derive an O-calculus assessment. Due to the number of parameters being in the order of
O(N 2 ) it is assumed here that also the number of iterations is ∈ O(N 2 ), which in reality
is a quite loose upper bound. In fact convergence can be much better if a large amount of
consistent training data is available. Furthermore, in real applications, a constant upper
bound for the number of iterations is used. Note that this does not guarantee that the training procedure is close to a local maximum. However, since usually training is repeated
several times with different random initializations, this drawback is relatively small.
Complexity of one reestimation step can be determined: The E-Step of the EM algorithm involves execution of the forward-backward algorithm of complexity O(N 2 L).
Then, to accomplish the M-step, reestimation of
π requires O(N ) steps
B requires O(N L) steps
A requires O(N 2 L) steps
for each sequence. Hence the overall training procedure also has complexity O(N 2 L).
Putting this together with the number of iterations, overall training complexity is of the
order of O(N 4 L). Similar to model application, complexity of models used in real applications (e.g., left-to-right topology) is less.
Turning to HSMMs, reestimation of π and B remains the same while reestimation
of A is replaced by an iterative approximation procedure, which leads to an increased
complexity of HSMMs:
• The optimization algorithm has to be run for each of the N states.
• For a fully connected model, the number of parameters that have to be estimated
increases by const ∗ (N − 1), which is in the order of O(N ).
• Computing the gradient involves the sum over all training data, which is O(L).
• Since a few gradient-based optimization steps are sufficient and assuming constant
complexity to determine the step size, the number of iterations can be limited to
O(1).
The resulting complexity is:
N ∗ O(N ) ∗ O(L) ∗ O(1) = O(N 2 L) .
(6.115)
Assuming the number of iterations of the outer EM algorithm to be O(N 2 ) as before, this
again yields an overall complexity of O(N 4 L). Again, in real applications such as online
failure prediction, a left-to-right structure is used, which also limits training complexity of
each iteration to O(N L). Additionally, a constant upper bound on the number of iterations
can be applied, showing the same drawback as for standard HMMs. In general, this
analysis shows the sometimes misleading over-simplification of the O-calculus: although
belonging to the same complexity class, it should be noted that HSMMs are clearly more
complex than standard HMMs. However, as experiments along with the case study will
show, computation times are still acceptable (see Sections 9.4.2, 9.7.1, and 9.9.5).
130
6.8
6. The Model
Summary
Hidden Semi-Markov Models (HSMMs) are a combination of semi-Markov processes
(SMPs) and standard hidden Markov models (HMMs): Standard HMMs employ a discrete time Markov chain for the stochastic process of hidden state traversals, which is
replaced by a continuous time SMP in the case of HSMMs. Although it is not the first
time that such a combination has been proposed, previous approaches were limited to
discrete time steps of length ∆t and/or have used state duration distributions instead of
transition durations and / or were limited to a maximum duration.
The forward, backward, and Viterbi algorithm have been derived yielding algorithms
that are of the same complexity class14 as the algorithms of standard HMMs. This has
been achieved by a strict application of the Markov property and the assumption that a
state transition takes place each time an observation occurs. Although this might sound
too simplistic, a comparison of event-triggered temporal sequence processing with the
situation encountered in speech recognition reveals why this assumption is appropriate for
temporal sequence processing: temporal properties of the process appear at the surface
and are expressed by the time when events occur whereas speech recognition operates on
periodic (i.e., equidistant) sampling, and hence the underlying temporal properties do not
appear in observation data.
The forward or Viterbi algorithm are used for sequence recognition. Sequence prediction aims to forecast the further development of the stochastic process. There are two
different types of prediction: first, it might be of interest what the next observation symbol at a certain time in the future will be, and second, the probability that the stochastic
process reaches a distinguished state up to some time t in the future can be computed.
Solutions to both goals have been derived.
Training of HSMMs is accomplished in a similar way to standard HMMs: Based on
the forward and backward algorithm, an expectation maximization algorithm is employed.
However, the formulas known from standard HMMs can only be adopted for initial state
and observation distributions π and bsi (oj ), respectively. Limiting transition distributions
pij and transition durations dij (dk ) need to be optimized by an embedded gradient-based
optimization procedure. The entire training procedure has been summarized on Page 111.
A proof to show that the training procedure converges to a local maximum of sequence
likelihood, has been presented. It is based on the notion that EM algorithms perform
lower-bound optimization from which a so-called Q-function can be derived. This derivation has been applied to the case of HSMMs yielding three terms that can be optimized
independently. The proof investigates all terms and derives the training formulas.
Topics that are relevant to the application of HSMMs to online failure prediction have
been covered, including a two step model construction process. Together with the application of background distributions, this process increases model bias and lowers variance,
as is shown in Section 7.3.
Finally, complexity of the derived algorithms has been assessed using the O-calculus.
For a fully connected (ergodic) model, both the algorithms for standard HMMs and
HSMMs are of complexity O(N 4 L), assuming the number of outer EM iterations to
be of O(N 2 ). However, the constant factors, which are hidden by the O-calculus, are
significant for HSMMs. Furthermore, for many applications complexity is reduced to
O(N L).
14
in terms of the O-calculus
6.8 Summary
131
Contributions of this chapter. HSMMs, as proposed in this chapter, follow a novel approach to extend hidden Markov models to continuous time. The fundamental difference
between periodically sampled input data of applications such as speech recognition and
event-triggered temporal sequences is that temporal aspects of the underlying stochastic
process is revealed at the level of observations. By exploiting this difference, a hidden
semi-Markov model has been proposed that operates on true continuous time rather than
discrete time steps. It is able to model transition durations rather than state sojourn times,
and does not require specification of a maximum duration. Furthermore, the model provides great flexibility in terms of the distributions used and offers the possibility to incorporate background distributions for transition durations. Moreover, the algorithm is of the
same complexity class as standard hidden Markov models.
Relation to other chapters. In online failure prediction, similarity of an error sequence
that has been observed in the running system is compared to failure-prone sequences of
the training dataset by computing sequence likelihood. Since at least two HSMMs are
used —one for similarity to failure and one for non-failure sequences— a classification
step is needed in order to come to a final evaluation of the current system status. Several
approaches to classification are presented in the next chapter.
Chapter 7
Classification
Classification is the last stage of the failure prediction process (see Figure 2.10 on
Page 20). Classification facilitates a decision whether the current status of the system, as
expressed by the observed error sequence, is failure-prone or not. This chapter discusses
issues related to that topic. More specifically, in Section 7.1 Bayes decision theory is introduced, while topics directly related to failure prediction are discussed in Section 7.2. As
the outcome of the classifier is a decision that can be either right or wrong, classification
error is analyzed in more detail in Section 7.3. This includes the bias-variance-dilemma
and approaches how the trade-off between bias and variance can be controlled.
7.1
Bayes Decision Theory
Classification, in its principal sense, denotes the assignment of some class label ci , i ∈
{0, . . . , u}, to an input feature vector s. It seems not surprising that a decision theory
bearing the name of Revd. Thomas Bayes is a stochastic formal foundation to derive and
evaluate rules for class label assignment based on the Bayes rule. The principal approach
of Bayesian decision is that class label assignment is based on the probability for class
ci after having observed feature vector s, which is the so-called posterior probability
distribution P (ci | s). Applying Bayes’ rule, the posterior can be computed by:
P (ci | s) =
p(s | ci ) P (ci )
p(s | ci ) P (ci )
=P
,
p(s)
l p(s | cl ) P (cl )
(7.1)
where p(s | ci ) is called the likelihood, and P (ci ) is called the prior. The likelihood
expresses that certain features occur with different probabilities depending on the true
class. The prior accounts for the fact that classes ci are not equally frequent. Due to the
fact that classification theory has mainly been developed for continuous feature vectors,
there are infinitely many s’es and likelihood p(s | ci ) is a probability density, which is
denoted by a small letter “p”.
133
134
7. Classification
7.1.1
Simple Classification
The simplest classification rule is to assign an observed feature vector s to the class with
maximum posterior probability
class(s) = arg max P (ci |s)
(7.2)
ci
p(s|ci ) P (ci )
= arg max P
ci
l p(s|cl ) P (cl )
(7.3)
= arg max p(s|ci ) P (ci ) .
(7.4)
ci
The last step from Equation 7.3 to Equation 7.4 can be performed since the denominator
is independent of ci and hence does not influence the arg max operator.
This classification rule seems intuitively correct, and it can be shown that it minimizes
the missclassification error (see Bishop [30]). For the sake of simplicity, let us assume
that there are only two classes c1 and c2 . Let Ri denote a decision region, which is a not
necessarily contiguous partition of feature space: if a data point s occurs within region
Ri , class label ci is assigned to s. The total probability of misclassification, i.e., the error,
is given by:
P (error) = P (s ∈ R2 , c1 ) + P (s ∈ R1 , c2 )
= P (s ∈ R2 | c1 ) P (c1 ) + P (s ∈ R1 | c2 ) P (c2 )
=
Z
R2
p(s | c1 ) P (c1 ) ds +
Z
R1
p(s | c2 ) P (c2 ) ds .
(7.5)
(7.6)
(7.7)
The boundaries between decision regions are known as decision surfaces or decision
boundaries. Figure 7.1 visualizes Equation 7.7 for a one-dimensional feature space s
and two continuous regions defining a single decision boundary θ. It can be seen from the
Figure 7.1: Classification by maximum posterior for a two-class example. The curves show
p(s|ci ) P (ci ) and hatched areas indicate the error. R1 and R2 are decision regions: Every s within R1 is classified as c1 and within R2 as c2 . It can be seen
that the error is minimal if the decision boundary θ equals the point where the two
probabilities cross.
figure that the total probability of an error (i.e., the hatched area in the figure) is minimal
7.1 Bayes Decision Theory
135
if θ is chosen to be the value of s for which p(s | c1 ) P (c1 ) = p(s | c2 ) P (c2 ). From this
follows that the decision rule given in Equation 7.4 results in minimum probability of
misclassification for two classes. The resulting minimum error for this boundary is called
the Bayes error rate.
In the case of more classes, it is easier to compute the probability of correct classification
XZ
P (correct) =
p(s | ci ) P (ci ) ds .
(7.8)
Ri
ci
Choosing decision regions such that the probability of correct classification is maximized
leads to Equation 7.4 in its general form for multiple classes. In summary, the Bayes
classifier chooses decision regions such that the probability of correct classification is
maximized. No other partitioning can yield a smaller probability of error (Duda & Hart
[84]).
7.1.2
Classification with Costs
The classification rule derived above has not considered any cost or risks involved with
classification. However, cost can influence classification significantly. For instance in
the case of medical screening, classifying an image of a tumor as normal is much worse
than the reverse. The same might hold for failure prediction, too: Not predicting an
upcoming failure might cause much higher cost than spuriously predicting a failure when
the system is actually running well. In order to account for cost, a cost or risk matrix is
introduced.1 Each element of the risk matrix rta defines the cost / risk associated with
assigning a pattern s to class ca when in reality it belongs to class ct . Although the term
“risk” might not seem appropriate for cases where the correct class label is assigned, the
term is used here. Instead of minimizing the probability of error, an optimal cost-based
classification minimizes expected risk. To derive a formula, first the expected risk of
assigning a sequence s to class ca is considered:
Ra (s) =
X
rta P (ct | s) .
(7.9)
t
Since class ca is assigned to all s ∈ Ra , the average cost of assignment to class ca is:
Ra =
Z
Ra
X
rta P (ct | s) p(s) ds =
Z
X
Ra
t
t
rta
p(s | ct ) P (ct )
p(s) ds
p(s)
(7.10)
and the total expected risk equals
R=
X
a
Ra =
XZ
a
X
Ra
rta p(s | ct ) P (ct ) ds .
(7.11)
t
Risk is minimized if the integrand is minimized for each sequence s, which is achieved
by choosing the decision region for assignment to class ca such that s ∈ Ra if
X
t
rta p(s | ct ) P (ct ) <
X
rti p(s | ct ) P (ct ) ∀ i 6= a
(7.12)
t
resulting in a Bayes decision rule where minimum loss across all assignments for sequence s is chosen. If two assignments have equal loss, any tie-breaking rule can be
used.
1
In classification, the matrix is also called loss matrix
136
7.1.3
7. Classification
Rejection Thresholds
Bishop [30] mentions that classification can also yield the result that a given instance
cannot be classified with enough confidence. The idea is to classify a sequence s only if
the maximum posterior is above some threshold θ ∈ [0, 1] (c.f., Equation 7.2):
class(s) =

c
k
= arg maxci P (ci | s)
∅
if P (ck | s) ≥ θ
else .
(7.13)
Rejection thresholds might be useful for online failure prediction if there is a human operator who can be alerted if a sequence cannot be classified in order to further investigate or
observe the system’s status. However, since experiments carried out in this work are only
based on a data set (there has been no operator to alert), rejection thresholds have not been
applied, here. A second application of rejection thresholds is concerned with improving
computing performance of classifiers: In a first step, simple classifiers can be used to
classify the non-ambiguous cases. More complex situations, for which the simple classifications do not exceed the rejection thresholds, more sophisticated but computationally
more expensive methods can be applied to further analyze the situation. However, since
optimization of computing performance is not the purpose of this dissertation, such approach has also not been applied in this thesis.
7.2
Classifiers for Failure Prediction
Bayesian decision theory provides the basic framework for classification. In this section,
failure prediction specific as well as practical issues are discussed. Note that now probability p(s | ci ) denotes likelihood of sequence s, which has been observed during runtime.
In case of hidden Markov models, sequence likelihood is computed by the forward algorithm.
7.2.1
Threshold on Sequence Likelihood
The simplest classification rule is to have only one single HSMM trained on all failure
sequences irrespective of the failure mechanism, and to apply a threshold θ ∈ [0, 1] to
sequence likelihood p(s | λF ) where λF denotes a model that has been trained on failure
data, only. The problem is that observation sequences s are delimited by a time window
∆td (c.f., Figure 5.4 on Page 79) resulting in a varying number of symbols in observation
sequences. Sequence likelihood decreases monotonically with the number of observation
symbols and hence threshold θ has to depend on the number of observation symbols. Furthermore, experiments have shown that such approach does not result in decisive models.
For these reasons, the method of simple thresholding is not used in this thesis.
7.2.2
Threshold on Likelihood Ratio
One way to circumvent the problem of varying length of observation sequences is to use
exactly two models —λF for failure and λF̄ for non-failure sequences— and to compute
the ratio of sequence likelihoods. A failure is predicted if the ratio is above some threshold
7.2 Classifiers for Failure Prediction
137
θ ∈ [0, ∞). More formally, a failure is predicted if
P (s | λF )
>θ.
P (s | λF̄ )
(7.14)
In order to analyze this approach it is cast into the framework of Bayes decision theory.
However, to simplify affairs, formulas of Bayes decision theory become more handy if
rephrased for the two-class case. From Equation 7.12 follows that the classifier should
opt for a failure if
rF F p(s | cF ) P (cF ) + rF̄ F p(s | cF̄ ) P (cF̄ ) <
rF F̄ p(s | cF ) P (cF ) + rF̄ F̄ p(s | cF̄ ) P (cF̄ )
(7.15)
⇔
(7.16)
(rF F − rF F̄ ) p(s | cF ) P (cF ) < (rF̄ F̄ − rF̄ F ) p(s | cF̄ ) P (cF̄ ) .
Under the reasonable assumption that rF̄ F̄ < rF̄ F , which means that the cost associated
with classifying a non-failure-prone situation correctly as o.k. are less than cost associated
with falsely classifying it as failure-prone, equations can be transformed as follows:
⇔
rF F − rF F̄
p(s | cF ) P (cF ) > p(s | cF̄ ) P (cF̄ )
rF̄ F̄ − rF̄ F
(7.17)
⇔
p(s | cF )
(r − rF̄ F̄ ) P (cF̄ )
> F̄ F
.
p(s | cF̄ )
(rF F̄ − rF F ) P (cF )
(7.18)
Identifying the likelihoods p(s | cF ) with estimated sequence likelihoods P (s | λF ) obtained from the model, it can be seen that classification by threshold on sequence likelihoods is optimal if for threshold θ holds:
θ=
7.2.3
(rF̄ F − rF̄ F̄ ) P (cF̄ )
.
(rF F̄ − rF F ) P (cF )
(7.19)
Using Log-likelihood
In many real applications and models such as hidden semi-Markov models, sequence
likelihoods P (s | λt ) get too small to be computed and hence the log-likelihood is used
(c.f., Equation 6.19 on Page 101). However, this does not rule out the Bayes classification to be used since the logarithm is a strictly monotonic increasing function and hence
Equation 7.18 can be transformed into
"
#
"
r − rF̄ F̄
P (cF̄ )
log p(s | λF ) − log p(s | λF̄ ) > log F̄ F
+ log
rF F̄ − rF F
P (cF )
h
i
h
i
|
{z
∈(−∞;∞)
}
|
{z
const.
#
.
(7.20)
}
Usefulness of the formula can be seen more easily if only cost for misclassification are
taken into account, which means rF F = rF̄ F̄ = 0. Hence,
"
#
r
θ̃ = log F̄ F + c .
rF F̄
(7.21)
138
7. Classification
Equation 7.21 approaches −∞ if rF F̄ and rF̄ F → 0. In other words, if cost for incorrectly
raising a failure warning approaches zero, the threshold θ gets infinitely small and consequently classifying every event sequence as failure-prone results in minimal cost. On
the other hand, if cost for such misclassification is high, then it must be quite evident that
the current status is failure-prone, i.e., big difference in sequence log-likelihoods, until a
failure warning is raised. In terms of rF F̄ the situation is inverse.
7.2.4
Multi-class Classification Using Log-Likelihood
As can be seen from Figure 2.10 on Page 20, in the approach presented here, one nonfailure model and u failure models —one for each failure mechanism— are used to predict a failure, which naturally leads to a multi-class classification problem. If sequence
likelihoods P (s | λt ) would be available, then Equation 7.12 had to be used for classification. However, in real applications only log-likelihoods log P (s | λF ) are available but
Equation 7.12 cannot be solved to include singleton log P (s | λt ) terms. Therefore, the
multi-class classification problem is turned into a two-class one by selecting maximum
log sequence likelihood of failure models and comparing it to log sequence likelihood of
the non-failure model:
class(s) = F
⇔
u
max log P (s | λi ) − log P (s | λ0 ) > log θ ,
i=1
(7.22)
where θ is as in Equation 7.19. The motivation for the approach is as follows: Failure
models are related since they all indicate an upcoming failure. If the system encounters an
upcoming failure, the observed error sequence is the outcome of exactly one underlying
failure mechanism. Hence the failure model that is targeted to this failure mechanism
should recognize the error sequence as most similar, which is expressed by maximum
sequence log-likelihood. An additional advantage of the approach is that the cost matrix
defining θ only has four elements, which can be overseen and determined more easily.
7.3
Bias and Variance
Bayes decision theory has been based on minimizing classification error for each single
observation sequence (c.f., Equation 7.5). However, the classifier is trained from some
finite training data set. Analyzing dependence on training data yields fundamental insights
into machine learning, which in turn lead to improved modeling techniques. In order to
describe the concept, bias and variance are first derived for regression, as it has been
developed by Geman et al. [104]. Having the concept in mind, the work of Friedman [98]
is described, who has proposed an analysis of bias and variance for classification. The
purpose of presenting this material is to provide the background for a discussion of bias
and variance in the context of failure prediction and for an overview of known techniques
to control the trade-off between bias and variance. For further details, please refer to
textbooks such as Bishop [30] or Duda et al. [85].
7.3.1
Bias and Variance for Regression
Machine learning techniques usually try to estimate unknown mechanisms / interrelations
from training samples, which leads to different resulting models depending on the data
7.3 Bias and Variance
139
Figure 7.2: Mean square error in regression problems. Dots in each figure indicate two different training datasets D1 , and D2 from which (in this case linear) models y(s; Di )
have been trained. Mean square error is determined by (y(s; D) − t(s))2 , where
t(s) is the target value at point s.
present in the training data set. This is due to the fact that training data is a finite sample
and the system under investigation might also be stochastic. The following considerations
assess the dependence on the choice of training data, resulting in an analysis of bias and
variance. A common way to explain the two terms is to first investigate mean square error
E for regression: The error is measured by square of the difference between y(s; D),
which is the output value for input data point s of some model that has been trained
from training data set D of fixed size n, and the target value t(s) (see Figure 7.2). Since
training data is a finite sample, resulting models may vary with every different training
dataset. The expected error over all training datasets for one data point s is computed and
decomposed as follows: (c.f., e.g., Alpaydin [6]):
E = ED y(s; D) − t(s)
2 (7.23)
h
i
(7.24)
h
i
(7.25)
= ED y 2 − 2 t ED [y] + t2
= ED y 2 − 2 t ED [y] + t2 + ED [y]2 − ED [y]2
h
i
= ED [y]2 − 2 t ED [y] + t2 + ED y 2 − ED [y]2
2
= ED [y(s; D)] − t(s)
|
{z
Bias2
}
+ ED y(s; D)
|
2
(7.26)
2
− ED y(s; D)
{z
V ariance
.
(7.27)
}
Equation 7.27 indicates that the mean squared deviation from the true target data of any
machine learning method consists of two parts:
1. ability to mimic the training data set (bias)
2. sensitivity of the training method to variations in the selection of the training data
set (variance)
The relation can be understood best if two extreme cases are considered:
140
7. Classification
• Assume a machine learning technique that memorizes all training data points. Such
technique has a bias of zero. However, the resulting model is strongly different for
different selections of the training data set resulting in high variance.
• Assume a “learning” technique that does not adapt to the training data at all (e.g., a
fixed straight line), then the resulting model is the same irrespective of the data set
(zero variance). However, deviation from the target values is quite high resulting in
a high bias.
The key insight of Equation 7.27 is that in order to obtain a model with small average
error on s, both bias and variance must be reduced. A good model achieves a balance of
underfitting (high bias, low variance) and overfitting (low bias, high variance), which is
also known as the bias and variance!dilemma.
7.3.2
Bias and Variance for Classification
The above derivations investigated mean square error for regression problems. Turning
to classification, the situation is different. In two-class classification, there are only two
target values t ∈ {0, 1}. Mean squared error (y(s; D) − t)2 could be used to measure
proximity of the model output to binary target data as well, but this is not a proper approach. Regard, for example, a classifier that yields output y(s) = 0.51 for t = 1 and
y(s) = 0.49 for t = 0 for all s. This is a perfect classifier since with a threshold of 0.5 all
s would be classified correctly. However, with the mean square error, the classifier would
receive a high bias. Friedman [98] was one of the first to investigate this problem and to
derive an assessment of bias and variance for classification problems. Although others
such as Shi & Manduchi [240], Domingos [82] have developed the topic further, only the
basic findings of Friedman are presented here.
The regression problem of the previous section involved the notation y(s; D) to denote the output of the model. In terms of classifiers, classification is based on modeled
posterior class probability (c.f., Equation 7.1):
fˆ(s; D) = P̂ (c = 1|s) = 1 − P̂ (c = 0|s) ,
(7.28)
which is an estimate of the true posterior probability f (s) = P (c1 |s). The posterior estimate fˆ(s; D) is used to classify intput s in a Bayes classifier. In a two-class classification
problem, the assigned class label is determined by:
ĉ(s; D) = IA
fˆ(s; D) ≥
r01
r01 + r10
,
(7.29)
where IA (x) is the standard indicator function and rta denote classification risk as in
Equation 7.12. Correspondingly, the optimal classification is based on the true posterior:
cB (s) = IA f (s) ≥
r01
r01 + r10
,
(7.30)
which results in cost minimal (Bayes) classification. In order to simplify notations, equal
cost r01 = r10 is assumed such that the decision level is set to 1/2. Figure 7.3 shows the
situation.
Similar to derivation of bias and variance for regression, the estimated posterior
fˆ(s; D) is a random variable depending on the training data set D. For one training
7.3 Bias and Variance
141
Figure 7.3: True posterior probability f (s), and estimated posterior fˆ(s; D) obtained from
training using dataset D . In regions of s where f (s) and fˆ(s; D) are on the
same side of the Bayesian decision boundary 1/2, a correct classification results
and classification error rate is minimal (regions R2 and R4 ). If not, the classifier
based on fˆ(s; D) assigns the wrong class label resulting in maximal classification
cost (for s in that region)
data set, fˆ(s; D) may be on the correct side of the decision boundary (for s), for another
data set not. In order to handle this dependency on training data, again the expected value
ED is used to assess the average misclassification rate
P ĉ(s) 6= c(s) = ED P ĉ(s; D) 6= c(s)
,
(7.31)
where c(s) is the true class of input
s. It can be shown that Equation 7.31 can be separated
into the minimal Bayes error P cB (s) 6= c(s) and a term that is linearly dependent on
the so-called boundary error P ĉ(s) 6= cB (s) . Since the Bayesian error does not depend
on the classifier, only the boundary error needs to be investigated.
For further assessment, Friedman assumes that the estimated
posterior fˆ(s; D) is dis
tributed —for varying datasets D— according to p fˆ(s) , which is unknown in general.
However, since many machine learning algorithms (including Baum-Welch) employ averaging, p fˆ(s) can be approximated by a normal distribution:
p fˆ(s) = N ED [fˆ(s; D)]; Var[fˆ(s; D)] .
(7.32)
In order to compute the boundary error P ĉ(s) 6= cB (s) , the desired quantity is the
probability that fˆ(s) and f (s) are on opposite sides of the decision boundary 1/2, which
yields (see Figure 7.4):
P ĉ(s) 6= cB (s) =
R ∞ ˆ
ˆ

 1/2 p f (s) df

R 1/2
ˆ
ˆ
−∞
p f (s) df
if f (s) < 1/2
(7.33)
if f (s) ≥
1/2
.
The two cases can be turned into one using the sign function:
P ĉ(s) 6= cB (s) = Φ sign f (s) − 1/2
|
ED [fˆ(s; D)] − 1/2
{z
boundary bias
Var[fˆ(s; D)]
} |
{z
variance
}
−1/2 (7.34)
142
7. Classification
Figure 7.4: Distribution of estimated posterior
where Φ is the upper tail integral of the normal distribution.2 Plots of the boundary error
as a function of f and ED [fˆ] are provided for two values of Var[fˆ] in Figure 7.5.
Boundary Error, f = −0.25
Boundary Error, var[f^] = 0.05
1.0
1.0
0.8
0.8
err
dary
boun
ary
bound
0.6
0.6
error
0.4
0.4
or
0.2
0.2
0.0
1.5
0.0
1.5
1.0
0.0
0.5
0.6
Ef
0.5
0.0
1.0
1.0
0.8
1.0
E[ 0.5
f^
]
f^]
V[
0.4
0.0
−0.5 1.5
f
0.2
−0.5 0.0
(a)
(b)
Figure 7.5: Boundary error P ĉ(s) 6= cB (s) . Plot (a) shows dependence on ED [fˆ(s; D)]
and Var[fˆ(s; D)] for a given true posterior of f (s) = −0.25. Plot (b)
shows dependence on true posterior f (s) and expected value ED [fˆ(s; D)] for
Var[fˆ(s; D)] = 0.05. Note that depending on the modeling technique estimates
fˆ(s; D) may exceed the range [0, 1]. This is not a problem since classification is
performed by comparing fˆ(s) to the decision boundary 1/2
Several key insights into the nature of classification error can be gained from this:
1. From Equation 7.34 it can be seen that bias and variance affect each other in a multiplicative way rather than additive as in the case of regression (c.f., Equation 7.27).
This results in the complex relationship seen in Figure 7.5-a.
2. Small classification errors can only be achieved if variance Var[fˆ(s; D)] is low.
2
Hence Φ(·) = 1 − erf(·)
7.3 Bias and Variance
143
However, this is only true if boundary bias is positive, i.e., f (s) and fˆ(s; D) are on
the same side of the decision boundary 1/2. If it is negative, a very large classification error results.
3. Except for the special case of zero variance, the error rate is depending on the
distance of ED [fˆ(s; D)] from the decision boundary 1/2. For this reason, bias is
expressed as boundary bias.
4. The error rate of the classifier is not depending on distance between f (s) and the
decision boundary, as long as f (s) and ED [fˆ(s; D)] are on the same side. In Figure 7.5-b, this can be seen that for fixed ED [fˆ(s; D)] the boundary error is the same
for all f < 1/2 and f ≥ 1/2, respectively.
From this discussion follows that optimal classification3 is achieved for small variance
(resulting models are more or less equal regardless of the selection of training data) provided the fact that the training algorithm on average yields an estimation of the posterior
probability that is on the correct side of the Bayes decision boundary.
Note that all the formulas derived above have evaluated only one single s. If the
overall error rate is to be assessed, a further integral is needed:
P (ĉ 6= c) =
Z ∞
P ĉ(s) 6= c(s) p(s) ds .
(7.35)
−∞
7.3.3
Conclusions for Failure Prediction
The detailed analysis of classification error with respect to bias and variance has shown
that first, there is a trade-off between underfitting and overfitting, and second, in the case
of classification, small variance is more important than small bias. For this reason, bias
and variance have to be controlled in order to achieve a robust classifier. A manifold of
techniques exist among which a few are shortly described here including a discussion
whether they can be used for online failure prediction with HSMMs, or not.
• The most intuitive golden rule for machine learning approaches is to increase the
amount of training data. However, in most real application the amount of available
training data is limited, either since cost for data acquisition is too high or, as in
the case of failure prediction, data acquisition simply takes too long. Since in most
applications one part of available data is used for training and the other is used to
assess the generalization / prediction quality of the models, it is suggested to use
techniques such as m-fold cross validation to make full use of the limited data. This
technique has also been used in this thesis (see Section 8.3.3).
• Training with noise. In the case that not enough training data is available, noise
can be synthetically added to the training data in order to divert the training procedure and to avoid memorizing of training data points (overfitting), hence increasing
bias and lowering variance. In the case of regression, “noise” refers to a simple
zero mean stochastic process being added to measurement data. However, it is not
clear how this concept translates into failure sequences. While a zero mean random
number could be added to the delay between error events, it seems hazardous to
3
Remember that the overall error rate is the sum of Bayes error and a term linear in P ĉ(s) 6= cB (s)
144
7. Classification
interchange the event type, which is a nominal, i.e., non-ordinal, variable.4 Hence,
this technique could not be applied in this thesis.
• Early stopping. Many machine learning techniques apply an iterative estimation algorithm to stepwise adapt model parameters to the training data. This corresponds
to a stepwise transition from under- to overfitting. The idea of early stopping is
to evaluate generalization performance with a data set that is not used for training and to halt the training procedure once the validation error begins to rise (see
Figure 7.6). Experiments have shown that early stopping does not seem to be an ap-
Figure 7.6: Early stopping. Error for the training data decreases with every training step approaching some minimum error. Evaluating generalization performance using a
separate validation data set shows an increasing error after some number of training steps due to the fact that the model is overfitting the training data. Early stopping interrupts the training procedure once validation error begins to rise
propriate technique for hidden semi-Markov models. The reason for this is that the
Baum-Welch estimation procedure sets all observation probabilities of symbols that
do not occur in the training data set to zero in the first iteration. As early stopping
can only halt at integer steps, the first possible stop is already “too late”. It has also
been tried to combine early stopping with background distributions but this did not
result in significant improvement in comparison to the application of background
distributions alone.
• Growing and pruning. One of the major factors influencing the trade-off between
bias and variance is the number of free parameters of the model: Provided that there
is enough training data, the greater the number of free parameters, the better a model
can memorize training data points resulting in a low bias but high variance. The idea
of growing or pruning algorithms is to iteratively increase / decrease the number of
free parameters until an optimal solution is found. In hidden Markov models, the
number of parameters is mainly determined by the number of states and transitions
and hence algorithms try to add / delete edges or nodes / states following some
mostly heuristic rule. Bicego et al. [28] have proposed several pruning algorithms.
However, these methods can only be applied to models with recurrent states, which
is not the case for the models used for online failure prediction.
• Model order selection. As discussed above, growing and pruning are not applica4
If a numbering scheme for event IDs similar to the one proposed in Section 5.4.2 is used, adding noise
could be applied. However, data of the telecommunication platform did not provide such numbering.
7.3 Bias and Variance
145
ble within an automatic rule-based approach. However, “growing and pruning” is
achieved by simple trial and error for some range of model parameters such as the
number of states, number of intermediate states or maximum span of shortcuts. In
this approach, the most appropriate model is selected applying techniques such as
cross-validation.
• Parameter tying. The number of free parameters of some model classes such as
neural networks and hidden Markov models can be reduced if several parameters
are “grouped”. In the case of hidden semi-Markov models, for example, transition
parameters pij of several transitions can be forced to be equal, which reduces the
number of free parameters. However, in order to apply tying wisely, not blindly,
strong assumptions and hence detailed knowledge about the modeled process are
necessary, which is not the case for the problem addressed in this dissertation.
• Background distributions, intermediate states, and shortcuts. Observation probabilities of hidden Markov models can be mixed with so-called background distributions (c.f. Page 112). Background distributions “blur” the output probabilities of
the HMM which results in an increased training bias but reduced variance. If observation probabilities are trained using the Baum-Welch algorithm (as is the case
for this thesis) the application of background distributions is especially important
to circumvent the problem of zero probability for observation symbols not occurring in the training data set. Intermediate states and shortcuts added to the model
topology (see Section 6.6) have a similar effect on specificity of state transitions.
Furthermore, HSMMs also allow to incorporate background distributions into transition durations. Due to the fact that (a) observation background distributions have
been available for the HMM toolkit on which the implementation of HSMMs is
based, (b) transition background distributions are within the core of HSMMs, and
(c) intermediate states and shortcuts can easily be incorporated by modifying the
model structure, these techniques have primarily been used in this thesis.
• Regularization. The techniques described so far have left the core of the training
procedure untouched. Regularization methods modify the training procedure itself
in that the objective function of training is modified such that model complexity
is penalized. Many regularization techniques exist for neural networks (see, e.g.,
Bishop [30]). However, for hidden Markov models there are less. Hence, regularization has been left over for future work.
• Aggregated models. Another group of techniques does not build on one single
model but rather on a population of component models that are aggregated to form
a big one. One of the predominant techniques is called arcing5 among which bagging and boosting are most well-known. Bagging trains various component models
by randomly chosen subsets of training data. The output of the aggregated model
is simply a majority vote among component models. Boosting, among which AdaBoost6 is most well-known, first trains a component model from a subset of training data, and then subsequently trains further component models from data sets that
5
Adaptive Reweighting and CombinING
6
“Adaptive Boosting”
146
7. Classification
consist half of input data that is correctly classified by the previous component models and half of incorrectly classified training samples. By this method, subsequent
components models are somewhat complementary to their predecessors. See Duda
et al. [85] for an overview of these methods. In this thesis, aggregated models have
not been used. Nonetheless, the concepts could be applied without restrictions.
7.4
Summary
In this chapter the theory of the last step of online failure prediction using a pattern recognition approach such as hidden semi-Markov models has been covered: the final classification whether the current status of the system, as expressed by the observed error event
sequence, is failure-prone or not.
In order to found the classification process on a theoretical framework, Bayes decision theory has been introduced. It has been shown why the overall error rate of any
classifier is minimal if decision boundaries are chosen at the points where posterior probability distributions cross. This concept has been extended to multi-class classification,
minimum cost classification and the use of rejection thresholds. Based on the framework,
other straightforward classification schemes have been analyzed. Since in real applications, log-likelihood is most commonly used, classification based on log-likelihood has
been investigated leading to the conclusion that only two-class classification can be used.
Since the modeling approach of this thesis employs a model for each failure mechanism,
all failure-related models are combined using the maximum operator, which is then compared to the sequence log-likelihood of the non-failure model.
The framework of Bayes decision theory is also the foundation for a detailed analysis
of classifier error rate in terms of bias and variance. The so-called bias-variance dilemma
has been introduced by the simpler case of regression. Subsequently, an analysis for
classification has been presented. The main purpose of this excursion was to explain the
necessity to control the bias-variance trade-off of the modeling approach. Hence finally,
a collection of well-known techniques has been described and each has been discussed in
the light of online failure prediction with hidden semi-Markov models.
Contributions of this chapter. The overview of main methods to control the trade-off
between bias and variance is a collection of the techniques found in several textbooks
on machine learning and pattern recognition. Additionally, it is —to the best of our
knowledge— the first time the aspect of log-likelihood for multiclass classification is
considered. Furthermore, some new figures and plots have been developed in the hope to
make Friedman’s theory more understandable.
Relation to other chapters. This chapter has covered the third stage of the comprehensive approach to online failure prediction pursued in this thesis: after data preprocessing
and HSMM modeling, it has described the step of coming to a conclusion about the current status of the system.
This chapter also concludes the modeling part of the thesis. Being equipped with
the principal solution to the problem of online failure prediction, the next part turns to the
third phase of the engineering cycle: The application of the principal solution to industrial
data of a commercial telecommunication system.
Part III
Applications of the Model
147
Chapter 8
Evaluation Metrics
Having presented in detail the approach to online failure prediction, this third part of the
thesis is concerned with the experimental evaluation of the approach. Experiments have
been performed on data of an industrial telecommunication system. Before presenting
experimental results, this chapter introduces the metrics used for evaluation. Specifically,
in Section 8.1 metrics related to failure sequence clustering are presented, and in Section 8.2, metrics to evaluate accuracy / quality of failure predictions are covered. The
evaluation process is described in Section 8.3 including the topic how statistical significance is assessed.
8.1
Evaluation of Clustering
Data preprocessing includes clustering at two levels: first, when message IDs are assigned
to log records and second, when failure sequences are grouped in order to separate failure
mechanisms in the training data (c.f., Sections 5.1.1 and 5.2). Several aspects must be
considered in the process of clustering: a (hierarchical) clustering algorithm must be chosen (i.e., agglomerative or divisive clustering), and in case of agglomerative clustering, the
inter-cluster distance metric needs to be defined (i.e., nearest neighbor, furthest neighbor,
unweighted pair-group average, or Ward’s method). Using dendrograms and banner plots,
the choice of methods can be visually investigated in order to see whether the clustering
technique results in a clear and reasonable division. A more formal analysis is provided
by the agglomerative and divisive coefficient that try to express “clusterability” as a real
number between zero and one.
After clustering, the number of groups needs to be determined into which the data is
partitioned. This topic has been covered in Section 5.2.3, one of which is visual inspection. For visual inspection, dendrograms or banner plots can be used as well.
8.1.1
Dendrograms
Dendrograms are tree-like charts that indicate which data points have successively been
merged / divided in the course of agglomerative / divisive hierarchical clustering. In
Figure 8.1, dendrograms for a simple six point example clustered with three different
149
150
8. Evaluation Metrics
clustering methods are shown. The tree structure indicates which data points are merged /
E
F
15
20
divisive clustering
C
F
E
B
A
5
0
D
5
C
10
Height
20
D
10
x2
15
20
data points
B
A
5
10
dissimilarities
Divisive Coefficient = 0.78
x1
(b)
(a)
agglomerative complete linkage clustering
10
Height
F
E
D
(c)
F
E
B
A
B
A
0
dissimilarities
Agglomerative Coefficient = 0.63
D
0
C
2
5
C
6
4
Height
8
20
10
agglomerative single linkage clustering
dissimilarities
Agglomerative Coefficient = 0.84
(d)
Figure 8.1: Dendrograms for a six point example. (a) shows the data points to be clustered.
(b) shows the result of divisive clustering, (c) agglomerative clustering with the single linkage distance metric, and (d) agglomerative clustering using the complete
linkage distance metric
divided and the height of the connecting horizontal bar indicates the corresponding level
of the distance metric termed “height”. It can be seen that different clustering algorithms
can result in different groupings. In the example depicted in Figure 8.1, divisive and
single linkage clustering suggest a division into two groups {A, B} and {C, D, E, F }
while complete linkage clustering suggests three groups {A, B}, {C, D}, and {E, F }.
8.1 Evaluation of Clustering
8.1.2
151
Banner Plots
Although dendrograms provide an intuitive way to present the result of clustering, they
get overly complicated if the number of data points is increased. Rousseeuw [215] has
introduced banner plots, which are more suited to large data sets. Therefore, in this dissertation, banner plots are used to visualize clustering results.
A banner plot is a horizontal plot that connects data points by a colored bar of length
according to the level of division / merge. As is the case for dendrograms, this sometimes
requires reordering of data points. Figure 8.2 shows corresponding banner plots for the
dendrograms shown in Figure 8.1-b and 8.1-d. Note that banner plots for divisive and
divisive clustering
agglomerative complete linkage clustering
A
A
B
B
C
C
D
D
E
E
F
F
26.9
24
20
16
12
8
6
4
2
0
0
2
4
Height
Divisive Coefficient = 0.78
6
8
12
16
20
24
Height
Agglomerative Coefficient = 0.84
(a)
(b)
Figure 8.2: Banner plots for divisive clustering (a) and agglomerative clustering based on
complete linkage (b). The plots correspond to dendrograms (b) and (d) of Figure 8.1
agglomerative clustering are reversed, since banner plots document the “operation” of the
clustering algorithm, i.e., division and merging, from left to right.
8.1.3
Agglomerative and Divisive Coefficient
Dendrograms and banner plots visually give a notion of the data set’s “clusterability”.
A formal metric addressing this aspect are divisive or agglomerative coefficient. For the
divisive algorithms, let d(i) denote the diameter of the last cluster to which observation
i belongs (before being split off as a single observation) divided by the diameter of the
whole dataset. For agglomerative algorithms, let m(i) denote dissimilarity of observation
i to the first cluster it is merged with, divided by the dissimilarity of the merger in the final
step of the algorithm. Then divisive coefficient DC and agglomerative coefficient AC are
152
8. Evaluation Metrics
defined as follows:
n
1X
DC =
1 − d(i)
n i=1
AC =
n
1X
1 − m(i)
n i=1
∈ [0, 1]
(8.1)
∈ [0, 1] .
(8.2)
Both coefficients can be interpreted as average width of the banner plot, which is also a
measure for the “filling” of the banner plot. Since the banner plot is scaled such that the
first split / last merger determines one border of the plot, the larger the filled area, the
clearer is the structure in the data. Hence, AC and DC can be interpreted as an indicator
for the strength of clustering structure in the data. However, with increasing number of
observations n, both AC and DC grow and should therefore not be used to compare data
sets of very different sizes.
8.2
Metrics for Prediction Quality
The output of online failure prediction is a binary decision whether the current status
of the system is failure-prone or not. Evaluating these binary decisions results in a socalled contingency table from which a variety of metrics can be inferred. The advantage
of these metrics is that an intuitive interpretation of classification results exists. On the
other hand, as explained in Chapter 7, decisions are subject to various parameters such as
classification cost and prior distributions. While prior distributions can be estimated from
the data set, an assignment of classification cost is quite application specific and is not
an easy task. Indeed, by choice of classification cost a comparison of failure prediction
methods can easily be tuned in favor of one or another failure prediction method. For this
reason, classification independent metrics are also used to evaluate the predictive power
of online failure prediction approaches.
The purpose of this section is to provide a comprehensive overview of the various
evaluation metrics for failure prediction algorithms. However, only
• precision, recall, true positive rate, false positive rate
• F-measure,
• precision-recall plot,
• ROC plot,
• AUC, and
• accumulated runtime cost
are used in this dissertation.
8.2 Metrics for Prediction Quality
8.2.1
153
Contingency Table
Obviously, the goal of any failure prediction is to predict a failure if and only if the system
really is failure-prone. However, it can be doubted that any prediction algorithm will ever
reach such one-to-one match between failure predictions and true situation of the system.
In fact, two types of mispredictions can occur:
• The failure prediction algorithm may predict an upcoming failure but in reality the
system is running well so no failure is about to occur. This is called a false positive,
or Type I error. In failure prediction, a positive prediction is also called a failure
warning and hence the misprediction is a false warning.
• The failure prediction algorithm may suggest that the system is in a correct, not
failure-prone state but this is not true. Such misprediction is called a false negative
or Type II error. Since there is no warning about the upcoming failure, this situation
is also called a missing warning
Similarly, there are two cases for correct predictions:
• If the system is correctly identified as failure-prone, the prediction is a true positive
or correct warning
• if the system is correctly identified as non failure-prone, the prediction is a true
negative or correct no-warning
If for an experiment each prediction is assigned to one of the four cases and the number
of occurrence of each case is stored, a so-called contingency table is obtained, as shown
in Table 8.1. The table is sometimes also called the confusion matrix (e.g., in Kohavi &
True Failure
True Non-failure
Sum
Prediction: Failure
(failure warning)
Prediction: No failure
(no failure warning)
true positive (T P )
(correct warning)
false negative (F N )
(missing warning)
false positive (F P )
(false warning)
true negative (T N )
(correctly no warning)
positives
(P OS)
negatives
(N EG)
Sum
failures (F )
non-failures (N F )
total (N )
Table 8.1: Contingency table. Any failure prediction belongs to one out of four cases: if the
prediction algorithm decides in favor of an upcoming failure, the prediction is called
a positive resulting in raising of a failure warning. This decision can be right or
wrong. If in truth the system is in a failure-prone state, the prediction is a true
positive. If not, a false positive. Analogously, in case the prediction decides that
the system is running well (a negative prediction) this prediction may be right (true
negative) or wrong (false negative)
Provost [146]), and it depends on lead-time ∆tl , prediction-period ∆tp and data window
size ∆td (c.f., Figure 2.4 on Page 12).
.
154
8. Evaluation Metrics
8.2.2
Metrics Obtained from Contingency Tables
Various metrics have been proposed in different research communities that express various aspects of the contingency table. Table 8.2 summarizes the metrics. Although the
table already lists the metrics that are used in this thesis, they are discussed shortly in the
next paragraphs. Please note further that the terms “precision” and “accuracy” are used
differently than for measurements, where they refer to the mean deviation from the true
value and spread of measurements. Moreover, there are at least seven more meanings of
“precision”.
Name of the metric
Precision
Symbol
p
Formula
TP
T P +F P
=
TP
P OS
Other names
Confidence
Positive predictive val.
Support
Sensitivity
Statistical power
Recall
True positive rate
r
tpr
TP
T P +F N
=
TP
F
False positive rate
fpr
FP
F P +T N
=
FP
NF
Fall-out
1 − fpr
TN
T N +F P
=
TN
NF
True negative rate
False negative rate
1−r
FN
T P +F N
=
FN
F
Negative predictive val.
npv
TN
T N +F N
=
TN
N EG
False positive error rate
1−p
FP
F P +T P
=
FP
P OS
Specificity
Accuracy
acc
T P +T N
T P +T N +F P +F N
Odds ratio
OR
T P ·T N
F P ·F N
Table 8.2: Metrics obtained from contingency table (c.f., Table 8.1). Different names for the
same measures have been used in various research areas (rightmost column).
Specificy, false negative rate, negative predictive value, and false positive error
rate are listed for completeness, they are not further discussed in this thesis as
they do not add a fundamentally different view on the contingency table.
Precision and recall. The terms precision and recall have originally been introduced
for information retrieval by van Rijsbergen [214]. Precision is defined as the ratio of
correctly identified failures to the number of all failure predictions. Recall is the ratio of
8.2 Metrics for Prediction Quality
155
correctly predicted failures to the number of true failures:
Precision p =
true positives
correct warnings
=
true positives + false positives
failure warnings
∈ [0, 1]
Recall r =
correct warnings
true positives
=
true positives + false negatives
failures
∈ [0, 1] . (8.4)
(8.3)
Consider the following two examples for clarification: First, a perfect failure predictor would achieve precision and recall of 1.0. Second, a real prediction algorithm that
achieves precision of 0.8, generates correct failure warnings (referring to true failures) in
80% of all cases and false positives in 20% of all cases. A recall of 0.9 expresses that
90% of all true failures are predicted and 10% are missed.
Since information retrieval has to cope with extreme class imbalance1 precision and
recall are also well-suited for the evaluation of failure prediction tasks: failures are usually
much more rare than non-failures. There are two boundary cases for which, precision, and
recall are not defined:
• Precision is not defined if there are no positive predictions at all. Since the number of true positives equals the number of all positives, a precision of 1 is used.
The same result is obtained, if a threshold is involved in classification (c.f., Section 7.2.2): with increasing threshold, the prediction algorithm must be “more sure”
about an upcoming failure to issue a warning. Hence precision increases. At some
point the threshold is so high that not a single prediction is positive and precision is
hence set to one.
• Recall is not defined if the number of failures in the experiment is zero. However,
since testing a failure predictor without any failures in the test data set is not useful
this case is not further considered.
Weiss & Hirsh [277] argue that in real applications of failure prediction, first, the same
failure might be predicted several times and second, false positives occurring in bursts
should not be counted equally to false positives occurring separately. Therefore, the authors introduce a modified version of precision and recall:
p0 =
predicted failures
predicted failures + discounted false warnings
(8.5)
r0 =
predicted failures
,
total number of failures
(8.6)
where discounted false warnings refer to the number of complete, non-overlapping prediction periods ∆tp associated with a false prediction.
F-measure. Improving precision, i.e., reducing the number of false positives, often results in worse recall, i.e., increasing the number of false negatives, at the same time. To
integrate the trade-off between precision and recall the F-measure can be used (Makhoul
1
Usually the number of relevant documents is much smaller than the total number of documents
156
8. Evaluation Metrics
et al. [172]). The F-measure is the weighted harmonic mean of precision and recall, where
precision is weighted by α ∈ [0, 1]:
Fα =
α
p
1
p·r
1−α =
(1 − α) p + α r
+ r
∈ [0, 1] .
(8.7)
A special case is F0.5 where precision and recall are weighted equally:
F0.5 =
2·p·r
.
p+r
(8.8)
If precision and recall both equal zero, the F-measure is not defined, but the discontinuity can be removed such that the F-measure equals 0 in this case.2
False positive rate and true positive rate. The false positive rate is defined as the ratio
of incorrect failure warnings to the number of all non-failures:
false positive rate fpr =
false warnings
false positives
=
.
false positives + true negatives
non-failures
(8.9)
The definition of true positive rate tpr is equivalent to recall. However, in combination
with false positive rate, the term true positive rate is used.
Accuracy. All evaluation metrics are concerned with the “accuracy” of failure prediction approaches in a general meaning of the word. Confusingly, one such measure is
actually called accuracy, which is defined as the ratio of correct predictions in comparison to all predictions performed:
accuracy acc =
true positives + true negatives
.
true positives + false positives + false negatives + true negatives
(8.10)
However, accuracy is not an appropriate measure for failure prediction. This is due
to the fact that failures are rare events. Consider, for example, a predictor that always
classifies the system to be non-failure-prone. Since the vast majority of predictions refer
to non-failure prone situations, the predictor achieves excellent accuracy since it is right
in most of the cases. Instead, precision and recall measure the percentage of correct
failure warnings and percentage of correctly predicted failures, respectively. Hence, these
metrics are more appropriate to assess the quality of failure prediction algorithms.
Odds ratio. Although mainly used in medical research, odds ratio can be applied to
assess failure prediction algorithms. In statistics, odds are a way to describe probabilities
in a p : q manner. More specifically, the odds O of an event E is defined as:
O(E) =
P (E)
.
1 − P (E)
(8.11)
For example, if 60% of all cats are black, odds for a cat to be black are 60:40 = 1.5.
2
To prove lim(p,r)→(0,0) F (p, r) = 0, it has to be shown that ∀ ε > 0 : ∃ δ > 0 such that ∀ (p, r); p, r > 0;
2p r |(p, r) − (0, 0)| < δ: p+r
< ε (c.f., e.g., Bronstein et al. [39]). The existence of δ can be proven by
letting p, r = 2ε from which follows that δ = √ε2 .
8.2 Metrics for Prediction Quality
157
The odds ratio is defined as the ratio of the odds of an event occurring in one group to
the odds of it occurring in another group:
OR(E) =
O1 (E)
.
O2 (E)
(8.12)
1.5
For example, if the odds for mice to be black is 1:10 = 0.1, the odds ratio is 0.1
= 15 1
expressing that cats are much more likely to be black than mice. Due to the fact that
OR(E) can take values from [0, ∞), the odds ratio is skewed. However, taking the logarithm turns it into a measure with values in (−∞, ∞), which additionally is normally
distributed such that standard error and hence confidence intervals can be computed (see,
e.g., Bland & Altman [31]).
In the case of failure prediction evaluation, the odds ratio is
OR(W ) =
TP · TN
,
FP · FN
(8.13)
expressing the “odds” that a failure warning occurs in the case of a true failure than in
tpr
the case of a true non-failure. However, odds ratio is equivalent to 1−tpr
· 1−fpr
and a
fpr
comparison with ROC-plots, which also relate tpr and fpr (see below), has shown that
ROC plots are much more meaningful (Pepe et al. [201]). Therefore, the odds ratio is not
used explicitly in this dissertation.
8.2.3
Plots of Contingency Table Measures
The various measures obtained from a contingency table are singleton values that share
two restrictions:
1. They evaluate binary decisions. As derived in Chapter 7, binary decisions result
from comparison with a threshold θ. Hence contingency table-based metrics are
dependent on θ.
2. They represent average behavior over the entire evaluation data set
If either of the two restrictions are released, a curve rather than a singleton value results.
By inspection of these curves more insight into a predictor’s characteristics can be gained.
On the other hand, comparability between failure prediction methods is worse.
Precision-recall curves. To visualize the inverse relationship between precision and
recall —improving recall by more frequently warning about an upcoming failure often
results in worse precision and vice versa— values of precision and recall can be plotted for
various threshold levels. The resulting graph is called a precision-recall curve. Figure 8.3
shows an exemplary plot.
Note that neither precision nor recall incorporate the number of true negative predictions. Receiver operating characteristics employ false positive rate, which indirectly
includes the number of true negatives.
158
8. Evaluation Metrics
Figure 8.3: Sample precision/recall-plot for two failure predictors A and B. Each point on a
curve corresponds to one classification threshold θ . Predictor A shows relatively
good precision for most recall values but then drops quickly. In the limiting case
that all sequences are classified as failure prone, a recall of one and corresponding precision of F/N is achieved. The opposite case, where no sequence is classified as failure prone, recall is zero and precision equals one.
Receiver Operating Characteristics (ROC). ROC curves (see, e.g., Egan [88]) are
one of the most versatile plots used in machine learning. They plot true positive rate over
false positive rate. Since a perfect classifier achieves a false positive rate fpr = 0 and
true positive rate tpr = 1, the closer a curve gets to the upper left corner, the better the
classifier. If applicable, points for various thresholds are drawn and linearly interpolated
resulting in a curve.3 As has been shown in Chapter 7, in case of Bayes classification,
θ depends on skewness as well as on the cost involved with the four cases of classification. Figure 8.4 shows ROC curves for three threshold-based predictors / classifiers and a
perfect classifier.
Figure 8.4: ROC plot. True positive rate is plotted over false positive rate for varying classification threshold θ . Predictor A shows better performance than B , while predictor
C corresponds to random guessing. A perfect predictor would achieve fpr = 0
and tpr = 1.
3
Other methods, such as decision trees (e.g., C4.5) apply a “fixed” classification and hence result in a single
point in ROC space
8.2 Metrics for Prediction Quality
159
In order to relate ROC plots to precision and recall, consider the following equivalent
formula for precision:
p =
TP
=
TP + FP
TP
F
TP
F
+
FP
F
=
TP
F
+
TP
F
NF
F
·
FP
NF
=
tpr
tpr +
NF
F
· fpr
,
(8.14)
which is a function of tpr and fpr. NFF denotes the ratio of non-failure over failure sequences, which is class skewness. It can be shown that iso-precision curves in ROC space
are concentric lines originating from point (0,0) (c.f., Flach [97]). Keeping in mind that
true positive rate equals recall, each point on the ROC curve can be associated with a
value for precision and recall as shown in Figure 8.5.
Figure 8.5: Relation between ROC plots and precision and recall. Each point on the ROC
curve is associated with a precision / recall pair. Iso-precision lines are concentric
at (0,0). In the graph, precision p1 > p2 > p3 . Since recall equals true positive
rate, corresponding recall values r1 , r2 and r3 can be read off directly.
ROC plots, as well as precision-recall plots, account for all possible values of θ, which
is one of their major advantages. However, in the special case of failure prediction one
problem occurs with ROC plots. In failure prediction, there is usually non negligible class
skewness since failures are encountered less frequently than non-failure cases. Therefore,
low false positive rates are easily obtained and hence only a small fraction of ROC space
is of “interest”. In other words, in many failure prediction approaches, and especially in
those evaluating periodic measured data, true negative predictions dominate which results
in a small fpr. Flach [97] has analyzed effects of class skewness on ROC plots and has
defined skew insensitive variants for accuracy, precision, and F-measure. However, these
are not considered in this thesis since experiments are carried out on one single data set
and hence class skewness is the same for all experiments.
Detection error trade-off (DET). Another way to compensate for class skewness is to
use DET curves (Martin et al. [177]). DET curves differ from ROC plots in two ways:
1. Instead of true positive rate, the y-axis plots false negative rate f nr = 1 − tpr. This
gives uniform treatment to both types of mispredictions: false positives and false
negatives.
160
8. Evaluation Metrics
2. Both axes are plotted on normal deviate scale. This leads to a linear curve in the
case of normal class distributions.
Figure 8.6 shows an example.
Figure 8.6: Detection error trade-off (DET) plot. In comparison to ROC plots, DET plot false
negative rate f nr = 1 − tpr instead of tpr over false positive rate. Both axes
have normal deviate scale. Curve B corresponds to random prediction while predictor A is better than random.
The drawback of DET curves is that there is no graphical way to determine minimum
cost, such as for ROC plots (see below). Additionally, DET curves have not yet been
established as standard plot for classification performance evaluation and no failure prediction related publication has been found that uses them. Hence, DET curves are not
further considered.
8.2.4
Cost Impact of Failure Prediction
In Section 7.1.2, a cost or risk matrix was introduced, where rta denotes the cost for assigning class label a to a sequence which in reality belongs to class t, e.g., rF F̄ denotes the
cost for falsely classifying a failure-prone sequence as non failure-prone. If true positive
rate and false positive rate of a failure prediction algorithm are known, its expected cost
can be determined as follows:
NF
F
(1 − tpr) rF F̄ + tpr rF F +
(1 − fpr) rF̄ F̄ + fpr rF̄ F
cost =
N
N
.
(8.15)
The equation distinguishes between all four cases: true and false, positive and negative
predictions. F/N determines the fraction of failure and N F/N the fraction of non-failure
sequences. The true positive rate (tpr) indicates the fraction of failure sequences that are
predicted4 and hence cost rF F are assigned to this case. The same argumentation applies
to the remaining three cases. Given a cost / risk matrix, the overall goal is to find a failure
predictor with minimum expected cost.
4
“caught” by the failure predictor
8.2 Metrics for Prediction Quality
161
When analyzing contour lines of classification cost in ROC space, it can be shown
that iso-cost lines are straight lines having slope
d cost
N F rF̄ F − rF̄ F̄
=
,
d fpr
F rF F̄ − rF F
(8.16)
which is only dependent on class skewness NFF and the classification cost matrix rij .
Figure 8.7 shows iso-cost lines for two values for class skewness. As expected, lower cost
1.0
iso−cost lines
NF:F
0.6
0.4
0.0
0.2
true positive rate
0.8
40:1
4:1
0.0
0.2
0.4
0.6
0.8
1.0
false positive rate
Figure 8.7: Iso-cost lines. Contours of equal cost are plotted for two class distributions. Solid
lines correspond to a ratio of N F : F = 40 : 1, while dashed lines correspond to
a ratio of 4 : 1. Classification cost has been assumed to be rF̄ F̄ = 1, rF F = 10,
rF̄ F = 100, rF F̄ = 1000 (c.f., Section 7.1.2).
is achieved near the top-left corner of the ROC plot.
Since the slope of iso-cost lines is only dependent on variables that are determined
by the application and not by the classifier, minimum achievable cost can be assessed by
identifying the iso-cost line that is a tangent to the ROC curve (see Figure 8.8)
Cost graphs of Drummond & Holte. In [83], Drummond & Holte propose a way to
turn ROC plots into a graph that explicitly shows cost. They define a so-called probability
cost function
F
r
N F F̄
P CF = F
(8.17)
NF
r
+
r
F
F̄
F̄
F
N
N
expressing the ratio of misclassifying a failure-prone sequence as non-failure prone
F
(N
rF F̄ ) and maximum expected cost, which is the sum of both types of misclassification. Note that P CF only consists of application-specific parameters.
Normalized expected cost N E is defined as expected cost divided by maximum cost.
It can be shown that
N E = (1 − tpr − fpr) P CF + fpr ,
(8.18)
162
8. Evaluation Metrics
Figure 8.8: Determining minimum achievable cost from ROC. Three iso-cost lines c1 < c2 <
c3 of slope determined by Equation 8.16 are drawn in the figure. Minimum achievable cost can be determined by the tangent to the ROC curve.
which is a linear function in P CF with bounding values fpr and 1 − tpr. Hence, for
each point in the ROC curve (i.e., tpr and fpr for a given threshold θ), there is a tpr /
fpr pair defining a straight line in the cost graph. If this line is plotted for various ROC
points / thresholds, a convex hull results (see Figure 8.9). The convex hull can be used
to identify the optimal threshold, resulting in minimal normalized expected cost, for each
(application specific) value of PCF. Furthermore, the intersection of the convex hull with
the lines for always positive and always negative predictions defines the range of operation
in terms of P CF for a given predictor.
Figure 8.9: Cost curves. Varying the classification threshold θ = {θi } for one predictor results
in set of corresponding pairs (tpri , fpr i ). Each pair defines a straight line showing
normalized expected cost (NE) as a function of the probability cost function (PCF).
The diagonals correspond to the two trivial predictors that always predict a failure
F or non-failure F̄ . If for every value on the PCF-axis the minimum value is
chosen, a convex hull results (thick line). It can be seen that for some values of
PCF, expected cost is greater than for a trivial predictor. This defines the operating
range (in terms of PCF) of the predictor.
However, as can be seen from Equation 8.17, the plot only takes misclassification cost
8.2 Metrics for Prediction Quality
163
rF F̄ and rF̄ F into account5 but cost for correct classification are not involved. Due to this
restriction and due to the fact that cost is difficult to estimate for the telecommunication
system, cost graphs of Drummond & Holte are left over for future investigations.
Accumulated runtime cost. All of the above metrics and graphs build on average values for the entire data set. However, it makes a difference if a failure predictor runs very
well for most of the time except for short periods showing bursts of mispredictions or if
the same number of wrong predictions occur all over the training data set. Accumulated
runtime cost graphs yield exactly this insight by adding cost rij for each prediction and
showing the step function of accumulating cost over runtime of the test (see Figure 8.10
for an example). They have initially been developed together with Dr. Günther Hoffmann
(see, e.g., Salfner et al. [224]) and have been extended in this dissertation. An accumulated runtime cost curve can be drawn either for several predictors or varying thresholds
θ for one predictor.
Figure 8.10: Exemplary accumulated runtime cost. Cost for all four types of prediction (true /
false positive / negative) is plotted as it accumulates over time for two predictors
A and B. In the figure, a cost setting of rF̄ F̄ : rF F : rF̄ F : rF F̄ = 1 : 2 : 4 : 8
has been assumed. Shaded areas indicate cost boundaries: maximum cost
(each prediction is wrong), cost without failure prediction (failures are missed),
cost for a perfect predictor (each prediction correct), and cost for oracle predictions (rF F for each failure). Diamonds (u) on the time line indicate the time of
failure occurrence and circles (•) the time of predictions between failures.
A further advantage of accumulated runtime cost is that cost boundaries can be visualized:
• An oracle, which is of course not existing, would need no evaluation of measurement data. It would just know when a failure is about to occur. Hence accumulating
5
Hence, the cost/risk matrix would have zeros at the main diagonal.
164
8. Evaluation Metrics
cost would only consist of cost for correct failure predictions rF F occurring each
time a true failure is observed.
• In contrast to the oracle, real predictors need to evaluate measurements from the
running system. As each evaluation incurs some cost, real predictors result in higher
accumulated cost. However, the perfect predictor, which only performs correct
predictions, indicates the minimum cost for any predictor operating at times of
measurements. More specifically, cost of rF F occurs at times of failure and cost
of rF̄ F̄ at times of non-failure predictions / measurements. Nevertheless, it must
be pointed out that this only determines minimal achievable cost for one class of
predictors. If, for example, measurements and hence predictions are performed
much more rarely, lower cumulative cost can result even for non-perfect predictors.
One typical example for this is the distinction whether prediction is performed on
error events or on periodic measurements of system parameters such as workload:
in most systems errors occur more seldom than periodic measurements.
• Cost if no predictor is in place can be determined in the following way: At each
occurrence of a failure, cost of rF F̄ − rF̄ F̄ occurs, which means that all failures
are missed and no predictions are performed in between. The reason why rF F̄ is
decreased by rF̄ F̄ is that rF F̄ also includes cost for performing a prediction. Cost
for a prediction without action can be approximated by true negative predictions
and hence rF̄ F̄ is subtracted.
• Maximum cost can be determined by assuming all predictions to be wrong. Hence,
each non-failure prediction receives cost of rF̄ F and each prediction at the time of
failure occurrence receives rF F̄ . This also applies only to one class of predictors.
Of course, as is the case for all plots assuming fixed cost, the graph can look significantly
different if the ratio of cost rij is changed. Furthermore, the difficulty to estimate the cost
/ risk matrix for real systems also applies to accumulated cost graphs. Nevertheless, since
accumulated cost graphs do not build on average values, they provide an insight into the
temporal behavior of a failure prediction algorithm and are for this reason used in this
dissertation.
8.2.5
Other Metrics
Despite of measures obtained from the contingency table (see Table 8.1) and the plots
shown here, some other measures should be mentioned.
Area under ROC curve (AUC). The integral of a ROC curve,
AU C =
Z 1
tpr(fpr) d fpr
∈ [0, 1] ,
(8.19)
0
is a wide-spread measure for classification accuracy. AUC can also be interpreted as the
probability that a randomly chosen failure-prone sequence receives a higher rating than a
randomly chosen non-failure sequence.
AUC turns the ROC curve into a single real number which, in contrast to ROC plots,
enables numeric comparison of classifiers. Obviously, a perfect predictor achieves AUC
8.2 Metrics for Prediction Quality
165
equal to one and a purely random classifier receives an AUC of 0.5.6 AUC is threshold
independent, which is the major difference to contingency table based metrics. However,
AUC has its problems, too:
• AUC equally incorporates all possible threshold values regardless of class skewness
(c.f., the discussion of the ROC curve).
• Interpretation of the AUC is not as intuitive as of contingency table base methods
• For a given cost setting and class skewness, AUC can be misleading: a classifier
with larger AUC might result in a higher cost impact of failure prediction, hence,
even though AUC is better than for other predictors, the predictor results in worse
cost incurred. For example in Figure 8.11, AUC for predictor B is larger than for
predictor A. However, minimal achievable cost for B is C2 which is larger than C1
for predictor A.
Figure 8.11: AUC can be misleading: predictor B (dashed line) has better AUC than predictor
A (solid line). However, for a given cost setting and N F/F ratio, the cost incurred
by prediction are higher for B than for A, since C2 > C1 .
Precision-recall-break-even One special point in precision-recall curves is the point
where the precision-recall curve crosses the ascending diagonal. At this point, precision
and recall are equal resulting in a scalar measure that can be used for comparison. However, if precision and recall are not equally significant for the application, this approach
seems not convincing and is hence not further considered in this thesis.
Further metrics. Many other metrics have been proposed in various scientific disciplines such as data mining or machine learning with decision trees. These measures
include more recently introduced measures such as the G-measure (Flach [97]), weighted
relative accuracy (Todorovski et al. [256]), and SAR (Squared error, Accuracy, and ROC
6
Note that the inverse inference is not valid: An AUC of 0.5 does not necessarily imply a random classifier!
166
8. Evaluation Metrics
area, see Caruana & Niculescu-Mizil [46]), as well as well-known metrics such as Gini
coefficient, lift, Piatetsky-Shapiro, φ-coefficient, etc. The measures could be applied to
failure prediction as well, however, to the best of our knowledge, this has not been investigated, so far.
One exception is the κ-statistic, which has been used by Elbaum et al. [89] to build
a detector for anomalous software events within the email client “pine”. The interesting
thing about κ-statistics is that it allows for a “soft” evaluation of prediction performance
based on the κ value (see, Altman [7] for details).
8.3
Evaluation Process
In previous sections the metrics by which the potential to predict failures is assessed
have been discussed. In this section, the focus is on the evaluation process, i.e., how the
metrics are obtained. The ultimate goal of evaluation is, of course, to identify the potential
to predict failures of a failure prediction approach given some data set. The evaluation
process consists of several parts:
1. Many modeling approaches such as the one described in this thesis involve parameters that need to be adjusted, which is also called training.
2. In machine learning, training is based on data. However, this data should not be
used for evaluation. Hence the data set needs to be split.
3. In application domains such as failure prediction the amount of data available for
evaluation is limited. Hence a technique called cross-validation is applied.
The following sections discuss each issue separately.
8.3.1
Setting of Parameters
Ideally, parameters involved in modeling should be adjusted such that optimal failure
prediction performance is achieved. However, “performance”, as has been discussed, can
be assessed by various metrics. Having decided upon one optimization criterion (e.g., Fmeasure), theoretically, each parameter in the modeling process should be analyzed with
respect to its effect on final failure prediction performance, which implies that each value
of each parameter must be tested in combination with each value of each other parameter
of the entire modeling process. Not surprisingly, this is hardly feasible if more than
15 parameters are involved in HSMM failure prediction. For this reason, the evaluation
process consists of a mixture of “greedy” and “non-greedy” steps:
• greedy:
Parameters that can be set rather “robustly” by some local optimization criterion or heuristic. Local optimization in this context means
that not overall prediction performance needs to be evaluated but
some criterion that can be computed without a fully trained prediction model. “Robust” in this context means that there is sufficient
background knowledge about the effect of the parameter on final prediction performance. “Greedy” also implies that —once adjusted—
a parameter is not changed in later stages of the modeling process.
An example of a parameter that can be set greedily is the length of
8.3 Evaluation Process
167
the tupling interval ε. The local optimization criterion is here the
number of resulting tuples (c.f., Section 5.1.2).
• non-greedy:
Parameters for which no local optimization criterion exists, or upon
which little is known with respect to the effects on final failure prediction performance, need to be tested in combination with all other
parameters that cannot be determined greedily. In order to reduce
complexity, the range of values needs to be limited. Additionally,
not each single value of the range needs to be explored, if it is expected that final prediction performance is a smooth function of the
parameter. An example for such parameter setting is the adjustment
of the maximum span of shortcuts in the structure of the HSMM
(c.f., Section 6.6). For each combination of parameters, the full modeling process needs to be performed and prediction performance is
assessed with respect to the selected evaluation metric.
Since greedy parameter optimizations include only one local optimization, increasing
their number drastically reduces the overall training effort. Chapter 9 will provide details how each parameter for modeling of the industrial telecommunication system has
been set.
8.3.2
Three Types of Data Sets
As described in Section 2.4, a typical two phase batch learning approach is applied in this
thesis: first, a model is trained from previously recorded data, and is subsequently applied
to the running system in order to predict failures online. However, the project from which
the data has been acquired did not allow to apply the failure prediction approach to a
running system for evaluation. For this reason, failure prediction performance must be
evaluated from the data set itself. However, assessing prediction of failures that have been
known in the training phase does not yield a realistic estimation of prediction performance
—hence the data needs to be separated into disjoint training and test data set data set.
However, training involves non-greedy estimation of model parameters, from which
follows that parameters have to be adjusted with respect to the final prediction performance metric. For this reason, the training data set needs to be further subdivided to yield
a so-called validation data set. Hence three types of data sets result:
1. Training data set: The data on which the training algorithm is run.
2. Validation data set: Parameters for which no local optimization criterion is available need to be optimized with respect to final prediction performance (non-greedy
estimation). Validation data is used to assess prediction performance of each setting
.
3. Test data set: Final evaluation of failure prediction performance is carried out on
completely new data, which is test data. By this, generalization performance of the
model is assessed which is taken as an indication to how well the failure predictor would predict future upcoming failures in a running system. Since evaluation
is performed on data that has not been available for training and validation, such
evaluation is also called out-of-sample.
168
8.3.3
8. Evaluation Metrics
Cross-validation
In many machine learning applications, much data is available such that it cannot be
processed entirely. In this case, the issue is to determine the minimum size of data sets that
is needed to assure some statistical significance. In the case of online failure prediction,
the situation is different: Failure data is always scarce and all available data must be used.
It is even so scarce that after splitting data into training, validation and test data, data
sets get too small to yield statistically reliable results. To remedy this situation m-fold
cross validation,7 which exploits the limited amount of data available by cyclic reuse,
each time holding out another portion of the data for validation / testing of performance,
can be used. More precisely, for m-fold cross validation data is split randomly into m
disjoint sets of equal size n/m, where n is the size of the data set. The training and testing
procedure is repeated m times, each time holding out a different subset for testing. The
remaining portion, which is of size n − n/m is subsequently split further into training and
validation data.
A special form of cross validation uses stratification, which means that distribution of
classes N F and F remain the same in each subset. However, stratification can only be
applied to validation since it is one of the main characteristics of the training procedure to
separate failure from non-failure sequences in order to deal with class imbalance. Hence,
stratification has not been applied, here.
A further validation variant is Monte-Carlo cross validation (Shao [236]) where the
data set is repeatedly divided into a fraction β for testing and (1−β) for training. Although
the procedure has been shown to yield more stable results for selecting the number of
kernels in a Gaussian mixture modeling problem (Smyth [245]). However, since first
it is not clear upfront, that Monte-Carlo cross validation also performs better for failure
prediction, and second it adds another parameter (β) that needs to be determined, only
standard m-fold cross validation has been applied in this dissertation.
8.4
Statistical Confidence
In order to gain trust in the assessment of failure prediction quality, each evaluation metric
should be accompanied by confidence intervals. For the accuracy evaluation metric a
theoretical analysis is available. A second theoretical analysis for other metrics builds on
the assumption of a normal sampling distribution, which cannot be guaranteed. For this
reason, confidence intervals are obtained from a well-known resampling strategy called
“bootstrapping”.
8.4.1
Theoretical Assessment of Accuracy
Mitchell [184] provides an analysis of confidence intervals for the mean error rate observed from an experiment
1X
(1 − δĉ(s) c(s) ) ,
(8.20)
Es = ES P ĉ(s) 6= c(s) =
n s∈S
7
According to Duda et al. [85], cross validation has been invented by Cover [66]. However, Yu [283]
claims that cross-validation has first been invented by Kurtz [151] and has been developed to multi-cross
validation by Krus & Fuller [148]. Even more confusingly, Bishop [30] mentions Stone [251] as its
inventor.
8.4 Statistical Confidence
169
where n denotes the size of the experiment’s data set S = {s}, c(s) is the true value for
s, ĉ(s) the estimated value, and δij is the Kronecker delta. Es is also called the sample
error rate.
Confidence intervals can be obtained from the fact that counting misclassifications
within a test data set of size n is a Bernoulli experiment and the probability to encounter
k misclassifications in the test data set is
!
n k
P (k) =
p (1 − p)n−k ,
k
(8.21)
where p is the true yet unknown error rate. p can be estimated from the number of misclassified sequences in the test data set, which is k, since the maximum likelihood estimation
p≈
k
= Es
n
(8.22)
is an unbiased estimator given that the samples of the test data set had been drawn according to the prior distribution P (s). From the fact that p is estimated as mean value and
from the fact that k is binomially distributed follows that standard deviation of the error
rate is approximately
s
Es (1 − Es )
.
(8.23)
σEs ≈
n
For n Es (1 − Es ) ≥ 5, the binomial distribution can be well approximated by a normal
distribution8 and confidence intervals can be obtained:

CN (Es ) = Es − zN
s
Es (1 − Es )
, Es + zN
n
s

Es (1 − Es ) 
,
n
(8.24)
where zN is the width of the smallest interval about the mean that includes N % of the
total probability mass.
Finally, a confidence interval for accuracy can be obtained from the relation
acc = 1 − Es .
(8.25)
However, Duda et al. [85] show that unless n is fairly large, the maximum likelihood
estimation of p must be interpreted with caution. Furthermore, the analysis is only applicable to error rate / accuracy but confidence intervals are needed for all of the evaluation
metrics presented. Hence this approach is not applied in this thesis.
8.4.2
Confidence Intervals by Assuming Normal Distributions
The central limit theorem states that any sum of independent and identically distributed
random variables tends towards the normal distribution. From this follows that statistics
such as the mean, which is defined by a sum also tends to be normally distributed and
hence confidence intervals can be obtained by
"
s
s
C = x̄ − √ , x̄ + √
n
n
8
#
,
otherwise, the cumulative binomial distribution must computed directly
(8.26)
170
8. Evaluation Metrics
where x̄ denotes the mean of values observed in the test data set, s denotes the standard
deviation, and n denotes sample size.
However, this parametric way to determine confidence intervals only works for statistics that yield normal sampling distributions, which is a strong assumption that cannot be
applied to all statistics. Furthermore, there is no way to correct for bias or skew of the
estimator. For this reason, this approach is also not applied in this thesis.
8.4.3
Jackknife
Quenouille [207] invented an estimation procedure that is applicable to any statistic estimator θ̂. The principle idea of the method was to compute the statistic for a data set
from which one single data point has been removed. This is repeated by removing each
data point once and the overall value of the statistic is finally obtained by the so-called
leave-one-out mean:
n
1X
θ̂(i) ,
(8.27)
θ̂ =
n i=1
where θ̂ denotes the estimate of statistic θ and θ̂(i) is the statistic for the data set from
which data point i has been removed.
The major benefit of this method is that bias and variance of the statistic can be estimated, even for statistics that resist theoretical analysis such as the mode or median. For
this versatility, the method became also known under the term jacknife.
Although this method can in principle be applied to this thesis, the major problem with
the jackknife method is that it processes exactly n subsets. Computing complexity can be
limited by leaving out more than one single sequence (similar to m-fold cross validation),
but this on the other hand deteriorates the quality of statistic estimation.
8.4.4
Bootstrapping
Bootstrapping (Efron [87]) adds more flexibility to the estimation process and is currently seen as state-of-the art (at least in engineering disciplines). According to Moore &
McCabe [187], bootstrapping should be applied when the sampling distribution is nonnormal, biased or skewed, or to the estimation of statistics for which parametric estimations of confidence intervals are not available (such as for the well-known outlier resistant
25% trimmed mean).
The basic idea of bootstrapping is that based on one original sample, many so-called
resamples are generated by randomly selecting n instances from the original sample with
replacement. Similar to the jackknife, the desired statistic is computed for each resample.
However, the number of resamplings can be chosen arbitrarily and the same data point
may occur several times in one resample. One of the explanations why this method works
is that the resulting bootstrapping distribution, which is the distribution of the statistic
among resamples, can be shown to approximate the true sampling distribution if the original sample represents reality rather well.
The statistic’s bias can be estimated by:
bias =
B
1 X
θ̂(b) − θ̂ ,
B b=1
(8.28)
8.4 Statistical Confidence
171
where B denotes the number of resamples, θ̂(b) the statistic θ computed from the b-th
resample and θ̂ is the statistic computed from the original sample. This estimate of bias
can be used to yield more reliable confidence intervals even for biased and skewed sampling distributions. In this thesis bootstrap bias corrected accelerated confidence intervals
(BCa) have been used, which require that the number of artificial resamples B has to be
set to at least 5000.
8.4.5
Bootstrapping with Cross-validation
As stated before, failure data is scarce and cross-validation needs to be applied to fully
exploit available data. Each step in cross-validation could be analyzed separately and
results could be combined afterwards. However, bootstrapping cannot compensate for
small sized original samples! Even if the resampling process is run many thousand times,
resamples only consist of the few data points available in the original sample. For this
reason, a combination of cross-validation and bootstrapping has been applied in this thesis
(see Figure 8.12)
• The complete dataset is randomly divided into ten groups for 10-fold-cross validation.
• Each group is used once as test group
– The remaining nine data groups form the training / validation dataset.
– A model is trained / validated.
– The resulting model is applied to data of the test group.
– Model outputs are stored in a test result dataset.
• After performing this for all ten groups, the evaluation metric (statistic) is computed
from the (combined) test result dataset.
• Bootstrapping is applied to test results, which means that test results are resampled
5000 times in order to yield BCa confidence intervals for the evaluation metric.
Ten-fold-validation has been described above, which implies that ten complete modeling procedures have to be performed. However, this procedure can be adapted to the
computing power available: The number of folds can be increased up to n, which would
result in the jackknife method with subsequent bootstrapping results. Note that the bootstrapping procedure only operates on the result of the training and testing procedure,
which is incomparably less laborious than training 5000 models. In summary, the number
of data points in the result dataset from which the statistic is estimated is always n and the
bootstrapping tries to compensate for the reduced number of trainings. Please also note
that
• cross-validation simulates the variability in selecting the training and test data
• bootstrapping simulates the sampling process in order to mimic the sampling distribution
although the two are related, they are not exactly the same.
172
8. Evaluation Metrics
Figure 8.12: Cross-validation and bootstrapping. The dataset contains three failure sequences (hatched boxes at the top) and seven non-failure sequences (shaded
boxes at the top). All sequences are randomly divided into ten groups. Each
group is used once as test data set. For each test group the remaining nine
groups are used as training / validation dataset. After training / validation, sequences of the test group are fed into the model and results are stored in the
test result dataset. The evaluation statistic is computed at the end from all test
results. In order to estimate confidence intervals, bootstrapping with 5000 resamples is applied.
8.4.6
Confidence Intervals for Plots
The estimation procedure shown above can be applied directly to contingency table-based
metrics such as precision, recall, etc. However, plots, such as ROC, precision / recall,
have two equally important dimensions. Fawcett [95] discusses the topic extensively for
ROC curves and proposes to compute contingency intervals in both directions by fixing
the threshold (see Figure 8.13). The same concept applies to precision / recall curves.
Confidence intervals are not investigated for accumulated runtime cost since this graph
depends on one specific excerpt of the data (times of predictions and failures are shown
on the x-axis). As AUC integrates over all threshold values, no threshold-based averaging
can be applied, either. Instead, confidence intervals can be computed by the bootstrapping
procedure directly.
8.5
Summary
In this chapter the process of evaluating failure prediction methods has been discussed.
Starting from an evaluation of clustering results, which is only relevant for the approach
in this dissertation, in subsequent sections failure prediction metrics have been discussed.
There are two principal groups:
1. Contingency table-based measures such as precision, recall, F-measure, false positive and true positive rate, or accuracy. These measures evaluate binary decisions
and are hence dependent on one specific decision threshold
8.5 Summary
173
Figure 8.13: Averaging ROC curves. For each value of the threshold (A, B , C , D , and E ),
confidence intervals are computed separately for false positive and true positive
rate. The ROC curve is then plotted through average values
2. Plots that account for various thresholds. In this thesis Precision / recall plots, ROC
plots, detection error trade-off (DET), cost curves and accumulated runtime cost
graphs have been presented.
AUC has a special place as it is a single value and does not depend on a threshold value
since it is obtained from ROC plots.
The subsequent topic addressed by this chapter has been a description of the evaluation
process. Three topics have been discussed: greedy vs. non-greedy parameter optimization, the distinction between training, validation, and test data set, and cross-validation.
Evaluating failure prediction by the use of data naturally raises the question of statistical confidence. Several approaches to an assessment of confidence estimation have been
discussed and it has been argued why they cannot be applied to the case of online failure prediction. The discussion concludes that bootstrapping is applied in this thesis and
a combination of cross-validation and bootstrapping has been proposed. Finally, it has
been described how confidence intervals can be generated for plots having two equally
important variables.
Contributions of this chapter.
• To the best of our knowledge, the first comprehensive overview on failure prediction
evaluation metrics has been presented.
• A novel evaluation plot —accumulated runtime cost graphs— has been introduced.
In comparison to other evaluation techniques, the graph can reveal whether a predictor operates very well for most of the time but fails for a short period or whether
false predictions occur equally distributed over time. Furthermore, the graph allows
to compare cost incurred by a predictor with cost for an oracle predictor, perfect
or worst predictor, and with cost for a system without failure prediction in place.
However, these comparisons only hold for predictors of the same class. A further
drawback is that the graph is highly sensible to the assignment of cost for true and
false positive and negative predictions.
174
8. Evaluation Metrics
• To the best of our knowledge, this thesis presents a novel combination of m-fold
cross validation and bootstrapping: The computationally much more expensive task
of model training is reduced and at least partly compensated by bootstrapping with a
large number of resamplings. Furthermore, this approach allows to fully exploit the
limited amount of data and to take advantage of state-of-the art confidence interval
estimation offered by the bootstrap.
Relation to other chapters. This chapter has been the first of the third phase of the
engineering cycle in which the modeling methodology is applied to industrial data of
the real system. Having defined the measures for evaluation as well as the procedure
how these measures are obtained, the whole approach will be applied to real data of the
industrial telecommunication system in the next chapter.
Chapter 9
Experiments and Results Based on
Industrial Data
The failure prediction approach proposed in this dissertation has been applied to industrial data of a commercial telecommunication system. In this chapter, detailed results are
provided. The chapter is organized along the process of modeling: starting with the introduction of the case study (Section 9.1) and data preprocessing (Section 9.2), properties
of the data set are presented in Section 9.3, and training of HSMMs is discussed in Section 9.4. The resulting failure predictor is analyzed in detail (Section 9.5) and dependence
on various parameters is investigated in Sections 9.6 and 9.7. Furthermore, a comparative
analysis is provided by applying several different prediction techniques to the same data.
Note that for readability reasons, in this chapter, the term “model” is not only used to
denote the class of hidden semi-Markov models, but also a concrete HSMM parametrization, such as a “model with 50” states, which denotes an instance of a HSMM that has 50
states.
9.1
Description of the Case Study
Although the telecommunication system has been briefly introduced in Section 2.2, the
description is repeated here for convenience. The main purpose of the telecommunication system is to realize a Service Control Point (SCP) in an Intelligent Network (IN),
providing Service Control Functions (SCF) for communication related management such
as billing, number translations or prepaid functionality. Services are offered for Mobile
Originated Calls (MOC), Short Message Service (SMS), or General Packet Radio Service
(GPRS). Service requests are transmitted to the system using various communication protocols such as Remote Authentication Dial In User Interface (RADIUS), Signaling System Number 7 (SS7), or Internet Protocol (IP). Since the system is a SCP, it cooperates
closely with other telecommunication systems in the Global System for Mobile Communication (GSM), however, it does not switch calls itself. The system is realized as
multi-tier architecture employing a component-based software design. At the time when
measurements were taken the system consisted of more than 1.6 million lines of code, approximately 200 components realized by more than 2000 classes, running simultaneously
175
176
9. Experiments and Results Based on Industrial Data
in several containers, each replicated for fault tolerance.
Specification for the telecommunication system requires that within successive, nonoverlapping five minute intervals, the fraction of calls having response time longer than
250ms must not exceed 0.01%. This definition is equivalent to a required four-nines
interval service availability (c.f., Equation 2.1 on Page 13). Hence the failures predicted
in this work are performance failures.
The setup from which data has been collected is depicted in Figure 9.1. A call tracker
kept trace of request response times and logged each request that showed a response time
exceeding 250ms. Furthermore, the call tracker provided information in five-minute intervals whether call availability dropped below 99.99%. More specifically, the exact time
of failure has been determined to be the first failed request that caused interval availability
to drop below the threshold. The telecommunication system consisted of two nodes that
are connected with a high-speed local network. Error logs have been collected separately
from both nodes and have been combined to form a system-wide logfile by merging both
logs into one based on timestamps (the system runs with synchronized clocks) treating
the system as whole.
Figure 9.1: Experiment setup. Call response times have been tracked from outside the system in order to identify failures. The telecommunication system consisted of two
computing nodes from which error logs have been collected.
We had access to data collected at 200 non-consecutive days spanning a period of 273
days. The entire dataset consists of error logs of two machines including 12,377,877 +
14,613,437 = 26,991,314 log records including 1,560 failures of two types: The first type
(885 instances) relates to GPRS and the second (675 instances) to SMS and MOC services
but due to limited human resources, only the first failure type has been investigated.
Some notes on the procedure. As has been stated in Section 8.3, there are two strategies how parameters are set: greedy and non-greedy. Obviously, the best parameter setting
would be found by trying all combinations of parameters and to evaluate them with respect
to failure prediction. However, such approach is not feasible and a different approach has
been taken for experiments: As long as there is a reasonable way to set parameters directly
based on some local criterion or observation, parameters are set by this heuristic. This implies that once a parameter has been set by a “local” criterion or heuristic, its effect on
overall failure prediction quality is not checked later, and hence it cannot be determined
whether even better prediction results may be achievable with the method. However, since
the results achieved by this strictly forward approach are already convincing, there is no
need to do so —at least from an engineering point of view. For this reason the following
sections go through the entire data preprocessing and modeling process from the start and
investigate each step one after another.
9.2 Data Preprocessing
177
Implementation of the HSMM approach has been accomplished by modifying the
General Hidden Markov Model (GHMM) [179] library developed by the Algorithmics
group lead by Dr. Alexander Schliep at the Max Planck Institute for Molecular Genetics, Berlin, Germany. The GHMM library and hence its modifications are written in C,
wrapped by Python classes which in turn are controlled by shell scripts. Clustering, evaluation and plotting has been performed using R statistical language (see, e.g., Dalgaard
[74]).
9.2
Data Preprocessing
As explained in Chapter 2, modeling first involves data preprocessing, which consists of
several steps. The following investigations will explain and analyze each step separately
in the order they have been performed on the data.
9.2.1
Making Logfiles Machine-Processable
System logfiles contain events of all architectural layers above the cluster management
layer including 55 different, partially non-numeric variables. Figure 9.2 shows one
(anonymized) log record consisting of three lines in the error log. In order to obtain
2004/04/09-19:26:13.634089-29846-00010-LIB_ABC_USER-AGOMP#020200034000060|
020101044430000|000000000000-020234f43301e000-2.0.1|020200003200060|00000001
2004/04/09-19:26:13.634089-29846-00010-LIB_ABC_USER-NOT: src=ERROR_APPLICATION
sev=SEVERITY_MINOR id=020d02222083730a
2004/04/09-19:26:13.634089-29846-00010-LIB_ABC_USER-unknown nature of address
value specified
Figure 9.2: Typical error log record consisting of three log lines (anonymized).
machine-readable logfiles, many steps had to be performed, and the tremendous effort
by Steffen Tschirpke, who has done most of the programming for these steps, should be
acknowledged at this point.
The major steps of logfile preprocessing include:
1. Eliminating logfile rotation. Many large systems perform logfile rotation, which
means that logfiles are limited either in size or time span (or both) and once a logfile has reached the limit, logging is redirected to the next file. After logging to
the n-th file, logging starts from the first file in a ring-buffer fashion. This behavior lead to duplicated log messages. Data has been reorganized to form one large
chronologically ordered logfile for each computing node.
2. Identifying borders between messages. While error messages “travel” through
various modules and levels of the system, more and more information is accumulated until the resulting log-record is written into the logfile. In our case, various
delimiters between the pieces of information were used and one log record could
even span several lines in the logfile, sometimes quoting the error message several
178
9. Experiments and Results Based on Industrial Data
times. For this reason, the logfile had to be parsed in order to generate a log where
each line corresponds to one log record, to employ usage of a unique delimiter and
to assign pieces of information to fixed positions (columns) within the line.
3. Converting time. Timestamps in the original logfiles were tailored to be “processed” by humans and were of the form 2004/04/09-19:26:13.634089
stating that the log message occurred at 7 pm, 26 minutes and 13.634089 seconds
on April, 9th in the year 2004. In order to be able to, e.g., compute the length of the
time interval between two successive error messages, time had to be transformed
into a format that can be processed by computers. Real-valued UTC has been used
for this purpose, which roughly relates to seconds since Jan. 1st, 1970.
9.2.2
Error-ID Assignment
After preprocessing, the next step involved assignment of an error ID to each message as
described in Section 5.1.1. In case of the telecommunication data, there were originally
1,695,160 different log-messages. By replacing numbers, etc., the number of different
messages has been reduced to 12,533. By application of the Levenshtein distance metric
to each pair (resulting in 157,063,556 distances) the log messages could be assigned to
1,435 groups by application of a constant similarity threshold. Table 9.1 summarizes the
numbers.
Data
Original
Without numbers
Levenshtein clustering
No of different messages
1,695,160
12,533
1,435
Reduction in %
n/a
99.26%
88.55% / 99.92% (original)
Table 9.1: Number of different log messages in the original data, after substitution of numbers
by placeholders, and after clustering by the Levenshtein distance metric.
In principle, the task of message grouping is a clustering problem. However, grouping 12,533 data points using a full-blown clustering algorithm is a considerably complex
task. Furthermore, application of such complex algorithms is not necessary. Figure 9.3
provides a plot where the gray value of each point indicates distance of the corresponding
message pair. Except for a few blocks in the middle of the plot, there are dark steps along
the main descending diagonal and the rest of the plot is rather light-colored. The plot has
been created by putting sequences next to each other if their Levenshtein distance was
below some fixed threshold. Since plotting similarities is not possible for all messages,
Figure 9.3 has been generated from a subset of the data. The figure indicates that strong
similarity is only present among groups of log messages and not to other message types.
Hence a rather robust grouping can be achieved by one of the simplest clustering methods:
grouping by a threshold on dissimilarity. The reason why this simple method works rather
robustly is that (after replacement of numbers by a placeholder), messages with more or
less the same text agree in most parts and other messages are significantly different.
Note that each error message type corresponds to one error symbol (indicated by A, B,
or C in previous chapters). Together with the number of failure types (which are at most
9.2 Data Preprocessing
179
Figure 9.3: Levenshtein similarity plot for a subset of message types. Each point represents
Levenshtein distance of one pair of error messages. Dark dots indicate similar
messages (small distance) while lighter dots indicate a larger Levenshtein distance. Messages have been arranged such that sequences are next to each
other if their Levenshtein distance is below some fixed threshold.
two in our case study) the number of different error messages defines the size of the
HSMM alphabet. Therefore, experiments in this case study had alphabets of size 1, 436
(1.435 errors plus one failure) since only one failure type has been investigated at a time.
Please also note that memory consumption of observation symbol matrix B is determined
by the number of states times the size of the alphabet. For these reasons, reducing the
number of error messages is an important step in the failure prediction approach described
in this thesis.
9.2.3
Tupling
As described in Section 5.1.2, tupling is a technique that combines several occurrences of
the same event in order to account for multiple reporting of the same problem. In order to
determine the optimal time window size ε, the heuristic shown in Figure 5.3 on Page 78
has been applied to the data. The size of the optimal time window is identified graphically
by plotting the number of resulting tuples over various values for ε. Figure 9.4 shows the
plot for a subset of one million log records for the cluster logfile (which has been received
by merging error logs of both machines). The graph strongly supports the claim of Iyer &
Rosetti that a change point value for ε can be identified above which the number of tuples
decreases much slower. According to the heuristic, ε is chosen slightly above the change
point.
In order to show that properties related to tupling do not change by merging the two
error logs, tupling analysis has also been performed for each machine error log separately.
As shown in Figure 9.5, the change point for both machines occurs at roughly the same
point. The most striking difference between Figure 9.5 and Figure 9.4 is that the number
of resulting tuples is smaller for single machine logfiles. This can be traced back to
the merging process: Tupling only lumps bursts of the same message —if a different
180
9. Experiments and Results Based on Industrial Data
Figure 9.4: Effect of tupling window size for the cluster-wide logfile. The graph shows resulting
number of tuples depending on tupling time window size ε (in seconds)
message from the second machine is woven into the burst, the burst results in at least two
separate tuples. However, the main point of the analysis is that a change point exists, and,
furthermore, that it occurs roughly at the same value for ε in single machine logfiles.
Based on this analysis, a value of ε = 0.015s has been used for experiments.
9.2.4
Extracting Sequences
After tupling, sequences are extracted from error logs (c.f., Section 5.1.3 and especially
Figure 5.4 on Page 79). In order to decide whether a sequence is a failure sequence or not,
the failure log, which has been written by the call tracker, has been analyzed, to extract
timestamps and types of failure occurrences. Three time intervals determine the process
of sequence extraction:
1. Lead-time ∆tl . If not specified explicitly, a lead-time of five minutes has been used,
although it is shown in Section 9.6.1 that prediction performance is comparably
good for even longer lead-times. However, since lead-time experiments have been
carried out relatively late, previous experiments have not been carried out again and
results are reported for five minutes lead time. For large and complex computer
systems, it is assumed that proactive fault handling actions such as restart, garbage
collection or checkpointing can be performed within five minutes, i.e., warningtime ∆tw is shorter than five minutes.
2. Data window size ∆td . Analyses presented in the next section are based on a data
window size of five minutes. An explicit analysis of ∆td is carried out in Section 9.6.2.
3. Margins for non-failure sequences ∆tm . This value is used to determine time intervals when no failure is imminent in the system. Since it cannot be measured,
whether the system really is fault-free, a value of 20 minutes has been chosen arbitrarily. According to an analysis of failure data, it has been observed that failures
often occur in bursts which are interpreted to be caused by the same instability
9.2 Data Preprocessing
Figure 9.5: Effect of tupling window size for each individual machine
181
182
9. Experiments and Results Based on Industrial Data
(fault). Employing a margin of 20 minutes seems to yield a stable separation. For
other systems that show long-range failure behavior (e.g., in the order of hours),
this value might be too small.
Non-failure sequences have been generated using overlapping time windows, which simulates the case that failure prediction is performed each time an error occurs.
9.2.5
Grouping (Clustering) of Failure Sequences
The goal of failure sequence clustering is to identify failure mechanisms contained in the
training data set (c.f., Section 5.2). The approach builds on ergodic (fully connected)
HSMMs to determine the dissimilarity matrix that is subsequently analyzed by a clustering method. Clustering has been performed using the cluster library of the statistical
programming language R.1
The approach implies several parameters such as the size of the HSMMs. This section
explores their influence on sequence clustering. In order to explore the influence, many
combinations of parameters have been tried. Although it is not possible to include all
plots here, key results are presented and visualized by plots. In order to support clarity
of the plots, a data excerpt from five successive days including 40 failure sequences has
been used.
The HSMMs used to compute sequence likelihoods had a topology as shown in Figure 9.6 and used exponential duration distributions mixed with a uniform background.
Figure 9.6: Topology of HSMMs used for computation of the dissimilarity matrix. The model
shown here has five states and an additional absorbing failure state
Results are presented for one failure type only. However, conclusions drawn from the
analysis also apply to the second failure type.
Clustering method. As explained in Section 5.2, several hierarchical clustering methods exist. In this thesis, one divisive and four agglomerative approaches have been applied
to the same data: The DIANA algorithm described in Kaufman & Rousseeuw [142] for
divisive clustering and agglomerative clustering using single linkage, average linkage,
complete linkage and Ward’s procedure. The agglomerative clustering method is called
1
see http://www.r-project.org, or Dalgaard [74]
9.2 Data Preprocessing
183
“AGNES”, hence the name is also used in the plots to indicate agglomerative clustering.
Figure 9.7 shows banner plots (c.f., Section 8.1.2) for all methods using a dissimilarity
matrix that has been generated using a HSMM with 20 states and a background level of
0.25. As will be shown next, the choice of the number of states and background level
have only very little impact on clustering results. Therefore, results look very similar if
the clustering methods are applied to dissimilarity matrices computed with another model
configuration. The plotting software could not include sequence labels on the y-axis in
the plots. However, checking the grouping by hand for some instances yielded similar
groupings.
Regarding first single linkage clustering (second row, left), the typical chaining effect
can be observed. Since single linkage merges two clusters if they get close at only one
point, elongated clusters result. Although beneficial for some applications, this behavior does not result in a good separation of failure sequences yielding an agglomerative
coefficient of only 0.45. Hence single linkage is not appropriate for the purpose.
Complete linkage (first row, right) performs better resulting in a clear separation of
two groups and an agglomerative coefficient of 0.72. Not surprisingly, average linkage
(first row, left) resembles some mixture of single and complete linkage clustering. The
result is not convincing with two single sequences left over. As was the case for complete
linkage, it cannot be clearly stated how many groups are in the data. Hence average
linkage also does not seem appropriate.
Divisive clustering (bottom row, left) divides data into three groups at the beginning
but does not look consistent since groups are split up further rather quickly. The resulting
agglomerative coefficient is 0.69. Finally, agglomerative clustering using Ward’s method
(second row, right) results in the clearest separation achieving an agglomerative coefficient of 0.85.
Considering other parameter settings, the picture always is the same: single linkage
fails and Ward’s method results in the clearest separation. For this reason, Ward’s method
is considered to be the most robust and most appropriate for failure sequence clustering
and has been used in all further experiments conducted in this dissertation. Nevertheless, there are other parameters to failure clustering, such as the number of states of the
HSMMs, which are investigated in the following.
Number of states. Since it is not clear a priori, how many states the HSMMs should
have, experiments have been conducted with model sizes ranging from five to 50 states.
Results for clustering using Ward’s procedure are shown in Figure 9.8. It can be observed
from the figure that the order in which clusters are merged is very similar for 20, 35, and
50 states, but is different for five states. Although not provable, the effect might be attributed to the number of the model’s transitions. Let N denote the number of states2 then
the number of transitions equals N · (N − 1) + N = N 2 . Considering the empirical cumulative distribution function (ECDF) of the length of failure sequences (c.f., Figure 9.18-b
on Page 197), it can be observed that for N = 5 (i.e., 25 transitions) more than 60%
of failure sequences have more symbols than there are transitions in the model, whereas
for N = 20 (i.e., 400 transitions) there is no failure sequence for which this is the case.
Although the number of transitions is not directly proportional to a model’s recognition
ability it gives an indication. Note that the ergodic models used here can in principle rec2
Not including the absorbing failure state F
184
9. Experiments and Results Based on Industrial Data
agnes average
20 states bg = 0.25
0
20
40
60
agnes complete
20 states bg = 0.25
80
100
120
140
0
20 40 60 80
Height
120
Agglomerative Coefficient = 0.57
Agglomerative Coefficient = 0.72
agnes single
20 states bg = 0.25
agnes ward
20 states bg = 0.25
0
10
20
30
40
50
160
200
234
Height
60
70
80
90
0
Height
50
100 150 200 250 300 350 400
Height
Agglomerative Coefficient = 0.45
Agglomerative Coefficient = 0.85
diana standard
20 states bg = 0.25
234
200
160
120
80 60 40 20
0
Height
Divisive Coefficient = 0.69
Figure 9.7: Effect of clustering methods. Five different clustering methods are applied to the
same dissimilarity matrix, which has been generated by a 20-state HSMM with
0.25 background weight. The agglomerative clustering algorithm is called “agnes”
and the divisive algorithm “diana”. For agglomerative clustering, average linkage,
complete linkage, single linkage and Ward’s procedure have been used.
9.2 Data Preprocessing
185
ognize sequences of arbitrary length but if transitions have to be “reused”, probabilities
get blurred and the model looses discriminative power. Similar observations can be made
if clustering methods other than Ward’s procedure are used..
As a rule of thumb, the number of states
√ for HSMMs used for failure sequence clustering should be chosen such that N > L for the majority of failure sequences, where
N denotes the number of of states and L the length of the sequence.
Weight of background distributions. It has already been mentioned in Section 6.6 that
background distributions must be used with HSMMs since observation probabilities for
errors that do not occur in the (single) training sequence are set to zero by the Baum-Welch
training algorithm. Hence each failure sequence that contains at least one error message
not contained in the training sequence would receive a sequence likelihood of zero (or
−∞ in the case of log-likelihood) and no useful dissimilarity matrix would be obtained.
Using background distributions, a small probability is assigned to all observation symbols
resulting in non-zero sequence likelihoods. In the experiments, a uniform distribution of
all error symbols occurring in the entire set of failure sequences has been used. The effect
of background distributions on sequence clustering has been investigated by varying the
background distribution weighting factor ρi , which has been equal for all states i of the
HSMM (c.f., Equation 6.63 on Page 112). Figure 9.9 shows results for clustering with a
HSMM with 20 states using Ward’s method.
As can be seen from the plots, varying the background weight does only slightly affect
grouping. In fact, with increasing background weight more “chaining-effects” can be observed and the agglomerative coefficient is decreasing. The explanation for this behavior
is that the single sequence HSMMs become “more equal” with increasing ρi due to the
fact that the uniform background distribution supersedes the specialized output probabilities obtained from training. The more similar the models, the more equal are sequence
likelihoods resulting in less structure in the dissimilarity matrix. Nevertheless, all background values result in a grouping that is similar to the ones obtained by the majority
of clustering approaches. Analysis is based on Ward’s procedure here, but the same effect can be observed for other clustering methods, as well. For some of the procedures
clustering is affected if the background distribution weight gets too large. A plot for a
background weight of zero has not been included since it could not be used for clustering
due to sequence log-likelihoods of −∞. Hence, the conclusions from this analysis is that
the background weight has not much influence on clustering but should neither be too
small nor too large. For the case study, a value 0.1 has been used.
Summary of failure sequence grouping. From the experiments the following conclusions (regarding failure sequence clustering) can be drawn:
• Agglomerative clustering using Ward’s procedure yields the most robust and most
clear grouping
• The number of states of the HSMMs used to compute sequence likelihoods is not
critical, however, it should be chosen such that the number of transitions is larger
than the number of error symbols of the majority
of failure sequences, hence the
√
number of states should be roughly equal to L.
186
9. Experiments and Results Based on Industrial Data
agnes ward
05 states bg = 0.05
0 50
150
250
350
agnes ward
20 states bg = 0.05
450
541
0 50
150
Height
250
350
450
Height
Agglomerative Coefficient = 0.89
Agglomerative Coefficient = 0.89
agnes ward
35 states bg = 0.05
agnes ward
50 states bg = 0.05
0 50
150
250
540
350
450
Height
Agglomerative Coefficient = 0.89
540
0 50
150
250
350
450
540
Height
Agglomerative Coefficient = 0.89
Figure 9.8: Effect of number of states. The plots show clustering results using agglomerative
clustering with Ward’s procedure for dissimilarity matrices computed by HSMMs
with 5, 20, 35, and 50 hidden states.
9.2 Data Preprocessing
187
agnes ward
20 states bg = 0.05
0
50
150
250
agnes ward
20 states bg = 0.25
350
450
540
0
50 100
Height
200
300
400
Height
Agglomerative Coefficient = 0.89
Agglomerative Coefficient = 0.85
agnes ward
20 states bg = 0.45
0
50
100 150 200 250 300 350
Height
Agglomerative Coefficient = 0.82
Figure 9.9: Effect of background distribution weight. One HSMM with 20 states has been
trained and dissimilarity matrices have been computed using three different values
of the background distribution weight ρi (denoted by “bg” in the plots). The banner
plots show results of agglomerative clustering using Ward’s procedure.
188
9. Experiments and Results Based on Industrial Data
• Background distributions are necessary in order to yield useful dissimilarity matrices but the actual value is not very decisive. A value of 0.1 is used in the case
study.
9.2.6
Noise Filtering
The goal of the statistical test involved with noise filtering (c.f., Section 5.3) is to eliminate error messages that are not indicative of failure sequences. The idea is to consider
only error messages that occur significantly more frequent in the failure sequences in
comparison to the expected number of occurrences in a given time frame. The decision
is based on a testing variable Xi (c.f., Equation 5.6 on Page 85), which involves the prior
probability p̂0i .
As described in Section 5.3, three variants exist to compute priors p̂0i :
1. p̂0i are estimated separately for each group of failure sequences.
2. p̂0i are estimated from all failure sequences —irrespective of the groups.
3. p̂0i are estimated from all sequences, containing failure and non-failure sequences.
Noise filtering has been implemented such that Xi values are stored for each symbol in
order to allow for filtering with various thresholds c. Experiments have been performed
on the dataset used previously for clustering analysis and six non-overlapping filtering
time windows of length 50 seconds have been analyzed.
Figures 9.10-9.12 show bar plots of Xi values for each symbol and time window. Figure 9.10 has been generated using group-based priors, Figure 9.11 using failure sequencebased priors and Figure 9.12 using a prior computed from the entire training dataset. Each
figure shows two plots: one for each group of failure sequences. The three figures are ordered by specificy of priors: the group-wise prior is computed from the failure symbols
itself (but without windowing) resulting in rather small values of Xi since the distribution
of failures in the time window is very close to the expected distribution. More general
priors result in larger values of Xi , as can be seen in Figures 9.10, 9.11, and 9.12.3
Regarding Figure 9.10, it can be observed that the distribution of symbols depends
on time before failure. The prior has been computed without time windows which can
be seen as the average over the entire length of failure sequences. Xi values mark the
difference to the prior for each time window. The figure shows that deviation from priors
is different for each window. This is an important finding: It is a further evidence for one
of the most principle assumptions of this thesis: The assumption that timing information
—at least time-before-failure— cannot be neglected in online failure prediction. By the
way, Figure 9.10 supports the second principle mentioned by Levy & Chillarege in [162],
stating that the mix of errors changes prior to a failure.
Due to the fact that the prior is computed for each group separately, the sum of Xi
values over all time windows should be equal to zero. Although this is the case for most
of the symbols, some violate this equality. The explanation for this is that sequences of
length up to 300 seconds have been used, but only time windows up to 250s have been
plotted for readability reasons.
3
Note that y-axes have been scaled to fit all Xi values.
9.2 Data Preprocessing
189
−5
0
Xi
5
10
group 1, group prior
−225
−175
−125
−75
−25
−75
−25
filtering interval centers [seconds]
−5
0
Xi
5
10
group 2, group prior
−225
−175
−125
filtering interval centers [seconds]
Figure 9.10: Values of Xi for noise filtering with a prior computed from each cluster of failure
sequences. The upper plot is for the first group of failure sequences and the
lower for the second group. Within each plot, each group corresponds to one
time window. Within each group, each bar corresponds to one error symbol and
the y-axis displays the value of the testing variable Xi . Numbers below each
group denote the center of the time interval in seconds before failure occurrence.
−2
0
2
Xi
4
6
8
group 1, fseq prior
−225
−175
−125
−75
−25
−75
−25
filtering interval centers [seconds]
−6
−2
Xi
2 4 6 8
group 2, fseq prior
−225
−175
−125
filtering interval centers [seconds]
Figure 9.11: Values of Xi for noise filtering with a prior computed from failure sequences.
190
9. Experiments and Results Based on Industrial Data
Xi
0
10
20
30
40
group 1, all seq prior
−225
−175
−125
−75
−25
−75
−25
filtering interval centers [seconds]
0
−20
Xi
10
20
group 2, all seq prior
−225
−175
−125
filtering interval centers [seconds]
Figure 9.12: Values of Xi for noise filtering with a prior computed from all sequences.
Regarding Figure 9.11, it can be observed that the distributions of Xi values are quite
different in the two groups. This is due to the fact, that the prior has been computed from
all failure sequences (regardless of the group), which can be interpreted as an indication
that failure sequence grouping supports failure pattern recognition since separate models
can be trained that are tailored towards the distributions in each group.
The third figure (Figure 9.12), which is based on a prior from failure and non-failure
sequences, supports the third principle described by Levy & Chillarege in [162] called
“clusters form early”: It can be observed especially in the lower plot that a few error symbols outnumber their expected value heavily. Furthermore, the effect becomes stronger
the closer the time window is to the occurrence of failures (the further right in the plot).
In order to investigate the effect of filtering on sequences the number of symbols
within each sequence has been analyzed. Figure 9.13 plots the average number of symbols
in one group of failure sequences after filtering out all symbols with Xi < c for various
values of c. Again, all three types of priors have been investigated.
Regarding first the “global” prior computed from all sequences (solid line), the resulting curve can be characterized as follows: For very small thresholds, all symbols pass the
filter and the average number of symbols in failure sequences equals the average number
without filtering. At some value of c the length of sequences starts dropping quickly until
some point where sequence lengths stabilize for some range of c. With further increasing
c average sequence length drops again until finally not a single symbol passes filtering.
The supposed explanation for this behavior is that the first drop results from filtering
real noise. The middle plateau indicates some sort of a “gap”, which may result from some
significant difference in the data: this is the filtering range where error symbols relevant
to failure sequences still get through but background noise is eliminated. At some point c
becomes too large even for relevant error symbols to get through and the average number
9.3 Properties of the Preprocessed Dataset
191
Figure 9.13: For each filtering threshold value c, mean sequence length has been plotted.
The solid line shows values for a prior computed from all sequences, the dashed
line for a prior computed from all failure sequences and the dotted line for priors
computed individually for each group/cluster of failure sequences.
of symbols in failure sequences drops to zero (the plateau around c = 40 is interpreted to
result from outliers).
Comparing the “global” prior with the two other priors, it can be observed that the
curve for a cluster-based prior drops most quickly and the curve for a “global” prior drops
most slowly. The reason for this is again specificy of the priors. Plateaus are at least not
as obvious as for the global prior.
Summary of noise filtering. From this analysis follows that a “global” prior computed
from all sequences (failure and non-failure) seems most appropriate. Therefore, further
experiments are based on filtered data using such prior. Similar to the tupling heuristic
proposed by Tsao & Siewiorek [258], the filtering threshold c has been chosen such that
it is slightly above the beginning of the middle plateau.
9.3
Properties of the Preprocessed Dataset
Before going into details of the modeling process, the preprocessed data has been analyzed. Later sections will then refer back to the properties described here. Additionally,
data analysis helps to understand better the system under investigation and may also help
others to judge whether results presented in this thesis can be transferred to their systems.
192
9. Experiments and Results Based on Industrial Data
9.3.1
Error Frequency
150
100
50
0
number of errors per 5 minutes
One of the most straightforward methods for online failure prediction is to look at the
frequency of error occurrence and to warn about an upcoming failure once the frequency
starts to rise significantly. However, as Figure 9.14 shows, such simple approach is not
effective when applied to the commercial telecommunication system.
0
200
400
600
800
1000
time[min]
Figure 9.14: Number of errors per five minutes in preprocessed data. Diamonds (u) indicate
the occurrence of failures.
More specifically, the figure shows the number of errors per five minute time intervals.
The plot has been generated of data obtained after tupling. As can be seen from the plot,
the number of log records varies greatly ranging from zero to 153 log records within five
minutes. Note that Figure 9.14 is only an excerpt of the data. The peak value observed
in the data of five successive days (the same data that has also been used in the previous
analyses) even reaches 267 log records within five minutes. Performing the same analysis
with time intervals of length of one second reveals that there are up to eight messages per
second.
The figure shows that a straightforward counting method would not work well since
the pure number of errors seems quite unrelated to the occurrence of failures: Failures
occur at times with many and with few errors per time interval, and in sections where the
number of errors increases as well as decreases. There are time intervals with heavy error
reporting but only a few failures and time intervals with few errors but several failures.
9.3.2
Distribution of Delays
The model for online failure prediction proposed in this dissertation builds on the timing
between error occurrence and hence uses probability distributions to handle time between
successive errors (delays). This section provides an analysis of delays in error sequences.
Theory of HSMMs allows to define a unique convex combination of distributions for
9.3 Properties of the Preprocessed Dataset
193
each transition. However, it is not possible to determine upfront which transition should
have what type of distribution, and, it is not practical for real applications. Therefore,
the same combination of distributions has been used for all transitions: Each transition,
for example, consists of a convex combination of exponential and uniform distribution.
Note that this does not imply that distributions are equal: the parameters of the distributions (e.g., rate λ of exponential distributions, the combining weight, etc.) are initialized
randomly and then further changed by the Baum-Welch algorithm.
In order to get a picture of delay distributions, delays occurring in the entire dataset
have been analyzed. More precisely, a histogram and quantile-quantile-plots (QQ-plots)
are provided in Figure 9.15.
The dataset used for analysis comprised 24,787 delays spanning a range from zero4 to
29.39 seconds with a mean of 1.404 seconds. The histogram shown at the top left of the
figure plots relative frequency of delays with a resolution of 1 second. The distribution of
data seems to resemble an exponential distribution except for the peak at 12-13 seconds.
It might be supposed that the peak results from some outliers. However, 1,048 delays
fall into this category and hence the peak results more likely from some system inherent
property. In order to further investigate which parametric distribution fits best the data,
QQ-plots have been generated plotting quantiles of the observed delay distribution against
the parametric ones: The normal distribution (middle row, left) obviously fits very badly.
This is due to its property that the distribution can take on negative values, which is
inappropriate for delays. Exponential (top row, right) and lognormal (middle row, right)
fit much better. However, both distributions show a quite bad match for higher quantiles.
As HSMMs provide the possibility to mix distributions, the exponential and log-normal
distributions have been mixed with a uniform distribution resulting in an improved fit
(except for very large delays), with the exponential being slightly better than the lognormal. However, further investigations have revealed that very long delays (> 12s) occur
only in 0.41% of all cases and a worse fit of the distribution can be accepted.
Based on this analysis, experiments have been performed using a convex combination
of exponential and uniform distribution.
9.3.3
Distribution of Failures
Assumptions on the distribution of failures (with respect to their time of occurrence) are
used in various areas of dependable computing research. For example, preventive maintenance, reliability engineering, and reliability modeling make use of it. As has been
described in Chapter 3, there are also online failure prediction approaches exploiting the
time of failure occurrence. Therefore, an analysis of the distribution of time-betweenfailure (TBF) has been performed. However, since failures are not as common as errors,
the entire dataset of 200 days has been analyzed. More precisely, the dataset consisted of
885 timestamps of failures of one type. Figure 9.16 summarizes the results.
Similar to the analysis of inter-error-delays, a histogram is provided at the top left
of the figure. Note that the histogram might not fully represent reality for the first two
slots since failures occurring earlier than 20 minutes after a previous failure have been
considered as related to the previous one and have been eliminated from the dataset during
4
A delay of zero means that two log records occur with the same timestamp in the log. Technically, this
means that the two records have a delay lower than the minimum time resolution of the system, which is
about a millisecond in the telecommunication system.
194
9. Experiments and Results Based on Industrial Data
exponential
30
histogram
20
15
0.2
data quantiles
10
+
5
0
0.0
10
15
20
25
30
0
4
6
8
normal
lognormal
10
25
+
0
++ + +
15
+
+
+
+
+
+
0
5
+
5
10
++
++
++
+
+
+++++
+++++++
++++++
+++++
++++++
++++
0
5
10
15
20
25
distribution quantiles
exponential mixed with uniform
lognormal mixed with uniform
30
distribution quantiles
30
30
+
+
+
+
++
++
++
+
+
++++++
+++++++
++++++++++
++++++++++++++
+++++++++++++++++++++++
+ + + ++++++++++++++++++++
10
data quantiles
20
25
20
15
25
+
++++
15
20
+
+
data quantiles
15
20
25
+
0
2
4
+
5
++
0
+++
++++++
+++++++
+++++++++
+++++++++
+
+
+
+
+
+
+
+
+
++++++
6
8
distribution quantiles
++ +
+
+
+
++
+
+
10
data quantiles
10
5
0
2
++
distribution quantiles
10
data quantiles
+
++
+++
+
+
+
+
+
+
++++++
++++++++++
++++++++++++++
++++++++++++++++++
++++++++++++
delays [seconds]
+
−5
5
+
+
30
5
30
0
0
+ +
+
+
0.1
Density
0.3
25
0.4
+
10
12
+
++
+++
+++++++
+
+
+
+
+
+++
+++++
++++++
++++++
0
2
4
++
6
8
10
12
distribution quantiles
Figure 9.15: Histogram and QQ-diagrams of delays between errors. QQ-plots plot the distribution of delays observed in the dataset versus several parametric distributions:
exponential, normal, log-normal, exponential mixed with uniform and log-normal
mixed with uniform. The straight line indicates a perfect match of quantiles. Parameters of parametric distributions have been estimated from the data (e.g.,
mean of the normal distribution has been set to the mean of the data)
9.3 Properties of the Preprocessed Dataset
195
250
200
0.000
50
0.002
100
150
data quantiles
0.010
0.008
0.006
0.004
Density
0.012
300
exponential
150
200
250
300
300
100
300
+
+
+
200
data quantiles
100
50
150
200
+
++
++
++
+
+
+++
++++
+++
++++
+
+
+
+
+++
+++++
++++++
+++++++
++++++
+
+
+
+
+
+
++++++
++++++
+++++++++
+++++++
+++
50
100
++
+
150
200
100
150
distribution quantiles
+
+
250
250
+
+
50
100
+
200
+
200
data quantiles
++
250
300
Weibull
300
Gamma
++
+
++
+
+
++++
+++
++++
+
+
++
++++
+++++
++++
++++++
+
+
+
+
+
+
++++++
++++++
++++++
+++++++
++++++++
+
+
+
+
+
+
+
+++
+ ++
+
+ +
distribution quantiles
+
400
250
250
200
150
data quantiles
100
50
200
distribution quantiles
200
data quantiles
150
100
50
+
+
+
50
++
lognormal
++
+
++
+
++
++
++
+
+
+++
++++
+++
++++
+
+
+
+
+++
+++++
++++++
+++++++
++++++
+
+
+
+
+
+
+++++++
++++++++
++++++++
+ + + ++++
50
100
+
normal
+
0
++
++
++++
+
+
+
++++
+++
++++
+++++
+
+
+
+++++
++++++
+++++++
++++++
++++
+
+
++++
++++
+++
+
distribution quantiles
+
+
+
+ +
0
+
+
+
TBF [min]
150
100
150
50
300
0
+
+
++
+
++
+
++
+
+
++++
+++
++++
+
+
++
++++
+++++
++++
++++++
+
+
+
+
+
+
++++++
++++++
++++++
+++++++
++++++++
+
+
+
+
+
+
+
++
++++
50
100
150
200
distribution quantiles
Figure 9.16: Analysis of time-between-failures (TBF). The top left plot shows a histogram.
The five other graphs plot quantiles of the observed data against quantiles of
exponential, normal, log-normal, gamma, and Weibull distribution (QQ-plots).
196
9. Experiments and Results Based on Industrial Data
0.0
0.2
0.4
ACF
0.6
0.8
1.0
data preprocessing. In addition to the histogram, QQ-plots are provided for the most
frequently used distributions in reliability theory. Parameters for the gamma and Weibull
distributions have been estimated by maximum likelihood. The interesting observation
here is that the frequently used exponential distribution yields a relatively bad fit. But
also other frequently used distributions such as the gamma or Weibull do not really fit the
data. The best approximation is obtained by a lognormal distribution.
Results of a second analysis are provided in Figure 9.17. In order to investigate,
whether some periodicity is present in the data, the normalized autocorrelation of failure
occurrence has been plotted. More specifically, the data has been divided into buckets
of five minute intervals and the autocorrelation has been computed for lags of up to 240
minutes. The observation is that there is almost no periodicity in failure occurrence, which
is the reason why periodic prediction does not work for the case study (see Section 9.9.4).
0
50
100
150
200
Lag [min]
Figure 9.17: Normalized autocorrelation of failure occurrence. Failure data has been grouped
into buckets of five minute intervals and autocorrelation has been computed for
lags of up to 240 minutes.
9.3 Properties of the Preprocessed Dataset
9.3.4
197
Distribution of Sequence Lengths
Error sequences are delimited by time ∆td and hence an analysis of the length of sequences in terms of the number of errors is provided here. For the test dataset, a histogram
of the number of symbols is shown in Figure 9.18-a. Taking only failure sequences into
account, Figure 9.18-b plots the empirical cumulative distribution function.
ECDF of failure sequence length
0.8
0.4
Fn(x)
0.6
0.010
0.0
0.000
0.2
0.005
Density
0.015
1.0
Histogram of length of all sequences
0
50
100
150
200
250
sequence length [no of symbols]
(a)
300
0
50
100
150
200
failure sequence length [number of symbols]
(b)
Figure 9.18: (a) Histogram of length of all sequences. (b) Empirical cumulative distribution
function (ECDF) for the length of failure sequences.
The histogram of all sequences (Figure 9.18-a) shows two peaks, one around 50, the
other around 225 symbols. This means that a large amount of sequences either have
around 50 or 225 symbols, although most of the sequences span a time interval of five
minutes. An explanation for this phenomenon is that the system writes either a great many
error log records or only a few, depending on varying call-load on the system present in
the rather small excerpt of data (as can also be observed in Figure 9.14). An analysis of
the entire data set showed a more equal distribution. In order to be consistent with the
other analyses presented in this section, the distribution has been plotted as is.
Regarding failure sequences (Figure 9.18-b), the empirical cumulative distribution
function is presented since it is the appropriate visualization for the argumentation used
in Section 9.2.5. The reason why the maximum length of failure sequences is smaller
than for all sequences is simply random variability: A separate investigation has shown
that there are also failure sequences with more than 200 symbols. Again, for the reason of
consistency, the plot is provided for the same data that has been used in investigations of
previous sections. Comparing Figure 9.18-b to Figure 9.13, it might look surprising how
an average length of 25 can result from the ECDF shown in Figure 9.18. The explanation
for this is that Figure 9.13 only plots average length of sequences belonging to one failure
group. The second group has an average length of 75.2 without noise filtering.
198
9. Experiments and Results Based on Industrial Data
9.4
Training HSMMs
In previous sections, data preprocessing has been explored, which is not necessarily focused on HSMM failure prediction. This section describes and analyzes the steps involved
in training HSMMs for failure prediction. Note that for reasons of legibility the previous
analysis was based on a small excerpt of data. In order to yield more reliable results, a
larger data set has been used for the experiments described in the following sections.
9.4.1
Parameter Space
A lot of parameters are involved in modeling. Although most of them have already been
mentioned and / or explained in previous chapters, an overview is provided, here. Moreover, the number of parameters and their possible values is too large to compare all combinations. Hence some parameters have been explored in detail while reasonable values,
based on an “educated guess” has been assumed for others (this approach has been termed
greedy versus non-greedy in Section 8.3.1).
Parameters that have been set heuristically. No experiments have been performed
for the following parameters. Instead, values have been chosen according to the reasons
described.
• Intermediate probability mass and distribution. In the experiments, 10% of the
probability mass of each transition has been distributed among intermediate states
(c.f., Section 6.6). The transitions itself have been chosen to be normal distributions
since they are centered around the mean, which is useful for the requirement that the
sum of mean intermediate durations should equal the mean duration between original states that are extended.5 Since for uncorrelated random variables the following
property holds:
!
Var
X
Xi =
i
X
(Var Xi ) ,
(9.1)
i
variance of the intermediate distributions have been set to the variance of intererror-delays divided by the number of intermediate states plus one. The assumption
that two successive delays are uncorrelated might not hold,6 however it lead to
reasonable good prediction results.
• Number of tries in optimization. As stated before, the Baum-Welch algorithm converges into a local optimum starting from a random initialization. The problem is
that it cannot be determined whether the local optimum is close to the global one
or not. Ignoring more sophisticated techniques such as evolutionary strategies, the
Baum Welch algorithm has simply been performed 20 times and the best solution
in terms maximum overall training sequence likelihood has been chosen.
• Type of background distributions. In principle, the concept of background distributions for observation probabilities allows to use arbitrary distributions. In this
5
Due to the central limit theorem, if there are many intermediate distributions having finite variance, the
sum approximates a normal distribution, anyway
6
E.g., due to the bursty behavior described in Section 9.3.4
9.4 Training HSMMs
199
thesis, the distribution of symbols estimated from the entire training data set has
been used since this reflects the overall frequency of error occurrence.
Parameters that have been varied. Several experiments have been performed in order
to determine the effects of the following parameters. Results of these experiments are
provided in the next section.
• Number of states. As can be seen from the figures on the principal prediction approach (Figure 2.9 on Page 19 and Figure 2.10 on Page 20) u + 1 HSMMs are
involved, where u is the number of groups obtained from failure sequence clustering. Each model consists of N states. The question is how the number of states
affects the modeling process. Since the prediction models have a strict left-to-right
structure (c.f., Figure 6.10 on Page 126), the maximum number of transitions is
N − 1. From this one might conclude that the models should have as many states
as there are symbols in the sequences. On the other hand, the larger the model,
the more model-parameters have to be estimated from the same limited amount of
training data resulting in worse estimations. Therefore, a better solution might be
obtained if an HSMM with fewer states is used and some very long sequences are
ignored.
• Maximum span of shortcuts. Figure 6.10 on Page 126 shows that there are shortcut
transitions in the model bypassing several states. Increasing the maximum span
of shortcuts increases flexibility of the models but almost doubles, triples, etc. the
number of transitions and hence the number of transition parameters.
• Number of intermediate states. After training, intermediate states are added to the
model (c.f., Figure 6.11 on Page 127). The number of states added between each
pair of states affects generality of the model. If there are no intermediate states, the
models might be overfitted. If there are too many, the model is too general.
• Amount of background weight. Background distributions are an important way to
reduce variance of hidden Markov models. The weight ρi by which background
distributions are mixed with observation distributions obtained from training also
affects the bias-variance trade-off.
9.4.2
Results for Parameter Investigation
Four parameters have been listed that need to be explored with respect to failure prediction
performance. One way to investigate their effect would be to perform a separate experiment for each parameter. However, such approach has two problems: First, the approach
neglects coherence among parameters and second, while testing one parameter, it is not
clear what (fixed) values should be assumed for the others. However, an investigation
reveals that there are two effects how parameters influence the model:
1. Number of states and maximum span of shortcuts determine the number of parameters (i.e. the degree of freedom) of the HSMM that need to be optimized from a
fixed and finite amount of training data. The trade-off is that a higher degree of
freedom in principle allows the model to better adapt to the data specifics – however, since more parameters need to be estimated from the same amount of data, the
estimates get worse resulting in worse adaption to data specifics.
200
9. Experiments and Results Based on Industrial Data
2. The number of intermediate states and amount of background weight affect generality of the models after training. More general models can account for a larger
variety of input data. On the other hand, too general models yield blurred sequence
likelihoods which in turn can result in worse classification results.
Therefore, the parameters have been investigated in two groups. First, models are trained
for various combinations of number of states and maximum span of shortcuts. In a second step, each resulting model is altered by adding intermediate states and applying some
amount of background weight. Tests are performed in order to evaluate dependence of
failure prediction quality on all four parameters. Additionally, failure prediction is dependent on the final classification threshold θ. In order to eliminate dependence on θ, for
each combination of the four parameters, various values of θ have been investigated and
maximum F-measure has been used to compare prediction results.
Training with varying number of states and maximum span of shortcuts. The number of states and maximum span of shortcuts are integer variables. However, a complete
enumeration of values is not possible. If, for example, the maximum span of shortcuts
should be varied from zero to five and the number of states from 20 to 500, 2886 combinations of model parameters would have to be tested, which is not possible since preparation
of the data, set up of models, training, testing and evaluation of prediction results would
be too time consuming. Hence, only some values for maximum span of shortcuts and the
number of states have been selected and all their combinations have been tested. More
specifically, HSMMs with 20, 50, 100, and 200 states have been investigated. Larger models could not be considered due to requirements both in terms of memory and computing
time. The maximum span of shortcuts has been varied from zero to three. This selection
is based on the following reasons: Shortcuts are introduced to account for missing errors
in failure sequences (e.g., if a symptomatic pattern is B-A-A-B but one example sequence
would only consist of B-A-B, shortcuts enable to align both sequences). By limiting the
maximum span of shortcuts to three, it is assumed that not more than three successive errors are missing. However, even if this case occurs, the sequence with missing errors can
be aligned from the next but one state on. Furthermore, this limitation is sufficient since
best failure prediction results are achieved with a shorter maximum span of shortcuts, as
is shown in the next paragraphs. Also note that shortcuts are not necessarily required to
handle short sequences: due to the initial probabilities πi , a short sequence may start “in
the middle” of the model.
In order to visualize the tradeoff average training sequence log-likelihood is plotted.
However, for legibility reasons, the negative of the sequence log-likelihood is shown in
Figure 9.19. That means, the higher the bar, the worse the training result, which could be
seen as some sort of training error. The dataset used for these experiments consisted of
3650 sequences, among which are 278 failure sequences.
Looking at training likelihoods for a maximum shortcut span of zero (first column in
Figure 9.19), it can be observed that adaptation to training data increases for an increasing
number of states up to a model with 100 states, but gets worse for a model with 200 states.
Regarding the effect of the maximum span of shortcuts it can be seen that for 20 and 50
states, incorporation of shortcuts spanning one to three states deteriorates models with 20
and 50 states and improves model training for models with 100 or 200 states. Overall,
the best training result is achieved using a model with 100 states and a maximum shortcut
span of one. The following conclusions can be drawn from these observations:
9.4 Training HSMMs
201
Figure 9.19: Average negative training sequence log-likelihood for several combinations of
the number of states and maximum span of shortcuts.
1. Models with 20 and 50 states seem to be inappropriately small since the number of
states determines the maximum length of sequences that can be handled. Since
shortcuts do not remedy this problem but only introduce additional parameters,
training results get worse due to worse probability estimates.
2. As can be seen from experiments without shortcuts, models with 200 states are
too large. In case of infinite training one would expect that average training loglikelihood is smaller than for 100 states since the model has more degrees of freedom and can hence better adapt to the training data. Therefore, the reason why
training likelihood is worse than for models with 100 states is attributed to worse
parameter estimation from the limited amount of training data. Furthermore, since
the Baum-Welch algorithm assigns some small fraction of the probability mass to
all transitions, results get also blurred if there are too many.
3. Considering only models without shortcuts, models with 100 states achieve minimum negative log-likelihood. However, by the introduction of shortcuts of length
one, results can be further improved. The fact that negative training log-likelihood
increases if shortcuts spanning more states are included can be explained by the
same effects as in 1.
Note that these investigations do not automatically allow for the conclusion that models
with 100 states and shortcuts spanning one state should be used for online failure prediction since such models could be overfitted to the training data, as will be discussed in the
next section.
Number of intermediate states and amount of background weight. Intermediate
states and background distributions are applied after training to control the trained model’s
generalization capabilities. However, overfitting can be reduced either by using fewer
states and more background weight or vice versa. This is the principle reason why nongreedy parameter selection is necessary: a model with worse training sequence likelihood
might after introduction of intermediate states and application of background distributions
result in better failure prediction performance than the model with best training results
(see discussion of bias and variance in Section 7.3). Hence, all 16 combinations of number of states and shortcuts have been combined with zero to three intermediate states per
202
9. Experiments and Results Based on Industrial Data
transition, and with five levels of background distribution weight. This selection is based
on the following considerations: Similar to the introduction of shortcuts, the introduction
of intermediate states aims at alignment of sequences with additional errors in between
symptomatic ones. And for similar reasons, the introduction of up to three intermediate
states between each transition is sufficient. Background weight is a real-valued parameter
and hence five values had been selected spanning a range from zero to 0.2.
In order to evaluate each combination, the maximum achievable F-measure with respect to out-of-sample prediction of validation sequences has been determined. Out-ofsample means that validation sequences have not been available for training. Since it
is not possible to present results of all 320 combinations here, the three most important
findings are described in the following:
1. Application of background distributions can increase failure prediction performance for all combinations. However, this is only true if the background distribution weight is rather small. Too large values for background distribution weight
quickly result in “random” models resulting in worse prediction performance than
models without background distribution. Hence, in later experiments, a background
weight ρi of 0.05 has been applied.7
2. A similar effect can be observed from the introduction of intermediate states. Overall, the effect of adding intermediate states to the models did not meet expectations: Failure prediction performance could only be improved slightly when one
intermediate state per transition was added. This setting has been used for further
experiments.
3. One setting for a model with 50 states and no shortcuts achieved roughly the same
failure prediction quality as the model with 100 states and maximum shortcut span
of one, which gives evidence to the described problem that a model with best training likelihood does not guarantee optimal prediction performance on test data. On
the other hand, the model with best training likelihood belongs to the set of models
with best failure prediction performance. Therefore, the model with 100 states has
been used for further experiments since it can account for longer sequences.
Computation times. A theoretical analysis of the algorithm’s complexity has been provided in Chapter 6, but the analysis was rather coarse grained and did only take the number of states and length of the sequence into account. Although the effect of the four
parameters investigated in this section could in principle be traced down to the number
of states and the number of edges, or even further to the number of multiplications and
additions, such full-fledged analysis is not provided here. Instead, the time needed to train
the models and to classify test data has been measured several times on one and the same
machine, which allows at least a guess on the effect of parameters in some relative way.
Training time is affected only by the number of states and maximum span of shortcuts,
and testing time is additionally influenced by the number of intermediate states. The
amount of background weight has no influence on testing times since output probabilities
bi (Oj) are altered before testing starts. Figures 9.20 to 9.22 show the results.
In Figure 9.20, mean training time for all 16 combinations of parameters is shown.
Training time is determined by the time needed to train one model. Not surprisingly,
7
c.f., Equation 6.63 on Page 112
9.4 Training HSMMs
203
Figure 9.20: Mean training time depending on the number of states and maximum span of
shortcuts.
training time increases both with the number of states and the maximum span of shortcuts
since both increase the number of parameters that need to be estimated from the training
data set. The figure suggests that the number of states has a stronger influence than the
maximum span of shortcuts. One reason is that the maximum span of shortcuts only
increases the number of transitions parameters, which are only a subset of all parameters
that need to be determined. For the configuration used in further experiments (100 states,
maximum shortcut span of one), a mean training time of 1365 seconds resulted.
With respect to testing, 75% trimmed mean testing times are plotted in Figure 9.21.
Testing time is determined by the mean of the time needed to perform a prediction on one
single sequence. In Figure 9.21-a, processing time is plotted in dependence on the number
of states and maximum span of shortcuts. The the number of states clearly dominates
testing time, which can again be explained by the fact that the maximum span of shortcuts
only increases the number of transitions (and no state-dependent parameters) and hence
only has effects in the most inner core loops of the algorithm. In addition to the number
of states and maximum span of shortcuts, testing time is also influenced by the number of
intermediate states. Figure 9.21-b shows 75% trimmed mean testing times in dependence
on the number of states and the number of intermediate states for a maximum shortcut
span of one. Surprisingly, computation time decreases by the introduction of intermediate
states. An analysis has revealed that this is due to the fact that with intermediate states
probabilities decrease more quickly in the forward and backward algorithm such that
implemented shortcuts in the algorithm are executed if probabilities are below a certain
threshold for some sequences.
Performing online failure prediction is a real-time application. However, no fullfledged real-time analysis can be presented here. Instead, Figure 9.22 shows upper limits
of 95% confidence intervals on mean testing time. This is obviously no guarantee that
the algorithm can always be performed in real time. However, two things should also be
taken into consideration: First, the algorithm operates on a lead time that is much larger
(e.g., five minutes), hence there is some space for “buffering”. Second, errors occur in
short bursts with longer time intervals with only very few errors. This means that there
is some chance for the algorithm to catch up. If not, the algorithm could simply ignore
204
9. Experiments and Results Based on Industrial Data
(a)
(b)
Figure 9.21: Computation time needed for testing a single sequence dependending on number of states and maximum span of shortcuts for one intermediate state (a) and
dependending on number of states and number of intermediate states for a maximum shortcut span of one (b).
(a)
(b)
Figure 9.22: 95% upper confidence interval limits for mean testing times corresponding to
Figure 9.21.
some sequences.8
In the experiments, one non-failure model and two failure models have been used.
With respect to testing, the effect of the number of groups is linear. However, with respect to training, the effect is more complex since with an increasing number of groups
there are fewer training sequences in each group partly compensating for the overhead of
training more models. The number of groups is expected to reflect the number of failure
mechanisms in the system and is determined during data preprocessing. Nevertheless, the
effect has been analyzed: the same data has been processed with only one failure group.
It has been found that the total time of training a model with only one non-failure and
one failure model is increased by approximately 20% since more iterations on a larger
training data set have to be performed.
8
Note that processing times shown in Figures 9.21 and 9.22 refer to an entire sequence.
9.5 Detailed Analysis of Failure Prediction Quality
9.5
205
Detailed Analysis of Failure Prediction Quality
In the previous sections the parameters involved in setting up an HSMM-based failure
predictor have been investigated. Although some model parameters have been assessed
with respect to failure prediction, only the maximum F-measure has been used. In this
section, the quality of failure prediction is assessed in more detail. Specifically, in Section 9.5.1, the focus is on precision, recall, and F-measure. In Section 9.5.2, ROC curves
and related metrics are provided while in Section 9.5.3 evaluation deals with cost-based
metrics. All experiments shown here have been performed using parameter settings as
listed in Table 9.2.
lead time
∆tl
5 min
data window length
∆td
5 min
no. of states
N
100
max. span
of shortcuts
1
no. of intermediate states
1
background
weight
0.05
Table 9.2: Experiment settings for detailed analysis.
With respect to data sets, the experiments performed in previous sections have been
evaluated using out-of-sample validation data, while results reported in this section refer to out-of-sample test data (c.f. Section 8.3.2). 95% confidence intervals have been
estimated by the procedure described in Section 8.4.5.
9.5.1
Precision, Recall, and F-measure
Precision, recall, and F-measure have been defined in Section 8.2.2. As they have been
developed for information retrieval evaluation, their focus is on imbalanced class distributions as is the case for failure prediction. However, precision, recall, and F-measure are
dependent on the classification threshold θ (c.f., Equation 7.20 on Page 137) and hence
precision/recall plots and a plot of the F-measure for a selection of eleven thresholds ranging from −∞ to ∞ are provided. At each of the eleven classification threshold levels θ,
95% confidence intervals have been computed.
At the threshold level for the maximum F-measure of 0.66, the corresponding values
of precision and recall are 0.70 and 0.62, respectively, which means that failure warnings
are correct in 70% of all cases and almost two third of all failures are caught by the prediction algorithm. Both values can be increased to reach 1.0 by adjusting the classification
threshold θ. It depends on the methods and actions triggered by the prediction algorithm,
whether high precision or high recall is more important.
9.5.2
ROC and AUC
Taking true negative predictions into account, ROC curves plot true positive rate versus
false positive rate (c.f., Section 8.2.2) and AUC is the area under the resulting curve as
estimated by integrating the piecewise linearly interpolated ROC curve.
Figure 9.24 shows ROC for HSMM failure prediction. Choosing the threshold yield-
9. Experiments and Results Based on Industrial Data
0.5
0.4
0.0
0.0
0.1
0.2
0.2
0.3
F measure
0.6
0.4
Precision
0.8
0.6
1.0
206
0.0
0.2
0.4
0.6
0.8
1.0
Recall
confidence level: 0.95
(a)
0
10
20
30
40
50
decision threshold
(b)
Figure 9.23: Precision/Recall plot (a) and corresponding values of F-measure (b) for the
HSMM failure prediction model. A selection of eleven thresholds ranging from
−∞ to ∞ has been plotted including 95% confidence intervals for precision and
recall.
ing the maximum F-measure, a false positive rate of 0.016 and true positive rate (which
is equal to the recall) of 0.62 results. Area under the ROC curve (AUC) equals 0.873.
9.5.3
Accumulated Runtime Cost
The plot showing accumulated runtime cost (c.f., Section 8.2.4) is dependent on the assignment of cost to each of the four cases that can occur in failure prediction:
• A true negative prediction has cost rF̄ F̄ . Since it is a negative prediction, no subsequent actions are performed. Furthermore, since it is a correct decision, rF̄ F̄ should
be the smallest value. A value of 1 has been chosen arbitrarily.
• A true positive prediction has cost rF F . Since the occurrence of a failure is predicted, some actions are performed in order to deal with the upcoming failure resulting in higher cost. However, it is a correct prediction and hence cost should not
be too high. Hence, a value of 10 has been chosen.
• A false positive prediction has cost rF̄ F . A failure is predicted and actions are
performed as in the previous case —however, these actions are unnecessary since
in truth no failure is imminent. Hence a value of 20 has been chosen.
• A false negative prediction has cost rF F̄ . From the point of view of computational
workload, cost should equal rF̄ F̄ . However, this is the worst case since an upcoming
failure is not predicted and nothing would be done about it. The system fails which
implies highest cost. Therefore, cost of 1000 have been assigned to this case.
Figure 9.25 shows accumulated runtime cost for a simulated run of 31.5 days. The figure
includes boundary cost for:
• oracle predictor: this predictor issues only a true positive failure warning at the
time of failure occurrence setting the lower bound of overall achievable cost.
207
0.6
0.4
0.0
0.2
true positive rate
0.8
1.0
9.6 Dependence on Application Specific Parameters
0.0
0.2
0.4
0.6
0.8
1.0
false positive rate
confidence level: 0.95
Figure 9.24: ROC plot for the HSMM failure prediction model applied to telecommunication
system data. A selection of 11 thresholds ranging from −∞ to ∞ has been
plotted. At each threshold level, 95% confidence intervals for true and false
positive rate are provided.
• perfect predictor: performs a prediction at each time instant, an error message occurs. However, each prediction is correct, i.e., only true positive or true negative
prediction occur.
• no prediction: if no predictor was in place, cost for false negative prediction occurs
each time, a failure occurs
• maximum cost: A prediction is performed each time an error message occurs. However, each prediction is wrong and hence only false positives and false negative
predictions are performed.
As it can be seen from the plot, many failures occurred at the beginning of the run,
followed by some “silent” period. However, due to lack of plotting resolution, it cannot
be seen that some failures occurred quite close in time resulting in a total of 232 failures.
By use of the HSMM failure predictor, accumulated runtime cost can be cut down to
approximately one fifth of the cost without a failure predictor.
9.6
Dependence on Application Specific Parameters
Experiments conducted so far have analyzed the parameters involved in data preprocessing and modeling. Parameters were not specific for the application domain for which
failure prediction should be performed. This section investigates application specific factors, which refers to restrictions or properties imposed by the application domain or the
system.
9.6.1
Lead-Time
In Section 2.1, or more specifically in Figure 2.4 on Page 12, it is shown that lead-time
∆tl has a lower bound called warning-time ∆tw , which is determined by the time needed
to perform some action upon failure warning. In the experiments carried out so far a
lead-time ∆tl of five minutes has been used. In order to evaluate the effect of lead-time,
208
9. Experiments and Results Based on Industrial Data
150000
no prediction
50000
cost
250000
maximum cost
HSMM predictor
0
perfect predictor
oracle predictor
0
500000
1000000 1500000 2000000 2500000
time
Figure 9.25: Accumulated runtime cost for the HSMM failure prediction model. A test run
of 31.5 days has been plotted. A cost ratio of rF̄ F̄ : rF F : rF̄ F : rF F̄ =
1 : 10 : 20 : 1000 has been used. The plot also includes boundary cost for
an oracle predictor, perfect predictor, a system without prediction and maximum
cost. Triangles at the bottom indicate times of failure occurrence.
experiments with a lead-time ranging from ∆tl = 5 minutes to ∆tl = 30 minutes have
been performed. Figure 9.26 summarizes the results in terms of maximum F-measure with
95% confidence intervals determined from out-of-sample test data. Although one could
expect a rather linear decrease of failure prediction performance, experiments indicate that
failure prediction performance stays more or less constant until a lead-time of up to 20
minutes, then the F-measure drops quickly. The rather sharp drop observed in the figure
indicates that symptomatic manifestations of an upcoming failure are only observable up
to 20 minutes9 before failure occurrence. Taking into account that errors occur late in
the process from faults to failures, it can be concluded that the fine-grained detection
mechanism in the telecommunication system is able to grasp the first misbehaviors up to
20 minutes ahead before failure.
9.6.2
Data Window Size
Training of HSMMs as well as of other failure prediction models is based on error sequences. The length of each sequence is determined by data window size ∆td . Although
∆td is a data preprocessing parameter (c.f., Section 9.2.4), it is analyzed here since effects
with respect to failure prediction quality have not been investigated in Section 9.2.4.
As is the case with many parameters, the effects of ∆td on failure prediction quality are manifold and can hardly be assessed analytically. In principle, longer sequences
should result in a more precise classification. On the other hand, the farther sequences
reach back into past, the more likely it becomes that failure-unrelated errors are included
in failure sequences, which deteriorates failure prediction. Figure 9.27 plots maximum
F-value for five values of ∆td : data windows of length 1 minute, 5 minutes, 10 minutes,
15 minutes and 20 minutes. Figure 9.27-a shows failure prediction quality in terms of
9
plus length of the data window ∆td
209
0.4
0.0
0.2
F measure
0.6
9.7 Dependence on Data Specific Issues
5
10
15
20
25
30
lead time [min]
Figure 9.26: Failure prediction performance for various lead-times ∆tl . The plot shows Fmeasure with 95% confidence intervals.
maximum F-value and 95% confidence intervals. As can be seen from the figure, despite
of a data window size of ten minutes, failure prediction quality improves with larger data
window sizes. The reason for the exception at ∆td = 10min might be caused by random
effects and the fact that the Baum-Welch algorithm only converges to a local maximum
rather than a global one, even if it is repeated 20 times.
Improved prediction comes at the price of memory consumption and processing time.
Figure 9.27-b shows mean processing time per sequence in seconds. Processing time
increases heavily with increasing ∆td : With twenty minutes data frames, the average processing time reaches 2.34 seconds per sequence. This increase is caused by two effects:
(a) length of the sequences (L) increases with ∆td , and (b) HSMMs also need to have
more states (N ) in order to represent longer sequences. The reason why confidence intervals for processing time get wider with increasing ∆td is that the number of errors
in sequences vary more: time windows of five minute length are “filled with errors” in
most cases whereas in time windows of 20 minute length, there sometimes are larger gaps
resulting in a sequence with fewer errors.
9.7
Dependence on Data Specific Issues
Building a failure prediction model following a data-driven machine learning approach is
always dependent on quality and quantity of the data and —of course— on the system
itself. This section investigates sensitivity of failure prediction quality with respect to
data-specific issues.
9. Experiments and Results Based on Industrial Data
2
1
0
0.0
0.2
0.4
F measure
0.6
mean processing time [s]
3
0.8
210
5
10
15
20
data window size [min]
(a)
5
10
15
20
data window size [min]
(b)
Figure 9.27: Experiments for various data window sizes ∆td . (a) Failure prediction performance reported as maximum F-value. (b) Mean processing times per sequence
in minutes. 95% confidence intervals are shown in both plots.
9.7.1
Size of the Training Data Set
The objective of machine learning is to identify unobservable relationships from measured
data, which is usually blurred by noise. Hence one of the rules of thumb for machine
learning is to use as many datapoints as available. Where in many cases, the maximum
size of the training dataset is the limiting factor, time needed for training may also be
critical. Additionally, if very old data is included in the training data set, it might not
represent precisely the relationships as they are present in the running system. In order
to investigate the effect of the size of the data set, parts of increasing size of training data
have been selected to train a model and have been tested on the same test data set (see
Figure 9.28). More precisely, in the experiments the relationship between the amount of
Figure 9.28: Selection of training data sets for experiments on the effect of the amount of
training data
available training data and resulting failure prediction quality as well as the time needed
for training has been investigated. In order to visualize the effect, two plots are presented:
Figure 9.29-a plots maximum F-measure for the three data sets of different size and Figure 9.29-b shows the time needed to train the models. In failure prediction, usually the
number of failures in the data set are the limiting factor. In the experiments a small data
211
1000 1200
800
600
mean training time [s]
400
0
0.0
0.1
200
0.2
0.3
0.4
F measure
0.5
0.6
0.7
1400
9.7 Dependence on Data Specific Issues
1
2
3
1
size of data set
(a)
2
3
size of data set
(b)
Figure 9.29: Effects of the size of the training data set. (a) shows maximum F-measure and
(b) the time needed to train models. Data set one contained 72, data set two
134 and data set three 278 failure sequences.
set with only 72 failure sequences, a medium data set with 134 failure sequences and
a large data set with 278 failure sequences were used, which had been obtained by reducing the data set that also used in previous experiments. Regarding Figure 9.29-a, the
F-measure is roughly similar for the large data set (3) and the middle sized data set (2).
When the data set is further reduced (data set 1) the F-measure drops down significantly.
This is due to the fact that there are too less examples to learn from, or, more precisely,
to robustly estimate all the parameters of the HSMMs. Figure 9.29-b shows the expected
dependence of time needed to train the models on the size of the data set.
9.7.2
System Configuration and Model Aging
Complex computer systems involve a manifold of configuration parameters and are subject to patches and updates. In the case of the telecommunication system investigated in
this thesis, the number of configuration parameters have been estimated by system experts
to exceed 2.000. A separate configuration database is installed on the system and the system is as flexible as different versions or implementations of a component can be used
just by updating one value in the configuration database. Hence a single change in the
configuration may alter system failure behavior significantly. It is the goal of this section
to investigate sensitivity of the trained HSMM failure prediction models to changes in
system configuration. However, the problem is that we neither had access to the configuration database nor to any logs indicating configuration changes. Therefore, sensitivity
can only be investigated in terms of the temporal gap ∆tg between the training data set
and the test data set (see Figure 9.30).
Figure 9.31 presents results from training a failure prediction model with five different
gaps ∆tg . More precisely, experiments have been performed with a gap of 13 days, 42
days, 91 days, 125 days and 152 days between the end of the training and beginning
of the test data. Since we had no access to the configuration database, these numbers
have been chosen from an in-depth analysis of the entire data set, which revealed, e.g.,
212
9. Experiments and Results Based on Industrial Data
Figure 9.30: Selection of test data sets for experiments on the effect of changing system
configuration. ∆tg indicates the gap between the end of training and start of the
test data set.
changes in the log format. Two conclusions can be drawn from the figure: Prediction
quality decreases with an increasing temporal distance between training and application
of the failure predictor. 95% confidence intervals get larger with increasing gap size. Both
characteristics can be interpreted with the background of continuous partial updates and
patches: If only parts of the system are changed, some failure indicating error-sequences
are still the same, while others are changed. The HSMM recognizes known (old) error
sequences well while it fails on new sequences. The increasing diversity of sequences is
reflected in wider confidence intervals obtained by the bootstrapping procedure.
Besides of the aspect of configuration, plotting failure prediction quality as a function
of ∆tg brings up another aspect of machine learning. The training procedure applied in
this thesis is called supervised offline batch learning, which means that first a batch of
data is collected which is used entirely to train a model. In this context offline means that
training is performed not during operation but in the two-phase approach indicated by Figure 2.7 on Page 16. There are other machine learning approaches that try to continuously
adapt the model in order to keep it up-to-date, however, in order to keep the approach
simple such techniques have not been investigated in this dissertation (see Chapter 12).
The important thing to note here is that —assuming an ever changing real system— the
model is always outdated, even right after training. Hence, the question is how quickly
key properties of the system change with respect to the prediction of upcoming failures.
Gap ∆tg is one way to expresses “age” of a model.
9.8
Failure Sequence Grouping and Filtering
In this dissertation two data preprocessing techniques have been proposed and used without scrutinizing. The experiments described in this section have done so and have investigated the effect of failure sequence grouping as well as noise filtering.
9.8.1
Failure Grouping
In order to obtain more consistent training datasets, failure sequence grouping intends
to separate failure mechanisms in the data. However, this also decreases the number of
sequences available for training of each model. In order to investigate the effects of failure
grouping, prediction performance has been evaluated for a predictor with only one HSMM
failure group model.10 Figure 9.32 presents results, which are intended to be compared
to Figure 9.23 and 9.24, respectively. Results show that failure prediction performance
without separating failure sequences into groups is worse and a maximum F-measure of
10
And, of course, a non-failure model
213
0.4
0.3
0.0
0.1
0.2
F measure
0.5
0.6
0.7
9.9 Comparative Analysis
0
20
40
60
80
100
120
140
160
gap size [days]
Figure 9.31: Prediction quality (expressed as maximum F-measure) depending on the temporal gap ∆tg between training and test data. The gap is expressed in days.
0.5097 and AUC of 0.7700 are achieved. This indicates that failure sequences are too
diverse to be represented by one single HSMM. Since by means of clustering similar
sequences are grouped and a separate model is trained for each group, the models can
better adopt to the specifics of error sequences indicating an upcoming failure.
9.8.2
Sequence Filtering
In order to remove noise from failure sequences, a statistical filter technique has been applied. To investigate the effects, a model with the same parameters as used in Section 9.5
has been trained and evaluated using unfiltered data. A maximum F-measure of 0.3601 resulting from a precision of 0.670 and a recall of 0.246 with a false positive rate of 0.0095
have been achieved. Hence sequence filtering improves failure prediction performance
slightly, at least for the parameter settings used previously.11 Additionally, filtering removes symbols from sequences, which in turn has a positive effect on computation times:
the average processing time for the prediction of a sequence without filtering is increased
by 16.9%.
9.9
Comparative Analysis
In order to be able to judge the results presented in previous sections, the HSMM-based
failure prediction approach has been compared to several published failure prediction
approaches. As already explained in Section 3.2, most promising and well-known approaches to error-driven failure prediction, as they have been identified as subbranches of
Category 1.3 in the failure prediction taxonomy (c.f., Figure 3.1 on Page 31) have been
11
However, it cannot be excluded that other model parametrizations exist that achieve better prediction
performance
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
Precision
0.6
true positive rate
0.8
1.0
9. Experiments and Results Based on Industrial Data
1.0
214
0.0
0.2
0.4
0.6
0.8
1.0
0.0
Recall
confidence level: 0.95
(a)
0.2
0.4
0.6
0.8
1.0
false positive rate
confidence level: 0.95
(b)
Figure 9.32: Precision / recall plot (a) and ROC plot (b) for prediction with a single failure
group model
selected. Additionally, results are provided for the simplest prediction method: Prediction based on mean-time-to-failures. Since the HSMM-based failure prediction approach
presented in this thesis extends standard HMMs, results for standard HMMs are provided,
and comparison to a random predictor and the UBF approach proposed by Hoffmann is
given. Even though the approaches have already been described in Section 3.2, their key
idea is repeated here for convenience. All experiments have been carried out on the same
dataset that has been used in Section 9.5 with a lead-time ∆tl of five minutes. Each model
is discussed separately and a results are summarized at the end of the section, including
computation times.
9.9.1
Dispersion Frame Technique (DFT)
DFT (c.f., Section 3.2.1) investigates the time of error occurrence by defining dispersion
frames (DF) and computing the error dispersion index (EDI). A failure is predicted at
the end of the DF if at least one out of five heuristic rules matches. In addition to the
original method, predictions that are closer to present time than warning-time ∆tw of
three minutes have not been considered.
Initial results for DFT using a parameter setting as in the paper by Lin & Siewiorek
[167] resulted in poor prediction performance. Parameters such as thresholds for tupling
and for the rules have been modified in order to improve prediction. However, even with
investigating 540 different combinations of parameters the best achievable result only
obtained an F-measure score of 0.115, resulting from a precision of 0.597 but only a
recall of 0.063. False positive rate equals 0.00352.
Comparing the results of DFT with the original work by Lin and Siewiorek, achieved
prediction performance is worse. The main reason for that seems to be the difference of
investigated systems: While the original paper investigated failures in the Andrews distributed file system based on the occurrence of host errors, our study applied the technique
to errors that had been reported by software components in order to predict upcoming
performance failures. In our study, intervals between errors of the same type are much
9.9 Comparative Analysis
215
0
5
10
15
20
25
delay [s]
50000
30000
Frequency
10000
20000
1000
800
0
0
0
200
20000
400
600
Frequency
100000
60000
Frequency
40000
1200
140000
shorter. As software container IDs have been chosen as the entity corresponding to field
replaceable units (FRUs) Figure 9.33 shows histograms of the time between errors for
three different software containers. As can be seen from the histograms, for the leftmost
0
10000
20000
30000
40000
50000
delay [s]
0
20
40
60
80
100
delay [s]
Figure 9.33: Histogram of time-between-errors for the dispersion frame technique. Since container IDs have been chosen to be the FRU equivalent, error messages of three
containers have been analyzed. In order to obtain histograms in which any details can be seen, after tupling only delays up to the 99% quantile has been used
in order to remove very rare but extremely large values
container ID, the vast majority of delays is below five seconds. Since DFT can at most
predict a failure half of the delay ahead, most of the failure predictions from this container
are dropped since they are closer to present time than the warning period of 100 seconds.
The same holds for the rightmost container ID. The fact that most of the predictions have
been dropped results in the low recall. However, if a failure warning is issued, it is correct
in almost 60% of all cases.
9.9.2
Eventset
The eventset method (c.f., Section 3.2.2) is based on data mining techniques identifying
sets of error event types that are indicative for upcoming failures, which set up a rule
database. Construction of the rule database includes the choice of four parameters:
• length of the data window
• level of minimum support
• level of confidence
• significance level for the statistical test
The training algorithm has been run for 64 combinations of various values for the parameters and the best combination with respect to F-measure has been selected. Since the
first part of the algorithm potentially needs to investigate the power set of all 1435 error
symbols, which is approximately 9.5 · 10430 , a branch and bound algorithm called “apriori” has been used as indicated in the paper by Vilalta & Ma [268].12 Best results have
12
More specifically, the implementation of Christian Borgelt (see [34])
216
9. Experiments and Results Based on Industrial Data
been achieved for a window length of five minutes, confidence of 10%, support of 25%
and significance level of 5% yielding a precision of 0.465, recall of 0.327, F-measure of
0.3841, and false positive rate of 0.0422.
9.9.3
SVD-SVM
Support Vector Machines (SVMs) are state-of-the art classifiers showing various desirable
properties such as convexity of the optimization criterion, etc. The major problem when
using SVM classifiers for failure prediction is representation of error data. Domeniconi
et al. [81] have used a bag-of-word representation together with latent semantic indexing
techniques to solve this problem resulting in the failure prediction approach described
in Section 3.2.3. 90 different configurations have been tested and the configuration with
maximum F-measure has been selected. In particular, configurations have been defined
by the following parameters:
• length of the data window ∆td
• type of kernel function: linear, polynomial, and radial basis functions (c.f., e.g.,
Chen et al. [56])
• parameters controlling the kernels, such as γ for radial basis function kernel
• trade-off between training error and margin (parameter C, as in, e.g., Schölkopf
et al. [231])
• feature encoding: either existence, count, or temporal (c.f., Section 3.2.3)
The approach has been implemented using R and the free SVM toolkit “SVMlight”
[135]. However, there is one difference to the algorithm as originally published in
Domeniconi et al. [81]: Since the output of SVMlight is not only a class label but a
distance from the decision boundary, a precision / recall and ROC plot can be drawn.
The idea is to classify a sequence as failure prone only if the SVM output is above some
customizable threshold. Classification performance of the original algorithm hence corresponds to a threshold equal to zero. Figure 9.34 presents the results.
Best results have been achieved using a radial basis function kernel with γ = 0.6, error
/ margin tradeoff c = 10, and count feature encoding. Using this setting, a maximum Fmeasure of 0.226, precision of 0.182, recall of 0.299, and false positive rate of 0.1103
have been achieved.
The fact that encoding error messages by the count scheme rather than the temporal
scheme might seem contradictory to one of the principal assumptions in this dissertation
that taking both type and time of error messages into account should improve failure
prediction. However, this is not the case, since the way, time is represented in the temporal
scheme has a fundamental flaw: By representing the occurrence of each error type as a
binary number, the temporal scheme investigates absolute time of error occurrence in the
sequence rather than relative, and discretizes time rather than treating it continuously (c.f.,
Section 4.2.1). As an example, assume that there is only one occurrence of one specific
error message type in a sequence. If the error messages appears only a little earlier such
that it falls into the next time slot, the magnitude along the error dimension is doubled in
the bag-of-words representation.
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
Precision
0.6
true positive rate
0.8
1.0
217
1.0
9.9 Comparative Analysis
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
Recall
confidence level: 0.95
(a)
0.4
0.6
0.8
1.0
false positive rate
confidence level: 0.95
(b)
Figure 9.34: Failure prediction results for the SVD-SVM failure prediction algorithm. Results
are reported as Precision/recall plot (a) and ROC plot (b)
9.9.4
Periodic Prediction Based on MTBF
The reliability model-based failure prediction approach is rather simple and has been
included to show the prediction performance that can be achieved with the most straightforward prediction method. Not surprisingly, prediction performance is low: Precision
equals 0.054, and also recall equals 0.054 yielding an F-measure of 0.0541. Since the approach only issues failure warnings (there are no non-failure-predictions), false positive
rate cannot be determined, here.
The reason why this prediction method does not really work for the case study is that
the prediction method is periodic: the next failure is predicted to occur at the median
of a distribution that is not adapted during runtime. As can be seen from the histogram
of time-between-failures (Figure 9.16) the distribution of time-between-failures is widespread and from the auto-correlation of failure occurrence (Figure 9.17) that there is no
periodicity evident in the failure data set.
9.9.5
Comparison with Standard HMMs
This dissertation is the first to apply hidden Markov models to the task of online failure
prediction. In Section 4.2 it has been argued that standard HMMs are not well-suited for
representing error logs due to insufficient capabilities to represent time (c.f., Figure 4.7
on Page 64). While this has just been a claim based on theoretical analysis, this section
provides experimental evidence for it.
Three experiments have been performed in which
1. no timing information is used
2. time-slotting (c.f., Section 4.2.1) is used.
3. the model described in Salfner [223] has been used.
218
9. Experiments and Results Based on Industrial Data
The third case did not work out due to the fact that the process is forced to always traverse
the same limited set of states and hence looses its pattern recognition potential. Due
to this theoretic flaw, this approach was not further investigated. In the first two cases,
the structure of the HMM was similar to the structure of the failure prediction HSMM.
In case of the time-slotting model, an extra observation symbol representing silence has
been added. Theoretically, the time slot size should be set equal to the minimum delay
between the errors, which is determined by the tupling parameter ε = 0.015s. However,
this would lead to huge models since:
]states = 5 min. ×
60
slots/min. = 20000 slots = 20000 states .
0.015
(9.2)
However, 20000 states are far to many to be trained from limited data within reasonable
time. Hence a larger time-slotting interval of ε = 0.2s has been used resulting in a model
with 1500 states. If two error symbols occurred within one time slot, one symbol has been
chosen randomly, which treats such cases as noise. Opposed to that, the model without
timing had 100 states.
The model without timing achieved a prediction performance of precision = 0.230,
recall = 0.176 and hence an F-measure of 0.1996. False positive rate was equal to 0.049.
The model with time-slotting achieved a prediction performance of precision = 0.079,
recall = 0.129 and hence an F-measure of 0.0982 with a false positive rate of 0.124.
Similar to the SVD-SVM method, this experiment also shows that the sheer incorporation of time information does not automatically lead to good failure prediction results.
Rather, in the case considered here, the incorporation of time by time-slotting did render
the prediction approach almost unusable.
9.9.6
Comparison with Random Predictor
The term “random predictor” denotes a predictor that at each time a prediction is to be
performed, a failure warning is issued with probability of 0.5. Applied to the case study,
such predictor would result in a contingency table as shown in Table 9.3. From the table,
Prediction: Failure
Prediction: No Failure
Sum
True Failure
139
139
278
True Non-failure
1686
1686
3372
Sum
1825
1825
3650
Table 9.3: Contingency table for a random predictor.
the following values for precision, recall, and false positive rate can be computed:
• precision =
• recall =
• f pr =
139
278
1686
3372
139
1825
≈ 0.076
= 0.5
= 0.5
9.9 Comparative Analysis
219
which results in an F-measure of approximately 0.1322. One might conclude from this
considerations that any predictor with recall less than 50% is useless. However, this is not
true: As precision and recall are in most cases inversely linked, many prediction methods
trade recall for precision, and are hence useful even though recall is below 50%.
9.9.7
Comparison with UBF
HSMM-based failure prediction operates on event-triggered input data and comparative
approaches have been selected from this class, too. However, Günther Hoffmann has
proposed a failure prediction technique called “Universal Basis Functions” (UBF) and
has applied it to symptom monitoring data such as workload and memory consumption13
of the same telecommunication system (Hoffmann & Malek [121]). Therefore, results are
outlined here for comparison. In Hoffmann [120], values for true/false positives/negatives
have been published, shown here as contingency Table 9.4. From this table, a precision
Prediction: Failure
Prediction: No Failure
Sum
True Failure
4
2
6
True Non-failure
49
192
241
Sum
53
194
247
Table 9.4: Contingency table for the UBF failure prediction approach.
of 0.076, recall of 0.667 and false positive rate of 0.2033 are computed. This yields an
F-measure of 0.13559. AUC is reported to be 0.846. It should be noted that the above
values are derived from a rather small data set containing only 247 predictions, among
which are only six failures.
Looking at precision and F-measure, it seems as if UBF is not much better than a
random predictor. However, this is not true, since UBF operates on different data: A random predictor applied to the UBF data would only achieve a precision of 0.024 resulting
in an F-measure of 0.0463. Furthermore, UBF achieves an AUC value that is similar to
AUC of the HSMM approach –but at considerably lower computational cost: In order to
perform a UBF prediction, each kernel has to be evaluated only once14 and results are
linearly combined. Therefore, in scenarios where, e.g., false positive alarms do not incur
high cost, UBF can achieve similar results with lower cost. However, if high precision is
a requirement, HSMM outperforms UBF significantly.
9.9.8
Discussion and Summary of Comparative Approaches
In this section, HSMM-based failure prediction has been compared to other well-known
prediction techniques: Dispersion Frame Technique (DFT), Eventset method and SVD13
More precisely, a variable selection technique called PWA has been applied yielding the number of
semaphore operations per second and the amount of allocated kernel memory as most descriptive variables
14
in case embedding dimension is zero, which has been shown to yield best results for UBF
220
9. Experiments and Results Based on Industrial Data
SVM. Additionally, periodic prediction has been investigated in order to show performance of the most straightforward failure prediction approach.
Failure prediction quality has been expressed as precision, recall, F-measure and
false positive rate (FPR). Figure 9.35 summarizes results for event-based failure predictors including 95% BCa confidence intervals obtained from bootstrapping. In summary,
0.0
0.2
0.4
0.6
periodic
DFT
Eventset
SVD−SVM
HSMM
precision
recall
F−measure
false positive rate
Figure 9.35: Summary of prediction results for comparative approaches. Results are reported
as mean values and 95% confidence intervals.
HSMM-based failure prediction outperforms the other techniques significantly in most of
the metrics. The second best technique is Eventset, which has been developed at IBM and
has been used for failure prediction in large scale parallel systems. However, improved
prediction is not for free: HSMM is by far the most complex failure prediction algorithm
with respect to both time and memory consumption. More specifically, Table 9.5 lists
training times and the time needed to perform one prediction along with lower and upper
bounds of 95% confidence intervals. It can be observed that with respect to training,
the HSMM approach takes approximately 2.4 times as long as SVD-SVM and almost 60
times as long as Eventset, which is the second-best prediction algorithm in this comparative analysis. Nevertheless, training has no tight real-time constraints and can still be
performed within reasonable timescales. With respect to online prediction, HSMM takes
much longer in comparison to the other techniques. However, computation times are still
sufficiently small if seen in the context of a lead time of at least five minutes.
HSMM-based failure prediction has also been compared to standard HMMs since
HSMMs are derived from HMMs. Prediction performance of a random predictor has been
computed for comparative reasons, and, finally, HSMM-based failure prediction has been
15
Time variances have been below system time resolution, hence no confidence intervals can be provided,
here
9.10 Summary
221
Prediction technique
Training
Online Prediction
Reliability-based periodic
n/a
n/a
DFT
n/a
Eventset
SVD-SVM (max. F-measure)
HSMM (max. F-measure)
17.22
572.62
1295.90
4.9767e-5
/ 1.2857e-4 / 2.0728e-4
0.000715
/ 22.90 / 28.58
/ 572.73 / 572.83
/ 1365.00 / 1434.10
5.7566e-4
/ 6.7781e-4 / 7.7995e-4
5.7761e-2
/ 0.15715 / 0.25655
Table 9.5: Summary of average computation times for comparative approaches with 95% confidence intervals. Time is reported in seconds.
compared to Universal Basis Functions (UBF), which is a very good failure prediction
technique for the analysis of periodic measurements such as system workload or memory
consumption.
9.10
Summary
In this chapter, the theory developed in previous chapters has been applied to industrial
data of a commercial telecommunication system in order to investigate how well failures
of a complex computing system can be predicted. All the steps from data preprocessing
to an evaluation of failure prediction quality have been thoroughly investigated, which
means that the effects of the various parameters involved have been assessed. In more
detail, the following issues have been covered:
• Data preprocessing consists of assignment of error-IDs to error messages, tupling,
failure sequence clustering, and noise filtering.
• Properties of the resulting data set have been investigated. This involved an analysis
of error frequency, the distribution of inter-error delays, the distribution of failures,
and length of the resulting error sequences.
• Modeling. Parameters involved in HSMM modeling include the number of states,
maximum span of shortcuts, number of intermediate states, intermediate probability mass and distribution, distribution type and amount of background weight,
and number of tries for the Baum-Welch algorithm. Some of these parameters
have been set heuristically while others, for which no values could be determined
upfront, have been investigated with respect to failure prediction performance on
out-of-sample validation data.
• For the given setting of parameters, failure prediction quality has been investigated
in more detail using out-of-sample test data: precision, recall, F-measure, false
positive rate, precision-recall plot, ROC plot, AUC and accumulated runtime cost
have been reported.
• Application specific parameters lead-time ∆tl , and data window size ∆td have been
explored in order to determine their effect on failure prediction performance.
222
9. Experiments and Results Based on Industrial Data
• A Data-specific issues have been investigated in order to determine how failure
prediction depends on the size of the training data set and the temporal distance
between training and test dataset, which can be taken to give an indication of model
aging due to system configuration changes and updates.
• The effect of failure sequence clustering and noise filtering have been investigated.
• In order to show that the theory developed in this thesis really improves failure
prediction quality, a comparative analysis has been performed. The selection of
comparative approaches includes the best-known approaches to error-driven failure
prediction, as they have been identified as subbranches of Category 1.3 in the failure prediction taxonomy (c.f., Figure 3.1 on Page 31). Specifically, HSMM-based
failure prediction has been compared to Dispersion Frame Technique (DFT) developed by Lin & Siewiorek [167], Eventset method developed by Vilalta & Ma [268]
at IBM and Singular value decomposition – support vector machines (SVD-SVM)
developed by Domeniconi et al. [81]. In order to provide some rough estimate on
effortless prediction, periodic prediction on the basis of MTBF has also been applied to the same data. In a further experiment, HSMM-based prediction has been
compared to standard hidden Markov models, a random predictor and Universal
Basis Functions (UBF) developed by Hoffmann [120].
In summary, it has been shown that for industrial data of the commercial telecommunication system, HSMM-based prediction is superior to all failure prediction approaches
it has been compared with. Supposedly, the main reasons for this are first the approach
to efficiently exploit both time and type of error messages by treating them as temporal sequence, and second the modeling flexibility provided by HSMMs. For example,
one characteristic is that HSMMs can handle permutations of error symbols occurring
together within a short time interval (c.f., Page 100). This property is relevant for error
sequence-based failure prediction since ordering of error events occurring closely in time
cannot be guaranteed in complex environments such as the telecommunication system.
On the other hand, it must be conceded that modeling flexibility comes at the price of a
considerable number of parameters that need to be adjusted. Hence, applying HSMMs
as modeling technique requires substantial investigations and experience. Additionally,
computational effort is increased: In comparison with, e.g., SVD-SVM, HSMM-based
training consumes approximately 2.4 times as much time for training and 232 times as
long for online prediction. However, this is still not prohibitive against the background
that HSMM-based failure prediction is able to reliably predict the occurrence of failures
with a lead-time of up to 20 minutes.
Contributions of this chapter. Contributions of this chapter are two-fold. From an
engineering point of view the chapter has shown in detail how an industrial system can be
modelled and how the various parameters can be investigated and adjusted to the specifics
of a system. From a scientific point of view, the main contributions of this chapter are
an in-depth performance evaluation of the HSMM method and a comparative analysis
of the approach to other well-known prediction approaches. Furthermore, it has been
shown that extending standard HMMs to HSMMs is worth the effort since prediction
quality is significantly improved and that the proposed preprocessing techniques, i.e.,
failure sequence clustering and noise filtering, improve failure prediction results.
9.10 Summary
223
Relation to other chapters. It has been shown in this chapter that a prediction of upcoming failures is possible in complex computer systems. However, prediction alone does
not improve system dependability! Hence the following fourth part of the thesis addresses
the issue what to do about a failure that has been predicted. In terms of the engineering
cycle, the third phase has been completed and a solution for failure prediction has been
obtained. The next part will cover the fourth and last phase of the engineering cycle,
which focuses on system improvement.
Part IV
Improving Dependability, Conclusions,
and Outlook
225
Chapter 10
Assessing the Effect on Dependability
The last phase of the engineering cycle named “improvement”, closes the loop: The goal
is to use the failure prediction solution developed in previous chapters in order to improve
the system with respect to system dependability. However, dependability improvement is
not the primary goal of this dissertation —it focuses mainly on failure prediction. That
is why this last part is shorter and investigations are not as detailed as in previous chapters. More specifically, in Section 10.1 proactive fault management is introduced, which
denotes the combination of online failure prediction and actions to improve system dependability. Related work on previous approaches to model proactive fault management
is provided in Section 10.2. In Sections 10.3 to 10.6, an availability model and a simplified reliability model are proposed, and closed form solutions for availability, reliability
and hazard rate are derived. The issue of parameter estimation from experimental data is
covered by Section 10.7 and some experiments that have been performed in the course of
a diploma thesis primarily supervised by the author are presented in Section 10.8.
10.1
Proactive Fault Management
System dependability cannot be improved solely by predicting failures —some actions
are necessary in order to do something about the failure that has been predicted. As
shown in Figure 1.1 on Page 4, online failure prediction and actions form a cycle where a
running system is continuously monitored in order to obtain data on the current status of
the system, and a prediction algorithm is performed resulting in a classification whether
the current situation is failure-prone or not. If so, it raises a failure warning and actions
are performed in order to do something about the failure. This might include diagnosis to
investigate the root cause of the imminent problem and a decision which technique will
be most effective (see Chapter 12). However, there are two different classes of actions
that can be performed upon failure prediction (see Figure 10.1):
• Downtime avoidance (or failure avoidance) aims at circumventing the occurrence
of the failure such that the system continues to operate without interruption
• Downtime minimization (minimization of the impact of failures) involves downtime, but the goal is to reduce downtime by preparation for true upcoming failures
227
228
10. Assessing the Effect on Dependability
Figure 10.1: Proactive fault management combines failure prediction and proactive actions.
Actions either try to avoid or to minimize downtime
or by intentionally bringing the system down in order to shift it from unplanned to
forced downtime.
Although several systems combining failure prediction with actions have been described in the literature, there is no unified name for this approach. Following Castelli
et al. [49], the name proactive fault management (PFM) is used in this thesis.
Several examples for such systems employing PFM have been described in the literature. For example, Castelli et al. [49] describe that a resource consumption trend estimation technique has been implemented into IBM Director Management Software for
xSeries servers that can restart parts of the system. In Cheng et al. [57] a framework called
application cluster service is described that facilitates failover (both preventive and after
a failure) and state recovery services. Li & Lan [164] propose FT-Pro, which is a failure prediction-driven adaptive fault management system. It uses false positive error rate
and false negative rate of a failure predictor together with cost and expected downtime to
choose among the options to migrate processes, to trigger checkpointing or to do nothing.
The behavior of PFM can be described in more detail as follows: If the failure predictor’s analysis suggests that the system is running well and hence no failure is anticipated in
the near future (which is a negative prediction), no action occurs. If a failure is predicted
(a positive prediction), either downtime avoidance actions or downtime minimization actions are performed, or both. However, it is obvious that any failure predictor can make
wrong decisions: the predictor might forecast an upcoming failure even if this is not the
case, which is called a false positive, or the predictor might miss to predict a failure that
is imminent in the system, which is called a false negative (c.f., Table 8.1 on Page 153
for an overview of all four cases that may occur). From this follows that in case of a
false positive prediction (FP) actions are performed unnecessarily while in case of a false
negative prediction (FN), nothing is done about the failure that is imminent in the system.
Table 10.1 summarizes these cases.
Prediction
Downtime avoidance
Downtime minimization
True positive
Try to prevent failure
Prepare repair (recovery)
False positive
Unnecessary action
Unnecessary preparation
True negative
No action
No action
False negative
No action
Standard (unprepared) repair (recovery)
Table 10.1: Actions performed after prediction. For a definition of true/false positives / negatives, see Table 8.1 on Page 153
10.1 Proactive Fault Management
229
Especially in event-based failure prediction, there are situations where a failure occurs
and no prediction has been performed at all, since there was no triggering event prior to
the failure. However, this case can be easily incorporated by treating it as a false negative
prediction.
All mechanisms that can benefit from the knowledge about an upcoming failure can
be used within PFM. It is not the focus of this thesis to provide a detailed analysis of
all kinds of actions falling into this category, and hence only some major concepts are
described in the following.
10.1.1
Downtime Avoidance
Downtime avoidance actions are triggered by a failure predictor in order to prevent the
occurrence of a failure that seems to be imminent in the system but has not yet occurred.
Three categories of mechanisms can be identified:
• State clean-up tries to avoid failures by cleaning up resources. Examples include
garbage collection, clearance of queues, correction of corrupt data or elimination of
“hung” processes.
• Preventive failover techniques perform a preventive switch to some spare hardware
or software unit. Several variants of this technique exist. One of which is failure
prediction-driven load balancing accomplishing gradual “failover” from a failureprone to failure-free component. For example, Chakravorty et al. [50] describe a
multiprocessor environment that is able to migrate processes in case of an imminent
failure.
• Lowering the load is a common way to prevent failures. For example, web-servers
reject connection requests in order not to become overloaded. Within proactive fault
management, the number of allowed connections is adaptive and would depend on
the risk of failure.
10.1.2
Downtime Minimization
Repairing the system after failure occurrence is the classical way of failure handling.
Detection mechanisms such as coding checks, replication checks, timing checks or plausibility checks trigger the recovery. Within PFM, these actions still incur downtime, but
its occurrence is either anticipated or even intended in order to reduce time-to-repair.
More specifically, there are two types of downtime minimization methods:
1. techniques that react to the occurrence of failures, and the goal is to reduce time-torepair by preparation for the failure. This is called reactive downtime minimization
2. techniques that intentionally bring the system down in order to cause less downtime
in comparison to downtime associated with unplanned failure occurrence. This
class of techniques is termed proactive downtime minimization
230
10. Assessing the Effect on Dependability
Figure 10.2: Improved time-to-repair for prediction-driven repair schemes. (a) sketches classical recovery and (b) improved recovery in case of preparation for an upcoming
failure. “Checkpoint” denotes the last checkpoint before failure, “Failure” the time
of failure occurrence, “Reconfigured” the time when reconfiguration has finished
and “Up” the time when the system is up again. In (b) time-to-repair is improved
since reconfiguration can start after prediction of an upcoming failure and the
prediction-triggered checkpoint is closer to the occurrence of the failure, which
results in less computation that needs to be recomputed after reconfiguration.
Reactive downtime minimization. The goal of such techniques can be summarized
that the system shall be brought into a consistent fault-free state. If this state is a previous one (a so-called checkpoint), the action applies a roll-backward scheme (see, e.g.,
Elnozahy et al. [91] for a survey of roll-back recovery in message passing systems). In
this case, all computation from the last checkpoint up to the time of failure occurrence
has to be recomputed. Typical examples are recovery from a checkpoint or the recovery
block scheme introduced by Siewiorek & Swarz [241]. In case of a roll-forward scheme,
the system is moved forward to a consistent state by either dropping or approximating the
computations that have failed (see, e.g., Randell et al. [213]).
Both schemes may comprise reconfiguration such as switching to a hardware spare or
another version of a software program, changing network routing, etc. Reconfiguration
takes place before computations are redone or approximated.
In traditional fault-tolerant computing without PFM, checkpoints are saved independently of upcoming failures, e.g., periodically. When a failure occurs, first reconfiguration
takes place until the system is ready for recomputation / approximation and then all the
computations from the last checkpoint up to the time of failure occurrence are redone.
Time-to-repair (TTR) is determined by two factors: time needed for reconfiguration and
the time needed for recomputation or approximation of lost computations. In the case
of roll-backward strategies, recomputation time is determined by the length of the time
interval between the checkpoint and the time of failure occurrence (see Figure 10.2-a).
In some cases recomputation may take less time than originally but the implication still
holds. Note that not all types of repair actions exhibit both factors contributing to TTR.
A large variety of repair actions exist that can benefit from failure prediction. In
principle, coupling with a failure predictor can reduce both factors contributing to TTR
(see Figure 10.2-b):
• Time needed for reconfiguration can be reduced since reconfiguration can be prepared for an upcoming failure. Think, for example, of a cold spare: Booting the
spare machine can be started right after an upcoming failure has been predicted
(and hence before failure occurrence) such that reconfiguration is almost finished
when the failure occurs.
10.2 Related Models
231
• Checkpoints may be saved upon failure prediction close to the failure, which reduces the amount of computation that needs to be repeated. This minimizes time
consumed by recomputation. On the other hand, it might not be wise to save a
checkpoint at a time when a failure can be anticipated since the system state might
already be corrupted. The question whether such scheme is applicable depends on
fault isolation between the system that is going to observe the failure and the state
that is included in the checkpoint. For example, if the amount of free memory is
monitored for failure prediction but the checkpoint comprises database tables of a
separate database server, it might be sure to rely on the correctness of the database
checkpoint. Additionally, an adaptive checkpointing scheme similar to the one described in Oliner & Sahoo [197] could be applied.
Leangsuksun et al. [157] describe that they have implemented predictive checkpointing
for a high-availability high performance Linux cluster.
Proactive downtime minimization. Parnas [198] reported on an effect that he called
software aging, being a name for effects such as memory leaks, unreleased file locks,
file descriptor leaking, or memory corruption. Based on these observations, Huang et al.
introduced a concept that the authors termed rejuvenation. The idea of rejuvenation is
to deal with problems related to software aging by restarting the system (or at least parts
of it). By this approach, unplanned / unscheduled / unprepared downtime incurred by
non-anticipated failures is replaced by forced / scheduled / anticipated downtime. The
authors have shown that —under certain assumptions— overall downtime and downtime
cost can be reduced by this approach. In Candea et al. [43] the approach is extended
by introducing recovery-oriented computing (see, e.g., Brown & Patterson [40]), where
restarting is organized recursively until the problem is solved.
10.2
Related Models
The objective of this chapter is a theoretical assessment of proactive fault management
with respect to system dependability, or more precisely steady-state system availability,
reliability and hazard rate. As is common in reliability theory, a model expressing the
relevant interrelations is used.
Proactive fault management is rooted in preventive maintenance, that has been a research issue for several decades (an overview can be found, e.g., in Gertsbakh [105]).
More specifically, proactive fault management belongs to the category of condition-based
preventive maintenance (c.f., e.g., Starr [250]). However, the majority of work has been
focused on industrial production systems such as heavy assembly line machines and more
recently to computing hardware. With respect to software, preventive maintenance has
focused more on long-term software product aging such as software versions and updates
rather than short-term execution aging. The only exception is software rejuvenation which
has been investigated heavily (c.f., e.g., Kajko-Mattson [139]).
Starting from general preventive maintenance theory, Kumar & Westberg [150] compute reliability of condition-based preventive maintenance. However, their approach is
based on a graphical analysis of so-called total time on test plots of singleton observation
variables such as temperature, etc. rendering the approach not appropriate for application
to automatic proactive fault management in software systems. An approach better suited
232
10. Assessing the Effect on Dependability
to software has been presented by Amari & McLaughlin [9]. They use a continuous-time
Markov chain (CTMC) to model system deterioration, periodic inspection, preventive
maintenance and repair. However, one of the major disadvantages of their approach is
that they assume perfect periodic inspection, which does not reflect failure prediction
reality, as has been shown along with the case study presented in Chapter 9.
A significant body of work has been published addressing software rejuvenation. Initially, Huang et al. [126] have used a CTMC in order to compute steady-state availability
and expected downtime cost. In order to overcome various limitations of the model, e.g.,
that constant transition rates are not well-suited to model software aging, several variations to the original model of Huang et al. have been published over the years, some of
which are briefly discussed here. Dohi et al. have extended the model to a semi-Markov
process to deal more appropriately with the deterministic behavior of periodic restarting.
Furthermore, they have slightly altered topology of the model since they assume that there
are cases where a repair does not result in a clean state and restart (rejuvenation) has to be
performed after repair. The authors have computed steady-state availability (Dohi et al.
[80]) and cost (Dohi et al. [79]) using this model. Cassady et al. [47] propose a slightly
different model and use Weibull distributions to characterize state transitions. However,
due to this choice, the model cannot be solved analytically and an approximate solution
from simulated data is presented.
Garg et al. [101] have used a three state discrete time Markov chain (DTMC) with
two subordinated non-homogeneous CTMCs to model rejuvenation in transaction processing systems. One subordinated CTMC models queuing behavior of transaction processing and the second models preventive maintenance. The authors compute steady-state
availability, probability of loosing a transaction, and an upper bound on response time
for periodic rejuvenation. They model a more complex scheme that starts rejuvenation
when the processing queue is empty. The same three-state macro-model has been used in
Vaidyanathan & Trivedi [262], but here, time-to failure is estimated using a monitoringbased subordinated semi-Markov reward model. However, for model solution, the authors
approximate time-to-failure with an increasing failure rate distribution.
Leangsuksun et al. [158] have presented a detailed stochastic reward net model of a
high availability cluster system in order to model availability. The model differentiates
between servers, clients and network. Furthermore, it distinguishes permanent as well
as intermittent failures that are either covered (i.e., eliminated by reconfiguration) or uncovered (i.e., eliminated by rebooting the cluster). Again, the model is too complex to
be analyzed analytically and hence simulations are performed. An analytical solution for
computing the optimal rejuvenation schedule is provided by Andrzejak & Silva [10] who
use deterministic function approximation techniques to characterize the relationship between aging factors and work metrics. The optimal rejuvenation schedule can then be
found by an analytical solution to an optimization problem.
The key property of PFM is that it operates upon failure predictions rather than a
purely time-triggered execution of fault-tolerance mechanisms. One of the first papers
to address this issue is Vaidyanathan et al. [261]. The authors propose several stochastic
reward nets (SRN), one of which explicitly models prediction-based rejuvenation. However, there are two limitations to this model: first, only one type of wrong predictions
is covered, and second, the model is tailored to rejuvenation —downtime avoidance or
reactive downtime minimization are not included. Furthermore, due to the complexity
of the model, no analytical solution for availability is presented. Focussing on service
10.3 The Availability Model
233
degradation, Bao et al. [21] propose a CTMC that includes the number of service requests
in the system plus the amount of leaked memory. An adaptive rejuvenation scheme is
analyzed that is based on estimated resource consumption. Later, the model has been
combined with the three-state macro model in order to compute availability (Bao et al.
[22]). However, the model does also not investigate the effect of mispredictions.
Last but not least, the model presented in this dissertation is not the first attempt to assess the effects of proactive fault management. In Salfner & Malek [225], an approach has
been published that directly extends the well-known formula for steady-state availability:
A=
MT T F
.
MT T F + MT T R
(10.1)
However, the approach proposed in Salfner & Malek [225] had three limitations:
1. It did not clearly distinguish between true and false positive and negative predictions. This flaw resulted in an inappropriate handling of prevented and induced
failures.
2. Only steady-state availability could be estimated. Other dependability metrics such
as reliability and hazard rate could not be computed.
3. The model is implicit but not transparent to help better understand the behavior of
proactive fault management.
In summary, to the best of our knowledge, no work has been published that captures
both downtime avoidance as well as reactive and proactive downtime minimization, and
that is incorporating all four cases of failure prediction: true and false positives as well as
negatives.
10.3
The Availability Model
As is the case for many of the rejuvenation models mentioned before, the model developed
here is based on the CTMC originally published by Huang et al. [126]. First, the original
model is briefly presented and then the new model is introduced.
10.3.1
The Original Model for Software Rejuvenation
by Huang et al.
As described by Parnas, software aging can be observed in long-running software. However, software aging does not cause the software to crash immediately but increases the
risk of failure. For example, if a memory leak is present, the amount of available memory
is continuously decreasing (in long-term behavior). Assuming that each service request
requires some (stochastically distributed) amount of memory, the risk that some service
request fails due to insufficient free memory is increasing over time. However, if the
maximum number of concurrent service requests and the maximum amount of memory
consumption of each service request are limited, software aging does not affect service
availability as long as the amount of free memory is above some threshold. This observation is one of the key concepts in the model for rejuvenation proposed by Huang et al.
[126]: Some state exists, where a running system enters a failure probable state Sp (see
234
10. Assessing the Effect on Dependability
Figure 10.3). In the example, the system transits into this state when the amount of free
memory drops below the described threshold. Rejuvenation is performed periodically in
order to clean up the system and to bring it back into the fault free state S0 .
The occurrence of forced downtime (e.g., incurred by rejuvenation) is known while
failures occur stochastically (unplanned downtime). The key notion of software rejuvenation is that both downtime and the associated downtime cost are less for forced downtime
than for unplanned. Therefore, the model has two different down-states: One for rejuvenation (SR ) and one for failures (SF ). Since the periodically triggered restarting process
is different from repair after failure, two transition rates r1 and r3 are used.
Figure 10.3: The original CTMC model as used by Huang et al. [126] to compute availability
of a system with rejuvenation. S0 denotes the state when everything is up and
running, SP the failure probable state, SR the rejuvenation state and SF the
failed state with appropriate transition rates as used in the original paper
10.3.2
Availability Model for Proactive Fault Management
In order to develop an availability model for proactive fault management, three key differences are taken into account:
• In addition to rejuvenation, proactive fault management involves downtime avoidance techniques. In terms of the model, this means that there needs to be some way
to get from the failure probable state back to the S0 state without an intermediate
down state.
• Proactive fault management actions operate upon failure prediction rather than periodically. However, predictions can be correct or false. Moreover, it makes a
difference whether there really is a failure imminent in the system or not. Hence,
the single failure probable state SP in Figure 10.3 needs to be split up into a more
fine-grained analysis: According to the four cases of prediction, there is a state
for true positive predictions (ST P ), false positive predictions (SF P ), true negative
predictions (ST N ) and false negative predictions (SF N ).
• Besides rejuvenation, which is a proactive downtime minimization technique,
proactive fault management also comprises reactive downtime minimization actions. However, both types of actions can be assessed in terms of their effect on
time-to-repair. Hence, it is sufficient to keep up two down states: one for prepared /
forced downtime (SR ) and one for unprepared / unplanned downtime (SF ).
10.3 The Availability Model
235
The resulting CTMC is shown in Figure 10.4.
Figure 10.4: Availability CTMC for proactive fault management. State S0 is the fault-free state.
States ST P , SF P , ST N and SF N are failure-probable states corresponding to
the four cases of failure prediction correctness. States 5 and 6 are “down” states
where SR accounts for forced downtime caused by scheduled restart or prepared repair, and SF accounts for the unplanned counterpart.
In order to better explain the model, consider the following scenario: Starting from the
up-state S0 a failure prediction is performed at some point in time. If the predictor comes
to the conclusion that a failure is imminent, the prediction is a positive and a failure warning is raised. If this is true (something is really going wrong in the system) the prediction
is a true positive and a transition into ST P takes place. Due to the warning, some actions
are performed in order to either prevent the failure from occurring (downtime avoidance),
or to prepare for some forced downtime (downtime minimization). Assuming first that
some preventive actions are performed, let
PT P := P (failure | true positive prediction)
(10.2)
denote the probability that the failure occurs despite of preventive actions. Hence,
with probability PT P a transition into failure state SR takes place, and with probability
(1 − PT P ) the failure can be avoided and the system returns to state S0 . Due to the fact
that a failure warning was raised (the prediction was a positive one), preparatory actions
have been performed and repair is quicker (on average), such that state S0 is entered with
rate rR .
If the failure warning is wrong (in truth the system is doing well) the prediction is
a false positive (state SF P ). In this case actions are performed unnecessarily. However,
although no failure was imminent in the system, there is some risk that a failure is caused
by the additional workload for failure prediction and subsequent actions. Hence, let
PF P := P (failure | false positive prediction)
(10.3)
denote that an additional failure is induced. Since there was a failure warning, preparation
for an upcoming failure has been carried out and hence the system transits into state SR .
236
10. Assessing the Effect on Dependability
In case of a negative prediction (no failure warning is issued) no action is performed.
If the judgment of the current situation to be non failure-prone is correct (there is no
failure imminent), the prediction is a true negative (state ST N ). In this case, one would
expect that nothing happens since no failure is imminent. However, depending on the
system, even failure prediction (without subsequent actions) may put additional load onto
the system which can lead to a failure although no failure was imminent at the time when
the prediction started. Hence there is also some small probability of failure occurrence in
the case of a true negative prediction:
PT N := P (failure | true negative prediction) .
(10.4)
Since no failure warning has been issued, the system is not prepared for the failure and
hence a transition to state SF rather than SR , takes place. This implies that the transition
back to the fault-free state S0 occurs at rate rF , which takes longer (on average). If
no additional failure is induced, the system returns to state S0 directly with probability
(1 − PT N ).
If the predictor does not recognize that something goes wrong in the system and a failure comes up, the prediction is a false negative (state SF N ). Since nothing is done about
the failure that comes up there is no transition back to the up-state and the model transits
to the failure state SF without any preparation. The reason why there is an intermediate
state SF N originates from the way transition rates are computed, as explained in the next
section.
10.4
Computing the Rates of the Model
Reliability modeling is typically performed to investigate new techniques for systems that
are under design1 in order to determine their potential effect on system parameters such
as availability. The model shown in Figure 10.4 comprises the following parameters:
• PT P , PF P , PT N denote the probability of failure occurrence given a true positive,
false positive, or true negative prediction.
• rT P , rF P , rT N , and rF N denote the rate of true/false positive and negative predictions
• rA denotes the action rate, which is determined by the average time from start of
the prediction to downtime or to return to the fault-free state.
• rR denotes repair rate for forced / prepared downtime
• rF denotes repair rate for unplanned downtime
However, some of these parameters are difficult to determine. Therefore, more intuitive
parameters are used. from which the rates of the CTMC model are computed. Usually,
there are two groups of parameters:
1. fixed parameters that are estimated / measured from a given system or determined
by the application area
1
As already mentioned in the case of this dissertation, it was not possible to try the methods on the commercial system
10.4 Computing the Rates of the Model
237
2. parameters that shall be investigated / optimized in order to assess their effect on
target metrics.
In the case considered here, it is assumed that a system without proactive fault management shall be extended by PFM and the effect of PFM with respect to availability,
reliability and hazard rate shall be investigated. More specifically, it is assumed that fixed
parameters comprise mean-time-to-failure (MTTF), mean-time-to-repair (MTTR), leadtime ∆tl , and prediction-period ∆tp . The second group of parameters (those that shall be
investigated) includes parameters evaluating accuracy of failure prediction and parameters investigating the efficiency of actions. Table 10.2 summarizes the specific parameters
that are used in the following. Note that in contrast to the definition in Section 8.2.2, for
readability reasons the single letter “f ” is used to denote false positive rate in this chapter.
Parameter
Mean time to failure (system w/o PFM)
Mean time to repair (system w/o PFM)
Lead-time
Prediction-period
Precision
Recall
False positive rate
Failure probability given TP prediction
Failure probability given FP prediction
Failure probability given TN prediction
Repair time improvement
Symbol
MT T F
MT T R
∆tl
∆tp
p
r
f
PT P
PF P
PT N
k
Fixed
X
X
X
X
Investigated
X
X
X
X
X
X
X
Table 10.2: Parameters used for modeling
In summary, it is intuitively clear that any proactive fault management technique
should strive to achieve the following parameter values in order to minimize downtime:
1. Failure prediction should be as accurate as possible. This translates into high precision, high recall and low false positive rate.
2. Failure occurrence probabilities PT P , PF P and PT N should be as close to zero as
possible.
3. Time to repair for forced downtime / prepared repair should be as small as possible
in comparison to repair time for unplanned / accidental downtime.
10.4.1
The Parameters in Detail
Parameters can be divided into three groups (see Table 10.2):
1. Precision, recall and false positive rate specify failure prediction accuracy2 .
2
in a general sense, not as strict as in Definition 8.10 on Page 156
238
10. Assessing the Effect on Dependability
2. Failure probabilities PT P , PF P , PT N assess effectiveness of downtime avoidance
and the risk of additional failures that are induced by the additional workload of
prediction and actions.
3. Repair time improvement factor k determines effectiveness of downtime minimization.
Failure prediction accuracy. Figure 10.5 visualizes all four cases of failure prediction
correctness including lead-time ∆tl and prediction-period ∆tp . The case that a failure
occurs without any failure prediction being performed3 is mapped to a missing failure
warning, which is a false negative prediction. Although contingency table, precision,
recall and false positive rate have been defined in Chapter 8 (c.f., Table 8.1 on Page 153,
Equations 8.3 and 8.4 on Page 155, and Equation 8.9 on Page 156), they are repeated here
for convenience. Also, notation is slightly changed in order to emphasize that the metrics
are defined by numbers that have been counted during an experiment: For example, nT P
denotes the number of true positive predictions within one experiment with a total of n
predictions.
Figure 10.5: A timeline showing failures (t) and all four types of predictions (P): true positive,
false positive, false negative, and true negative. A failure is counted as predicted
if it occurs within prediction-period of length ∆tp , which starts lead-time ∆tl after
beginning of prediction
Table 10.3 shows the modified version of the contingency table. Using this notation,
Prediction: Failure
Prediction: No failure
Sum
True Failure
nT P
nF N
nF
True Non-failure
nF P
nT N
nN F
Sum
nP OS
nN EG
n
Table 10.3: This contingency table is a simplified version of Table 8.1 on Page 153. It emphasizes that the fields consist of the number of true positives (nT P ), false positives
(nF P ), etc. predictions from an experiment with a total of n predictions.
3
E.g., in error-based prediction, if no error occurs prior to a failure
10.4 Computing the Rates of the Model
239
precision, recall and false positive rate are defined as follows:
Precision p =
Recall r =
False positive rate f =
nT P
nT P
=
nT P + nF P
nP OS
(10.5)
nT P
nT P
=
nT P + nF N
nF
(10.6)
nF P
nF P
=
.
nF P + nT N
nN F
(10.7)
Effectiveness of downtime avoidance and risk of induced failures. Preventive actions
are applied in order to avoid an imminent failure which affects time-to-failure (TTF).
However, the opposite effect may also happen: due to additional load generated by failure
prediction and actions, failures can be provoked that would not have occurred if no PFM
had been in place. In order to account for this effect, the model uses three probabilities
corresponding to the types of failure prediction correctness:
PT P is the probability that a failure occurs in case of a correct warning. This is the
probability that the preventive action is not successful.
PF P is the probability of failure occurrence in case of a false positive warning. Since no
failure is imminent at the time of prediction, it corresponds to the probability that a
failure is provoked by the extra load of failure prediction and subsequent actions.
PT N is the probability that an extra failure is provoked by prediction alone: since it is
a true negative prediction, a failure occurs although no failure is imminent in the
system and no actions are performed.
There is no need to define a probability for false negative predictions since nothing is done
about the failure that will occur. The probability of failure occurrence is hence equal to
one.
Effectiveness of downtime minimization. Effects of forced downtime / prepared repair
on availability, reliability, and hazard rate are gauged by time-to-repair. More specifically,
the effect is expressed by mean relative improvement, how much faster the system is up in
case of forced downtime / prepared repair in comparison to MTTR after an unanticipated
failure:
MT T R
,
(10.8)
k=
M T T Rp
which is the ratio of MTTR without preparation to MTTR for the forced / prepared case.
Obviously, one would expect that preparation for upcoming failures improves MTTR,
thus k > 1, but the definition also allows k < 1 corresponding to a change for the worse.
10.4.2
Computing the Rates from Parameters
CTMC models express temporal behavior using exponential distributions for timing in the
state before transitioning. Exponential distributions are determined by a single parameter:
the transition rate. In this dissertation, only constant transition rates are considered which
are determined by the inverse of mean time.
240
10. Assessing the Effect on Dependability
It is the objective of this section to relate the model’s rates rT P , rF P , rT N , rF N , rA ,
rR , and rF to the more intuitive parameters listed in Table 10.2. Therefore, using the
formulas developed in the following, the rates of CTMC can be computed from the more
intuitive parameters. The text follows a bottom-up approach such that basic relationships
and equations are developed first which are subsequently used to derive equations for the
CTMC rates given by Equations 10.30 to 10.33.
Starting point for computations is to determine the distribution of predictions among
true and false positives and negatives. This can be obtained using the prediction-related
metrics precision, recall and false positive rate. The distribution is expressed by the number of, e.g., true positive predictions divided by the total number of predictions. By reference to Table 10.3 and the definitions given by Equations 10.5 to 10.7, it can be derived
that:
n = nF + nN F
nF P
nT P
+
=
r
f
(10.9)
(10.10)
nT P
nP OS − nT P
+
r
f
nT P
nT P
nT P
=
+
−
r
pf
f
=
nT P
n
⇒
=
1
1
r
+
1
pf
−
1
f
,
(10.11)
(10.12)
(10.13)
which is an equation to compute the fraction of true positive predictions nT P (in comparison to the total number of predictions n) from the prediction-related parameters precision,
recall and false positive rate.
In order to compute the fraction of false positive, false negative, and true negative
predictions, it is necessary to determine:
nP OS
n
=
1 nT P
p n
(10.14)
nF
n
=
1 nT P
,
r n
(10.15)
which leads to
nF P
n
nF N
n
nT N
n
nP OS
nT P
−
n
n
nF
nT P
=
−
n
n
nN F
nF P
nF
nF P
=
−
=1−
−
.
n
n
n
n
=
(10.16)
(10.17)
(10.18)
Now, as the relative distribution among true and false positive and negative predictions is known, the corresponding transition rates rT P , rF P , rT N , rF N can be computed.
The approach is to first compute the overall prediction rate rp , which determines timing of the process once it has entered state S0 . The mean time is determined by meantime-to-prediction (MTTP), which is computed in two steps: first, mean-time-betweenpredictions (MTBP) is computed from the temporal parameters MTTF, MTTR, lead time
10.4 Computing the Rates of the Model
241
∆tl , and prediction period ∆tp . MTTF and MTTR are assumed to be known from a system without PFM —that is why they are fixed parameters. Then in a second step, MTTP
is obtained from MTBP by subtracting time needed for prediction, etc.
The principal notion to compute MTBP is that there are x-times as many predictions
as true failures. By assuming the number of predictions to be n and the number of true
failures to be nF , x can be determined by expressing n in terms of nF , as is shown in the
following:
n = nF + nN F
nF P
= nF +
f
nP OS
nT P
= nF +
−
f
f
nT P
nT P
= nF +
−
pf
f
(10.19)
(10.20)
(10.21)
(10.22)
1
1
−
pf
f
!
1
1
= nF + nF r
−
pf
f
!
= nF + nT P
= nF
1−p
1+r
pf
(10.23)
(10.24)
!
.
(10.25)
This means that there are 1 + r 1−p
as many predictions as failures. Hence it can be
pf
concluded that for the mean times holds:
M T BP =
1
1+r
1−p
pf
M T BF ,
(10.26)
where MTBF denotes “mean-time-between-failures” for a system without proactive fault
management, which can be computed from MTTF and MTTR by the standard formula
M T BF = M T T F + M T T R .
(10.27)
As can be seen in Figure 10.6 MTTP can be computed from MTBP by subtracting leadtime ∆tl , and repair time R. Additionally, half of the prediction-period has to be subtracted since a failure may occur at any time within the prediction-period and hence on
average failures occur at half of the prediction-period.4 However, for a system with PFM,
repair time R is not equal to MTTR, since there are two different repair times: One for
prepared repair (or forced downtime) and one for the unprepared / unplanned case. But,
as we only consider mean values, mean repair time R is a combination of both cases and
the mixture is determined by the fraction of positive predictions in comparison to negative
4
To be precise, a symmetric distribution centered around the middle of the prediction-period is assumed,
i.e., a distribution with zero skewness and median equal to ∆tp/2
242
10. Assessing the Effect on Dependability
Figure 10.6: Time relations for prediction. Failures are indicated by t, predictions by P and
repair by R
predictions. More specifically, M T T P is given by:
∆tp
2
nT P + nF P
nT N + nF N
−
M T T Rp −
M T T R , (10.28)
n
n
where M T T Rp is mean-time-to-repair for the case of forced / prepared downtime. It
is related to MTTR of unplanned downtime by repair time improvement factor k (c.f.,
Equation 10.8). Finally, prediction rate rp is computed by:
M T T P = M T BP − ∆tl −

rp = 
∆tp
MT T F + MT T R
− ∆tl −
1−p
2
1 + r pf
−1
nT P + nF P
nT N + nF N
−
+
MT T R
. (10.29)
kn
n
As already mentioned, transition rates from S0 to ST P , SF P , ST N , and SF N , are determined by distributing rp among true / false positive / negative predictions:
nij
∗ rp where i ∈ {T, F} and j ∈ {P, N} ,
(10.30)
rij =
n
where nnij denotes the fractions given by Equations 10.13 to 10.18.
The three remaining rates are action rate (rA ), repair rate for forced downtime / prepared repair (rR ) and repair rate for an unprepared failure (rF ). rA is characterized by
average time from the beginning of the prediction to the occurrence of downtime or its
prevention and can hence be computed from lead-time ∆tl and prediction-period ∆tp :
rA =
1
.
∆tl + 1/2∆tp
(10.31)
Repair rate rF is determined by the repair rate of a system without PFM, which is MTTR:
1
(10.32)
MT T R
and repair rate for forced downtime / prepared repair is determined by MTTR and k:
rF =
rR =
1
k
=
= k rF .
M T T Rp
MT T R
(10.33)
10.5 Computing Availability
10.5
243
Computing Availability
Steady-state availability is defined as the portion of uptime versus lifetime, which is equivalent to the portion of time, the system is up. In terms of our CTMC model, this quantity
can be determined by the equilibrium state distribution: It is the portion of probability
mass in steady-state assigned to the non-failed states, which are S0 , ST P , SF P , ST N , and
SF N . In order to simplify representation, numbers 0 to 6 —as indicated in Figure 10.4—
are used to identify the states of the CTMC.
The infinitesimal generator Matrix Q of the CTMC model is:


−rp
rT P rF P rT N rF N
0
0


 (1−P ) r −r
0
0
0 PT P rA
0 


TP
A
A


 (1−P ) r
0 −rA 0
0 PF P r A
0 
FP
A





Q =  (1−PT N ) rA 0
0 −rA 0
0
PT N r A 





0
0
0
0
−r
0
r
A
A





rR
0
0
0
0
−rR
0 


rF
0
0
0
0
0
−rF
(10.34)
The equilibrium state distribution of a CTMC can be determined by solving the global
balance equations. This is equivalent to a solution of the following linear equation system
(see, e.g., Kulkarni [149]):
πQ = 0
6
X
s.t.
(10.35)
πi = 1 .
(10.36)
i=0
The way to a solution is based on the following observation: If π is a solution to
Equation 10.35 then each scaling of π is also a solution and hence, an infinite number of
solutions exist, one of which fulfills Equation 10.36. Therefore, π6 is arbitrarily set to one
and the inhomogeneous equation system π 0 Q0 = b is solved by Gaussian elimination
yielding a single solution π 0 where Q0 is


−rp
rT P rF P rT N rF N
0


 (1−P )r −r
0
0
0 PT P rA 


TP A
A


 (1−P )r
0 −rA 0
0 PF P rA 
FP A


0

Q =
 (1−PT N )rA 0
0 −rA 0
0 





0
0
0
0 −rA
0 


rR
0
0
0
0 −rR
and
b=
−rF 0 0 0 0 0 .
(10.37)
(10.38)
The final solution π is obtained by scaling of πi0 such that the sum equals one (c.f., Equation 10.36):
πi =
πi0
P
5
i=0
π6 = P5
i ∈ {0 . . . 5}
πi0 + 1
1
0
i=0 πi + 1
.
(10.39)
244
10. Assessing the Effect on Dependability
By exploiting that rR = k rF , results can be further simplified. Equations for πi are
provided by Table 10.4.
πi
π0
π1
π2
π3
π4
π5
π6
Solution
rF
rF
rF
rF
rF
rF
rF
rF k rA
k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N )
rF k rT P
k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N )
rF k r F P
k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N )
rF k r T N
k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N )
rF k rF N
k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N )
rA (PF P rF P + PT P rT P )
k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N )
k rA (PT N rT N + rF N )
k (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N )
Table 10.4: Solution to equation system defined by Equations 10.35 and 10.36. πi ’s are equilibrium (steady-state) probabilities for the states in the availability model
Steady-state availability is determined by the portion of time the stochastic process
stays in one of the up-states 0 to 4:
A=
4
X
πi = 1 − π5 − π6
i=0
A=
(rA + rp )k rF
,
k rF (rA + rp ) + rA (PF P rF P + PT P rT P + k PT N rT N + k rF N )
(10.40)
yielding a closed-form solution for steady-state availability of systems with PFM.
10.6
Computing Reliability
Reliability R(t) is defined as the probability of failure occurrence up to time t given that
the system is fully operational at t = 0. In terms of CTMC modeling this is equivalent to
a non-repairable system and computation of the first passage time into the down-state.
10.6.1
The Reliability Model
Since a non-repairable system is to be modeled, the distinction between two down-states
(SR and SF ) is not required anymore. Furthermore, there’s no transition back to the up-
10.6 Computing Reliability
245
state. That is why a simpler topology can be used where the two failure states are merged
into one absorbing state SF as shown in Figure 10.7.
Figure 10.7: CTMC model for reliability. Failure states 5 and 6 of Figure 10.4 have been
merged into one absorbing state SF
The generator matrix for this model has the form:
Q=
T t0
0 0
!
,
(10.41)
where T equals:

T =







−rp
rT P rF P rT N rF N
(1−PT P ) rA −rA
0
0
0
(1−PF P ) rA
0 −rA
0
0
(1−PT N ) rA
0
0 −rA
0
0
0
0
0 −rA
and t0 equals:
t0 = [ 0
10.6.2
PT P rA
PF P rA
PT N rA








rA ]T .
,
(10.42)
(10.43)
Reliability and Hazard Rate
The distribution of the probability to first reach the down-state SF yields the cumulative
distribution of time-to-failure. In terms of CTMCs this quantity is called first-passagetime distribution F (t). Reliability R(t) and hazard rate h(t) can be computed from F (t)
in the following way:
R(t) = 1 − F (t)
f (t)
h(t) =
,
1 − F (t)
(10.44)
(10.45)
where f (t) denotes the corresponding probability density of F (t). F (t) and f (t) are the
cumulative and density of a phase-type exponential distribution defined by T and t0 (see,
246
10. Assessing the Effect on Dependability
e.g., Kulkarni [149]):
F (t) = 1 − α exp (t T ) e
f (t) = α exp (t T ) t0 ,
(10.46)
(10.47)
where e is a vector with all ones, exp (t T ) denotes the matrix exponential, and α is the
initial state probability distribution. It can be determined from the fact that reliability is
defined such that the system is fully operational at time t = 0. Hence:
α = [1
0
0
0
0] .
(10.48)
Closed form expressions exist and can be computed using a symbolic computer algebra
tool. However, the solution would fill several pages5 and will hence not be provided here.
10.7
How to Estimate the Parameters from Experiments
The previous sections described how availability and reliability for systems with PFM can
be determined as a function of eleven parameters: M T T F , M T T R, ∆tl , ∆tp , p, r, f ,
PT P , PF P , PT N and k (c.f., Table 10.2). M T T F and M T T R were assumed to be known
from a system without proactive fault management and can hence be estimated from an
existing system; and ∆tl and ∆tp have been assumed to be application specific. The remaining seven parameters refer to proactive fault management. If reliable estimates for
these parameters are available from similar systems, the derived formulas can be applied
directly. However if not, it seems impossible to derive them analytically from system
specifications. Therefore, they must be estimated by experiment in an environment similar to the production environment. In this section, an estimation procedure is described
that separates the mutual influence of failure prediction and reaction schemes in order to
determine all seven parameters.
10.7.1
Failure Prediction Accuracy
During the first experiment, only those parameters characterizing failure prediction
(namely p, r, and f ) are investigated with as little feedback onto the system as possible. This can either be accomplished by performing predictions offline working with
previously recorded logfiles (as it has been done in Chapter 9 for the telecommunication
system) or performing predictions on a separate machine. Side-effects such as additional
workload are incorporated in later experiments. The outcome of the failure prediction
experiment yields a sequence of predictions (either positive or negative) and a sequence
of failures, by which predictions can be classified as true or false. Figure 10.8 shows all
four cases that can occur. The figure is almost the same as Figure 10.5, however, it assigns
situation IDs ¬ to ¯, which are needed in later steps of the estimation procedure.
Starting from a timeline as in Figure 10.8, predictions can be assigned to be true
positive (situation ¬), false positive (situation ­), false negative (situation ®) or true
negative (situation ¯). From this assignment p, r, and f can be computed according to
5
The solution found by MapleTM contains approximately 3000 terms.
10.7 How to Estimate the Parameters from Experiments
247
Figure 10.8: A timeline obtained from an experiment showing true failures (t) and prediction results. “!” indicate positive predictions (failure warnings) and “Ø” negative
predictions. Four situations can occur as indicated by ¬ to ¯
the definitions given in Equations 10.5 to 10.7
p =
count(¬)
count(¬) + count(­)
(10.49)
r =
count(¬)
count(¬) + count(®)
(10.50)
f =
count(­)
,
count(­) + count(¯)
(10.51)
where count(x) denotes the number of times, situation x has occurred in the experiment.
Using p, r, and f , the expected ratio of true and false positive and negative predictions
nT P nF P nT N
, n , n , and nFnN can be computed using Equations 10.13 to 10.18, where in the
n
following N denotes the total number of predictions in the experiment’s trace. From now
on, nTnP , nFnP , nTnN , and nFnN are assumed to be known. They are used in later experiments.
10.7.2
Failure Probabilities PT P , PF P , and PT N
The goal of the second experiment is to assess the capability of downtime avoidance
mechanisms. Since these mechanisms are run only in case of a positive prediction, and a
failure can only be avoided if such is imminent, failure avoidance capability is gauged by
probability PT P . More precisely, PT P is the probability that a failure occurs even though
an upcoming true failure has been predicted and downtime avoidance mechanisms have
been performed. To estimate it, failure predictions and downtime avoidance mechanisms
have to be performed together on a test system that mimics key features of the modeled
system as close as possible. The outcome of the experiment is again a timeline as shown in
Figure 10.8. However, the simple assignment of cases to true / false positives / negatives
is not possible any more due to the following observations:
• situation ¬
(co-occurrence of failure warning and failure). This situation might
be traced back to two scenarios: (a) the prediction was a true positive
and the triggered downtime avoidance action could not prevent the
occurrence of the failure, or (b) the prediction was a false positive
that successively lead to a failure induced by the prediction algorithm
and / or the triggered action (e.g., due to additional load).
• situation ­
(failure warning without occurrence of a failure). This situation might
248
10. Assessing the Effect on Dependability
be traced back to (a) a false positive prediction or (b) to a true positive
prediction with successful avoidance of the failure.
• situation ®
(occurrence of failure only). This situation can be caused by (a) a
false negative prediction or (b) by a true negative prediction where
the execution of the failure prediction algorithm (there are no actions
performed upon negative predictions) caused the failure.
Table 10.5 provides a complete list of all these cases. As is indicated by the horizontal
Situation
Comment
Prediction
Failure
¬
action not successful
TP
F
¬
failure caused by PFM
FP
F
­
failure prevented
TP
NF
­
false positive prediction
FP
NF
®
failure caused by prediction
TN
F
®
false negative prediction
FN
F
¯
correctly no warning
TN
NF
Prob. of occur.
nT P
n
nF P
n
nT P
n
nF P
n
nT N
n
nF N
n
nT N
n
PT P
PF P
(1 − PT P )
(1 − PF P )
PT N
(1 − PT N )
Table 10.5: Mapping of cases to situations. Although only four different situations can be
observed in the experiment’s output (c.f., Figure 10.8), they can be traced back to
seven different cases if downtime avoidance techniques are applied
line, there are two groups of non-overlapping parameters and situations: The first group
comprises parameters PT P , PF P and situations ¬ and ­, while the second group comprises parameter PT N and situations ® and ¯. Since handling of the second group is
easier, this group is discussed first.
Estimation of PT N . By combining rows referring to the same situation in the second
group of Table 10.5, the following linear equation system can be set up:
nT N
nF N
PT N +
n
n
count(®)
N
(10.52)
nT N
count(¯)
(1 − PT N ) =
n
N
(10.53)
=
Since there are two equations for one parameter (PT N ), it can be shown that a solution
only exists, if
count(®) count(¯)
nT N
nF N
+
=
+
,
(10.54)
N
N
n
n
expressing that the observed fraction of negative predictions (left-hand-side) is equal to
the expected fraction computed from precision, recall and false positive rate, which have
been estimated before (right-hand-side of the equation). Assuming that this is the case,
one of Equations 10.52 or 10.53 can be chosen. Since situation ¯ is expected to occur
10.7 How to Estimate the Parameters from Experiments
249
more frequently, estimation error is expected to be lower and hence Equation 10.53 is
solved for PT N :
count(¯)
N
PT N = 1 −
,
(10.55)
nT N
n
¯) equals nT N (expressing that all true negawhich has an intuitive interpretation: if count(
N
n
tive predictions appear in situation ¯), there are no true negative predictions that resulted
in situation ®, which means that no failures are induced by the prediction algorithm,
which in turn is consistent with PT N being equal to zero.
Estimation of PT P and PF P . The same procedure is applied to the first group in Table 10.5. The linear equation system is
nT P
nF P
count(¬)
PT P +
PF P =
n
n
N
nT P
nF P
count(­)
(1 − PT P ) +
(1 − PF P ) =
,
n
n
N
(10.56)
(10.57)
which are two equations for two variables. However, Equations 10.56 and 10.57 are not
independent. Similar to the estimation of PT N , a solution only exists, if
nT P
nF P
count(¬) count(­)
+
=
+
.
N
N
n
n
(10.58)
Since there’s only one (independent) equation containing two variables, an additional,
independent equation involving PT P or PF P has be formed. The following options are
available:
1. Since PF P denotes the risk of failure induced by execution of failure prediction
algorithms and subsequent (unnecessary) actions, PF P could be set a-priori yielding
PF P = const.
(10.59)
2. In case of a true positive prediction, a failure may occur due to two reasons: The
action was not able to avoid the failure, or the action would have avoided the failure
that had been predicted, but due to additional load, another failure occurs. However,
the risk of inducing an additional failure is PF P (see above), and hence one could
assume that
PT P = P (failure cannot be avoided) + PF P .
(10.60)
So the difficulty is to determine the probability that a failure cannot be avoided.
3. A fixed ratio of PT P : PF P could be assumed. For example, a ratio of 10:1 would
express that the risk of failure occurrence after issuing of a failure warning is ten
times as high if the warning is correct as if it is a false warning. In general, this
leads to
PF P = c PT P ,
(10.61)
where c is a constant (ten in the example).
250
10. Assessing the Effect on Dependability
4. Either PT P or PF P can be determined in a separate experiment. This also results in
PF P = const.
or
PT P = const.
(10.62)
Solutions one to three involve assumptions that are vague and difficult to support by measurements. In contrast, solution four is based on experimental evidence. It might seem
that it does not make a difference whether PT P or PF P is estimated, but this is not true: In
order to estimate PF P or PT P it must be known when a prediction is a false or true positive. In the false positive case, it must be proven that a failure would not have occurred
if failure prediction and actions had not been in place, which seems infeasible. However,
in the second case, it must be assured that a positive prediction is a true positive, which
means that a failure is really imminent. This can be achieved by fault injection (see, e.g.,
Silva & Madeira [242] for an introduction), as is explained in the following.
Once again, PT P is the probability of failure occurrence given a true positive prediction. Applying a maximum likelihood estimator yields:
PT P = P (F |T P ) =
count(F ∧ T P )
count(F ∧ T P )
=
,
count(T P )
count(F ∧ T P ) + count(N F ∧ T P )
(10.63)
where count(F ∧ T P ) denotes the number of true positive predictions where (despite all
preventive actions) a failure has occurred, and count(N F ∧ T P ) denotes the number of
cases where a failure warning is raised correctly that is not followed by a failure. Fault
injection is applied in order to know when a failure really is imminent in the system and
hence any positive prediction (failure warning) occurring within some time interval after
fault injection is a true positive. The case that a true positive prediction is followed by
a failure (F ∧ T P ) can be identified directly in the log of a fault-injection experiment
(c.f., situation ° in Figure 10.9). Identification of the case that no failure occurs after a
Figure 10.9: Identifying true positive predictions by fault injection. °: If a failure (t) occurs
within a given time-interval after fault injection and the failure is preceded by a
failure warning (exclamation mark), the situation is assumed to be a true positive
prediction where the failure could not be prevented. ±: If no failure but a failure
warning are observed after fault injection, this corresponds either to a false positive prediction if fault injection was not successful or to a true positive prediction
where the failure has been prevented
true positive prediction (N F ∧ T P ) is more complicated. The reason for this is that the
injection of a fault does not always lead to a failure. Hence situation ± in Figure 10.9
can either be a true positive where the failure has been prevented (this is the case needed
for Equation 10.63) or a false positive prediction in the case that fault injection did not
succeed. However, these two cases can be distinguished by the relative frequencies of
true positive and false positive predictions, which is determined by precision. But since
a fault injector can in some cases change system behavior significantly, precision has
10.7 How to Estimate the Parameters from Experiments
251
to be estimated separately for the fault injection experiments following the same offline
procedure as described in the previous section (p0 is used to indicate precision in this
case). This leads to the following formula for maximum likelihood estimation of PT P :
PT P =
count(°)
.
count(°) + p0 count(±)
(10.64)
It should be noted that fault injection is a difficult issue and care should be taken that a
broad range of faults are injected such that failures of different types occur. If downtime
avoidance techniques are only able to compensate for upcoming failures of certain classes,
PT P equals one for failure types that are not taken care of. If the distribution of failure
types is known, the estimate given in Equation 10.64 can be improved.
By substituting the solution for PT P (Equation 10.64) either into Equation 10.56
or 10.57, PF P can be computed. Using the first equation yields
PF P =
count(¬)
N
−
nF P
n
nT P
n
PT P
.
(10.65)
Measuring deviation. Since all experiments are finite samples, and since, if actions are
performed, failure prediction accuracy might deviate slightly from the values determined
by offline estimation, exact equalities in Equation 10.54 and 10.58 will be observed rather
rarely. If deviation is sufficiently small, the equations can be used nonetheless. If not,
experiments have to be repeated with an increased sample size or with more similar environments and conditions. The amount of deviation can be determined by Equations 10.54
and 10.58. It can be observed that the deviation is symmetric: if the observed fraction
of negative predictions is larger than expected (left-hand side > right-hand side in Equation 10.54), the observed fraction of positive predictions is smaller than expected (lefthand side < right-hand side in Equation 10.58) and vice versa. Therefore, one of both can
be used to determine deviation from expectations. Since there usually are more negative
than positive predictions, the estimate is more reliable if negative predictions are used and
deviation is defined as follows (c.f., Equation 10.54):
dev =
10.7.3
count(®)
N
count(¯) nT N
nF N +
−
−
.
N
n
n (10.66)
Repair Time Improvement k
In order to estimate the repair time improvement factor k, an experimental trace such as
Fig. 10.8 that additionally includes time-to-repair is needed. As k is the ratio of MTTR for
unplanned /unprepared downtime and MTTR for forced /prepared downtime (c.f., Equation 10.8), mean values for both cases need to be computed. The distinction between
both types of downtime is based on failure prediction: in case of a failure warning (situation ¬ in Figure 10.8) time to repair contributes to forced /prepared downtime, in case
of no failure warning (situation ® in Figure 10.8), it contributes to unplanned /unprepared downtime. Comparing the value of M T T R for the unpredicted case to the fixed
value known from a system without PFM yields a further indication how representative
the estimate is.
252
10. Assessing the Effect on Dependability
10.7.4
Summary of the Estimation Procedure
Since the estimation procedure is quite complex involving several experiments, a brief
summary of the procedure is provided in Figure 10.10.
10.8
A Case Study and an Example
In his diploma thesis [254], Olaf Tauber has set up an experimental environment in order
to explore the effects of proactive fault management onto a real system. The case study
has been performed by extending the .NET Pet Shop application6 from Microsoft. This
section summarizes the work and highlights main results. Since results have not been
convincing, a more advanced example is also presented.
10.8.1
Experiment Description
.NET is a runtime environment developed by Microsoft that is able to execute software
components written in various programming languages. Furthermore, it provides readyto-use functionality to handle a multitude of tasks ranging from multi-threading to graphical user interfaces. “Pet Shop” is a small, open source web-shop demo application that
has been built in order to demonstrate superiority to the Java-based “PetStore” demo application developed by Sun Microsystems.
Running the Pet Shop application requires at least two additional components: a webserver that handles http requests from web-browsers (clients) and a database to store the
data in. In order to create an experimental environment for testing proactive fault management techniques, several modules had to be added to the system (see Figure 10.11):
• Stressors. In order to simulate a real scenario, workload must be put onto the system. An existing load generator called JMeter had been adapted to simulate a variety of actions associated with shopping (e.g., logging in, browsing the catalog,
viewing and changing the shopping cart, payment, etc.). Activity patterns have
been executed randomly obeying several boundary conditions such as that users
have to log in prior to payment, etc. Furthermore, stressors have been replicated
simulating a total of 70 users shopping concurrently. The second important part
of stressors is response analysis: Each response has been analyzed with respect to
response times and correctness of the returned web-page. For performance reasons,
relevant data has only be stored during runtime and has been analyzed offline after
each test run.
• Monitoring. Since proactive fault management is about acting upon an analysis of
the current state, runtime monitoring is necessary. In this case, a .NET component
has been used to report system-wide Windows performance counters such as the
number of active database transactions, size of the swap file, etc.
• Failure Prediction. Monitoring values have been transmitted over a network socket
to a failure predictor. It must be pointed out that not the failure prediction algorithm
proposed in this dissertation has been used. Instead, Olaf Tauber has developed a
6
See http://msdn2.microsoft.com/en-us/library/ms978487.aspx
10.8 A Case Study and an Example
253
1. Experiment 1: without feedback onto the system: Either write logfiles or execute predictions on separate computer.
(a) Classify the resulting data into situations ¬ to ¯ (c.f., Figure 10.8).
(b) Compute precision p, recall r, and false positive rate f using Equations 10.49, 10.50, and 10.51.
(c) Using p, r, and f , compute expected ratios of true and false positives and
negatives nTnP , nFnP , nTnN , and nFnN using Equations 10.13 to 10.18.
2. Experiment 2: with failure prediction and actions similar to a production system.
(a) Classify the resulting data into situations ¬ to ¯ (c.f., Figure 10.8).
®) +
(b) Determine the relative amount of negative predictions, which is count(
N
count(¯)
where N is the total number of predictions. Compute deviation
N
dev by Equation 10.66 using
nT N
,
n
and
nF N
n
from Experiment 1.
(c) If deviation is significant,a experiments 1 and 2 have to be repeated either
with more samples to reduce sampling effects or such that computing
environments and conditions are more similar.
(d) Estimate PT N using Equation 10.55.
(e) From repair times occurring in the experiment, estimate M T T R and
M T T Rp and compute repair time improvement factor k as described in
Section 10.7.3.
(f) Compute overall prediction rate rp using Equation 10.29
(g) Using rp , compute prediction rates rT P , rT P , rT P , rT P from Equation 10.30 using values nTnP , nFnP , nTnN , and nFnN from Experiment 1-c,
above.
(h) compute rates rA , rF , and rR using Equations 10.31 to 10.33.
3. Experiment 3: with fault injection but without feedback onto the system.
(a) Identify occurrences of situations ¬ and ­ for fault injection experiment
(c.f., Figure 10.8).
(b) Estimate p0 using Equation 10.49
4. Experiment 4: with fault injection, prediction and actions in place. By analyzing situations ° and ± (c.f., Figure 10.9), estimate PT P using Equation 10.64.
5. Estimate PF P by Equation 10.65 using data of Experiment 2.
a
the threshold is application specific and cannot be provided, here
Figure 10.10: Summary of the procedure to estimate model parameters
254
10. Assessing the Effect on Dependability
Figure 10.11: Overview of the case study
simple prediction algorithm that is based on weighted events generated from threshold violations. The reason for this is that implementation of HSMM-based failure
prediction has not been finished at the time Olaf Tauber carried out the experiments.
• Action. If a failure is predicted, some action is triggered. One downtime avoidance
and one downtime minimization technique have been implemented:
– Load lowering has been chosen for downtime avoidance. More specifically,
lowering of the load was achieved by displaying a web page stating that the
server is temporarily overloaded and clients should retry in a few seconds.
– A two-level hierarchical reboot strategy was used for downtime minimization.
The reboot strategy was able to either reboot the application layer in the .NET
runtime or to reboot the entire system.
• Fault Injection. One of the most effective fault injection techniques is to limit available ressources. Olaf Tauber has opted for allocating memory such that the rest of
the system (including Pet Shop application, webserver and the database) have to
cope with a reduced amount of free memory. Specifically, fault injection has been
implemented by a multi-threaded process controlled7 from outside the system.
10.8.2
Results
At the time when Olaf Tauber carried out his experiments, the model proposed in this
Chapter had not been developed, and hence he had used the formulas and estimation
technique proposed in Salfner & Malek [225]. Fortunately, the supplemental DVD to
the diploma thesis contained the complete recordings collected during experiments and
the data could be analyzed with the estimation procedure described in Section 10.7. In
contrast to this procedure, the one that Olaf Tauber has applied consisted of only two
phases, but since he has applied fault injection in his experiments, data could be split
7
this means specification of start, duration and amount of memory allocation
10.8 A Case Study and an Example
255
further in time intervals with and without failure prediction resulting in data for four
experiments. In order to clearly separate the parts, some time period after the end of each
fault injection has been removed from considerations.
Two proactive fault management techniques have been investigated by Olaf Tauber:
employing downtime minimization by restart and employing downtime avoidance by presenting a static page saying “server is busy”. Since the only type of failures observed were
singleton runtime failures, each only affecting a few requests, the restarting approach was
not at all successful: Even in the case of application-level restarting, eleven times as many
service requests got lost during restart than by the failure itself. For this reason, only the
downtime avoidance technique has been analyzed in the following.
Analysis has been performed with a lead-time ∆tl of 60s, and prediction-period ∆tp of
five minutes. Table 10.6 shows parameter values for the resulting model. Unfortunately,
the limited amount of data is not sufficient to yield a statistically reliable assessment of
the parameters. Hence, the results need to be interpreted with care. Deviation, as defined
by Equation 10.66 has been equal to 0.0164.
Fixed
parameters
MT T F
MT T R
∆tl
∆tp
Value [s]
25711
2.00
60
300
Estimated
parameters
p
r
f
PT P
PF P
PT N
k
Value
0.167
0.25
0.0617284
0.5
0.1463768
0.04366895
1.5625
Resulting
rates
rT P
rF P
rT N
rF N
rA
rF
rR
Value [ 1s ]
1.178169e-05
5.876737e-05
0.000893264
3.534508e-05
0.004761905
0.5
0.78125
Table 10.6: Resulting values for model parameters as estimated from data of the case study.
Fixed parameters refer to the parameters not depending on PFM. Estimated parameters are those that are estimated from experiments as described in Section 10.7. The rightmost column lists the resulting transition rates computed from
estimated parameters.
It might look surprising, that k is not equal to one since showing a “server is busy”
page aims at downtime avoidance rather than downtime minimization. The explanation
for this behavior is that M T T R as well as M T T Rp are determined by the first successful
request after a failed one. If only a static page is displayed, the first successful response
can be delivered earlier,8 and hence MTTR is reduced.
Using the estimated values for model parameters, steady-state availability, reliability
and hazard rate can be computed and plotted, respectively. In particular, steady-state availability of the system without proactive fault management was equal to A = 0.9999222
and of the system with PFM AP F M = 0.9998618. This is a dramatic decrease! More
precisely
1 − AP F M
≈ 1.78 ,
(10.67)
1−A
8
To be precise, after 1.28 seconds rather than 2.00 seconds resulting in k = 1.5625
1.000
10. Assessing the Effect on Dependability
1.0
256
w/o PFM
with PFM
0.990
R(t)
0.0
0.980
0.2
0.985
0.4
R(t)
0.6
0.995
0.8
w/o PFM
with PFM
0
10000
20000
30000
40000
50000
0
100
200
time [s]
300
400
500
time [s]
(a) Reliability
(b) Blow-up of the first 500s of (a)
Figure 10.12: Reliability for the case study. The blow-up (b) of the first 500 seconds shows
the phase-type character of the reliability model
which indicates that unavailability is approximately doubled. Regarding reliability, a
similar picture is observed: In Figure 10.12-a reliability of the system with and without
proactive fault management are plotted. It can be observed that reliability of the PetShop
system without PFM shows better reliability than the altered PetShop system. A more
fine-grained analysis of the first few hundred seconds reveals that reliability of the case
with PFM is slightly higher within the first 300 seconds (see Figure 10.12-b). However,
this results most likely from the simple model used to compute reliability of the system
without PFM, which employs an exponential distribution of time-to-failure:
1
R(t) = 1 − F (t) = 1 − 1 − e− M T T F t .
(10.68)
Nevertheless, the fine-grained analysis reveals the phase-type character of reliability as a
consequence of the modeling approach.
Hazard rates are shown in Figure 10.13. From the usage of a single exponential distribution for the system without PFM (c.f., Equation 10.68) results a constant hazard rate:
h(t) =
1
λ e−λ t
=λ=
.
−λ
t
1 − (1 − e )
MT T F
(10.69)
Regarding hazard rate of the system with PFM, the characteristic that the hazard rate
is zero for t = 0 results from the fact that there is no direct transition from the initial
up-state to a failure. It can also be observed from Figure 10.13 that for t → ∞ hazard
rate approaches a constant, which results from the CTMC settling into steady-state. As
could be expected from worse steady-state availability and reliability, the constant value
is higher than for the case without PFM.
Looking at Table 10.6, the bad performance of the proactive fault management can
be traced back both to low values for precision and recall and the inefficiency of the
downtime avoidance technique:
• Low precision and recall express that the used simplistic threshold-based failure
prediction method is not able to achieve sufficiently accurate failure prediction: A
257
h(t)
2e−05
4e−05
6e−05
8e−05
10.8 A Case Study and an Example
0e+00
with PFM
w/o PFM
0
200
400
600
800
1000
time [s]
Figure 10.13: Hazard rate for the case study
precision of 0.167 implies that about 83% of all failure warnings are false. Orthogonal to that, only 25% of failures are caught by the prediction algorithm and
three fourth are missed. As a side-remark, these values are a good example to show
that accuracy —as defined by Equation 8.10 on Page 156— is not an appropriate
metric to evaluate failure prediction: accuracy equals 90.59%! The explanation for
this discrepancy is that most of the predictions are true negatives, as can be seen
from Table 10.7 listing the relative distribution among predictions as obtained from
Equations 10.13 to 10.18.
Type of prediction
True positives
False positives
True negatives
False negatives
Relative amount
1.18%
5.88%
89.40%
3.54%
Table 10.7: Relative amount of the four types of prediction
• Poor ability to prevent failures: the probability that a failure occurs even if it is
predicted is PT P = 0.5. The number is that even since only a total of 36 predictions
containing five failures are available from the data. The value of PT P to be one
half indicates that each second true positive prediction, a failure occurs. Although
downtime is smaller for predicted outages (k > 1) it cannot compensate for the fact
that there are 1 + r 1−p
= 21.25 as many predictions as occurrences of failures in
pf
the Petshop without proactive fault management. Even that would be no problem if
most of the predictions were true negatives and PT N was sufficiently small, which
is also not the case in this example.
In summary, the experiment has shown that the application of proactive fault management
can make a system worse. However, the applied failure prediction method was too simple
258
10. Assessing the Effect on Dependability
Parameter
p
r
f
PT P
PF P
PT N
k
Value
0.70
0.62
0.016
0.25
0.1
0.001
2
Table 10.8: Parameters assumed for the sophisticated example
and downtime avoidance was far from being effective. The next section will demonstrate
the effects in a more sophisticated setting.
10.8.3
An Advanced Example
In order to show that proactive fault management can indeed improve steady-state system availability, calculations have been carried out assuming parameter values from a
better failure predictor and more effective actions. More specifically, the values that have
been observed for HSMM-based failure prediction for the telecommunication system (c.f.,
Chapter 9) have been used, which are precision equal to 0.70, recall equal to 0.62, and a
false positive rate of 0.016. Also with respect to effectiveness of actions and risk induced
failures, slightly better values have been assumed. Exact values for PT P , PF P , PT N , and
k are listed in Table 10.8. Values for M T T F and M T T R are the same as in the case
study by Olaf Tauber.
Using these values, a steady-state availability has been computed showing a value of
AP F M = 0.999962. Availability of a system without PFM is the same as in the previous
experiment. This results in cutting down unavailability approximately by half:
1 − AP F M
≈ 0.488 .
(10.70)
1−A
Reliability and hazard rate are also improved, as can be seen from Figures 10.14
and 10.15. This time, the constant limiting hazard rate is below the hazard rate of a
system without proactive fault management.
10.9
Summary
In this chapter, a model has been introduced in order to assess the effect of proactive fault
management, which denotes the approach to combine proactive techniques with a failure
predictor: each time, an imminent failure is predicted, actions are triggered that either
try to avoid or to minimize the downtime incurred by failure occurrence. Examples have
been given for both types of actions.
The model presented is based on the well-known continuous-time Markov chain
model used by Huang et al. [126] to model software rejuvenation, which is a special
case of downtime minimization by periodic restarting. The model replaces the failureprobable state of the original rejuvenation CTMC, which is one its major drawbacks, by
1.000
259
1.0
10.9 Summary
w/o PFM
with PFM
R(t)
0.990
0.0
0.980
0.2
0.985
0.4
R(t)
0.6
0.995
0.8
w/o PFM
with PFM
0
10000
20000
30000
40000
50000
0
100
time [s]
200
300
400
500
time [s]
(a) Reliability
(b) Magnification of the first 500s of (a)
h(t)
2e−05
4e−05
6e−05
8e−05
Figure 10.14: Reliability for the sophisticated example. Similar to Figure 10.12, the first 500
seconds are magnified showing the phase-type character of the underlying distribution.
0e+00
with PFM
w/o PFM
0
200
400
600
800
1000
time [s]
Figure 10.15: Hazard rate for the more sophisticated example.
four states representing correctness of failure predictions. The model is based on eleven
parameters, of which four are determined by the boundary conditions of the system and
the remaining seven parameters characterize efficiency of proactive fault management:
• precision, recall and false positive rate are used for assessment of failure prediction
accuracy
• probability of failure occurrence in case of true positive, false positive or true negative predictions are used to assess success of downtime avoidance techniques as
well as to capture probability of failures that are induced by failure prediction and
actions themselves.
• a repair time improvement factor accounts for the effect of improved repair times
260
10. Assessing the Effect on Dependability
in case of forced versus unplanned downtime.
Closed-form solutions for steady state availability, reliability, and hazard rate have
been developed, and a procedure, how these seven parameters can be estimated from
experimental data has been described.
Finally, a case study has been presented, where the Microsoft .NET demo web-shop
called “Pet Shop” has been extended in order to facilitate testing of simple proactive fault
management techniques. The case study is based on data gathered by Olaf Tauber in the
course of his diploma thesis, which has primarily been supervised by the author. However,
neither the failure prediction algorithm —which is not the HSMM-based algorithm described in this thesis— nor the applied downtime minimization and avoidance techniques
have been convincing such that availability, reliability and hazard-rate get worse if the
techniques are applied. For this reason, a second, more advanced example has been presented using values for precision, recall, and false positive rate that have been achieved by
HSMM-based failure prediction for the telecommunication system case study. Regarding
efficiency of methods, slightly better values than the ones estimated from Olaf Tauber’s
experiments have been used. In this setting reliability was significantly improved, and
unavailability was cut down by half. However, it should be noted that HSMM-based prediction, if applied to the Pet Shop system, would not have reached as good results as for
the telecommunication case study. The reason for this is that no fine-grained fault detection is built into the Pet Shop. Therefore, only very few indicative symptomatic errors are
reported prior to a failure.
One of the major limitations of the model is that it operates only on mean times, which
is a direct consequence of using continuous-time Markov chains. Other models such as
stochastic activity networks (SAN) can model more details. On the other hand, finding
closed-form solutions is rather difficult for these models.
A further limitation of the model presented here is that diagnosis and scheduling of
actions (see Chapter 12) are not explicitly modeled: If a PFM system comprises several
different actions, a decision is necessary about which action to trigger in a given situation. This decision can as well be correct or wrong. Although decision accuracy of the
dispatcher is inherently contained in probabilities PT P , PF P , PT N , and k,9 a more detailed
modeling would be desirable. On the other hand, introduction of even more states and parameters makes the model more difficult to understand and results in more parameters that
need to be estimated from experimental data.
Contributions of this chapter. Main contribution of this chapter is the proposal of a
CTMC model to assess the effect of proactive fault management on availability, reliability, and hazard rate. A brief survey of existing models that try to evaluate the effect
of proactive fault management has revealed that —to the best of our knowledge— the
proposed CTMC model is the first to
• clearly distinguish between all four types of failure predictions: true positives, true
negatives, false positives and false negatives,
• handle both downtime minimization as well as downtime avoidance techniques,
9
Think, for example, of the case that the dispatcher chooses to prepare a repair action instead of triggering
a preventive action then the probability of failure occurrence is increased while k is improved
10.9 Summary
261
• incorporate the case that failure prediction plus triggered actions can induce failures, i.e., due to additional load caused by prediction or actions, a failure occurs
that would not have occurred if no proactive fault management was in place.
From a practical point of view, the three main contributions of the model are:
• It can help to decide if application of proactive fault management is useful for a
given system. In order to do so, MTTF and MTTR must be determined from the
current version of the system. The remaining parameters must be estimated from
experiments in similar environments, such as done in Chapter 9 for assessment of
failure prediction effectiveness.
• When analyzing a system that already employs proactive fault management techniques, partial derivations of the availability / reliability formulas may give an indication which of the seven parameters would be most effective to increase availability. For example, if a system’s engineer had $100,000 to spend on improved
proactive fault management, a derivation of the formulas derived in this chapter indicates whether it is, e.g., more effective to spend the money on improved failure
prediction methods or on a reduction of MTTR for forced / prepared outages.
• It can be used to determine the optimal trade-off between precision, recall, and false
positive rate. In order to do so, all parameters except for precision, recall, and false
positive rate must be assumed to be fixed. Then by Equation 10.40, availability
becomes a function of these three parameters. Hence, an availability value can be
assigned to each point of the trajectory through the space of precision/recall/false
positive rate and the optimal combination can be chosen.
Relation to other chapters. This chapter is the first of the fourth phase of the engineering cycle —and since the main focus of this thesis is on failure prediction, it is also the
last. The remaining chapters will conclude the thesis and will provide an outlook onto
further research.
Chapter 11
Summary and Conclusions
The initial spark that has lit up the fire providing energy to write this dissertation has been
the challenge to predict the failures of a commercial telecommunication platform from
errors that occur. In this chapter the essentials are summarized, major contributions are
pointed out, and remaining issues are discussed.
Beginning with the aim to improve a given system, a typical engineering approach can
be divided into four phases forming the “engineering cycle” (c.f., Figure 1.2 on Page 6).
The thesis has been structured along this concept and so is its summary.
11.1
Phase I: Problem Statement, Key Properties and
Related Work
The ultimate goal addressed in this dissertation is to improve computer system dependability by means of a proactive management of faults. However, the thesis has been focused on the prerequisite first step, which is online failure prediction where the objective
is to predict the occurrence of failures in the near future based on the current state of the
system as it is observed by runtime monitoring. As a case study, failures of a commercial
telecommunication platform of which industrial data has been available were to be predicted. A detailed analysis of the surrounding conditions of the case study has revealed
several key properties, for which the proposed approach to failure prediction has been
designed:
• The size of the system is so immense that detailed knowledge of complex relationships is rare —it has at least not been overt to us. However, with ever-growing
complexity of systems and increasing use of commercial-off-the-shelf components,
this assumption might also be valid for the companies themselves. For this reason,
a black-box approach has been applied. However, as is discussed in the outlook,
the model can be augmented by analytical knowledge, which would turn it into a
gray-box approach.
• A huge amount of data is available. Therefore, a data-driven approach from machine learning has been chosen that aims at filtering out relevant interrelations
from data rather than building on an analytical approach where interrelations are
extracted manually. This approach has a major consequence: Only those types of
failures can be predicted that have occurred (frequently enough) in the training data.
263
264
11. Summary and Conclusions
Events that are really rare are not the focus, here. However, as Levy & Chillarege
[162] have pointed out, failure types follow Zipf’s law and targetting at frequent
failures first, results in the biggest impact. Regarding the telecommunication system case study, the goal was to predict performance failures a few minutes ahead.
• Faults can become visible at four stages: by auditing, which means actively searching for faults, by symptom monitoring, by error detection or by observing a failure.
In the case of this thesis, errors have been used as input data. There are reasons in
favor and against this choice. The most important are:
– Errors occur late in the process from faults to failures. In order to be able to
predict failures with reasonable lead time, fine-grained fault detection must be
in place that is able to capture misbehavior early enough.
+ Due to the property of occurring late, input occurs only when something is
going wrong in the system. This alleviates the problem of class skewness: the
ratio of failure and non-failure data is more even than in symptom monitoringbased approaches.
+ Since error reporting is inherently built into the majority of systems, errorbased prediction techniques are expected to have less effects onto the production system than monitoring-based approaches: As the case study in the
diploma thesis of Olaf Tauber has shown, system response times are dramatically influenced by the amount as well as the frequency of collected monitoring data.
+ Quite a lot of symptom monitoring-based approaches have been published
while the area of error event-based methods is not well explored. From a
scientific point of view, it has been alluring to explore this white spot in prediction methods.
Experiments conducted in this thesis have been performed using previously
recorded logfiles. It should be noted that for real application to a running system a
direct interface to error event reporting should be used.
• Component-based software architectures are common in large systems. The clear
structure of encapsulated entities advocates an approach that builds on interrelationships and dependencies among components. From this follows that the order
of error events is relevant. Moreover, an analysis has revealed that not only the
order but the temporal delay between errors is even more decisive. Since errors
occur non-equidistantly and the type of each error belongs to a finite countable set,
temporal sequences are the input data for the failure prediction algorithm.
• Fault-tolerant systems can cope with many erroneous situations but fail under some
conditions. The principal assumption in this thesis is, that the distinction between
erroneous situations leading to failure and erroneous situations that do not lead to
failure can be distinguished by identifying patterns in temporal error sequences.
For this reason, a pattern recognition approach has been applied.
• It is a non-distributed system. Although the approach might in principle be applicable to distributed systems as well, such aspects have not been considered in this
thesis.
11.2 Phase II: Data Preprocessing, the Model, and Classification
265
The resulting approach is divided into two major steps: first, models are adjusted to system specifics from previously recorded training data. After training, error sequences occurring at runtime are analyzed in order to classify the current status of the system as
failure-prone or not. In machine learning, such procedure is called an supervised offline
batch learning approach.
In order to review existing approaches to online failure prediction, a taxonomy has
been developed and a comprehensive survey of major publications has been presented.
Additionally, related work on extending hidden Markov models to continuous time has
been presented.
11.2
Phase II: Data Preprocessing, the Model, and Classification
The second phase of the engineering cycle aims at synthesizing a problem-specific
methodology. In many cases including this thesis, existing approaches need to be adapted
or a new model needs to be developed.
Online failure prediction is performed in three steps:
1. Error messages that have occurred within a given time window before present time
form an error sequence. The sequence is preprocessed, which includes assignment
of symbols, tupling and noise filtering. In the case of training, failure sequences are
additionally grouped by clustering.
2. Using extended hidden Markov models, similarity to failure and non-failure sequences is computed. Sequence likelihood is used as a measure for similarity between the observed sequence under investigation and the sequences of the training
data.
3. Applying Bayes decision theory, a final decision is made whether the current situation is failure-prone or not.
11.2.1
Data Preprocessing
Data preprocessing consists of several steps, of which the assignment of error IDs, the
tupling technique by Iyer & Rosetti, and sequence extraction are more of technical rather
than conceptual nature and are hence not summarized, here.
Failure sequence clustering. Due to the complexity of the system, it must be assumed
that several failure mechanisms exist and are hence present in the data. The term failure
mechanism is used to denote specific relations of faults and system states to a failure. In
this thesis, a technique has been developed that separates failure mechanisms by means
of clustering. The basic notion of failure sequence clustering is that a dissimilarity matrix is formed by training a small hidden semi-Markov model for each sequence and by
computing sequence likelihoods with each model for all failure sequences. Then, a standard clustering technique can be applied to identify groups of failure sequences that are
“close” in the sense of large mutual sequence likelihoods. An analysis using the telecommunication data has revealed that agglomerative clustering using Ward’s procedure yields
266
11. Summary and Conclusions
the most stable results. Other clustering parameters such as the number of states of the
models and the level of background distributions are not very decisive and default values
have been derived.
Noise filtering. The purpose of noise filtering is to remove non failure-related errors
from failure sequences in the training data as well as for online prediction. Noise filtering
is based on a statistical test derived from the well-known χ2 test of goodness of fit. The
principle notion is that only symbols that are outstanding, i.e., occur more frequently than
expected at a given time, are considered. At least for the data of the case study, an analysis
has shown that computing the symbols’ expected probabilities from all sequences shows
a rather clear separation from signal to noise.
Improved logfiles. Although not applied to the data of the case study, a principle investigation of logfiles has resulted in two proposals how logfiles can be improved for
automatic processing:
1. Event type and event source should be clearly separated
2. A hierarchical numbering scheme should be used, which supports data investigation
by providing multiple levels of detail. Furthermore, a distance metric can be defined
that would facilitate clustering of error message types.
In order to quantify the quality of logfiles, logfile entropy has been defined. It is based on
Shannon’s information entropy but additionally incorporates the overlap of required and
given information in logfiles.
11.2.2
The Hidden Semi-Markov Model
In this thesis, a pattern recognition approach is applied to the task of online failure prediction. Hidden Markov models (HMMs) have been chosen as modeling formalism since
first, HMMs have successfully been used in many advanced pattern recognition tasks,
and second, there is an appealing match of concepts from faults to hidden states and from
errors to observation symbols. However, temporal sequences, which are sequences in continuous time are used as input data but standard HMMs are not designed for continuous
time. Four ways how standard HMMs can be used / extended to process continuous-time
sequences have been discussed. An extension of the stochastic process of hidden state
traversals seemed most promising due to a lossless representation of time and the power
to mimic the temporal behavior of the underlying stochastic process. In order to achieve
this, a new model has been proposed in this dissertation. Its key concepts and properties
are summarized in the following:
• HMMs have been combined with a semi-Markov process resulting in a hidden
semi-Markov model (HSMM). HSMMs combine the mature formalism and wellunderstood properties and algorithms of standard HMMs with a great flexibility to
specify the duration of transitions from one hidden state to the next.
• For sequence recognition the efficient forward algorithm has been adapted to
HSMMs. By this, sequence likelihood can be computed which is a probabilistic measure of similarity between the sequence under investigation and the set of
11.2 Phase II: Data Preprocessing, the Model, and Classification
267
sequences the HSMM has been trained with. In order to find the most probable sequence of hidden states, the Viterbi algorithm has been adapted, as well. However,
this has not been of major concern for this thesis, although it might be of interest
for diagnosis.
• The HSMM can also be used for sequence prediction, which is also not used for
online failure prediction, here. However, this technique might be of interest for
diagnosis or other applications of the model.
• In order to train the model, the Baum-Welch algorithm used for standard HMMs
has been adapted to HSMMs. It belongs to the class of generalized expectation
maximization algorithms combining techniques from maximum likelihood estimation and gradient-based methods for optimization of transition duration distribution
parameters.
• Convergence of the training procedure has been proven based on the rather universal theory of EM algorithms, which employs lower bound optimization resulting in
a so-called Q-function. The specific Q-function for HSMMs has been derived and
by partial differentiation and application of Lagrange multipliers it has been shown
that the algorithm converges at least to a local maximum of training sequence likelihood.
• For the specific task of online failure prediction, a dedicated topology of HSMMs
is used. Failure prediction models employ a chain-like, or left-to-right structure.
However, in order to deal with missing errors in training sequences, shortcuts are
included in the model. In order to deal with additional error messages (noise) that
has not been present in training data, intermediate states are added after completion
of the training procedure: By this, model flexibility is increased without affecting
complexity of the training procedure. Experiments with the telecommunication data
have shown that one intermediate state per transition is most effective, although the
benefit lags behind expectations.
• In order to assess complexity of the algorithm, two cases must be distinguished:
application of HSMMs for online failure prediction during runtime, and training
of HSMM parameters. Application complexity of general HSMMs belongs to the
class O(N 2 L), where N denotes the number of states and L the number of symbols in the error sequence. However, due to the left-to-right structure used for
online failure prediction, complexity actually belongs to class O(N L). Theoretically, training complexity is O(N 4 L). However, the chain-like structure reduces
complexity to O(N 3 L). In order to remedy the problem of convergence to local
maxima, the entire training procedure is repeated 20 times with varying random
initialization.
• The forward algorithm of the HSMM developed in this thesis is much more efficient than previous extensions to continuous time. The main reason for this is that
previous extensions have mainly been developed in the area of speech recognition.
An in-depth comparison of the task of failure prediction with speech recognition
has revealed that for failure prediction a one-to-one mapping between states and
observation symbols can be assumed and temporal properties are mainly included
in the stochastic process of hidden state traversals. This analysis allows a strict
268
11. Summary and Conclusions
enforcement of the Markov assumption, which results in a forward algorithm that
is almost as efficient as its discrete-time counterpart. Furthermore, this approach
allows to model time as transition durations rather than state sojourn times which
offers more modeling flexibility.
11.2.3
Sequence Classification
The final step in a pattern recognition approach to online failure prediction is to classify
whether the current runtime state is failure-prone or not. Bayes decision theory has been
used in order to derive classification rules. More specifically,
• An introduction to Bayes decision theory has been given including the proof that
classification error rate is minimal if each sequence is classified according to maximum posterior probabilities, and a minimum cost classification rule.
• Since for real applications of hidden Markov models1 only logarithmic sequence
likelihood can be used, the Bayesian decision rule has been extended to a multiclass classification rule for log-likelihoods.
• By introducing the bias-variance dilemma, it has been shown why it is important
to control the trade-off between bias (which is how closely training can adapt to
the training data) and variance (which is how much the resulting model is dependent on the selection of training data). Several techniques have been discussed with
respect to their applicability to online failure prediction with HSMMs. In this dissertation, model order selection, and background distributions, in combination with
maximum amount of training data have been applied.
11.3
Phase III: Evaluation Methods and Results for Industrial Data
Having developed the theoretical methodology, the third phase of the engineering cycle
is concerned with implementing it and to perform experiments with data. This leads to a
solution that can be applied to a running system.
11.3.1
Evaluation Methods
Many different metrics exist that capture various aspects of failure prediction. The comprehensive overview and discussion of characteristics is one of this thesis’ contributions.
Metrics for prediction quality. Many metrics for the evaluation of prediction are based
on the contingency table, which classifies each prediction to be either a true positive, false
positive, true negative, or false negative. A table has been presented listing a great variety
of metrics and their synonymous names. In this thesis, precision, recall, false positive
and true positive rate have been used. Additionally, the F-measure is used in order to turn
precision and recall into a single real number.
1
As well as of their hidden semi-Markov extensions
11.3 Phase III: Evaluation Methods and Results for Industrial Data
269
One of the major drawbacks of contingency table-based methods is that they are defined on a binary basis: a prediction is either positive (a failure warning) or negative (no
warning). However, many prediction methods such as the HSMM approach employ a
customizable threshold upon which the decision is based, and each threshold value may
result in a different contingency table and subsequently in different values for the associated metrics. Several plots address this problem: Precision / recall curves plot precision
over recall for various values of the decision threshold. In addition to the F-measure, the
point where precision and recall are equal can be used to turn them into a single number. A second well-known plot are receiver operating characteristics (ROC) where true
positive rate is plotted versus false positive rate. In order to turn this graph into a single
number, the integral under the ROC curve is used, which is called “area under curve”
(AUC).
A new type of graph has been introduced in this thesis: accumulated runtime cost
graphs plot prediction cost as they accumulate over runtime. In contrast to contingency
table-based metrics, which imply mean values, accumulated runtime cost graphs reveal
a temporal aspect of prediction since it can be seen when, e.g., false positive predictions
have occurred. Furthermore, any predictor can be compared to an oracle predictor, a
perfect measurement-based predictor, a system without predictor, and maximum cost.
In summary, it should be pointed out that there is no single perfect evaluation metric.
For example, precision and recall do not account for true negative predictions. AUC
weights all threshold values equally which results in cases where a predictor with better
AUC incurs higher cost. Accumulated runtime cost graphs are sensitive to the relative
distribution of cost, which can be chosen such that the graph is altered significantly.
Evaluation process and statistical confidence. One of the major problems with many
machine learning approaches is that a lot of parameters are involved that are not directly
optimized by the training procedure. For example, the length of the data time window is
usually assumed to be fixed, but it is not clear what size of the window results in optimal
prediction quality. Moreover, several parameters are dependent so that each combination
of all values for all parameters would have to be tested and evaluated with respect to
final prediction performance. Since more than 15 parameters are involved, such approach
would result in tremendous computation times. For this reason, a mixed approach has
been applied in this thesis: Parameters that could be set by separate experiments have
been optimized separately (greedy approach) while other parameters have been optimized
in combination with dependent ones that cannot be determined in a greedy way.
Three types of data sets have been used in the experiments:
• Training data is used as input data for the training procedure.
• Validation data is used to assess and control overfitting.
• Test data is used for final out-of-sample assessment of failure prediction performance.
Even though a lot of data has been available for the telecommunication system, the
amount of failure data is still limited. A fixed division into three data sets of equal size
would not result in sufficient estimation of real prediction performance. The standard
solution to this kind of problem is called m-fold cross validation where data is divided
into m parts, and 1 − m parts are used for training / validation while the remaining part is
270
11. Summary and Conclusions
used for test. This procedure is repeated m times such that each of the m parts is used for
testing once. Training data is then further divided into training and validation data in the
same way.
In order to get to an estimation of confidence intervals, cross validation is combined
with a technique called bootstrapping. Other confidence interval estimation techniques
have been investigated, too, but are either not applicable (such as assuming a BernoulliExperiment or normal distributions), or are not flexible enough to be applied to large
datasets (such as the Jacknife).
11.3.2
Results for the Telecommunication System Case Study
Industrial data of a commercial telecommunication system has been analyzed in order
to assess the potential to predict failures of the system. The entire modeling procedure
has been described and analyzed in detail from the first steps of data preprocessing to a
detailed analysis of the modeling parameters on final prediction quality. The most relevant
results are provided, here.
Data preprocessing. The main goal was to investigate whether the assumptions made
in the theoretical development of the methodology fit reality as observed in the industrial
data. In particular, findings included
1. The proposed procedure to assign error IDs to error messages is relatively robust.
The majority of assignments is non-ambiguous. The procedure reduced the number
of different message types from 1,695,160 to 1,435.
2. It seems safe to determine tupling window size by the procedure proposed by Iyer
& Rosetti. The expected bend in the number of resulting tuples can be identified
clearly.
3. Agglomerative clustering with Ward’s method should be used to group failure sequences. The number of states for the HSMMs
√ used to compute sequence likelihoods should be chosen to be approximately L where L denotes the maximum
length of the majority of failure sequences. The weight assigned to background
distributions should be chosen rather small —a value of 0.1 has been used in experiments.
4. Noise filtering works best if a global prior estimated from the entire training data
set is used. Experiments indicate that the proposed noise filtering mechanism can
distinguish between signal and noise. The filtering threshold should be chosen to
be slightly above a plateau in average sequence length that has been observable in
data of the case study. Furthermore, experiments support two principles observed
by Levy & Chillarege: prior to a failure, the mix of errors changes and a few errors
outnumber their expected value heavily.
Analysis of the preprocessed dataset. After preprocessing, the resulting dataset has
been investigated with respect to the following characteristics:
• Error frequency varies heavily in the data set. However, no correlation between
the number of errors per time unit and the occurrence of failures can be observed.
11.3 Phase III: Evaluation Methods and Results for Industrial Data
271
Hence straightforward counting and thresholding techniques do not seem appropriate.
• Delays between errors can be approximated best by a mixed probability distribution
consisting of an exponential and a uniform distribution.
• An analysis of the distribution of time-between-failures revealed that frequently
used distributions such as exponential, gamma or Weibull do not fit the data very
well. For this reason, failure prediction or preventive maintenance techniques that
simply rely on lifetime distributions are most likely deemed to fail. Furthermore,
an autocorrelation analysis shows that there is no periodicity in the occurrence of
failures. That is why periodic techniques cannot achieve good prediction results.
Model parameters. Quite a few parameters are involved in the modeling step. Parameters have been divided into two groups:
• Parameters that can be fixed heuristically in a greedy manner. This group includes
probability mass and distribution of intermediate states, the number of iterations the
Baum-Welch algorithm is performed, and the type of the background distribution.
• Parameters that can only be evaluated by training a model and testing prediction
performance on a test data. This group includes the number of states of the HSMM,
the maximum number of states that are skipped by shortcuts, the number of intermediate states that are added to the model after training, and the amount of background
weight applied after training.
Parameters of the second group have been investigated with respect to F-Measure, and
their effect on computation times. Best results have been achieved with a model of 100
states, shortcuts bypassing one state, one intermediate state per transition and a background weight of 0.05.
The optimal set of parameters has then been investigated further and precision / recall
plots, ROC curves, AUC, and cost curves have been provided. At the threshold value of
maximum F-measure (0.66), precision of 0.70, recall of 0.62, and false positive rate of
0.016 have been achieved. AUC was equal to 0.873.
Application specific parameters. Two modeling parameters depend on the application
rather than the model itself: lead-time (i.e., the time how far in the future failures are
predicted), and data window size (how much data is used for prediction). An analysis
of these two parameters has shown that lead time approximately stays at the same level
for predictions up to 20 minutes ahead and drops quickly for predictions with longer lead
time. With respect to the size of the data window, model quality in general becomes better
if longer sequences are taken into account. However, mean processing time increases
heavily for longer sequences putting a limit on the size of the data window.
Sensitivity analysis. Large-scale computer systems such as the telecommunication system are highly configurable and undergo repetitive updates. In order to assess sensitivity
to these issues, the approach has been tested in two ways:
272
11. Summary and Conclusions
• Dependence of prediction quality on the size of training dataset. Many stochastic
estimators such as the mean yield unreliable results if the number of data points is
decreased. A similar effect was observed in the case study. By reducing the size of
the training data set in two steps, results remained stable for the first step but failure
prediction quality broke down after the second reduction. Not surprisingly, mean
training time was also reduced for smaller training data sets.
• Dependence on changing system configurations and model aging. Since with offline
batch learning, parameters of the HSMM are trained once, behavior of the running
system will be increasingly different from system behavior at training time with
every change to configuration and every update. This effect has been simulated by
an increasing time gap between selected training and test data. Experiments have
shown that mean maximum F-measure decreases almost linearly with increasing
size of the gap. Additionally, it has been observed, that confidence intervals obtained from bootstrapping get wider which can be explained that with increasing
gap size more and more sequences are significantly different from training data.
• Grouping of failure sequences has been applied in order to separate failure mechanisms. However, partitioning the set of failure sequences results in less training
sequences for each model, which in turn may deteriorate HSMM parameter estimation involved in the training procedure. In order to check whether this is the case,
a HSMM failure predictor with only one failure group model has been trained. Results for this model have been significantly worse supporting the assumption that the
HSMMs can adopt better to the training sequences if failure sequences are grouped
according to their similarity.
Comparative analysis. The HSMM-based failure prediction approach has been compared to the most promising and well-known failure prediction approaches of that area,
which are dispersion frame technique (DFT) by Lin & Siewiorek, the Eventset method
by Vilalta & Ma, and SVD-SVM by Domeniconi et al.. DFT only evaluates the time of
error occurrence, while Eventset and SVD-SVM only investigate the type of errors that
occur.2 In contrast, HSMM-based failure prediction investigates both time of error occurrence and their type —it treats input data as a temporal sequence. In order to provide a
comparison with a very simple prediction method, periodic prediction based on MTBF
has also been included in the comparison.
Standard, discrete-time HMMs can be used for failure prediction, too. In order to
assess the gain in prediction performance achieved by introducing a semi-Markov process,
prediction performance of standard HMMs have been tested, too. Additionally, HSMMbased failure prediction has been compared to failure prediction based on universal basis
functions (UBF) developed by Günther Hoffmann, although UBF prediction belongs to
a different class of prediction algorithms operating on equidistant monitoring of system
variables.
In summary, it can be concluded from the comparative analysis that HSMM-based
failure prediction outperforms other failure prediction approaches significantly. However,
2
Although SVD-SVM can in principle incorporate both time and type of error messages, prediction actually
deteriorates if time is included.
11.4 Phase IV: Dependability Improvement
273
improved failure prediction comes at the price of computational complexity: Model training consumes 2.38 times and online prediction 224.5 times as much time as the slowest
comparative approach. Nevertheless, the approach demonstrates what prediction performance is achievable with error-event triggered online failure prediction.
11.4
Phase IV: Dependability Improvement
Failure prediction is not worth the effort if it does not help to improve system dependability. In order to improve dependability, failure prediction must be coupled with subsequent
actions that are performed once an upcoming failure has been predicted. This is called
proactive fault management. However, the focus of this thesis is on failure prediction
and therefore, only a theoretical analysis of the effect of proactive fault management on
system dependability has been provided.
11.4.1
Proactive Fault Management
Two strategies exist how system dependability can be improved in case of a predicted
upcoming failure:
• Downtime avoidance techniques try to prevent the failure. Their goal is to achieve
continuous operation. Three groups of downtime avoidance techniques have been
identified: state clean-up, preventive failover, and load lowering.
• Downtime minimization techniques can be further divided into two subgroups: reactive techniques let the predicted failure happen, however, the system is prepared
for its occurrence such that time-to-repair is reduced. This is achieved by either one
or both of two effects: (a) reconfiguration time can be shortened if an upcoming
failure is anticipated, and (b) the time needed for recomputation can be reduced.
On the other hand, proactive techniques actively trigger repair actions such as a
restart, turning unplanned downtime into planned downtime, which is expected to
be shorter or incurring less cost.
Several examples for all types of techniques have been given.
11.4.2
Models
Based on the continuous Markov-chain model (CTMC) for software rejuvenation (i.e.,
preventive restart of components or the entire system) introduced by Huang et al., two
CTMC models have been developed: The first model is used to compute steady-state system availability, while the second simplified model is used to compute system reliability
and hazard rate.
It has been shown how the rates of the model can be computed from eleven parameters,
of which four are application specific and hence assumed to be fixed. The remaining seven
modeling parameters are: precision, recall, false positive rate, failure probability given a
true positive, false positive, and true negative prediction, and repair time improvement
factor. Using these parameters, closed-form solutions for steady-state system availability,
reliability and hazard rate have been derived.
274
11. Summary and Conclusions
11.4.3
Parameter Estimation
A procedure has been described, how the several parameters can be estimated from experiments. The procedure consists of four experiments, two of which include fault injection
in order to assure that the prediction of a failure is a true positive.
11.4.4
Case Study and an Advanced Example
A diploma thesis has set up an experimental environment where simple proactive fault
management techniques have been applied to an open web-shop demo application on
the basis of Microsoft .NET. Specifically, preventive restart on application as well as on
system level have been used as downtime minimization technique, and delivering a page
stating that the server is temporarily busy has been used as technique to avoid downtime
by system relieving.
The parameter estimation procedure has been applied to the data recorded in the experiments. However, neither of the two proactive techniques has been able to improve
system availability, reliability or hazard rate (in the long term). The main reasons for
this is that the implemented failure prediction algorithm (which was not HSMM-based
prediction but a simple threshold-based method) has not been able to provide sufficiently
good predictions. Furthermore, both types of actions have not proven to be successful:
Instead of a reduced downtime, restarting took eleven times as long as downtime incurred
by a failure, and at every second true positive prediction, a failure occurred even though
system relieving was in place.
Since the experiments have not resulted in improved system dependability, a more
sophisticated example has been provided. In this second example, the values estimated
from the telecommunication case study have been used for prediction quality. Additionally, better but still realistic values have been assumed for the other parameters. This
scenario resulted in a considerable improvement in dependability: unavailability was cut
by half and reliability as well as hazard rate have been significantly improved.
11.5
Main Contributions
In summary a novel failure prediction approach has been developed that has strong foundations in stochastic pattern recognition rather than heuristics, and that outperforms wellknown prediction techniques if applied to industrial data of a commercial telecommunication system of considerable size. On the way to this result, several contributions to the
state-of-the art have been achieved:
• In the fundamental relationship between faults, errors, and failures, side-effects of
faults are missing. In this dissertation, side-effects of faults are termed symptoms
and the fundamental concept has been extended.
• A comprehensive taxonomy of online failure prediction methods has been introduced. Based on the taxonomy, an in-depth survey of online failure prediction techniques has been presented. including research areas that have not been explored for
the objective of failure prediction.
• The failure prediction method developed in this thesis is the first to apply pattern
recognition to error event-driven time sequences (temporal sequences).
11.6 Conclusions
275
• A novel extension of hidden Markov models to incorporate continuous time has
been developed. Since previous extensions to continuous time have focused on
equidistant time series, the extension presented here is the first to specifically address temporal sequences as input data.
• A novel model to theoretically assess dependability of proactive fault management,
which is prediction-driven fault tolerance, has been introduced. To our knowledge,
it is the first to incorporate correct and false predictions, as well as downtime avoidance and downtime minimization techniques. In addition to that, the model incorporates failures that are induced by proactive fault management itself, e.g., by the
additional load that is put onto the system.
Although not directly related to the failure prediction model, several other contributions
to the state-of-the-art have been made:
• To the best of our knowledge, this thesis is the first to collect and discuss the various
evaluation metrics for prediction tasks.
• A novel methodology to identify failure mechanisms and to group failure sequences
has been developed. Although only used for data preprocessing in this thesis, the
approach might also be useful for diagnosis, as well.
• To our knowledge the first measure to quantify the quality of logfiles has been
introduced. Due to its roots in information theory, the measure is called logfile
entropy.
11.6
Conclusions
In this dissertation, an effective online failure prediction approach has been proposed that
builds on the recognition of symptomatic patterns of error sequences. A novel continuoustime extension of hidden Markov models has been developed and the approach has been
applied to industrial data of a commercial telecommunication system. In comparison
to the best-known error-based failure prediction approaches, the proposed methodology
showed superior prediction accuracy. However, accuracy comes at the prize of computational complexity. Although this is intuitively comprehensible, Legg [160] has investigated performance and complexity of prediction algorithms in a principle way. Based
on a universal formal theory for sequence prediction by Solomonoff [247, 248], which is
not computable in general, Legg has proven that predictors of a given predictive power
require some minimum computational complexity (see Figure 11.1). Another important
result of Legg’s work is that, although very powerful predictors exist for computable sequences, they are not provable due to Gödel incompleteness problems. In other words,
for provable algorithms, an upper bound with respect to predictive power exists. Hence,
maximum achievable predictive accuracy for the telecommunication case study might be
worse than 100% precision and 100% recall, and HSMM-based failure prediction is even
closer to the optimum than it appears.
Starting point of the model’s development was an analysis of key properties of complex, component-based, non-distributed software systems and the failure prediction approach has been designed with these properties in mind. Hence HSMMs should also
276
11. Summary and Conclusions
Figure 11.1: Trade-off between predictive power and complexity. It can be shown that for a
given complexity, there is an upper bound on predictive power. Hence, there is
also an upper bound on predictive power achievable by algorithms with provable
complexity. However, it can also be shown that algorithms with better predictive
power exist but their complexity is unprovable. HSMM-based prediction algorithm lies within the hatched area (Legg [160]).
show very good prediction results if applied to other systems sharing the same properties. Additionally, HSMMs can be adapted to different situations by adjusting the various
parameters involved in modeling. Furthermore, since they are a general contribution to
event-driven temporal sequence processing, HSMMs might prove to achieve similarly
outstanding results in other application domains beyond failure prediction, as well.
Chapter 12
Outlook
As is the case with most projects, there is always room for further investigations and
improvements. In this chapter, some potential and promising directions are highlighted.
Starting from technical issues how the proposed hidden semi-Markov model (HSMM)
could be further improved the scope is widened successively.
12.1
Further Development of Prediction Models
The survey of online prediction models (see Chapter 3) has shown that quite a few prediction models have been developed in the past, but also that there are several areas that seem
promising to be explored. The discussion along the branches of the taxonomy is not be
reiterated, here but rather, the focus is on more sophisticated machine learning techniques.
12.1.1
Improving the Hidden Semi-Markov Model
More sophisticated optimization techniques than the gradient-based could be used for
estimation of transition duration parameters in the Baum-Welch algorithm for HSMMs.
For example, second order optimization algorithms such as Newton’s method or quasiNewton methods such as Broyden-Fletcher-Goldfarb-Shanno (BFGS) and DavidonFletcher-Powell (DFP) could be applied. In this thesis, the problem of local maxima has
been addressed by simply running the Baum-Welch algorithm several times. A more sophisticated solution would, for example, apply an evolutionary optimization strategy. Additionally, the EM training algorithm used in this dissertation does not alter the structure
of the HSMM. Extended algorithms such as state pruning, which also alter the topology
of an HSMM, may be investigated.
The black-box approach can actually be turned into a gray-box approach by adding
failure group models that are constructed manually. More specifically, if it is known from
system design that activation of a special failure mechanism results in a unique sequence
of errors, an additional failure group model can be built that is specifically targeted to
this sequence. Variations and uncertainties in time as well as in error symbols can be
modeled by transition and observation probabilities. The resulting additional failure group
model can be seamlessly integrated with the models obtained from data-driven machine
learning. Referring to Figures 2.9 and 2.10 on Pages 19 and 20, respectively, the handbuilt model would simply be added as model u + 1. By this procedure, the purely datadriven modeling approach described in this thesis is turned into a machine-learning /
277
278
12. Outlook
analytic hybrid modeling approach.
12.1.2
Bias and Variance
Controlling bias and variance means to control the trade-off between under- and overfitting. As has been mentioned in Chapter 7, algorithms such as bagging and boosting can
be applied to HSMMs as well.
A further technique controlling the bias-variance tradeoff is called regularization (see,
e.g., Bishop [30]). Regularization usually denotes a technique where the optimization objective is augmented by a term putting a penalty on model complexity or specificy, such
as curvature in regression problems. Regularization can in principle also be applied to
HSMMs. In order to do so, the Baum-Welch algorithm would have to be changed such
that the optimization objective, which is training sequence likelihood, is augmented by a
complexity / specificy term. For example, a penalty could be put on setting transition or
observation probabilities to zero. Another approach is to introduce a prior probability distribution over the values of model parameters, as has been introduced by Hughey & Krogh
[128]. However, regularization changes the model rather deep in its core and similar results can very likely be achieved by other techniques such as background distributions,
which have been used in this thesis.
It is common knowledge that every single modeling technique is well-suited for some
problems but performs worse on others. This is called the inductive bias of a modeling
technique. Meta-learning (see, e.g., Vilalta & Drissi [267]) makes use of different inductive bias of several modeling techniques. For example, one technique of meta-learning
assigns a new problem to the base-learner with the most appropriate inductive bias. This
has been shown to improve failure prediction significantly in Gujrati et al. [111] even
though a very simple meta-learning algorithm has been applied.
12.1.3
Online Learning
Systems are undergoing permanent updates and configuration changes. With each such
step, failure behavior of the system might be changed. The consequence is that the models
obtained from training are getting more and more outdated. One solution to this problem
is online learning. In online learning, the model is permanently updated such that it adapts
to the changes in the system. A straightforward solution to online learning for HSMMs
would be to collect new failure and non-failure sequences at runtime and to periodically
train new models in the background. However, most likely more sophisticated approaches
can be applied.
12.1.4
Further Issues
Prediction in continuous time. In this dissertation, failures have been predicted with
a fixed lead-time ∆tl . However, if the error sequence under investigation is assumed to
be the start of a temporal sequence, sequence prediction techniques (c.f., Section 6.2.2)
can be used to determine the continuous cumulative probability of failure occurrence over
time. Such approach is beneficial if several proactive actions are available in a system that
imply different warning times (i.e., minimum lead-time): failure prediction would have
to be performed only once rather than once for every lead-time.
12.1 Further Development of Prediction Models
279
Conditional random fields. Markov models in general are subject to the so-called label
bias problem (see, e.g., Lafferty et al. [152]). The problem is that the entire probability
mass is distributed among successor states. Hence, if a state has only one successor, the
stochastic process transits with probability one to the next state. If there are two successors and both successors are equally likely, they proceed with probability of roughly a
half. From this follows that sequence likelihood depends on the number of outgoing transitions. Even if this problem is not that urgent for HSMMs, since first, the model topology
is rather symmetric (most states have the same number of successor states) and furthermore, transition probability is also determined by the duration of the transition, the principle restriction still applies. In recent years, new stochastic models have been developed,
among which conditional random fields (CRF) are promising candidates. These models
have a second important advantage: The objective function is convex, which guarantees
that the training procedure converges to a global rather than a local maximum. However,
these models are rather new and experience is limited —that is why they have not been
considered in this thesis.
Input variables. In this dissertation, error events have served as input data to the hidden
semi-Markov model. However, as the title of the thesis indicates, any event-based data
source may be used, too. For example, by defining a threshold, any (equidistant) monitoring of system variables such as memory consumption or workload can be turned into
an event-based data source. Although this has not been applied in this thesis, it might be
a valuable solution for systems that do not have fine-grained fault detection installed.
In many machine learning applications, the problem of variable selection is an important issue. For online failure prediction based on symptom monitoring, Hoffmann
et al. [122] have shown that a good selection of variables can be even more decisive than
a sophisticated choice of modeling technique. In the course of this thesis, some experiments with different sets of error-message variables have been performed. However,
results could not be improved. The main reason for this is that —in contrast to symptom
monitoring— not all variables are available all the time: each error message may contain
a different set of variables. Hence, in order to successfully apply variable selection techniques, extra care must be taken to missing variables. Since this is not the case for most
existing variable selection algorithms, further research is needed.
Mining rare events. A further issue is related to the problem that failure sequences
are rare in comparison to non-failure sequences. Weiss [276] has comprehensively investigated this topic, even though the main focus has been on data mining. Many of the
proposed techniques, e.g., training failure models on the rare class only, and using rare
class robust evaluation metrics such as precision and recall have been applied in this thesis. However, other techniques, such as advanced sampling methods could be additionally
applied.
Distributed systems. This dissertation has focused on centralized systems, only. However, distributed systems are also important and should be considered. According to its
features to flexibly incorporate timing behavior, to model interdependencies of more or
less isolated entities, and to handle missing events or permutations in their order, HSMMbased failure prediction seems to be a good candidate for failure prediction in distributed
systems.
280
12. Outlook
Design for predictability. it has been assumed throughout this thesis that the system is
fixed and given and failure prediction algorithms have to adapt to its specifics. However in
future, it may also be the other way round: “designing for predictability” may be considered from the very beginning of the software development process. At the current stage,
it is not yet clear what characteristics of a software design makes failures predictable and
further research is needed. However, it can be concluded from this dissertation that if
error event-based failure prediction is to be applied, fine-grained fault detection has to be
embedded throughout the system.
12.1.5
Further Application Domains for HSMMs
The hidden semi-Markov model developed in this dissertation has been designed for the
processing of event-triggered temporal sequences. Therefore, HSMMs can be applied to
other problem domains as well. The prerequisite key characteristic is that observations
(input data) must occur in an event-driven way and input values must belong to a finite
countable set (observation symbols). There are supposedly many areas where HSMMs
can be applied, among which are
• Web user profiling. The click stream of a web user navigating through a site forms
a temporal sequence: each click is an event and, e.g., the requested URL is the
observation symbol. HSMMs might be used to distinguish between various types
of users (sequence recognition) or to predict the most probable URL that the web
user will click next (sequence prediction). Both could be used to dynamically adjust
web pages to the users needs and preferences.
• Shopping tour prediction. In a retail store, each time a customer puts an item into
the (technically enhanced) cart, an event is generated. The type of the event is
defined, e.g., by the id of the item, its location, etc. Temporal sequence processing
based on HSMMs might be used to, e.g., display context-sensitive advertisements.
In contrast to existing data-mining approaches, not only the set of items is relevant
but also the time when the customer has put the item into the cart, which would
enable to present advertisements that are along the customer’s anticipated route
through the shop or to enable a predictive planning of cash counter personnel.
• Failure prediction in critical infrastructures. Many infrastructures that are used
everyday (such as electricity, telecommunication, water supply, food transport) can,
in case of a failure, impose drastic restrictions in daily life or even pose a severe
threat to the health of many people. Failure prediction may be used to predict
infrastructure failures such that appropriate actions can be undertaken to prevent
or at least to alleviate them. HSMM-based failure prediction might prove to be
especially successful for infrastructures where only critical events but no continuous
monitoring is available.
12.2
Proactive Fault Management
The essence of proactive fault management is to act proactively rather than reactively to
system failures. In the context of this dissertation, techniques are considered that rely on
12.2 Proactive Fault Management
281
the prediction of upcoming failures. Event though there are techniques such as checkpointing that can be triggered directly by a failure prediction algorithm, subsequent diagnosis is required in order to investigate what is going wrong in the system, i.e. what
caused the failure that is imminent. Based on failure prediction and diagnosis results, a
decision needs to be made which of the implemented downtime avoidance or downtime
minimization techniques should be applied and when it should be executed in order to
remedy the problem (see Figure 12.1).
Figure 12.1: The steps involved in proactive fault management. After prediction of an upcoming failure, diagnosis is needed in order to find the fault that causes the upcoming
failure. Failure prediction and diagnosis results are used to decide upon which
proactive method to apply and to schedule their execution.
Both diagnosis and choice / scheduling of actions are complex problems that need to
be solved for proactive fault management to be most effective. Nevertheless, the following
paragraphs will discuss some issues that are related to HSMM-based failure prediction as
proposed in this dissertation.
Diagnosis. The objective of diagnosis is to find out where the fault is located (e.g., at
which component) and sometimes to find out what caused it. Note that in contrast to traditional diagnosis, in proactive fault management diagnosis is invoked by failure prediction,
i.e., when the failure is imminent but has not yet occurred. One idea how diagnosis could
be accomplished is to analyze the hidden semi-Markov model used for failure prediction:
Since the HSMM approach makes use of several HSMM instances (one for non-failure
and several other for failure sequences), and each failure group model is targeted to a
failure mechanism, sequence likelihoods of the failure group models can be compared
in order to analyze which failure mechanism might be active in the system. The fault
might then be determined by an analysis of characteristic error messages, which might
also include identification of the most probable sequence of hidden states by applying the
Viterbi algorithm. Some parts of this analysis could be even precomputed after clustering
of training failure sequences.
Scheduling of actions. The investigation of dependability enhancement presented in
Chapter 10 has been based on a binary classification whether a failure is imminent or not.
However, in general, the decision which proactive technique to apply should be based on
an objective function taking cost of actions, confidence in the prediction, effectiveness and
complexity of actions into account in order determine the optimal trade-off. For example,
to trigger a rather costly technique such as system restart, the scheduler should be almost
sure about an upcoming failure, whereas for a less expensive action such as writing a
supplemental checkpoint, less confidence in the correctness of the failure prediction is
required. In contrast to many other failure prediction approaches, HSMM-based failure
282
12. Outlook
prediction can support the scheduler by reporting the posterior probability (c.f., Equation 7.1) rather than the binary decision whether a failure is coming up or not.
Both topics, diagnosis and scheduling are challenges each worth a separate dissertation
raising a manifold of scientific questions. The crucial issue, however, is to bring proactive
fault management into practical applications in order to prove that system dependability
can be boosted by up to an order of magnitude by the proactive fault handling toolbox,
which is a combination of effective downtime avoidance and downtime minimization
techniques, diagnosis, action scheduling, and last but not least accurate online failure
prediction.
Part V
Appendix
283
Derivatives with Respect to Parameters
for Selected Distributions
In order to compute the gradient used in hidden semi-Markov model training (c.f., Section 6.3.2) partial derivatives with respect to parameters of the transition duration distribution are needed. Derivatives for some commonly used distributions are provided in the
following. Note that cumulative parametric probability distributions are used to specify a
hidden semi-Markov model’s transition durations.
Exponential distribution. The cumulative distribution is given by
κij,r = 1 − e−λij,r dk .
The derivative with respect to λij,r is hence:
∂ 1 − e−λij,r dk = dk e−λij,r dk .
∂ λij,r
Normal distribution. No closed-form representation for the cumulative normal distribution Φµ,σ (t) is known. However, it can be expressed using the so-called error function
erf(t):
2 Z t −τ 2
e dτ
erf(t) =
π 0
∂ erf
2
2
= √ e−t
∂t
π
The cumulative normal distribution is then given by:
"
t−µ
1
1 + erf √
Φµ,σ (t) =
2
2σ
In order to compute the partial derivative of Φ let:
fµ,σ (t) :=
t−µ
√
2σ
∂f
∂µ
1
= −√
2σ
∂f
∂σ
=
285
µ−t
√
2σ 2
!#
.
286
and hence,
∂Φ
1 2 −f 2
√ e
=
∂µ
2 π
1
−√
2σ
!
1
t2 − 2tµ + µ2
= −√
exp −
2σ 2
2πσ
∂Φ
µ−t
t2 − 2tµ + µ2
= √
exp
−
∂σ
2σ 2
2πσ 2
!
!
Log-normal distribution. Similar to the normal distribution, the cumulative lognormal distribution can be expressed using the error function:
"
ln(t) − µ
1
√
1 + erf
Ψµ,σ (t) =
2
2σ
!#
.
Therefore, derivations are derived similar to the normal distribution:
1
∂Ψ
ln(t)2 − 2 ln(t)µ + µ2
= −√
exp −
∂µ
2σ 2
2πσ
!
∂Ψ
µ − ln(t)
ln(t)2 − 2 ln(t)µ + µ2
√
=
exp −
∂σ
2σ 2
2πσ 2
!
Pareto distribution. The cumulative distribution is determined by the location parameter (tm ), which determines the minimum value for t, and a shape parameter k:
Ptm ,k (t) := 1 −
tm
t
k
Using MapleTM , the derivatives with respect to tm can be determined yielding
∂P
=−
∂tm
tm
t
k
k
tm
and the derivative with respect to k is given by:
tm
∂P
=−
∂k
t
k
tm
ln
t
287
Gamma distribution. The density of the gamma distribution is defined to be:
t
gk,θ (t) = t
k−1
e− θ
θk Γ(k)
where Γ(k) denotes the gamma function. The cumulative distribution is given by:
Gk,θ (t) =
γ(k; θt )
Γ(k)
where γ denotes the incomplete gamma function. Derivation of Gk,θ (t) with respect to k
as well as to θ is possible, however, the result comprises many different terms which are
hence not displayed here. Rather, the result can be seen by evaluating the following four
lines using MapleTM :
gam := int(t^(a-1)*exp(-t), t=0..x);
CDFgam := subs(a=k,x=x/theta,gam) / GAMMA(k);
diff(CDFgam,k);
diff(CDFgam,theta);
Erklärung
Ich erkläre hiermit, dass
• ich die vorliegende Dissertationsschrift “Event-based Failure Prediction: An Extended Hidden Markov Model Approach” selbstständig und ohne unerlaubte Hilfe
angefertigt habe;
• ich mich nicht bereits anderwärts um einen Doktorgrad beworben habe oder einen
solchen besitze;
• mir die Promotionsordnung der Mathematisch-Naturwissenschaftlichen Fakultät II
der Humboldt-Universität zu Berlin (veröffentlicht im Amtl. Mitteilungsblatt Nr.
34 / 2006) bekannt ist.
289
Acronyms
AC
AFS
AGNES
ARMA
ARX
AUC
BCa
BFGS
BLAST
CBE
CDF
CHMM
CPU
CRF
CTMC
CT-HMM
DC
DET
DF
DFT
DIANA
DTMC
DVD
ECDF
ECG
EDI
EFDIA
EM
ESHMM
fMRI
FN
FOIL
FP
FPR
FRU
FWN
GHMM
Agglomerative coefficient
Andrews file system
Agglomerative Nesting
Auto-regressive moving average
Auto-regressive model with auxiliary input
Area under (ROC) curve
Bias corrected accelerated confidence intervals
Broyden-Fletcher-Goldfarb-Shanno
Basic local alignment search tool
Common base event
Cumulative distribution function
Continuous hidden Markov model
Central processing unit
Conditional random fields
Continuous time Markov chain
Continuous time hidden Markov model
Divisive coefficient
Detection error tradeoff
Dispersion frame
Dispersion frame technque
Divisive analysis clustering
Discrete time Markov chain
Digital versatile disk
Empirical cumulative distribution function
Expectation conjugate gradient
Error dispersion index
Early failure detection and isolation arrangement
Expectation maximization
Expanded State HMM
functional magnetic resonance imaging
False negative
First order ??? language (not spelled out in the paper)
False positive
False positive rate
Field replacable unit
Fuzzy wavelet network
General hidden Markov model library
291
292
GPRS
GSM
HMM
HP
HSMM
HSMESM
HTTP
IBM
ID
IHMM
IN
IO
IP
LDAP
LSI
MAP
ML
MOC
MSET
MTBF
MTBP
MTTF
MTTP
MTTR
NBEM
NF
OR
PCA
PCF
PCFG
PFM
PR
PWA
QQ
RADIUS
RAID
RBF
RLC
ROC
SAN
SAP
SAR
SCF
SCP
SEP
SHIP
SMART
General packet radio service
Global system for mobile communication
Hidden Markov model
Hewlett-Packard
Hidden semi-Markov model
Hidden semi-Markov event sequence model
Hypertext transport protocol
International business machines
Identifier
Inhomogeneous hidden Markov model
Intelligent network
Input output
Internet protocol
Lightweight directory access protocol
Latent semantic indexing
Maximum a-posteriori
Maximum likelihood
Mobile originated call
Multivariate state estimation technique
Mean time between failures
Mean time between predictions
Mean time to failure
Mean time to prediction
Mean time to repair
Naive Bayes expectation maximization
Non-failure
Odds ratio
Principal component analysis
Probability cost function
Probabilistic context-free grammar
Proactive fault management
Precision-recall
Probabilistic wrapper approach
Quantile quantile
Remote authentication dial in user interface
Redundant array of independent disks
Radial basis function
Resistor inductor capacitor
Receiver operating characteristic
Stochastic activity network
Systems, applications, products
System activity reporter
Service control function
Service control point
Similar events prediction
Software hardware interoperability people
Self-monitoring, analysis and reporting technology
293
SMP
SMS
SPRT
SRN
SSI
STAR
SVD
SVM
TBF
TCP
TN
TP
TTF
TTP
TTR
UBF
UML
UPGMA
URL
UTC
WSDM
Semi-Markov process
Short message service
Sequential probability ratio test
Stochastic reward net
Stressor susceptibility interaction
Self-testing and repairing
Singular value decomposition
Support vector machine
Time between failures
Transmission control protocol
True negative
True positive
Time to failure
Time to prediction
Time to repair
Universal basis function
Universal modeling language
Unweighted pair-group average method
Uniform resource locator
Universal time coordinated
Web services distributed management
Index
φ-coefficient, 166
a-priori algorithm, 49
abstraction, 6
accumulated runtime cost, 163
accuracy, 156
AdaBoost, 145
adaptive enterprise, 5
agglomerative coefficient, 151
aggregated models, 145
alarm, 41
alphabet, 56, 109
amount of background weight, 199
anomaly detectors, 33
approximation approach, 25
arcing, 145
area under curve (AUC), 164
autonomic computing, 5
background distribution, 81, 126, 145
backward algorithm
HMM, 59
HSMM, 105
bag-of-words, 50
bagging, 145, 278
banner plot, 151
Baum-Welch algorithm
HMM, 60
HSMM, 106
Bayes error rate, 135
Bayes prediction, 30
Bayesian prediction, 37
BCa, 171
bias and variance, 138
bias, 139
classification, 140
regression, 138
variance, 139
bias-variance dilemma, 140
boosting, 145, 278
bootstrapping, 170
boundary bias, 143
boundary error, 141
bug
Bohrbugs, 22
Heisenbugs, 22
Mandelbugs, 22
Schrödingbugs, 22
chaining effect, 83
checkpoint, 230
class skewness, 26
classification, 19
cost, 135
cost-based, 135
failure prediction, 136
likelihood ratio, 136
log-likelihood, 137
loss matrix, 135
multiclass log-likelihood, 138
rejection thresholds, 136
risk, 135
sequence likelihood, 136
clustering, 37
agglomerative, 81
complete linkage, 83
divisive, 81
failure sequences, 18
hierarchical, 81
nearest neighbor, 83
partitioning, 81
stopping rules, 82
unweighted pair-group average, 83
Ward’s method, 83
clusters, 83
collision, 77
common base event (CBE), 90
conditional random fields, 279
confidence, 49
confusion matrix, 153
containers, 15
contingency table, 152, 153
continuous output probability densities, 67
295
296
continuous time sequences, 63
convex combination, 98
cooperative checkpointing, 4
correct no-warning, 153
correct warning, 153
count encoding, 216
counting and thresholding prediction, 30
data mining, 41
data sets, 167
test, 167
training, 167
validation, 167
data window size ∆td , 136, 208
data window size ∆td ., 180
decision
boundaries, 134
region, 134
surfaces, 134
defect trigger, 87
defect type, 87
delay symbols, 65
dendrogram, 149
detection error trade-off (DET), 159
diagnosis, 281
discrete time Markov chain (DTMC), 55
dispersion frame technique (DFT), 46
dissimilarity matrix, 80
distributed system, 16
divisive coefficient, 151
downtime avoidance, 227
downtime minimization, 227
duration, 97
early stopping, 144
engineering cycle, 6
entropy, 90
equilibrium state distribution, 243
ergodic topology, 81, 110
error, 10, 41
error function, 285
error patterns, 17
error type, 76
error-based failure prediction, 39
classifier, 45
frequency, 39
pattern recognition, 43
rule-based, 41
statistical tests, 45
event, 39
event type, 87
Index
event-triggered temporal sequence, 46
eventset, 48
accurate, 49
frequent, 48
method, 42, 48
expectation conjugate gradient (ECG), 110
expectation maximization (EM), 61, 116
generalized, 110, 119
expected risk, 135
F-measure, 155
failure, 10
arbitrary, 14
computation, 14
crash, 14
omission, 14
performance, 14, 19
timing, 14
failure avoidance, 227
failure mechanism, 15, 18, 79
failure modes, 15
failure prediction, 11
online, 12
failure probable, 233
failure sequence, 79
clustering, 79, 182, 212
grouping, 79, 182, 212
failure warning, 153
failure windows, 48
false negative, 153, 228
false positive, 153, 228
false positive rate, 156
false warning, 153
fault, 10
auditing, 10
design, 21
detection, 10
intermittent, 21
monitoring, 10
permanent, 21
runtime, 21
transient, 21
fault injection, 250
fault intolerance, 23
fault model, 20
fault tolerance, 23
fault trees, 42
feature analysis, 38
feature selection, 36
first passage time distribution, 103
first step analysis, 104
Index
forced downtime, 228
forward algorithm
HMM, 58
HSMM, 101
frequency of error occurrence, 39
function approximation-based prediction, 34
curve fitting, 34
genetic programming, 35
machine learning, 35
furthest neighbor, 83
G-measure, 165
generalized EM, 110, 119
Gini coefficient, 166
growing and pruning, 144
hidden Markov model (HMM), 56
basic problems of, 57
continuous (CHMM), 56, 67
continuous time (CT-HMM), 67
discrete, 56
hidden semi-Markov model (HSMM), 68, 95
complexity, 128
event sequence model (HSMESM), 69
expanded state (ESHMM), 69
exponentially-distributed durations, 68
Ferguson’s model, 68
gamma-distributed durations, 68
inhomogeneous (IHMM), 70, 116
Poisson-distributed durations, 68
proof of convergence, 116
reestimation formulas, 106
segmental, 69
structure of, 109
topology of, 109
Viterbi path constrained durations, 69
hierarchical numbering, 87
hybrid modeling approach, 278
incomplete data, 117
inductive bias, 278
information entropy of logfiles, 89
interdependencies, 15
intermediate states, 127, 145
jacknife, 170
kernels, 96, 98
label bias problem, 279
lead-time ∆tl , 180, 207
learning
297
batch, 25
offline, 25
online, 25
supervised, 25
leave-one-out, 170
lift, 166
likelihood, 133
logarithmic, 80, 101
load lowering, 229
logfile
entropy, 89
error ID assignment, 178
hierarchical numbering, 87
tupling, 179
type and source, 86
lower bound optimization, 117
m-fold cross validation, 143, 168
machine learning, 16
marginal, 118
margins for non-failure sequences, 180
Markov
assumptions, 56
properties, 56, 96
Markov renewal sequence, 95
kernel, 96
maximum cost, 164
maximum span of shortcuts, 199
median, 170
meta-learning, 278
minimal distance methods, 24
missing warning, 153
mixture of distributions, 98
mode, 170
model order selection, 144
monitoring-based classifiers, 29, 36
Bayesian classifier, 37
clustering, 37
statistical tests, 37
n-version programming, 4
no free lunch theorem, 25
noise filtering, 83, 188
non-failure sequences, 79
non-parametric prediction, 30
number of intermediates, 199
number of states, 199, 200
number of tries in each optimization step, 198
observation probabilities, 56
odds ratio, 157
298
online learning, 278
oracle, 163
out-of-sample, 167, 202
overfitting, 5, 140
pairwise alignment, 44
parameter setting
greedy, 166
non-greedy, 167
parameter tying, 145
pattern recognition-based prediction, 43
Markov models, 44
pairwise alignment, 44
probabilistic context-free grammar, 43
perfect predictor, 164
periodic prediction, 53, 217
Piatetsky-Shapiro, 166
positive, 153
posterior probability distribution, 133
precision, 154
precision recall break-even, 165
precision recall curves, 157
prediction
overview, 18
preparation, 227
preventive failover, 229
primal-dual method, 117
prior, 133
proactive downtime minimization, 229
proactive fault management, 5, 228
probabilistic context-free grammars (PCFG), 43
probabilistic wrapper approach (PWA), 36
properties of the data set, 221
Q-function, 118
quality of logfiles, 89
reactive downtime minimization, 229
recall, 154
receiver operating characteristics (ROC), 158
recovery oriented computing, 5
reestimation step, 129
regularization, 145, 278
rejuvenation, 4, 231
reliability model, 244
resamples, 170
responsive computing, 5
roll-backward scheme, 230
roll-forward scheme, 230
root cause, 10
analysis, 11
Index
rule-based prediction, 41
data mining, 41
fault trees, 42
sample error rate, 169
SAR, 165
scaling, 101
self-* properties, 5
self-testing and repairing computer (STAR), 4
self-transitions, 64
semi-Markov process (SMP), 67, 95
sequence
generation, 25
likelihood, 18
prediction, 25
recognition, 25
sequence extraction, 79
sequence likelihood, 57
HSMM, 101
sequence prediction, 102
sequential decision making, 25
sequential pattern mining, 42
service degradation, 32
SHIP fault model, 23
shortcuts, 126, 145
signal processing, 39
similar events prediction (SEP), 5
single linkage, 83
singular value decomposition (SVD), 50
software aging, 5, 231
software components, 15
source, 87
speech recognition, 113
state clean-up, 229
state duration, 114
statistical confidence, 168
statistical methods, 25
steady-state availability, 243, 244
stratification, 168
structure, 109
supervised offline batch learning, 212
support, 49
support vector machines (SVM), 51
SVD-SVM, 50
symbol, 56, 76
symptoms, 10, 40
system configuration, 211
system model-based prediction, 32
anomaly detectors, 33
control theory, 33
stochastic, 32
Index
temporal encoding, 216
temporal output, 66
temporal sequence, 15, 63, 115
temporal sequence pattern recognition, 53
test data, 167
test data set, 167
time series analysis, 38
feature analysis, 38
signal processing, 39
time series prediction, 38
time slotting, 64
time-varying internal process, 66
topology, 109
training
overview, 18
training data set, 167
training with noise, 143
transition
duration, 97
probability, 97
true negative, 153
true positive, 153
true positive rate, 156
truncation, 77
trustworthy computing, 5
tupling, 76, 77
two dimensional output, 66
Type I error, 153
Type II error, 153
type of background distributions, 198
underfitting, 140
universal basis functions (UBF), 219
unobservable data, 117
validation, 167
validation data set, 167
variable selection, 36, 279
Viterbi algorithm
HMM, 59
HSMM, 101
weighted relative accuracy, 165
299
Bibliography
[1] Abraham, A. & Grosan, C. Genetic programming approach for fault modeling of electronic hardware. In IEEE Proceedings Congress on Evolutionary Computation (CEC’05),,
volume 2, 1563–1569. Edinburgh, UK, 2005
[2] Agrawal, R., Imieliński, T., & Swami, A. Mining association rules between sets of items
in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on
Management of data (SIGMOD 93), 207–216. ACM Press, 1993
[3] Aitchison, J. & Dunsmore, I. R. Statistical Prediction Analysis. Cambridge University
Press, 1975
[4] Albin, S. & Chao, S. Preventive replacement in systems with dependent components. IEEE
Transactions on Reliability, volume 41(2): 230–238, 1992
[5] Aldenderfer, M. & Blashfield, R. Cluster Analysis. Sage Publications, Inc., Newbury Park
(CA,USA), 1984
[6] Alpaydin, E. Introduction To Machine Learning. MIT Press, 2004
[7] Altman, D. G. Practical Statistics for Medical Research. Chapman-Hall, 1991
[8] Altschul, S., Gish, W., Miller, W., Myers, E., & Lipman, D. Basic local alignment search
tool. Journal of Molecular Biology, volume 215(3): 403–410, 1990
[9] Amari, S. & McLaughlin, L. Optimal design of a condition-based maintenance model. In
IEEE Proceedings of Reliability and Maintainability Symposium (RAMS), 528–533. 2004
[10] Andrzejak, A. & Silva, L. Deterministic Models of Software Aging and Optimal Rejuvenation Schedules. In 10th IEEE/IFIP International Symposium on Integrated Network
Management (IM ’07), 159–168. 2007
[11] Apostolico, A. E. D. & Galil, Z. Pattern Matching Algorithms. Oxford University Press,
1997
[12] Ascher, H. E., Lin, T.-T. Y., & Siewiorek, D. P. Modification of: Error Log Analysis:
Statistical Modeling and Heuristic Trend Analysis. IEEE Transactions on Reliability, volume 41(4): 599–601, 1992
[13] Avižienis, A. Fault-tolerance and fault-intolerance: Complementary approaches to reliable
computing. In Proceedings of the international conference on Reliable software, 458–464.
ACM Press, New York, NY, USA, 1975
[14] Aviz̆ienis, A. The N-Version Approach to Fault-Tolerant Software. IEEE Transactions on
Software Engineering, volume SE-11(12): 1491–1501, 1985
301
302
Bibliography
[15] Aviz̆ienis, A., Gilley, G., Mathur, F., Rennels, D., Rohr, J., & Rubin, D. The STAR (SelfTesting And Repairing) Computer: An Investigation of the Theory and Practice of FaultTolerant Computer Design. IEEE Transactions on Computers, volume C-20(11): 1312–
1321, 1971
[16] Aviz̆ienis, A. & Laprie, J.-C. Dependable computing: From concepts to design diversity.
Proceedings of the IEEE, volume 74(5): 629–638, 1986
[17] Avižienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. Basic concepts and taxonomy of
dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, volume 1(1): 11–33, 2004
[18] Azimi, M., Nasiopoulos, P., & Ward, R. K. Offline and Online Identification of Hidden
Semi-Markov Models. IEEE Transactions on Signal Processing, volume 53(8): 2658–2663,
2005
[19] Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel A., & van
Steen, M. (eds.). Self-Star Properties in Complex Information Systems, Lecture Notes in
Computer Science, volume 3460. Springer-Verlag, 2005
[20] Bai, C. G., Hu, Q. P., Xie, M., & Ng, S. H. Software failure prediction based on a Markov
Bayesian network model. Journal of Systems and Software, volume 74(3): 275–282, 2005
[21] Bao, Y., Sun, X., & Trivedi, K. Adaptive Software Rejuvenation: Degradation Model and
Rejuvenation Scheme. In Proceedings of the 2003 International Conference on Dependable
Systems and Networks (DSN’2003). IEEE Computer Society, 2003
[22] Bao, Y., Sun, X., & Trivedi, K. A workload-based analysis of software aging, and rejuvenation. IEEE Transactions on Reliability, volume 54(3): 541–548, 2005
[23] Barborak, M., Dahbura, A., & Malek, M. The consensus problem in fault-tolerant computing. ACM Computing Surveys, volume 25(2): 171–220, 1993
[24] Basseville, M. & Nikiforov, I. Detection of abrupt changes: theory and application. Prentice Hall, 1993
[25] Baum, L. E. & Sell, G. R. Growth Transformations for Functions on Manifolds. Pacific
Journal of Mathematics, volume 27(2): 211–227, 1968
[26] Bazaraa, M. S. & Shetty, C. M. Nonlinear Programming. John Wiley and Sons, New York,
1979
[27] Berenji, H., Ametha, J., & Vengerov, D. Inductive learning for fault diagnosis. In IEEE
Proceedings of 12th International Conference on Fuzzy Systems (FUZZ’03), volume 1.
2003
[28] Bicego, M., Murino, V., & Figueiredo, M. A. T. A sequential pruning strategy for the
selection of the number of states in hidden Markov models. Pattern Recognition Letters,
volume 24(9–10): 1395–1407, 2003
[29] Bilmes, J. A. A Gentle Tutorial on the EM Algorithm and its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov Models. Tech. report ICSI-TR-97021, U.C. Berkeley, International Computer Science Institute, Berkeley, CA, 1998
[30] Bishop, C. M. Neural Networks for Pattern Recognition. Oxford University Press, 1995
Bibliography
303
[31] Bland, J. M. & Altman, D. G. The odds ratio. British Medical Journal, volume 320(7247):
1468, 2000
[32] Blischke, W. R. & Murthy, D. N. P. Reliability: Modeling, Prediction, and Optimization.
Probability and Statistics. John Wiley and Sons, 2000
[33] Bonafonte, A., Vidal, J., & Nogueiras, A. Duration modeling with expanded HMM applied
to speech recognition. In IEEE Proceedings of the Fourth International Conference on
Spoken Language (ICSLP 96), volume 2, 1097–1100. 1996
[34] Borgelt, C. & Kruse, R. Induction of Association Rules: Apriori Implementation. In Proceedings of 15th Conference on Computational Statistics (Compstat 2002). Physica Verlag,
Heidelberg, Germany, 2002
[35] Bowles, J. A survey of reliability-prediction procedures for microelectronic devices. IEEE
Transactions on Reliability, volume 41(1): 2–12, 1992
[36] Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. Time Series Analysis: Forecasting and
Control. Prentice Hall, Englewood Cliffs, New Jersey, 3rd edition, 1994
[37] Bridgewater, D. Standardize Messages with the Common Base Event Model. 2004. URL
www-106.ibm.com/developerworks/autonomic/library/ac-cbe1/
[38] Brocklehurst, S. & Littlewood, B. Techniques for Prediction Analysis and Recalibration.
In Lyu, M. R. (ed.), Handbook of software reliability engineering, chapter 4, 119–166.
McGraw-Hill, 1996
[39] Bronstein, I. N., Semendjajew, K. A., Musiol, G., & Mühlig, H. Taschenbuch der Mathematik. Harri Deutsch, Frankfurt am Main, Germany, 6th edition, 2005
[40] Brown, A. & Patterson, D. Embracing Failure: A Case for Recovery-Oriented Computing
(ROC). In High Performance Transaction Processing Symposium. 2001
[41] Burckhardt, J. Griechische Kultur. Safari Verlag, Berlin, Germany, 1958
[42] Candea, G. The Enemies of Dependability I: Software. Technical Report CS444a, Stanford
University, CA, 2003
[43] Candea, G., Cutler, J., & Fox, A. Improving Availability with Recursive Microreboots: A
Soft-State System Case Study. Performance Evaluation Journal, volume 56(1-3), 2004
[44] Candea, G., Delgado, M., Chen, M., & Fox, A. Automatic Failure-Path Inference: A
Generic Introspection Technique for Internet Applications. In Proceedings of the 3rd IEEE
Workshop on Internet Applications (WIAPP). San Jose, CA, 2003
[45] Candea, G., Kiciman, E., Zhang, S., Keyani, P., & Fox, A. JAGR: An Autonomous SelfRecovering Application Server. In Proceedings of the 5th International Workshop on Active
Middleware Services. Seattle, WA, USA, 2003
[46] Caruana, R. & Niculescu-Mizil, A. Data mining in metric space: an empirical analysis
of supervised learning performance criteria. In Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and data mining (KDD 04), 69–78. ACM
Press, New York, NY, USA, 2004
304
Bibliography
[47] Cassady, C., Maillart, L., Bowden, R., & Smith, B. Characterization of optimal agereplacement policies. In IEEE Proceedings of Reliability and Maintainability Symposium,
170–175. 1998
[48] Cassidy, K. J., Gross, K. C., & Malekpour, A. Advanced Pattern Recognition for Detection of Complex Software Aging Phenomena in Online Transaction Processing Servers. In
Proceedings of Dependable Systems and Networks (DSN), 478–482. 2002
[49] Castelli, V., Harper, R., P., H., Hunter, S., Trivedi, K., Vaidyanathan, K., & Zeggert, W.
Proactive management of software aging. IBM Journal of Research and Development,
volume 45(2): 311–332, 2001
[50] Chakravorty, S., Mendes, C., & Kale, L. Proactive fault tolerance in large systems. In
HPCRI Workshop in conjunction with HPCA 2005. 2005
[51] Chan, L. M., Comaromi, J. P., Mitchell, J. S., & Satija, M. Dewey Decimal Classification:
A Practical Guide. OCLC Forest Press, Albany, N.Y., 2nd edition, 1996
[52] Chen, M., Accardi, A., Lloyd, J., Kiciman, E., Fox, A., Patterson, D., & Brewer, E. Pathbased Failure and Evolution Management. In Proceedings of USENIX/ACM Symposium on
Networked Systems Design and Implementation (NSDI). San Francisco, CA, 2004
[53] Chen, M., Kiciman, E., Fratkin, E., Fox, A., & Brewer, E. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In Proceedings of 2002 International Conference
on Dependable Systems and Networks (DSN), IPDS track, 595–604. IEEE Computer Society, 2002
[54] Chen, M., Zheng, A., Lloyd, J., Jordan, M., & Brewer, E. Failure diagnosis using decision
trees. In IEEE Proceedings of International Conference on Autonomic Computing, 36–43.
2004
[55] Chen, M.-S., Park, J. S., & Yu, P. S. Efficient Data Mining for Path Traversal Patterns.
IEEE Transactions on Knowledge and Data Engineering, volume 10(2): 209–221, 1998.
URL citeseer.nj.nec.com/article/chen98efficient.html
[56] Chen, P., Lin, C. J., & Schoelkopf, B. A tutorial on ν-Support Vector Machines. Applied
Stochastic Models in Business and Industry, volume 21(2): 111–136, 2005
[57] Cheng, F., Wu, S., Tsai, P., Chung, Y., & Yang, H. Application Cluster Service Scheme for
Near-Zero-Downtime Services. In IEEE Proceedings of the International Conference on
Robotics and Automation, 4062–4067. 2005
[58] Chiang, F. & Braun, R. Intelligent Network Failure Domain Prediction in Complex
Telecommunication Systems with Hybrid Neural Rough Nets. In The Second International
Symposium on Neural Networks (ISNN 2005). Chongqing, China, 2005
[59] Chillarege, R., Bhandari, S., Chaar, J. K., Halliday, M. J., Moebus, D. S., Ray, B. K., &
Wong, M.-Y. Orthogonal Defect Classification - A Concept for In-Process Measurements.
IEEE Transactions on Software Engineering, volume 18(11): 943–955, 1992
[60] Chillarege, R., Biyani, S., & Rosenthal, J. Measurement of Failure Rate in Widely Distributed Software. In FTCS ’95: Proceedings of the Twenty-Fifth International Symposium
on Fault-Tolerant Computing, 424–432. IEEE Computer Society, 1995
Bibliography
305
[61] Cohen, W. W. Fast effective rule induction. In Proceedings of the Twelfth International
Conference on Machine Learning, 115–123. 1995
[62] Cole, R., Mariani, J., Uszkoreit, H., Varile, G. B., Zaenen, A., Zampolli, A., & Zue, V.
(eds.). Survey of the State of the Art in Human Language Technology. Cambridge University
Press and Giardini, 1997
[63] Coleman, D. & Thompson, C. Model Based Automation and Management for the Adaptive Enterprise. In Proceedings of the 12th Annual Workshop of HP OpenView University
Association, 171–184. 2005
[64] Comission, I. I. T. (ed.). Dependability and Quality of Service, chapter 191. IEC, 2nd
edition, 2002
[65] Cook, A. E. & Russell, M. J. Improved duration modeling in hidden Markov models using
series-parallel configurations of states. Proc. Inst. Acoust., volume 8: 299–306, 1986
[66] Cover, T. M. Learning in pattern recognition. In Watanabe, S. (ed.), Methodologies of
Pattern Recognition, 111–132. Academic Press, 1968
[67] Cox, D. R. & Miller, H. D. The Theory of Stochastic Processes. Chapman and Hall,
London, UK, 1st edition, 1965
[68] Cristian, F., Aghili, H., Strong, R., & Dolev, D. Atomic Broadcast: From Simple Message
Diffusion to Byzantine Agreement. In IEEE Proceedings of 15th International Symposium
on Fault Tolerant Computing (FTCS). 1985
[69] Cristian, F., Dancey, B., & Dehn, J. Fault-tolerance in the Advanced Automation System. In
IEEE Proceedings of 20th International Symposium on Fault-Tolerant Computing (FTCS20), 6–17. 1990
[70] Cristianini, N. & Shawe-Taylor, J. An introduction to Support Vector Machines and other
kernel-based learning methods. Cambridge University Press, 2000
[71] Crowell, J., Shereshevsky, M., & Cukic, B. Using fractal analysis to model software aging.
Technical report, West Virginia University, Lane Department of CSEE, Morgantown, WV,
2002
[72] Csenki, A. Bayes Predictive Analysis of a Fundamental Software Reliability Model. IEEE
Transactions on Reliability, volume 39(2): 177–183, 1990
[73] Daidone, A., Di Giandomenico, F., Bondavalli, A., & Chiaradonna, S. Hidden Markov
Models as a Support for Diagnosis: Formalization of the Problem and Synthesis of the
Solution. In IEEE Proceedings of the 25th Symposium on Reliable Distributed Systems
(SRDS 2006). Leeds, UK, 2006
[74] Dalgaard, P. Introductory Statistics with R. Springer, 2002
[75] Dempster, A., Laird, N., & Rubin, D. Maximum-Likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society, volume 39(1): 1–38, 1977
[76] Dennis, J. E. J. & Moré, J. J. Quasi-Newton Methods, Motivation and Theory. SIAM
Review, volume 19(1): 46–89, 1977
[77] Denson, W. The history of reliability prediction. IEEE Transactions on Reliability, volume 47(3): 321–328, 1998
306
Bibliography
[78] Discenzo, F., Unsworth, P., Loparo, K., & Marcy, H. Self-diagnosing intelligent motors: a
key enabler for nextgeneration manufacturing systems. In IEE Colloquium on Intelligent
and Self-Validating Sensors. 1999
[79] Dohi, T., Goseva-Popstojanova, K., & Trivedi, K. S. Analysis of Software Cost Models
with Rejuvenation. In Proceedings of IEEE Intl. Symposium on High Assurance Systems
Engineering, HASE 2000. 2000
[80] Dohi, T., Goseva-Popstojanova, K., & Trivedi, K. S. Statistical Non-Parametric Algorihms
to Estimate the Optimal Software Rejuvenation Schedule. In Proceedings of the Pacific Rim
International Symposium on Dependable Computing (PRDC 2000). 2000
[81] Domeniconi, C., Perng, C.-S., Vilalta, R., & Ma, S. A Classification Approach for Prediction of Target Events in Temporal Sequences. In Elomaa, T., Mannila, H., & Toivonen,
H. (eds.), Proceedings of the 6th European Conference on Principles of Data Mining and
Knowledge Discovery (PKDD’02), LNAI, volume 2431, 125–137. Springer-Verlag, Heidelberg, 2002
[82] Domingos, P. A Unified Bias-Variance Decomposition for Zero-One and Squared Loss. In
Proceedings of the Seventeenth National Conference on Artificial Intelligence, 564–569.
2000
[83] Drummond, C. & Holte, R. C. Explicitly representing expected cost: an alternative to
ROC representation. In Proceedings of the sixth ACM SIGKDD international conference
on Knowledge discovery and data mining (KDD’00), 198–207. ACM Press, New York, NY,
USA, 2000
[84] Duda, R. O. & Hart, P. E. Pattern classification and scene analysis. John Wiley and Sons,
New York, London, Sydney, Toronto, 1973
[85] Duda, R. O., Hart, P. E., & Stork, D. G. Pattern Classification. Wiley-Interscience, 2nd
edition, 2000
[86] Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK,
1998
[87] Efron, B. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics,
volume 7(1): 1–26, 1979
[88] Egan, J. P. Signal detection theory and ROC analysis. Academic Press New York, 1975
[89] Elbaum, S., Kanduri, S., & Amschler, A. Anomalies as precursors of field failures. In
IEEE Proceedings of the 14th International Symposium on Software Reliability Engineering
(ISSRE 2003), 108–118. 2003
[90] Elliott, R. J., Aggoun, L., & Moore, J. B. Hidden Markov Models: Estimation and Control,
Stochastic Modelling and Applied Probability, volume 29. Springer Verlag, 1st edition,
1995
[91] Elnozahy, E. N., Alvisi, L., Wang, Y., & Johnson, D. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, volume 34(3): 375–408,
2002
Bibliography
307
[92] Esary, J. D. & Proschan, F. The Reliability of Coherent Systems. In Wilcox & Mann (eds.),
Redundancy Techniques for Computing Systems, 47–61. Spartan Books, Washington, DC,
1962
[93] Faisan, S., Thoraval, L., Armspach, J., & Heitz, F. Unsupervised Learning and Mapping
of Brain fMRI Signals Based on Hidden Semi-Markov Event Sequence Models. In Goos,
G., Hartmanis, J., & van Leeuwen, J. (eds.), Medical Image Computing and ComputerAssisted Intervention (MICCAI 2003), Lecture Notes in Computer Science, volume 2879,
75–82. Springer, 2003
[94] Farr, W. Software Reliability Modeling Survey. In Lyu, M. R. (ed.), Handbook of software
reliability engineering, chapter 3, 71–117. McGraw-Hill, 1996
[95] Fawcett, T. ROC graphs: notes and practical considerations for data mining researchers.
Technical Report 2003-4, HP Laboratories, Palo Alto, CA, USA, 2003
[96] Ferguson, J. Variable duration models for speech. In Proceedings of the Symposium on the
Application of HMMs to Text and Speech, 143–179. 1980
[97] Flach, P. The geometry of ROC space: understanding machine learning metrics through
ROC isometrics. In Proceedings of the 20th International Conference on Machine Learning
(ICML’03), 194–201. AAAI Press, 2003
[98] Friedman, J. H. On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Data Mining and Knowledge Discovery, volume 1(1): 55–77, 1997
[99] Fu, S. & Xu, C.-Z. Quantifying Temporal and Spatial Fault Event Correlation for Proactive
Failure Management. In IEEE Proceedings of Symposium on Reliable and Distributed
Systems (SRDS 07). 2007
[100] Garg, S., van Moorsel, A., Vaidyanathan, K., & Trivedi, K. S. A Methodology for Detection
and Estimation of Software Aging. In Proceedings of the 9th International Symposium on
Software Reliability Engineering, ISSRE 1998. 1998
[101] Garg, S., Puliafito, A., Telek, M., & Trivedi, K. Analysis of Preventive Maintenance in
Transactions Based Software Systems. IEEE Trans. Comput., volume 47(1): 96–107, 1998
[102] Ge, X. Segmental semi-Markov models and applications to sequence analysis. Ph.D. thesis,
University of California, Irvine, 2002. Chair-Padhraic Smyth
[103] Gellert, W., Küstner, H., Hellwig, M., & Kästner, H. (eds.). Kleine Enzyklopädie Mathematik. VEB Bibliographisches Institut, Leipzig, Germany, 1965
[104] Geman, S., Bienenstock, E., & Doursat, R. Neural networks and the bias/variance dilemma.
Neural Computation, volume 4(1): 1–58, 1992
[105] Gertsbakh, I. Reliability Theory: with Applications to Preventive Maintenance. SpringerVerlag, Berlin, Germany, 2000
[106] Goldberg, D. E. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison Wesley, 1989
[107] Gray, J. Why do computers stop and what can be done about it? In Proceedings of
Symposium on Reliability in Distributed Software and Database Systems (SRDS-5), 3–12.
IEEE CS Press, Los Angeles, CA, 1986
308
Bibliography
[108] Gray, J. A census of tandem system availability between 1985 and 1990. IEEE Transactions
on Reliability, volume 39(4): 409–418, 1990
[109] Gray, J. & Reuter, A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992
[110] Gross, K. C., Bhardwaj, V., & Bickford, R. Proactive Detection of Software Aging Mechanisms in Performance Critical Computers. In SEW ’02: Proceedings of the 27th Annual
NASA Goddard Software Engineering Workshop (SEW-27’02). IEEE Computer Society,
Washington, DC, USA, 2002
[111] Gujrati, P., Li, Y., Lan, Z., Thakur, R., & White, J. A Meta-Learning Failure Predictor
for Blue Gene/L Systems. In IEEE proceedings of International Conference on Parallel
Processing (ICPP 2007). 2007
[112] Hamerly, G. & Elkan, C. Bayesian approaches to failure prediction for disk drives. In
Proceedings of the Eighteenth International Conference on Machine Learning, 202–209.
Morgan Kaufmann Publishers Inc., 2001
[113] Hamming, W. R. Error Detecting and Error Correcting Codes. Bell Systems Technical
Journal, volume 29(2): 147–160, 1950
[114] Hansen, J. & Siewiorek, D. Models for time coalescence in event logs. In IEEE Proceedings
of International Symposium on Fault-Tolerant Computing (FTCS-22), 221–227. 1992
[115] Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer Series in Statistics. Springer Verlag, 2001
[116] Hätönen, K., Klemettinen, M., Mannila, H., Ronkainen, P., & Toivonen, H. TASA: Telecommunication Alarm Sequence Analyzer, or: How to enjoy faults in your network. In IEEE
Proceedings of Network Operations and Management Symposium, volume 2, 520 – 529.
Kyoto, Japan, 1996
[117] Hellerstein, J. L., Zhang, F., & Shahabuddin, P. An approach to predictive detection for
service management. In IEEE Proceedings of Sixth International Symposium on Integrated
Network Management, 309–322. 1999
[118] Herodot. Historien. Kröner Verlag, Stuttgart, Germany, 1971
[119] Hestenes, M. R. & Stiefel, E. Methods of conjugate gradients for solving linear systems.
Journal. Research of the National Bureau of Standards, volume 49(6): 409–436, 1952
[120] Hoffmann, G. A. Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker Verlag, 2006
[121] Hoffmann, G. A. & Malek, M. Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach. In Proceedings of the 25th IEEE Symposium on
Reliable Distributed Systems (SRDS 2006). Leeds, United Kingdom, 2006
[122] Hoffmann, G. A., Trivedi, K. S., & Malek, M. A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability, volume 56(4): 615–628,
2007
[123] Horn, P. Autonomic Computing: IBM’s perspective on the State of Information Technology. 2001. URL http://www.research.ibm.com/autonomic/manifesto/
autonomic_computing.pdf
Bibliography
309
[124] Hotelling, H. Analysis of a complex of statistical variables into principal components.
Journal of Educational Psychology, volume 24: 417–441, 1933
[125] Huang, X., Acero, A., & Hon, H.-W. Spoken Language Processing: A Guide to Theory,
Algorithm, and System Development. Prentice Hall, Upper Saddle River, NJ, USA, 2001
[126] Huang, Y., Kintala, C., Kolettis, N., & Fulton, N. Software Rejuvenation: Analysis, Module
and Applications. In Proceedings of IEEE Intl. Symposium on Fault Tolerant Computing,
FTCS 25. 1995
[127] Hughes, G., Murray, J., Kreutz-Delgado, K., & Elkan, C. Improved disk-drive failure
warnings. IEEE Transactions on Reliability, volume 51(3): 350–357, 2002
[128] Hughey, R. & Krogh, A. Hidden Markov models for sequence analysis: extension and
analysis of the basic method. CABIOS, volume 12(2): 95–107, 1996
[129] Iyer, R. & Rosetti, D. A statistical load dependency of CPU errors at SLAC. In IEEE
Proceedings of 12th International Symposium on Fault Tolerant Computing (FTCS-12).
1982
[130] Iyer, R. K., Young, L. T., & Iyer, P. K. Automatic Recognition of Intermittent Failures:
An Experimental Study of Field Data. IEEE Transactions on Computers, volume 39(4):
525–537, 1990
[131] Iyer, R. K., Young, L. T., & Sridhar, V. Recognition of error symptoms in large systems.
In Proceedings of 1986 ACM Fall joint computer conference, 797–806. IEEE Computer
Society Press, Los Alamitos, CA, USA, 1986
[132] Jelinski, Z. & Moranda, P. Software reliability research. In Freiberger, W. (ed.), Statistical
computer performance evaluation. Academic Press, 1972
[133] Jensen, J. L. W. V. Sur les fonctions convexes et les inégalités entre les valeurs moyennes.
Acta Mathematica, volume 30(1): 175–193, 1906
[134] Jiménez, D. A. & Lin, C. Neural methods for dynamic branch prediction. ACM Transactions on Computer Systems, volume 20(4): 369–397, 2002
[135] Joachims, T. Making large-scale SVM Learning Practical. In Schölkopf, B., Burges, C., &
A., S. (eds.), Advances in Kernel Methods - Support Vector Learning. MIT Press, 1999
[136] Joseph, D. & Grunwald, D. Prefetching Using Markov Predictors. IEEE Transactions on
Computers, volume 48(2): 121–133, 1999
[137] Juang, B. H., Levinson, S. E., & Sondhi, M. M. Maximum Likelihood Estimation for
Multivariate Mixture Observations of Markov Chains. IEEE Transactions on Information
Theory, volume 32(2): 307–309, 1986
[138] Juang, B.-H. & Rabiner, L. The segmental K-means algorithm for estimating parameters of
hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing,
volume 38(9): 1639–1641, 1990
[139] Kajko-Mattson, M. Can We Learn Anything from Hardware Preventive Maintenance? In
ICECCS ’01: Proceedings of the Seventh International Conference on Engineering of Complex Computer Systems, 106–111. IEEE Computer Society, 2001
310
Bibliography
[140] Kalman, R. E. & Bucy, R. S. New results in linear filtering and prediction theory. Transactions of the ASME, Series D, Journal of Basic Engineering, volume 83: 95–107, 1961
[141] Kapadia, N. H., Fortes, J. A. B., & Brodley, C. E. Predictive application-performance modeling in a computational gridenvironment. In IEEE Procedings of the eighth International
Symposium on High Performance Distributed Computing, 47–54. 1999
[142] Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data. John Wiley and Sons, New York,
1990
[143] Kelly, J. P. J., Aviz̆ienis, A., Ulery, B. T., Swain, B. J., Lyu, M. R., Tai, A., & Tso, K. S.
Multi-Version Software Development. In Proceedings IFAC Workshop SAFECOMP’86,
43–49. Sarlat, France, 1986
[144] Kiciman, E. & Fox, A. Detecting application-level failures in component-based Internet
services. IEEE Transactions on Neural Networks, volume 16(5): 1027–1041, 2005
[145] Kim, W.-G., Choi, J.-Y., & Youn, D. H. HMM with global path constraint in Viterbi decoding for isolatedword recognition. In IEEE Proceedings of International Conference on
Acoustics, Speech, and Signal Processing (ICASSP-94), volume 1, 605–608. 1994
[146] Kohavi, R. & Provost, F. Glossary of terms. Machine Learning, volume 30(2/3): 271–274,
1998
[147] Korbicz, J., Kościelny, J. M., Kowalczuk, Z., & Cholewa, W. (eds.). Fault Diagnosis:
Models, Artificial Intelligence, Applications. Springer Verlag, 2004
[148] Krus, D. J. & Fuller, E. A. Computer Assisted Multicrossvalidation in Regression Analysis.
Educational and Psychological Measurement, volume 42(1): 187–193, 1982
[149] Kulkarni, V. G. Modeling and Analysis of Stochastic Systems. Chapman and Hall, London,
UK, 1st edition, 1995
[150] Kumar, D. & Westberg, U. Maintenance scheduling under age replacement policy using
proportional hazards model and TTT-plotting. European Journal of Operational Research,
volume 99(3): 507–515, 1997
[151] Kurtz, A. K. A research test of Rorschach test. Personnel Psychology, volume 1: 41–53,
1948
[152] Lafferty, J., McCallum, A., & Pereira, F. Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data. In Proc. 18th International Conf. on Machine
Learning, 282–289. Morgan Kaufmann, San Francisco, CA, 2001. URL citeseer.
ist.psu.edu/article/lafferty01conditional.html
[153] Lal, R. & Choi, G. Error and Failure Analysis of a UNIX Server. In IEEE Proceedings
of third International High-Assurance Systems Engineering Symposium (HASE), 232–239.
IEEE Computer Society Washington, DC, USA, 1998
[154] Lance, G. N. & Williams, W. T. A general theory of classificatory sorting strategies, 1.
Hierarchical Systems. The Computer Journal, volume 9(4): 373–380, 1967
[155] Laprie, J.-C. & Kanoun, K. Software Reliability and System Reliability. In Lyu, M. R. (ed.),
Handbook of software reliability engineering, chapter 2, 27–69. McGraw-Hill, 1996
Bibliography
311
[156] Laranjeira, L., Malek, M., & Jenevein, R. On tolerating faults in naturally redundant
algorithms. In IEEE Proceedings of Tenth Symposium on Reliable Distributed Systems
(SRDS),, 118–127. 1991
[157] Leangsuksun, C., Liu, T., Rao, T., Scott, S., & Libby, R. A Failure Predictive and PolicyBased High Availability Strategy for Linux High Performance Computing Cluster. In The
5th LCI International Conference on Linux Clusters: The HPC Revolution, 18–20. 2004
[158] Leangsuksun, C., Shen, L., Liu, T., Song, H., & Scott, S. Availability prediction and modeling of high mobility OSCAR cluster. In IEEE Proceedings of International Conference on
Cluster Computing, 380–386. 2003
[159] Lee, I. & Iyer, R. K. Software dependability in the Tandem GUARDIAN system. IEEE
Transactions on Software Engineering, volume 21(5): 455–467, 1995
[160] Legg, S. Is There an Elegant Universal Theory of Prediction? In Algorithmic Learning
Theory, Lecture Notes in Computer Science, volume 4264, 274–287. Springer Verlag, 2006
[161] Levinson, S. E. Continuously variable duration hidden Markov models for automatic
speech recognition. Computer Speech and Language, volume 1(1): 29–45, 1986
[162] Levy, D. & Chillarege, R. Early Warning of Failures through Alarm Analysis - A Case
Study in Telecom Voice Mail Systems. In ISSRE ’03: Proceedings of the 14th International
Symposium on Software Reliability Engineering. IEEE Computer Society, Washington, DC,
USA, 2003
[163] Li, L., Vaidyanathan, K., & Trivedi, K. S. An Approach for Estimation of Software Aging in
a Web Server. In Proceedings of the Intl. Symposium on Empirical Software Engineering,
ISESE 2002. Nara, Japan, 2002
[164] Li, Y. & Lan, Z. Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing. In IEEE Proceedings of the Sixth International Symposium on Cluster Computing
and the Grid (CCGRID’ 06), 531–538. IEEE Computer Society, Los Alamitos, CA, USA,
2006
[165] Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., & Sahoo, R. BlueGene/L Failure
Analysis and Prediction Models. In IEEE Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2006), 425–434. 2006
[166] Lin, T.-T. Y. Design and evaluation of an on-line predictive diagnostic system. Ph.D.
thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University,
Pittsburgh, PA, 1988
[167] Lin, T.-T. Y. & Siewiorek, D. P. Error log analysis: statistical modeling and heuristic trend
analysis. IEEE Transactions on Reliability, volume 39(4): 419–432, 1990
[168] Liporace, L. A. Maximum Likelihood Estimation for Multivariate Observations of Markov
Sources. IEEE Transactions on Information Theory, volume 28(5): 729–734, 1982
[169] Lunze, J. Automatisierungstechnik. Oldenbourg, 1st edition, 2003
[170] Lyu, M. R. (ed.). Handbook of Software Reliability Engineering. McGraw-Hill, 1996
[171] Magedanz, T. & Popescu-Zeletin, R. Intelligent networks: basic technology, standards and
evolution. Internat. Thomson Computer Press, London, UK, 1996
312
Bibliography
[172] Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. Performance Measures for Information Extraction. In Proceedings of DARPA Broadcast News Workshop. Herndon, VA,
1999
[173] Malek, M. Responsive Systems: The challenge for the nineties. Microprocessing and
Microprogramming, volume 30: 9–16, 1990
[174] Malek, M. Personal communication. 2007
[175] Manning, C. D. & Schütze, H. Foundations of Statistical Natural Language Processing.
The MIT Press, Cambridge, Massachusetts, 1999
[176] Marciniak, A. & Korbicz, J. Pattern Recognition Approach to Fault Diagnostics. In Korbicz, J., Kościelny, J. M., Kowalczuk, Z., & Cholewa, W. (eds.), Fault Diagnosis: Models,
Artificial Intelligence, Applications, chapter 14, 557–590. Springer Verlag, 2004
[177] Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. The DET curve in
assessment of detection task performance. In Proceedings of the 5th European Conference
on Speech Communication and Technology, volume 4, 1895–1898. 1997
[178] Marzbana, C. & Stumpf, G. J. A Neural Network for Damaging Wind Prediction. Weather
and Forecasting, volume 13(1): 151–163, 1998
[179] Max Planck Institute for Molecular Genetics. General Hidden Markov Model library. 2007.
URL http://www.ghmm.org, date: 06-12-07
[180] Melliar-Smith, P. M. & Randell, B. Software reliability: The role of programmed exception
handling. SIGPLAN Not., volume 12(3): 95–100, 1977
[181] Minka, T. Expectation-Maximization as lower bound maximization. Tutorial published
on the web at http://research.microsoft.com/users/minka/papers/
minka-em-tut.ps.gz, 1998
[182] Mitchell, C., Harper, M., & Jamieson, L. On the Complexity of Explicit Duration HMM’s.
IEEE Transactions on Speech and Audio Processing, volume 3(3): 213–217, 1995
[183] Mitchell, C. & Jamieson, L. Modeling duration in a hidden Markov model with the exponential family. In IEEE Proceedings of the International Conference on Acoustics, Speech,
and Signal Processing (ICASSP-93), volume 2, 331–334. 1993
[184] Mitchell, T. M. Machine Learning. McGraw-Hill, international edition 1997 edition, 1997
[185] Mojena, R. Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal, volume 20(4): 359–363, 1977
[186] Moll, K. D. & Luebbert, G. M. Arms Race and Military Expenditure Models: A Review.
The Journal of Conflict Resolution, volume 24(1): 153–185, 1980
[187] Moore, D. S. & McCabe, G. P. Introduction to the Practice of Statistics. W. H. Freeman &
Co., New York, NY, USA, 5th edition, 2006
[188] Mundie, C., de Vries, P., Haynes, P., & Corwine, M. Trustworthy Computing. Technical
report, Microsoft Corp., 2002. URL http://www.microsoft.com/mscorp/twc/
twc_whitepaper.mspx
Bibliography
313
[189] Musa, J. D., Iannino, A., & Okumoto, K. Software Reliability: Measurement, Prediction,
Application. McGraw-Hill, 1987
[190] Nassar, F. A. & Andrews, D. M. A Methodology for Analysis of Failure Prediction Data.
In IEEE Real-Time Systems Symposium, 160–166. 1985
[191] Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, volume 48(3): 443–53, 1970
[192] von Neumann, J. Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components. In Shannon, C. & McCarthy, J. (eds.), Automata Studies, 43–98.
Princeton University Press, Princeton, 1956
[193] Neville, S. W. Approaches for Early Fault Detection in Large Scale Engineering Plants.
Ph.D. thesis, University of Victoria, 1998
[194] Ning, M. H., Yong, Q., Di, H., Ying, C., & Zhong, Z. J. Software Aging Prediction Model
Based on Fuzzy Wavelet Network with Adaptive Genetic Algorithm. In 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’06), 659–666. IEEE Computer Society, Los Alamitos, CA, USA, 2006
[195] Noll, A. & Ney, H. Training of phoneme models in a sentence recognition system. In
IEEE Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’87), volume 12, 1277–1280. 1987
[196] Ogle, D., Kreger, H., Salahshour, A., Cornpropst, J., Labadie, E., Chessell, M., Horn,
B., & Gerken, J. Canonical Situation Data Format: The Common Base Event. IBM
Specification ACAB.BO0301.1.1, 2003. URL http://xml.coverpages.org/
IBMCommonBaseEventV111.pdf
[197] Oliner, A. & Sahoo, R. Evaluating cooperative checkpointing for supercomputing systems.
In IEEE Proceedings of 20th International Parallel and Distributed Processing Symposium
(IPDPS 2006). 2006
[198] Parnas, D. L. Software aging. In IEEE Proceedings of the 16th international conference on
Software engineering (ICSE ’94), 279–287. IEEE Computer Society Press, Los Alamitos,
CA, USA, 1994
[199] Pawlak, Z., Wong, S. K. M., & Ziarko, W. Rough sets: Probabilistic versus deterministic
approach. International Journal of Man-Machine Studies, volume 29: 81–95, 1988
[200] Pena, J. M., Létourneau, S., & Famili, F. Application of Rough Sets Algorithms to Prediction of Aircraft Component Failure. In Advances in Intelligent Data Analysis: Third International Symposium (IDA-99), LNCS, volume 1642. Springer Verlag, Amsterdam, The
Netherlands, 1999
[201] Pepe, M. S., Janes, H., Longton, G., Leisenring, W., & Newcomb, P. Limitations of the
Odds Ratio in Gauging the Performance of a Diagnostic, Prognostic, or Screening Marker.
American Journal of Epidemiology, volume 159(9): 882–890, 2004
[202] Petsche, T., Marcantonio, A., Darken, C., Hanson, S. J., Kuhn, G. M., & Santoso, I. A Neural Network Autoassociator for Induction Motor Failure Prediction. In Touretzky, D. S.,
314
Bibliography
Mozer, M. C., & Hasselmo, M. E. (eds.), Advances in Neural Information Processing Systems, volume 8, 924–930. The MIT Press, 1996. URL citeseer.ist.psu.edu/
petsche96neural.html
[203] Pfefferman, J. & Cernuschi-Frias, B. A nonparametric nonstationary procedure for failure
prediction. IEEE Transactions on Reliability, volume 51(4): 434–442, 2002
[204] Pielke, R. Mesoscale Meteorological Modeling, International Geophysics, volume 78. Elsevier, 2nd edition, 2001
[205] Pizza, M., Strigini, L., Bondavalli, A., & Di Giandomenico, F. Optimal Discrimination
between Transient and Permanent Faults. In IEEE Proceedings of Third International HighAssurance Systems Engineering Symposium (HASE’98), 214–223. IEEE Computer Society,
Los Alamitos, CA, USA, 1998
[206] Pylkkönen, J. Phone Duration Modeling Techniques in Continuous Speech Recognition.
Master’s thesis, Helsinki University of Technology, Department of Computer Science and
Engineering, Laboratory of Computer and Information Science, 2004
[207] Quenouille, M. H. Notes on Bias in Estimation. Biometrika, volume 43(3/4): 353–360,
1956
[208] Quinlan, J. Learning logical definitions from relations. Machine Learning, volume 5(3):
239–266, 1990
[209] Quinlan, J. C4. 5: Programs for Machine Learning. Morgan Kaufmann, 1993
[210] Rabiner, L. R. A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition. Proceedings of the IEEE, volume 77(2): 257–286, 1989
[211] Ramesh, P. & Wilpon, J. G. Modeling state durations in hidden Markov models for automatic speech recognition. In IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP-92), volume 1, 381–384. 1992
[212] Randell, B. System structure for software fault tolerance. IEEE Transactions on Software
Engineering, volume 1(2): 220–232, 1975
[213] Randell, B., Lee, P., & Treleaven, P. C. Reliability Issues in Computing System Design.
ACM Computing Survey, volume 10(2): 123–165, 1978
[214] van Rijsbergen, C. J. Information Retrieval. Butterworth, London, 2nd edition, 1979
[215] Rousseeuw, P. J. A visual display for hierarchical classification. In Diday, E., Escoufier, Y.,
Lebart, L., Pagès, J., Schektman, Y., & Tomassone, R. (eds.), Data Analysis and Informatics
IV, 743–748. North-Holland, Amsterdam, 1986
[216] Rovnyak, S., Kretsinger, S., Thorp, J., & Brown, D. Decision trees for real-time transient
stability prediction. IEEE Transactions on Power Systems, volume 9(3): 1417–1426, 1994
[217] Russell, M. A segmental HMM for speech pattern modelling. In IEEE Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP-93), volume 2,
499–502. 1993
[218] Russell, M. & Cook, A. Experimental evaluation of duration modelling techniques for automatic speech recognition. In IEEE Proceedings of International Conference on Acoustics,
Speech, and Signal Processing (ICASSP ’87), volume 12, 2376–2379. 1987
Bibliography
315
[219] Russell, M. J. & Moore, R. K. Explicit Modelling of State Occupancy in Hidden Markov
Models for Automatic Speech Recognition. In IEEE Proceedings of Int. Conf. on Acoustics,
Speech and Signal Processing, 5–8. 1985
[220] Sahoo, R. K., Oliner, A. J., Rish, I., Gupta, M., Moreira, J. E., Ma, S., Vilalta, R., &
Sivasubramaniam, A. Critical Event Prediction for Proactive Management in Large-scale
Computer Clusters. In Proceedings of the ninth ACM SIGKDD international conference on
Knowledge discovery and data mining (KDD ’03), 426–435. ACM Press, 2003
[221] Saks, S. Theory of the Integral. G. E. Stechert & Co, New York, USA, 1937
[222] Salakhutdinov, R., Roweis, S., & Ghahramani, Z. Expectation-Conjugate Gradient: An
Alternative to EM. IEEE Signal Processing Letters, volume 11(7), 2004
[223] Salfner, F. Predicting Failures with Hidden Markov Models. In Proceedings of 5th European Dependable Computing Conference (EDCC-5), 41–46. Budapest, Hungary, 2005.
Student forum volume
[224] Salfner, F., Hoffmann, G. A., & Malek, M. Prediction-Based Software Availability Enhancement. In Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van
Moorsel A., & van Steen, M. (eds.), Self-Star Properties in Complex Information Systems,
Lecture Notes in Computer Science, volume 3460. Springer-Verlag, 2005
[225] Salfner, F. & Malek, M. Proactive Fault Handling for System Availability Enhancement. In
IEEE Proceedings of the 19th International Parallel and Distributed Processing Symposium
(IPDPS’05) - Workshop 16 IEEE Proceedings, DPDNS Workshop. Denver, CO, 2005
[226] Salfner, F., Schieschke, M., & Malek, M. Predicting Failures of Computer Systems: A
Case Study for a Telecommunication System. In Proceedings of IEEE International Parallel
and Distributed Processing Symposium (IPDPS 2006), DPDNS workshop. Rhodes Island,
Greece, 2006
[227] Salfner, F., Tschirpke, S., & Malek, M. Comprehensive Logfiles for Autonomic Systems. In IEEE Proceedings of International Parallel and Distributed Processing Symposium
(IPDPS), Workshop on Fault-Tolerant Parallel, Distributed and Network-Centric Systems
(FTPDS). IEEE Computer Society, Santa Fe, New Mexico, USA, 2004
[228] Salvador, S. & Chan, P. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In IEEE Proceedings of 16th International Conference on
Tools with Artificial Intelligence (ICTAI 2004), 576–584. 2004
[229] Salvo Rossi, P., Romano, G., Palmieri, F., & Iannello, G. A hidden Markov model for
Internet channels. In IEEE Proceedings of the 3rd International Symposium on Signal
Processing and Information Technology (ISSPIT 2003), 50–53. 2003
[230] Schlittgen, R. Einführung in die Statistik: Analyse und Modellierung von Daten.
Oldenbourg-Wissenschaftsverlag, München, Wien, 9th edition, 2000
[231] Schölkopf, B., Smola, A. J., Williamson, R. C., & Bartlett, P. L. New Support Vector
Algorithms. Neural Computation, volume 12(5): 1207–1245, 2000
[232] Scott, D. Making Smart Investments to Reduce Unplanned Downtime. Technical Report
Tactical Guidelines, TG-07-4033, GartnerGroup RAS Services, 1999
316
Bibliography
[233] Sen, P. K. Estimates of the Regression Coefficient Based on Kendall’s Tau. Journal of the
American Statistical Association, volume 63(324): 1379–1389, 1968
[234] Sfetsos, A. Short-term load forecasting with a hybrid clustering algorithm. IEE Proceedings of Generation, Transmission and Distribution, volume 150(3): 257–262, 2003
[235] Shannon, C. A Mathematical Theory of Communication. The Bell System Technical Journal, volume 27: 379–423,623–656, 1948
[236] Shao, J. Linear Model Selection by Cross-Validation. Journal of the American Statistical
Association, volume 88(422): 486–494, 1993
[237] Shawe-Taylor, J. & Cristianini, N. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004
[238] Shereshevsky, M., Crowell, J., Cukic, B., Gandikota, V., & Liu, Y. Software aging and
multifractality of memory resources. In Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2003), 721–730. IEEE Computer Society, San
Francisco, CA, USA, 2003
[239] Shewchuk, J. An introduction to the conjugate gradient method without the agonizing pain.
Technical report, School of Computer Science, Carnegie Mellon University, Pittsburgh PA,
USA, 1994
[240] Shi, X. & Manduchi, R. Invariant operators, small samples, and the bias-variance dilemma.
In IEEE Proceedings of the Conference on Computer Vision and Pattern Recognition
(CVPR 2004), volume 2. 2004
[241] Siewiorek, D. P. & Swarz, R. S. Reliable Computer Systems. Digital Press, Bedford, MA,
2nd edition, 1992
[242] Silva, J. G. & Madeira, H. Experimental Dependability Evaluation. In Diab, H. B. &
Zomaya, A. Y. (eds.), Dependable Computing Systems, chapter 12, 327–355. John Wiley &
Sons, 2005
[243] Singer, R. M., Gross, K. C., Herzog, J. P., King, R. W., & Wegerich, S. Model-Based Nuclear Power Plant Monitoring and Fault Detection: Theoretical Foundations. In Proceedings of Intelligent System Application to Power Systems (ISAP 97), 60–65. Seoul, Korea,
1997
[244] Smith, T. & Waterman, M. Identification of Common Molecular Subsequences. Journal of
Molecular Biology, volume 147: 195–197, 1981
[245] Smyth, P. Clustering Using Monte Carlo Cross-Validation. In ACM proceedings of Knowledge Discovery and Data Mining (KDD 1996), 126–133. 1996
[246] Smyth, P. Clustering Sequences with Hidden Markov Models. In Mozer, M. C., Jordan,
M. I., & Petsche, T. (eds.), Advances in Neural Information Processing Systems, volume 9,
648. The MIT Press, 1997
[247] Solomonoff, R. J. A Formal Theory of Inductive Inference, Part 1. Information and Control,
volume 7(1): 1–22, 1964
[248] Solomonoff, R. J. A Formal Theory of Inductive Inference, Part 2. Information and Control,
volume 7(2): 224–254, 1964
Bibliography
317
[249] Srikant, R. & Agrawal, R. Mining Sequential Patterns: Generalizations and Performance
Improvements. In Apers, P. M. G., Bouzeghoub, M., & Gardarin, G. (eds.), Proc. 5th Int.
Conf. Extending Database Technology, EDBT, volume 1057, 3–17. Springer-Verlag, 1996.
URL citeseer.nj.nec.com/article/srikant96mining.html
[250] Starr, A. A structured approach to the selection of condition based maintenance. In IEE
Proceedings of Fifth International Conference on Factory 2000 - The Technology Exploitation Process. 1997
[251] Stone, M. Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of
the Royal Statistical Society, volume 36(2): 111–147, 1974
[252] Sullivan, M. & Chillarege, R. Software defects and their impact on system availability - a
study of field failures in operating systems. 21st Int. Symp. on Fault-Tolerant Computing
(FTCS-21), 2–9, 1991. URL citeseer.ist.psu.edu/sullivan91software.
html
[253] Sun, R. Introduction to Sequence Learning. In Sun, R. & Giles, C. L. (eds.), Sequence
Learning: Paradigms, Algorithms, and Applications, Lecture Notes in Computer Science,
volume 1828, 1–11. Springer, Berlin / Heidelberg, 2001
[254] Tauber, O. Einfluss vorhersagegesteuerter Restarts auf die Verfügbarkeit. Master’s thesis,
Humboldt-Universität zu Berlin, Berlin, Germany, 2006
[255] Thoraval, L. Hidden Semi-Markov Event Sequence Models. Technical report, Université
Louis Pasteur Strasbourg, France, 2002
[256] Todorovski, L., Flach, P., & Lavrac, N. Predictive performance of weighted relative accuracy. In Zighed, D. A., Komorowski, J., & Żytkow, J. (eds.), Proceedings of the Fourth
European Conference on Principles of Data Mining and Knowledge Discovery (PKDD
2000), Lecture Notes in Artificial Intelligence, volume 1910, 255–264. Springer, 2000
[257] Troudet, T., Merrill, W., Center, N., & Cleveland, O. A real time neural net estimator
of fatigue life. In IEEE Proceedings of International Joint Conference on Neural Networks(IJCNN 90), 59–64. 1990
[258] Tsao, M. M. & Siewiorek, D. P. Trend Analysis on System Error Files. In Proc. 13th
International Symposium on Fault-Tolerant Computing, 116–119. Milano, Italy, 1983
[259] Turnbull, D. & Alldrin, N. Failure Prediction in Hardware Systems. Technical report,
University of California, San Diego, 2003. Available at http://www.cs.ucsd.edu/
~dturnbul/Papers/ServerPrediction.pdf
[260] Ulerich, N. & Powers, G. On-line hazard aversion and fault diagnosis in chemical processes: the digraph+fault-tree method. IEEE Transactions on Reliability, volume 37(2):
171–177, 1988
[261] Vaidyanathan, K., Harper, R. E., Hunter, S. W., & Trivedi, K. S. Analysis and implementation of software rejuvenation in cluster systems. In Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 62–71.
ACM Press, 2001
[262] Vaidyanathan, K. & Trivedi, K. A comprehensive model for software rejuvenation. IEEE
Transactions on Dependable and Secure Computing, volume 2: 124–137, 2005
318
Bibliography
[263] Vaidyanathan, K. & Trivedi, K. S. A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems. In Proceedings of the International
Symposium on Software Reliability Engineering (ISSRE). 1999
[264] Vapnik, V. N. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995
[265] Vesely, W., Goldberg, F. F., Roberts, N. H., & Haasl, D. F. Fault Tree Handbook. Technical
Report NUREG-0492, U.S. Nuclear Regulatory Commission, Washington, DC, 1981
[266] Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., & Weiss, S. M. Predictive algorithms
in the management of computer systems. IBM Systems Journal, volume 41(3): 461–474,
2002
[267] Vilalta, R. & Drissi, Y. A perspective view and survey of meta-learning. Artificial Intelligence Review, volume 18(2): 77–95, 2002
[268] Vilalta, R. & Ma, S. Predicting Rare Events In Temporal Domains. In Proceedings of the
2002 IEEE International Conference on Data Mining (ICDM’02), 474–482. IEEE Computer Society, Washington, DC, USA, 2002
[269] Wahl, M., Howes, T., & Kille, S. Lightweight Directory Access Protocol (v3). RFC 2251,
1997. http://www.ietf.org/rfc/rfc2251.txt
[270] Wang, X. Durationally constrained training of HMM without explicit state durational PDF.
In Proceedings of the Institute of Phonetic Sciences, University of Amsterdam, volume 18,
111–130. 1994
[271] Ward, A., Glynn, P., & Richardson, K. Internet service performance failure detection.
SIGMETRICS Performance Evaluation Review, volume 26(3): 38–43, 1998
[272] Ward, A. & Whitt, W. Predicting response times in processor-sharing queues. In Glynn,
P. W., MacDonald, D. J., & Turner, S. J. (eds.), Proc. of the Fields Institute Conf. on Comm.
Networks. 2000
[273] Warrender, C., Forrest, S., & Pearlmutter, B. Detecting intrusions using system calls: alternative data models. In IEEE Proceedings of the 1999 Symposium on Security and Privacy,
133–145. 1999
[274] Wei, W., Wang, B., & Towsley, D. Continuous-time hidden Markov models for network
performance evaluation. Performance Evaluation, volume 49(1-4): 129–146, 2002
[275] Weiss, G. Timeweaver: A Genetic Algorithm for Identifying Predictive Patterns in Sequences of Events. In Proceedings of the Genetic and Evolutionary Computation Conference, 718–725. Morgan Kaufmann, San Francisco, CA, 1999
[276] Weiss, G. M. Mining with rarity: a unifying framework. SIGKDD Explor. Newsl., volume 6(1): 7–19, 2004
[277] Weiss, G. M. & Hirsh, H. Learning to Predict Rare Events in Event Sequences. In
R. Agrawal, P. S. & Piatetsky-Shapiro, G. (eds.), Proceedings of the Fourth International
Conference on Knowledge Discovery and Data Mining, 359–363. AAAI Press, Menlo Park,
California, 1998
[278] Williams, J., Davies, A., & Drake, P. (eds.). Condition-based Maintenance and Machine
Diagnostics. Springer Verlag, 1994
Bibliography
319
[279] Wilson, A. D. & Bobick, A. F. Recognition and interpretation of parametric gesture. In
IEEE Proceedings of Sixth International Conference on Computer Vision, 329–336. 1998
[280] Wolpert, D. H. The Mathematics of Generalization. Addison-Wesley, Reading, MA, 1995
[281] Wong, K. C. P., Ryan, H., & Tindle, J. Early Warning Fault Detection Using Artificial
Intelligent Methods. In Proceedings of the Universities Power Engineering Conference.
1996. URL citeseer.nj.nec.com/217993.html
[282] Yang, S. A condition-based failure-prediction and processing-scheme for preventive maintenance. IEEE Transactions on Reliability, volume 52(3): 373–383, 2003
[283] Yu, C. H. Resampling methods: concepts, applications, and justification. Practical Assessment, Research and Evaluation, volume 8(19), 2003
[284] Yu, S.-Z., Liu, Z., Squillante, M. S., Xia, C., & Zhang, L. A hidden semi-Markov model
for web workload self-similarity. In IEEE Proceedings of 21st International Performance,
Computing, and Communications Conference, 65–72. 2002
[285] Zipf, G. K. Human Behavior and the Principle of Least Effort: An Introduction to Human
Ecology. Addison-Wesley Press, Cambridge, Mass, 1949