Download IMPACT OF TYPE OF CONCEPT DRIFT ON ONLINE ENSEMBLE

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
IMPACT OF TYPE OF CONCEPT DRIFT ON
ONLINE ENSEMBLE LEARNING
A Project Report
Submitted by
Prashant M. Chaudhary
110703010
Gaurish S. Chaudhari
110703201
Sonali P. Rahagude
110703202
in partial fulfilment for the award of the degree
of
B.Tech Computer Engineering
Under the guidance of
Mrs. Vahida Z. Attar
College of Engineering, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND
INFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE-5
May, 2011
DEPARTMENT OF COMPUTER ENGINEERING AND
INFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE
CERTIFICATE
Certified that this project, titled “IMPACT OF TYPE OF CONCEPT DRIFT ON
ONLINE ENSEMBLE LEARNING” has been successfully completed by
Prashant M. Chaudhary
110703010
Gaurish S. Chaudhari
110703201
Sonali P. Rahagude
110703202
and is approved for the partial fulfilment of the requirements for the degree of “B.Tech.
Computer Engineering”.
SIGNATURE
Mrs. Vahida Z. Attar
Project Guide
Department of Computer Engineering
SIGNATURE
Dr. Jibi Abraham
Head
Department of Computer Engineering
and Information Technology,
and Information Technology,
College of Engineering Pune,
College of Engineering Pune,
Shivajinagar, Pune - 5.
Shivajinagar, Pune - 5.
Abstract
Mining concept drifting data stream is a challenging area for data mining research. In
real world, data streams are not stable but change with time. Such changes termed
as drifts in concept are categorized into gradual and abrupt, based on the amount of
drifting time, i.e. the time steps taken to replace the old concept completely by the new
one. In traditional online learning systems, this categorization has not been exploited for
handling different drifts in the data stream. The characteristics of different drift types
can be used to develop different approaches to achieve better system performance than
the existing ones. So, the issue of handling concept drifts in online data according to
their type can be explored further.
Among the most popular and effective approaches to handle concept drifts is ensemble
learning, where a set of models built over different time periods is maintained and the
predictions of models are combined, usually according to their expertise level regarding
the current concept. Diversity among the base classifiers of the ensemble is an important
factor affecting the performance of an ensemble learning system. If early instances of new
concept are stored and used for ensemble learning once the drift is detected, this may
help increase the overall accuracy after the drift. Moreover, if an ensemble learns with
zero diversity for instances of a new concept during the drifting period, the ensemble
may learn the new concept faster thus boosting recovery. The project presents the above
mentioned approach for effective handling of various drifts according to their compostion
characteristics.
Acknowlgements
Apart from our efforts, the success of this project depends largely on the encouragement
and guidelines of many others. We take this opportunity to express our gratitude to the
people who have been instrumental in the successful completion of this project.
We are highly indebted to Mrs. Vahida Z. Attar for her guidance and constant
supervision as well as for providing necessary information regarding the project and also
for her tremendous support throughout the project.
We are very thankful to our Head of the Department Dr.
Jibi Abraham who
modeled us both technically and morally for achieving greater success in life.
We would also like to thank Mr. Leandro L. Minku for his valuable guidance in the
starting phase of the project. His research papers, and source codes that he generously
provided, were of great use for the experimentation part of the project.
Lastly, we express our sincere gratitude towards our parents and the faculty members of the Department of Computer Engineering and Information Technology, College
of Engineering, Pune for their kind co-operation and encouragement which helped us in
completion of this project.
Prashant Chaudhary
Gaurish Chaudhari
Sonali Rahagude
Contents
List of Tables
iii
List of Figures
v
List of Algorithms
vi
1 Motivation
1
2 Introduction
3
3 Literature Survey
5
3.1
Data Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3.2
Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3.3
Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3.4
Drift Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . .
7
3.4.1
Drift Detection Method - DDM . . . . . . . . . . . . . . . . . . .
8
3.4.2
Early Drift Detection Method - EDDM . . . . . . . . . . . . . . .
9
3.4.3
Adaptive Windowing - ADWIN . . . . . . . . . . . . . . . . . . .
9
Drift Handling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.5.1
Pure Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . .
10
3.5.2
Learning with Drift Detection . . . . . . . . . . . . . . . . . . . .
11
3.5
2
4 Existing Approach
12
4.1
Learning System using EDDM . . . . . . . . . . . . . . . . . . . . . . . .
12
4.2
Online Boosting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
5 Proposed Approach
5.1
5.2
5.3
16
Use of Instance Window . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
5.1.1
Storing Instances in Instance Window . . . . . . . . . . . . . . . .
17
5.1.2
Training from Instance Window . . . . . . . . . . . . . . . . . . .
18
Use of Zero Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
5.2.1
Instances used for Training with Zero Diversity . . . . . . . . . .
19
5.2.2
Implementation of Zero Diversity . . . . . . . . . . . . . . . . . .
20
Switching to New Ensemble . . . . . . . . . . . . . . . . . . . . . . . . .
22
6 Experimentation
25
6.1
Artificial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
6.2
Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
6.2.1
Spam Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
6.2.2
Forest Cover (UCI Repository) . . . . . . . . . . . . . . . . . . .
29
6.2.3
ELEC2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
6.2.4
Usenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
6.3
Implementation Environment . . . . . . . . . . . . . . . . . . . . . . . .
29
6.4
User Interface for GPS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
6.5
Determination of Parameters . . . . . . . . . . . . . . . . . . . . . . . . .
32
6.5.1
Size of Instance Window . . . . . . . . . . . . . . . . . . . . . . .
32
6.5.2
Ratio value for PostWarning Level . . . . . . . . . . . . . . . . .
32
7 Results and Analysis
35
7.1
Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
7.1.1
Tables of Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
7.1.2
Graphs of Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
7.2
Noise Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
7.3
Memory and Time Bounds . . . . . . . . . . . . . . . . . . . . . . . . . .
60
8 Conclusion
63
9 Supplementary Work
65
9.1
Hospitalization Record Analyzer . . . . . . . . . . . . . . . . . . . . . . .
65
9.2
Research Paper Publication . . . . . . . . . . . . . . . . . . . . . . . . .
68
9.3
Approaches of Drift Type Detection . . . . . . . . . . . . . . . . . . . . .
68
9.3.1
Approach 1 : Using Standard Deviation Measure . . . . . . . . .
69
9.3.2
Approach 2 : Using Error Rate . . . . . . . . . . . . . . . . . . .
69
9.3.3
Approach 3 : Generating Association Rules/Decision Trees for Drift
9.4
Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
ADWIN Integrated GPS Approach . . . . . . . . . . . . . . . . . . . . .
72
10 Future Work
74
List of Tables
6.1
ARTIFICIAL DATASETS . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
AVG. ACCURACIES FOR DIFFERENT INSTANCE WINDOW SIZES.
DATASET-SIZE : 50,000 . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
41
AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1,
SEVERITY : MEDIUM . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7
39
AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1,
SEVERITY : MEDIUM . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6
38
AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3,
SEVERITY : HIGH
7.5
37
AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1,
SEVERITY : HIGH
7.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1,
SEVERITY : HIGH
7.3
34
AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1,
SEVERITY : HIGH
7.2
33
AVG. ACCURACIES FOR DIFFERENT VALUES OF POST WARNING
LEVEL. DATASET-SIZE : 50,000 . . . . . . . . . . . . . . . . . . . . . .
7.1
28
42
AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1,
SEVERITY : MEDIUM . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
43
7.8
AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3,
SEVERITY : MEDIUM . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.9
44
AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1,
SEVERITY : LOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
7.10 AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1,
SEVERITY : LOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
7.11 AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1,
SEVERITY : LOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
7.12 AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3,
SEVERITY : LOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
7.13 AVG. ACCURACIES. MISCELLANEOUS DATASETS . . . . . . . . . .
49
7.14 AVG. ACCURACIES FOR DIFFERENT NOISE LEVELS. DATASET :
PLANE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
7.15 PROCESSING TIME (IN SECONDS). DATASET-SIZE : 50000 . . . . .
61
7.16 MEMORY (IN BYTES). DATASET-SIZE : 50000 . . . . . . . . . . . . .
62
9.1
AVERAGE ACCURACIES FOR GPS(EDDM) AND GPS(ADWIN). DATASETSIZE : 50,000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
List of Figures
3.1
Types of concept drifts in streams . . . . . . . . . . . . . . . . . . . . . .
7
6.1
User Interface for GPS in MOA . . . . . . . . . . . . . . . . . . . . . . .
32
7.1
Dataset: Circle, Size: 50000, Drift: Abrupt, Severity: High . . . . . . . .
50
7.2
Dataset: SineH, Size: 50000, Drift: Gradual(0.25N ), Severity: High . . .
51
7.3
Dataset: Plane, Size: 50000, Drift: Gradual(0.50N ), Severity: High . . .
51
7.4
Dataset: Line, Size: 2000, Drift: Abrupt, Severity: Medium . . . . . . . .
52
7.5
Dataset: SineV, Size: 2000, Drift: Gradual(0.25N ), Severity: Medium . .
52
7.6
Dataset: Plane, Size: 2000, Drift: Gradual(0.50N ), Severity: Medium . .
53
7.7
Dataset: SineH, Size: 100000, Drift: Abrupt, Severity: Low, No. of Drifts: 1 53
7.8
Dataset: Circle, Size: 100000, Drift: Gradual(0.25N ), Severity: Low, No.
of Drifts: 1
7.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
Dataset: SineV, Size: 100000, Drift: Gradual(0.50N ), Severity: Low, No.
of Drifts: 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
7.10 Dataset: SineV, Size: 100000, Drift: Abrupt, Severity: High, No. of Drifts: 3 55
7.11 Dataset: Circle, Size: 100000, Drift: Gradual(0.25N ), Severity: High, No.
of Drifts: 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
7.12 Dataset: SineH, Size: 100000, Drift: Gradual(0.50N ), Severity: High, No.
of Drifts: 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
56
7.13 Dataset: Hyperplane, Size: 50000, No. of Drifts : 1 . . . . . . . . . . . .
56
7.14 Dataset: Waveform, Size: 150000, No. of Drifts : 0 . . . . . . . . . . . .
57
7.15 Dataset: Spam Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
7.16 Dataset: Forest Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
9.1
Hospitalization Record Analyzer : Main Window . . . . . . . . . . . . .
66
9.2
Hospitalization Record Analyzer : Analysis example (Syndrome Distribution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3
67
Hospitalization Record Analyzer : Analysis example (City-wise Deadinfected) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
9.4
Example of Association Rules created using JRIP . . . . . . . . . . . . .
71
9.5
Drift Detection Framework . . . . . . . . . . . . . . . . . . . . . . . . . .
71
List of Algorithms
1
ADWIN : Adaptive Windowing Algorithm . . . . . . . . . . . . . . . . .
10
2
Existing Approach : SingleClassif ierDrif t . . . . . . . . . . . . . . . .
13
3
Online Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
4
Training Algorithm for Ensembles used in Proposed Approaches . . . . .
21
5
Proposed Approach : GPSGradual . . . . . . . . . . . . . . . . . . . . .
23
6
Proposed Approach : GPSAbrupt . . . . . . . . . . . . . . . . . . . . . .
24
vi
Chapter 1
Motivation
Data mining is the process of extracting patterns from data. Today Data Mining is
the most advancing and challenging field in the area of real-life applications. This is
a vast field using various concepts from computer science and mathematics. As far as
computer science is concerned, data mining uses various algorithms (design and analysis),
data structures, database management systems, high performance computing etc. From
mathematics side, the data mining extensively make use of probability and statistical
analysis. Hence, doing a project in data mining requires knowledge from all the important
fields. Therefore, more can be learned and more fruitful results can be obtained by
applying all the knowledge.
During the literature survey for the project, various domains within data mining were
studied such as preprocessing of data, feature selection, synopsis data structures, data
streaming, classification algorithms , clustering , concept drifts, ensembles techniques,
etc. Our aim of the search was to find a domain that will help us to provide something
new and innovative to the data mining community. At the same time, the scope of the
project must be feasible as it is time-bounded.
Concept drifts in data streams are usually of two types : Abrupt and Gradual. The
existing approaches for handling concept drift in online learning system do not take into
1
consideration, type of the drift being handled. So, we thought of using the characteristics
of the type of drift to improve the system performance in online learning. It is observed
that accuracy of an online learning system for gradual drifts is lesser than that for abrupt
drifts. Moreover, recovery in case of gradual drifts is less and slower. So, we first chose
to work on this issue of improving the performance of the online learning system in case
of data streams with gradual drifts. Once this is done, we would then work on abrupt
drifts. Thus, we decided to develop a framework which handle abrupt and gradual drifts
effectively.
Chapter 2
Introduction
Online learning has a wide variety of applications in which training data is available
continuously in time and there are time and space constraints. For example, web traffic
monitoring, network security, sensor signal processing, credit card fraud detection etc.
Online learning algorithms process each training instance once on arrival, without the
need for storage or reprocessing and maintain a current hypothesis that reflects all the
training instances so far[18]. In this way the learning algorithm take single training
instance as well as a hypothesis input and output an updated hypothesis [10].
In online learning environments are often non-stationary and the variables to be predicted by the learning machine may change with time, this change is referred to as concept
drift. Concept drifts can be categorized based on their speeds. Speed is the inverse of
drifting time, which can be measured as the number of time steps taken for a new concept to completely replace the old one [17]. In this way, a higher speed is related to a
lower number of time steps and a lower speed is related to a higher number of time steps.
According to the speed, drifts can be categorized as either abrupt, when the complete
change occurs in only one time step, or gradual, otherwise [17].
For example, sudden change in buying preferences of a share purchaser due to a dip
in stock price of some particular company (abrupt concept drift), whereas adapting from
3
an old mailing system to a new one where people use both systems for some time initially
(gradual concept drift).
Ensemble learning is among the most popular and effective approaches to handle
concept drift, in which a set of concept descriptions built over different time intervals
is maintained, predictions of which are combined using a form of voting, or the most
relevant description is selected [23]. Ensembles of classifiers have been successfully used
to improve the accuracy of single classifiers in online learning [18, 10, 21, 15]. Learning
machines used to model non-stationary environments (concept drifts) should be able to
adapt quickly and accurately to possible changes.
We propose a novel ensemble approach of handling various concept drifts by exploiting
their composition characteristics. In this approach, early instances of new concept are
stored and used for ensemble learning whenever a drift occurs. Also, all the classifiers in
the ensemble are trained for these instances of the new concept, during the drifting period.
Experiments show that when a concept drift occurs the proposed approach obtains better
accuracy than Early Drift Detection Method (EDDM) [14] approach, a system which
adopts the strategy of learning a new classifier from scratch when a drift is detection.
Chapter 3
Literature Survey
3.1
Data Stream Mining
A data stream is an ordered sequence of instances that can be read only once or a small
number of times using limited computing and storage capabilities. Instances of data
streams include computer network traffic, phone conversations, ATM transactions, web
searches, and sensor data. Data stream mining can be considered a subfield of data
mining, machine learning, and knowledge discovery.
Data stream mining is a subdomain of data mining which is the process of extracting knowledge structures from continuous, rapid data streams. The core assumption of
data stream processing is that training instances can be briefly inspected a single time
only, that is, they arrive in a high speed stream, then must be discarded to make room
for subsequent instances. The algorithm processing the stream has no control over the
order of the instances seen, and must update its model incrementally as each example
is inspected. An additional desirable property, the so-called anytime property, requires
that the model is ready to be applied at any point between training instances [8].
5
3.2
Online Learning
Online learning is a continuous learning process in which instances arrive one at a time
and are processed only once due to time and space constraints. Online learning algorithms process each training instance once ”on arrival”, without the need for storage or
reprocessing, and maintain a current hypothesis that reflects all the training instances so
far [18].
Learning proceeds in a sequence of trials. In each trial, the algorithm receives an
instance from some fixed domain and is to produce a binary prediction. At the end of the
trial, the algorithm receives a binary level which can be viewed as the correct prediction
for the instance.
3.3
Concept Drift
Concept refers to the target variable, which the model is trying to predict. Concept
change is the change of the underlying concept over time. The term concept drift can be
formally defined as follows : Concept drift is the change in the distribution of a problem
[11], which is characterized by the joint distribution p(x, w), where x represents the input
attributes and w represents target classes [16].
Two kinds of concept drift may occur in the real world normally − abrupt and gradual
[25]. Let SI and SII be two sources which generate instances corresponding to old and
new concepts, respectively. Let time t0 be the time at which the drift occurs, before
which all the instances are from source SI.
Abrupt Drift : The simplest pattern of a change is abrupt drift, when at time t0 a
source SI is suddenly replaced by source SII and is continued further.
Gradual Drift : Another kind of change is gradual drift which refers to a certain period
after t0 when both sources SI and SII are active. As time passes, the probability of
sampling from source SI decreases, probability of sampling from source SII increases.
Note, that at the beginning of this gradual drift, before more instances are seen, an
instance from the source SII might be easily mixed up with random noise.
The period where both sources SI and SII are active is called as drifting period. The
width of the drifting period is in inverse relation with the speed of the drift. Greater the
width, less is the speed, i.e. change is more gradual. We adopt this definition of gradual
drift as the basis for developing the proposed algorithm to handle gradual changes in
data streams.
Figure 3.1: Types of concept drifts in streams
3.4
Drift Detection Techniques
To deal with change over time, most previous work has been classified observing if they use
full, partial or not examples memory [14]. The partial memory methods used variations
of the sliding-window idea: at every moment, one window (or more) containing the most
recently read instances is kept, and only those are considered relevant for learning. A
critical point in any such strategy is the choice of a window size. The easiest strategy
is deciding (or asking the user for) a window size W and keeping it fixed through the
execution of the algorithm. In order to detect change, one can keep a reference window
with data from the past, also of some fixed size, and decide that change has occurred if
some statistical test indicates that the distributions in the reference and current windows
differ. Another approach, using no examples memory, only aggregates, applies a decay
function to instances so that they become less important over time.
Another approach to detecting changes in the distribution of the training instances
monitors the online error-rate of the algorithm [25]. In this method learning takes place
in a sequence of trials. When a new training example is available, it is classified using
the current model. The method controls the trace of the online error of the algorithm.
For the actual context they define a warning level, and a drift level. This approach has
been used in DDM and EDDM.
3.4.1
Drift Detection Method - DDM
The drift detection method (DDM) [11] uses a binomial distribution. For each point i
in the sequence that is being sampled, the error rate is the probability of missclassifying
(pi ), with standard deviation given by si =pi (1 − pi )/i. A significant increase in the error
of the algorithm, suggests that the class distribution is changing and, hence, the actual
decision model is supposed to be inappropriate. Thus, they store the values of pi and si
when pi +si reaches its minimum value during the process (obtaining pmin and smin ).
And it checks when the following conditions triggers:
• pi +si ≥ pmin +2smin for the warning level.
• pi +si ≥ pmin +3smin for the drift level.
This approach has a good behaviour detecting abrupt changes and gradual changes when
the gradual change is not very slow, but it has difficulties when the change is slowly
gradual.
3.4.2
Early Drift Detection Method - EDDM
We use EDDM [14] as the drift detection method in our approach. Early Drift Detection
Method (EDDM), has been developed to improve the detection in presence of gradual
concept drift and to keep a good performance with abrupt concept drift. The basic idea is
to consider the distance between two errors classification instead of considering only the
number of errors like in DDM. So, a significant decrease in the average distance between
2 consecutive errors suggests that the class distribution is changing.
EDDM calculates the average distance between two errors obtained by the classifier
system (Pi ) and its standard deviation (Si ) and stores their maximum values so far (Pmax
and Smax ). If (Pi +2Si ) / (Pmax +2Smax ) < α, where α is a pre-defined parameter, a concept drift is suspected and a warning level is triggered . If the similarity between (Pi +2Si )
and (Pmax +2Smax ) starts to increase after a warning level is triggered, the warning level is
cancelled and the method returns to normality. If (Pi +2Si ) / (Pmax +2Smax ) < β, where
β is a pre-defined parameter and α > β, a concept drift is confirmed. Thus,
• (Pi +2Si ) / (Pmax +2Smax ) < α for the warning level.
• (Pi +2Si ) / (Pmax +2Smax ) < β for the drift level.
3.4.3
Adaptive Windowing - ADWIN
ADWIN [7] is a parameter and assumption free in the sense that it automatically detects
and adapts to the current rate of change. It keeps a variable-length window of recently
seen items, with the property that the window has the maximal length statistically consistent with the hypothesis there has been no change in the average value inside the window.
More precisely, an older fragment of the window is dropped if and only if there is enough
evidence that its average value differs from that of the rest of the window. This has two
consequences: one, that change reliably declared whenever the window shrinks, and two,
that at any time the average over the existing window can be reliably taken as an estimation of the current average in the stream (barring a very small or very recent change
that is still not statistically visible).
ADWIN’s only parameter is a confidence bound δ, indicating how confident we want
to be in the algorithms output, inherent to all algorithms dealing with random processes
[8]. The algorithm keeps a sliding window W with the most recently read xi . Let n
denote the length of W , µ̂w the (observed) average of the elements in W , and µ̂w the
(unknown) average of µt for t ∈ W . Strictly speaking, these quantities should be indexed
by t, but in general t will be clear from the context. Algorithm 1 describes the adpative
windowing algorithm.
Algorithm 1 ADWIN : Adaptive Windowing Algorithm
1: Initialize Window W
2:
for each t > 0 do
3:
W ← W ∪ xt (i.e. add xt to the head of W )
4:
repeat
Drop elements from the tail of W
4:
5:
until | µ̂W 0 − µ̂W 1 |< ²cut holds for every split of W into W = W0 .W1
6:
output µ̂W
7:
end for
3.5
3.5.1
Drift Handling Techniques
Pure Ensemble Learning
An ensemble consists of a set of individually trained classifiers whose predictions are
combined when classifying novel instances. Bagging and Boosting are well-known ensemble learning algorithms.In bagging, you give equal weightage to all classifiers, whereas in
boosting you give weightage according to the accuracy of the classifier. Combining the
output of several classifiers in an ensemble is useful only if there is disagreement among
them [6]. Usually the implicit aim of having disagreement or diversity is to have at least
one classifier in the ensemble trained for each distinct concept. Thus, when tackling nonstationary concepts, ensembles of classifiers have several advantages over single classifier
methods - they are easy to scale and parallelize, they can adapt to change quickly by
pruning under-performing parts of the ensemble, and they therefore usually also generate
more accurate concept descriptions.
3.5.2
Learning with Drift Detection
The problem with pure ensemble learning is that recovery time is very high due to no
drift detection, especially, in case of gradual drifts. In this approach, during the warning
period of the drfit detection method, the training instances are stored. If a warning level
is cancelled, the stored instances are removed. If a drift is confirmed, the classifier system
is reset and a new classifier system is created. The new classifier system learns all the
instances stored since the warning level was triggered and then starts to learn the new
training instances. Here, using an ensemble of classifiers is preferred as they improve
accuracy of single classifiers.
In MOA the above approach has been implemented using the wrapper class SingleClassifierDrift [8]. Some of the ensembles available as the base classifiers for this wrapper
that we have used are OzaBag [19], OzaBoost [19] and OCBoost [20].
Chapter 4
Existing Approach
4.1
Learning System using EDDM
A learning system which uses EDDM behaves in the following way. During the warning
period, a new classifier system is created and trained for all the incoming instances till
the drift level is detected by EDDM. If a warning level is cancelled, the new classifier
system is discarded. If a drift is confirmed, p0max and s0max are reset and the original
classifier system is replaced by the new one. New p0max and s0max are considered only after
30 errors have occurred. Thus, EDDM approach adopts the strategy of learning a new
classifier from scratch when a drift is detected. This approach is implemented by the
SingleClassif ierDrif t class in MOA. Algorithm 2 describes the same.
In EDDM approach, we use online boosting as the learning technique for the classifier
ensemble, with EDDM as the drift detection method.
12
Algorithm 2 Existing Approach : SingleClassif ierDrif t
Input: inst , currentEnsemble , newEnsemble
1:
predictedClass ← P redictClassOf Instance(inst)
2:
classif ication ← (predictedClass == inst.Class)
3:
level ← computeEDDM Level(inst, classif ication)
4:
if level == W ARN IN G then
5:
if newClassif ierReset == T RU E then
6:
newEnsemble.reset()
7:
newClassif ierReset = F ALSE
8:
end if
9:
newEnsemble.train(inst)
10:
else if level == DRIF T then
11:
currentEnsemble ← newEnsemble
12:
newEnsemble.reset()
13:
else if level == ST ABLE then
14:
newClassif ierReset = T RU E
15:
end if
16:
currentEnsemble.train(inst)
Output: U pdated newEnsemble and currentEnsemble
4.2
Online Boosting.
Boosting algorithm generates a sequence of base models h1 , h2 , h3 ...hM using the weighted
training sets (weighted by D1 , D2 , D3 ...DM ) such that the training instances misclassified
by hm − 1 are given half the total weight when generating model hm and the correctly
classified instances are given the other half of the weight. When the base model learning algorithm cannot learn with weighted training sets, one can generate samples with
replacement according to Dm [19].
The online boosting algorithm [18] simulates the sampling with replacement using
the Poisson distribution. When the base classifier misclassifies a training instance, the
Poisson distribution parameter lambda (λ) associated with instance is increased when
presented to to the next base model, thus the instance will be learnt by more number
of base classifiers; otherwise it is decreased. Algorithm 3 describes the online boosting
algorithm.
Algorithm 3 Online Boosting
sw
Input: Input inst, λsc
m , λm , BaseLearner = Hoef f ding T ree
1:
Set weight of example λd ← 1
2:
l ← ensembleLength
3:
for m = 1,2...,l do
4:
Set k ← P oisson(λd )
5:
if k > 0.0 then
6:
Update hm with the current instance inst
7:
end if
8:
if hm correctly classifies instance inst then
9:
10:
11:
sc
λsc
m ← λm + λd
λd ← λd (N/2λsc
m)
else
12:
sw
λsw
m ← λm + λd
13:
λd ← λd (N/2λsw
m )
14:
15:
end if
end for
Output: U pdated ensemble
Chapter 5
Proposed Approach
In the initial experimentation on drifting data, it was observed that accuracy of an online
learning system for gradual drifts is lesser than that for abrupt drifts. Moreover, recovery
in case of gradual drifts was less and slower. The reason for less accuracy in case of
gradual drifts is because of the composition of stream in case of a gradual drift. As
mentioned in Section 3, data stream during the drifting period consist of instances from
the old as well as the new concept. Due to this, the classifier system is trained on both the
concepts whenever a concept drift occurs. As opposed to this, in abrupt drift, data stream
after the drift consists only of instances from the new concept. Since the classifier learns
only on the new concept in this case, it learns better hence leads to greater classification
accuracy and recovery. Thus, the main aim was to make the classifier system learn only
on the instances from the new concept, whenever a drift occurs.
For the classifier to be trained with enough instances from the new concept for attaining
greater classification accuracy, we can store instances of the new concept in advance so
that enough instances of the new concept are available to the new classifier when the
drift is confirmed by the drift detection method. Thus, an instance-window containing
instances only from the new concept can improve learning of the classifier system thus
improving its recovery.
16
Also, in case of an ensemble classifier system, the classification of a given instance
depends on voting of all the individual classifiers. So, a way of making the classifier learn
better on the new concept after the drift, is to make all the individual base classifiers
learn on the instances from the new concept. The ensemble will be trained well on the
new concept as all the classifiers learn this new concept. We call this method of making
all base classifiers learn as zero diversity.
Based on the above described ideas, we propose 2 approaches GPSGradual and GPSAbrupt
in an attempt to improve classifier accuaracy and recovery in case of gradual and abrupt
drifts respectively. The composition characteristics of the drift streams have been exploited for these approaches.
We use EDDM as the drift detection method because it is a recent method which has
shown to attain similar accuracy to a previous methods when the drifts are abrupt and
a better accuracy when the drifts are gradual.
From the start of the data stream, we maintain 2 ensembles, currentEnsemble and
newEnsemble. Both of these ensembles are trained in a similar way by all the instances
before a drift occurs. Thus, before the drift, newEnsemble is the copy of currentEnsemble.
The reason for maintaining ensembles from the start is that it not only helps to adapt to
the change quickly but also it improves the performance in case of false drift detections.
5.1
5.1.1
Use of Instance Window
Storing Instances in Instance Window
An instance window of fixed size is maintained for storing the recent instances to be
trained on the new ensemble. These instances are stored in first-in-first-out fashion. The
instance window size was experimentally concluded to be 50.
For GPSGradual approach, the instances on which the newEnsemble is trained with
zero diversity after the warning level, are the instances misclassified by the currentEnsemble. This is because whenever a warning level is flagged for the possibility of a drift,
instances misclassified by the currentEnsemble should belong to the new concept as the
current ensemble has not still learnt the new concept well. Hence, training is done only
on the misclassified instances.
In gradual drift stream, whenever a warning level is flagged for the possibility of a
drift, instances misclassified by the currentEnsemble should belong to the new concept
as the currentEnsemble has still not learnt the new concept well. So in GPSGradual
appraoch, we store only those instances in the window, that have been misclassified by
the currentEnsemble, so that during the drift period, the misclassified instances will be
those of the new concept and hence the instance window will start containing instances
from the new concept.
In abrupt drift stream, all the instances after the drift occurs, belong to the new
concept. So in GPSAbrupt approach, all the instances from the stream are stored so
that, during the drifting period, all of the incoming instances belong to the new concept.
Thus, the instance window will contain instances of new concept.
The advantage of this window is that, it contains instances belonging to the new
concept. So, training the new ensemble on this window instances will lead to faster
recovery from the concept drifts.
5.1.2
Training from Instance Window
The newEnsemble is trained on the instance window at different times for the two approaches (Algorithm 5 : GPSGradual and Algorithm 6 : GPSAbrupt).
In case of gradual drifts, when a warning level is detected, the data stream has more
instances of the old concept and less of the new concept initially. So, at the warning
level, the instance window contains instances from the old concept as well. Gradually,
the probability of instances from new concept goes on increasing, hence at some latter
point after the warning level is triggered, the instance window will contain majority
instances of new concept. We call this point as Post Warning level. This level has been
introduced in GPSGradual. At this level, the new ensemble is trained on the instance
window, as it contains new concept instances in majority. The Post Warning level has
been introduced in GPSGradual so that more relevant instances of new concept will be
present for training on newEnsemble in the instance window. The optimal value for this
level has been determined experimentally as 0.925.
As mentioned previously, in abrupt drift, instances only from the new concept are
present in the stream after the drift occurs. Thus, when the warning level is detected in
case of abrupt drifts, the instance window will already contain instances from the new
concept. This eliminates the need of postwarning level in GPSAbrupt approach. So, the
newEnsemble is trained on instances from the instance window at warning level itself in
GPSAbrupt.
Whenever the respective level is triggered, the newEnsemble thus learns on all the
instances in the instance window.
5.2
5.2.1
Use of Zero Diversity
Instances used for Training with Zero Diversity
In GPSGradual as well as GPSAbrupt approaches, whenever a warning level is triggered
by the drift detection method, we start training a new ensemble with zero diversity i.e.
all the individual base classifiers in the ensemble learn on the given instance. The reason
for training the new ensemble with zero diversity is to help the new ensemble learn the
new concept more efficiently and quickly. Hence, it leads to better accuracy and faster
recovery.
As mentioned in the previous section, a gradual drift stream consists of instances from
new as well as old concepts. Hence, when the warning level is triggered, it is necessary to
ensure that only instances from the new concept are given for training with zero diversity
to the new ensemble.
For GPSGradual approach, the instances on which the new ensemble is trained with
zero diversity after the warning level, are the instances misclassified by the current ensemble. This is because whenever a warning level is flagged for the possibility of a drift,
instances misclassified by the current ensemble should belong to the new concept as the
current ensemble has not still learnt the new concept well. Hence, training is done only
on the misclassified instances. An abrupt drift data stream, on the other hand, consists
of instances only from the new concept once the drift occurs. Hence, after the warning
level is flagged, all instances are necessarily from the new concept. Hence,in GPSAbrupt,
all the instances after the warning level are given to the new ensemble for training with
zero diversity.
5.2.2
Implementation of Zero Diversity
Diversity of an ensemble is the measure of disagreement of various classifiers in the ensemble. Several means can be used to reach this goal - different presentations of the input
data , variations in learner design and by adding a penalty to the outputs to encourage
diversity.
In an online learning system, Poisson distribution is used to decide if a classifier in
the ensemble will be presented an incoming instance for training [19]. So, not all the
classifiers in the ensemble are trained for the same given instance. In this approach, we
tune the Poisson parameter in order to obtain different diversities for the ensemble. As
shown in Algorithm 4, the training method is passed a flag to determine the diversity
for training the ensemble on the current instance. When the f lag is normalDiverse, the
value of k is determined by P oisson parameter λ. Thus, when the f lag is zeroDiverse,
the value of k is set to 1, due to which all the classifiers in the ensemble get trained on
the current instance. Thus, for new concept instances, training all the classifiers improves
the accuracy of the ensemble.
Algorithm 4 Training Algorithm for Ensembles used in Proposed Approaches
sw
Input: Input inst, f lag , λsc
m , λm , BaseLearner = Hoef f ding T ree
1:
Set weight of example λd ← 1
2:
l ← ensembleLength
3:
for m = 1,2...,l do
4:
5:
6:
7:
if f lag == normalDiverse then
Set k ← P oisson(λd )
else if f lag == zeroDiverse then
Set k ← 1
8:
end if
9:
if k > 0.0 then
10:
Update hm with the current instance inst
11:
end if
12:
if hm correctly classifies instance inst then
13:
sc
λsc
m ← λm + λd
14:
λd ← λd (N/2λsc
m)
15:
else
16:
sw
λsw
m ← λm + λd
17:
λd ← λd (N/2λsw
m )
18:
19:
end if
end for
Output: U pdated ensemble
5.3
Switching to New Ensemble
Once the drift is confirmed by the drift detection method, the newEnsemble is set as the
currentEnsemble which will be used for making further predicitons. The now−currentEnsemble
then starts learning with normal diversity again. The newEnsemble is reset which is used
for future drift handling.
Algorithm 5 Proposed Approach : GPSGradual
Input: inst , currentEnsemble , newEnsemble
1:
predictedClass ← P redictClassOf Instance(inst)
2:
classif ication ← (predictedClass == inst.Class)
3:
if classif ication == f alse then
4:
add inst to window[]
5:
end if
6:
level ← computeEDDM Level(inst, classif ication)
7:
if level == ST ABLE then
8:
newEnsemble.train(inst, normalDiverse)
9:
else if level == W ARN IN G then
10:
if classification == false then
11:
12:
13:
14:
15:
newEnsemble.train(inst, zeroDiverse)
end if
else if level == P OST W ARN IN G then
for instTemp IN window[] do
newEnsemble.train(instT emp, zeroDiverse)
16:
end for
17:
if classification == false then
18:
19:
20:
newEnsemble.train(inst, zeroDiverse)
end if
else if level == DRIF T then
21:
currentEnsemble ← newEnsemble
22:
newEnsemble.reset()
23:
end if
24:
currentEnsemble.train(inst, normalDiverse)
Output: U pdated newEnsemble and currentEnsemble
Algorithm 6 Proposed Approach : GPSAbrupt
Input: inst , currentEnsemble , newEnsemble
1:
predictedClass ← P redictClassOf Instance(inst)
2:
classif ication ← (predictedClass == inst.Class)
3:
add inst to window[]
4:
level ← computeEDDM Level(inst, classif ication)
5:
if level == ST ABLE then
6:
7:
newEnsemble.train(inst, normalDiverse)
else if level == W ARN IN G then
8:
newEnsemble.train(inst, zeroDiverse)
9:
for instTemp IN window[] do
10:
newEnsemble.train(instT emp, zeroDiverse)
11:
end for
12:
if classification == false then
13:
14:
15:
newEnsemble.train(inst, zeroDiverse)
end if
else if level == DRIF T then
16:
currentEnsemble ← newEnsemble
17:
newEnsemble.reset()
18:
end if
19:
currentEnsemble.train(inst, normalDiverse)
Output: U pdated newEnsemble and currentEnsemble
Chapter 6
Experimentation
6.1
Artificial Datasets
When working with real-world datasets, it is not possible to know exactly when a drift
starts to occur, which type of drift is present, or even if there really is a drift. So, it is
not possible to perform a detailed analysis of the behaviour of algorithms in the presence
of concept drifts using only pure real-world datasets. In order to analyze the strong and
weak points of a particular algorithm, it is necessary first to check its behaviour using
artificial datasets containing simulated drifts. Depending upon the type of drift in which
the algorithm is weak, it may be necessary to adopt a different strategy to improve it,
so that its performance is better when applied to real-world problems. To generate the
datasets with concept drift, we use the approach used by Minku L., 2008 [16].
We created the datasets for the following problems (Table 6.1): Circle, Sine, Line and
Plane2d.
In circle, line and plane2d, r, a0 and a0 represent the concepts respectively. For sine,
both c and d represent concepts, thus generating two problems viz. SineH and SineV. The
instances in the datasets contain x or xi and y as the input attributes and the concept
(which can assume a value 0 or 1) as the output attribute. See Table 6.1.
25
We choose these problems for generating datasets because of the wide variety of problems that get covered. Circle dataset represents second degree problems, Sine represents
trigonometric problems, Line represents two dimensional linear problems and Plane represents 3 dimensional problems.
Gradual drift is introduced in the datasets by decreasing the speed of the drift. The
speed of the drift can be modelled by degree of dominance function representing the
probability that an instance of the old and new concept will be presented to the learning
system. To introduce the instances of both new and old concept for certain period of
time we used the following linear degree of dominance function :
vn (t) =
(t − N )
, N < t ≤ N + drif ting time
drif ting time
vo (t) = 1 − vn (t), N < t ≤ N + drif ting time
(6.1)
(6.2)
where vn (t) and vo (t) are the degrees of dominance of new and old concepts, respectively;
t is the current time step; N is number of time steps before drift started to occur and
drif ting time is the number of time steps for a complete replacement of the old concept.
The first N instances were generated according to the old concept (vo (t) = 1, 1 ≤ t ≤
N ). The next drif ting time instances (N < t ≤ N + drif ting time) were generated
according to the degree of dominance functions vn (t) (Equation 6.1) and vo (t) (Equation
6.2). The remaining instances were generated according to the new concept (vn (t) = 1,
N + drif ting time < t ≤ 2N ).
The drif ting time is 1 for abrupt drifts whereas for gradual drifts, it has been set to
0.25N and 0.50N . Thus for each of the 5 problems, we have 3 datatsets corresponding to
3 different drifts (1 abrupt and 2 gradual), giving 15 different datasets for a given size. we
have created datasets of various sizes such as 2000, 50000, 100000. In the datasets with
2000 and 50000 instances, drifts have been introduced at 1000 and 25000 respectively. In
the datsets of size 100000 with 1 drift, drift is at position 25000. Also, a dataset of size
100000 having 3 drifts at positions 20000, 40000, 750000 has been created. Thus, giving
60 datasets in all.
Class severity has been defined as the percentage of the input space, which has its
target class changed after the drift is complete. Based on this, drifts can be said to be of
low severity (≈ 25%), medium severity (≈ 50%) and high severity (≈ 75%). Thus, if the
percentage is low, the drift is said to be of low severity and similarly. The 60 datasets
were generated for all the three severities, thus, giving 180 datasets in all [17].
Table 6.1 describes the various problems and values used for generating datasets from
these problems. Three ranges of concepts are given corresponding to each problem. Only
the first range is used for generating datasets with one drift whereas all three ranges are
used for generating datasets with three drifts.
Also, to increase the difficulty of the problems, we added 8 irrelevant attributes and
10% class noise to the plane2d datasets. The code for genrating these datasets is written
in C language.
Apart from these, we have also used some standard datasets such as SEA [22] (size
100000) with 3 drifts at 25000, 50000 and 75000 , Waveform (size 100000) with 3 drifts at
25000, 50000 and 75000, Waveform [9] (size 150000) with no drift, Hyperplane [24] (size
50000) with 1 drift at 25000 which we generated through MOA.
The reason for using the datasets of 100000 instances with 1 drift at 25000 is to
evaluate the behaviour of the GPS at a time long after the drift has been occurred.
Driftless datasets have been used to evaluate the performance of GPS in the absence of
drifts in the data stream. Noisy datasets are used to check the noise sensitivity of the
algorithm. Also, datatsets with 3 drifts have been created so evaluate the performance in
case of mutiple drifts, moreover, the drifts have been positioned closely to check for good
and quick recovery. Thus, we have covered wide variety of situations that can appear in
the data streams.
Table 6.1: ARTIFICIAL DATASETS
Problem
Fixed Values
Range of Attributes
Range of Concepts
Circle
a=0.5, b=0.5
x:[0,1], y:[0,1]
r : 0.2 → 0.5
(x − a)2 + (y − b)2 ≤ r2
r : 0.5 → 0.1
r : 0.1 → 0.5
SineV
a=1, b=1, c=0
x:[0,10], y:[−10,10]
y ≤ asin(bx + c) + d
d : −8 → 7
d : 7 → −7
d : −7 → 6
SineH
a=5, d=5, b=1
x:[0,4π], y:[0,10]
y ≤ asin(bx + c) + d
c : 0 → −π
c : −π → 0
c : 0 → −π
Line
a1 =0.1
x:[0,1], y:[0,1]
y ≤ −a0 + a1 x
a0 : −0.1 → −0.8
a0 : −0.8 → −0.2
a0 : −0.2 → −0.8
Plane
y ≤ −a0 + a1 x1 + a2 x2
a1 =0.1, a2 =0.1
x:[0,1], y:[0,5]
a0 : −0.7 → −4.4
a0 : −4.4 → −0.5
a0 : −0.5 → −4.3
6.2
6.2.1
Real Datasets
Spam Corpus
For testing on real world data, we chose the spam corpus dataset [13]. This is a real
world textual dataset uses SpamAssassin data collection. This spam dataset consists of
9324 instances with 40,000 attributes and represents the gradual concept drift. There are
2 classes legitimate and spam, with the ratio around 20%.
6.2.2
Forest Cover (UCI Repository)
The Foreset Cover dataset [4] contains Geo-spatial descriptions of different types of
forests. It contains 7 classes and 54 attributes and around 581,000 instances. We normalize the dataset, and arrange the data so that in any chunk at most 3 and at least 2
classes co-occur, and new classes appear randomly.
6.2.3
ELEC2 Dataset
The data was collected from the Australian New South Wales Electricity Market. The
ELEC2 dataset [14] contains 45312 instances dated from May 1996 to December 1998.
Each example of the dataset refers to a period of 30 minutes. Each example on the
dataset has 5 fields, the day of week, the time stamp, the NSW electricity demand, the
Vic electricity demand, the scheduled electricity transfer between states and the class
label.
6.2.4
Usenet
The usenet dataset [12] is based on the 20 newsgroups collection. They simulate a stream
of messages form different newsgroups that are sequentially presented to a user, who then
labels them as interesting or junk according to his/her personal interests.
6.3
Implementation Environment
The proposed algorithm in this paper is implemented in Java programming language on
Linux platform. We have used the MOA- Massive Online Analysis [8] tool for all the
experimentation of the proposed approach.
MOA is framework for data stream mining which includes a collection of machine learning algorithms for tools and evaluation and is written in java. Massive Online Analysis
(MOA) is a software environment for implementing algorithms and running experiments
for online learning from evolving data streams.
MOA is concerned with the problem of classification, perhaps the most commonly
researched machine learning task. The goal of classification is to produce a model that
can predict the class of unlabeled instances, by training on instances whose label, or class,
is supplied.
We chose MOA because it is open source tool including a wide variety of stream
generators, classifiers, evaluators and drift detection methods for analysis purposes. Also,
new classifiers can be easily built and added to the MOA framework.
To build a picture of accuracy and time, we use the Interleaved Test-Then-Train evaluation model available in MOA. In Interleaved Test-Then-Train model, each individual
instance can be used to test the model before it is used for training, and from this the
accuracy can be incrementally updated. When intentionally performed in this order, the
model is always being tested on instances it has not seen. This scheme has the advantage
that no holdout set is needed for testing, making maximum use of the available data. It
also ensures a smooth plot of accuracy over time, as each individual instance will become
increasingly less significant to the overall average.
The experiments were performed on a 2.59 GHz Intel Core 2 Duo processor with 3
GB main memory, running Ubuntu 10.04. For comparison of the performance different
ensemble techniques like OzaBoost, OCBoost, OzaBag, OzaBagADWIN are used. The
first 3 with wrapper class SingleClassifierDrift (the EDDM Approach as implemented in
MOA) which includes drift detection methods.
Hoeffding tree is used as the base learner for all ensembles. The ensemble size kept
for each ensemble was 10.
For artificial datasets, the accuracy presented in the next section is reset at the time
step N + 1 when the drift starts to happen. This was done to allow the evaluation of the
behavior of the approach when the drift starts to occur. For real dataset, the accuracy
is never reset.
For plotting the graphs of accuracy versus number of instances in order to compare
the recovery and accuracy of the system at various time steps of the proposed approaches
with existing ones, we used GNUplot version 4.2, which is a command-driven interactive
function plotting program.
6.4
User Interface for GPS
We implemented the proposed approaches viz. GPSGradual and GPSAbrupt in MOA and
develpoed a user interface for it. The interface shows various options viz. base learner,
type of drift, window size and post warning level. MOA provides different classes for
developing a GUI which reduce a workload to a great extent. The base class for approach
is GP S which instantiates objects for GP SGradual and GP SAbrupt depending upon
the type of drift selected by the user. The Fig. 6.1 shows the user interface for GPS
algorithms.
Figure 6.1: User Interface for GPS in MOA
6.5
6.5.1
Determination of Parameters
Size of Instance Window
We experimented with window sizes ranging from 10 to 100 in steps of 10 and compared
them for each dataset. It was observed that a window size of 50 gives the better results
for all the datasets. So, we have set the optimal value of window size as 50 (Table 6.2).
6.5.2
Ratio value for PostWarning Level
The Post Warning level is a new level added to EDDM between the warning level and drift
level. The purpose of adding this level was to determine a time during the drift period
when the instance window properly contained instances of the new concept. In EDDM,
Table 6.2: AVG. ACCURACIES FOR DIFFERENT INSTANCE WINDOW SIZES. DATASET-SIZE :
50,000
DATASETS
10
20
30
40
50
60
70
80
90
100
ABRUPT
circle
0.906
0.909
0.908
0.903
0.91
0.906
0.901
0.901
0.904
0.896
sineV
0.966
0.964
0.968
0.964
0.969
0.968
0.968
0.962
0.967
0.965
sineH
0.796
0.802
0.79
0.808
0.788
0.78
0.807
0.76
0.77
0.775
line
0.984
0.984
0.984
0.983
0.985
0.983
0.985
0.985
0.984
0.984
plane
0.82
0.821
0.817
0.826
0.828
0.819
0.827
0.823
0.826
0.822
GRADUAL : drif ting time = 0.25N
circle
0.866
0.865
0.866
0.865
0.87
0.869
0.866
0.861
0.854
0.852
sineV
0.916
0.915
0.905
0.914
0.919
0.91
0.905
0.909
0.917
0.916
sineH
0.819
0.817
0.816
0.813
0.823
0.825
0.818
0.824
0.81
0.824
line
0.93
0.935
0.927
0.932
0.937
0.924
0.932
0.932
0.918
0.932
plane
0.838
0.825
0.825
0.839
0.832
0.829
0.831
0.834
0.835
0.834
GRADUAL : drif ting time = 0.50N
circle
0.852
0.852
0.854
0.856
0.84
0.86
0.85
0.862
0.852
0.847
sineV
0.893
0.89
0.888
0.89
0.894
0.886
0.889
0.886
0.888
0.886
sineH
0.813
0.813
0.813
0.808
0.821
0.816
0.815
0.821
0.812
0.815
line
0.91
0.911
0.91
0.908
0.915
0.913
0.909
0.915
0.892
0.905
plane
0.827
0.817
0.803
0.826
0.82
0.816
0.82
0.824
0.824
0.824
the ratio (pi + 2si )/(pmax + 2smax ) is checked against predefined values α (0.95) and β
(0.90) to detect the warning and drift level respectively. Hence we needed to define a new
value between α and β for detecting postwarning level. Hence, we experimented with
different values between the Warning and Drift levels i.e. 0.95 and 0.90 respectively. The
values used in experimentation were 0.91, 0.92, 0.925, 0.93 and 0.94 (Table 6.3). It was
found that the value 0.925 gave the better results for the training of the new ensemble.
So, the value of the Post Warning level has been set to 0.925.
Table 6.3:
AVG. ACCURACIES FOR DIFFERENT VALUES OF POST WARNING LEVEL.
DATASET-SIZE : 50,000
DATASETS
0.91
0.92
0.925
0.93
0.94
GRADUAL : drif ting time = 0.25N
circle
0.865
0.870
0.870
0.869
0.834
sineV
0.908
0.906
0.916
0.915
0.914
sineH
0.807
0.787
0.823
0.816
0.82
line
0.933
0.932
0.930
0.931
0.932
plane
0.829
0.826
0.832
0.83
0.826
GRADUAL : drif ting time = 0.50N
circle
0.852
0.853
0.840
0.851
0.830
sineV
0.891
0.889
0.895
0.893
0.892
sineH
0.801
0.786
0.821
0.815
0.812
line
0.912
0.91
0.911
0.913
0.902
plane
0.822
0.803
0.824
0.822
0.816
Chapter 7
Results and Analysis
We compared the proposed approaches with different online learning algorithms viz.
EDDM approach (ensemble − OzaBoost), EDDM approach (ensemble − OzaBag), EDDM
Approach (ensemble − OCBoost) and OzaBagADWIN. The base learner for ensembles
of all the approaches was Hoeffding Tree and the size of each ensemble was kept as 10.
For the GPSGradual algorithm, we set the Post Warning level = 0.925 and Window
Size = 50 as mentioned previously. We performed comparisons for all the datasets with
single as well as multiple gradual drifts. Also, the drif ting time (speed of the drift) was
varied as 0.25N and 0.5N .
For the GPSAbrupt algorithm, we set the Window Size = 50. We performed comaprisons for all the datasets with single and multiple abrupt drifts.
Moreover, to test the sensitivity of the algorithms, we experimented on driftless as
well as noisy datasets. As mentioned previously, we have also used datasets having drifts
of different severities.
35
7.1
Accuracy
It was observed that the algorithms GPSGradual and GPSAbrupt performed better in
terms of accuracy as compared to the other existing standard algorithms. The results for
all the datasets used are tabulated in tables below (Table 7.1 to Table 7.13). To show
the improvement in the recovery time of the system, we have given graphs of comparison
for some of the datsets (Fig. 7.1 to Fig. 7.16)
7.1.1
Tables of Results
We have tabulated the average(overall) accuracy of the classification for all the different
datasets described previously. We have specifically calculated the average accuracy to
give a picture of the overall performance of the algorithms for a given dataset. The tables
given below are arranged according to the severity of the drifts viz. high, medium and
low. For a given severity, the tables for different dataset-sizes are given viz. 50000(1
drift), 2000(1 drift), 100000(1 drift) and 100000(3 drifts).
To analyze the performance of GPSAbrupt and GPSGradual, each table is divided
in two parts viz. ABRUPT and GRADUAL. The GRADUAL part is further divided
into two parts according to the speed of the drifts viz. drif ting time = 0.25N and
drif ting time = 0.50N . The table rows consists of different datasets and the table
columns consists of different algorithms used for comparison. Thus, each table cell entry
gives the average accuracy of classification of an algorithm for a given dataset.
Table 7.13 contains the results for artificial datasets viz. SEA, Waveform and Hyperplane as well as the results for real datasets viz. spam corpus, forest cover (Covtype),
electricity (ELEC2) and usenet. SEA, Hyperplane and Usenet contains abrupt drift while
Waveform, Spam Corpus contain gradual drift.
Table 7.1: AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1, SEVERITY : HIGH
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.847
0.816
0.813
0.796
0.900
sineV
0.958
0.915
0.949
0.927
0.963
sineH
0.709
0.658
0.666
0.686
0.788
line
0.971
0.961
0.949
0.919
0.984
plane
0.845
0.787
0.839
0.748
0.821
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25
circle
0.816
0.791
0.822
0.844
0.870
sineV
0.894
0.869
0.897
0.910
0.916
sineH
0.757
0.656
0.724
0.739
0.823
line
0.906
0.896
0.912
0.923
0.930
plane
0.826
0.760
0.818
0.830
0.832
Gradual : drif ting time = 0.50N
circle
0.797
0.779
0.819
0.849
0.840
sineV
0.866
0.842
0.882
0.891
0.894
sineH
0.745
0.653
0.745
0.742
0.821
line
0.885
0.857
0.888
0.908
0.911
plane
0.814
0.745
0.807
0.821
0.820
Table 7.2: AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1, SEVERITY : HIGH
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.813
0.809
0.826
0.561
0.796
sineV
0.904
0.875
0.896
0.614
0.935
sineH
0.532
0.537
0.605
0.606
0.607
line
0.872
0.837
0.880
0.719
0.924
plane
0.765
0.784
0.782
0.681
0.751
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.782
0.769
0.793
0.727
0.795
sineV
0.833
0.809
0.839
0.814
0.856
sineH
0.531
0.526
0.606
0.605
0.593
line
0.831
0.805
0.848
0.833
0.863
plane
0.726
0.740
0.754
0.708
0.767
Gradual : drif ting time = 0.50N
circle
0.788
0.764
0.792
0.737
0.794
sineV
0.808
0.781
0.820
0.823
0.844
sineH
0.587
0.529
0.610
0.621
0.610
line
0.803
0.780
0.818
0.839
0.857
plane
0.704
0.691
0.722
0.730
0.750
Table 7.3: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1, SEVERITY : HIGH
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.882
0.828
0.826
0.832
0.925
sineV
0.971
0.928
0.955
0.934
0.972
sineH
0.784
0.757
0.744
0.764
0.856
line
0.978
0.967
0.955
0.933
0.987
plane
0.848
0.843
0.848
0.803
0.851
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.854
0.806
0.827
0.835
0.895
sineV
0.922
0.898
0.914
0.917
0.936
sineH
0.818
0.724
0.768
0.785
0.870
line
0.929
0.919
0.930
0.932
0.946
plane
0.836
0.801
0.826
0.839
0.838
Gradual : drif ting time = 0.50N
circle
0.825
0.792
0.818
0.829
0.855
sineV
0.889
0.867
0.895
0.893
0.909
sineH
0.781
0.678
0.779
0.770
0.857
line
0.904
0.878
0.904
0.912
0.922
plane
0.821
0.774
0.804
0.823
0.831
Table 7.4: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3, SEVERITY : HIGH
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.844
0.807
0.821
0.834
0.888
sineV
0.942
0.930
0.936
0.888
0.963
sineH
0.724
0.667
0.672
0.671
0.781
line
0.933
0.926
0.927
0.914
0.978
plane
0.808
0.802
0.801
0.748
0.795
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.847
0.822
0.831
0.847
0.870
sineV
0.888
0.862
0.891
0.897
0.912
sineH
0.780
0.683
0.722
0.730
0.807
line
0.908
0.891
0.906
0.912
0.924
plane
0.819
0.783
0.808
0.814
0.818
Gradual : drif ting time = 0.50N
circle
0.833
0.814
0.828
0.837
0.862
sineV
0.863
0.849
0.866
0.878
0.889
sineH
0.757
0.678
0.735
0.736
0.810
line
0.879
0.887
0.876
0.894
0.898
plane
0.796
0.751
0.784
0.803
0.807
Table 7.5: AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1, SEVERITY : MEDIUM
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.762
0.710
0.767
0.822
0.861
sineV
0.929
0.899
0.939
0.929
0.962
sineH
0.674
0.668
0.680
0.727
0.766
line
0.934
0.896
0.905
0.924
0.979
plane
0.849
0.822
0.843
0.785
0.845
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.787
0.753
0.795
0.843
0.845
sineV
0.922
0.861
0.900
0.923
0.938
sineH
0.736
0.676
0.731
0.749
0.805
line
0.922
0.881
0.906
0.934
0.942
plane
0.837
0.819
0.840
0.846
0.856
Gradual : drif ting time = 0.50N
circle
0.802
0.782
0.826
0.866
0.860
sineV
0.902
0.864
0.882
0.918
0.922
sineH
0.760
0.679
0.752
0.762
0.831
line
0.873
0.861
0.888
0.925
0.931
plane
0.842
0.803
0.835
0.841
0.854
Table 7.6: AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1, SEVERITY : MEDIUM
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.700
0.690
0.696
0.710
0.705
sineV
0.724
0.752
0.723
0.755
0.873
sineH
0.519
0.518
0.611
0.609
0.617
line
0.713
0.663
0.774
0.751
0.846
plane
0.655
0.636
0.740
0.672
0.706
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.766
0.753
0.771
0.778
0.779
sineV
0.759
0.662
0.744
0.810
0.845
sineH
0.540
0.523
0.605
0.604
0.595
line
0.739
0.642
0.670
0.788
0.841
plane
0.698
0.636
0.741
0.724
0.747
Gradual : drif ting time = 0.50N
circle
0.796
0.771
0.794
0.804
0.796
sineV
0.780
0.663
0.775
0.833
0.844
sineH
0.546
0.539
0.614
0.613
0.614
line
0.733
0.592
0.659
0.811
0.839
plane
0.714
0.625
0.698
0.750
0.758
Table 7.7: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1, SEVERITY :
MEDIUM
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.821
0.748
0.819
0.849
0.899
sineV
0.948
0.924
0.948
0.934
0.973
sineH
0.718
0.756
0.712
0.796
0.820
line
0.955
0.931
0.932
0.939
0.984
plane
0.858
0.844
0.838
0.814
0.857
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.824
0.765
0.807
0.843
0.866
sineV
0.940
0.894
0.920
0.927
0.951
sineH
0.747
0.751
0.740
0.795
0.839
line
0.944
0.919
0.934
0.941
0.956
plane
0.854
0.834
0.829
0.849
0.860
Gradual : drif ting time = 0.50N
circle
0.826
0.781
0.828
0.855
0.871
sineV
0.916
0.891
0.900
0.917
0.933
sineH
0.774
0.755
0.758
0.788
0.862
line
0.901
0.898
0.914
0.928
0.941
plane
0.843
0.820
0.823
0.837
0.8531
Table 7.8: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3, SEVERITY :
MEDIUM
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.775
0.766
0.753
0.804
0.872
sineV
0.949
0.918
0.935
0.893
0.956
sineH
0.720
0.712
0.704
0.696
0.786
line
0.939
0.920
0.920
0.914
0.975
plane
0.854
0.820
0.815
0.788
0.834
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.838
0.782
0.817
0.842
0.873
sineV
0.890
0.905
0.922
0.905
0.933
sineH
0.775
0.752
0.748
0.743
0.820
line
0.897
0.904
0.926
0.911
0.945
plane
0.846
0.827
0.828
0.827
0.847
Gradual : drif ting time = 0.50N
circle
0.832
0.808
0.815
0.854
0.875
sineV
0.917
0.892
0.912
0.910
0.937
sineH
0.777
0.758
0.759
0.767
0.825
line
0.906
0.878
0.911
0.912
0.928
plane
0.839
0.828
0.822
0.827
0.841
Table 7.9: AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1, SEVERITY : LOW
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.835
0.773
0.822
0.882
0.879
sineV
0.902
0.91
0.929
0.934
0.97
sineH
0.754
0.696
0.713
0.778
0.862
line
0.923
0.896
0.933
0.938
0.976
plane
0.853
0.845
0.837
0.851
0.858
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.824
0.816
0.864
0.898
0.908
sineV
0.912
0.902
0.927
0.95
0.964
sineH
0.829
0.698
0.757
0.792
0.846
line
0.903
0.899
0.934
0.952
0.967
plane
0.858
0.836
0.841
0.862
0.871
Gradual : drif ting time = 0.50N
circle
0.841
0.833
0.879
0.908
0.899
sineV
0.895
0.897
0.922
0.948
0.959
sineH
0.83
0.693
0.777
0.799
0.851
line
0.908
0.914
0.93
0.952
0.956
plane
0.86
0.834
0.843
0.866
0.875
Table 7.10: AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1, SEVERITY : LOW
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.796
0.780
0.804
0.814
0.823
sineV
0.670
0.675
0.618
0.848
0.850
sineH
0.543
0.514
0.616
0.614
0.612
line
0.591
0.713
0.746
0.831
0.766
plane
0.614
0.667
0.659
0.752
0.767
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.831
0.817
0.834
0.841
0.834
sineV
0.523
0.540
0.771
0.864
0.909
sineH
0.551
0.547
0.625
0.625
0.635
line
0.574
0.722
0.704
0.840
0.841
plane
0.557
0.682
0.670
0.766
0.658
Gradual : drif ting time = 0.50N
circle
0.844
0.830
0.844
0.853
0.858
sineV
0.527
0.541
0.773
0.869
0.873
sineH
0.564
0.543
0.618
0.617
0.609
line
0.521
0.733
0.728
0.846
0.860
plane
0.549
0.694
0.682
0.778
0.684
Table 7.11: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1, SEVERITY : LOW
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.882
0.797
0.850
0.899
0.908
sineV
0.933
0.935
0.944
0.935
0.976
sineH
0.819
0.780
0.755
0.835
0.912
line
0.954
0.928
0.953
0.945
0.982
plane
0.859
0.850
0.868
0.850
0.871
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.850
0.840
0.880
0.902
0.926
sineV
0.939
0.926
0.941
0.947
0.970
sineH
0.888
0.780
0.779
0.836
0.889
line
0.937
0.927
0.953
0.954
0.975
plane
0.854
0.847
0.873
0.861
0.872
Gradual : drif ting time = 0.50N
circle
0.868
0.860
0.879
0.905
0.903
sineV
0.921
0.919
0.936
0.944
0.962
sineH
0.885
0.774
0.777
0.838
0.877
line
0.939
0.934
0.945
0.951
0.964
plane
0.859
0.844
0.871
0.864
0.875
Table 7.12: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3, SEVERITY : LOW
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
0.882
0.797
0.850
0.899
0.908
sineV
0.933
0.935
0.944
0.935
0.976
sineH
0.819
0.780
0.755
0.835
0.912
line
0.954
0.928
0.953
0.945
0.982
plane
0.859
0.850
0.868
0.850
0.871
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
0.850
0.840
0.880
0.902
0.926
sineV
0.939
0.926
0.941
0.947
0.970
sineH
0.888
0.780
0.779
0.836
0.889
line
0.937
0.927
0.953
0.954
0.975
plane
0.854
0.847
0.873
0.861
0.872
Gradual : drif ting time = 0.50N
circle
0.868
0.860
0.879
0.905
0.903
sineV
0.921
0.919
0.936
0.944
0.962
sineH
0.885
0.774
0.777
0.838
0.877
line
0.939
0.934
0.945
0.951
0.964
plane
0.859
0.844
0.871
0.864
0.875
Table 7.13: AVG. ACCURACIES. MISCELLANEOUS DATASETS
OTHER ARTIFICIAL DATASETS
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPS
OzaBoost
OCBoost
OzaBag
ADWIN
SEA (100k, 3 Drifts)
0.796
0.717
0.794
0.798
0.829
Waveform (100k, 3 Drifts)
0.522
0.341
0.555
0.544
0.577
Waveform (150k, 0 Drift)
0.628
0.380
0.649
0.647
0.666
Hyperplane (50k, 1 Drift)
0.713
0.714
0.698
0.712
0.773
GPS
REAL DATASETS
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Spam Corpus
0.777
0.822
0.868
0.814
0.888
Covertype
0.666
0.711
0.725
0.667
0.733
ELEC2
0.654
0.753
0.647
0.795
0.755
Usenet
0.489
0.487
0.483
0.494
0.501
7.1.2
Graphs of Results
In addition to the tables of results, we have also provided the graphs of instantaneous
accuracy versus no. of instances for some of the datasets. These graphs help to visualize
the behaviour of the algorithms during the drifting period as well as the recovery of
the classification system. To analyze the recovery of the algorithms after a drift occurs,
we specifically set the instantaneous accuracy to zero at the actual drift point. Some
of the graphs plotted, are given below. Due to space limitations, all the 180 graphs(one
corresponding to each dataset) could not be included. Hence, to cover the various possible
cases, graphs of different severities, dataset-sizes, types and number of drifts are given.
Figure 7.1: Dataset: Circle, Size: 50000, Drift: Abrupt, Severity: High
Figure 7.2: Dataset: SineH, Size: 50000, Drift: Gradual(0.25N ), Severity: High
Figure 7.3: Dataset: Plane, Size: 50000, Drift: Gradual(0.50N ), Severity: High
Figure 7.4: Dataset: Line, Size: 2000, Drift: Abrupt, Severity: Medium
Figure 7.5: Dataset: SineV, Size: 2000, Drift: Gradual(0.25N ), Severity: Medium
Figure 7.6: Dataset: Plane, Size: 2000, Drift: Gradual(0.50N ), Severity: Medium
Figure 7.7: Dataset: SineH, Size: 100000, Drift: Abrupt, Severity: Low, No. of Drifts: 1
Figure 7.8: Dataset: Circle, Size: 100000, Drift: Gradual(0.25N ), Severity: Low, No. of Drifts: 1
Figure 7.9: Dataset: SineV, Size: 100000, Drift: Gradual(0.50N ), Severity: Low, No. of Drifts: 1
Figure 7.10: Dataset: SineV, Size: 100000, Drift: Abrupt, Severity: High, No. of Drifts: 3
Figure 7.11: Dataset: Circle, Size: 100000, Drift: Gradual(0.25N ), Severity: High, No. of Drifts: 3
Figure 7.12: Dataset: SineH, Size: 100000, Drift: Gradual(0.50N ), Severity: High, No. of Drifts: 3
Figure 7.13: Dataset: Hyperplane, Size: 50000, No. of Drifts : 1
Figure 7.14: Dataset: Waveform, Size: 150000, No. of Drifts : 0
Figure 7.15: Dataset: Spam Corpus
Figure 7.16: Dataset: Forest Cover
7.2
Noise Sensitivity
The proposed algorithms work fine for noiseless data streams as well as noisy data streams.
Class noise is defined as the percentage of the total instances whose actual class labels
have been changed. Different amount of class noise was added in the plane dataset to
test the sensitivity of the algorithms. It can be seen from Table 7.14 that the algorithm
performs better for noise level upto 20% in noisy data streams. However, the performance
of algorithm degrades for higher percentage of noise in the data. This is because of the
presence of more noisy instances in the instance window that unnecessarily get trained
during the drifting period.
Table 7.14: AVG. ACCURACIES FOR DIFFERENT NOISE LEVELS. DATASET : PLANE
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
0
0.965
0.938
0.953
0.94
0.984
5
0.875
0.844
0.882
0.775
0.885
10
0.845
0.787
0.839
0.748
0.821
15
0.78
0.748
0.782
0.688
0.792
20
0.737
0.72
0.739
0.697
0.718
25
0.694
0.688
0.706
0.695
0.677
30
0.65
0.632
0.66
0.664
0.647
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
0
0.883
0.857
0.868
0.922
0.929
5
0.831
0.797
0.854
0.872
0.88
10
0.826
0.76
0.818
0.83
0.832
15
0.771
0.729
0.775
0.789
0.794
20
0.734
0.714
0.734
0.744
0.735
25
0.702
0.688
0.714
0.708
0.704
30
0.666
0.646
0.676
0.676
0.664
Gradual : drif ting time = 0.50N
0
0.855
0.834
0.855
0.91
0.904
5
0.817
0.776
0.828
0.85
0.857
10
0.814
0.745
0.807
0.817
0.82
15
0.76
0.716
0.764
0.783
0.787
20
0.724
0.697
0.731
0.743
0.745
25
0.696
0.674
0.703
0.709
0.708
30
0.664
0.648
0.667
0.677
0.676
7.3
Memory and Time Bounds
The Interleaved Test-Then-Train evaluation model [8] has been used as mentioned previously. In this, each individual example can be used to test the model before it is used
for training, and from this the classification state can be incrementally updated. Hence,
the GPS algorithms are single pass and incremental. As each example is processed only
once as soon as it arrives, the algorithms can ideally process an infinite stream of data.
Compared to existing standard algorithms, the proposed algorithms take more memory as shown in Table 7.16. The reason for this is that, while all the other methods
(viz. EDDM-OzaBoost, EDDM-OzaBag, EDDM-OCBoost, OzaBagAdwin) maintain
only one ensemble during the processing period, the proposed algorithms (GPSGradual
and GPSAbrupt) both maintain two ensembles as well as an additional instance-window
throughout the classification process.
The time taken by these algorithms is more than the existing ones (Table 7.15) because
newEnsemble is trained right since the beginning of the data stream and also, during the
drifting period, it is trained specifically on the instance window. The time requirements of
GPSGradual and GPSAbrupt can be substantially reduced by training the two ensembles
(currentEnsemble and newEnsemble) on parallel processors.
Table 7.15: PROCESSING TIME (IN SECONDS). DATASET-SIZE : 50000
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
1.26
4.44
5.38
8.14
11.4
sineV
1.24
7.94
4.48
6.06
8.3
sineH
1.14
8.96
8.08
10.68
12.52
line
1.2
7.44
5.88
8.42
8.6
plane
1.94
16.36
19.16
22.44
26.66
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
1.24
4.98
5.34
6.06
11.36
sineV
1.32
8.44
5.44
5.6
12.6
sineH
1.24
8.66
7.7
10.78
11.68
line
1.18
7.66
6.68
7.56
8.62
plane
1.92
13.52
19.88
22.18
23.44
Gradual : drif ting time = 0.50N
circle
1.24
5.66
5.38
6.36
10.5
sineV
1.32
8.54
4.94
5.58
11.5
sineH
1.2
8.94
10.44
12.14
10.68
line
1.2
7.74
6.96
5.56
8.42
plane
2.04
11.56
18.46
22.62
18.94
Table 7.16: MEMORY (IN BYTES). DATASET-SIZE : 50000
ABRUPT
Dataset
EDDM
EDDM
EDDM
OzaBag-
GPSAbrupt
OzaBoost
OCBoost
OzaBag
ADWIN
circle
1408
14464
141232
147184
558856
sineV
1408
124384
79392
75240
393656
sineH
1408
161120
264928
187272
596232
line
1408
100096
80016
94984
329384
plane
5184
332160
507728
525048
1226424
GPSGradual
GRADUAL
Dataset
EDDM
EDDM
EDDM
OzaBag-
OzaBoost
OCBoost
OzaBag
ADWIN
Gradual : drif ting time = 0.25N
circle
1408
147952
123888
64944
452416
sineV
1408
109344
132960
70160
470656
sineH
1408
93504
252528
122104
606208
line
1408
92784
141264
73960
294576
plane
5184
399552
523280
254888
1245232
Gradual : drif ting time = 0.50N
circle
1408
111248
107568
55656
314128
sineV
1408
121376
129232
67344
344032
sineH
1408
67456
264272
113800
475120
line
1408
108144
154864
70496
283360
plane
5184
337344
420240
218280
755344
Chapter 8
Conclusion
We studied various areas in data mining domain. In these areas, we specifically focused
on handling concept drifts in online ensemble learning. It was found that little work has
been done to exploit drift type in concept drift handling techniques. So, we decided to
develop different approaches for abrupt and gradual concept drifts depending on their
composition characteristics.
The existing EDDM-based drift handling approach was studied and improvised to
develop two new approaches GPSGradual and GPSAbrupt to handle gradual and abrupt
drift data streams respectively. These classification approaches use zero diversity and
instance-window techniques to improve the classification algorithm. Zero diversity helps
all the classifiers in the ensemble of classifiers to get trained on the new concept instances
during the drifting period, hence, adapting to the change quickly. Instance-window stores
early instances of the new concept which are used later during drifting period for training
the newEnsemble. This approach of instance selection helps to improve the accuracy
of the classification system after the drift is detected. Because of different composition
characteristics of the data stream during drifting period, the instances selected for training
the newEnsemble are different in GPSGradual and GPSAbrupt. Due to the same reason,
the newEnsemble is trained on the instance-window at different levels in the drifting
63
period for GPSGradual and GPSAbrupt.
For experimentation we considered various artificial as well as real datasets, thus, a
large number of datasets were covered for exhaustive testing of the proposed approaches.
artificial datasets accounted for a wide variety of mathematical problems that can occur
in data streams. Experimental results show that the new approaches perform better with
respect to accuracy in classifying these datasets compared to the existing approaches.
The results also show that the new approaches reduce the recovery time for the drifts in
data streams.
Chapter 9
Supplementary Work
9.1
Hospitalization Record Analyzer
As part of the summer work(May-Jun 2010) before actual commencement of project
work, we participated in IEEE VAST 2010 Challenge [3] and submitted our solution
for Mini Challenge 2 : Hospitalization Records - Characterization of Pandemic
Spread.
The mini challenge required analyzing hospitalization records. The dataset provided
were the city-wise hospitalization records for a particular panademic. The task was
to analyze these datasets and characterize the spread of the panademic by taking into
account symptoms of the disease, mortality rates, temporal patterns of the onset, peak
and recovery of the disease. Also, the comparison of the outbreak of panademic accross
cities was required.
A visualization tool was to be developed for solving the above task. So, we developed
a tool in java titled ”Hospitalization Record Analyzer”. It gives values of various
factors using filters of city and syndromes on the data set. It shows plots of the processed
data also. These graphs are drawn using open source graph plotting software GNUplot
(http://www.gnuplot.info/). The tool analyzes the preprocessed data in variety of ways.
65
These include : city-wise analysis and overall analysis. A few screenshots of the tool are
given below (Fig. 9.1 to Fig. 9.3). Our entry received a an average score of 17 out of 30
at the competition.
Figure 9.1: Hospitalization Record Analyzer : Main Window
Figure 9.2: Hospitalization Record Analyzer : Analysis example (Syndrome Distribution)
Figure 9.3: Hospitalization Record Analyzer : Analysis example (City-wise Dead-infected)
9.2
Research Paper Publication
Based on the research work carried out, we submitted a research paper titled ”An
Instance-Window based Classification Algorithm for Handling Gradual Concept Drifts” authored by Vahida Attar, Prashant Chaudhary, Sonali Rahagude,
Gaurish Chaudhari and Pradeep Sinha. This paper elaborates the approach for
handling gradual drift effectively i.e. the GPSGradual approach.
This paper was submitted in the ADMI (Agents and Data Mining Interaction) workshop [2] that is conducted at AAMAS 2011 (The Tenth Internation Conference on Autonomous Agent and Multi-agent Systems) [1] held in Taipei, Taiwan.
The paper was accepted for publication in the following:
a) The ADMI-related Springer LNCS/LNAI Volume.
b) AAMAS USB Proceedings.
Also, the extended version of the paper has been invited for publication in JAAMAS
(Journal of Autonomous Agent and Multi-agent Systems) Special issue on agent mining.
9.3
Approaches of Drift Type Detection
As mentioned previously, gradual and abrupt drift streams differ in their compostion
characteristics. In case of gradual drift in data stream, the stream immediately after the
drift consists of both old as well as new instances. On the other hand, for an abrupt drift
in the stream, the stream after drift occurence consists of new instances only. Thus, for
abrupt drift, the classification errors caused by the classifier immediately after the drift
point is more since only new instances are present as the classifier is still not trained on
the new concept. However, in case of a gradual drift, the errors are less as there is a
mixture of old and new instances.
9.3.1
Approach 1 : Using Standard Deviation Measure
The Early Drift Detection Method (EDDM) flags a warning level when online error rate of
the classifier system crosses a certain bound and further, detects the drift level. Standard
deviation s0i in EDDM is calcuated on this error rate. Based on the above discussion,
we can infer that the values of the standard deviation at the start of warning and drift
levels will be different in case of abrupt and gradual drifts. This distinguishing factor
could be used in to detect the type of drift based on the difference between the standard
deviations at warning and drift levels. We describe this approach below.
Let i be the first instance after warning level and f be the final instance before drift
level. We calculate δ = s0f − s0i . For abrupt drift, δ will be a small value because the
distance between errors is less as the number of errors are more. While for gradual drift,
δ will be a large value because distance between the errors is more as number of errors
are less.
It was seen that threshold δ is dataset dependent.
9.3.2
Approach 2 : Using Error Rate
In this approach we calculate error rate between warning level and drift level of EDDM.
The error rate is calculated as the ratio of number of errors(i.e. number of misclassified
instances) to the total number of instances. As mentioned previously, there are more
errors in case of abrupt drift, hence, the error rate is more for abrupt drift. Similarly,
error rate is less for gradual drift. We define threshold δ that differentiates between
abrupt and gradual drifts. For noisy datasets(class noise = 10%), the value of δ we
determined was 0.45.
if error rate < δ then
drif t type = gradual
elseif error rate > δ then
drif t type = abrupt
Drawback
The threshold δ depends on the amount of noise in the dataset. Because as noise increases
the number of errors also increase. Hence, different amount noise results in different
error rate for given dataset.
9.3.3
Approach 3 : Generating Association Rules/Decision Trees for Drift
Type
EDDM maintains different parameters while detecting drift viz. mean of distance between
errors(mean), standard deviation(std), m2s (mean + 2 ∗ std) etc. For different types of
drifts, these values differ due to different composition characteristics of the drifts. An
approach can be devised to detect type of drift by creating decision trees or association
rules by learning these parameters of processed data streams where the type of drift is
already known and is used as a class label. Now, these decision trees or association rules
can be used for testing and determining the types of drifts for new data streams with
unknown drift types. Thus, detecting type in this case is a binary classification problem.
In order to implement the above approach, we generated the data log files for various
drift data streams (containing parameters mentioned above). Batch-learning algorithms
from WEKA [5] were used to learn these datalog files and the association rules were
generated as shown in Fig. 9.4. The framework was as shown in Fig. 9.5.
Figure 9.4: Example of Association Rules created using JRIP
Figure 9.5: Drift Detection Framework
9.4
ADWIN Integrated GPS Approach
The ADWIN drift detection method has several advantages over EDDM. ADWIN is a
windowing technique based on partial examples memory. EDDM is based on online error
rate of the clm2sassifier system based on no examples memory. ADWIN detect less false
drifts as compared to EDDM. Also, performance of ADWIN in case of noisy data streams
is better than that of EDDM. So, it is a better drift detection method. As our approach is
entirely dependent on the drift detection method used, using ADWIN instead of EDDM
can help attain greater accuracy for the system. Hence, we tried replacing EDDM by
ADWIN.
There were challenges while incorporating ADWIN in the proposed approach. The
proposed approach strictly requires a warning level and drift level for training of the
newEnsemble. Howerver, ADWIN provides only drift level and no warning level. Thus,
modifying ADWIN to determine a warning level and post warning level for GPSGradual
algorithm was necessary. EDDM has defined precise values for warning and drift level.
Deducing a post warning level from these values was easy. However, ADWIN does not
give any predefined values. As mentioned previously, ADWIN has a predefined parameter
² to detect change. Hence, to introduce warning level, we experimented on values close
to ² viz. (²/2), (3²/4), (5²/6),(7²/8) and (9²/10). It was observed that the value (3²/4)
gave better results for determining warning level in ADWIN. Now, to introduce post
warning level in GPSGradual algorithm, we experimented on values between ² and 3²/4
viz. (5²/6), (7²/8), (9²/10) and (11²/12). It was observed that the value (9²/10) gave
better results for determining post warning level in ADWIN. This modified ADWIN was
integrated in the proposed approaches namely, GPSGradual and GPSAbrupt.
Experiments reveal that use of ADWIN as the drift detection method improves the
accuracy of the algorithms as compared to EDDM (Table 9.1). Due to lack of time,
exhaustive testing of the ADWIN-integrated approaches could not be performed.
Table 9.1: AVERAGE ACCURACIES FOR GPS(EDDM) AND GPS(ADWIN). DATASET-SIZE :
50,000
ABRUPT
ALGORITHM
Circle
SineH
SineV
Line
Plane
GPSAbrupt
0.9
0.963
0.788
0.984
0.821
GPSADWIN
0.934
0.971
0.804
0.978
0.758
GRADUAL
Gradual : drif ting time = 0.25N
GPSGradual
0.870
0.916
0.823
0.930
0.832
GPSADWIN
0.894
0.922
0.822
0.937
0.838
Gradual : drif ting time = 0.50N
GPSGradual
0.840
0.894
0.821
0.911
0.820
GPSADWIN
0.878
0.897
0.812
0.921
0.828
Chapter 10
Future Work
The integration of ADWIN with the GPS approaches gave promising results. So, future
work may include rigorous testing for estimating warning level automatically in ADWIN
drift detection method. This modified ADWIN algorithm can then be used to replace
the EDDM method in the proposed approaches.
Currently, the values for instance-window size and post warning level are determined
experimentally. Thus, for any speed of drift in the data stream, these values remain
fixed. Thus, future work may also include determining the value of the PostWarning
level and window size automatically according to the speed of drift. This may improve
the accuracy of the classification system.
Future work may also include implementing approaches on parallel processors in order
to speed up the system. A framework could be developed, where first the type of drift
(Abrupt or Gradual) can be detected automatically and based on this type, one of the
proposed approaches - GPSAbrupt or GPSGradual can be applied.
74
Bibliography
[1] Aamas 2011 - the tenth international conference on autonomous agents and multiagent systems. http://www.aamas2011.tw/.
[2] Admi 2011 - the seventh international workshop on agents and data mining interaction. http://admi11.agentmining.org/.
[3] Ieee vast challenge 2010. http://hcil.cs.umd.edu/localphp/hcil/vast10/index.php.
[4] Uci repository covertype dataset. http://archive.ics.uci.edu/ml/datasets/Covertype.
[5] Weka. http://www.cs.waikato.ac.nz/ml/weka/.
[6] R. A. Berk. An introduction to ensemble methods for data analysis. Sociological
Methods Research, 34(3):263–295, 2006.
[7] A. Bifet and R. Gavald. Learning from time-changing data with adaptive windowing.
In SIAM International Conference on Data Mining, 2007.
[8] A. Bifet and R. Kirkby.
Data stream mining − a practical approach.
http://moa.cs.waikato.ac.nz/downloads/.
[9] L Breiman, J H Friedman, R A Olshen, and C J Stone. Classification and Regression
Trees, volume p. Wadsworth, 1984.
[10] A. Fern and R. Givan. Online ensemble learning: An empirical study. Machine
Learning, 53:71–109, 2003.
75
[11] J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection.
In Proceedings of the 7th Brazilian Symposium on Artificial Intelligence (SBIA ’04)
- Lecture Notes in Computer Science, volume 317 1, pages 286–295, Sao Lui z do
Maranhao, Brazil, 2004. Springer.
[12] I. Katakis, G. Tsoumakas, and I. Vlahavas. An ensemble of classifiers for coping
with recurring contexts in data streams. In 18th European Conference on Artificial
Intelligence, Patras, Greece, 2008.
[13] I. Katakis, G. Tsoumakas, and I. Vlahavas. Tracking recurring contexts using ensemble classifiers: An application to email filtering. Knowledge and Information
Systems, 22:371–391, 2009.
[14] Baena-Garcia M., Campo-Avila J. Del, R. Fidalgo, and A. Bifet. Early drift detection method. In Proceedings of the 4th ECML PKDD International Workshop
on Knowledge Discovery From Data Streams (IWKDDS ’06), pages 77–86, Berlin,
Germany, 2006.
[15] F. L. Minku, H. Inoue, and X. Yao. Negative correlation in incremental learning.
Natural Computing Journal - Special Issue on Nature-inspired Learning and Adaptive
Systems, page 32, 2008.
[16] F. L. Minku and X. Yao. Using diversity to handle concept drift in on-line learning.
IEEE Transactions on Knowledge and Data Engineering, 2009.
[17] L. Minku, A. White, and X. Yao. The impact of diversity on on-line ensemble
learning in the presence of concept drift. IEEE Transactions on Knowledge and
Data Engineering, 2008.
[18] N. C. Oza and S. Russell. Experimental comparisons of on-line and batch versions
of bagging and boosting. In Proceedings of the seventh ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 359–364, San Francisco,
California, 2001.
[19] N. C. Oza and S. Russell. Online bagging and boosting. In Proceedings of the 2005
IEEE International Conference on Systems, Man and Cybernetics, volume 3, pages
2340–2345, New Jersey: Institute for Electrical and Electronics Engineers, 2005.
[20] R. Pelossof, M. Jones, I. Vovsha, and C. Rudin. Online coordinate boosting. On-line
Learning for Computer Vision Workshop (OLCV), 2009.
[21] R. Polikar, L. Udpa, S. S. Udpa, and V. Honavar. Learn ++: An incremental learning
algorithm for supervised neural networks. IEEE Transactions on Systems, Man. and
Cybernetics - Part C, 31:497–508, 2001.
[22] W. Nick Street and Yongseog Kim. A streaming ensemble algorithm (sea) for largescale classification. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining KDD 01, pages 377–382, 2001.
[23] A. Tsymbala, M. Pechenizkiy, P. Cunningham, and S. Puuronen. Dynamic integration of classifiers for handling concept drift. Information Fusion, 9:56–68, 2008.
[24] Haixun Wang, Wei Fan, Philip S Yu, and Jiawei Han. Mining concept-drifting
data streams using ensemble classifiers. Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining KDD 03, page
226, 2003.
[25] I. Zliobaite. Learning under concept drift- an overview. In Technical Report, Faculty
of Mathematics and Informatics, Vilnius University, Vilnius, Lithuania, 2009.