Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IMPACT OF TYPE OF CONCEPT DRIFT ON ONLINE ENSEMBLE LEARNING A Project Report Submitted by Prashant M. Chaudhary 110703010 Gaurish S. Chaudhari 110703201 Sonali P. Rahagude 110703202 in partial fulfilment for the award of the degree of B.Tech Computer Engineering Under the guidance of Mrs. Vahida Z. Attar College of Engineering, Pune DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY, COLLEGE OF ENGINEERING, PUNE-5 May, 2011 DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY, COLLEGE OF ENGINEERING, PUNE CERTIFICATE Certified that this project, titled “IMPACT OF TYPE OF CONCEPT DRIFT ON ONLINE ENSEMBLE LEARNING” has been successfully completed by Prashant M. Chaudhary 110703010 Gaurish S. Chaudhari 110703201 Sonali P. Rahagude 110703202 and is approved for the partial fulfilment of the requirements for the degree of “B.Tech. Computer Engineering”. SIGNATURE Mrs. Vahida Z. Attar Project Guide Department of Computer Engineering SIGNATURE Dr. Jibi Abraham Head Department of Computer Engineering and Information Technology, and Information Technology, College of Engineering Pune, College of Engineering Pune, Shivajinagar, Pune - 5. Shivajinagar, Pune - 5. Abstract Mining concept drifting data stream is a challenging area for data mining research. In real world, data streams are not stable but change with time. Such changes termed as drifts in concept are categorized into gradual and abrupt, based on the amount of drifting time, i.e. the time steps taken to replace the old concept completely by the new one. In traditional online learning systems, this categorization has not been exploited for handling different drifts in the data stream. The characteristics of different drift types can be used to develop different approaches to achieve better system performance than the existing ones. So, the issue of handling concept drifts in online data according to their type can be explored further. Among the most popular and effective approaches to handle concept drifts is ensemble learning, where a set of models built over different time periods is maintained and the predictions of models are combined, usually according to their expertise level regarding the current concept. Diversity among the base classifiers of the ensemble is an important factor affecting the performance of an ensemble learning system. If early instances of new concept are stored and used for ensemble learning once the drift is detected, this may help increase the overall accuracy after the drift. Moreover, if an ensemble learns with zero diversity for instances of a new concept during the drifting period, the ensemble may learn the new concept faster thus boosting recovery. The project presents the above mentioned approach for effective handling of various drifts according to their compostion characteristics. Acknowlgements Apart from our efforts, the success of this project depends largely on the encouragement and guidelines of many others. We take this opportunity to express our gratitude to the people who have been instrumental in the successful completion of this project. We are highly indebted to Mrs. Vahida Z. Attar for her guidance and constant supervision as well as for providing necessary information regarding the project and also for her tremendous support throughout the project. We are very thankful to our Head of the Department Dr. Jibi Abraham who modeled us both technically and morally for achieving greater success in life. We would also like to thank Mr. Leandro L. Minku for his valuable guidance in the starting phase of the project. His research papers, and source codes that he generously provided, were of great use for the experimentation part of the project. Lastly, we express our sincere gratitude towards our parents and the faculty members of the Department of Computer Engineering and Information Technology, College of Engineering, Pune for their kind co-operation and encouragement which helped us in completion of this project. Prashant Chaudhary Gaurish Chaudhari Sonali Rahagude Contents List of Tables iii List of Figures v List of Algorithms vi 1 Motivation 1 2 Introduction 3 3 Literature Survey 5 3.1 Data Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4 Drift Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.4.1 Drift Detection Method - DDM . . . . . . . . . . . . . . . . . . . 8 3.4.2 Early Drift Detection Method - EDDM . . . . . . . . . . . . . . . 9 3.4.3 Adaptive Windowing - ADWIN . . . . . . . . . . . . . . . . . . . 9 Drift Handling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.5.1 Pure Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . 10 3.5.2 Learning with Drift Detection . . . . . . . . . . . . . . . . . . . . 11 3.5 2 4 Existing Approach 12 4.1 Learning System using EDDM . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Online Boosting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5 Proposed Approach 5.1 5.2 5.3 16 Use of Instance Window . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1.1 Storing Instances in Instance Window . . . . . . . . . . . . . . . . 17 5.1.2 Training from Instance Window . . . . . . . . . . . . . . . . . . . 18 Use of Zero Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2.1 Instances used for Training with Zero Diversity . . . . . . . . . . 19 5.2.2 Implementation of Zero Diversity . . . . . . . . . . . . . . . . . . 20 Switching to New Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . 22 6 Experimentation 25 6.1 Artificial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.2 Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.1 Spam Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.2 Forest Cover (UCI Repository) . . . . . . . . . . . . . . . . . . . 29 6.2.3 ELEC2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.2.4 Usenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.3 Implementation Environment . . . . . . . . . . . . . . . . . . . . . . . . 29 6.4 User Interface for GPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.5 Determination of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.5.1 Size of Instance Window . . . . . . . . . . . . . . . . . . . . . . . 32 6.5.2 Ratio value for PostWarning Level . . . . . . . . . . . . . . . . . 32 7 Results and Analysis 35 7.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.1.1 Tables of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.1.2 Graphs of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.2 Noise Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.3 Memory and Time Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 60 8 Conclusion 63 9 Supplementary Work 65 9.1 Hospitalization Record Analyzer . . . . . . . . . . . . . . . . . . . . . . . 65 9.2 Research Paper Publication . . . . . . . . . . . . . . . . . . . . . . . . . 68 9.3 Approaches of Drift Type Detection . . . . . . . . . . . . . . . . . . . . . 68 9.3.1 Approach 1 : Using Standard Deviation Measure . . . . . . . . . 69 9.3.2 Approach 2 : Using Error Rate . . . . . . . . . . . . . . . . . . . 69 9.3.3 Approach 3 : Generating Association Rules/Decision Trees for Drift 9.4 Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 ADWIN Integrated GPS Approach . . . . . . . . . . . . . . . . . . . . . 72 10 Future Work 74 List of Tables 6.1 ARTIFICIAL DATASETS . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 AVG. ACCURACIES FOR DIFFERENT INSTANCE WINDOW SIZES. DATASET-SIZE : 50,000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 41 AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1, SEVERITY : MEDIUM . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 39 AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1, SEVERITY : MEDIUM . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 38 AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3, SEVERITY : HIGH 7.5 37 AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1, SEVERITY : HIGH 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1, SEVERITY : HIGH 7.3 34 AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1, SEVERITY : HIGH 7.2 33 AVG. ACCURACIES FOR DIFFERENT VALUES OF POST WARNING LEVEL. DATASET-SIZE : 50,000 . . . . . . . . . . . . . . . . . . . . . . 7.1 28 42 AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1, SEVERITY : MEDIUM . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 43 7.8 AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3, SEVERITY : MEDIUM . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 44 AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1, SEVERITY : LOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7.10 AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1, SEVERITY : LOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.11 AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1, SEVERITY : LOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.12 AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3, SEVERITY : LOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.13 AVG. ACCURACIES. MISCELLANEOUS DATASETS . . . . . . . . . . 49 7.14 AVG. ACCURACIES FOR DIFFERENT NOISE LEVELS. DATASET : PLANE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.15 PROCESSING TIME (IN SECONDS). DATASET-SIZE : 50000 . . . . . 61 7.16 MEMORY (IN BYTES). DATASET-SIZE : 50000 . . . . . . . . . . . . . 62 9.1 AVERAGE ACCURACIES FOR GPS(EDDM) AND GPS(ADWIN). DATASETSIZE : 50,000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 List of Figures 3.1 Types of concept drifts in streams . . . . . . . . . . . . . . . . . . . . . . 7 6.1 User Interface for GPS in MOA . . . . . . . . . . . . . . . . . . . . . . . 32 7.1 Dataset: Circle, Size: 50000, Drift: Abrupt, Severity: High . . . . . . . . 50 7.2 Dataset: SineH, Size: 50000, Drift: Gradual(0.25N ), Severity: High . . . 51 7.3 Dataset: Plane, Size: 50000, Drift: Gradual(0.50N ), Severity: High . . . 51 7.4 Dataset: Line, Size: 2000, Drift: Abrupt, Severity: Medium . . . . . . . . 52 7.5 Dataset: SineV, Size: 2000, Drift: Gradual(0.25N ), Severity: Medium . . 52 7.6 Dataset: Plane, Size: 2000, Drift: Gradual(0.50N ), Severity: Medium . . 53 7.7 Dataset: SineH, Size: 100000, Drift: Abrupt, Severity: Low, No. of Drifts: 1 53 7.8 Dataset: Circle, Size: 100000, Drift: Gradual(0.25N ), Severity: Low, No. of Drifts: 1 7.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Dataset: SineV, Size: 100000, Drift: Gradual(0.50N ), Severity: Low, No. of Drifts: 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 7.10 Dataset: SineV, Size: 100000, Drift: Abrupt, Severity: High, No. of Drifts: 3 55 7.11 Dataset: Circle, Size: 100000, Drift: Gradual(0.25N ), Severity: High, No. of Drifts: 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.12 Dataset: SineH, Size: 100000, Drift: Gradual(0.50N ), Severity: High, No. of Drifts: 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 56 7.13 Dataset: Hyperplane, Size: 50000, No. of Drifts : 1 . . . . . . . . . . . . 56 7.14 Dataset: Waveform, Size: 150000, No. of Drifts : 0 . . . . . . . . . . . . 57 7.15 Dataset: Spam Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.16 Dataset: Forest Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 9.1 Hospitalization Record Analyzer : Main Window . . . . . . . . . . . . . 66 9.2 Hospitalization Record Analyzer : Analysis example (Syndrome Distribution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 67 Hospitalization Record Analyzer : Analysis example (City-wise Deadinfected) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 9.4 Example of Association Rules created using JRIP . . . . . . . . . . . . . 71 9.5 Drift Detection Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 71 List of Algorithms 1 ADWIN : Adaptive Windowing Algorithm . . . . . . . . . . . . . . . . . 10 2 Existing Approach : SingleClassif ierDrif t . . . . . . . . . . . . . . . . 13 3 Online Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Training Algorithm for Ensembles used in Proposed Approaches . . . . . 21 5 Proposed Approach : GPSGradual . . . . . . . . . . . . . . . . . . . . . 23 6 Proposed Approach : GPSAbrupt . . . . . . . . . . . . . . . . . . . . . . 24 vi Chapter 1 Motivation Data mining is the process of extracting patterns from data. Today Data Mining is the most advancing and challenging field in the area of real-life applications. This is a vast field using various concepts from computer science and mathematics. As far as computer science is concerned, data mining uses various algorithms (design and analysis), data structures, database management systems, high performance computing etc. From mathematics side, the data mining extensively make use of probability and statistical analysis. Hence, doing a project in data mining requires knowledge from all the important fields. Therefore, more can be learned and more fruitful results can be obtained by applying all the knowledge. During the literature survey for the project, various domains within data mining were studied such as preprocessing of data, feature selection, synopsis data structures, data streaming, classification algorithms , clustering , concept drifts, ensembles techniques, etc. Our aim of the search was to find a domain that will help us to provide something new and innovative to the data mining community. At the same time, the scope of the project must be feasible as it is time-bounded. Concept drifts in data streams are usually of two types : Abrupt and Gradual. The existing approaches for handling concept drift in online learning system do not take into 1 consideration, type of the drift being handled. So, we thought of using the characteristics of the type of drift to improve the system performance in online learning. It is observed that accuracy of an online learning system for gradual drifts is lesser than that for abrupt drifts. Moreover, recovery in case of gradual drifts is less and slower. So, we first chose to work on this issue of improving the performance of the online learning system in case of data streams with gradual drifts. Once this is done, we would then work on abrupt drifts. Thus, we decided to develop a framework which handle abrupt and gradual drifts effectively. Chapter 2 Introduction Online learning has a wide variety of applications in which training data is available continuously in time and there are time and space constraints. For example, web traffic monitoring, network security, sensor signal processing, credit card fraud detection etc. Online learning algorithms process each training instance once on arrival, without the need for storage or reprocessing and maintain a current hypothesis that reflects all the training instances so far[18]. In this way the learning algorithm take single training instance as well as a hypothesis input and output an updated hypothesis [10]. In online learning environments are often non-stationary and the variables to be predicted by the learning machine may change with time, this change is referred to as concept drift. Concept drifts can be categorized based on their speeds. Speed is the inverse of drifting time, which can be measured as the number of time steps taken for a new concept to completely replace the old one [17]. In this way, a higher speed is related to a lower number of time steps and a lower speed is related to a higher number of time steps. According to the speed, drifts can be categorized as either abrupt, when the complete change occurs in only one time step, or gradual, otherwise [17]. For example, sudden change in buying preferences of a share purchaser due to a dip in stock price of some particular company (abrupt concept drift), whereas adapting from 3 an old mailing system to a new one where people use both systems for some time initially (gradual concept drift). Ensemble learning is among the most popular and effective approaches to handle concept drift, in which a set of concept descriptions built over different time intervals is maintained, predictions of which are combined using a form of voting, or the most relevant description is selected [23]. Ensembles of classifiers have been successfully used to improve the accuracy of single classifiers in online learning [18, 10, 21, 15]. Learning machines used to model non-stationary environments (concept drifts) should be able to adapt quickly and accurately to possible changes. We propose a novel ensemble approach of handling various concept drifts by exploiting their composition characteristics. In this approach, early instances of new concept are stored and used for ensemble learning whenever a drift occurs. Also, all the classifiers in the ensemble are trained for these instances of the new concept, during the drifting period. Experiments show that when a concept drift occurs the proposed approach obtains better accuracy than Early Drift Detection Method (EDDM) [14] approach, a system which adopts the strategy of learning a new classifier from scratch when a drift is detection. Chapter 3 Literature Survey 3.1 Data Stream Mining A data stream is an ordered sequence of instances that can be read only once or a small number of times using limited computing and storage capabilities. Instances of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery. Data stream mining is a subdomain of data mining which is the process of extracting knowledge structures from continuous, rapid data streams. The core assumption of data stream processing is that training instances can be briefly inspected a single time only, that is, they arrive in a high speed stream, then must be discarded to make room for subsequent instances. The algorithm processing the stream has no control over the order of the instances seen, and must update its model incrementally as each example is inspected. An additional desirable property, the so-called anytime property, requires that the model is ready to be applied at any point between training instances [8]. 5 3.2 Online Learning Online learning is a continuous learning process in which instances arrive one at a time and are processed only once due to time and space constraints. Online learning algorithms process each training instance once ”on arrival”, without the need for storage or reprocessing, and maintain a current hypothesis that reflects all the training instances so far [18]. Learning proceeds in a sequence of trials. In each trial, the algorithm receives an instance from some fixed domain and is to produce a binary prediction. At the end of the trial, the algorithm receives a binary level which can be viewed as the correct prediction for the instance. 3.3 Concept Drift Concept refers to the target variable, which the model is trying to predict. Concept change is the change of the underlying concept over time. The term concept drift can be formally defined as follows : Concept drift is the change in the distribution of a problem [11], which is characterized by the joint distribution p(x, w), where x represents the input attributes and w represents target classes [16]. Two kinds of concept drift may occur in the real world normally − abrupt and gradual [25]. Let SI and SII be two sources which generate instances corresponding to old and new concepts, respectively. Let time t0 be the time at which the drift occurs, before which all the instances are from source SI. Abrupt Drift : The simplest pattern of a change is abrupt drift, when at time t0 a source SI is suddenly replaced by source SII and is continued further. Gradual Drift : Another kind of change is gradual drift which refers to a certain period after t0 when both sources SI and SII are active. As time passes, the probability of sampling from source SI decreases, probability of sampling from source SII increases. Note, that at the beginning of this gradual drift, before more instances are seen, an instance from the source SII might be easily mixed up with random noise. The period where both sources SI and SII are active is called as drifting period. The width of the drifting period is in inverse relation with the speed of the drift. Greater the width, less is the speed, i.e. change is more gradual. We adopt this definition of gradual drift as the basis for developing the proposed algorithm to handle gradual changes in data streams. Figure 3.1: Types of concept drifts in streams 3.4 Drift Detection Techniques To deal with change over time, most previous work has been classified observing if they use full, partial or not examples memory [14]. The partial memory methods used variations of the sliding-window idea: at every moment, one window (or more) containing the most recently read instances is kept, and only those are considered relevant for learning. A critical point in any such strategy is the choice of a window size. The easiest strategy is deciding (or asking the user for) a window size W and keeping it fixed through the execution of the algorithm. In order to detect change, one can keep a reference window with data from the past, also of some fixed size, and decide that change has occurred if some statistical test indicates that the distributions in the reference and current windows differ. Another approach, using no examples memory, only aggregates, applies a decay function to instances so that they become less important over time. Another approach to detecting changes in the distribution of the training instances monitors the online error-rate of the algorithm [25]. In this method learning takes place in a sequence of trials. When a new training example is available, it is classified using the current model. The method controls the trace of the online error of the algorithm. For the actual context they define a warning level, and a drift level. This approach has been used in DDM and EDDM. 3.4.1 Drift Detection Method - DDM The drift detection method (DDM) [11] uses a binomial distribution. For each point i in the sequence that is being sampled, the error rate is the probability of missclassifying (pi ), with standard deviation given by si =pi (1 − pi )/i. A significant increase in the error of the algorithm, suggests that the class distribution is changing and, hence, the actual decision model is supposed to be inappropriate. Thus, they store the values of pi and si when pi +si reaches its minimum value during the process (obtaining pmin and smin ). And it checks when the following conditions triggers: • pi +si ≥ pmin +2smin for the warning level. • pi +si ≥ pmin +3smin for the drift level. This approach has a good behaviour detecting abrupt changes and gradual changes when the gradual change is not very slow, but it has difficulties when the change is slowly gradual. 3.4.2 Early Drift Detection Method - EDDM We use EDDM [14] as the drift detection method in our approach. Early Drift Detection Method (EDDM), has been developed to improve the detection in presence of gradual concept drift and to keep a good performance with abrupt concept drift. The basic idea is to consider the distance between two errors classification instead of considering only the number of errors like in DDM. So, a significant decrease in the average distance between 2 consecutive errors suggests that the class distribution is changing. EDDM calculates the average distance between two errors obtained by the classifier system (Pi ) and its standard deviation (Si ) and stores their maximum values so far (Pmax and Smax ). If (Pi +2Si ) / (Pmax +2Smax ) < α, where α is a pre-defined parameter, a concept drift is suspected and a warning level is triggered . If the similarity between (Pi +2Si ) and (Pmax +2Smax ) starts to increase after a warning level is triggered, the warning level is cancelled and the method returns to normality. If (Pi +2Si ) / (Pmax +2Smax ) < β, where β is a pre-defined parameter and α > β, a concept drift is confirmed. Thus, • (Pi +2Si ) / (Pmax +2Smax ) < α for the warning level. • (Pi +2Si ) / (Pmax +2Smax ) < β for the drift level. 3.4.3 Adaptive Windowing - ADWIN ADWIN [7] is a parameter and assumption free in the sense that it automatically detects and adapts to the current rate of change. It keeps a variable-length window of recently seen items, with the property that the window has the maximal length statistically consistent with the hypothesis there has been no change in the average value inside the window. More precisely, an older fragment of the window is dropped if and only if there is enough evidence that its average value differs from that of the rest of the window. This has two consequences: one, that change reliably declared whenever the window shrinks, and two, that at any time the average over the existing window can be reliably taken as an estimation of the current average in the stream (barring a very small or very recent change that is still not statistically visible). ADWIN’s only parameter is a confidence bound δ, indicating how confident we want to be in the algorithms output, inherent to all algorithms dealing with random processes [8]. The algorithm keeps a sliding window W with the most recently read xi . Let n denote the length of W , µ̂w the (observed) average of the elements in W , and µ̂w the (unknown) average of µt for t ∈ W . Strictly speaking, these quantities should be indexed by t, but in general t will be clear from the context. Algorithm 1 describes the adpative windowing algorithm. Algorithm 1 ADWIN : Adaptive Windowing Algorithm 1: Initialize Window W 2: for each t > 0 do 3: W ← W ∪ xt (i.e. add xt to the head of W ) 4: repeat Drop elements from the tail of W 4: 5: until | µ̂W 0 − µ̂W 1 |< ²cut holds for every split of W into W = W0 .W1 6: output µ̂W 7: end for 3.5 3.5.1 Drift Handling Techniques Pure Ensemble Learning An ensemble consists of a set of individually trained classifiers whose predictions are combined when classifying novel instances. Bagging and Boosting are well-known ensemble learning algorithms.In bagging, you give equal weightage to all classifiers, whereas in boosting you give weightage according to the accuracy of the classifier. Combining the output of several classifiers in an ensemble is useful only if there is disagreement among them [6]. Usually the implicit aim of having disagreement or diversity is to have at least one classifier in the ensemble trained for each distinct concept. Thus, when tackling nonstationary concepts, ensembles of classifiers have several advantages over single classifier methods - they are easy to scale and parallelize, they can adapt to change quickly by pruning under-performing parts of the ensemble, and they therefore usually also generate more accurate concept descriptions. 3.5.2 Learning with Drift Detection The problem with pure ensemble learning is that recovery time is very high due to no drift detection, especially, in case of gradual drifts. In this approach, during the warning period of the drfit detection method, the training instances are stored. If a warning level is cancelled, the stored instances are removed. If a drift is confirmed, the classifier system is reset and a new classifier system is created. The new classifier system learns all the instances stored since the warning level was triggered and then starts to learn the new training instances. Here, using an ensemble of classifiers is preferred as they improve accuracy of single classifiers. In MOA the above approach has been implemented using the wrapper class SingleClassifierDrift [8]. Some of the ensembles available as the base classifiers for this wrapper that we have used are OzaBag [19], OzaBoost [19] and OCBoost [20]. Chapter 4 Existing Approach 4.1 Learning System using EDDM A learning system which uses EDDM behaves in the following way. During the warning period, a new classifier system is created and trained for all the incoming instances till the drift level is detected by EDDM. If a warning level is cancelled, the new classifier system is discarded. If a drift is confirmed, p0max and s0max are reset and the original classifier system is replaced by the new one. New p0max and s0max are considered only after 30 errors have occurred. Thus, EDDM approach adopts the strategy of learning a new classifier from scratch when a drift is detected. This approach is implemented by the SingleClassif ierDrif t class in MOA. Algorithm 2 describes the same. In EDDM approach, we use online boosting as the learning technique for the classifier ensemble, with EDDM as the drift detection method. 12 Algorithm 2 Existing Approach : SingleClassif ierDrif t Input: inst , currentEnsemble , newEnsemble 1: predictedClass ← P redictClassOf Instance(inst) 2: classif ication ← (predictedClass == inst.Class) 3: level ← computeEDDM Level(inst, classif ication) 4: if level == W ARN IN G then 5: if newClassif ierReset == T RU E then 6: newEnsemble.reset() 7: newClassif ierReset = F ALSE 8: end if 9: newEnsemble.train(inst) 10: else if level == DRIF T then 11: currentEnsemble ← newEnsemble 12: newEnsemble.reset() 13: else if level == ST ABLE then 14: newClassif ierReset = T RU E 15: end if 16: currentEnsemble.train(inst) Output: U pdated newEnsemble and currentEnsemble 4.2 Online Boosting. Boosting algorithm generates a sequence of base models h1 , h2 , h3 ...hM using the weighted training sets (weighted by D1 , D2 , D3 ...DM ) such that the training instances misclassified by hm − 1 are given half the total weight when generating model hm and the correctly classified instances are given the other half of the weight. When the base model learning algorithm cannot learn with weighted training sets, one can generate samples with replacement according to Dm [19]. The online boosting algorithm [18] simulates the sampling with replacement using the Poisson distribution. When the base classifier misclassifies a training instance, the Poisson distribution parameter lambda (λ) associated with instance is increased when presented to to the next base model, thus the instance will be learnt by more number of base classifiers; otherwise it is decreased. Algorithm 3 describes the online boosting algorithm. Algorithm 3 Online Boosting sw Input: Input inst, λsc m , λm , BaseLearner = Hoef f ding T ree 1: Set weight of example λd ← 1 2: l ← ensembleLength 3: for m = 1,2...,l do 4: Set k ← P oisson(λd ) 5: if k > 0.0 then 6: Update hm with the current instance inst 7: end if 8: if hm correctly classifies instance inst then 9: 10: 11: sc λsc m ← λm + λd λd ← λd (N/2λsc m) else 12: sw λsw m ← λm + λd 13: λd ← λd (N/2λsw m ) 14: 15: end if end for Output: U pdated ensemble Chapter 5 Proposed Approach In the initial experimentation on drifting data, it was observed that accuracy of an online learning system for gradual drifts is lesser than that for abrupt drifts. Moreover, recovery in case of gradual drifts was less and slower. The reason for less accuracy in case of gradual drifts is because of the composition of stream in case of a gradual drift. As mentioned in Section 3, data stream during the drifting period consist of instances from the old as well as the new concept. Due to this, the classifier system is trained on both the concepts whenever a concept drift occurs. As opposed to this, in abrupt drift, data stream after the drift consists only of instances from the new concept. Since the classifier learns only on the new concept in this case, it learns better hence leads to greater classification accuracy and recovery. Thus, the main aim was to make the classifier system learn only on the instances from the new concept, whenever a drift occurs. For the classifier to be trained with enough instances from the new concept for attaining greater classification accuracy, we can store instances of the new concept in advance so that enough instances of the new concept are available to the new classifier when the drift is confirmed by the drift detection method. Thus, an instance-window containing instances only from the new concept can improve learning of the classifier system thus improving its recovery. 16 Also, in case of an ensemble classifier system, the classification of a given instance depends on voting of all the individual classifiers. So, a way of making the classifier learn better on the new concept after the drift, is to make all the individual base classifiers learn on the instances from the new concept. The ensemble will be trained well on the new concept as all the classifiers learn this new concept. We call this method of making all base classifiers learn as zero diversity. Based on the above described ideas, we propose 2 approaches GPSGradual and GPSAbrupt in an attempt to improve classifier accuaracy and recovery in case of gradual and abrupt drifts respectively. The composition characteristics of the drift streams have been exploited for these approaches. We use EDDM as the drift detection method because it is a recent method which has shown to attain similar accuracy to a previous methods when the drifts are abrupt and a better accuracy when the drifts are gradual. From the start of the data stream, we maintain 2 ensembles, currentEnsemble and newEnsemble. Both of these ensembles are trained in a similar way by all the instances before a drift occurs. Thus, before the drift, newEnsemble is the copy of currentEnsemble. The reason for maintaining ensembles from the start is that it not only helps to adapt to the change quickly but also it improves the performance in case of false drift detections. 5.1 5.1.1 Use of Instance Window Storing Instances in Instance Window An instance window of fixed size is maintained for storing the recent instances to be trained on the new ensemble. These instances are stored in first-in-first-out fashion. The instance window size was experimentally concluded to be 50. For GPSGradual approach, the instances on which the newEnsemble is trained with zero diversity after the warning level, are the instances misclassified by the currentEnsemble. This is because whenever a warning level is flagged for the possibility of a drift, instances misclassified by the currentEnsemble should belong to the new concept as the current ensemble has not still learnt the new concept well. Hence, training is done only on the misclassified instances. In gradual drift stream, whenever a warning level is flagged for the possibility of a drift, instances misclassified by the currentEnsemble should belong to the new concept as the currentEnsemble has still not learnt the new concept well. So in GPSGradual appraoch, we store only those instances in the window, that have been misclassified by the currentEnsemble, so that during the drift period, the misclassified instances will be those of the new concept and hence the instance window will start containing instances from the new concept. In abrupt drift stream, all the instances after the drift occurs, belong to the new concept. So in GPSAbrupt approach, all the instances from the stream are stored so that, during the drifting period, all of the incoming instances belong to the new concept. Thus, the instance window will contain instances of new concept. The advantage of this window is that, it contains instances belonging to the new concept. So, training the new ensemble on this window instances will lead to faster recovery from the concept drifts. 5.1.2 Training from Instance Window The newEnsemble is trained on the instance window at different times for the two approaches (Algorithm 5 : GPSGradual and Algorithm 6 : GPSAbrupt). In case of gradual drifts, when a warning level is detected, the data stream has more instances of the old concept and less of the new concept initially. So, at the warning level, the instance window contains instances from the old concept as well. Gradually, the probability of instances from new concept goes on increasing, hence at some latter point after the warning level is triggered, the instance window will contain majority instances of new concept. We call this point as Post Warning level. This level has been introduced in GPSGradual. At this level, the new ensemble is trained on the instance window, as it contains new concept instances in majority. The Post Warning level has been introduced in GPSGradual so that more relevant instances of new concept will be present for training on newEnsemble in the instance window. The optimal value for this level has been determined experimentally as 0.925. As mentioned previously, in abrupt drift, instances only from the new concept are present in the stream after the drift occurs. Thus, when the warning level is detected in case of abrupt drifts, the instance window will already contain instances from the new concept. This eliminates the need of postwarning level in GPSAbrupt approach. So, the newEnsemble is trained on instances from the instance window at warning level itself in GPSAbrupt. Whenever the respective level is triggered, the newEnsemble thus learns on all the instances in the instance window. 5.2 5.2.1 Use of Zero Diversity Instances used for Training with Zero Diversity In GPSGradual as well as GPSAbrupt approaches, whenever a warning level is triggered by the drift detection method, we start training a new ensemble with zero diversity i.e. all the individual base classifiers in the ensemble learn on the given instance. The reason for training the new ensemble with zero diversity is to help the new ensemble learn the new concept more efficiently and quickly. Hence, it leads to better accuracy and faster recovery. As mentioned in the previous section, a gradual drift stream consists of instances from new as well as old concepts. Hence, when the warning level is triggered, it is necessary to ensure that only instances from the new concept are given for training with zero diversity to the new ensemble. For GPSGradual approach, the instances on which the new ensemble is trained with zero diversity after the warning level, are the instances misclassified by the current ensemble. This is because whenever a warning level is flagged for the possibility of a drift, instances misclassified by the current ensemble should belong to the new concept as the current ensemble has not still learnt the new concept well. Hence, training is done only on the misclassified instances. An abrupt drift data stream, on the other hand, consists of instances only from the new concept once the drift occurs. Hence, after the warning level is flagged, all instances are necessarily from the new concept. Hence,in GPSAbrupt, all the instances after the warning level are given to the new ensemble for training with zero diversity. 5.2.2 Implementation of Zero Diversity Diversity of an ensemble is the measure of disagreement of various classifiers in the ensemble. Several means can be used to reach this goal - different presentations of the input data , variations in learner design and by adding a penalty to the outputs to encourage diversity. In an online learning system, Poisson distribution is used to decide if a classifier in the ensemble will be presented an incoming instance for training [19]. So, not all the classifiers in the ensemble are trained for the same given instance. In this approach, we tune the Poisson parameter in order to obtain different diversities for the ensemble. As shown in Algorithm 4, the training method is passed a flag to determine the diversity for training the ensemble on the current instance. When the f lag is normalDiverse, the value of k is determined by P oisson parameter λ. Thus, when the f lag is zeroDiverse, the value of k is set to 1, due to which all the classifiers in the ensemble get trained on the current instance. Thus, for new concept instances, training all the classifiers improves the accuracy of the ensemble. Algorithm 4 Training Algorithm for Ensembles used in Proposed Approaches sw Input: Input inst, f lag , λsc m , λm , BaseLearner = Hoef f ding T ree 1: Set weight of example λd ← 1 2: l ← ensembleLength 3: for m = 1,2...,l do 4: 5: 6: 7: if f lag == normalDiverse then Set k ← P oisson(λd ) else if f lag == zeroDiverse then Set k ← 1 8: end if 9: if k > 0.0 then 10: Update hm with the current instance inst 11: end if 12: if hm correctly classifies instance inst then 13: sc λsc m ← λm + λd 14: λd ← λd (N/2λsc m) 15: else 16: sw λsw m ← λm + λd 17: λd ← λd (N/2λsw m ) 18: 19: end if end for Output: U pdated ensemble 5.3 Switching to New Ensemble Once the drift is confirmed by the drift detection method, the newEnsemble is set as the currentEnsemble which will be used for making further predicitons. The now−currentEnsemble then starts learning with normal diversity again. The newEnsemble is reset which is used for future drift handling. Algorithm 5 Proposed Approach : GPSGradual Input: inst , currentEnsemble , newEnsemble 1: predictedClass ← P redictClassOf Instance(inst) 2: classif ication ← (predictedClass == inst.Class) 3: if classif ication == f alse then 4: add inst to window[] 5: end if 6: level ← computeEDDM Level(inst, classif ication) 7: if level == ST ABLE then 8: newEnsemble.train(inst, normalDiverse) 9: else if level == W ARN IN G then 10: if classification == false then 11: 12: 13: 14: 15: newEnsemble.train(inst, zeroDiverse) end if else if level == P OST W ARN IN G then for instTemp IN window[] do newEnsemble.train(instT emp, zeroDiverse) 16: end for 17: if classification == false then 18: 19: 20: newEnsemble.train(inst, zeroDiverse) end if else if level == DRIF T then 21: currentEnsemble ← newEnsemble 22: newEnsemble.reset() 23: end if 24: currentEnsemble.train(inst, normalDiverse) Output: U pdated newEnsemble and currentEnsemble Algorithm 6 Proposed Approach : GPSAbrupt Input: inst , currentEnsemble , newEnsemble 1: predictedClass ← P redictClassOf Instance(inst) 2: classif ication ← (predictedClass == inst.Class) 3: add inst to window[] 4: level ← computeEDDM Level(inst, classif ication) 5: if level == ST ABLE then 6: 7: newEnsemble.train(inst, normalDiverse) else if level == W ARN IN G then 8: newEnsemble.train(inst, zeroDiverse) 9: for instTemp IN window[] do 10: newEnsemble.train(instT emp, zeroDiverse) 11: end for 12: if classification == false then 13: 14: 15: newEnsemble.train(inst, zeroDiverse) end if else if level == DRIF T then 16: currentEnsemble ← newEnsemble 17: newEnsemble.reset() 18: end if 19: currentEnsemble.train(inst, normalDiverse) Output: U pdated newEnsemble and currentEnsemble Chapter 6 Experimentation 6.1 Artificial Datasets When working with real-world datasets, it is not possible to know exactly when a drift starts to occur, which type of drift is present, or even if there really is a drift. So, it is not possible to perform a detailed analysis of the behaviour of algorithms in the presence of concept drifts using only pure real-world datasets. In order to analyze the strong and weak points of a particular algorithm, it is necessary first to check its behaviour using artificial datasets containing simulated drifts. Depending upon the type of drift in which the algorithm is weak, it may be necessary to adopt a different strategy to improve it, so that its performance is better when applied to real-world problems. To generate the datasets with concept drift, we use the approach used by Minku L., 2008 [16]. We created the datasets for the following problems (Table 6.1): Circle, Sine, Line and Plane2d. In circle, line and plane2d, r, a0 and a0 represent the concepts respectively. For sine, both c and d represent concepts, thus generating two problems viz. SineH and SineV. The instances in the datasets contain x or xi and y as the input attributes and the concept (which can assume a value 0 or 1) as the output attribute. See Table 6.1. 25 We choose these problems for generating datasets because of the wide variety of problems that get covered. Circle dataset represents second degree problems, Sine represents trigonometric problems, Line represents two dimensional linear problems and Plane represents 3 dimensional problems. Gradual drift is introduced in the datasets by decreasing the speed of the drift. The speed of the drift can be modelled by degree of dominance function representing the probability that an instance of the old and new concept will be presented to the learning system. To introduce the instances of both new and old concept for certain period of time we used the following linear degree of dominance function : vn (t) = (t − N ) , N < t ≤ N + drif ting time drif ting time vo (t) = 1 − vn (t), N < t ≤ N + drif ting time (6.1) (6.2) where vn (t) and vo (t) are the degrees of dominance of new and old concepts, respectively; t is the current time step; N is number of time steps before drift started to occur and drif ting time is the number of time steps for a complete replacement of the old concept. The first N instances were generated according to the old concept (vo (t) = 1, 1 ≤ t ≤ N ). The next drif ting time instances (N < t ≤ N + drif ting time) were generated according to the degree of dominance functions vn (t) (Equation 6.1) and vo (t) (Equation 6.2). The remaining instances were generated according to the new concept (vn (t) = 1, N + drif ting time < t ≤ 2N ). The drif ting time is 1 for abrupt drifts whereas for gradual drifts, it has been set to 0.25N and 0.50N . Thus for each of the 5 problems, we have 3 datatsets corresponding to 3 different drifts (1 abrupt and 2 gradual), giving 15 different datasets for a given size. we have created datasets of various sizes such as 2000, 50000, 100000. In the datasets with 2000 and 50000 instances, drifts have been introduced at 1000 and 25000 respectively. In the datsets of size 100000 with 1 drift, drift is at position 25000. Also, a dataset of size 100000 having 3 drifts at positions 20000, 40000, 750000 has been created. Thus, giving 60 datasets in all. Class severity has been defined as the percentage of the input space, which has its target class changed after the drift is complete. Based on this, drifts can be said to be of low severity (≈ 25%), medium severity (≈ 50%) and high severity (≈ 75%). Thus, if the percentage is low, the drift is said to be of low severity and similarly. The 60 datasets were generated for all the three severities, thus, giving 180 datasets in all [17]. Table 6.1 describes the various problems and values used for generating datasets from these problems. Three ranges of concepts are given corresponding to each problem. Only the first range is used for generating datasets with one drift whereas all three ranges are used for generating datasets with three drifts. Also, to increase the difficulty of the problems, we added 8 irrelevant attributes and 10% class noise to the plane2d datasets. The code for genrating these datasets is written in C language. Apart from these, we have also used some standard datasets such as SEA [22] (size 100000) with 3 drifts at 25000, 50000 and 75000 , Waveform (size 100000) with 3 drifts at 25000, 50000 and 75000, Waveform [9] (size 150000) with no drift, Hyperplane [24] (size 50000) with 1 drift at 25000 which we generated through MOA. The reason for using the datasets of 100000 instances with 1 drift at 25000 is to evaluate the behaviour of the GPS at a time long after the drift has been occurred. Driftless datasets have been used to evaluate the performance of GPS in the absence of drifts in the data stream. Noisy datasets are used to check the noise sensitivity of the algorithm. Also, datatsets with 3 drifts have been created so evaluate the performance in case of mutiple drifts, moreover, the drifts have been positioned closely to check for good and quick recovery. Thus, we have covered wide variety of situations that can appear in the data streams. Table 6.1: ARTIFICIAL DATASETS Problem Fixed Values Range of Attributes Range of Concepts Circle a=0.5, b=0.5 x:[0,1], y:[0,1] r : 0.2 → 0.5 (x − a)2 + (y − b)2 ≤ r2 r : 0.5 → 0.1 r : 0.1 → 0.5 SineV a=1, b=1, c=0 x:[0,10], y:[−10,10] y ≤ asin(bx + c) + d d : −8 → 7 d : 7 → −7 d : −7 → 6 SineH a=5, d=5, b=1 x:[0,4π], y:[0,10] y ≤ asin(bx + c) + d c : 0 → −π c : −π → 0 c : 0 → −π Line a1 =0.1 x:[0,1], y:[0,1] y ≤ −a0 + a1 x a0 : −0.1 → −0.8 a0 : −0.8 → −0.2 a0 : −0.2 → −0.8 Plane y ≤ −a0 + a1 x1 + a2 x2 a1 =0.1, a2 =0.1 x:[0,1], y:[0,5] a0 : −0.7 → −4.4 a0 : −4.4 → −0.5 a0 : −0.5 → −4.3 6.2 6.2.1 Real Datasets Spam Corpus For testing on real world data, we chose the spam corpus dataset [13]. This is a real world textual dataset uses SpamAssassin data collection. This spam dataset consists of 9324 instances with 40,000 attributes and represents the gradual concept drift. There are 2 classes legitimate and spam, with the ratio around 20%. 6.2.2 Forest Cover (UCI Repository) The Foreset Cover dataset [4] contains Geo-spatial descriptions of different types of forests. It contains 7 classes and 54 attributes and around 581,000 instances. We normalize the dataset, and arrange the data so that in any chunk at most 3 and at least 2 classes co-occur, and new classes appear randomly. 6.2.3 ELEC2 Dataset The data was collected from the Australian New South Wales Electricity Market. The ELEC2 dataset [14] contains 45312 instances dated from May 1996 to December 1998. Each example of the dataset refers to a period of 30 minutes. Each example on the dataset has 5 fields, the day of week, the time stamp, the NSW electricity demand, the Vic electricity demand, the scheduled electricity transfer between states and the class label. 6.2.4 Usenet The usenet dataset [12] is based on the 20 newsgroups collection. They simulate a stream of messages form different newsgroups that are sequentially presented to a user, who then labels them as interesting or junk according to his/her personal interests. 6.3 Implementation Environment The proposed algorithm in this paper is implemented in Java programming language on Linux platform. We have used the MOA- Massive Online Analysis [8] tool for all the experimentation of the proposed approach. MOA is framework for data stream mining which includes a collection of machine learning algorithms for tools and evaluation and is written in java. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is concerned with the problem of classification, perhaps the most commonly researched machine learning task. The goal of classification is to produce a model that can predict the class of unlabeled instances, by training on instances whose label, or class, is supplied. We chose MOA because it is open source tool including a wide variety of stream generators, classifiers, evaluators and drift detection methods for analysis purposes. Also, new classifiers can be easily built and added to the MOA framework. To build a picture of accuracy and time, we use the Interleaved Test-Then-Train evaluation model available in MOA. In Interleaved Test-Then-Train model, each individual instance can be used to test the model before it is used for training, and from this the accuracy can be incrementally updated. When intentionally performed in this order, the model is always being tested on instances it has not seen. This scheme has the advantage that no holdout set is needed for testing, making maximum use of the available data. It also ensures a smooth plot of accuracy over time, as each individual instance will become increasingly less significant to the overall average. The experiments were performed on a 2.59 GHz Intel Core 2 Duo processor with 3 GB main memory, running Ubuntu 10.04. For comparison of the performance different ensemble techniques like OzaBoost, OCBoost, OzaBag, OzaBagADWIN are used. The first 3 with wrapper class SingleClassifierDrift (the EDDM Approach as implemented in MOA) which includes drift detection methods. Hoeffding tree is used as the base learner for all ensembles. The ensemble size kept for each ensemble was 10. For artificial datasets, the accuracy presented in the next section is reset at the time step N + 1 when the drift starts to happen. This was done to allow the evaluation of the behavior of the approach when the drift starts to occur. For real dataset, the accuracy is never reset. For plotting the graphs of accuracy versus number of instances in order to compare the recovery and accuracy of the system at various time steps of the proposed approaches with existing ones, we used GNUplot version 4.2, which is a command-driven interactive function plotting program. 6.4 User Interface for GPS We implemented the proposed approaches viz. GPSGradual and GPSAbrupt in MOA and develpoed a user interface for it. The interface shows various options viz. base learner, type of drift, window size and post warning level. MOA provides different classes for developing a GUI which reduce a workload to a great extent. The base class for approach is GP S which instantiates objects for GP SGradual and GP SAbrupt depending upon the type of drift selected by the user. The Fig. 6.1 shows the user interface for GPS algorithms. Figure 6.1: User Interface for GPS in MOA 6.5 6.5.1 Determination of Parameters Size of Instance Window We experimented with window sizes ranging from 10 to 100 in steps of 10 and compared them for each dataset. It was observed that a window size of 50 gives the better results for all the datasets. So, we have set the optimal value of window size as 50 (Table 6.2). 6.5.2 Ratio value for PostWarning Level The Post Warning level is a new level added to EDDM between the warning level and drift level. The purpose of adding this level was to determine a time during the drift period when the instance window properly contained instances of the new concept. In EDDM, Table 6.2: AVG. ACCURACIES FOR DIFFERENT INSTANCE WINDOW SIZES. DATASET-SIZE : 50,000 DATASETS 10 20 30 40 50 60 70 80 90 100 ABRUPT circle 0.906 0.909 0.908 0.903 0.91 0.906 0.901 0.901 0.904 0.896 sineV 0.966 0.964 0.968 0.964 0.969 0.968 0.968 0.962 0.967 0.965 sineH 0.796 0.802 0.79 0.808 0.788 0.78 0.807 0.76 0.77 0.775 line 0.984 0.984 0.984 0.983 0.985 0.983 0.985 0.985 0.984 0.984 plane 0.82 0.821 0.817 0.826 0.828 0.819 0.827 0.823 0.826 0.822 GRADUAL : drif ting time = 0.25N circle 0.866 0.865 0.866 0.865 0.87 0.869 0.866 0.861 0.854 0.852 sineV 0.916 0.915 0.905 0.914 0.919 0.91 0.905 0.909 0.917 0.916 sineH 0.819 0.817 0.816 0.813 0.823 0.825 0.818 0.824 0.81 0.824 line 0.93 0.935 0.927 0.932 0.937 0.924 0.932 0.932 0.918 0.932 plane 0.838 0.825 0.825 0.839 0.832 0.829 0.831 0.834 0.835 0.834 GRADUAL : drif ting time = 0.50N circle 0.852 0.852 0.854 0.856 0.84 0.86 0.85 0.862 0.852 0.847 sineV 0.893 0.89 0.888 0.89 0.894 0.886 0.889 0.886 0.888 0.886 sineH 0.813 0.813 0.813 0.808 0.821 0.816 0.815 0.821 0.812 0.815 line 0.91 0.911 0.91 0.908 0.915 0.913 0.909 0.915 0.892 0.905 plane 0.827 0.817 0.803 0.826 0.82 0.816 0.82 0.824 0.824 0.824 the ratio (pi + 2si )/(pmax + 2smax ) is checked against predefined values α (0.95) and β (0.90) to detect the warning and drift level respectively. Hence we needed to define a new value between α and β for detecting postwarning level. Hence, we experimented with different values between the Warning and Drift levels i.e. 0.95 and 0.90 respectively. The values used in experimentation were 0.91, 0.92, 0.925, 0.93 and 0.94 (Table 6.3). It was found that the value 0.925 gave the better results for the training of the new ensemble. So, the value of the Post Warning level has been set to 0.925. Table 6.3: AVG. ACCURACIES FOR DIFFERENT VALUES OF POST WARNING LEVEL. DATASET-SIZE : 50,000 DATASETS 0.91 0.92 0.925 0.93 0.94 GRADUAL : drif ting time = 0.25N circle 0.865 0.870 0.870 0.869 0.834 sineV 0.908 0.906 0.916 0.915 0.914 sineH 0.807 0.787 0.823 0.816 0.82 line 0.933 0.932 0.930 0.931 0.932 plane 0.829 0.826 0.832 0.83 0.826 GRADUAL : drif ting time = 0.50N circle 0.852 0.853 0.840 0.851 0.830 sineV 0.891 0.889 0.895 0.893 0.892 sineH 0.801 0.786 0.821 0.815 0.812 line 0.912 0.91 0.911 0.913 0.902 plane 0.822 0.803 0.824 0.822 0.816 Chapter 7 Results and Analysis We compared the proposed approaches with different online learning algorithms viz. EDDM approach (ensemble − OzaBoost), EDDM approach (ensemble − OzaBag), EDDM Approach (ensemble − OCBoost) and OzaBagADWIN. The base learner for ensembles of all the approaches was Hoeffding Tree and the size of each ensemble was kept as 10. For the GPSGradual algorithm, we set the Post Warning level = 0.925 and Window Size = 50 as mentioned previously. We performed comparisons for all the datasets with single as well as multiple gradual drifts. Also, the drif ting time (speed of the drift) was varied as 0.25N and 0.5N . For the GPSAbrupt algorithm, we set the Window Size = 50. We performed comaprisons for all the datasets with single and multiple abrupt drifts. Moreover, to test the sensitivity of the algorithms, we experimented on driftless as well as noisy datasets. As mentioned previously, we have also used datasets having drifts of different severities. 35 7.1 Accuracy It was observed that the algorithms GPSGradual and GPSAbrupt performed better in terms of accuracy as compared to the other existing standard algorithms. The results for all the datasets used are tabulated in tables below (Table 7.1 to Table 7.13). To show the improvement in the recovery time of the system, we have given graphs of comparison for some of the datsets (Fig. 7.1 to Fig. 7.16) 7.1.1 Tables of Results We have tabulated the average(overall) accuracy of the classification for all the different datasets described previously. We have specifically calculated the average accuracy to give a picture of the overall performance of the algorithms for a given dataset. The tables given below are arranged according to the severity of the drifts viz. high, medium and low. For a given severity, the tables for different dataset-sizes are given viz. 50000(1 drift), 2000(1 drift), 100000(1 drift) and 100000(3 drifts). To analyze the performance of GPSAbrupt and GPSGradual, each table is divided in two parts viz. ABRUPT and GRADUAL. The GRADUAL part is further divided into two parts according to the speed of the drifts viz. drif ting time = 0.25N and drif ting time = 0.50N . The table rows consists of different datasets and the table columns consists of different algorithms used for comparison. Thus, each table cell entry gives the average accuracy of classification of an algorithm for a given dataset. Table 7.13 contains the results for artificial datasets viz. SEA, Waveform and Hyperplane as well as the results for real datasets viz. spam corpus, forest cover (Covtype), electricity (ELEC2) and usenet. SEA, Hyperplane and Usenet contains abrupt drift while Waveform, Spam Corpus contain gradual drift. Table 7.1: AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1, SEVERITY : HIGH ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.847 0.816 0.813 0.796 0.900 sineV 0.958 0.915 0.949 0.927 0.963 sineH 0.709 0.658 0.666 0.686 0.788 line 0.971 0.961 0.949 0.919 0.984 plane 0.845 0.787 0.839 0.748 0.821 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25 circle 0.816 0.791 0.822 0.844 0.870 sineV 0.894 0.869 0.897 0.910 0.916 sineH 0.757 0.656 0.724 0.739 0.823 line 0.906 0.896 0.912 0.923 0.930 plane 0.826 0.760 0.818 0.830 0.832 Gradual : drif ting time = 0.50N circle 0.797 0.779 0.819 0.849 0.840 sineV 0.866 0.842 0.882 0.891 0.894 sineH 0.745 0.653 0.745 0.742 0.821 line 0.885 0.857 0.888 0.908 0.911 plane 0.814 0.745 0.807 0.821 0.820 Table 7.2: AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1, SEVERITY : HIGH ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.813 0.809 0.826 0.561 0.796 sineV 0.904 0.875 0.896 0.614 0.935 sineH 0.532 0.537 0.605 0.606 0.607 line 0.872 0.837 0.880 0.719 0.924 plane 0.765 0.784 0.782 0.681 0.751 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.782 0.769 0.793 0.727 0.795 sineV 0.833 0.809 0.839 0.814 0.856 sineH 0.531 0.526 0.606 0.605 0.593 line 0.831 0.805 0.848 0.833 0.863 plane 0.726 0.740 0.754 0.708 0.767 Gradual : drif ting time = 0.50N circle 0.788 0.764 0.792 0.737 0.794 sineV 0.808 0.781 0.820 0.823 0.844 sineH 0.587 0.529 0.610 0.621 0.610 line 0.803 0.780 0.818 0.839 0.857 plane 0.704 0.691 0.722 0.730 0.750 Table 7.3: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1, SEVERITY : HIGH ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.882 0.828 0.826 0.832 0.925 sineV 0.971 0.928 0.955 0.934 0.972 sineH 0.784 0.757 0.744 0.764 0.856 line 0.978 0.967 0.955 0.933 0.987 plane 0.848 0.843 0.848 0.803 0.851 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.854 0.806 0.827 0.835 0.895 sineV 0.922 0.898 0.914 0.917 0.936 sineH 0.818 0.724 0.768 0.785 0.870 line 0.929 0.919 0.930 0.932 0.946 plane 0.836 0.801 0.826 0.839 0.838 Gradual : drif ting time = 0.50N circle 0.825 0.792 0.818 0.829 0.855 sineV 0.889 0.867 0.895 0.893 0.909 sineH 0.781 0.678 0.779 0.770 0.857 line 0.904 0.878 0.904 0.912 0.922 plane 0.821 0.774 0.804 0.823 0.831 Table 7.4: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3, SEVERITY : HIGH ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.844 0.807 0.821 0.834 0.888 sineV 0.942 0.930 0.936 0.888 0.963 sineH 0.724 0.667 0.672 0.671 0.781 line 0.933 0.926 0.927 0.914 0.978 plane 0.808 0.802 0.801 0.748 0.795 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.847 0.822 0.831 0.847 0.870 sineV 0.888 0.862 0.891 0.897 0.912 sineH 0.780 0.683 0.722 0.730 0.807 line 0.908 0.891 0.906 0.912 0.924 plane 0.819 0.783 0.808 0.814 0.818 Gradual : drif ting time = 0.50N circle 0.833 0.814 0.828 0.837 0.862 sineV 0.863 0.849 0.866 0.878 0.889 sineH 0.757 0.678 0.735 0.736 0.810 line 0.879 0.887 0.876 0.894 0.898 plane 0.796 0.751 0.784 0.803 0.807 Table 7.5: AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1, SEVERITY : MEDIUM ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.762 0.710 0.767 0.822 0.861 sineV 0.929 0.899 0.939 0.929 0.962 sineH 0.674 0.668 0.680 0.727 0.766 line 0.934 0.896 0.905 0.924 0.979 plane 0.849 0.822 0.843 0.785 0.845 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.787 0.753 0.795 0.843 0.845 sineV 0.922 0.861 0.900 0.923 0.938 sineH 0.736 0.676 0.731 0.749 0.805 line 0.922 0.881 0.906 0.934 0.942 plane 0.837 0.819 0.840 0.846 0.856 Gradual : drif ting time = 0.50N circle 0.802 0.782 0.826 0.866 0.860 sineV 0.902 0.864 0.882 0.918 0.922 sineH 0.760 0.679 0.752 0.762 0.831 line 0.873 0.861 0.888 0.925 0.931 plane 0.842 0.803 0.835 0.841 0.854 Table 7.6: AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1, SEVERITY : MEDIUM ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.700 0.690 0.696 0.710 0.705 sineV 0.724 0.752 0.723 0.755 0.873 sineH 0.519 0.518 0.611 0.609 0.617 line 0.713 0.663 0.774 0.751 0.846 plane 0.655 0.636 0.740 0.672 0.706 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.766 0.753 0.771 0.778 0.779 sineV 0.759 0.662 0.744 0.810 0.845 sineH 0.540 0.523 0.605 0.604 0.595 line 0.739 0.642 0.670 0.788 0.841 plane 0.698 0.636 0.741 0.724 0.747 Gradual : drif ting time = 0.50N circle 0.796 0.771 0.794 0.804 0.796 sineV 0.780 0.663 0.775 0.833 0.844 sineH 0.546 0.539 0.614 0.613 0.614 line 0.733 0.592 0.659 0.811 0.839 plane 0.714 0.625 0.698 0.750 0.758 Table 7.7: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1, SEVERITY : MEDIUM ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.821 0.748 0.819 0.849 0.899 sineV 0.948 0.924 0.948 0.934 0.973 sineH 0.718 0.756 0.712 0.796 0.820 line 0.955 0.931 0.932 0.939 0.984 plane 0.858 0.844 0.838 0.814 0.857 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.824 0.765 0.807 0.843 0.866 sineV 0.940 0.894 0.920 0.927 0.951 sineH 0.747 0.751 0.740 0.795 0.839 line 0.944 0.919 0.934 0.941 0.956 plane 0.854 0.834 0.829 0.849 0.860 Gradual : drif ting time = 0.50N circle 0.826 0.781 0.828 0.855 0.871 sineV 0.916 0.891 0.900 0.917 0.933 sineH 0.774 0.755 0.758 0.788 0.862 line 0.901 0.898 0.914 0.928 0.941 plane 0.843 0.820 0.823 0.837 0.8531 Table 7.8: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3, SEVERITY : MEDIUM ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.775 0.766 0.753 0.804 0.872 sineV 0.949 0.918 0.935 0.893 0.956 sineH 0.720 0.712 0.704 0.696 0.786 line 0.939 0.920 0.920 0.914 0.975 plane 0.854 0.820 0.815 0.788 0.834 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.838 0.782 0.817 0.842 0.873 sineV 0.890 0.905 0.922 0.905 0.933 sineH 0.775 0.752 0.748 0.743 0.820 line 0.897 0.904 0.926 0.911 0.945 plane 0.846 0.827 0.828 0.827 0.847 Gradual : drif ting time = 0.50N circle 0.832 0.808 0.815 0.854 0.875 sineV 0.917 0.892 0.912 0.910 0.937 sineH 0.777 0.758 0.759 0.767 0.825 line 0.906 0.878 0.911 0.912 0.928 plane 0.839 0.828 0.822 0.827 0.841 Table 7.9: AVG. ACCURACIES. DATASET-SIZE : 50000, NO. OF DRIFTS : 1, SEVERITY : LOW ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.835 0.773 0.822 0.882 0.879 sineV 0.902 0.91 0.929 0.934 0.97 sineH 0.754 0.696 0.713 0.778 0.862 line 0.923 0.896 0.933 0.938 0.976 plane 0.853 0.845 0.837 0.851 0.858 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.824 0.816 0.864 0.898 0.908 sineV 0.912 0.902 0.927 0.95 0.964 sineH 0.829 0.698 0.757 0.792 0.846 line 0.903 0.899 0.934 0.952 0.967 plane 0.858 0.836 0.841 0.862 0.871 Gradual : drif ting time = 0.50N circle 0.841 0.833 0.879 0.908 0.899 sineV 0.895 0.897 0.922 0.948 0.959 sineH 0.83 0.693 0.777 0.799 0.851 line 0.908 0.914 0.93 0.952 0.956 plane 0.86 0.834 0.843 0.866 0.875 Table 7.10: AVG. ACCURACIES. DATASET-SIZE : 2000, NO. OF DRIFTS : 1, SEVERITY : LOW ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.796 0.780 0.804 0.814 0.823 sineV 0.670 0.675 0.618 0.848 0.850 sineH 0.543 0.514 0.616 0.614 0.612 line 0.591 0.713 0.746 0.831 0.766 plane 0.614 0.667 0.659 0.752 0.767 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.831 0.817 0.834 0.841 0.834 sineV 0.523 0.540 0.771 0.864 0.909 sineH 0.551 0.547 0.625 0.625 0.635 line 0.574 0.722 0.704 0.840 0.841 plane 0.557 0.682 0.670 0.766 0.658 Gradual : drif ting time = 0.50N circle 0.844 0.830 0.844 0.853 0.858 sineV 0.527 0.541 0.773 0.869 0.873 sineH 0.564 0.543 0.618 0.617 0.609 line 0.521 0.733 0.728 0.846 0.860 plane 0.549 0.694 0.682 0.778 0.684 Table 7.11: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 1, SEVERITY : LOW ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.882 0.797 0.850 0.899 0.908 sineV 0.933 0.935 0.944 0.935 0.976 sineH 0.819 0.780 0.755 0.835 0.912 line 0.954 0.928 0.953 0.945 0.982 plane 0.859 0.850 0.868 0.850 0.871 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.850 0.840 0.880 0.902 0.926 sineV 0.939 0.926 0.941 0.947 0.970 sineH 0.888 0.780 0.779 0.836 0.889 line 0.937 0.927 0.953 0.954 0.975 plane 0.854 0.847 0.873 0.861 0.872 Gradual : drif ting time = 0.50N circle 0.868 0.860 0.879 0.905 0.903 sineV 0.921 0.919 0.936 0.944 0.962 sineH 0.885 0.774 0.777 0.838 0.877 line 0.939 0.934 0.945 0.951 0.964 plane 0.859 0.844 0.871 0.864 0.875 Table 7.12: AVG. ACCURACIES. DATASET-SIZE : 100000, NO. OF DRIFTS : 3, SEVERITY : LOW ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 0.882 0.797 0.850 0.899 0.908 sineV 0.933 0.935 0.944 0.935 0.976 sineH 0.819 0.780 0.755 0.835 0.912 line 0.954 0.928 0.953 0.945 0.982 plane 0.859 0.850 0.868 0.850 0.871 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 0.850 0.840 0.880 0.902 0.926 sineV 0.939 0.926 0.941 0.947 0.970 sineH 0.888 0.780 0.779 0.836 0.889 line 0.937 0.927 0.953 0.954 0.975 plane 0.854 0.847 0.873 0.861 0.872 Gradual : drif ting time = 0.50N circle 0.868 0.860 0.879 0.905 0.903 sineV 0.921 0.919 0.936 0.944 0.962 sineH 0.885 0.774 0.777 0.838 0.877 line 0.939 0.934 0.945 0.951 0.964 plane 0.859 0.844 0.871 0.864 0.875 Table 7.13: AVG. ACCURACIES. MISCELLANEOUS DATASETS OTHER ARTIFICIAL DATASETS Dataset EDDM EDDM EDDM OzaBag- GPS OzaBoost OCBoost OzaBag ADWIN SEA (100k, 3 Drifts) 0.796 0.717 0.794 0.798 0.829 Waveform (100k, 3 Drifts) 0.522 0.341 0.555 0.544 0.577 Waveform (150k, 0 Drift) 0.628 0.380 0.649 0.647 0.666 Hyperplane (50k, 1 Drift) 0.713 0.714 0.698 0.712 0.773 GPS REAL DATASETS Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Spam Corpus 0.777 0.822 0.868 0.814 0.888 Covertype 0.666 0.711 0.725 0.667 0.733 ELEC2 0.654 0.753 0.647 0.795 0.755 Usenet 0.489 0.487 0.483 0.494 0.501 7.1.2 Graphs of Results In addition to the tables of results, we have also provided the graphs of instantaneous accuracy versus no. of instances for some of the datasets. These graphs help to visualize the behaviour of the algorithms during the drifting period as well as the recovery of the classification system. To analyze the recovery of the algorithms after a drift occurs, we specifically set the instantaneous accuracy to zero at the actual drift point. Some of the graphs plotted, are given below. Due to space limitations, all the 180 graphs(one corresponding to each dataset) could not be included. Hence, to cover the various possible cases, graphs of different severities, dataset-sizes, types and number of drifts are given. Figure 7.1: Dataset: Circle, Size: 50000, Drift: Abrupt, Severity: High Figure 7.2: Dataset: SineH, Size: 50000, Drift: Gradual(0.25N ), Severity: High Figure 7.3: Dataset: Plane, Size: 50000, Drift: Gradual(0.50N ), Severity: High Figure 7.4: Dataset: Line, Size: 2000, Drift: Abrupt, Severity: Medium Figure 7.5: Dataset: SineV, Size: 2000, Drift: Gradual(0.25N ), Severity: Medium Figure 7.6: Dataset: Plane, Size: 2000, Drift: Gradual(0.50N ), Severity: Medium Figure 7.7: Dataset: SineH, Size: 100000, Drift: Abrupt, Severity: Low, No. of Drifts: 1 Figure 7.8: Dataset: Circle, Size: 100000, Drift: Gradual(0.25N ), Severity: Low, No. of Drifts: 1 Figure 7.9: Dataset: SineV, Size: 100000, Drift: Gradual(0.50N ), Severity: Low, No. of Drifts: 1 Figure 7.10: Dataset: SineV, Size: 100000, Drift: Abrupt, Severity: High, No. of Drifts: 3 Figure 7.11: Dataset: Circle, Size: 100000, Drift: Gradual(0.25N ), Severity: High, No. of Drifts: 3 Figure 7.12: Dataset: SineH, Size: 100000, Drift: Gradual(0.50N ), Severity: High, No. of Drifts: 3 Figure 7.13: Dataset: Hyperplane, Size: 50000, No. of Drifts : 1 Figure 7.14: Dataset: Waveform, Size: 150000, No. of Drifts : 0 Figure 7.15: Dataset: Spam Corpus Figure 7.16: Dataset: Forest Cover 7.2 Noise Sensitivity The proposed algorithms work fine for noiseless data streams as well as noisy data streams. Class noise is defined as the percentage of the total instances whose actual class labels have been changed. Different amount of class noise was added in the plane dataset to test the sensitivity of the algorithms. It can be seen from Table 7.14 that the algorithm performs better for noise level upto 20% in noisy data streams. However, the performance of algorithm degrades for higher percentage of noise in the data. This is because of the presence of more noisy instances in the instance window that unnecessarily get trained during the drifting period. Table 7.14: AVG. ACCURACIES FOR DIFFERENT NOISE LEVELS. DATASET : PLANE ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN 0 0.965 0.938 0.953 0.94 0.984 5 0.875 0.844 0.882 0.775 0.885 10 0.845 0.787 0.839 0.748 0.821 15 0.78 0.748 0.782 0.688 0.792 20 0.737 0.72 0.739 0.697 0.718 25 0.694 0.688 0.706 0.695 0.677 30 0.65 0.632 0.66 0.664 0.647 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N 0 0.883 0.857 0.868 0.922 0.929 5 0.831 0.797 0.854 0.872 0.88 10 0.826 0.76 0.818 0.83 0.832 15 0.771 0.729 0.775 0.789 0.794 20 0.734 0.714 0.734 0.744 0.735 25 0.702 0.688 0.714 0.708 0.704 30 0.666 0.646 0.676 0.676 0.664 Gradual : drif ting time = 0.50N 0 0.855 0.834 0.855 0.91 0.904 5 0.817 0.776 0.828 0.85 0.857 10 0.814 0.745 0.807 0.817 0.82 15 0.76 0.716 0.764 0.783 0.787 20 0.724 0.697 0.731 0.743 0.745 25 0.696 0.674 0.703 0.709 0.708 30 0.664 0.648 0.667 0.677 0.676 7.3 Memory and Time Bounds The Interleaved Test-Then-Train evaluation model [8] has been used as mentioned previously. In this, each individual example can be used to test the model before it is used for training, and from this the classification state can be incrementally updated. Hence, the GPS algorithms are single pass and incremental. As each example is processed only once as soon as it arrives, the algorithms can ideally process an infinite stream of data. Compared to existing standard algorithms, the proposed algorithms take more memory as shown in Table 7.16. The reason for this is that, while all the other methods (viz. EDDM-OzaBoost, EDDM-OzaBag, EDDM-OCBoost, OzaBagAdwin) maintain only one ensemble during the processing period, the proposed algorithms (GPSGradual and GPSAbrupt) both maintain two ensembles as well as an additional instance-window throughout the classification process. The time taken by these algorithms is more than the existing ones (Table 7.15) because newEnsemble is trained right since the beginning of the data stream and also, during the drifting period, it is trained specifically on the instance window. The time requirements of GPSGradual and GPSAbrupt can be substantially reduced by training the two ensembles (currentEnsemble and newEnsemble) on parallel processors. Table 7.15: PROCESSING TIME (IN SECONDS). DATASET-SIZE : 50000 ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 1.26 4.44 5.38 8.14 11.4 sineV 1.24 7.94 4.48 6.06 8.3 sineH 1.14 8.96 8.08 10.68 12.52 line 1.2 7.44 5.88 8.42 8.6 plane 1.94 16.36 19.16 22.44 26.66 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 1.24 4.98 5.34 6.06 11.36 sineV 1.32 8.44 5.44 5.6 12.6 sineH 1.24 8.66 7.7 10.78 11.68 line 1.18 7.66 6.68 7.56 8.62 plane 1.92 13.52 19.88 22.18 23.44 Gradual : drif ting time = 0.50N circle 1.24 5.66 5.38 6.36 10.5 sineV 1.32 8.54 4.94 5.58 11.5 sineH 1.2 8.94 10.44 12.14 10.68 line 1.2 7.74 6.96 5.56 8.42 plane 2.04 11.56 18.46 22.62 18.94 Table 7.16: MEMORY (IN BYTES). DATASET-SIZE : 50000 ABRUPT Dataset EDDM EDDM EDDM OzaBag- GPSAbrupt OzaBoost OCBoost OzaBag ADWIN circle 1408 14464 141232 147184 558856 sineV 1408 124384 79392 75240 393656 sineH 1408 161120 264928 187272 596232 line 1408 100096 80016 94984 329384 plane 5184 332160 507728 525048 1226424 GPSGradual GRADUAL Dataset EDDM EDDM EDDM OzaBag- OzaBoost OCBoost OzaBag ADWIN Gradual : drif ting time = 0.25N circle 1408 147952 123888 64944 452416 sineV 1408 109344 132960 70160 470656 sineH 1408 93504 252528 122104 606208 line 1408 92784 141264 73960 294576 plane 5184 399552 523280 254888 1245232 Gradual : drif ting time = 0.50N circle 1408 111248 107568 55656 314128 sineV 1408 121376 129232 67344 344032 sineH 1408 67456 264272 113800 475120 line 1408 108144 154864 70496 283360 plane 5184 337344 420240 218280 755344 Chapter 8 Conclusion We studied various areas in data mining domain. In these areas, we specifically focused on handling concept drifts in online ensemble learning. It was found that little work has been done to exploit drift type in concept drift handling techniques. So, we decided to develop different approaches for abrupt and gradual concept drifts depending on their composition characteristics. The existing EDDM-based drift handling approach was studied and improvised to develop two new approaches GPSGradual and GPSAbrupt to handle gradual and abrupt drift data streams respectively. These classification approaches use zero diversity and instance-window techniques to improve the classification algorithm. Zero diversity helps all the classifiers in the ensemble of classifiers to get trained on the new concept instances during the drifting period, hence, adapting to the change quickly. Instance-window stores early instances of the new concept which are used later during drifting period for training the newEnsemble. This approach of instance selection helps to improve the accuracy of the classification system after the drift is detected. Because of different composition characteristics of the data stream during drifting period, the instances selected for training the newEnsemble are different in GPSGradual and GPSAbrupt. Due to the same reason, the newEnsemble is trained on the instance-window at different levels in the drifting 63 period for GPSGradual and GPSAbrupt. For experimentation we considered various artificial as well as real datasets, thus, a large number of datasets were covered for exhaustive testing of the proposed approaches. artificial datasets accounted for a wide variety of mathematical problems that can occur in data streams. Experimental results show that the new approaches perform better with respect to accuracy in classifying these datasets compared to the existing approaches. The results also show that the new approaches reduce the recovery time for the drifts in data streams. Chapter 9 Supplementary Work 9.1 Hospitalization Record Analyzer As part of the summer work(May-Jun 2010) before actual commencement of project work, we participated in IEEE VAST 2010 Challenge [3] and submitted our solution for Mini Challenge 2 : Hospitalization Records - Characterization of Pandemic Spread. The mini challenge required analyzing hospitalization records. The dataset provided were the city-wise hospitalization records for a particular panademic. The task was to analyze these datasets and characterize the spread of the panademic by taking into account symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease. Also, the comparison of the outbreak of panademic accross cities was required. A visualization tool was to be developed for solving the above task. So, we developed a tool in java titled ”Hospitalization Record Analyzer”. It gives values of various factors using filters of city and syndromes on the data set. It shows plots of the processed data also. These graphs are drawn using open source graph plotting software GNUplot (http://www.gnuplot.info/). The tool analyzes the preprocessed data in variety of ways. 65 These include : city-wise analysis and overall analysis. A few screenshots of the tool are given below (Fig. 9.1 to Fig. 9.3). Our entry received a an average score of 17 out of 30 at the competition. Figure 9.1: Hospitalization Record Analyzer : Main Window Figure 9.2: Hospitalization Record Analyzer : Analysis example (Syndrome Distribution) Figure 9.3: Hospitalization Record Analyzer : Analysis example (City-wise Dead-infected) 9.2 Research Paper Publication Based on the research work carried out, we submitted a research paper titled ”An Instance-Window based Classification Algorithm for Handling Gradual Concept Drifts” authored by Vahida Attar, Prashant Chaudhary, Sonali Rahagude, Gaurish Chaudhari and Pradeep Sinha. This paper elaborates the approach for handling gradual drift effectively i.e. the GPSGradual approach. This paper was submitted in the ADMI (Agents and Data Mining Interaction) workshop [2] that is conducted at AAMAS 2011 (The Tenth Internation Conference on Autonomous Agent and Multi-agent Systems) [1] held in Taipei, Taiwan. The paper was accepted for publication in the following: a) The ADMI-related Springer LNCS/LNAI Volume. b) AAMAS USB Proceedings. Also, the extended version of the paper has been invited for publication in JAAMAS (Journal of Autonomous Agent and Multi-agent Systems) Special issue on agent mining. 9.3 Approaches of Drift Type Detection As mentioned previously, gradual and abrupt drift streams differ in their compostion characteristics. In case of gradual drift in data stream, the stream immediately after the drift consists of both old as well as new instances. On the other hand, for an abrupt drift in the stream, the stream after drift occurence consists of new instances only. Thus, for abrupt drift, the classification errors caused by the classifier immediately after the drift point is more since only new instances are present as the classifier is still not trained on the new concept. However, in case of a gradual drift, the errors are less as there is a mixture of old and new instances. 9.3.1 Approach 1 : Using Standard Deviation Measure The Early Drift Detection Method (EDDM) flags a warning level when online error rate of the classifier system crosses a certain bound and further, detects the drift level. Standard deviation s0i in EDDM is calcuated on this error rate. Based on the above discussion, we can infer that the values of the standard deviation at the start of warning and drift levels will be different in case of abrupt and gradual drifts. This distinguishing factor could be used in to detect the type of drift based on the difference between the standard deviations at warning and drift levels. We describe this approach below. Let i be the first instance after warning level and f be the final instance before drift level. We calculate δ = s0f − s0i . For abrupt drift, δ will be a small value because the distance between errors is less as the number of errors are more. While for gradual drift, δ will be a large value because distance between the errors is more as number of errors are less. It was seen that threshold δ is dataset dependent. 9.3.2 Approach 2 : Using Error Rate In this approach we calculate error rate between warning level and drift level of EDDM. The error rate is calculated as the ratio of number of errors(i.e. number of misclassified instances) to the total number of instances. As mentioned previously, there are more errors in case of abrupt drift, hence, the error rate is more for abrupt drift. Similarly, error rate is less for gradual drift. We define threshold δ that differentiates between abrupt and gradual drifts. For noisy datasets(class noise = 10%), the value of δ we determined was 0.45. if error rate < δ then drif t type = gradual elseif error rate > δ then drif t type = abrupt Drawback The threshold δ depends on the amount of noise in the dataset. Because as noise increases the number of errors also increase. Hence, different amount noise results in different error rate for given dataset. 9.3.3 Approach 3 : Generating Association Rules/Decision Trees for Drift Type EDDM maintains different parameters while detecting drift viz. mean of distance between errors(mean), standard deviation(std), m2s (mean + 2 ∗ std) etc. For different types of drifts, these values differ due to different composition characteristics of the drifts. An approach can be devised to detect type of drift by creating decision trees or association rules by learning these parameters of processed data streams where the type of drift is already known and is used as a class label. Now, these decision trees or association rules can be used for testing and determining the types of drifts for new data streams with unknown drift types. Thus, detecting type in this case is a binary classification problem. In order to implement the above approach, we generated the data log files for various drift data streams (containing parameters mentioned above). Batch-learning algorithms from WEKA [5] were used to learn these datalog files and the association rules were generated as shown in Fig. 9.4. The framework was as shown in Fig. 9.5. Figure 9.4: Example of Association Rules created using JRIP Figure 9.5: Drift Detection Framework 9.4 ADWIN Integrated GPS Approach The ADWIN drift detection method has several advantages over EDDM. ADWIN is a windowing technique based on partial examples memory. EDDM is based on online error rate of the clm2sassifier system based on no examples memory. ADWIN detect less false drifts as compared to EDDM. Also, performance of ADWIN in case of noisy data streams is better than that of EDDM. So, it is a better drift detection method. As our approach is entirely dependent on the drift detection method used, using ADWIN instead of EDDM can help attain greater accuracy for the system. Hence, we tried replacing EDDM by ADWIN. There were challenges while incorporating ADWIN in the proposed approach. The proposed approach strictly requires a warning level and drift level for training of the newEnsemble. Howerver, ADWIN provides only drift level and no warning level. Thus, modifying ADWIN to determine a warning level and post warning level for GPSGradual algorithm was necessary. EDDM has defined precise values for warning and drift level. Deducing a post warning level from these values was easy. However, ADWIN does not give any predefined values. As mentioned previously, ADWIN has a predefined parameter ² to detect change. Hence, to introduce warning level, we experimented on values close to ² viz. (²/2), (3²/4), (5²/6),(7²/8) and (9²/10). It was observed that the value (3²/4) gave better results for determining warning level in ADWIN. Now, to introduce post warning level in GPSGradual algorithm, we experimented on values between ² and 3²/4 viz. (5²/6), (7²/8), (9²/10) and (11²/12). It was observed that the value (9²/10) gave better results for determining post warning level in ADWIN. This modified ADWIN was integrated in the proposed approaches namely, GPSGradual and GPSAbrupt. Experiments reveal that use of ADWIN as the drift detection method improves the accuracy of the algorithms as compared to EDDM (Table 9.1). Due to lack of time, exhaustive testing of the ADWIN-integrated approaches could not be performed. Table 9.1: AVERAGE ACCURACIES FOR GPS(EDDM) AND GPS(ADWIN). DATASET-SIZE : 50,000 ABRUPT ALGORITHM Circle SineH SineV Line Plane GPSAbrupt 0.9 0.963 0.788 0.984 0.821 GPSADWIN 0.934 0.971 0.804 0.978 0.758 GRADUAL Gradual : drif ting time = 0.25N GPSGradual 0.870 0.916 0.823 0.930 0.832 GPSADWIN 0.894 0.922 0.822 0.937 0.838 Gradual : drif ting time = 0.50N GPSGradual 0.840 0.894 0.821 0.911 0.820 GPSADWIN 0.878 0.897 0.812 0.921 0.828 Chapter 10 Future Work The integration of ADWIN with the GPS approaches gave promising results. So, future work may include rigorous testing for estimating warning level automatically in ADWIN drift detection method. This modified ADWIN algorithm can then be used to replace the EDDM method in the proposed approaches. Currently, the values for instance-window size and post warning level are determined experimentally. Thus, for any speed of drift in the data stream, these values remain fixed. Thus, future work may also include determining the value of the PostWarning level and window size automatically according to the speed of drift. This may improve the accuracy of the classification system. Future work may also include implementing approaches on parallel processors in order to speed up the system. A framework could be developed, where first the type of drift (Abrupt or Gradual) can be detected automatically and based on this type, one of the proposed approaches - GPSAbrupt or GPSGradual can be applied. 74 Bibliography [1] Aamas 2011 - the tenth international conference on autonomous agents and multiagent systems. http://www.aamas2011.tw/. [2] Admi 2011 - the seventh international workshop on agents and data mining interaction. http://admi11.agentmining.org/. [3] Ieee vast challenge 2010. http://hcil.cs.umd.edu/localphp/hcil/vast10/index.php. [4] Uci repository covertype dataset. http://archive.ics.uci.edu/ml/datasets/Covertype. [5] Weka. http://www.cs.waikato.ac.nz/ml/weka/. [6] R. A. Berk. An introduction to ensemble methods for data analysis. Sociological Methods Research, 34(3):263–295, 2006. [7] A. Bifet and R. Gavald. Learning from time-changing data with adaptive windowing. In SIAM International Conference on Data Mining, 2007. [8] A. Bifet and R. Kirkby. Data stream mining − a practical approach. http://moa.cs.waikato.ac.nz/downloads/. [9] L Breiman, J H Friedman, R A Olshen, and C J Stone. Classification and Regression Trees, volume p. Wadsworth, 1984. [10] A. Fern and R. Givan. Online ensemble learning: An empirical study. Machine Learning, 53:71–109, 2003. 75 [11] J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. In Proceedings of the 7th Brazilian Symposium on Artificial Intelligence (SBIA ’04) - Lecture Notes in Computer Science, volume 317 1, pages 286–295, Sao Lui z do Maranhao, Brazil, 2004. Springer. [12] I. Katakis, G. Tsoumakas, and I. Vlahavas. An ensemble of classifiers for coping with recurring contexts in data streams. In 18th European Conference on Artificial Intelligence, Patras, Greece, 2008. [13] I. Katakis, G. Tsoumakas, and I. Vlahavas. Tracking recurring contexts using ensemble classifiers: An application to email filtering. Knowledge and Information Systems, 22:371–391, 2009. [14] Baena-Garcia M., Campo-Avila J. Del, R. Fidalgo, and A. Bifet. Early drift detection method. In Proceedings of the 4th ECML PKDD International Workshop on Knowledge Discovery From Data Streams (IWKDDS ’06), pages 77–86, Berlin, Germany, 2006. [15] F. L. Minku, H. Inoue, and X. Yao. Negative correlation in incremental learning. Natural Computing Journal - Special Issue on Nature-inspired Learning and Adaptive Systems, page 32, 2008. [16] F. L. Minku and X. Yao. Using diversity to handle concept drift in on-line learning. IEEE Transactions on Knowledge and Data Engineering, 2009. [17] L. Minku, A. White, and X. Yao. The impact of diversity on on-line ensemble learning in the presence of concept drift. IEEE Transactions on Knowledge and Data Engineering, 2008. [18] N. C. Oza and S. Russell. Experimental comparisons of on-line and batch versions of bagging and boosting. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 359–364, San Francisco, California, 2001. [19] N. C. Oza and S. Russell. Online bagging and boosting. In Proceedings of the 2005 IEEE International Conference on Systems, Man and Cybernetics, volume 3, pages 2340–2345, New Jersey: Institute for Electrical and Electronics Engineers, 2005. [20] R. Pelossof, M. Jones, I. Vovsha, and C. Rudin. Online coordinate boosting. On-line Learning for Computer Vision Workshop (OLCV), 2009. [21] R. Polikar, L. Udpa, S. S. Udpa, and V. Honavar. Learn ++: An incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man. and Cybernetics - Part C, 31:497–508, 2001. [22] W. Nick Street and Yongseog Kim. A streaming ensemble algorithm (sea) for largescale classification. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining KDD 01, pages 377–382, 2001. [23] A. Tsymbala, M. Pechenizkiy, P. Cunningham, and S. Puuronen. Dynamic integration of classifiers for handling concept drift. Information Fusion, 9:56–68, 2008. [24] Haixun Wang, Wei Fan, Philip S Yu, and Jiawei Han. Mining concept-drifting data streams using ensemble classifiers. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining KDD 03, page 226, 2003. [25] I. Zliobaite. Learning under concept drift- an overview. In Technical Report, Faculty of Mathematics and Informatics, Vilnius University, Vilnius, Lithuania, 2009.