Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IEEE CBMS’06: DM Track Salt Lake City, Utah, USA June 21-23, 2006 Dynamic Integration of Classifiers for Handling Concept Drift Alexey Tsymbal Department of Computer Science Trinity College Dublin Ireland Mykola Pechenizkiy Dept. of Mathematical IT University of Jyväskylä Finland Pádraig Cunningham Seppo Puuronen Dept. of CS and IS University of Jyväskylä Finland Department of Computer Science Trinity College Dublin Ireland IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 1 Outline Introduction – Supervised Learning – The Problem of Concept Drift (CD) Approaches to Handle CD: – Instance selection; instance weighting; and ensemble learning Dynamic Integration of Classifiers for Handling CD – Dynamic Selection, Dynamic Integration, and their mix Domain of Antibiotic resistance – How resistance occurs, concept drift context Experiments design – C4.5 ensembles with static and dynamic integration Results and Conclusion IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 2 The task of classification J classes, n training observations, p features Training Set New instance to be classified Given n training instances (xi, yi) where xi are values of attributes and y is class CLASSIFICATION Class Membership of the new instance IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 Goal: given new x0, predict class y0 Examples: - diagnosis of thyroid diseases; - heart attack prediction, etc. “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 3 The Task of Classification Predicting Antibiotic Resistance – predict the sensitivity of a pathogen to an antibiotic based on data about the antibiotic, the isolated pathogen, and the demographic and clinical features of the patient. IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 4 The Problem of Concept Drift Changes in the hidden context can induce more or less radical (gradual or abrupt) changes in the target concept – A typical example – antibiotic resistance: • pathogen sensitivity may change over time as new pathogen strains develop resistance to antibiotics that were previously effective – Even in most strictly controlled environments some unexpected changes may happen due to: • fail and replacement of some medical equipment, or • changes in personnel, causing the necessity to change the model. – The necessity in the change of current model due to the change of data distribution is called virtual concept drift An effective learner should be able to track such changes and to quickly adapt to them. IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 5 Approaches to Handle Concept Drift instance selection: – select instances relevant to the current concept; – generalizing from a moving window and uses the learnt concepts for prediction only in the immediate future; – case-base editing strategies in CBR that delete noisy, irrelevant and redundant cases; instance weighting: – weighting according to “age”, and competence wrt the current concept; – weighting techniques handle CD worse than analogous instance selection techniques (due overfitting the data); ensemble learning: – maintains a set of concept descriptions, predictions of which are combined using e.g. a form of voting; – dividing the data into sequential blocks of fixed size and building an ensemble on them. IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 6 Handling Concept Drift with Ensembles Ensemble is constructed as a set of concept descriptions corresponding to different time intervals: time training set for next base classifier Usually simple voting is used for model combination – does not work in complex domains with local concept drift Our basic idea: use local accuracies for model combination in order to handle local concept drift – adapts to concept drift better (e.g. with antibiotic resistance data) IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 7 Local Concept Drift In the real world, concept drift may often be local, – changes in the concept or data distribution occur in some regions of instance space only, • only particular bacteria may develop their resistance to certain antibiotics, while resistance to the others could remain the same. – the type and severity of changes may depend on the location in the instance space. Local CD - changes in concept and data distribution occurring at an instance rather than data set level. – Local CD occurs between two consecutive time points • if there is a sub-space of the whole instance space such that it has different changes of concept and/or data distribution in comparison with the rest of the data. – This is reflected by a different change in (local) predictive performance of currently used model in this sub-space. IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 8 Stability of Regions: Rotating Hyperplane t1 t2 t3 t4 Stability of regions in the rotating hyperplane problem Base models of an ensemble should not be discarded if - global accuracy on the current block of data falls, but they are still good experts in the stable parts of the data. One solution to this problem is the use of DIC: - the models are integrated at an instance level according to their local accuracies. IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 9 Local Concept Drift: Most gradual CDs may be considered local, if: – the velocity of changes is small relative wrt. arriving instances in the data stream; – most regions of the data remain stable. Most abrupt CDs are – not local unless substantial sub-areas remain stable between the two changing concepts. – local, if it relates to a subgroup of the whole population. CD may also be complex, - different concept or data distribution changes (potentially also differently!) in different clusters – changes in AR and data distribution are usually different for different bacteria in the AR problem. Local CD occurs at an instance level – its treatment should be at that level as well! Potential approaches to handle local CD: – CBR: a case base is updated at an instance level; – a hybrid of ensemble learning and instance selection – Ensemble integration based on local accuracies IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 10 How Antibiotic Resistance Happens IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 11 How Antibiotic Resistance Happens In spontaneous DNA mutation, bacterial DNA may mutate spontaneously. Drugresistant tuberculosis arises this way. In a form of microbial sex called transformation, one bacterium may take up DNA from another bacterium. Pencillin-resistant gonorrhea results from transformation. Resistance acquired from a small circle of DNA called a plasmid, that can flit from one type of bacterium to another. – – IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 A single plasmid can provide a slew of different resistances. In 1968, 12,500 people in Guatemala died in an epidemic of Shigella diarrhea. The microbe harbored a plasmid carrying resistances to four antibiotics! “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 12 Data Collection & Organization N.N. Burdenko Institute of Neurosurgery Bacterial analyzer “Vitek-60” (by “bioMérieux”) Information Systems: "Microbiologist" & "Microbe" Each instance: one sensitivity test: – pathogen that is isolated during the bacterial identification analysis, – antibiotic that is used in the sensitivity test – the result of the sensitivity test itself (sensitive, resistant or intermediate), obtained from “Vitek” according to the guidelines of (NCCLS). – The above information is connected with patient, his or her demographical data (sex, age) and hospitalization in the Institute (main department, days spent in ICU, days spent in the hospital before test, etc.). 4430 sensitivity tests corresponding to a single specimen (liquor) including the meningitis cases of the year 2002 - 2004. IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 13 Classification over Sequential Data Blocks 0.9 0.8 v 0.7 wv ds 0.6 dv dvs 0.5 0.4 27 25 23 21 19 17 15 13 11 9 7 5 3 1 0.3 accuracy for C4.5 ensembles IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 14 Weighted Average of Classification Accuracy 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 min aver max v wv ds dv dvs C4.5 ensembles IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 15 Summary and Conclusions In the real world concepts are often not stable but change with time, which is known as the problem of concept drift (CD). Among the most popular and effective approaches to handling CD is ensemble learning: – a set of concept descriptions built on data blocks corresponding to different time intervals is maintained, and – the final prediction is the aggregated prediction of ensemble members. We suggested a dynamic integration approach for ensembles (DIC) used in handling CD: – integrates the base classifiers at an instance level, assigning to them weights proportional to their local accuracy on each instance considered. We considered an example of CD from the area of antibiotic resistance. We demonstrated that DIC often results in better accuracy with the considered data set than the more commonly used weighted voting: – this supports our hypothesis that favors DIC for handling CD, especially in the presence of local CD. IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 16 Contact Info MS Power Point slides of this and other recent talks and full texts of selected publications are available online at: http://www.cs.jyu.fi/~mpechen Mykola Pechenizkiy Department of Mathematical Information Technology, University of Jyväskylä, FINLAND E-mail: [email protected] http://ww.cs.jyu.fi/~mpechen THANK YOU! IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 17 Additional Slides … IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 18 Antibiotic Resistance in Nosocomial Infections 3 - 40% of patients admitted to hospital acquire an infection during their stay, and that the risk for hospital-acquired infection, or nosocomial infection, has risen steadily in recent decades. The frequency depends mostly on the type of conducted operation being greater for “dirty” operations (10-40%), and smaller for “pure” operations (3-7%). E.g. such serious infectious complication as postoperative meningitis is often the result of nosocomial infection. Antibiotics are the drugs that are commonly used to fight against infections caused by bacteria. According to the Center for Disease Control and Prevention (CDC) statistics, more than 70% of the bacteria that cause hospital-acquired infections are resistant to at least one of the antibiotics most commonly used to treat infections. Analysis of the microbiological data included in antibiograms collected in different institutions over different periods of time is considered as one of the most important activities to restrain the spreading of antibiotic resistance and to avoid the negative consequences of this phenomenon. IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 19 Antibiotic sensitivity of different bacteria Comparing the antibiotic sensitivity of different bacteria © Jim Deacon, Institute of Cell and Molecular Biology, The University of Edinburgh IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 20 The emergence of antibiotic resistance Effects of different antibiotics on growth of a Bacillus strain. The right-hand image shows a close-up of the novobiocin disk (marked by an arrow on the whole plate). In this case some individual mutant cells in the bacterial population were resistant to the antibiotic and have given rise to small colonies in the zone of inhibition. © Jim Deacon, Institute of Cell and Molecular Biology, The University of Edinburgh IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 21 How Antibiotic Resistance Happens Horizontal Gene Transfer (© Grace Yim and Fan Sozzi) IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 22 Mechanisms of Antibiotic Resistance © Grace Yim and Fan Sozzi IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 23 Mechanisms of Antibiotic Resistance Antibiotic Method of resistance Chloramphenicol reduced uptake into cell Tetracycline active efflux from the cell β-lactams, Erythromycin, Lincomycin eliminates or reduces binding of antibiotic to target β-lactams, Erythromycin hydrolysis Aminoglycosides, Chloramphenicol, Fosfomycin, Lincomycin inactivation of antibiotic by enzymatic modification β-lactams, Fusidic Acid sequestering of the antibiotic by protein binding Sulfonamides, Trimethoprim metabolic bypass of inhibited reaction Sulfonamides, Trimethoprim overproduction of antibiotic target (titration) Bleomycin binding of specific immunity protein to antibiotic IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 24 Dataset Characteristics Patient and hospitalization related Sex {Male, Female} Age Integer Recurring stay {True,False} Days of stay in NSI Integer Days of stay in ICU Integer Days of stay in NSI before specimen was received Integer Bacterium is isolated when patient is in ICU {True,False} Main department {1,…,10} Department of stay (departments + ICU) {1,…,11} Pathogen and pathogen groups Pathogen name {Pat_name1, …, Pat_name17} Gram(+/- ) {True,False} Staphylococcus {True,False} Enterococcus {True,False} Enterobacteria {True,False} Nonfermenters {True,False} Antibiotic and antibiotic groups Antibiotic name {Ant_name1, …, Ant_name39} Group1 {True,False} … … Group15 sensitivity {True,False} {Sensitive, Intermediate, Resistant} IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 25 Experiment design In Naïve Bayes, a normal distribution was assumed for numeric features, and the Laplace correction with a multiplicative factor of 1 was used in probability estimation for categorical features. C4.5 decision trees were built using 0.25 as the confidence factor for pruning and 2 as the minimum number of instances per leaf. With all ensembles considered here we use the simple so-called replace the loser ensemble pruning strategy. – We experimented with 5 different sizes of neighbourhood k; 7, 15, 31, 63, and 127. – – – – – if the ensemble size is greater than or equal to 25, the worst classifier, according to the current validation estimates, is replaced with a new one trained on the most recent data. Naturally, usually accuracy decreases with the increase in the size of neighbourhood, becoming closer to static voting. Our experiments demonstrated that DIC was not very sensitive to the size of neighbourhood. A reason for that is the locally weighted learning scheme used, with which the more distant an instance is from the current test instance, the less influence it will have on the prediction of local performance. However, the smaller neighbourhoods (7 and 15) sometimes result in noisy performance estimates and inferior accuracies (especially with DS). We continue our analysis of experimental results focusing on the size of neighbourhood equal to 31, as usually it gives the best improvement due to DIC in the problems considered. WEKA3 environment: Data Mining Software in Java: – – http://www.cs.waikato.ac.nz/ml/weka/ Default settings were used in the WEKA learning algorithms used in our experiments. IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen 26