Download Lecture 7. Data Stream Mining. Building decision trees

Data Stream Mining: Building decision trees and Evaluations 1 / 44 Background: Learning and mining Finding in what sense data is not random For example: frequently repeated patterns, correlations among attributes, one attribute being predictable from others, ... Fix a description language a priori, and show that data has a description more concise than the data itself. 2 / 44 Data stream classification cycle 1. Process an example at a time, and inspect it only once (at most) 2. Use a limited amount of memory 3. Work in a limited amount of time 4. Be ready to predict at any point 3 / 44 Background: Learning and mining PROCESS DATA MODEL / / PATTERNS ACTIONS 4 / 44 Mining in Data Streams: What’s new? Make one pass on the data Use low memory Certainly sublinear in data size In practice, that fits in main memory – no disk accesses Use low processing time per item Data evolves over time - nonstationary distribution 5 / 44 Two main approaches Learner builds model, perhaps batch style When change detected, revise or rebuild from scratch 6 / 44 Two approaches Keep accurate statistics of recent / relevant data e.g. with “intelligent counters” Learner keeps model in sync with these statistics 7 / 44 Decision Tree Learning 8 / 44 Background: Decision Trees Powerful, intuitive classifiers Good induction algorithms since mid 80’s (C4.5, CART) Many many algorithms and variants now 9 / /44 10 Decision Trees Decision tree is a classification model. Its basic structure is a general tree structure Internal node: test on example’s attribute value Leaf node: class labels Key idea: 1) pick an attribute to test at root 2) divide the training data into subsets Di for each value the attribute can take on 3) build the tree for each Di and splice it in under the appropriate branch at the root 10 / 44 Decision Trees 11 / 44 Top-down Induction, C4.5 Style • Algorithm to build decision tree 12 / 44 Top-down Induction, C4.5 Style • Limitations • Classic decision tree learners assume all training data can be simultaneously stored in main memory. • Disk-based decision tree learners repeatedly read training data from disk sequentially • Goal • Design decision tree learners that read each example at most once, and use a small constant time to process it. 13 / 44 VFDT P. Domingos and G. Hulten: “Mining high-speed data streams” KDD’2000. Very influential paper Very Fast induction of Decision Trees, a.k.a. Hoeffding trees Algorithm for inducing decision trees in data stream way Does not deal with time change Does not store examples - memory independent of data size 14 / 44 VFDT In order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node. Given a stream of examples, use the first ones to choose the root attribute. Once the root attribute is chosen, the successive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively. Use Hoeffding bound to decide how many examples are enough at each node 15 / 44 VFDT Crucial observation [DH00] An almost-best attribute can be pinpointed quickly: Evaluate gain function G(A,S) on examples seen so far S, then use Hoeffding bound Criterion If Ai satisfies G(Ai ,S) > G(Aj ,S) + ε(|S|,δ) for every j ≠i conclude “Ai is best” with probability 1 −δ 16 / 44 VFDT • • Calculate the information gain for the attributes and determines the best two attributes At each node, check for the condition ∆G = G ( X a ) − G ( X b ) > ε • • If condition satisfied, create child nodes based on the test at the node If not, stream in more examples and perform calculations till condition satisfied 17 / 44 VFDT 18 / 44 VFDT-like Algorithm T := Leaf with empty statistics; For t = 1,2,... do VFDT Grow(T,xt ) VFDT Grow (Tree T , example x) run x from the root of T to a leaf L update statistics on attribute values at L using x evaluate G(Ai , SL) for all i from statistics at L if there is an i such that, for all j, G(Ai ,SL) > G(Aj ,SL) + ε(SL,δ) then turn leaf L to a node labelled with Ai create children of L for all values of Ai make each child a leaf with empty statistics 19 / 44 Extensions of VFDT IADEM [G. Ramos, J. del Campo, R. Morales-Bueno 2006] Better splitting and expanding criteria Margin-driven growth VFDTc [J. Gama, R. Fernandes, R. Rocha 2006], UFFT [J. Gama, P. Medas 2005] Continuous attributes Naive Bayes at inner nodes and leaves Short term memory window for detecting concept drift Converts inner nodes back to leaves, fill them with window data Different splitting and expanding criteria CVFDT [G. Hulten, L. Spencer, P. Domingos 2001] 20 / 44 CVFDT G. Hulten, L. Spencer, P.Domingos, “Mining time-changing data streams”, KDD 2001 • Concept-adapting VFDT • Update statistics at leaves and inner nodes • Main idea: when change is detected at a subtree, grow candidatesubtree • Eventually, either current subtree or candidate subtree is dropped • Classification at leaves based on most frequent class in a window of examples • Decisions at leaf use a window of recent examples 21 / 44 CVFDT CVFDT (Concept-adapting Very Fast Decision Tree learner) Extend VFDT Maintain VFDT’s speed and accuracy Detect and respond to changes in the example-generating process 22 / 44 CVFDT With a time-changing concept, the current splitting attribute of some nodes may not be the best any more. An outdated sub tree may still be better than the best single leaf, particularly if it is near the root. Grow an alternative sub tree with the new best attribute at its root, when the old attribute seems out-of-date. Periodically use a bunch of samples to evaluate qualities of trees. Replace the old sub tree when the alternate one becomes more accurate. 23 / 44 CVFDT 24 / 44 CVFDT 25 / 44 CVFDT No theoretical guarantees on the error rate of CVFDT CVFDT parameters : W : is the example window size. T0: number of examples used to check at each node if the splitting attribute is still the best. T1: number of examples used to build the alternate tree. T2: number of examples used to test the accuracy of the alternate tree. 26 / 44 CVFDT vs VFDT 27 / 44 CVFDT VFDT: No concept drift No example memory No parameters but δ Rigorous performance guarantees CVFDT: Concept drift Window of examples Several parameters besides δ No performance guarantees 28 / 44 Remember ADWIN [Bifet, G. 09] Adaptive Hoeffding Trees Recall: the ADWIN algorithm detects change in the mean of a data stream of numbers keeps a window W whose mean approximates current mean memory, time O(logW) 29 20/ /44 Hoeffding Adaptive Trees Hoeffding Adaptive Tree: • replace frequency statistics counters by estimators • • • • don’t need a window to store examples, due to the fact that we maintain the statistics data needed with estimators change the way of checking the substitution of alternate subtrees, using a change detector with theoretical guarantees Theoretical guarantees No Parameters 30 / 44 HAT have no parameters! When to start growing alternate trees? • When ADWIN says “error reate is increasing”, or • When ADWIN for a counter says “attribute statistics are changing” How to start growing new tree? • Use accurate estimates from ADWIN’s at parent - no window When to tell alternate tree is better? • Use the estimation of error by ADWIN to decide How to answer at leaves? • Use accurate estimates from ADWIN’s at leaf - no window 31 / 44 Experiments 32 / 44 Hoeffding Adaptive Trees: Summary No “magic” parameters. Self-adapts to change Always as accurate as CVFDT, and sometimes much better Less memory - no example window Moderate overhead in time (<50%). Working on it Rigorous guarantees possible 33 / 44 Evaluation 1. Error estimation: Hold-out or Prequential 2. Evaluation performance measures: Accuracy or κ-statistic 3. Statistical significance validation: MacNemar or Nemenyi test 34 / 44 Error Estimation (Holdout) Data available for testing Holdout an independent test set Apply the current decision model to the test set, at regular time intervals The loss estimated in the holdout is an unbiased estimator 35 / 44 Error Estimation (Prequential) No data available for testing The error of a model is computed from the sequence of examples. For each example in the stream, the actual model makes a prediction, and then uses it to update the model. 36 / 44 Error Estimation Hold-out or Prequential? Hold-out is more accurate, but needs data for testing. 37 / 44 Evaluation performance measures Correct Class+ Correct ClassTotal Predicted Class+ 75 7 82 Predicted Class8 10 18 Total 83 17 100 Table : Simple confusion matrix example 38 / 44 Performance Measures with Unbalanced Classes Correct Class+ Correct ClassTotal Predicted Class+ 75 7 82 Predicted Class8 10 18 Total 83 17 100 Table : Simple confusion matrix example Correct Class+ Correct ClassTotal Predicted Class+ 68.06 13.94 82 Predicted Class14.94 3.06 18 Total 83 17 100 Table : Confusion matrix for chance predictor 39 / 44 Performance Measures with Unbalanced Classes Kappa Statistic p0: classifier’s prequential accuracy pc : probability that a chance classifier makes a correct prediction. p0 − pc κ statistic κ= 1 − pc κ = 1 if the classifier is always correct κ = 0 if the predictions coincide with the correct ones as often as those of the chance classifier Forgetting mechanism for estimating prequential kappa Sliding window of size w with the most recent observations 40 / 44 Cost Evaluation Example Accuracy Time Memory Classifier A 70% 100 20 Classifier B 80% 20 40 Which classifier is performing better? 41 / 44 RAM-Hours RAM-Hour Every GB of RAM deployed for 1 hour Cloud Computing Rental Cost Options 42 / 44 Cost Evaluation Example Accuracy Classifier A 70% Classifier B 80% Time Memory 100 20 20 40 RAM-Hours 2,000 800 Which classifier is performing better? 43 / 44 Evaluation • • • Error estimation: Hold-out or Prequential Evaluation performance measures: Accuracy or κ-statistic Resources needed: time and memory or RAMHours 44 / 44

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 7. Data Stream Mining. Building decision trees