Download slides - UCLA Computer Science

Data Streams Classifiers CS240B Notes by Carlo Zaniolo UCLA CSD With slides from a ICDE 2005 tutorial by Haixun Wang, Jian Pei & Philip Yu 1 Classifiers • The batch classification problem: – Given a finite training set D={(x,y)} , where y={y1, y2, …, yk}, |D|=n, find a function y=f(x) that can predict the y value for an unseen instance x • The data stream classification problem: – Given an infinite sequence of pairs of the form (x,y) where y={y1, y2, …, yk}, find a function y=f(x) that can predict the y value for an unseen instance x • Example applications: – – – – Fraud detection in credit card transactions Churn prediction in a telecommunication company Sentiment classification in the Twitter stream Topic classification in a news aggregation site, e.g. Google news – … 2 Batch Classifiers: Splitting Attributes •Basic algorithm (ID3, Quinlan 1986) – Tree is constructed in a top‐down recursive divide‐and‐conquer manner – At start, all the training examples are at the root node – Then select the select the best attribute for splitting using Gini… until the node is suffiiently pure 3 Decision Tree Classifiers  A divide-and-conquer approach Simple algorithm, intuitive model  Typically a decision tree grows one level for each scan of data  Multiple scans are required  But if we can use small samples these problem disappears  But data structure is not ‘stable’ Subtle changes of data can cause global changes in the data structure 8 Challenge #1 in on-line classifiers How many samples do we need to build a tree in constant time that is nearly identical to the tree a batch learner (C4.5, Sprint,...) Nearly identical?  Categorical attributes:  with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner  identical decision tree  Continuous attributes:  discretize them into categorical ones ...Forget concept shift/drift for now 9 Hoeffding Bound I  Also known as additive Chernoff Bound  Given – r : real valued random variable with range R – n : # independent observations of r  The true mean of r does not differ ravg by more than ε, with probability 1-δ, where: • This bound holds true regardless of the distribution generating the values, and depends only on the range of values, number of observations and desired confidence. – A disadvantage of being so general is that it is more conservative than a distribution‐dependent bound 10 Hoeffding Bound Properties:  Hoeffding bound is independent of data distribution  Error ε decreases when n (# of samples) increases At each node, we shall accumulate enough samples (n) before we make a split. 11 Hoeffding Trees Scales better than traditional DT algorithms Incremental: the nodes are are created incrementally as new samples stream in Sub-linear with sampling Small memory requirement Cons: Only consider top 2 attributes Tie breaking takes time Growing a deep tree takes time Discrete attribute only 12 Nearly Identical? Categorical attributes: with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner thus we seek identical decision tree Continuous attributes:discretize them into categorical ones. 13 Concept Drift/Shift Time-changing data streams Incorporate new samples and eliminate effect of old samples Naïve approach  Place a sliding window on the stream Reapply C4.5 or VFDT whenever window moves Time consuming! [VFDT]Very Fast Decision Tree [Domingos, Hulten 2000] Several Improvements: faster and less memory 14 CVFDT  Concept-adapting VFDT Hulten, Spencer, Domingos, 2001  Goal Classifying concept-drifting data streams  Approach Make use of Hoeffding bound Incorporate “windowing” Monitor changes of information gain for attributes. If change reaches threshold, generate alternate subtree with new “best” attribute, but keep on background. Replace if new subtree becomes more accurate. 16 Classifier Esemble for Data Streams Fast and Light Classifiers: Naïve Bayesian: one pass to count occurrences  Sliding windows, tumbles and slides  Adaptive Nearest Neighbor Classification Algorithm-ANNCAD Fast and Light Classifiers Ensembles of Classifiers--decision trees or others Bagging Ensembles and Boosting Ensembles 17 Basic Ideas  Stream partitioned into sequential chunks  Train a classifier from each chunk  Accuracy of voting ensembles is normally better than that of a single classfier.  Method1. Bagging Weighted voting: weights are assigned to classifiers based on their recent performance on the current test examples Only top K classifiers are used  Method2. Boosting Majority voting Classifiers retired by age Boosting used in training 18 Bagging Ensemble Method 19 Mining Streams with Concept Changes  Changes detected by drop in accuracy or by other methods Build new classifiers on new windows Search among old ones those that have now become accurate 20 Boosting Ensembles for Adaptive Mining of Data Streams Andrea Fang Chu, Carlo Zaniolo [PAKDD2004] 21 Mining Data Stream: Desiderata Fast learning (preferably in one pass of the data.) Light requirements (low time complexity, low memory requirement) Adaptation (model always reflects the timechanging concept) 22 Adaptive Boosting Ensembles Training stream is split into blocks (i.e., windows) • Each individual classifier is learned from a block. • A boosting ensemble of (7—19 members) is maintained over time Decisions are taken by simple majority As the N+1 classifier is build, boost the weight of the tuples misclassified by the first N • • • Change detection is explored to achieve adaptation. 23 Fast and Light Experiments show that boosting ensembles of “weak learners” provide accurate prediction Weak Learners • • An aggressively pruned decision tree, e.g., shallow tree (this means fast!) Trained on a small set of examples (this mean light in memory requirements!) 24 Adaptation: Detect changes that cause significant drops in ensemble performance  gradual changes: concept drift abrupt changes: concept schift 25 Adaptability The error rate is viewed as a random variable When it drops significantly from the recent average the whole ensemble is dropped And a new one is quickly re-learned Cost/performance of boosting ensembles is better than that of bagging ensembles [KDD04] BUT ??? 26 References  Haixun Wang, Wei Fan, Philip S. Yu, Jiawei Han. Mining Concept Drifting Data Streams using Ensemble Classifiers. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) 2003.  Pedro Domingos, Geoff Hulten. Mining High Speed Data Streams. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) 2000.  Geoff Hulten, Laurie Spencer, Pedro Domingos. Mining Time-Changing Data Streams. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) 2001.  Wei Fan, Yi-an Huang, Haixun Wang, Philip S Yu. Active Mining of Data Streams. In the SIAM International Conference on Data Mining (SIAM DM)  2004Fang Chu, Yizhou Wang, Carlo Zaniolo, An adaptive learning approach for noisy data streams, 4th IEEE International Conference on Data Mining (ICDM), 2004  Fang Chu, Carlo Zaniolo: Fast and Light Boosting for Adaptive Mining of Data Streams. PAKDD 2004: 282-292.  Yan-Nei Law, Carlo Zaniolo, An Adaptive Nearest Neighbor Classification Algorithm for Data Streams, 2005 ECML/PKDD Conference, Porto, Portugal, October 3-7, 2005. 27

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides - UCLA Computer Science