Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Learning from Data Streams CS240B Notes by Carlo Zaniolo UCLA CSD With slides from a ICDE 2005 tutorial by Haixun Wang, Jian Pei & Philip Yu 1 What are the Challenges? Data Volume impossible to mine the entire data at one time can only afford constant memory per data sample Concept Drifts previously learned models are invalid Cost of Learning  model updates can be costly can only afford constant time per data sample. 2 On-Line Learning  Learning (Training) :  Input: a data set of (a, b), where a is a vector, b a class label  Output: a model (decision tree)  Testing:  Input: a test sample (x, ?)  Output: a class label prediction for x  When mining data streams the two phases are combined 3 Mining Data Streams: Challenges  On-line response (NB), limited memory, most recent windows only  Fast & Light algorithms needed that  Minimize usage of memory and CPU  Require only one (or a few) passes through data  Concept shift/drift: change mining set statistics  Render previously learned models inaccurate or invalid  Robustness and Adaptability: quickly recover/adjust after concept changes.  Popular machine learning algorithms no longer effective:  Neural nets: slow learner requires many passes  Support Vector Machines (SVM): computationally too expensive. 4 Classifiers Algorithms: from databases to data streams  New Algorithm have emerged,  Bloom Filters,  ANNCAD  Exisiting algorithms have adapted,  NBC survives with only minor changes  Decision Trees require significant adaptation  Classifier ensembles remain effective—after significant changes  Popular algorithms no longer effective:  Neural nets: slow learner requires many passes  Support Vector Machines (SVM): computationally too expensive 5 Decision Tree Classifiers  A divide-and-conquer approach Simple algorithm, intuitive model  Typically a decision tree grows one level for each scan of data  Multiple scans are required  But if we can use small samples these problem disappears  But data structure is not ‘stable’ Subtle changes of data can cause global changes in the data structure 6 Challenge #1 How many samples do we need to build a tree in constant time that is nearly identical to the tree a batch learner (C4.5, Sprint,...) Nearly identical?  Categorical attributes:  with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner  identical decision tree  Continuous attributes:  discretize them into categorical ones ...Forget concept shift/drift for now 7 Hoeffding Bound  Also known as additive Chernoff Bound  Given – r : real valued random variable – n : # independent observations of r – R : range of r  Mean of r is at least ravg- ε, with probability 1-δ,  Or: P(μr  ravg- ε) = 1-δ and 8 Hoeffding Bound Properties:  Hoeffding bound is independent of data distribution  Error ε decreases when n (# of samples) increases At each node, we shall accumulate enough samples (n) before we make a split. 9 Nearly Identical? Categorical attributes: with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner thus we seek identical decision tree Continuous attributes:discretize them into categorical ones. 10