Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Learning from Data Streams CS240B Notes by Carlo Zaniolo UCLA CSD With slides from a ICDE 2005 tutorial by Haixun Wang, Jian Pei & Philip Yu 1 What are the Challenges? Data Volume impossible to mine the entire data at one time can only afford constant memory per data sample Concept Drifts previously learned models are invalid Cost of Learning model updates can be costly can only afford constant time per data sample. 2 On-Line Learning Learning (Training) : Input: a data set of (a, b), where a is a vector, b a class label Output: a model (decision tree) Testing: Input: a test sample (x, ?) Output: a class label prediction for x When mining data streams the two phases are combined 3 Mining Data Streams: Challenges On-line response (NB), limited memory, most recent windows only Fast & Light algorithms needed that Minimize usage of memory and CPU Require only one (or a few) passes through data Concept shift/drift: change mining set statistics Render previously learned models inaccurate or invalid Robustness and Adaptability: quickly recover/adjust after concept changes. Popular machine learning algorithms no longer effective: Neural nets: slow learner requires many passes Support Vector Machines (SVM): computationally too expensive. 4 Classifiers Algorithms: from databases to data streams New Algorithm have emerged, Bloom Filters, ANNCAD Exisiting algorithms have adapted, NBC survives with only minor changes Decision Trees require significant adaptation Classifier ensembles remain effective—after significant changes Popular algorithms no longer effective: Neural nets: slow learner requires many passes Support Vector Machines (SVM): computationally too expensive 5 Decision Tree Classifiers A divide-and-conquer approach Simple algorithm, intuitive model Typically a decision tree grows one level for each scan of data Multiple scans are required But if we can use small samples these problem disappears But data structure is not ‘stable’ Subtle changes of data can cause global changes in the data structure 6 Challenge #1 How many samples do we need to build a tree in constant time that is nearly identical to the tree a batch learner (C4.5, Sprint,...) Nearly identical? Categorical attributes: with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner identical decision tree Continuous attributes: discretize them into categorical ones ...Forget concept shift/drift for now 7 Hoeffding Bound Also known as additive Chernoff Bound Given – r : real valued random variable – n : # independent observations of r – R : range of r Mean of r is at least ravg- ε, with probability 1-δ, Or: P(μr ravg- ε) = 1-δ and 8 Hoeffding Bound Properties: Hoeffding bound is independent of data distribution Error ε decreases when n (# of samples) increases At each node, we shall accumulate enough samples (n) before we make a split. 9 Nearly Identical? Categorical attributes: with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner thus we seek identical decision tree Continuous attributes:discretize them into categorical ones. 10