Download slides - UCLA Computer Science

Learning from Data Streams CS240B Notes by Carlo Zaniolo UCLA CSD With slides from a ICDE 2005 tutorial by Haixun Wang, Jian Pei & Philip Yu 1 What are the Challenges? Data Volume impossible to mine the entire data at one time can only afford constant memory per data sample Concept Drifts previously learned models are invalid Cost of Learning  model updates can be costly can only afford constant time per data sample. 2 On-Line Learning  Learning (Training) :  Input: a data set of (a, b), where a is a vector, b a class label  Output: a model (decision tree)  Testing:  Input: a test sample (x, ?)  Output: a class label prediction for x  When mining data streams the two phases are combined 3 Mining Data Streams: Challenges  On-line response (NB), limited memory, most recent windows only  Fast & Light algorithms needed that  Minimize usage of memory and CPU  Require only one (or a few) passes through data  Concept shift/drift: change mining set statistics  Render previously learned models inaccurate or invalid  Robustness and Adaptability: quickly recover/adjust after concept changes.  Popular machine learning algorithms no longer effective:  Neural nets: slow learner requires many passes  Support Vector Machines (SVM): computationally too expensive. 4 Classifiers Algorithms: from databases to data streams  New Algorithm have emerged,  Bloom Filters,  ANNCAD  Exisiting algorithms have adapted,  NBC survives with only minor changes  Decision Trees require significant adaptation  Classifier ensembles remain effective—after significant changes  Popular algorithms no longer effective:  Neural nets: slow learner requires many passes  Support Vector Machines (SVM): computationally too expensive 5 Decision Tree Classifiers  A divide-and-conquer approach Simple algorithm, intuitive model  Typically a decision tree grows one level for each scan of data  Multiple scans are required  But if we can use small samples these problem disappears  But data structure is not ‘stable’ Subtle changes of data can cause global changes in the data structure 6 Challenge #1 How many samples do we need to build a tree in constant time that is nearly identical to the tree a batch learner (C4.5, Sprint,...) Nearly identical?  Categorical attributes:  with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner  identical decision tree  Continuous attributes:  discretize them into categorical ones ...Forget concept shift/drift for now 7 Hoeffding Bound  Also known as additive Chernoff Bound  Given – r : real valued random variable – n : # independent observations of r – R : range of r  Mean of r is at least ravg- ε, with probability 1-δ,  Or: P(μr  ravg- ε) = 1-δ and 8 Hoeffding Bound Properties:  Hoeffding bound is independent of data distribution  Error ε decreases when n (# of samples) increases At each node, we shall accumulate enough samples (n) before we make a split. 9 Nearly Identical? Categorical attributes: with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner thus we seek identical decision tree Continuous attributes:discretize them into categorical ones. 10

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides - UCLA Computer Science