Download slides - UCLA Computer Science

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto, 2005 1 Outline     Related Work ANNCAD Properties of ANNCAD Conclusion 2 Classifying Data Streams  Problem Statement: We seek an algorithm for classifying data streams with numerical attributes--will work for totally ordered domains too.  Desiderata:  Fast update speed for newly arriving records.  Only require single pass of data.  Incremental algorithms are needed.  Coping with concept changes.  Classical mining algorithms were not designed for data streams and need to replaced or modified. 3 Classifying Data Streams: Related Work  Hoeffding trees:     VFDT and CVFDT: build decision tree incrementally. Require a large amount of examples to obtain a fair performance classifier. Unsatisfied performance when training set is small. Ensemble:    Combine base models by voting technique. Suitable for coping with concept drift. Fail to provide a simple model and understanding of the problem. 4 State of the Art: NearestNeighborhood Classifiers  Pros and cons:     +: Strong intuitive appeal and simple to implement -: Fail to provide simple models/rules -: Expensive Computations ANN: Approximate Nearest Neighborhood with error guarantee 1+ε:    Idea: pre-processing the data by devising a data structure (e.g. ring-cover tree) to speed up the searchings. Designed for stored data only. Time for update the pre-processing step depends on size of data set which may be infinite. 5 Our Algorithm: ANNCAD Adaptive NN Classification Algorithm for Data Streams  Model building:     Pre-assign classes to obtain an approximate result and provide simple models/rules. Decompose the feature space to make classification decisions. Akin to wavelets. Classification:   Find NN for classification adaptively. progressively expand the searching of nearby area of a test point (star). 6 Quantize Feature Space and Compute Multi-resolution Coefficients (8+9+10+0)/4 A set of 100 twoclass training points Quantize Feature Space and record information into data arrays Multi-resolution representation of a two-class data set. 7 Building a Classifier B=6.75; R=0.6 B=2; R=4.25  Blue  M(ix) Label each block with its majority class B=3; R=3.25  M(ix) Label block only if |C1st|-|C2nd| > 80% Hierarchical structure of ANNCAD Classifier 8 Decision Algorithm on the ANNCAD Hierarchy Classified Classifiedblock block  Label LabelUnclassified class classIII block, go to next level. The combined classifier over multiple levels Block with tag “M”, go back to prev. level. Compute the distance between the test point and the center of every nonempty neighboring block. 9 Incremental Update 0 8 8 0 6.75 10 9 0 0 10 2 1 0 0 0 0 0 1 2 3.0625 3 3 0.25 0.5 New training point 10 Concept Drift: Adaptation by Exponential Forgetting  Data Array , Factor 01: new   old    No effect if no concept changes Adapt quickly (exponentially) if concept changes No extra memory needed (sliding window required.) 11 Grid Position and Resolution   Problem: Neighborhood decision strongly depends on grid position Solution: Build several classifiers by shifting grid position by 1/n. Then combine the results by voting. Example: 4 different grids for building 4 classifiers.  Thm. x: test point, nd classifiers, b(x): Blocks containing x, then:  zb(x), yb(x): dist(x,y)<(1+1/n-1)*dist(x,z).  In practice, only 2-3 classifiers can achieve a good result. 12 Properties of ANNCAD     Compact support: locality property allows fast update Dealing with noise: can set a threshold for classification decision Multi-resolution: to control the fineness of the result, or optimize the system resources. Low complexity (gd = total number of cells)  Building classifier: O(min(N,gd))  Testing: O(log2(g)+2d).  Updating: log2(g)+1. 13 Experiments   Synthetic Data  3-d unit cube:  Class distribution: class 0 inside sphere with radius 0.5 class 1 outside  3000 training examples  1000 test examples (a) different initial resolutions. Exact ANN:   Expand the searching area by double the radius until reaching some training point. Classify the test point with the majority class. (b) different # ensembles. 14 Experiments (Cont’)  Real Data 1 -- Letter Recognition        Objective: identify a pixel displays as one of the 26 letter. 16 numerical attributes to describe its pixel displays. 15,000 training examples 5,000 test examples Add 5 % noise by randomly assign class. Grid size: 16 units #Classifiers: 2 15 ANNCAD Vs VFDT  Real Data 2 – Forest Cover Type       Objective: predict forest cover type. 10 numerical attributes. 12,000 training examples 9,000 test examples Grid size: 32 unit #Classifiers: 2 16 Concept Shift: ANNCAD vs CVFDT  Real Data 3 – Adult      Objective: determine a person with salary>50K Concept Shift Simulation: Group by races  = 0.98 Grid Size: 64 #Classifier: 2 17 Conclusion and Future Work  ANNCAD     an incremental classification algorithm to find adaptive NN Suitable for mining data streams: fast update speed Exponential forgetting for concept shift/drift. Future Work: Detect concept shift/drift by changes in class label of blocks. 18 THANK YOU! 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides - UCLA Computer Science