Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Incremental Generalized Eigenvalue Classification on data streams [email protected] High Performance Computing and Networking Institute National Research Council – Naples, ITALY International Workshop on Data Stream Management and Mining Beijing, October 27-29, 2008 Acknowledgements Panos Pardalos, Onur Seref, Claudio Cifarelli Davide Feminiano, Salvatore Cuciniello Rosanna Verde Beijing, October 27-29, 2008 2 Outline Introduction • Challenges, applications and existing methods ReGEC (Regularized Generalized Eigenvalue Classifier) I-Regec (Incremental ReGEC ) I-ReGEC on data streams Experiments Beijing, October 27-29, 2008 3 Supervised learning Supervised learning refers to the capability of a system to learn from examples The trained system is able to answer to new questions Supervised means the desired answer for the training set is provided by an external teacher Binary classification is among the most successful methods for supervised learning Beijing, October 27-29, 2008 4 Supervised learning Incremental Regularized Generalized Eigenvalue Classifier (I-ReGEC) is a supervised learning algorithm, using a subset of training data The advantage is the classification model can be incrementally updated • The algorithm online decides which points bring new information and updates the model Experiments and comparison assess I-ReGEC classification accuracy and processing speed rates Beijing, October 27-29, 2008 5 The challenges Applications on massive data sets are emerging with an increasing frequency • Data has to be analyzed the data as soon as produced Legacy data bases and warehouses with petabytes of data can not be loaded in main memory and data are accessed as streams Classification algorithms have to deal with a large amount of data that are delivered in form of streams Beijing, October 27-29, 2008 6 The challenges Classification has to be performed online • Data is processed on fly, at the transmission speed Need for data sampling • Classification models not over fitting data and detailed enough to describe the phenomena. Change of training behavior during time • Nature and gradient of data changes over time Beijing, October 27-29, 2008 7 Applications on data streams Sensor networks • Power grids, telecommunications, bio, seismic, security,… Computer network traffic • spam, intrusion detection, IP traffic logs… Bank transactions • Financial data, frauds, credit cards,… Web • Browser clicks, user queries, link rating,… Beijing, October 27-29, 2008 8 Support Vector Machines SVM algorithm is among the most successful method for classification, and its variations have been applied to data streams The general purpose methods are only suitable for small size problems For large problems, chunking subset selection and decomposition methods use subsets of points SVM-Light and libSVM are among the most preferred implementations that use chunking subset selection and decomposition methods Beijing, October 27-29, 2008 9 Support Vector Machines Find two parallel lines with maximum margin and leave all points of a class on one side support vectors B A 2 w min || || ω 0 2 s.t. Aw + b e Bw + b < - e Beijing, October 27-29, 2008 10 SVM for data streams Batch technique uses SVM model on complete data set Error-driven technique randomly stores k samples in a training and the other in a test set. If a point is well classified, it remains in the traing set Fixed-partition technique divides the training set in batches of fixed size. This partition permits to add points to current SVM accordingly to the ones loaded in memory Beijing, October 27-29, 2008 11 SVM for data streams Exceeding-margin technique checks, at the time t and for each new point, if the new data exceeds the margin evaluated by SVM • In case, the point is added to the incremental training, otherwise it is discarded Fixed-margin + errors technique adds the new point to the training if either it exceeds the margin or it is misclassified Beijing, October 27-29, 2008 12 Pros of SVM based methods They require a minimal computational burden to build classification models In case of kernel-based nonlinear classification, they reduce the size of the training set, and thus, the related kernel All of these methods show that a sensible data reduction is possible while maintaining a comparable level of classification accuracy Beijing, October 27-29, 2008 13 A different approach: ReGEC Find two lines, each the closest to one set and the furthest to the other A B 2 w g e || min || A ω, γ 0 || Bw - eg ||2 M.R. Guarracino, C. Cifarelli, O. Seref, P. Pardalos. A Classification Method Based on Generalized Eigenvalue Problems, OMS, 2007. Beijing, October 27-28, 2008 14 ReGEC formulation 2 w g A e || || min ω, γ 0 2 w g e || || B Let: G = [A –e]’[A –e], H = [B –e]’[B –e], z = [w’ g]’ Equation becomes: min z ' G z z0 z' H z Raleigh quotient of Generalized Eigenvalue Problem Gx=lHx Beijing, October 27-29, 2008 15 ReGEC classification of new points A new point is assigned to the class described by the closest line B A | x 'w - g | dist ( x , Pi ) = || w || Beijing, October 27-29, 2008 16 ReGEC [w1, g1] and [w2, g2] be eigenvectors of min and max eigenvalues of G x = l H x Let • a ∈ A, closer to x'w1 - g1 =0 than to x'w2-g2=0, • b ∈ B, closer to x'w2 - g2=0 than to x'w1-g1=0. Beijing, October 27-28, 2008 17 Incremental learning The purpose of incremental learning is to find a small and robust subset of the training set that provides comparable accuracy results. A smaller set of points reduces the probability of overfitting the problem. A classification model built from a smaller subset is computationally more efficient in predicting new points. As new points become available, the cost of retraining the algorithm decreases if the influence of the new points is only evaluated by the small subset. C. Cifarelli, M.R. Guarracino, O. Seref, S. Cuciniello, and P.M. Pardalos. Incremental Classification with Generalized Eigenvalues, JoC, 2007. Beijing, October 27-29, 2008 18 Incremental learning algorithm 1: 1 = C \ C0 2: {M0, Acc0} = Classify ( C; C0 ) 3: k = 1 4: while |k| > 0 do 5: 6: 7: xk = x : maxx {Mk-1 ∩ k-1} {dist(x, Pclass(x))} {Mk, Acck } = Classify( C; {Ck-1 U {xk}} ) if Acck > Acck-1 then 8: Ck = Ck-1 U {xk} 9: end if 10: k = k-1 \ {xk} 11: k=k+1 12: end while Beijing, October 27-29, 2008 19 Incremental classification on data streams Find a small and robust subset of the training set while accessing data available in a window wsize When the window is full, all points are processed by the classifier M.R. Guarracino, S. Cuciniello, D. Feminiano. Incremental Generalized Eigenvalue Classification on Data Streams. MODULAD, 2007. Beijing, October 27-28, 2008 20 I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams wsize Beijing, October 27-29, 2008 21 I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams New data Old data At each step, data in window are processed with the incremental learning classifier… And hyperplanes are built Beijing, October 27-29, 2008 22 I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams Step by step new points are processed …and I-ReGEC updates hyperplanes configuration Beijing, October 27-29, 2008 23 I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams But not all points are considered… Some of them are discarted if their information contribution is useless Beijing, October 27-29, 2008 24 I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams New unknown incoming points are classified by their distance from the hyperplanes Beijing, October 27-29, 2008 25 I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams An incremental learning technique based on ReGEC that determines the classification model from a very small sample of data stream Beijing, October 27-29, 2008 26 Experiments Large-noisy-crossed-norm Data set 200.000 points with 20 features equally divided in 2 classes 100.000 train points 100.000 test points Each class is drawn from a multivariate normal distribution Beijing, October 27-29, 2008 27 Miss-classification results SI-ReGEC has the lowest error and uses the smallest incremental set B: ED: FP: EM: EM+E: Batch SVM Error-driven KNN Fixed partition SVM Exceeding-margin SVM Fixed margin + errors Beijing, October 27-29, 2008 28 Window size Larger windows lead to smaller train subset and execution time increases with window growth One billion elements per day can be processed on standard hardware Beijing, October 27-29, 2008 29 Conclusion and future work The classification accuracy of I-ReGEC a well compares with other methods I-ReGEC produces small incremental training sets In future, investigate how to dinamically adapt window size to stream rate and nonstationary data streams Beijing, October 27-29, 2008 30 Incremental Generalized Eigenvalue Classification on data streams [email protected] High Performance Computing and Networking Institute National Research Council – Naples, ITALY International Workshop on Data Stream Management and Mining Beijing, October 27-29, 2008