Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu The UNIVERSITY of Mining, KENTUCKY CS685 : Special Topics in Data UKY Linear Classifiers x denotes +1 denotes -1 a f yest f(x,w,b) = sign(w. x - b) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Linear Classifiers x denotes +1 denotes -1 a f yest f(x,w,b) = sign(w. x - b) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Linear Classifiers x denotes +1 denotes -1 a f yest f(x,w,b) = sign(w. x - b) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Linear Classifiers x denotes +1 denotes -1 a f yest f(x,w,b) = sign(w. x - b) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Linear Classifiers x denotes +1 denotes -1 a f yest f(x,w,b) = sign(w. x - b) Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Classifier Margin x denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore a f yest f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. CS685 : Special Topics in Data Mining, UKY Maximum Margin x denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore a f yest f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM CS685 : Special Topics in Data Mining, UKY Maximum Margin x denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against Copyright © 2001, 2003, Andrew W. Moore a f yest f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM CS685 : Special Topics in Data Mining, UKY Why Maximum Margin? denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against Copyright © 2001, 2003, Andrew W. Moore 1. Intuitively this feels safest. 2. If we’vef(x,w,b) made a small errorxin the = sign(w. - b) location of the boundary (it’s been The maximum jolted in its perpendicular margin direction) this gives us linear least chance classifier is the of causing a misclassification. 3. LOOCV is easylinear since classifier the modelwith is the, um, maximum immune to removal of any nonmargin. support-vector datapoints. This is the simplest 4. There’s some theory (using VC of SVM dimension) thatkind is related to (but an LSVM) not the same as)(Called the proposition that this is a good thing. 5. Empirically it works very very well. CS685 : Special Topics in Data Mining, UKY Specifying a line andPlus-Plane margin Classifier Boundary Minus-Plane • How do we represent this mathematically? • …in m input dimensions? Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Specifying a line and margin Plus-Plane Classifier Boundary Minus-Plane • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } if w . x + b >= 1 -1 if w . x + b <= -1 Universe explodes if -1 < w . x + b < 1 Classify as.. +1 Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Computing the margin width M = Margin Width How do we compute M in terms of w and b? • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Computing the margin width M = Margin Width How do we compute M in terms of w and b? • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) And so of course the vector w is ? also perpendicular to the Minus Plane Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Computing the margin width x+ x- • • • • • M = Margin Width How do we compute M in terms of w and b? Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Copyright © 2001, 2003, Andrew W. Moore Any location in mm::not R not necessarily a datapoint CS685 : Special Topics in Data Mining, UKY Computing the margin width x+ x- • • • • • • M = Margin Width How do we compute M in terms of w and b? Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Claim: x+ = x- + l w for some value of l. Why? Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Computing the margin width x+ x- • • • • • • M = Margin Width The line from x- to x+ is perpendicular to the How do we compute planes. M in terms- of +w So to get from x to x and b? travel some distance in direction w. Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Claim: x+ = x- + l w for some value of l. Why? Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Computing the margin width x+ M = Margin Width x- What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + l w • |x+ - x- | = M It’s now easy to get M in terms of w and b Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Computing the margin width M = Margin Width x+ x- w . (x - + l w) + b = 1 What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + l w • |x+ - x- | = M It’s now easy to get M in terms of w and b Copyright © 2001, 2003, Andrew W. Moore => w . x - + b + l w .w = 1 => -1 + l w .w = 1 => 2 λ w.w CS685 : Special Topics in Data Mining, UKY Computing the margin width M = Margin Width = x+ 2 w.w xM = |x+ - x- | =| l w |= What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + l w • |x+ - x- | = M 2 • λ λ | w | λ w.w 2 w.w 2 w.w w.w w.w Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Learning the Maximum Margin Classifier x+ M = Margin Width = 2 w.w x- Given a guess of w and b we can • Compute whether all data points in the correct half-planes • Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Estimate the Margin denotes +1 denotes -1 wx +b = 0 x • What is the distance expression for a point x to a line wx+b= 0? d ( x) xw b w Copyright © 2001, 2003, Andrew W. Moore 2 2 xw b d 2 w i 1 i CS685 : Special Topics in Data Mining, UKY Estimate the Margin denotes +1 denotes -1 wx +b = 0 Margin • What is the expression for margin? margin arg min d (x) arg min xD Copyright © 2001, 2003, Andrew W. Moore xD xw b d 2 w i 1 i CS685 : Special Topics in Data Mining, UKY Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin argmax margin(w, b, D) w ,b = argmax arg min d (xi ) w ,b xi D argmax arg min w ,b Copyright © 2001, 2003, Andrew W. Moore xi D b xi w d 2 w i 1 i CS685 : Special Topics in Data Mining, UKY Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin argmax arg min w ,b xi D b xi w d 2 w i 1 i subject to xi D : yi xi w b 0 • Min-max problem game problem Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY denotes +1 denotes -1 Maximize Margin Margin wx +b = 0 argmax arg min w ,b xi D b xi w d 2 w i i 1 subject to xi D : yi xi w b 0 Strategy: argmin i 1 wi2 d xi D : b xi w 1 Copyright © 2001, 2003, Andrew W. Moore w ,b subject to xi D : yi xi w b 1 CS685 : Special Topics in Data Mining, UKY Learning via Quadratic Programming • QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints. Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Suppose we’re in 1-dimension What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Suppose we’re in 1-dimension Not a big surprise x=0 Positive “plane” Copyright © 2001, 2003, Andrew W. Moore Negative “plane” CS685 : Special Topics in Data Mining, UKY Harder 1-dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0 Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY Harder 1-dimensionalRemember datasethow permitting nonlinear basis functions made linear regression so much nicer? Let’s permit them here too x=0 Copyright © 2001, 2003, Andrew W. Moore z k ( xk , x ) 2 k CS685 : Special Topics in Data Mining, UKY Harder 1-dimensionalRemember datasethow permitting nonlinear basis functions made linear regression so much nicer? Let’s permit them here too x=0 Copyright © 2001, 2003, Andrew W. Moore z k ( xk , x ) 2 k CS685 : Special Topics in Data Mining, UKY Common SVM basis functions zk = ( polynomial terms of xk of degree 1 to q ) zk = ( radial basis functions of xk ) | xk c j | z k [ j ] φ j (x k ) KernelFn zk = ( sigmoid functions of xk ) KW This is sensible. Is that the end of the story? No…there’s one more trick! Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY SVM Performance • Anecdotally they work very very well indeed. • Example: They are currently the best-known classifier on a well-studied hand-writtencharacter recognition benchmark • Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly. • There is a lot of excitement and religious fervor about SVMs as of 2001. CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Doing multi-class classification • SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). • What can be done? • Answer: with output arity N, learn N SVM’s – – – – SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” • Then to predict the output for a new input, just predict with each SVM and find out which Copyright © 2001, 2003, Andrew W. Moore CS685 : Special Topics in Data Mining, UKY SVM Related Links • http://svm.dcs.rhbnc.ac.uk/ • http://www.kernel-machines.org/ • C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. • SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light • BOOK: An Introduction to Support Vector Machines N. Cristianini and J. Shawe-Taylor Cambridge University Press CS685 : Special Topics in Data Mining, UKY Classification - CBA CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu The UNIVERSITY of Mining, KENTUCKY CS685 : Special Topics in Data UKY Association Rules Transactionid Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l,m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Customer buys both Customer buys beer Customer buys diaper Itemset X = {x1, …, xk} Find all the rules X Y with minimum support and confidence support, s, is the probability that a transaction contains X Y confidence, c, is the conditional probability that a transaction having X also contains Y Let supmin = 50%, confmin = 50% Association rules: A C (60%, 100%) C A (60%, 75%) CS685 : Special Topics in Data Mining, UKY Classification based on Association • Classification rule mining versus Association rule mining • Aim – A small set of rules as classifier – All rules according to minsup and minconf • Syntax – Xy – X Y CS685 : Special Topics in Data Mining, UKY Why & How to Integrate • Both classification rule mining and association rule mining are indispensable to practical applications. • The integration is done by focusing on a special subset of association rules whose right-hand-side are restricted to the classification class attribute. – CARs: class association rules CS685 : Special Topics in Data Mining, UKY CBA: Three Steps • Discretize continuous attributes, if any • Generate all class association rules (CARs) • Build a classifier based on the generated CARs. CS685 : Special Topics in Data Mining, UKY Our Objectives • To generate the complete set of CARs that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. • To build a classifier from the CARs. CS685 : Special Topics in Data Mining, UKY Rule Generator: Basic Concepts • Ruleitem <condset, y> :condset is a set of items, y is a class label Each ruleitem represents a rule: condset->y • condsupCount • The number of cases in D that contain condset • rulesupCount • The number of cases in D that contain the condset and are labeled with class y • Support=(rulesupCount/|D|)*100% • Confidence=(rulesupCount/condsupCount)*100% CS685 : Special Topics in Data Mining, UKY RG: Basic Concepts (Cont.) • Frequent ruleitems – A ruleitem is frequent if its support is above minsup • Accurate rule – A rule is accurate if its confidence is above minconf • Possible rule – For all ruleitems that have the same condset, the ruleitem with the highest confidence is the possible rule of this set of ruleitems. • The set of class association rules (CARs) consists of all the possible rules (PRs) that are both frequent and accurate. CS685 : Special Topics in Data Mining, UKY RG: An Example • A ruleitem:<{(A,1),(B,1)},(class,1)> – assume that • the support count of the condset (condsupCount) is 3, • the support of this ruleitem (rulesupCount) is 2, and • |D|=10 – then (A,1),(B,1) -> (class,1) • supt=20% (rulesupCount/|D|)*100% • confd=66.7% (rulesupCount/condsupCount)*100% CS685 : Special Topics in Data Mining, UKY RG: The Algorithm 1 F 1 = {large 1-ruleitems}; 2 CAR 1 = genRules (F 1 ); 3 prCAR 1 = pruneRules (CAR 1 ); //count the item and class occurrences to determine the frequent 1-ruleitems and prune it 4 for (k = 2; F k-1 Ø; k++) do 5 C k = candidateGen (F k-1 ); //generate the candidate ruleitems Ck 6 7 8 9 10 11 12 using the frequent ruleitems Fk-1 for each data case d D do //scan the database C d = ruleSubset (C k , d); //find all the ruleitems in Ck whose condsets are supported by d for each candidate c C d do c.condsupCount++; if d.class = c.class then c.rulesupCount++; //update various support counts of the candidates in Ck end end CS685 : Special Topics in Data Mining, UKY RG: The Algorithm(cont.) 13 F k = {c C k | c.rulesupCount minsup}; //select those new frequent ruleitems to form Fk 14 CAR k = genRules(F k ); //select the ruleitems both accurate and frequent 15 prCAR k = pruneRules(CAR k ); 16 end 17 CARs = k CAR k ; 18 prCARs = k prCAR k ; CS685 : Special Topics in Data Mining, UKY