* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download w - UTK-EECS
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					COSC 526 Class 8 Classification & Regression Part II Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: [email protected] Last Class • We saw different techniques for handling largescale classification: – Decision trees – K Nearest Neighbors (k-NN) • Modified Decision trees to handle streaming datasets • For k-NN, we discussed efficient data structures to handle large-scale datasets 2 This class… • More classification… • Support vector machines: – Basic outline of the algorithm – Modifications for large-scale datasets – Online approaches 3 Class Logistics 4 10% Grade… • 40% – assignments (2 instead of 3) • 50% - class project • What do we do with the 5%? – Class participation – Select papers updated on the class website – Present your take on it (Critique) – Per class 2 presentations – 15 minutes each – Starting Tue (Feb 17) 5 Critique Structure • 2-3 slides: what is the paper all about? • 2-4 slides: key results presented • 2-3 slides: key limitations/shortcomings? • 2-3 slides: what could have been done better? • More discussion oriented presentation rather than an in-depth view of the paper • Peer review of critiques due every week… 6 Support Vector Machines (SVM) 7 Support Vector Machines (SVM) • Easiest explanation: – Find the “best linear separator” for a dataset • Training examples: – {(x1, y1), (x2, y2), …, (xn, yn)} – Each data point xi = (xi(1), xi(2) … xi(d)) – yi = {-1, +1} • In higher dimensional datasets we want to find the “best hyperplane” 8 α x f(x,w) y Classifier Margin… • Margin: width of the boundary that could be increased by before hitting a data point • Interested in the maximum margin: – Simplest is the maximum margin SVM classifier – Support Vectors 9 support vectors Why are we interested in Max Margin SVM? • Intuitively this feels right: – we need the maximum margin to fit the data correctly • Robust to location of the boundary: – even if we have made a small error in the location of the boundary, not much effect on classification • Model obtained is tolerant to removal of any nonsupport vector data points – Validation works well! • Works very well in practice 10 Why is the maximum margin a good thing? • theoretical convenience and existence of generalization error bounds that depend on the value of margin 11 Why maximizing 𝜸 a good idea? • We all know what the dot product means 𝑨 𝒄𝒐𝒔𝜽 12 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12 Why maximizing 𝜸 a good idea? • Dot product 𝒘 x2 • What is , ? x2 + +x 1 1 x2 + +x 1 𝒘 In this case 𝜸𝟏 ≈ 𝒘 𝟐 + +x 𝒘 In this case 𝜸𝟐 ≈ 𝟐 𝒘 𝟐 • So, roughly corresponds to the margin – Bigger bigger the separation 13 13 What is the margin? Distance from a point to a line w A (xA(1), xA(2)) + H (0,0) M (x1, x2) L • Let: Note we assume 𝒘 𝟐=𝟏 – Line L: w∙x+b = w(1)x(1)+w(2)x(2)+b=0 – w = (w(1), w(2)) – Point A = (xA(1), xA(2)) – Point M on a line = (xM(1), xM(2)) d(A, L) = |AH| = |(A-M) ∙ w| = |(xA(1) – xM(1)) w(1) + (xA(2) – xM(2)) w(2)| = xA(1) w(1) + xA(2) w(2) + b =w∙A+b 14 Remember xM(1)w(1) + xM(2)w(2) = - b since M belongs to line L J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive 14 Largest Margin Prediction = sign (w.x + b) Confidence = (w.x + b)y 𝒘 + + + + - + + + For ith data point: - - Want to solve: - 15 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15 Support Vector Machine • Maximize the margin: – Good according to intuition, theory (VC dimension) & practice max  + + + + + + +  +  – 𝜸 is margin … distance from the separating hyperplane 16 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org  w, s.t.i, yi ( w  xi  b)   wx+b=0 - - Maximizing the margin 16 How do we derive the margin? • Separating hyperplane is defined by the support vectors – Points on +/- planes from the solution – If you knew these points, you could ignore the rest – Generally, d+1 support vectors (for d dim. data) 17 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 Canonical Hyperplane: Problem • Let • Now, • Scaling w increases margin x2 • Solution • Work with normalized w • Also require support vectors to be in the plane defined by: 18 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org x1 w || w || Canonical Hyperplane: Solution • Want to maximize margin γ • What is the relation between x1 and x2? x2 2 x1 • We also know: w || w || -1 19 Note: 2 ww  w Maximizing the Margin • We started with max w,  s.t.i, yi ( w  xi  b)   x2 x1 But w can be arbitrarily large! arg max   arg max 1  arg min w  arg min 12 w w 2 w || w || • We normalized and... • Then: min 1 w 2 || w || 2 2 s.t.i, yi ( w  xi  b)  1 20 This is called SVM with “hard” constraints 20 Non-linearly Separable Data • If data is not separable introduce penalty: 2 1 min w 2 w  C  (# number of mistakes) s.t.i, yi ( w  xi  b)  1 – Minimize ǁwǁ2 plus the number of training mistakes – Set C using cross validation • How to penalize mistakes? – All mistakes are not equally bad! 21 + + + + + + + - - - - - - 21 + Support Vector Machines • Introduce slack variables i min w ,b , i  0 1 2 n w  C   i 2 i 1 s.t.i, yi ( w  xi  b)  1   i • If point xi is on the wrong side of the margin then get penalty i 22 + + + + + + + i j - + - - For each data point: If margin  1, don’t care If margin < 1, pay linear penalty 22 Slack Penalty min 1 w 2 w  C  (# number of mistakes) 2 s.t.i, yi ( w  xi  b)  1 • What is the role of slack penalty C: – C=: Only want to w, b that separate the data + – C=0: Can set i to anything, then w=0 (basically ignores the data) 23 small C + (0,0) + + + + + “good” C big C - - + - 23 Support Vector Machines • SVM in the “natural” form n arg min w ,b 1 2 w  w  C   max0,1  yi ( w  xi  b) i 1 Margin Regularization parameter Empirical loss L (how well we fit training data) • SVM uses “Hinge Loss”: penalty min w ,b 1 2 n w  C  i 2 i 1 s.t.i, yi  ( w  xi  b)  1   i 0/1 loss Hinge loss: max{0, 1-z} 24 -1 0 1 2 z  yi  ( xi  w  b) 24 SVM: How to estimate w? n min w ,b 1 2 w  w  C   i i 1 s.t.i, yi  ( xi  w  b)  1   i • Want to estimate and ! – Standard way: Use a solver! • Solver: software for finding solutions to “common” optimization problems • Use a quadratic solver: – Minimize quadratic function – Subject to linear constraints 25 • Problem: Solvers are inefficient for big data! 25 SVM: How to estimate w? n w  w  C  i • Want to estimate w, b! min • Alternative approach: s.t.i, yi  ( xi  w  b)  1  i w ,b 1 2 i 1 – Want to minimize f(w,b): d   ( j) ( j) 1 f ( w, b)  2 w  w  C   max 0,1  yi ( w xi  b) i 1 j 1   n • Side note: – How to minimize convex functions ? – Use gradient descent: minz g(z) g(z) – Iterate: zt+1  zt –  g(zt) 26 z SVM: How to estimate w? • Want to minimize f(w,b): d   f (w, b)  12  w j 1 ( j) 2 d   ( j) ( j)  C  max 0,1  yi ( w xi  b) i 1 j 1   n • Compute the gradient (j) w.r.t. w(j) f ( j ) 27 Empirical loss 𝑳(𝒙𝒊 𝒚𝒊 ) n L( xi , yi ) f ( w, b) ( j)   w  C ( j) ( j) w  w i 1 L( xi , yi ) 0 if yi (w  xi  b)  1 ( j) w   yi xi( j ) else SVM: How to estimate w? • Gradient descent: Iterate until convergence: • For j = 1 … d n L( xi , yi )  f ( w , b ) ( j) ( j) • Evaluate:f   w  C ( j) ( j) w w i 1 • Update: w(j)  w(j) - f(j) …learning rate parameter C… regularization parameter • Problem: – Computing f(j) takes O(n) time! • n … size of the training dataset 28 SVM: How to estimate w? • Stochastic Gradient Descent We just had: f ( j) w n ( j) – Instead of evaluating gradient over all examples evaluate it for each individual training example f ( j) ( xi )  w ( j) L( xi , yi ) C w( j ) • Stochastic gradient descent: Iterate until convergence: • For i = 1 … n • For j = 1 … d • Compute: f(j)(xi) • Update: w(j)  w(j) -  f(j)(xi) 29  C i 1 L( xi , yi ) w( j ) Notice: no summation over i anymore Making SVMs work with Big Data 30 Optimization problem set up kernel function • This is good optimization problem for computers to solve called quadratic programming 31 Problems with SVMs on Big Data • SVM complexity: quadratic programming (QP) – Time: O(m3) – Space: O(m2) – m: size of training data • Two types of approaches: – Modify SVM algorithm to work with large datasets – Select representative training data to use normal SVM 32 Reducing the Training Set • How do we reduce the size of the training set so that we can reduce the time? – Combine (a large number of) ‘small’ SVMs together to obtain the final SVM – Reduced SVM: use random rectangular subset of the kernel matrix – All techniques only find an approximation to the optimal solution by an iterative approach • why not exploit this? 33 Problematic datasets: even in 1D! How will a SVM handle this dataset? 34 The Kernel Trick: Using higher dimensions to separate data • Project data into a higher dimension • Now find the separation in this space • This is a common trick in ML for many algorithms 35 Commonly used SVM basis functions • zk = (Polynomial terms of xk of degree 1-q) • zk = (Radial basis functions of xk) • zk = (sigmoid functions of xk) • These and many other functions are valid 36 How to tackle training set sizes? • We need: – Efficient and effective method for selecting a “working” set – “shrink” the optimization problem: • Much less support vectors than training examples • many support vectors have an αi at the upper bound C – Computational improvements including caching and incremental updates of the gradient 37 How does this algorithm look like? • In each iteration of our optimization, we will split αi into the twooptimality categories: While constraints are violated: • BSelect variablesupdated for the in working set iteration B. – set of freeqvariables: the current • NRemaining l-q variables are part – set of fixed variables: temporarily fixedofinthe the fixed current set N. iteration • Solve the QP-sub-problem with W(α) on B • Then divide and solve the optimization Joachims, T., Making large-scale SVM practical, Large-scale Machine Learning 38 Now, how do we select a good “working” set? • Select the set of variables such that the current iteration will make progress towards the minimum of W(α) – Use first order approximation, i.e., steepest direction d of descent which has only q non-zero elements • Convergence: – terminate only when the optimal solution is found – If not, take a step towards the optimum… 39 Other techniques to identify appropriate (and reduced) training sets • Minimum enclosing ball (MEB) for selecting “core” sets εR R • after core sets are selected, solve the same optimization problem… • Complexity: – Time: O(m/ε2 + 1/ε4) 40 – Space: O(1/ε8) Tsang, I.W., Kwok, J.T., and Cheung, P.-M., JMLR (6): 363-392 (2005). Incremental (and Decremental) SVM Learning • Solving this QP formulation, we find this: 41 Example of an SVM working with Big Datasets • Example by Leon Bottou: – Reuters RCV1 document corpus • Predict a category of a document – One vs. the rest classification – m = 781,000 training examples (documents) – 23,000 test examples – d = 50,000 features • One feature per word • Remove stop-words • Remove low frequency words 42 42 Text categorization • Questions: – (1) Is SGD successful at minimizing f(w,b)? – (2) How quickly does SGD find the min of f(w,b)? – (3) What is the error on a test set? Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM (1) SGD-SVM is successful at minimizing the value of f(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable 43 43 Optimization “Accuracy” SGD SVM Conventional SVM Optimization quality: | f(w,b) – f (wopt,bopt) | For optimizing f(w,b) within reasonable quality SGD-SVM is super fast 44 44 SGD vs. Batch Conjugate Gradient • SGD on full dataset vs. Conjugate Gradient on a sample of n training examples Theory says: Gradient descent converges in linear time 𝒌. Conjugate gradient converges in 𝒌. 45 Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times 𝒌… condition number Practical Considerations • Need to choose learning rate  and t0 t  L( xi , yi )  wt 1  wt   wt  C  t  t0  w  • Leon suggests: – Choose t0 so that the expected initial updates are comparable with the expected size of the weights – Choose : • Select a small subsample • Try various rates  (e.g., 10, 1, 0.1, 0.01, …) • Pick the one that most reduces the cost • Use  for next 100k iterations on the full dataset 46 46 Advanced Topics… 47 Sparse Linear SVMs • Feature vector xi is sparse (contains many zeros) • Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] • But represent xi as a sparse vector xi=[(4,1), (9,5), …] • Can we do the SGD update more efficiently?  w   w  w   w  C  L( xi , yi )  – Approximated in 2 steps: cheap: xi is sparse and so few coordinates j of w will be updated w  w  C L( xi , yi ) expensive: w is not sparse, all w coordinates need to be updated w  w(1   ) 48 Sparse Linear SVMs: Practical Considerations  Solution 1: – Represent vector w as the product of scalar s and vector v – Then the update procedure is: Two step update procedure: L( xi , yi ) w (2) w  w(1   ) (1) w  w  C • (1) • (2) • Solution 2: – Perform only step (1) for each training example – Perform step (2) with lower frequency and higher  49 49 Practical Considerations • Stopping criteria: How many iterations of SGD? – Early stopping with cross validation • Create a validation set • Monitor cost function on the validation set • Stop when loss stops decreasing – Early stopping • Extract two disjoint subsamples A and B of training data • Train on A, stop by validating on B • Number of epochs is an estimate of k • Train for k epochs on the full dataset 50 50 What about multiple classes? • Idea 1: One against all Learn 3 classifiers – + vs. {o, -} – - vs. {o, +} – o vs. {+, -} Obtain: w+ b+, w- b-, wo bo • How to classify? • Return class c arg maxc wc x + bc 51 51 Learn 1 classifier: Multiclass SVM • Idea 2: Learn 3 sets of weights simoultaneously! – For each class c estimate wc, bc – Want the correct class to have highest margin: wyi xi + by  1 + wc xi + bc c  yi , i i (xi, yi) 52 52 Multiclass SVM • Optimization problem: min w,b 1 2 w c c 2 n  C  i i 1 wyi  xi  byi  wc  xi  bc  1  i c  yi , i i  0, i – To obtain parameters wc , bc (for each class c) we can use similar techniques as for 2 class SVM 53 53 SVM : what you must know? • One of the most successful ML algorithms • Modifications for handling big datasets: – Reduce the training set (“core set”) – Modify SVM training algorithm – Incremental algorithm • Multi-class modifications are more complex 54
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            