Download Slide 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu  Overfitting:  Regularization: If the data is not separable weights dance around  Mediocre generalization:  Finds a “barely” separating solution 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 Which is best linear separator?  + + + + - + + 5/22/2017 Data:  Examples:  (x1, y1) … (xn, yn) - - -  Example i:  xi = ( x1(1),… , x1(d) )  Real values  yi  { -1, +1 }  Inner product: wx = Σj w(j) x(j) Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 Why define margin as distance to the closest example, why not avg. or something else 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4  + + +   + - + + + - - - Prediction = sign(wx + b) “Confidence” = (w x + b) y For i-th datapoint: i = (w xi+b) yi Want to solve: max min  i w -  i Can rewrite as max  w, s.t.i, yi ( w  xi  b)   5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5  Maximize the margin:  Good according to intuition, theory & practice + + max  w, s.t.i, yi ( w  xi  b)   + + + + + + wx+b=0    - - Maximizing the margin 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6  Separating hyperplane is defined by the support vectors  Points on +/- planes from the solution.  If you knew these points, you could ignore the rest  If no degeneracies, d+1 support vectors (for d dim. data) 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7  Problem:  Let (w x+b) y =  then (2w x+ 2b) y = 2  x2  Scaling w increases margin!  x1 Solution:  Normalize |w|= 𝑤 w: |𝑤| 𝑑 𝑗=1 𝑤𝑗 w || w || 2  Also require hyperplanes touching support vectors to be: 𝑤 ⋅ 𝑥 + 𝑏 = ±1 5/22/2017 Note: w j … j-th component of w Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8  What is the relation between x1 and x2?  𝒙𝟏 = 𝒙𝟐 +  𝒘 𝟐𝜸 ||𝒘|| x2 2 We also know: x1  𝑤 ⋅ 𝑥1 + 𝑏 = +1  𝑤 ⋅ 𝑥2 + 𝑏 = −1  w || w || Then: 𝒘 𝒘 𝒙𝟐 + 𝟐𝜸 ||𝒘||  𝒘 ⋅ 𝒙𝟐 + 𝒃 + -1 5/22/2017 + 𝒃 = +𝟏 𝒘⋅𝒘 𝟐𝜸 ||𝒘|| w 1    w w w Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu Note: ww  w 9 2  We started with max w,  s.t.i, yi ( w  xi  b)   x2 2 But w can be arbitrarily large  We normalized and.. max   max  x1 1  min w  min w 1 2 w 2 Then: min 1 w 2 || w || w || w || 2 s.t.i, yi ( w  xi  b)  1 This is called SVM with “hard” constraints 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10  If data is not separable introduce penalty: min 1 w 2 w  C (# number of mistakes) 2 s.t.i, yi ( w  xi  b)  1  Minimize ǁwǁ2 plus the number of training mistakes  Set C using cross validation  How to penalize mistakes?  All mistakes are not equally bad! + + + + + + + - - - - - + - 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11  Introduce slack variables i n 2 1 min 2 w  C  i w,b ,i 0 + i 1 s.t.i, yi ( w  xi  b)  1  i  If point xi is on the wrong side of the margin then get penalty i  Slack penalty C: + + + + + + i i - + - - For each datapoint:  C=: only want to w,b that If margin  1, don’t care separate the data If margin < 1, pay linear penalty  C=0: can set i to anything, then w=0 (basically ignores the data) 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12  SVM in the “natural”nform arg min w  w  C  max 0,1  yi ( w  xi  b) 1 2 w,b i 1 Margin Regularization Empirical loss (how well we fit training data) parameter SVM uses “Hinge Loss”: min w ,b penalty  1 2 n w  C  i 2 i 1 s.t.i, yi  ( w  xi  b)  1  i 0/1 loss Hinge loss: max{0, 1-z} -1 5/22/2017 0 1 2 z  yi  ( xi  w  b) Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13 n min w ,b 1 2 w  w  C  i i 1 s.t.i, yi  ( xi  w  b)  1  i  Want to estimate w,b:  Standard way: Use a solver!  Solver: software for finding solutions to “common” optimization problems  Use a quadratic solver:  Minimize quadratic function  Subject to linear constraints  Problem: Solvers are inefficient for big data! 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14 n   Want to estimate w,b! Alternative approach: min w ,b 1 2 w  w  C  i i 1 s.t.i, yi  ( xi  w  b)  1  i  Want to minimize f(w,b): d   j j 1 f ( w, b)  2  w  C  max 0,1  yi ( w xi  b) j 1 i 1 j 1    How to minimize convex functions?  Gradient descent: minz f(z)  Iterate: zt+1  zt –  f’(zt) d   j 2 n f(z) z 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15  Want to minimize f(w,b): d   j j 1 f ( w, b)  2  w  C  max 0,1  yi ( w xi  b) j 1 i 1 j 1    Take the gradient w.r.t wj: d   j 2 n n f ( w, b) L( xi , yi ) j j   w  C j j w  w i 1 L( xi , yi )  0 if yi (w  xi  b)  1 j w   yi xij else 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16  Gradient descent: Iterate until convergence: For j = 1 … d n f ( w, b) L( xi , yi ) j  w  C Evaluate  j  j j w w i 1 Update: wj  wj -  j  Problem:  Computing j takes O(n) time!!  n … size of the training dataset 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17 We just had:  Stochastic gradient descent (SGD) n  j  w  C j i 1 L( xi , yi ) w j  Instead of evaluating gradient over all examples evaluate it for each individual example L( xi , yi )  j ,i  w  C j w Stochastic gradient descent: j   Iterate until convergence: For i = 1… n For j = 1 … d Evaluate j,i Update: w j  w j -  j, i 5/22/2017 n … # training examples d … # features Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18  Example by Leon Bottou:  Reuters RCV1 document corpus  Predict a category of a document  One vs. the rest classification  n = 781,000 training examples  23,000 test examples  d = 50,000 features  One feature per word  Remove stop-words  Remove low frequency words 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19  Questions:  Is SGD successful at minimizing f(w,b)?  How quickly does SGD find the minimum of f(w,b)?  What is the error on a test set? Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20 | f(w,b) – foptimal(w,b) | 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21  What if we subsample the training data and solve exactly? The point of this slid is that methods that converge faster don’t necessary work faster – simple thing that you can do fast (but many times) is better. Check SVM slides from 2009 and see if there is a better slide to make this point. This is not too good slide  SGD on full dataset vs. Batch Conjugate Gradient on a sample of n training examples 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22   Need to choose learning rate  and t0 t  L( xi , yi )  wt 1  wt  wC  t  t0  w  Leon suggests:  Choose t0 so that the expected initial updates are comparable with the expected size of the weights  Choose :  Select a small subsample  Try various rates  (e.g., 10, 1, 0.1, 0.01, …)  Pick the one that most reduces the cost  Use  for next 100k iterations on the full dataset 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23  The two steps do not decompose. Why does it work? Since eta is close to 0? You can do this. You get one more eta^2 term that is low order Sparse Linear SVM:  Feature vector xi is sparse (contains many zeros)  Do not do: xi=[0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…]  But represent xi as a sparse vector xi=[(4,1), (9,5), …]  The update w  w    w  C   w  i i L( x , y )   Can be computed in 2 steps: L( xi , yi ) cheap: xi is sparse and so few w  w  C w coordinates j of w will be updated w  w(1   ) 5/22/2017 expensive: w is not sparse, all coordinates need to be updated Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24 Give the equations for step 1 and 2. w = v.s  Solution 1:  Represent vector w as the product of scalar s and vector v  Perform step (1) by updating v and step (2) by updating s  Two step update procedure: (1) w  w  C (2) L( xi , yi ) w w  w(1   ) Solution 2:  Perform only step (1) for each training example  Perform step (2) with lower frequency and higher  5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25 Why do you think training on A and B would take the same time as on the full dataset?  Stopping criteria: How many iterations of SGD?  Early stopping with cross validation  Create validation set  Monitor cost function on the validation set  Stop when loss stops decreasing  Early stopping  Extract two disjoint subsamples A and B of training data  Train on A, stop by validating on B  Number of epochs is an estimate of k  Train for k epochs on the full dataset 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26  One against all: Learn 3 classifiers  + vs. {o, -}  - vs. {o, +}  o vs. {+, -} Obtain: w+ b+, w- b-, wo bo   5/22/2017 How to classify? Return class c arg maxc wc x + bc Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27  Learn 3 sets of weights simoultaneously  For each class c estimate wc, bc  Want the correct class to have highest margin: wy xi + by  1 + wc xi + bc c  yi , i i i (xi, yi) 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28  Optimization problem: min w,b 1 2 w c c 2 n  C  i i 1 wyi  xi  byi  wc  xi  bc  1  i c  yi , i i  0, i  To obtain parameters wc, bc (for each class c) we can use similar techniques as for 2 class SVM 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1