Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Overfitting: Regularization: If the data is not separable weights dance around Mediocre generalization: Finds a “barely” separating solution 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 Which is best linear separator? + + + + - + + 5/22/2017 Data: Examples: (x1, y1) … (xn, yn) - - - Example i: xi = ( x1(1),… , x1(d) ) Real values yi { -1, +1 } Inner product: wx = Σj w(j) x(j) Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 Why define margin as distance to the closest example, why not avg. or something else 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4 + + + + - + + + - - - Prediction = sign(wx + b) “Confidence” = (w x + b) y For i-th datapoint: i = (w xi+b) yi Want to solve: max min i w - i Can rewrite as max w, s.t.i, yi ( w xi b) 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5 Maximize the margin: Good according to intuition, theory & practice + + max w, s.t.i, yi ( w xi b) + + + + + + wx+b=0 - - Maximizing the margin 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6 Separating hyperplane is defined by the support vectors Points on +/- planes from the solution. If you knew these points, you could ignore the rest If no degeneracies, d+1 support vectors (for d dim. data) 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7 Problem: Let (w x+b) y = then (2w x+ 2b) y = 2 x2 Scaling w increases margin! x1 Solution: Normalize |w|= 𝑤 w: |𝑤| 𝑑 𝑗=1 𝑤𝑗 w || w || 2 Also require hyperplanes touching support vectors to be: 𝑤 ⋅ 𝑥 + 𝑏 = ±1 5/22/2017 Note: w j … j-th component of w Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8 What is the relation between x1 and x2? 𝒙𝟏 = 𝒙𝟐 + 𝒘 𝟐𝜸 ||𝒘|| x2 2 We also know: x1 𝑤 ⋅ 𝑥1 + 𝑏 = +1 𝑤 ⋅ 𝑥2 + 𝑏 = −1 w || w || Then: 𝒘 𝒘 𝒙𝟐 + 𝟐𝜸 ||𝒘|| 𝒘 ⋅ 𝒙𝟐 + 𝒃 + -1 5/22/2017 + 𝒃 = +𝟏 𝒘⋅𝒘 𝟐𝜸 ||𝒘|| w 1 w w w Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu Note: ww w 9 2 We started with max w, s.t.i, yi ( w xi b) x2 2 But w can be arbitrarily large We normalized and.. max max x1 1 min w min w 1 2 w 2 Then: min 1 w 2 || w || w || w || 2 s.t.i, yi ( w xi b) 1 This is called SVM with “hard” constraints 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10 If data is not separable introduce penalty: min 1 w 2 w C (# number of mistakes) 2 s.t.i, yi ( w xi b) 1 Minimize ǁwǁ2 plus the number of training mistakes Set C using cross validation How to penalize mistakes? All mistakes are not equally bad! + + + + + + + - - - - - + - 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11 Introduce slack variables i n 2 1 min 2 w C i w,b ,i 0 + i 1 s.t.i, yi ( w xi b) 1 i If point xi is on the wrong side of the margin then get penalty i Slack penalty C: + + + + + + i i - + - - For each datapoint: C=: only want to w,b that If margin 1, don’t care separate the data If margin < 1, pay linear penalty C=0: can set i to anything, then w=0 (basically ignores the data) 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12 SVM in the “natural”nform arg min w w C max 0,1 yi ( w xi b) 1 2 w,b i 1 Margin Regularization Empirical loss (how well we fit training data) parameter SVM uses “Hinge Loss”: min w ,b penalty 1 2 n w C i 2 i 1 s.t.i, yi ( w xi b) 1 i 0/1 loss Hinge loss: max{0, 1-z} -1 5/22/2017 0 1 2 z yi ( xi w b) Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13 n min w ,b 1 2 w w C i i 1 s.t.i, yi ( xi w b) 1 i Want to estimate w,b: Standard way: Use a solver! Solver: software for finding solutions to “common” optimization problems Use a quadratic solver: Minimize quadratic function Subject to linear constraints Problem: Solvers are inefficient for big data! 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14 n Want to estimate w,b! Alternative approach: min w ,b 1 2 w w C i i 1 s.t.i, yi ( xi w b) 1 i Want to minimize f(w,b): d j j 1 f ( w, b) 2 w C max 0,1 yi ( w xi b) j 1 i 1 j 1 How to minimize convex functions? Gradient descent: minz f(z) Iterate: zt+1 zt – f’(zt) d j 2 n f(z) z 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15 Want to minimize f(w,b): d j j 1 f ( w, b) 2 w C max 0,1 yi ( w xi b) j 1 i 1 j 1 Take the gradient w.r.t wj: d j 2 n n f ( w, b) L( xi , yi ) j j w C j j w w i 1 L( xi , yi ) 0 if yi (w xi b) 1 j w yi xij else 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16 Gradient descent: Iterate until convergence: For j = 1 … d n f ( w, b) L( xi , yi ) j w C Evaluate j j j w w i 1 Update: wj wj - j Problem: Computing j takes O(n) time!! n … size of the training dataset 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17 We just had: Stochastic gradient descent (SGD) n j w C j i 1 L( xi , yi ) w j Instead of evaluating gradient over all examples evaluate it for each individual example L( xi , yi ) j ,i w C j w Stochastic gradient descent: j Iterate until convergence: For i = 1… n For j = 1 … d Evaluate j,i Update: w j w j - j, i 5/22/2017 n … # training examples d … # features Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18 Example by Leon Bottou: Reuters RCV1 document corpus Predict a category of a document One vs. the rest classification n = 781,000 training examples 23,000 test examples d = 50,000 features One feature per word Remove stop-words Remove low frequency words 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19 Questions: Is SGD successful at minimizing f(w,b)? How quickly does SGD find the minimum of f(w,b)? What is the error on a test set? Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20 | f(w,b) – foptimal(w,b) | 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21 What if we subsample the training data and solve exactly? The point of this slid is that methods that converge faster don’t necessary work faster – simple thing that you can do fast (but many times) is better. Check SVM slides from 2009 and see if there is a better slide to make this point. This is not too good slide SGD on full dataset vs. Batch Conjugate Gradient on a sample of n training examples 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22 Need to choose learning rate and t0 t L( xi , yi ) wt 1 wt wC t t0 w Leon suggests: Choose t0 so that the expected initial updates are comparable with the expected size of the weights Choose : Select a small subsample Try various rates (e.g., 10, 1, 0.1, 0.01, …) Pick the one that most reduces the cost Use for next 100k iterations on the full dataset 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23 The two steps do not decompose. Why does it work? Since eta is close to 0? You can do this. You get one more eta^2 term that is low order Sparse Linear SVM: Feature vector xi is sparse (contains many zeros) Do not do: xi=[0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] But represent xi as a sparse vector xi=[(4,1), (9,5), …] The update w w w C w i i L( x , y ) Can be computed in 2 steps: L( xi , yi ) cheap: xi is sparse and so few w w C w coordinates j of w will be updated w w(1 ) 5/22/2017 expensive: w is not sparse, all coordinates need to be updated Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24 Give the equations for step 1 and 2. w = v.s Solution 1: Represent vector w as the product of scalar s and vector v Perform step (1) by updating v and step (2) by updating s Two step update procedure: (1) w w C (2) L( xi , yi ) w w w(1 ) Solution 2: Perform only step (1) for each training example Perform step (2) with lower frequency and higher 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25 Why do you think training on A and B would take the same time as on the full dataset? Stopping criteria: How many iterations of SGD? Early stopping with cross validation Create validation set Monitor cost function on the validation set Stop when loss stops decreasing Early stopping Extract two disjoint subsamples A and B of training data Train on A, stop by validating on B Number of epochs is an estimate of k Train for k epochs on the full dataset 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26 One against all: Learn 3 classifiers + vs. {o, -} - vs. {o, +} o vs. {+, -} Obtain: w+ b+, w- b-, wo bo 5/22/2017 How to classify? Return class c arg maxc wc x + bc Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27 Learn 3 sets of weights simoultaneously For each class c estimate wc, bc Want the correct class to have highest margin: wy xi + by 1 + wc xi + bc c yi , i i i (xi, yi) 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28 Optimization problem: min w,b 1 2 w c c 2 n C i i 1 wyi xi byi wc xi bc 1 i c yi , i i 0, i To obtain parameters wc, bc (for each class c) we can use similar techniques as for 2 class SVM 5/22/2017 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29