Download w - UTK-EECS

Document related concepts

Principal component analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Support vector machine wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
COSC 526 Class 8
Classification & Regression Part II
Arvind Ramanathan
Computational Science & Engineering Division
Oak Ridge National Laboratory, Oak Ridge
Ph: 865-576-7266
E-mail: [email protected]
Last Class
• We saw different techniques for handling largescale classification:
– Decision trees
– K Nearest Neighbors (k-NN)
• Modified Decision trees to handle streaming
datasets
• For k-NN, we discussed efficient data structures
to handle large-scale datasets
2
This class…
• More classification…
• Support vector machines:
– Basic outline of the algorithm
– Modifications for large-scale datasets
– Online approaches
3
Class Logistics
4
10% Grade…
• 40% – assignments (2 instead of 3)
• 50% - class project
• What do we do with the 5%?
– Class participation
– Select papers updated on the class website
– Present your take on it (Critique)
– Per class 2 presentations – 15 minutes each
– Starting Tue (Feb 17)
5
Critique Structure
• 2-3 slides: what is the paper all about?
• 2-4 slides: key results presented
• 2-3 slides: key limitations/shortcomings?
• 2-3 slides: what could have been done better?
• More discussion oriented presentation rather than
an in-depth view of the paper
• Peer review of critiques due every week…
6
Support Vector Machines (SVM)
7
Support Vector Machines (SVM)
• Easiest explanation:
– Find the “best linear
separator” for a dataset
• Training examples:
– {(x1, y1), (x2, y2), …, (xn, yn)}
– Each data point xi = (xi(1),
xi(2) … xi(d))
– yi = {-1, +1}
• In higher dimensional
datasets we want to find
the “best hyperplane”
8
α
x
f(x,w)
y
Classifier Margin…
• Margin: width of the
boundary that could be
increased by before hitting a
data point
• Interested in the maximum
margin:
– Simplest is the maximum
margin SVM classifier
– Support Vectors
9
support vectors
Why are we interested in Max Margin SVM?
• Intuitively this feels right:
– we need the maximum margin to fit the data correctly
• Robust to location of the boundary:
– even if we have made a small error in the location of
the boundary, not much effect on classification
• Model obtained is tolerant to removal of any nonsupport vector data points
– Validation works well!
• Works very well in practice
10
Why is the maximum margin a good thing?
• theoretical convenience and existence of
generalization error bounds that depend on the
value of margin
11
Why maximizing 𝜸 a good idea?
• We all know what the dot product
means
𝑨 𝒄𝒐𝒔𝜽
12
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
12
Why maximizing 𝜸 a good idea?
• Dot product
𝒘
x2
• What is , ?
x2
+ +x
1
1
x2
+ +x
1
𝒘
In this case
𝜸𝟏 ≈ 𝒘 𝟐
+ +x
𝒘
In this case
𝜸𝟐 ≈ 𝟐 𝒘 𝟐
• So, roughly corresponds to the margin
– Bigger bigger the separation
13
13
What is the margin?
Distance from a point to a line
w
A (xA(1), xA(2))
+
H
(0,0)
M (x1, x2)
L
• Let:
Note we assume
𝒘 𝟐=𝟏
– Line L: w∙x+b =
w(1)x(1)+w(2)x(2)+b=0
– w = (w(1), w(2))
– Point A = (xA(1), xA(2))
– Point M on a line = (xM(1), xM(2))
d(A, L) = |AH|
= |(A-M) ∙ w|
= |(xA(1) – xM(1)) w(1) + (xA(2) – xM(2)) w(2)|
= xA(1) w(1) + xA(2) w(2) + b
=w∙A+b
14
Remember xM(1)w(1) + xM(2)w(2) = - b
since M belongs to line L
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive 14
Largest Margin
Prediction = sign (w.x + b)
Confidence = (w.x + b)y
𝒘
+
+ +
+
-
+
+
+
For ith data point:
- -
Want to solve:
-
15
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
15
Support Vector Machine
• Maximize the margin:
– Good according to intuition,
theory (VC dimension) &
practice
max 
+
+
+ +
+
+
+

+

– 𝜸 is margin … distance from
the separating hyperplane
16
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

w,
s.t.i, yi ( w  xi  b)  
wx+b=0
- -
Maximizing the margin
16
How do we derive the margin?
• Separating hyperplane
is defined by the
support vectors
– Points on +/- planes
from the solution
– If you knew these
points, you could
ignore the rest
– Generally,
d+1 support vectors (for d
dim. data)
17
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
17
Canonical Hyperplane: Problem
• Let
• Now,
• Scaling w increases margin
x2
• Solution
• Work with normalized w
• Also require support
vectors to be in the plane
defined by:
18
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
x1
w
|| w ||
Canonical Hyperplane: Solution
• Want to maximize margin γ
• What is the relation between x1 and
x2?
x2
2
x1
• We also know:
w
|| w ||
-1
19
Note:
2
ww  w
Maximizing the Margin
• We started with
max w, 
s.t.i, yi ( w  xi  b)  
x2
x1
But w can be arbitrarily large!
arg max   arg max
1
 arg min w  arg min 12 w
w
2
w
|| w ||
• We normalized and...
• Then:
min
1
w 2
|| w ||
2
2
s.t.i, yi ( w  xi  b)  1
20
This is called SVM with “hard” constraints
20
Non-linearly Separable Data
• If data is not separable introduce
penalty:
2
1
min w 2 w  C  (# number of mistakes)
s.t.i, yi ( w  xi  b)  1
– Minimize ǁwǁ2 plus the
number of training mistakes
– Set C using cross validation
• How to penalize mistakes?
– All mistakes are not
equally bad!
21
+
+
+
+ +
+
+
-
-
-
-
-
-
21
+
Support Vector Machines
• Introduce slack variables i
min
w ,b , i  0
1
2
n
w  C   i
2
i 1
s.t.i, yi ( w  xi  b)  1   i
• If point xi is on the wrong
side of the margin then
get penalty i
22
+
+
+
+
+
+
+
i
j
-
+
- -
For each data point:
If margin  1, don’t care
If margin < 1, pay linear penalty
22
Slack Penalty
min
1
w 2
w  C  (# number of mistakes)
2
s.t.i, yi ( w  xi  b)  1
• What is the role of slack penalty C:
– C=: Only want to w, b
that separate the data
+
– C=0: Can set i to anything,
then w=0 (basically
ignores the data)
23
small C
+
(0,0)
+
+
+
+
+
“good” C
big C
-
-
+ - 23
Support Vector Machines
• SVM in the “natural” form
n
arg min
w ,b
1
2
w  w  C   max0,1  yi ( w  xi  b)
i 1
Margin
Regularization
parameter
Empirical loss L (how well we fit training data)
• SVM uses “Hinge Loss”:
penalty
min
w ,b
1
2
n
w  C  i
2
i 1
s.t.i, yi  ( w  xi  b)  1   i
0/1 loss
Hinge loss: max{0, 1-z}
24
-1
0
1
2
z  yi  ( xi  w  b)
24
SVM: How to estimate w?
n
min
w ,b
1
2
w  w  C   i
i 1
s.t.i, yi  ( xi  w  b)  1   i
• Want to estimate and !
– Standard way: Use a solver!
• Solver: software for finding solutions to
“common” optimization problems
• Use a quadratic solver:
– Minimize quadratic function
– Subject to linear constraints
25
• Problem: Solvers are inefficient for big data!
25
SVM: How to estimate w?
n
w  w  C  i
• Want to estimate w, b!
min
• Alternative approach:
s.t.i, yi  ( xi  w  b)  1  i
w ,b
1
2
i 1
– Want to minimize f(w,b):
d


( j) ( j)
1
f ( w, b)  2 w  w  C   max 0,1  yi ( w xi  b)
i 1
j 1


n
• Side note:
– How to minimize convex functions ?
– Use gradient descent: minz g(z)
g(z)
– Iterate: zt+1  zt –  g(zt)
26
z
SVM: How to estimate w?
• Want to minimize f(w,b):
d
 
f (w, b)  12  w
j 1
( j) 2
d


( j) ( j)
 C  max 0,1  yi ( w xi  b)
i 1
j 1


n
• Compute the gradient (j) w.r.t. w(j)
f ( j )
27
Empirical loss 𝑳(𝒙𝒊 𝒚𝒊 )
n
L( xi , yi )
f ( w, b)
( j)

 w  C
( j)
( j)
w

w
i 1
L( xi , yi )
0
if yi (w  xi  b)  1
( j)
w
  yi xi( j ) else
SVM: How to estimate w?
• Gradient descent:
Iterate until convergence:
• For j = 1 … d
n
L( xi , yi )

f
(
w
,
b
)
( j)
( j)
• Evaluate:f 
 w  C
( j)
( j)
w
w
i 1
• Update:
w(j)  w(j) - f(j)
…learning rate parameter
C… regularization parameter
• Problem:
– Computing f(j) takes O(n) time!
• n … size of the training dataset
28
SVM: How to estimate w?
• Stochastic Gradient Descent
We just had:
f
( j)
w
n
( j)
– Instead of evaluating gradient over all examples
evaluate it for each individual training example
f
( j)
( xi )  w
( j)
L( xi , yi )
C
w( j )
• Stochastic gradient descent:
Iterate until convergence:
• For i = 1 … n
• For j = 1 … d
• Compute: f(j)(xi)
• Update: w(j)  w(j) -  f(j)(xi)
29
 C
i 1
L( xi , yi )
w( j )
Notice: no summation
over i anymore
Making SVMs work with Big Data
30
Optimization problem set up
kernel function
• This is good optimization problem for computers
to solve called quadratic programming
31
Problems with SVMs on Big Data
• SVM complexity: quadratic programming (QP)
– Time: O(m3)
– Space: O(m2)
– m: size of training data
• Two types of approaches:
– Modify SVM algorithm to work with large datasets
– Select representative training data to use normal SVM
32
Reducing the Training Set
• How do we reduce the size of the training set so
that we can reduce the time?
– Combine (a large number of) ‘small’ SVMs together to
obtain the final SVM
– Reduced SVM: use random rectangular subset of the
kernel matrix
– All techniques only find an approximation to the
optimal solution by an iterative approach
• why not exploit this?
33
Problematic datasets: even in 1D!
How will a SVM handle this dataset?
34
The Kernel Trick: Using higher dimensions
to separate data
• Project data into a
higher dimension
• Now find the
separation in this
space
• This is a common trick
in ML for many
algorithms
35
Commonly used SVM basis functions
• zk = (Polynomial terms of xk of degree 1-q)
• zk = (Radial basis functions of xk)
• zk = (sigmoid functions of xk)
• These and many other functions are valid
36
How to tackle training set sizes?
• We need:
– Efficient and effective method for selecting a “working”
set
– “shrink” the optimization problem:
• Much less support vectors than training examples
• many support vectors have an αi at the upper bound C
– Computational improvements including caching and
incremental updates of the gradient
37
How does this algorithm look like?
• In each iteration of our optimization, we will split
αi
into the
twooptimality
categories:
While
constraints are violated:
• BSelect
variablesupdated
for the in
working
set iteration
B.
– set
of freeqvariables:
the current
• NRemaining
l-q variables
are part
– set
of fixed variables:
temporarily
fixedofinthe
the fixed
current
set N.
iteration
• Solve the QP-sub-problem with W(α) on B
• Then divide and solve the optimization
Joachims, T., Making large-scale SVM practical, Large-scale Machine Learning
38
Now, how do we select a good “working”
set?
• Select the set of variables such that the current
iteration will make progress towards the minimum
of W(α)
– Use first order approximation, i.e., steepest direction d
of descent which has only q non-zero elements
• Convergence:
– terminate only when the optimal solution is found
– If not, take a step towards the optimum…
39
Other techniques to identify appropriate
(and reduced) training sets
• Minimum enclosing ball (MEB) for selecting
“core” sets
εR
R
• after core sets are selected, solve the same
optimization problem…
• Complexity:
– Time: O(m/ε2 + 1/ε4)
40
– Space: O(1/ε8)
Tsang, I.W., Kwok, J.T., and Cheung, P.-M., JMLR (6): 363-392
(2005).
Incremental (and Decremental) SVM
Learning
• Solving this QP formulation, we find this:
41
Example of an SVM working with Big
Datasets
• Example by Leon Bottou:
– Reuters RCV1 document corpus
• Predict a category of a document
– One vs. the rest classification
– m = 781,000 training examples (documents)
– 23,000 test examples
– d = 50,000 features
• One feature per word
• Remove stop-words
• Remove low frequency words
42
42
Text categorization
• Questions:
– (1) Is SGD successful at minimizing f(w,b)?
– (2) How quickly does SGD find the min of f(w,b)?
– (3) What is the error on a test set?
Training time
Value of f(w,b)
Test error
Standard SVM
“Fast SVM”
SGD SVM
(1) SGD-SVM is successful at minimizing the value of f(w,b)
(2) SGD-SVM is super fast
(3) SGD-SVM test set error is comparable
43
43
Optimization “Accuracy”
SGD SVM
Conventional
SVM
Optimization quality: | f(w,b) – f (wopt,bopt) |
For optimizing f(w,b) within reasonable quality
SGD-SVM is super fast
44
44
SGD vs. Batch Conjugate Gradient
• SGD on full dataset vs. Conjugate Gradient on
a sample of n training examples
Theory says: Gradient
descent converges in
linear time 𝒌. Conjugate
gradient converges in 𝒌.
45
Bottom line: Doing a simple (but fast) SGD update
many times is better than doing a complicated (but
slow) CG update a few times
𝒌… condition number
Practical Considerations
• Need to choose learning rate  and t0
t 
L( xi , yi ) 
wt 1  wt 
 wt  C

t  t0 
w 
• Leon suggests:
– Choose t0 so that the expected initial updates are
comparable with the expected size of the weights
– Choose :
• Select a small subsample
• Try various rates  (e.g., 10, 1, 0.1, 0.01, …)
• Pick the one that most reduces the cost
• Use  for next 100k iterations on the full dataset
46
46
Advanced Topics…
47
Sparse Linear SVMs
• Feature vector xi is sparse (contains many
zeros)
• Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…]
• But represent xi as a sparse vector xi=[(4,1), (9,5), …]
• Can we do the SGD update more efficiently?

w 

w  w   w  C

L( xi , yi ) 
– Approximated in 2 steps:
cheap: xi is sparse and so few
coordinates j of w will be updated
w  w  C
L( xi , yi )
expensive: w is not sparse, all
w
coordinates need to be updated
w  w(1   )
48
Sparse Linear SVMs: Practical
Considerations

Solution 1:
– Represent vector w as the
product of scalar s and vector v
– Then the update procedure is:
Two step update procedure:
L( xi , yi )
w
(2) w  w(1   )
(1) w  w  C
• (1)
• (2)
• Solution 2:
– Perform only step (1) for each training example
– Perform step (2) with lower frequency
and higher 
49
49
Practical Considerations
• Stopping criteria:
How many iterations of SGD?
– Early stopping with cross validation
• Create a validation set
• Monitor cost function on the validation set
• Stop when loss stops decreasing
– Early stopping
• Extract two disjoint subsamples A and B of training data
• Train on A, stop by validating on B
• Number of epochs is an estimate of k
• Train for k epochs on the full dataset
50
50
What about multiple classes?
• Idea 1:
One against all
Learn 3 classifiers
– + vs. {o, -}
– - vs. {o, +}
– o vs. {+, -}
Obtain:
w+ b+, w- b-, wo bo
• How to classify?
• Return class c
arg maxc wc x + bc
51
51
Learn 1 classifier: Multiclass SVM
• Idea 2: Learn 3 sets of weights simoultaneously!
– For each class c estimate wc, bc
– Want the correct class to have highest margin:
wyi xi + by  1 + wc xi + bc c  yi , i
i
(xi, yi)
52
52
Multiclass SVM
• Optimization problem:
min
w,b
1
2
w
c
c
2
n
 C  i
i 1
wyi  xi  byi  wc  xi  bc  1  i
c  yi , i
i  0, i
– To obtain parameters wc , bc (for each class c)
we can use similar techniques as for 2 class SVM
53
53
SVM : what you must know?
• One of the most successful ML algorithms
• Modifications for handling big datasets:
– Reduce the training set (“core set”)
– Modify SVM training algorithm
– Incremental algorithm
• Multi-class modifications are more complex
54