Download Support Vector Machines for Data Fitting and Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Support Vector Machines
for Data Fitting and Classification
David R. Musicant
with Olvi L. Mangasarian
UW-Madison Data Mining Institute
Annual Review
June 2, 2000
Overview


Regression and its role in data mining
Robust support vector regression
– Our general formulation

Tolerant support vector regression
– Our contributions
– Massive support vector regression
– Integration with data mining tools


Active support vector machines
Other research and future
directions
What is regression?



Regression forms a rule for predicting an
unknown numerical feature from known ones.
Example: Predicting purchase habits.
Can we use...
– age, income, level of education

To predict...
– purchasing patterns?

And simultaneously...
– avoid the “pitfalls” that standard statistical regression
falls into?
Regression example

Can we use.
Age
Income
Years of Education $ spent on software
30 $56,000 / yr
16
$800
50 $60,000 / yr
12
$0
16 $2,000 / yr
11
$200

To predict:
Age
Income
Years of Education $ spent on software
40 $48,000 / yr
17
?
29 $60,000 / yr
18
?
Role in data mining

Goal: Find new relationships in data
– e.g. customer behavior, scientific experimentation

Regression explores importance of each known
feature in predicting the unknown one.
– Feature selection

Regression is a form of supervised learning
– Use data where the predictive value is known for
given instances, to form a rule

Massive datasets
Regression is a fundamental task in data mining.
Part I:
Robust Regression
a.k.a. Huber Regression
2
1
0
-2
-g
-1
0
-1
1
g
2
“Standard” Linear Regression
ê = Aw + be
y
y
b
Find w, b such that:
A
Aw + be ù y
Optimization problem

Find w, b such that:
Aw + be ù y

Bound the error by s:
à s ô Aw + be à y ô s

Minimize the error:
P
min
s2i
à s ô Aw + be à y ô s
Traditional approach: minimize squared error.
Examining the loss function

Standard regression uses a squared error loss
function.
– Points which are far from the predicted line (outliers)
are overemphasized.
4
3
2
1
0
-2
-1
-1
0
1
2
Alternative loss function

Instead of squared error,
P try absolute value of
the error:
min
js j
i
à s ô Aw + be à y ô s
2
1
0
-2
-1
0
1
2
This is called the 1-norm loss function.
1-Norm Problems And Solution
– Overemphasizes error on points close to the
predicted line

Solution: Huber loss function hybrid approach
Linear
2
1
Quadratic
0
-2
-1
0
1
2
-1
Many practitioners prefer the Huber loss function.
Mathematical Formulation

g indicates switchover from quadratic to linear
2
1
0
-g
-2
-1
ú
ú( t ) =
0
-1
t 2=2;
í j t j à í 2=2;
g
1
2
if j t j ô í
if j t j > í
Larger g means “more quadratic.”
Regression Approach Summary
4

Quadratic Loss Function
– Standard method in statistics
– Over-emphasizes outliers

2
1
0
-2
-1
Linear Loss Function (1-norm)
– Formulates well as a linear
program
– Over-emphasizes small errors

3
0
1
2
0
1
2
0
1
2
2
1
0
-2
-1
Huber Loss Function (hybrid
approach)
– Appropriate emphasis on large
and small errors
-1
2
1
0
-2
-1
-1
Previous attempts complicated

Earlier efforts to solve Huber regression:
–
–
–
–

Huber: Gauss-Seidel method
Madsen/Nielsen: Newton Method
Li: Conjugate Gradient Method
Smola: Dual Quadratic Program
Our new approach:Pconvex quadratic
program
P
2
min
jt i j
zi =2 + í
w2 R d; z2 R l ; t 2 R l
z à t ô Aw + be à y ô z + t
Our new approach is simpler and faster.
Experimental Results: Census20k
20,000 points
11 features
1.345
g
Li
Madsen/Nielsen
Huber
Smola
MM
1
0.1
Faster!
0
200
400
Time (CPU sec)
600
Experimental Results: CPUSmall
8,192 points
12 features
1.345
g
Li
Madsen/Nielsen
Huber
Smola
MM
1
0.1
Faster!
0
50
100
150
Time (CPU sec)
200
Introduce nonlinear kernel!

Begin with previous formulation:
P
min
z2i=2 +
í
P
jt i j
w2 R d; z2 R l ; t 2 R l
z à t ô Aw + be à y ô z + t

Substitute w = A’a and minimize a instead:
z à t ô AA 0ë + be à y ô z + t

Substitute K(A,A’) for AA’:
z à t ô K ( A; A 0) ë + be à y ô z + t
A kernel is nonlinear function.
Nonlinear results
Dataset
Kernel
Training Accuracy Testing Accuracy
CPUSmall Linear
94.50%
94.06%
Gaussian
97.26%
95.90%
Boston
Linear
85.60%
83.81%
Housing
Gaussian
92.36%
88.15%
Nonlinear kernels improve accuracy.
Part II:
Tolerant Regression
a.k.a. Tolerant Training
Regression Approach Summary
4

Quadratic Loss Function
– Standard method in statistics
– Over-emphasizes outliers

2
1
0
-2
-1
Linear Loss Function (1-norm)
– Formulates well as a linear
program
– Over-emphasizes small errors

3
0
1
2
0
1
2
0
1
2
2
1
0
-2
-1
Huber Loss Function (hybrid
approach)
– Appropriate emphasis on large
and small errors
-1
2
1
0
-2
-1
-1
Optimization problem (1-norm)

Find w, b such that:
Aw + be ù y

Bound the error by s:
à s ô Aw + be à y ô s

Minimize the error:
P
min
j si j
à s ô Aw + be à y ô s
Minimize the magnitude of the error.
The overfitting issue

Noisy training data can be fitted “too well”
– leads to poor generalization on future data

ê = Aw + be
y
A
Prefer simpler
regressions, i.e. where
– some w coefficients are
zero
– line is “flatter”
Reducing overfitting

To achieve both goals
– minimize magnitude of w vector
P
P
min j w i j + C si
à s ô Aw + be à y ô s

C is a parameter to balance the two goals
– Chosen by experimentation

Reduces overfitting due to points far from
surface
Overfitting again: “close” points

“Close points” may be wrong due to noise only
– Line should be influenced by “real” data, not noise
"
ê = Aw + be
y
A

Ignore errors from
those points which are
close!
Tolerant regression

Allow an intervalPof size e with P
uniform error
min j w i j + C si
à s ô Aw + be à y ô s
e" ô s

How large should e be?
– Large as possible, while preserving accuracy
P
P
min j w i j + C si à Cö"
à s ô Aw + be à y ô s
e" ô s
How about a nonlinear surface?
Introduce nonlinear kernel!

Begin with previous
formulation:
P
P

Substitute w = A’a and minimize a instead:
min j wi j + C si à Cö"
à s ô Aw + be à y ô s
e" ô s
à s ô AA 0ë + be à y ô s

Substitute K(A,A’) for AA’:
à s ô K ( A; A 0) ë + be à y ô s
A kernel is nonlinear function.
Our improvements

This formulation and interpretation is new!
– Improves intuition from prior results
– Uses less variables
– Solves faster!

Computational tests run on DMI Locop2
– Dell PowerEdge 6300 server with
– Four gigabytes of memory, 36 gigabytes of disk
space
– Windows NT Server 4.0
– CPLEX 6.5 solver
Donated to UW by Microsoft Corporation
Comparison Results
m
Dataset
0
0.1
0.2
...
0.7
Total
Time
Time (sec) Improvement
Census Tuning set error
5.10%
4.74%
Max
0.00
0.02
79.7%
SSR time (sec)
980
935
5086
Avg
MM time (sec)
199
294
3765
26.0%
Tuning set error
6.60%
6.32%
Max
0.00
3.09
65.7%
SSR time (sec)
1364
1286
7604
Avg
MM time (sec)
468
660
6533
14.1%
e
CompActiv
e
Boston Tuning set error
Housing
e
14.69% 14.62%
Max
0.00
0.42
52.0%
SSR time (sec)
36
34
170
Avg
MM time (sec)
17
23
140
17.6%
Problem size concerns

How does the problem scale?
– m = number of points
– n = number of features



For linear kernel: problem size is O(mn)
à s ô Aw + be à y ô s
For nonlinear kernel: problem size is O(m2)
à s ô K ( A; A 0) ë + be à y ô s
Thousands of data points ==> massive problem!
Need an algorithm that will scale well.
Chunking approach

Idea: Use a chunking method
– Bring as much into memory as possible
– Solve this subset of the problem
– Retain solution and integrate into next subset

Explored in depth by Paul Bradley and O.L.
Mangasarian for linear kernels
Solve in pieces, one chunk at a time.
Row-Column Chunking

Why column chunking also?
– If non-linear kernel is used, chunks are very
wide.
– A wide chunk must have a small number of
rows to fit in memory.
Both these chunks use the same memory!
Chunking Experimental Results
Dataset:
16,000 point subset of Census in R 11+ noise
Kernel:
Gaussian Radial Basis Kernel
LP Size:
32,000 nonsparse rows and columns
Problem Size:
1.024 billion nonzero values
Time to termination:
18.8 days
Number of SVs:
1621 support vectors
Solution variables:
33 nonzero components
Final tuning set error:
9.8%
Tuning set error on first
16.2%
chunk (1000 points)
Objective Value & Tuning Set Error
for Billion-Element Matrix
Tuning Set Error
25000
20%
20000
18%
Tuning Set Error
Objective Value
Objective Value
15000
10000
5000
16%
14%
12%
10%
8%
0
0
0
500
5
1000
9
1500
13
2000
18
Row-Column Chunk Iteration Number
Time in Days
0
0
500
5
1000
9
1500
13
2000
18
Row-Column Chunk Iteration Number
Time in Days
Given enough time, we find the right answer!
Integration into data mining tools


Method runs as a stand-alone application, with
data resident on disk
With minimal effort, could sit on top of a RDBMS
to manage data input/output
– Queries select a subset of data - easily SQLable

Database queries occur “infrequently”
– Data mining can be performed on a different machine
from the one maintaining the DBMS

Licensing of a linear program solver necessary
Algorithm can integrate with data mining tools.
Part III:
Active Support Vector Machines
a.k.a. ASVM
The Classification Problem
x 0w = í + 1
A+
Ax 0w = í à 1
0
Separating Surface: x w = í
w
Find surface to best separate two classes.
Active Support Vector Machine

Features
–
–
–
–
Solves classification problems
No special software tools necessary! No LP or QP!
FAST. Works on very large problems.
Web page: www.cs.wisc.edu/~musicant/asvm
• Available for download and can be integrated into data
mining tools
• MATLAB integration already provided
# of points Features Iterations Time (CPU min)
4 million
32
5
38.04
7 million
32
5
95.57
Summary and Future Work

Summary
– Robust regression can be modeled simply and
efficiently as a quadratic program
– Tolerant regression can be used to solve massive
regression problems
– ASVM can solve massive classification problems
quickly

Future work
– Parallel approaches
– Distributed approaches
– ASVM for various types of regression
Questions?