Download Mathematical Programming in Support Vector Machines

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Four-vector wikipedia , lookup

Vector space wikipedia , lookup

System of linear equations wikipedia , lookup

Transcript
Support Vector Machine Classification
Computation & Informatics in Biology & Medicine
Madison Retreat, November 15, 2002
Olvi L. Mangasarian
with
G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg
& Collaborators at ExonHit – Paris
Data Mining Institute
University of Wisconsin - Madison
What is a Support Vector Machine?
 An optimally defined surface
 Linear or nonlinear in the input space
 Linear in a higher dimensional feature space
 Implicitly defined by a kernel function

K(A,B)  C
What are Support Vector Machines
Used For?
 Classification
 Regression & Data Fitting
 Supervised & Unsupervised Learning
Principal Topics
Proximal support vector machine classification
Classify by proximity to planes instead of halfspaces
Massive incremental classification
Classify by retiring old data & adding new data
Knowledge-based classification
Incorporate expert knowledge into a classifier
Fast Newton method classifier
Finitely terminating fast algorithm for classification
Breast cancer prognosis & chemotherapy
Classify patients on basis of distinct survival curves
 Isolate a class of patients that may benefit from
chemotherapy
Principal Topics
Proximal support vector machine classification
Support Vector Machines
Maximize the Margin between Bounding Planes
w
x 0w = í + 1
A+
A-
x 0w = í à 1
2
jj wjj 2
Proximal Support Vector Machines
Maximize the Margin between Proximal Planes
w
x 0w = í + 1
A+
A-
0
xw= í à 1
2
jj wjj 2
Standard Support Vector Machine
Algebra of 2-Category Linearly Separable Case
 Given m points in n dimensional space
 Represented by an m-by-n matrix A
 Membership of each A i in class +1 or –1 specified by:
 An m-by-m diagonal matrix D with +1 & -1 entries
 Separate by two bounding planes, x 0w = í æ1 :
A i w= í + 1; for D i i = + 1;
A i w5 í à 1; for D i i = à 1:
 More succinctly:
D (Aw à eí ) = e;
where e is a vector of ones.
Standard Support Vector Machine
Formulation
 Solve the quadratic program for some ÷ > 0:
min
÷
2
k
k
y
2
2
1
2kw; í
k 22
y; w; í
s. t. D (Aw à eí ) + y > e
+
(QP)
,
where D i i = æ1, denotes A + or A à membership.
 Margin is maximized by minimizing 12kw; í k 22
Proximal SVM Formulation
(PSVM)
Standard SVM formulation:
min
w; í
s. t.
Solving for
min
w; í
÷
2
k
k
y
2
2
+ 12kw; í k 22
= e
D (Aw à eí ) + y =
y in terms of w and í
÷
2ke à
(QP)
D (A w à eí
2
)k 2
gives:
+
1
2kw;
í
2
k2
This simple, but critical modification, changes the nature
of the optimization problem tremendously!!
(Regularized Least Squares or Ridge Regression)
Advantages of New Formulation
 Objective function remains strongly convex.
 An explicit exact solution can be written in terms
of the problem data.
 PSVM classifier is obtained by solving a single
system of linear equations in the usually small
dimensional input space.
 Exact leave-one-out-correctness can be obtained in
terms of problem data.
Linear PSVM
We want to solve:
min
w; í
÷
2ke à
D (A w à eí
2
)k 2
+
1
2kw;
í
2
k2
Setting the gradient equal to zero, gives a
nonsingular system of linear equations.
Solution of the system gives the desired PSVM
classifier.
Linear PSVM Solution
h i
w
í
=
I
(÷ +
0
H H)
à1
0
H De
Here, H = [A à e]
 The linear system to solve depends on:
0
HH
which is of size
(n + 1) â (n + 1)
 n is usually much smaller than
m
Linear & Nonlinear PSVM MATLAB Code
function [w, gamma] = psvm(A,d,nu)
% PSVM: linear and nonlinear classification
% INPUT: A, d=diag(D), nu. OUTPUT: w, gamma
% [w, gamma] = psvm(A,d,nu);
[m,n]=size(A);e=ones(m,1);H=[A -e];
v=(d’*H)’
%v=H’*D*e;
r=(speye(n+1)/nu+H’*H)\v % solve (I/nu+H’*H)r=v
w=r(1:n);gamma=r(n+1);
% getting w,gamma from r
Numerical experiments
One-Billion Two-Class Dataset
 Synthetic dataset consisting of 1 billion points in 10dimensional input space
 Generated by NDC (Normally Distributed Clustered)
dataset generator
Dataset divided into 500 blocks of 2 million points
each.
Solution obtained in less than 2 hours and 26 minutes
on a 400Mhz
 About 30% of the time was spent reading data from
disk.
Testing set Correctness 90.79%
Principal Topics
Knowledge-based classification (NIPS*2002)
Conventional Data-Based SVM
Knowledge-Based SVM
via Polyhedral Knowledge Sets
Incoporating Knowledge Sets
Into an SVM Classifier
è ?
é
 Suppose that the knowledge set: x ? Bx 6 b
belongs to the class A+. Hence it must lie in the
halfspace :
è
é
x j x 0w> í + 1
 We therefore have the implication:
Bx 6 b )
x w> í + 1
0
 This implication is equivalent to a set of
constraints that can be imposed on the classification
problem.
Numerical Testing
The Promoter Recognition Dataset
 Promoter: Short DNA sequence that
precedes a gene sequence.
 A promoter consists of 57 consecutive
DNA nucleotides belonging to {A,G,C,T} .
 Important to distinguish between
promoters and nonpromoters
 This distinction identifies starting locations
of genes in long uncharacterized DNA
sequences.
The Promoter Recognition Dataset
Numerical Representation
 Simple “1 of N” mapping scheme for converting
nominal attributes into a real valued representation:
 Not most economical representation, but commonly
used.
The Promoter Recognition Dataset
Numerical Representation
 Feature space mapped from 57-dimensional nominal
space to a real valued 57 x 4=228 dimensional space.
57 nominal values
57 x 4 =228
binary values
Promoter Recognition Dataset
Prior Knowledge Rules
 Prior knowledge consist of the following 64 rules:
2
3
R1
6 or 7
6
7
6 R2 7 V
6
7
6 or 7
6
7
6 R3 7
4
5
or
R4
2
3
R5
6 or 7
6
7
6 R6 7 V
6
7
6 or 7
6
7
6 R7 7
4
5
or
R8
2
3
R9
6 or 7
6
7
6 R10 7
6
7 = ) PROM OTER
6 or 7
6
7
6 R11 7
4
5
or
R12
Promoter Recognition Dataset
Sample Rules
R4 : (pà 36 = T) ^ (pà 35 = T) ^ (pà 34 = G)
^ (pà 33 = A) ^ (pà 32 = C);
R8 : (pà 12 = T) ^ (pà 11 = A) ^ (pà 07 = T);
R10 : (pà 45 = A) ^ (pà 44 = A) ^ (pà 41 = A);
where pj denotes position of a nucleotide, with
respect to a meaningful reference point starting at
position pà 50 and ending at position p7:
Then:
R4 ^ R8 ^ R10 =)
PROM OTER
The Promoter Recognition Dataset
Comparative Algorithms
 KBANN Knowledge-based artificial neural network
[Shavlik et al]
 BP: Standard back propagation for neural networks
[Rumelhart et al]
 O’Neill’s Method Empirical method suggested by
biologist O’Neill [O’Neill]
 NN: Nearest neighbor with k=3 [Cost et al]
 ID3: Quinlan’s decision tree builder[Quinlan]
 SVM1: Standard 1-norm SVM [Bradley et al]
The Promoter Recognition Dataset
Comparative Test Results
Wisconsin Breast Cancer Prognosis Dataset
Description of the data
 110 instances corresponding to 41 patients whose cancer
had recurred and 69 patients whose cancer had not recurred
 32 numerical features
 The domain theory: two simple rules used by doctors:
Wisconsin Breast Cancer Prognosis Dataset
Numerical Testing Results
 Doctor’s rules applicable to only 32 out of 110
patients.
 Only 22 of 32 patients are classified correctly
by this rule (20% Correctness).
 KSVM linear classifier applicable to all
patients with correctness of 66.4%.
 Correctness comparable to best available
results using conventional SVMs.
 KSVM can get classifiers based on knowledge
without using any data.
Principal Topics
Fast Newton method classifier
Fast Newton Algorithm for Classification
Standard quadratic programming (QP) formulation of SVM:
Newton Algorithm
f (z) =
í
1í
2
w2 1 í
í
(e à D (Aw à ew)) + w + 2 í w; í í
2
zi + 1 = zi à @2f (zi ) à 1r f (zi )
Newton algorithm terminates in a finite number of steps
Termination at global minimum
Error rate decreases linearly
Can generate complex nonlinear classifiers
By using nonlinear kernels: K(x,y)
Nonlinear Spiral Dataset
94 Red Dots & 94 White Dots
Principal Topics
Breast cancer prognosis & chemotherapy
Kaplan-Meier Curves for Overall Patients:
With & Without Chemotherapy
Breast Cancer Prognosis & Chemotherapy
Good, Intermediate & Poor Patient Groupings
(6 Input Features : 5 Cytological, 1 Histological)
(Grouping: Utilizes 2 Histological Features &Chemotherapy)
Kaplan-Meier Survival Curves
for Good, Intermediate & Poor Patients
82.7% Classifier Correctness via 3 SVMs
Kaplan-Meier Survival Curves for Intermediate Group
Note Reversed Role of Chemotherapy
Conclusion
New methods for classification
All based on rigorous mathematical foundation
Fast computational algorithms capable of classifying
massive datasets
Classifiers based on both abstract prior knowledge as well
as conventional datasets
Identification of breast cancer patients that can benefit from
chemotherapy
Future Work
Extend proposed methods to broader optimization problems
 Linear & quadratic programming
 Preliminary results beat state-of-the-art software
Incorporate abstract concepts into optimization problems as
constraints
Develop fast online algorithms for intrusion and fraud
detection
 Classify the effectiveness of new drug cocktails in
combating various forms of cancer
Encouraging preliminary results for breast cancer
Breast Cancer Treatment Response
Joint with ExonHit ( French BioTech)
35 patients treated by a drug cocktail
9 partial responders; 26 nonresponders
25 gene expression measurements made on each patient
1-Norm SVM classifier selected: 12 out of 25 genes
Combinatorially selected 6 genes out of 12
Separating plane obtained:
2.7915 T11 + 0.13436 S24 -1.0269 U23 -2.8108 Z23 -1.8668 A19 -1.5177 X05 +2899.1 = 0.
 Leave-one-out-error: 1 out of 35 (97.1% correctness)
Detection of Alternative RNA Isoforms via DATAS
(Levels of mRNA that Correlate with Senitivity to Chemotherapy)
E1
I1
E2
I2
E3
I3
E4
I4
E5
DNA
Transcription
E1
I1
E2
I2
E3
I3
E4
I4
E5
5'
3'
pre-mRNA
(m=messenger)
Alternative RNA splicing
E1
E2
E3
E4
E1
E5
E2
E4
E5
(A)n
(A)n
mRNA
Translation
NH2
COOH
DATAS
Chemo-Sensitive
NH2
Proteins
COOH
Chemo-Resistant
E3
DATAS: Differential Analysis of Transcripts with Alternative Splicing
Talk Available
www.cs.wisc.edu/~olvi