Download Mathematical Programming in Support Vector Machines

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Support Vector Machine Data Mining
Olvi L. Mangasarian
with
Glenn M. Fung, Jude W. Shavlik
& Collaborators at ExonHit – Paris
Data Mining Institute
University of Wisconsin - Madison
What is a Support Vector Machine?
 An optimally defined surface
 Linear or nonlinear in the input space
 Linear in a higher dimensional feature space
 Implicitly defined by a kernel function

K(A,B)  C
What are Support Vector Machines
Used For?
 Classification
 Regression & Data Fitting
 Supervised & Unsupervised Learning
Principal Topics
Knowledge-based classification
Incorporate expert knowledge into a classifier
Breast cancer prognosis & chemotherapy
Classify patients on basis of distinct survival curves
 Isolate a class of patients that may benefit from
chemotherapy
Multiple Myeloma detection via gene expression
measurements
Drug discovery based on gene macroarray expression
Joint work with ExonHit
Support Vector Machines
Maximize the Margin between Bounding Planes
w
x 0w = í + 1
A+
A-
x 0w = í à 1
2
jj wjj
Principal Topics
Knowledge-based classification (NIPS*2002)
Conventional Data-Based SVM
Knowledge-Based SVM
via Polyhedral Knowledge Sets
Incoporating Knowledge Sets
Into an SVM Classifier
è ?
é
 Suppose that the knowledge set: x ? Bx 6 b
belongs to the class A+. Hence it must lie in the
halfspace :
è
é
x j x 0w> í + 1
 We therefore have the implication:
Bx 6 b )
x w> í + 1
0
 This implication is equivalent to a set of
constraints that can be imposed on the classification
problem.
Numerical Testing
The Promoter Recognition Dataset
 Promoter: Short DNA sequence that
precedes a gene sequence.
 A promoter consists of 57 consecutive
DNA nucleotides belonging to {A,G,C,T} .
 Important to distinguish between
promoters and nonpromoters
 This distinction identifies starting locations
of genes in long uncharacterized DNA
sequences.
The Promoter Recognition Dataset
Numerical Representation
 Simple “1 of N” mapping scheme for converting
nominal attributes into a real valued representation:
 Not most economical representation, but commonly
used.
The Promoter Recognition Dataset
Numerical Representation
 Feature space mapped from 57-dimensional nominal
space to a real valued 57 x 4=228 dimensional space.
57 nominal values
57 x 4 =228
binary values
Promoter Recognition Dataset
Prior Knowledge Rules
 Prior knowledge consist of the following 64 rules:
2
3
R1
6 or 7
6
7
6 R2 7 V
6
7
6 or 7
6
7
6 R3 7
4
5
or
R4
2
3
R5
6 or 7
6
7
6 R6 7 V
6
7
6 or 7
6
7
6 R7 7
4
5
or
R8
2
3
R9
6 or 7
6
7
6 R10 7
6
7 = ) PROM OTER
6 or 7
6
7
6 R11 7
4
5
or
R12
Promoter Recognition Dataset
Sample Rules
R4 : (pà 36 = T) ^ (pà 35 = T) ^ (pà 34 = G)
^ (pà 33 = A) ^ (pà 32 = C);
R8 : (pà 12 = T) ^ (pà 11 = A) ^ (pà 07 = T);
R10 : (pà 45 = A) ^ (pà 44 = A) ^ (pà 41 = A);
where pj denotes position of a nucleotide, with
respect to a meaningful reference point starting at
position pà 50 and ending at position p7:
Then:
R4 ^ R8 ^ R10 =)
PROM OTER
The Promoter Recognition Dataset
Comparative Algorithms
 KBANN Knowledge-based artificial neural network
[Shavlik et al]
 BP: Standard back propagation for neural networks
[Rumelhart et al]
 O’Neill’s Method Empirical method suggested by
biologist O’Neill [O’Neill]
 NN: Nearest neighbor with k=3 [Cost et al]
 ID3: Quinlan’s decision tree builder[Quinlan]
 SVM1: Standard 1-norm SVM [Bradley et al]
The Promoter Recognition Dataset
Comparative Test Results
Principal Topics
Breast cancer prognosis & chemotherapy
Kaplan-Meier Curves for Overall Patients:
With & Without Chemotherapy
Breast Cancer Prognosis & Chemotherapy
Good, Intermediate & Poor Patient Groupings
(6 Input Features : 5 Cytological, 1 Histological)
(Clustering: Utilizes 2 Histological Features &Chemotherapy)
253 Patients
(113 NoChemo, 140 Chemo)
Good1:
Lymph=0 AND Tumor<2
Compute Median Using 6 Features
Compute Initial
Cluster Centers
Poor1:
Lymph>=5 OR Tumor>=4
Compute Median Using 6 Features
Cluster 113 NoChemo Patients
Cluster 140 Chemo Patients
Use k-Median Algorithm with Initial Centers:
Use k-Median Algorithm with Initial Centers:
Medians of Good1 & Poor1
Medians of Good1 & Poor1
69 NoChemo Good
Good
44 NoChemo Poor
67 Chemo Good
Intermediate
73 Chemo Poor
Poor
Kaplan-Meier Survival Curves
for Good, Intermediate & Poor Patients
82.7% Classifier Correctness via 3 SVMs
Kaplan-Meier Survival Curves for Intermediate Group
Note Reversed Role of Chemotherapy
Multiple Myeloma Detection
Multiple Myeloma is cancer of the plasma cell
Plasma cells normally produce antibodies
Out of control plasma cells produce tumors
When tumors appear in multiple sites they are called
Multiple Myeloma
Dataset
105 patients: 74 with MM, 31 healthy
Each patient is represented by 7008 gene measurements
taken from plasma cell samples
For each one of the 7008 gene measurements
Absolute Call (AC):
Absent (A), Marginal (M) or Present (P)
Average Difference (AD):
Positive or negative number
Multiple Myeloma Data Representation
A1 0 0
M0 1 0
P 0 0 1
AMP  7008 X 3 = 21024
AD  7008
Total = 28,032 per patient
104 Patients: 74 MM + 31 Healthy
104 X 28,032 Data Matrix A
Multiple Myeloma 1-Norm SVM Linear Classifier
Leave-one-out-correctness (looc) = 100%
Average number of features used = 7 per fold
Total computing time for 105 folds = 7892 sec.
 Overall number of features used in 105 folds= 7
Breast Cancer Treatment Response
Joint with ExonHit - Paris (Curie Dataset)
35 patients treated by a drug cocktail
9 partial responders; 26 nonresponders
25 gene expressions out of 692, selected by Arnaud Zeboulon
Most patients had 3 replicate measurements
1-Norm SVM classifier selected 14 out of 25 gene expressions
 Leave-one-out correctness was 80%
Greedy combinatorial approach selected 5 genes out of 14
Separating plane obtained in 5-dimensional gene-expression
space
 Replicates of all patients except one used in training
Average of replicates of patient left out used for testing
Leave-one-out correctness was 33 out of 35, or 94.2%
Separation of Convex Hull of Replicates of:
10 Synthetic Nonresponders & 4 Synthetic Partial Responders
Linear Classifier in 3-Gene Space
35 Patients with 93 Replicates
26 Nonresponders & 9 Partial Responders
Conclusion
New approaches for SVM-based classification
Algorithms capable of classifying data with few examples in
very large dimensional spaces
Typical of microarray classification problems
Classifiers based on both abstract prior knowledge as well
as conventional datasets
Identification of breast cancer patients that can benefit from
chemotherapy
 Useful tool for drug discovery