Download Mathematical Programming in Support Vector Machines

Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University of Wisconsin - Madison What is a Support Vector Machine?  An optimally defined surface  Linear or nonlinear in the input space  Linear in a higher dimensional feature space  Implicitly defined by a kernel function  K(A,B)  C What are Support Vector Machines Used For?  Classification  Regression & Data Fitting  Supervised & Unsupervised Learning Principal Topics Knowledge-based classification Incorporate expert knowledge into a classifier Breast cancer prognosis & chemotherapy Classify patients on basis of distinct survival curves  Isolate a class of patients that may benefit from chemotherapy Multiple Myeloma detection via gene expression measurements Drug discovery based on gene macroarray expression Joint work with ExonHit Support Vector Machines Maximize the Margin between Bounding Planes w x 0w = í + 1 A+ A- x 0w = í à 1 2 jj wjj Principal Topics Knowledge-based classification (NIPS*2002) Conventional Data-Based SVM Knowledge-Based SVM via Polyhedral Knowledge Sets Incoporating Knowledge Sets Into an SVM Classifier è ? é  Suppose that the knowledge set: x ? Bx 6 b belongs to the class A+. Hence it must lie in the halfspace : è é x j x 0w> í + 1  We therefore have the implication: Bx 6 b ) x w> í + 1 0  This implication is equivalent to a set of constraints that can be imposed on the classification problem. Numerical Testing The Promoter Recognition Dataset  Promoter: Short DNA sequence that precedes a gene sequence.  A promoter consists of 57 consecutive DNA nucleotides belonging to {A,G,C,T} .  Important to distinguish between promoters and nonpromoters  This distinction identifies starting locations of genes in long uncharacterized DNA sequences. The Promoter Recognition Dataset Numerical Representation  Simple “1 of N” mapping scheme for converting nominal attributes into a real valued representation:  Not most economical representation, but commonly used. The Promoter Recognition Dataset Numerical Representation  Feature space mapped from 57-dimensional nominal space to a real valued 57 x 4=228 dimensional space. 57 nominal values 57 x 4 =228 binary values Promoter Recognition Dataset Prior Knowledge Rules  Prior knowledge consist of the following 64 rules: 2 3 R1 6 or 7 6 7 6 R2 7 V 6 7 6 or 7 6 7 6 R3 7 4 5 or R4 2 3 R5 6 or 7 6 7 6 R6 7 V 6 7 6 or 7 6 7 6 R7 7 4 5 or R8 2 3 R9 6 or 7 6 7 6 R10 7 6 7 = ) PROM OTER 6 or 7 6 7 6 R11 7 4 5 or R12 Promoter Recognition Dataset Sample Rules R4 : (pà 36 = T) ^ (pà 35 = T) ^ (pà 34 = G) ^ (pà 33 = A) ^ (pà 32 = C); R8 : (pà 12 = T) ^ (pà 11 = A) ^ (pà 07 = T); R10 : (pà 45 = A) ^ (pà 44 = A) ^ (pà 41 = A); where pj denotes position of a nucleotide, with respect to a meaningful reference point starting at position pà 50 and ending at position p7: Then: R4 ^ R8 ^ R10 =) PROM OTER The Promoter Recognition Dataset Comparative Algorithms  KBANN Knowledge-based artificial neural network [Shavlik et al]  BP: Standard back propagation for neural networks [Rumelhart et al]  O’Neill’s Method Empirical method suggested by biologist O’Neill [O’Neill]  NN: Nearest neighbor with k=3 [Cost et al]  ID3: Quinlan’s decision tree builder[Quinlan]  SVM1: Standard 1-norm SVM [Bradley et al] The Promoter Recognition Dataset Comparative Test Results Principal Topics Breast cancer prognosis & chemotherapy Kaplan-Meier Curves for Overall Patients: With & Without Chemotherapy Breast Cancer Prognosis & Chemotherapy Good, Intermediate & Poor Patient Groupings (6 Input Features : 5 Cytological, 1 Histological) (Clustering: Utilizes 2 Histological Features &Chemotherapy) 253 Patients (113 NoChemo, 140 Chemo) Good1: Lymph=0 AND Tumor<2 Compute Median Using 6 Features Compute Initial Cluster Centers Poor1: Lymph>=5 OR Tumor>=4 Compute Median Using 6 Features Cluster 113 NoChemo Patients Cluster 140 Chemo Patients Use k-Median Algorithm with Initial Centers: Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 Medians of Good1 & Poor1 69 NoChemo Good Good 44 NoChemo Poor 67 Chemo Good Intermediate 73 Chemo Poor Poor Kaplan-Meier Survival Curves for Good, Intermediate & Poor Patients 82.7% Classifier Correctness via 3 SVMs Kaplan-Meier Survival Curves for Intermediate Group Note Reversed Role of Chemotherapy Multiple Myeloma Detection Multiple Myeloma is cancer of the plasma cell Plasma cells normally produce antibodies Out of control plasma cells produce tumors When tumors appear in multiple sites they are called Multiple Myeloma Dataset 105 patients: 74 with MM, 31 healthy Each patient is represented by 7008 gene measurements taken from plasma cell samples For each one of the 7008 gene measurements Absolute Call (AC): Absent (A), Marginal (M) or Present (P) Average Difference (AD): Positive or negative number Multiple Myeloma Data Representation A1 0 0 M0 1 0 P 0 0 1 AMP  7008 X 3 = 21024 AD  7008 Total = 28,032 per patient 104 Patients: 74 MM + 31 Healthy 104 X 28,032 Data Matrix A Multiple Myeloma 1-Norm SVM Linear Classifier Leave-one-out-correctness (looc) = 100% Average number of features used = 7 per fold Total computing time for 105 folds = 7892 sec.  Overall number of features used in 105 folds= 7 Breast Cancer Treatment Response Joint with ExonHit - Paris (Curie Dataset) 35 patients treated by a drug cocktail 9 partial responders; 26 nonresponders 25 gene expressions out of 692, selected by Arnaud Zeboulon Most patients had 3 replicate measurements 1-Norm SVM classifier selected 14 out of 25 gene expressions  Leave-one-out correctness was 80% Greedy combinatorial approach selected 5 genes out of 14 Separating plane obtained in 5-dimensional gene-expression space  Replicates of all patients except one used in training Average of replicates of patient left out used for testing Leave-one-out correctness was 33 out of 35, or 94.2% Separation of Convex Hull of Replicates of: 10 Synthetic Nonresponders & 4 Synthetic Partial Responders Linear Classifier in 3-Gene Space 35 Patients with 93 Replicates 26 Nonresponders & 9 Partial Responders Conclusion New approaches for SVM-based classification Algorithms capable of classifying data with few examples in very large dimensional spaces Typical of microarray classification problems Classifiers based on both abstract prior knowledge as well as conventional datasets Identification of breast cancer patients that can benefit from chemotherapy  Useful tool for drug discovery

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mathematical Programming in Support Vector Machines