Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Support Vector Machine Classification Computation & Informatics in Biology & Medicine Madison Retreat, November 15, 2002 Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg & Collaborators at ExonHit – Paris Data Mining Institute University of Wisconsin - Madison What is a Support Vector Machine? An optimally defined surface Linear or nonlinear in the input space Linear in a higher dimensional feature space Implicitly defined by a kernel function K(A,B) C What are Support Vector Machines Used For? Classification Regression & Data Fitting Supervised & Unsupervised Learning Principal Topics Proximal support vector machine classification Classify by proximity to planes instead of halfspaces Massive incremental classification Classify by retiring old data & adding new data Knowledge-based classification Incorporate expert knowledge into a classifier Fast Newton method classifier Finitely terminating fast algorithm for classification Breast cancer prognosis & chemotherapy Classify patients on basis of distinct survival curves Isolate a class of patients that may benefit from chemotherapy Principal Topics Proximal support vector machine classification Support Vector Machines Maximize the Margin between Bounding Planes w x 0w = í + 1 A+ A- x 0w = í à 1 2 jj wjj 2 Proximal Support Vector Machines Maximize the Margin between Proximal Planes w x 0w = í + 1 A+ A- 0 xw= í à 1 2 jj wjj 2 Standard Support Vector Machine Algebra of 2-Category Linearly Separable Case Given m points in n dimensional space Represented by an m-by-n matrix A Membership of each A i in class +1 or –1 specified by: An m-by-m diagonal matrix D with +1 & -1 entries Separate by two bounding planes, x 0w = í æ1 : A i w= í + 1; for D i i = + 1; A i w5 í à 1; for D i i = à 1: More succinctly: D (Aw à eí ) = e; where e is a vector of ones. Standard Support Vector Machine Formulation Solve the quadratic program for some ÷ > 0: min ÷ 2 k k y 2 2 1 2kw; í k 22 y; w; í s. t. D (Aw à eí ) + y > e + (QP) , where D i i = æ1, denotes A + or A à membership. Margin is maximized by minimizing 12kw; í k 22 Proximal SVM Formulation (PSVM) Standard SVM formulation: min w; í s. t. Solving for min w; í ÷ 2 k k y 2 2 + 12kw; í k 22 = e D (Aw à eí ) + y = y in terms of w and í ÷ 2ke à (QP) D (A w à eí 2 )k 2 gives: + 1 2kw; í 2 k2 This simple, but critical modification, changes the nature of the optimization problem tremendously!! (Regularized Least Squares or Ridge Regression) Advantages of New Formulation Objective function remains strongly convex. An explicit exact solution can be written in terms of the problem data. PSVM classifier is obtained by solving a single system of linear equations in the usually small dimensional input space. Exact leave-one-out-correctness can be obtained in terms of problem data. Linear PSVM We want to solve: min w; í ÷ 2ke à D (A w à eí 2 )k 2 + 1 2kw; í 2 k2 Setting the gradient equal to zero, gives a nonsingular system of linear equations. Solution of the system gives the desired PSVM classifier. Linear PSVM Solution h i w í = I (÷ + 0 H H) à1 0 H De Here, H = [A à e] The linear system to solve depends on: 0 HH which is of size (n + 1) â (n + 1) n is usually much smaller than m Linear & Nonlinear PSVM MATLAB Code function [w, gamma] = psvm(A,d,nu) % PSVM: linear and nonlinear classification % INPUT: A, d=diag(D), nu. OUTPUT: w, gamma % [w, gamma] = psvm(A,d,nu); [m,n]=size(A);e=ones(m,1);H=[A -e]; v=(d’*H)’ %v=H’*D*e; r=(speye(n+1)/nu+H’*H)\v % solve (I/nu+H’*H)r=v w=r(1:n);gamma=r(n+1); % getting w,gamma from r Numerical experiments One-Billion Two-Class Dataset Synthetic dataset consisting of 1 billion points in 10dimensional input space Generated by NDC (Normally Distributed Clustered) dataset generator Dataset divided into 500 blocks of 2 million points each. Solution obtained in less than 2 hours and 26 minutes on a 400Mhz About 30% of the time was spent reading data from disk. Testing set Correctness 90.79% Principal Topics Knowledge-based classification (NIPS*2002) Conventional Data-Based SVM Knowledge-Based SVM via Polyhedral Knowledge Sets Incoporating Knowledge Sets Into an SVM Classifier è ? é Suppose that the knowledge set: x ? Bx 6 b belongs to the class A+. Hence it must lie in the halfspace : è é x j x 0w> í + 1 We therefore have the implication: Bx 6 b ) x w> í + 1 0 This implication is equivalent to a set of constraints that can be imposed on the classification problem. Numerical Testing The Promoter Recognition Dataset Promoter: Short DNA sequence that precedes a gene sequence. A promoter consists of 57 consecutive DNA nucleotides belonging to {A,G,C,T} . Important to distinguish between promoters and nonpromoters This distinction identifies starting locations of genes in long uncharacterized DNA sequences. The Promoter Recognition Dataset Numerical Representation Simple “1 of N” mapping scheme for converting nominal attributes into a real valued representation: Not most economical representation, but commonly used. The Promoter Recognition Dataset Numerical Representation Feature space mapped from 57-dimensional nominal space to a real valued 57 x 4=228 dimensional space. 57 nominal values 57 x 4 =228 binary values Promoter Recognition Dataset Prior Knowledge Rules Prior knowledge consist of the following 64 rules: 2 3 R1 6 or 7 6 7 6 R2 7 V 6 7 6 or 7 6 7 6 R3 7 4 5 or R4 2 3 R5 6 or 7 6 7 6 R6 7 V 6 7 6 or 7 6 7 6 R7 7 4 5 or R8 2 3 R9 6 or 7 6 7 6 R10 7 6 7 = ) PROM OTER 6 or 7 6 7 6 R11 7 4 5 or R12 Promoter Recognition Dataset Sample Rules R4 : (pà 36 = T) ^ (pà 35 = T) ^ (pà 34 = G) ^ (pà 33 = A) ^ (pà 32 = C); R8 : (pà 12 = T) ^ (pà 11 = A) ^ (pà 07 = T); R10 : (pà 45 = A) ^ (pà 44 = A) ^ (pà 41 = A); where pj denotes position of a nucleotide, with respect to a meaningful reference point starting at position pà 50 and ending at position p7: Then: R4 ^ R8 ^ R10 =) PROM OTER The Promoter Recognition Dataset Comparative Algorithms KBANN Knowledge-based artificial neural network [Shavlik et al] BP: Standard back propagation for neural networks [Rumelhart et al] O’Neill’s Method Empirical method suggested by biologist O’Neill [O’Neill] NN: Nearest neighbor with k=3 [Cost et al] ID3: Quinlan’s decision tree builder[Quinlan] SVM1: Standard 1-norm SVM [Bradley et al] The Promoter Recognition Dataset Comparative Test Results Wisconsin Breast Cancer Prognosis Dataset Description of the data 110 instances corresponding to 41 patients whose cancer had recurred and 69 patients whose cancer had not recurred 32 numerical features The domain theory: two simple rules used by doctors: Wisconsin Breast Cancer Prognosis Dataset Numerical Testing Results Doctor’s rules applicable to only 32 out of 110 patients. Only 22 of 32 patients are classified correctly by this rule (20% Correctness). KSVM linear classifier applicable to all patients with correctness of 66.4%. Correctness comparable to best available results using conventional SVMs. KSVM can get classifiers based on knowledge without using any data. Principal Topics Fast Newton method classifier Fast Newton Algorithm for Classification Standard quadratic programming (QP) formulation of SVM: Newton Algorithm f (z) = í 1í 2 w2 1 í í (e à D (Aw à ew)) + w + 2 í w; í í 2 zi + 1 = zi à @2f (zi ) à 1r f (zi ) Newton algorithm terminates in a finite number of steps Termination at global minimum Error rate decreases linearly Can generate complex nonlinear classifiers By using nonlinear kernels: K(x,y) Nonlinear Spiral Dataset 94 Red Dots & 94 White Dots Principal Topics Breast cancer prognosis & chemotherapy Kaplan-Meier Curves for Overall Patients: With & Without Chemotherapy Breast Cancer Prognosis & Chemotherapy Good, Intermediate & Poor Patient Groupings (6 Input Features : 5 Cytological, 1 Histological) (Grouping: Utilizes 2 Histological Features &Chemotherapy) Kaplan-Meier Survival Curves for Good, Intermediate & Poor Patients 82.7% Classifier Correctness via 3 SVMs Kaplan-Meier Survival Curves for Intermediate Group Note Reversed Role of Chemotherapy Conclusion New methods for classification All based on rigorous mathematical foundation Fast computational algorithms capable of classifying massive datasets Classifiers based on both abstract prior knowledge as well as conventional datasets Identification of breast cancer patients that can benefit from chemotherapy Future Work Extend proposed methods to broader optimization problems Linear & quadratic programming Preliminary results beat state-of-the-art software Incorporate abstract concepts into optimization problems as constraints Develop fast online algorithms for intrusion and fraud detection Classify the effectiveness of new drug cocktails in combating various forms of cancer Encouraging preliminary results for breast cancer Breast Cancer Treatment Response Joint with ExonHit ( French BioTech) 35 patients treated by a drug cocktail 9 partial responders; 26 nonresponders 25 gene expression measurements made on each patient 1-Norm SVM classifier selected: 12 out of 25 genes Combinatorially selected 6 genes out of 12 Separating plane obtained: 2.7915 T11 + 0.13436 S24 -1.0269 U23 -2.8108 Z23 -1.8668 A19 -1.5177 X05 +2899.1 = 0. Leave-one-out-error: 1 out of 35 (97.1% correctness) Detection of Alternative RNA Isoforms via DATAS (Levels of mRNA that Correlate with Senitivity to Chemotherapy) E1 I1 E2 I2 E3 I3 E4 I4 E5 DNA Transcription E1 I1 E2 I2 E3 I3 E4 I4 E5 5' 3' pre-mRNA (m=messenger) Alternative RNA splicing E1 E2 E3 E4 E1 E5 E2 E4 E5 (A)n (A)n mRNA Translation NH2 COOH DATAS Chemo-Sensitive NH2 Proteins COOH Chemo-Resistant E3 DATAS: Differential Analysis of Transcripts with Alternative Splicing Talk Available www.cs.wisc.edu/~olvi