Download An Introduction to Linear Discriminants for Classification

A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002. Introduction • A linear discriminant is a group of mathematical models that allows us to classify data (like microarray) into preset groups (eg. cancer vs. noncancer, metastatic vs. non metastatic, respond well to drug vs. poorly to drug) • ‘Discriminant’ simply means that it has the ability to discriminate between two classes. • The meaning of the word ‘linear’ will become clearer later. Motivation I • Spoke previously at great length about common clustering methods for microarray data (unsupervised learning). • Supervised techniques are much more powerful/useful. • Linear discriminants (supervised method) are one of the older, well studied supervised techniques, both in traditional statistics and machine learning. Motivation II • Linear discriminants are widely used today in many application domains, including the modeling of various types of biological data. • Many classes or sub-classes of techniques are actually linear discriminants (eg. Artificial Neural Networks, Fisher Discriminant, Support Vector Machine and many more). • Provides very general framework upon which much has been built i.e. can extend to very sophisticated, robust techniques. eg. Classifying Cancer Patients vs. Healthy Patients from Microarray Patient_X= (gene_1, gene_2, gene_3, …, gene_N) Cancerous Healthy N (number of dimensions) is normally larger than 2, so we can’t visualize the data. eg. Classifying Cancer Patients vs. Healthy Patients from Microarray Up-regulated For simplicity, pretend that we are only looking at expression levels of 2 genes. Gene_2 expression level 5 Cancerous 0 Healthy -5 Down-regulated -5 0 Gene_1 expression level 5 eg. Classifying Cancer Patients vs. Healthy Patients from Microarray Question: How can we build a classifier for this data? Gene_2 expression level 5 Cancerous 0 Healthy -5 -5 0 Gene_1 expression level 5 eg. Classifying Cancer Patients vs. Healthy Patients from Microarray IF gene_1 <0 AND gene_2 <0 THEN person=healthy IF gene_1 >0 AND gene_2 >0 THEN person=cancerous Gene_2 expression level Simple Classification Rule: 5 Cancerous 0 -5 Healthy -5 0 Gene_1 expression level 5 eg. Classifying Cancer Patients vs. Healthy Patients from Microarray Simple Classification Rule: IF gene_1 <0 AND gene_2 <0 AND … gene 5000 < Y If we move away from our simple example with 2 genes to a realistic case with say 5000 genes, then 1. What will these rules look like? 2. How will we find them? THEN person=healthy IF gene_1 >0 AND gene_2 >0 … gene 5000 >W THEN person=cancerous Gets a little complicated, unwieldy… eg. Classifying Cancer Patients vs. Healthy Patients from Microarray SIMPLE RULE: •If data point lies to the ‘left’ of the line, then ‘healthy’. •If data point lies to ‘right’ of line then ‘cancerous’ 5 Gene_2 expression level Reformulate the previous rule Cancerous 0 Healthy -5 -5 0 5 Gene_1 expression level It is easier to generalize this line to 5000 genes than it is a list of rules. Also easier to solve mathematically. More Than 2 Genes (dimensions) ? Easy to Extend •Line in 2D: x1C1 + x2C2 = T •If we had 3 genes, and needed to build a ‘line’ in 3-dimensional space, then we would be seeking a plane. Plane in 3D: x1C1 + x2C2 + x3C3 = T •If we were looking in more than 3 dimensions, the ‘plane’ is called a hyperplane. A hyperplane is simply a generalization of a plane to dimensions higher than 3. Hyperplane in N-dimensions: x1C1 + x2C2 + x3C3 + … + xNCN = T 5 Cancerous 0 Healthy 5 5 0 5 eg. Classifying Cancer Patients vs. Healthy Patients from Microarray Why is it called ‘linear’? The rule of ‘which side is the point on’, looks, mathematically like: gene1*C1 + gene2*C2 > T then cancer gene1*C1 + gene2*C2 < T then healthy It is linear in the input (the gene expression levels). Gene_2 expression level 5 <T Cancerous >T 0 Healthy -5 -5 0 Gene_1 expression level 5 Linear Vs. Non-Linear gene1*C1 + gene2*C2 > T gene1*C1 + gene2*C2 < T gene12*C1 + gene2*C2 > T gene12*C1 + gene2*C2 < T 1/[1+exp-(gene1*C1 + gene2*C2 +T)] < 0 1/[1+exp-(gene1*C1 + gene2*C2 +T)] > 0 gene1*gene2*C > T gene1*gene2*C < T ‘logistic’ linear discriminant Mathematically, linear problems are generally much easier to solve than non-linear problems. Back to our Linear Discriminant There are actually many (infinite) lines that ‘properly’ divide the points. 5 0 5 5 0 5 Which is the correct one? One solution (that SVMs use): 1. Find line that has the all data points on the proper side. 2. Of all lines that satisfy (1), find the one that maximizes the ‘margin’ (smallest distance between any point and line). 3. This is called ‘Constrained Optimization’ in mathematics. largest margin 5 5 0 0 5 smaller margin 5 0 5 margin 5 - 0 5 5 margin Obtaining Different ‘Lines’: Objective Functions • In general, the line that you end up with depends on some criteria, defined by the ‘Objective Function’ (for SVM, the margin) • An ‘Objective Function’ is chosen by the modeler, and varies depending on exactly what the modeler is trying to achieve or thinks will work well (eg margin, posterior probabilities, sum of squares error, small weight vector). • The function usually has a theoretical foundation (eg. risk minimization, maximum likelihood/gaussian processes/zero mean gaussian noise). What if the data looked like this? Depends… •Is it just a few points that are small ‘outliers’? •Or is the data simply not amenable to this kind of classification? 5 Gene_2 expression level How could we build a suitable line that divides the data nicely? Cancerous 0 Healthy -5 -5 0 Gene_1 expression level 5 Not linearly separable data. Healthy Almost linearly separable data. Linearly separable data. Cancerous Cancerous Healthy Inherently, the data cannot be separated by any one line. A few outliers – probably can still find a ‘good’ line. Can make a great classifier. Not linearly separable data. •If we allow the model to have more than one line (or hyperplane), then maybe we can still form a nice model. 5 Cancerous Healthy •Much more complicated. •This is one thing that neural networks allow us to do: combine linear discriminants together to form a single classifier (no longer a linear classifier). •No time to delve further during this talk. 0 Cancerous Healthy 5 - 0 5 Inherently, the data cannot be separated by any one line. 5 Not linearly separable data. 5 Now what?? Even with many lines it would be extremely difficult to build a good classifier. 0 5 5 0 5 Sometimes Need to Transform the Data Linearly separable data. Not linearly separable data. polar coordinates 0 5 Distance from center (radius) Need to transform the coordinates: polar coordinates, Principal Components coordinates, kernel transformation into higher dimensional space (support vector machines). Caveats • May need to find a subset of the data that is linearly separable (called feature selection). • Feature selection is what we call in computer science, an NP-complete problem, which means, in layman’s terms: impossible to solve exactly. Feature selection is an open research problem. • There are a spate of techniques that give you approximate solutions to feature selection. • Features selection is mandatory in microarray expression experiments because there is so much noisy, irrelevant data. • Also, with microarray data, there is much missing data – introduces difficulties. Other Biological Applications • Gene finding in DNA: (input is part of DNA strand, output is whether or not nucleotide at centre is inside of a gene). • Sequence-based gene classification: the input is a gene sequence, output is a functional class. • Protein secondary structure prediction: input is a sequence of amino acids, output is the local secondary structure. • Protein localization in cell: the input is an amino acid sequence, the output is position in the cell (eg. nucleus, membrane, etc.) Taken from Introduction to Support Vector Machines and Applications to Computational Biology, Jean Philippe Vert Wrap-Up • Intuitive feel for linear discriminants. • Widely applicable technique – for many problems in Polyomx and many other areas. • Difficulties: missing data, feature selection. • Have used linear discriminants for our SNP data and microarray data. If interested in knowing more, great book: Neural Networks for Pattern Recognition, Christopher Bishop, 1999. Finding the Equation of the Linear Discriminant (How a Single Layer Neural Network Might Do It) The discriminant function: y ( x)  0 y (x)  w x  w0 T y ( x)  0 y ( x)  0 Eg. Sum-of-squares error function (more for regression): N E (w )   (w T x n  w0 )  t n ) 2 (t n {1,1}) n 1 Minimize objective function 1. Exact solution via matrix algebra since here E is convex. 2. Iterative algorithms (gradient descent, conjugate gradient, Newton’s method, etc.) for cases where E may not be convex.  E E E E (w)   , ,..., wK  w1 w2    0  w Can regularize by adding in ||w||2 to E. Finding the Equation of the Linear Discriminant (How an SVM would do it.) The discriminant function: y ( x)  0 y (x)  w T x  w0 y ( x)  0 The margin is given by: y ( x)  0 w wT x  0 || w || || w || Minimize ||w||2 subject to the following constraints: i  1...K , ti (w T x  w0 )  1  0 (t n {1,1}) Use Lagrange Multipliers N L(w, w0 , λ ) || w ||2  i [ti (w T x  w0 )  1] i 1 w wT x  0 || w || || w ||

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download An Introduction to Linear Discriminants for Classification