Download supervised-i

Supervised Learning I BME 230 Supervised (hypothesis-driven) learning In clustering only had expression data available In supervised learning, a “response” or “label” or “class” is known (e.g. group of related genes, tissue of origin, drug treatment, concentration, temperature,...) Supervised Learning in gene expression analysis • Use information about known biology • Usually predict arrays based on their expression vectors (columns of matrix) – Samples of different cell types – Healthy patients versus sick – Drug treatments • Find genes that are informative • Use expression of informative genes to predict new samples. Example – Predicting Cancer Prognosis • Somatic chromosomal alterations frequently occur in cancer • Some may be neutral while others contribute to pathogenesis • Identifying aberrations that recur in tumors Array Comparative Genome Hybridization Chin et al. (2007) 171 Primary Breast Tumors 41 Breast Cancer Cell Lines Cancer Classification Challenge • Given genomic features like CNV (x) • Predict clinical outcomes, (y) • Classifiction if y is discrete – Cancer vs. normal – HER2 status • Regression if y is continuous – Genome instability index (GII) Cluster the CNV Data Patients Discrete Resposes, y: Grade Continous Resposes, y: GII Predictors, x: CNV Classifying leukemia (Golub et al 1999) • Target cancer w/ the right drug • Usually classify by morphology • Acute leukemia – – – – – known to have variable clinical outcome Nuclear morphologies different 1960’s: enzyme assays (myeloperoxidase +) 1970’s: antibodies to lymphoid vs myeloid 1995: specific chrom translocations Golub et al (1999) Classifying leukemia (Golub et al 1999) • 38 bone marrow samples at time of diagnosis – 27 patients w/ acute lymphoblastic (ALL) – 11 patients w/ acute myeloblastic (AML) • Class discovery – can clustering “find” distinction between the leukemias? Golub et al (1999) Leukemia: Neighborhood analysis Golub et al (1999) Classifying leukemia (Golub et al 1999) Golub et al (1999) Supervised Learning Train a model with x1,...,xn examples of labelled data. Labels are y1,...,yn. Find function h(x)  y. So can predict the class y on new observations. Copy Number Variation Data patient genomic samples Normal Normal Normal Cancer Cancer sample1 sample2 sample3 sample4 sample5 … Genes 1 2 3 4 5 0.46 -0.10 0.15 -0.45 -0.06 0.30 0.49 0.74 -1.03 1.06 0.80 0.24 0.04 -0.79 1.35 1.51 0.06 0.10 -0.56 1.09 Fewer copies of gene i in sample j 0.90 0.46 0.20 -0.32 -1.09 ... ... ... ... ... y x Tumor Classification Genomic Data Three main types of statistical problems associated with the high-throughput genomic data: • Identification of “marker” genes that characterize the different tumor classes (feature or variable selection). • Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering) • Classification of sample into known classes (supervised learning – classification) Classification Y Normal Normal Normal Cancer Cancer sample1 sample2 sample3 sample4 sample5 … 1 2 3 4 5 0.46 -0.10 0.15 -0.45 -0.06 0.30 0.49 0.74 -1.03 1.06 0.80 0.24 0.04 -0.79 1.35 X 1.51 0.06 0.10 -0.56 1.09 0.90 0.46 0.20 -0.32 -1.09 ... ... ... ... ... unknown =Y_new New sample 0.34 0.43 -0.23 -0.91 1.23 X_new • Each object (e.g. arrays or columns)associated with a class label (or response) Y  {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG) • Aim: predict Y_new from X_new. Supervised Learning • Neighbor-based methods – k-nearest neighbors (KNN) – Parzen windows & kernel density estimation • Discriminating hyperplanes – Linear discriminant (LDA/QDA/PDA) – Support Vector Machines (SVM) – Neural nets and perceptrons (ANNs) • Decision trees (CART) • Aggregating Classifiers Neighbor based methods (guilt by association) The function of a gene should be similar to the functions of its neighbors Neighbors are found in the predictor space X The neighbors vote for the function of the gene Nearest Neighbor Classification • Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation). • k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: – find the k closest observations in the training data, – predict the class by majority vote, i.e. choose the class that is most common among those k neighbors. – k is a parameter, the value of which will be determined by minimizing the cross-validation error later. – E. Fix and J. Hodges. Discriminatory analysis. Nonparametric discrimination: Consistency properties. Tech. Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, 1951. Neighbor-based methods genes expression vector a “new” unclassified gene: Does it encode a protein involved in degradation? degradation ? 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 Neighbor-based methods genes 1. Find its closest neighors degradation similarity ? 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0.8 0.7 0.65 0.6 0.6 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar least similar Neighbor-based methods genes 2. Let closest neighbors vote on function degradation similarity ? 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0.8 0.7 0.65 0.6 0.6 0.6 0.54 0.3 0.1 0.05 0.03 0.01 count # of 1’s vs. count # of 0’s most similar least similar k-nearest neighbors The k closest neighbors get to vote (no matter how far away they are) k=3 function of the gene is k-nearest neighbors genes k-nearest neighbors w/ k=5 degradation similarity ? 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0.8 0.7 0.65 0.6 0.6 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar least similar 4/5 say degradation Parzen windows Neighbors within distance d get to vote (no matter how many there are) distance = d d function of the gene is Parzen windows genes Parzen windows with similarity > 0.1 degradation similarity ? 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0.8 0.7 0.65 0.6 0.6 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar least similar 6/9 say degradation KNN for missing data imputation Microarrays have *tons* of missing data. Some methods won’t work with NA’s... What can you do? Troyanskya et al. 2002 Use k-nearest neighbors for missing data Troyanskya et al. 2002 Hyperplane discriminants Search for a single partition that places all positives on one side and all negatives on another side Hyperplane discriminants Y is discrete E.g. non-cancer Y E.g. cancer E.g. PCNA expression X Hyperplane discriminants Y is discrete “Decision Boundary” Discriminant Line Separating Hyperplane E.g. non-cancer training misclassified Y E.g. cancer E.g. PCNA expression X Hyperplane discriminants Y is discrete “Decision Boundary” Discriminant Line Separating hyperplane E.g. non-cancer training misclassified Y E.g. cancer E.g. PCNA expression X Hyperplane discriminants Y is discrete Test Data E.g. non-cancer testing misclassified Y E.g. cancer E.g. PCNA expression X Hyperplanes Y is continuous “Decision Boundary” Discriminant Line Separating hyperplane Y E.g. survival E.g. PCNA expression X Hyperplanes X –a set of predictors hyperplane Y E.g. survival E.g. PCNA E.g. Cell-surface marker X2 X1 Classify new cases with selected features f(Xi)  y f(Xi) = selected features vote f(Xi) = ∑jwjXij=Xi β We want a β that minimizes error. On training data the least squares error is: ∑i (f(Xi)-Yi)2 = (Xβ-y)T(Xβ-y) β* = argminβ {Xβ-y)T(Xβ-y)} β* = (XTX)−1XT Fisher Linear Discriminant Analysis • In a two-class classification problem, given n samples in a d-dimensional feature space. n1 in class 1 and n2 in class 2. • Goal: to find a vector w, and project the n samples on the axis y=w’x, so that the projected samples are well separated. w1: Poor Seperation w2: Good Seperation Fisher Linear Discriminant Analysis • The sample mean vector for the ith class is mi and the sample covariance matrix for the ith class is Si. • The between-class scatter matrix is: SB=(m1-m2)(m1-m2)’ • The within-class scatter matrix is: Sw= S1+S2 • The sample mean of the projected points in the ith class is: m˜ i = 1 w' x = w' mi å n i x Îith class • The variance of the projected points in the ith class is: S˜ i = å(w' x - w' m ) i x Îith class 2 = w' Si w Fisher Linear Discriminant Analysis The fisher linear discriminant analysis will choose the w, which maximize: ~ m ~ |2 w' S w |m B Spread ( w)  ~12 ~22  w' S w w S1  S 2 i.e. the between-class distance should be as large as possible, meanwhile the within-class scatter should be as small as possible. Maximum Likelihood Discriminant Rule • A maximum likelihood classifier (ML) chooses the class that makes the chance of the observations the highest • the maximum likelihood (ML) discriminant rule predicts the class of an observation X by that which gives the largest likelihood to X, i.e., by h( x)  arg max P( x | y  k ) k Gaussian ML Discriminant Rules • Assume the conditional densities for each class is a multivariate Gaussian (normal), P(X|Y= k) ~ N(k, k), • Then ML discriminant rule is h(X) = argmink {(X - k) k-1 (X - k)’ + log| k |} • In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA in R) • In practice, population mean vectors k and covariance matrices k are estimated from the training set. ˆ k  xk and ˆ k  S k h( X )  arg min ( X  ˆ k )ˆ k ( X  ˆ k )  log | ˆ k |  k  Gaussian ML Discriminant Rules • When all class densities have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2): h(x) = argmink (x - k) -1 (x - k)’ • In practice, population mean vectors k and constant covariance matrices  are estimated from learning set L. ˆ k  xk and ˆ   (nk  1) S k /( n  K )  k  h( x)  arg min ( X  ˆ k )ˆ 1 ( X  ˆ k ) k Gaussian ML Discriminant Rules • When the class densities have diagonal covariance matrices, , the discriminant rule is given by additive quadratic contributions from each variable (Diagonal quadratic discriminant analysis, or DQDA)  ( xi   ki ) 2  2 h( x) = argmin k    log  ki  2  ki i 1   G • When all class densities have the same diagonal covariance matrix =diag(12… G2), the discriminant rule is again linear (Diagonal linear discriminant analysis, or DLDA in R) h( x) = argmin k G ( xi   ki ) 2 i 1  i2  Application of ML discriminant Rule • Weighted gene voting method. (Golub et al. 1999) – One of the first application of a ML discriminant rule to gene expression data. – This methods turns out to be a minor variant of the sample Diagonal Linear Discriminant rule. – Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP,Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. (1999).Molecular classification of cancer: class discovery and class prediction bygene expression monitoring. Science. Oct 15;286(5439):531 - 537. Support Vector Machines Find the seperating hyperplane with maximum margin In practice, classes can overlap, so error measures how far on the wrong side a point may be Hastie, Tibshirani, Friedman (2001) Support Vector Machines To find hyperplane, maximize the Lagrangian: Which gives us our solutions: Any α>0 is part of the support Kernels in SVMs Dot product is distance between i and j Can be replaced with any appropriate distance measure The distance measure is called a kernel Changing the distance measure effectively changes the space where we search for the hyperplane Hastie et al. (2001) SVMs SVMs – 4th degree polynomial kernel Hastie et al.(2001) SVMs in microarray analysis First application tried to predict gene categories from Eisen yeast compendium Train to distinguish between: ribosome vs not TCA vs not respiration vs not proteasome vs not histones vs not helix-turn-helix versus not Brown et al. (2000) Predicting TCA w/ SVM Brown et al. (2000) Predicting Ribosomes w/ SVM Brown et al. (2000) Predicting HTH w/ SVM Brown et al. (2000) “False” predictions FP: Genes newly predicted to be in a functional group that were thought to belong to another, may be coregulated with the new group. FN: Genes that were thought to belong to a functional group may not be coregulated with that group. Inspecting “errors” often leads to the most interesting findings! Looking closely at “false” predictions w/ SVM RPN1: regulatory particle, interacts with DNA damage rad23 Elsasser S et al 2002 shouldn’t have been in the list Brown et al. (2000) E.g. of “mis-annotated” gene EGD1’s expression profile closely resembles other ribosomal subunits Brown et al. (2000) New functional annotations Brown et al. (2000)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download supervised-i