* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download supervised-i
Minimal genome wikipedia , lookup
Genomic library wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Ridge (biology) wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Microevolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genomic imprinting wikipedia , lookup
Metagenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome evolution wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome (book) wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Oncogenomics wikipedia , lookup
Supervised Learning I BME 230 Supervised (hypothesis-driven) learning In clustering only had expression data available In supervised learning, a “response” or “label” or “class” is known (e.g. group of related genes, tissue of origin, drug treatment, concentration, temperature,...) Supervised Learning in gene expression analysis • Use information about known biology • Usually predict arrays based on their expression vectors (columns of matrix) – Samples of different cell types – Healthy patients versus sick – Drug treatments • Find genes that are informative • Use expression of informative genes to predict new samples. Example – Predicting Cancer Prognosis • Somatic chromosomal alterations frequently occur in cancer • Some may be neutral while others contribute to pathogenesis • Identifying aberrations that recur in tumors Array Comparative Genome Hybridization Chin et al. (2007) 171 Primary Breast Tumors 41 Breast Cancer Cell Lines Cancer Classification Challenge • Given genomic features like CNV (x) • Predict clinical outcomes, (y) • Classifiction if y is discrete – Cancer vs. normal – HER2 status • Regression if y is continuous – Genome instability index (GII) Cluster the CNV Data Patients Discrete Resposes, y: Grade Continous Resposes, y: GII Predictors, x: CNV Classifying leukemia (Golub et al 1999) • Target cancer w/ the right drug • Usually classify by morphology • Acute leukemia – – – – – known to have variable clinical outcome Nuclear morphologies different 1960’s: enzyme assays (myeloperoxidase +) 1970’s: antibodies to lymphoid vs myeloid 1995: specific chrom translocations Golub et al (1999) Classifying leukemia (Golub et al 1999) • 38 bone marrow samples at time of diagnosis – 27 patients w/ acute lymphoblastic (ALL) – 11 patients w/ acute myeloblastic (AML) • Class discovery – can clustering “find” distinction between the leukemias? Golub et al (1999) Leukemia: Neighborhood analysis Golub et al (1999) Classifying leukemia (Golub et al 1999) Golub et al (1999) Supervised Learning Train a model with x1,...,xn examples of labelled data. Labels are y1,...,yn. Find function h(x) y. So can predict the class y on new observations. Copy Number Variation Data patient genomic samples Normal Normal Normal Cancer Cancer sample1 sample2 sample3 sample4 sample5 … Genes 1 2 3 4 5 0.46 -0.10 0.15 -0.45 -0.06 0.30 0.49 0.74 -1.03 1.06 0.80 0.24 0.04 -0.79 1.35 1.51 0.06 0.10 -0.56 1.09 Fewer copies of gene i in sample j 0.90 0.46 0.20 -0.32 -1.09 ... ... ... ... ... y x Tumor Classification Genomic Data Three main types of statistical problems associated with the high-throughput genomic data: • Identification of “marker” genes that characterize the different tumor classes (feature or variable selection). • Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering) • Classification of sample into known classes (supervised learning – classification) Classification Y Normal Normal Normal Cancer Cancer sample1 sample2 sample3 sample4 sample5 … 1 2 3 4 5 0.46 -0.10 0.15 -0.45 -0.06 0.30 0.49 0.74 -1.03 1.06 0.80 0.24 0.04 -0.79 1.35 X 1.51 0.06 0.10 -0.56 1.09 0.90 0.46 0.20 -0.32 -1.09 ... ... ... ... ... unknown =Y_new New sample 0.34 0.43 -0.23 -0.91 1.23 X_new • Each object (e.g. arrays or columns)associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG) • Aim: predict Y_new from X_new. Supervised Learning • Neighbor-based methods – k-nearest neighbors (KNN) – Parzen windows & kernel density estimation • Discriminating hyperplanes – Linear discriminant (LDA/QDA/PDA) – Support Vector Machines (SVM) – Neural nets and perceptrons (ANNs) • Decision trees (CART) • Aggregating Classifiers Neighbor based methods (guilt by association) The function of a gene should be similar to the functions of its neighbors Neighbors are found in the predictor space X The neighbors vote for the function of the gene Nearest Neighbor Classification • Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation). • k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: – find the k closest observations in the training data, – predict the class by majority vote, i.e. choose the class that is most common among those k neighbors. – k is a parameter, the value of which will be determined by minimizing the cross-validation error later. – E. Fix and J. Hodges. Discriminatory analysis. Nonparametric discrimination: Consistency properties. Tech. Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, 1951. Neighbor-based methods genes expression vector a “new” unclassified gene: Does it encode a protein involved in degradation? degradation ? 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 Neighbor-based methods genes 1. Find its closest neighors degradation similarity ? 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0.8 0.7 0.65 0.6 0.6 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar least similar Neighbor-based methods genes 2. Let closest neighbors vote on function degradation similarity ? 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0.8 0.7 0.65 0.6 0.6 0.6 0.54 0.3 0.1 0.05 0.03 0.01 count # of 1’s vs. count # of 0’s most similar least similar k-nearest neighbors The k closest neighbors get to vote (no matter how far away they are) k=3 function of the gene is k-nearest neighbors genes k-nearest neighbors w/ k=5 degradation similarity ? 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0.8 0.7 0.65 0.6 0.6 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar least similar 4/5 say degradation Parzen windows Neighbors within distance d get to vote (no matter how many there are) distance = d d function of the gene is Parzen windows genes Parzen windows with similarity > 0.1 degradation similarity ? 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0.8 0.7 0.65 0.6 0.6 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar least similar 6/9 say degradation KNN for missing data imputation Microarrays have *tons* of missing data. Some methods won’t work with NA’s... What can you do? Troyanskya et al. 2002 Use k-nearest neighbors for missing data Troyanskya et al. 2002 Hyperplane discriminants Search for a single partition that places all positives on one side and all negatives on another side Hyperplane discriminants Y is discrete E.g. non-cancer Y E.g. cancer E.g. PCNA expression X Hyperplane discriminants Y is discrete “Decision Boundary” Discriminant Line Separating Hyperplane E.g. non-cancer training misclassified Y E.g. cancer E.g. PCNA expression X Hyperplane discriminants Y is discrete “Decision Boundary” Discriminant Line Separating hyperplane E.g. non-cancer training misclassified Y E.g. cancer E.g. PCNA expression X Hyperplane discriminants Y is discrete Test Data E.g. non-cancer testing misclassified Y E.g. cancer E.g. PCNA expression X Hyperplanes Y is continuous “Decision Boundary” Discriminant Line Separating hyperplane Y E.g. survival E.g. PCNA expression X Hyperplanes X –a set of predictors hyperplane Y E.g. survival E.g. PCNA E.g. Cell-surface marker X2 X1 Classify new cases with selected features f(Xi) y f(Xi) = selected features vote f(Xi) = ∑jwjXij=Xi β We want a β that minimizes error. On training data the least squares error is: ∑i (f(Xi)-Yi)2 = (Xβ-y)T(Xβ-y) β* = argminβ {Xβ-y)T(Xβ-y)} β* = (XTX)−1XT Fisher Linear Discriminant Analysis • In a two-class classification problem, given n samples in a d-dimensional feature space. n1 in class 1 and n2 in class 2. • Goal: to find a vector w, and project the n samples on the axis y=w’x, so that the projected samples are well separated. w1: Poor Seperation w2: Good Seperation Fisher Linear Discriminant Analysis • The sample mean vector for the ith class is mi and the sample covariance matrix for the ith class is Si. • The between-class scatter matrix is: SB=(m1-m2)(m1-m2)’ • The within-class scatter matrix is: Sw= S1+S2 • The sample mean of the projected points in the ith class is: m˜ i = 1 w' x = w' mi å n i x Îith class • The variance of the projected points in the ith class is: S˜ i = å(w' x - w' m ) i x Îith class 2 = w' Si w Fisher Linear Discriminant Analysis The fisher linear discriminant analysis will choose the w, which maximize: ~ m ~ |2 w' S w |m B Spread ( w) ~12 ~22 w' S w w S1 S 2 i.e. the between-class distance should be as large as possible, meanwhile the within-class scatter should be as small as possible. Maximum Likelihood Discriminant Rule • A maximum likelihood classifier (ML) chooses the class that makes the chance of the observations the highest • the maximum likelihood (ML) discriminant rule predicts the class of an observation X by that which gives the largest likelihood to X, i.e., by h( x) arg max P( x | y k ) k Gaussian ML Discriminant Rules • Assume the conditional densities for each class is a multivariate Gaussian (normal), P(X|Y= k) ~ N(k, k), • Then ML discriminant rule is h(X) = argmink {(X - k) k-1 (X - k)’ + log| k |} • In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA in R) • In practice, population mean vectors k and covariance matrices k are estimated from the training set. ˆ k xk and ˆ k S k h( X ) arg min ( X ˆ k )ˆ k ( X ˆ k ) log | ˆ k | k Gaussian ML Discriminant Rules • When all class densities have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2): h(x) = argmink (x - k) -1 (x - k)’ • In practice, population mean vectors k and constant covariance matrices are estimated from learning set L. ˆ k xk and ˆ (nk 1) S k /( n K ) k h( x) arg min ( X ˆ k )ˆ 1 ( X ˆ k ) k Gaussian ML Discriminant Rules • When the class densities have diagonal covariance matrices, , the discriminant rule is given by additive quadratic contributions from each variable (Diagonal quadratic discriminant analysis, or DQDA) ( xi ki ) 2 2 h( x) = argmin k log ki 2 ki i 1 G • When all class densities have the same diagonal covariance matrix =diag(12… G2), the discriminant rule is again linear (Diagonal linear discriminant analysis, or DLDA in R) h( x) = argmin k G ( xi ki ) 2 i 1 i2 Application of ML discriminant Rule • Weighted gene voting method. (Golub et al. 1999) – One of the first application of a ML discriminant rule to gene expression data. – This methods turns out to be a minor variant of the sample Diagonal Linear Discriminant rule. – Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP,Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. (1999).Molecular classification of cancer: class discovery and class prediction bygene expression monitoring. Science. Oct 15;286(5439):531 - 537. Support Vector Machines Find the seperating hyperplane with maximum margin In practice, classes can overlap, so error measures how far on the wrong side a point may be Hastie, Tibshirani, Friedman (2001) Support Vector Machines To find hyperplane, maximize the Lagrangian: Which gives us our solutions: Any α>0 is part of the support Kernels in SVMs Dot product is distance between i and j Can be replaced with any appropriate distance measure The distance measure is called a kernel Changing the distance measure effectively changes the space where we search for the hyperplane Hastie et al. (2001) SVMs SVMs – 4th degree polynomial kernel Hastie et al.(2001) SVMs in microarray analysis First application tried to predict gene categories from Eisen yeast compendium Train to distinguish between: ribosome vs not TCA vs not respiration vs not proteasome vs not histones vs not helix-turn-helix versus not Brown et al. (2000) Predicting TCA w/ SVM Brown et al. (2000) Predicting Ribosomes w/ SVM Brown et al. (2000) Predicting HTH w/ SVM Brown et al. (2000) “False” predictions FP: Genes newly predicted to be in a functional group that were thought to belong to another, may be coregulated with the new group. FN: Genes that were thought to belong to a functional group may not be coregulated with that group. Inspecting “errors” often leads to the most interesting findings! Looking closely at “false” predictions w/ SVM RPN1: regulatory particle, interacts with DNA damage rad23 Elsasser S et al 2002 shouldn’t have been in the list Brown et al. (2000) E.g. of “mis-annotated” gene EGD1’s expression profile closely resembles other ribosomal subunits Brown et al. (2000) New functional annotations Brown et al. (2000)