* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Knowledge-based Analysis of Microarray Gene Expression Data
Essential gene wikipedia , lookup
Gene nomenclature wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Oncogenomics wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Public health genomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Metagenomics wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Long non-coding RNA wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genomic imprinting wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Minimal genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Sugnet, Terrence S. Furey, Manuel Ares, Jr. David Haussler Proceedings of the National Academy of Sciences. 2000 Overview Objective: Classify genes based on functionality Observation: Genes of similar function yield similar expression pattern in microarray hybridization experiments Method: Use SVM to build classifiers, using microarray gene expression data. Previous Methods Most current methods employ unsupervised learning methods (at the time of the publication) Genes are grouped using clustering algorithms based on a distance measure Hierarchical clustering Self-organizing maps DNA Microarray Data Each data point represents the ratio of expression levels of a particular gene in an experimental condition and a reference condition n genes on a single chip m experiments performed The results is an n by m matrix of expression-level ratios m experiments n genes m-element expression vector for a single gene DNA Microarray Data Normalized logarithmic ratio For gene X, in experience i, define: • Ei is the expression level in the experiment • Ri is the expression level in the reference state • Xi=(x1, x2,..., xn) is the normalized logarithmic ratio • Xi is positive when the gene is induced (turned up) • Xi is negative when the gene is repressed (turned down) Support Vector Machines Searches for a hyperplane that Maximizes the margin Minimizes the violation of the margin * Edda Leopold† and Jörg Kindermann Linear Inseparability What if data points are not linearly separable? * Andrew W. Moore Linear Inseparability Map the data to higherdimension space * Andrew W. Moore Linear Inseparability Problems with mapping data to higherdimension space 1. Overfitting • 2. SVM chooses the maximum margin, and deals well with overfitting High computational cost • SVM kernels only involve dot products between points (cheap!) SVM Kernels K(X, Y) is function that calculates a measure of similarity between X and Y Dot product • K(X,Y) = X.Y • Simplest kernel. Linear hyperplane Degree d polynomials • K(X,Y) = (X.Y + 1)d Gaussian • K(X,Y) = exp(-|X - Y|2/22) Experimental Dataset Expression data from the budding yeast Six functional classes 2467 genes (n) 79 experiments (m) Dataset available on Stanford web site From the Munich Information Centre for Protein Sequences Yeast Genome Database Class definitions come from biochemical and genetic studies Training data: positive labels: set of genes that have a common function Negative labels: set of genes known not to be a member of this function class Experimental Design Compare the performance of SVM (with degree 1 kernel, i.e. linear)) SVM (with degree 2 kernel) SVM (with degree 3 kernel) SVM (Gaussian) Parzen Windows Fisher’s Linear Discriminate C4.5 Decision Trees MOC1 Decision Trees Experimental Design Define the cost of method M Cost of each method is compared to: C(M) = fp(M) + 2.fn(M) False negatives are weighted higher because the number of true negatives is larger C(N) = cost of classifying everything as negative Cost saving of method M is : S(M) = C(N) - C(M) Experimental Results SVM (d=1) TCA Resp Ribo Prot Hist HTH 6 31 224 35 18 -56 SVM (d=2) 9 39 229 48 18 -3 SVM SVM Parzen Fisher's (d=3) (Gauss) Windows LD C4.5 MOC1 12 11 6 5 -7 -1 38 33 18 30 8 -4 229 226 220 217 169 164 51 52 39 39 33 26 18 18 14 16 16 10 -1 0 -14 -14 -2 -6 SVMs outperform other methods All classifiers fail to recognize the HTH protein this is expected Members of this class are not “similarly regulated” Consistently Misclassified Genes 20 genes are consistently misclassified by 4 SVM kernels, in different experiments Difference between the expression data and definitions based on protein structures. Many of the false positives are known to be important for the functional class (even though they are not included as part of the class)