* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Part B - Bioinformatics
Quantitative trait locus wikipedia , lookup
Genetic engineering wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Oncogenomics wikipedia , lookup
Metagenomics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Essential gene wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene therapy wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene nomenclature wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene desert wikipedia , lookup
Minimal genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome evolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Ridge (biology) wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Designer baby wikipedia , lookup
Pictorial Demonstration Rescale features to minimize the LOO bound R2/M2 R2/M2 >1 R R2/M2 =1 M=R M x2 x2 x1 SVM Functional To the SVM classifier we add an extra scaling parameters for feature selection: where the parameters , b are computed by maximizing the the following functional, which is equivalent to maximizing the margin: Radius Margin Bound Jaakkola-Haussler Bound Span Bound The Algorithm Computing Gradients Toy Data Linear problem with 6 relevant dimensions of 202 Nonlinear problem with 2 relevant dimensions of 52 Face Detection On the CMU testset consisting of 479 faces and 57,000,000 non-faces we compare ROC curves obtained for different number of selected features. We see that using more than 60 features does not help. Molecular Classification of Cancer Dataset Total Samples Class 0 Class 1 Dataset Total Samples Class 0 Class 1 Leukemia Morphology (train) 38 27 ALL 11 AML Lymphoma Morphology 77 19 FSC 58 DLCL Leukemia Morpholgy (test) 34 20 ALL 14 AML Lymphoma Outcome 58 20 Low risk 14 High risk Leukemia Lineage (ALL) 23 15 B-Cell 8 T-Cell Brain Morphology 41 14 Glioma 27 MD Lymphoma Outcome (AML) 15 8 7 Low risk High risk Brain Outcome 50 38 Low risk 12 High risk Morphology Classification Dataset Algorithm Total Samples Total errors Class 1 errors Class 0 errors Number Genes Leukemia Morphology (trest) AML vs ALL SVM 35 0/35 0/21 0/14 40 WV 35 2/35 1/21 1/14 50 k-NN 35 3/35 1/21 2/14 10 SVM 23 0/23 0/15 0/8 10 WV 23 0/23 0/15 0/8 9 k-NN 23 0/23 0/15 0/8 10 SVM 77 4/77 2/32 2/35 200 WV 77 6/77 1/32 5/35 30 k-NN 77 3/77 1/32 2/35 250 SVM 41 1/41 1/27 0/14 100 WV 41 1/41 1/27 0/14 3 k-NN 41 0/41 0/27 0/14 5 Leukemia Lineage (ALL) B vs T Lymphoma FS vs DLCL Brain MD vs Glioma Outcome Classification Dataset Algorithm Total Samples Total errors Class 1 errors Class 0 errors Number Genes Lymphoma SVM 58 13/58 3/32 10/26 100 LBC treatment outcome WV 58 15/58 5/32 10/26 12 k-NN 58 15/58 8/32 7/26 15 Brain SVM 50 7/50 6/12 1/38 50 MD treatment outcome WV 50 13/50 6/12 7/38 6 k-NN 50 10/50 6/12 4/38 5 Outcome Classification 0.6 0.6 0.8 0.8 1.0 1.0 Error rates ignore temporal information such as when a patient dies. Survival analysis takes temporal information into account. The Kaplan-Meier survival plots and statistics for the above predictions show significance. p-val = 0.0015 0.0 0.0 0.2 0.2 0.4 0.4 p-val = 0.00039 0 50 Lymphoma 100 150 0 20 40 60 80 100 Medulloblastoma 120 Part 4 Clustering Algorithms Hierarchical Clustering Hierarchical clustering Step 1: Transform genes * experiments matrix into genes * genes distance matrix Exp 1 Exp 2 Exp 3 Gene A Exp 4 Gene A Gene B Gene C Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains Gene A Gene B Gene C Gene B Gene C 0 ? ? 0 ? 0 Hierarchical clustering (continued) To transform the genes*exp matrix into genes*genes matrix, use a gene similarity metric. (Eisen et al. 1998 PNAS 95:14863-14868) Exactly same as Pearsons correlation except the underline Where Gi equal the (log-transformed) primary data for gene G in condition i. For any two genes X and Y observed over a series of N conditions. Goffset is set to 0, corresponding to fluorescence ratio of 1.0 Hierarchical clustering (continued) Pearsons correlation example What if genome expression is clustered based on negative correlation? Hierarchical clustering (continued) G1 G2 G3 G4 G5 G1 0 2 6 10 9 G2 G3 0 5 9 8 0 4 5 G4 0 3 G5 G (12) 0 G (12) 6 G3 10 G4 9 G5 2 3 4 G4 G5 0 4 5 0 3 0 0 G (12) G3 G (45) 1 G3 5 Stage P5 P4 P3 P2 P1 G (12) 0 6 10 G3 G (45) 0 5 0 Groups [1], [2], [3], [4], [5] [1 2], [3], [4], [5] [1 2], [3], [4 5] [1 2], [3 4 5] [1 2 3 4 5] Part 5 Clustering Algorithms k-means Clustering K-means clustering This method differs from the hierarchical clustering in many ways. In particular, - There is no hierarchy, the data are partitioned. You will be presented only with the final cluster membership for each case. - There is no role for the dendrogram in k-means clustering. - You must supply the number of clusters (k) into which the data are to be grouped. K-means clustering(continued) Step 1: Transform n (genes) * m (experiments) matrix into n(genes) * n(genes) distance matrix Exp 1 Gene A Gene B Gene C Exp 2 Exp 3 Gene A Exp 4 Gene A Gene B Gene C Gene B Gene C 0 ? ? 0 ? 0 Step 2: Cluster genes based on a k-means clustering algorithm K-means clustering(continued) To transform the n*m matrix into n*n matrix, use a similarity (distance) metric. (Tavazoie et al. Nature Genetics. 1999 Jul;22(3):281-5) Euclidean distance Where any two genes X and Y observed over a series of M conditions. K-means clustering(continued) Gene 1 0 1 1 Gene 1 Gene 2 Gene 3 Gene 4 Gene 2 Gene 4 0 1 0 0 1 1 2 3 4 1 2 2 1 Gene 3 K-means clustering algorithm Step 1: Suppose distance of genes expression patterns are positioned on a two dimensional space based a distance matrix Step 2: The first cluster center(red) is chosen randomly and then subsequent centers are by finding the data point farthest from the centers already chosen. In this example, k=3. K-means clustering algorithm(continued) Step 3: Each point is assigned to the cluster associated with the closest representative center Step 4: Minimizes the within-cluster sum of squared distances from the cluster mean by moving the centroid (star points), that is computing a new cluster representative K-means clustering algorithm(continued) Step 5: Repeat step 3 and 4 with a new representative Run step 3, 4 and 5 until no further changes occur. Part 6 Clustering Algorithms Principal Component Analysis Principal component analysis (PCA) PCA is a variable reduction procedure. It is useful when you have obtained data on a large number of variables, and believe that there is some redundancy in those variables. PCA (continued) PCA (continued) PCA (continued) - Items 1-4 are collapsed into a single new variable that reflects the employees’ satisfaction with supervision, and items 5-7 are collapsed into a single new variable that reflects satisfaction with pay. - General form for the formula to compute scores on the first component C1 = b11(X1) + b12(X2) + ……. b1p(Xp) where C1 = the subject’s score on principal component 1 b1p = the regression coefficient(or weight) for observed variable p, as used in creating principal component 1 Xp = the subject’s score on observed variable p. PCA (continued) For example, you could determine each subject’s score on principal component 1 (satisfaction with supervision) and principal component 2 (satisfaction with pay ) by C1 = .44(X1) + .40(X2) + .47(X3) + .32(X4) + .02 (X5) + .01 (X6) + .03(X7) C2 = .01(X1) + .04(X2) + .02(X3) + .02(X4) + .48(X5) + .31 (X6) + .39(X7) These weights can be calculated using special type of equation called an eigenequation. PCA (continued) (Alter et al., PNAS, 2000, 97(18) 10101-10106) PCA (continued) Part 7 Clustering Algorithms Self-Organizing Maps Clustering Goals • Find natural classes in the data • Identify new classes / gene correlations • Refine existing taxonomies • Support biological analysis / discovery • Different Methods – Hierarchical clustering, SOM's, etc Self organizing maps (SOM) - A data visualization technique invented by Professor Teuvo Kohonen which reduce the dimensions of data through the use of self-organizing neural networks. - A method for producing ordered low-dimensional representations of an input data space. - Typically such input data is complex and high-dimensional with data elements being related to each other in a nonlinear fashion. SOM (continued) SOM (continued) - Cerebral cortex of the brain is arranged as a two-dimensional plane of neurons and spatial mappings are used to model complex data structures. - Topological relationships in external stimuli are preserved and complex multi-dimensional data can be represented in a lower (usually two) dimensional space. SOM (continued) (Tamayo et al., 1999 PNAS 96:2907-2912) -One chooses a geometry of "nodes"for example, a 3 × 2 grid. - The nodes are mapped into k-dimensional space, initially at random, and then iteratively adjusted. - Each iteration involves randomly selecting a data point P and moving the nodes in the direction of P. SOM (continued) - The closest node NP is moved the most, whereas other nodes are moved by smaller amounts depending on their distance from NP in the initial geometry. - In this fashion, neighboring points in the initial geometry tend to be mapped to nearby points in k-dimensional space. The process continues for 20,000-50,000 iterations. SOM (continued) Yeast Cell Cycle SOM - The 828 genes that passed the variation filter were grouped into 30 clusters. SOM analysis of data of yeast gene expression during diauxic shift [2]. Data were analyzed by a prototype of GenePoint software •a: Genes with a similar expression profile are clustered in the same neuron of a 16 x 16 matrix SOM and genes with closely related profiles are in neighboring neurons. Neurons contain between 10 and 49 genes •b: Magnification of four neurons similarly colored in a. The bar graph in each neuron displays the average expression of genes within the neuron at 2-h intervals during the diauxic shift •c: SOM modified with Sammon's mapping algorithm. The distance between two neurons corresponds to the difference in gene expression pattern between two neurons and the circle size to the number of genes included in the neuron. Neurons marked in green, yellow (upper left Result of SOM clustering of Dictyostelium expression data with a 6 x 4 structure of centroids. A 6 x 4 = 24 clusters is the minimum number of centroids needed to resolve the three clusters revealed by percolation clustering (encircled, from top to bottom: down-regulated genes, early upregulated genes, and late upregulated genes). The remaining 21 clusters are formed by forceful partitioning of the remaining noninformative noisy data. Similarity of expression within these 21 clusters is random, and is biologically meaningless. SOM clustering • SOM - self organizing maps • Preprocessing – filter away genes with insufficient biological variation – normalize gene expression (across samples) to mean 0, st. dev 1, for each gene separately. • Run SOM for many iterations • Plot the results SOM results Large grid 10x10 3 cells Clustering visualization 2D SOM visualization SOM output visualization The Y-Cluster Part 8 Beyond Clustering Support vector machines Used for classification of genes according to function 1) Choose positive and negative examples (lable +/-) 2) Transform input space to feature space 3) Construct maximum margin hyperplane 4) Classify new genes as members /non-members Support vector machines (continued) (Brown et al., 2000 PNAS 97(1), 262-267) - Using the class definitions made by the MIPS yeast genome database, SVMs were trained to recognize six functional classes: tricarboxylic acid (TCA) cycle, respiration, cytoplasmic ribosomes, proteasome, histones, and helix-turn-helix proteins. Support vector machines (continued) Examples of predicted functional classifications for previously unannotated genes by the SVMs Class TCA Resp Ribo Prot Gene Locus Comments YHR188C Conserved in worm, Schizosaccharomyces pombe, human YKL039W PTM1 Major transport facilitator family; likely integral membrane protein. YKR016W Not highly conserved, possible homolog in S. pombe YKR046C No convincing homologs YKL056C Homolog of translationally controlled tumor protein, abundant, fingers YNL053W MSG5 Protein-tyrosine phosphatase, bypasses growth arrest by mating factor YDR330W Ubiquitin regulatory domain protein, S. pombe homolog YJL036W Member of sorting nexin family YDL053C No convincing homologs YLR387C Three C2H2 zinc fingers, similar YBR267W not coregulated Automatic discovery of regulatory patterns in promoter region (Juhl and Knudsen, 2000 Bioinformatics, 16:326-333) From SGD All 6269 ORFs : up and downstream 200 bp. 5097 ORFs : upstream 500 bp. DNA chip : 91 data sets. These data sets consists of the 500 bp upstream regions and the red-green ratios Automatic discovery of regulatory patterns in promoter region (continued) - Sequence patterns correlated to whole cell expression data found by Kolmogorov-Smirnov tests - Regulatory elements were identified by systematic calculations of the significance of correlation between words found in functional annotation of genes and DNA words occuring in their promoter regions. Bayesian networks analysis (Friedman et al. 2000 J. Comp. Biol., 7:601-620) - Graph-based model of joint multi-variate probability distributions - The model can captures properties of conditional independence between variables. - Can describe complex stochastic processes - Provide clear methodologies for learning from (noisy) observation Bayesian networks analysis (continued) Bayesian networks analysis (continued) -76 gene expression measurement of 6177 yeast ORFs. -800 genes whose expression varied over cell-cycle stages were selected. -Learned networks whose variables were the expression level of each of these 800 genes Movie http://www.dkfz-heidelberg.de/abt0840/whuber/mamovie.html Part 9 Concluding Remarks Future directions • Algorithms optimized for small samples (the no. of samples will remain small for many tasks) • Integration with other data – biological networks – medical text – protein data • cost-sensitive classification algorithms – error cost depends on outcome (don’t want to miss treatable cancer), treatment side effects, etc. Summary • Microarray Data Analysis -- a revolution in life sciences! • Beware of false positives • Principled methodology can produce good results