* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 5 - digbio
Polycomb Group Proteins and Cancer wikipedia , lookup
Point mutation wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Pathogenomics wikipedia , lookup
X-inactivation wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Copy-number variation wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genetic engineering wikipedia , lookup
Genomic imprinting wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome evolution wikipedia , lookup
Genome (book) wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene therapy wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
The Selfish Gene wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene desert wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene nomenclature wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression profiling wikipedia , lookup
Project Phase I Due on 9/22, send me through email 2-10 Pages Free style in writing (use 11pt font or larger) Project description Overview Problem definition Why it is important Some review of existing work Objectives to achieve Gene Expression Data Analyses Dong Xu Computer Science Department 109 Engineering Building West E-mail: [email protected] 573-882-7064 http://digbio.missouri.edu Lecture Outline Gene expression Similarity between gene expression profiles Concept of clustering K-Means clustering Hierarchical clustering Minimum spanning tree-based clustering Gene expression profiles Expression (relatively levels to reference point at 0) Time/Condition Goal of Microarray Experiments Regulation/function in pathway/cellular state/phenotype Disease diagnosis / disease gene identification Gene expression Microarray data Biological pathway What Microarray Can Tell Us Differentially expressed genes Under different conditions Different genotypes (mutant vs. wild type) Co-expression and gene function inference Regulatory network inference Regulatory Networks Which gene controls what? Current methods for network reconstruction Boolean networks qualitative representation (on/off relationship) computationally more manageable differential equations give “detailed” dynamic properties of networks mathematically/computationally more problematic Bayesian networks define regulatory relationship Widely used E-Cell Project (http://www.c-cell.org/): network modeling Lecture Outline Gene expression Similarity between gene expression profiles Concept of clustering K-Means clustering Hierarchical clustering Minimum spanning tree-based clustering Similarity between Profiles expression Similarity measure: Euclidean distance Correlation coefficient Trend … Correlation coefficient often works better. 0 time Expression profile Pearson Correlation Coefficient Compares scaled profiles! Can detect inverse relationships Most commonly used xi x yi y 1 r n 1 i 1 s x s y n n=number of conditions x=average expression of gene x in all n conditions y=average expression of gene y in all n conditions sx=standard deviation of x Sy=standard deviation of y Correlation Pitfalls Raw Data 120 100 80 Gene A 60 Correlation=0.97 Gene B 40 20 0 chip 1 chip2 chip 3 chip 4 chip 5 chip 6 chip7 Normalized Data 2.5 2 1.5 1 Gene A 0.5 Gene B 0 -0.5 -1 chip 1 chip2 chip 3 chip 4 chip 5 chip 6 chip7 Euclidean Distance Scaled versus unscaled Cannot detect inverse relation ships For Gene X=(x1, x2,…xn) and Gene Y=(y1, y2,…yn) d X ,Y x1 y1 x2 y2 2 2 . . . xn yn 2 Lecture Outline Gene expression Similarity between gene expression profiles Concept of clustering K-Means clustering Hierarchical clustering Minimum spanning tree-based clustering Data-Mining through Clustering Assumptions for clustering analysis: Expression level of a gene reflects the gene’s activity. Genes involved in same biological process exhibit statistical relationship in their expression profiles. Degradation Synthesis Chromatin Glycolysis Idea of Clustering Clustering: group objects into clusters so that o objects in each cluster have “similar” features; o objects of different clusters have “dissimilar” features Methods of Clustering •discriminant analysis (Fisher,1931) •K-means (Lloyd,1948) •hierarchical clustering •self-organizing maps (Kohonen, 1980) •support vector machines (Vapnik, 1985) •single linkage (dendrogram) •minimum spanning tree based clustering Issues in Cluster Analysis A lot of clustering algorithms A lot of distance/similarity metrics Which clustering algorithm runs faster and uses less memory? How many clusters after all? Are the clusters stable? Are the clusters meaningful? Which Clustering Method Should I Use? What is the biological question? Do I have a preconceived notion of how many clusters there should be? How strict do I want to be? Spilt or Join? Can a gene be in multiple clusters? Hard or soft boundaries between clusters Lecture Outline Gene expression Similarity between gene expression profiles Concept of clustering K-Means clustering Hierarchical clustering Minimum spanning tree-based clustering K-means clustering for expression profiles Step 1: Transform n (genes) * m (experiments) matrix into n(genes) * n(genes) distance matrix Exp 1 Gene A Gene B Gene C Exp 2 Exp 3 Gene A Exp 4 Gene A Gene B Gene C Gene B Gene C 0 ? ? 0 ? To transform the n*m matrix into n*n matrix, use a similarity (distance) metric. Step 2: Cluster genes based on a k-means clustering algorithm 0 K-means algorithm The most popular algorithm for clustering What is so attractive? •Simple •Fast •Mathematically correct •Invariant to dimension •Easy to implement K-Means Clustering Basic Ideas : using cluster centre (means) to represent cluster Assigning data elements to the closet cluster (centre). Goal: Minimize square error (intra-class dissimilarity) : 2 = d ( xi , C ( xi )) i There is no hierarchy. Must supply the number of clusters (k) into which the data are to be grouped. K-means Clustering : Procedure (1) Initialization 1 Specify the number of cluster k -- for example, k = 4 Expression matrix conditions gene Each point is called “gene” K-means Clustering : Procedure (2) Initialization 2 Genes are randomly assigned to one of k clusters or choose random starting centers K-means Clustering : Procedure (3) Calculate the mean of each cluster 1 m NC i c (6,7) (1,2) m i BLUE NC g i 1 i (3,4) (3,2) 1 [(6,7) + (3,4) + …] 4 K-means Clustering : Procedure (4) Each gene is reassigned to the nearest cluster Gene i to cluster c c arg min j | mij gi |2 K-means Clustering : Procedure (5) Iterate until the means are converged Convergence of K-means algorithm •For each set of starting centers we’ll get a local minimum Increase number of starts! Example : 111 data points in 9-dimensional space N= # of starts for achieving global solution # of Clusters 2 N 3 4 1000 10000 30000 20 30 40000 1000000 Lecture Outline Gene expression Similarity between gene expression profiles Concept of clustering K-Means clustering Hierarchical clustering Minimum spanning tree-based clustering Hierarchical clustering (1) Step 1: Transform genes * experiments matrix into genes * genes distance matrix Exp 1 Exp 2 Exp 3 Gene A Exp 4 Gene A Gene B Gene C Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains Gene A Gene B Gene C Gene B Gene C 0 ? ? 0 ? 0 Hierarchical clustering (2) G1 G2 G3 G4 G5 G1 0 2 6 10 9 G2 G3 0 5 9 8 0 4 5 G4 0 3 G5 G (12) 0 G (12) 6 G3 10 G4 9 G5 2 3 4 G4 G5 0 4 5 0 3 0 0 G (12) G3 G (45) 1 G3 5 Stage P5 P4 P3 P2 P1 G (12) 0 6 10 G3 G (45) 0 5 0 Groups [1], [2], [3], [4], [5] [1 2], [3], [4], [5] [1 2], [3], [4 5] [1 2], [3 4 5] [1 2 3 4 5] Hierarchical Clustering Results K-Means vs Hierarchical Clustering Lecture Outline Gene expression Similarity between gene expression profiles Concept of clustering K-Means clustering Hierarchical clustering Minimum spanning tree-based clustering Graph Representation Represent a set of n-dimensional points as a graph o each data point (gene) represented as a node o each pair of genes represented as an edge with a weight defined by the “dissimilarity” between the two genes 0 1 1.5 2 5 6 7 9 1 0 2 1 6.5 6 8 8 1.5 2 0 1 4 4 6 5.5 . . . n-D data points graph representation distance matrix Minimum Spanning Tree Spanning tree: a sub-graph that has all nodes connected and has no cycles (a) (b) (c) Minimum spanning tree (MST): a spanning tree with the minimum total distance How to Construct Minimum Spanning Tree Prim’s algorithm and Kruskal’s algorithm Kruskal’s algorithm step 1: select an edge with the smallest distance from graph step 2: add to tree as along as no cycle is formed step 3: remove the edge from graph step 4: repeat steps 1-3 till all nodes are connected in tree. 4 8 4 4 4 7 14 5 3 7 10 3 3 3 5 3 6 (a) (b) (c) (d) (e) 5 Foundation of MST Approach Significantly simplifies the data clustering problem, while losing very little essential information for clustering. We have mathematically proved: A multi-dimensional clustering problem is equivalent to a tree-partitioning problem! Clustering by Cutting Long Edge Hierarchical cutting 1st cut: longest edge 2nd 1 cut: second longest edge … Work well for “easy” cases. Produce many clusters with single element for some “difficult” cases. 2 Tree-Based Clustering For each edge, calculate the assessment value Find the edge that give the minimum assessment value as the place to cut g* Clustering using iterative method guarantee to find the global optimality using tree-based dynamic programming Automated Selection of Number of Clusters Select “transition point” in the assessment value as the“correct” number of clusters. Transition Profiles indicator[n] = (A[n-1] – A[n]) / (A[n] – A[n+1]) A[k] is the assessment value for partition with k clusters Our clustering of yeast data Reading Assignments (1) Suggested reading: Chapter 10 in “Neil C.Jones and Pavel A. Pevzner: An Introduction to Bioinformatics Algorithms (Computational Molecular Biology). MIT Press, 2004.” Chapter 11 in “Current Topics in Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press. 2002.” Reading Assignments (2) Optional reading: 1. Ying Xu, Victor Olman, and Dong Xu. Clustering Gene Expression Data Using a Graph-Theoretic Approach: An Application of Minimum Spanning Trees. Bioinformatics. 18:526-535, 2002. 1. Dong Xu, Victor Olman, Li Wang, and Ying Xu. EXCAVATOR: a computer program for gene expression data analysis. Nucleic Acid Research. 31: 5582-5589. 2003. Project Assignment Develop a program that implement the K-means clustering algorithm 1. Allow several random initializations, and compare their clustering results. Choose the one that has the best value 2 for objective function d ( xi , C ( xi )) . i 2. Test the program using the gene expression data sent to the mailing list. 3. Output gene IDs for each cluster.