* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Microarray expression data
X-inactivation wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Copy-number variation wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Genetic engineering wikipedia , lookup
Oncogenomics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Pathogenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Public health genomics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Gene therapy wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene desert wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome evolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Aspects of microarray gene expression analysis Project 786 - 102 Spring 2002 Antoaneta Vladimirova Parts of the talk: 1. Why microarray expression experiments? 2. Basic steps of the microarray experiment 3. Data collection and normalization 4. Analysis of expression data - Clustering algorithms 5. “Extraction of correlated gene clusters by multiple graph comparison” - an algorithm that allows integration of the expression data with other existing biological knowledge Reductionism in Biology Studies in biology until now: whole---> parts Assumption: knowledge derived from the parts will enable us to understand the whole organism-->organs-->cells-->molecules Consequences of this approach: - incomplete knowledge - isolated studies: individual genes or gene products - databases - inconsistent annotations, lack of integration Biological information flow: In general: DNA----> RNA ---->protein copy the genetic information genome : collection of all genes(DNA) of an organism Transcription (gene expression) RNA - a messenger molecule from genetic information to functional unit Protein: gene product carries the function of the gene Why study gene expression? * every cell in an organism has the same set of genes. So, what makes the liver cells and the brain cells different? * cells in different tissues or in different stages of development express different set of genes and have consequently different characteristics * gene expression process is the intermediary process between the maintenance of the genetic information in the form of DNA in the chromosomes and the production of protein which carries most of the functions in a cell * our interest is in understanding the functions, properties and inter-relations among proteins, however, studying gene expression is technologically more affordable, cheaper and is assumed to give us a good approximation about the quantity of the corresponding protein product Why design and analyze microarray experiments? * allow simultaneous assessment of multiple genes * generate expression levels of thousands of genes in parallel * expression level of the gene is approximated to the protein level, and, respectively, to the function a gene product carries * expression level changes due to environmental conditions, developmental stage, diseased state * genes that share expression patterns are assumed to be co-regulated and to be functionally related * gene expression data might eventually allow us to reverse-engineer (reconstruct gene regulation networks and biological processes) Synthetic approach * advances in technology allow us to move on to synthetic approach: * take all pieces together, integrate vs. disassemble the biological system. * towards reconstruction of the whole cell/organism * need to study gene/gene product not in isolation, but relevant to all other genes/products and the environment networks of components and interactions between them Microarray System Adapted from “Ratio-based decisions and the quantitative analysis of cDna microarray images” -Chen, Dougherty and Bittner (1997), J Biomed Opt 2(4) Microarray images a. Oligonucleotide array synthesized in situ with photochemical technology by Affymetrix. b. Oligonucleotide array synthesized in situ with ink-jet technology (Rosetta Inpharmatics). c. DNA microarray printed on a glass slide (Corning, Inc). Adapted from “Biomedical Discovery Review with DNA Arrays” - Richard A. Young Color-coded expression *Each dot on the microarray is read through two independent channels (green and red) *Green color - means query expression is lower than the control expression Query signal of gene x/Control signal of gene x < 0 *Red color - means query expression is higher than the control expression Query signal of gene x /Control signal of gene x> 0 *Yellow color - means query expression and control expression are equal Query signal of gene x/Control signal of gene x = 1 *Black color - neither control or query bound to the slide Why signal normalization is necessary? Assumptions: *the quantity of initial RNA from both samples is equal *some genes are up-regulated, others are down-regulated, but overall these changes should balance out so that the total quantity from each sample that hybridizes to the array is equal, therefore the total intensity read through the red and green channels should be the same The relative fluorescence intensities need to be normalized because: * we need to adjust for differences in labeling and detection efficiencies of the different fluorescent labels * we need to adjust for differences in the quantity of initial RNA isolated from the query and control samples *need to compensate for experimental variability * a normalization factor is computed and applied for each gene How to interpret the raw microarray data? Microarray 1 (gene 1- gene m) gene 1 Microarray i (gene 1- gene m) …. Microarray n (gene 1- gene m) …. gene m Experimental conditions 1 (e.g. nutrients withdrawal) Experimental conditions i (e.g. gene disruption) Experimental conditions n (e.g. drug treatment) Fluorescence intensities are translated to a ratio Q/C Data is organized into an Expression matrix ... gene 1 gene 2 gene 3 . . . .55 0.40 2.34 .37 0 .12 0.77 .59 gene m .19 Gene 1 is differentially expressed in exp. condition 3 Expression signal is represented as a ratio *Red color -> Q/C > 0 If Q/C in the range of 1.5 - 2.0 the particular gene in the query cell is considered upregulated. In theory: Q/C can go to infinity. *Green color -> Q/C< 0 The particular gene in the query cell is considered down-regulated. In theory: Q/C will range from zero to one. 0 Green 1 Red To correct for that, log2(Q/C) is used. If Q/C = 2 ==> log2(Q/C) = 1 If Q/C = 1==> log2(Q/C) = 0 If Q/C = 1/2 ==> log2(Q/C) = -1 Over-expressed and inhibited genes values are equally distributed Green 0 Red +infinity Expression vectors and expression space 2D expression space Expression matrix 0.9 0.8 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene 10 Exp. 1 Exp. 2 0.55 0.33 0.45 0.55 0.23 0.76 0.24 0.34 0.11 0.77 0.67 0.45 0.9 0.33 0.4 0.12 0.77 0.37 0.02 0.33 We want to group (cluster) the expression vectors based on their “similarity” 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 0.9 0.8 0.7 0.6 0.5 0.4 Assumption: Genes in the same group are functionally related 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Experiment vectors and experiment space 5D experiment space (.55, .45, .23, .24 ,.11) Expression matrix Exp. 1 Exp. 2 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 0.55 0.45 0.23 0.24 0.11 0.44 0.33 0.37 0.29 0.88 (.44, .33, .37, .29 ,.88) (0, 0, 0, 0 ,0) We want to group (cluster) the expression vectors based on their “similarity” Cluster 1 Assumption: Genes in the same group are functionally related Cluster 2 How to define similarity between expression vectors? dij Distance measure: i the distance between two objects (e.g. expression vectors) dik k Metric: 1. dij must be positive or zero (dij >= 0) 2. Must be symmetric (dij = dji) 3. An object is zero distance from itself (dii = 0) 4. When considering three objects i, j and k, the distance from i to k is always less than or equal to the sum of the distance from i to j, and the distance from j to k ; (dik <= dij + djk )(the triangle rule) j dkj Example : Euclidean distance between two 3D points X(x1, x2, x3) and Y(y1, y2, y3) is d12 = SQRT ( (x1 - y1)2 + (x2 - y2)2 + (x3 - y3)2) For an n-dimensional space: d12 = SQRT (xi - yi)2 where i = 1 to n Semi-metric: Obey the first three rules but not the triangle rule Clustering analysis of gene expression Idea: cluster together genes with similar expression patterns Underlying assumptions: * Genes that share expression patterns are co-regulated and participate in functionally related processes * Unknown genes that are clustered together with known genes might have similar or related functions Categories of clustering methods: I. Unsupervised II. Supervised A. Agglomerative B. Divisive Two major clustering algorithm categories Agglomerative: Start with individual gene clusters and gradually accommodate more genes in a cluster, clusters are eventually joined in one huge cluster; usually represented by a tree structure resembling the phylogenetic trees Divisive: Start with all point into one cluster and gradually form new clusters and distribute the data points among them Unsupervised: No prior knowledge is assumed when forming the clusters Supervised: Existing biological knowledge is used to guide the clustering process Clustering methods to be discussed: * Hierarchical clustering * k-means clustering * Principal Component Analysis * Supervised clustering (classifiers) Hierarchical Clustering * One of the most frequently used techniques * simple and can be easily visualized as a tree similar to the phylogenetic trees * an agglomerative approach: single expression profiles are joined to form groups, the process is repeated until all expression profiles have been joined in one cluster *first, the pair-wise distances are calculated for all the genes to be clustered; initially each gene is a cluster itself Cl1 Cl2 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 0.11 Cl3 Gene 1 0 Gene 2 Gene 3 Gene 4 Gene 5 0.45 0.23 0.24 0.11 0 0.76 0.34 0.77 0 0.44 0.36 0 0.77 0 Cl5 Cl4 * The distance matrix is searched for the clusters with the minimum distance between them * The two selected clusters are joined to form a new cluster containing 2 objects now Hierarchical Clustering * The distances are recalculated from this new cluster to the rest of the clusters; the distance matrix now contains one less dimension ( or cluster) Gene 2 Gene 2 Gene 3 Gene 4 Cluster A Gene 3 0 0.76 0.34 0.56 0 0.44 0.56 Gene 4 Cluster A 0 0.45 Cl3 0 Building the hierarchical tree: Cl1 ClA Cl2 Cl4 Cluster A Cl5 * The process is repeated and Cl2 and Cl2 are joined into Cluster B * The distances are recalculated and clusters formed until only one cluster is left that accommodates all the objects Gene 3 Gene 3 Cluster A Cluster B Cluster A Cluster B 0 0.56 0.44 0 0.67 0 Cl3 Building the hierarchical tree: Cl 1 Cl A Cl 5 Cl 2 Cl 4 Cl B Cluster A Cluster B Hierarchical Clustering Cluster A Cluster C Cluster A Cluster C 0 0.56 0 Building the hierarchical tree: Cl 1 Cl A Cl3 Cl 5 Cl 2 Cl 4 Cl 3 Cl2 Cl B Cl C Cl4 Cluster A Cluster C Cluster D Cluster D 0 Building the hierarchical tree: Cl 1 Cl 5 Cl 2 Cl 4 Cl2 Cl3 Cl A Cl B Cl C Cl D Cl4 Cl 3 Cluster D Hierarchical Clustering - Tree Representation Example (partial tree) Hierarchical clustering of gene expression matrices. The image shows an average linkage (UPGMA) clustering of 505 yeast genes duri three different cell cycle studies with a total of 60 different time points analyzed. The color image on the left shows the numerical values by color according to the method introduced by Mike Eisen. Red is used to represent the positive values and green the negative values Blue shows the missing values in the respective experiments. The clustering and the image are produced using WWW-based tools in E Pro¢ler (http://www.ebi.ac.uk/microarray/). The interface is interactive and further information about the genes in each subtree is availab clicking on the respective nodes in the tree. Adapted from “Gene expression data analysis” - A,Brazma and J. Vilo (2000) FEBS Letters Hierarchical Clustering Algorithms * Single-linkage clustering: the distance between two clusters I and j is calculated as the minimum distance between a member of cluster i and a member of cluster j i j * Complete-linkage clustering: the distance between two clusters I and j is calculated as the maximum distance between a member of cluster i and a member of cluster j i j * Average-linkage clustering: average values are used for calculating the distance i j K-means Clustering * Partitions data in groups with similar expression * there should be advanced knowledge about the number of clusters or k should be chosen arbitrarily; objects are partitioned into a fixed number of clusters, such that clusters are internally similar but externally dissimilar * each time the same k might produce slightly different clustering results * the process is conceptually simple, but can be computationally intensive * first, the objects are randomly partitioned into k user-specified clusters * an average expression vector is computed which represents each cluster and is used to compute the distances between each point and each average cluster vector * if a given object is closer to a different cluster that to the one it is assigned to, it is reassigned to the closest cluster and the average expression vector for the clusters is recalculated. * the process is repeated until no re-assignments are necessary. K-means Clustering * Let’s have m objects in n-dimensional space (e.g. five genes in 2D expression space) 1 2 3 5 4 * Let k = 2 * Let’s partition arbitrarily into 2 clusters * Then calculate the average expression vector for each cluster * Calculate distances from each object to the average vector 1 2 3 5 * Re-assign object 3 to Cluster A 4 * Recalculate average expression vectors for the clusters * Re-calculate the distances from all objects to all average expression vectors * No further re-assignments are necessary 1 * This are the final 2 clusters 2 3 5 4 Principal Component Analysis (PCA) * Principal Components Analysis or Singular Value decomposition is a mathematical technique that picks up patterns in data while reducing dimensionality * reduction of dimensionality might be necessary when some of the data might contain redundant information, e.g. if a group of experiments are more closely related that initially expected * “projects” complex data onto a reduced, easily visualized space * analogy: a 3D cloud of data points that is rotated so that one can see it from different perspectives; some views might allow a better separation of the data into groups than other views ;PCA finds the best views to separate the data * in most implementations of PCA it is difficult to define the precise boundaries of distinct clusters in the data, or to define genes(or experiments) belonging to each cluster * however, when combined with another clustering techniques such as k-means, it becomes a very powerful technique Analysis of a demonstration data set * the performance of the various algorithms is compared * the analysis can help to provide an understanding of how the data are handled and interpreted by the different methods A synthetic gene-expression data set. This data set provides an opportunity to evaluate how various clustering algorithms reveal different features of the data. A. Nine distinct gene-expression patterns were created with log2(ratio) expression measures defined for ten experiments. B. For each expression pattern, 50 additional genes were generated, representing variations on the basic patterns. Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2 Hierarchical Clustering Algorithms Genes in the demonstration data set were subjected to a. average-linkage b. complete-linkage c. single-linkage hierarchical clustering using a Euclidean distance metric and gene-expression families (A–J) that were color coded for comparison. Genes that are up-regulated appear in red, and those that are down-regulated appear in green, with the relative log2(ratio) reflected by the intensity of the color. This method of clustering groups genes by reordering the expression matrix allows patterns to be easily visualized. Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2 Hierarchical Clustering and PCA Principal component analysis. The same demonstration data set was analyzed using a. hierarchical (average-linkage) clustering and b. principal component analysis using Euclidean distance, to show how each treats the data, with genes color coded on the basis of hierarchical clustering results for comparison. Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2 Data Filtering by Mean Centering *Why filtering? To enhance certain features of the patterns we are looking for *Mean Centering removes “constant” expression by subtracting the average across all experiments from each data point * genes with similar changes relative to their baseline expression pattern are grouped *A, B and C have “constant” expression - grouped together (originally B were up-regulated and C were down-regulated) * D and G are grouped together - expression changes in the same fashion (up and down) * E and F are grouped together - expression changes in the same fashion (down and up) Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2 The effect of Data Filtering by Mean Centering The effect of data filtering. Application of various data filters or changes in the distance metric can change the results derived from any clustering algorithm. A. Mean centering of the data removes ‘constant’ expression, which reveals changes in expression patterns for the nine gene families across the ten experiments. The changes can be seen in the results of b. principal component analysis c. average-linkage hierarchical clustering. Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2 Supervised clustering (classifiers) * Supervised methods can be applied if one has some previous knowledge of which genes are expected to cluster together * Support Vector Machine (SVM) - a widely used technique * SVM uses a training set of genes known to be related e.g. functionally; the training set is provided as positive members and genes known not to be related are used as negative examples * this training of the SVM allows it to distinguish between members and non-members of the group based on the expression data; SVM uses existing biological relationships to determine expression features that are characteristic for a group Supervised clustering (classifiers) * the SVM is used then to recognize and classify the genes in the data set to the established groups on the basis of their expression * the SVM can also identify genes in the training set that are outliers or that have been previously assigned to the incorrect class * an application of potentially great impact is classification of samples from patients affected by some disease; if there is information on expression patterns that is already correlated with survival data or disease-stage or disease-type, that can be applied to train the SVM to classify samples for cancer diagnostics, for example. In many cases samples look the same histologically, but their “expression fingerprint” is different. A certain “expression fingerprint” might be correlated with different rates of progression of the disease or to its response to treatment with various drugs Clustering/Classification of expression data Problems: * how to normalize expression values? * what distance metric to use? * results are very much dependent on the approach taken for the analysis Algorithm limitations: * take into account only expression profiles * does not incorporate all the biological information out there * clustering unrelated data will still produce clusters! * there is no “best” or “correct” clustering technique - the results have to be evaluated in the context of the existing biological knowledge “Extraction of Correlated Gene Clusters by Multiple Graph Comparison” Akihiro Nakaya Susumu Goto Minoru Kanehisa Bioinformatics Center, Kyoto University, Japan Genome Informatics 12: 44-53 (2001) Graphs Edges A Vertices(Nodes) B C D G = (V, E) E F B A G - graph V- vertices E - edges Comparisons of Graphs (common sub-graph) A B A C Genome B C Linear genome B B D E A C D E Genome Pathway G F Comparisons of Graphs A B A C Cluster B C Linear genome B B D E Pathway A C D E Pathway G F The KEGG Databases Database Data Object (graph) Node Edge Content GENES Genome Gene Adjacency Gene catalogs of completely sequenced genomes and some partial genomes SSDB Protein Universe Protein Sequence similarity Ortholog/Paralog relations of all protein-coding genes in complete genomes PATHWAY Network Gene Generalized Generalized Generalized protein interaction network product protein interaction (pathways and complexes) involving or interaction various cellular processes subnetwork LIGAND Chemical Universe Compound Reaction EXPRESSION Transcriptome Gene Expression Microarray gene expression profiles similarity BRITE Protein Direct interaction Proteome Chemical Compounds and chemical reactions that are relevant to cellular processes Protein-protein interactions and relations Pathway Database * network of gene products (nodes) with three types of interactions or relations (edges) - enzyme-enzyme relations (catalyzing successive reaction steps in the metabolic pathway - protein-protein interactions (e.g. binding, phosphorylation) - gene expression relations (transcription factors and target gene products) * 5761 entries ( as of Sept 2001) - 201 reference pathway diagrams - 83 ortholog group tables - 960 enzyme-enzyme relations One of the really fundamental problems in biology: * there is a fraction of genes with known functions, however, the majority of genes have not been assigned a function even if the particular genome has been already sequenced. * how to find gene functions or genes/gene products with related functions from all the information obtained from the sequencing, expression profiling or proteinprotein interaction assays?? * Techniques: *Clustering of expression microarray data *Classification of expression microarray data *Multiple graph comparison Goal: *extract a set of correlated genes with respect to multiple biological features Method: Relationships among genes on a specific feature are encoded as a graph structure where nodes correspond to genes (or gene products). This might suggest a functional link between genes. Genome (gene cluster) Pathway (enzyme cluster) Expression (co-expressed genes) Correlated Gene Clusters * if all or most of the genes from different graphs reserve their mutual relationships in multiple graphs, the biological relevance among these genes is considered to be supported at high possibility * can be used to characterize, classify or predict activities of genes * finding clusters in different graphs is actually finding common sub-graphs among them * belongs to a category of NP-complete problems (non-deterministic polynomial time complete), actually represent a class of extremely problems with enormous computational complexity *real problems solved by heuristic algorithms Algorithm Heuristics: * given the correspondences of nodes (vertices) in two graphs, we want to identify whether the two graphs contain locally related regions * when two graphs are viewed as being linked by correspondences (additional edges), then the problem becomes finding clusters of those correspondences C2 C1 Clustering algorithm G1 G2 G1 G2 Algorithm * if the the set contains n correspondences (Virtual edges), the problem is to cluster these n data points according to a certain distance measure * each datapoint represents a correspondence between a node in G1 and a node in G2 * the distance between two data points i and j may be defined by two distances d1(i, j) - for the shortest path between nodes v1i and v1j in graph G1 d2(i, j) - for the shortest path between nodes v2i and v2j in graph G2 v1i v1j G1 = (V1, E1) v2i v2j correspondence (binary relationship) (virtual edge) G2 = (V2, E2) Algorithm *first, each correspondence is considered as an individual cluster * initially there are n initial clusters C1 v1i v1j G1 = (V1, E1) v2i C2 v2j v1i v1j G2 = (V2, E2) G1 = (V1, E1) v2i v2j G2 = (V2, E2) Algorithm *then single linkage clustering is performed according to the following criterion whether to merge two clusters Ci and Cj: 1 if min{d1(r, s) | r Ci, s Cj} <= 1 + Gap1 and d(i, j) = min{d2(r’, s’) | r’ Ci, s’ Cj} <= 1 + Gap2 0 otherwise where Gap1 and Gap2 are non-negative gap parameters * if d(i, j) = 1, the clusters Ci and Cj are merged C1 v1i v1j G1 = (V1, E1) v2i C2 v2j G2 = (V2, E2) Algorithm * extend the problem to finding a correlation of sub-graphs in more than two graphs (additional graphs provide information about gene-gene relations that cannot be found in the two graphs) * correlated gene clusters are connected by links (hyperedges) that link genes from the corresponding clusters * the distance between hyperedges reflects the shortest path length between the nodes in the graphs * correlated gene clusters: we can find sets of tightly coupled nodes in the graphs by gathering hyperedges based on their distance Algorithm C1 C2 c1 2 c2 2 c1 3 c11 c2 3 c2 1 G1 Genome G2 Pathway Input datasets: n graphs m hyperedges n graphs denote a hyperedge with an n-tuple G3 Similarity G = {G1, …, Gn} H = {h1, …, hm} hi = (x1, i1, …., xn, in) The kth element hik = xk, ik is Gk‘s node that constitutes the hyperedge (1<= k <= n) (assume a hyperedge has exactly n nodes) Algorithm C1 C2 c1 2 c2 2 c1 3 c11 c2 1 k G1 G2 c2 3 G set of hyperedges C1 : C1 = {hs1, …, hsp) set of hyperedges C2 : C2 = {ht1, …, htq) set of kth elements of hyperedges in C1: C1k = {hks1, …, hksp) set of kth elements of hyperedges in C2: C2k = {hkt1, …, hktq) C1 C2 c11 x c2 1 y G1 c1 2 c2 2 G2 c1 3 c2 3 G * d(x, y) is the length of the shortest path between nodes x and y in graph Gs ( can be calculated by Dijkstra’s algorithm) * distance dis(C1s, C2s) = max{d(x, y) | x C1s, y C2s} for complete linkage clustering * distance between two sets of hyperedges C1 and C2 : D(C1 , C2 ) = S dis(C1s, C2s) (1<= s <= n) C1 C2 h1 h2 h3 c11 c2 h4 h5 h6 1 G1 c1 2 c2 2 G2 In our case: H = {h1, h2, h3, h4, h5, h6} C1 = {h1, h2, h3} C2 = {h4, h5, h6} D(C1 , C2 ) = S dis(C1s, C2s) = 1<= s <= 3 = dis(C11, C21)+ dis(C12, C22)+dis(C13, C23 = 8 + 8 + 8 = 24 c1 3 c2 3 G Clustering of hyperedges * using the distance D we cluster the hyperedges into an initial set of clusters, each of which consists of a single hyperedge C only * we iterate the procedure to pick two clusters between which the distance is the smallest * merge them into a new cluster (hierarchical clustering using distance D) * in order to merge D must be under a given threshold pi for graph Gi * if pathlength between two nodes x and y is larger than pi, set d(x, y) to infinity to eliminate that path; clusters with infinity distance are not merged * when there are no more clusters between which the distance is different than infinity, the clustering is done Visualizing the clusters * if the clusters were visualized in 2D, then the distance limit pi will correspond to a radius pi within which nodes of one graph can be clustered (to avoid merging distant genes in the same graph) * only nodes within circles that intersect can potentially form a bigger cluster C1 C1 z C3 pi y x pi C2 Initial . clusters: . . . . . . . . C1 Merge C1 and C2 C2 C3 . . . . . Set distance to infinity between C1.and C3 and C1 and C2 . . . The. final clusters: . . . . . . . . C1 C2 . . . . . . . There are no more clusters to join since all the distances between . . clusters are now infinity Homologous gene clusters in the genomes of E. coli and H. influenzae Applications of the algorithm * recent high-throughput technologies provide vast amounts of biological data; contain unknown or hypothetical or erroneous relationships among genes * standard approaches cluster data only according to one biological parameter (e.g. microarray data are clustered by expression patterns only) may uncover links between known and unknown genes * advantage of the correlated gene clusters: incorporate in the analysis multiple biological criteria (graphs); if relationships among genes/gene products cannot be explained or do not make sense in a single dataset, multiple datasets will increase the likelihood of deducing the potentially biologically significant relationships. The algorithm, alternatively, can emphasize a relationship that might have been uncovered by clustering techniques * next step - find relations among genes in the correlated gene clusters Summary: * Microarray system basics * Data collection, normalization, similarity measures * Expression matrix and expression vectors * Analysis of expression data - Clustering algorithms - Hierarchical clustering - k-means clustering - Principal Component Analysis - Supervised clustering 5. “Extraction of correlated gene clusters by multiple graph comparison”