* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download J-Clustering - Hennig
Genetic engineering wikipedia , lookup
X-inactivation wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Pathogenomics wikipedia , lookup
Essential gene wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene therapy wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
History of genetic engineering wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene nomenclature wikipedia , lookup
Minimal genome wikipedia , lookup
Gene desert wikipedia , lookup
Genomic imprinting wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome evolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Microevolution wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene expression programming wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
The following slides have been adapted from http://www.tm4.org/ to be presented at the Follow-up course on Microarray Data Analysis (Nov 20-24 2006, PICB Shanghai) by Peter Serocka Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV) The Expression Matrix is a representation of data from multiple microarray experiments. Each element is a log ratio (usually log 2 (Cy5 / Cy3) ) Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Black indicates a log ratio of zero, i. e., Cy5 and Cy3 are very close in value Gene 5 Gene 6 Gray indicates missing data Green indicates a negative log ratio , i.e., Cy5 < Cy3 Red indicates a positive log ratio, i.e, Cy5 > Cy3 Expression Vectors -Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. 2 Log2(cy5/cy3) 0 1 2 3 4 5 6 7 8 -2 -0.8 1.5 1.8 0.5 -0.4 -1.3 0.8 1.5 Expression Vectors As Points in ‘Expression Space’ Exp 1 Exp 2 Exp 3 G1 G2 G3 G4 G5 -0.8 -0.4 -0.6 0.9 1.3 -0.3 -0.8 -0.8 1.2 0.9 -0.7 -0.7 -0.4 1.3 -0.6 Similar Expression Experiment 3 Experiment 2 Experiment 1 Distance and Similarity -the ability to calculate a distance (or similarity, it’s inverse) between two expression vectors is fundamental to clustering algorithms -distance between vectors is the basis upon which decisions are made when grouping similar patterns of expression -selection of a distance metric defines the concept of distance Distance: a measure of similarity between genes. Exp 1 Exp 2 Gene A x1A x2A Gene B x1B x2B Exp 3 Exp 4 x3A x3B x4A x4B Exp 5 Exp 6 x5A x6A x5B x6B p1 Some distances: (MeV provides 11 metrics) 1. Euclidean: i6= 1 (xiA - xiB)2 2. Manhattan: i = 1 |xiA – xiB| 6 3. Pearson correlation p0 log2(cy5/cy3) Distance is Defined by a Metric 2 0 -2 Distance Metric: Euclidean Pearson(r*-1) D 1.4 -0.90 D 4.2 -1.00 Algorithms… Hierarchical Clustering (HCL) HCL is an agglomerative clustering method which joins similar genes into groups. The iterative process continues with the joining of resulting groups based on their similarity until all groups are connected in a hierarchical tree. (HCL-1) Hierarchical Clustering g1 g2 g3 g4 g5 g6 g7 g8 g1 is most like g8 g1 g8 g2 g3 g4 g5 g6 g7 g4 is most like {g1, g8} g1 g8 g4 g2 g3 g5 g6 g7 (HCL-2) Hierarchical Clustering g1 g8 g4 g2 g3 g5 g6 g7 g5 is most like g7 g1 g8 g4 g2 g3 g5 g7 g6 {g5,g7} is most like {g1, g4, g8} g1 g8 g4 g5 g7 g2 g3 g6 (HCL-3) Hierarchical Tree g1 g8 g4 g5 g7 g2 g3 g6 (HCL-4) Hierarchical Clustering During construction of the hierarchy, decisions must be made to determine which clusters should be joined. The distance or similarity between clusters must be calculated. The rules that govern this calculation are linkage methods. (HCL-5) Agglomerative Linkage Methods Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. Three linkage methods that are commonly used are: • Single Linkage • Average Linkage • Complete Linkage (HCL-6) Single Linkage Cluster-to-cluster distance is defined as the minimum distance between members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters. DAB = min ( d(ui, vj) ) where u A and v B for all i = 1 to NA and j = 1 to NB DAB (HCL-7) Average Linkage Cluster-to-cluster distance is defined as the average distance between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance. DAB = 1/(NANB) ( d(ui, vj) ) where u A and v B for all i = 1 to NA and j = 1 to NB DAB (HCL-8) Complete Linkage Cluster-to-cluster distance is defined as the maximum distance between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability. DAB = max ( d(ui, vj) ) where u A and v B for all i = 1 to NA and j = 1 to NB DAB (HCL-9) Comparison of Linkage Methods Single Ave. Complete (HCL-10) K-Means / K-Medians Clustering (KMC)– 1 1. Specify number of clusters, e.g., 5. 2. Randomly assign genes to clusters. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 K-Means Clustering – 2 3. Calculate mean / median expression profile of each cluster. 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean / median expression profile (calculated in step 3) is the closest to that gene’s expression profile. G3 G11 G6 G1 G8 G4 G7 G5 G2 G10 G9 G12 G13 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached. K-Means / K-Medians is most useful when the user has an a-priori hypothesis about the number of clusters the genes should group into. Cluster Affinity Search Technique (CAST) -uses an iterative approach to segregate elements with ‘high affinity’ into a cluster -the process iterates through two phases -addition of high affinity elements to the cluster being created -removal or clean-up of low affinity elements from the cluster being created Clustering Affinity Search Technique (CAST)-1 Affinity = a measure of similarity between a gene, and all the genes in a cluster. Threshold affinity = user-specified criterion for retaining a gene in a cluster, defined as %age of maximum affinity at that point 1. Create a new empty cluster C1. 2. Set initial affinity of all genes to zero 3. Move the two most similar genes into the new cluster. G3 G2 G1 Empty cluster C1 G6 G13 G8 G14 G4 G12 G5 G9 G7 G15 G11 G10 Unassigned genes 4. Update the affinities of all the genes (new affinity of a gene = its previous affinity + its similarity to the gene(s) newly added to the cluster C1) ADD GENES: 5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds the user-specified threshold affinity, pick the unassigned gene whose affinity is the highest, and add it to cluster C1. Update the affinities of all the genes accordingly. CAST – 2 REMOVE GENES: 6. When there are no more unassigned high-affinity genes, check to see if cluster C1 contains any elements whose affinity is lower than the current threshold. If so, remove the lowest-affinity gene from C1. Update the affinities of all genes by subtracting from each gene’s affinity, its similarity to the removed gene. 7. Repeat step 6 while C1 contains a low-affinity gene. G6 G14 G12 Current cluster C1 G1 G15 G3 G8 G13 G2 G4 G5 G9 G11 G7 G10 Unassigned genes 8. Repeat steps 5-7 as long as changes occur to the cluster C1. 9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps 1-8. 10. Keep forming new clusters following steps 1-9, until all genes have been assigned to a cluster QT-Clust (from Heyer et. al. 1999) (HJC) -1 1. Compute a jackknifed distance between all pairs of genes (Jackknifed distance: The data from one experiment are excluded from both genes, and the distance is calculated. Each experiment is thus excluded in turn, and the maximum distance between the two genes (over all exclusions) is the jackknifed distance. This is a conservative estimate of distance that accounts for bias that might be introduced by single outlier experiments.) 2. Choose a gene as the seed for a new cluster. Add the gene which increases cluster diameter the least. Continue adding genes until additional genes will exceed the specified cluster diameter limit. G6 G8 G4 G2 G5 G3 G7 G9 G10 Currently unassigned genes G11 G1 G11 “Seed” gene G12 Current cluster 3. Repeat step 2 for every gene, so that each gene has the chance to be the seed of a new cluster. All clusters are provisional at this point. QT-Clust – 2 4. Choose the largest cluster obtained from steps 2 and 3. In case of a tie, pick one of the largest clusters at random. G1 “Seed” gene G7 G11 G8 G12 G7 G11 G4 G1 G8 G2 G10 G9 “Seed” gene G3 G5 G4 G3 G9 “Seed” gene Pick this cluster 5. All genes that are not in the cluster selected above are treated as currently unassigned. Repeat steps 2-4 on these unassigned genes. 6. Stop when the last cluster thus formed has fewer genes than a user-specified number. All genes that are not in a cluster at this point are treated as unassigned. SOTA - 1 Self Organizing Tree Algorithm • Dopazo, J. , J.M Carazo, Phylogenetic reconstruction using and unsupervised growing neural network that adopts the topology of a phylogenetic tree. J. Mol. Evol. 44:226-233, 1997. • Herrero, J., A. Valencia, and J. Dopazo. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17(2):126-136, 2001. SOTA Characteristics SOTA - 2 • Divisive clustering, allowing high level hierarchical structure to be revealed without having to completely partition the data set down to single gene vectors • Data set is reduced to clusters arranged in a binary tree topology • The number of resulting clusters is not fixed before clustering • Neural network approach which has advantages similar to SOMs such as handling large data sets that have large amounts of ‘noise’ SOTA - 3 SOTA Topology Centroid Vector Parent Node ap Members aw Winning Cell as Sister Cell a*=migration factor (as < ap < aw) SOTA - 4 Adaptation Overview -each gene vector associated with the parent is compared to the centroid vector of its offspring cells. -the most similar cell’s centroid and its neighboring cells are adapted using the appropriate migration weights. SOTA - 5 -following the presentation of all genes to the system a measure of system diversity is used to determine if training has found an optimal position for the offspring. -if the system diversity improves (decreases) then another training epoch is started otherwise training ends and a new cycle starts with a cell division. SOTA - 6 The most ‘diverse’ cell is selected for division at the start of the next training cycle. SOTA - 7 Growth Termination Expansion stops when the most diverse cell’s diversity falls below a threshold. SOTA - 8 Each training cycle ends when the overall tree diversity ‘stabilizes’. This triggers a cell division and possibly a new training cycle. 0.2 Tree Diversity 0.15 0.1 0.05 0 0 100 200 300 400 Adaptation Epoch Number 500 Self-organizing maps (SOMs) – 1 1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal G7 G1 G6 G2 G3 G5 N = Nodes G = Genes G8 G9 G4 G10 N1 N2 N3 N4 G11 G12 G14 G26 G27 N5 G13 G15 N6 G28 G29 G16 G17 G23 G25 G24 G18 G19 G20 G22 G21 SOMs – 2 2. Choose a random gene, e.g., G9 3. Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The further away the node is from N2, the less it is moved. G7 G2 G1 G3 G6 G5 G9 G4 G26 G27 N1 N2 N3 N4 N5 N6 G8 G10 G11 G12 G13 G14 G15 G28G29 G16 G17 G18G19 G23 G24 G25 G20 G21 G22 SOM Neighborhood Options Bubble Neighborhood radius N1 N2 N3 N5 G7 G8 G9 G10 G11 Gaussian Neighborhood N1 N2 N4 N3 N4 N6 N5 N6 Some move, alpha is constant. G7 G8 G9 G10 G11 All move, alpha is scaled. SOMs – 3 4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) are repeated many (usually several thousand) times. However, with each iteration, the amount that the nodes are allowed to move is decreased. 5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than its distance to any other node G7 G2 G1 G6 N1 G3 G4 G5 G8 G9N2 G10 G11 G12N4G13 G14 G15 G26 G27 N3 G28G29 G16 G17 G18G19 G23 N5 G24 G25 G20N6 G21 G22 Gene Shaving Compute first principle component of expression matrix Shave off a% (default 10%) of genes with lowest values of dot product with 1st principal component Results in a series of nested clusters Repeat until only one gene remains Choose cluster of appropriate size as determined by gap statistic calculation Orthogonalize expression matrix with respect to the average gene in the cluster and repeat shaving procedure Gene Shaving Gap statistic calculation (choosing cluster size) Quality measure for clusters: R2 Large implies a tight cluster of coherent genes R2 = between variance within variance between variance of mean gene across experiments Create random permutations of the expression matrix and calculate R2 for each Compare R2 of each cluster to that of the entire expression matrix Choose the cluster whose R2 is furthest from the average R2 of the permuted expression matrices. within variance of each gene about the cluster average The final cluster contains a set of genes that are greatly affected by the experimental conditions in a similar way. Relevance Networks Set of genes whose expression profiles are predictive of one another. Can be used to identify negative correlations between genes Genes with low entropy (least variable across experiments) are excluded from analysis. 10 H = -p(x)log2(p(x)) x=1 Relevance Networks A .92 A .75 .15 .37 E B B E .02 .28 .63 .51 .40 D C D .11 The expression pattern of each gene compared to that of every other gene. The ability of each gene to predict the expression of each other gene is assigned a correlation coefficient Tmin = 0.50 Tmax = 0.90 Correlation coefficients outside the boundaries defined by the minimum and maximum thresholds are eliminated. C The remaining relationships between genes define the subnets