* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Comparative Gene Expression Analysis: Data Analysis Issues
Epigenetics of diabetes Type 2 wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Gene nomenclature wikipedia , lookup
Metagenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Gene desert wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genetic engineering wikipedia , lookup
Essential gene wikipedia , lookup
Pathogenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene expression programming wikipedia , lookup
History of genetic engineering wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome evolution wikipedia , lookup
Genome (book) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Minimal genome wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Comparative Gene Expression Analysis: Data Analysis Issues and Solutions Vipin Kumar William Norris Professor and Head, Department of Computer Science Problem Definition • Goal: gain biological insights by analyzing which genes have the same or divergent behavior across the two organisms • Techniques can identify pairs of orthologous genes between two organisms – C. albicans and S cerevisiae have 4000 such pairs 5/23/2017 2 One Approach (Judith Berman, et al.) • Step 1: Identify clusters of functionally related orthologous genes within one organism – Select a functionally related group of genes – Find clusters using similarities computed from the gene expression data of the organism • Step 2: Split each cluster into two clusters – Use the similarities computed from the gene expression data of the second organism – Analyze for similarities and differences 5/23/2017 3 Problems With Step 1 • Clustering techniques may produce incorrect clusters due to – – – – – – – – – Noise Varying cluster sizes Varying cluster density Non-globular cluster shape High-dimensional data Clusters that exist in subsets of the attributes Clusters may be overlapping Normalization Choice of similarity measure 5/23/2017 4 Problems With Step 2 • Given a decomposition of genes into functionally coherent clusters for two organisms, A and B, there are a wide variety of relationships between the clusters of the two organisms – Some relationships are not captured by current approach – Example: a cluster of genes in organism A may (1) be split into two standalone clusters, or (2) be split into two groups that are just a part of larger clusters • Focusing on one cluster at a time does not take into account cross-talk between functional categories 5/23/2017 5 Alternative #1: Similarity-Based Approach • Directly compare the pattern of similarities of a gene g in both organisms Orthologous pair of genes – Idea is that the function of a gene is conserved if its relationship to other genes is similar in both organisms • Degree of similarity reflects the degree of overlap • Assign a value between 0 and 1 to each pair that indicates the divergence or conservation of functionality – A value of 0 implies divergence of function – A value of 1 implies conservation of function – Intermediate values indicate intermediate degrees of conservation/divergence 5/23/2017 6 Shared Nearest Neighbor Approach 1 2 3 4 5 6 7 8 9 10 1 1.00 0.63 0.22 0.39 0.34 0.48 0.15 0.33 0.58 0.49 2 0.63 1.00 0.48 0.86 0.14 0.32 0.93 0.68 0.69 0.33 1 2 9 10 6 4 2 7 4 9 8 1 3 5 4 6 2 9 4 2 9 3 6 5 0.6 5 3 6 10 4 8 6 0.5 6 10 7 5 9 4 7 0.4 7 2 9 6 4 10 8 0.3 8 10 9 2 6 5 0.2 9 7 10 8 4 2 0.1 10 8 6 9 5 7 1 1 0.9 2 0.8 3 3 0.22 0.48 1.00 0.69 0.71 0.55 0.45 0.28 0.46 0.27 0.7 4 0.39 0.86 0.69 1.00 0.55 0.60 0.54 0.41 0.69 0.07 5 0.34 0.14 0.71 0.55 1.00 0.67 0.31 0.42 0.40 0.65 6 0.48 0.32 0.55 0.60 0.67 1.00 0.69 0.60 0.66 0.77 Gene 4 5 7 0.15 0.93 0.45 0.54 0.31 0.69 1.00 0.35 0.81 0.53 9 8 0.33 0.68 0.28 0.41 0.42 0.60 0.35 1.00 0.70 0.78 10 9 0.58 0.69 0.46 0.69 0.40 0.66 0.81 0.70 1.00 0.72 10 0.49 0.33 0.27 0.07 0.65 0.77 0.53 0.78 0.72 1.00 1 2 3 4 5 6 7 8 9 10 Gene 1 3 2 10 6 4 0.9 2 6 5 9 1 10 0.8 3 6 4 1 5 8 4 5 9 3 6 2 5 4 2 3 9 10 6 2 10 3 4 8 1 1 2 3 4 5 6 7 8 9 10 1 1.00 0.56 0.67 0.39 0.26 0.48 0.33 0.33 0.31 0.49 2 0.56 1.00 0.21 0.51 0.74 0.87 0.46 0.48 0.70 0.51 3 0.67 0.21 1.00 0.69 0.62 0.73 0.27 0.51 0.19 0.47 4 0.39 0.51 0.69 1.00 0.94 0.61 0.43 0.36 0.85 0.07 5 0.26 0.74 0.62 0.94 1.00 0.24 0.51 0.35 0.54 0.51 6 0.48 0.87 0.73 0.61 0.24 1.00 0.23 0.53 0.38 0.78 1 2 7 0.33 0.46 0.27 0.43 0.51 0.23 1.00 0.82 0.30 3 0.7 Gene 4 5 0.6 6 0.5 7 0.4 8 0.3 7 8 5 2 4 10 9 0.2 8 9 7 6 3 2 0.1 9 8 4 2 5 6 10 6 5 2 1 3 0.42 10 8 0.33 0.48 0.51 0.36 0.35 0.53 0.82 1.00 0.88 0.36 9 0.31 0.70 0.19 0.85 0.54 0.38 0.30 0.88 1.00 0.20 10 0.49 0.51 0.47 0.07 0.51 0.78 0.42 0.36 0.20 1.00 5/23/2017 1 2 3 4 5 6 Gene 7 8 9 10 Idea is that the function of a gene is conserved if its relationship to other genes is similar in both organisms 7 Shared Nearest Neighbor Approach • For each pair of orthologues of a gene g in organisms A and B – Assign a measure based on the overlap of the k nearest neighbor list • Various possibilities – Fraction of overlap in k nearest neighbor list (0 indicates no overlap, 1 indicates complete overlap) – Use a weighted measure (high weight for high ranks) – A pair of orthologues that have a high value of the measure are likely to have conserved behavior 5/23/2017 8 Alternative #2: Contrast Sets (motivated by Bay and Pazzani, KDD 99) A set of genes that have very high similarity (in expression patterns) for one organisms and low similarity for the other organism Genes Conditions C1 Cm C1 Cn Genes Genes • Contrast sets can be overlapping • Set of candidates are exponentially large • Recent advantages make it possible to prune the search space and compute them efficiently 5/23/2017 9 Alternatives for Step 2 • Assume that the output of step 1 is accurate • Could apply statistical tests for comparing distributions – T-test commonly used for comparing individual genes – Issues for comparing clusters using this scheme • Need to define a multi-dimensional version of the T-test • Only tests equality of the sample means • Assumes that the conditions are the same for the samples • Could apply techniques developed for comparing partitions (Strehl and Ghosh, 2002) – Measures of distance between partitions – Evaluate which clusters contribute most to the distance – Catch: Works only for the same data set (Correlation matrices for the two organisms in this case) • Need a more general solution 5/23/2017 10 General solution to step 2 • Compare sets of clusters derived from two different but related data sets • Biologically-inspired overlap-based approach: – Consider cluster C1 of genes for first organism and C2 for second – |C1∩C2|/|C2|>α1 implies genes in C2 still working together for a function similar to C1 – Else, |C1∩C2|/|C2|<α2 implies genes in C2 have diverged into some other functional category • Guidelines for choosing the α’s: – Ideally, α1→1 and α2→0 – α1 should be small enough to allow splits into more than two clusters – Similarly, α2 should be just high enough to be able to identify outliers 5/23/2017 11