Download Comparative Gene Expression Analysis: Data Analysis Issues

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epistasis wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene nomenclature wikipedia , lookup

Metagenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene desert wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genetic engineering wikipedia , lookup

Essential gene wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene expression programming wikipedia , lookup

History of genetic engineering wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome evolution wikipedia , lookup

Genome (book) wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Minimal genome wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Comparative Gene Expression Analysis:
Data Analysis Issues and Solutions
Vipin Kumar
William Norris Professor and
Head, Department of Computer Science
Problem Definition
• Goal: gain biological insights by analyzing which
genes have the same or divergent behavior
across the two organisms
• Techniques can identify pairs of orthologous
genes between two organisms
– C. albicans and S cerevisiae have 4000 such pairs
5/23/2017
2
One Approach
(Judith Berman, et al.)
• Step 1: Identify clusters of functionally related
orthologous genes within one organism
– Select a functionally related group of genes
– Find clusters using similarities computed from the
gene expression data of the organism
• Step 2: Split each cluster into two clusters
– Use the similarities computed from the gene
expression data of the second organism
– Analyze for similarities and differences
5/23/2017
3
Problems With Step 1
• Clustering techniques may produce incorrect
clusters due to
–
–
–
–
–
–
–
–
–
Noise
Varying cluster sizes
Varying cluster density
Non-globular cluster shape
High-dimensional data
Clusters that exist in subsets of the attributes
Clusters may be overlapping
Normalization
Choice of similarity measure
5/23/2017
4
Problems With Step 2
• Given a decomposition of genes into functionally
coherent clusters for two organisms, A and B, there are a
wide variety of relationships between the clusters of the
two organisms
– Some relationships are not captured by current approach
– Example: a cluster of genes in organism A may
(1) be split into two standalone clusters, or
(2) be split into two groups that are just a part of larger clusters
• Focusing on one cluster at a time does not take into
account cross-talk between functional categories
5/23/2017
5
Alternative #1: Similarity-Based Approach
• Directly compare the pattern of similarities of a gene g in both
organisms
Orthologous
pair of genes
– Idea is that the function of a gene is conserved if its relationship to other
genes is similar in both organisms
• Degree of similarity reflects the degree of overlap
• Assign a value between 0 and 1 to each pair that indicates the
divergence or conservation of functionality
– A value of 0 implies divergence of function
– A value of 1 implies conservation of function
– Intermediate values indicate intermediate degrees of conservation/divergence
5/23/2017
6
Shared Nearest Neighbor Approach
1
2
3
4
5
6
7
8
9
10
1
1.00
0.63
0.22
0.39
0.34
0.48
0.15
0.33
0.58
0.49
2
0.63
1.00
0.48
0.86
0.14
0.32
0.93
0.68
0.69
0.33
1
2
9
10
6
4
2
7
4
9
8
1
3
5
4
6
2
9
4
2
9
3
6
5
0.6
5
3
6
10
4
8
6
0.5
6
10
7
5
9
4
7
0.4
7
2
9
6
4
10
8
0.3
8
10
9
2
6
5
0.2
9
7
10
8
4
2
0.1
10
8
6
9
5
7
1
1
0.9
2
0.8
3
3
0.22
0.48
1.00
0.69
0.71
0.55
0.45
0.28
0.46
0.27
0.7
4
0.39
0.86
0.69
1.00
0.55
0.60
0.54
0.41
0.69
0.07
5
0.34
0.14
0.71
0.55
1.00
0.67
0.31
0.42
0.40
0.65
6
0.48
0.32
0.55
0.60
0.67
1.00
0.69
0.60
0.66
0.77
Gene
4
5
7
0.15
0.93
0.45
0.54
0.31
0.69
1.00
0.35
0.81
0.53
9
8
0.33
0.68
0.28
0.41
0.42
0.60
0.35
1.00
0.70
0.78
10
9
0.58
0.69
0.46
0.69
0.40
0.66
0.81
0.70
1.00
0.72
10
0.49
0.33
0.27
0.07
0.65
0.77
0.53
0.78
0.72
1.00
1
2
3
4
5
6
7
8
9
10
Gene
1
3
2
10
6
4
0.9
2
6
5
9
1
10
0.8
3
6
4
1
5
8
4
5
9
3
6
2
5
4
2
3
9
10
6
2
10
3
4
8
1
1
2
3
4
5
6
7
8
9
10
1
1.00
0.56
0.67
0.39
0.26
0.48
0.33
0.33
0.31
0.49
2
0.56
1.00
0.21
0.51
0.74
0.87
0.46
0.48
0.70
0.51
3
0.67
0.21
1.00
0.69
0.62
0.73
0.27
0.51
0.19
0.47
4
0.39
0.51
0.69
1.00
0.94
0.61
0.43
0.36
0.85
0.07
5
0.26
0.74
0.62
0.94
1.00
0.24
0.51
0.35
0.54
0.51
6
0.48
0.87
0.73
0.61
0.24
1.00
0.23
0.53
0.38
0.78
1
2
7
0.33
0.46
0.27
0.43
0.51
0.23
1.00
0.82
0.30
3
0.7
Gene
4
5
0.6
6
0.5
7
0.4
8
0.3
7
8
5
2
4
10
9
0.2
8
9
7
6
3
2
0.1
9
8
4
2
5
6
10
6
5
2
1
3
0.42
10
8
0.33
0.48
0.51
0.36
0.35
0.53
0.82
1.00
0.88
0.36
9
0.31
0.70
0.19
0.85
0.54
0.38
0.30
0.88
1.00
0.20
10
0.49
0.51
0.47
0.07
0.51
0.78
0.42
0.36
0.20
1.00
5/23/2017
1
2
3
4
5
6
Gene
7
8
9
10
Idea is that the function of a gene is conserved if its
relationship to other genes is similar in both organisms
7
Shared Nearest Neighbor Approach
• For each pair of orthologues of a gene g in
organisms A and B
– Assign a measure based on the overlap of the k
nearest neighbor list
• Various possibilities
– Fraction of overlap in k nearest neighbor list (0 indicates no
overlap, 1 indicates complete overlap)
– Use a weighted measure (high weight for high ranks)
– A pair of orthologues that have a high value of the
measure are likely to have conserved behavior
5/23/2017
8
Alternative #2: Contrast Sets
(motivated by Bay and Pazzani, KDD 99)
A set of genes that have very high similarity (in expression patterns) for
one organisms and low similarity for the other organism
Genes
Conditions
C1
Cm
C1
Cn
Genes
Genes
• Contrast sets can be overlapping
• Set of candidates are exponentially
large
• Recent advantages make it possible
to prune the search space and
compute them efficiently
5/23/2017
9
Alternatives for Step 2
• Assume that the output of step 1 is accurate
• Could apply statistical tests for comparing distributions
– T-test commonly used for comparing individual genes
– Issues for comparing clusters using this scheme
• Need to define a multi-dimensional version of the T-test
• Only tests equality of the sample means
• Assumes that the conditions are the same for the samples
• Could apply techniques developed for comparing
partitions (Strehl and Ghosh, 2002)
– Measures of distance between partitions
– Evaluate which clusters contribute most to the distance
– Catch: Works only for the same data set (Correlation matrices for
the two organisms in this case)
• Need a more general solution
5/23/2017
10
General solution to step 2
• Compare sets of clusters derived from two different but
related data sets
• Biologically-inspired overlap-based approach:
– Consider cluster C1 of genes for first organism and C2 for second
– |C1∩C2|/|C2|>α1 implies genes in C2 still working together for a
function similar to C1
– Else, |C1∩C2|/|C2|<α2 implies genes in C2 have diverged into some
other functional category
• Guidelines for choosing the α’s:
– Ideally, α1→1 and α2→0
– α1 should be small enough to allow splits into more than two
clusters
– Similarly, α2 should be just high enough to be able to identify
outliers
5/23/2017
11