Download Where Do New Genes Come From? A Computational Analysis of

The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand Glycolysis Pathway Glycolysis Clusters Clostridium acetobutylicum Gene Clustering for Functional Inference in Bacterial Genomes The Use of Gene Clusters to Infer Functional Coupling, Overbeek et al., PNAS 96: 2896-2901, 1999. original genome large scale duplication or speciation event rearrangement, mutation Gene content and order are preserved Similarity in gene content Neither content nor order is strictly preserved “Evolution of gene order conservation in prokaryotes” Tamames, Genome Biology 2, 2001 “Evolution of gene order conservation in prokaryotes” Tamames, Genome Biology 2, 2001 Gene insertion/loss “Evolution of gene order conservation in prokaryotes” Tamames, Genome Biology 2, 2001 Gene insertion/loss Local rearrangement Two Possible Questions 1. Given a set of genes that we believe are functionally related, determine if they cluster together spatially more than we would expect by chance 2. Identify all significantly conserved gene clusters as a starting point for making functional inferences Two Possible Questions 1. Given a set of genes that we believe are functionally related, determine if they cluster together spatially more than we would expect by chance Reference set scenario 2. Identify all significantly conserved gene clusters as a starting point for making functional inferences Whole genome comparison Reference Set Scenario Reference Set Scenario • Model of a genome – G = 1, …, n; an ordered set of n unique genes – assume genes do not overlap – chromosome breaks ignored Reference Set Scenario • Model of a genome – G = 1, …, n; an ordered set of n unique genes – assume genes do not overlap – chromosome breaks ignored • Reference gene scenario: – m genes of interest (in red) are pre-specified – want to find clusters of (a subset of) these genes Whole Genome Scenario Given: two genomes: G = 1, …, n and H = 1, …, n G H Find all significant clusters of at least k homologs in close proximity in both genomes? Outline • What formalisms do we need to address these questions? – Definitions: formulate a cluster definition – Algorithms: identifying clusters in real data Statistics: assess the significance of one or more clusters • Reference set scenario • Whole genome comparison • Conclusion Why develop a formal statistical model? • Understand trends and verify that they match our expectations • Choose parameters effectively • Statistical tests for data analysis Typically researchers use randomization tests to estimate statistical significance Cluster Definitions • An intuitive notion of a cluster is a group of genes – occurring in close proximity – neither gene content nor order is strictly conserved • Algorithms and statistics require a formal definition. – What properties are desirable? – Do existing definitions have these properties? size = 3 genes Possible Cluster Parameters – size: number of red genes in the cluster • Example: cluster size ≥ 3 length = 6 Possible Cluster Parameters – size: number of red genes in the cluster • Example: cluster size ≥ 3 – length: number of genes between first and last red genes • Example: cluster length ≤ 6 length = 6 Possible Cluster Parameters – size: number of red genes in the cluster • Example: cluster size ≥ 3 – length: number of genes between first and last red genes • Example: cluster length ≤ 6 density = 6/11 Possible Cluster Parameters – size: number of red genes in the cluster • Example: cluster size ≥ 3 – length: number of genes between first and last red genes • Example: cluster length ≤ 6 – density: proportion of red genes (size/length) • Example: density ≥ 0.5 density = 6/11 Possible Cluster Parameters – size: number of red genes in the cluster • Example: cluster size ≥ 3 – length: number of genes between first and last red genes • Example: cluster length ≤ 6 – density: proportion of red genes (size/length) • Example: density ≥ 0.5 gap ≤ 4 genes Possible Cluster Parameters – size: number of red genes in the cluster • Example: cluster size ≥ 3 – length: number of genes between first and last red genes • Example: cluster length ≤ 6 – density: proportion of red genes (size/length) – compactness: maximum gap between adjacent red genes Max-Gap Cluster gap g • • Commonly used in analysis of genomic data Desirable properties – Ensures minimum local density – Extensible: doesn’t artificially limit cluster length – Disjoint: clusters will not overlap Outline • Formalisms • Reference set scenario • Whole genome comparison • Conclusion Formalisms • Definitions: formulate a cluster definition • Algorithms: identify clusters in real data • Statistics: assess the significance of a cluster A Statistical Model • Given – a genome: G = 1, …, n unique genes – a set of m reference genes – a maximum-gap size g • Null hypothesis: – Random gene order • Alternate hypotheses: – Evolutionary history – Functional selection Statistics of Max-Gap Gene Clusters • We provide – analytical and dynamic programming solutions – to determine cluster significance exactly – for the reference set scenario Hoberman, Sankoff and Durand. In ``Proceedings of the RECOMB Satellite Workshop on Comparative Genomics'', J. Lagergren, ed., Lecture Notes in Bioinformatics, Springer Verlag, in press. Hoberman, Sankoff, Durand. Submitted to RECOMB 2005. Test Statistic: Complete Clusters The probability of observing all m reference genes in a max-gap cluster in G Test Statistic: Incomplete Clusters The probability of observing at least h of the m reference genes in a max-gap cluster in G Cluster significance n = 1000, m=50 • • • • n = 500, h = m/2 n = number genes in each genome m = number of genes shared between the two genomes g = maximum allowed gap size h = size of cluster (e.g. number of red genes) Significant Parameter Values (α = 0.0001) n = 500 Significant Parameter Values (α = 0.0001) n = 500 Outline • Formalisms • Reference set scenario • Whole genome comparison • Conclusion Formalisms • Definitions: formulate a cluster definition • Algorithms: identify clusters in real data • Statistics: assess the significance of one or more clusters Whole genome comparison g 10 g 10 Find all sets of genes that form max-gap clusters in both genomes. Properties of Max-Gap Clusters for Whole Genome Comparison • Clusters are locally dense in both genomes • Clusters are still guaranteed to be disjoint. • The definition is symmetric with respect to genome Most existing cluster algorithms are not symmetric! Algorithms: Finding Max-Gap Clusters If g = 2 • There is no valid max-gap cluster of size two or three • There is a valid max-gap cluster of size four Algorithms: Finding Max-Gap Clusters • A consequence of this is that a greedy iterative approach will not find all max-gap clusters – Specifically, larger clusters that don’t contain smaller ones will not be found Algorithms: Finding Max-Gap Clusters There is an efficient divide-and-conquer algorithm to find all max-gap clusters (Bergeron et al, 2002) Since algorithms are generally not stated formally in application papers, we don’t know whether people are actually getting what they think they’re getting Formalisms • Definitions: formulate a cluster definition • Algorithms: identify clusters in real data • Statistics: assess the significance of one or more clusters Work in Progress… Statistics: Whole genome comparison g 10 g 10 What is the probability that at least k genes form a max-gap cluster in both genomes? Statistics: Whole genome comparison g 10 g 10 What is the probability that at least k genes form a max-gap cluster in both genomes? Assuming identical gene content, the probability of finding a max-gap cluster of size at least k is always one! An Example Example: g =1 An Example Example: g =1 An Example Example: g =1 A cluster of size k does not necessarily contain a cluster of size k-1 An Example Example: g =1 An Example Example: g =1 • When gene content is identical, there will always be a cluster of size n An Example Example: g =1 • When gene content is identical, there will always be a cluster of size n • Therefore, for all k, there will always be a cluster of size at least k An Example Example: g =1 • When gene content is identical, there will always be a cluster of size n • Therefore, for all k, there will always be a cluster of size at least k • Therefore, the probability of finding a cluster of size at least k is always one! Relaxing the Assumption of Identical Gene Content • Assume only m of the n genes in each genome are shared • If the longest run of “non-shared” genes is less than g then we are still guaranteed to find a complete cluster More generally… Simulations of randomly ordered genomes show that large clusters may be very likely to occur merely by chance Unexpected Statistical Trends • There can be a significant probability of finding a cluster that includes all homologous gene pairs n = 1000, m = 250, g=20 • The significance of a cluster of size k can be less than that of a cluster of size k-1 • Probabilities are not monotonic • Large clusters may not be significant Probability of a cluster of size 250 ~ 50% Outline • Formalisms • Reference set scenario • Whole genome comparison • Conclusion Clusters Are Used in Many Other Applications Inferring functional coupling of genes in bacteria (Overbeek et al 1999) Recent polyploidy in Arabidopsis (Blanc et al 2003) Sequence of the human genome (Venter et al 2001) Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002) Duplications in Eukaryotes (Vision et al 2000) Identification of horizontal transfers (Lawrence and Roth 1996) Evolution of gene order conservation in prokaryotes (Tamames 2001) Ancient yeast duplication (Wolfe and Shields 1997) Genomic duplication during early chordate evolution (McLysaght et al 2002) Comparing rates of rearrangements (Coghlan and Wolfe 2002) Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998) Operon prediction in newly sequenced bacteria (Chen et al 2004) Breakpoints as phylogenetic features (Blanchette et al 1999) ... Max-Gap Clusters are Especially Common Inferring functional coupling of genes in bacteria (Overbeek et al 1999) Recent polyploidy in Arabidopsis (Blanc et al 2003) Sequence of the human genome (Venter et al 2001) Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002) Duplications in Eukaryotes (Vision et al 2000) Identification of horizontal transfers (Lawrence and Roth 1996) Evolution of gene order conservation in prokaryotes (Tamames 2001) Ancient yeast duplication (Wolfe and Shields 1997) Genomic duplication during early chordate evolution (McLysaght et al 2002) Comparing rates of rearrangements (Coghlan and Wolfe 2002) Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998) Operon prediction in newly sequenced bacteria (Chen et al 2004) Breakpoints as phylogenetic features (Blanchette et al 1999) ... Formal statistical models allow us to – understand trends and verify that they match our expectations, – choose parameters effectively – conduct statistical tests for data analysis Formal statistical models require – a formal cluster definition – a search procedure to find clusters These issues are more complicated than they might seem! Summary Results: statistical tests of significance for max-gap clusters • • Reference set scenario Genome comparison (work in progress) We need to • • • explicitly consider the cluster properties we would like our definitions to satisfy rigorously evaluate whether our definition meets these requirements carefully prove that our search procedures match our stated definitions Thank You

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Where Do New Genes Come From? A Computational Analysis of