Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Phylogenomics and the evolution of gene repertoires in bacteria Paris, MEP, June 18th 2005 Vincent Daubin Bioinformatique et Génomique Evolutive Menu • Introduction: phylogenomics – A neologism and an old quote. • Phylogenomics in Bacteria/Prokaryotes – What phylogenetic framework??? • Approaches for finding the Tree (if there is one) – Results obtained from different methods • Reconstructing the history of complete genomes • Conclusion Phylogenomics • Nothing makes sense in genomics except in a phylogenetic framework – Understanding the organization of genomes, the evolution of functions, the histories of duplications, etc… • Numerous prokaryotic genomes (relatively small, dense in genes…) • But what phylogenetic framework for prokaryotes? Woese, 1987 SSU rRNA phylogeny From one gene… ATTTGAC… ACTTGAC… ATTCGCC… ATTCGCC… … to another TTTAGAC… TCTAGAC… TTACGCC… TTACGAC… Evidence for Lateral Gene Transfer 0.5 Buchnera sp. Pasteurella multocida Haemophilus influenzae Vibrio cholerae Pseudomonas aeruginosa Xylella fastidiosa Rickettsia prowazekii Caulobacter crescentus Neisseria meningitidis Campylobacter jejuni Helicobacter pylori Arabidopsis thaliana Synechocystis sp. Aquifex aeolicus Bacillus halodurans Bacillus subtilis Staphylococcus aureus Lactococcus lactis Streptococcus pyogenes Mycobacterium tuberculosis Mycobacterium leprae Streptomyces coelicolor Deinococcus radiodurans Chlamydia trachomatis Chlamydia muridarum Chlamydophila pneumoniae Mycoplasma pneumoniae Mycoplasma genitalium Ureaplasma parvum Thermotoga maritima Archaeoglobus fulgidus Pyrococcus abyssi Pyrococcus horikoshii Methanococcus jannashii Halobacterium sp. Methanobacterium thermoautotrophicum Sulfolobus solfataricus Aeropyrum pernix Thermoplasma acidophilum Treponema pallidum Borrelia burgdorferi Green plant cyanobacteria Bacteria Archaea Eukaryota UMP-Kinase Multiple LGT or … ? Mycobacterium leprae Mycobacterium tuberculosis 0.5 Streptomyces coelicolor Aquifex aeolicus Synechocystis sp. Pyrococcus horikoshii Pyrococcus abyssi Methanococcus jannashii Methanobacterium thermoautotrophicum Archaeoglobus fulgidus Campylobacter jejuni Helicobacter pylori Thermotoga maritima Caulobacter crescentus Bacteria Archaea Eukaryota Deinococcus radiodurans Halobacterium sp. Thermoplasma acidophilum Caenorhabditis elegans Chlamydophila pneumoniae Xylella fastidiosa Saccharomyces cerevisiae Pseudomonas aeruginosa Vibrio cholerae Pasteurella multocida Haemophilus influenzae Neisseria meningitidis Buchnera sp. Aeropyrum pernix Sulfolobus solfataricus Orotate Phosphoribosyltransferase Lateral gene transfer in bacteria Transduction Conjugation Transformation Acquisition of function via LGT Ochman, et al., 2000 Massive gene “exchanges” ! Ochman, et al., 2000 The alternative to the tree ? Zhaxybayeva and Gogarten, 2002 Methods used to reconstruct the tree of life using complete genomes • Oligonucleotides/peptides (words) frequency in genome/proteome • Global index of similarity (BLAST) Hypothesis of homology not always clear Mostly gene homology • Gene content • Gene order • Gene concatenation • Supertrees Mostly gene orthology Gene orthology (alignments) Finding xenology Statistics on genomes Whole genomes (proteome) Word frequency sp1 sp2 sp1 sp2 sp3 … AAAA 104 63 307 …. AAAC … … … Tree AAAG AAAT …. sp3 … Count words (correct for % of letters) Compute distances (= differences in word usage) Build a tree Re-sample words for support Hypothesis of homology ? Statistics on genomes • Pride et al. 2003 • Based on tetranucleotide frequency in 27 genomes • Distance ~ differences in usage • Relatively little signal for resolving the tree of bacteria BUT resolves recent and very deep nodes (i.e., domains). Statistics on genomes • Qi et al., 2004 • K-strings in proteins (i.e., words of K letters) – here, K=6 • 109 genomes • Gets better with longer strings (relationship to gene homology?) Blast scores Compare proteomes (BLASTP…) Distance matrix sp1 sp2 sp1 sp2 sp1 0 0.5 sp2 0.5 0 sp3 sp3 … sp3 Tree … …. 0 0 … Average %identity, normalized BLAST scores… (restrict to orthologous genes) Transform into distance Build a tree Re-sample pairs of matching genes for support (remove discordant matches) Blast scores • Clarke et al., 2002 • Normalized BLASTP scores (=match/self_match) • 37 genomes (3 domains) • Finds most of the phyla defined by rRNA • Remove phylogenetically discordant matches (little effect) Gene content Compare proteomes (BLASTP…) Parsimony matrix sp1 sp2 sp1 sp2 sp3 … Gene1 0 1 0 …. Gene2 1 1 1 Tree … sp3 … …. … Code presence/absence of : - Orthologs (reciprocal best matches) - Homologs (families) - Domains, Folds (superfamilies) Compute distance (correct for genome size) Dollo parsimony … Build a tree Sample subset of genes for statistical support Gene content • Yang et al. 2004 • Folds (=superfamilies) in 119 bacterial genomes • Distance method • Finds a few phyla defined by rRNA Gene content • Snel et al. 1999 • Orthologs in 23 genomes • Finds most of the phyla defined by rRNA Gene order Compare proteomes (BLASTP…) b sp1 sp3 a e c d sp1 a b sp2 Tree Gene order e d sp2 … c c a e sp3 d b … Assign orthologs - keep those present in ≥ 2 - keep those present in all Compute distances based on: - conservation of pairs of neighbor - number of breakpoints - sequence of inversions… Gene order • Wolf et al., 2001 • Based on conservation of pairs of neighbors • Finds most of the phyla defined by rRNA + suggests some non-trivial groups Wolf et al., 2001 Gene concatenation gene alignments Super-alignment select genes that can be concatenated: - reduce missing data - analyze congruence (… or not) Bootstrap: Re-sample sites (Re-sample genes) Gene concatenation • 57 genes in 45 species (8857 positions) • unrooted tree of bacteria • Finds all phyla defined by rRNA + suggests some non-trivial groupings Brochier et al., 2002 Whatever distance Comparison of (some) phylogenomic distances 1,8 1,6 1,4 1,2 Gene_order = -ln(s) R2 = 0,0913 Concatenated proteins (9genes - JTT) R2= 0,6477 1 0,8 0,6 Gene order = (s-1) R2= 0,132 Presence/absence = -ln(s) R2= 0,0756 0,4 Presence/absence = (s-1) R2= 0,1849 0,2 0 0 0,1 0,2 0,3 0,4 0,5 0,6 16S rRNA divergence (F84) Supertrees Combination of trees gene trees F E D F C A D A E B G C F A B B D E G select trees that can be combined = analyze congruence (… or not) Bootstrap: Re-sample sites (MRP) (Re-sample trees) Supertree of bacteria • Daubin et al. 2002 • bacterial supertree based on the combination of 121 gene trees with 7 ≤ nb sp ≤ 32 • Matrix Representation with Parsimony • Finds all phyla defined by rRNA + suggests some nontrivial groupings 100 100 100 100 95 100 65 100 63 91 43 100 100 100 83 100 100 100 100 100 100 100 99 80 100 100 100 92 100 Streptomyces pyogenes Lactococcus lactis Staphylococcus aureus Bacillus subtilis Bacillus halodurans Ureaplasma parvum Mycoplasma genitalium Mycoplasma pneumoniae Synechocystis sp. Deinococcus radiodurans Mycobacterium tuberculosis Mycobacterium leprae Streptomyces coelicolor Helicobacter pylori Campylobacter jejuni Rickettsia prowazekii Caulobacter crescentus Neisseria meningitidis Xylella fastidiosa Pseudomonas aeruginosa Buchnera sp. Haemophilus influenzae Pasteurella multocida Escherichia coli Vibrio cholerae Aquifex aeolicus Thermotoga maritima Chlamydophila pneumoniae Chlamydia muridarum Chlamydia trachomatis Borrelia burgdorferi Treponema pallidum Low G+C Gram-postives Mycoplasmas High G+C Gram-postives Proteobacteria Chlamydiales Spirochaetes A tree of bacteria? 100 100 100 100 95 100 65 100 63 91 43 100 100 100 83 100 100 100 100 100 100 100 99 80 100 100 100 92 100 Streptomyces pyogenes Lactococcus lactis Staphylococcus aureus Bacillus subtilis Bacillus halodurans Ureaplasma parvum Mycoplasma genitalium Mycoplasma pneumoniae Synechocystis sp. Deinococcus radiodurans Mycobacterium tuberculosis Mycobacterium leprae Streptomyces coelicolor Helicobacter pylori Campylobacter jejuni Rickettsia prowazekii Caulobacter crescentus Neisseria meningitidis Xylella fastidiosa Pseudomonas aeruginosa Buchnera sp. Haemophilus influenzae Pasteurella multocida Escherichia coli Vibrio cholerae Aquifex aeolicus Thermotoga maritima Chlamydophila pneumoniae Chlamydia muridarum Low G+C Gram-postives Mycoplasmas High G+C Gram-postives Proteobacteria Chlamydia trachomatis Borrelia burgdorferi Treponema pallidum Super-tree (Daubin et al. 2002) 121 genes Chlamydiales Spirochaetes Concatenation of ribosomal proteins (Brochier, et al., 2002) 57 genes A consensus for the tree of life • Black: already known from rRNA • Red: established from complete genome analysis (congruence among methods) • Dashed red: suggested by complete genome analysis Wolf et al., 2002 Phylogenomics in bacteria Nature of gene innovation in bacteria? In eukaryotes: mainly duplication What about bacteria? The origin of « duplicates » in bacterial genomes -Intra-genomic duplication a a a’ PARALOGS -LGT of a gene having already an homolog in the genome bx b b bx XENOLOGS Calling these genes « duplicates » or « paralogs » is an overstatement: “SYNOLOGS” = PARALOGS || XENOLOGS Phylogenomics of Gammaproteobacteria (13 species) • Ancient group (0.5-1 billion years – May et al., 2001) • Model of bacterial diversification: – – – – – – – – – – – – Escherichia coli K12 Salmonella typhimurium LT2 Buchnera aphidicola AP Haemophilus influenzae Pasteurella multocida Yersinia pestis (CO92 and KIM) Pseudomonas aeruginosa PAO1 Xanthomonas axonopodis Xanthomonas campestris Xylella fastidiosa Wigglesworthia brevipalpis Vibrio cholerae commensal human pathogen endosymbiont of aphids commensal human pathogen animal pathogen (agent of plague) human opportunistic pathogen plant pathogen (Citrus) plant pathogen (crucifers) plant pathogen (Citrus) endosymbiont of tse-tse fly human pathogen (agent of cholera) • High rate of LGT reported (e.g., E. coli) Gene families in -proteobacteria 8000 8035 Genes unique to a genome Number of families 7000 6000 5000 4000 minimal core of genes in -proteobacteria 2693 3000 2000 988 1000 552 332 224 266 205 145 127 174 142 275 5 6 7 8 9 10 11 12 13 0 1 2 3 4 Number of species The core of genes • among the 275 families that group genes from the 13 species: 205 families with 1 gene per species. true orthologs. Do these genes have the same history? ML tests (ELW, SH, KH…) Sequence alignment Ln1 and LnX significantly different ? ML tree (Ln1) LnX NO: accept phylogenetic hypothesis Phylogenetic hypothesis to test (e.g., “species phylogeny”) YES: possible LGT 197 196 200 best topology 203 186 172 181 178 177 150 133 130 117 110 100 95 108 97 88 72 50 33 8 9 27 24 19 75 28 2 0 1 2 3 SSU rRNA 4 5 6 Concatenated proteins not different from the ML tree 7 8 9 10 11 12 13 other hypothesis different from the ML tree The organismal phylogeny 100 100 100 E. coli 4183 S. typhimurium 4203 Y. pestis CO92 3599 100 Y. pestis KIM B. aphidicola 100 100 W. brevipalpis H. infuenzae 100 100 P. multocida V. cholerae P. aeruginosa 100 100 0.2 3879 564 653 1709 2015 2724+1081 5540 X. fastidiosa 2680 X. axonopodis 4192 X. campestris 4029 Based on the concatenation of 203 genes Lerat et al., 2003 Exemple: Maximum likelihood test with one synolog Sp A Synolog in sp A Test ΔL species topology ML trees - Allows detection of possible LGT and identification of the true ortholog - !!! Incongruence can result from duplication + loss (results need to be checked manually) !!! Results for the phylogenetically « informative » fraction of the genomes 80 Number of synologs Percentage of LGT 0 1 2 >2 60 40 20 0 6 7 8 9 10 11 12 13 Number of species Synology is associated with a high frequency of LGT A lot of the so called duplicates in bacterial genomes arise in fact by LGT Lerat et al., 2005 But families having synologs are rare 7655 2429 835 457 Number of synologs (# genes – # genomes) 0 1 2 3 4 5 6 7 8 9 10 >10 Number of families 250 a 200 150 100 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Number of species Families having synologs represent less than 2% of the total The auxiliary genome of bacteria is an ORFanage Welch et al., 2002 Genes unique to a genome - Some genes are annotated as phages or secretion proteins… - Most have unknown function - Most are ORFans (no homolog known in databases) What are ORFans ? • Rapidly evolving genes, or possibly pseudogenes (cf. Amiri et al., 2003) • Genes produced de novo from non-coding sequences • Artifacts resulting from the algorithms used to recognize coding sequences in genomes. • Genes transferred from organisms that have no representatives in the databases How can we understand ORFans ? • By definition, no possible comparative study (evolutionary rate, structure determination by homology…) • But… if the mechanism producing ORFans is continuous overtime, we can find ORFans for every node in a tree • Search for ORFans in the lineage leading to E. coli MG1655 (K12) Examine genes restricted to each clade at increasing phylogenetic depths (n0, n1, n2, etc.) as well as those ancestral to all taxa (native). At each node, define two types of genes: - ORFans: genes restricted to a clade and having no other homologs - HOPs (Heterogeneous Occurrence in Prokaryotes): genes restricted to a clade but with matches in some distantly related organism (LGT events) This approach allows: 1. comparisons of the sequence features of ORFans of different ages 2. comparisons of ORFans to acquired & ancestral genes. 3. use of comparative methods to obtain information about evolutionary rate and functional status of ORFans (e.g., n2: E. coli vs. Salmonella) ORFans HOPs Daubin & Ochman, 2004 Length of ORFans and HOPs -proteobacteria E. coli Vibrio Vibrio/Haem enterics enterics enterics S. ent S. ent S. ent S. enterica E. coli + E. coli + E. coli + E. coli + E. coli +Shigella E. coli E. coli E. coli E. coli E. coli Length (bp) 1200 1000 HOPs ORFans 800 600 400 n0 younger n1 n2 n3 n4 native older Evolutionary rates of ORFans and HOPs -proteobacteria Vibrio Vibrio/Haem enterics enterics enterics S. ent S. ent S. ent S. enterica E. coli + E. coli + E. coli + E. coli +Shigella E. coli E. coli E. coli E. coli Escherichia-Salmonella Ka/Ks 0.14 0.12 HOPs ORFans 0.10 0.08 0.06 n2 n3 n4 nativ e Ka/Ks is low, indicating that ORFans encode proteins; however, both Ka & Ks are elevated G+C content of ORFans and HOPs -proteobacteria E. coli Vibrio Vibrio/Haem enterics enterics enterics S. ent S. ent S. ent S. enterica E. coli + E. coli + E. coli + E. coli + E. coli +Shigella E. coli E. coli E. coli E. coli E. coli % G+ C 3 58 54 HOPs ORFans 50 46 42 n0 younger n1 n2 n3 n4 native older ORFans in A+T rich genomes 0,39 0,38 0,37 0,36 0,35 0,34 0,33 0,32 0,31 0,3 0,29 Helicobacter pylori G+C3 G+C3 Streptococcus pneumoniae 0,44 0,42 0,4 0,38 0,36 0,34 0,32 0,3 0,28 “Natives” ORFans “Natives” ORFans Daubin et al., 2003 Features of ORFans ORFans arise quickly in genomes & can be strain-specific Do not originate from native DNA that is shared among strains ORFans are short and very A+T-rich Consistently A+T-richer donor ? Average Ka/Ks of ORFans is much less than 1 (often < 0.2) Most ORFans are functional (although functions are unassigned) ORFans evolve faster than other genes in the genome Under less constraints or possibly due to positive selection ORFans originate by lateral gene transfer but by different vehicles, mechanisms or processes than HOPs (which are present in other Bacteria, Archaea or Eukaryotes) Given their base compositions, lack of homologs & functional status ORFans most likely derive from DNA phages (which are poorly represented in the databases) Rocha et Danchin, 2002 And if you are not yet convinced, 1. Younger ORFans tend to be clustered (as if co-inherited in a single event), whereas older ORFans are dispersed (by rearrangements & deletions) ORFan cluster sizes average 2.1 genes in n0/n1 and 1.3 genes in n4 2. Genes in DNA phage genomes are short. Average is 615 bp, and only 471 bp for those encoding ‘hypothetical’ proteins 3. ORFans often occur at tRNA genes or near translocatable sequences 4. ORFans in E. coli have di-nucleotide frequencies close to coliphages ORFans are conserved through time and may assume key functions - An ORFan from n2 (only in E. coli and Salmonella) is the ribosomal protein S22, expressed in stationary phase - Some ORFans from n3 (restricted to the enterics) have been retained in the highly reduced genome of Buchnera: e.g., dnaT and dnaC, which are essential to E. coli Daubin & Ochman, 2004 The genealogy of bacterial genomes S. typhimurium (4206) Ubiquitous genes are rare and show few evidence for LGT E. coli (4187) Y. pestis KIM (3883) Genes seem to be acquired continuously Y. pestis CO92 (3599) W. brevipalpis (653) B. aphidicola (564) P. multocida (2015) H. influenzae (1709) Most of the acquired genes are completely new for the genome (no homologs) A lot of them are even ORFans V. cholerae (3805) P. aeruginosa (5540) X. fastidiosa (2680) X. campestris (4030) X. axonopodis (4193) ORFans genes appear as a contribution of phage to bacterial evolution Because genomes are not increasing in size, non-homologous replacement may play a major role acknowledgements • • • • • Emmanuelle Lerat (esp. LGT in Gamma-proteobacteria !!!) Manolo Gouy Guy Perrière Howard Ochman Nancy Moran