* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Species tree
Designer baby wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Minimal genome wikipedia , lookup
DNA barcoding wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Gene expression programming wikipedia , lookup
Maximum parsimony (phylogenetics) wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome editing wikipedia , lookup
Pathogenomics wikipedia , lookup
Microevolution wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur [email protected] Outline • Species tree construction and difficulties; • Post genome era species tree construction; • Conservation profiles; • Genome tree construction based on conservation profiles; • Conclusions; • References. Species tree - Tree Of Life • 16/18s rRNA tree (Woese 1990); Woese and others have used rRNA comparisons to construct a “Tree Of Life” showing the evolutionary relationships of a wide variety of organisms. The « Tree Of Life » has long served as a useful tool for describing the history and relationships of organisms over evolutionary time. One species is represented as a branching point, or node, on the tree, and the branches represent paths of descent from a parental node. Martin & Embley Nature 431:152-5.(2004) The three-domain proposal based on the ribosomal RNA tree. Woese et al. PNAS. 87:4576-4579. (1990) The three-domain proposal, with continuous lateral gene transfer among domains. Doolittle. Science 284:2124-8. (1999) The two-empire proposal, separating eukaryotes from prokaryotes and eubacteria from archaebacteria. Mayr, D. PNAS 95:9720-23. (1998). The ring of life, incorporating lateral gene transfer but preserving the prokaryote eukaryote divide. Rivera & Lake JA. Nature 431: 152-5. (2004) Genomic Databases and the Tree of Life Keith A. Crandall and Jennifer E. Buhay Sciences, 306; 1144-1145. (2004) Prospects for Building the Tree of Life from Large Sequence Databases The 1.2-Megabase Genome Sequence of Mimivirus Raoult et al. Sciences, 306:1344-1350. (2004) Driskell, et al . Sciences, 306; 1172-1174. (2004) Pennisi, E. (1998). Genome data shake tree of life. Science 280:672-4. New genome sequences are mystifying evolutionary biologists by revealing unexpected connections between microbes thought to have diverged hundreds of millions of years ago. and suggests to construct species trees from their whole gene content. B A E Genome phylogeny based on gene content (1999) Snel, Bork, Huynen. Nature Genetics 21, 108-110. Tekaia, Lazcano & Dujon (1999) Genome Research 9: 550-7. B A E 387 29 Complete genomes 2208 projects • 460 published (14-11-2006) • 1054 prokaryotes • 631 eukaryotes 44 http://www.genomesonline.org/ Gene tree - Species tree • Time Duplication • Duplication A B C Gene tree Speciation Speciation A A B C Genomes 2 edition 2002. T.A. Brown B Species tree C Problems with species tree construction • main difficulties in species tree construction include extensive incongruence between alternative phylogenies generated from single-gene data sets; -Genes don't evolve at the same rate nor in the same way; -the evolutionary history inferred from one gene may be different from what another gene appears to show. Alternative solutions: integrative methods • “supertree” The supertree approach estimates phylogenies for subsets of genes with good overlap, then combines these subtree estimates into a supertree. • Depends on the ability to distinguish between orthologs and paralogs; • Supertree approaches are controversial, in part because the methodology results in a degree of disconnection between the underlying genetic data and the final tree produced. Bininda-Emonds et al. 2002 • “phylogenomic tree” (based on concatenation of a gene sample common to the considered species); S1 . . Sn • genes don't evolve at the same rate nor in the same way; • a limited number of genes are shared among all species; The tree of one percent (2006) Dagan and Martin. Genome Biology, 7:118. More generally these methods suffer difficulties related to the phylogenetic tree construction: • global sequence alignment (quality, gaps,...); • different evolutionary histories of genes; • substitution saturation;... and • more seriously from gene sampling difficulties. Adapted from: Gene tree - Species tree: The gene Linder, Moret, Nakhleh, Warnow. sampling problem True species tree A B gene tree # species tree Blue is lost in A and B A C Red is lost in C B C A B C Gene tree - Species tree: The gene sampling problem A B C All red orthologs has been lost in the 3 species. A B C Luckily: sampling gives the blue orthologs. The true species tree is reconstructed. Gene tree - Species tree: The gene sampling problem A B C All versions of the gene are in the 3 species A B CA B C Gene trees are the same as the species tree Genome tree is another alternative to construct species tree. • The concept of genome tree is based on overall gene content similarity. (consider more than single gene information) Methodology Fp 1 i p 1 j kij • • • • • • • • • • n •• • • •• • • • • • • • • • F1 • • • • • • sup Matrice T kij > 0 Correspondence Analysis Classification • orthogonal system; • use of euclidean distance; Systematic Analysis of Completely Sequenced Organisms • In silico species specific comparisons (Tekaia & Dujon. J. Mol. Evol. 1999) (27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins) blastp, pam250, SEG filter Proteome1 Proteome • 99 species (B: 33; A: 19; E:27) • total of 541880 proteins Proteomen Systematic Analysis of Completely Sequenced Organisms • In silico species specific comparisons (27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins) • Degree of ancestral duplication and of ancestral conservation between pairs of species; • Families of paralogs (Partition-MCL); • Families of orthologs (Partition-MCL); • Distribution of orthologous families according to the three domains of life; • Determination of the protein dictionary (orthologs); • Determination of protein conservation profiles; Ancestor A Note on: Homologs - Paralogs - Orthologs Duplication A Time Homologs: A1, B1, A2, B2 B Paralogs: A1 vs B1 and A2 vs B2 Evolution A Orthologs: A1 vs A2 and B1 vs B2 B Speciation A1 A2 B1 B2 Species-1 Species-2 Sequence analysis a S1 S2 b • Large scale comparative analysis of predicted proteomes revealed significant evolutionary processes: Evolutionary processes include Ancestor Expansion* Phylogeny* genesis duplication HGT species genome Exchange* selection* HGT loss Deletion* Expansion, Exchange and Deletion are noise. They should be eliminated or at least reduced. To overcome some of these limitations, we consider Genome tree construction from “Protein Conservation Profiles” and attempt to reduce noisy evolutionary processes Conservation profiles • 99 species (B: 33; A: 19; E:27); 541880 proteins p 0111111000111111111000110110111101001111101111 • A “conservation profile” is an n-component binary vector describing a protein conservation pattern across n species. Components are 0 and 1, following absence or presence of homologs. Main interesting properties of conservation profiles: • Conservation profiles are signatures of evolutionary relationships; • A conservation profile is the trace of protein evolutionary histories jointly captured in a set of n species (multidimensional feature); Protein conservation profiles E A B S1..............I.............I................Sn G1,1 100000000000000000000000000000000000000000000000 G2,1 111111111111111111111111111111111111111111111111 G3,1 111111110011111111111111011101110101111111101111 ....................................................... Gn1,1 100001110001000000000000000000000000000000000000 G1,2 010000000000000000010100000000000111000011100011 G2,2 010000000000000000010100000000000111000011100011 ........................................................ Gn2,2 111111110011111111111111011101110101111111101111 ........................................................ G1,n 011110100000000000000000001000000000000000000001 G2,n 111111110011111111100011011101110101111111101111 G3,n 111111110011111111100011011101110101111111101111 ........................................................ Gnp,n 100110000000000000000000000000000000000000000001 Table : 541880 proteins x 99 species • Different conservation profiles represent different evolutionary histories Distinct conservation profiles 541880 original total proteins (99 species) 442460 non-specific proteins i.e conservation profiles (82%) 184130 distinct conservation profiles (42%) 100000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111 111111110011111111111111011101110101111111101111 010000000000000000010100000000000111000011100011 100110000000000000000000000000000000000000000001 ................................................ (one representative from each set of identical conservation profiles) • Effect of the duplication process is reduced • This set is indicative of the various observed evolutionary histories. c01 c02 c03 c04 c05 c06 c07 c08 c09 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c30 c31 c32 c33 c34 c35 c36 c37 c38 c39 c40 c41 c42 c43 c44 c45 c46 c47 c48 c49 c50 c51 c52 c53 c54 c55 c56 c57 c58 c59 c60 c61 c62 c63 c64 c65 c66 c67 c68 c69 c70 c71 c72 c73 c74 c75 c76 c77 c78 c79 c80 c81 c82 c83 c84 c85 c86 c87 c88 c89 c90 c91 c92 c93 c94 c95 c96 c97 c98 c99 Fractions (*10000) of distinct conservation profiles 250 240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 Presence in the 184130 distinct conservation profiles: Mean=32.2; SD=23.3; min=1; Max=99. Conservation weights (sum of "1":presence) Genome tree construction: data matrices • 184130 d.c.prof various evolutionary histories i j 100000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111 111111110011111111111111011101110101111111101111 010000000000000000010100000000000111000011100011 100110000000000000000000000000000000000000000001 ................................................ • Jaccard similarity scores between species sij = N11/(N11+N01+N10); N11; N01; N10 are respectively total occurrences of (1,1), (0,1) and (1,0) between i,j. T = { Tij = sij ; i=1,n; j=1,n; n } profiles tree Tekaia F, Yeramian E. (2005). PLoS Comput Biol.1(7):e75 Conclusions: Methodology • Species classification is not an easy task! • Species tree construction should take into account the whole information included in the genomes; • Methods that take into account whole genome informations are still needed; • Correspondence analysis method might be helpful in revealing evolutionary trends embedded in the multidimensional relationships as obtained from large scale genome comparisons; Conclusions... • Conservation profiles represent most conserved and meaningful evolutionary signals jointly captured in a set of species; • Thus they should correspond to the most accurate type of markers for species classification; • In principal profiles tree derived from distinct conservation profiles should considerably minimize genome acquisition effects and should reflect less noisy phylogenetic signals; • The profiles tree presents evidence of conservation of stable phylogenetic relationships and reveals unconventional species clustering; • The profiles tree corresponds to the classification of the evolutionary scenari. Acknowledgments: The support of: • The Institut Pasteur (Strategic Horizontal Programme on Anopheles gambiae) • The Ministère de la Recherche Scientifique (France): ACI-IMPBIO-2004–98-GENEPHYS program. • Bernard Dujon (Institut Pasteur). References: • Tekaia, F. and Dujon, B. (1999). Pervasiveness of gene conservation and persistence of duplicates in cellular genomes. Journal of Molecular Evolution, 49:591-600. • Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from whole proteome comparisons. Genome Res. 12:17-25. • Tekaia, F., Yeramian, E. and Dujon, B. (2002). Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 297: 51-60. • Tekaia, F. and Yeramian, E. (2005). Genome Trees from Conservation Profiles. PLoS Comput Biol.1(7):e75. • Tekaia, F. and Yeramian, E. (2006). Evolution of Proteomes: Fundamental signatures and global trends in amino acid composition. BMC Genomics. 7:307. • Tekaia F, Latgé JP. (2005). Aspergillus fumigatus: saprophyte or pathogen? Curr Opin Microbiol. 8:385-92. Review. • Systematic analysis of completely sequenced organisms: http://www.pasteur.fr/~tekaia/sacso.html References: • Bininda-Emonds ORP (2005). Supertree Construction in the Genomic Age. Methods in Enzymology 395: p.745-757. • Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002) The (super)Tree Of Life: Procedures, Problems, and Prospects. Annual Review of Ecology and Systematics, Vol. 33: 265-289. • Dagan, T. and W, Martin (2006). The tree of one percent. Genome Biology, 7:118. • Delsuc F, Brinkmann H, Philippe H. (2005). Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 6:361-75. Review. • Doolittle. Science 284:2124-8. (1999) • Driskell, et al. (2004). Sciences, 306; 1172-1174. • http://www.genomesonline.org/gold.cgi (list of genome projects) • Keith A. Crandall and Jennifer E. Buhay (2004). Sciences, 306; 1144-1145. • Linder, Moret, Nakhleh, and Warnow: http://compbio.unm.edu/networks1.ppt • Martin & Embley (2004). Nature 431:152-5. • MCL: a cluster algorithm for graphs: http://micans.org/mcl/ • Pennisi, E.(1998). Genome data shake tree of life.Science. 280:672-4. • Rivera & Lake JA.(2004). Nature 431: 152-5. • Raoult et al.(2004). Sciences, 306:1344-1350. • Snel, Bork, Huynen (1999). Genome phylogeny based on gene content.Nature Genetics 21, 108-110. • Snel B, Huynen MA, Dutilh BE (2005). Genome trees and the nature of genome evolution.Annu Rev Microbiol.;59:191-209. Review. • Woese et al.(1990). PNAS. 87:4576-4579.