* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Julie Thompson – IGBMC
Ancestral sequence reconstruction wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene expression wikipedia , lookup
Protein moonlighting wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Gene desert wikipedia , lookup
Community fingerprinting wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene regulatory network wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Molecular evolution wikipedia , lookup
Introduction to phylogenomics Julie Thompson Laboratory of Integrative Bioinformatics and Genomics IGBMC, Strasbourg, France [email protected] Phylogenomics A combination of : genomics (study of function and structure of genes and genomes) molecular phylogenetics (study of evolutionary relationships among organisms) Two different aspects : using phylogenetic data to infer functions for DNA and protein sequences (Eisen. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998) using genomic data to infer phylogenetic relationships (species trees) and to gain insights into the mechanisms of molecular evolution (O'Brien and Stanyon. Phylogenomics. Ancestral primate viewed. Nature 1999) Julie Thompson – IGBMC 1. Phylogeny-based functional inference • Homology based methods • Non-homology based methods Julie Thompson – IGBMC Phylogeny-based functional inference Used in molecular biology, genetics, development, behavior, epidemiology, ecology, systematics, conservation biology, forensics… draw structural/functional inferences from the structure of the tree or from the way the character states map onto the tree use these clues to build hypotheses and models of important events, systems, predict behavior, etc. Julie Thompson – IGBMC Reviewed in: Brown & Sjölander, PLoS Comput Biol. 2006 Levasseur et al, Evolutionary Bioinformatics, 2008 Phylogeny-based functional inference A two-step process: Model Systems (Human, mouse, drosophila, yeast …) Complexes & Networks : Copresence/coabsence Fusion/fission 2 Model Systems Data & (Human, mouse, Information drosophila, yeast …) inference Interactome : Interologous approach Propagation Modelling Interfaceome : Conserved residues Phylogenetic interactions Inference Promotome : Phylogenetic footprint Transcriptome, proteome… Julie Thompson – IGBMC Similar Expression Knowledge extraction Sequence Structure New Systems (partial (human, experimental partial experimental Data) data) Function Evolution 1 Propagation from model systems Classical method : similarity-based functional annotation (from Blast best hit) Perform Blast search to detect similar sequences human mouse1 mouse2 worm yeast Julie Thompson – IGBMC Transfer function from highest scoring sequence with known function human mouse1 Errors : • gene duplications (ortholog/paralog) • multi-domain proteins • existing database errors Propagation from model systems Classical method : similarity-based functional annotation (from Blast best hit) Problems : distantly related sequences may have different functions spurious hits in low complexity regions propagation of existing database annotation errors Example : complex domain organisation Julie Thompson – IGBMC Problems : domain organisation SW:Y449_MYCGE RNA binding domain SW:Y663_MYCPN SW:SYFB_IDILO phenylalanyl-tRNA synthetase SPT:A5IAL4_LEGPC Julie Thompson – IGBMC Annotation errors Sequence prediction errors : 65% of the sequences are in silico predictions 44% of eukaryote predicted proteins are partially incorrect: at least one suspicious indel or divergent segment (Bianchetti et al, 2005) Function annotation errors : 66% of sequences in the UniProt database have GO annotations, but only 3% have evidence codes indicating experimental support (Krishnamurthy et al, 2006) 10-30% of genome functional annotations are erroneous (Devos, Valencia, 2000) Julie Thompson – IGBMC Propagation from model systems Phylogeny-based inference Perform Blast search to detect similar sequences human mouse1 mouse2 worm yeast Perform multiple alignment of sequences representing potential homologs Construct phylogenetic tree and identify orthologs human mouse2 human mouse2 mouse1 worm yeast mouse1 worm yeast duplication fusion Infer function from set of orthologs, domain organisation, conserved motifs (also 3D structure, etc.) Julie Thompson – IGBMC Assumption We can identify set of homologous sequences and differentiate orthologs from paralogs orthologous sequences (diverged by speciation) are more reliable predictors of protein function than paralogous sequences (that diverged by gene duplication) ancestor gene A speciation mouse gene A human gene A orthologs duplication human gene A’ Julie Thompson – IGBMC paralogs human gene A mouse gene A Define orthologous groups pairwise orthology: reciprocal best hits (RBH) Inparanoid (Remm et al., 2001) COGs: Clusters of Orthologous Groups (Tatusov et al., 1997) orthoMCL (Li et al., 2003) EggNOG (Jensen et al., 2008) ancestor gene A speciation mouse gene A human gene A orthologs duplication human gene A’ Julie Thompson – IGBMC paralogs human gene A mouse gene A Problems: leading to wrong orthology assumptions sequencing errors, non-predicted sequences gene duplication followed by differential gene loss varying rates of evolution rat mouse human Sub-family X Sub-family A worm gene loss (human ) rat human Sub-family Y fly Sub-family B duplication RBH: human B Julie Thompson – IGBMC mouse A RBH: human X rat Y Define orthologous groups Tree-based orthology: build a phylogenetic tree of a group of genes and compare gene tree to species tree to define speciation, duplication events Resampled Inference of Orthologs (RIO) (Zmasek and Eddy, 2002) Orthostrapper (Storm and Sonnhammer, 2002) Levels Of Orthology From Trees (LOFT) (Van de Heijden et al, 2007) Example: G protein-coupled receptors Unknown sequence Unknown sequence Prediction: Opiod receptor Julie Thompson – IGBMC More general prediction: GPCR of unknown specificity Large scale analysis pipelines FIGENIX (Gouret et al, 2005): automatic pipeline for structural/functional annotation and phylogeny SIFTER (Engelhardt et al, 2005): statistical inference algorithms to propagate function annotations within a phylogeny PhyloFacts (Krishnamurthy et al, 2006): database of protein families, integrating different predictions and experimental data in a phylogeny MACSIMS (Thompson et al, 2006): information management system, to propagate structural/functional data within a multiple alignment Julie Thompson – IGBMC Large scale analysis: example Phylogenies of peroxisomal proteins (yeast, rat) were reconstructed to determine their origin : eukaryotic, bacterial or archaeal 39–58% were of eukaryotic origin (biogenesis or maintenance) 13–18% were of bacterial origin (enzymes) by recruitment of proteins originally targeted to mitochondria bacteria archaea Julie Thompson – IGBMC Gabaldón et al. Biology Direct, 2006. Large scale analysis : example Figenix functional analysis of genes lost in mammals/vertebrates, but present in other animals More than 50% of lost genes are involved in biomolecular metabolism/catabolism e.g. TPS biosynthesizes Trehalose 6P from UDP-glucose, a disaccharide crucial for the survival of species in dry and freezing periods and other stress conditions Julie Thompson – IGBMC Danchin, Gouret and Pontarotti. BMC Evolutionary Biology 2006 Online resource : PhyloFacts The Berkeley Phylogenomics Group : a phylogenomic encyclopedia containing 10,000 'books' for protein families, pre-calculated structural, functional and evolutionary analyses. FlowerPower SAM Blast to PDB MSA analysis MUSCLE NJ MP ML SCI-PHY PFAM Julie Thompson – IGBMC http://phylogenomics.berkeley.edu/phylofacts Online resource : PhyloFacts Search with fasta sequence Julie Thompson – IGBMC http://phylogenomics.berkeley.edu/phylofacts Julie Thompson – IGBMC http://phylogenomics.berkeley.edu/phylofacts Phylogeny-based inference Warning: inference accuracy depends on evolutionary distance and the particular functional attribute under consideration Some attributes of protein families, such as the 3D structure, are conserved across large evolutionary distances Other attributes, such as substrate specificity, can be modified by a few amino acid substitutions in critical positions Julie Thompson – IGBMC MACSIMS : Information Management System http://bips.u-strasbg.fr/MACSIMS/ Data collection : • creation of a relational database (BIRD, H. Nguyen) Information management : • data validation • reliable propagation Efficient exploitation : • automatic, high-throughput processing (XML format) • visualisation (JalView editor) Julie Thompson – IGBMC Thompson et al, 2006 Substrate specificity rhodocoxin reductase thioredoxin reductase Julie Thompson – IGBMC *** FAD binding http://bips.u-strasbg.fr/MACSIMS/ MACSIMS : Information Management System Sulfatase protein family : GALNS Mutations in GALNS gene are implicated in Morquio A syndrome : • mutation C79Y -> severe phenotype • others -> milder phenotypes Julie Thompson – IGBMC http://bips.u-strasbg.fr/MACSIMS/ “non-homology” based methods When no characterised homologs are available, 'nonhomology' methods can be used to analyze other patterns : gene co-inheritance (phylogenetic profiling) gene context domain fusion gene neighborhood (operon, synteny, …) gene regulation (phylogenetic footprinting / shadowing) They predict functional associations between proteins : physical interactions co-membership in pathways, regulons or other cellular processes Julie Thompson – IGBMC Phylogenetic profiling Joint presence or joint absence of two traits across large numbers of species can be used to infer a biological connection e.g. involvement of two different proteins in the same biological pathway (Pelligrini et al., 1999) Hypothesis: A biological process (photosynthesis, methanogenesis, histidine biosynthesis, …) may require the concerted action of many proteins If some protein critical to a process is lost, other proteins dedicated to that process would become useless; natural selection makes it unlikely they will be retained over evolutionary time Therefore, genes that are functionally related should be gained and lost together from genomes during evolution, which results in a correlation of their occurrence vectors Julie Thompson – IGBMC Phylogenetic profiling Julie Thompson – IGBMC For each gene, code Presence (1) or Absence (0) in each species Group genes with same or similar profiles Genes with similar profiles are likely to be functionally related Phylogenetic profiling: example Comparative Genomics Identifies a Flagellar and Basal Body Proteome that Includes the BBS5 Human Disease Gene Li et al, Cell, 2004 Julie Thompson – IGBMC Phylogenetic profiling Other methods: Similarity-based methods (correlating rates of evolution) (eg. Marcotte, 2000) Comparison of trees, rather than simple co-presence/co-absence (eg. for STRING database, von Mering et al, 2003) Problems/limitations : Need to include a large number of genomes Genes may not be predicted (or badly predicted) Functional link is inferred, but no clues to exact gene functions Julie Thompson – IGBMC Domain fusion (Rosetta stone) Hypothesis: some pairs of interacting proteins have homologs in another organism fused into a single protein chain A comparison of sequence homologs from multiple organisms can reveal these fused sequences called Rosetta Stone sequences because they decipher the interactions between the protein pairs (Marcotte et al, 1999) Example: Julie Thompson – IGBMC Rosetta stone : genome analysis Prediction of E. coli genome-wide gene network Significanc e score threshold Number of functional links Number of proteins (% of genome) 1 4613 1124 (26%) 1x10-6 111 583 (14%) 1x10-10 854 475 (11%) Problems : The networks generated are sparse, but begin to define cellular systems May not be scaleable to higher eukaryotes due to large numbers of duplicate genes, promiscuous domains Julie Thompson – IGBMC Gene neighborhood methods genes that frequently co-occur in the same operon (genomic region) in a diverse set of species are more likely to physically interact or be involved in the same pathway (Dandekar et al, 1998; Huynen et al, 2000;…) Example: fatty acid biosynthesis fatty acid degradation predicted transcription factor TF may regulate fatty acid degradation and biosynthesis Julie Thompson – IGBMC From Harrington et al, PNAS 2007 Protein function prediction using combined methods E.g. PLEX (Date and Marcotte, 2005) mySQL relational database, with gene sequences, chromosomal positions, pre-computed phylogenetic profiles and Rosetta Stone linkages, accessible via a web-based interface Julie Thompson – IGBMC http://bioinformatics.icmb.utexas.edu/plex/ Protein function prediction using combined methods Study of protein function prediction in genomes and metagenomes Combination of homology and non-homology approaches specific function non-specific function conserved protein singleton Julie Thompson – IGBMC From Harrington et al, PNAS 2007) Phylogenetic footprinting Used to identify Transcription Factor Binding Sites (TFBS) within a non-coding region of DNA Hypothesis: selective pressure causes regulatory elements to evolve at a slower rate than the non-functional surrounding sequence Phylogenetic shadowing : a related technique used with closely related species Julie Thompson – IGBMC Tagle et al, 1988 Phylogenetic footprinting Protocol: Carefully choose species with orthologous genes to provide enough sequence divergence Decide on the length of the upstream / downstream region to be analysed Align the sequences Look for conserved regions and analyse them Example: Julie Thompson – IGBMC From Zhang and Gerstein Journal of Biology 2003 Footprinting programs… Multiple alignment of genomic regions: PipMaker, AVID, Multiz Experimentally validated motif databases: DBTSS, EPD Motif prediction: First EF, Eponine and GenScan Integrated systems: CONREAL, ConSite, Footer, PHYLONET, PromAn, PhyloScan Problems: Species specific binding sites Very short binding sites Less specific binding factors Compound binding regions Julie Thompson – IGBMC 2. Construction of species trees Julie Thompson – IGBMC 2. Construction of species trees Problem phylogenetic trees based on single gene families, may show conflict due to a variety of causes (gene duplication, loss, horizontal transfer, convergent or parallel evolution…) Solution integrate the phylogenetic information from the different gene families to form a single species phylogeny Julie Thompson – IGBMC Construction of species trees Define groups of orthologous sequences Then use: Whole genome features (complete genome alignment, gene content) Supermatrix (simultaneous-analysis, combined-analysis) Supertree (separate analysis) Julie Thompson – IGBMC Delsuc et al, Nature reviews, 2005 Gene content No multiple alignments, but sequence information is used to determine the orthologous genes Build a matrix indicating the presence or absence of OGs in all species (phylogenetic profile) Binary matrix can be treated in the same way as a multiple sequence alignment 4 states: ACGT Julie Thompson – IGBMC 2 states: P present, A absent Infer a phylogenomic tree from matrix (alignment) Gene content Distance methods: Maximum parsimony: Julie Thompson – IGBMC Snel, Bork & Huynen. (1999) Nature Genet. Tekaia, Lazcano & Dujon. (1999) Genome Res. Lin & Gerstein. (2000) Genome Res. Wolf, Rogozin, Grishin, Tatusov & Koonin. (2001) BMC Evol. Biol. Fitz-Gibbon & House. (1999) Nucleic Acids Res. Gene order (synteny) Estimate evolutionary distance from the number of rearrangements necessary to transform one genome into another (computationally complex) construct phylogenetic trees by minimizing the number of breakpoints between genomes (Blanchette et al 1999) More practical solution: simply score the presence or absence of pairs of orthologous genes (Korbel et al. 2002, Wolf et al 2001) Julie Thompson – IGBMC Gene content / gene order Problems Julie Thompson – IGBMC Orthology assessment ‘big genome attraction’: distantly related species with large genomes may share more genes than closer related species with small genomes. Sequence information is lost Superalignments (supermatrix) multiple alignments for each gene are concatenated to form a superalignment Use conventional phylogenetic reconstruction methods (e.g. distance or MP) (Brown et al. 2001, Wolf, et al 2001) Julie Thompson – IGBMC Superalignments Julie Thompson – IGBMC Example: RibAlign analysis of 16S ribosomal RNA (rRNA) sequences has been the de-facto gold standard for the assessment of phylogenetic relationships among prokaryotes concatenation of ribosomal protein sequences (MAFFT, Phylip: ProML, MrBayes) Superdistance (supermatrix) Superdistance methods first calculate distance matrices for all gene families. The phylogenomic distance between two species is then defined as the average distance between all the shared gene families (Kunin et al., 2005) Julie Thompson – IGBMC Supertree Reconstruct phylogenetic trees for each gene family separately Combine the multiple gene family trees to form a single phylogenomic tree (Gene Tree Reconciliation) (Bininda-Emonds, 2004; Daubin et al., 2002) Julie Thompson – IGBMC Gene tree reconciliation methods Consensus tree methods are used to combine fully overlapping source trees (strict, majority consensus rules, …) Julie Thompson – IGBMC (eg. Mincut Semple and Steele 2000) From de Queiroz and Gatesy, Trends Ecol Evol, 2007 Gene tree reconciliation methods Indirect supertree construction represents individual source trees as matrices, then combines them using an optimization criterion : Matrix representation using parsimony (MRP) “flip” supertrees Average consensus procedure Most Similar Supertree (MSSA) Maximum Quartet Fit (QFIT) Maximum Splits Fit (SFIT). From Bininda-Emonds et al, 2002 Julie Thompson – IGBMC Software Clann, http://bioinf.may.ie/software/clann/ Problems Large amounts of data: need automatic pipelines Need a reliable method to identify genuine orthologues Missing data: some genes missing from some species (incomplete sequencing) Factors leading to an incorrect tree, even with use of genome-scale data: nucleotide or amino acid compositional bias long-branch attraction caused by unequal evolutionary rates among lineages sparse taxon sampling heterotachy (the shift of position specific evolutionary rates) Julie Thompson – IGBMC Comparison supermatrix/supertree Supermatrix methods Include all sequence information (reduces noise) Can yield relationships that are not present in the set of source trees Ignore differences in rates or modes of evolution More sensitive to missing data Computationally expensive Supertree methods Relatively efficient => allow construction of large trees Estimate an independent set of parameters for every gene Allow incorporation of diverse kinds of data, e.g. characters from fossils, morphobank Less sensitive to missing data Use heuristic algorithms that cannot be justified rigorously on a statistical basis. Ignore uncertainties in the subtrees (bootstrap values, Bayesian posterior probabilities,…) but some recent algorithms may solve this problem (Burleigh, 2006; Moore, 2006) May over-fit the data and cause large variances in the estimates Julie Thompson – IGBMC Statistical modelling approach statistical likelihood provides a framework for combining information from different experiments combine data from multiple genes while accommodating differences in the evolutionary process define a model that estimates the probability of obtaining a series of subtree topologies, given a hypothesized supertree select supertree that maximums the likelihood (product of likelihoods of all subtrees) Julie Thompson – IGBMC Ren, F., et al. A likelihood look at the supermatrix–supertree controversy, Gene (2008) Applications: tree of life Mammalian tree topology 70 mammalian species, plus Marsupialia and Monotremata as outgroups Supermatrix approach using 1st, 2nd codon positions of mitochondrial proteincoding genes and MrBayes Julie Thompson – IGBMC Reyes, et al. Mol. Biol. Evol. 2004. Applications: tree of life 3 hypotheses for the root of the eutherian tree basal Afrotheria basal Xenarthra basal Boreotheria, or Afrotheria/Xenarthra clade concatenated dataset of the 2,789 gene sequences Supermatrix (ML analyses) supported the Boreotheria (cow, dog, mouse, rat, human, chimpanzee, and macaque) monophyly robustly, but root was sensitive to substitution model Supertree (ML analyses), takes account of variations in the rates and modes of evolution among genes by assigning different parameters to different genes => all models support tree 1 Julie Thompson – IGBMC Nishihara et al, Genome Biol. 2007 Rooting the eutherian tree: the power and pitfalls of phylogenomics Applications: tree of life Current status and future challenges Identified or Confirmed by phylogenomics Hypothetical relationships Hypothetical relationships from phylogenomics Main uncertainties Julie Thompson – IGBMC Extinct species Julie Thompson – IGBMC Ancestral sequence reconstruction putative archosaur rhodopsin visual pigment synthesised and tested for function using biochemical methods archosaurs may have had visual pigments that would support dim-light vision Were ancestors nocturnal or diurnal? Julie Thompson – IGBMC Chang, et al. Mol Biol Evol 2002 Perspectives: Jurassic genome? Julie Thompson – IGBMC Zimmer C. Evolution. Jurassic genome. Science. 2007