Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pathway Tools Meeting - December 1, 2005, Geneva (SIB) & : Putting together synteny and metabolic information to achieve relevant expert annotation of microbial genomes Dr Claudine Médigue What is MaGe ? Yet another bacterial annotation platform !… Its development started in Oct. 2002 Context : the Acinetobacter sp. ADP1 genome annotation (Summer 2004) Shares functionalities with other existing annotation systems : An automatic annotation process : Syntaxic and functional annotations Functional annotation and classification inferences A relational database (MySQL) used to store the sequences and the analysis results. A WEB interface allowing multiple users to simultaneously annotate a genome. Connectivity to other databases or systems Developed by biologists involved in manual expert annotation Graphical interface which focuses on gene context and synteny results with available bacterial proteomes. Introduction to the Prokaryotic Genome DataBase (PkGDB) Purpose: storage of ‘clean’ and complete annotation data which are subsequently used in the genomic comparative analysis. Relational SGBD (MySQL) • Complete bacterial genomes (Refseq NCBI and Genome Review EBI) Integration in PkGDB Correction of obvious errors Management of frameshifts Syntactic re-annotation NAR (WS), 2003 Add missing gene annotations NAR (WS), 2005 • New bacterial genomes (annotation projects) • Annotation tool results : Intrinsic: genes, signals, repeats,… Extrinsic : BLAST, InterPro, COG, synteny … Simplified structure of PkGDB Re-annotation project Annotation project Published genomes NCBI RefSeq Genome Reviews Newly sequenced genomes Gene prediction AMIGene Project customization Reference annotation for model organisms Ecogene Geneprotec Subtilist Annotation management Sequence updates and annotation transfer Genomic Objects Automatic and manual functional assignations Annotation history Annotator management Functional Classification MultiFun GeneOntology Functional predictions Protein similarities helixes and signal peptides Enzymatic functions KEGG BioCyc Domains and motifs Uniprot Interpro COG Specific regions Orthologs & Paralogs Syntenies • Multiple correspondences • Local rearrangements (ins/del) Boyer et al. Bioinformatics (Nov 2005) How to read the synteny maps ? ACIAD0574 hutH Two ‘homologs’ to ACIAD0574 on the P. aeruginosa genome These two P. syringae genes (PSPTO5274/hutH-2 and 5276/ hutH-3) are similar to ACIAD0574 (putative paralogs of PSPTO0599) This P. syringae gene (PSPTO0599/hutH-1) is a putative ‘ortholog’ to ACIAD0574 and is involved in a synteny group containing 17 genes (in green) A larger view of the previous Acinetobacter ADP1 region 0574 0562 hisS 0582-0583 hutH fabG-fabF 4 of 138 genomes in PkGDB 9 of 284 complete microbial proteomes (RefSeq section) How are genes organized in a synteny group ? Synteny with Ralstonia solanacearum chromosome Synteny with Ralstonia solanacearum Mega Plasmid Synteny maps are useful to annotate gene fusion/fission Fusion of genes involved in DNA replication dnaQ (DNA polIII, epsilon subunit + proofreading 3’-5’ exonuclease) rnhA (degradation of Okazaki fragments) (dnaQ) YPO1082 (dnaQ) STM0264 (dnaQ) NMB1514 (dnaQ) PA1816 (dnaQ) PSPTO3711 YPO1081 (rnhA) STM0263 (rnhA) (rnhA) NMB1618 PA1815 (rnhA) PSPTO3712(rnhA) Colored rectangles represent the part of the protein which aligns with the corresponding Acinetobacter protein. Simplified structure of PkGDB Re-annotation project Annotation project Published genomes NCBI RefSeq Newly sequenced genomes Genome Reviews Gene prediction AMIGene Project customization Reference annotation for model organisms Ecogene Geneprotec Subtilist Annotation management Sequence updates and annotation transfer Genomic Objects Automatic and manual functional assignations Annotation history Annotator management Functional Classification MultiFun GeneOntology Functional predictions helixes and signal peptides Protein similarities PRIAM http://bioinfo.genopoletoulouse.prd.fr/priam/ Position-specific scoring matrices ('profiles') built with SwissProt proteins Enzymatic functions KEGG BioCyc Domains and motifs Uniprot Dynamic requests www.genome.jp/kegg/ Interpro COG Local installation http://www.biocyc.org/ Specific regions Orthologs & Paralogs Syntenies Setting up a new annotation project : an example Available related sequences • Rhizobium leguminosarum (Sanger Center) • Rhodobacter sphaeroides (DOE/JGI) • Rhodospirillum rubrum (DOE/JGI) Newly sequenced genomes Genomes in public DataBanks • Mesorhizobium loti (00) • Sinorhizobium meliloti (01) • Bradyrhizobium japonicum (02) • Rhodopeudomonas palustris (03) Automatic syntaxic annotations Re-annotation process (in some cases, functional annotations) (pseudogenes, missing genes) • Bradyrhizobium sp. ORS278 (Genoscope) -> 1 chr (7,5 Mb) • Bradyrhizobium sp. BTAi (DOE/JGI) -> 1 chr (8,5 Mb) Complete pipeline of automatic annotations Searching for synteny groups with complete proteomes available in RefSeq section (NCBI, 284 to date) and in PkGDB (curated genomes, 138 to date) PkGDB Pathway Tools Metabolic pathway reconstruction Ocelot object model BrajapCyc YersiniaScope AcinetoScope ColiScope RhizoScope BradyBTCyc BradyORCyc FrankiaScope CloacaScope RhizoCyc BioWareHouse relational model Comparative Metabolic Capabilities : an example Reaction content comparisons between the 3 Bradyrhizobium organisms (BioWareHouse SQL query on reactions having gene-> protein->reaction correspondences ) Bradyrhizobium sp. ORS278 830 BRAOR5732 BRAOR5733 BRAOR5771 BRAOR5772 BRAOR5776 873 76 14 ORS278 Bradyrhizobium sp. BTAi 43 BTAi genes coding the same reaction BRABT1389,BRABT0754,BRABT07 23,BRABT0755,BRABT0724 BRABT1389,BRABT0754,BRABT07 23,BRABT0755,BRABT0724 BRABT1389,BRABT0754,BRABT07 23,BRABT0755,BRABT0724 BRABT1389,BRABT0754,BRABT07 23,BRABT0755,BRABT0724 BRABT0759 16 724 Pathway Reaction protocatechuate degradation I PROTOCATECHUATE-4,5-DIOXYGENASE-RXN protocatechuate degradation I PROTOCATECHUATE-4,5-DIOXYGENASE-RXN protocatechuate degradation I PROTOCATECHUATE-4,5-DIOXYGENASE-RXN 30 127 protocatechuate degradation I PROTOCATECHUATE-4,5-DIOXYGENASE-RXN protocatechuate degradation I RXN-2463 Bradyrhizobium japonicum USDA 110 897 Bradyrhizobium ORS278 region containing CDS 5771&5772 BRAOR5771-5772 - 5773 15277747 “Cloning and Characterization of the Genes Encoding !!! Enzymes for the Protocatechuate Meta-degradation Pathways of Pseudomonas ochraceae NGJ1” Maruyama et al. (2004) Biosci. Biotechnol. Biochem, 68, 1434-1441. !!! ??? AUTOmatic vs EXPert annotation of the region PRODUCT BRAOR5770 AUTO 4-carboxy-2-hydroxymuconate-6-semialdehyde dehydrogenase EC-number Gene 1.1.1.18 EXP 4-carboxy-2-hydroxymuconate-6-semialdehyde dehydrogenase 1.2.1.45 Evidence ligC BLAST R. palus PRIAM (medium) ligC BLAST P. testosteroni Publication + Enzyme BRAOR5771 AUTO = EXP Protochatechuate 4,5-dioxygenase, alpha subunit 1.13.11.8 ligB BLAST R. palus PRIAM (high) BRAOR5772 AUTO = EXP Protochatechuate 4,5-dioxygenase, beta subunit 1.13.11.8 ligA BLAST R. palus PRIAM (high) none ligI BLAST R. palus 3.1.1.57 ligI none none BLAST R. palus 1.1.1.- none BLAST R. palus InterproScan none fidZ BLAST R. palus 4.1.3.17 ligK BLAST P. ochraceae Publication + Enzyme none ligJ BLAST R. palus 4.2.1.83 ligJ BLAST R. palus Publication + Enzyme BRAOR5773 2-pyrone-4,6-dicarboxylic acid hydrolase AUTO EXP 2-pyrone-4,6-dicarboxylic acid hydrolase BRAOR5774 AUTO Putative dehydrogenase EXP Putative dehydrogenase with NAD binding protein BRAOR5775 AUTO Putative acyl transferase EXP 4-hydroxy-4-methyly-2-oxoglutarate aldolase BRAOR5776 AUTO 4-oxalomesaconate hydratase EXP 4-oxalomesaconate hydratase BLAST R. palus Publication + Enzyme Bradyrhizobium ORS278 region after expert annotation BRAOR5770 BRAOR5771-72 BRAOR5773 1.2.1.45 1.13.11.8 3.1.1.57 ligC ligBA ligI BRAOR5777 BRAOR5776 BRAOR5775 4.2.1.83 4.1.3.17 ligJ ligK BRAOR5778 Connectivity to KEGG database Enzymes encoded by genes in the MaGe region Enzymes encoded by genes elsewhere in the Bradyrhizobium genome Additional enzymes in E. coli 4.2.1.83 ? Connectivity to KEGG database Enzymes encoded by genes in the MaGe region Enzymes encoded by genes elsewhere in the Bradyrhizobium genome Additional enzymes in E. coli Bradyrhizobium ORS278 region after expert annotation Probable protochatechuate transporter 5770 5771 5772 Probable transcriptional regulator of protochatechuate degradation 5776 5773 5775 BRAOR5777 BRAOR5778 ligR BRAOR5770_ligC 4-carboxy-2-hydroxymuconate 6-semialdehyde dehydrogenase 1.2.1.45 BRAOR5776_ligJ 4-oxalmesaconate hydratase 4.2.1.83 The reactions catalyzed by 1.2.1.45 and 4.2.1.83 exist in MetaCyc but they are not involved in a pathway. Enzymatic activity predictions (PRIAM) : some results Comparison of PRIAM predictions [P] and Expert annotations [E] Acinetobacter ADP1 Total genes 3325 Pseudoalteromonas haloplanktis Frankia alni Pseudomonas entomophila 3514 6861 5182 Nb EC_[P] vs EC_[E] 1012 / 947 927 / 993 1729 / 1498 1455 / 1232 EC_[P] = EC_[E] 632 (62.5%) 697 (75.2%) 912 (52.8%) 820 (56.3%) 47 (4.6%) 23 (2.5%) 68 (3.9%) 46 (3.2%) EC_[P] <> EC_[E] 131 (12.9%) 102 (11.0%) 401 (23.2%) 285 (19.6%) EC_[P] & (NO EC_[E]) 202 (20.0%) 105 (11.3%) 348 (20.1%) 304 (20.9%) EC_[E] & (NO EC_[P]) 111 (11.7%) 152 (15.3%) 111 (7.4%) 90 (7.3%) EC_[P](3 digit) = EC_[E] Limitations of PRIAM sequence-based enzyme prediction Availability of at least one UniProt/SwissProt sequence in the Enzyme entry ! Existence of closely related enzymes with different substrate specificity Relaxed substrate specificity exhibited by some enzymes Several wrong predictions in case of Medium/Low PRIAM confidence PGDBs built at Genoscope Our PGDBs are currently available in the MaGe’s interface HomePage : http://www.genoscope.cns.fr/agc/mage/ NO curation to date (Tier 3* Databases) (except for Acinetobacter ADP1-> Metabolic Thesaurus project) MaGe’s training courses include a quick overview of how to explore PathoLogic results to perform relevant expert annotation Automatic updates of PathoLogic predictions : every week To date : about 60 Tier 3 PGDBs 16 PGDBs common to SRI/EBI PGDBs Tier3* (and 4 with Tier2*): «Expansion of the BioCyc collection of pathway/genome databases to 160 genomes» Karp et al. Nucleic Acid Research, 2005, 33: 6083-6089. • The number of enzymes and pathways is slightly greater in our PGDBs (source of annotations + process of Pathologic file format generation) • Important discrepancies with Sinorhizobium meliloti (44 predicted pathways in the SRI/EBI PGDB vs 259 in the Genoscope PGDB) 18 PGDBs : other published bacterial genomes 25 PGDBs for newly sequenced and annotated bacterial genomes *Tier 3: Computationally-Derived Databases Subject to No Curation *Tier 2: Computationally-Derived Databases Subject to Moderate Curation Some Questions / Perspectives Better correspondences between BioCyc and MaGe • Optional fields in the PathoLogic file format (PubMedID, Funcat, …) How to tackle the pseudogene information ? No enzyme has been found Pathway X doesn’t exist because Some enzymes correspond to pseudogenes Curation of PGDB ? Integration and evaluation of Pathway Hole Filler Remove false-positive pathway (Tier 3 -> Tier2) • Automatic reduction of false positive pathway predictions stored in the PGDBs • Finding a way to get a list of false positive pathways at the end of the manual process of annotation. Tier2 -> Tier1*, especially creation of new metabolic pathways : !!! Not an easy task !!! (a strong knowledge of metabolism is required) • PGDBs freely available for «adoption» by biologists *Tier1: Intensively Curated Databases Metabolic Thesaurus project at Genoscope Véronique de Berardinis’s team Knock-out collection 2240 ADP1 genes knocked out Biological evidence Systematic phenotyping Annotation Accurate phenotyping Biochemical studies 3325 Acinetobacter ADP1 annotated genes Functional complementation Model Network reconstruction Flux Models Metabolism prediction Vincent Schächter’s bioInformatic team Transcriptome analyses Metabolic Pathway Reconstruction / Experimental Data Metabolic Thesaurus ColiScope Acinetobacter ADP1 KO collection Sequencing of 2 commensal and 4 pathogenic E. coli strains Phenotypic analysis: growth essay on different nutrient sources + Metabolome analysis: LC/MS and CE/MS Data Integration and Comparative Analysis Linked enzymatic activity to genes of unknown function Evolution of metabolic capabilities => adaptation of microorganisms commensalism / virulence emergence Participating teams AGC team : Zoé Rouy David Vallenet Aurélie Lajus Stéphane Cruveiller Claudine Médigue Genoscope informatic system team Claude Scarpelli Laurent Sainte-Marthe Sylvain Bonneval … and with the help of : François Lefèvre (V. Schächter team) Mage’s users feedback helps in improving many functionalities of our system !