Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
ITS accuracy at GenBank Conrad Schoch Barbara Robbertse Improving accuracy • Barcode tag in GenBank – Barcode submission tool – Standards • RefSeq Targeted Loci – Well validated sequences already in GenBank • Bacteria all type sequences • Limited fungal sequences Formal selection of the fungal DNA barcode Schoch et al. 2012. ITS sequence standards • 1. Standardized sequence title should be "Fungal ITS barcode". • 2. Annotation • 3. Length • 4. Quality of sequence • 5. Unique or not? • 6. Meta data Difference Between GenBank and RefSeq Targeted Loci GenBank RefSeq Not curated Curated Author submits NCBI creates from existing data Only author can revise NCBI revises as new data emerge Multiple records for same loci common Single records for each molecule of major organisms Records can contradict each other No limit to species included Limited to model organisms Data exchanged among INSDC members Exclusive NCBI database Akin to primary literature Akin to review articles Proteins identified and linked Proteins and transcripts identified and linked Access via NCBI Nucleotide databases Access via Nucleotide & Protein databases http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.GenBank_ASM 150 Yeast sequences 1. Same ITS accession associated with different species/strains in the list. 2. Taxon name associated with accession in provide list not found in NCBI Strain name indicated on GenBank accession not found at culture collection database (wrong strain name in GenBank?) 3. Incomplete accession identifier in the list. 4. A few accessions in the list does not exist in Genbank. Checklist for ITS Accessions added to the target loci RefSeq project: ----------------------------------------------------------------------------------------1) Source from a type specimen. 2) Primary GenBank name and Current name at CBS is the same. 3) Strand in the correct orientation. 4) Type info added from CBS to /note. 5) Added feature /culture_collection 6) Added feature /identified_by (source CBS) 7) Moved information in note to /isolation_source 7) All 26S labled 28S 8) Reannotated (used 5.8S Rfam borders; used 3’ 18S boundaries (CATTA motif) and 5’ 28S border (GACCT motif) as guide in an alignment). 9) Added PMID if available. 10) Checked hits with moleblast. 11) Example defline (note it has no strain info): DEFINITION Trichosporon veenhuisii 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence. 12) Features used: /rRNA for 18S and 28S and /misc_RNA for ITS1 and ITS2 13) Standardized names used in product qualifier: 18S ribosomal RNA, 5.8S ribosomal RNA, 28S ribosomal RNA, internal transcribed spacer 1, internal transcribed spacer 2 Annotation of 150 ITS records in GenBank #records Features Annotation in note or product 127 /misc_RNA contains 18S ribosomal RNA, internal transcribed spacer 1, 5.8S ribosomal RNA, internal transcribed spacer 2, and 28S ribosomal RNA (or 26S ribosomal RNA) 4 /misc_RNA or /misc_feature contains internal transcribed spacer 1, 5.8S ribosomal RNA and internal transcribed spacer 2 8 /misc_RNA and /rRNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 2 /misc_feature and /gene and /rRNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 6 /rRNA and /gene and /misc_feature 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 26S ribosomal RNA 3 /rRNA and /misc_RNA 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 28S ribosomal RNA (or 26S ribosomal RNA) Annotation of 150 ITS records in RefSeq #records Features Annotation in product 138 /rRNA and /misc_RNA 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 28S ribosomal RNA 2 /rRNA and /misc_RNA 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 10 /misc_RNA and /rRNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 RefSeq Accession number Project number RefSeq references Expanded qualifiers MOLE-BLAST • Algorithm: – 1. For each query run BLAST search against nr and collect top five hits. – 2. Cluster query sequences into groups corresponding to different loci – 3. For each locus: • Compute multiple alignment for queries and their top five BLAST hits • Compute phylogenetic tree based on the multiple alignment Adding microbial type strain data to the taxonomy database • Upload all types together with names in NCBI Taxonomy • Cross reference this as a property in Entrez • Enable search restricted to ex-type sequences • Start with Euzeby list What next? Expand other markers for the ‘known universe’ • Secondary barcode-type markers – list and communicate resources • Highlight ‘problematic’ ITS taxa • Provide barcodes for all genomes • Ensure genome samples are correctly identified • Integrate sequences with fungal names BaG (Barcode all genera) of Fungi, proposed goals Sequence for more than 3000 genera in GenBank • Compare GenBank and MycoBank taxonomies • Highlight types in GenBank taxonomy Target lists for all fungal genera – focused on type species • 16 000 Genera (5000 with full meta-data in MycoBank) One name one fungus = opportunity Acknowledgments ITS Meta data Centraalbureau voor Schimmelcultures (CBS) MOLE-BLAST Grzegorz Boratyn Tom Madden Taxonomy type updates Scott Federhen