Download ITS and Barcoding at GenBank

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ITS accuracy at GenBank
Conrad Schoch
Barbara Robbertse
Improving accuracy
• Barcode tag in GenBank
– Barcode submission tool
– Standards
• RefSeq Targeted Loci
– Well validated sequences already in GenBank
• Bacteria all type sequences
• Limited fungal sequences
Formal selection of the fungal DNA barcode
Schoch et al. 2012.
ITS sequence standards
• 1. Standardized sequence title should be
"Fungal ITS barcode".
• 2. Annotation
• 3. Length
• 4. Quality of sequence
• 5. Unique or not?
• 6. Meta data
Difference Between GenBank and RefSeq Targeted Loci
GenBank
RefSeq
Not curated
Curated
Author submits
NCBI creates from existing data
Only author can revise
NCBI revises as new data emerge
Multiple records for same loci
common
Single records for each molecule of
major organisms
Records can contradict each other
No limit to species included
Limited to model organisms
Data exchanged among INSDC
members
Exclusive NCBI database
Akin to primary literature
Akin to review articles
Proteins identified and linked
Proteins and transcripts identified
and linked
Access via NCBI Nucleotide databases
Access via Nucleotide & Protein
databases
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.GenBank_ASM
150 Yeast sequences
1. Same ITS accession associated with
different species/strains in the list.
2. Taxon name associated with accession
in provide list not found in NCBI Strain
name indicated on GenBank accession
not found at culture collection
database (wrong strain name in
GenBank?)
3. Incomplete accession identifier in the
list.
4. A few accessions in the list does not
exist in Genbank.
Checklist for ITS Accessions added to the target loci RefSeq project:
----------------------------------------------------------------------------------------1) Source from a type specimen.
2) Primary GenBank name and Current name at CBS is the same.
3) Strand in the correct orientation.
4) Type info added from CBS to /note.
5) Added feature /culture_collection
6) Added feature /identified_by (source CBS)
7) Moved information in note to /isolation_source
7) All 26S labled 28S
8) Reannotated (used 5.8S Rfam borders; used 3’ 18S boundaries (CATTA motif) and 5’ 28S border
(GACCT motif) as guide in an alignment).
9) Added PMID if available.
10) Checked hits with moleblast.
11) Example defline (note it has no strain info):
DEFINITION Trichosporon veenhuisii 18S ribosomal RNA gene, partial sequence;
internal transcribed spacer 1, 5.8S ribosomal RNA gene, and
internal transcribed spacer 2, complete sequence; and 28S ribosomal
RNA gene, partial sequence.
12) Features used: /rRNA for 18S and 28S and /misc_RNA for ITS1 and ITS2
13) Standardized names used in product qualifier: 18S ribosomal RNA, 5.8S ribosomal RNA, 28S
ribosomal RNA, internal transcribed spacer 1, internal transcribed spacer 2
Annotation of 150 ITS records in GenBank
#records
Features
Annotation in note or product
127
/misc_RNA
contains 18S ribosomal RNA, internal transcribed spacer 1, 5.8S ribosomal RNA,
internal transcribed spacer 2, and 28S ribosomal RNA (or 26S ribosomal RNA)
4
/misc_RNA or
/misc_feature
contains internal transcribed spacer 1, 5.8S ribosomal RNA
and internal transcribed spacer 2
8
/misc_RNA and
/rRNA
internal transcribed spacer 1
5.8S ribosomal RNA
internal transcribed spacer 2
2
/misc_feature and
/gene and /rRNA
internal transcribed spacer 1
5.8S ribosomal RNA
internal transcribed spacer 2
6
/rRNA and /gene
and /misc_feature
18S ribosomal RNA
internal transcribed spacer 1
5.8S ribosomal RNA
internal transcribed spacer 2
26S ribosomal RNA
3
/rRNA and
/misc_RNA
18S ribosomal RNA
internal transcribed spacer 1
5.8S ribosomal RNA
internal transcribed spacer 2
28S ribosomal RNA (or 26S ribosomal RNA)
Annotation of 150 ITS records in RefSeq
#records
Features
Annotation in product
138
/rRNA and /misc_RNA
18S ribosomal RNA
internal transcribed spacer 1
5.8S ribosomal RNA
internal transcribed spacer 2
28S ribosomal RNA
2
/rRNA and /misc_RNA
18S ribosomal RNA
internal transcribed spacer 1
5.8S ribosomal RNA
internal transcribed spacer 2
10
/misc_RNA and /rRNA
internal transcribed spacer 1
5.8S ribosomal RNA
internal transcribed spacer 2
RefSeq Accession number
Project number
RefSeq references
Expanded qualifiers
MOLE-BLAST
• Algorithm:
– 1. For each query run
BLAST search against nr
and collect top five hits.
– 2. Cluster query
sequences into groups
corresponding to
different loci
– 3. For each locus:
• Compute multiple
alignment for queries and
their top five BLAST hits
• Compute phylogenetic
tree based on the
multiple alignment
Adding microbial type strain data to
the taxonomy database
• Upload all types together
with names in NCBI
Taxonomy
• Cross reference this as a
property in Entrez
• Enable search restricted
to ex-type sequences
• Start with Euzeby list
What next?
Expand other markers for the ‘known
universe’
• Secondary barcode-type markers – list and
communicate resources
• Highlight ‘problematic’ ITS taxa
• Provide barcodes for all genomes
• Ensure genome samples are correctly identified
• Integrate sequences with fungal names
BaG (Barcode all genera)
of Fungi, proposed goals
Sequence for more than 3000 genera in GenBank
•
Compare GenBank and MycoBank taxonomies
•
Highlight types in GenBank taxonomy
Target lists for all fungal genera – focused on type species
•
16 000 Genera (5000 with full meta-data in MycoBank)
One name
one fungus =
opportunity
Acknowledgments
ITS Meta data
Centraalbureau voor Schimmelcultures
(CBS)
MOLE-BLAST
Grzegorz Boratyn
Tom Madden
Taxonomy type updates
Scott Federhen
Related documents