Download NGS Journal Club

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to the CGE servers
Center for Genomic Epidemiology
Aim:
• To provide the scientific foundation for future internet-based
solutions, where a central database will enable simplification of
total genome sequence information and comparison to all other
sequenced isolates including spatial-temporal analysis.
• To develop algorithms for rapid analyses of whole genome DNAsequences, tools for analyses and extraction of information from
the sequence data and internet/web-interfaces for using the
tools in the global scientific and medical community.
Tools for species identification
Name of Service Description
SpeciesFinder
Species
identification
using 16S rRNA
KmerFinder
Species
identification
using
overlapping
16mers
URL
(cge.cbs.dtu.dk/s
ervices/)
Status
SpeciesFinder Online
KmerFinder
Online
TaxonomyFinder Taxonomy
TaxonomyFinder
identification
using functional
protein domains
Reads2Type
Species
identification on
client computer
Reads2Type
Publication
Published Feb
2014 PMID:
24574292
Published Jan
2014 PMID:
24172157
Published in
PMID: 24574292
+ Oksana's PhD
thesis
Online
Published Feb
2014 PMID:
24574292
Benchmarking of Methods for
Bacterial Species Identification
PMID: 24574292
Training data
 1,647 completed / almost completed genomes downloaded
from NCBI in 2011 (1,009 different species)
Evaluation data
 NCBI draft genomes
• 695 isolates from species that overlap with training set (151 species)
 SRA draft genomes
• 10,407 sets of short reads from Illumina (168 species)
• 10,407 draft genomes from Illumina data (168 species)
16S rRNA
• 16S rRNA sequencing has dominated molecular taxonomy of
prokaryotes for more than 30 years (Fox et al, Int. J. Syst. Bacteriol., 1977)
• Tremendous amounts of 16S rRNA sequence data are available
in databases
Concerns:
• Low resolution
• Some genomes contain several copies of the 16S rRNA gene
with inter-gene variation
• The 16S rRNA gene represents only about 0.1% of the coding
part of a microbial genome
CGE implementation of
16S species identification
SpeciesFinder
Reference database
• 16S rRNA genes are isolated from genomes in training data using
RNAmmer (Lagesen, NAR, 2007).
Method
•Input genomes are BLASTed against 16S rRNA genes in reference
database.
•Best hit is selected based on a combination of coverage, % identity,
bitscore, number of mistmatches and number of gaps in the
alignments.
KmerFinder
• Genomes in training data is chopped into 16mers:
A T G A C G T A T G A T T G A T G A C G T A G T A G T C C
9mer
• Immune system inspired downsampling
• Only 16mers with specific prefix are kept
MHC-I
16mer database
ATGAATGTGTGAGTGA
ATGACTGTGCCCCTGA
Unknown isolate
Unique 16 mers:
ATGAATGTGTGAGTGA
ATGACTGTGCCCCTGA
CP001921 (Acinetobacter baumanii)
CP000521 (Acinetobacter baumanii)
CP002522 (Acinetobacter baumanii)
CP001921 (Acinetobacter baumanii)
CP002301 (Buchnera aphidicola)
Species
Match
No. of
Kmer hits
Acinetobacter baumannii
CP001921
2
Acinetobacter baumannii
CP000521
1
Acinetobacter baumannii
CP002521
1
Buchnera aphidicola
CP002301
1
ATGAAAAAAAAAAAA
KmerFinder is very robust – it only needs one 16mer!
Desulfovibrio piger GOR1 SRR097356
>NODE 4 length 92 cov 23.119566
TAGGACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGA
CGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATGC
>NODE 15 length 82 cov 2.792683
AGCGAAAAATGTCATAACAACGATCACGACCGATAACCATCTTTGGTCCAAACTTACTCA
CGCAGCAGGCGTATAACTCGCGCATACCAGCTTTGGGCAT
N50 = 110
Total no. of bp: 210
Prediction
Species
Match
No. of
Kmer hits
Flavobacterium
psycrophilum
AM398681 1
TaxonomyFinder
Reads2Type
• Definition: Quick & dirty
taxonomy identification of
single isolates
• 50-mer of marker gene DB
– 16S rRNA: Training data
genomes  RNAmmer
(other)
– ITS: Training data
(Mycobacterium)
– GyrB: Training data
(Enterobacteriaceae)
– Resulting database ~5 MB
• Read2Type pushes analysis
to user, server provides 50mers database
• SuffixTree: efficient data
structure for string matching
• Narrow Down Approach:
– Reads2Type compares 50mers of combined marker
genes against raw reads
– Shared Probes vs Unique
Probe
rMLST
Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony C, Colles FM, Wimalarathna
H, Harrison OB, Sheppard SK, Cody AJ, Maiden MC. Ribosomal multilocus sequence
typing: universal characterization of bacteria from domain to strain.
Microbiology. 2012 Apr;158(Pt 4):1005-15.
CGE implementation
•For each genome in the training data the 53 ribosomal genes were extracted.
•Genomes in evaluation sets were aligned using blat to each gene collection (only hits
with at least 95% identity and 95% coverage were considered as a potential match).
•The closets match of the training genomes was selected based on a combination of
coverage, %identity, bitscore, number of mistmatches and number of gaps in the
alignments across all genes.
Results
(16s rRNA)
Overlap in predictions
Isolates in the NCBIdrafts set for which all four methods predict the species to be different
from the annotated one.
* NZAEPO00000000 has been re-annotated as S. oralis since we downloaded the data.
Speed
Method
Estimated speed
(mm:ss)
16S
00:13*
KmerFinder
00:09*
TaxonomyFinder
11:33*
rMLST
00:45*
Reads2Type
00:55**
*Estimation based on draft genomes
**Estimation based on short reads
Summary of taxonomy
benchmark study
• KmerFinder had the highest accuracy and was
the fastest method.
• SpeciesFinder (16S rRNA-based) had the
lowest accuracy.
• Methods that only sample genomic loci (16S,
Reads2Type, rMLST) had difficulties distinguishing species that only recently diverged,
especially when main difference is a plasmid.
Tools for further typing
Name of
Service
MLST
Description
Multilocus sequence
typing
URL
(https://cge.cbs.dtu.dk/services/ )
MLST
Publication
Published Apr 2012,
PMID: 22238442
PlasmidFinder
Identification of
plasmids in
Enterobacteriaceae
PlasmidFinder
Published Apr 2014,
PMID 24777092
pMLST
pMLST of plasmids
in
Enterobacteriaceae
pMLST
Published Apr 2014,
PMID 24777092
Multilocus Sequence Typing (MLST)
First developed in 1998 for Neisseria meningitis
(Maiden et al. PNAS 1998. 95:3140-3145)
 The nucleotide sequence of internal regions of
app. 7 housekeeping genes are determined by PCR
followed by Sanger sequencing
 Different alleles are each assigned a random number
 The unique combination of alleles is the sequence type (ST)
Using WGS data for MLST
Acinetobacter baumannii #1 Campylobacter lari
Acinetobacter baumannii #2 Cronobacter
C. upsaliensis
Arcobacter
Escherichia coli #1
Borrelia burgdorferi
Escherichia coli #2
Bacillus cereus
Enterococcus faecalis
Brachyspira hyodysenteriae
Enterococcus faecium
Bifidobacterium
F. psychrophilum
Brachyspiria intermedia
Haemophilus influenzae
Bordetella
Haemophilus parasuis
Burkholderia pseudomallei
Helicobacter pylori
Brachyspira
Burkholeria cepacia complex Klebsiella pneumoniae
Lactobacillus casei
Campylobacter jejuni
Assembled
Clostridium
botulinumgenome Lactococcus lactis
454
– single
Clostridium
difficile
#1end readsLeptospira
454
– paired
Listeria
Clostridium
difficile
#2 end reads
Illumina
– single end reads
Listeria monocytogenes
Campylobacter
helveticus
Illumina
– paired end Moraxella
reads
catarrhalis
Campylobacter
insulaenigrae
Ionsepticum
Torrent
Mannheimia haemolytica
Clostridium
SOLiD – single end reads
Neisseria
C. diphtheriae
SOLiD fetus
– mate pair reads
P. gingivalis
Campylobacter
P. acne
Chlamydiales
Pseudomonas aeruginosa
www.cbs.dtu.dk/services/MLST
Pasteurella multocida
Pasteurella multocida
Staphylococcus aureus
Streptococcus agalactiae
Salmonella enterica
Staphylococcus epidermidis
S. maltophilia
Streptococcus pneumoniae
Streptococcus oralis
S. zooepidemicus
Streptococcus pyogenes
Streptococcus suis
Streptococcus thermophilus
Streptomyces
Streptococcus uberis
Vibrio parahaemolyticus
Vibrio vulnificus
Wolbachia
Xylella fastidiosa
Y. pseudotuberculosis
Extended Output
Extended Output
aro: WARNING, Identity: 100%, HSP/Length: 349/498, Gaps: 0, aro_122 is the best
match for aro
What is the MLST web-service used for?
PlasmidFinder and pMLST
The PlasmidFinder database contains replicons, not entire plasmids.
Tools for phenotyping
Name of
Service
ResFinder
VirulenceFinder
Description
Identification of
acquired antibiotic
resistance genes
Identification of
virulence genes in E.
coli (and S. aureus
and Enterococcus)
MyDbFinder Identification of
genes from the users
own database
PathogenFinder
Prediction of
pathogenic potential
URL
(https://cge.cbs.dtu.dk/services/ )
Publication
Published Nov 2012,
PMID: 22782487
ResFinder
VirulenceFinder
E. coli published
Feb 2014, PMID:
24574290.
MyDbFinder
Will be published in
book chapter
PathogenFinder
Published Oct 2013,
PMID: 24204795
ResFinder
NGS
Illumina
Ion torrent
454..
Assembly
pipeline
ResFinder
(BLAST)
Resistance
gene profile
List of genes
Accession numbers
Theoretical resistance phenotype
Fasta
Sanger
 200 isolates from 4 different species (Salmonella Typhimurium,
Escherichia coli, Enterococcus faecalis and Enterococcus faecium)
 ResFinder, 98 %ID, 60% length coverage
 Phenotypic tests, 3,051 in total
•
482 Resistant
•
2569 Susceptible
=> 99,74% of the results were in agreement between ResFinder and
the phenotypic tests
23 discrepancies -> 16, typically in relation to spectinomycin in E. coli
Alternatives to ResFinder
Unpublished or uncategorized
Name of
Service
PanFunPro
SerotypeFinder
Description
URL
(https://cge.cbs.dtu.dk/serv
ices/ )
Status
Groups homologous
proteins based on
functional domain content
PanFunPro
Identification of serotypes
SerotypeFinder-1.0
Online
Publication
Published in
F1000Research
2013, 2:265
Not yet published
Online
Restriction- Identification of RM
Modification system genes
Finder
HostPhinder Prediction of the host of a
bacteriophage
MetaVirFinder
Identification of virus in
metegenomic data
Identifies the content of
metagenomic samples
MGmapper
RestrictionModificationFinder
Online
HostPhinder
Online, but under
development
MetaVirFinder
Online, but under
development
MGmapper
Online, but under
development
Will only be
published in book
chapter
Not yet published
Not yet published
Not yet published
Tools for phylogeny
Name of
Service
SnpTree
Description
Creation of
phylogenetic
trees based on
SNPs
CSIPhylo- Creation of
geny
phylogenetic
trees based on
SNPs
NDtree
Creation of
phylogenetic
trees
URL (cge.cbs.dtu.dk/services)
Status
Publication
Published Dec 2012,
PMID: 23281601
snpTree
Online
CSIPhylogeny
Planned
Online
NDtree
Online
Published in Feb 2014,
PMID: 24505344
Web-service usage
Type of data uploaded to MLST web-service
454, single reads
454, paired-end
Ion torrent
Illumina, single reads
Illumina, paired-end reads
Assembled draft genomes
Related documents