Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Introduction to the CGE servers Center for Genomic Epidemiology Aim: • To provide the scientific foundation for future internet-based solutions, where a central database will enable simplification of total genome sequence information and comparison to all other sequenced isolates including spatial-temporal analysis. • To develop algorithms for rapid analyses of whole genome DNAsequences, tools for analyses and extraction of information from the sequence data and internet/web-interfaces for using the tools in the global scientific and medical community. Tools for species identification Name of Service Description SpeciesFinder Species identification using 16S rRNA KmerFinder Species identification using overlapping 16mers URL (cge.cbs.dtu.dk/s ervices/) Status SpeciesFinder Online KmerFinder Online TaxonomyFinder Taxonomy TaxonomyFinder identification using functional protein domains Reads2Type Species identification on client computer Reads2Type Publication Published Feb 2014 PMID: 24574292 Published Jan 2014 PMID: 24172157 Published in PMID: 24574292 + Oksana's PhD thesis Online Published Feb 2014 PMID: 24574292 Benchmarking of Methods for Bacterial Species Identification PMID: 24574292 Training data 1,647 completed / almost completed genomes downloaded from NCBI in 2011 (1,009 different species) Evaluation data NCBI draft genomes • 695 isolates from species that overlap with training set (151 species) SRA draft genomes • 10,407 sets of short reads from Illumina (168 species) • 10,407 draft genomes from Illumina data (168 species) 16S rRNA • 16S rRNA sequencing has dominated molecular taxonomy of prokaryotes for more than 30 years (Fox et al, Int. J. Syst. Bacteriol., 1977) • Tremendous amounts of 16S rRNA sequence data are available in databases Concerns: • Low resolution • Some genomes contain several copies of the 16S rRNA gene with inter-gene variation • The 16S rRNA gene represents only about 0.1% of the coding part of a microbial genome CGE implementation of 16S species identification SpeciesFinder Reference database • 16S rRNA genes are isolated from genomes in training data using RNAmmer (Lagesen, NAR, 2007). Method •Input genomes are BLASTed against 16S rRNA genes in reference database. •Best hit is selected based on a combination of coverage, % identity, bitscore, number of mistmatches and number of gaps in the alignments. KmerFinder • Genomes in training data is chopped into 16mers: A T G A C G T A T G A T T G A T G A C G T A G T A G T C C 9mer • Immune system inspired downsampling • Only 16mers with specific prefix are kept MHC-I 16mer database ATGAATGTGTGAGTGA ATGACTGTGCCCCTGA Unknown isolate Unique 16 mers: ATGAATGTGTGAGTGA ATGACTGTGCCCCTGA CP001921 (Acinetobacter baumanii) CP000521 (Acinetobacter baumanii) CP002522 (Acinetobacter baumanii) CP001921 (Acinetobacter baumanii) CP002301 (Buchnera aphidicola) Species Match No. of Kmer hits Acinetobacter baumannii CP001921 2 Acinetobacter baumannii CP000521 1 Acinetobacter baumannii CP002521 1 Buchnera aphidicola CP002301 1 ATGAAAAAAAAAAAA KmerFinder is very robust – it only needs one 16mer! Desulfovibrio piger GOR1 SRR097356 >NODE 4 length 92 cov 23.119566 TAGGACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGA CGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATGC >NODE 15 length 82 cov 2.792683 AGCGAAAAATGTCATAACAACGATCACGACCGATAACCATCTTTGGTCCAAACTTACTCA CGCAGCAGGCGTATAACTCGCGCATACCAGCTTTGGGCAT N50 = 110 Total no. of bp: 210 Prediction Species Match No. of Kmer hits Flavobacterium psycrophilum AM398681 1 TaxonomyFinder Reads2Type • Definition: Quick & dirty taxonomy identification of single isolates • 50-mer of marker gene DB – 16S rRNA: Training data genomes RNAmmer (other) – ITS: Training data (Mycobacterium) – GyrB: Training data (Enterobacteriaceae) – Resulting database ~5 MB • Read2Type pushes analysis to user, server provides 50mers database • SuffixTree: efficient data structure for string matching • Narrow Down Approach: – Reads2Type compares 50mers of combined marker genes against raw reads – Shared Probes vs Unique Probe rMLST Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony C, Colles FM, Wimalarathna H, Harrison OB, Sheppard SK, Cody AJ, Maiden MC. Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain. Microbiology. 2012 Apr;158(Pt 4):1005-15. CGE implementation •For each genome in the training data the 53 ribosomal genes were extracted. •Genomes in evaluation sets were aligned using blat to each gene collection (only hits with at least 95% identity and 95% coverage were considered as a potential match). •The closets match of the training genomes was selected based on a combination of coverage, %identity, bitscore, number of mistmatches and number of gaps in the alignments across all genes. Results (16s rRNA) Overlap in predictions Isolates in the NCBIdrafts set for which all four methods predict the species to be different from the annotated one. * NZAEPO00000000 has been re-annotated as S. oralis since we downloaded the data. Speed Method Estimated speed (mm:ss) 16S 00:13* KmerFinder 00:09* TaxonomyFinder 11:33* rMLST 00:45* Reads2Type 00:55** *Estimation based on draft genomes **Estimation based on short reads Summary of taxonomy benchmark study • KmerFinder had the highest accuracy and was the fastest method. • SpeciesFinder (16S rRNA-based) had the lowest accuracy. • Methods that only sample genomic loci (16S, Reads2Type, rMLST) had difficulties distinguishing species that only recently diverged, especially when main difference is a plasmid. Tools for further typing Name of Service MLST Description Multilocus sequence typing URL (https://cge.cbs.dtu.dk/services/ ) MLST Publication Published Apr 2012, PMID: 22238442 PlasmidFinder Identification of plasmids in Enterobacteriaceae PlasmidFinder Published Apr 2014, PMID 24777092 pMLST pMLST of plasmids in Enterobacteriaceae pMLST Published Apr 2014, PMID 24777092 Multilocus Sequence Typing (MLST) First developed in 1998 for Neisseria meningitis (Maiden et al. PNAS 1998. 95:3140-3145) The nucleotide sequence of internal regions of app. 7 housekeeping genes are determined by PCR followed by Sanger sequencing Different alleles are each assigned a random number The unique combination of alleles is the sequence type (ST) Using WGS data for MLST Acinetobacter baumannii #1 Campylobacter lari Acinetobacter baumannii #2 Cronobacter C. upsaliensis Arcobacter Escherichia coli #1 Borrelia burgdorferi Escherichia coli #2 Bacillus cereus Enterococcus faecalis Brachyspira hyodysenteriae Enterococcus faecium Bifidobacterium F. psychrophilum Brachyspiria intermedia Haemophilus influenzae Bordetella Haemophilus parasuis Burkholderia pseudomallei Helicobacter pylori Brachyspira Burkholeria cepacia complex Klebsiella pneumoniae Lactobacillus casei Campylobacter jejuni Assembled Clostridium botulinumgenome Lactococcus lactis 454 – single Clostridium difficile #1end readsLeptospira 454 – paired Listeria Clostridium difficile #2 end reads Illumina – single end reads Listeria monocytogenes Campylobacter helveticus Illumina – paired end Moraxella reads catarrhalis Campylobacter insulaenigrae Ionsepticum Torrent Mannheimia haemolytica Clostridium SOLiD – single end reads Neisseria C. diphtheriae SOLiD fetus – mate pair reads P. gingivalis Campylobacter P. acne Chlamydiales Pseudomonas aeruginosa www.cbs.dtu.dk/services/MLST Pasteurella multocida Pasteurella multocida Staphylococcus aureus Streptococcus agalactiae Salmonella enterica Staphylococcus epidermidis S. maltophilia Streptococcus pneumoniae Streptococcus oralis S. zooepidemicus Streptococcus pyogenes Streptococcus suis Streptococcus thermophilus Streptomyces Streptococcus uberis Vibrio parahaemolyticus Vibrio vulnificus Wolbachia Xylella fastidiosa Y. pseudotuberculosis Extended Output Extended Output aro: WARNING, Identity: 100%, HSP/Length: 349/498, Gaps: 0, aro_122 is the best match for aro What is the MLST web-service used for? PlasmidFinder and pMLST The PlasmidFinder database contains replicons, not entire plasmids. Tools for phenotyping Name of Service ResFinder VirulenceFinder Description Identification of acquired antibiotic resistance genes Identification of virulence genes in E. coli (and S. aureus and Enterococcus) MyDbFinder Identification of genes from the users own database PathogenFinder Prediction of pathogenic potential URL (https://cge.cbs.dtu.dk/services/ ) Publication Published Nov 2012, PMID: 22782487 ResFinder VirulenceFinder E. coli published Feb 2014, PMID: 24574290. MyDbFinder Will be published in book chapter PathogenFinder Published Oct 2013, PMID: 24204795 ResFinder NGS Illumina Ion torrent 454.. Assembly pipeline ResFinder (BLAST) Resistance gene profile List of genes Accession numbers Theoretical resistance phenotype Fasta Sanger 200 isolates from 4 different species (Salmonella Typhimurium, Escherichia coli, Enterococcus faecalis and Enterococcus faecium) ResFinder, 98 %ID, 60% length coverage Phenotypic tests, 3,051 in total • 482 Resistant • 2569 Susceptible => 99,74% of the results were in agreement between ResFinder and the phenotypic tests 23 discrepancies -> 16, typically in relation to spectinomycin in E. coli Alternatives to ResFinder Unpublished or uncategorized Name of Service PanFunPro SerotypeFinder Description URL (https://cge.cbs.dtu.dk/serv ices/ ) Status Groups homologous proteins based on functional domain content PanFunPro Identification of serotypes SerotypeFinder-1.0 Online Publication Published in F1000Research 2013, 2:265 Not yet published Online Restriction- Identification of RM Modification system genes Finder HostPhinder Prediction of the host of a bacteriophage MetaVirFinder Identification of virus in metegenomic data Identifies the content of metagenomic samples MGmapper RestrictionModificationFinder Online HostPhinder Online, but under development MetaVirFinder Online, but under development MGmapper Online, but under development Will only be published in book chapter Not yet published Not yet published Not yet published Tools for phylogeny Name of Service SnpTree Description Creation of phylogenetic trees based on SNPs CSIPhylo- Creation of geny phylogenetic trees based on SNPs NDtree Creation of phylogenetic trees URL (cge.cbs.dtu.dk/services) Status Publication Published Dec 2012, PMID: 23281601 snpTree Online CSIPhylogeny Planned Online NDtree Online Published in Feb 2014, PMID: 24505344 Web-service usage Type of data uploaded to MLST web-service 454, single reads 454, paired-end Ion torrent Illumina, single reads Illumina, paired-end reads Assembled draft genomes