Download presentation - Genome-to-Genome Distance Calculator

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vectors in gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Human genetic variation wikipedia , lookup

Gene wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Microevolution wikipedia , lookup

Tag SNP wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Epigenomics wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Designer baby wikipedia , lookup

DNA sequencing wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Adeno-associated virus wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Copy-number variation wikipedia , lookup

RNA-Seq wikipedia , lookup

Genetic engineering wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Genome (book) wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Polyploid wikipedia , lookup

Oncogenomics wikipedia , lookup

DNA barcoding wikipedia , lookup

DNA virus wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transposable element wikipedia , lookup

Public health genomics wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Metagenomics wikipedia , lookup

NUMT wikipedia , lookup

ENCODE wikipedia , lookup

Helitron (biology) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Minimal genome wikipedia , lookup

Human genome wikipedia , lookup

Pathogenomics wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genomic library wikipedia , lookup

Genome editing wikipedia , lookup

Genomics wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Methods for the phylogenetic inference from whole genome sequences and their use in Prokaryote taxonomy
M. Göker, A.F. Auch, H.­P. Klenk
Contents
1.
Short introduction into the role of DNA­DNA hybridization (DDH) in Prokaryote taxonomy
2.
Current developments in genome sequencing techniques and projects
3. Correlation between in vitro DDH data and in silico whole­genome distances: can we replace DDH?
4.
Outlook: the impact of genome sequencing on Prokaryote taxonomy
Genome­based taxonomy
Traditional Methods for Polyphasic Taxonomy

laborious

time consuming

technically demanding

standardization required

not suitable for all organisms
Picture from Vandamme et al. 1996
Genome­based taxonomy
DNA­DNA hybridization ultimately decides about acceptance of new species

To establish a strain as type strain of a novel species, the similarity between its genomic DNA and the DNA of closely related reference species must be sufficiently low

Measured using DNA­DNA hybridization (DDH) with a threshold of 70%

Several wet­lab methods available: Free solution (photometry, nuclease, thermal stability) or membrane­
bound (binding, thermal stability)

Techniques are cumbersome and cannot easily be made reproducible; different methods yield different results

Currently carried out in only a few specialized molecular labs in the world
=> Can DDH be replaced by comparing sequenced genomes?
Genome­based taxonomy
Genome sequencing gets easier and cheaper each day

March 29th 2010: 1117 published archaeal and bacterial genomes from >250 genera

4045 ongoing bacterial genome sequencing projects (data from www.genomesonline.org/gold)
Genomes
Genera
Modified from www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_growth.html
Genome­based taxonomy
Growth of microbial genome sequences
July 1995 to December 2008
Most phyla with cultured species
are poorly sampled in genomics ...lineages with no cultured taxa
are even more poorly sampled! Genome­based taxonomy
Current genomic coverage of type strains is unsatisfactory
Situation in April 2009
Genome­based taxonomy
Picture from Göker & Klenk, in press
The GEBA project aims at filling the gaps
GEBA: Genomic Enzyclopedia of Bacteria and Archaea
Approach:
 Take rRNA Tree of Life
 Overlay finished and in progress genome sequencing projects onto it  Rank phylotypes by how they fill into gaps of the tree
 Use phylogenetic diversity as Strain selection for GEBA main project:
Wu et al., Phylogenetic Distance (PD) procedure
measure for strain selection
Genome­based taxonomy
Several approaches to directly compare whole genomes are available
“Direct” methods:

HSP­based (e.g., GBDP)

Based on word­count frequencies

Complexity­based (e.g., OS distance)

Based on common substrings (e.g., ACS)
=> all use alignment­free distance calculation to infer genome­
to­genome distances (GGD)
More widespread alternative:

Extraction methods, starting with gene finding and orthology determination
Genome­based taxonomy
GBDP is based on high­scoring segment pairs (HSPs)
 GBDP = Gen(om)e Blast Distance Phylogeny
 Pairwise homology search via BLAST or related algorithms
 Calculate coverage of genomes
 Apply specific distance functions, e.g.:
Genome­based taxonomy
d(x,y)
:= distance between genomes x and y
Ixy
:= number of identical bases/amino acids in HSPs
Hxy
:= total HSP length using x as query and y as subject
(x,y)
:= sum of genome lengths (alternative: 2* minimum length)
Several methods to obtain HSPs have been described
 Blast (Altschul et al. 1990): probably most frequently used algorithm in bioinformatics; about 50x faster than exhaustive sequence alignment
 Blat (Kent 2002): faster than Blast when it was introduced (while using more memory)
 Blastz (Schwartz et al. 2003) developed for whole­
genome human­mouse alignments
 Mummer (Kurtz et al. 2004): even faster alternative, obtains exact matches of specified minimum length only (MUMs instead of HSPs)
Genome­based taxonomy
Empirical assessment of whole­genome comparisons
 Goris et al. (2007) published DDH values for 62 pairs of sequenced genomes for comparison with a genome sequence and Blast­based method called ”ANI”
 Search in GOLD and in the taxonomic literature revealed additional 31 genome pairs for which DDH values have been published
 Covering 15 „hybridization groups“ (sets of potentially closely related strains, e.g. from the same genus)
 Dataset includes fully sequenced as well as draft genomes
 Correlation between DDH values obtained in vitro and pair­
wise genome distances calculated in silico determined
Genome­based taxonomy
Blat all in all performs best, followed by NCBI blastn
Genome­based taxonomy
Analogon of 70% DDH determined via regression or minimum error ratio
s(d) = d * ­122.5151 + 100.6280
Genome­based taxonomy
Mummer performs better for larger minimum exact match lengths
Genome­based taxonomy
GBDP slightly outperforms ANI
Best correlation
obtained
Minimal
error ratio
GBDP/NCBI Blastn
0.771
0.065
GBDP/Blat
0.760
0.065
ANI (Goris et al. 2007)
0.717
0.065
„Percentage conserved DNA“
(Goris et al. 2007)
0.686
0.081
 As 1st step, ANI splits the genome sequence in 1,000 bp chunks; this seems to be unnecessary
 GBDP includes a richer set of distance functions and HSP/MUM determination programs
Genome­based taxonomy
Some distance functions accurately predict 70% DDH using only 20% of the genome sequence
Genome­based taxonomy
DDH correlates less well with 16S rDNA distances than genome­to­genome distances do
Picture from Stackebrandt & Evers 2006
Best correlation with 16S
rRNA distances obtained
DDH
0.533
GBDP/NCBI Blastn
0.602
GBDP/Blat
0.612
GBDP/MUMmer
0.611
Genome­based taxonomy
Whole­genome comparison available as web service: http://ggdc.gbdp.org/ Genomes received either through upload or from Genbank
...
Genome­based taxonomy
Results are received via e­mail
Dear User,
please find below the results of your GGDC job. A short explanation
is found at the end of this e­mail. Further information is
provided on our website at http://www.gbdp.org/species/. If you use this
service in a publication, please cite the appropriate references listed
there. Also, please report any apparent calculation errors or other
anomalous results to the authors.
Sincerely,
Your GGDC team
========================================
Submission: 09­11­23­15­38­4417682
Query: taxon1
Reference: CP001407.1
One entry per reference Program: NCBI­BLAST
Formula: 1 (HSP length / total length)
genome and distance formula
Distance: 0.1661101
DDH estimate (regression­based): 76.7180478
Estimate of DDH <=70% (threshold­based): no (threshold=0.2676)
========================================
Submission: 09­11­23­15­38­4417682
Query: taxon1
Reference: CP001407.1
...
Genome­based taxonomy
A standard operation procedure (SOP) for the use of the server is available
Standards in Genomic Sciences (2010) 2: 142-148 DOI:10.4056/sigs.541628
The Genomic Standards Consortium
Standard operating procedure for calculating genome-to-genome distances based on
high-scoring segment pairs
Alexander F. Auch1, Hans-Peter Klenk2*, Markus Goeker2
1 Center for Bioinformatics Tübingen, Eberhard-Karls-Universitaet, Tübingen, Germany
2 DSMZ – German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany.
* Corresponding author : Hans-Peter Klenk.
Keywords : BLAST, GBDP, GGDC web server, genomics, MUMmer, phylogeny, species delineation, microbial taxonomy.
Standards in Genomic Sciences (2010) 2: 117-134 DOI:10.4056/sigs.531120
The Genomic Standards Consortium
Digital DNA-DNA hybridization for microbial species delineation by means of
genome-to-genome sequence comparison
Alexander F. Auch1, Mathias von Jan2, Hans-Peter Klenk2*, Markus Goeker2
1 Center for Bioinformatics Tübingen, Eberhard-Karls-Universitaet, Tübingen, Germany
2 DSMZ – German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany
* Corresponding author : Hans-Peter Klenk.
Keywords : Archaea , Bacteria , BLAST, GBDP, genomics, MUMmer, phylogeny, species concept, taxonomy.
Genome­based taxonomy
Next steps in genome­based taxonomy
Mid­term goals:
 Enlargement of reference datasets for empirical tests and test of more distance functions
 Automatically updated server for the genomes of type strains (GEBA etc.)
 Genome sequencing as routine part of the type strain accession process in culture collections
Prognosis:
 Significantly increased impact of genome sequencing on Prokaryote classification
Genome­based taxonomy
GBDP can be used to infer whole­
genome phylogenetic trees
Neighbour­net phylogenetic network
Genome­based taxonomy
Henz et al. 2005
En route to a genome­based taxonomy
 Analysis of genome sequences can be applied at all taxonomic levels
 Techniques will soon be sufficiently fast and affordable
Picture from Vandamme et al. 1996, modified
Genome­based taxonomy