Download PhylOligo: a package to identify contaminant or untargeted organism

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Bioinformatics
doi.10.1093/bioinformatics/xxxxxx
Advance Access Publication Date: Day Month Year
Application note
OUP_First_SBk_Bot_8401-eps-c
Application note
PhylOligo: a package to identify contaminant or
untargeted organism sequences in genome
assemblies.
†
†
Ludovic Mallet 1 ∗ , Tristan Bitard-Feildel 2 , Franck Cerutti 1 and Hélène
Chiapello 1
1
2
INRA UR875, Unité Mathématiques et Informatique Appliquées de Toulouse (MIAT), Auzeville, 31326 Castanet-Tolosan, France
CNRS UMR7590, Sorbonne Universités, Université Pierre et Marie Curie – Paris 6 – MNHN – IRD – IUC, Paris, France.
∗ To
whom correspondence should be addressed.
authors contributed equally to this work.
† These
Associate Editor: Dr. John Hancock
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Abstract
Motivation: Genome sequencing projects sometimes uncover more organisms than expected, especially
for complex and/or non-model organisms. It is therefore useful to develop software to identify mix of
organisms from genome sequence assemblies. Here we present PhylOligo, a new package including tools
to explore, identify and extract organism-specific sequences in a genome assembly using the analysis of
their DNA compositional characteristics.
Availability: The tools are written in Python3 and R under the GPLv3 Licence and can be found at
https://github.com/itsmeludo/Phyloligo/.
Contact: [email protected]
Supplementary information: Supplementary data and examples are available at Bioinformatics online.
1 Introduction
The development of sequencing technologies has enabled to target the
genome of complex non-model organisms and communities of organisms.
Some of these non-model organisms can be challenging to isolate from
their environment or cannot be cloned in vitro. They might alternatively be
compulsorily associated with cognate commensal or parasitic organisms,
or even embedded within a host. Consequently, genome assembly
datasets sometimes include DNA from unexpected sources like mixture
of untargeted species, but may also contain organelles or even laboratory
contaminants. The presence of additional untargeted species was indeed
reported in several recent genome assemblies,for instance in the draft
assembly of domestic cow (Merchant et al., 2014), in several isolates
of the phytopathogenic fungi Magnaporthe oryzae (Chiapello et al., 2015)
or recently in the tardigrade genome (Delmont and Eren, 2016). Such
mixed assemblies may produce several biases and problems in downstream
bioinformatics analyses and raise the need for tools able to deal with
mixed-organism DNA assemblies.
Several tools were recently designed or used to detect and filter
untargeted organisms from sequence datasets. A first type of approach,
used by khmer software (Crusoe et al., 2015), is to compute kmer
frequencies on short reads to pre-process and filter read data sets prior
to de novo sequence assembly. Other packages like Blobtools (Kumar
et al., 2013) and Anvi’o (Eren et al., 2015) combine sequence properties
(GC content, oligonucleotide profile) and additional information such as
depth of coverage, similarity to public databases and reference gene sets
to identify untargeted species using both raw reads and assembled contigs
of a genomic dataset. Finally, a last type of approach is to use software
dedicated to metagenomic species read binning, such as CONCOCT
(Alneberg et al., 2014) that use sequence composition and coverage across
multiple samples to automatically cluster contigs into genomes or Kraken
(Wood and Salzberg, 2014) that relies on exact alignment of k-mers to
a k-mer reference database to assign taxonomic labels to metagenomic
DNA.
© The Author(s) 2017 . Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License
(http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any
medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
3
PhylOligo
Kmer pattern
Species mix
111 1111 11111
3.3 PhylOligo performances
11001 110101
111001
0.39
B. cereus in C. canadensis
0.99
B. mallei in H. sapiens
1.00
A. fulgidus in P. tetraurelia
0.99
A. fumigatus in P. tigris
0.96
G. intestinalis in X. tropicalis 0.95
S. enterica in T. vaginalis
0.41
B. cereus in A. australis
0.73
S. cerevisiae in T. vaginalis
0.60
S. pombe in T. vaginalis
0.65
0.79
0.99
0.99
0.99
0.96
0.99
0.50
0.73
0.56
0.61
0.94
0.98
0.99
0.99
0.93
0.95
0.70
0.71
0.49
0.43
0.45
0.99
0.99
0.99
0.95
0.99
0.72
0.72
0.57
0.58
0.93
0.98
0.99
0.99
0.95
0.96
0.71
0.71
0.51
0.47
0.97
0.99
0.99
0.99
0.95
0.98
0.72
0.72
0.58
0.58
Mean
Median
Min
Max
0.74
0.79
0.01
0.99
0.75
0.93
0.15
0.99
0.81
0.95
0.45
0.99
0.83
0.95
0.47
0.99
0.78
0.95
0.05
0.99
S. enterica in A. fumigatus
0.69
0.73
0.01
1.00
Table 1. Impact of kmer pattern on the hybrid score (best computed value of
the product of cluster specificity and sensitivity) for 10 pairs of
of simulated data.
on all the combinations of the 32 simulated genomes assemblies. We also
evaluated the impact of the kmer parameter by panelling continuous and
spaced-pattern kmers on a focus subset of ten pairs (see results in Table 1).
Complete results are detailed in section 6.2 of Supplementary Material.
Overall, the benchmark demonstrates a great ability to discriminate
contaminant clusters with very high specificity and good sensitivity, suited
with the requirements for supervised learning and partitioning. Concerning
k parameter impact, we obtained best results according to our hybrid score
for two spaced patterns: 11001 (mean:0.8133, median: 0.9499) and 110101
(mean:0.8344, median:0.9459). Continuous kmers of length 4 and 5 also
performed well but with slightly lower scores (median scores of 0.7909
and 0.9305 for k=4 and 5 respectively).
3.2 Real datasets
PhylOligo has been successfully applied to identify untargeted large
bacterial regions in four out of nine fungal genomic datasets of
Magnaporthe (Chiapello et al., 2015). GOHTAM (Ménigaud et al.,
2012) taxonomical assignment of these additional regions confirmed their
homogeneity and origin from Burkholderiales. Targeted Blast comparisons
indicated that some of these supplementary regions were almost identical
to Burkholderia fungorum sequences (100 % identity for 16S, recA
and gyrB genes) suggesting an origin or relatedness to one or several
bacterial isolate(s) of this species. PhylOligo was applied to filter genome
assemblies, validated with BUSCO (Simão et al., 2015) and DOGMA
(Dohmen et al., 2016) (see Supplementary Material) and allowed to
continue further bioinformatics analyses without rebuilding the costly
initial genome assembly and annotation processes.
PhylOligo was also used to explore the scaffolds of the tardigrade
assembly (Boothby et al., 2015), for which a multiple contamination
was previously proposed (Delmont and Eren, 2016; Koutsovoulos et al.,
2016). We compared the the topology of the compositional cladograms
established with PhylOligo on both the initial and the filtered assembly
obtained with Anvi’o (Eren et al., 2015). Our results showed that the
cladogram produced with PhylOligo exhibited a topology where the
curated assembly was monophyletic, with a sequence subset and topology
highly concordant with the results of Anvi’o (see Supplementary Materials
Section 5.2).
PhylOligo handles assembly contigs up to a count of several dozen
thousand on a modern workstation within minutes and up to few hundred
thousand on a high-memory server. Input sequences can be qualityfiltered or sub-sampled with a preprocessing tool to allow for improved
signal and quick tests. Several parallel computation optimisations and
data compression methods including HDF5 are available to improve
performance on larger datasets.
Acknowledgements
We thank the INRA bioinformatics platforms MIGALE and Genotoulbioinfo for resources and Thomas Schiex and Natalie Villa-Vialaneix for
their input.
References
Alneberg, J. et al. (2014). Binning metagenomic contigs by coverage and
composition. Nat Meth, 11(11), 1144–1146. Brief Communication.
Angly, F. E., Willner, D., Rohwer, F., Hugenholtz, P., and Tyson, G. W.
(2012). Grinder: a versatile amplicon and shotgun sequence simulator.
Nucleic Acids Research, 40(12), e94.
Boothby, T. C. et al. (2015). Evidence for extensive horizontal gene
transfer from the draft genome of a tardigrade. Proceedings of the
National Academy of Sciences, 112(52), 15976–15981.
Břinda, K., Sykulski, M., and Kucherov, G. (2015). Spaced seeds improve
k-mer-based metagenomic classification. Bioinformatics, 31(22), 3584.
Campello, R. J. G. B., Moulavi, D., and Sander, J. (2013). Density-Based
Clustering Based on Hierarchical Density Estimates, pages 160–172.
Springer Berlin Heidelberg, Berlin, Heidelberg.
Chiapello, H. et al. (2015). Deciphering genome content and evolutionary
relationships of isolates from the fungus Magnaporthe oryzae attacking
different host plants. Genome Biology and Evolution, 7(10), 2896–2912.
Crusoe, M. R. et al. (2015). The khmer software package: enabling
efficient nucleotide sequence analysis. F1000Research, 4(900).
Delmont, T. O. and Eren, A. M. (2016). Identifying contamination with
advanced visualization and analysis practices: metagenomic approaches
for eukaryotic genome assemblies. PeerJ, 4, e1839.
Dohmen, E., Kremer, L. P., Bornberg-Bauer, E., and Kemena, C. (2016).
Dogma: domain-based transcriptome and proteome quality assessment.
Bioinformatics, 32(17), 2577.
Eren, A. M. et al. (2015). Anvi’o: an advanced analysis and visualization
platform for ’omics data. PeerJ, 3, e1319.
Koutsovoulos, G. et al. (2016). No evidence for extensive horizontal gene
transfer in the genome of the tardigrade hypsibius dujardini. Proceedings
of the National Academy of Sciences, 113(18), 5053–5058.
Kumar, S., Jones, M., Koutsovoulos, G., Clarke, M., and Blaxter, M.
(2013). Blobology: exploring raw genome data for contaminants,
symbionts and parasites using taxon-annotated gc-coverage plots.
Frontiers in Genetics, 4, 237.
Leimeister, C.-A., Boden, M., Horwege, S., Lindner, S., and Morgenstern,
B. (2014). Fast alignment-free sequence comparison using spaced-word
frequencies. Bioinformatics, 30(14), 1991.
Merchant, S., Wood, D. E., and Salzberg, S. L. (2014). Unexpected
cross-species contamination in genome sequencing projects. PeerJ, 2,
e675.
Ménigaud, S. et al. (2012). Gohtam: a website for ‘genomic origin of
horizontal transfers, alignment and metagenomics’. Bioinformatics,
28(9), 1270–1271.
Noé, L. and Martin, D. E. K. (2014). A Coverage Criterion for Spaced
Seeds and Its Applications to Support Vector Machine String Kernels
and k-Mer Distances. Journal of Computational Biology, 21(12), 28.
Paradis, E., Claude, J., and Strimmer, K. (2004). Ape: Analyses of
phylogenetics and evolution in r language. Bioinformatics, 20(2), 289.
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V.,
and Zdobnov, E. M. (2015). Busco: assessing genome assembly and
annotation completeness with single-copy orthologs. Bioinformatics,
31(19), 3210.
van der Maaten, L. and Hinton, G. E. (2008). Visualizing high-dimensional
data using t-sne. Journal of Machine Learning Research, 9, 2579–2605.
Wood, D. E. and Salzberg, S. L. (2014). Kraken: ultrafast metagenomic
sequence classification using exact alignments. Genome Biology, 15(3),
R46.