Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics doi.10.1093/bioinformatics/xxxxxx Advance Access Publication Date: Day Month Year Application note OUP_First_SBk_Bot_8401-eps-c Application note PhylOligo: a package to identify contaminant or untargeted organism sequences in genome assemblies. † † Ludovic Mallet 1 ∗ , Tristan Bitard-Feildel 2 , Franck Cerutti 1 and Hélène Chiapello 1 1 2 INRA UR875, Unité Mathématiques et Informatique Appliquées de Toulouse (MIAT), Auzeville, 31326 Castanet-Tolosan, France CNRS UMR7590, Sorbonne Universités, Université Pierre et Marie Curie – Paris 6 – MNHN – IRD – IUC, Paris, France. ∗ To whom correspondence should be addressed. authors contributed equally to this work. † These Associate Editor: Dr. John Hancock Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Motivation: Genome sequencing projects sometimes uncover more organisms than expected, especially for complex and/or non-model organisms. It is therefore useful to develop software to identify mix of organisms from genome sequence assemblies. Here we present PhylOligo, a new package including tools to explore, identify and extract organism-specific sequences in a genome assembly using the analysis of their DNA compositional characteristics. Availability: The tools are written in Python3 and R under the GPLv3 Licence and can be found at https://github.com/itsmeludo/Phyloligo/. Contact: [email protected] Supplementary information: Supplementary data and examples are available at Bioinformatics online. 1 Introduction The development of sequencing technologies has enabled to target the genome of complex non-model organisms and communities of organisms. Some of these non-model organisms can be challenging to isolate from their environment or cannot be cloned in vitro. They might alternatively be compulsorily associated with cognate commensal or parasitic organisms, or even embedded within a host. Consequently, genome assembly datasets sometimes include DNA from unexpected sources like mixture of untargeted species, but may also contain organelles or even laboratory contaminants. The presence of additional untargeted species was indeed reported in several recent genome assemblies,for instance in the draft assembly of domestic cow (Merchant et al., 2014), in several isolates of the phytopathogenic fungi Magnaporthe oryzae (Chiapello et al., 2015) or recently in the tardigrade genome (Delmont and Eren, 2016). Such mixed assemblies may produce several biases and problems in downstream bioinformatics analyses and raise the need for tools able to deal with mixed-organism DNA assemblies. Several tools were recently designed or used to detect and filter untargeted organisms from sequence datasets. A first type of approach, used by khmer software (Crusoe et al., 2015), is to compute kmer frequencies on short reads to pre-process and filter read data sets prior to de novo sequence assembly. Other packages like Blobtools (Kumar et al., 2013) and Anvi’o (Eren et al., 2015) combine sequence properties (GC content, oligonucleotide profile) and additional information such as depth of coverage, similarity to public databases and reference gene sets to identify untargeted species using both raw reads and assembled contigs of a genomic dataset. Finally, a last type of approach is to use software dedicated to metagenomic species read binning, such as CONCOCT (Alneberg et al., 2014) that use sequence composition and coverage across multiple samples to automatically cluster contigs into genomes or Kraken (Wood and Salzberg, 2014) that relies on exact alignment of k-mers to a k-mer reference database to assign taxonomic labels to metagenomic DNA. © The Author(s) 2017 . Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] 3 PhylOligo Kmer pattern Species mix 111 1111 11111 3.3 PhylOligo performances 11001 110101 111001 0.39 B. cereus in C. canadensis 0.99 B. mallei in H. sapiens 1.00 A. fulgidus in P. tetraurelia 0.99 A. fumigatus in P. tigris 0.96 G. intestinalis in X. tropicalis 0.95 S. enterica in T. vaginalis 0.41 B. cereus in A. australis 0.73 S. cerevisiae in T. vaginalis 0.60 S. pombe in T. vaginalis 0.65 0.79 0.99 0.99 0.99 0.96 0.99 0.50 0.73 0.56 0.61 0.94 0.98 0.99 0.99 0.93 0.95 0.70 0.71 0.49 0.43 0.45 0.99 0.99 0.99 0.95 0.99 0.72 0.72 0.57 0.58 0.93 0.98 0.99 0.99 0.95 0.96 0.71 0.71 0.51 0.47 0.97 0.99 0.99 0.99 0.95 0.98 0.72 0.72 0.58 0.58 Mean Median Min Max 0.74 0.79 0.01 0.99 0.75 0.93 0.15 0.99 0.81 0.95 0.45 0.99 0.83 0.95 0.47 0.99 0.78 0.95 0.05 0.99 S. enterica in A. fumigatus 0.69 0.73 0.01 1.00 Table 1. Impact of kmer pattern on the hybrid score (best computed value of the product of cluster specificity and sensitivity) for 10 pairs of of simulated data. on all the combinations of the 32 simulated genomes assemblies. We also evaluated the impact of the kmer parameter by panelling continuous and spaced-pattern kmers on a focus subset of ten pairs (see results in Table 1). Complete results are detailed in section 6.2 of Supplementary Material. Overall, the benchmark demonstrates a great ability to discriminate contaminant clusters with very high specificity and good sensitivity, suited with the requirements for supervised learning and partitioning. Concerning k parameter impact, we obtained best results according to our hybrid score for two spaced patterns: 11001 (mean:0.8133, median: 0.9499) and 110101 (mean:0.8344, median:0.9459). Continuous kmers of length 4 and 5 also performed well but with slightly lower scores (median scores of 0.7909 and 0.9305 for k=4 and 5 respectively). 3.2 Real datasets PhylOligo has been successfully applied to identify untargeted large bacterial regions in four out of nine fungal genomic datasets of Magnaporthe (Chiapello et al., 2015). GOHTAM (Ménigaud et al., 2012) taxonomical assignment of these additional regions confirmed their homogeneity and origin from Burkholderiales. Targeted Blast comparisons indicated that some of these supplementary regions were almost identical to Burkholderia fungorum sequences (100 % identity for 16S, recA and gyrB genes) suggesting an origin or relatedness to one or several bacterial isolate(s) of this species. PhylOligo was applied to filter genome assemblies, validated with BUSCO (Simão et al., 2015) and DOGMA (Dohmen et al., 2016) (see Supplementary Material) and allowed to continue further bioinformatics analyses without rebuilding the costly initial genome assembly and annotation processes. PhylOligo was also used to explore the scaffolds of the tardigrade assembly (Boothby et al., 2015), for which a multiple contamination was previously proposed (Delmont and Eren, 2016; Koutsovoulos et al., 2016). We compared the the topology of the compositional cladograms established with PhylOligo on both the initial and the filtered assembly obtained with Anvi’o (Eren et al., 2015). Our results showed that the cladogram produced with PhylOligo exhibited a topology where the curated assembly was monophyletic, with a sequence subset and topology highly concordant with the results of Anvi’o (see Supplementary Materials Section 5.2). PhylOligo handles assembly contigs up to a count of several dozen thousand on a modern workstation within minutes and up to few hundred thousand on a high-memory server. Input sequences can be qualityfiltered or sub-sampled with a preprocessing tool to allow for improved signal and quick tests. Several parallel computation optimisations and data compression methods including HDF5 are available to improve performance on larger datasets. Acknowledgements We thank the INRA bioinformatics platforms MIGALE and Genotoulbioinfo for resources and Thomas Schiex and Natalie Villa-Vialaneix for their input. References Alneberg, J. et al. (2014). Binning metagenomic contigs by coverage and composition. Nat Meth, 11(11), 1144–1146. Brief Communication. Angly, F. E., Willner, D., Rohwer, F., Hugenholtz, P., and Tyson, G. W. (2012). Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Research, 40(12), e94. Boothby, T. C. et al. (2015). Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proceedings of the National Academy of Sciences, 112(52), 15976–15981. Břinda, K., Sykulski, M., and Kucherov, G. (2015). Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics, 31(22), 3584. Campello, R. J. G. B., Moulavi, D., and Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates, pages 160–172. Springer Berlin Heidelberg, Berlin, Heidelberg. Chiapello, H. et al. (2015). Deciphering genome content and evolutionary relationships of isolates from the fungus Magnaporthe oryzae attacking different host plants. Genome Biology and Evolution, 7(10), 2896–2912. Crusoe, M. R. et al. (2015). The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research, 4(900). Delmont, T. O. and Eren, A. M. (2016). Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies. PeerJ, 4, e1839. Dohmen, E., Kremer, L. P., Bornberg-Bauer, E., and Kemena, C. (2016). Dogma: domain-based transcriptome and proteome quality assessment. Bioinformatics, 32(17), 2577. Eren, A. M. et al. (2015). Anvi’o: an advanced analysis and visualization platform for ’omics data. PeerJ, 3, e1319. Koutsovoulos, G. et al. (2016). No evidence for extensive horizontal gene transfer in the genome of the tardigrade hypsibius dujardini. Proceedings of the National Academy of Sciences, 113(18), 5053–5058. Kumar, S., Jones, M., Koutsovoulos, G., Clarke, M., and Blaxter, M. (2013). Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated gc-coverage plots. Frontiers in Genetics, 4, 237. Leimeister, C.-A., Boden, M., Horwege, S., Lindner, S., and Morgenstern, B. (2014). Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics, 30(14), 1991. Merchant, S., Wood, D. E., and Salzberg, S. L. (2014). Unexpected cross-species contamination in genome sequencing projects. PeerJ, 2, e675. Ménigaud, S. et al. (2012). Gohtam: a website for ‘genomic origin of horizontal transfers, alignment and metagenomics’. Bioinformatics, 28(9), 1270–1271. Noé, L. and Martin, D. E. K. (2014). A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels and k-Mer Distances. Journal of Computational Biology, 21(12), 28. Paradis, E., Claude, J., and Strimmer, K. (2004). Ape: Analyses of phylogenetics and evolution in r language. Bioinformatics, 20(2), 289. Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., and Zdobnov, E. M. (2015). Busco: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 31(19), 3210. van der Maaten, L. and Hinton, G. E. (2008). Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9, 2579–2605. Wood, D. E. and Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3), R46.