Download Consensus Clustering for Binning Metagenome Sequences

Consensus Clustering for Binning Metagenome Sequences Isis Bonet1,*, Adriana Escobar1, Andrea Mesa-Múnera1, Juan Fernando Alzate2 1Universidad Escuela de Ingeniería de Antioquia, Envigado, Antioquia, Colombia [email protected], [email protected], [email protected] 2 Centro Nacional de Secuenciación Genómica-CNSG, Facultad de Medicina, Universidad de Antioquia, Medellín, Colombia [email protected] Abstract. The advances in next-generation sequencing technologies allow researchers to sequence in parallel millions of microbial organisms directly from environmental samples. The result of this “shotgun” sequencing are many short DNA fragments of different organisms, which constitute the basis for the field of metagenomics. Although there are big databases with known microbial DNA that allow us classify some fragments, these databases only represent around 1% of all the species existing in the entire world. For this reason, it is important to use unsupervised methods to group the fragments with the same taxonomic levels. In this paper we focus on the binning step in metagenomics in an unsupervised way. We propose a consensus clustering method based on an iterative clustering process using different lengths of sequences in the databases and a mixture of distance as approach to finding the consensus clustering. The final performance clustering is evaluated according with the purity of clusters. The results achieved by the proposed method outperforms results obtained by simple methods and iterative methods. Keywords: Metagenomics, consensus clustering, sequences binning, k-means 1 Introduction During the last years, the development of next-generation sequencing technologies allows researchers to sequence multiple genomes of different organisms within an environmental sample. These sequencing methods have the capacity to sequence uncultivable organisms, which have lead a revolution in genetics, taking into account, in many environments, as many as 99% of the microorganisms cannot be cultured by standard techniques [1]. Shotgun sequence, as is also called this kind of sequencing, enable researchers to analyze several types of ecosystems, including extreme environments, with known and unknown microorganism. Moreover, the sequenced genomes provide valuable insights about the microbial community and answers to a wide range of questions [2]. A Sequencing run using such technologies generates hundreds of thousands or millions of DNA fragments, also known as reads. The handling of this result has driven the development of new computational methods and technologies, which rises big and new challenges. To deal with these challenges, emerge metagenomics as a new science. Metagenomics aims to study genomes of many microbial organisms from a specific environment, without a prior need for isolation and cultivation of individual genome in a lab [3]. The objective of metagenomics, based on shotgun sequencing results, is to reconstruct and identify the whole genomes of species within an environmental community under study. In a single genome sequencing, the subsequent processes are assembly of sequence reads, gene prediction, functional annotation and metabolic pathway construction. Additionally, a binning process is required in metagenomics[4]. Usually, the first process, before binning, is an assembly of overlapping shorter reads obtained from the sequencing, in order to provide a consensus sequence (contigs and scaffolds) [2]. Binning methods has the task to group (bin) reads or contigs into their corresponding phylogenetic group. It is can divided into two categories based on the information to group the sequences: composition-based and similarity-based methods. Similarity- or homology-based binning use alignment tools as BLAST [5], MEGAN [6]. From the point of view of machine learning, similarity-based binning is a supervised method supported by a database of known species genome. On the other hand, compositionbased binning made analyzes of genomes features, such as GC content, codon usage or oligonucleotide frequencies to describe the sequences. There are supervised algorithms based on composition features as Phylopythia [7], TACOA [8] and NBC classifier [9] which based the classification in a similarity. Another kind of binning algorithm is referred to as unsupervised methods based on composition features. Unlike the previous algorithms, unsupervised binning is taxonomy independent [10]. Although supervised methods are more accurate than unsupervised methods, the limitation of unknown the majority of the species leads the use of unsupervised methods or the combination of both methods. There are some unsupervised binning reported in the literature, differing in the clustering method, distance measure and the features. For example, TETRA [11] and MetaCAA [12] use the k-mers feature, with k=4 also known as tetranucleotide frequencies. In [13] a Self-Organizing Maps (SOM) method was used for efficiently cluster complex data using the oligonucleotide frequencies calculation, while in [14] Growing Self-organizing maps was used. In [15] the authors use a fuzzy k-means based on GC percentage and oligonucleotides frequencies. MetaCluster is another method that use k-median algorithm and k-mers to represent the features [16, 17]. Others researchers use clustering methods based on expectation maximization (EM) [18] [19]. Also, some authors are reported hybrid algorithms that combine the compositionbased methodology along with an alignment-based methods as PhymmBL [20] and new versions of MetaCluster[17]. In [21] a comparison of some clustering methods is done. Different problematics can arise with a binning process: 1) fragments cover a great range of possible lengths, 2) the amount of fragments which belong to each specie is very different, resulting in an unbalanced database, 3) very large amount of data, and 4) unknown number of organisms, as a classical unsupervised problem. These issues are the cause of the complexity of the unsupervised binning, and lead to a search of good features to represent the DNA fragments and complex algorithms that can hand complex and big data. In this paper we show an ensemble of cluster based on k-means and sequence-based measures, such as GC content and k-mers frequencies. As comparison, we compare with simple clustering algorithm and evaluate according with the sensitivity of clusters. 2 Methods and Data 2.1 Data Assembled genomic sequences at contig level of different organisms including viruses, bacteria and eukaryotes were downloaded from the FTP site of the Sanger institute. Table 1has the description of each organism including in the database. It illustrate the number of contigs representing the organism and the range of minimum and maximum lengths for each one. Table 1. Organisms in the database Organism Ascaris Aspergillus_fumigatus Bacteroides_dorei Bifidobacterium_longum Bos_taurus Candida_parasilopsis Chikung Dengue Ebola Glossina_morsitans HIV Influenza Malus_domestica Manihot_esculenta Pantholops_hodgsonii Zea_mays Contigs 137650 295 1928 18 315841 1540 1 64 1 20334 1 8 66739 7192 159729 161235 Min Length 50 1001 500 540 101 1003 11826 10392 18957 101 9181 853 102 1998 50 102 Max Length 30000 29660 29906 26797 5000 29956 11826 10785 18957 29996 9181 2309 5000 4998 5000 5000 872576 50 30000 In order to have representation of different groups of domains, but also a variety in each group, database consists of 9 eukaryotes, 2 bacteria and 5 virus. 2.2 Features Considering that the sequences have very different length ranging from 50 to 30000 nucleotides bases (Table 1), it is clear how important is the use of composition-based feature to represent the DNA fragments. Based on good results obtained by previous authors, we select k-mer (k=4) as the features to represent the DNA fragments, that means 256 possible tetranucleotides (256 features). It was compute as the number of each tetranucleotide and normalized with the total of tetranucleotides in the sequence. The features and the amount of instances in the database are the basis to perform the bases clustering methods for the ensemble clustering. We use another representative and supervised database with an information gain measure to select the more representative features. The features with highest score are: TCGA, TTCG, CGAA, CGAT, and ATCG. 2.3 Clustering Methods We test different clustering algorithms as SOM, EM and k-means, but we report the results obtained by k-means because it get the best performance. We also proposed a consensus of clustering with k-means as the base clustering method. Despite the problem to estimate the parameter k (number of cluster), k-means is one of the most popular clustering methods. This algorithm finds a set of k centroids, and associates each instance in the data to the nearest centroid, based on a distance function [22]. Here we used a variant of k-means, called k-means++ [23]. As distance functions to compare the contigs we used Euclidean and Cosine distance (Equation 1). n Cosine( X , Y )  1   x  y  i i 1 n  x  i 1 2 i  i (1) n  y  i 1 2 i Where X and Y are the instance to compare, with dimension N (features number), and xi and yi denote the ith feature of X and Y respectively. For the implementation of the clustering methods, we used Weka 3.9 [24], which is a free machine learning package that has implemented k-means++. Furthermore, it has the advantage that it is easy to add a new clustering method. 3 Consensus Clustering Method One of the problematics in the binning process is the different length of fragments. Some binning algorithms have been built for a specific range of lengths. Based on the variability of the clustering algorithms according with the length of contigs we propose train clustering methods with different range of length. Figure 1 describes the steps of our algorithm, including the initial pre-processing data. The input is a fasta file with contigs information which is convert in a weka file (arff). The representation of sequences is based on GC and k-mers as we explain before. Here, the selection of feature is based on the scores obtained with the gain information algorithm. After data pre-processing, the data is ready to be used. Fig. 1. Consensus Clustering Method. The first part of the proposed approach is focused on an iterative clustering where each run process is based on the error of previous run [25]. This means that, a first clustering methods is trained and the results clusters are evaluated based the distance within each cluster and between the centroids of clusters. From this run the best clusters are separated and the rest are used as the new database to train the new cluster method. This process is repeated over several iteration until a number N of run or all resulting clusters achieve a good evaluation value, selecting the compactness clusters for each iteration. Once the iterative clustering is performed we can obtain a large number of clusters. The consensus clustering approach is based on compute the distance between the centroids with Euclidean and Cosine distance. The average of these distances are the measure to join closest clusters and decrease the number of them. In that problem the priority is to group fragments of the same species, even when the organism is represented by more than one cluster. What is important is to cluster groups, which, at least, make possible assembly groups of fragments to build longer DNA sequences. Longer sequences, as will be shown below, can improve performance in a new clustering process. On the other hand, using longer sequences in supervised databases may be more likely to succeed. That means, the key is to obtain clusters with high value of purity. In that spirit, we focused the algorithm to separate the organisms as much as possible, even if this involves a very large number of clusters generated. Taking into account the problematic about the diversity of lengths in the databases, the idea in this method is to split the database based on length of sequences, for this reason the database is divided in two. One database contains the sequences that have length superior to 10000 and the other one with length inferior to 10000. The algorithm describe above is applied to this two database. 3.1 Performance measures There are some measures in the literature to evaluate performance of clusters. Here we use two kind of measures. The first one is to evaluate the cluster formed in each step of the proposed method and join similar clusters evaluate, based on the pairwise difference of between and within-cluster distances. As explained before, our aim is to obtain pure clusters despite some organism can be divided in different clusters. For this reason, the second measure is the purity of the clusters. The aim of this measure is quantify the purity of each cluster, computing how similar the clusters are to the benchmark classifications. To compute purity, each cluster is assigned to the class which is most frequent in the cluster that means, some cluster can be assigned to the same class. The purity of a cluster j is defined in equation 3. 𝑃𝑢𝑟𝑖𝑡𝑦 = 𝑚𝑎𝑥(𝑛𝑖𝑗 ) 𝑛𝑗 (2) where nj is the number of organisms in cluster j and nij is the number of organisms of class i in cluster j. High purity is easy to achieve when the number of clusters is large. 4 Results and Discussion Firstly we train an iterative k-means++ as described above, with Cosine distance and different number of clusters, k between 15 to 2500, keeping the clusters with higher distance inter-cluster, that is the distance between the centroids. A metagenome database built from 16 different organisms is used to evaluate the method. GC content and tetranucleotides are the attributes used to describe the sequences. Euclidean and Cosine distances were used for the k-means algorithms. Fig. 2. Cluster Purity with different min of contigs lengths. We test simple k-means++ algorithms with different size of data. We decrease the database on the lengths of the organisms, in order to know the influence of diversity in contigs lengths. Figure 2 illustrates the average of purity with respect to the minimum length of the organisms included, showing a significant increase in the purity of cluster obtained when larger sequences are used. The first step was to select the best features to describe the data, in order to reduce dimensionality. We use internationals database to adjust the best features, using the algorithm based on gain information measure. We select the 17 features best ranked. Hence, we divided the database in two: one based on the sequences with lengths lower than 10000 and another with lengths greater than or equal to 10000. For each partition of database, we build the clustering method proposed. Afterward, we use Cosine and Euclidean distance to measure the distance between the centroids and regroup the cluster with lowest distances, i.e. those whose average of distances is lower than a threshold (here we use 0.5). The best result was obtained with the increment of number of clusters. For example the Figure 3 shows the results with k1=15, k2=30 for the iterative clustering of two iteration, where ki is the number of cluster of iteration i. The left part of the figure represents the number of clusters, the organisms assigned and the number of fragments associated with each organism. It can be seen most of clusters have a percentage relative to the predominant organism superior of 90%. The average of purity was 92.85 the clusters only represent 8 organisms. With a more deep analysis of the results, we can see that bifidobacterium_longum and the all virus are distributed into all clusters. On the other hand, Ascaris, Pantholops, Malus, Bos Taurus, Bacteroides dorei and Zea mays are the organisms grouped in clusters with high purity. We increase the number of cluster in order to separate the more difficult organisms, in that case the virus because of the low number of contigs. Fig. 3. Results of proposed clustering methods with k1=15, k2=20 Fig. 4. Results of proposed clustering methods with k1=15, k2=2500 and lengths superior to 10000 Figure 4 shows two results of clustering, with k1=15 for both runs, k2=2500 for sequences with max length of 10000 (at the left of figure) and k=250 for lengths superior to 10000 (at the right of figure). In second database, the half of species are not considerate because all genomic fragments of them are smaller than 10000 bases. Bars represent the number of cluster that represent each organism. Although the organisms are very scattered, we obtain a high purity and we can separated the virus in independent clusters. Using lengths larger than 10000, the algorithm achieves 100% of purity for all clusters, and 98.11% when the lengths are shorter than 10000. The average between Euclidean and Cosine distance was used to compare the distance between the centroids of clusters obtained. The closest clusters were grouped together based on a threshold of 0.5. The number of clusters is reduce from 2715 to 122, having a least one cluster for each specie, yielding a 99% of purity. The number of clusters oscillate into 37 to 1 cluster by specie, as it is shown in figure 5. We obtain 117 clusters with 100% of purity. The species with cannot be complete separate from the rest are Bifidobacterium, Influenza and Aspergilus. In most of case are sequence with lengths lower than 10000. Fig. 5. Consensus clustering In short, the results presented with consensus of clusters based on iterative clustering and consensus of distances measures improve the results of clustering in metagenomics. Taking into account the lengths of sequences we can divided the problem and create models focused on short sequences and model based on large sequences. On the other hand, the combination of different distance can generate a significant change in the separation of the space. 5 Conclusions In this paper we present a model based on consensus of different clustering models by the combination of different distances measures. The difference in the models are referred to the data use to train them. The data are reconstructed using different lengths of sequences. The proposed method is applied to a metagenome dataset composed of 16 different organisms. The result achieved by the proposed method, in line with the objective of obtaining clusters with high purity, outperforms result obtained with a simple k-means and also compared with an iterative k-means. Taking into account the purity, even the number of cluster, the proposed method provide pure clusters by organisms, reaching 100% of purity in all cluster when the lengths of contigs is greater than 10000, and 99% for all possible lengths. This paper is not intended to show the best clustering method for metagenomics, but rather to show a promising method to bear in mind in order to build larger sequences or as a prior step in the binning process. Longer DNA fragments can improve performance in a new binning process. This consensus clustering can be used with other base clustering method such as SOM or Expectation Maximization. In future work we expect compare the proposed method with other base methods and other metagenome databases. 6 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. References Riesenfeld, C.S., P.D. Schloss, and J. Handelsman, Metagenomics: genomic analysis of microbial communities. Annu Rev Genet, 2004. 38: p. 525-52. Oulas, A., et al., Metagenomics: Tools and Insights for Analyzing Next-Generation Sequencing Data Derived from Biodiversity Studies, in Bioinform Biol Insights. 2015. p. 75-88. Council, N.R., The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. 2007: The National Academies Press. Chan, C.-K., et al., Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics, 2008. 9(1): p. 215. Camacho, C., et al., BLAST+: architecture and applications. BMC Bioinformatics, 2009. 10(1): p. 421. Huson, D.H., et al., MEGAN analysis of metagenomic data. Genome Research, 2007. 17(3): p. 377-386. McHardy, A.C., et al., Accurate phylogenetic classification of variable-length DNA fragments. Nat Meth, 2007. 4(1): p. 63-72. Diaz, N.N., et al., TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics, 2009. 10: p. 56-56. Rosen, G.L., E. Reichenberger, and A. Rosenfeld, NBC: The Naïve Bayes Classification Tool Webserver for Taxonomic Classification of Metagenomic Reads. Bioinformatics, 2010. Mande, S.S., M.H. Mohammed, and T.S. Ghosh, Classification of metagenomic sequences: methods and challenges. Brief Bioinform, 2012. 13(6): p. 669-81. Teeling, H., et al., TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 2004. 5(1): p. 163. Reddy, R.M., M.H. Mohammed, and S.S. Mande, MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets. Genomics, 2014. 103(2– 3): p. 161-168. Abe, T., et al., Informatics for Unveiling Hidden Genome Signatures. Genome Research, 2003. 13(4): p. 693-702. Chan, C.K.K., et al., Using Growing Self-Organising Maps to Improve the Binning Process in Environmental Whole-Genome Shotgun Sequencing. J Biomed Biotechnol, 2008. 2008. Sara Nasser, A.B., Frederick C. Harris Jr., Monica Nicolescu, University of Nevada Reno. A Fuzzy Classifier to Taxonomically Group DNA Fragments within a 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. Metagenome. 2016; Available from: http://www.cse.unr.edu/~monica/Research/Publications/nafips2008.pdf. Leung, H.C., et al., A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics, 2011. 27(11): p. 1489-95. Wang, Y., et al., MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning. BMC Genomics, 2014. 15(1): p. 1-9. Siegel, K., et al., Puzzlecluster: A novel unsupervised clustering algorithm for binning dna fragments in metagenomics. 2016. Wu, Y.W. and Y. Ye, A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J Comput Biol, 2011. 18(3): p. 523-34. Brady, A. and S.L. Salzberg, Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models. Nature methods, 2009. 6(9): p. 673676. Li, W., et al., Ultrafast clustering algorithms for metagenomic sequence analysis. Briefings in Bioinformatics, 2012. 13(6): p. 656-668. MacQueen, J., Some methods for classification and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. 1967, University of California Press: Berkeley, Calif. p. 281-297. Arthur, D. and S. Vassilvitskii. K-Means ++: The Advantages of Careful Seeding. in 8th Annual ACM-SIAM Symposium on Discrete Algorithms. 2007. New Orleans. Witten, I. and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed, ed. M.R. Jim Gray. 2005, San Francisco: Morgan Kaufmann. 525. Bonet, I., et al., Iterative Clustering Method for Metagenomic Sequences, in Mining Intelligence and Knowledge Exploration, R. Prasath, P. O’Reilly, and T. Kathirvalavakumar, Editors. 2014, Springer International Publishing. p. 145-154.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Consensus Clustering for Binning Metagenome Sequences