Download Consensus Clustering for Binning Metagenome Sequences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genome wikipedia , lookup

Genome evolution wikipedia , lookup

Microevolution wikipedia , lookup

Ridge (biology) wikipedia , lookup

History of genetic engineering wikipedia , lookup

DNA barcoding wikipedia , lookup

DNA sequencing wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Whole genome sequencing wikipedia , lookup

RNA-Seq wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Sequence alignment wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Pathogenomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome editing wikipedia , lookup

Genomic library wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
Consensus Clustering for Binning Metagenome
Sequences
Isis Bonet1,*, Adriana Escobar1, Andrea Mesa-Múnera1, Juan Fernando Alzate2
1Universidad
Escuela de Ingeniería de Antioquia, Envigado, Antioquia, Colombia
[email protected], [email protected], [email protected]
2 Centro Nacional de Secuenciación Genómica-CNSG, Facultad de Medicina, Universidad de
Antioquia, Medellín, Colombia
[email protected]
Abstract. The advances in next-generation sequencing technologies allow researchers to sequence in parallel millions of microbial organisms directly from
environmental samples. The result of this “shotgun” sequencing are many short
DNA fragments of different organisms, which constitute the basis for the field of
metagenomics. Although there are big databases with known microbial DNA that
allow us classify some fragments, these databases only represent around 1% of
all the species existing in the entire world. For this reason, it is important to use
unsupervised methods to group the fragments with the same taxonomic levels. In
this paper we focus on the binning step in metagenomics in an unsupervised way.
We propose a consensus clustering method based on an iterative clustering process using different lengths of sequences in the databases and a mixture of distance as approach to finding the consensus clustering. The final performance
clustering is evaluated according with the purity of clusters. The results achieved
by the proposed method outperforms results obtained by simple methods and iterative methods.
Keywords: Metagenomics, consensus clustering, sequences binning, k-means
1
Introduction
During the last years, the development of next-generation sequencing technologies allows researchers to sequence multiple genomes of different organisms within an environmental sample. These sequencing methods have the capacity to sequence uncultivable organisms, which have lead a revolution in genetics, taking into account, in many
environments, as many as 99% of the microorganisms cannot be cultured by standard
techniques [1]. Shotgun sequence, as is also called this kind of sequencing, enable researchers to analyze several types of ecosystems, including extreme environments, with
known and unknown microorganism. Moreover, the sequenced genomes provide valuable insights about the microbial community and answers to a wide range of questions
[2]. A Sequencing run using such technologies generates hundreds of thousands or millions of DNA fragments, also known as reads. The handling of this result has driven
the development of new computational methods and technologies, which rises big and
new challenges. To deal with these challenges, emerge metagenomics as a new science.
Metagenomics aims to study genomes of many microbial organisms from a specific
environment, without a prior need for isolation and cultivation of individual genome in
a lab [3].
The objective of metagenomics, based on shotgun sequencing results, is to reconstruct and identify the whole genomes of species within an environmental community
under study. In a single genome sequencing, the subsequent processes are assembly of
sequence reads, gene prediction, functional annotation and metabolic pathway construction. Additionally, a binning process is required in metagenomics[4].
Usually, the first process, before binning, is an assembly of overlapping shorter reads
obtained from the sequencing, in order to provide a consensus sequence (contigs and
scaffolds) [2].
Binning methods has the task to group (bin) reads or contigs into their corresponding
phylogenetic group. It is can divided into two categories based on the information to
group the sequences: composition-based and similarity-based methods. Similarity- or
homology-based binning use alignment tools as BLAST [5], MEGAN [6]. From the
point of view of machine learning, similarity-based binning is a supervised method
supported by a database of known species genome. On the other hand, compositionbased binning made analyzes of genomes features, such as GC content, codon usage or
oligonucleotide frequencies to describe the sequences. There are supervised algorithms
based on composition features as Phylopythia [7], TACOA [8] and NBC classifier [9]
which based the classification in a similarity. Another kind of binning algorithm is referred to as unsupervised methods based on composition features. Unlike the previous
algorithms, unsupervised binning is taxonomy independent [10].
Although supervised methods are more accurate than unsupervised methods, the limitation of unknown the majority of the species leads the use of unsupervised methods
or the combination of both methods.
There are some unsupervised binning reported in the literature, differing in the clustering method, distance measure and the features. For example, TETRA [11] and MetaCAA [12] use the k-mers feature, with k=4 also known as tetranucleotide frequencies.
In [13] a Self-Organizing Maps (SOM) method was used for efficiently cluster complex
data using the oligonucleotide frequencies calculation, while in [14] Growing Self-organizing maps was used. In [15] the authors use a fuzzy k-means based on GC percentage and oligonucleotides frequencies. MetaCluster is another method that use k-median
algorithm and k-mers to represent the features [16, 17]. Others researchers use clustering methods based on expectation maximization (EM) [18] [19].
Also, some authors are reported hybrid algorithms that combine the compositionbased methodology along with an alignment-based methods as PhymmBL [20] and new
versions of MetaCluster[17]. In [21] a comparison of some clustering methods is done.
Different problematics can arise with a binning process: 1) fragments cover
a great range of possible lengths, 2) the amount of fragments which belong to each specie is very different, resulting in an unbalanced database, 3) very large amount of data,
and 4) unknown number of organisms, as a classical unsupervised problem. These issues are the cause of the complexity of the unsupervised binning, and lead to a search
of good features to represent the DNA fragments and complex algorithms that can hand
complex and big data.
In this paper we show an ensemble of cluster based on k-means and sequence-based
measures, such as GC content and k-mers frequencies. As comparison, we compare
with simple clustering algorithm and evaluate according with the sensitivity of clusters.
2
Methods and Data
2.1
Data
Assembled genomic sequences at contig level of different organisms including viruses,
bacteria and eukaryotes were downloaded from the FTP site of the Sanger institute.
Table 1has the description of each organism including in the database. It illustrate
the number of contigs representing the organism and the range of minimum and maximum lengths for each one.
Table 1. Organisms in the database
Organism
Ascaris
Aspergillus_fumigatus
Bacteroides_dorei
Bifidobacterium_longum
Bos_taurus
Candida_parasilopsis
Chikung
Dengue
Ebola
Glossina_morsitans
HIV
Influenza
Malus_domestica
Manihot_esculenta
Pantholops_hodgsonii
Zea_mays
Contigs
137650
295
1928
18
315841
1540
1
64
1
20334
1
8
66739
7192
159729
161235
Min Length
50
1001
500
540
101
1003
11826
10392
18957
101
9181
853
102
1998
50
102
Max Length
30000
29660
29906
26797
5000
29956
11826
10785
18957
29996
9181
2309
5000
4998
5000
5000
872576
50
30000
In order to have representation of different groups of domains, but also a variety in
each group, database consists of 9 eukaryotes, 2 bacteria and 5 virus.
2.2
Features
Considering that the sequences have very different length ranging from 50 to 30000
nucleotides bases (Table 1), it is clear how important is the use of composition-based
feature to represent the DNA fragments.
Based on good results obtained by previous authors, we select k-mer (k=4) as the
features to represent the DNA fragments, that means 256 possible tetranucleotides (256
features). It was compute as the number of each tetranucleotide and normalized with
the total of tetranucleotides in the sequence.
The features and the amount of instances in the database are the basis to perform the
bases clustering methods for the ensemble clustering.
We use another representative and supervised database with an information gain
measure to select the more representative features. The features with highest score are:
TCGA, TTCG, CGAA, CGAT, and ATCG.
2.3
Clustering Methods
We test different clustering algorithms as SOM, EM and k-means, but we report the
results obtained by k-means because it get the best performance. We also proposed a
consensus of clustering with k-means as the base clustering method.
Despite the problem to estimate the parameter k (number of cluster), k-means is one
of the most popular clustering methods. This algorithm finds a set of k centroids, and
associates each instance in the data to the nearest centroid, based on a distance function
[22]. Here we used a variant of k-means, called k-means++ [23]. As distance functions
to compare the contigs we used Euclidean and Cosine distance (Equation 1).
n
Cosine( X , Y )  1 
 x  y 
i
i 1
n
 x 
i 1
2
i

i
(1)
n
 y 
i 1
2
i
Where X and Y are the instance to compare, with dimension N (features number),
and xi and yi denote the ith feature of X and Y respectively.
For the implementation of the clustering methods, we used Weka 3.9 [24], which is
a free machine learning package that has implemented k-means++. Furthermore, it has
the advantage that it is easy to add a new clustering method.
3
Consensus Clustering Method
One of the problematics in the binning process is the different length of fragments.
Some binning algorithms have been built for a specific range of lengths. Based on the
variability of the clustering algorithms according with the length of contigs we propose
train clustering methods with different range of length.
Figure 1 describes the steps of our algorithm, including the initial pre-processing
data. The input is a fasta file with contigs information which is convert in a weka file
(arff). The representation of sequences is based on GC and k-mers as we explain before.
Here, the selection of feature is based on the scores obtained with the gain information
algorithm. After data pre-processing, the data is ready to be used.
Fig. 1. Consensus Clustering Method.
The first part of the proposed approach is focused on an iterative clustering where
each run process is based on the error of previous run [25]. This means that, a first
clustering methods is trained and the results clusters are evaluated based the distance
within each cluster and between the centroids of clusters. From this run the best clusters
are separated and the rest are used as the new database to train the new cluster method.
This process is repeated over several iteration until a number N of run or all resulting
clusters achieve a good evaluation value, selecting the compactness clusters for each
iteration.
Once the iterative clustering is performed we can obtain a large number of clusters.
The consensus clustering approach is based on compute the distance between the centroids with Euclidean and Cosine distance. The average of these distances are the measure to join closest clusters and decrease the number of them.
In that problem the priority is to group fragments of the same species, even when the
organism is represented by more than one cluster. What is important is to cluster groups,
which, at least, make possible assembly groups of fragments to build longer DNA sequences. Longer sequences, as will be shown below, can improve performance in a new
clustering process. On the other hand, using longer sequences in supervised databases
may be more likely to succeed. That means, the key is to obtain clusters with high value
of purity. In that spirit, we focused the algorithm to separate the organisms as much as
possible, even if this involves a very large number of clusters generated.
Taking into account the problematic about the diversity of lengths in the databases,
the idea in this method is to split the database based on length of sequences, for this
reason the database is divided in two. One database contains the sequences that have
length superior to 10000 and the other one with length inferior to 10000. The algorithm
describe above is applied to this two database.
3.1
Performance measures
There are some measures in the literature to evaluate performance of clusters. Here we
use two kind of measures. The first one is to evaluate the cluster formed in each step of
the proposed method and join similar clusters evaluate, based on the pairwise difference
of between and within-cluster distances.
As explained before, our aim is to obtain pure clusters despite some organism can
be divided in different clusters. For this reason, the second measure is the purity of the
clusters. The aim of this measure is quantify the purity of each cluster, computing how
similar the clusters are to the benchmark classifications. To compute purity, each cluster is assigned to the class which is most frequent in the cluster that means, some cluster
can be assigned to the same class. The purity of a cluster j is defined in equation 3.
𝑃𝑢𝑟𝑖𝑡𝑦 =
𝑚𝑎𝑥(𝑛𝑖𝑗 )
𝑛𝑗
(2)
where nj is the number of organisms in cluster j and nij is the number of organisms of
class i in cluster j.
High purity is easy to achieve when the number of clusters is large.
4
Results and Discussion
Firstly we train an iterative k-means++ as described above, with Cosine distance and
different number of clusters, k between 15 to 2500, keeping the clusters with higher
distance inter-cluster, that is the distance between the centroids.
A metagenome database built from 16 different organisms is used to evaluate the
method. GC content and tetranucleotides are the attributes used to describe the sequences. Euclidean and Cosine distances were used for the k-means algorithms.
Fig. 2. Cluster Purity with different min of contigs lengths.
We test simple k-means++ algorithms with different size of data. We decrease the
database on the lengths of the organisms, in order to know the influence of diversity in
contigs lengths. Figure 2 illustrates the average of purity with respect to the minimum
length of the organisms included, showing a significant increase in the purity of cluster
obtained when larger sequences are used.
The first step was to select the best features to describe the data, in order to reduce
dimensionality. We use internationals database to adjust the best features, using the
algorithm based on gain information measure. We select the 17 features best ranked.
Hence, we divided the database in two: one based on the sequences with lengths
lower than 10000 and another with lengths greater than or equal to 10000. For each
partition of database, we build the clustering method proposed. Afterward, we use Cosine and Euclidean distance to measure the distance between the centroids and regroup
the cluster with lowest distances, i.e. those whose average of distances is lower than a
threshold (here we use 0.5).
The best result was obtained with the increment of number of clusters. For example
the Figure 3 shows the results with k1=15, k2=30 for the iterative clustering of two iteration, where ki is the number of cluster of iteration i. The left part of the figure represents the number of clusters, the organisms assigned and the number of fragments associated with each organism. It can be seen most of clusters have a percentage relative
to the predominant organism superior of 90%. The average of purity was 92.85 the
clusters only represent 8 organisms.
With a more deep analysis of the results, we can see that bifidobacterium_longum
and the all virus are distributed into all clusters. On the other hand, Ascaris, Pantholops,
Malus, Bos Taurus, Bacteroides dorei and Zea mays are the organisms grouped in clusters with high purity.
We increase the number of cluster in order to separate the more difficult organisms,
in that case the virus because of the low number of contigs.
Fig. 3. Results of proposed clustering methods with k1=15, k2=20
Fig. 4. Results of proposed clustering methods with k1=15, k2=2500 and lengths superior to
10000
Figure 4 shows two results of clustering, with k1=15 for both runs, k2=2500 for sequences with max length of 10000 (at the left of figure) and k=250 for lengths superior
to 10000 (at the right of figure). In second database, the half of species are not considerate because all genomic fragments of them are smaller than 10000 bases. Bars represent the number of cluster that represent each organism. Although the organisms are
very scattered, we obtain a high purity and we can separated the virus in independent
clusters. Using lengths larger than 10000, the algorithm achieves 100% of purity for all
clusters, and 98.11% when the lengths are shorter than 10000.
The average between Euclidean and Cosine distance was used to compare the distance between the centroids of clusters obtained. The closest clusters were grouped together based on a threshold of 0.5. The number of clusters is reduce from 2715 to 122,
having a least one cluster for each specie, yielding a 99% of purity. The number of
clusters oscillate into 37 to 1 cluster by specie, as it is shown in figure 5.
We obtain 117 clusters with 100% of purity. The species with cannot be complete
separate from the rest are Bifidobacterium, Influenza and Aspergilus. In most of case
are sequence with lengths lower than 10000.
Fig. 5. Consensus clustering
In short, the results presented with consensus of clusters based on iterative clustering
and consensus of distances measures improve the results of clustering in metagenomics.
Taking into account the lengths of sequences we can divided the problem and create
models focused on short sequences and model based on large sequences. On the other
hand, the combination of different distance can generate a significant change in the
separation of the space.
5
Conclusions
In this paper we present a model based on consensus of different clustering models by
the combination of different distances measures. The difference in the models are referred to the data use to train them. The data are reconstructed using different lengths
of sequences.
The proposed method is applied to a metagenome dataset composed of 16 different
organisms. The result achieved by the proposed method, in line with the objective of
obtaining clusters with high purity, outperforms result obtained with a simple k-means
and also compared with an iterative k-means. Taking into account the purity, even the
number of cluster, the proposed method provide pure clusters by organisms, reaching
100% of purity in all cluster when the lengths of contigs is greater than 10000, and 99%
for all possible lengths.
This paper is not intended to show the best clustering method for metagenomics, but
rather to show a promising method to bear in mind in order to build larger sequences
or as a prior step in the binning process. Longer DNA fragments can improve performance in a new binning process.
This consensus clustering can be used with other base clustering method such as
SOM or Expectation Maximization. In future work we expect compare the proposed
method with other base methods and other metagenome databases.
6
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
References
Riesenfeld, C.S., P.D. Schloss, and J. Handelsman, Metagenomics: genomic analysis
of microbial communities. Annu Rev Genet, 2004. 38: p. 525-52.
Oulas, A., et al., Metagenomics: Tools and Insights for Analyzing Next-Generation
Sequencing Data Derived from Biodiversity Studies, in Bioinform Biol Insights. 2015.
p. 75-88.
Council, N.R., The New Science of Metagenomics: Revealing the Secrets of Our
Microbial Planet. 2007: The National Academies Press.
Chan, C.-K., et al., Binning sequences using very sparse labels within a metagenome.
BMC Bioinformatics, 2008. 9(1): p. 215.
Camacho, C., et al., BLAST+: architecture and applications. BMC Bioinformatics,
2009. 10(1): p. 421.
Huson, D.H., et al., MEGAN analysis of metagenomic data. Genome Research, 2007.
17(3): p. 377-386.
McHardy, A.C., et al., Accurate phylogenetic classification of variable-length DNA
fragments. Nat Meth, 2007. 4(1): p. 63-72.
Diaz, N.N., et al., TACOA – Taxonomic classification of environmental genomic
fragments using a kernelized nearest neighbor approach. BMC Bioinformatics, 2009.
10: p. 56-56.
Rosen, G.L., E. Reichenberger, and A. Rosenfeld, NBC: The Naïve Bayes
Classification Tool Webserver for Taxonomic Classification of Metagenomic Reads.
Bioinformatics, 2010.
Mande, S.S., M.H. Mohammed, and T.S. Ghosh, Classification of metagenomic
sequences: methods and challenges. Brief Bioinform, 2012. 13(6): p. 669-81.
Teeling, H., et al., TETRA: a web-service and a stand-alone program for the analysis
and comparison of tetranucleotide usage patterns in DNA sequences. BMC
Bioinformatics, 2004. 5(1): p. 163.
Reddy, R.M., M.H. Mohammed, and S.S. Mande, MetaCAA: A clustering-aided
methodology for efficient assembly of metagenomic datasets. Genomics, 2014. 103(2–
3): p. 161-168.
Abe, T., et al., Informatics for Unveiling Hidden Genome Signatures. Genome
Research, 2003. 13(4): p. 693-702.
Chan, C.K.K., et al., Using Growing Self-Organising Maps to Improve the Binning
Process in Environmental Whole-Genome Shotgun Sequencing. J Biomed Biotechnol,
2008. 2008.
Sara Nasser, A.B., Frederick C. Harris Jr., Monica Nicolescu, University of Nevada
Reno. A Fuzzy Classifier to Taxonomically Group DNA Fragments within a
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
Metagenome.
2016;
Available
from:
http://www.cse.unr.edu/~monica/Research/Publications/nafips2008.pdf.
Leung, H.C., et al., A robust and accurate binning algorithm for metagenomic
sequences with arbitrary species abundance ratio. Bioinformatics, 2011. 27(11): p.
1489-95.
Wang, Y., et al., MetaCluster-TA: taxonomic annotation for metagenomic data based
on assembly-assisted binning. BMC Genomics, 2014. 15(1): p. 1-9.
Siegel, K., et al., Puzzlecluster: A novel unsupervised clustering algorithm for binning
dna fragments in metagenomics. 2016.
Wu, Y.W. and Y. Ye, A novel abundance-based algorithm for binning metagenomic
sequences using l-tuples. J Comput Biol, 2011. 18(3): p. 523-34.
Brady, A. and S.L. Salzberg, Phymm and PhymmBL: Metagenomic Phylogenetic
Classification with Interpolated Markov Models. Nature methods, 2009. 6(9): p. 673676.
Li, W., et al., Ultrafast clustering algorithms for metagenomic sequence analysis.
Briefings in Bioinformatics, 2012. 13(6): p. 656-668.
MacQueen, J., Some methods for classification and analysis of multivariate
observations, in Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, Volume 1: Statistics. 1967, University of California Press:
Berkeley, Calif. p. 281-297.
Arthur, D. and S. Vassilvitskii. K-Means ++: The Advantages of Careful Seeding. in
8th Annual ACM-SIAM Symposium on Discrete Algorithms. 2007. New Orleans.
Witten, I. and E. Frank, Data Mining: Practical Machine Learning Tools and
Techniques. 2nd ed, ed. M.R. Jim Gray. 2005, San Francisco: Morgan Kaufmann. 525.
Bonet, I., et al., Iterative Clustering Method for Metagenomic Sequences, in Mining
Intelligence and Knowledge Exploration, R. Prasath, P. O’Reilly, and T.
Kathirvalavakumar, Editors. 2014, Springer International Publishing. p. 145-154.