Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hybrid (biology) wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microevolution wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Koinophilia wikipedia , lookup
Sequence alignment wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Pathogenomics wikipedia , lookup
Microbial Community Analysis Technical Background English Version www.chunlab.com Microbial Community Analysis Technical Background ChunLab, Inc. Seoul National University Bldg. 105-1, Suite #307 1 Gwanak-ro, Gwanak-gu, Seoul, Korea 151-742 Tel: 82-2-875-2501 Fax: 82-2-875-7250 E-mail: [email protected] E100 MC Technical Background February 2014 Microbial Community Analysis DATABASES RELATED TO TAXONOMIC CONCEPTS The purpose of microbial community analysis is to investigate the composition of a microbial population, the proportion of each species present, and the shift of microbial composition due to some factor(s). For these analyses, we need to understand the concept of microbial taxonomy. THE CONCEPT OF MICROBIAL SPECIES Microbes can be divided into Bacteria, Archaea, Fungi and Protist. Although Fungi and Protist are dependent on their forms or reproductive behavior, Bacteria and Archaea are mainly defined through molecular phylogenetic analysis. Species of the Bacteria and Archaea domains were determined by the International Committee on Systematics of Prokaryotes (ICSP) in 1987. {Wayne, 1987 #160}. © 2013 ChunLab, Inc. All rights reserved. 1 E100 MC Technical Background February2014 The concept of the Bacteria and Archaea species in the present day {Tindall, 2010 #235} For two species to be considered different, they must have over a certain level of difference in their genome sequence. Currently, the following two criteria are widely used. (i) Two species should be hybridized at less than 70% when using the indirect comparison of DNA-DNA hybridization method. (ii) The similarity rate of the species’ 16S rRNA sequences should be less than 97% {Stackebrandt, 1994 #161}. Although the former conditions coincide with the comparison of two strains, two strains cannot be divided into different species if their morphology and physiology characteristics are not distinguished. DNA-DNA hybridization requires a lot of time and money, and it is impossible to construct a database with the data generated. For example, even if the value of the DNA-DNA hybridization is 50% between strains A and B, and the value between B and C is also 50%, another experiment must be conducted in order to obtain the hybridization value between A and C. However, we can determine the relationship of the target strains with all the microbes in a database by using the 16S rRNA gene sequence. © 2013 ChunLab, Inc. All rights reserved. 2 E100 MC Technical Background February2014 OFFICIAL CLASSIFICATION OF BACTERIA AND ARCHAEA The classification of Bacteria and Archaea is supervised by the International Committee on Systematics of Prokaryotes (ICSP; http://www.the-icsp.org/). Official changes in the taxonomy of prokaryotes (including new species) are also supervised by the ICSP, and are announced by the International Journal of Systematic and Evolutionary Microbiology (IJSEM; http://ijs.sgmjournals.org/), the journal issued by the SGM (Society for General Microbiology). All official prokaryotes names are available on the List of Prokaryotic Names with Standing in Nomenclature website (http://www.bacterio.cict.fr/), which is maintained by Jean Euzeby. IDENTIFICATION USING THE 16S rRNA About 4,000 genes are present in E. coli, but only a few of them are used in phylogenetic research. The 16S rRNA gene is used as a standard for classification and identification of microbes as it exists in most microbes and shows constant changes over time. Type strains of 16S rRNA gene sequences for most prokaryotes are available on public databases such as NCBI. However, in the case of NCBI, the quality of the sequences available cannot be verified due to a lack of review and most of the sequences found in the database are not accurately identified. Therefore, secondary databases with only 16S © 2013 ChunLab, Inc. All rights reserved. 3 E100 MC Technical Background February2014 rRNA sequences are widely used. The following 16S rRNA databases are the most widely used. Database Description Reference URL Ribosomal Data provided in the FASTA and {Cole, 2009 http://rdp.cme.msu. #172} edu/ Provides various tools for {DeSantis, http://greengenes.l sequence analyses 2006 #234} bl.gov/ Provides search engines for the {Chun, 2007 http://www.eztaxon RNA Database Genbank formats Project (RDP) Greengenes EzTaxon-e identification of microbes through #170} .org/ a type strain sequence database Table 1. Major 16S rRNA gene sequence databases © 2013 ChunLab, Inc. All rights reserved. 4 E100 MC Technical Background February2014 The following is a flow chart for how prokaryotes are identified using the 16S rRNA gene. 16S Figure 1. Identification of prokaryotes using 16S rRNA gene sequences THE EzTaxon & EzTaxon-e DATABASE/SERVER The EzTaxon database was developed in 2007 in the laboratory of Professor Jongsik Chun of Seoul National University. Professor Chun is also the founder and CEO of ChunLab, Inc. EzTaxon includes 16S rRNA sequences of type strains for all currently existing species of Bacteria and Archaea. EzTaxon-e, an extended version of EzTaxon, was created to include uncultured sequences, candidates, and newly predicted species © 2013 ChunLab, Inc. All rights reserved. 5 E100 MC Technical Background February2014 sequences. The EzTaxon-e database is available on the EzBioCloud website (http://eztaxon-e.ezbiocloud.net). All sequences inside this database has a hierarchical taxonomic structure (from phylum to species), which can be applied to the study of microbial communities of various environments. The algorithm used for identification is shown in the picture below. First, a search is done to find similar sequences in the database using two search engines – BLASTN and Mega BLAST. Then, the sequence similarity is obtained using a global pairwise alignment of each similar sequence. The alignment algorithm used here {Myers, 1988 #265} is the same as the one used in ClustalW. The EzTaxon server is the only website in the world that provides sequence similarities based on global pairwise alignment. The EzTaxon database is updated monthly providing the latest taxonomical data. According to various recent indicators, the EzTaxon database/server has emerged as the global standard for the identification of Bacteria and Archaea. In its guideline for classifying new species of prokaryotes issued in 2009, the IJSEM (International Journal of Systematic and Evolutionary Microbiology), which is the most authoritative journal on classifying microbes, recommended the use of EzTaxon {Tindall, 2009 #223}. © 2013 ChunLab, Inc. All rights reserved. 6 E100 MC Technical Background February2014 Figure 2. Identification algorithm used by the EzTaxon-e Server Worldwide Use of the EzTaxon and EzTaxon-e Database/Server Distribution by country (currently over 50 countries) Registered users: over 5,800 Number of citations: over 1800 times as of Feb . 2013 (source: Google Scholar) Most cited paper among those published in IJSEM (the most authoritative journal on the classification of microbes) after 2002. © 2013 ChunLab, Inc. All rights reserved. 7 E100 MC Technical Background February2014 For fungi, the Internal Transcribed Spacer (ITS) sequence is used for identification. ITS is a piece of non-functional RNA that is situated between structural ribosomal RNAs. Molecular studies of fungi in many different environments have been performed using ITS sequences as a molecular marker. Aside from the ITS region, Cytochrome c oxidase subunit 1 (CO1), 18S, and 28S rRNA have been used as molecular markers. However, CO1 is available only in specific genera (Dentiger et al. 2011) and 18s rRNA is known for having a low variable degree in fungi. Recently, the ITS region has been recommended as the universal fungal barcode sequence (School et al, 2012), and there are over 172,000 full length ITS sequences registered on public databases. By using these sequences as references, researchers can obtain a higher degree of accuracy in their analyses. MOLECULAR BIOLOGICAL METHODS FOR ANALYZING MICROBIAL COMMUNITIES Over the past 20 years, various molecular methods have been developed for the analysis of microbial communities (the composition and relative abundance of microbial populations) using the metagenome. Most of these methods amplify the 16S rRNA genes using PCR reaction, and are categorized as shown in Table 2. © 2013 ChunLab, Inc. All rights reserved. 8 E100 MC Technical Background Method DGGE Category Fingerprint (Denaturing Gradient Gel Fingerprint (Temperature Gradient Gel Database Cost Results Construction Gel/band No Low No Low Gel/band image Electrophoresis) T-RFLP (Terminal Restriction Format of image Electrophoresis) TGGE February2014 Fingerprint Chromatogram No Low Fingerprint Gel/band No Low Fragment Length Polymorphism) SSCP (Single Strand Conformation image Polymorphism) Clone Library Sequencing Sequences Yes High Pyrosequencing Sequencing Sequences Yes Medium Table 2. Molecular analysis methods for microbial community analysis DGGE, TGGE, T-RFLP, SSCP are fingerprinting methods that use electrophoresis after the 16S rRNA genes have been amplified through PCR. After electrophoresis, DNAs with different nucleotide sequences show different bands (or chromatogram) on gels according to the length or characteristics of their sequences; these methods are used to observe changes in microbial communities. The advantage of these methods is that multiple samples can be analyzed at a low cost, but the downside is that it would be difficult to construct a database with these results. In addition, these methods are useful to quickly © 2013 ChunLab, Inc. All rights reserved. 9 E100 MC Technical Background February2014 observe differences in the microbial communities, but it is impossible to obtain detailed information on the changes of the microbes and their distribution at the species level. The clone library method is able to directly and accurately investigate the composition of a community using sequences of 16S rRNA amplicons cloned to E. coli. But the number of clones available for analysis is limited due the high cost of this method. This disadvantage of the clone library method, which uses Sanger DNA sequencing, can be overcome by the use of pyrosequencing (a type of NGS method). © 2013 ChunLab, Inc. All rights reserved. 10 E100 MC Technical Background February2014 Microbial Community Analysis Using Pyrosequencing AN INTRODUCTION TO PYROSEQUENCING The development of NGS (next generation sequencing) has led to a new era in microbial ecology. Pyrosequencing, developed by Roche 454, is widely used in various fields of metagenomics since it provides a sufficient DNA sequence read length (about 400 bp on the 454 Titanium platform) for analysis. Pyrosequencing does not require cloning, which is essential in the clone library method, and the cost of pyrosequencing is 1/100 of the cost of the Sanger sequencing method. The current cost of pyrosequencing is reasonable for research development, but it is still considered expensive compared to the current microbiological analysis methods used in environmental or clinical researches. Nevertheless, the cost of NGS sequencing has been declining sharply year over year, thus it is predicted that it will be at a competitiveness price level in 3 or 4 years. PYROSEQUENCING FACTORS THAT NEED TO BE OVERCOME The biggest problem of microbial community analysis using the pyrosequencing method is the lack of biological information techniques or computing infrastructures available for the analysis of large amounts of sequence data that is generated. This problem applies to © 2013 ChunLab, Inc. All rights reserved. 11 E100 MC Technical Background February2014 most biological fields that use NGS. For example, if 10,000 sequences are obtained from one sample, multiple sequence alignments (which is the first step of the statistical analysis) would require an enormous amount time to analyze. Even the most popular methods, such as ClustalW or ClustalX, require a considerable amount of time. BIOINFORMATIC ANALYSIS PIPELINE The figure below depicts the microbial community analysis pipeline using pyrosequencing. Figure 3. Bioinformatics pipeline for microbial community analysis at ChunLab, Inc. © 2013 ChunLab, Inc. All rights reserved. 12 E100 MC Technical Background February2014 The following steps outline how NGS data is analyzed at ChunLab. Steps Barcode Sorting Analysis Algorithm/Method Sequences are classified by barcodes that are included in the PCR primers. This step is necessary to distinguish the different samples. The 2 bp linker is also removed in this step. Quality Short sequences or those with unambiguous bases (Ns) are removed. Prescreening According to research we conducted, sequencing errors are more likely to occur when there are short sequences (less than 300 bp) or when there are more than 2 Ns. Therefore, these sequences are discarded. In general, these types of sequences represent less than 10% of the total reads. Trimming Primer Primers used in amplification are removed in this step. Primers should Sequences be discarded before any analysis is done as they are not obtained from direct sequencing. This would be clear if you recall that amplicons could be generated even when 1 or 2 bases in the sequence are different from the primer. Primers are recognized and removed using the pairwise alignment algorithm {Myers, 1988 #265}. The primer-linker-barcode can be sequenced on the opposite side (3’ end) when the sequence is long and when the analyzed base sequence is short. We also remove primers from the opposite side, © 2013 ChunLab, Inc. All rights reserved. 13 E100 MC Technical Background February2014 but occasionally, the opposite primers and barcodes may not be removed in case of sequencing errors. Removing Non-target genes (target gene: 16S rRNA gene) found among the Non-target reads are removed in this process. Using BLASTN, we select reads Sequences that meet Alignment Score (at least 100) and E-value (max 1.0) standards. For this process, EzTaxon-e (http://eztaxone.ezbiocloud.net) is used as the BLAST database. Although most of the sequenced reads are target genes, non-specific amplification could occasionally occur. If non-target sequences generated from non-specific amplifications are not removed, they could be recognized as novel species. In general, the rate of non-specific amplification generation is less than 1%. For example, genomic DNA of mice can be amplified when amplifying bacteria obtained from the oral cavity of mice. Assemble Errors due to homopolymers commonly occur in pyrosequencing. To Sequence for fix these errors, identical repeated sequences are united, then De-noise clustered and assembled. A representative sequence is then selected to reconstruct the form of the original base sequence. Errors due to homopolymers are corect in this process. Taxonomic In this step, the similarity search is used to identify the taxonomy of Assignment the contig sequences created in the previous assembly step. The © 2013 ChunLab, Inc. All rights reserved. 14 E100 MC Technical Background February2014 EzTaxon-e (http://eztaxon-e.ezbiocloud.net) database is used for the identification of prokaryotes (Chun et al., 2007). In brief, the top five sequences with the highest similarity are extracted from the database using BLASTN. Then, the similarity value of each sequence is obtained using global pairwise sequence alignment {Myers, 1988 #265}. Species identification is then performed according to their similarity values. The method we currently use is the Simple Identification Scheme, but various new algorithms will be applied in the future. [Simple Identification Scheme] x = sequence similarity to type strain species ( x ≥ 97%), genus (97> x ≥94%), family (94> x ≥90%), order (90> x ≥85%), class (85> x ≥80%), and phylum (80> x ≥75%). ChunLab provides one or two files, which includes each raw read and the result of the taxonomic assignments obtained from the assembled contig of the read. These files which contain information about the community are delivered through our CLcommunity™ program. Chimera Check Chimera is a genomic DNA with a template of mixed species. It is an artificial fragment that could be generated from PCR reactions. This is because a sequence that does not actually exist on the template is © 2013 ChunLab, Inc. All rights reserved. 15 E100 MC Technical Background February2014 amplified from the multi-template. These artificial fragments could cause an over-estimation of the diversity, therefore it is necessary to remove them for a more accurate analysis. The Bellerophon method is used to remove chimeras. Table 3. Step by step analysis procedures SPECIES RICHNESS Species richness indicates the number of species that exists in a sample. This is the first index that is searched for in the analysis when using pyrosequencing. However, there are various methods for estimating species richness, and each researcher might use a different statistical method when writing his/her paper. Therefore, it is necessary to understand each method used in species richness estimation. An Operational Taxonomic Unit (OTU) is an artificial taxon or group, which is the subject of classification. It could be an individual, or it could be a taxon such as species or genus. Recently, OTU is widely used as the term to mean species classified at a 97% sequence similarity, thus OTU will be defined as such in this document. The term M-OTU (Molecular OTU) can also be used as a classification that uses molecular systematical methods. © 2013 ChunLab, Inc. All rights reserved. 16 E100 MC Technical Background February2014 After pyrosequencing, the first step in an analysis is to obtain the number of species present in the sample. All the sequence reads are clustered at a 3% difference cutoff {Tindall, 2010 #235}. The following are the methods that are commonly used for this purpose. Method Computer Programs Characteristics Multiple sequence alignment DOTUR or Accurate but very slow; several + clustering based on MOTHUR different options produce several distances/similarities Massive clustering different results CD-HIT Moderately accurate and fast BlastClust Accuracy is not estimated, but it is based on CD-HIT BlastClust fast Table 4. Clustering methods We obtain OTUs using the CD-HIT program, and results of CD-HIT are provided on our website as cm files. If we obtain a number of sampled OTUs from a sample, we can estimate the number of OTUs that actually exists. The exact value is unknown, only estimations can be made for this value. The methods used for OTU estimation also vary as outlined below. Each method might suggest different values of species richness. © 2013 ChunLab, Inc. All rights reserved. 17 E100 Species Richness MC Technical Background February2014 Description Index Rarefaction A rarefaction curve results from averaging randomizations of the {Heck, 1975 #278} observed accumulation curve. The variance around the repeated randomizations allows one to compare the observed richness among samples, but it is distinct from a measure of confidence about the actual richness in the communities. Rarefaction curves are useful for comparing multiple samples in a graphical manner. See http://en.wikipedia.org/wiki/Rarefaction_(ecology) for more information. Chao1 Estimator A nonparametric estimator is adapted from mark-release- {Chao, 1984 #274} recapture (MRR) statistics. Sobs is the number of observed species, n1 is the number of singletons (species captured once), and n2 is the number of doubletons (species captured twice). Chao {Chao, 1984 #274} noted that this index is particularly useful for data sets skewed toward the low-abundance classes, as is likely to be the case with microbes. © 2013 ChunLab, Inc. All rights reserved. 18 E100 MC Technical Background February2014 ACE Estimator ACE incorporates data from all species with fewer than 10 {Chao, 1992 #275} individuals, rather than just singletons and doubletons (as in Chao1). ACE estimates species richness as where Srare is the number of rare samples (sampled abundances ≤10) and Sabund is the number of abundant species (sampled abundances >10). Note that Srare + Sabund equals the total number of species observed. CACE = 1 − F1/Nrare estimates the sample coverage, where F1 is the number of species with i individuals and Finally, ICE (Incidence Coverage-based Estimator) estimates the coefficient of variation of the Fi's (R. Colwell, User's Guide to EstimateS 5 [http://viceroy.eeb.uconn.edu/estimates]). Table 5. Species richness estimates calculated using various OTU calculations {Hughes, 2001 #276}. © 2013 ChunLab, Inc. All rights reserved. 19 E100 MC Technical Background February2014 With the Chao1 and ACE estimators, species richness can be overestimated when it comes to small-sized samples {Hughes, 2001 #276}. The estimation value of Chao1, in particular, can be increased in samples that include a large amount of singletons due to pyrosequencing error. CD-HIT Program The CD-HIT {Li, 2006 #267} program clusters similar sequences from huge amounts of protein or DNA based on sequence similarity. It is appropriate for clustering pyrosequencing data due to its fast speed. It is available on the web ({Huang, 2010 #266}; http://cd-hit.org/). The CD-HIT program is commonly used for clustering large amounts of protein sequences, but it can also be applied to the clustering of 16S rRNA sequences through proper similarity. For instance, if 16S rRNA gene sequences are clustered at a 3% level, the cluster, that is the OTU, would be the species. The CD-HIT program had been used in studies about the human gut microbiome {Turnbaugh, 2009 #270}, and fungal microbiome analysis of oral cavities {Ghannoum, 2010 #279}. BLASTCLUST Program BlastClust is part of the BLAST software package provided by NCBI. It automatically and systematically clusters protein or DNA sequences based on pairwise matches using the BLAST algorithm for proteins or the Mega BLAST algorithm for DNA. © 2013 ChunLab, Inc. All rights reserved. 20 E100 MC Technical Background February2014 OTHER DIVERSITY INDICES Species richness only represents the number of species, therefore it is necessary to analyze different diversity indices that indicate an overall species distribution. SIMPSON'S DIVERSITY INDEX If pi is the fraction of all organisms which belong to the i-th species, then Simpson's diversity index is most commonly defined as the statistic: A perfectly homogeneous population would have a diversity index score of 0. A perfectly heterogeneous population would have a diversity index score of 1. Mothur Program The Mothur (Schloss, 2009) program provides rarefaction information and diversity indices such as Chao, Ace, Shannon, Simpson, etc. In addition, various analyses can be conducted with the Mothur program. © 2013 ChunLab, Inc. All rights reserved. 21 E100 MC Technical Background February2014 UNIFRAC DISTANCE UniFrac is a method used to measure the difference between communities and is widely used in metagenomic analysis. It calculates the distance using the phylogenetic distance between community members as well as their abundance information. CLcommunity™ provides files that contain community analysis results based on all of the methods mentioned above. These files can be visualized using various options. © 2013 ChunLab, Inc. All rights reserved. 22