Download Microbial Community Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Hybrid (biology) wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Koinophilia wikipedia , lookup

Sequence alignment wikipedia , lookup

Genomics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Pathogenomics wikipedia , lookup

DNA barcoding wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
Microbial Community Analysis
Technical Background
English Version
www.chunlab.com
Microbial Community Analysis
Technical Background
ChunLab, Inc.
Seoul National University
Bldg. 105-1, Suite #307
1 Gwanak-ro, Gwanak-gu,
Seoul, Korea 151-742
Tel: 82-2-875-2501
Fax: 82-2-875-7250
E-mail: [email protected]
E100
MC Technical Background
February 2014
Microbial Community Analysis
DATABASES RELATED TO TAXONOMIC CONCEPTS
The purpose of microbial community analysis is to investigate the composition of a
microbial population, the proportion of each species present, and the shift of microbial
composition due to some factor(s). For these analyses, we need to understand the
concept of microbial taxonomy.
THE CONCEPT OF MICROBIAL SPECIES
Microbes can be divided into Bacteria, Archaea, Fungi and Protist. Although Fungi and
Protist are dependent on their forms or reproductive behavior, Bacteria and Archaea are
mainly defined through molecular phylogenetic analysis. Species of the Bacteria and
Archaea domains were determined by the International Committee on Systematics of
Prokaryotes (ICSP) in 1987. {Wayne, 1987 #160}.
© 2013 ChunLab, Inc. All rights reserved.
1
E100
MC Technical Background
February2014
The concept of the Bacteria and Archaea species in the present day {Tindall, 2010
#235}
 For two species to be considered different, they must have over a certain level of
difference in their genome sequence. Currently, the following two criteria are widely used.
(i) Two species should be hybridized at less than 70% when using the indirect comparison
of DNA-DNA hybridization method. (ii) The similarity rate of the species’ 16S rRNA
sequences should be less than 97% {Stackebrandt, 1994 #161}.
 Although the former conditions coincide with the comparison of two strains, two strains
cannot be divided into different species if their morphology and physiology characteristics
are not distinguished.
DNA-DNA hybridization requires a lot of time and money, and it is impossible to construct
a database with the data generated. For example, even if the value of the DNA-DNA
hybridization is 50% between strains A and B, and the value between B and C is also 50%,
another experiment must be conducted in order to obtain the hybridization value between
A and C. However, we can determine the relationship of the target strains with all the
microbes in a database by using the 16S rRNA gene sequence.
© 2013 ChunLab, Inc. All rights reserved.
2
E100
MC Technical Background
February2014
OFFICIAL CLASSIFICATION OF BACTERIA AND ARCHAEA
The classification of Bacteria and Archaea is supervised by the International Committee on
Systematics of Prokaryotes (ICSP; http://www.the-icsp.org/). Official changes in the
taxonomy of prokaryotes (including new species) are also supervised by the ICSP, and are
announced by the International Journal of Systematic and Evolutionary Microbiology
(IJSEM; http://ijs.sgmjournals.org/), the journal issued by the SGM (Society for General
Microbiology). All official prokaryotes names are available on the List of Prokaryotic
Names with Standing in Nomenclature website (http://www.bacterio.cict.fr/), which is
maintained by Jean Euzeby.
IDENTIFICATION USING THE 16S rRNA
About 4,000 genes are present in E. coli, but only a few of them are used in phylogenetic
research. The 16S rRNA gene is used as a standard for classification and identification of
microbes as it exists in most microbes and shows constant changes over time. Type
strains of 16S rRNA gene sequences for most prokaryotes are available on public
databases such as NCBI. However, in the case of NCBI, the quality of the sequences
available cannot be verified due to a lack of review and most of the sequences found in the
database are not accurately identified. Therefore, secondary databases with only 16S
© 2013 ChunLab, Inc. All rights reserved.
3
E100
MC Technical Background
February2014
rRNA sequences are widely used. The following 16S rRNA databases are the most widely
used.
Database
Description
Reference
URL
Ribosomal
Data provided in the FASTA and
{Cole, 2009
http://rdp.cme.msu.
#172}
edu/
Provides various tools for
{DeSantis,
http://greengenes.l
sequence analyses
2006 #234}
bl.gov/
Provides search engines for the
{Chun, 2007
http://www.eztaxon
RNA Database Genbank formats
Project (RDP)
Greengenes
EzTaxon-e
identification of microbes through #170}
.org/
a type strain sequence database
Table 1. Major 16S rRNA gene sequence databases
© 2013 ChunLab, Inc. All rights reserved.
4
E100
MC Technical Background
February2014
The following is a flow chart for how prokaryotes are identified using the 16S rRNA gene.
16S
Figure 1. Identification of prokaryotes using 16S rRNA gene sequences
THE EzTaxon & EzTaxon-e DATABASE/SERVER
The EzTaxon database was developed in 2007 in the laboratory of Professor Jongsik
Chun of Seoul National University. Professor Chun is also the founder and CEO of
ChunLab, Inc. EzTaxon includes 16S rRNA sequences of type strains for all currently
existing species of Bacteria and Archaea. EzTaxon-e, an extended version of EzTaxon,
was created to include uncultured sequences, candidates, and newly predicted species
© 2013 ChunLab, Inc. All rights reserved.
5
E100
MC Technical Background
February2014
sequences. The EzTaxon-e database is available on the EzBioCloud website
(http://eztaxon-e.ezbiocloud.net).
All sequences inside this database has a hierarchical taxonomic structure (from phylum to
species), which can be applied to the study of microbial communities of various
environments. The algorithm used for identification is shown in the picture below. First, a
search is done to find similar sequences in the database using two search engines –
BLASTN and Mega BLAST. Then, the sequence similarity is obtained using a global
pairwise alignment of each similar sequence. The alignment algorithm used here {Myers,
1988 #265} is the same as the one used in ClustalW. The EzTaxon server is the only
website in the world that provides sequence similarities based on global pairwise
alignment. The EzTaxon database is updated monthly providing the latest taxonomical
data. According to various recent indicators, the EzTaxon database/server has emerged
as the global standard for the identification of Bacteria and Archaea. In its guideline for
classifying new species of prokaryotes issued in 2009, the IJSEM (International Journal of
Systematic and Evolutionary Microbiology), which is the most authoritative journal on
classifying microbes, recommended the use of EzTaxon {Tindall, 2009 #223}.
© 2013 ChunLab, Inc. All rights reserved.
6
E100
MC Technical Background
February2014
Figure 2. Identification algorithm used by the EzTaxon-e Server
Worldwide Use of the EzTaxon and EzTaxon-e Database/Server
 Distribution by country (currently over 50 countries)

Registered users: over 5,800

Number of citations: over 1800 times as of Feb . 2013 (source: Google Scholar)

Most cited paper among those published in IJSEM (the most authoritative journal
on the classification of microbes) after 2002.
© 2013 ChunLab, Inc. All rights reserved.
7
E100
MC Technical Background
February2014
For fungi, the Internal Transcribed Spacer (ITS) sequence is used for identification. ITS is
a piece of non-functional RNA that is situated between structural ribosomal RNAs.
Molecular studies of fungi in many different environments have been performed using ITS
sequences as a molecular marker. Aside from the ITS region, Cytochrome c oxidase
subunit 1 (CO1), 18S, and 28S rRNA have been used as molecular markers. However,
CO1 is available only in specific genera (Dentiger et al. 2011) and 18s rRNA is known for
having a low variable degree in fungi. Recently, the ITS region has been recommended as
the universal fungal barcode sequence (School et al, 2012), and there are over 172,000
full length ITS sequences registered on public databases. By using these sequences as
references, researchers can obtain a higher degree of accuracy in their analyses.
MOLECULAR BIOLOGICAL METHODS FOR ANALYZING MICROBIAL
COMMUNITIES
Over the past 20 years, various molecular methods have been developed for the analysis
of microbial communities (the composition and relative abundance of microbial populations)
using the metagenome. Most of these methods amplify the 16S rRNA genes using PCR
reaction, and are categorized as shown in Table 2.
© 2013 ChunLab, Inc. All rights reserved.
8
E100
MC Technical Background
Method
DGGE
Category
Fingerprint
(Denaturing Gradient Gel
Fingerprint
(Temperature Gradient Gel
Database
Cost
Results
Construction
Gel/band
No
Low
No
Low
Gel/band
image
Electrophoresis)
T-RFLP (Terminal Restriction
Format of
image
Electrophoresis)
TGGE
February2014
Fingerprint
Chromatogram
No
Low
Fingerprint
Gel/band
No
Low
Fragment Length Polymorphism)
SSCP
(Single Strand Conformation
image
Polymorphism)
Clone Library
Sequencing
Sequences
Yes
High
Pyrosequencing
Sequencing
Sequences
Yes
Medium
Table 2. Molecular analysis methods for microbial community analysis
DGGE, TGGE, T-RFLP, SSCP are fingerprinting methods that use electrophoresis after
the 16S rRNA genes have been amplified through PCR. After electrophoresis, DNAs with
different nucleotide sequences show different bands (or chromatogram) on gels according
to the length or characteristics of their sequences; these methods are used to observe
changes in microbial communities. The advantage of these methods is that multiple
samples can be analyzed at a low cost, but the downside is that it would be difficult to
construct a database with these results. In addition, these methods are useful to quickly
© 2013 ChunLab, Inc. All rights reserved.
9
E100
MC Technical Background
February2014
observe differences in the microbial communities, but it is impossible to obtain detailed
information on the changes of the microbes and their distribution at the species level.
The clone library method is able to directly and accurately investigate the composition of a
community using sequences of 16S rRNA amplicons cloned to E. coli. But the number of
clones available for analysis is limited due the high cost of this method. This disadvantage
of the clone library method, which uses Sanger DNA sequencing, can be overcome by the
use of pyrosequencing (a type of NGS method).
© 2013 ChunLab, Inc. All rights reserved.
10
E100
MC Technical Background
February2014
Microbial Community Analysis Using Pyrosequencing
AN INTRODUCTION TO PYROSEQUENCING
The development of NGS (next generation sequencing) has led to a new era in microbial
ecology. Pyrosequencing, developed by Roche 454, is widely used in various fields of
metagenomics since it provides a sufficient DNA sequence read length (about 400 bp on
the 454 Titanium platform) for analysis. Pyrosequencing does not require cloning, which is
essential in the clone library method, and the cost of pyrosequencing is 1/100 of the cost
of the Sanger sequencing method. The current cost of pyrosequencing is reasonable for
research development, but it is still considered expensive compared to the current
microbiological analysis methods used in environmental or clinical researches.
Nevertheless, the cost of NGS sequencing has been declining sharply year over year, thus
it is predicted that it will be at a competitiveness price level in 3 or 4 years.
PYROSEQUENCING FACTORS THAT NEED TO BE OVERCOME
The biggest problem of microbial community analysis using the pyrosequencing method is
the lack of biological information techniques or computing infrastructures available for the
analysis of large amounts of sequence data that is generated. This problem applies to
© 2013 ChunLab, Inc. All rights reserved.
11
E100
MC Technical Background
February2014
most biological fields that use NGS. For example, if 10,000 sequences are obtained from
one sample, multiple sequence alignments (which is the first step of the statistical analysis)
would require an enormous amount time to analyze. Even the most popular methods, such
as ClustalW or ClustalX, require a considerable amount of time.
BIOINFORMATIC ANALYSIS PIPELINE
The figure below depicts the microbial community analysis pipeline using pyrosequencing.
Figure 3. Bioinformatics pipeline for microbial community analysis at ChunLab, Inc.
© 2013 ChunLab, Inc. All rights reserved.
12
E100
MC Technical Background
February2014
The following steps outline how NGS data is analyzed at ChunLab.
Steps
Barcode Sorting
Analysis Algorithm/Method
Sequences are classified by barcodes that are included in the PCR
primers. This step is necessary to distinguish the different samples.
The 2 bp linker is also removed in this step.
Quality
Short sequences or those with unambiguous bases (Ns) are removed.
Prescreening
According to research we conducted, sequencing errors are more
likely to occur when there are short sequences (less than 300 bp) or
when there are more than 2 Ns. Therefore, these sequences are
discarded. In general, these types of sequences represent less than
10% of the total reads.
Trimming Primer
Primers used in amplification are removed in this step. Primers should
Sequences
be discarded before any analysis is done as they are not obtained
from direct sequencing. This would be clear if you recall that
amplicons could be generated even when 1 or 2 bases in the
sequence are different from the primer. Primers are recognized and
removed using the pairwise alignment algorithm {Myers, 1988 #265}.
The primer-linker-barcode can be sequenced on the opposite side (3’
end) when the sequence is long and when the analyzed base
sequence is short. We also remove primers from the opposite side,
© 2013 ChunLab, Inc. All rights reserved.
13
E100
MC Technical Background
February2014
but occasionally, the opposite primers and barcodes may not be
removed in case of sequencing errors.
Removing
Non-target genes (target gene: 16S rRNA gene) found among the
Non-target
reads are removed in this process. Using BLASTN, we select reads
Sequences
that meet Alignment Score (at least 100) and E-value (max 1.0)
standards. For this process, EzTaxon-e (http://eztaxone.ezbiocloud.net) is used as the BLAST database. Although most of
the sequenced reads are target genes, non-specific amplification
could occasionally occur. If non-target sequences generated from
non-specific amplifications are not removed, they could be recognized
as novel species. In general, the rate of non-specific amplification
generation is less than 1%. For example, genomic DNA of mice can
be amplified when amplifying bacteria obtained from the oral cavity of
mice.
Assemble
Errors due to homopolymers commonly occur in pyrosequencing. To
Sequence for
fix these errors, identical repeated sequences are united, then
De-noise
clustered and assembled. A representative sequence is then selected
to reconstruct the form of the original base sequence. Errors due to
homopolymers are corect in this process.
Taxonomic
In this step, the similarity search is used to identify the taxonomy of
Assignment
the contig sequences created in the previous assembly step. The
© 2013 ChunLab, Inc. All rights reserved.
14
E100
MC Technical Background
February2014
EzTaxon-e (http://eztaxon-e.ezbiocloud.net) database is used for the
identification of prokaryotes (Chun et al., 2007). In brief, the top five
sequences with the highest similarity are extracted from the database
using BLASTN. Then, the similarity value of each sequence is
obtained using global pairwise sequence alignment {Myers, 1988
#265}. Species identification is then performed according to their
similarity values. The method we currently use is the Simple
Identification Scheme, but various new algorithms will be applied in
the future.
[Simple Identification Scheme] x = sequence similarity to type strain
species ( x ≥ 97%), genus (97> x ≥94%), family (94> x ≥90%), order
(90> x ≥85%), class (85> x ≥80%), and phylum (80> x ≥75%).
ChunLab provides one or two files, which includes each raw read and
the result of the taxonomic assignments obtained from the assembled
contig of the read. These files which contain information about the
community are delivered through our CLcommunity™ program.
Chimera Check
Chimera is a genomic DNA with a template of mixed species. It is an
artificial fragment that could be generated from PCR reactions. This is
because a sequence that does not actually exist on the template is
© 2013 ChunLab, Inc. All rights reserved.
15
E100
MC Technical Background
February2014
amplified from the multi-template. These artificial fragments could
cause an over-estimation of the diversity, therefore it is necessary to
remove them for a more accurate analysis. The Bellerophon method is
used to remove chimeras.
Table 3. Step by step analysis procedures
SPECIES RICHNESS
Species richness indicates the number of species that exists in a sample. This is the first
index that is searched for in the analysis when using pyrosequencing. However, there are
various methods for estimating species richness, and each researcher might use a
different statistical method when writing his/her paper. Therefore, it is necessary to
understand each method used in species richness estimation. An Operational Taxonomic
Unit (OTU) is an artificial taxon or group, which is the subject of classification. It could be
an individual, or it could be a taxon such as species or genus. Recently, OTU is widely
used as the term to mean species classified at a 97% sequence similarity, thus OTU will
be defined as such in this document. The term M-OTU (Molecular OTU) can also be used
as a classification that uses molecular systematical methods.
© 2013 ChunLab, Inc. All rights reserved.
16
E100
MC Technical Background
February2014
After pyrosequencing, the first step in an analysis is to obtain the number of species
present in the sample. All the sequence reads are clustered at a 3% difference cutoff
{Tindall, 2010 #235}. The following are the methods that are commonly used for this
purpose.
Method
Computer Programs
Characteristics
Multiple sequence alignment
DOTUR or
Accurate but very slow; several
+ clustering based on
MOTHUR
different options produce several
distances/similarities
Massive clustering
different results
CD-HIT
Moderately accurate and fast
BlastClust
Accuracy is not estimated, but it is
based on CD-HIT
BlastClust
fast
Table 4. Clustering methods
We obtain OTUs using the CD-HIT program, and results of CD-HIT are provided on our
website as cm files.
If we obtain a number of sampled OTUs from a sample, we can estimate the number of
OTUs that actually exists. The exact value is unknown, only estimations can be made for
this value. The methods used for OTU estimation also vary as outlined below. Each
method might suggest different values of species richness.
© 2013 ChunLab, Inc. All rights reserved.
17
E100
Species Richness
MC Technical Background
February2014
Description
Index
Rarefaction
A rarefaction curve results from averaging randomizations of the
{Heck, 1975 #278}
observed accumulation curve. The variance around the
repeated randomizations allows one to compare the observed
richness among samples, but it is distinct from a measure of
confidence about the actual richness in the communities.
Rarefaction curves are useful for comparing multiple samples in
a graphical manner.
See http://en.wikipedia.org/wiki/Rarefaction_(ecology)
for more information.
Chao1 Estimator
A nonparametric estimator is adapted from mark-release-
{Chao, 1984 #274}
recapture (MRR) statistics.
Sobs is the number of observed species, n1 is the number of
singletons (species captured once), and n2 is the number of
doubletons (species captured twice). Chao {Chao, 1984 #274}
noted that this index is particularly useful for data sets skewed
toward the low-abundance classes, as is likely to be the case
with microbes.
© 2013 ChunLab, Inc. All rights reserved.
18
E100
MC Technical Background
February2014
ACE Estimator
ACE incorporates data from all species with fewer than 10
{Chao, 1992 #275}
individuals, rather than just singletons and doubletons (as in
Chao1).
ACE estimates species richness as
where Srare is the number of rare samples (sampled abundances
≤10) and Sabund is the number of abundant species (sampled
abundances >10). Note that Srare + Sabund equals the total
number of species observed. CACE = 1 − F1/Nrare estimates the
sample coverage, where F1 is the number of species with i
individuals and
Finally,
ICE (Incidence Coverage-based Estimator) estimates the
coefficient of variation of the Fi's (R. Colwell, User's Guide to
EstimateS 5 [http://viceroy.eeb.uconn.edu/estimates]).
Table 5. Species richness estimates calculated using various OTU calculations
{Hughes, 2001 #276}.
© 2013 ChunLab, Inc. All rights reserved.
19
E100
MC Technical Background
February2014
With the Chao1 and ACE estimators, species richness can be overestimated when it
comes to small-sized samples {Hughes, 2001 #276}. The estimation value of Chao1, in
particular, can be increased in samples that include a large amount of singletons due to
pyrosequencing error.
CD-HIT Program
The CD-HIT {Li, 2006 #267} program clusters similar sequences from huge amounts of
protein or DNA based on sequence similarity. It is appropriate for clustering
pyrosequencing data due to its fast speed. It is available on the web ({Huang, 2010
#266}; http://cd-hit.org/). The CD-HIT program is commonly used for clustering large
amounts of protein sequences, but it can also be applied to the clustering of 16S rRNA
sequences through proper similarity. For instance, if 16S rRNA gene sequences are
clustered at a 3% level, the cluster, that is the OTU, would be the species. The CD-HIT
program had been used in studies about the human gut microbiome {Turnbaugh, 2009
#270}, and fungal microbiome analysis of oral cavities {Ghannoum, 2010 #279}.
BLASTCLUST Program
BlastClust is part of the BLAST software package provided by NCBI. It automatically
and systematically clusters protein or DNA sequences based on pairwise matches
using the BLAST algorithm for proteins or the Mega BLAST algorithm for DNA.
© 2013 ChunLab, Inc. All rights reserved.
20
E100
MC Technical Background
February2014
OTHER DIVERSITY INDICES
Species richness only represents the number of species, therefore it is necessary to
analyze different diversity indices that indicate an overall species distribution.
SIMPSON'S DIVERSITY INDEX
If pi is the fraction of all organisms which belong to the i-th species, then Simpson's
diversity index is most commonly defined as the statistic:
A perfectly homogeneous population would have a diversity index score of 0. A perfectly
heterogeneous population would have a diversity index score of 1.
Mothur Program
The Mothur (Schloss, 2009) program provides rarefaction information and diversity
indices such as Chao, Ace, Shannon, Simpson, etc. In addition, various analyses can
be conducted with the Mothur program.
© 2013 ChunLab, Inc. All rights reserved.
21
E100
MC Technical Background
February2014
UNIFRAC DISTANCE
UniFrac is a method used to measure the difference between communities and is widely
used in metagenomic analysis. It calculates the distance using the phylogenetic distance
between community members as well as their abundance information.
CLcommunity™ provides files that contain community analysis results based on all of the
methods mentioned above. These files can be visualized using various options.
© 2013 ChunLab, Inc. All rights reserved.
22