Download The ITS region as a target for characterization of fungal communities

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Island restoration wikipedia , lookup

Biodiversity action plan wikipedia , lookup

Latitudinal gradients in species diversity wikipedia , lookup

Bifrenaria wikipedia , lookup

Community fingerprinting wikipedia , lookup

Molecular ecology wikipedia , lookup

Transcript
RESEARCH LETTER
The ITS region as a target for characterization offungal communities
using emerging sequencing technologies
Rolf Henrik Nilsson1, Martin Ryberg1, Kessy Abarenkov2, Elisabet Sjökvist1 & Erik Kristiansson3
1
Department of Plant and Environmental Sciences, University of Gothenburg, Göteborg, Sweden; 2Department of Botany, Institute of Ecology and Earth
Sciences, University of Tartu, Tartu, Estonia; and 3Department of Zoology, University of Gothenburg, Göteborg, Sweden
Correspondence: Rolf Henrik Nilsson,
Department of Plant and Environmental
Sciences, University of Gothenburg, PO Box
461, 405 30 Göteborg, Sweden. Tel.: 146 31
786 2623; fax: 146 31 786 2560; e-mail:
[email protected]
Received 13 December 2008; accepted 7 April
2009.
Final version published online 1 May 2009.
DOI:10.1111/j.1574-6968.2009.01618.x
Editor: Jan Dijksterhuis
Keywords
massively parallel sequencing; community
profiling; sequence identification.
Abstract
The advent of new high-throughput DNA-sequencing technologies promises to
redefine the way in which fungi and fungal communities – as well as other groups
of organisms – are studied in their natural environment. With read lengths of some
few hundred base pairs, massively parallel sequencing (pyrosequencing) stands out
among the new technologies as the most apt for large-scale species identification in
environmental samples. Although parallel pyrosequencing can generate hundreds
of thousands of sequences at an exceptional speed, the limited length of the reads
may pose a problem to the species identification process. This study explores
whether the discrepancy in read length between parallel pyrosequencing and
traditional (Sanger) sequencing will have an impact on the perceived taxonomic
affiliation of the underlying species. Based on all 39 200 publicly available fungal
environmental DNA sequences representing the nuclear ribosomal internal
transcribed spacer (ITS) region, the results show that the two approaches give rise
to quite different views of the diversity of the underlying samples. Standardization
of which subregion from the ITS region should be sequenced, as well as a
recognition that the composition of fungal communities as depicted through
different sequencing methods need not be directly comparable, appear crucial to
the integration of the new sequencing technologies with current mycological
praxis.
Introduction
Mycologists face the daunting task of characterizing the very
large and unwieldy fungal kingdom in a taxonomic context.
Estimated at 1.5 million species and reported from little
short of all biota on Earth, fungi are thought to be
responsible for many key ecological functions such as wood
and litter decomposition, mycorrhizal associations, and
other forms of nutrient recycling (Hawksworth, 2001).
Inconspicuous by default, fungi are typically noticed only
when they form above-ground fruiting bodies or other
propagules. The study of fungi is thus plagued by a reliance
on ephemeral structures whose presence or absence is only
weakly correlated with the actual mycoflora of the collection
site (Porter et al., 2008). Adding to the complexity, even
outwardly very similar or identical fruiting bodies often
prove to represent several distinct (cryptic) species (Geml
et al., 2006; Paulus et al., 2007). These observations make a
FEMS Microbiol Lett 296 (2009) 97–101
compelling case for DNA sequences as a vital information
source in contemporary mycology, and the scientific study
of fungi is indeed as much a molecular as a morphological
enterprise today (Blackwell et al., 2006; Hibbett, 2007).
The last few years have witnessed a surge in the interest in
characterizing the mycoflora of entire localities and ecosystems (Bruns et al., 2008; Taylor, 2008). The desire to
sequence whole communities of fungi from any given
study site imposes very high demands in terms of highthroughput sequencing as to question the use of the
presently popular capillary (Sanger)-based techniques in
the first place (c.f. Metzker, 2005; Kahvejian et al., 2008).
Indeed, emerging sequencing technologies with the capacity
to generate hundreds of thousands of limited-length sequences within a matter of hours promise to take over the
sequencing role for environmental studies (Strausberg et al.,
2008). Three major platforms (Applied Biosystems SOLiD,
Illumina Sequencing, and 454 Life Science/Roche massively
2009 Federation of European Microbiological Societies
Published by Blackwell Publishing Ltd. All rights reserved
c
98
parallel pyrosequencing) are presently in use for highthroughput sequencing, but only pyrosequencing yields
long enough DNA templates to be considered for rigid use
in a species-level classification framework (SOLiD, 31 bp;
Illumina, 36 bp; pyrosequencing, c. 250 bp; Shendure & Ji,
2008). Although the current pyrosequencer GS FLX Standard is bound by an upper sequence length of about 250 bp,
pyrosequencing of target genes and regions known to be
sufficiently variable should in theory yield enough information to allow identification to the species level (Liu et al.,
2008).
In mycology, the internal transcribed spacer (ITS) region
of the nuclear ribosomal repeat unit is by far the most
commonly sequenced region for queries of systematics and
taxonomy at and below the genus level. Although the ITS
region is not entirely unproblematic (Feliner & Rosselló,
2007), 4 100 000 fungal ITS sequences have been deposited
in the International Nucleotide Sequence Databases (INSD;
Benson et al., 2008) since the early 1990s (Nilsson et al.,
2008). The roughly 650-bp region is normally obtainable in
a single round of Sanger DNA sequencing, and of its three
subregions (the spacers ITS1 and ITS2 and the 5.8S gene),
two (ITS1 and ITS2) show a high rate of evolution and are
typically species specific (Bruns & Shefferson, 2004; Kõljalg
et al., 2005). The large number of ITS copies per cell
(upwards of 250; Vilgalys & Gonzalez, 1990) makes the
region an appealing target for sequencing substrates where
the initial amount of DNA is low, such as in environmental
samples from soil and wood. Jointly, these observations
make a compelling case for the ITS region as a prime target
for pyrosequencing – targeted at either the ITS1 or ITS2 – of
environmental samples of fungi. Based on the 39 200 available environmental ITS sequences of fungi, the present study
investigates the ramifications of choosing either of these two
subregions over the other, a well as over the whole ITS
region, for purposes of molecular characterization of fungal
communities. Questions of how to make the most of the
data from high-throughput sequencing of environmental
samples are cast in a taxonomic perspective.
Materials and methods
All fungal ITS sequences annotated as such in INSD as of
November 2008 were downloaded and divided into two
datasets: those that were identified to the species level (fully
identified sequences, FIS) and those that were not (insufficiently identified sequences, IIS) following the procedure of
Nilsson et al. (2005). The fungus-specific Hidden Markov
Models of Ryberg et al. (2009) were used to locate and
extract the ITS1 and ITS2 from the sequences, and all entries
were stored in a local MySQL database (http://www.mysql.
com). The IIS are, to a large degree, obtained through
environmental sampling such that their nature makes them
2009 Federation of European Microbiological Societies
Published by Blackwell Publishing Ltd. All rights reserved
c
R. H. Nilsson et al.
attractive as query sequences in studies addressing the
properties of environmental sequencing. Thus, to simulate
the authentic situation where unidentified sequences have
been obtained through sequencing of environmental samples and are queried against the INSD for taxonomic
affiliation using BLAST (Altschul et al., 1997), all IIS featuring
both the ITS1 and the ITS2 (in full or in part; defined as
4 40 bp) were compared in full against the FIS dataset using
BLAST 2.2.18. These comparisons were repeated using only
the ITS1, and then the ITS2, of these IIS to mimic limitedlength pyrosequencing data. All entries were tagged with the
best BLAST match to the FIS dataset for the complete sequence
data as well as for each of the ITS1 and ITS2. To minimize
the impact of questionable matches and technical artefacts
(Nilsson et al., 2006), only sequences where both the ITS1
and ITS2 found relevant matches (BLAST E-value threshold,
10 10) among the FIS were used for comparison. To
examine the impact of partial vs. full ITS1 and ITS2 data,
respectively, the results from BLAST analysis of the entire ITS1
region of four sets of 10 000 ITS sequences were contrasted
with the results obtained through analysing only the first
100 bp of the same sequences (and similarly for the ITS2;
Supporting Information, Appendix S1). The IIS for which
one or both of the ITS1 and ITS2 were missing are not
treated any further in this study and are excluded from the
statistics reported below. Synonyms and anamorph–
teleomorph relationships were established through the
Centraalbureau voor Schimmelcultures databases (Crous
et al., 2004; http://www.cbs.knaw.nl/databases/) and are
accounted for in the following.
Results
A total of 100 639 fungal ITS sequences from 1992 and
onwards were downloaded from INSD. Sixty-one percent
(61 471 sequences) were identified to species level, leaving
39% (39 168 sequences) insufficiently identified. A complete
or partial ITS1 was extracted and found to have a sufficiently
good match to the FIS dataset for 77% of the IIS; the
corresponding value for ITS2 was 80%. A total of 26 577
sequences (68% of the IIS) fulfilled all the criteria as to have
ITS1 and ITS2 of sufficient length and to produce sufficiently good matches to the FIS dataset for both ITS1 and
ITS2; these were designated as the query sequences of the
study. The average length of the full IIS under scrutiny
(including all three subregions and any part of the flanking
ribosomal subunit genes) was 646 bp; that of the ITS1 was
182 bp; and that of the ITS2 was 183 bp.
A moderate 22% of the entries were found to have the
same INSD entry (accession number) as their best BLAST
match regardless of which one of the regions (full sequence,
ITS1, or ITS2) was used as a query (Table 1). The choice of
region had a clear impact on the perceived taxonomic
FEMS Microbiol Lett 296 (2009) 97–101
99
Taxonomic prospects of fungal ITS pyrosequencing
Table 1. Summary statistics for the fungal ITS sequences in INSD as of December 2008 and the results from their analysis (in full and as broken up into
constituent subregions) using BLAST
Number of ITS sequences in INSD
Number of ITS sequences with 4 40 bp ITS1
Number of ITS sequences with 4 40 bp ITS2
Number of ITS sequences with 4 40 bp of both ITS1 and ITS2
Percentage of cases where the whole ITS region, its ITS1 and ITS2 are best matched by the same INSD entry (accession number)
Percentage of cases where the whole ITS region, its ITS1 and ITS2 are best matched by the same species
Percentage of cases where the whole ITS region, its ITS1, and its ITS2 each are matched by different species
Percentage of cases where the ITS1 and ITS2 are best matched by the same species, but the whole region is best matched by
another species
Percentage of cases where the ITS1 and ITS2 are best matched by different species
Total number of species in the whole FIS dataset
Total number of species in the FIS ITS1 dataset
Total number of species in the FIS ITS2 dataset
Proportion of IIS/FIS in the whole dataset
Proportion of IIS/FIS in the ITS1 dataset
Proportion of IIS/FIS in the ITS2 dataset
affiliation of the sequence, with not less than 51% of the IIS
showing not just another INSD entry but another species
altogether as their best match (and in 21% of the total
number of cases even a different genus) depending on which
one was used for comparison. The three subregions disagreed completely on the species level in 14% of the cases
(but in only 4% on the genus level). Thus, with respect to
taxonomic affiliation, only in 49% of the cases did the choice
of target region not matter at all. The full ITS region yielded
the same BLAST results in terms of taxonomic affiliations as
one, but not both, of its ITS1 and ITS2 26% of the time, with
ITS2 (14%) concurring more often than the ITS1 (12%)
with the taxonomic affiliation suggested by the entire ITS
region. The ITS1 and ITS2 reported the same species, which
was not suggested by the complete sequence, as their best
BLAST match in a total of 11% of the cases. Roughly 20% of
the ITS1 sequences under examination were assigned a
different taxonomic affiliation by BLAST depending on
whether the full ITS1 data or only the first 100 bp of the
ITS1 were used (ITS2, 22 %; Appendix S1).
Discussion
Present pyrosequencing methods yield read lengths up to
about 250 bp, a marked improvement over the 80–100 bp
generated by the first generation of pyrosequencing machines, but only a third or perhaps half of both the length of
a typical capillary sequencing round and the length needed
to cover the ITS region in full for a wide selection of fungi.
Improvements in the length of pyrosequencing reads are
anticipated over time, but, at present, the user interested in
sequencing the ITS region with pyrosequencing technology
has to make a choice as to what part of the ITS to target. As if
to underline the dangers of taking this decision too lightly,
the present study shows that the choice of target region will
FEMS Microbiol Lett 296 (2009) 97–101
100 639
90 200
93 655
85 914
22%
49%
14%
11%
40%
13 351
12 699
13 103
0.64
0.60
0.61
have an effect on one’s perception of the taxonomic diversity
in the sample at hand. This is, at some level, expected due to
the variable nature of the ITS1 and ITS2, which is made full
justice to only when compared separately from the very
conserved flanking and intercalary genes. Furthermore, the
partial state of some ITS sequences in INSD, with either the
ITS1 or the ITS2 missing entirely from a proportion of the
sequences (10% of the FIS and 21% of the IIS), can be
expected to introduce a degree of bias in such comparisons.
Even so, the magnitude of the discrepancies is such that it is
likely to find its way into large pyrosequencing datasets
where automated processing of the output is the only
feasible approach to species identification. More worrying
still is the observation that ITS1 and ITS2 disagree over the
taxonomic affiliation of the underlying query sequence in no
less than 40% of the cases (Table 1), although this figure is in
part explained by the presence of species groups with no or
little interspecific variation. The BLAST output order for hits
with identical match statistics – even though the species
names may differ – is for all practical purposes random.
Incorrectly annotated sequences, too, are likely to have
influenced these estimates somewhat (c.f. Bidartondo et al.,
2008).
These results show that species-oriented ecosystem studies based on the whole of the ITS region – as is normally
done today – and those based on pyrosequencing of either
the ITS1 or the ITS2 – an approach expected to gain
popularity rapidly over the next few years – may portray
different pictures of the fungal diversity under scrutiny, a
fact that strongly mitigates against ecological conclusions
based on a direct comparison of such sets of results. The
present study, along with others, testify to the benefits of
analysing the ITS1/ITS2 in isolation (i.e. with the flanking
and intercalary, highly conserved genes removed), at least if
the goal is to identify the sequences to the species or the
2009 Federation of European Microbiological Societies
Published by Blackwell Publishing Ltd. All rights reserved
c
100
genus – as opposed to ordinal or phylum – level (c.f. Bruns
& Shefferson, 2004). In the interest of comparison of
ecosystems from different studies, the mycological community would be best off if it would standardize one of the two
subregions of the ITS as the basis of such pyrosequencebased studies of environmental samples. The two subregions
are roughly equal in terms of variability and length, but
there are more ITS2 than ITS1 available for comparison in
INSD (Table 1). More importantly, the gene in the downstream region of the ITS2 (encoding the ribosomal large
subunit, or the 25/28S) is known to be substantially more
useful for species identification and phylogenetic inference
up to the ordinal level than the gene downstream of the
ITS1, i.e. the very conserved 5.8S. Thus, any additional
downstream region retrieved while sequencing the ITS2
may contribute a further signal to the identification procedure while those downstream of the ITS1 are less likely to be
helpful. These observations, together with the wide selection
of auxiliary resources available for the ITS2 (e.g. Selig et al.,
2008; Coleman, 2009; Keller et al., 2009), make a case for the
ITS2 as the better choice for parallel sequencing, although
the issue of primer optimization in the fungal ITS region
needs further attention.
The data presented above leave little room for interpretation on one pressing issue: the largest obstacle to routine,
en masse identification of fungal sequences to the species
level is the striking paucity of well-identified, extensively
annotated, and sequence coverage-wise complete reference
sequences, preferably stemming from vouchered specimens
kept in public herbaria, in INSD (c.f. Brock et al., 2009).
Indeed, the sheer number of sequences from any pyrosequencing study is likely to further dilute the already limited
presence of FIS in the BLAST hit lists so as to complicate any
identification procedure even more. The present study
shows the INSD to contain FIS from the ITS region –
regardless of their suitability as reference sequences – for
about 13 350 species, a very modest 0.9% of the estimated
number of extant fungal species. Of the many issues
elaborated on in the barcoding and molecular identification
debate, taxonomy may well be the least considered and
furthermore the one where progress is the slowest and most
painstaking. The mycological community will soon be
awash in data in the form of unidentified – and often
unidentifiable – fungal ITS sequences from an abundance
of study sites and ecosystems, data with which taxonomy in
its current practise cannot be expected to hold pace. It
would be a severe set-back for mycology if such unidentified
taxa were to be given a different, ad hoc name in each study
they were recovered as this would effectively close the route
to interpreting the taxa in the light of other studies. A
temporary system for formalizing clusters of unidentifiable
and to all appearances conspecific sequences into standardized and nonarbitrarily named molecular species pending
2009 Federation of European Microbiological Societies
Published by Blackwell Publishing Ltd. All rights reserved
c
R. H. Nilsson et al.
formal taxonomic interpretation and assignment is likely to
prove to be the only sustainable way to maintain data
comparability across studies and sites (c.f. Ryberg et al.,
2008; Horton et al., 2009). Any other, nonstandardized way
of delimiting and referring to such sequence clusters will
only serve to add further to the mounting burden of the
progressively fewer, and severely underfinanced, still active
fungal taxonomists. High-throughput sequencing represents an amazing technological feat that promises to reshape
mycology, but unless a unified infrastructure for processing
and interpretation of the results in a taxonomic context can
be agreed upon and implemented, the benefits of community profiling may well come at the price of the integrative nature of current public sequence repositories.
Acknowledgement
R.H.N. and K.A. acknowledge infrastructural support from
the Fungi in Boreal Forest Soils network.
References
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W
& Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids
Res 25: 3389–3402.
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J & Wheeler
DL (2008) GenBank. Nucleic Acids Res 36: D25–D30.
Bidartondo MI, Bruns TD, Blackwell M et al. (2008) Preserving
accuracy in GenBank. Science 319: 1616.
Blackwell M, Hibbett DS, Taylor JW & Spatafora JW (2006)
Research coordination networks: a phylogeny for kingdom
Fungi (Deep Hypha). Mycologia 98: 829–837.
Brock PM, Döring H & Bidartondo MI (2009) How to know
unknown fungi: the role of a herbarium. New Phytol 181:
719–724.
Bruns TD & Shefferson RP (2004) Evolutionary studies of
ectomycorrhizal fungi: milestones and future directions. Can J
Bot 82: 1122–1132.
Bruns TD, Arnold AE & Hughes KW (2008) Fungal networks
made of humans: UNITE, FESIN, and frontiers in fungal
ecology. New Phytol 177: 586–588.
Coleman AW (2009) Is there a molecular key to the level of
‘‘biological species’’ in eukaryotes? A DNA guide. Mol
Phylogenet Evol 50: 197–203.
Crous PW, Gams W, Stalpers JA, Robert V & Stegehuis G (2004)
MycoBank: an online initiative to launch mycology into the
21st century. Stud Mycol 50: 19–22.
Feliner GN & Rosselló JA (2007) Better the devil you know?
Guidelines for insightful utilization of nrDNA ITS in specieslevel evolutionary studies in plants. Mol Phylogenet Evol 44:
911–919.
FEMS Microbiol Lett 296 (2009) 97–101
101
Taxonomic prospects of fungal ITS pyrosequencing
Geml J, Laursen GA, O’Neill K, Nusbaum HC & Taylor DL (2006)
Beringian origins and cryptic speciation events in the fly agaric
(Amanita muscaria). Mol Ecol 15: 225–239.
Hawksworth DL (2001) The magnitude of fungal diversity: the
1.5 million species estimate revisited. Mycol Res 105:
1422–1432.
Hibbett DS (2007) After the gold rush, or before the flood?
Evolutionary morphology of mushroom-forming fungi
(Agaricomycetes) in the early 21st century. Mycol Res 111:
1001–1018.
Horton TR, Arnold EA & Bruns TD (2009) FESIN workshops at
ESA – the mycelial network grows. Mycorrhiza 19: 283–285.
Kahvejian A, Quackenbush J & Thompson JF (2008) What would
you do if you could sequence everything? Nat Biotechnol 26:
1125–1133.
Keller A, Schleicher T, Schultz J, Müller T, Dandekar T & Wolf M
(2009) 5.8S–28S rRNA interaction and HMM-based ITS2
annotation. Gene 430: 50–57.
Kõljalg U, Larsson K-H, Abarenkov K et al. (2005) UNITE: a
database providing web-based methods for the molecular
identification of ectomycorrhizal fungi. New Phytol 166:
1063–1068.
Liu Z, DeSantis TZ, Andersen GL & Knight G (2008) Accurate
taxonomy assignments from 16S rRNA sequences produced by
highly parallel pyrosequencers. Nucleic Acids Res 36: e120.
Metzker M (2005) Emerging technologies in DNA sequencing.
Genome Res 15: 1767–1776.
Nilsson RH, Kristiansson E, Ryberg M & Larsson K-H (2005)
Approaching the taxonomic affiliation of unidentified
sequences in public databases – an example from the
mycorrhizal fungi. BMC Bioinformatics 6: 178.
Nilsson RH, Ryberg M, Kristiansson E, Abarenkov K, Larsson KH & Kõljalg U (2006) Taxonomic reliability of DNA sequences
in public sequences databases: a fungal perspective. PLoS ONE
1: e59.
Nilsson RH, Kristiansson E, Ryberg M, Hallenberg N & Larsson
K-H (2008) Intraspecific ITS variability in the kingdom Fungi
as expressed in the international sequence databases and its
implications for molecular species identification. Evol
Bioinform 8: 193–201.
Paulus B, Nilsson RH & Hallenberg N (2007) Phylogenetic
studies in Hypochnicium (Basidiomycota), with special
FEMS Microbiol Lett 296 (2009) 97–101
emphasis on species from New Zealand. New Zeal J Bot 45:
139–150.
Porter TM, Skillman JE & Moncalvo JM (2008) Fruiting body
and soil rDNA sampling detects complementary assemblage of
Agaricomycotina (Basidiomycota, Fungi) in a hemlockdominated forest plot in southern Ontario. Mol Ecol 17:
3037–3050.
Ryberg M, Nilsson RH, Kristiansson E, Jacobsson S & Larsson E
(2008) Mining metadata from unidentified ITS sequences in
GenBank: a case study in Inocybe (Basidiomycota). BMC Evol
Biol 8: 50.
Ryberg M, Kristiansson E, Sjökvist E & Nilsson RH (2009) An
outlook on the fungal ITS sequences in GenBank and the
introduction of a web-based tool for the exploration of fungal
diversity. New Phytol 181: 471–477.
Selig C, Wolf M, Muller T, Dandekar T & Schultz J (2008) The
ITS2 Database II: homology modelling RNA structure for
molecular systematics. Nucleic Acids Res 36: D377–D380.
Shendure J & Ji H (2008) Next-generation DNA sequencing. Nat
Biotechnol 26: 1135–1145.
Strausberg RL, Levy S & Rogers Y-H (2008) Emerging DNA
sequencing technologies for human genomic medicine. Drug
Discov Today 13: 569–577.
Taylor AFS (2008) Recent advances in our understanding of
fungal ecology. Coolia 52: 197–212.
Vilgalys R & Gonzalez D (1990) Organization of ribosomal DNA
in the basidiomycete Thanatephorus praticola. Curr Genet 18:
277–280.
Supporting Information
Additional Supporting Information may be found in the
online version of this article:
Appendix S1. Additional statistics pertaining to the IIS and
FIS datasets.
Please note: Wiley-Blackwell is not responsible for the
content or functionality of any supporting materials supplied by the authors. Any queries (other than missing
material) should be directed to the corresponding author
for the article.
2009 Federation of European Microbiological Societies
Published by Blackwell Publishing Ltd. All rights reserved
c