Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
RESEARCH LETTER The ITS region as a target for characterization offungal communities using emerging sequencing technologies Rolf Henrik Nilsson1, Martin Ryberg1, Kessy Abarenkov2, Elisabet Sjökvist1 & Erik Kristiansson3 1 Department of Plant and Environmental Sciences, University of Gothenburg, Göteborg, Sweden; 2Department of Botany, Institute of Ecology and Earth Sciences, University of Tartu, Tartu, Estonia; and 3Department of Zoology, University of Gothenburg, Göteborg, Sweden Correspondence: Rolf Henrik Nilsson, Department of Plant and Environmental Sciences, University of Gothenburg, PO Box 461, 405 30 Göteborg, Sweden. Tel.: 146 31 786 2623; fax: 146 31 786 2560; e-mail: [email protected] Received 13 December 2008; accepted 7 April 2009. Final version published online 1 May 2009. DOI:10.1111/j.1574-6968.2009.01618.x Editor: Jan Dijksterhuis Keywords massively parallel sequencing; community profiling; sequence identification. Abstract The advent of new high-throughput DNA-sequencing technologies promises to redefine the way in which fungi and fungal communities – as well as other groups of organisms – are studied in their natural environment. With read lengths of some few hundred base pairs, massively parallel sequencing (pyrosequencing) stands out among the new technologies as the most apt for large-scale species identification in environmental samples. Although parallel pyrosequencing can generate hundreds of thousands of sequences at an exceptional speed, the limited length of the reads may pose a problem to the species identification process. This study explores whether the discrepancy in read length between parallel pyrosequencing and traditional (Sanger) sequencing will have an impact on the perceived taxonomic affiliation of the underlying species. Based on all 39 200 publicly available fungal environmental DNA sequences representing the nuclear ribosomal internal transcribed spacer (ITS) region, the results show that the two approaches give rise to quite different views of the diversity of the underlying samples. Standardization of which subregion from the ITS region should be sequenced, as well as a recognition that the composition of fungal communities as depicted through different sequencing methods need not be directly comparable, appear crucial to the integration of the new sequencing technologies with current mycological praxis. Introduction Mycologists face the daunting task of characterizing the very large and unwieldy fungal kingdom in a taxonomic context. Estimated at 1.5 million species and reported from little short of all biota on Earth, fungi are thought to be responsible for many key ecological functions such as wood and litter decomposition, mycorrhizal associations, and other forms of nutrient recycling (Hawksworth, 2001). Inconspicuous by default, fungi are typically noticed only when they form above-ground fruiting bodies or other propagules. The study of fungi is thus plagued by a reliance on ephemeral structures whose presence or absence is only weakly correlated with the actual mycoflora of the collection site (Porter et al., 2008). Adding to the complexity, even outwardly very similar or identical fruiting bodies often prove to represent several distinct (cryptic) species (Geml et al., 2006; Paulus et al., 2007). These observations make a FEMS Microbiol Lett 296 (2009) 97–101 compelling case for DNA sequences as a vital information source in contemporary mycology, and the scientific study of fungi is indeed as much a molecular as a morphological enterprise today (Blackwell et al., 2006; Hibbett, 2007). The last few years have witnessed a surge in the interest in characterizing the mycoflora of entire localities and ecosystems (Bruns et al., 2008; Taylor, 2008). The desire to sequence whole communities of fungi from any given study site imposes very high demands in terms of highthroughput sequencing as to question the use of the presently popular capillary (Sanger)-based techniques in the first place (c.f. Metzker, 2005; Kahvejian et al., 2008). Indeed, emerging sequencing technologies with the capacity to generate hundreds of thousands of limited-length sequences within a matter of hours promise to take over the sequencing role for environmental studies (Strausberg et al., 2008). Three major platforms (Applied Biosystems SOLiD, Illumina Sequencing, and 454 Life Science/Roche massively 2009 Federation of European Microbiological Societies Published by Blackwell Publishing Ltd. All rights reserved c 98 parallel pyrosequencing) are presently in use for highthroughput sequencing, but only pyrosequencing yields long enough DNA templates to be considered for rigid use in a species-level classification framework (SOLiD, 31 bp; Illumina, 36 bp; pyrosequencing, c. 250 bp; Shendure & Ji, 2008). Although the current pyrosequencer GS FLX Standard is bound by an upper sequence length of about 250 bp, pyrosequencing of target genes and regions known to be sufficiently variable should in theory yield enough information to allow identification to the species level (Liu et al., 2008). In mycology, the internal transcribed spacer (ITS) region of the nuclear ribosomal repeat unit is by far the most commonly sequenced region for queries of systematics and taxonomy at and below the genus level. Although the ITS region is not entirely unproblematic (Feliner & Rosselló, 2007), 4 100 000 fungal ITS sequences have been deposited in the International Nucleotide Sequence Databases (INSD; Benson et al., 2008) since the early 1990s (Nilsson et al., 2008). The roughly 650-bp region is normally obtainable in a single round of Sanger DNA sequencing, and of its three subregions (the spacers ITS1 and ITS2 and the 5.8S gene), two (ITS1 and ITS2) show a high rate of evolution and are typically species specific (Bruns & Shefferson, 2004; Kõljalg et al., 2005). The large number of ITS copies per cell (upwards of 250; Vilgalys & Gonzalez, 1990) makes the region an appealing target for sequencing substrates where the initial amount of DNA is low, such as in environmental samples from soil and wood. Jointly, these observations make a compelling case for the ITS region as a prime target for pyrosequencing – targeted at either the ITS1 or ITS2 – of environmental samples of fungi. Based on the 39 200 available environmental ITS sequences of fungi, the present study investigates the ramifications of choosing either of these two subregions over the other, a well as over the whole ITS region, for purposes of molecular characterization of fungal communities. Questions of how to make the most of the data from high-throughput sequencing of environmental samples are cast in a taxonomic perspective. Materials and methods All fungal ITS sequences annotated as such in INSD as of November 2008 were downloaded and divided into two datasets: those that were identified to the species level (fully identified sequences, FIS) and those that were not (insufficiently identified sequences, IIS) following the procedure of Nilsson et al. (2005). The fungus-specific Hidden Markov Models of Ryberg et al. (2009) were used to locate and extract the ITS1 and ITS2 from the sequences, and all entries were stored in a local MySQL database (http://www.mysql. com). The IIS are, to a large degree, obtained through environmental sampling such that their nature makes them 2009 Federation of European Microbiological Societies Published by Blackwell Publishing Ltd. All rights reserved c R. H. Nilsson et al. attractive as query sequences in studies addressing the properties of environmental sequencing. Thus, to simulate the authentic situation where unidentified sequences have been obtained through sequencing of environmental samples and are queried against the INSD for taxonomic affiliation using BLAST (Altschul et al., 1997), all IIS featuring both the ITS1 and the ITS2 (in full or in part; defined as 4 40 bp) were compared in full against the FIS dataset using BLAST 2.2.18. These comparisons were repeated using only the ITS1, and then the ITS2, of these IIS to mimic limitedlength pyrosequencing data. All entries were tagged with the best BLAST match to the FIS dataset for the complete sequence data as well as for each of the ITS1 and ITS2. To minimize the impact of questionable matches and technical artefacts (Nilsson et al., 2006), only sequences where both the ITS1 and ITS2 found relevant matches (BLAST E-value threshold, 10 10) among the FIS were used for comparison. To examine the impact of partial vs. full ITS1 and ITS2 data, respectively, the results from BLAST analysis of the entire ITS1 region of four sets of 10 000 ITS sequences were contrasted with the results obtained through analysing only the first 100 bp of the same sequences (and similarly for the ITS2; Supporting Information, Appendix S1). The IIS for which one or both of the ITS1 and ITS2 were missing are not treated any further in this study and are excluded from the statistics reported below. Synonyms and anamorph– teleomorph relationships were established through the Centraalbureau voor Schimmelcultures databases (Crous et al., 2004; http://www.cbs.knaw.nl/databases/) and are accounted for in the following. Results A total of 100 639 fungal ITS sequences from 1992 and onwards were downloaded from INSD. Sixty-one percent (61 471 sequences) were identified to species level, leaving 39% (39 168 sequences) insufficiently identified. A complete or partial ITS1 was extracted and found to have a sufficiently good match to the FIS dataset for 77% of the IIS; the corresponding value for ITS2 was 80%. A total of 26 577 sequences (68% of the IIS) fulfilled all the criteria as to have ITS1 and ITS2 of sufficient length and to produce sufficiently good matches to the FIS dataset for both ITS1 and ITS2; these were designated as the query sequences of the study. The average length of the full IIS under scrutiny (including all three subregions and any part of the flanking ribosomal subunit genes) was 646 bp; that of the ITS1 was 182 bp; and that of the ITS2 was 183 bp. A moderate 22% of the entries were found to have the same INSD entry (accession number) as their best BLAST match regardless of which one of the regions (full sequence, ITS1, or ITS2) was used as a query (Table 1). The choice of region had a clear impact on the perceived taxonomic FEMS Microbiol Lett 296 (2009) 97–101 99 Taxonomic prospects of fungal ITS pyrosequencing Table 1. Summary statistics for the fungal ITS sequences in INSD as of December 2008 and the results from their analysis (in full and as broken up into constituent subregions) using BLAST Number of ITS sequences in INSD Number of ITS sequences with 4 40 bp ITS1 Number of ITS sequences with 4 40 bp ITS2 Number of ITS sequences with 4 40 bp of both ITS1 and ITS2 Percentage of cases where the whole ITS region, its ITS1 and ITS2 are best matched by the same INSD entry (accession number) Percentage of cases where the whole ITS region, its ITS1 and ITS2 are best matched by the same species Percentage of cases where the whole ITS region, its ITS1, and its ITS2 each are matched by different species Percentage of cases where the ITS1 and ITS2 are best matched by the same species, but the whole region is best matched by another species Percentage of cases where the ITS1 and ITS2 are best matched by different species Total number of species in the whole FIS dataset Total number of species in the FIS ITS1 dataset Total number of species in the FIS ITS2 dataset Proportion of IIS/FIS in the whole dataset Proportion of IIS/FIS in the ITS1 dataset Proportion of IIS/FIS in the ITS2 dataset affiliation of the sequence, with not less than 51% of the IIS showing not just another INSD entry but another species altogether as their best match (and in 21% of the total number of cases even a different genus) depending on which one was used for comparison. The three subregions disagreed completely on the species level in 14% of the cases (but in only 4% on the genus level). Thus, with respect to taxonomic affiliation, only in 49% of the cases did the choice of target region not matter at all. The full ITS region yielded the same BLAST results in terms of taxonomic affiliations as one, but not both, of its ITS1 and ITS2 26% of the time, with ITS2 (14%) concurring more often than the ITS1 (12%) with the taxonomic affiliation suggested by the entire ITS region. The ITS1 and ITS2 reported the same species, which was not suggested by the complete sequence, as their best BLAST match in a total of 11% of the cases. Roughly 20% of the ITS1 sequences under examination were assigned a different taxonomic affiliation by BLAST depending on whether the full ITS1 data or only the first 100 bp of the ITS1 were used (ITS2, 22 %; Appendix S1). Discussion Present pyrosequencing methods yield read lengths up to about 250 bp, a marked improvement over the 80–100 bp generated by the first generation of pyrosequencing machines, but only a third or perhaps half of both the length of a typical capillary sequencing round and the length needed to cover the ITS region in full for a wide selection of fungi. Improvements in the length of pyrosequencing reads are anticipated over time, but, at present, the user interested in sequencing the ITS region with pyrosequencing technology has to make a choice as to what part of the ITS to target. As if to underline the dangers of taking this decision too lightly, the present study shows that the choice of target region will FEMS Microbiol Lett 296 (2009) 97–101 100 639 90 200 93 655 85 914 22% 49% 14% 11% 40% 13 351 12 699 13 103 0.64 0.60 0.61 have an effect on one’s perception of the taxonomic diversity in the sample at hand. This is, at some level, expected due to the variable nature of the ITS1 and ITS2, which is made full justice to only when compared separately from the very conserved flanking and intercalary genes. Furthermore, the partial state of some ITS sequences in INSD, with either the ITS1 or the ITS2 missing entirely from a proportion of the sequences (10% of the FIS and 21% of the IIS), can be expected to introduce a degree of bias in such comparisons. Even so, the magnitude of the discrepancies is such that it is likely to find its way into large pyrosequencing datasets where automated processing of the output is the only feasible approach to species identification. More worrying still is the observation that ITS1 and ITS2 disagree over the taxonomic affiliation of the underlying query sequence in no less than 40% of the cases (Table 1), although this figure is in part explained by the presence of species groups with no or little interspecific variation. The BLAST output order for hits with identical match statistics – even though the species names may differ – is for all practical purposes random. Incorrectly annotated sequences, too, are likely to have influenced these estimates somewhat (c.f. Bidartondo et al., 2008). These results show that species-oriented ecosystem studies based on the whole of the ITS region – as is normally done today – and those based on pyrosequencing of either the ITS1 or the ITS2 – an approach expected to gain popularity rapidly over the next few years – may portray different pictures of the fungal diversity under scrutiny, a fact that strongly mitigates against ecological conclusions based on a direct comparison of such sets of results. The present study, along with others, testify to the benefits of analysing the ITS1/ITS2 in isolation (i.e. with the flanking and intercalary, highly conserved genes removed), at least if the goal is to identify the sequences to the species or the 2009 Federation of European Microbiological Societies Published by Blackwell Publishing Ltd. All rights reserved c 100 genus – as opposed to ordinal or phylum – level (c.f. Bruns & Shefferson, 2004). In the interest of comparison of ecosystems from different studies, the mycological community would be best off if it would standardize one of the two subregions of the ITS as the basis of such pyrosequencebased studies of environmental samples. The two subregions are roughly equal in terms of variability and length, but there are more ITS2 than ITS1 available for comparison in INSD (Table 1). More importantly, the gene in the downstream region of the ITS2 (encoding the ribosomal large subunit, or the 25/28S) is known to be substantially more useful for species identification and phylogenetic inference up to the ordinal level than the gene downstream of the ITS1, i.e. the very conserved 5.8S. Thus, any additional downstream region retrieved while sequencing the ITS2 may contribute a further signal to the identification procedure while those downstream of the ITS1 are less likely to be helpful. These observations, together with the wide selection of auxiliary resources available for the ITS2 (e.g. Selig et al., 2008; Coleman, 2009; Keller et al., 2009), make a case for the ITS2 as the better choice for parallel sequencing, although the issue of primer optimization in the fungal ITS region needs further attention. The data presented above leave little room for interpretation on one pressing issue: the largest obstacle to routine, en masse identification of fungal sequences to the species level is the striking paucity of well-identified, extensively annotated, and sequence coverage-wise complete reference sequences, preferably stemming from vouchered specimens kept in public herbaria, in INSD (c.f. Brock et al., 2009). Indeed, the sheer number of sequences from any pyrosequencing study is likely to further dilute the already limited presence of FIS in the BLAST hit lists so as to complicate any identification procedure even more. The present study shows the INSD to contain FIS from the ITS region – regardless of their suitability as reference sequences – for about 13 350 species, a very modest 0.9% of the estimated number of extant fungal species. Of the many issues elaborated on in the barcoding and molecular identification debate, taxonomy may well be the least considered and furthermore the one where progress is the slowest and most painstaking. The mycological community will soon be awash in data in the form of unidentified – and often unidentifiable – fungal ITS sequences from an abundance of study sites and ecosystems, data with which taxonomy in its current practise cannot be expected to hold pace. It would be a severe set-back for mycology if such unidentified taxa were to be given a different, ad hoc name in each study they were recovered as this would effectively close the route to interpreting the taxa in the light of other studies. A temporary system for formalizing clusters of unidentifiable and to all appearances conspecific sequences into standardized and nonarbitrarily named molecular species pending 2009 Federation of European Microbiological Societies Published by Blackwell Publishing Ltd. All rights reserved c R. H. Nilsson et al. formal taxonomic interpretation and assignment is likely to prove to be the only sustainable way to maintain data comparability across studies and sites (c.f. Ryberg et al., 2008; Horton et al., 2009). Any other, nonstandardized way of delimiting and referring to such sequence clusters will only serve to add further to the mounting burden of the progressively fewer, and severely underfinanced, still active fungal taxonomists. High-throughput sequencing represents an amazing technological feat that promises to reshape mycology, but unless a unified infrastructure for processing and interpretation of the results in a taxonomic context can be agreed upon and implemented, the benefits of community profiling may well come at the price of the integrative nature of current public sequence repositories. Acknowledgement R.H.N. and K.A. acknowledge infrastructural support from the Fungi in Boreal Forest Soils network. References Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W & Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J & Wheeler DL (2008) GenBank. Nucleic Acids Res 36: D25–D30. Bidartondo MI, Bruns TD, Blackwell M et al. (2008) Preserving accuracy in GenBank. Science 319: 1616. Blackwell M, Hibbett DS, Taylor JW & Spatafora JW (2006) Research coordination networks: a phylogeny for kingdom Fungi (Deep Hypha). Mycologia 98: 829–837. Brock PM, Döring H & Bidartondo MI (2009) How to know unknown fungi: the role of a herbarium. New Phytol 181: 719–724. Bruns TD & Shefferson RP (2004) Evolutionary studies of ectomycorrhizal fungi: milestones and future directions. Can J Bot 82: 1122–1132. Bruns TD, Arnold AE & Hughes KW (2008) Fungal networks made of humans: UNITE, FESIN, and frontiers in fungal ecology. New Phytol 177: 586–588. Coleman AW (2009) Is there a molecular key to the level of ‘‘biological species’’ in eukaryotes? A DNA guide. Mol Phylogenet Evol 50: 197–203. Crous PW, Gams W, Stalpers JA, Robert V & Stegehuis G (2004) MycoBank: an online initiative to launch mycology into the 21st century. Stud Mycol 50: 19–22. Feliner GN & Rosselló JA (2007) Better the devil you know? Guidelines for insightful utilization of nrDNA ITS in specieslevel evolutionary studies in plants. Mol Phylogenet Evol 44: 911–919. FEMS Microbiol Lett 296 (2009) 97–101 101 Taxonomic prospects of fungal ITS pyrosequencing Geml J, Laursen GA, O’Neill K, Nusbaum HC & Taylor DL (2006) Beringian origins and cryptic speciation events in the fly agaric (Amanita muscaria). Mol Ecol 15: 225–239. Hawksworth DL (2001) The magnitude of fungal diversity: the 1.5 million species estimate revisited. Mycol Res 105: 1422–1432. Hibbett DS (2007) After the gold rush, or before the flood? Evolutionary morphology of mushroom-forming fungi (Agaricomycetes) in the early 21st century. Mycol Res 111: 1001–1018. Horton TR, Arnold EA & Bruns TD (2009) FESIN workshops at ESA – the mycelial network grows. Mycorrhiza 19: 283–285. Kahvejian A, Quackenbush J & Thompson JF (2008) What would you do if you could sequence everything? Nat Biotechnol 26: 1125–1133. Keller A, Schleicher T, Schultz J, Müller T, Dandekar T & Wolf M (2009) 5.8S–28S rRNA interaction and HMM-based ITS2 annotation. Gene 430: 50–57. Kõljalg U, Larsson K-H, Abarenkov K et al. (2005) UNITE: a database providing web-based methods for the molecular identification of ectomycorrhizal fungi. New Phytol 166: 1063–1068. Liu Z, DeSantis TZ, Andersen GL & Knight G (2008) Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Res 36: e120. Metzker M (2005) Emerging technologies in DNA sequencing. Genome Res 15: 1767–1776. Nilsson RH, Kristiansson E, Ryberg M & Larsson K-H (2005) Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi. BMC Bioinformatics 6: 178. Nilsson RH, Ryberg M, Kristiansson E, Abarenkov K, Larsson KH & Kõljalg U (2006) Taxonomic reliability of DNA sequences in public sequences databases: a fungal perspective. PLoS ONE 1: e59. Nilsson RH, Kristiansson E, Ryberg M, Hallenberg N & Larsson K-H (2008) Intraspecific ITS variability in the kingdom Fungi as expressed in the international sequence databases and its implications for molecular species identification. Evol Bioinform 8: 193–201. Paulus B, Nilsson RH & Hallenberg N (2007) Phylogenetic studies in Hypochnicium (Basidiomycota), with special FEMS Microbiol Lett 296 (2009) 97–101 emphasis on species from New Zealand. New Zeal J Bot 45: 139–150. Porter TM, Skillman JE & Moncalvo JM (2008) Fruiting body and soil rDNA sampling detects complementary assemblage of Agaricomycotina (Basidiomycota, Fungi) in a hemlockdominated forest plot in southern Ontario. Mol Ecol 17: 3037–3050. Ryberg M, Nilsson RH, Kristiansson E, Jacobsson S & Larsson E (2008) Mining metadata from unidentified ITS sequences in GenBank: a case study in Inocybe (Basidiomycota). BMC Evol Biol 8: 50. Ryberg M, Kristiansson E, Sjökvist E & Nilsson RH (2009) An outlook on the fungal ITS sequences in GenBank and the introduction of a web-based tool for the exploration of fungal diversity. New Phytol 181: 471–477. Selig C, Wolf M, Muller T, Dandekar T & Schultz J (2008) The ITS2 Database II: homology modelling RNA structure for molecular systematics. Nucleic Acids Res 36: D377–D380. Shendure J & Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26: 1135–1145. Strausberg RL, Levy S & Rogers Y-H (2008) Emerging DNA sequencing technologies for human genomic medicine. Drug Discov Today 13: 569–577. Taylor AFS (2008) Recent advances in our understanding of fungal ecology. Coolia 52: 197–212. Vilgalys R & Gonzalez D (1990) Organization of ribosomal DNA in the basidiomycete Thanatephorus praticola. Curr Genet 18: 277–280. Supporting Information Additional Supporting Information may be found in the online version of this article: Appendix S1. Additional statistics pertaining to the IIS and FIS datasets. Please note: Wiley-Blackwell is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article. 2009 Federation of European Microbiological Societies Published by Blackwell Publishing Ltd. All rights reserved c