Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Application Note: Metagenomics Analysis Metagenomics Sequence Analysis Real Time Genomics has created a comprehensive analysis platform that delivers fast and accurate compositional analysis of microbial communities. Analysis Highlights • Highly Accurate: Delivers precise abundance estimates and more accurate reporting of translated nucleotide homology • Fast: Orders of magnitude faster at similar sensitivity thresholds • Easy-to-Use: Seamlessly integrates into existing workflows • Extensible: works with Illumina, Ion Torrent, and 454 data; handles single sample or community-scale datasets Introduction Research and clinical studies are leveraging the power of next-generation sequencing for evaluating metagenomic data across microbial habitats from large populations. Whole Genome Sequencing (WGS) provides unprecedented opportunities for discovering novel organisms and evolving pathogens, for discerning organismal composition at the species and strain level, as well as adding a new dimension with functional data. The analysis of gene content in microbiomes has revealed the potential of WGS approaches for exploring metabolic processing potential and pathway representation in health and disease. Using Real Time Genomics’ unique metagenomics pipeline, researchers are able to leverage WGS data to assess the genetic content of their samples and describe the known metabolic pathways for discovery of novel genes and associated variants. Understanding the metabolic capabilities of microbiomes will provide the basis for new diagnostics and novel therapeutics and help to uncover aspects of drug metabolism that are affected directly and indirectly by gut microbiota. For these breakthroughs, Real Time Genomics provides fast and highly accurate mapping and species abundance capabilities that allow for the analysis and comparison of hundreds of WGS samples in a robust and quantitative manner. Microbial Community Structure Real Time Genomics’ metagenomic pipeline was developed to estimate the abundance or frequency of a particular genome (typically a bacterial species) in a complex metagenomic sample. The calculations are performed on standard SAM files after reads have been mapped to a reference genome set containing thousands of genomes, many of which are highly related at the nucleotide level. Figure 1: Bacterial Community Composition Fractional representation of the top 20 species identified in a Human Stool Microbiome. Species composition is calculated in two ways. First, an estimate is derived for the proportion of the organisms in the sample. Application Note: Metagenomics Analysis Second, we produce an estimate of the species’ fractional abundance in the DNA of the sample by normalizing according to the genome length of each species (fractionindividuals). Breadth of coverage is calculated by taking the sum over the total number of bases in the species’ genome where at least 1 mapped read is present and dividing by total reference length. For depth of coverage, all mapped bases are divided by total reference length. We handle ambiguous reads by weighting their number of occurrences in the SAM records across the reference genomes, in effect multiplying by 1/n for those reads. The result is a highly accurate and comprehensive estimate of microbial community composition. Shotgun Metagenomics The ability to exploit the entire genomic content of a microbial environment is a distinguishing characteristic of a WGS- based approach as compared to the singlegene, 16S rRNA survey. Reference databases with sufficiently broad phylogenetic distributions of known organisms can be interrogated to gain species or strain-level resolution versus typical genera-level abundance estimates from 16S data. One of the biggest obstacles for the quantitative analysis of WGS involves the utilization of reads that map to multiple related genomes. Previous approaches avoid representing ambiguously mapped reads in calculations, and in doing so significantly underestimate coverage of a proportion of the species in the metagenome. The result of Real Time Genomics’ WGS-based approach is higher species specificity both for highly abundant and low abundance species within the metagenome, as described in Table 1. Table 1: Specificity of RTG’s Metagenomics Results Table 1 presents results of Real Time Genomic’s Metagenomics pipeline run on a synthetic sample containing ten genomes. The sample had 1,000,000 Illumina reads, 100bp PE, with 1% introduced error (90% SNP/10% indel). Results show species abundance was reported very close to actual over a wide dynamic range and with highly similar strains. The results show more accurate reporting than mapping-based approaches. Application Note: Metagenomics Analysis Community Metabolic Function As part of the end-to-end metagenomics product, Real Time Genomics allows for the identification and characterization of the metabolic functions of microbial communities. In humans, normal and pathogenic states of dynamic habitats such as the gut or oral cavity may be a consequence of altered metabolic processes. In taking a metagenomics approach to metabolic profiling, the most important analytical component is accurately identifying homologous matches to translated short reads in a protein database. Real Time Genomics takes short read nucleotide sequence data and performs robust similarity searches of the six-frame translations against a protein database. Using optimized data structures, the pipeline builds a set of indexes from the translated reads and scans each database subject for matches according to userspecified sensitivity settings. Index sets can be constructed by varying the number and placement of n-mers across each translated query, with the result being a highly sensitive and flexible homology search engine. Table 2. Accuracy of Homology Detection vs. BLASTX Percent homology corresponds to fraction of the sequenced read that matches a corresponding translated short read in NCBI nr database. Results as compared with BLASTX. However, a primary trade-off of high sensitivity when using existing homology search techniques is a steep cost on processing time. To assess the speed of Real Time Genomics pipeline,10 million 100bp Illumina reads were searched against the NCBI nr protein database. Table 3 shows the processing time as compared to NCBI BlastX using similar multicore threaded processing, with Real Time Genomics several orders of magnitude faster. Table 3. Speed of Translated Protein Search vs. BLASTX Processing Time using multicore threaded processing on 10M, 100bp Illumina reads against the NCBI nr database. Speed factor as compared to BLASTX 2.2.26 on similar multicore threaded configuration. Community Relationships Following the accurate detection of translated sequence in a metagenomics sample, Real Time Genomics identifies genes and metabolic pathways present and then produces estimates of gene and pathway abundances within the community. Thus WGS data from metagenomes can be used to infer relationships between multiple communities or analyse dynamics within a single community, including differential abundance across multiple samples. Most common are techniques that employ either a distance matrix method or parsimony, Bayesian or likelihood methods to perform phylogeny reconstruction. Alternatively, community analysis based solely on sequence composition (as opposed to pairwise sequence alignment) can be attempted to provide an absolute view of community relatedness. Application Note: Metagenomics Analysis Real Time Genomics has developed a sequence-based community comparison protocol that combines Principle Components Analysis methods to analyse the computed kmer-based data matrix across multiple samples. An example from the Human Microbiome Project provided by Washington University, St. Louis is shown in Figure 2. Figure 2. Sequence-based Clustering with Real Time Genomics’ Similarity Report Plot showing 3D visualization of 628 samples from HMP samples taken from oral (buccal mucosa, supragingival plaque and tongue dorsum), nasal (anterior nares), gut (stool), and vaginal (posterior fornix) body habitats (see Huttenhower, et al. Nature 2012). The similarity report produces a kmer-based matrix which is then analyzed by a singular value decomposition technique. Real Time Genomics, Inc. • 999 Bayhill Drive, Suite 101, San Bruno, CA 94066 USA • tel 1.415.441.2466 • [email protected] • realtimegenomics.com FOR RESEARCH USE ONLY © 2012 Real Time Genomics, Inc. All rights reserved.Real Time Genomics is a trademarks of Real Time Genomics, Inc. All other brands and names contained herein are the property of their respective owners. Current as of 22 October, 2012.