Download Metagenomics Sequence Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Application Note: Metagenomics Analysis
Metagenomics Sequence Analysis
Real Time Genomics has created a comprehensive analysis platform that delivers fast and
accurate compositional analysis of microbial communities.
Analysis Highlights
• Highly Accurate: Delivers precise
abundance estimates and more accurate
reporting of translated nucleotide homology
• Fast: Orders of magnitude faster at similar
sensitivity thresholds
• Easy-to-Use: Seamlessly integrates into
existing workflows
• Extensible: works with Illumina, Ion Torrent,
and 454 data; handles single sample or
community-scale datasets
Introduction
Research and clinical studies are leveraging
the power of next-generation sequencing for
evaluating metagenomic data across
microbial habitats from large populations.
Whole Genome Sequencing (WGS)
provides unprecedented opportunities for
discovering novel organisms and evolving
pathogens, for discerning organismal
composition at the species and strain level,
as well as adding a new dimension with
functional data. The analysis of gene
content in microbiomes has revealed the
potential of WGS approaches for exploring
metabolic processing potential and pathway
representation in health and disease. Using
Real Time Genomics’ unique metagenomics
pipeline, researchers are able to leverage
WGS data to assess the genetic content of
their samples and describe the known
metabolic pathways for discovery of novel
genes and associated variants.
Understanding the metabolic capabilities of
microbiomes will provide the basis for new
diagnostics and novel therapeutics and help
to uncover aspects of drug metabolism that
are affected directly and indirectly by gut
microbiota. For these breakthroughs, Real
Time Genomics provides fast and highly
accurate mapping and species abundance
capabilities that allow for the analysis and
comparison of hundreds of WGS samples in
a robust and quantitative manner.
Microbial Community Structure
Real Time Genomics’ metagenomic pipeline
was developed to estimate the abundance
or frequency of a particular genome
(typically a bacterial species) in a complex
metagenomic sample. The calculations are
performed on standard SAM files after
reads have been mapped to a reference
genome set containing thousands of
genomes, many of which are highly related
at the nucleotide level.
Figure 1: Bacterial Community Composition
Fractional representation of the top 20 species identified in a
Human Stool Microbiome.
Species composition is calculated in two
ways. First, an estimate is derived for the
proportion of the organisms in the sample.
Application Note: Metagenomics Analysis
Second, we produce an estimate of the
species’ fractional abundance in the DNA of
the sample by normalizing according to the
genome length of each species (fractionindividuals). Breadth of coverage is
calculated by taking the sum over the total
number of bases in the species’ genome
where at least 1 mapped read is present
and dividing by total reference length. For
depth of coverage, all mapped bases are
divided by total reference length. We handle
ambiguous reads by weighting their number
of occurrences in the SAM records across
the reference genomes, in effect multiplying
by 1/n for those reads. The result is a highly
accurate and comprehensive estimate of
microbial community composition.
Shotgun Metagenomics
The ability to exploit the entire genomic
content of a microbial environment is a
distinguishing characteristic of a WGS-
based approach as compared to the singlegene, 16S rRNA survey. Reference
databases with sufficiently broad
phylogenetic distributions of known
organisms can be interrogated to gain
species or strain-level resolution versus
typical genera-level abundance estimates
from 16S data.
One of the biggest obstacles for the
quantitative analysis of WGS involves the
utilization of reads that map to multiple
related genomes. Previous approaches
avoid representing ambiguously mapped
reads in calculations, and in doing so
significantly underestimate coverage of a
proportion of the species in the
metagenome. The result of Real Time
Genomics’ WGS-based approach is higher
species specificity both for highly abundant
and low abundance species within the
metagenome, as described in Table 1.
Table 1: Specificity of RTG’s Metagenomics Results
Table 1 presents results of Real Time Genomic’s Metagenomics pipeline run on a synthetic sample containing ten genomes. The sample had
1,000,000 Illumina reads, 100bp PE, with 1% introduced error (90% SNP/10% indel). Results show species abundance was reported very close to
actual over a wide dynamic range and with highly similar strains. The results show more accurate reporting than mapping-based approaches.
Application Note: Metagenomics Analysis
Community Metabolic Function
As part of the end-to-end metagenomics
product, Real Time Genomics allows for the
identification and characterization of the
metabolic functions of microbial
communities. In humans, normal and
pathogenic states of dynamic habitats such
as the gut or oral cavity may be a
consequence of altered metabolic
processes. In taking a metagenomics
approach to metabolic profiling, the most
important analytical component is
accurately identifying homologous matches
to translated short reads in a protein
database. Real Time Genomics takes short
read nucleotide sequence data and
performs robust similarity searches of the
six-frame translations against a protein
database. Using optimized data structures,
the pipeline builds a set of indexes from the
translated reads and scans each database
subject for matches according to userspecified sensitivity settings. Index sets can
be constructed by varying the number and
placement of n-mers across each translated
query, with the result being a highly
sensitive and flexible homology search
engine.
Table 2. Accuracy of Homology Detection vs. BLASTX
Percent homology corresponds to fraction of the sequenced
read that matches a corresponding translated short read in
NCBI nr database. Results as compared with BLASTX.
However, a primary trade-off of high
sensitivity when using existing homology
search techniques is a steep cost on
processing time. To assess the speed of
Real Time Genomics pipeline,10 million
100bp Illumina reads were searched against
the NCBI nr protein database. Table 3
shows the processing time as compared to
NCBI BlastX using similar multicore
threaded processing, with Real Time
Genomics several orders of magnitude
faster.
Table 3. Speed of Translated Protein Search vs. BLASTX
Processing Time using multicore threaded processing on 10M,
100bp Illumina reads against the NCBI nr database. Speed
factor as compared to BLASTX 2.2.26 on similar multicore
threaded configuration.
Community Relationships
Following the accurate detection of
translated sequence in a metagenomics
sample, Real Time Genomics identifies
genes and metabolic pathways present and
then produces estimates of gene and
pathway abundances within the community.
Thus WGS data from metagenomes can be
used to infer relationships between multiple
communities or analyse dynamics within a
single community, including differential
abundance across multiple samples.
Most common are techniques that employ
either a distance matrix method or
parsimony, Bayesian or likelihood methods
to perform phylogeny reconstruction.
Alternatively, community analysis based
solely on sequence composition (as
opposed to pairwise sequence alignment)
can be attempted to provide an absolute
view of community relatedness.
Application Note: Metagenomics Analysis
Real Time Genomics has developed a
sequence-based community comparison
protocol that combines Principle
Components Analysis methods to analyse
the computed kmer-based data matrix
across multiple samples. An example from
the Human Microbiome Project provided by
Washington University, St. Louis is shown in
Figure 2.
Figure 2. Sequence-based Clustering with Real Time Genomics’ Similarity Report
Plot showing 3D visualization of 628 samples from HMP samples taken from oral (buccal mucosa, supragingival plaque and tongue
dorsum), nasal (anterior nares), gut (stool), and vaginal (posterior fornix) body habitats (see Huttenhower, et al. Nature 2012). The
similarity report produces a kmer-based matrix which is then analyzed by a singular value decomposition technique.
Real Time Genomics, Inc. • 999 Bayhill Drive, Suite 101, San Bruno, CA 94066 USA • tel 1.415.441.2466 •
[email protected] • realtimegenomics.com
FOR RESEARCH USE ONLY
© 2012 Real Time Genomics, Inc. All rights reserved.Real Time Genomics is a trademarks of Real Time
Genomics, Inc. All other brands and names contained herein are the property of their respective
owners. Current as of 22 October, 2012.