Download An ORFome Assembly Approach to Metagenomics Sequence Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA barcoding wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein moonlighting wikipedia , lookup

Cell-penetrating peptide wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

DNA sequencing wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Exome sequencing wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Point mutation wikipedia , lookup

Paracrine signalling wikipedia , lookup

Biosynthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genomic library wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Genome evolution wikipedia , lookup

Molecular evolution wikipedia , lookup

Non-coding DNA wikipedia , lookup

Transcript
Computational Approaches to
Metagenomic Sequences Analysis
Yu z h e n Ye
School of Informatics
I n d i a n a U n i v e r s i t y, B l o o m i n g t o n
[email protected]
Metagenomics
 The
term of “metagenomics” was first used in 1998
(Handelsman et al. Chemistry & Biology 1998, 5:R245R249)
 A methodology that applies genome sequencing to the
culture-independent analysis of complex and diverse
(“meta”) environmental populations of microbes
 Metagenomics projects: Global Ocean Survey (GOS),
Acid Mine Drainage (AMD), human microbiome project,
etc.
 Getting broader! Functional metagenomics
2
The Acid Mine Drainage (AMD) project
Biofilms growing on the
surface of flowing AMD in
the five-way region of the
Richmond mine at Iron
Mountain, California;
sampled in 2000
Acid is produced by oxidation of sulfide
minerals that are exposed to air as a
result of mining activity
An acid mine drainage site
Ref: Tyson G et al. Nature 2004, 428:37–43
3
DNA Sequencing in AMD
A
small insert plasmid library (average insert size 3.2 kb)
 Shotgun sequencing resulted 72.6 million bp; averaging
737 bp per read
Reads could from
different individuals,
different strains of the same species,
and different species
4
Human microbiome projects
 To
characterize the human microbiome (the totality of
microbes living on and within human body) and its role in
health and disease
 An often asked question: is there a core human
microbiome?
 NIH HMP website: http://nihroadmap.nih.gov/hmp/
5
rRNA-based or metagenomic

Small-subunit ribosomal RNA (rRNA) studies for microbial
community profiling (involving PCR of 16s RNAs)
→
→
→

16s RNA for Bacteria & 18s RNA for Archaea
rRNAs are used as phylogenetic markers to define which lineages are
present in a community
barcoded pyrosequencing allow deeper view of a microbial community
Metagenomic studies for functional profiling—community
DNA is subject to shotgun sequencing
→
→
Often used sequencing techniques: 454 pyrosequencing & Solexa/Illumina
Metagenomic studies are usually more expensive than rRNA-based, but
they are essential for understanding the functions encoded in a metagenome
(collection of genomes)
Ref: Genome Res. 2009. 19: 1141-1152
6
Computational problems and challenges
 Problems
Assembly
→ Identification of community species: Phylotyping versus binning
→ Function annotation
→ Comparative analysis
→

e.g, UniFrac and SONS for comparing microbial communities
 Challenges
Scale: development of computational tools that can handle input on
“metagenomic” scale
→ Complexity: a metagenome contains genetic elements from various
genomes (could be huge)
→
7
Genome- or gene-centric approaches

Genome-centric analyses
→
→
Similar to traditional genome projects
Worked in the AMD project
A lucky pick—low species diversity
 Reconstruction of near-complete genomes of Leptospirillum group II and
Ferroplasma type II, and partial recovery of 3 other genomes; JAZZ was
used

→

Not work well on datasets from samples with high species diversity and/or
low sequencing coverage
Gene-centric analyses
→
→
→
Environmental gene tags (EGTs): short DNA sequences that contain
fragments of functional genes
EGTs “fingerprints” can be compared across multiple sites or habitats or
over time in the same environment
Overrepresented or underrepresented EGTs can provide insights into unique
metabolic capabilities associated with a particular environment even if it is
not possible to assign a particular EGT to a particular environment
Ref: Eisen JA. Environmental shotgun sequencing: its potential and challenges for
studying the hidden world of microbes. PLoS Biol. 2007, 5(3):e82
8
Gene-centric approaches need to be improved
 Partial
→
(fragmental) genes/proteins
Application of next-generation sequencing technologies (e.g., many
of the metagenomic projects applied Roche/454 and Roche/Illumina
sequencing technologies for WGS, producing even shorter reads)
 Difficulties
in analyzing fragmental genes
It is difficult to correctly predict partial genes from DNA fragments
→ And gene length does matter (Ref: Wommack et al. Appl Environ
Microbiol. 2008 Mar;74(5):1453-63)
→
9
What’s covered in this talk
 MetaORFA:
ORFome assembler
 MinPath:
a parsimony approach to biological pathway
reconstruction
10
MetaORFA: ORFome assembler
 ORFome:
all the ORFs (open reading frames) found in a
given set of DNA sequences
 ORFome assembly: assemble ORFs into longer peptides
(so that similarity search using assembled peptides may
achieve higher sensitivity and specificity )
 References
→
Accepted paper in CSB 2008; JBCB. 2009, 7(3):455-71
Assembled peptide
18 ORFs
11
ORF assembly versus genome assembly
Whole genome assembly
ORFome assembly
ORF prediction
Assembly at DNA level
Assembly at protein level
ORF prediction
12
Why assemble at the protein level?
 When
whole genome assembly is difficult (short-reads,
low-coverage, high species complexity, and repeat-like
DNA sequences—shared DNA elements among different
species)
 Metagenomic DNA sequences could be from different
individuals (so there will be mutations that further
complicate DNA level assembly)
 Many mutations (hopefully) are synonymous (do not
change amino acid)
 We can assemble proteins first!
13
ORF identification
 The
typical strategy “start from a start-codon and stop at a
stop-codon” won’t work (because of the fragmental
nature of the metagenomic sequences)
 We use all potential ones
1
2
-3
-2
-1
3
1
2
…
 And we masked the DNA sequences prior to the ORF
identification using MDUST and Tandem Repeat Finder
14
ORF assembly algorithm: Eulerian path approach
Fragments = {ATG, TGC, GCG, CGT, GTG, GCT, CTG, GCA}
Vertices correspond to (l–1)-mers : {AT, TG, GC, CG, GT, CT, CA}
Edges correspond to l–mers from fragments (e.g., TGC; we used l = 10)
Assembled sequences: path visited every EDGE
e.g., ATGCGTGCTGCA ATGCTGCGTGCA
GT
De Bruijn graph
AT
TG
CG
GC
CT
CA
Repetitive sequences are
represented by a single edge (TGC)
Ref: Pevzner, Tang and Waterman (2001), An Eulerian path approach to
DNA fragment assembly. PNAS 98:9748-9753
ORFome assembly reports family graph
Protein I: MLSDFPVSTLIARCVLNSTY
Protein II: MRSNFPVSTVFAKTTLNSTY
Sequencing &
ORF identification
Peptides
from reads
MLSDFPVS FPVSTLI TLIARCV RCVLNSTY
MLSNFP PVSTV STVFAKTTL TTLNST LNSTY
De Bruijn graph
construction
Protein family graph
MLSD
LIARCV
FPVST
MRSN
LNSTY
VFAKTT
Currently the sequences corresponding to the edges are used in the search!
MetaORFA
 Input:
ORFs prediction
 Output: assembled peptides
 MetaORFA
→
runs fast
but the downstream analysis, similarity search and family annotation,
of the ORFs/assembled peptides may be time consuming
17
Test of ORF length cutoff
 Short
ORFs may not be real
 Too many short ORFs slow down the assembly
18
Test on real metagenomic datasets
 Four
datasets each containing metagenomics sequences of
a major oceanic region community (the four regions are
Sargasso Sea, Coast of British Columbia, Gulf of Mexico,
and Arctic Ocean) (referred to as Ocean Virus datasets).
 The reads were acquired by 454 sequencing machine, and
they are typically very short.
 All the metagenomic sequences were downloaded from
CAMERA website (http://camera.calit2.net/)
19
ORFome assembly results
Table: Statistics of the ORFs and ORFome assembly results for Ocean
Virus datasets
A-Pep: assembled peptide
20
More reads hit similar sequences
total reads=688590
searched against IMG database (the integrated microbial genomes system) version 2.4
21
More functional categories are identified
Table: Summary of the family annotation of assembled peptides versus
unassembled reads for the four ocean virus datasets
PTHR22748, AP endonuclease (E-value = 2.5e-12);
PTHR11527 (subfamily SF15), heat shock protein 16 (E-value = 1.5e-07);
PTHR21535 (subfamily SF1), magnesium and cobalt transport protein (E-value = 8e-09);
PTHR17630 (subfamily SF20), carboxymethylenebutenolidase (E-value = 4.7e-08)
PANTHER family classification was used for family (subfamily) annotation
PANTHER HMM library was downloaded from ftp://ftp.pantherdb.org and associated HMM
searching tool (pantherScore.pl) was used
22
Longer peptides carry more specific information
 For
the Arctic Ocean dataset
 Assembled peptides add 113 subfamilies to the annotation
using unassembled short ORFs (1524 subfamilies)
 An example with “mis-annotation” at subfamily level
Assembled peptide
PTHR11935:SF11
(Glyoxalase II)
SCUMS_READ_Arctic2924400-r2
SCUMS_READ_Arctic2876600-r1
SCUMS_READ_Arctic2285121-f2
SCUMS_READ_Arctic2455735-f0
SCUMS_READ_Arctic2538177-f0
SCUMS_READ_Arctic2813824-r18
PTHR11935:SF10
(Beta lactamase domain)
23
Assembled peptide may involve synonymous
mutations
24
New/future developments
 Improve
assembly algorithm (A-Bruijn graph algorithm)
 Improve ORF identification
→
Current prediction may include many false ORFs
 Systematically
study DNA polymorphism in metagenomic
sequences
 Apply ORF assembly results to improve metagenomic
sequence annotation
 Utilize ORF assembly results to assist DNA-level
assembly
25
MinPath: a parsimony approach to
biological pathway reconstruction
for genomes and metagenomes
Biological pathways are key to the
understanding of biological functions
Biological pathways are key to the
understanding of biological functions
Smaller units (e.g.,
KEGG pathways) are
extremely important
for the understanding
of biological functions
Genome of an endosymbiont coupling N2 fixation to
cellulolysis within protist cells in termite gut
Image from: http://www.sciencemag.org/cgi/content/full/322/5904/1108/DC1
Ref: Science 322(5904): 1108 – 1109, 2008
MinPath: a parsimony approach to biological
pathway reconstruction
The naïve mapping approach
collects all pathways with one
or more associated families
annotated
p1
f
1
MinPath keeps only the minimal
set of pathways that explain all
the functions annotated
f1
p1
p2
f2
p2
f2
f3
p3
f3
f4
p4
f4
f
f
5
5
f6
f6
p4
Reference: PLoS Computational Biology (to appear)
Why MinPath
Pathway reconstruction based on some new high throughput
techniques (e.g., proteomics, and metagenomics) must provide
conclusions from explicitly incomplete information
(metagenomes, unlike genomes, are most likely incomplete).
There will be missing enzymes in reconstructed pathways—
are they real missing enzymes, or they are simply not
sampled?
 Existing methods of pathway reconstruction or inference (e.g.,
the naïve mapping approach shown in previous slide) may
over-estimate the number of pathways because of redundancy
in the protein-pathway (e.g., different pathways may share the
same biological functions, some proteins carry out multiple
biological functions).
 A more conservative pathway reconstruction (such as the
minimal pathway formula used in MinPath) may actually work
better

Minimal pathway reconstruction problem solved
by integer programming algorithm
The goal is to find the minimum number of pathways that can explain
all the functions carried by at least one protein from a dataset
p
min  Pj
j 1
p
s.t.
M
P  1 i  [1,n]
ij j
j 1
M is the mapping of protein functions to the pathways as, where Mij = 1 if
function i is involved in pathway j, otherwise 0 (note one function may map to
multiple pathways or subsystems).
if a pathway j is selected in the final list or not as Pj, with Pj = 1 if
Denote
selected, Pj = 0 otherwise.
The set of pathways with Pi = 1 composes the minimal set of pathways that can
explain all the functions that are annotated for a dataset.
Protein function and function annotation
K
numbers or fig families are used for functions
depending on which pathway database is used
K numbers (or KO families) for KEGG pathways
→ fig families for SEED subsystems
→
 Function
annotations are based on similarity search
The fig family release comes with a script for fig family annotation
(the fig annotations used in the MinPath paper were downloaded
from MG-RAST server)
→ KO families were downloaded from KEGG server, or predicted
based on the best blast hits with E-value cutoff of 1e-5
→
Pathway reconstructions of genomes by MinPath
MinPath gives an estimation of functional diversity of
various genomes (measured by the number of pathways
constructed) that is closer to the curated KEGG database
as compared to the naïve mapping approach
Selected pathways eliminated by MinPath
(Human genome)
An example of pathway eliminated by MinPath:
the ascorbate and aldarate metabolism pathway
Only three enzymes are
annotated in the human
genome (highlighted in
green), none of which are
unique to this pathway.
Pathway reconstructions by MinPath for
metagenomes
In a single sequence set from the coral biome (4440319.3.dna.fa) the naïve
mapping approach identified 224 KEGG pathways, whereas MinPath identified
only 143 KEGG pathways. The pathways eliminated by MinPath include the
inositol metabolism pathway, the androgen and estrogen metabolism pathway,
the caffeine metabolism pathway. The metagenomes used here are from
Dinsdale EA et al. 2008, Nature 452: 629-632
Potential applications of MinPath
 To
improve function annotation of metagenomic
sequences with more carefully constructed biological
pathways
 To give a more reliable estimation of the functional ability
of a community, which is essential for understanding a
community, and for comparing the functional diversity of
different communities
38
Acknowledgements

Yu-Wei Wu, Tom Doak, Mina Rho, and Quan Zhang

Colleagues at the School of Informatics and Computing,
Indiana University, Bloomington
→
Drs. Haixu Tang, Sun Kim, Mehmet Dalkilic, Matthew Hahn and Predrag
Radivojac

NIH grant 1R01HG004908-01 (Fragment Assembly and
Metabolic/Species Diversity Analysis for HMP)

MetaCyt Initiative at Indiana University, funded by Lilly
Endowment
39
The game is just begun
40