Download An ORFome Assembly Approach to Metagenomics Sequence Analysis

Computational Approaches to Metagenomic Sequences Analysis Yu z h e n Ye School of Informatics I n d i a n a U n i v e r s i t y, B l o o m i n g t o n [email protected] Metagenomics  The term of “metagenomics” was first used in 1998 (Handelsman et al. Chemistry & Biology 1998, 5:R245R249)  A methodology that applies genome sequencing to the culture-independent analysis of complex and diverse (“meta”) environmental populations of microbes  Metagenomics projects: Global Ocean Survey (GOS), Acid Mine Drainage (AMD), human microbiome project, etc.  Getting broader! Functional metagenomics 2 The Acid Mine Drainage (AMD) project Biofilms growing on the surface of flowing AMD in the five-way region of the Richmond mine at Iron Mountain, California; sampled in 2000 Acid is produced by oxidation of sulfide minerals that are exposed to air as a result of mining activity An acid mine drainage site Ref: Tyson G et al. Nature 2004, 428:37–43 3 DNA Sequencing in AMD A small insert plasmid library (average insert size 3.2 kb)  Shotgun sequencing resulted 72.6 million bp; averaging 737 bp per read Reads could from different individuals, different strains of the same species, and different species 4 Human microbiome projects  To characterize the human microbiome (the totality of microbes living on and within human body) and its role in health and disease  An often asked question: is there a core human microbiome?  NIH HMP website: http://nihroadmap.nih.gov/hmp/ 5 rRNA-based or metagenomic  Small-subunit ribosomal RNA (rRNA) studies for microbial community profiling (involving PCR of 16s RNAs) → → →  16s RNA for Bacteria & 18s RNA for Archaea rRNAs are used as phylogenetic markers to define which lineages are present in a community barcoded pyrosequencing allow deeper view of a microbial community Metagenomic studies for functional profiling—community DNA is subject to shotgun sequencing → → Often used sequencing techniques: 454 pyrosequencing & Solexa/Illumina Metagenomic studies are usually more expensive than rRNA-based, but they are essential for understanding the functions encoded in a metagenome (collection of genomes) Ref: Genome Res. 2009. 19: 1141-1152 6 Computational problems and challenges  Problems Assembly → Identification of community species: Phylotyping versus binning → Function annotation → Comparative analysis →  e.g, UniFrac and SONS for comparing microbial communities  Challenges Scale: development of computational tools that can handle input on “metagenomic” scale → Complexity: a metagenome contains genetic elements from various genomes (could be huge) → 7 Genome- or gene-centric approaches  Genome-centric analyses → → Similar to traditional genome projects Worked in the AMD project A lucky pick—low species diversity  Reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of 3 other genomes; JAZZ was used  →  Not work well on datasets from samples with high species diversity and/or low sequencing coverage Gene-centric analyses → → → Environmental gene tags (EGTs): short DNA sequences that contain fragments of functional genes EGTs “fingerprints” can be compared across multiple sites or habitats or over time in the same environment Overrepresented or underrepresented EGTs can provide insights into unique metabolic capabilities associated with a particular environment even if it is not possible to assign a particular EGT to a particular environment Ref: Eisen JA. Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS Biol. 2007, 5(3):e82 8 Gene-centric approaches need to be improved  Partial → (fragmental) genes/proteins Application of next-generation sequencing technologies (e.g., many of the metagenomic projects applied Roche/454 and Roche/Illumina sequencing technologies for WGS, producing even shorter reads)  Difficulties in analyzing fragmental genes It is difficult to correctly predict partial genes from DNA fragments → And gene length does matter (Ref: Wommack et al. Appl Environ Microbiol. 2008 Mar;74(5):1453-63) → 9 What’s covered in this talk  MetaORFA: ORFome assembler  MinPath: a parsimony approach to biological pathway reconstruction 10 MetaORFA: ORFome assembler  ORFome: all the ORFs (open reading frames) found in a given set of DNA sequences  ORFome assembly: assemble ORFs into longer peptides (so that similarity search using assembled peptides may achieve higher sensitivity and specificity )  References → Accepted paper in CSB 2008; JBCB. 2009, 7(3):455-71 Assembled peptide 18 ORFs 11 ORF assembly versus genome assembly Whole genome assembly ORFome assembly ORF prediction Assembly at DNA level Assembly at protein level ORF prediction 12 Why assemble at the protein level?  When whole genome assembly is difficult (short-reads, low-coverage, high species complexity, and repeat-like DNA sequences—shared DNA elements among different species)  Metagenomic DNA sequences could be from different individuals (so there will be mutations that further complicate DNA level assembly)  Many mutations (hopefully) are synonymous (do not change amino acid)  We can assemble proteins first! 13 ORF identification  The typical strategy “start from a start-codon and stop at a stop-codon” won’t work (because of the fragmental nature of the metagenomic sequences)  We use all potential ones 1 2 -3 -2 -1 3 1 2 …  And we masked the DNA sequences prior to the ORF identification using MDUST and Tandem Repeat Finder 14 ORF assembly algorithm: Eulerian path approach Fragments = {ATG, TGC, GCG, CGT, GTG, GCT, CTG, GCA} Vertices correspond to (l–1)-mers : {AT, TG, GC, CG, GT, CT, CA} Edges correspond to l–mers from fragments (e.g., TGC; we used l = 10) Assembled sequences: path visited every EDGE e.g., ATGCGTGCTGCA ATGCTGCGTGCA GT De Bruijn graph AT TG CG GC CT CA Repetitive sequences are represented by a single edge (TGC) Ref: Pevzner, Tang and Waterman (2001), An Eulerian path approach to DNA fragment assembly. PNAS 98:9748-9753 ORFome assembly reports family graph Protein I: MLSDFPVSTLIARCVLNSTY Protein II: MRSNFPVSTVFAKTTLNSTY Sequencing & ORF identification Peptides from reads MLSDFPVS FPVSTLI TLIARCV RCVLNSTY MLSNFP PVSTV STVFAKTTL TTLNST LNSTY De Bruijn graph construction Protein family graph MLSD LIARCV FPVST MRSN LNSTY VFAKTT Currently the sequences corresponding to the edges are used in the search! MetaORFA  Input: ORFs prediction  Output: assembled peptides  MetaORFA → runs fast but the downstream analysis, similarity search and family annotation, of the ORFs/assembled peptides may be time consuming 17 Test of ORF length cutoff  Short ORFs may not be real  Too many short ORFs slow down the assembly 18 Test on real metagenomic datasets  Four datasets each containing metagenomics sequences of a major oceanic region community (the four regions are Sargasso Sea, Coast of British Columbia, Gulf of Mexico, and Arctic Ocean) (referred to as Ocean Virus datasets).  The reads were acquired by 454 sequencing machine, and they are typically very short.  All the metagenomic sequences were downloaded from CAMERA website (http://camera.calit2.net/) 19 ORFome assembly results Table: Statistics of the ORFs and ORFome assembly results for Ocean Virus datasets A-Pep: assembled peptide 20 More reads hit similar sequences total reads=688590 searched against IMG database (the integrated microbial genomes system) version 2.4 21 More functional categories are identified Table: Summary of the family annotation of assembled peptides versus unassembled reads for the four ocean virus datasets PTHR22748, AP endonuclease (E-value = 2.5e-12); PTHR11527 (subfamily SF15), heat shock protein 16 (E-value = 1.5e-07); PTHR21535 (subfamily SF1), magnesium and cobalt transport protein (E-value = 8e-09); PTHR17630 (subfamily SF20), carboxymethylenebutenolidase (E-value = 4.7e-08) PANTHER family classification was used for family (subfamily) annotation PANTHER HMM library was downloaded from ftp://ftp.pantherdb.org and associated HMM searching tool (pantherScore.pl) was used 22 Longer peptides carry more specific information  For the Arctic Ocean dataset  Assembled peptides add 113 subfamilies to the annotation using unassembled short ORFs (1524 subfamilies)  An example with “mis-annotation” at subfamily level Assembled peptide PTHR11935:SF11 (Glyoxalase II) SCUMS_READ_Arctic2924400-r2 SCUMS_READ_Arctic2876600-r1 SCUMS_READ_Arctic2285121-f2 SCUMS_READ_Arctic2455735-f0 SCUMS_READ_Arctic2538177-f0 SCUMS_READ_Arctic2813824-r18 PTHR11935:SF10 (Beta lactamase domain) 23 Assembled peptide may involve synonymous mutations 24 New/future developments  Improve assembly algorithm (A-Bruijn graph algorithm)  Improve ORF identification → Current prediction may include many false ORFs  Systematically study DNA polymorphism in metagenomic sequences  Apply ORF assembly results to improve metagenomic sequence annotation  Utilize ORF assembly results to assist DNA-level assembly 25 MinPath: a parsimony approach to biological pathway reconstruction for genomes and metagenomes Biological pathways are key to the understanding of biological functions Biological pathways are key to the understanding of biological functions Smaller units (e.g., KEGG pathways) are extremely important for the understanding of biological functions Genome of an endosymbiont coupling N2 fixation to cellulolysis within protist cells in termite gut Image from: http://www.sciencemag.org/cgi/content/full/322/5904/1108/DC1 Ref: Science 322(5904): 1108 – 1109, 2008 MinPath: a parsimony approach to biological pathway reconstruction The naïve mapping approach collects all pathways with one or more associated families annotated p1 f 1 MinPath keeps only the minimal set of pathways that explain all the functions annotated f1 p1 p2 f2 p2 f2 f3 p3 f3 f4 p4 f4 f f 5 5 f6 f6 p4 Reference: PLoS Computational Biology (to appear) Why MinPath Pathway reconstruction based on some new high throughput techniques (e.g., proteomics, and metagenomics) must provide conclusions from explicitly incomplete information (metagenomes, unlike genomes, are most likely incomplete). There will be missing enzymes in reconstructed pathways— are they real missing enzymes, or they are simply not sampled?  Existing methods of pathway reconstruction or inference (e.g., the naïve mapping approach shown in previous slide) may over-estimate the number of pathways because of redundancy in the protein-pathway (e.g., different pathways may share the same biological functions, some proteins carry out multiple biological functions).  A more conservative pathway reconstruction (such as the minimal pathway formula used in MinPath) may actually work better  Minimal pathway reconstruction problem solved by integer programming algorithm The goal is to find the minimum number of pathways that can explain all the functions carried by at least one protein from a dataset p min  Pj j 1 p s.t. M P  1 i  [1,n] ij j j 1 M is the mapping of protein functions to the pathways as, where Mij = 1 if function i is involved in pathway j, otherwise 0 (note one function may map to multiple pathways or subsystems). if a pathway j is selected in the final list or not as Pj, with Pj = 1 if Denote selected, Pj = 0 otherwise. The set of pathways with Pi = 1 composes the minimal set of pathways that can explain all the functions that are annotated for a dataset. Protein function and function annotation K numbers or fig families are used for functions depending on which pathway database is used K numbers (or KO families) for KEGG pathways → fig families for SEED subsystems →  Function annotations are based on similarity search The fig family release comes with a script for fig family annotation (the fig annotations used in the MinPath paper were downloaded from MG-RAST server) → KO families were downloaded from KEGG server, or predicted based on the best blast hits with E-value cutoff of 1e-5 → Pathway reconstructions of genomes by MinPath MinPath gives an estimation of functional diversity of various genomes (measured by the number of pathways constructed) that is closer to the curated KEGG database as compared to the naïve mapping approach Selected pathways eliminated by MinPath (Human genome) An example of pathway eliminated by MinPath: the ascorbate and aldarate metabolism pathway Only three enzymes are annotated in the human genome (highlighted in green), none of which are unique to this pathway. Pathway reconstructions by MinPath for metagenomes In a single sequence set from the coral biome (4440319.3.dna.fa) the naïve mapping approach identified 224 KEGG pathways, whereas MinPath identified only 143 KEGG pathways. The pathways eliminated by MinPath include the inositol metabolism pathway, the androgen and estrogen metabolism pathway, the caffeine metabolism pathway. The metagenomes used here are from Dinsdale EA et al. 2008, Nature 452: 629-632 Potential applications of MinPath  To improve function annotation of metagenomic sequences with more carefully constructed biological pathways  To give a more reliable estimation of the functional ability of a community, which is essential for understanding a community, and for comparing the functional diversity of different communities 38 Acknowledgements  Yu-Wei Wu, Tom Doak, Mina Rho, and Quan Zhang  Colleagues at the School of Informatics and Computing, Indiana University, Bloomington → Drs. Haixu Tang, Sun Kim, Mehmet Dalkilic, Matthew Hahn and Predrag Radivojac  NIH grant 1R01HG004908-01 (Fragment Assembly and Metabolic/Species Diversity Analysis for HMP)  MetaCyt Initiative at Indiana University, funded by Lilly Endowment 39 The game is just begun 40

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download An ORFome Assembly Approach to Metagenomics Sequence Analysis