Download Advancing Science with DNA Sequence

Advancing Science with DNA Sequence IMG/M and metagenome analysis Natalia Ivanova MGM Workshop February 5, 2009 Advancing Science with DNA Sequence Outline 1. Problems of metagenomic data 2. IMG/M features 3. Analysing metagenomic data: flowcharts Advancing Science with DNA Sequence 1. Problems of metagenomic data (metagenomic data is the problem) (see IMG/M -> Using IMG/M -> About IMG/M -> Background for definitions) Advancing Science with DNA Sequence Metagenomic data are noisy • Definition of high quality genome sequence: an example of “finished” JGI genomes - each base is covered by at least two Sanger reads in each direction with a quality of at least Q20 • Definition of “ high quality” metagenome? Too many variables:  species composition/abundance  amount of DNA available  average GC content of each species (applies to 454 Titanium as well)  “clonability” of the DNA of each species (or biases of 454 libraries)  amount of sequence allocated  no clear sequencing goal … Advancing Science with DNA Sequence Metagenomic data are noisy • Sequence coverage of metagenomes is low US Sludge, Phrap assembly # of scaffolds % total scaffolds Scaffolds, coverage > 2.0 2954 9.3 Scaffolds, coverage 1.03-2.0 8158 25.7 Unassembled reads 20630 65 • Rate of sequencing artifacts is high • Frameshifts are the most unpleasant artifacts, they lead to errors in gene prediction Advancing Science with DNA Sequence Metagenomic data are highly fragmented • Median scaffold length in 56 GEBA genomes – 28,179 bp • Median scaffold length in US Sludge, Phrap assembly – 1,157 bp • Many more gene fragments in metagenomes (median protein size in GEBA genomes – 252 aa, median protein size in US Sludge, Phrap – 195 aa) • Problems with assignment to protein families and functional annotation Advancing Science with DNA Sequence Metagenomic datasets are large (or huge) # of CDSs GEBA genomes Samples in IMG Projects in IMG minimal 1,375 2,331 (mouse gut ob2) 2,386 (AMO community) maximal 9,433 185,274 (soil) 333,301 (Lake Washington sediment) median 3,562 16,053 83,662 • No manual annotation (functional annotations in metagenomes should be taken with a grain of salt) • “Divide and conquer” approach Advancing Science with DNA Sequence 2. IMG/M features (see also IMG/M -> Using IMG/M -> Using IMG/M -> IMG User Guide and IMG/M Addendum) Advancing Science with DNA Sequence IMG/M User Interface Map Advancing Science with DNA Sequence Dividing the genes phylogenetically • Bins Microbiome Details -> Microbiome Information -> Bins (of scaffolds) • Phylogenetic Distribution of Genes Microbiome Details -> Phylogenetic Distribution of Genes gene lists gene counts histogram Components: (phylum/class)  histograms summary statistics  Protein Recruitment Plots histogram counts, lists, statistics (family)  summary statistics tables counts, lists  lists of genes histogram (species) recruitment plots Advancing Science with DNA Sequence Dividing the genes by abundance/ by function • Abundance Profiles Compare Genomes -> Abundance Profiles Tools Components:  Abundance Profile Overview  Abundance Profile Search  Function Comparisons  Function Category Comparisons Common parameters:  Normalization (none/scale for size)  Type of count (raw counts/estimated gene copies)  Type of protein family (COG, Pfam, Enzyme, TIGRfam) Advancing Science with DNA Sequence 3. Analysing metagenomic data: flowcharts Advancing Science with DNA Sequence Sanger metagenomes Sanger library 16S sequences 10 plate QC raw read QC: GC content insert-less clones contamination taxonomic analysis (MEGAN) Full sequenc e vector and quality trimming assembly annotation binning loading to IMG/M-ER (upon request) manual analysis (protein families, etc.) loading to IMG/M-ER Advancing Science with DNA Sequence 454 Titanium metagenomes Titanium library 16S pyrotags ¼ run QC (100 Mb) Full sequence (1 run, ~500 Mb) raw read QC; initial assembly ? loading to IMG/M-ER (upon request) taxonomic analysis (MEGAN) dereplication quality trimming ? assembly ? manual analysis (protein families, etc.) annotation? binning ? loading to IMG/M-ER Advancing Science with DNA Sequence Sanger/Titanium metagenomes: unassembled data unassembled metagenomes taxonomic analysis using Phylogenetic Distribution of genes gross counts of hits to taxa hits to housekeeping genes at different % identity compare to 16S and MEGAN results abundance analysis using Function Comparisons and Function Category Comparisons compare to relevant metagenomes (ecology/taxonomy) compare to relevant genomes (ecology/taxonomy) check “Genes in internal clusters” abundance analysis of custom function categories using Function Profiles find the relevant genes and reference sequences in the literature identify relevant protein families add them to Function Cart, run Function Profiles, compare sums of counts Advancing Science with DNA Sequence Sanger/Titanium metagenomes: assembled data assembled metagenomes taxonomic analysis using Phylogenetic Distribution of genes look for reference genomes try to select a training set for binning binning abundance analysis using Function Comparisons and Function Category Comparisons abundance analysis of custom function categories using Function Profiles compare to relevant metagenomes (ecology/taxonomy) compare to relevant genomes (ecology/taxonomy) check “Genes in internal clusters” find the relevant genes and reference sequences in the literature identify relevant protein families add them to Function Cart, run Function Profiles, compare sums of counts Advancing Science with DNA Sequence Sanger/Titanium metagenomes: assembled and binned data QC analysis of bins assembled and binned metagenomes metabolic reconstruction on bins compare bin content using Phylogenetic Profiles analyze recombination within populations using SNP VISTA check the genes on the scaffolds with lowest confidence analysis of bin coverage: check the presence of COGs in biosynthetic pathways, ribosomal proteins, etc. COG Pathways and Functional Categories KEGG maps custom pathways keep in mind bin coverage analyze gene presence/absence in pathway context be careful with unique proteins – they may be errors of gene prediction

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Advancing Science with DNA Sequence