* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Advancing Science with DNA Sequence
Biology and consumer behaviour wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Primary transcript wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
DNA damage theory of aging wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Protein moonlighting wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
DNA barcoding wikipedia , lookup
DNA supercoil wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Molecular cloning wikipedia , lookup
DNA vaccination wikipedia , lookup
Human genome wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic library wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Epigenomics wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Pathogenomics wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Microsatellite wikipedia , lookup
Genome evolution wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Designer baby wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Non-coding DNA wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genome editing wikipedia , lookup
Microevolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Advancing Science with DNA Sequence IMG/M and metagenome analysis Natalia Ivanova MGM Workshop February 5, 2009 Advancing Science with DNA Sequence Outline 1. Problems of metagenomic data 2. IMG/M features 3. Analysing metagenomic data: flowcharts Advancing Science with DNA Sequence 1. Problems of metagenomic data (metagenomic data is the problem) (see IMG/M -> Using IMG/M -> About IMG/M -> Background for definitions) Advancing Science with DNA Sequence Metagenomic data are noisy • Definition of high quality genome sequence: an example of “finished” JGI genomes - each base is covered by at least two Sanger reads in each direction with a quality of at least Q20 • Definition of “ high quality” metagenome? Too many variables: species composition/abundance amount of DNA available average GC content of each species (applies to 454 Titanium as well) “clonability” of the DNA of each species (or biases of 454 libraries) amount of sequence allocated no clear sequencing goal … Advancing Science with DNA Sequence Metagenomic data are noisy • Sequence coverage of metagenomes is low US Sludge, Phrap assembly # of scaffolds % total scaffolds Scaffolds, coverage > 2.0 2954 9.3 Scaffolds, coverage 1.03-2.0 8158 25.7 Unassembled reads 20630 65 • Rate of sequencing artifacts is high • Frameshifts are the most unpleasant artifacts, they lead to errors in gene prediction Advancing Science with DNA Sequence Metagenomic data are highly fragmented • Median scaffold length in 56 GEBA genomes – 28,179 bp • Median scaffold length in US Sludge, Phrap assembly – 1,157 bp • Many more gene fragments in metagenomes (median protein size in GEBA genomes – 252 aa, median protein size in US Sludge, Phrap – 195 aa) • Problems with assignment to protein families and functional annotation Advancing Science with DNA Sequence Metagenomic datasets are large (or huge) # of CDSs GEBA genomes Samples in IMG Projects in IMG minimal 1,375 2,331 (mouse gut ob2) 2,386 (AMO community) maximal 9,433 185,274 (soil) 333,301 (Lake Washington sediment) median 3,562 16,053 83,662 • No manual annotation (functional annotations in metagenomes should be taken with a grain of salt) • “Divide and conquer” approach Advancing Science with DNA Sequence 2. IMG/M features (see also IMG/M -> Using IMG/M -> Using IMG/M -> IMG User Guide and IMG/M Addendum) Advancing Science with DNA Sequence IMG/M User Interface Map Advancing Science with DNA Sequence Dividing the genes phylogenetically • Bins Microbiome Details -> Microbiome Information -> Bins (of scaffolds) • Phylogenetic Distribution of Genes Microbiome Details -> Phylogenetic Distribution of Genes gene lists gene counts histogram Components: (phylum/class) histograms summary statistics Protein Recruitment Plots histogram counts, lists, statistics (family) summary statistics tables counts, lists lists of genes histogram (species) recruitment plots Advancing Science with DNA Sequence Dividing the genes by abundance/ by function • Abundance Profiles Compare Genomes -> Abundance Profiles Tools Components: Abundance Profile Overview Abundance Profile Search Function Comparisons Function Category Comparisons Common parameters: Normalization (none/scale for size) Type of count (raw counts/estimated gene copies) Type of protein family (COG, Pfam, Enzyme, TIGRfam) Advancing Science with DNA Sequence 3. Analysing metagenomic data: flowcharts Advancing Science with DNA Sequence Sanger metagenomes Sanger library 16S sequences 10 plate QC raw read QC: GC content insert-less clones contamination taxonomic analysis (MEGAN) Full sequenc e vector and quality trimming assembly annotation binning loading to IMG/M-ER (upon request) manual analysis (protein families, etc.) loading to IMG/M-ER Advancing Science with DNA Sequence 454 Titanium metagenomes Titanium library 16S pyrotags ¼ run QC (100 Mb) Full sequence (1 run, ~500 Mb) raw read QC; initial assembly ? loading to IMG/M-ER (upon request) taxonomic analysis (MEGAN) dereplication quality trimming ? assembly ? manual analysis (protein families, etc.) annotation? binning ? loading to IMG/M-ER Advancing Science with DNA Sequence Sanger/Titanium metagenomes: unassembled data unassembled metagenomes taxonomic analysis using Phylogenetic Distribution of genes gross counts of hits to taxa hits to housekeeping genes at different % identity compare to 16S and MEGAN results abundance analysis using Function Comparisons and Function Category Comparisons compare to relevant metagenomes (ecology/taxonomy) compare to relevant genomes (ecology/taxonomy) check “Genes in internal clusters” abundance analysis of custom function categories using Function Profiles find the relevant genes and reference sequences in the literature identify relevant protein families add them to Function Cart, run Function Profiles, compare sums of counts Advancing Science with DNA Sequence Sanger/Titanium metagenomes: assembled data assembled metagenomes taxonomic analysis using Phylogenetic Distribution of genes look for reference genomes try to select a training set for binning binning abundance analysis using Function Comparisons and Function Category Comparisons abundance analysis of custom function categories using Function Profiles compare to relevant metagenomes (ecology/taxonomy) compare to relevant genomes (ecology/taxonomy) check “Genes in internal clusters” find the relevant genes and reference sequences in the literature identify relevant protein families add them to Function Cart, run Function Profiles, compare sums of counts Advancing Science with DNA Sequence Sanger/Titanium metagenomes: assembled and binned data QC analysis of bins assembled and binned metagenomes metabolic reconstruction on bins compare bin content using Phylogenetic Profiles analyze recombination within populations using SNP VISTA check the genes on the scaffolds with lowest confidence analysis of bin coverage: check the presence of COGs in biosynthetic pathways, ribosomal proteins, etc. COG Pathways and Functional Categories KEGG maps custom pathways keep in mind bin coverage analyze gene presence/absence in pathway context be careful with unique proteins – they may be errors of gene prediction