Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res. 2007 Early metagenomic Known phylogenetic markers and subsequent sequencing of clones Analysis of paired-end reads Complete sequences of environmental fosmid and BAC clones Environmental assemblies Rough annotation of the metabolic capacity Distinguish between discrete species and population of closely related biotypes Problem of using proven phylogenetic markers(ribosomal genes, coding sequences) Slow-evolving genes : distinguishing between species at large evolutionary distances What is MEGAN? Metagenome Analyzer (MEGAN) Free software. Deviates from the analytical pattern of previous Built on the statistical analysis of comparing random sequence intervals with unspecified phylogenetic properties against databases Providing filter to adjust the level of stringency later to an appropriate level Laptop analysis Depends on the related sequences in the databases Comparing result (BLAST)-> laptop (MEGAN) Graphical and statistical output Pipeline Compare against databases : BLAST Compute, explore taxonomical content : NCBI taxonomy Lowest common ancestor (LCA) algorithm Data sets(Sargasso Sea, mammoth bone, Short E. coli K12 & B. bacteriovorus HD100) What we can do with MEGAN Species and strain identification through species-specific genes Searching species or taxa by find tool Distribution of strains of a species Underlying sequence alignments Experiments-1 Sargasso Sea data set Sanger sequencing Sample 1-4 from DDBJ/EMBL/GenBank BLASTX->NCBI-NR 10000 reads from Sample1 Randomly selected a pooled set of 10000 reads from samples 2-4 1% no hits from sample1, <3% no hits from sample 2-4 Filters Min-score : bit-score threshold of 100 Top-percent : bit scores lie within 5% of the best score Min-support : isolated assignments it by one read) discarded Analysis-Sargasso Sea data 1.66M reads, AVG. 818bp by Sanger sequensing Species profile of 16 taxonomical groups Environmental assemblies By analyzing six specific phylogenetic markers rRNA, RecA/RadA, HSP70, RpoB, EF-Tu, and Ef-G Result • Sample1 •~83% reads were assigned to taxa that were more speific than the kingdom level •Majority of (8298) were assigned to bacterial group •Sample 2-4 •~59% reads were assigned to taxa that were more specific than the kingdom level •Majority of (5709) were assigned to bacterial group •Alphaproteobacteria, Gammaproteobacteria by a factor of 2-4 over the remaining 14 taxonomic groups •Eukaryotes & Viruses : size filtering •Archaea : May be there is 10times as much vacterial sequence information in the public databases •MEGAN vs. previous (Venter et al. 2004) •Specific assignment information : LCA Result-cont. •Averaged weighted percentage of the siz phylogenetic markers for each of the 16 taxonomic groups •Easily detect sampling bias between sample1 and pooled sample 2-4 Experiments-2 Mammoth bone Data set Roche GS20 sequencing (Sequencing-by-synthesis) Sample from 1g of mammoth bone , 28000 years ~300,000 reads, 95bp BLASTZ-genome sequences (elephant, human, dog) 45.4% of the reads mammoth DNA, others are environmental organisms (bacteria, fungi, amoeba, nematodes) BLASTX–NCBI-NR for environmental sequences Filters : bit-score threshold 30, discard isolated assignment (filtered 2086 reads) Result 19841 reads to Eukaryota, of which 7969 to Gnathostomata 16972 : Bacteria, 761: Archea, 152 : Viruses Experiment 3 Identifying species from various lead length Short E. coli K12 & B. bacteriovorus HD100 simulation 5000 random shotgun reads BLASTX-NCBI-NR Filters Bit-score threshold 35 20% of the best hit Discarded isolated assignments Result : no false-positive assignment, short read can be used for metagenomic analysis, albeit at the cost of a high rate of underprediction Experiment 3-cont. Roche GS20 sequencing Data set 2000 reads from random positions in the E.coli K12 ~100 bp BALSTX – NCBI-NR Filters Bit-score threshold 35 20% of the best hit Discarded isolated assignments Result Experiment 3-cont. Roche GS20 sequencing Data set 2000 reads from random positions in the B. bacteriovorus HD100 ~100 bp BALSTX – NCBI-NR : A in figure BLASTX – NCBI-NR without B.bacteriovorus HD100 : B in figure Filters Bit-score threshold 35 20% of the best hit Discarded isolated assignments Result MEGAN 3(June, 2009) Suitable for very large datasets Interests changed Advances in the throughput and cost-efficiency of sequencing technology From ‘which species present’ to ‘What’s different?’ Features Visualization technique for multiple database New statistical method for highlighting the difference in a pairwise comparison MEGAN3-cont. Comparing 6 mouse gut with human gut Clickable, collapsible.