* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 3. Sequence preprocessing
Promoter (genetics) wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Deoxyribozyme wikipedia , lookup
DNA barcoding wikipedia , lookup
Genome evolution wikipedia , lookup
DNA sequencing wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Molecular ecology wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Exome sequencing wikipedia , lookup
Molecular evolution wikipedia , lookup
Genomic library wikipedia , lookup
Homology modeling wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Most pipelines work the same way! Metagenomics Processing Pre pro ces sin g Taxonomic assignments Contamination removal Bin nin g rea ds Metagenomics Quality control – Prinseq Statistics Deconseq Population genomes Annotation STAMP crAss FOCUS metabat Real time metagenomics ContigClustering mg-rast Super FOCUS Metagenomics Processing Contig clustering Preprocessing AbundanceBin CompostBin concoct crAss tetra FASTQC FastX Toolkit fitGCP NGS QC Toolkit Non-pareil Prinseq QC-Chain Streaming Trim Taxonomic assignment CARMA FOCUS KRAKEN LMAT MEGAN Metaplan myTaxa PhylopythiaS phymmbl RAIphy TACOA Taxy Gene Prediction FragGeneScan GlimmerMG MetaGeneAnnotator MetaGeneMark MetaGun Orphelia Prodigal Functional assignment CLAMS Sequedex DiScRIBinATE SORTITEMS genometa SPANNER GSMer SPHINX PPLACER TaxSOM RTMg Treephyler Bad data analysis Preprocessing Data Rob Schmieder Good data analysis New dataset Quality control & Preprocessing Assembly Similarity search 3 Tools for metagenomic data http://prinseq.sourceforge.net http://tagcleaner.sourceforge.net http://deconseq.sourceforge.net Quality control and data preprocessing http://edwards.sdsu.edu/prinseq Rob Schmieder Number and length of sequences Bad Good Reads should be approx. same length (same number of cycles) → Short reads are likely lower quality Linearly degrading quality across the read Trim low quality ends High quality throughout the sequence Good quality through the length of the sequence Sequence quality falls off quickly → Bad sequence data Ion quality scores Low quality sequence issues Most assemblers or aligners do not take into account quality scores Errors in reads complicate assembly, might cause misassembly, or make assembly impossible What if quality scores are not available ? Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huse et al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007) What if quality scores are not available ? Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huse et al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007) Ambiguous bases If you can afford the loss, filter out all reads containing Ns Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, …) use 2-bit encoding system for nucleotides – some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 – A, 01 – C, 10 – G, 11 - T Quality filtering Any region with homopolymer will tend to have a lower quality score Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007) Sequence duplicates Real or artificial duplicate ? Metagenomics = random sampling of genomic material Why do reads start at the same position? Why do these reads have the same errors? No specific pattern or location on sequencing plate Gomez-Alvarez et al.: Systematic artifacts in metagenomes from complex microbial communities. ISME (2009) One micro-reactor – Many beads Martine Yerle (Laboratory of Cellular Genetics, INRA, France) Impacts of duplicates False variant (SNP) calling Require more computing resources – Find similar database sequences for same query sequence – Assembly process takes longer – Increase in memory requirements Abundance or expression measures can be wrong Impacts of duplicates False variant (SNP) calling Require more computing resources Reference ...ACCACACGTGTTGTGTACATGAACACAGTATATGAGCATACAGAT... – Find similar database sequences for same query GTGTTGTGTACATGAACACAGTATATGAGCATACAGAT... sequence GTGTACATGAACACAGTATATGAGCATACAGAT... – Assembly process takes longer TGAACACAGTCTATGAGCATACAGAT... – Increase in memory requirements TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... Abundance or expressionTGAACACAGTCTATGAGCATACAGAT... measures can be wrong Impacts of duplicates False variant (SNP) calling Require more computing resources – Find similar database sequences for same query sequence – Assembly process takes longer – Increase in memory requirements Abundance or expression measures can be wrong Detect and remove tag sequences http://edwards.sdsu.edu/tagcleaner No tag MID tag WTA tags Imperfect primer annealing Fragment-to-fragment concatenations Data upload Tag sequence definition Tag sequence prediction Parameter definition Download results Identification and removal of sequence contamination http://edwards.sdsu.edu/deconseq Contaminant identification Previous methods had critical limitations Dinucleotide relative abundance uses information content in sequences can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences – BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) – Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010) Faster algorithms for Next-gen data Principal component analysis (PCA) of dinucleotide relative abundance Microbial metagenomes Viral metagenomes Contaminant identification Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences – BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) – Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010) DeconSeq web interface Two types of reference databases Remove Retain DeconSeq web interface (cont.) DeconSeq Identity = How similar is the query sequence to the reference sequence Coverage = How much of query sequence is similar to reference sequence DeconSeq Blue = More similar to “retain” Red = More similar to “remove” Human DNA contamination identified in 145 out of 202 metagenomes Pairing Data Two types of paired ends Mate pairs Paired end reads Repeats A B Paired end reads or mate pairs C Mate pair sequencing Mate pair Sequencing Add linkers Mate pair sequencing Nick Sequencing migration Paired end sequencing Tagmentation Biological fragmentation using a transposon and discontinuous DNA primers Covaris ultrasonicator Physical fragmentation of DNA using sonication Paired end sequencing Paired end sequencing Short reads Long reads Joining Paired Ends Joining paired ends Counting abundance: Join, but keep one end of singletons Assembling Do not join (assembler will do it)