Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data analysis methods for nextgeneration sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July 14-15, 2008, Boston, MA T1. Roche / 454 FLX system • pyrosequencing technology • variable read-length • the only new technology with >100bp reads • tested in many published applications • supports paired-end read protocols with up to 10kb separation size T2. Illumina / Solexa Genome Analyzer • fixed-length short-read sequencer • read properties are very close traditional capillary sequences • very low INDEL error rate • tested in many published applications • paired-end read protocols support short (<600bp) separation T3. AB / SOLiD system A C G T 0 1 2 3 1 0 3 2 G 2 3 0 1 T 3 2 1 0 A 1st Base • fixed-length short-read sequencer • employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy • requires color-space informatics • published applications underway / in review • paired-end read protocols support up to 10kb separation size 2nd Base C T4. Helicos / Heliscope system • experimental short-read sequencer system • single molecule sequencing • no amplification • variable read-length • error rate reduced with 2pass template sequencing A1. Variation discovery: SNPs and short-INDELs 1. sequence alignment 2. dealing with non-unique mapping 3. looking for allelic differences A2. Structural variation detection • structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations • copy number (for amplifications, deletions) from depth of read coverage A3. Identification of protein-bound DNA genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. Robertson et al. Nature Methods, 2007 A4. Novel transcript discovery (genes) Mortazavi et al. Nature Methods A5. Novel transcript discovery (miRNAs) Ruby et al. Cell, 2006 A6. Expression profiling by tag counting gene gene aligned reads aligned reads Jones-Rhoads et al. PLoS Genetics, 2007 A7. De novo organismal genome sequencing Lander et al. Nature 2001 short reads read pairs longer reads assembled sequence contigs C1. Read length 20-35 (var) 25-35 (fixed) 25-40 (fixed) ~200-450 (var) 0 100 200 300 read length [bp] 400 When does read length matter? • short reads often sufficient where the entire read length can be used for mapping: SNPs, short-INDELs, SVs CHIP-SEQ short RNA discovery counting (mRNA miRNA) • longer reads are needed where one must use parts of reads for mapping: de novo sequencing novel transcript discovery aacttagacttaca gacttacatacgta Known exon 1 Known exon 2 accgattactatacta C2. Read error rate • error rate typically 0.4 - 1% • error rate dictates the stringency of the read mapper 0.40 0.35 Fraction of genome 0.30 0.25 0.20 • the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned 0.15 0.10 0.05 0.00 0 1 Number of mismatches allowed 2 Error rate grows with each cycle 40 10.00% 9.00% 35 8.00% 30 7.00% 6.00% • this phenomenon limits useful read length 5.00% 20 4.00% 15 3.00% 10 2.00% 5 1.00% 0 0.00% 0 5 10 15 20 Position on Read 25 30 35 40 Error rate Measured QV 25 Substitutions vs. INDEL errors C3. Representational biases / library complexity fragmentation biases PCR amplification biases sequencing low/no representation high representation sequencing biases Dispersal of read coverage • this affects variation discovery (deeper starting read coverage is needed) • it should have major impact is on counting applications Amplification errors early amplification error gets propagated onto every clonal copy many reads from clonal copies of a single fragment • early PCR errors in “clonal” read copies lead to false positive allele calls C4. Paired-end reads • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends) Technologies / properties / applications Technology Roche/454 Illumina/Solexa AB/SOLiD 200-450bp 20-50bp 25-50bp Error rate <0.5% <1.0% <0.5% Dominant error type INDEL SUB SUB yes yes not really < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal) ● ● ○ ● ○ Read properties Read length Quality values available Paired-end separation Applications SNP discovery short-INDEL discovery SV discovery ○ ○ ● CHIP-SEQ ○ ● ● small RNA/gene discovery ○ ● ● mRNA Xcript discovery ● ○ ○ Expression profiling ○ ● ● De novo sequencing ● ? ? Resequencing-based SNP discovery (ii) micro-repeat analysis REF IND (iii) read mapping (pair-wise alignment to genome reference) (iv) read assembly (v) SNP calling IND (vi) SNP validation (i) base calling (vii) data viewing, hypothesis generation The “toolbox” • base callers • microrepeat finders • read mappers • SNP callers • structural variation callers • assembly viewers Reference guided read mapping Reference-sequence guided mapping: …you get the pieces… …AND they give you the cover on the box Some pieces are more unique than others MOSAIK: an anchored aligner / assembler Step 1. initial short-hash scan for possible read locations Step 2. evaluation of candidate locations with SW method Michael Stromberg Non-unique mapping, gapped alignments 1. Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented) 2. Gapped alignments: allow for mapping reads with insertion or deletion sequencing errors, and reads with bona fide INDEL alleles Read types aligned, paired-end read strategy 3. Aligns and co-assembles customary read types: ABI/capillary Illumina/Solexa AB/SOLiD Roche/454 Helicos/Heliscope ABI/capillary 454 FLX 454 GS20 Illumina 4. Paired-end read alignments Other mainstream read mappers • ELAND (Tony Cox, Illumina) -- the “official” read mapper supplied by Illumina, fast • MAQ (Li Heng + Richard Durbin, Sanger) -- the most widely used read mapper, low RAM footprint • SOAP (Beijing Genomics Institute) -- a new mapper developed for human next-gen reads • SHRIMP (Michael Brudno, University of Toronto) -- full Smith-Waterman Speed Polymorphism / mutation detection sequencing error polymorphism Determining genotype directly from sequence individual 1 AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA A/C individual 2 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA C/C individual 3 AACGTTAGCATA AACGTTAGCATA A/A Software GigaBayes P( SNP ) all var iable P( S N | RN ) P( S1 | R1 ) ... PPr ior ( S1 ,..., S N ) PPr ior ( S1 ) PPr ior ( S N ) P( SiN | R1 ) P( Si1 | R1 ) S ... ... PPr ior ( Si1 ,..., SiN ) PPr ior ( SiN ) S i1 [ A ,C ,G ,T ] S iN [ A ,C ,G ,T ] PPr ior ( S i1 ) SNP INS Data visualization 1. aid software development: integration of trace data viewing, fast navigation, zooming/panning 2. facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays 3. promote hypothesis generation: integration of annotation tracks Weichun Huang Applications 1. SNP discovery in shallow, single-read 454 coverage (Drosophila melanogaster) 2. SNP and INDEL discovery in deep Illumina short-read coverage (Caenorhabditis elegans) 3. Mutational profiling in deep 454 and Illumina read data (Pichia stipitis) (image from Nature Biotech.) Our software is available for testing http://bioinformatics.bc.edu/marthlab/Beta_Release Credits Elaine Mardis (Washington University) Andy Clark (Cornell University) Doug Smith (Agencourt) Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.) Michael Stromberg Michele Busby Aaron Quinlan Chip Stewart Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang http://bioinformatics.bc.edu/marthlab Accuracy • As is the case for all heuristic alignment algorithms accuracy and speed are option- and parameter-dependent C3. Quality values are important for allele calling • PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles • inaccurate or not well calibrated base quality values hinder allele calling Q-values should be accurate … and high! Software tools for next-gen sequence analysis Next-generation sequencing technologies and applications