Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008 T1. Roche / 454 FLX system • pyrosequencing technology • variable read-length • the only new technology with >100bp reads • tested in many published applications • supports paired-end read protocols with up to 10kb separation size T2. Illumina / Solexa Genome Analyzer • fixed-length short-read sequencer • read properties are very close traditional capillary sequences • very low INDEL error rate • tested in many published applications • paired-end read protocols support short (<600bp) separation T3. AB / SOLiD system A C G T 0 1 2 3 1 0 3 2 G 2 3 0 1 T 3 2 1 0 A 1st Base • fixed-length short-read sequencer • employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy • requires color-space informatics • published applications underway / in review • paired-end read protocols support up to 10kb separation size 2nd Base C T4. Helicos / Heliscope system • experimental short-read sequencer system • single molecule sequencing • no amplification • variable read-length • error rate reduced with 2pass template sequencing A1. Variation discovery: SNPs and short-INDELs 1. sequence alignment 2. dealing with non-unique mapping 3. looking for allelic differences A2. Structural variation detection • structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations • copy number (for amplifications, deletions) from depth of read coverage A3. Identification of protein-bound DNA genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. Robertson et al. Nature Methods, 2007 A4. Novel transcript discovery (genes) • novel genes / exons • novel transcripts in known genes Inferred exon 1 Inferred exon 2 Known exon 1 Known exon 2 Known exon 1 Known exon 2 A5. Novel transcript discovery (miRNAs) Ruby et al. Cell, 2006 A6. Expression profiling by tag counting gene gene aligned reads aligned reads Jones-Rhoads et al. PLoS Genetics, 2007 A7. De novo organismal genome sequencing Lander et al. Nature 2001 short reads read pairs longer reads assembled sequence contigs C1. Read length 20-35 (var) 25-35 (fixed) 25-40 (fixed) ~250 (var) 0 100 200 300 read length [bp] When does read length matter? • short reads often sufficient where the entire read length can be used for mapping: SNPs, short-INDELs, SVs CHIP-SEQ short RNA discovery counting (mRNA miRNA) • longer reads are needed where one must use parts of reads for mapping: de novo sequencing novel transcript discovery aacttagacttaca gacttacatacgta Known exon 1 Known exon 2 accgattactatacta C2. Read error rate • error rate typically 0.4 - 1% • error rate dictates how many 0.40 errors the aligner should tolerate 0.35 Fraction of genome 0.30 0.25 0.20 • the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned 0.15 0.10 0.05 0.00 0 1 2 Number of mismatches allowed • applications where, in addition, specific alleles are essential, error rate is even more important C3. Error rate grows with each cycle 40 10.00% 9.00% 35 8.00% 30 7.00% 6.00% • this phenomenon limits useful read length 5.00% 20 4.00% 15 3.00% 10 2.00% 5 1.00% 0 0.00% 0 5 10 15 20 Position on Read 25 30 35 40 Error rate Measured QV 25 C4. Substitutions vs. INDEL errors • SNP discovery may require higher coverage for allele confirmation • INDELs can be discovered with very high confidence! • gapped alignment necessary • good SNP discovery accuracy • short-INDEL discovery difficult C5. Quality values are important for allele calling • PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles • inaccurate or not well calibrated base quality values hinder allele calling Q-values should be accurate … and high! Quality values should be well-calibrated assigned base quality value should be calibrated to represent the actual base quality value in every sequencing cycle C6. Representational biases / library complexity fragmentation biases PCR amplification biases sequencing low/no representation high representation sequencing biases Dispersal of read coverage • this affects variation discovery (deeper starting read coverage is needed) • it has major impact is on counting applications Amplification errors early amplification error gets propagated onto every clonal copy many reads from clonal copies of a single fragment • early PCR errors in “clonal” read copies lead to false positive allele calls C7. Paired-end reads • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends) Paired-end reads for SV discovery • longer fragments increase the chance of spanning SV breakpoints and/or entire events • longer fragments tend to have wider fragment length distributions • SV breakpoint detection sensitivity & resolution depend on the width of the fragment length distribution (most 2kb deletions would be detected at 10% std but missed at 30% std) C8. Technologies / properties / applications Technology Roche/454 Illumina/Solexa AB/SOLiD Read length 250bp 20-40bp 25-35bp Error rate <0.5% <1.0% <0.5% Dominant error type INDEL SUB SUB yes yes yes < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal) ○ ● ● ● ○ Read properties Paired-end reads available Paired-end separation Applications SNP discovery short-INDEL discovery SV discovery ○ ○ ● CHIP-SEQ ○ ● ● small RNA/gene discovery ○ ● ● mRNA Xcript discovery ● ○ ○ Expression profiling ○ ● ● De novo sequencing ● ? ? Thanks Michael Egholm Clive Brown David Bentley Elaine Mardis Francisco de la Vega Kristen Stoops Ed Thayer MOSAIK talk Thursday, 7:40PM Michael Stromberg Michele Busby Aaron Quinlan Eric Tsung Derek Barnett Chip Stewart Damien Croteau-Chonka Weichun Huang http://bioinformatics.bc.edu/marthlab