Download deschamp_2009_sequencing

Overview and Applications of Next-Generation Sequencing Technologies Stéphane Deschamps Analytical & Genomic Technologies DuPont Agriculture & Nutrition Pioneer Hi-Bred International Outline 1. Next-Generation Sequencing Platforms 1. 454 FLX technology 2. Solexa/Illumina technology 2. Applications of Next-Generation Sequencing Technologies 1. Overview 2. Variant detection with Illumina platform 3. Open-source tools for bioinformatics 3. Third-Generation Sequencing technologies: what’s next? Sanger sequencing Successive improvements now allows 96 800-900 base reads to be sequenced in less than 2h Sanger sequencing Sanger sequencing has been, and still is, very useful... ...but it remains slow and expensive Sequencing Platform Comparisons ABI 3730xl 454 FLX Titanium Illumina GA II Read Length ~750bps ~450bps 18-75bps Number of reads/run 96 500K 100MM Max Yield/run ~70Kbps ~1Gbp ~10Gbps Cost/1Gbp $3.5MM $7,000 $1,000 Run time/machine to 1Gbp 8 years 1 day <1 day Next-Generation Sequencing Second-generation platforms: Third-generation platforms: •454/Roche •Solexa/Illumina •SOLiD/ABI •Helicos BioSciences •Dover Systems •Complete Genomics •BioNanomatrix •VisiGen •Pacific Biosciences •Intelligent Bio-Systems •ZS Genetics •Reveo •LightSpeed Genomics •NABsys •Oxford Nanopore Technologies 454 FLX Titanium • First next-generation sequencing platform launched (October 2005) • Titanium chemistry for the 454 FLX launched in September 2008 • Sequencing By Synthesis – Pyrosequencing – Chemiluminescent signal • Long read technology (~450 nucleotides) • Possibility of sequencing both ends of DNA fragments (FLX platform) • Generates up to 0.5Gbps per run • Max cost is ~$10,000/run 454 FLX Titanium • DNA Library Construction • Emulsion PCR • Sequencing DNA Library Construction • DNA fragmentation via nebulization • Size-selection • Ligation of adapters A & B • Selection of A/B fragments via biotin selection • Denaturation to select single-stranded A/B fragments • No cloning! (B/B) A/B ss DNA End repair Streptavidin Denaturation Streptavidin + (A/B) + (A/A) Emulsion PCR Emulsion PCR • Add DNA to capture beads (needs titration) • Add PCR reagents to DNA and capture beads • Transfer sample to oil tube or cup • Emulsify DNA capture beads in PCR reagents to form water-in-oil “microreactors” – Emulsion with Qiagen TissueLyser (highspeed shaker) • Clonal amplification in microreactors – Careful not to break the emulsion! – ~10MM copies per capture bead • Break emulsion and enrich for DNA positive beads – Use biotinylated oligo to capture enriched beads then denature www.roche-applied-science.com Bead deposition into plates • Deposition of enriched beads into PicoTiter plate • Well diameter = 29uM allowing for a single bead (20uM diameter) per well • Chambers are filled with enzyme beads, DNA beads and packing beads. www.roche-applied-science.com Pyrosequencing 1. Polymerase add nucleotide (sequential flow of dNTPs) 2. PPi is released 3. Sulfurylase creates ATP from PPi 4. Luciferase hydrolyzes ATP and use luciferin to make light www.roche-applied-science.com Image and signal processing 1. Raw data is series of images (one image per base per cycle). 2. Data are extracted, quantified and normalized. 3. Read data are converted into “flowgrams”. Post-processing 1. Output = flowgrams, basecalls, Phred-equivalent scores 2. Basecall & Flowgrams can be used in the following applications: 1. De novo assembler – consensus sequences assembled into contigs with quality scores and ACE file (works best with genomic DNA). 2. Reference mapper – contigs mapped to reference sequence + list of high-confidence mutations 3. Amplicon variant analyzer – identification of sequence variants in amplicon libraries Illumina Genome Analyzer • Successor to MPSS (Massively Parallel Signature Sequencing) • Single molecule array (“flow cell”) with millions of amplified clusters • Sequencing By Synthesis – Removable fluorescence – Reversible terminators • Short read technology (16 - 75 nucleotides) • Possibility of sequencing both ends of DNA fragments • Generates up to 20Gbps per run • Max cost is ~$10,000/run = $500/Gbp! Illumina Genome Analyzer Sample Prep Prepare DNA fragments + Ligate adapters Cluster Station Cluster Synthesis Genome Analyzer Sequencing Analysis Pipeline Cluster Station Genome Analyzer Fluidics and Electronics Flow Cell & Detection Laser Optics Cluster Generation or RNA - anneal Cluster Generation - extension DNA Clusters • ~1,000 copies of DNA in each cluster • 1-2 microns in diameter Reversible Terminator Chemistry Sequencing by Synthesis (SBS) 5’ 3’ Cycle 1: First base incorporated G T Remove unincorporated bases A C A T G C C G T T A C A Add sequencing reagents C G A T T A G A C T C C G A G C T C G A 5’ T Detect signal Deblock (removal of fluorescent dye and protecting group) Cycle 2 - n: Add sequencing reagents and repeat Sequencing by Synthesis (SBS) Data Analysis Workflow - Illumina Images (.tif) Illumina Analysis Pipeline Data transfer and Storage Image Analysis Base calling Sequence Analysis alignment (ELAND), filtering (chastity) •Cluster Intensities •Cluster Noise •Cluster Sequence •Cluster Probabilities (Scores) •Corrected Cluster Intensities • cross-talk correction • phasing correction 1 image per dye 4 dyes/cycle 75 cycles 50 tiles/column 2 columns/lane 8 lanes/flowcell 240,000 images per flowcell x8 MB per image 1.92 TB of image data •Image analysis module is Firecrest x2 for PE run 3.8 TB of image data •Base calling module is Bustard •Sequence analysis module is Gerald Alignments, Assemblies, Normalization, Annotations & Post-processing Evaluations Other platforms Sequencing Sequencing Run Read Reads per Throughput per Platform Chemistry Time Length (bp) Run (million) Run (Gbp) Roche 454 FLX Pyrosequencing 10h 400-500 ~1 0.4-0.5 9.5 days 100 250 25 8 days 50 400 20 8 days 25-55 600-800 15-45 80h 28 300-400 10 Sequencing by Synthesis Sequencing by ABI SOLiD Ligation Sequencing by Helicos HeliScope Synthesis Sequencing by Polonator Ligation Illumina GAIIx Data Storage & Quality Images? ~Phred 20 Phred score 20 = 1% error rate Quality vs. Read Length? Trimming? Lower sequence quality than Sanger sequencing but offset by deeper coverage Single short read uniqueness Illumina 35 base reads aligned to A. thaliana genome 10000000 ~4MM reads 1 location 2 locations 3 locations 4 locations 5 locations 6 locations 7 locations 8 locations 9 locations 10 locations 100000 10000 1000 100 10 1 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 Number of Reads 1000000 Read length (bps) Applications of Next-Generation Sequencing Gene Expression Profiling – Tag count & Alignments – Digital Gene Expression Tag Profiling • Short cDNA fragments mapping to 3’ ends of transcripts • SAGE-like approach (1 short tag/transcript) • 20 base tag output (RE site + 16 bases) aligned to a reference genome • Identify, quantify and annotate expressed genes – Transcriptome Profiling (RNA-Seq) • cDNA fragments generated via random priming • 36-75 base output aligned to a reference genome • Assemble entire transcript sequence • Identify, quantify and annotate expressed genes • Identify SNPs, alleles and alternative splice variants Tag Profiling – Sample Prep (Illumina) Total RNA (5ug) mRNA isolation AAAAA 1st and 2nd Strand cDNA Synthesis AAAAA TTTTT-bio Restriction Enzyme Digestion (DpnII or NlaIII) AAAAA TTTTT-bio CATG GEX Adaptor 1 Ligation MmeI CATG GTAC AAAAA TTTTT_bio MmeI CATG GTAC MmeI digestion NN GEX Adaptor 2 Ligation NN NN CATG GTAC PCR Amplification PCR Primer 2 PCR Primer 1 CATG GTAC sequencing primer TAG Cluster Generation Transcriptome Profiling – Sample Prep (Illumina) Tissue Total RNA isolation (10ug) mRNA isolation AAAAA Fragmentation (random) AAAAA 1st and 2nd Strand cDNA Synthesis (N6 primer) AAAAA TTTTT Adaptor Ligations PCR Amplification PCR Primer 2 PCR Primer 1 sequencing primer 1 sequencing primer 2 Cluster Generation Novel Transcript Discovery – Small RNA Identification and Profiling • – Small RNA size is suitable to discovery with next-generation sequencing Deep assessment of alternative splicing isoforms • Deep coverage allows discovery of rare isoforms Mortazavi et al. (2008), Nat. Methods De novo Sequencing – Whole Genome Sequencing • Small genomes that are not too complex (microbial) • The longer the reads, the better – 454 chemistry most suitable • Paired-End sequencing – Whole Transcriptome Sequencing – Targeted Sequencing • Pooled PCR products – Raindance Technologies (~4,000 amplicons in one tube) – Padlock probes • Pooled BAC clones • Sequence Capture (Solid phase, Liquid phase) – Agilent, Febit & Nimblegen – Metagenomics & Microbial diversity Gene Regulation – ChIP-Seq (immunoprecipitate sequencing) • Capture regions of the genome bound by proteins (transcription factors, histones) • Sequences need to be aligned to a reference sequence • Requires complex algorithm to determine differential levels of coverage throughout the genome – Methyl-Seq (methylation status) – Bisulfite Sequencing • – Sequences aligned and compared to reference sequence DNAseI Hypersensitivity Site Sequencing Mikkelsen et al. (2007), Nature Variant & Structural Variation – Coverage & Alignment – Paired-End Sequencing – Whole Genome Resequencing • Small genomes that are not too complex (repeats, duplications...) • The longer the reads, the better – Targeted Resequencing • Complex genomes (crops) – Reduced representation libraries (methyl-sensitive enzymes) – Transcriptome • Sequence Capture (Microarrays) » Agilent, Febit & Nimblegen » CGH arrays Challenges in variant discovery 1. Base quality & filtering (scoring threshold) 2. Sequencing errors vs. SNPs 1. To differentiate true polymorphisms from sequencing errors 2. Coverage of a given SNP region and redundancy of reads (coverage vs. number of samples) 3. Availability of a reference sequence (genome) 1. To separate unique vs. duplicated sequences 2. Duplication in one line but not another 3. Polymorphism rate in one line vs. another = need to set conditions for alignment 4. Paired-end sequencing can help unique read placement 5. Complex genomes = need to reduce complexity prior to sequencing 1. High repeat content (ex: ~80% in maize, ~70% in soy, 90% in sunflower…) 2. Gene duplications and genome plasticity (polyploidy, partial or whole genome duplications...) Reduced-representation libraries 1. DNA methylation in plants occurs at 5-methyl cytosine within CpG dinucleotides and CpNpG trinucleotides 2. Transposons and other repeats comprise the largest fraction of methylated DNA. Studies in Arabidopsis have shown that CG sites in the 3’ end of the transcribed regions of more than one third of all genes also are methylated (Zhang, Science, 320, 489, 2008). 3. Methylation is critically important in silencing transposons and regulating plant development (methylation in promoters appears to reduce transcription) PstI sites transposon transposon transposon PstI digestion Recover digested fraction (gel, column) P P P PP P P P Library Construction Genomic DNA Digestion with one methyl-sensitive restriction enzyme (RE) and fractionation Ligation of biotinylated RE-specific adapters 1 B B Digestion with 4-bp cutter (DpnII) GATC B Ligation of DpnII-specific adapter GATC CTAG B Binding to streptavidin column and digestion with RE GATC CTAG Ligation of RE-specific adapters 2 GATC CTAG PCR enrichment, gel purification, size selection (150-500bp fragments), cluster synthesis and sequencing (36 cycles) Deschamps et al. The Plant Genome (in press) SNP detection flowchart Basecalling, cropping last 4 bases & initial base-quality filter (for individual tags) Filtering and Condensing Condensing & optional consensus base-quality filter (for unitags sequences) Creating HQ unitag datasets (removing singlets) Comparing HQ unitag datasets from genotype “A” and genotype “B” using Vmatch Comparing two genotypes Filtering, to accept clusters with only two members (A, B) with exactly one mismatch Recovering matched HQ unitag sequences and SNP sites from Vmatch alignments Mapping to genome Mapping SNP-containing HQ unitags to reference sequence (genome), using a k-mer table (k=length of trimmed tags), and find copy numbers and locations. Capturing single-copy HQ unitags with up to a single-base mismatch to the reference sequence at the exact location of the putative SNP site for one or both genotypes. Example: one flow cell in soybean (Williams82 vs. Pintado) Run Metrics Williams82 Pintado Number of total reads generated (after initial basecalling) 37,666,279 38,000,474 Number of filtered total reads † 24,519,484 23,101,973 Number of unitags (generated from filtered total reads) 965,610 885,429 Number of high quality (HQ) unitags ‡ 255,918 246,102 Scatter Plot 100,000 Frequency Alignment of HQ unitags against the reference sequence: 10,000 1,000 100 10 Zero mismatch § 208,923 197,015 One mismatch § 27,770 27,699 Two or more mismatches § 1 10 19,225 21,388 152,185 144,559 100 log10(Depth) 1,000 10,000 Depth HQ unitags aligning uniquely to the reference sequence with zero mismatch † Filtered total reads defined as having a quality value for individual base greater than or equal to 15 ‡ HQ unitag reads defined as having a quality value for each base greater than or equal to 15, and with an individual read count greater than or equal to 2. § Best match to reference sequence of HQ unitag reads aligning uniquely or multiple times to the reference sequence 100,000 Results & Validation Experiments Q Score threshold: 15 Soy: Williams82 vs. Pintado Rice: Kasalath vs. Taichung65 Q Score threshold: 25 Soy: Williams82 vs. Pintado Rice: Kasalath vs. Taichung65 * Not Confirmed * Putative SNPs Confirmed Validation rate 1,682 2,618 163 162 5 6 97.0% 96.4% 702 2,148 168 174 2 1 98.8% 99.4% *SNPs confirmed/not confirmed via Sanger sequencing of PCR products for both genotypes Distribution of HQ unitags & SNPs related to annotated gene density (soybean) Gene Density (excluding TEs) in 500Kb window Coverage by HQ unitags in 70Kb window SNP Density in 70Kb window Distribution of HQ unitags & SNPs related to distance to annotated genes (excluding TEs) in soybean 38 .3 50.0 .6 21 .4 SNPs 18 .3 20 16 .2 20.0 24 .4 23 .8 22 .8 21 .9 24 .1 30.0 34 .5 33 .7 40.0 Pintado Williams82 10.0 0.0 intron CDS within_5k intergenic Intron, CDS and UTR coordinates determined from GFF annotation files Bioinformatic tools Alignment and Polymorphism Detection 1. SOAP – Short Oligonucleotide Alignment Program • Ruiqiang Li, Beijing Genomics Institute • http://soap.genomics.org.cn 2. MAQ – Mapping and Assembly with Quality • Heng Li, Sanger Centre • http://maq.sourceforge.net/maq-man.shtml 3. Bowtie - An ultrafast memory-efficient short read aligner • Ben Langmead and Cole Trapnell, University of Maryland • http://bowtie-bio.sourceforge.net/ 4. ssahaSNP – Tool to detect homozygous SNPs and indels • Adam Spargo and Zemin Ning, Sanger Centre • http://www.sanger.ac.uk/Software/analysis/ssahaSNP Bioinformatic tools Genomic Assembly 1. Velvet – De novo assembly of short reads • Daniel Zerbino and Ewan Birney, EMBL-EBI • http://www.ebi.ac.uk/~zerbino/velvet/ 2. SSAKE – Assembly of short reads • Rene Warren, et al, British Columbia Cancer Agency • http://bioinformatics.oxfordjournals.org/cgi/content/full/23/4/500 3. Euler – Genomic Assembly • Pavel Pevzner and Mark Chaisson, University of California, San Diego • http://nbcr.sdsc.edu/euler/ www.illumina.com Bioinformatic tools ChIP Sequencing 1. ChIP-Seq Peak Finder • Barbara Wold, Cal Tech and Rick Meyers, Stanford University • http://woldlab.caltech.edu/html/software/ Digital Gene Expression 1. Comparative Count Display • Alex Lash, NIH • ftp://ftp.ncbi.nlm.nih.gov/pub/sage/obsolete/bin/ccd/ 2. SAGE DGED Tool • Cancer Genome Anatomy Project • http://cgap.nci.nih.gov/SAGE/SDGED_Wizard?METHOD=SS10,LS10&O RG=Hs www.illumina.com Bioinformatic tools - Illumina Overview 1. Obtain Bustard reads and align against Genome with Eland 2. Aggregate and SNP call data with CASAVA 3. GenomeStudio™ wizard import of data 4. Examine coverage and quality in stacked alignment graphs for a selected region/chromosome 5. Export table of SNPs and consensus sequence Bioinformatic tools - Illumina Third-Generation Sequencing technologies: what’s next? Next-Generation Sequencing Second-generation platforms: Third-generation platforms: •454/Roche •Solexa/Illumina •SOLiD/ABI •Helicos BioSciences •Dover Systems •Complete Genomics •BioNanomatrix •VisiGen •Pacific Biosciences •Intelligent Bio-Systems •ZS Genetics •Reveo •LightSpeed Genomics •NABsys •Oxford Nanopore Technologies Pacific Biosciences • SMRT™ Technology (to be commercially launched Fall 2010) • Single DNA polymerase attached at bottom surface of nanometer-scale hole, incorporates in real-time fashion fluorescently labeled nucleotides to elongated strand of DNA • Elongated strand can be several thousands of nucleotides in length www.pacificbiosciences.com Pacific Biosciences 1. Small size of the hole favors rapid in-and-out diffusion of nucleotides and dye following their cleavage. Meanwhile, incorporated nucleotide is held within the detection volume for 10’s of milliseconds, order of magnitude longer than the time it takes for nucleotides to diffuse in and out of the hole, therefore decreasing background noise 2. Fluorescent dye is attached to the phosphate chain, rather than the base, and is cleaved when the nucleotide is incorporated to the DNA strand. => Decreased background noise and use of phospholinked nucleotides circumvents the need for successive cycles of incorporation, washing, scanning and removal of the label, therefore optimizing processivity of the enzyme and allowing longer read lengths => No need for washing decreases the consumption of reagents Nanopore Sequencing = the real $100 genome? 1. Sequencing-by-Synthesis requires lots of preparation, lots of reagents (polymerase, nucleotides, fluorescent labels...) and expensive detection systems. 2. Nanopore sequencing does not rely on amplification or labeling, and provides a direct electrical signal for base calling. It is based on a simple idea of “passing” DNA fragments through a nanometer-scale pore and detecting in a real-time fashion signal due to the DNA blocking the electrical current that runs through the pore 3. Oxford Nanopore: Protein nanopore 1. Long read lengths (1000’s) 2. High read accuracy 3. Current technical issues: • Exonuclease Alpha-hemolysin Attachment of the exonuclease to the pore • Parallelization (1,000’s of pores per chips) www.nanoporetech.com Cyclodextrin (encapsulate nucleotide) Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download deschamp_2009_sequencing