* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The Loblolly Pine Genome, v1
Survey
Document related concepts
Transcript
The Loblolly Pine Genome, v1 University of California – Davis: Kris=an Stevens, Jill L. Wegrzyn, Marc Crepeau, Charis Cardeno, John Liechty, Pedro J Mar=nez-‐Garcia, Hans Vasquez-‐Gross, L. L. Zieve, MaN Dougherty, Brian Y. Lin, Pat McGuire, David Neale, Charles H. Langley Johns Hopkins University S.M.: Daniela Puiu, Steven L. Salzberg University of Maryland: -‐ Aleksey Zimin, G. Marçais, M. Roberts, James A. Yorke Children's Hospital Oakland Res. Inst.: – Ann Holtz-‐Morris, Maxim Koriabine, Pieter J. deJong Texas A&M University: Carol Loopstra Washington State University: Dorrie Main Indiana University: Keithanne Mockai=s, S. Fuentes-‐Soriano, L. Wu, D. Gilbert University of Utah: M. Yandell, C. Holt University of Florida: J.M. Davis, K. Smith University of Georgia: J.F.D. Dean, W.W. Lorenz North Carolina State University: R.W. WheNen, R. Sederoff Pennsylvania State University: Nicholas Wheeler Background: huge and repe==ve • Genome Size: ≈22,000,000,000 bp • Pine karyotype is highly conserved, n=12 • Genome is mostly repe==ve • But the repeats are ancient and diverged • Deep coverage is feasible with NGS • Technical challenges: • Low “error” rate -‐> local assembly • Linking at scales >> 150 bp Outline – theme: reduc=on in complexity • Genomic DNA -‐> sequence • Error correc=on • Super-‐reads: local con=gs • Linking sequences • Assembly • Scaffolding • Valida=on • Annota=on • Applica=on Abstract The size and complexity of conifer genomes (c. 20 to 40 Gb, 2n = 24) pose a formidable challenge to full genome sequencing and assembly. We developed a new approach to sequence the genome of loblolly pine (Pinus taeda L.). It leveraged unique aspects of pine reproductive biology and deployed recents advances in genome assembly methodology. We used whole genome shotgun sequencing based primarily on next generation sequence generated from a single haploid seed megagametophyte (conifer seed endosperm) from the loblolly pine tree genotype 20-1010 used in industrial forest tree breeding and research. The resulting sequence and assembly led to a draft genome spanning 23.2 billion base pairs and containing 20.1 billion sequenced bases with an N50 scaffold size of 66.9 Kbp. Overview of Approach to Sequencing and Assembly Target Megagametophyte Parental Needles Haploid 1N Diploid 2N 11 paired-end libraries Paired-end filters 48 mate-pair + 9 DiTag libraries Super-read reduction Mate-pair filters 7.5Bx2 reads 150M Super-reads 300M x 2 reads More scaffolding V1.01 V1.0 Gap Closing CABOG OLC Assembly 900M x 2 reads Conifer life cycle offers a haploid source of genomic DNA thus removing the complexity inherent in assembling one genome from two Distribution of megagametophyte DNA yields DNA Yields forfrom P. 20−1010 taedaMega−gametophytes 20-1010 Mega-gametophytes DNA Yields 2010−2012 Crops 45 !"#$%&'%!"#$%#%#!"#%%&"#%"!'()* + !"# ,-./01 $%&%'( ,-01 $%)*'( 23.!.-4 )&*'( ,05 *%+!'( ,/1 $")'( 40 35 Count 30 25 20 15 10 5 0 0 0.5 1 1.5 2 2.5 Microgram DNA Yield 3 3.5 4 4.5 short-insert paired-end libraries from the single megagametophyte A) B) 0.14 220 bp 234 bp 246 bp 260 bp 273 bp 285 bp 900 700 500 300 200 100 Fraction of Fragments Sampled 0.12 0.1 0.08 0.06 0.04 0.02 0 210 220 230 240 250 260 270 Fragment Length (bp) 280 290 300 library complexity: diminishing returns of sequencing deeper Ideal 273bp Overview of Approach to Sequencing and Assembly Target Megagametophyte Parental Needles Haploid 1N Diploid 2N 11 paired-end libraries Paired-end filters 48 mate-pair + 9 DiTag libraries Super-read reduction Mate-pair filters 7.5Bx2 reads 150M Super-reads 300M x 2 reads More scaffolding V1.01 V1.0 Gap Closing CABOG OLC Assembly 900M x 2 reads Genome size estimation from k-mer depth distributions: 20 Gbp erroneous k-mers unique (haploid) genomically repeated Histogram'of'k2mer'Depth'for'Haploid'Data' k-mers k-mers 5.0E+08% k%=%24% 4.5E+08% k%=%31% Number'of'Dis-nct'k2mers' Number of distinct k-mers 4.0E+08% 3.5E+08% 3.0E+08% 2.5E+08% 2.0E+08% 1.5E+08% 1.0E+08% 5.0E+07% 0.0E+00% 0% 50% 100% k2mer'Depth' k-mer depth 150% 200% 250% Pair-end filtering & super-reads Using the QuORUM tool the haploid short insert pair-end reads were Corrected: each singleton 24-mers was dropped from the list of of “good” k-mers. Filtered: reads with singleton indels were discarded. Reads containing known “contaminant” sequences were truncated, as were large-insert DiTag and mate-pair reads containing junctions Super-reads • Based on the observation that most of the sequence in genomes is locally unique – branches are relatively rare • We can efficiently count k-mers in the data set of all reads with Jellyfish e.g. consider 10-mers (we use much longer k, 76, of course): AGCTGACTGACTGGTAACAA AGCTGACTGA GCTGACTGAC • Use all k-mers with counts > threshold T (e.g. T=1) • The idea is to make reads longer instead of breaking them into k-mers. 100 Times Fewer Super-Reads than Reads Many read extensions stop at the same branch points • Starting with 15x109 paired end reads, average 120 bp • We produced ~ 150x106 super-reads – 100 times fewer reads! • The super-reads contain 52x109 bp of sequence. • 50% of that sequence is in 500 bp or longer super reads. • These are few enough and long enough to be assembled by Overlap-Layout-Consensus assembler, CABOB (son of the Celera assembler). Overview of Approach to Sequencing and Assembly Target Megagametophyte Parental Needles Haploid 1N Diploid 2N 11 paired-end libraries Paired-end filters 48 mate-pair + 9 DiTag libraries Super-read reduction Mate-pair filters 7.5Bx2 reads 150M Super-reads 300M x 2 reads More scaffolding V1.01 V1.0 Gap Closing CABOG OLC Assembly 900M x 2 reads Yield from long-insert mate-pair libraries Estimated) jump) size)(bp)! Library)) count! Reads) After)error) NonE sequenced) corr.)and) junction) [x106]))))))))))))))))))))))))))))))))))))) mapping*) pairs! Redundant) reads! 1000E1999! 5! 127.3! 67%! 7%! 12%! 2000E2999! 16! 651.9! 66%! 26%! 11%! 3000E3999! 18! 705.4! 67%! 6%! 20%! 4000E4999! 6! 186.6! 69%! 5%! 11%! 5000E5500) 3- 55.3- 69%- 15%- 61%- Yielding 37X clone coverage. - And 46 M fosmid DiTags which yielded 4.5 M distinctly mapping read pairs. filters 7.5B x 2 Illumina reads reduction filters Overview of Assembly 150M Super-reads 300M x 2 reads 900M x 2 reads CABOG OLC Assembly GAP CLOSING V1.0 V1.0: MaSuRCA Output 20.1 Gbp spanning 22.6 Gbp Contig N50 = 8206 Scaffold N50 = 30.7 Kbp SOAP Denovo Assembly Scaffold N50 = 54.7 Kbp Contig N50 = 687 bp MORE SCAFFOLDING V1.01 V1.01: 20.1 Gbp spanning 23.2 Gbp Contig N50 = 8206 Scaffold N50 = 66.9 Kbp Transcriptome Assembly Rescaffolding using the SOAPdenovo2 scaffolder and assembled transcripts P. taeda 1.0 P. taeda 1.01 Total sequence in contigs (Gbp) 20,148,103,497 20,148,103,497 Total span of scaffolds (bp) 22,564,679,219 23,180,477,227 8206 8206 30,681 66,920 Number of contigs > 500 bp 4,047,642 4,047,642 Number of scaffolds > 500 bp 2,319,749 2,158,326 N50 contig size (bp) N50 scaffold size (bp) Validation • Comparison to an assembly from a pool of 5500 fosmids: 109 Mbp; 0.5X; 98.6% aligned. • Representation of highly conserved proteins using the CEGMA pipeline – 248 protein families aligned to the genome – P. taeda v1.0: 45% full-length & 79% partial – P. taeda v1.01: 75% full-length & 82% partial – Fraction of all alignments that are full-length jumps from 57% to 91%. Comparison to recently published conifer genomes Species Loblolly pine Norway spruce Pinus taeda Picea abies White spruce Picea glauca Cytometrically estimated genome size (Gbp) 21.6§ 19.6† 15.8§§ Total scaffold span (Gbp) 22.6 12.3 20.8 Total contig span* (Gbp) 20.1 12.0 20.8 Referenced genome size estimate (Gbp) 22 18 20 N50 contig size (Kbp) 8.2 0.6 5.4 66.9 0.72 22.9 14412985 10253693 7084659 74% complete 50% complete 8% partial 26% partial 91% annotated 66% annotated full-length full-length 38% complete 74% partial 52% annotated full-length N50 scaffold size (Kbp) Number of scaffolds CEGMA (90) Annotation of the 248 conserved genes. Cummulative distribution of scaffold size The End