Download The Loblolly Pine Genome, v1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Zinc finger nuclease wikipedia , lookup

DNA sequencing wikipedia , lookup

Microsatellite wikipedia , lookup

Exome sequencing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Human Genome Project wikipedia , lookup

Transcript
The Loblolly Pine Genome, v1 University of California – Davis: Kris=an Stevens, Jill L. Wegrzyn, Marc Crepeau, Charis Cardeno, John Liechty, Pedro J Mar=nez-­‐Garcia, Hans Vasquez-­‐Gross, L. L. Zieve, MaN Dougherty, Brian Y. Lin, Pat McGuire, David Neale, Charles H. Langley Johns Hopkins University S.M.: Daniela Puiu, Steven L. Salzberg University of Maryland: -­‐ Aleksey Zimin, G. Marçais, M. Roberts, James A. Yorke Children's Hospital Oakland Res. Inst.: – Ann Holtz-­‐Morris, Maxim Koriabine, Pieter J. deJong Texas A&M University: Carol Loopstra Washington State University: Dorrie Main Indiana University: Keithanne Mockai=s, S. Fuentes-­‐Soriano, L. Wu, D. Gilbert University of Utah: M. Yandell, C. Holt University of Florida: J.M. Davis, K. Smith University of Georgia: J.F.D. Dean, W.W. Lorenz North Carolina State University: R.W. WheNen, R. Sederoff Pennsylvania State University: Nicholas Wheeler Background: huge and repe==ve •  Genome Size: ≈22,000,000,000 bp •  Pine karyotype is highly conserved, n=12 •  Genome is mostly repe==ve •  But the repeats are ancient and diverged •  Deep coverage is feasible with NGS •  Technical challenges: •  Low “error” rate -­‐> local assembly •  Linking at scales >> 150 bp Outline – theme: reduc=on in complexity •  Genomic DNA -­‐> sequence •  Error correc=on •  Super-­‐reads: local con=gs •  Linking sequences •  Assembly •  Scaffolding •  Valida=on •  Annota=on •  Applica=on Abstract The size and complexity of conifer genomes (c. 20 to 40 Gb, 2n =
24) pose a formidable challenge to full genome sequencing and
assembly. We developed a new approach to sequence the genome
of loblolly pine (Pinus taeda L.). It leveraged unique aspects of
pine reproductive biology and deployed recents advances in
genome assembly methodology. We used whole genome shotgun
sequencing based primarily on next generation sequence
generated from a single haploid seed megagametophyte (conifer
seed endosperm) from the loblolly pine tree genotype 20-1010
used in industrial forest tree breeding and research. The resulting
sequence and assembly led to a draft genome spanning 23.2
billion base pairs and containing 20.1 billion sequenced bases with
an N50 scaffold size of 66.9 Kbp.
Overview of Approach to Sequencing and Assembly
Target Megagametophyte
Parental Needles
Haploid 1N
Diploid 2N
11 paired-end libraries
Paired-end
filters
48 mate-pair + 9 DiTag libraries
Super-read
reduction
Mate-pair
filters
7.5Bx2
reads
150M
Super-reads
300M x 2
reads
More
scaffolding
V1.01
V1.0
Gap
Closing
CABOG OLC
Assembly
900M x 2
reads
Conifer life
cycle
offers a
haploid
source of
genomic
DNA thus
removing
the
complexity
inherent in
assembling
one genome
from two Distribution of megagametophyte DNA yields
DNA Yields
forfrom
P. 20−1010
taedaMega−gametophytes
20-1010 Mega-gametophytes
DNA Yields
2010−2012 Crops
45
!"#$%&'%!"#$%#%#!"#%%&"#%"!'()*
+
!"#
,-./01
$%&%'(
,-01
$%)*'(
23.!.-4
)&*'(
,05
*%+!'(
,/1
$")'(
40
35
Count
30
25
20
15
10
5
0
0
0.5
1
1.5
2
2.5
Microgram DNA Yield
3
3.5
4
4.5
short-insert paired-end libraries from the single megagametophyte
A)
B)
0.14
220 bp
234 bp
246 bp
260 bp
273 bp
285 bp
900
700
500
300
200
100
Fraction of Fragments Sampled
0.12
0.1
0.08
0.06
0.04
0.02
0
210
220
230
240
250
260
270
Fragment Length (bp)
280
290
300
library complexity: diminishing returns of sequencing deeper
Ideal
273bp
Overview of Approach to Sequencing and Assembly
Target Megagametophyte
Parental Needles
Haploid 1N
Diploid 2N
11 paired-end libraries
Paired-end
filters
48 mate-pair + 9 DiTag libraries
Super-read
reduction
Mate-pair
filters
7.5Bx2
reads
150M
Super-reads
300M x 2
reads
More
scaffolding
V1.01
V1.0
Gap
Closing
CABOG OLC
Assembly
900M x 2
reads
Genome size estimation from k-mer depth distributions: 20 Gbp
erroneous
k-mers
unique (haploid)
genomically repeated
Histogram'of'k2mer'Depth'for'Haploid'Data'
k-mers
k-mers
5.0E+08%
k%=%24%
4.5E+08%
k%=%31%
Number'of'Dis-nct'k2mers'
Number of distinct k-mers
4.0E+08%
3.5E+08%
3.0E+08%
2.5E+08%
2.0E+08%
1.5E+08%
1.0E+08%
5.0E+07%
0.0E+00%
0%
50%
100%
k2mer'Depth'
k-mer
depth
150%
200%
250%
Pair-end filtering & super-reads
Using the QuORUM tool the haploid short insert pair-end reads
were
Corrected: each singleton 24-mers was dropped from the list of
of “good” k-mers.
Filtered: reads with singleton indels were discarded.
Reads containing known “contaminant” sequences were
truncated,
as were large-insert DiTag and mate-pair reads containing
junctions
Super-reads
•  Based on the observation that most of the sequence in
genomes is locally unique – branches are relatively rare
•  We can efficiently count k-mers in the data set of all reads
with Jellyfish e.g. consider 10-mers (we use much longer k,
76, of course):
AGCTGACTGACTGGTAACAA
AGCTGACTGA
GCTGACTGAC •  Use all k-mers with counts > threshold T (e.g. T=1)
•  The idea is to make reads longer instead of breaking them
into k-mers.
100 Times Fewer Super-Reads than Reads
Many read extensions stop at the same branch points
•  Starting with 15x109 paired end reads, average 120 bp
•  We produced ~ 150x106 super-reads – 100 times fewer
reads!
•  The super-reads contain 52x109 bp of sequence.
•  50% of that sequence is in 500 bp or longer super reads.
•  These are few enough and long enough to be assembled by
Overlap-Layout-Consensus assembler, CABOB (son of the
Celera assembler).
Overview of Approach to Sequencing and Assembly
Target Megagametophyte
Parental Needles
Haploid 1N
Diploid 2N
11 paired-end libraries
Paired-end
filters
48 mate-pair + 9 DiTag libraries
Super-read
reduction
Mate-pair
filters
7.5Bx2
reads
150M
Super-reads
300M x 2
reads
More
scaffolding
V1.01
V1.0
Gap
Closing
CABOG OLC
Assembly
900M x 2
reads
Yield from long-insert mate-pair libraries
Estimated)
jump)
size)(bp)!
Library))
count!
Reads)
After)error) NonE
sequenced) corr.)and)
junction)
[x106])))))))))))))))))))))))))))))))))))))
mapping*) pairs!
Redundant)
reads!
1000E1999! 5!
127.3!
67%!
7%!
12%!
2000E2999! 16!
651.9!
66%!
26%!
11%!
3000E3999! 18!
705.4!
67%!
6%!
20%!
4000E4999! 6!
186.6!
69%!
5%!
11%!
5000E5500) 3-
55.3-
69%-
15%-
61%-
Yielding 37X clone coverage.
-
And 46 M fosmid DiTags which yielded 4.5 M distinctly mapping read pairs.
filters
7.5B x 2
Illumina
reads
reduction
filters
Overview of Assembly
150M
Super-reads
300M x 2
reads
900M x 2
reads
CABOG OLC
Assembly
GAP
CLOSING
V1.0
V1.0: MaSuRCA Output
20.1 Gbp spanning 22.6 Gbp
Contig N50 = 8206
Scaffold N50 = 30.7 Kbp
SOAP Denovo Assembly
Scaffold N50 = 54.7 Kbp
Contig N50 = 687 bp
MORE
SCAFFOLDING
V1.01
V1.01:
20.1 Gbp spanning 23.2 Gbp
Contig N50 = 8206
Scaffold N50 = 66.9 Kbp
Transcriptome Assembly
Rescaffolding using the SOAPdenovo2
scaffolder and assembled transcripts
P. taeda 1.0
P. taeda 1.01
Total sequence in contigs (Gbp)
20,148,103,497
20,148,103,497
Total span of scaffolds (bp)
22,564,679,219
23,180,477,227
8206
8206
30,681
66,920
Number of contigs > 500 bp
4,047,642
4,047,642
Number of scaffolds > 500 bp
2,319,749
2,158,326
N50 contig size (bp)
N50 scaffold size (bp)
Validation
•  Comparison to an assembly from a pool of
5500 fosmids: 109 Mbp; 0.5X; 98.6% aligned.
•  Representation of highly conserved proteins
using the CEGMA pipeline
–  248 protein families aligned to the genome
–  P. taeda v1.0: 45% full-length & 79% partial
–  P. taeda v1.01: 75% full-length & 82% partial
–  Fraction of all alignments that are full-length jumps
from 57% to 91%.
Comparison to recently published conifer genomes
Species
Loblolly pine Norway spruce
Pinus taeda
Picea abies
White spruce
Picea glauca
Cytometrically estimated
genome size (Gbp)
21.6§
19.6†
15.8§§
Total scaffold span (Gbp)
22.6
12.3
20.8
Total contig span* (Gbp)
20.1
12.0
20.8
Referenced genome size
estimate (Gbp)
22
18
20
N50 contig size (Kbp)
8.2
0.6
5.4
66.9
0.72
22.9
14412985
10253693
7084659
74% complete 50% complete
8% partial
26% partial
91% annotated 66% annotated
full-length
full-length
38% complete
74% partial
52% annotated
full-length
N50 scaffold size (Kbp)
Number of scaffolds
CEGMA (90)
Annotation of the 248
conserved genes.
Cummulative distribution of scaffold size
The End