Download EpiGen-2008 - BC Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data analysis methods for nextgeneration sequencing technologies
Gabor T. Marth
Boston College Biology Department
Epigenomics & Sequencing Meeting
July 14-15, 2008, Boston, MA
T1. Roche / 454 FLX system
• pyrosequencing technology
• variable read-length
• the only new technology with >100bp reads
• tested in many published applications
• supports paired-end read protocols with up to 10kb
separation size
T2. Illumina / Solexa Genome Analyzer
• fixed-length short-read sequencer
• read properties are very close traditional capillary
sequences
• very low INDEL error rate
• tested in many published applications
• paired-end read protocols support short (<600bp)
separation
T3. AB / SOLiD system
A
C
G
T
0
1
2
3
1
0
3
2
G
2
3
0
1
T
3
2
1
0
A
1st Base
• fixed-length short-read sequencer
• employs a 2-base encoding system
that can be used for error reduction
and improving SNP calling accuracy
• requires color-space informatics
• published applications underway /
in review
• paired-end read protocols support
up to 10kb separation size
2nd Base
C
T4. Helicos / Heliscope system
• experimental short-read
sequencer system
• single molecule sequencing
• no amplification
• variable read-length
• error rate reduced with 2pass template sequencing
A1. Variation discovery: SNPs and short-INDELs
1. sequence alignment
2. dealing with non-unique mapping
3. looking for allelic differences
A2. Structural variation detection
• structural variations (deletions, insertions, inversions and translocations) from
paired-end read map locations
• copy number (for amplifications, deletions) from depth of read coverage
A3. Identification of protein-bound DNA
genome sequence
aligned reads
Chromatin structure (CHIP-SEQ)
(Mikkelsen et al. Nature 2007)
Transcription binding sites. Robertson et al. Nature Methods, 2007
A4. Novel transcript discovery (genes)
Mortazavi et al. Nature Methods
A5. Novel transcript discovery (miRNAs)
Ruby et al. Cell, 2006
A6. Expression profiling by tag counting
gene
gene
aligned reads
aligned reads
Jones-Rhoads et al. PLoS Genetics, 2007
A7. De novo organismal genome sequencing
Lander et al. Nature 2001
short reads
read pairs
longer reads
assembled sequence contigs
C1. Read length
20-35 (var)
25-35 (fixed)
25-40 (fixed)
~200-450 (var)
0
100
200
300
read length [bp]
400
When does read length matter?
• short reads often sufficient where the
entire read length can be used for mapping:
SNPs, short-INDELs, SVs
CHIP-SEQ
short RNA discovery
counting (mRNA miRNA)
• longer reads are needed where one must use parts of reads for mapping:
de novo sequencing
novel transcript discovery
aacttagacttaca
gacttacatacgta
Known exon 1
Known exon 2
accgattactatacta
C2. Read error rate
• error rate typically 0.4 - 1%
• error rate dictates the
stringency of the read mapper
0.40
0.35
Fraction of genome
0.30
0.25
0.20
• the more errors the aligner must
tolerate, the lower the fraction of the
reads that can be uniquely aligned
0.15
0.10
0.05
0.00
0
1
Number of mismatches allowed
2
Error rate grows with each cycle
40
10.00%
9.00%
35
8.00%
30
7.00%
6.00%
• this phenomenon limits useful read
length
5.00%
20
4.00%
15
3.00%
10
2.00%
5
1.00%
0
0.00%
0
5
10
15
20
Position on Read
25
30
35
40
Error rate
Measured QV
25
Substitutions vs. INDEL errors
C3. Representational biases / library complexity
fragmentation biases
PCR
amplification biases
sequencing
low/no
representation
high
representation
sequencing biases
Dispersal of read coverage
• this affects variation discovery (deeper starting read coverage is needed)
• it should have major impact is on counting applications
Amplification errors
early amplification error gets
propagated onto every clonal copy
many reads from
clonal copies of a
single fragment
• early PCR errors in “clonal” read copies lead to false positive allele calls
C4. Paired-end reads
• fragment amplification:
fragment length 100 - 600 bp
• fragment length limited by
amplification efficiency
• circularization: 500bp - 10kb (sweet spot ~3kb)
• fragment length limited by library complexity
Korbel et al. Science 2007
• paired-end read can improve read mapping accuracy (if unique map positions
are required for both ends) or efficiency (if fragment length constraint is used to
rescue non-uniquely mapping ends)
Technologies / properties / applications
Technology
Roche/454
Illumina/Solexa
AB/SOLiD
200-450bp
20-50bp
25-50bp
Error rate
<0.5%
<1.0%
<0.5%
Dominant error type
INDEL
SUB
SUB
yes
yes
not really
< 10kb (3kb optimal)
100 - 600bp
500bp - 10kb (3kb optimal)
●
●
○
●
○
Read properties
Read length
Quality values available
Paired-end separation
Applications
SNP discovery
short-INDEL discovery
SV discovery
○
○
●
CHIP-SEQ
○
●
●
small RNA/gene discovery
○
●
●
mRNA Xcript discovery
●
○
○
Expression profiling
○
●
●
De novo sequencing
●
?
?
Resequencing-based SNP discovery
(ii) micro-repeat analysis
REF
IND
(iii) read mapping (pair-wise
alignment to genome reference)
(iv) read
assembly
(v) SNP calling
IND
(vi) SNP
validation
(i) base calling
(vii) data viewing, hypothesis generation
The “toolbox”
• base callers
• microrepeat finders
• read mappers
• SNP callers
• structural variation callers
• assembly viewers
Reference guided read mapping
Reference-sequence guided mapping:
…you get the pieces…
…AND they give you the
cover on the box
Some pieces are more unique than others
MOSAIK: an anchored aligner / assembler
Step 1. initial short-hash scan for possible read locations
Step 2. evaluation of candidate locations with SW method
Michael Stromberg
Non-unique mapping, gapped alignments
1. Non-unique read mapping: optionally
either only report uniquely mapped reads
or report all map locations for each read
(mapping quality values for all mapped
reads are being implemented)
2. Gapped alignments: allow for mapping
reads with insertion or deletion sequencing
errors, and reads with bona fide INDEL
alleles
Read types aligned, paired-end read strategy
3. Aligns and co-assembles customary read types:
ABI/capillary
Illumina/Solexa
AB/SOLiD
Roche/454
Helicos/Heliscope
ABI/capillary
454 FLX
454 GS20
Illumina
4. Paired-end read alignments
Other mainstream read mappers
• ELAND (Tony Cox, Illumina)
-- the “official” read mapper supplied by Illumina, fast
• MAQ (Li Heng + Richard Durbin, Sanger)
-- the most widely used read mapper, low RAM footprint
• SOAP (Beijing Genomics Institute)
-- a new mapper developed for human next-gen reads
• SHRIMP (Michael Brudno, University of Toronto)
-- full Smith-Waterman
Speed
Polymorphism / mutation detection
sequencing
error
polymorphism
Determining genotype directly from sequence
individual 1
AACGTTAGCATA
AACGTTAGCATA
AACGTTCGCATA
AACGTTCGCATA
A/C
individual 2
AACGTTCGCATA
AACGTTCGCATA
AACGTTCGCATA
AACGTTCGCATA
C/C
individual 3
AACGTTAGCATA
AACGTTAGCATA
A/A
Software
GigaBayes
P( SNP ) 

all var iable
P( S N | RN )
P( S1 | R1 )
 ...
 PPr ior ( S1 ,..., S N )
PPr ior ( S1 )
PPr ior ( S N )
P( SiN | R1 )
P( Si1 | R1 )
S
...

...

 PPr ior ( Si1 ,..., SiN )


PPr ior ( SiN )
S i1 [ A ,C ,G ,T ] S iN [ A ,C ,G ,T ] PPr ior ( S i1 )
SNP
INS
Data visualization
1. aid software development: integration of trace data viewing, fast
navigation, zooming/panning
2. facilitate data validation (e.g. SNP validation): co-viewing of multiple
read types, quality value displays
3. promote hypothesis generation: integration of annotation tracks
Weichun Huang
Applications
1. SNP discovery in shallow, single-read 454
coverage
(Drosophila melanogaster)
2. SNP and INDEL discovery in deep Illumina
short-read coverage
(Caenorhabditis elegans)
3. Mutational profiling in deep 454 and Illumina
read data
(Pichia stipitis)
(image from Nature Biotech.)
Our software is available for testing
http://bioinformatics.bc.edu/marthlab/Beta_Release
Credits
Elaine Mardis (Washington University)
Andy Clark (Cornell University)
Doug Smith (Agencourt)
Research supported by: NHGRI (G.T.M.)
BC Presidential Scholarship (A.R.Q.)
Michael Stromberg
Michele
Busby
Aaron Quinlan
Chip
Stewart
Damien
Croteau-Chonka
Eric Tsung
Derek Barnett
Weichun Huang
http://bioinformatics.bc.edu/marthlab
Accuracy
• As is the case for all heuristic alignment algorithms accuracy and speed are
option- and parameter-dependent
C3. Quality values are important for allele calling
• PHRED base quality values represent the estimated likelihood of sequencing
error and help us pick out true alternate alleles
• inaccurate or not well calibrated base quality values hinder allele calling
Q-values should be accurate … and high!
Software tools for next-gen sequence
analysis
Next-generation sequencing
technologies and applications
Related documents