Download Estimation of alternative splicing isoform frequencies from RNA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Estimation of alternative
splicing isoform frequencies
from RNA-Seq data
Ion Mandoiu
Computer Science and Engineering Department
University of Connecticut
Joint work with Marius Nicolae, Serghei Mangul, and Alex Zelikovsky
Outline
•
•
•
•
Introduction
EM Algorithm
Results
Conclusions and future work
Alternative Splicing
[Nilsen & Graveley 10]
RNA-Seq
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
A
Gene Expression (GE)
B
C
D
Isoform Discovery (ID)
A
B
A
C
D
E
C
E
Isoform Expression (IE)
Gene Expression Challenges
• Read ambiguity (multireads)
A
B
C
• What is the gene length?
D
E
Previous approaches to GE
• Ignore multireads
• [Mortazavi et al. 08]
– Fractionally allocate multireads based on unique read
estimates
• [Pasaniuc et al. 10]
– EM algorithm for solving ambiguities
• Gene length: sum of lengths of exons that appear
in at least one isoform
 Underestimates expression levels for genes with 2 or
more isoforms [Trapnell et al. 10]
Read Ambiguity in IE
A
A
B
C
C
D
E
Previous approaches to IE
• [Jiang&Wong 09]
– Poisson model, single reads only
• [Li et al.10]
– EM Algorithm, single reads only
• [Feng et al. 10]
– Convex quadratic program, pairs used only for ID
• [Trapnell et al. 10]
– Extends Jiang’s model to paired reads
– Fragment length distribution
Our contribution
• EM Algorithm for IE
– Single and paired reads
– Fragment length distribution
– Strand information
– Base quality scores
Read-Isoform Compatibility
wr ,i
wr ,i   OaQa Fa
a
Fragment length distribution
• Paired reads
A
B
A
C
C
• Single reads
A
B
A
C
C
IsoEM algorithm
E-step
M-step
Experimental setup
25000
100000
20000
10000
Number of genes
Number of isoforms
• Human genome UCSC known isoforms
15000
10000
5000
1000
100
0
10
1
10
100
1000
10000
100000
0
5
Isoform length
10 15 20 25 30 35 40 45 50 55
Number of isoforms
• GNFAtlas2 gene expression levels
– Uniform/geometric expression of gene isoforms
• Normally distributed fragment lengths
– Mean 250, std. dev. 25
Accuracy measures
• Error Fraction (EF)
– Percentage of isoforms (or genes) with relative
error larger than given threshold t
• Median Percent Error (MPE)
– Threshold t for which EF is 50%
• r2
Isoform Error Fraction Curves
• 30M single reads of length 25
100
Uniq
Rescue
RSEM
IsoEM
UniqLN
% of isoforms over threshold
90
80
70
60
50
40
30
20
10
0
0
0.2
0.4
0.6
Relative error threshold
0.8
1
Gene Error Fraction Curves
• 30M single reads of length 25
100
90
Uniq
Rescue
GeneEM
RSEM
% of genes over threshold
80
70
60
IsoEM
50
40
30
20
10
0
0
0.2
0.4
0.6
Relative error threshold
0.8
1
Read Length Effect
• Fixed sequencing throughput (750Mb)
0.978
25
0.976
20
Median Percent Error
0.974
r2
0.972
0.97
0.968
0.966
Paired reads
0.964
Single reads
15
10
5
Paired reads
Single reads
0.962
0
25
35
45
55
65
Read length
75
85
95
25
35
45
55
65
Read length
75
85
95
Effect of Pairs & Strand Information
• 1-60M 75bp reads
0.985
0.98
0.975
r2
0.97
RandomStrand-Pairs-PerfectMapping
0.965
RandomStrand-Pairs
0.96
CodingStrand-pairs
0.955
RandomStrand-Single
0.95
CodingStrand-single
0.945
0
10000000
20000000
30000000
40000000
50000000
# reads
• [Trapnell et al. 10] r2=.95 for 13M PE reads
60000000
Conclusions & Future Work
• Presented EM algorithm for estimating
isoform/gene expression levels
– Integrates fragment length distribution, base qualities,
pair and strand info
– http://dna.engr.uconn.edu/software/IsoEM/
• Ongoing work
–
–
–
–
Confidence intervals
Allelic specific expression
Integration with isoform discovery
Reconstruction & frequency estimation for virus
quasispecies
Acknowledgments

NSF awards IIS-0546457, IIS-0916401, and IIS-0916948
Related documents