Download Estimation of alternative splicing isoform frequencies from RNA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Marius Nicolae
Computer Science and Engineering Department
University of Connecticut
Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky




Introduction
EM Algorithm
Results
Conclusions and future work
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
A
Gene Expression (GE)
B
C
D
Isoform Discovery (ID)
A
B
A
C
D
E
C
E
Isoform Expression (IE)

Read ambiguity (multireads)
A

B
C
What is the gene length?
D
E


Ignore multireads
[Mortazavi et al. 08]
◦ Fractionally allocate multireads based on unique
read estimates

[Pasaniuc et al. 10]
◦ EM algorithm for solving ambiguities

Gene length: sum of lengths of exons that
appear in at least one isoform
 Underestimate expression levels for genes with 2
or more isoforms [Trapnell et al. 10]
A
A
B
C
C
D
E

[Jiang&Wong 09]
◦ Poisson model, single reads only

[Li et al.10]
◦ EM Algorithm, single reads only

[Feng et al. 10]
◦ Convex quadratic program, pairs used only for ID

[Trapnell et al. 10]
◦ Extends Jiang’s model to paired reads
◦ Fragment length distribution

EM Algorithm for IE
◦
◦
◦
◦

Single and paired reads
Fragment length distribution
Strand information
Base quality scores
Solving GE by adding isoform levels

Introduction

EM Algorithm


Results
Conclusions and future work


Paired reads
A
B
C
A
C
Single reads
A
B
A
C
C
E-step
M-step

Introduction
EM Algorithm

Results

Conclusions and future work

Human genome UCSC known isoforms
25000
100000
20000
10000
Number of genes
Number of isoforms

15000
10000
5000
0
100
10
1
10
100
1000
10000 100000
Isoform length

1000
0
5
10 15 20 25 30 35 40 45 50 55
Number of isoforms
GNFAtlas2 gene expression levels
◦ Uniform/geometric expression of gene isoforms

Normally distributed fragment lengths
◦ Mean 250, std. dev. 25

Error Fraction (EF)
◦ Percentage of isoforms (or genes) with relative error
larger than given threshold t

Median Percent Error (MPE)
◦ Threshold t for which EF is 50%

r2
◦ Coefficient of determination

30M single reads of length 25
100
Uniq
Rescue
RSEM
IsoEM
UniqLN
% of isoforms over threshold
90
80
70
60
50
40
30
20
10
0
0
0.2
0.4
0.6
0.8
1
Relative error threshold

Main difference b/w IsoEM and RSEM is fragment length
modeling
30M single reads of length 25
100
90
Uniq
Rescue
GeneEM
RSEM
80
% of genes over threshold

70
60
IsoEM
50
40
30
20
10
0
0
0.2
0.4
0.6
Relative error threshold
0.8
1

Fixed sequencing throughput (750Mb)
0.978
25
0.976
r2
0.972
0.97
0.968
0.966
Paired reads
0.964
Single reads
0.962
Median Percent Error
20
0.974
15
10
5
Paired reads
Single reads
0
25
35
45
55
65
75
85
95
25
35
Read length

50bp reads better than 100bp!
45
55
65
Read length
75
85
95

1-60M 75bp reads
0.985
0.98
0.975
r2
0.97
RandomStrand-Pairs-PerfectMapping
0.965
RandomStrand-Pairs
0.96
CodingStrand-pairs
0.955
RandomStrand-Single
0.95
CodingStrand-single
0.945
0
10000000 20000000 30000000 40000000 50000000 60000000
# reads


Pairs help, strand info doesn’t
[Trapnell et al. 10] r2=.95 for 13M PE reads

Introduction
EM Algorithm
Results

Conclusions and future work



Presented EM algorithm for isoform frequency
estimation that exploits fragment length
distribution for both single and paired reads
◦ Significant accuracy improvement over existing
methods
◦ Code and datasets to be released publicly soon

Ongoing extensions
◦
◦
◦
◦
Confidence intervals
Allelic specific isoform expression
Testing for novel isoforms
Integration with isoform discovery
Related documents