Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with Marius Nicolae, Serghei Mangul, and Alex Zelikovsky Outline • • • • Introduction EM Algorithm Results Conclusions and future work Alternative Splicing [Nilsen & Graveley 10] RNA-Seq Make cDNA & shatter into fragments Sequence fragment ends Map reads A Gene Expression (GE) B C D Isoform Discovery (ID) A B A C D E C E Isoform Expression (IE) Gene Expression Challenges • Read ambiguity (multireads) A B C • What is the gene length? D E Previous approaches to GE • Ignore multireads • [Mortazavi et al. 08] – Fractionally allocate multireads based on unique read estimates • [Pasaniuc et al. 10] – EM algorithm for solving ambiguities • Gene length: sum of lengths of exons that appear in at least one isoform Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10] Read Ambiguity in IE A A B C C D E Previous approaches to IE • [Jiang&Wong 09] – Poisson model, single reads only • [Li et al.10] – EM Algorithm, single reads only • [Feng et al. 10] – Convex quadratic program, pairs used only for ID • [Trapnell et al. 10] – Extends Jiang’s model to paired reads – Fragment length distribution Our contribution • EM Algorithm for IE – Single and paired reads – Fragment length distribution – Strand information – Base quality scores Read-Isoform Compatibility wr ,i wr ,i OaQa Fa a Fragment length distribution • Paired reads A B A C C • Single reads A B A C C IsoEM algorithm E-step M-step Experimental setup 25000 100000 20000 10000 Number of genes Number of isoforms • Human genome UCSC known isoforms 15000 10000 5000 1000 100 0 10 1 10 100 1000 10000 100000 0 5 Isoform length 10 15 20 25 30 35 40 45 50 55 Number of isoforms • GNFAtlas2 gene expression levels – Uniform/geometric expression of gene isoforms • Normally distributed fragment lengths – Mean 250, std. dev. 25 Accuracy measures • Error Fraction (EF) – Percentage of isoforms (or genes) with relative error larger than given threshold t • Median Percent Error (MPE) – Threshold t for which EF is 50% • r2 Isoform Error Fraction Curves • 30M single reads of length 25 100 Uniq Rescue RSEM IsoEM UniqLN % of isoforms over threshold 90 80 70 60 50 40 30 20 10 0 0 0.2 0.4 0.6 Relative error threshold 0.8 1 Gene Error Fraction Curves • 30M single reads of length 25 100 90 Uniq Rescue GeneEM RSEM % of genes over threshold 80 70 60 IsoEM 50 40 30 20 10 0 0 0.2 0.4 0.6 Relative error threshold 0.8 1 Read Length Effect • Fixed sequencing throughput (750Mb) 0.978 25 0.976 20 Median Percent Error 0.974 r2 0.972 0.97 0.968 0.966 Paired reads 0.964 Single reads 15 10 5 Paired reads Single reads 0.962 0 25 35 45 55 65 Read length 75 85 95 25 35 45 55 65 Read length 75 85 95 Effect of Pairs & Strand Information • 1-60M 75bp reads 0.985 0.98 0.975 r2 0.97 RandomStrand-Pairs-PerfectMapping 0.965 RandomStrand-Pairs 0.96 CodingStrand-pairs 0.955 RandomStrand-Single 0.95 CodingStrand-single 0.945 0 10000000 20000000 30000000 40000000 50000000 # reads • [Trapnell et al. 10] r2=.95 for 13M PE reads 60000000 Conclusions & Future Work • Presented EM algorithm for estimating isoform/gene expression levels – Integrates fragment length distribution, base qualities, pair and strand info – http://dna.engr.uconn.edu/software/IsoEM/ • Ongoing work – – – – Confidence intervals Allelic specific expression Integration with isoform discovery Reconstruction & frequency estimation for virus quasispecies Acknowledgments NSF awards IIS-0546457, IIS-0916401, and IIS-0916948