Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky Introduction EM Algorithm Results Conclusions and future work Make cDNA & shatter into fragments Sequence fragment ends Map reads A Gene Expression (GE) B C D Isoform Discovery (ID) A B A C D E C E Isoform Expression (IE) Read ambiguity (multireads) A B C What is the gene length? D E Ignore multireads [Mortazavi et al. 08] ◦ Fractionally allocate multireads based on unique read estimates [Pasaniuc et al. 10] ◦ EM algorithm for solving ambiguities Gene length: sum of lengths of exons that appear in at least one isoform Underestimate expression levels for genes with 2 or more isoforms [Trapnell et al. 10] A A B C C D E [Jiang&Wong 09] ◦ Poisson model, single reads only [Li et al.10] ◦ EM Algorithm, single reads only [Feng et al. 10] ◦ Convex quadratic program, pairs used only for ID [Trapnell et al. 10] ◦ Extends Jiang’s model to paired reads ◦ Fragment length distribution EM Algorithm for IE ◦ ◦ ◦ ◦ Single and paired reads Fragment length distribution Strand information Base quality scores Solving GE by adding isoform levels Introduction EM Algorithm Results Conclusions and future work Paired reads A B C A C Single reads A B A C C E-step M-step Introduction EM Algorithm Results Conclusions and future work Human genome UCSC known isoforms 25000 100000 20000 10000 Number of genes Number of isoforms 15000 10000 5000 0 100 10 1 10 100 1000 10000 100000 Isoform length 1000 0 5 10 15 20 25 30 35 40 45 50 55 Number of isoforms GNFAtlas2 gene expression levels ◦ Uniform/geometric expression of gene isoforms Normally distributed fragment lengths ◦ Mean 250, std. dev. 25 Error Fraction (EF) ◦ Percentage of isoforms (or genes) with relative error larger than given threshold t Median Percent Error (MPE) ◦ Threshold t for which EF is 50% r2 ◦ Coefficient of determination 30M single reads of length 25 100 Uniq Rescue RSEM IsoEM UniqLN % of isoforms over threshold 90 80 70 60 50 40 30 20 10 0 0 0.2 0.4 0.6 0.8 1 Relative error threshold Main difference b/w IsoEM and RSEM is fragment length modeling 30M single reads of length 25 100 90 Uniq Rescue GeneEM RSEM 80 % of genes over threshold 70 60 IsoEM 50 40 30 20 10 0 0 0.2 0.4 0.6 Relative error threshold 0.8 1 Fixed sequencing throughput (750Mb) 0.978 25 0.976 r2 0.972 0.97 0.968 0.966 Paired reads 0.964 Single reads 0.962 Median Percent Error 20 0.974 15 10 5 Paired reads Single reads 0 25 35 45 55 65 75 85 95 25 35 Read length 50bp reads better than 100bp! 45 55 65 Read length 75 85 95 1-60M 75bp reads 0.985 0.98 0.975 r2 0.97 RandomStrand-Pairs-PerfectMapping 0.965 RandomStrand-Pairs 0.96 CodingStrand-pairs 0.955 RandomStrand-Single 0.95 CodingStrand-single 0.945 0 10000000 20000000 30000000 40000000 50000000 60000000 # reads Pairs help, strand info doesn’t [Trapnell et al. 10] r2=.95 for 13M PE reads Introduction EM Algorithm Results Conclusions and future work Presented EM algorithm for isoform frequency estimation that exploits fragment length distribution for both single and paired reads ◦ Significant accuracy improvement over existing methods ◦ Code and datasets to be released publicly soon Ongoing extensions ◦ ◦ ◦ ◦ Confidence intervals Allelic specific isoform expression Testing for novel isoforms Integration with isoform discovery