* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Click
Transposable element wikipedia , lookup
Oncogenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Public health genomics wikipedia , lookup
RNA interference wikipedia , lookup
Minimal genome wikipedia , lookup
History of genetic engineering wikipedia , lookup
Copy-number variation wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
X-inactivation wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene therapy wikipedia , lookup
Genomic imprinting wikipedia , lookup
Pathogenomics wikipedia , lookup
Non-coding RNA wikipedia , lookup
RNA silencing wikipedia , lookup
Gene desert wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genomic library wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genome (book) wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Microevolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome evolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Designer baby wikipedia , lookup
Metagenomics wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene expression programming wikipedia , lookup
Primary transcript wikipedia , lookup
Gene expression from RNA-Seq Once sequenced the problem becomes computational sequencer cells Sequenced reads cDNA ChIP Alignment read coverage genome Considerations and assumptions • High library complexity • • #molecules in library >> #sequenced molecules Short reads • Read length << sequenced molecule length Not all applications satisfy this: • miRNA sequencing • Small input sequencing (e.g. single cell sequencing) Corollaries • Libraries satisfying assumptions 1 & 2 only measure relative abundance • Key quantity: # fragments sequenced for each transcript. Need to: • • Which transcript generated the observed read? Isn’t this easy? • Reads do not uniquely map • Transcripts or genes have different isoforms • Sequencing has a ~ 1% error rate • Transcripts are not uniformly sequenced The RNA-Seq quantification problem (simple case) • Start with a set of previous gene/transcript annotations • Assume only one isoform per gene • Assume 1-1 read to transcript correspondence. (Sequencing depth) Using the Poisson approximation to the binomial We seek to maximize the likelihood of transcript frequencies given the data Which, of course has MLE The process of RNA-Seq quantification • • • Sequenced reads are aligned to a reference sequence • the species genome or • its transcriptome Transcript abundance is measured: • By counting reads mapped to each transcript (not accurate when multiple isoforms share sequence) • By solving a maximizing the likelihood of the observed mapping given transcript abundance To compare samples counts need to be normalized • Libraries have different sequencing depth • Sample composition may be different • Most standard normalization: counts Transcripts per Million (TPM) units The gene expression table Genes are quantified. Each gene or isoform has: • • A TPM value • A (expected) fragment count vaue All samples were quantified in the same fashion and arranged into a table of genes (22,000) x samples (24). • • Row i gives the expression of the gene i across all samples • Row j gives the expression of genes in sample j. gene Mir301 LD1,2.rep1 LD1,2.rep2 LD1,2.rep3 LD1.rep1 LD1.rep2 LD1.rep3 LD2.rep1 LD2.rep2 LD2.rep3 0 0 0 0 0 0 0 0 0 Cpne2 157 158.98 88.04 69 111.99 114.33 93 208 140 Capn5 36 65 46 46 69 42 33 58 59.01 Lage3 313.06 241.23 276.23 218.9 285.19 359.65 269.7 359.04 417.47 Brd7 379 358.58 390 336 357.26 368.08 264 564.07 476 Dimt1 77 68 58 54 62 60 54 76 97.03 0 0 0 0 0 0 0 0 0 AK017068 mapping nt transs53 use nsensus ed to a central artition orms of e Bruijn k-mers’ reduces eads to of k – 1 possible ersed in s, elimie shared paired orted as Lik ufflinks 10−2 10−2 100 0% Isoform 1 Isoform 2 101 102 103 104 True FPKM But, how are these quantities computed? 25% 100% Conf dence interval c Isoform 1 Exon union method Isoform 2 Exon intersection method • Figure Start3 with a set ofofprevious gene/transcript | An overview gene expression quantification annotations with RNA-seq. Illustration of transcripts of different lengths with different read • (a)Assume Define only one isoform per gene levels welltranscript as total read counts observed for Reads each (fragments) • coverage Assume 1-1(left) readasto correspondence. transcript (middle) and FPKM-normalized read counts (right). (b) Reads are now short, one transcript generates many fragments. from alternatively spliced genes may be attributable to a single isoform or more than one isoform. Reads are color-coded when their isoform of Change: Transcripts of different lengths generate fragments origin is clear. Black reads indicate reads with uncertain origin. ‘Isoform expression methods’ estimate isoform abundances that best explain the Transcript effective length observed read counts under a generative model. Samples near the original maximum likelihood estimate (dashed line) improve the robustness of the , withabundance. MLE: Model: , around each isoform’s estimate and provide a confidence interval (c) For a gene with two expressed isoforms, exons are colored according to the isoform of origin. Two simplified gene models used for quantification purposes, spliced transcripts from each model and their associated lengths, nceptuare shown to the right. The ‘exon union model’ (top) uses exons from all some alternatively The RNA-Seq quantification problem. Isoform deconvolution contain unique exon Alternative method as Cufflinks29 and m tainty by construc sequencing process Isoform 1 that best explain th (Fig. 3b). This estim maximizes the likel Isoform 2 lihood estimate (M Transcript expression method intersection method MLE is not an accu improves Main difference: quantification involves read assignment. Our model must the robu Detected change capture read assignment uncertainty. pling’ alternative a also providing a co Parameters: Transcript relative abundance ition 1 Condition 2 Condition 1 Condition 2 We note that the Latent variables: Fragment alignment source the results, with inc Observed variables: N fragment alignments, erview of RNA-seq differential expression analysis. transcripts, fragment length uncertainty. As suc distribution n microarrays rely on fluorescence intensity via a hybridization the maximal isofor mber of probes to the gene RNA. RNA-seq gene expression as the fraction of aligned reads that can be assigned to the before expression e Expression estimator value Transcript expression level ession estimate We can estimate the insert size distribution P1 P2 Get all single isoform reconstructions 0.004 0.003 0.002 Estimate insert size empirical distribution d2 0.001 d1 0.000 Splice and compute insert distance 0 100 200 300 400 500 600 700 … and use it for probabilistic read assignment Isoform 1 Isoform 2 Isoform 3 0.004 d2 0.003 d2 0.000 P(d > di) 0.001 0.002 d1 d1 0 100 200 300 400 500 600 700 For methods such as MISO, Cufflinks and RSEM, it is critical to have paired-end data some alternatively spliced genes, it fa The RNA-Seq quantification problem. Isoform deconvolution contain unique exons from which to esti Alternative methods termed ‘isoform-e d2 d1 29 and mixture of isoforms as Cufflinks tainty by constructing a ‘likelihood fu sequencing process and identifies isofo Isoform 1 that best explain the reads obtained in (Fig. 3b). This estimate, defined as the maximizes the likelihood function, is te Isoform 2 lihood estimate (MLE). For genes exp Transcript expression method MLE is not an accurate expression estim Parameters: Transcript relative abundance improves the robustness of expression Latent variables: Fragment alignment source Detected change pling’transcripts, alternative abundance Observed variables: N fragment alignments, fragment length estimate distribution also providing a confidence measure on Condition 1 Condition 2 of the fragment We noteProbability that the number of potential alignment originating from t the results, with incorrect or misassemb ential expression analysis. uncertainty. working wi CanAs besuch, shownwhen it is concave, orescence intensity via a hybridization andisoform hence solvable byis necessar the maximal sets, it ene RNA. RNA-seq gene expression expectation maximization d reads that can be assigned to the before expression estimation for some g 0.000 0.001 0.002 0.003 0.004 Expression estimator value Transcript expression level 0 100 200 300 400 500 600 700 Summary: Current quantification models are complex • In its simplest form we assume that reads can be unequivocally mapped. This allows: • Read counts distribute multinomial with rate estimated from the observed counts • When this assumption breaks, multinomial is no longer appropriate. • More general models use: • • Base quality scores • Sequence mapability • Protocol biases (e.g. 3’ bias) • Sequence biases (e.g. GC) Handling each of these involves a more complex model where reads are assigned probabilistically not only to an isoform but to a different loci RNA-Seq libraries revisited: End-sequence libraries • Target the start or end of transcripts. • Source: End-enriched RNA • • Fragmented then selected • Fragmented then enzymatically purified Uses: • Annotation of transcriptional start sites • Annotation of 3’ UTRs • Quantification and gene expression • Depth required 3-8 mill reads • Low quality RNA samples • Single cell RNA sequencing RNA-Seq libraries: Summary End-sequencing solution Analysis of counting data requires 3 broad tasks • Read mapping (alignment): Placing short reads in the genome • Quantification: • Transcript relative abundance estimation • Determining whether a gene is expressed • Normalization • Finding genes/transcripts that are differentially represented between two or more samples. • Reconstruction: Finding the regions that originated the reads What are we normalizing? A typical replicate scatter plot What are we normalizing? A typical replicate scatter plot TPM normalization • Accounts for: Differences in sequencing depth Show External URL Show Embeded Code • Hide MathML Code • Differences in the number of reads generated by transcripts of different length ow Embeded Code Hide MathML Code Show Embeded Code Hide MathML Code Embeded Code Hide MathML Code Estimated reads/fragments for the gene Total reads/fragments Length of the transcript Sample composition impacts transcript relative abundance Cell type I Cell type II Normalizing by total reads does not work well for samples with very different RNA composition Example normalization techniques Counts for gene i in experiment j Geometric mean for that gene over ALL experiments i runs through all n genes j through all m samples kij is the observed counts for gene i in sample j sj Is the normalization constant Alders and Huber, 2010 Lets do an experiment Similar read number, one transcript many fold changed Size normalization results in 2-fold changes in all transcripts When everything changes: Spike-ins Lovén et al, Cell 2012