Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Long non-coding RNA wikipedia , lookup
Gene expression profiling wikipedia , lookup
Hidden Markov model wikipedia , lookup
Gene expression wikipedia , lookup
Non-coding RNA wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene prediction wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Human Genome Project wikipedia , lookup
Bioinformatics wikipedia , lookup
DNA sequencing wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Lecture 12 RNA – seq analysis Some background • RNA-seq (RNA sequencing), also called whole transcriptome shotgun sequencing(WTSS), is a technology that uses the capabilities of next generation sequencingto reveal a snapshot of presence and quantity of RNA at a given moment in time. • RNA sequencing is a high-throughput tool for investigating gene expression, made possible with rapid advances in the speed and efficiency of sequencing technologies. • Unlike microarrays, RNA-seq benefits from a highly dynamic range of signal detection, identifying both rare and common transcripts with no a priori knowledge of the organism’s genome or transcriptome. • The additional information captured in RNA-seq libraries has revolutionized our understanding of cancer, stem cell differentiation, and plant genetics.” Next Generation Sequencing • Next-generation sequencing (NGS), also known as high-throughput sequencing, is the catch-all term used to describe a number of different modern sequencing technologies including: Illumina (Solexa) sequencing. Roche 454 sequencing. Ion torrent: Proton / PGM sequencing. SOLiD sequencing. • REALLY revolutionized genome sequencing as many many can be done in smaller amounts of time. More Background • An RNA-seq run reads and quantifies the transcriptome (complete set of mRNA) in a single sequencing run. • RNA is extracted from tissue, cleaved into fragments a few hundred nucleotides long, and then converted to a complementary DNA (cDNA) library (Wilhelm & Landry, 2009). • Sequencing adaptors are ligated to both ends of each fragment, and the products are sequenced using any highthroughput method such as 454, SOLiD, or Ion Torrent. Comparison with Microarrays:advantages • New sequences can be discovered. • RNA-seq, on the other hand, determines all sequences empirically. • This has proved invaluable in non-model species with large genomes, • False positives from cross-hybridization are not an issue in RNA-seq. • Quantification is possible even at extremely low and high expression levels. • Whereas microarrays have a dynamic range of one to a few hundred fold, RNA-seq boasts a dynamic range of >8,000 fold (Wang, Gerstein, & Snyder, 2009). Comparison with Microarrays: disadvantages • Considerably more processing power is required to handle millions of RNA-seq reads, and chemical manipulation of RNA and cDNA can introduce artifacts. • Slower than microarrays when the genome is known. But as sequencing costs have plummeted and computing power has increased, RNA-seq is now the transcriptomics method of choice for most applications. Pictures Data structure • So here you have a sequence and for each sequence you have the number of READS • The data is the COUNT of the sequences read. • NOT continuous like expression data • So, normal and other related distributions cannot be used. • General modeling is done using the Poisson distribution Poisson Distribution • Generally used to model count data • The mass function is given by • P(Y=y)=f(y)= • • • • • e y y! Properties: Has a range from 0 to positive infinity Mean, E(Y)= m Variance, = m Hence, mean and Variance are same. Issues with Poisson • The property that requires that mean and Variance are the same is problematic for RNA-seq data, where Variance is often much larger than the mean. • This is called the over-dispersion problem. • Common in litter studies where over-dispersion is induced by auto-correlation. Solutions: The NB Distribution • To try and address this question ne distribution that is used is the Negative Binomial Distribution. • It is used to model the number of trials till the rth success and is related to the geometric distribution. Model: P(Y=y)=f(y) = r y 1 r y p (1 p ) y Properties of the NB distribution (1 p ) Mean r p (1 p ) Var r 2 p So, the mean and variance are related by a proportionality constant Theoretical Background • To model over-dispersion in Poisson regression one generally adds a random effect qi to represent the unobserved heterogenity. • So the conditional distribution of Yi given qi is indeed Poisson with mean and variance miqi. • Idea is: if we knew and observed qi the data would be Poisson. But, we don’t know it, so if we assume a assume that qi has a gamma distribution with both parameters ab1/s2 which represents the variance of the unobserved. • Then the unconditional distribution is given by: (a y 1)! b a y P[Y y ] y! (a 1)! ( b )a y Theory: • • • • The form is a NB distribution with r=a, p= b/(b) The mean and variance are related with a proportionality constant. • This is the form used in the Anders and Huber paper laying the basic theory for D-seq. DE Seq Theory • The library DESeq2 uses Empirical Bayesian ideas for Differential Expression for looking at differences in the genes across conditions. • The idea, let Kij be the count associated with the ith gene and the jth sample • The assumption is: Kij ~ NB(ij, ai) • Where ij=sjqij • And log2(qij)=xjbi • Here xj is the sample specific design and beta is our gene specific parameters. DE Seq2 package: contrasts • Contrasts can be calculated for a DESeqDataSet object for which the GLM coecients have already been fit using the Wald test steps (DESeq with test="Wald" or using nbinomWaldTest). • The vector of coefficients is left multiplied by the contrast vector c to form the numerator of the test statistic. • The denominator is formed by multiplying the covariance matrix for the coefficients on either side by the contrast vector c. • The square root of this product is an estimate of the standard error for the contrast. • The contrast statistic is then compared to a normal distribution as are the Wald statistics for the DESeq2 package.