Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genome-wide analysis • We calculated a t-test for 30,000 genes at once • How do we handle results, present data and results • Normalization of the data as a mean of removing biases and reducing experimental variability • Two basic questions in the normalization process •Are we attenuating the signal? •Are we compromising the independence of our measurements? • Outliers – part of the quality control. • If we can identify physical reasons for excluding an observation (e.g. scratch on the slide) • Such physical problems are usually "flagged" in the process of quantifying fluorescence intensities • The questions of excluding a whole array from the analysis is particularly tricky – we will discuss it further later 1-11-2005 1 Randomization Issue The Problem: Identify genes whose expression in a target organ (Lung) of a model organism (Rat) is affected by an environmental toxicant (W) Population: All model organisms of this type (Rats) Sample: 12 randomly selected rats from the population of all rats. (Randomly means that all rats in the population have the equal chance of being selected) Randomization: Randomly select 6 rats to be treated by the toxicant. Randomly is the key word here that allows us to ascribe observed changes to the treatment alone. Prepare samples and extract RNA from all 12 rats Randomly assign labeled RNA to different microarrays Process microarrays in a random order 1-11-2005 2 Single Channel Microarrays – Each Sample Assigned to a Different Microarray •12 microarrays, 12 samples (C1,...,C6,W1,...,W6) •Randomly assign samples to different microarrays •In terms of a single gene, 12 different "spots" W3 W5 W6 W1 W2 W4 C5 C1 C2 C4 C6 C3 e “GreenScanning Channel”the “GreenScanning Channel”the “GreenScanning Channel”the “GreenScanning Channel”the “GreenScanning Channel”the “GreenScanning Channel”the “GreenScanning Channel”the “GreenScanning Channel”the “GreenScanning Channel”the “GreenScanning Channel”the “GreenScanning Channel”the “Green Channel” (XG) (XG) (XG) (XG) (XG) (XG) (XG) (XG) (XG) (XG) (XG) (XG) he “Red Channel” Scanning the “Red Channel” Scanning the “Red Channel” Scanning the “Red Channel” Scanning the “Red Channel” Scanning the “Red Channel” Scanning the “Red Channel” Scanning the “Red Channel” Scanning the “Red Channel” Scanning the “Red Channel” Scanning the “Red Channel” Scanning the “Red Channel” (XR) (XR) (XR) (XR) (XR) (XR) (XR) (XR) (XR) (XR) (XR) (XR) XR XG X R ) log( X X X X X X X X X X RR)) log( RR)) log( RR)) log( RR)) log( RR)) log( RR)) log( RR)) log( RR)) log( RR)) log( RR)) log( log(X log(X XG log(X XG log(X XG log(X XG log(X XG log(X XG log(X XG log(X XG log(X XG log(X XG log(X X GR)) log( X G ) R XG XG XG XG XG XG XG XG XG XG XG •Proceed with a two-sample t-test as we did so far 1-11-2005 3 Two-Channel Microarrays – One C and One W Sample Assigned to Each Microarray •6 microarrays, 12 samples (C1,...,C6,W1,...,W6) •Randomly select pairs and assign then to different microarrays •In terms of a single gene, 6 different "spots" W3 C5 W6 C1 W2 C2 W5 C6 W4 C4 Scanning the “Green Channel” (XG) Scanning the “Green Channel” (XG) Scanning the “Green Channel” (XG) Scanning the “Green Channel” (XG) Scanning the “Green Channel” (XG) Scanning the “Green Channel” (XG) Scanning the “Red Channel” (XR) Scanning the “Red Channel” (XR) Scanning the “Red Channel” (XR) Scanning the “Red Channel” (XR) Scanning the “Red Channel” (XR) Scanning the “Red Channel” (XR) XR XG log(X R ) log( XXGR) XG X R) log(X R ) log( X G XG X log(X R ) log( XRG ) XG R) log(X R ) log(X XG XG R) log(X R ) log(X XG XG W1 C3 log(X R ) log( X G ) •Individual samples are no longer "free" to be assigned to any microarray – restriction on the randomization process •Measurements are "blocked" within a microarray (terminology) •We could still randomly assign samples and not have treatment and the control on each microarray, but this would be unreasonable (arguments to come) •Need to use a paired t-test 1-11-2005 4 Paired t-test • For a specific gene ri = xiw -xic = ith difference, i=1,…,6 2 • Statistical Model of observed data ri ~ N (μ , σ ) • Differential expression 0 n s2 n i i 1 2 n 1 r 1 n 0.4 t* s 0.0 0.1 • "Null Distribution" is tdistribution with n-1 degrees of freedom i 1 (r r ) 0.3 • Calculating t-statistic ˆ r n 0.2 • Estimating parameters ri 1-11-2005 -4 -2 0 t-statistics 2 4 5 Two-sample t-test vs paired t-test x2 x1 t 2 sp n * t2n-2 r 1 s n tn-1 •Denominator 1.51 0.04 •p-value 0.870 0.002 6 7 8 W 9 10 11 • Reference Distribution t * 6 7 8 9 10 11 C 1-11-2005 6 Two-sample t-test vs paired t-test Standard Deviations 0.0 0.0 0.5 0.5 1.0 Standard Deviation 2.0 1.5 1.0 Standard Deviation 1.5 2.5 3.0 2.0 Standard Deviations Raw 1-11-2005 Paired TTest Raw Paired TTest 7 Two-sample t-test vs paired t-test P-values 0.5 0.0 0 Raw 1-11-2005 1.0 -log10(p-value) 4 2 -log10(p-value) 6 1.5 8 2.0 P-values Paired TTest Raw Paired TTest 8 Two-sample t-test vs paired t-test 20 sp 2 2 s n 1 n 10 t * 2t *paired 0 Paired t-statistic 30 40 Two-sample vs Paired t-test 0 2 4 6 8 10 12 14 T statistic 1-11-2005 9 Two-sample t-test vs paired t-test 0.6 0.4 0.0 0.2 Paired t-test p-value 0.8 1.0 Two-sample vs Paired t-test 0.0 0.2 0.4 0.6 0.8 1.0 Two-sample t-test p-value •Small advantage for two-sample t-test purely due to degrees of freedom •Bigger possible advantage due to the smaller denominator (standard error) 1-11-2005 10 8.0 8.5 When is t-test "better" than paired t-test paired t Denominator 0.56 0.64 p-value 0.0008 0.0097 6.0 6.5 7.0 W 7.5 t-sample t 8.8 9.0 9.2 9.4 9.6 C •Q: Can we use the two-paired t-test in this case since it gives us a smaller p-value? •A: NO! Randomization and non-independence issues remain 1-11-2005 11 Multiple Factor Experiments - Incomplete Block Design Array Cy 3 Control Treatment Cy 5 Control Treatment 1 Treatment 2 1-11-2005 12 Multiple Factor Experiments - Incomplete Block Design •No color effect •Homogeneous variance •Optimal •Homogeneous color effect •Homogeneous variance 1-11-2005 •No color effect •Homogeneous variance •Sub-Optimal •Homogeneous variance 13 Multiple Factor Experiments - Incomplete Block Design •Homogeneous Variance T1 C T1 T1 & T2 T2 1-11-2005 C T1 & T2 T2 14 limma ... is a package for the analysis of microarray data, especially the use of linear models for analyzing designed experiments and the assessment of differential expression. • Specially constructed data objects to represent various aspects of microarray data • Specially constructed "object methods" for importing, normalizing, displaying and analyzing microarray data • Unique in the implementation of the empirical Bayes procedure for identifying differentially expressed genes by "borrowing" information from different genes (everything so far has been gene by gene) 1-11-2005 15