Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW2: Gene expression analsysis Due: 24:00 EST, Mar 14, 2016 by autolab Your goals in this assignment are to 1. Understand the basics of multiple hypothesis testing 2. Explore properties of linear models 3. Investigate moderated T statistics 4. Understand data structure and apply principal component analysis 5. Compute permutation statistics What to hand in. • One report (in pdf format) addressing each of following questions including the figures generated by R when appropriate. • All source code for the R exercises. We should be able to run the source code and produce the figures requested. Submit a zip file containing the completed code (if any) and the pdf file (if any) to autolab. The zip file should have the following structure ./S2016HW2.pdf ./Q3/ put all codes related to Q3 here, if any ./Q4/ put all codes related to Q4 here, if any ./Q5/ put all codes related to Q5 here, if any 1. [8 points] Hypothesis Testing Suppose you will test 20,000 six-sided dice in search of dice that have a probability of rolling 6 that is greater than 1/6. Your plan is to roll each die four times and declare any die that rolls 6 all four times to be a die that has probability of rolling 6 that is greater than 1/6. Suppose that, unknown to you, one die will roll 6 with probability 1, 10 dice will roll 6 with probability 0.5, 20 dice will roll 6 with probability 0.4, and 100 dice will roll 6 with probability 0.2. The other 20,000-(1+10+20+100) dice are regular six-sided dice that roll 6 with probability 1/6. Use the definitions given in the notes on mixture modeling of the p-value distribution to compute the following quantities for this die-rolling scenario. Of course, you will need to draw an analogy between this hypothetical die testing problem and the testing for differential expression in order for this problem to make sense (e.g., regular dice are like equivalently expressed genes, dice with greater than 1/6 probability of landing heads are like differentially expressed genes, etc.). Your task (a) (2 points) What is the null and alternative hypothesis in this case? The null hypothesis H0 : The dice is a regular dice with exactly 16 of rolling 6. The alternative hypothesis H1 : The dice has a greater than 61 probability of rolling 6. (b) (3 points) Write down the expression for FWER in terms of the quantities given. Do not evaluate the actual value as it is very close to 1. V is the number of false positive cases. F W ER = P (V ≥ 1) = 1 − P (V = 0) 1 4 20000−(1+10+20+100) =1− 1−( ) 6 19869 1 =1− 1− 1296 (c) (3 points) Write down the expression and evaluate the FDR for this scenario. First we apply the approximation, F DR = E( V E(V ) )≈ R E(R) Here, we have • V ∼ Binomial(19869, ( 61 )4 ) • R ∼ Binomial(19869, ( 16 )4 )+Binomial(100, ( 51 )4 )+Binomial(20, ( 25 )4 )+Binomial(10, ( 21 )4 )+ Binomial(1, 14 ) Thus, F DR = = E(V ) E(R) ( 16 )4 × 19869 ( 16 )4 × 19869 + ( 15 )4 × 100 + ( 25 )4 × 20 + ( 12 )4 × 10 + 14 × 1 = 0.87 • If you are interested in why the approximation may apply, please see the reference and related material. 2 • Another way to think of FDR is as follows, F DR = P (EE|4 6s) P (4 6s|EE)P (EE) = P (4 6s) ( 16 )4 × 19869 20000 = 1 4 1 4 2 4 ( 6 ) ×19869+( 5 ) ×100+( 5 ) ×20+( 12 )4 ×10+14 ×1 20000 = 0.87 Reference: Storey, J. D., Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16), 9440-9445. 2. [8 points] Linear model Recall from the lecture that given a gene expression experiment with two groups of two samples each, we can specify a linear model for the expression of a single gene as, y = Xβ + 1 0 1 0 β 1 y= 1 1 β 2 + 1 1 Your task Use the formula for the solution to least squares regression to show that if we rewrite this as y = X 0 β 0 + 0 1 0 1 0 β10 0 y= 0 1 β20 + 0 1 and we have = 0 . Start by writing X 0 as XA. What is A? What is the relationship between β and β 0 ? Formulate a general statement/theorem about the equivalence of different model specifications. Given the linear the model, y = Xβ + Apply the least square regression and we will get the formula for β, β = (X T X)−1 X T Y It’s not hard to find a linear transformation A, which satisfies X 0 = XA. To avoid symbol clutter X and X 0 are switched. 1 0 0 X =X −1 1 3 Then we could prove the and 0 are in fact the same. 0 = Y − X 0 (X 0T X 0 )−1 X 0T Y = Y − XA((XA)T (XA))−1 (XA)T Y = Y − XA(AT (X T X)A)−1 AT X T Y = Y − XAA−1 (X T X)−1 (AT )−1 AT X T Y = Y − X(X T X)−1 X T Y = We can’t distribute the inverse over (X 0T X 0 ) since X 0 is not invertible but all steps are valid as along as A is invertible 3. [16 points] Moderated T statistic Here we will write some custom R codes to calculate moderated T statistics. Your task (a) (2 points) Begin by writing a simple function to calculate a T statistic using equal variance assumption. Your function should take 2 inputs: a vector of data and a vector specifying the group membership numerically as 1, 2 as in myttest<-function(x, grp){...} The formula for an equal-variance T statistic is t= X̄1 − X̄2 q sX1 X2 · n11 + 1 n2 where s sX1 X2 = (n1 − 1)s2X1 + (n2 − 1)s2X2 n1 + n2 − 2 Include your function in the report. See (c). (b) (3 points) As a sanity check, you can simulate some data and check that your code produces the same T statistic as the built-in R function. You can simulate the data as x = c ( rnorm (20) , rnorm (20) +1) ; grp = rep ( c (1 ,2) , each =20) ; The T test can then be executed as t.test(x~grp, var.equal=T) Make sure you get the same value as you execute myttest(x,grp) and t.test(x grp, var.equal=T). If you execute the following code block, the output T statistic would be -2.845004. set . seed (1) x <- c ( rnorm (20) , rnorm (20) +1) grp <- rep ( c (1 ,2) , each =20) t . test ( x ~ grp , var . equal = T ) myttest (x , grp ,0) 4 (c) (1 point) Define an additional parameter to be added to the denominator of the equation as myttest<-function(x, grp, s0){...}, a value of 0 should leave the function result unchanged. This is the ”fudge factor” in SAM analysis. Include your updated function in the report. myttests <- function (x , grp , s0 ) { data = data . frame ( x =x , grp = grp ) m1 = mean ( data$x [ data$grp == "1"]) s1 = sd ( data$x [ data$grp == "1"]) n1 = length ( data$x [ data$grp == "1"]) m2 = mean ( data$x [ data$grp == "2"]) s2 = sd ( data$x [ data$grp == "2"]) n2 = length ( data$x [ data$grp == "2"]) se <- sqrt ( (1/ n1 + 1/ n2 ) * (( n1 -1) * s1 ^2 + ( n2 -1) * s2 ^2) /( n1 + n2 -2) ) + s0 tval <- ( m1 - m2 ) / se return ( tval ) } (d) (3 points) We can simulate gene expression data with variance drawn from an inverse χ2 distribution as simData <- function () { data = matrix ( nrow =5000 , ncol =40) sd = sqrt (5/ rchisq (5000 , df =3) ) for ( i in 1:5000) { data [i ,] = rnorm (40 ,0 , sd [ i ]) } data [1:500 ,1:20]= data [1:500 ,1:20]+1 data } A dataset simulated with the code above is provided and you can load it with load(’simData.RData’, verbose=T) which will put simData and simData.grp into your workspace. We have 40 samples with 5000 genes of which the first 500 are differentially expressed. Complete your T-stat function by allowing x to be a matrix and compute a single T statistic per row. You can use the built-in apply() function here. Generate a boxplot figure for these T statistics. 5 6 4 2 0 −2 −4 (e) (3 points) Given that we know the first 500 genes are simulated to have different means, we can test the performance of various statistics at distinguishing these genes from the rest. Write a function that computes the Area Under Receiver Operating Characteristic (ROC) Curve (AUC). You function should take 2 inputs: the statistics for each gene and the labels (whether or not the gene is differentially expressed), as in AUC<-function(values, labels){...}. Feel free to use any R functions or packages to perform the computation though it may just be easier and faster to do this from scratch. Include your function in the report. pROC is a widely used package. You could call auc() function directly, or use roc() function with auc=TRUE 0.910 0.906 AUC (f) (3 points) Now we will see if the moderated T statistic gives us better performance. Compute T statistics using at least 50 equally spaced s0 values in the range [0,3] and plot the AUC results relative to the value of s0. 0.0 0.5 1.0 1.5 s0 6 2.0 2.5 3.0 (g) (1 point) Which s0 value achieved the best performance? The maxima AUC value (AUCmax = 0.9130569) is obtained at s0 = 0.244898. This is just for reference since everybody uses a little different simulated data. 4. [12 points] Principal Component Analysis Here you will analyze real gene expression data and investigate the molecular differences between two types of leukaemia. Your task (a) (6 points) Load the provided gene expression data with load(’HumanData.RData’, verbose=T). This will put data and data.grp into you workspace. Use the svd() function to perform SVD decomposition on row centered and scaled (to have variance of 1) gene expression matrix. Create a complete pairwise plot of the first 5 eigengenes (principal components) using the pairs() function and label the samples with the leukaemia type by setting col=data.grp. −0.1 0.1 0.3 −0.3 −0.1 0.1 0.0 −0.3 0.3 −0.3 PC1 0.0 −0.3 0.0 PC2 0.0 −0.4 PC3 PC5 −0.3 −0.1 0.1 −0.4 −0.2 0.0 0.2 −0.4 −0.2 0.0 0.2 −0.4 0.0 0.3 −0.3 PC4 • The scale function scales columns by default, thus a standard way to scale the data is as follows, dataScaled <- t ( scale ( t ( data ) ) ) • Here you are asked to plot the first 5 eigengenes (principal components), so you need to plot svd$v instead of svd$u. 7 (b) (1 points) Which principal component contains the most information about the difference among samples? The 5th principal component contains the most information to distinguish samples, with the 4th principal component adding some information. (c) (5 points) Repeat the decomposition using only the genes whose mean expression level is > 8. Also you need to create a complete pairwise plot of the first 5 eigengenes (principal components). How are the new PCA results different? Recall that the SVD will return principal components in the order of decreasing singular values and consequently decreasing ”variance explained”. What is the potential explanation for why the results on this ”filtered” dataset are different. −0.1 0.1 0.3 −0.4 −0.2 0.0 0.2 0.0 0.2 −0.3 0.0 0.3 −0.3 PC1 0.0 0.2 −0.3 PC2 −0.4 −0.1 0.2 −0.3 PC3 0.3 PC4 −0.3 0.0 PC5 −0.3 −0.1 0.1 −0.3 −0.1 0.1 −0.3 −0.1 0.1 0.3 The 4th principal component seems to have the most discriminatory power. By removing the genes whose mean expression level is ≤ 8, we remove some noise embedded in the data and change the variation structure. 5. [36 points] Differential Expression Use the same leukaemia data (Question 4) and your custom T stat function (Question 3) to calculate the regular (non-moderated) T statistic for differential expression across the two leukaemia types. Your task (a) (5 points) Use the T distribution probability function pt() to compute the corresponding pvalues (check the function help by typing ?pt for available options). What is the degree of 8 0.4 0.0 p−values 0.8 freedom here? Since we have no specific hypothesis about genes being up or down regulated, we will use the two-tailed T-test which considers both tails of the distribution (Note the distribution is symmetric). As a sanity check, make sure you get values in the range [0, 1] and larger T statistics (in absolute value) produce smaller p-values. Plot a histogram of the resulting pvalues. Hint: You can also use the built-in t.test() function with var.equal=T to spot check the results. The degree of freedom here is 45 (n1 + n2 − 2 = 47 − 2 = 45). 0 5 10 15 20 25 800 400 0 Frequency 1400 abs(T statistics) 0.0 0.2 0.4 0.6 0.8 1.0 p−values 800 1200 400 0 Frequency (b) (2 points) Calculate the FDR using the Benjamini-Hochberg method with the function p.adjust. 0.0 0.2 0.4 0.6 corrected p−values 9 0.8 1.0 (c) (2 points) Recall that the q-value FDR control method multiplies the above corrected p-values by π0 . Calculate π0 for the p-value distribution using λ = 0.5. You could find the formula to calculate π0 from the reference. #{pi > λ|i = 1, . . . , n} n × (1 − λ) 3603 = 9012 × 0.5 = 0.7996 500 1000 0 Frequency π0 = 0.0 0.2 0.4 0.6 0.8 q−values Reference: Storey, J. D., Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16), 9440-9445. (d) (2 points) How many genes are differentially expressed at a BH FDR <0.1? What about the q-value FDR < 0.1? There are 721 genes which are differentially expressed at a BH FDR <0.1. There are 817 genes which are differentially expressed at a q-value FDR < 0.1. (e) (6 points) Permutation strategy 1: Use set.seed(1) to make sure your sampling is ”nonrandom”. Generate 10 sets of permutation p-values by running the T-test with permuted data.grp labels. You can simply set grp=grp[sample(1:length(grp))]. Plot the resulting p-value histograms as 10 panels on the same plot. Do these p-values follow a uniform distribution? 10 0.4 0.6 0.8 1000 Frequency 0.2 0 400 500 Frequency 0 200 0.0 1.0 0.0 0.2 0.4 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 1.0 0.6 0.8 1.0 0.6 0.8 1.0 500 Frequency 0.4 0.8 0 200 300 0.2 0.6 p−values 0 Frequency p−values 0.0 1.0 400 Frequency 0.4 0.8 0 300 600 0.2 0.6 p−values 0 Frequency p−values 0.0 1.0 300 600 Frequency 0.4 0.8 0 300 0.2 0.6 p−values 0 Frequency p−values 0.0 1.0 300 Frequency 0.2 0.8 0 500 0.0 0.6 p−values 0 200 Frequency p−values 0.0 p−values 0.2 0.4 p−values In some permutations, the p-value distribution is close to a uniform distribution. (f) (6 points) Permutation strategy 2: Use set.seed(1) to make sure your sampling is ”nonrandom”. Now instead of permuting the group label will will permute each gene separately, we can define a new dataset as datar=t(apply(data,1,sample())) which has the effect of applying the a different permutation to each row. Repeat the p-value calculation and plot the histogram for 10 permuted datasets. Are the resulting p-values closer to a uniform distribution? Explain why the two permutation strategies produce different results? 11 0.4 0.6 0.8 500 Frequency 0.2 0 200 200 Frequency 0 0.0 1.0 0.0 0.2 0.4 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 1.0 0.6 0.8 1.0 0.6 0.8 1.0 500 Frequency 0.4 0.8 0 200 500 0.2 0.6 p−values 0 200 Frequency p−values 0.0 1.0 200 Frequency 0.4 0.8 0 500 0.2 0.6 p−values 0 200 Frequency p−values 0.0 1.0 500 Frequency 0.4 0.8 0 200 500 0.2 0.6 p−values 0 200 Frequency p−values 0.0 1.0 500 Frequency 0.2 0.8 0 200 500 0.0 0.6 p−values 0 200 Frequency p−values 0.0 p−values 0.2 0.4 p−values Compared with permutation strategy 1, the p-value distribution in nearly all permutations are closer to a uniform distribution. As using permutation strategy 1, similar genes still share the same group label. If the permutated group labels are close to the original labels, the resulted p-value distribution is close to a skewed distribution instead of a uniform distribution. As using permutation strategy 2, the group labels are completely randomized. There is little chance that similar genes share similar group labels. (g) (4 points) Use the group permutations (Permutation strategy 1) to calculate an empirical FDR for the T statistics from part (a). You can combine up and down-regulated genes and simply use absolute values of the T statistics. Use set.seed(1) if sampling is required. 12 800 1000 600 400 0 200 Frequency 0.0 0.2 0.4 0.6 0.8 1.0 Empirical FDR 0.6 0.4 0.0 0.2 Empirical FDR 0.8 1.0 (h) (3 points) Plot the empirical FDR against the BH FDR. Which one is more conservative? How many genes are differentially expressed with an empirical FDR <0.1? 0.0 0.2 0.4 0.6 0.8 1.0 BH FDR There are 803 genes with an empirical FDR <0.1. There are 721 genes with an BH FDR <0.1. BH FDR is more conservative compared with the empirical FDR. (i) (6 points) Since the true differential expression status of the genes is unknown, we will use the number of genes with empirical FDR of <0.1 as a performance metric. Using this metric makes a plot of moderated T statistic performance for different values of s0. The moderated T statistic will have a different distribution, so make sure to rerun the permutation analysis every time. Find the optimal s0 constant for this dataset. Use set.seed(1) if sampling is required. 13 850 800 750 700 650 600 Empirical FDR 0.0 0.5 1.0 1.5 2.0 2.5 3.0 s0 We choose 10 permutations each time and there are 50 equally spaced s0 values in the range [0,3]. The optimal s0 constant is 0.06122449 and correspondingly there are 876 genes with an empirical FDR <0.1. 14