Download HW2 Solution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
MSCBIO 2070/02-710: Computational Genomics, Spring 2016
HW2: Gene expression analsysis
Due: 24:00 EST, Mar 14, 2016 by autolab
Your goals in this assignment are to
1. Understand the basics of multiple hypothesis testing
2. Explore properties of linear models
3. Investigate moderated T statistics
4. Understand data structure and apply principal component analysis
5. Compute permutation statistics
What to hand in.
• One report (in pdf format) addressing each of following questions including the figures generated by
R when appropriate.
• All source code for the R exercises. We should be able to run the source code and produce the figures
requested.
Submit a zip file containing the completed code (if any) and the pdf file (if any) to autolab. The zip file
should have the following structure
./S2016HW2.pdf
./Q3/
put all codes related to Q3 here, if any
./Q4/
put all codes related to Q4 here, if any
./Q5/
put all codes related to Q5 here, if any
1. [8 points] Hypothesis Testing
Suppose you will test 20,000 six-sided dice in search of dice that have a probability of rolling 6 that is
greater than 1/6. Your plan is to roll each die four times and declare any die that rolls 6 all four times
to be a die that has probability of rolling 6 that is greater than 1/6. Suppose that, unknown to you,
one die will roll 6 with probability 1, 10 dice will roll 6 with probability 0.5, 20 dice will roll 6 with
probability 0.4, and 100 dice will roll 6 with probability 0.2. The other 20,000-(1+10+20+100) dice
are regular six-sided dice that roll 6 with probability 1/6. Use the definitions given in the notes on
mixture modeling of the p-value distribution to compute the following quantities for this die-rolling
scenario. Of course, you will need to draw an analogy between this hypothetical die testing problem
and the testing for differential expression in order for this problem to make sense (e.g., regular dice
are like equivalently expressed genes, dice with greater than 1/6 probability of landing heads are like
differentially expressed genes, etc.).
Your task
(a) (2 points) What is the null and alternative hypothesis in this case?
The null hypothesis H0 : The dice is a regular dice with exactly 16 of rolling 6.
The alternative hypothesis H1 : The dice has a greater than 61 probability of rolling 6.
(b) (3 points) Write down the expression for FWER in terms of the quantities given. Do not evaluate
the actual value as it is very close to 1.
V is the number of false positive cases.
F W ER = P (V ≥ 1)
= 1 − P (V = 0)
1 4 20000−(1+10+20+100)
=1− 1−( )
6
19869
1
=1− 1−
1296
(c) (3 points) Write down the expression and evaluate the FDR for this scenario.
First we apply the approximation,
F DR = E(
V
E(V )
)≈
R
E(R)
Here, we have
• V ∼ Binomial(19869, ( 61 )4 )
• R ∼ Binomial(19869, ( 16 )4 )+Binomial(100, ( 51 )4 )+Binomial(20, ( 25 )4 )+Binomial(10, ( 21 )4 )+
Binomial(1, 14 )
Thus,
F DR =
=
E(V )
E(R)
( 16 )4 × 19869
( 16 )4 × 19869 + ( 15 )4 × 100 + ( 25 )4 × 20 + ( 12 )4 × 10 + 14 × 1
= 0.87
• If you are interested in why the approximation may apply, please see the reference and
related material.
2
• Another way to think of FDR is as follows,
F DR = P (EE|4 6s)
P (4 6s|EE)P (EE)
=
P (4 6s)
( 16 )4 × 19869
20000
= 1 4
1 4
2 4
( 6 ) ×19869+( 5 ) ×100+( 5 ) ×20+( 12 )4 ×10+14 ×1
20000
= 0.87
Reference:
Storey, J. D., Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings
of the National Academy of Sciences, 100(16), 9440-9445.
2. [8 points] Linear model
Recall from the lecture that given a gene expression experiment with two groups of two samples each,
we can specify a linear model for the expression of a single gene as,
y = Xβ + 

1 0 1 0  β 1

y=
1 1  β 2 + 1 1
Your task
Use the formula for the solution to least squares regression to show that if we rewrite this as
y = X 0 β 0 + 0


1 0 1 0 β10
0

y=
0 1 β20 + 0 1
and we have = 0 . Start by writing X 0 as XA. What is A? What is the relationship between β and
β 0 ? Formulate a general statement/theorem about the equivalence of different model specifications.
Given the linear the model,
y = Xβ + Apply the least square regression and we will get the formula for β,
β = (X T X)−1 X T Y
It’s not hard to find a linear transformation A, which satisfies X 0 = XA. To avoid symbol clutter X
and X 0 are switched.
1 0
0
X =X
−1 1
3
Then we could prove the and 0 are in fact the same.
0 = Y − X 0 (X 0T X 0 )−1 X 0T Y
= Y − XA((XA)T (XA))−1 (XA)T Y
= Y − XA(AT (X T X)A)−1 AT X T Y
= Y − XAA−1 (X T X)−1 (AT )−1 AT X T Y
= Y − X(X T X)−1 X T Y
=
We can’t distribute the inverse over (X 0T X 0 ) since X 0 is not invertible but all steps are valid as along
as A is invertible
3. [16 points] Moderated T statistic
Here we will write some custom R codes to calculate moderated T statistics.
Your task
(a) (2 points) Begin by writing a simple function to calculate a T statistic using equal variance
assumption. Your function should take 2 inputs: a vector of data and a vector specifying the
group membership numerically as 1, 2 as in myttest<-function(x, grp){...}
The formula for an equal-variance T statistic is
t=
X̄1 − X̄2
q
sX1 X2 · n11 +
1
n2
where
s
sX1 X2 =
(n1 − 1)s2X1 + (n2 − 1)s2X2
n1 + n2 − 2
Include your function in the report.
See (c).
(b) (3 points) As a sanity check, you can simulate some data and check that your code produces
the same T statistic as the built-in R function. You can simulate the data as
x = c ( rnorm (20) , rnorm (20) +1) ;
grp = rep ( c (1 ,2) , each =20) ;
The T test can then be executed as t.test(x~grp, var.equal=T)
Make sure you get the same value as you execute myttest(x,grp) and t.test(x grp, var.equal=T).
If you execute the following code block, the output T statistic would be -2.845004.
set . seed (1)
x <- c ( rnorm (20) , rnorm (20) +1)
grp <- rep ( c (1 ,2) , each =20)
t . test ( x ~ grp , var . equal = T )
myttest (x , grp ,0)
4
(c) (1 point) Define an additional parameter to be added to the denominator of the equation
as myttest<-function(x, grp, s0){...}, a value of 0 should leave the function result unchanged. This is the ”fudge factor” in SAM analysis. Include your updated function in the
report.
myttests <- function (x , grp , s0 ) {
data = data . frame ( x =x , grp = grp )
m1 = mean ( data$x [ data$grp == "1"])
s1 = sd ( data$x [ data$grp == "1"])
n1 = length ( data$x [ data$grp == "1"])
m2 = mean ( data$x [ data$grp == "2"])
s2 = sd ( data$x [ data$grp == "2"])
n2 = length ( data$x [ data$grp == "2"])
se <- sqrt ( (1/ n1 + 1/ n2 ) * (( n1 -1) * s1 ^2 + ( n2 -1) * s2 ^2) /( n1 + n2 -2) ) + s0
tval <- ( m1 - m2 ) / se
return ( tval )
}
(d) (3 points) We can simulate gene expression data with variance drawn from an inverse χ2 distribution as
simData <- function () {
data = matrix ( nrow =5000 , ncol =40)
sd = sqrt (5/ rchisq (5000 , df =3) )
for ( i in 1:5000) {
data [i ,] = rnorm (40 ,0 , sd [ i ])
}
data [1:500 ,1:20]= data [1:500 ,1:20]+1
data
}
A dataset simulated with the code above is provided and you can load it with load(’simData.RData’,
verbose=T) which will put simData and simData.grp into your workspace. We have 40 samples with 5000 genes of which the first 500 are differentially expressed. Complete your T-stat
function by allowing x to be a matrix and compute a single T statistic per row. You can use
the built-in apply() function here. Generate a boxplot figure for these T statistics.
5
6
4
2
0
−2
−4
(e) (3 points) Given that we know the first 500 genes are simulated to have different means, we can
test the performance of various statistics at distinguishing these genes from the rest. Write a
function that computes the Area Under Receiver Operating Characteristic (ROC) Curve (AUC).
You function should take 2 inputs: the statistics for each gene and the labels (whether or not
the gene is differentially expressed), as in AUC<-function(values, labels){...}. Feel free to
use any R functions or packages to perform the computation though it may just be easier and
faster to do this from scratch. Include your function in the report.
pROC is a widely used package. You could call auc() function directly, or use roc() function
with auc=TRUE
0.910
0.906
AUC
(f) (3 points) Now we will see if the moderated T statistic gives us better performance. Compute T
statistics using at least 50 equally spaced s0 values in the range [0,3] and plot the AUC results
relative to the value of s0.
0.0
0.5
1.0
1.5
s0
6
2.0
2.5
3.0
(g) (1 point) Which s0 value achieved the best performance?
The maxima AUC value (AUCmax = 0.9130569) is obtained at s0 = 0.244898. This is just for
reference since everybody uses a little different simulated data.
4. [12 points] Principal Component Analysis
Here you will analyze real gene expression data and investigate the molecular differences between two
types of leukaemia.
Your task
(a) (6 points) Load the provided gene expression data with load(’HumanData.RData’, verbose=T).
This will put data and data.grp into you workspace. Use the svd() function to perform SVD
decomposition on row centered and scaled (to have variance of 1) gene expression matrix. Create a complete pairwise plot of the first 5 eigengenes (principal components) using the pairs()
function and label the samples with the leukaemia type by setting col=data.grp.
−0.1
0.1
0.3
−0.3
−0.1
0.1
0.0
−0.3
0.3
−0.3
PC1
0.0
−0.3
0.0
PC2
0.0
−0.4
PC3
PC5
−0.3
−0.1
0.1
−0.4
−0.2
0.0
0.2
−0.4 −0.2
0.0
0.2
−0.4
0.0
0.3
−0.3
PC4
• The scale function scales columns by default, thus a standard way to scale the data is as
follows,
dataScaled <- t ( scale ( t ( data ) ) )
• Here you are asked to plot the first 5 eigengenes (principal components), so you need to
plot svd$v instead of svd$u.
7
(b) (1 points) Which principal component contains the most information about the difference among
samples?
The 5th principal component contains the most information to distinguish samples, with the
4th principal component adding some information.
(c) (5 points) Repeat the decomposition using only the genes whose mean expression level is > 8.
Also you need to create a complete pairwise plot of the first 5 eigengenes (principal components).
How are the new PCA results different? Recall that the SVD will return principal components in
the order of decreasing singular values and consequently decreasing ”variance explained”. What
is the potential explanation for why the results on this ”filtered” dataset are different.
−0.1
0.1
0.3
−0.4
−0.2
0.0
0.2
0.0 0.2
−0.3
0.0
0.3
−0.3
PC1
0.0 0.2
−0.3
PC2
−0.4 −0.1
0.2
−0.3
PC3
0.3
PC4
−0.3
0.0
PC5
−0.3
−0.1
0.1
−0.3
−0.1
0.1
−0.3
−0.1
0.1
0.3
The 4th principal component seems to have the most discriminatory power. By removing the genes
whose mean expression level is ≤ 8, we remove some noise embedded in the data and change the
variation structure.
5. [36 points] Differential Expression
Use the same leukaemia data (Question 4) and your custom T stat function (Question 3) to calculate
the regular (non-moderated) T statistic for differential expression across the two leukaemia types.
Your task
(a) (5 points) Use the T distribution probability function pt() to compute the corresponding pvalues (check the function help by typing ?pt for available options). What is the degree of
8
0.4
0.0
p−values
0.8
freedom here? Since we have no specific hypothesis about genes being up or down regulated, we
will use the two-tailed T-test which considers both tails of the distribution (Note the distribution
is symmetric). As a sanity check, make sure you get values in the range [0, 1] and larger T
statistics (in absolute value) produce smaller p-values. Plot a histogram of the resulting pvalues.
Hint: You can also use the built-in t.test() function with var.equal=T to spot check the
results.
The degree of freedom here is 45 (n1 + n2 − 2 = 47 − 2 = 45).
0
5
10
15
20
25
800
400
0
Frequency
1400
abs(T statistics)
0.0
0.2
0.4
0.6
0.8
1.0
p−values
800 1200
400
0
Frequency
(b) (2 points) Calculate the FDR using the Benjamini-Hochberg method with the function p.adjust.
0.0
0.2
0.4
0.6
corrected p−values
9
0.8
1.0
(c) (2 points) Recall that the q-value FDR control method multiplies the above corrected p-values
by π0 . Calculate π0 for the p-value distribution using λ = 0.5.
You could find the formula to calculate π0 from the reference.
#{pi > λ|i = 1, . . . , n}
n × (1 − λ)
3603
=
9012 × 0.5
= 0.7996
500 1000
0
Frequency
π0 =
0.0
0.2
0.4
0.6
0.8
q−values
Reference:
Storey, J. D., Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings
of the National Academy of Sciences, 100(16), 9440-9445.
(d) (2 points) How many genes are differentially expressed at a BH FDR <0.1? What about the
q-value FDR < 0.1?
There are 721 genes which are differentially expressed at a BH FDR <0.1. There are 817 genes
which are differentially expressed at a q-value FDR < 0.1.
(e) (6 points) Permutation strategy 1: Use set.seed(1) to make sure your sampling is ”nonrandom”. Generate 10 sets of permutation p-values by running the T-test with permuted
data.grp labels. You can simply set grp=grp[sample(1:length(grp))]. Plot the resulting p-value histograms as 10 panels on the same plot. Do these p-values follow a uniform
distribution?
10
0.4
0.6
0.8
1000
Frequency
0.2
0 400
500
Frequency
0 200
0.0
1.0
0.0
0.2
0.4
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
1.0
0.6
0.8
1.0
0.6
0.8
1.0
500
Frequency
0.4
0.8
0 200
300
0.2
0.6
p−values
0
Frequency
p−values
0.0
1.0
400
Frequency
0.4
0.8
0
300 600
0.2
0.6
p−values
0
Frequency
p−values
0.0
1.0
300 600
Frequency
0.4
0.8
0
300
0.2
0.6
p−values
0
Frequency
p−values
0.0
1.0
300
Frequency
0.2
0.8
0
500
0.0
0.6
p−values
0 200
Frequency
p−values
0.0
p−values
0.2
0.4
p−values
In some permutations, the p-value distribution is close to a uniform distribution.
(f) (6 points) Permutation strategy 2: Use set.seed(1) to make sure your sampling is ”nonrandom”. Now instead of permuting the group label will will permute each gene separately,
we can define a new dataset as datar=t(apply(data,1,sample())) which has the effect of
applying the a different permutation to each row. Repeat the p-value calculation and plot the
histogram for 10 permuted datasets. Are the resulting p-values closer to a uniform distribution?
Explain why the two permutation strategies produce different results?
11
0.4
0.6
0.8
500
Frequency
0.2
0 200
200
Frequency
0
0.0
1.0
0.0
0.2
0.4
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
1.0
0.6
0.8
1.0
0.6
0.8
1.0
500
Frequency
0.4
0.8
0 200
500
0.2
0.6
p−values
0 200
Frequency
p−values
0.0
1.0
200
Frequency
0.4
0.8
0
500
0.2
0.6
p−values
0 200
Frequency
p−values
0.0
1.0
500
Frequency
0.4
0.8
0 200
500
0.2
0.6
p−values
0 200
Frequency
p−values
0.0
1.0
500
Frequency
0.2
0.8
0 200
500
0.0
0.6
p−values
0 200
Frequency
p−values
0.0
p−values
0.2
0.4
p−values
Compared with permutation strategy 1, the p-value distribution in nearly all permutations are
closer to a uniform distribution. As using permutation strategy 1, similar genes still share the
same group label. If the permutated group labels are close to the original labels, the resulted
p-value distribution is close to a skewed distribution instead of a uniform distribution. As using
permutation strategy 2, the group labels are completely randomized. There is little chance that
similar genes share similar group labels.
(g) (4 points) Use the group permutations (Permutation strategy 1) to calculate an empirical
FDR for the T statistics from part (a). You can combine up and down-regulated genes and
simply use absolute values of the T statistics. Use set.seed(1) if sampling is required.
12
800 1000
600
400
0
200
Frequency
0.0
0.2
0.4
0.6
0.8
1.0
Empirical FDR
0.6
0.4
0.0
0.2
Empirical FDR
0.8
1.0
(h) (3 points) Plot the empirical FDR against the BH FDR. Which one is more conservative? How
many genes are differentially expressed with an empirical FDR <0.1?
0.0
0.2
0.4
0.6
0.8
1.0
BH FDR
There are 803 genes with an empirical FDR <0.1. There are 721 genes with an BH FDR <0.1.
BH FDR is more conservative compared with the empirical FDR.
(i) (6 points) Since the true differential expression status of the genes is unknown, we will use the
number of genes with empirical FDR of <0.1 as a performance metric. Using this metric makes
a plot of moderated T statistic performance for different values of s0. The moderated T statistic
will have a different distribution, so make sure to rerun the permutation analysis every time.
Find the optimal s0 constant for this dataset. Use set.seed(1) if sampling is required.
13
850
800
750
700
650
600
Empirical FDR
0.0
0.5
1.0
1.5
2.0
2.5
3.0
s0
We choose 10 permutations each time and there are 50 equally spaced s0 values in the range
[0,3]. The optimal s0 constant is 0.06122449 and correspondingly there are 876 genes with an
empirical FDR <0.1.
14