Download Hwk4F06

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genealogical DNA test wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Bioinformatics II
Fall 2006
Homework 4
Due Oct. 19, 2006
email to: [email protected]
Email the R code and hand in the output. All plots should have titles
We are going to do a few statistical tests using the Golub data, which is already loaded
into the Biobase package.
library(Biobase)
data(golubMergeSub)
To extract the data about the samples:
SampleInfo=pData(golubMergeSub)
To extract the probeset summaries (in log2)
golubExpr=log2(exprs(golubMergeSub))
The normalization used produces both negative and zero expression summaries, so that
taking log2, results in NaN ("not a number") and –Inf, which we will have to remove.
Print the first few lines of SampleInfo and of golubExpr to see what you have. (Do not
turn this in.)
A. Testing differential expression when the samples are independent.
We will use probeset 4. (Row 4 of golubExpr)
It is a bit inconvenient to show the differences between the methods using all 72 patients,
so I have chosen 12 patients
samp=c(13, 42, 6 ,37, 15, 51, 45, 68 ,62, 27, 34, 71)
gene4=golubExpr[4,samp]
ALtype=SampleInfo[samp,2]
# saves the patient numbers
# save the 12 gene expression values
# saves ALL or AML for each patient
We will also look at what happens when the data have outliers.
y1=gene4
y1[1]=14
y2=y1
y2[12]=3
Bioinformatics II
Fall 2006
1. Get boxplots of gene4, y1 and y2 by cancer type.
Note: If x is a factor and y is a vector of numbers, boxplot(y~x) produces a boxplot for
each level of x.
2. a) Do a two-sample t-test for differential expression using gene4, y1, y2.
t.test(y~x)
# does a 2-sample t-test of the equality of the mean of y broken into 2
# groups by the values of x.
b) Notice that the p-value decreased as single points were moved to more extreme
separation of the ALL and AML samples. But even an extremely high ALL value and an
extremely low AML value did not lead to statistical significance. Why is this? (Look at
the formula for the t-test, or have a look at the 95% confidence interval for the difference
in means that is printed by R.)
3. a) Do a two-sample Wilcoxon test for differential expression using gene4, y1, y2.
wilcox.test
b) Notice that the p-value for gene4 and y1 are identical. Why is this? (Look at how the
Wilcoxon test statistic is computed.)
4. a) 1:12 generates the numbers 1,2,3,...,12. What does samp(1:12,12) do?
b) We will do a permutation test (just for gene4) by computing 100 permutations of the
cancer types, and extracting the t-value. To start, you need to find the component of the
t.test output that holds the t-value.
t.out=t.test(y~x)
unclass(t.out)
# saves the output
# prints the output object to the screen without formatting.
t.out$compname
#extracts the data from component "compname"
What is the name of the component that holds the computed t value?
Bioinformatics II
Fall 2006
c) Here is a function you can write that will do the permutation test.
perm.t = function(y,x,nperms){
out=numeric(nperms)
for (i in 1:nperms){
ptype=sample(x,12)
t.out=t.test(y~ptype)
out[i]=t.out$compname
}
out
}
#create an empty vector to store the t-values
#permute the cancer types
#replace with the right component
d) Run your function 100 times using gene4 for y and obtain an histogram of t-values.
This is an estimate of the null distribution of the t-test.
e) Use the 100 t-values to estimate the percentage of samples for which |t| is greater than
the value you obtained from gene4.
f) In perm.t we have permuted the values of the disease classification. Suppose what we
permuted the data instead – would this have the same effect?
5. a) 1:12 generates the numbers 1,2,3,...,12. What does samp(1:12,12,replace=T) do?
b) We will do a bootstrap test (just for gene4) by computing 100 samples from the
combined ALL and AML data, and extracting the t-value. Here is a function you can
write that will do the permutation test.
boot.t = function(y,x,nboot){
out=numeric(nboot)
#create an empty vector to store the t-values
for (i in 1:nboot){
yboot=sample(y,12,replace=T) #sample from the combined sample
t.out=t.test(yboot~x)
out[i]=t.out$compname
#replace with the right component
}
out
}
c) Run your function 100 times using gene4 for y and obtain an histogram of t-values.
This is an estimate of the null distribution of the t-test.
d) Use the 100 t-values to estimate the percentage of samples for which |t| is greater than
the value you obtained from gene4.
Bioinformatics II
Fall 2006
B. Testing differential expression when the samples are NOT independent.
We will look at the difference in gene expression for genes 1 and 4 in the AML patients.
The patients with valid data for both genes are:
AML=c(21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 33, 34, 62, 63, 64, 65, 66, 67, 68, 69,
70, 71, 72)
AMLG1=GolubExpr[1,AML]
AMLG4=GolubExpr[4,AML]
M=AMLG4 – AMLG1
6. Obtain a boxplot of AMLG1 and AMLG4 and compute the correlation in expression.
boxplot(list(AMLG1,AMLG4), names=c("Gene 1","Gene 4"))
cor
cor.test
#computes the correlation
#tests if the correlation is 0
The correlation is not high, but because we are measuring gene expression on genes in
the same individuals, we should handle the data as dependent.
7. Do a one-sample t-test to test if the mean of M is 0 and a paired t-test to determine if
the difference between the mean of AMLG1 is the same as the mean of AMLG2.
?t.test will give the options.
If you pick, the option: paired=T you should get the same value of the test statistic.
Now try omitting paired=T. The p-value may be bigger or smaller than the value from
the paired test, depending on the correlation.
7. Permutation test: We do not want to break the pairing. The simplest way to do the test
is to save abs(M) and sign(M). Then permute the signs, andcompute the t-tests on the
product of abs(M) and the permuted signs. Write a function to do this, and turn in a
histogram of 100 t-values.
8. Bootstrap test: Under the null hypothesis of no difference, M and –M are equally
likely. Save the vector c(M,-M) and pick samples of size M with replacement from this
vector. Write a function to do this and turn in a histogram of 100 t-values.