Download Obligatory assignments, week 6

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Obligatory assignments, week 6
1. Assume the follwing is the result of an experiment where you want to measure how many seconds a program spends when writing a certain amount
of data from memory.
76.3 82.5 83.9 82.0 82.8 84.3 95.3 90.1 80.9 83.3
89.2 82.8 87.0 84.9 72.4 76.6 70.7 88.3 63.8 74.0
Everything is kept constant during the experiment and you assume the
variance of the data is due to random errors. Present the result of your
experiment to someone who would like to know how efficient the programs
memory usage is.
Hint: When reading data from a file containing a long list of numbers(one
per line or a long line) you may do it using scan:
> x = scan("file")
If the file contains several columns, you should use read.table(). You
may parse a long string of numbers into R and convert it to a vector of
numbers by
x = as.numeric(unlist(strsplit("76.3 82.5 83.9 82.0 82.8 84.3 95.3 90.1 80.9 83.3"," ")))
2. Assume the same data were the result of your measurement of the percentage of CPU-usage at 12.00 for the last 20 days. Present the results of
your data to a sysadmin responsible for the server in question. Have in
mind that you might need to show similar data for 96 other times of the
day.
3. Run the following R commands:
N = 10
mu = 0.5
sigma = 1
x = rnorm(N,mu,sigma)
t.test(x,conf.level=0.9)
Calculate the confidence interval given by the output of t.test using R
functions like sd(), mean() and pt().
4. Study and run the following R-code and try to figure out what it does.
The R scripts of this week’s assignments are collected in
http://www.iu.hio.no/~haugerud/r/assignments/
withinKnownSigma=0
N=10
Ntotal=200000
mu=1
sd=1
sKn = sd/sqrt(N)
1
for (i in 1:Ntotal){
set=rnorm(N,mu,sd)
x = mean(set)
if(mu > x - sKn & mu < x + sKn){withinKnownSigma=withinKnownSigma+1}
# Sigma known. Same CI each time, width = 2 sKn.
}
pG=withinKnownSigma/Ntotal
cat("Experiment, known sigma, within CI:",pG,"\n")
Try to compute using R the theoretical value this experiment should give.
That is, the theoretical value of the fraction of times the true mean µ (mu)
is within the CI.
What percentage is this CI?
5. Study and run the following R-code and try to figure out what it does.
N=10
Ntotal=200000
mu=1
sd=1
within=0
averCI=0
for (i in 1:Ntotal){
set=rnorm(N,mu,sd)
x = mean(set)
s = sd(set)/sqrt(N)
averCI = averCI + 2*s
if(mu > x - s & mu < x + s){within=within+1}
# Sigma unknown. Creating new CI each time, width = 2s is varying
}
p=within/Ntotal
cat("\n\nExperiment, unknown sigma, within CI: ",p,"\n")
# The random variable (x-mu)/s follows student’s t distribution
averCI = averCI/Ntotal
cat("\n\nKnown sigma CI-length:",2*sd/sqrt(N))
cat("\nAverage CI-length: ",averCI)
Try to compute using R the theoretical value this experiment should give.
That is, the theoretical value of the fraction of times the true mean µ (mu)
is within the CI.
What percentage is this CI?
Finally try to relate the average CI-length (or width) to the results of
these two experiments.
6. What is the main and crucial difference between the way the confidence
interval CI is calculated in the two previous assignments? If one of the
data sets generated by
set=rnorm(N,mu,sd)
was not generated by R, but by an experiment which you had performed.
Which of the two methods for calculating the CI would you then have to
use?
2
7. Study and run the following R-code and try to figure out what it does.
N=5
heading="Comparing normal and student’s t distribution"
plot.new()
Tsample=vector()
CLTsample=vector()
CLTscaledSample=vector()
Nr=50000
x = seq(-7,7,length=500)
sd=2
mu=1
for (i in 1:Nr){
set = rnorm(N,mu,sd)
xbar = mean(set)
# mean, -> norm dist for large N
CLTsample <- c(CLTsample,xbar) # according to CLT
CLTscaledSample <- c(CLTscaledSample,(xbar-mu)/(sd/sqrt(N)))
# scaled normal distribution
t = (xbar-mu)/(sd(set)/sqrt(N)) # This variable t follows
Tsample <- c(Tsample,t)
# the student’s t distribution
}
br=c(min(Tsample),seq(-6.5,6.5,length=400),max(Tsample))
hCLT=hist(CLTscaledSample,breaks=br,freq=FALSE,main=heading,xlim=c(-4,4),col=rgb(1,0,0,1/4),border=rgb(1,0,0,1/4))
#lines(x,dFUNK?(x,???),type="l",col="red",lwd=3)
legend(2,0.4,legend=paste("normal,sd = ",1),fill="red")
legend(-4,0.35,legend=paste("N = ",N))
hT=hist(Tsample,breaks=br,freq=FALSE,xlim=c(-4,4),col=rgb(0,0,1,1/4),border=rgb(0,0,1,1/4),add=T)
#lines(x,dFUNK?(x,???),type="l",col="blue",lwd=3)
legend(2,0.35,legend="student’s t, df=N-1",fill="blue")
hT=hist(CLTsample,breaks=br,freq=FALSE,xlim=c(-4,4),col=rgb(0,1,0,1/4),border=rgb(0,1,0,1/4),add=T)
#lines(x,dFUNK(x,???),type="l",col="green",lwd=3)
legend(2,0.3,legend=sprintf("normal,sd = %.2f",sd/sqrt(N)),fill="green")
Uncomment the three lines containing dFUNK? and find the correct distribution functions which fits the data correctly when you replace ??? by
the correct set of parameters for the distribution function.
8. In statistical significance testing, the p-value is the probability of obtaining
a test statistic at least as extreme as the one that was actually observed,
assuming that the null hypothesis is true. One rejects the null hypothesis
when the p-value is less than the significance level α, which is often set
to be 0.05. When the null hypothesis is rejected, the result is said to
be statistically significant. Consider the two samples x and y in the files
below:
http://www.iu.hio.no/teaching/materials/MS007A/html/x
http://www.iu.hio.no/teaching/materials/MS007A/html/y
Run a one-sample t-test using R on each set and explain the meaning
of the P-value and the confidence interval. Make conclusions based on
the tests in each of the cases. Are you able to reject or prove any of the
hypothesises?
9. Run a Welch Two Sample t-test on the two sets, explain what H0 and H1
is and estimate how probable it is that the two means are not equal.
Comparing performance
3
10. You are asked to compare the performance of two different computersystem installations using a standard benchmark program. These are
large, complex systems that require a significant amount of time and effort on your part to make the benchmark program run. By the time you
have the benchmark running on both systems, you have time to make only
n1 = 8 measurements on the first system and n2 = 5 measurements on
the second system. Your measurements in seconds for each system, are
shown in the following table
System 1
1011
998
1113
1008
1100
1039
1003
1098
System 2
894
963
1098
982
1046
Using statistical methods, try to find out if there is a statistically significant difference between the two systems. Present your results using
both a confidence interval and a p-value and explain how to interpret
your results. Hint: using R: x <- c(894,963,1098) will put these three
values into the set x.
11. A few weeks after performing the comparison of the two systems, you
are provided with som additional time to complete your measurements on
the second system. You measure three more values for the second system
obtaining the values of 1002, 989 and 994 seconds. Find out if this changes
your conclusions on the difference of the two systems.
12. Using R, construct a box plot of the following data. These are lifetimes
in hours of fifty 40-watt lamps taken from forced life tests.
919,1067,1045,956,765,958,1196,1092,855,1102,958,1311,785,1162,1195,1157,902,
1037,1126,1170,1195,978,1022,702,936,929,1340,832,1333,923,918,950,1122,1009,
811,1156,905,938,1157,1217,920,972,970,1151,1085,948,1035,1237,1009,896
Explain the information the boxplot gives and compare it to just giving
the mean and standard deviation.
13. Assume a typo was done, changing the last value of the liftime to 1896.
Explain how this changes the boxplot and compare it to the change of
mean and standard deviation.
14. Calculate a 95% confidence interval for the same set of lifetimes. How
probable is it that when testing a single new lamp that it’s lifetime is
within this 95% confidence interval?
4
15. In order to be able to run SIGN.test() in R, you might need to install an
additional package like this:
> install.packages("BSDA")
> library(BSDA)
Run the following test:
x <- rnorm(10,1,1)
t.test(x)
SIGN.test(x)
Find another way to calculate the p-value for each of the tests. Is your
result exactly the same in any of the cases? Explain.
16. Run the following tests five times.
x <- rnorm(10,1,1)
t.test(x)
SIGN.test(x)
Compare each time the p-values calculated and comment on the resultst.
17. Run the SIGN.test on the two data sets of assignment 7 above. Compare
the results to the results of the a Welch Two Sample t-test on the two
sets.
Extra challenge(not compulsory)
18. Run the following R commands:
N = 10
mu = 0.5
sigma = 1
x = rnorm(N,mu,sigma)
t.test(x,conf.level=0.9)
Calculate all the numbers given by the output of t.test using R functions
like sd() and pt(). Remember: the p-value is the probability of obtaining a
result at least as extreme as the mean value you have obtained, assuming
the true mean is equal to 0. And the CI is calculated based on the student
t distribution (pt()).
19. Repeat the calculation of all the numbers assuming your avarage is a part
of a normal distribution with a known sigma = 1 (which they indeed are
in this case) using R functions like mean() and pnorm(). Comment on the
differences in the results. Since you know the 10 numbers are drawn from
a normal distribution with standard deviation equal to 1, you may say
that the CI and the p-value calculated based on the normal distribution
instead of the student t distribution, are the exact ones. Explain in what
way they are exact.
5
20. Assume you know that the 10 values in the assignment above have been
drawn from a normal distribution, but you do not know the value of sigma
(nor mu). What is the most correct way to calculate CI’s and P-value,
based on the student t distribution (as with t.test()) or based on the
normal distribution?
21. Study and run the following R-code and try to figure out what it does.
z=uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root
withinKnownSigma=0
N=5
Ntotal=200000
mu=0
sd=1
sKn = sd/sqrt(N)
for (i in 1:Ntotal){
set=rnorm(N,mu,sd)
x = mean(set)
if(mu > x - z*sKn & mu < x + z*sKn){withinKnownSigma=withinKnownSigma+1}
# Sigma known. Same CI each time, with = 2 z sKn.
}
pG=withinKnownSigma/Ntotal
cat("Experiment, known sigma, within CI:",pG,"\n")
Try to compute using R the theoretical value this experiment should give.
That is, the theoretical value of the fraction of times the true mean µ (mu)
is within the CI. What percentage is this CI?
22. Study and run the following R-code and try to figure out what it does.
z=uniroot(function(z) (pt(z,N-1) - pt(-z,N-1) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root
N=5
Ntotal=200000
mu=0
sd=1
within=0
averCI=0
for (i in 1:Ntotal){
set=rnorm(N,mu,sd)
x = mean(set)
s = sd(set)/sqrt(N)
averCI = averCI + 2*s*z
if(mu > x - z*s & mu < x + z*s){within=within+1}
# Sigma unknown. Creating new CI each time, with = 2s is varying
# Should be within with probability pt(1,N-1) - pt(-1,N-1)
}
p=within/Ntotal
cat("\n\nExperiment, unknown sigma, within CI: ",p,"\n")
# The random variable (x-mu)/s follows student’s t distribution
cat("Theoretical, Student’s t distribution: ",pt(z,N-1) - pt(-z,N-1))
averCI = averCI/Ntotal
z=uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root
cat("\n\nKnown sigma CI-length:",2*sd*z/sqrt(N))
cat("\nAverage CI-length: ",averCI)
6
Try to compute using R the theoretical value this experiment should give.
That is, the theoretical value of the fraction of times the true mean µ (mu)
is within the CI. What percentage is this CI?
Finally try to relate the average CI-length (or width) to the results of
these two experiments.
23. Using R, try to establish how many times a random sample of 100 numbers
from a normal distribution with mean value zero and standard deviation
equal to 1 would result in a mean value as far from or further from zero
as the x-set. Or in other words, how probable it is that the sample x is
from a normal distribution with mean value zero and standard deviation
equal to 1 and that the mean value just by chance deviates that much
from zero.
24. Verify Theorem 1 of the lecture notes in the case where N = 2 and the
two random variables X1 and X2 are discrete Bernoulli distributed random
variables with p = 1/2 (corresponding to two coins).
7