Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Obligatory assignments, week 6 1. Assume the follwing is the result of an experiment where you want to measure how many seconds a program spends when writing a certain amount of data from memory. 76.3 82.5 83.9 82.0 82.8 84.3 95.3 90.1 80.9 83.3 89.2 82.8 87.0 84.9 72.4 76.6 70.7 88.3 63.8 74.0 Everything is kept constant during the experiment and you assume the variance of the data is due to random errors. Present the result of your experiment to someone who would like to know how efficient the programs memory usage is. Hint: When reading data from a file containing a long list of numbers(one per line or a long line) you may do it using scan: > x = scan("file") If the file contains several columns, you should use read.table(). You may parse a long string of numbers into R and convert it to a vector of numbers by x = as.numeric(unlist(strsplit("76.3 82.5 83.9 82.0 82.8 84.3 95.3 90.1 80.9 83.3"," "))) 2. Assume the same data were the result of your measurement of the percentage of CPU-usage at 12.00 for the last 20 days. Present the results of your data to a sysadmin responsible for the server in question. Have in mind that you might need to show similar data for 96 other times of the day. 3. Run the following R commands: N = 10 mu = 0.5 sigma = 1 x = rnorm(N,mu,sigma) t.test(x,conf.level=0.9) Calculate the confidence interval given by the output of t.test using R functions like sd(), mean() and pt(). 4. Study and run the following R-code and try to figure out what it does. The R scripts of this week’s assignments are collected in http://www.iu.hio.no/~haugerud/r/assignments/ withinKnownSigma=0 N=10 Ntotal=200000 mu=1 sd=1 sKn = sd/sqrt(N) 1 for (i in 1:Ntotal){ set=rnorm(N,mu,sd) x = mean(set) if(mu > x - sKn & mu < x + sKn){withinKnownSigma=withinKnownSigma+1} # Sigma known. Same CI each time, width = 2 sKn. } pG=withinKnownSigma/Ntotal cat("Experiment, known sigma, within CI:",pG,"\n") Try to compute using R the theoretical value this experiment should give. That is, the theoretical value of the fraction of times the true mean µ (mu) is within the CI. What percentage is this CI? 5. Study and run the following R-code and try to figure out what it does. N=10 Ntotal=200000 mu=1 sd=1 within=0 averCI=0 for (i in 1:Ntotal){ set=rnorm(N,mu,sd) x = mean(set) s = sd(set)/sqrt(N) averCI = averCI + 2*s if(mu > x - s & mu < x + s){within=within+1} # Sigma unknown. Creating new CI each time, width = 2s is varying } p=within/Ntotal cat("\n\nExperiment, unknown sigma, within CI: ",p,"\n") # The random variable (x-mu)/s follows student’s t distribution averCI = averCI/Ntotal cat("\n\nKnown sigma CI-length:",2*sd/sqrt(N)) cat("\nAverage CI-length: ",averCI) Try to compute using R the theoretical value this experiment should give. That is, the theoretical value of the fraction of times the true mean µ (mu) is within the CI. What percentage is this CI? Finally try to relate the average CI-length (or width) to the results of these two experiments. 6. What is the main and crucial difference between the way the confidence interval CI is calculated in the two previous assignments? If one of the data sets generated by set=rnorm(N,mu,sd) was not generated by R, but by an experiment which you had performed. Which of the two methods for calculating the CI would you then have to use? 2 7. Study and run the following R-code and try to figure out what it does. N=5 heading="Comparing normal and student’s t distribution" plot.new() Tsample=vector() CLTsample=vector() CLTscaledSample=vector() Nr=50000 x = seq(-7,7,length=500) sd=2 mu=1 for (i in 1:Nr){ set = rnorm(N,mu,sd) xbar = mean(set) # mean, -> norm dist for large N CLTsample <- c(CLTsample,xbar) # according to CLT CLTscaledSample <- c(CLTscaledSample,(xbar-mu)/(sd/sqrt(N))) # scaled normal distribution t = (xbar-mu)/(sd(set)/sqrt(N)) # This variable t follows Tsample <- c(Tsample,t) # the student’s t distribution } br=c(min(Tsample),seq(-6.5,6.5,length=400),max(Tsample)) hCLT=hist(CLTscaledSample,breaks=br,freq=FALSE,main=heading,xlim=c(-4,4),col=rgb(1,0,0,1/4),border=rgb(1,0,0,1/4)) #lines(x,dFUNK?(x,???),type="l",col="red",lwd=3) legend(2,0.4,legend=paste("normal,sd = ",1),fill="red") legend(-4,0.35,legend=paste("N = ",N)) hT=hist(Tsample,breaks=br,freq=FALSE,xlim=c(-4,4),col=rgb(0,0,1,1/4),border=rgb(0,0,1,1/4),add=T) #lines(x,dFUNK?(x,???),type="l",col="blue",lwd=3) legend(2,0.35,legend="student’s t, df=N-1",fill="blue") hT=hist(CLTsample,breaks=br,freq=FALSE,xlim=c(-4,4),col=rgb(0,1,0,1/4),border=rgb(0,1,0,1/4),add=T) #lines(x,dFUNK(x,???),type="l",col="green",lwd=3) legend(2,0.3,legend=sprintf("normal,sd = %.2f",sd/sqrt(N)),fill="green") Uncomment the three lines containing dFUNK? and find the correct distribution functions which fits the data correctly when you replace ??? by the correct set of parameters for the distribution function. 8. In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One rejects the null hypothesis when the p-value is less than the significance level α, which is often set to be 0.05. When the null hypothesis is rejected, the result is said to be statistically significant. Consider the two samples x and y in the files below: http://www.iu.hio.no/teaching/materials/MS007A/html/x http://www.iu.hio.no/teaching/materials/MS007A/html/y Run a one-sample t-test using R on each set and explain the meaning of the P-value and the confidence interval. Make conclusions based on the tests in each of the cases. Are you able to reject or prove any of the hypothesises? 9. Run a Welch Two Sample t-test on the two sets, explain what H0 and H1 is and estimate how probable it is that the two means are not equal. Comparing performance 3 10. You are asked to compare the performance of two different computersystem installations using a standard benchmark program. These are large, complex systems that require a significant amount of time and effort on your part to make the benchmark program run. By the time you have the benchmark running on both systems, you have time to make only n1 = 8 measurements on the first system and n2 = 5 measurements on the second system. Your measurements in seconds for each system, are shown in the following table System 1 1011 998 1113 1008 1100 1039 1003 1098 System 2 894 963 1098 982 1046 Using statistical methods, try to find out if there is a statistically significant difference between the two systems. Present your results using both a confidence interval and a p-value and explain how to interpret your results. Hint: using R: x <- c(894,963,1098) will put these three values into the set x. 11. A few weeks after performing the comparison of the two systems, you are provided with som additional time to complete your measurements on the second system. You measure three more values for the second system obtaining the values of 1002, 989 and 994 seconds. Find out if this changes your conclusions on the difference of the two systems. 12. Using R, construct a box plot of the following data. These are lifetimes in hours of fifty 40-watt lamps taken from forced life tests. 919,1067,1045,956,765,958,1196,1092,855,1102,958,1311,785,1162,1195,1157,902, 1037,1126,1170,1195,978,1022,702,936,929,1340,832,1333,923,918,950,1122,1009, 811,1156,905,938,1157,1217,920,972,970,1151,1085,948,1035,1237,1009,896 Explain the information the boxplot gives and compare it to just giving the mean and standard deviation. 13. Assume a typo was done, changing the last value of the liftime to 1896. Explain how this changes the boxplot and compare it to the change of mean and standard deviation. 14. Calculate a 95% confidence interval for the same set of lifetimes. How probable is it that when testing a single new lamp that it’s lifetime is within this 95% confidence interval? 4 15. In order to be able to run SIGN.test() in R, you might need to install an additional package like this: > install.packages("BSDA") > library(BSDA) Run the following test: x <- rnorm(10,1,1) t.test(x) SIGN.test(x) Find another way to calculate the p-value for each of the tests. Is your result exactly the same in any of the cases? Explain. 16. Run the following tests five times. x <- rnorm(10,1,1) t.test(x) SIGN.test(x) Compare each time the p-values calculated and comment on the resultst. 17. Run the SIGN.test on the two data sets of assignment 7 above. Compare the results to the results of the a Welch Two Sample t-test on the two sets. Extra challenge(not compulsory) 18. Run the following R commands: N = 10 mu = 0.5 sigma = 1 x = rnorm(N,mu,sigma) t.test(x,conf.level=0.9) Calculate all the numbers given by the output of t.test using R functions like sd() and pt(). Remember: the p-value is the probability of obtaining a result at least as extreme as the mean value you have obtained, assuming the true mean is equal to 0. And the CI is calculated based on the student t distribution (pt()). 19. Repeat the calculation of all the numbers assuming your avarage is a part of a normal distribution with a known sigma = 1 (which they indeed are in this case) using R functions like mean() and pnorm(). Comment on the differences in the results. Since you know the 10 numbers are drawn from a normal distribution with standard deviation equal to 1, you may say that the CI and the p-value calculated based on the normal distribution instead of the student t distribution, are the exact ones. Explain in what way they are exact. 5 20. Assume you know that the 10 values in the assignment above have been drawn from a normal distribution, but you do not know the value of sigma (nor mu). What is the most correct way to calculate CI’s and P-value, based on the student t distribution (as with t.test()) or based on the normal distribution? 21. Study and run the following R-code and try to figure out what it does. z=uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root withinKnownSigma=0 N=5 Ntotal=200000 mu=0 sd=1 sKn = sd/sqrt(N) for (i in 1:Ntotal){ set=rnorm(N,mu,sd) x = mean(set) if(mu > x - z*sKn & mu < x + z*sKn){withinKnownSigma=withinKnownSigma+1} # Sigma known. Same CI each time, with = 2 z sKn. } pG=withinKnownSigma/Ntotal cat("Experiment, known sigma, within CI:",pG,"\n") Try to compute using R the theoretical value this experiment should give. That is, the theoretical value of the fraction of times the true mean µ (mu) is within the CI. What percentage is this CI? 22. Study and run the following R-code and try to figure out what it does. z=uniroot(function(z) (pt(z,N-1) - pt(-z,N-1) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root N=5 Ntotal=200000 mu=0 sd=1 within=0 averCI=0 for (i in 1:Ntotal){ set=rnorm(N,mu,sd) x = mean(set) s = sd(set)/sqrt(N) averCI = averCI + 2*s*z if(mu > x - z*s & mu < x + z*s){within=within+1} # Sigma unknown. Creating new CI each time, with = 2s is varying # Should be within with probability pt(1,N-1) - pt(-1,N-1) } p=within/Ntotal cat("\n\nExperiment, unknown sigma, within CI: ",p,"\n") # The random variable (x-mu)/s follows student’s t distribution cat("Theoretical, Student’s t distribution: ",pt(z,N-1) - pt(-z,N-1)) averCI = averCI/Ntotal z=uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root cat("\n\nKnown sigma CI-length:",2*sd*z/sqrt(N)) cat("\nAverage CI-length: ",averCI) 6 Try to compute using R the theoretical value this experiment should give. That is, the theoretical value of the fraction of times the true mean µ (mu) is within the CI. What percentage is this CI? Finally try to relate the average CI-length (or width) to the results of these two experiments. 23. Using R, try to establish how many times a random sample of 100 numbers from a normal distribution with mean value zero and standard deviation equal to 1 would result in a mean value as far from or further from zero as the x-set. Or in other words, how probable it is that the sample x is from a normal distribution with mean value zero and standard deviation equal to 1 and that the mean value just by chance deviates that much from zero. 24. Verify Theorem 1 of the lecture notes in the case where N = 2 and the two random variables X1 and X2 are discrete Bernoulli distributed random variables with p = 1/2 (corresponding to two coins). 7