Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Obligatory assignments, solution notes, week 6 1. 2. 3. > N = 10 > > > > mu = 0.5 sigma = 1 x = rnorm(N,mu,sigma) t.test(x,conf.level=0.9) One Sample t-test data: x t = 1.8071, df = 9, p-value = 0.1042 alternative hypothesis: true mean is not equal to 0 90 percent confidence interval: -0.008222728 1.151773396 sample estimates: mean of x 0.5717753 > solution = uniroot(function(z) (pt(z,N-1) - pt(-z,N-1) - 0.9), lower = 0, upper = 4,tol = 0.000 > z = solution$root > left = mean(x) - z*sd(x)/sqrt(N) > right = mean(x) + z*sd(x)/sqrt(N) > cat ("CI: <",left,",",right,">\n") CI: < -0.008222728 , 1.151773 > 4. withinKnownSigma=0 N=10 Ntotal=200000 mu=1 sd=1 sKn = sd/sqrt(N) for (i in 1:Ntotal){ set=rnorm(N,mu,sd) x = mean(set) if(mu > x - sKn & mu < x + sKn){withinKnownSigma=withinKnownSigma+1} # Sigma known. Same CI each time, width = 2 sKn. # Should give pnorm(1) - pnorm(-1) = 0.6826895 } pG=withinKnownSigma/Ntotal cat("Experiment, known sigma, within CI:",pG,"\n") cat("Theoretical, normal distribution: ",pnorm(1) - pnorm(-1)) 5. within=0 averCI=0 1 for (i in 1:Ntotal){ set=rnorm(N,mu,sd) x = mean(set) s = sd(set)/sqrt(N) averCI = averCI + 2*s if(mu > x - s & mu < x + s){within=within+1} # Sigma unknown. Creating new CI each time, with = 2s is varying # Should be within with probability pt(1,N-1) - pt(-1,N-1) } p=within/Ntotal cat("\n\nExperiment, unknown sigma, within CI: ",p,"\n") # The random variable (x-mu)/s follows student’s t distribution cat("Theoretical, Student’s t distribution: ",pt(1,N-1) - pt(-1,N-1)) averCI = averCI/Ntotal cat("\n\nKnown sigma CI-length:",2*sd/sqrt(N)) cat("\nAverage CI-length: ",averCI) 6. In the first case the value of σ is assumed known and the CI is constructed using this value. In the second case the CI is constructed using the sample standard deviation s. In an experiment the value of σ would not be known and you would have to use the second method based on s. 7. N=5 heading="Comparing normal and student’s t distribution" plot.new() Tsample=vector() CLTsample=vector() CLTscaledSample=vector() Nr=50000 x = seq(-7,7,length=500) sd=2 mu=1 for (i in 1:Nr){ set = rnorm(N,mu,sd) xbar = mean(set) # mean, -> norm dist for large N CLTsample <- c(CLTsample,xbar) # according to CLT CLTscaledSample <- c(CLTscaledSample,(xbar-mu)/(sd/sqrt(N))) # scaled normal distribution t = (xbar-mu)/(sd(set)/sqrt(N)) # This variable t follows Tsample <- c(Tsample,t) # the student’s t distribution } br=c(min(Tsample),seq(-6.5,6.5,length=400),max(Tsample)) hCLT=hist(CLTscaledSample,breaks=br,freq=FALSE,main=heading,xlim=c(-4,4),col=rgb(1,0,0,1/4),borde lines(x,dnorm(x,0,1),type="l",col="red",lwd=3) legend(2,0.4,legend=paste("normal,sd = ",1),fill="red") legend(-4,0.35,legend=paste("N = ",N)) hT=hist(Tsample,breaks=br,freq=FALSE,xlim=c(-4,4),col=rgb(0,0,1,1/4),border=rgb(0,0,1,1/4),add=T) 2 lines(x,dt(x,N-1),type="l",col="blue",lwd=3) legend(2,0.35,legend="student’s t, df=N-1",fill="blue") hT=hist(CLTsample,breaks=br,freq=FALSE,xlim=c(-4,4),col=rgb(0,1,0,1/4),border=rgb(0,1,0,1/4),add= lines(x,dnorm(x,mu,sd/sqrt(N)),type="l",col="green",lwd=3) legend(2,0.3,legend=sprintf("normal,sd = %.2f",sd/sqrt(N)),fill="green") 8. t.test(x) One Sample t-test data: x t = 2.3892, df = 99, p-value = 0.01878 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.03778428 0.40807200 mean of x 0.2229281 This result shows that one should reject the null hypothesis. A p-value of 0.02 means that if the null hypothesis mean = 0 was true, such an extreme result as mean = 0.22 would just occure 2 out of a 100 times. The 95 percent confidence interval < 0.038, 0.408 > means that there is a 95% probability that the true mean is within this interval. That is, if you keep on drawing samples of a 100 numbers from the same distribution, on average the CI would include the true mean in 95 out of a 100 times. t.test(y) One Sample t-test data: y t = 0.054, df = 99, p-value = 0.957 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.1632961 0.1724413 mean of x 0.004572609 This result shows that the null hypothesis can’t and shouldn’t be rejected. In fact, that a mean of x so close to zero or closer would just occure in 4 out of a 100 times if the null hypothesis was true. But this does not mean that there is a probability of 96% that the true mean is zero! The 95 percent confidence interval < −0.163, 0.172 > means that there is a 95% probability that the true mean is within this interval. That is, if you keep on drawing samples of a 100 numbers from the same distribution, on average the CI would include the true mean in 95 out of a 100 times. So in this case the CI is far more valuble than the p-value. 9. The H0 hypothesis is that the true difference in means is zero, just as it would be if the samples where drawn from the same distribution. H1 is the opposite, the means are unequal. > t.test(x,y) 3 Welch Two Sample t-test data: x and y t = 1.7336, df = 196.13, p-value = 0.08455 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.03003875 0.46674981 mean of x mean of y 0.222928138 0.004572609 This result shows that the null hypothesis can’t be rejected as the result is not statistically significant. A p-value of 0.08 means that if the null hypothesis equal means was true, such an extreme result as difference in means = 0.218 would just occure 8 out of a 100 times. As this is not too unliklye, the result is not significant. The CI tells us that there is a 95% probability that the true difference in means is within the interval < −0.03, 0.47 >. Again, the CI gives us the most precise information. However, quite often only the p-value is used in applied statistics. If a 90 percent confidence interval is calculated the result is < 0.01, 0.42 >, so there is a 90% probability that the true difference in means is unequal. 10. A Welch Two Sample t-test gives a p-value = 0.2517 and a 95 percent confidence interval < −46, 145 >. The p-value is much larger than 0.05 which means that this is a quite probable outcome if the null hypotheses is true, so that one can not conclude that there is a difference in performance between the two systems. From the CI we see that there is a 95% probability that the true difference of the means are in this interval. The interval includes zero and again, we can not conclude that the systems are different. If we calculate a 74.83 percent confidence interval (1-0.2517 = 0.7483) we get a < 0.004, 99.3 > CI (why is zero just at the edge of this CI?). This is just another way to interprtet the p-value. 74.83% of the times one makes such a test, the mean difference would be in this range, not including zero. 11. Now the Welch Two Sample t-test gives a p-value = 0.08744. If we decide before doing the experiments to use a significance of 0.05, we can still not conclude that there is a significantly difference between the two systems. The 95 percent confidence interval gives < −8.4, 109 > and the 95% CI just includes zero. But still, there is not enough evidence to reject the null hypothesis. If we do a 91.256 % CI we get the interval < 0.00, 100.5 > and in more than 90% of the times one makes such a test, the mean difference would be in this range, not including zero. If a significance level of 0.1 was chosen before we did the experiment, we could have rejected the null hypothesis. 12. Boxplot. The bottom and top of the box are the 25th and 75th percentile, and the line near the middle of the box is the median. The default value 4 1300 1200 1100 1000 900 800 700 Figure 1: Boxplot of lamp lifetimes of the range is 1.5, so the whiskers are placed at the most extrem points of the sample, but at most at 1.5 times the interquartile range (= hight of the box). The boxplot shows roughly how skew the distribution is and shows where the central half part of the datapoints are located. On the other hand the standard deviation is symmetric, and will show no skewness. Additionally the boxplot shows the extrema, if necessarry as outliers. Since there are no outliers in this case, all points are within the whiskers. Both the standard deviation and the mean might change a lot whith large outliers, while the median is less influenced by such extreme points. The following shows that both max and min are within 1.5 times the interquantile range outside the box, so the whiskers are posisioned at min and max: > max(x) [1] 1340 > min(x) [1] 702 5 > median(x) [1] 1009 > quantile(x) 0% 25% 50% 75% 100% 702.00 924.50 1009.00 1154.75 1340.00 > IQR(x) # (Inter Quantile Range) [1] 230.25 > 1154.75-924.50 [1] 230.25 > quantile(x,probs=0.75) 75% 1154.75 > quantile(x,probs=0.75) + 1.5*IQR(x) 1500.125 > quantile(x,probs=0.25) - 1.5*IQR(x) 579.125 This is because 702 is larger than 579 and 1340 is smaller than 1500. The boxplot shows that the distribution is somewhat skew, the lower 25% quantile is much closer to the median than the upper. 13. This extreme point is now outside the box by more than 1.5 times the interquantile range: > quantile(x,probs=0.75) + 1495.75 1.5*IQR(x) since the extreme value 1896 is larger than the 75% quantile + 1.5 times the interquantile range = 1495.75, it is plotted as an outlier. The upper whisker is then placed at the next largest point, 1340, since this is below 1495.75. Comparing the median and mean, we see that the mean and the standard deviation changes a lot more than the median and the quantiles because of this single error. measure mean sd median 25% quantile 75% quantile correct 1027.08 148.4 1009 924.5 1154.75 with typo 1047.08 191.5 1015.5 930.75 1156.75 The boxplot is almost unchanged (if you zoom in on it) except that it shows the single outlier. It is thus a method which gives is more robust to individual error points and gives more information than just the mean and the standard deviation. 14. The 95 percent confidence interval is < 985, 1069 > but this is a mesaure of how accurate the calculated sample mean is. It is a probability of 95% that the true mean is within this range. And this does not saying anything 6 800 1000 1200 1400 1600 1800 ● Figure 2: Boxplot of lamp lifetimes about the porbability of individual datapoints. So the probability is NOT 95%! The probability for a liftime within this interval can just be estimated and a reasonable estimate can be found by counting the number of datapoints within this interval in the current sample. Running sort(x) one finds that only 1009 1009 1022 1035 1037 1045 1067 are whithin this range, or 7 out 50 and thus 14%. This is of course just a rough estimate of the probability, but it is certainly far from 95%. If we assume that the distribution is Gaussian whith a mean of 1027 and a standard deviation of 148, the probability is > 1 - 2*pnorm(985,mean=1027.08,sd=148.4) [1] 0.2232508 which gives a bit larger probability, 22%. However, there is no strong evidence for a Gaussian distribution since the data are so skew. 7 15. x <- rnorm(10,1,1) t.test(x) One Sample t-test, p-value = p-value = 0.02408 2*(pnorm(-mean(x),0,sd(x)/sqrt(10))) = 0.006771427 # Much smaller, n is just 10 and sd(x) = 1.19 is far from 1 SIGN.test(x) One-sample Sign-Test, p-value = 0.1094 > x [1] 2.15935869 -0.36024036 2.36360647 [8] 1.52291449 0.48742194 1.79825463 0.00839999 1.18033842 2.10846273 -1.07201314 pbinom(2,10,0.5)*2 = 0.109375 # This is the exact answer. There where 2 values with a minus sign. # p is the probability for 0,1,2 or 8,9,10 minus signs. 16. > x <- rnorm(10,1,1) > t.test(x)$p.value [1] 0.0007673417 > SIGN.test(x) s = 10, p-value = 0.001953 > x <- rnorm(10,1,1) > t.test(x)$p.value [1] 0.01786454 > SIGN.test(x) s = 8, p-value = 0.1094 > x <- rnorm(10,1,1) > t.test(x)$p.value [1] 0.02395486 > SIGN.test(x) s = 8, p-value = 0.1094 > x <- rnorm(10,1,1) > t.test(x)$p.value [1] 0.007523077 > SIGN.test(x) s = 9, p-value = 0.02148 x <- rnorm(10,1,1) > t.test(x)$p.value [1] 0.03139396 > SIGN.test(x) s = 8, p-value = 0.1094 Each time the t.test p-value is smaller. This is as expected, it is a more powerful test and potentially gives more precise results. However, it depends on the data to be normal distributed when n is som small. Otherwise the estimate of the p-value might be wrong. The sign-test has no such 8 dependency. 17. > y <- c(894,963,1098,982,1046,1002,989,994) > x <- c(1011,998,1113,1008,1100,1039,1003,1098) > t.test(x,y) t = 1.8423, df = 13.537, p-value = 0.08744 > SIGN.test(x,y) Dependent-samples Sign-Test S = 8, p-value = 0.007812 In this case the SIGN-test gives a smaller p-value. We see that by chance, that is, the way they are paired, all the values of the x set are lager than those of the y set, giving a small p-value for the t-test. But note that if one of the arrays are shuffled, you might get another result: > w = sample(y) > SIGN.test(x,w) S = 6, p-value = 0.2891 > x [1] 1011 998 1113 1008 1100 1039 1003 1098 > w [1] 989 963 982 1046 894 1098 994 1002 The t.test is of course independent of such a permutation of the sample. 18. > N = 10 > > > > mu = 0.5 sigma = 1 x = rnorm(N,mu,sigma) t.test(x,conf.level=0.9) One Sample t-test data: x t = 1.8071, df = 9, p-value = 0.1042 alternative hypothesis: true mean is not equal to 0 90 percent confidence interval: -0.008222728 1.151773396 sample estimates: mean of x 0.5717753 > > > > > > > > t t = mean(x)/(sd(x)/sqrt(N)); df = N-1 solution = uniroot(function(z) (pt(z,N-1) - pt(-z,N-1) - 0.9), lower = 0, upper = 4,tol = 0.000 z = solution$root p = 2*pt(-t,N-1) left = mean(x) - z*sd(x)/sqrt(N) right = mean(x) + z*sd(x)/sqrt(N) cat ("t =",t,"df =",df,"p-value =",p,"\n") = 1.807125 df = 9 p-value = 0.1042085 9 > cat ("CI: <",left,",",right,">\n") CI: < -0.008222728 , 1.151773 > > cat ("mean: ",mean(x),"\n") mean: 0.5717753 19. > # Assuming normal distribution > > > > > > > > solution = uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.90), lower = 0, upper = 4,tol = 0.0000 z = solution$root p = 2*pnorm(-t) left = mean(x) - z*1/sqrt(N) right = mean(x) + z*1/sqrt(N) #left = mean(x) - z*sd(x)/sqrt(N) # Now we know that sd = 1 #right = mean(x) + z*sd(x)/sqrt(N) t.test(x,conf.level=0.9) One Sample t-test data: x t = 1.8071, df = 9, p-value = 0.1042 alternative hypothesis: true mean is not equal to 0 90 percent confidence interval: -0.008222728 1.151773396 sample estimates: mean of x 0.5717753 > cat ("t =",t,"df =",df,"p-value =",p,"\n") t = 1.807125 df = 9 p-value = 0.07074286 > cat ("CI: <",left,",",right,">\n") CI: < 0.05162674 , 1.091924 > > cat ("mean: ",mean(x),"\n") mean: 0.5717753 Since we know that σ = 1, the numbers calculated based on the normal distribution follows from Theorem 2 and hence these numbers are exact. If we draw numbers over and over from such a distribution the true mean, µ = 0 would on average be within this interval 90% of the time, However, the 90% CI of the t-test is also exact, in the way that if this procedure was followed over and over again, the true mean would be within the CI 90% of the time. But since the CI also depends on the value of the sample standard mean, it’s size would be different every time and could be really large or really small. 20. If we did not know σ, the CI based on the student-t distribution would be the correct one, because the calculation of the CI would have to depend on the sample standard deviation which also varies. So if such a calculation was performed over and over again, 90 % of the time the true mean would be within the CI calculated each time. The size of the CI would vary for each new measurement, but on average the true mean, µ = 0 would be within this interval 90% of the time. 10 21. z=uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root withinKnownSigma=0 N=5 Ntotal=200000 mu=0 sd=1 sKn = sd/sqrt(N) for (i in 1:Ntotal){ set=rnorm(N,mu,sd) x = mean(set) if(mu > x - z*sKn & mu < x + z*sKn){withinKnownSigma=withinKnownSigma+1} # Sigma known. Same CI each time, with = 2 z sKn. # Should give pnorm(z) - pnorm(-z) = 0.95 } pG=withinKnownSigma/Ntotal cat("Experiment, known sigma, within CI:",pG,"\n") cat("Theoretical, normal distribution: ",pnorm(z) - pnorm(-z)) 22. z=uniroot(function(z) (pt(z,N-1) - pt(-z,N-1) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root within=0 averCI=0 for (i in 1:Ntotal){ set=rnorm(N,mu,sd) x = mean(set) s = sd(set)/sqrt(N) averCI = averCI + 2*s*z if(mu > x - z*s & mu < x + z*s){within=within+1} # Sigma unknown. Creating new CI each time, with = 2s is varying # Should be within with probability pt(1,N-1) - pt(-1,N-1) } p=within/Ntotal cat("\n\nExperiment, unknown sigma, within CI: ",p,"\n") # The random variable (x-mu)/s follows student’s t distribution cat("Theoretical, Student’s t distribution: ",pt(z,N-1) - pt(-z,N-1)) averCI = averCI/Ntotal z=uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root cat("\n\nKnown sigma CI-length:",2*sd*z/sqrt(N)) cat("\nAverage CI-length: ",averCI) 23. We use the t-test of assignment 8 on the set x of 100 numbers as an example: t.test(x) One Sample t-test data: x t = 2.3892, df = 99, p-value = 0.01878 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.03778428 0.40807200 11 mean of x 0.2229281 The following R-script simply draws a random sample of 100 numbers from a normal distribution with mean value zero and standard deviation equal to 1 ten million times: N = 0 total = 10000000 for (i in 1:total){ x <- rnorm(100,sd=1.0) m = mean(x) if(m > 0.2229281 | m < -0.2229281){ N = N + 1 } } P = N/(1.0*total) cat ("P: ",P,"\n") When run it gives P: 0.0258132 which is a bit larger than the result from the t-test, p-value = 0.019. Our result in this experiment is correct since it actually carries out the experiment that the p-value asked for in the assignment corresponds to: How probable is it that the sample x is from a normal distribution with mean value zero and standard deviation 1 and that the mean value just by chance deviates that much from zero. However, the t-test result is also correct if the case was that we did not know the value of the standard deviation σ, as explained below. In the present case we know that the numbers are drawn from a normal distribution and the p-value equals the area outside the extreme values -0.2229281 and 0.2229281 √ of the gaussian function with mean zero and standard deviation 1/ N = 0.1 as Theorem 2 states. And indeed, using R gives the following result: > 2*pnorm(-0.2229281,sd=0.1) [1] 0.02579521 consistent with the result of the experiment. However, is this p-value really what we would want to calculate if we wanted to test a hypothesis based on the results of the sample x? No, because then we would not know that the true standard deviation of the random distribution the sample x was drawn from has σ = 1. If we knew, 12 the p-value calculated above would be the one we wanted. But if we do not know the value of σ, we have to rely on the sample standard deviation s calculated from the N = 100 sample measurements. So we want to find the p-value which predicts: How probable is it that the sample x is from a normal distribution with mean value zero and that the mean value just by chance deviates that much from zero. The result given by the t-test is based on the single sample of N = 100 measurements only. And no knowledge of the true value of the standard deviation σ = 1. The student-t distribution is the distribution of the variable x−µ t= √ s/ N Assuming µ = 0, and calculating sd(x) = 0.9330827, the recorded result is t = 0.2229281/0.09330827 = 2.3891569 = 2.4. Based on this single sample average and sample standard deviation, the p-value is calculated as > 2*pt(-t,99) [1] 0.01878143 The p-value calculated is the probability for obtaining such an extreme result as t or a more extreme result, given that the underlying distribution is normal but that σ is unknown. And the way to perform an experiment which corresponds to this p-value is to repeat the experiment over and over again and record how often the new result tnew = x−µ √ s/ N is larger than the value we started out to compare with, t = 2.4. If this experiment is carried out as follows using R: N = 0 total = 10000000 t = 0.2229281/0.09330827 for (i in 1:total){ x <- rnorm(100,sd=1) m = mean(x)/(sd(x)/10) if(m > t | m < -t){ N = N + 1 } } P = N/(1.0*total) cat ("P: ",P,"\n") 13 the result is P: 0.0187898 and indeed, it fits the result of the t-test. 14