Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
6 Lecture, week 6 6.1 Confidence intervals and the Student t distribution When calculating confidence intervals, we have so far considered two cases: • The distribution of the measured data points is known to be Gaussian/normal and the √ variance σ 2 is known. Then the confidence interval CI given as x ± z · σ/ N is exact. • The distribution of the measured data points is√unknown, but the variance σ 2 is known. Then the confidence interval CI given as x ± z · σ/ N is exact in the limit N → ∞ according to CLT, but the approximation is good for large N, and in most cases for N > 30. By adjusting the parameter z we may construct the CI we like. For instance will choosing z = 1 give a 68% CI while z = 1.96 gives a 95% CI. Quite often the distribution of data points of measurements is Gaussian, but it is almost never the case that we know the standard deviation σ in advance. Usually the variance is unknown and must be approximated by the sample variance N s2 = 1 X (x − xi )2 N − 1 i=1 √ However, in such a case, when constructing the confidence interval CI as x±s/ N in turns out that this will not be a 68% CI as the case is when the standard deviation is known. The reason is that the sample standard deviation will vary when repeatedly taking samples. Each time a different CI will be calculated. It turns out√that the avarege value of the width of the CI’s calculated from√the sample standard variance, 2s/ N , is a bit smaller than the value used when σ is known, 2σ/ N . And the result is that this instead gives a 63% CI when N = 5. When N is increased, s will get closer to the true σ and the CI gets closer to the 68% CI of the normal distribution. We would like to construct a CI given by √ x ± z ∗ s/ N but the probability that the true µ will be within this CI is can not be estimated from the normal distribution, because the width of the CI varies with the varying sample standard deviation s. The percentage of the CI can be estimated by including the sample standard deviation s as another variable, along the random variables Xi . Given that the sampled data points are normal distributed, it can be shown that the variable t= x−µ √ s/ N will be distributed as the student t distribution f (t, N − 1) with degrees of freedom df = N − 1 and where x is the sample mean and s is the sample standard deviation. Note that the distribution function f (t, N − 1) does not depend on µ or σ. The function depends on N , but the student’s t density function get closer to the normal distribution as N increases, as shown in Fig. 6.1. The exact definition of f (t, N − 1) is given in the last section. The fact that the variable t follows the student’s t distribution is is analogous to the CLT stating that √ the variable x will be distributed as a normal distribution with mean µ and standard deviation σ/ N . 1 0.0 0.1 0.2 0.3 0.4 density N(x ; µ = 0,σ = 1) f(x, df = 20) f(x, df = 10) f(x, df = 5) f(x, df = 3) f(x, df = 2) f(x, df = 1) −4 −2 0 2 4 x Figure 1: The student t density function approaches the normal distribution rapidly as the degree of freedom increases. Very similarly to the case of the normal distribution one can show that the probability of drawing an number between −z and z from the student’s t distribution is P (−z ≤ t ≤ z) = 0.626 in the case z = 1 and N = 5.√ From this we may calculate the probability for µ to be within a confidence interval x ± 2z · s/ N using standard rules for inequalities: P (−z ≤ t ≤ z) = P (−z ≤ x−µ √ ≤ z) s/ N z·s z·s P (− √ ≤ x − µ ≤ √ ) N N z·s z·s = P (−x − √ ≤ −µ ≤ −x + √ ) N N z·s z·s = P (x + √ ≥ µ ≥ x − √ ) N N z·s z·s = P (x − √ ≤ µ ≤ x + √ ) N N = From this we may conclude that √ x ± z · s/ N 2 is a 63% confidence interval in the case z = 1 and N = 5. Just as when calculating the CI for the normal distribution, we need to find a factor z such that the probability of the student’s t distribution √ √ P (x − z · s/ N < µ < x + z · s/ N ) = P ercent/100 ⇓ P (−z ≤ t ≤ z) = P ercent/100 (1) in order to find a P ercent confidence interval. For small N there will be some difference, but even then the result will not differ a lot from the results calculated using the normal distribution. 6.1.1 R’s t.test When running a one sample t-test using R, the calculations of a confidence interval is by default based on the student t distribution: > t.test(x) One Sample t-test t = 0.3891, df = 9, p-value = 0.7062 95 percent confidence interval: -0.4967945 0.7032099 mean of x 0.1032077 > sd(x) [1] 0.8387453 and P ercent = 95 in this case. In order to show how R calculates this CI, we start by solving Eq. 6.1 and finding the correct z. Using R, this can be done by > uniroot(function(z) (pt(z,N-1) - pt(-z,N-1) - 0.95), lower = 0, upper = 4,tol = 0.00001) $root [1] 2.262157 giving the result z = 2.262157. The confidence interval may then be calculated as > mean(x) - 2.262157*sd(x)/sqrt(N) [1] -0.4967945 > mean(x) + 2.262157*sd(x)/sqrt(N) [1] 0.7032099 just like the output of the t-test. Since in this case we know that the distribution the numbers are drawn from is the normal distribution with σ = 1, the exactly correct 95% confidence interval can be calculated as follows. > uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.95), lower = 0, upper = 4,tol = 0.00001) $root [1] 1.959963 3 giving the result z = 1.96. The confidence interval may then be calculated as > mean(x) - 1.96*1/sqrt(N) [1] -0.5165987 > mean(x) + 1.96*1/sqrt(N) [1] 0.7230141 The 95% confidence interval given by the t-test is also correct, in the sense that if we do this operation over and over again, extracting N data, calculating the sample mean, the sample standard deviation and the CI based on the student t distribution, then on average 95% of the means will indeed be within this interval.But the variation of the sample standard deviation itself, leads to on average a slightly broader CI compared to the one calculated when knowing the true σ. Quite seldom however, the true σ is known in advance. Finally there is the most usual case where both the underlying distribution and it’s variance σ 2 are unknown. For N larger than 30 the distribution of the averages will usually be close to Gaussian, and the sample variance s2 also gets closer to the true variance σ 2 as N increases. But this means √ that t = s/x−µ will not be exactly distributed as the student t distribution, but it will approach N it as the distribution of the means approaches the normal distribution. If the distribution of the individual measurements are close to normal, both will be good approximations. The student t distribution is usually the best approximation when N is small, as it takes into account that the sample standard deviation will vary from the true one. As N grows larger, CLT means that the distribution of the means will approach the normal distribution but so will also the student t distribution, as the sample standard deviation s becomes a better approximation of the true standard deviation σ. 6.1.2 Welch’s t-test This test is used when trying to compare two different samples of data. Typically it could be the two sets of data resulting from testing the performance of two similar systems. One example could be to test the speed of writing to disk when running VMware and KVM. When trying to decide which system is fastest, one should run an R t-test with the two sets of data as arguments. The result could be something like this: > x=rnorm(10,1,1) > y=rnorm(10,0,1) > t.test(x,y) Welch Two Sample t-test data: x and y t = 2.7995, df = 14.615, p-value = 0.01374 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.2850111 2.1212203 sample estimates: mean of x mean of y 0.7649797 -0.4381361 In this example we know the distributions the data are drawn from, but usually we would not know this. The 95 percent confidence interval means that there is a probability of 95% that the 4 difference µx − µy is within the interval < 0.3, 2.1 >. And this is correct in this case, as we know that µx − µy = 1. However, if we repeated this experiments over and over again, in 5% of the cases the interval would not include 1. 6.2 Hypothesis testing Calculating a confidence interval is a very precise way of presenting the results of an experiments, claiming how confident one is that the true mean is within this interval. Depending a bit what the final goal of the experiment is, a related way to present results is through hypothesis testing. Normally a significance level α, which is often set to be 0.05, is chosen before the experiment is performed. Roughly speaking, a significance level of 0.05 means that you will recognize your results as significant if there is a probability less than 5% that the outcome was just due to chance and not due to an intrinsic difference you claim to have measured by the experiment. After determining a significance level, the hypotheses must be established. A common approach is to define two hypothesis: H0 The null hypothesis H1 The alternative hypothesis Next one carries out the experiments and tests the null hypothesis by computing the p-value. This value is the probability of obtaining a test statistic(result) at least as extreme as the one that was actually observed, when assuming that the null hypothesis is true. If the p-value is less than the significance level α, often set to be 0.05, the null hypothesis is rejected. When the null hypothesis is rejected, the result is said to be statistically significant. Rejecting the null hypothesis does not mean that the alternative hypothesis as been proved, but it is more likely true than what we thought before carrying out the experiment. And if the p-value is larger than the significance level α, this by no means proves that the null hypothesis is true, no matter how close the p-value is to one. If H0 is true, the experiment is most likely to end up with a p-value larger than α, but a single such observation does not prove it or even indicate it. Say your hypotheses was the following: H0 Men in Stockholm and men in Oslo are equally high H1 Men in Stockholm and men in Oslo are not equally high If you measured the average height of ten men in Oslo and ten men in Stockholm and the average was the same, this would not prove anything. However, if the difference between the averages you measured was 10 cm, this might be a statistically significant difference. The p-value would indicate how probable such a measurement would be to happen just by chance, given that men in Oslo and Stockholm are equally high. The same ideas as when calculating confidence intervals may be used when performing hypothesis testing. The Central Limit Theorem states that the means of measurements will tend to be normal distributed and the p-value may be calculated based on this. Say you draw N = 10 numbers from some distribution or population. Then your hypotheses might be defined as: H0 Null hypothesis: The true mean µ = 0 H1 Alternative hypothesis: The true mean µ 6= 0 5 Next you calculate the mean x and the sample standard deviation s. Assuming the null hypothesis, true mean µ = 0, and assuming that the means will be normal distributed with standard deviation s, the probability for a result as large as x is given as P (X < − |x|) + P (X >|x|) (2) and this is given by the corresponding areas of the density function of the normal distribution. Using R, this could then be calculated as in the following example: x = rnorm(10) > mean(x) [1] 0.6756075 > sd(x) [1] 1.101143 > 2*pnorm(-mean(x),0,sd(x)/sqrt(10)) [1] 0.05235299 since this is the probability given above, √ the area below and above the mean observed for a normal distribution with µ = 0 and a σ = s/ 10. When performing a t-test, the result is as follows: > t.test(x) One Sample t-test t = 1.9402, df = 9, p-value = 0.08428 alternative hypothesis: true mean is not equal to 0 and we observe that the p-value is a bit larger than the one calculated. Again, when we use the sample standard deviation in the calculations and do not know the true σ, then the student t distribution is the correct choice and we indeed get the same and correct result: > t = mean(x)/(sd(x)/sqrt(10)) > 2*pt(-t,N-1) [1] 0.0842796 The following shows that calculating a confidence level is another way to express some of the same result: > t.test(x,conf.level=0.91572) One Sample t-test t = 1.9402, df = 9, p-value = 0.08428 alternative hypothesis: true mean is not equal to 0 91.572 percent confidence interval: 1.026891e-06 1.351214e+00 When calculating a (1 - p-value) confidence level, one side of the confidence level exactly touches zero, the mean value of the null hypothesis. This is not a coincidence, why does this happen? 6.2.1 Sign test When calculating confidence intervals and p-values as described above, there are several assumptions. Most importantly that the distribution of the individual measurments of the experiments 6 are either not far from a normal distribution or the number of experiments N are larger than 30 so that the distribution of the means are close to a normal one due to the Central Limit Theorem. However, there are tests which do not depend on such criteria at all. These methods are generally less precise in there predictions, but their results are valid no matter what the true distribution leading to the results looks like. The sign test is one such test for which the null hypothesis H0 is that the median µ̃ is zero. Given a random variable X, the median µ̃ is defined by P (X > µ̃) = 0.5 = P (X < µ̃) In order to test this hypothesis one may view a series of experiments as a series of trials flipping a coin with probability p = 1/2 for heads and tails. If the result of the experiment is less than zero a minus is recorded and if it is larger than zero a plus is recorded. It is then straightforward to calculate the p-value which equals the probability in such an experiment to have the result obtained or a more extreme result. 6.3 Descriptive statistics Until now we have studied statistical inference, also called statistical induction, which is the science of drawing conclusions based on data which has some kind of random variation, for instance measurement errors or variation due to random sampling from some population. The aim is usually to be able to describe how accurate the results of measurement of experiments are or to do hypothesis testing, making it possible to reject a hypothesis based on a given significance level. In other cases you might just want to describe your data. For instance if you want to show how the number of processes running at a server at a given time of a weekday is distributed, when sampling data over a year. Or you could want to show the distribution of the individual data points of your measured data. These are not necessarily normal distributed (remember, it is the distribution of the means which tends to be Gaussian) and could be skew so that if you just show the average and standard deviation, you will not give a precise picture of your data. In connection with the CLT we have used histograms which are very useful when describing data, another valuable method is the box plot. The following R-code x = rnorm(100,5,2) y = rexp(100,0.5) boxplot(x,y,ylim=c(-3,13),names=c("normal","exponential")) produces the boxplot of Fig. 6.3. The bottom and top of the box are the 25th and 75th percentile, and the line near the middle of the box is the median. For R boxplots default value of the range is 1.5, meaning that the whiskers are placed at the most extrem points of the sample, but at most at 1.5 times the interquartile range (= hight of the box). The boxplot shows roughly how skew the distribution is and in the figure you can see that the exponential distribution is skew, the points below the median are closer. And it shows where the central half part of the datapoints are located. On the other hand the standard deviation is symmetric, and will show no skewness. Additionally the boxplot shows the extrema, if necessary as outliers. In case of the exponential sample, the boxplot shows two outliers outside the whiskers. Both the standard deviation and the mean might change a lot with large outliers, while the median is less influenced by such extreme points. 7 10 5 0 normal exponential Figure 2: Boxplot of a 100 numbers drawn from a normal and an exponential distribution. 6.4 The student’s t distribution The density function of the student’s t distribution may be written as f (x, N − 1) = p Γ( N2 ) π(N − 1) Γ( N 2−1 ) 1+ x2 N −1 where the gamma function is given by Γ(N ) = 1 Γ(N + ) = 2 (N − 1)! (2N )! √ π 4N N ! For large N this function approaches the Normal distribution and lim f (x, N − 1) = N (x; µ = 0, σ 2 = 1) N →∞ 8 − N2