Download en-pdf

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Confidence interval wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
6
Lecture, week 6
6.1
Confidence intervals and the Student t distribution
When calculating confidence intervals, we have so far considered two cases:
• The distribution of the measured data points is known to be Gaussian/normal
and the
√
variance σ 2 is known. Then the confidence interval CI given as x ± z · σ/ N is exact.
• The distribution of the measured data points is√unknown, but the variance σ 2 is known.
Then the confidence interval CI given as x ± z · σ/ N is exact in the limit N → ∞ according
to CLT, but the approximation is good for large N, and in most cases for N > 30.
By adjusting the parameter z we may construct the CI we like. For instance will choosing z = 1
give a 68% CI while z = 1.96 gives a 95% CI.
Quite often the distribution of data points of measurements is Gaussian, but it is almost never
the case that we know the standard deviation σ in advance. Usually the variance is unknown and
must be approximated by the sample variance
N
s2 =
1 X
(x − xi )2
N − 1 i=1
√
However, in such a case, when constructing the confidence interval CI as x±s/ N in turns out that
this will not be a 68% CI as the case is when the standard deviation is known. The reason is that
the sample standard deviation will vary when repeatedly taking samples. Each time a different CI
will be calculated. It turns out√that the avarege value of the width of the CI’s calculated from√the
sample standard variance, 2s/ N , is a bit smaller than the value used when σ is known, 2σ/ N .
And the result is that this instead gives a 63% CI when N = 5. When N is increased, s will get
closer to the true σ and the CI gets closer to the 68% CI of the normal distribution.
We would like to construct a CI given by
√
x ± z ∗ s/ N
but the probability that the true µ will be within this CI is can not be estimated from the normal distribution, because the width of the CI varies with the varying sample standard deviation
s. The percentage of the CI can be estimated by including the sample standard deviation s as
another variable, along the random variables Xi . Given that the sampled data points are normal
distributed, it can be shown that the variable
t=
x−µ
√
s/ N
will be distributed as the student t distribution f (t, N − 1) with degrees of freedom df = N − 1 and
where x is the sample mean and s is the sample standard deviation. Note that the distribution
function f (t, N − 1) does not depend on µ or σ. The function depends on N , but the student’s t
density function get closer to the normal distribution as N increases, as shown in Fig. 6.1. The
exact definition of f (t, N − 1) is given in the last section.
The fact that the variable t follows the student’s t distribution is is analogous to the CLT stating
that
√ the variable x will be distributed as a normal distribution with mean µ and standard deviation
σ/ N .
1
0.0 0.1 0.2 0.3 0.4
density
N(x ; µ = 0,σ = 1)
f(x, df = 20)
f(x, df = 10)
f(x, df = 5)
f(x, df = 3)
f(x, df = 2)
f(x, df = 1)
−4
−2
0
2
4
x
Figure 1: The student t density function approaches the normal distribution rapidly as the degree of freedom
increases.
Very similarly to the case of the normal distribution one can show that the probability of drawing
an number between −z and z from the student’s t distribution is
P (−z ≤ t ≤ z) = 0.626
in the case z = 1 and N = 5.√ From this we may calculate the probability for µ to be within a
confidence interval x ± 2z · s/ N using standard rules for inequalities:
P (−z ≤ t ≤ z)
=
P (−z ≤
x−µ
√ ≤ z)
s/ N
z·s
z·s
P (− √ ≤ x − µ ≤ √ )
N
N
z·s
z·s
= P (−x − √ ≤ −µ ≤ −x + √ )
N
N
z·s
z·s
= P (x + √ ≥ µ ≥ x − √ )
N
N
z·s
z·s
= P (x − √ ≤ µ ≤ x + √ )
N
N
=
From this we may conclude that
√
x ± z · s/ N
2
is a 63% confidence interval in the case z = 1 and N = 5.
Just as when calculating the CI for the normal distribution, we need to find a factor z such that
the probability of the student’s t distribution
√
√
P (x − z · s/ N < µ < x + z · s/ N ) = P ercent/100
⇓
P (−z ≤ t ≤ z)
= P ercent/100
(1)
in order to find a P ercent confidence interval. For small N there will be some difference, but even
then the result will not differ a lot from the results calculated using the normal distribution.
6.1.1
R’s t.test
When running a one sample t-test using R, the calculations of a confidence interval is by default
based on the student t distribution:
> t.test(x)
One Sample t-test
t = 0.3891, df = 9, p-value = 0.7062
95 percent confidence interval:
-0.4967945 0.7032099
mean of x
0.1032077
> sd(x)
[1] 0.8387453
and P ercent = 95 in this case. In order to show how R calculates this CI, we start by solving Eq.
6.1 and finding the correct z. Using R, this can be done by
>
uniroot(function(z) (pt(z,N-1) - pt(-z,N-1) - 0.95), lower = 0, upper = 4,tol = 0.00001)
$root
[1] 2.262157
giving the result z = 2.262157. The confidence interval may then be calculated as
> mean(x) - 2.262157*sd(x)/sqrt(N)
[1] -0.4967945
> mean(x) + 2.262157*sd(x)/sqrt(N)
[1] 0.7032099
just like the output of the t-test.
Since in this case we know that the distribution the numbers are drawn from is the normal distribution with σ = 1, the exactly correct 95% confidence interval can be calculated as follows.
> uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.95), lower = 0, upper = 4,tol = 0.00001)
$root
[1] 1.959963
3
giving the result z = 1.96. The confidence interval may then be calculated as
> mean(x) - 1.96*1/sqrt(N)
[1] -0.5165987
> mean(x) + 1.96*1/sqrt(N)
[1] 0.7230141
The 95% confidence interval given by the t-test is also correct, in the sense that if we do this
operation over and over again, extracting N data, calculating the sample mean, the sample standard
deviation and the CI based on the student t distribution, then on average 95% of the means will
indeed be within this interval.But the variation of the sample standard deviation itself, leads to
on average a slightly broader CI compared to the one calculated when knowing the true σ. Quite
seldom however, the true σ is known in advance.
Finally there is the most usual case where both the underlying distribution and it’s variance σ 2 are
unknown. For N larger than 30 the distribution of the averages will usually be close to Gaussian,
and the sample variance s2 also gets closer to the true variance σ 2 as N increases. But this means
√
that t = s/x−µ
will not be exactly distributed as the student t distribution, but it will approach
N
it as the distribution of the means approaches the normal distribution. If the distribution of the
individual measurements are close to normal, both will be good approximations. The student
t distribution is usually the best approximation when N is small, as it takes into account that
the sample standard deviation will vary from the true one. As N grows larger, CLT means that
the distribution of the means will approach the normal distribution but so will also the student
t distribution, as the sample standard deviation s becomes a better approximation of the true
standard deviation σ.
6.1.2
Welch’s t-test
This test is used when trying to compare two different samples of data. Typically it could be the
two sets of data resulting from testing the performance of two similar systems. One example could
be to test the speed of writing to disk when running VMware and KVM. When trying to decide
which system is fastest, one should run an R t-test with the two sets of data as arguments. The
result could be something like this:
> x=rnorm(10,1,1)
> y=rnorm(10,0,1)
> t.test(x,y)
Welch Two Sample t-test
data: x and y
t = 2.7995, df = 14.615, p-value = 0.01374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2850111 2.1212203
sample estimates:
mean of x mean of y
0.7649797 -0.4381361
In this example we know the distributions the data are drawn from, but usually we would not
know this. The 95 percent confidence interval means that there is a probability of 95% that the
4
difference µx − µy is within the interval < 0.3, 2.1 >. And this is correct in this case, as we know
that µx − µy = 1. However, if we repeated this experiments over and over again, in 5% of the cases
the interval would not include 1.
6.2
Hypothesis testing
Calculating a confidence interval is a very precise way of presenting the results of an experiments,
claiming how confident one is that the true mean is within this interval. Depending a bit what
the final goal of the experiment is, a related way to present results is through hypothesis testing.
Normally a significance level α, which is often set to be 0.05, is chosen before the experiment is
performed. Roughly speaking, a significance level of 0.05 means that you will recognize your results
as significant if there is a probability less than 5% that the outcome was just due to chance and
not due to an intrinsic difference you claim to have measured by the experiment.
After determining a significance level, the hypotheses must be established. A common approach
is to define two hypothesis:
H0 The null hypothesis
H1 The alternative hypothesis
Next one carries out the experiments and tests the null hypothesis by computing the p-value. This
value is the probability of obtaining a test statistic(result) at least as extreme as the one that was
actually observed, when assuming that the null hypothesis is true. If the p-value is less than the
significance level α, often set to be 0.05, the null hypothesis is rejected. When the null hypothesis
is rejected, the result is said to be statistically significant. Rejecting the null hypothesis does
not mean that the alternative hypothesis as been proved, but it is more likely true than what we
thought before carrying out the experiment. And if the p-value is larger than the significance level
α, this by no means proves that the null hypothesis is true, no matter how close the p-value is to
one. If H0 is true, the experiment is most likely to end up with a p-value larger than α, but a single
such observation does not prove it or even indicate it. Say your hypotheses was the following:
H0 Men in Stockholm and men in Oslo are equally high
H1 Men in Stockholm and men in Oslo are not equally high
If you measured the average height of ten men in Oslo and ten men in Stockholm and the average
was the same, this would not prove anything. However, if the difference between the averages you
measured was 10 cm, this might be a statistically significant difference. The p-value would indicate
how probable such a measurement would be to happen just by chance, given that men in Oslo and
Stockholm are equally high.
The same ideas as when calculating confidence intervals may be used when performing hypothesis
testing. The Central Limit Theorem states that the means of measurements will tend to be normal
distributed and the p-value may be calculated based on this. Say you draw N = 10 numbers from
some distribution or population. Then your hypotheses might be defined as:
H0 Null hypothesis: The true mean µ = 0
H1 Alternative hypothesis: The true mean µ 6= 0
5
Next you calculate the mean x and the sample standard deviation s. Assuming the null hypothesis,
true mean µ = 0, and assuming that the means will be normal distributed with standard deviation
s, the probability for a result as large as x is given as
P (X < − |x|) + P (X >|x|)
(2)
and this is given by the corresponding areas of the density function of the normal distribution.
Using R, this could then be calculated as in the following example:
x = rnorm(10)
> mean(x)
[1] 0.6756075
> sd(x)
[1] 1.101143
> 2*pnorm(-mean(x),0,sd(x)/sqrt(10))
[1] 0.05235299
since this is the probability given above,
√ the area below and above the mean observed for a normal
distribution with µ = 0 and a σ = s/ 10. When performing a t-test, the result is as follows:
> t.test(x)
One Sample t-test
t = 1.9402, df = 9, p-value = 0.08428
alternative hypothesis: true mean is not equal to 0
and we observe that the p-value is a bit larger than the one calculated. Again, when we use the
sample standard deviation in the calculations and do not know the true σ, then the student t
distribution is the correct choice and we indeed get the same and correct result:
> t = mean(x)/(sd(x)/sqrt(10))
> 2*pt(-t,N-1)
[1] 0.0842796
The following shows that calculating a confidence level is another way to express some of the same
result:
> t.test(x,conf.level=0.91572)
One Sample t-test
t = 1.9402, df = 9, p-value = 0.08428
alternative hypothesis: true mean is not equal to 0
91.572 percent confidence interval:
1.026891e-06 1.351214e+00
When calculating a (1 - p-value) confidence level, one side of the confidence level exactly touches
zero, the mean value of the null hypothesis. This is not a coincidence, why does this happen?
6.2.1
Sign test
When calculating confidence intervals and p-values as described above, there are several assumptions. Most importantly that the distribution of the individual measurments of the experiments
6
are either not far from a normal distribution or the number of experiments N are larger than 30
so that the distribution of the means are close to a normal one due to the Central Limit Theorem.
However, there are tests which do not depend on such criteria at all. These methods are generally
less precise in there predictions, but their results are valid no matter what the true distribution
leading to the results looks like. The sign test is one such test for which the null hypothesis H0 is
that the median µ̃ is zero. Given a random variable X, the median µ̃ is defined by
P (X > µ̃) = 0.5 = P (X < µ̃)
In order to test this hypothesis one may view a series of experiments as a series of trials flipping
a coin with probability p = 1/2 for heads and tails. If the result of the experiment is less than
zero a minus is recorded and if it is larger than zero a plus is recorded. It is then straightforward
to calculate the p-value which equals the probability in such an experiment to have the result
obtained or a more extreme result.
6.3
Descriptive statistics
Until now we have studied statistical inference, also called statistical induction, which is the science
of drawing conclusions based on data which has some kind of random variation, for instance
measurement errors or variation due to random sampling from some population. The aim is
usually to be able to describe how accurate the results of measurement of experiments are or to
do hypothesis testing, making it possible to reject a hypothesis based on a given significance level.
In other cases you might just want to describe your data. For instance if you want to show how the
number of processes running at a server at a given time of a weekday is distributed, when sampling
data over a year. Or you could want to show the distribution of the individual data points of your
measured data. These are not necessarily normal distributed (remember, it is the distribution of
the means which tends to be Gaussian) and could be skew so that if you just show the average and
standard deviation, you will not give a precise picture of your data. In connection with the CLT
we have used histograms which are very useful when describing data, another valuable method is
the box plot. The following R-code
x = rnorm(100,5,2)
y = rexp(100,0.5)
boxplot(x,y,ylim=c(-3,13),names=c("normal","exponential"))
produces the boxplot of Fig. 6.3.
The bottom and top of the box are the 25th and 75th percentile, and the line near the middle of
the box is the median. For R boxplots default value of the range is 1.5, meaning that the whiskers
are placed at the most extrem points of the sample, but at most at 1.5 times the interquartile range
(= hight of the box). The boxplot shows roughly how skew the distribution is and in the figure
you can see that the exponential distribution is skew, the points below the median are closer. And
it shows where the central half part of the datapoints are located. On the other hand the standard
deviation is symmetric, and will show no skewness. Additionally the boxplot shows the extrema,
if necessary as outliers. In case of the exponential sample, the boxplot shows two outliers outside
the whiskers. Both the standard deviation and the mean might change a lot with large outliers,
while the median is less influenced by such extreme points.
7
10
5
0
normal
exponential
Figure 2: Boxplot of a 100 numbers drawn from a normal and an exponential distribution.
6.4
The student’s t distribution
The density function of the student’s t distribution may be written as
f (x, N − 1) = p
Γ( N2 )
π(N − 1) Γ( N 2−1 )
1+
x2
N −1
where the gamma function is given by
Γ(N ) =
1
Γ(N + ) =
2
(N − 1)!
(2N )! √
π
4N N !
For large N this function approaches the Normal distribution and
lim f (x, N − 1) = N (x; µ = 0, σ 2 = 1)
N →∞
8
− N2