Download 9_March_MT2004

Document related concepts

History of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
MT2004
Olivier GIMENEZ
Telephone: 01334 461827
E-mail: [email protected]
Website: http://www.creem.st-and.ac.uk/olivier/OGimenez.html
9. Distributions derived from
normal distributions
In the previous section, we assume that the variance of the whole
population was known
Unlikely to be the case…
So we need methods to deal with both mean and variance of the
whole population are unknown
To develop the theory underlying such methods, we need to
introduce first some other distributions but related to the normal
distribution
Namely, the 2, t and F distributions
9.1 2 distributions
9.1 2 distributions
9.1 2 distributions
Upper quantile = value above which some specified proportion of
the area of a p.d.f. lies
9.1 2 distributions
The 5% upper quantile of a 25 is x such Pr(25  x) = 0.05
9.1 2 distributions
The 5% upper quantile of a 25 is x such Pr(25  x) = 0.05 or
alternatively Pr(25  x) = 0.95 i.e. the lower 95% quantile
9.1 2 distributions
Pr(25  x) = 0.95 (the lower 95% quantile) is obtained using the R
command: > qchisq(0.95,5) # cumulative d. f.
[1] 11.07050
9.1 2 distributions
Example: Suppose that X, Y, and Z are coordinates in 3dimensional space which are independently distributed as N(0,1),
with all measurements in cm. What is the probability that the point
(X,Y,Z) lies more than 3 cm from the origin?
9.1 2 distributions
Example: Suppose that X, Y, and Z are coordinates in 3-dimensional space which
are independently distributed as N(0,1), with all measurements in cm. What is
the probability that the point (X,Y,Z) lies more than 3 cm from the origin?
9.1 2 distributions
Example: Suppose that X, Y, and Z are coordinates in 3-dimensional space which
are independently distributed as N(0,1), with all measurements in cm. What is
the probability that the point (X,Y,Z) lies more than 3 cm from the origin?
9.1 2 distributions
9.1 2 distributions
9.1 2 distributions
9.2 The F distributions
9.2 The F distributions
The 5% upper quantile of a Fdf1,df2 is x such Pr(Fdf1,df2  x) = 0.05
Use Tables or R command qf(0.95,df1,df2) (lower 95% quantile)
9.2 The F distributions
So if we have a table with the upper quantiles, we can also get the
lower quantiles as follows.
Remember that:
Upper quantile = value above which some specified proportion
of the area of a p.d.f. lies
Lower quantile = value below which some specified proportion
of the area of a p.d.f. lies
9.2 The F distributions
So if we have a table with the upper quantiles, we can also get the
lower quantiles as follows.
9.2 The F distributions
So if we have a table with the upper quantiles, we can also get the
lower quantiles as follows.
i.e. upper (1-) quantile of Fn,k or lower  quantile of Fn,k is the
inverse of the upper  quantile of the Fk,n
9.2 The F distributions
Example: Given that F3,2;0.025 = 39.17, find F2,3;0.975 (i.e. lower 0.025
= 1-0.975 quantile of the F2,3 distribution)
F2,3;0.975 = 1/ F3,2;0.025 = 1/39.17 = 0.0255
9.2 The F distributions
Example: Given that F3,2;0.025 = 39.17, find F2,3;0.975 (i.e. lower 0.025
= 1-0.975 quantile of the F2,3 distribution)
F2,3;0.975 = 1/ F3,2;0.025 = 1/39.17 = 0.0255
R commands
> par(mfrow=c(2,1))
> plot(x,df(x,2,3),xlab="",ylab="",type='l')
> title("pdf F(2,3)")
> plot(x,df(x,3,2),xlab="",ylab="",type='l')
> title("pdf F(3,2)")
9.3 The t distributions
9.3 The t distributions
The shape of the p.d.f. of tn depends on n
9.3 The t distributions
Looks like a normal distribution, but more of the probability is in
the centre and the tails, see the graph for t1 e.g. (top left)
9.3 The t distributions
9.3 The t distributions
tn; is the upper  quantile of the t distribution with n degrees of
freedom
9.3 The t distributions
Use tables or R, e.g. qt(0.95,30) (=1.859548) gives the lower
95% quantile of the t distribution with 8 degrees of freedom
(upper 5% quantile) (qt(0.95,5000) = 1.645158…)
10 Using t distributions
To derive the distribution of the statistic testing hypotheses about
the mean of a normal population with unknown variance, we
need a key result on the joint distribution of the sample mean and
the sample variance
Remember that:
10 Using t distributions
To derive the distribution of the statistic testing hypotheses about
the mean of a normal population with unknown variance, we
need a key result on the joint distribution of the sample mean and
the sample variance
10 Using t distributions
The quantity T depends on the population mean  but not on the
unknown variance 2.
So this statistic will be useful to test hypotheses about the mean
population of normal populations with unknown variance
10.2 One-sample t-tests and
confidence intervals
One sample t-tests:
39 observations on pulse rates (heart beats/minute) of Indigenous
Peruvians had sample mean 70.31 and sample variance 90.219.
We assume normality.
Question: at the 1% significance level, could this data set be
considered as a random sample from a population with mean 75.
In other words (Step 1 of hypothesis testing strategy):
H0:  = 75 against H1  75
Your turn. Perform step 2 (find a ‘good test statistic’) and step 3
(derive its distribution)
10.2 One-sample t-tests and
confidence intervals
One sample t-tests:
39 observations on pulse rates (heart beats/minute) of Indigenous
Peruvians had sample mean 70.31 and sample variance 90.219.
Step 1: H0:  = 75 against H1  75
Step 2: Xi/n - 0 is a good candidate since it takes ‘extreme’
values if H1 is true, and moderate values if H0 is true.
Step 4: it’s a 2-sided test, so we will reject H0 if
tobs  –tn-1;/2 or tobs  tn-1;/2 (graphical representation)
10.2 One-sample t-tests and
confidence intervals
One sample t-tests:
39 observations on pulse rates (heart beats/minute) of Indigenous
Peruvians had sample mean 70.31 and sample variance 90.219.
Step 1: H0:  = 75 against H1  75
Step 2: Xi/n - 0 is a good candidate since it takes ‘extreme’
values if H1 is true, and moderate values if H0 is true.
If one-sided test, H1: <0, we reject if tobs  –tn-1;
If one-sided test, H1: >0, we reject if tobs  tn-1;
10.2 One-sample t-tests and
confidence intervals
One sample t-tests:
39 observations on pulse rates (heart beats/minute) of Indigenous
Peruvians had sample mean 70.31 and sample variance 90.219.
So we will reject if tobs  2.7045 or if tobs  -2.7045
P-value using R:
> 2*pt(tobs,38) # (tobs<0 so need to double the c.d.f. of tobs – 2-sided test)
> 0.003799049
10.2 One-sample t-tests and
confidence intervals
Confidence interval:
39 observations on pulse rates (heart beats/minute) of Indigenous
Peruvians had sample mean 70.31 and sample variance 90.219.
We’d like to build up a 99% confidence interval for , we’re
looking for values of  for which we would accept H0
We know that:
10.2 One-sample t-tests and
confidence intervals
Confidence interval:
39 observations on pulse rates (heart beats/minute) of Indigenous
Peruvians had sample mean 70.31 and sample variance 90.219.
So we would accept any value of  such that
75 is outside the confidence interval, so we would reject H0 at the 1% significance level
10.2 One-sample t-tests and
confidence intervals
Confidence interval:
With R, a 95% confidence interval is obtained as follows:
> cil = 70.31 + qt(0.975,38)*90.219/sqrt{39}
> cil = 70.31 - qt(0.975,38)*90.219/sqrt{39}
> c(cil,ciu)
> [1] 67.23099 73.38901
And the 99% confidence interval is obtained as
> c(70.31 + qt(0.995,38)*90.219/sqrt{39}, 70.31 + qt(0.995,38)*90.219/sqrt{39}
10.3 Paired t-tests
Consider two samples of observations (Xi,Yi)
Consider the case: the two measurements (Xi,Yi) are made on the
same unit i
We wish to test if the two population means are equal
Example: measurement of left and right wing length of birds
Should not be treated as independent!!!!!
Obviously, length of left wing and length of right wing both tend to
be large for large birds: dependent measurements
Idea: work with the differences between the two measurement on
each unit, i.e. Xi-Yi, in order to go back to a one-sample t-test e.g.
10.3 Paired t-tests
Example: corneal thickness in microns for both eyes of patients
who have glaucoma in one eye
Glaucoma 488 478 480 426 440 410 458 460
Healthy
484 478 492 444 436 398 464 476
Obviously, the corneal thickness is likely to be similar in the two
eyes of any patient – dependent observations
Consider di = glaucomai – healthyi. We will assume that this new
random sample is drawn from a normal distribution N(d,2), and
we wish to test: H0: d=0 vs H1: d0
di = -32 ; di2 = 936 and
10.3 Paired t-tests
Example: corneal thickness in microns for both eyes of patients
who have glaucoma in one eye
H0: d=0 vs H1: d0
di = -32 ; di2 = 936, s2 = 115.43 and t7;0.025 = 2.3646 (see Tables)
tobs > - t7;0.025 and tobs < t7;0.025 meaning that tobs is in the region of
acceptance of H0
t
-t/2
t/2
10.3 Paired t-tests
Example: corneal thickness in microns for both eyes of patients
who have glaucoma in one eye
H0: d=0 vs H1: d0
di = -32 ; di2 = 936, s2 = 115.43 and t7;0.025 = 2.3646 (see Tables)
tobs > - t7;0.025 and tobs < t7;0.025 meaning that tobs is in the region of
acceptance of H0
At the 5% significance level, we fail to reject H0, so there is
apparently no difference between the good eye and the diseased eye
10.4 Two-sample t-tests
Now, we want to deal with two sets of data and compare, e.g., their
means
We consider that the two random samples are drawn from normal
distributions with unknown but same variances.
More formally
10.4 Two-sample t-tests
We consider that the two random samples are drawn from normal
distributions with unknown but same variances.
We know that the distributions of the sample means of the two
samples are:
so that (using results on sums of normal r.v’s)
As usual, we’d like to relate this distribution to a standard normal
random variable…
10.4 Two-sample t-tests
We consider that the two random samples are drawn from normal distributions
with unknown but same variances.
We have that:
Obviously, if we assume that  is known, we can test hypotheses
about the difference in means between the two groups (see the onesample case – z-test).
But we assume that  is unknown. So we need to do again what
we’ve done for the t-test (one-sample test about the mean with
unknown variance).
10.4 Two-sample t-tests
More precisely, first find the distribution of:
We note that:
where
10.4 Two-sample t-tests
Similarly, we have that:
where
10.4 Two-sample t-tests
Putting the two latter results together, we have that, using the
additivity of 2 r.v’s:
Note that the above quantity can be written as
where:
is called the pooled sample variance.
10.4 Two-sample t-tests
Remember that we have:
10.4 Two-sample t-tests
So let the test statistic T be
which is actually the ratio of following distributions:
i.e. a t distribution with n+m-2 degrees of freedom!
10.4 Two-sample t-tests
Now we can see that T can be re-written as follows:
or:
The quantity T depends on the population means X and Y but not
on the unknown variance 2.
This statistic is thus useful to test hypotheses about the difference in
means between the 2 populations.
10.4 Two-sample t-tests
Example: Consider two random samples from 2 normal
distributions:
x = 11 10 14 12 13 and y = 8 3 4 9
Test the hypothesis that the two population means are equal against
the alternative hypothesis that they are not.
10.4 Two-sample t-tests
Example: Consider two random samples from 2 normal
distributions:
x = 11 10 14 12 13 and y = 8 3 4 9
Test the hypothesis that the two population means are equal against
the alternative hypothesis that they are not.
We wish to test H0: X = Y against H1: X  Y
s2 = (10 + 26) / 7 = 36 / 7, and  xi/n = 12,  yj/m = 6
There is evidence to reject H0 at the 5% significance level.
In other words, the two population means are different
10.4 Two-sample t-tests
Using R:
> x=c(11,10,14,12,13)
> y=c(8,3,4,9)
> # pooled standard deviation:
> pooledsd=sqrt(((5-1)*var(x)+(4-1)*var(y))/(5+4-2))
> # observed value of the test statistic:
> tobs=(mean(x)-mean(y))/(pooledsd*sqrt(1/5+1/4))
> tobs
[1] 3.944053
> # p-value of the 2-sided test
> 2*(1-pt(tobs,5+4-2))
[1] 0.005574311
10.5 Testing equality of variances
Motivation: to apply the two-sample t-test of Section 10.4, we need
to check that the two samples come from normal distributions with
same variance
Consider X1,…,Xn and Y1,…,Ym two random samples drawn from
normal distributions. We also assume independence.
Let 2X and 2Y be the population variances of the two random
samples.
Remember the strategy of hypothesis testing:
Step 1: We wish to test H0: 2X = 2Y vs H1: 2X  2Y
Step 2: We need to find a ‘good’ test statistic, i.e. a function of the
data that takes ‘extreme’ values if H1 is true, and moderate values if
H0 is true.
10.5 Testing equality of variances
We’ve seen that:
So what about the ratio:
?????
10.5 Testing equality of variances
If you work it out a little bit, you get under H0: 2X = 2Y = 2, the
following test statistic:
Under the null hypothesis the terms involving 2 cancel.
If the alternative hypothesis is true, i.e. if 2X  2Y, then the value
of the test statistic above will be small or large depending on
whether 2X < 2Y or 2X > 2Y.
10.5 Testing equality of variances
Step 3: Now we need the distribution of this test statistic under H0.
By definition of an F distribution, we have that:
that is:
or
using the main property of F distributions.
10.5 Testing equality of variances
Step 4: We will reject the null hypothesis if the observed value of
this test statistic is greater than the upper quantile of the appropriate
F distribution (using the Tables or program R).
Note that it is enough to compare the larger of the two test statistics
describes on the previous slide with the upper quantile of the
appropriate distribution.
Example: consider two examples, one of size 11 and the other of
size 16 from two normal distributions. The sample variance of the
first is 20 and the sample variance of the second is 30. At the 5%
level, is there evidence to reject the hypothesis that the two
populations have the same variance? Note that F15,10;0.025=3.522
10.5 Testing equality of variances
Example: consider two examples, one of size 11 and the other of size 16 from
two normal distributions. The sample variance of the first is 20 and the
sample variance of the second is 30. At the 5% level, is there evidence to
reject the hypothesis that the two populations have the same variance? Note
that F15,10;0.025=3.522
1) We wish to test 2X = 2Y vs H1: 2X  2Y , where X has sample
size and Y has sample size 16, with respectively s2X=20 and
s2Y=30. This is a test of equality of variances.
2) To perform it, we calculate the observed value of the test
statistic (the largest one): fobs = s2Y/s2X = 30/20 = 1.5
3) We need to compare this observed value to the 2.5% upper
quantile of an F distribution with 15 and 10 degrees of freedom,
i.e. F15,10;0.025 which is equal to 3.522
10.5 Testing equality of variances
Example: consider two examples, one of size 11 and the other of size 16 from
two normal distributions. The sample variance of the first is 20 and the
sample variance of the second is 30. At the 5% level, is there evidence to
reject the hypothesis that the two populations have the same variance? Note
that F15,10;0.025 = 3.522
4) fobs = 1.5 < F15,10;0.025 = 3.522
5) So there is no evidence to reject the null hypothesis. We fail to
reject the equality of variances.
Note: We might now consider testing whether the two population
means are different.