Download basic2014_old - Department of Statistics Oxford

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Basics - Statistical Methods
Sarah Filippi, University of Oxford
17 October 2014
Michaelmas Term 2014
Two equivalent ways
In the last lecture we have seen how to compute p-values for
two-sided hypothesis testing problems. In general, given a test
statistic T and an observed value tobs , proceed in either of two
equivalent ways:
1
Calculate the p-value
p = P (|T | > tobs |θ = θ0 ),
2
the probability under the null-hypothesis of observing a value
of the test statistic at least as extreme as tobs . Reject the null
hypothesis at level α if and only if the p-value is less than or
equal to α.
Set a significance level α for the test and determine the
critical value c such that
P (|T | > c|θ = θ0 ) ≤ α.
Reject H0 if the observed value of |T | is greater than c.
Normality assumption
Recall that one assumption for the two-sample t test is the
normality of the data. You can check visually this assumption. It is
not a serious issue if the sample size is large enough (see CLT).
But what is large enough? Some aspects you should consider are:
• Do the two samples have the same sample size?
• Do the two samples have the same standard deviations and
same shapes?
• If the skewness of the two samples is very different, t-tests can
be very misleading with any sample size.
• If the two sample sizes are roughly equal and so is the
skewness, t-tests are generally OK.
See pag.61 and Display 3.4 of Ramsey, F.L., Schafer, D.W. (2002,
2013).
Normality assumption
Simulation exercise:
1
Choose a probability distribution, and denote with µ0 its
mean;
2
Random generate a sample y of size n, y = (y1 , . . . , yn );
3
Compute
√
tobs =
n(ȳ − µ0 )
;
s
4
Repeat 1-3, N times.
5
Plot the histogram of tobs together with the theoretical
distribution T ∼ tn−1 .
In the following, we tried it for Exponential(1), U[0,1], N(0,1),
n = 5, 20, 50, 500, N = 25000. You can try other simulation
settings. What can you conclude?
Exponential(1)
Uniform[0,1]
Gaussian
Equal Variances Assumption
What happens when the variances of the two samples are very
different?
• If the two sample sizes are roughly the same, the t-test is OK
even if the variances are not the same.
• If the two sample sizes are very different and also the
variances, t-tests are unreliable (for example, n1 = 100,
n2 = 400 and σ2 /σ1 = 1/4).
See pag.62 and Display 3.5 of Ramsey, F.L., Schafer, D.W. (2002,
2013).
1.5
Example of different variances
0.0
0.5
dnorm(x)
1.0
N(0, 1)
N(0, 0.25)
N(0, 0.05)
−4
−2
0
x
2
4
Comparing two samples with unequal variances
It is possible to obtain a test analogous to the two-sample t-test,
when the variances of the two samples are different.
The command t.test as default performs the two-sample t-test
for samples with unequal variances.
If you specify the option var.equal=T then you perform the
two-sample t-test with equal variances.
In R
# chemical experiment
exp1<-c(22,19,35,11,21,10)
exp2<-c(33,11,20,38)
> t.test(exp1,exp2,var.equal=T)
Two Sample t-test
data: exp1 and exp2
t = -0.8694, df = 8, p-value = 0.4099
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-21.305463
9.638796
sample estimates:
mean of x mean of y
19.66667 25.50000
> t.test(exp1,exp2,var.equal=F)
Welch Two Sample t-test
data: exp1 and exp2
t = -0.8132, df = 5.166, p-value = 0.452
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-24.09685 12.43018
sample estimates:
mean of x mean of y
19.66667 25.50000
Test of equal variances - var.test()
Consider two samples with X1 ∼ N (µ1 , σ12 ) and Y1 ∼ N (µ2 , σ22 ).
We wish to test H0 : σ12 = σ22 against H1 : σ12 6= σ22 .
Use the test statistic
tobs =
Under H0 ,
s21
s22
S12
∼ Fn1 −1,n2 −1 .
S22
The further the value of the test statistic is from 1, the stronger
the evidence against equal variances.
In R
Returning to the chemical example, test
H0 : σ1 = σ2
against H1 : σ1 6= σ2 .
The test statistic tobs = s21 /s22 = 0.545, and comparing this with a
F5,3 distribution gives a p-value of 0.52.
> var.test(exp1,exp2)
F test to compare two variances
data: exp1 and exp2
F = 0.5448, num df = 5, denom df = 3, p-value = 0.5157
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.03660187 4.22969952
sample estimates:
ratio of variances
0.5448124
Independence Assumption
• Do the samples have any cluster effect? (Were the units
selected in distinct groups?)
• What about serial effects? (Were the data collected at close
time points or locations?)
• In general it is better not to use t-tests if any of the above is
suspected (you will study different statistical methods for
these type of problems).
Graphical methods over formal tests of model
adequacy
• Tests for normality and test for equality of variances are
available in R.
• But...they are often not very robust against their own model
assumptions.
• Graphical displays are usually more informative. So try to rely
more on graphical displays.
• For example for a two sample t-test you can look at
histograms, boxplots, probability plots of the two conditions
to look at the shape, skewness, variance and normality
assumptions of the two samples.
Example
A study on the 24 hours total energy expenditure (MJ/day) in
group of lean (n = 13) and obese women (n = 9) (Prentice et al,
1986).
lean<-c(6.13,7.05,7.48,7.48,7.53,7.58,7.90,8.08,8.09,8.11,8.40,10.15,10.88)
obese<-c(8.79,9.19,9.21,9.68,9.69,9.97,11.51,11.85,12.79)
Looking at Histograms
0.0
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
0.5
Can we see any difference in shape?
6
7
8
9
lean
10
11
8
9
10
11
obese
12
13
Looking at probability plots
What about the normality of the data?
11
Normal Q−Q Plot
Normal Q−Q Plot
●
●
12
10
●
●
10
●
11
Sample Quantiles
9
●
●● ●
8
Sample Quantiles
●
●
● ● ●
●
●
●
7
●
●
9
●
●
6
●
−1.5
−0.5
0.5
1.5
Theoretical Quantiles
−1.5
−0.5
0.5
Theoretical Quantiles
We cannot say much about the distributions as the sample sizes are very small.
1.5
Looking at Boxplots
●
●
6
7
8
9
10
11
12
13
Is there any difference in the centre of location?
●
1
2
The plot seems to suggest a difference in the centre of location, but we need to test this formally.
Two-sample t-test in R
> t.test(lean,obese, var.equal=T)
Two Sample t-test
data: lean and obese
t = -3.9456, df = 20, p-value = 0.000799
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.411451 -1.051796
sample estimates:
mean of x mean of y
8.066154 10.297778
> t.test(lean,obese)
Welch Two Sample t-test
data: lean and obese
t = -3.8555, df = 15.919, p-value = 0.001411
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.459167 -1.004081
sample estimates:
mean of x mean of y
8.066154 10.297778
The answer is the same in both cases.
Robustness for one sample and paired t-tests
Recall that the assumptions are independence and normality of the
data.
• If the sample size is small and the distribution of the sample is
skewed, one sample t-test can give problems.
• If the sample size is large, one sample t-test is OK.
• Cluster or serial effects can be a problem.
• One sample t-test sometimes used after log-transformation of
the data (also in the paired case). The confidence intervals
need to be transformed back in the original scale.
Robustness versus Resistance
A statistical procedure is
• Robust with respect to a particular assumption, if it is valid
even if the assumption is not met.
• Resistant if it does not change very much if a small part of
the data changes.
For example, the sample median is a resistant statistic.
Since t-tests are based on sample means, they are NOT resistant.
Non-parametric test
Advantages:
valid for data from any distribution
Disadvantages:
• Parametric are more efficient if data permits
• Difficult to compute for large samples
Some popular non-parametric test:
• Sign test or Wilcoxon signed rank test (for one sample or for
paired data)
• Wilcoxon rank sum (for two independent samples)
The R function wilcox.test can be used to perform the following
non-parametric tests:
• Wilcoxon Signed Rank Test
• Wilcoxon rank sum (also called Mann-Whitney test)
Tests of Location Zero (Sign test)
• It is a non-parametric alternative to the one-sample t-test or
the t-test for paired data.
• Consider pairs of data (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ).
• Assume that the differences Xi − Yi are iid and come from a
continuous distribution.
• The null hypothesis is that the true median θ = 0.
• Statistic: number of values greater than θ = 0
• Under the null hypothesis, positive and negative differences
are equally likely, so the number of positive values follows a
binomial distribution with parameters n and 0.5.
• Hence the p-value can be easily calculated.
Wilcoxon Signed Rank test wilcox.test()
• It is another non-parametric alternative to the one-sample
t-test or the t-test for paired data.
• Consider pairs of data (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ).
• Assume that the differences Xi − Yi are iid and come from a
continuous distribution which is symmetric about θ.
• The null hypothesis is that the median difference between the
pairs θ = 0.
• The test statistic is the sum of the ranks for the differences
with positive sign.
• Extreme values of this statistic (large or small) indicate
departure from the null hypothesis.
• The p-value can be calculated exactly for small samples using
the permutation distribution, whilst for large samples a normal
approximation to the sampling distribution is used.
Example IQ
To test the hypothesis that IQ is not intrinsic, a group of students
were instructed to take an IQ test before and after completing an
“IQ test training course”.
Test the null hypothesis that the differences come from a
distribution symmetric about θ = 0 against the alternative θ > 0.
The data are presented in the table below.
IQ before
IQ after
abs(diff)
sign(diff)
rank
118
110
8
6
121
122
1
+
1.5
The test statistic is 21.5.
96
110
14
+
8
102
104
2
+
3.5
93
102
9
+
7
110
105
5
5
117
115
2
3.5
131
132
1
+
1.5
In R
> IQbefore<-c(118,121,96,102,93,110,117,131)
> IQafter<-c(110,122,110,104,102,105,115,132)
# IQafter-IQbefore
> wilcox.test(IQafter, IQbefore,paired=T,alternative="greater",exact=F)
Wilcoxon signed rank test with continuity correction
data: IQafter and IQbefore
V = 21.5, p-value = 0.3368
alternative hypothesis: true location shift is greater than 0
# Note that we don’t want IQbefore-IQafter. It would produce
> wilcox.test(IQbefore, IQafter,paired=T,alternative="greater",exact=F)
Wilcoxon signed rank test with continuity correction
data: IQbefore and IQafter
V = 14.5, p-value = 0.7128
alternative hypothesis: true location shift is greater than 0
Wilcoxon Rank Sum test (Mann-Whitney test)
• It is a non-parametric alternative to the two-sample t-test,
valid for data from any distribution. Performs better than the
two-sample t-test if there are extreme outliers.
• It is a test of identical distributions. In particular the test tries
to detect location shifts between two independent samples.
• If the difference in shapes of the two distributions is only
given by a location shift, then the test is very powerful, but if
there are other differences in shapes, then it can lose power.
• It can be seen also as a test of median differences. The
estimator for the difference in location parameters does not
estimate the difference in medians but rather the median of
the difference.
• Assumptions:
• Independence, random samples
• Populations are continuous
Wilcoxon Rank Sum test
H0 : the two distributions differ by a location shift of µ.
H1 : the two distribution differ by some other location shift.
• The rank sum statistics is the sum of the ranks of the
observations from one of the samples.
• If there are ties, the rank of the tied observations is the
average of the ranks of the tied observations.
• If the samples are of size n1 and n2 respectively, then the test
statistic is the sum of the ranks of the observations from the
first sample [minus n1 (n1 + 1)/2].
• For small samples without ties the distribution of the test
statistic can be computed exactly and is tabulated, otherwise
a large-sample normal approximation is used.
Example
For the chemical data, we have
ordered results
experiment
rank
10
1
1
11
1
2.5
11
2
2.5
19
1
4
20
2
5
21
1
6
22
1
7
33
2
8
35
1
9
Hence the value of the test statistic is 29.5 − 21 = 8.5.
The p-value is 0.52, so we do not reject the null hypothesis.
38
2
10
In R
> wilcox.test(exp1,exp2,exact=FALSE)
Wilcoxon rank sum test with continuity correction
data: exp1 and exp2
W = 8.5, p-value = 0.5212
alternative hypothesis: true location shift is not equal to 0
> wilcox.test(exp1,exp2,exact=TRUE)
Wilcoxon rank sum test with continuity correction
data: exp1 and exp2
W = 8.5, p-value = 0.5212
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(exp1, exp2, exact = TRUE) :
cannot compute exact p-value with ties
The distribution of the test statistics can be computed via a
permutation approach or via normal approximation. In this case,
the exact p-value (based on the statistic found via a permutation
approach) cannot be computed because there are ties in the
observations.
Permutation tests
A permutation test calculates the p-value as the proportion of
re-labellings of the groups which produce a test statistic at least as
extreme as that observed. They do not require any distributional
assumptions to be made, but exact calculation can be
computationally infeasible.
A permutation test can be carried out as follows:
1
Calculate the value of the test statistic on the observed
sample.
2
Compute the value of the test statistic on all possible
re-groupings of the sample.
3
The p-value is the proportion of re-groupings which resulted in
a value of the test statistic at least as extreme as that
observed in step 1.
Goodness of fit tests
These are used to test if the data came from some hypothesized
distribution.
For continuous data one can use the Kolmogorov-Smirnov test
(ks.test), the Anderson-Darling test or the Shapiro-Wilk test
(shapiro.test) for normality.
However, visual inspection is generally adequate.
Errors in hypothesis testing
Type I error: reject H0 when H0 is true.
Type II error: accept H0 when H0 is false.
So far we focused on:
P (Type I error) = P (reject H0 |H0 true)
= 1 − P (accept H0 |H0 true)
= α.
What about
P (Type II error) = P (accept H0 |H0 false) = β ?
Power of a statistical test = P (reject H0 |H0 false) = 1 − β
Statistical power
Factors influencing power:
• Statistical significance criteria used in the test (α):
trade-off between Type I error (α) and Type II error (β)
• Sample size:
All other things being equal, greater sample size, greater
power
• The ”true” value of the parameter being tested:
Greater difference between ”true” value and value specified in
null hypothesis, greater power of the test
Example
Suppose y1 , · · · , yn random sample from Y1 , · · · Yn iid with
Y1 ∼ N (µ, σ 2 ), σ 2 known.
Want to test H0 : µ = µ0 versus H1 : µ > µ0 .
√
• Under H0 , Z =
n(Ȳ −µ0 )
σ
∼ N (0, 1).
• For a given significance level α: reject H0 if zobs > z1−α .
• Imagine the true mean µa > µ0 :
√
n(Ȳ − µ0 )
Power = P (reject H0 |µa ) = P
> z1−α µa
σ
√
√ n(Ȳ − µa )
n > z1−α + (µ0 − µa )
=P
µa
σ
σ √ n
= 1 − Φ z1−α + (µ0 − µa )
σ
Power and sample size
0.4
power
0.6
0.8
1.0
Fix α = 0.05, µ0 = 0, µa = 1 and σ = 1
0
10
20
30
40
50
Further reading
For this lecture, suggested readings are
• Ramsey, F.L., Schafer, D.W. (2002, 2013). The Statistical
Sleuth. A course in methods of data. Duxbury Press.
(Chapter 3 and 4)
• Venable, W.N., Ripley, B.D. (2002). Modern Applied
Statistics with S. Springer. (Chapter 5)
Check also the use of R. Type ?t.test, ?wilcoxon.test in R to
understand their use.