Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basics - Statistical Methods Sarah Filippi, University of Oxford 17 October 2014 Michaelmas Term 2014 Two equivalent ways In the last lecture we have seen how to compute p-values for two-sided hypothesis testing problems. In general, given a test statistic T and an observed value tobs , proceed in either of two equivalent ways: 1 Calculate the p-value p = P (|T | > tobs |θ = θ0 ), 2 the probability under the null-hypothesis of observing a value of the test statistic at least as extreme as tobs . Reject the null hypothesis at level α if and only if the p-value is less than or equal to α. Set a significance level α for the test and determine the critical value c such that P (|T | > c|θ = θ0 ) ≤ α. Reject H0 if the observed value of |T | is greater than c. Normality assumption Recall that one assumption for the two-sample t test is the normality of the data. You can check visually this assumption. It is not a serious issue if the sample size is large enough (see CLT). But what is large enough? Some aspects you should consider are: • Do the two samples have the same sample size? • Do the two samples have the same standard deviations and same shapes? • If the skewness of the two samples is very different, t-tests can be very misleading with any sample size. • If the two sample sizes are roughly equal and so is the skewness, t-tests are generally OK. See pag.61 and Display 3.4 of Ramsey, F.L., Schafer, D.W. (2002, 2013). Normality assumption Simulation exercise: 1 Choose a probability distribution, and denote with µ0 its mean; 2 Random generate a sample y of size n, y = (y1 , . . . , yn ); 3 Compute √ tobs = n(ȳ − µ0 ) ; s 4 Repeat 1-3, N times. 5 Plot the histogram of tobs together with the theoretical distribution T ∼ tn−1 . In the following, we tried it for Exponential(1), U[0,1], N(0,1), n = 5, 20, 50, 500, N = 25000. You can try other simulation settings. What can you conclude? Exponential(1) Uniform[0,1] Gaussian Equal Variances Assumption What happens when the variances of the two samples are very different? • If the two sample sizes are roughly the same, the t-test is OK even if the variances are not the same. • If the two sample sizes are very different and also the variances, t-tests are unreliable (for example, n1 = 100, n2 = 400 and σ2 /σ1 = 1/4). See pag.62 and Display 3.5 of Ramsey, F.L., Schafer, D.W. (2002, 2013). 1.5 Example of different variances 0.0 0.5 dnorm(x) 1.0 N(0, 1) N(0, 0.25) N(0, 0.05) −4 −2 0 x 2 4 Comparing two samples with unequal variances It is possible to obtain a test analogous to the two-sample t-test, when the variances of the two samples are different. The command t.test as default performs the two-sample t-test for samples with unequal variances. If you specify the option var.equal=T then you perform the two-sample t-test with equal variances. In R # chemical experiment exp1<-c(22,19,35,11,21,10) exp2<-c(33,11,20,38) > t.test(exp1,exp2,var.equal=T) Two Sample t-test data: exp1 and exp2 t = -0.8694, df = 8, p-value = 0.4099 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -21.305463 9.638796 sample estimates: mean of x mean of y 19.66667 25.50000 > t.test(exp1,exp2,var.equal=F) Welch Two Sample t-test data: exp1 and exp2 t = -0.8132, df = 5.166, p-value = 0.452 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -24.09685 12.43018 sample estimates: mean of x mean of y 19.66667 25.50000 Test of equal variances - var.test() Consider two samples with X1 ∼ N (µ1 , σ12 ) and Y1 ∼ N (µ2 , σ22 ). We wish to test H0 : σ12 = σ22 against H1 : σ12 6= σ22 . Use the test statistic tobs = Under H0 , s21 s22 S12 ∼ Fn1 −1,n2 −1 . S22 The further the value of the test statistic is from 1, the stronger the evidence against equal variances. In R Returning to the chemical example, test H0 : σ1 = σ2 against H1 : σ1 6= σ2 . The test statistic tobs = s21 /s22 = 0.545, and comparing this with a F5,3 distribution gives a p-value of 0.52. > var.test(exp1,exp2) F test to compare two variances data: exp1 and exp2 F = 0.5448, num df = 5, denom df = 3, p-value = 0.5157 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.03660187 4.22969952 sample estimates: ratio of variances 0.5448124 Independence Assumption • Do the samples have any cluster effect? (Were the units selected in distinct groups?) • What about serial effects? (Were the data collected at close time points or locations?) • In general it is better not to use t-tests if any of the above is suspected (you will study different statistical methods for these type of problems). Graphical methods over formal tests of model adequacy • Tests for normality and test for equality of variances are available in R. • But...they are often not very robust against their own model assumptions. • Graphical displays are usually more informative. So try to rely more on graphical displays. • For example for a two sample t-test you can look at histograms, boxplots, probability plots of the two conditions to look at the shape, skewness, variance and normality assumptions of the two samples. Example A study on the 24 hours total energy expenditure (MJ/day) in group of lean (n = 13) and obese women (n = 9) (Prentice et al, 1986). lean<-c(6.13,7.05,7.48,7.48,7.53,7.58,7.90,8.08,8.09,8.11,8.40,10.15,10.88) obese<-c(8.79,9.19,9.21,9.68,9.69,9.97,11.51,11.85,12.79) Looking at Histograms 0.0 0.0 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 Can we see any difference in shape? 6 7 8 9 lean 10 11 8 9 10 11 obese 12 13 Looking at probability plots What about the normality of the data? 11 Normal Q−Q Plot Normal Q−Q Plot ● ● 12 10 ● ● 10 ● 11 Sample Quantiles 9 ● ●● ● 8 Sample Quantiles ● ● ● ● ● ● ● ● 7 ● ● 9 ● ● 6 ● −1.5 −0.5 0.5 1.5 Theoretical Quantiles −1.5 −0.5 0.5 Theoretical Quantiles We cannot say much about the distributions as the sample sizes are very small. 1.5 Looking at Boxplots ● ● 6 7 8 9 10 11 12 13 Is there any difference in the centre of location? ● 1 2 The plot seems to suggest a difference in the centre of location, but we need to test this formally. Two-sample t-test in R > t.test(lean,obese, var.equal=T) Two Sample t-test data: lean and obese t = -3.9456, df = 20, p-value = 0.000799 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.411451 -1.051796 sample estimates: mean of x mean of y 8.066154 10.297778 > t.test(lean,obese) Welch Two Sample t-test data: lean and obese t = -3.8555, df = 15.919, p-value = 0.001411 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.459167 -1.004081 sample estimates: mean of x mean of y 8.066154 10.297778 The answer is the same in both cases. Robustness for one sample and paired t-tests Recall that the assumptions are independence and normality of the data. • If the sample size is small and the distribution of the sample is skewed, one sample t-test can give problems. • If the sample size is large, one sample t-test is OK. • Cluster or serial effects can be a problem. • One sample t-test sometimes used after log-transformation of the data (also in the paired case). The confidence intervals need to be transformed back in the original scale. Robustness versus Resistance A statistical procedure is • Robust with respect to a particular assumption, if it is valid even if the assumption is not met. • Resistant if it does not change very much if a small part of the data changes. For example, the sample median is a resistant statistic. Since t-tests are based on sample means, they are NOT resistant. Non-parametric test Advantages: valid for data from any distribution Disadvantages: • Parametric are more efficient if data permits • Difficult to compute for large samples Some popular non-parametric test: • Sign test or Wilcoxon signed rank test (for one sample or for paired data) • Wilcoxon rank sum (for two independent samples) The R function wilcox.test can be used to perform the following non-parametric tests: • Wilcoxon Signed Rank Test • Wilcoxon rank sum (also called Mann-Whitney test) Tests of Location Zero (Sign test) • It is a non-parametric alternative to the one-sample t-test or the t-test for paired data. • Consider pairs of data (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ). • Assume that the differences Xi − Yi are iid and come from a continuous distribution. • The null hypothesis is that the true median θ = 0. • Statistic: number of values greater than θ = 0 • Under the null hypothesis, positive and negative differences are equally likely, so the number of positive values follows a binomial distribution with parameters n and 0.5. • Hence the p-value can be easily calculated. Wilcoxon Signed Rank test wilcox.test() • It is another non-parametric alternative to the one-sample t-test or the t-test for paired data. • Consider pairs of data (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ). • Assume that the differences Xi − Yi are iid and come from a continuous distribution which is symmetric about θ. • The null hypothesis is that the median difference between the pairs θ = 0. • The test statistic is the sum of the ranks for the differences with positive sign. • Extreme values of this statistic (large or small) indicate departure from the null hypothesis. • The p-value can be calculated exactly for small samples using the permutation distribution, whilst for large samples a normal approximation to the sampling distribution is used. Example IQ To test the hypothesis that IQ is not intrinsic, a group of students were instructed to take an IQ test before and after completing an “IQ test training course”. Test the null hypothesis that the differences come from a distribution symmetric about θ = 0 against the alternative θ > 0. The data are presented in the table below. IQ before IQ after abs(diff) sign(diff) rank 118 110 8 6 121 122 1 + 1.5 The test statistic is 21.5. 96 110 14 + 8 102 104 2 + 3.5 93 102 9 + 7 110 105 5 5 117 115 2 3.5 131 132 1 + 1.5 In R > IQbefore<-c(118,121,96,102,93,110,117,131) > IQafter<-c(110,122,110,104,102,105,115,132) # IQafter-IQbefore > wilcox.test(IQafter, IQbefore,paired=T,alternative="greater",exact=F) Wilcoxon signed rank test with continuity correction data: IQafter and IQbefore V = 21.5, p-value = 0.3368 alternative hypothesis: true location shift is greater than 0 # Note that we don’t want IQbefore-IQafter. It would produce > wilcox.test(IQbefore, IQafter,paired=T,alternative="greater",exact=F) Wilcoxon signed rank test with continuity correction data: IQbefore and IQafter V = 14.5, p-value = 0.7128 alternative hypothesis: true location shift is greater than 0 Wilcoxon Rank Sum test (Mann-Whitney test) • It is a non-parametric alternative to the two-sample t-test, valid for data from any distribution. Performs better than the two-sample t-test if there are extreme outliers. • It is a test of identical distributions. In particular the test tries to detect location shifts between two independent samples. • If the difference in shapes of the two distributions is only given by a location shift, then the test is very powerful, but if there are other differences in shapes, then it can lose power. • It can be seen also as a test of median differences. The estimator for the difference in location parameters does not estimate the difference in medians but rather the median of the difference. • Assumptions: • Independence, random samples • Populations are continuous Wilcoxon Rank Sum test H0 : the two distributions differ by a location shift of µ. H1 : the two distribution differ by some other location shift. • The rank sum statistics is the sum of the ranks of the observations from one of the samples. • If there are ties, the rank of the tied observations is the average of the ranks of the tied observations. • If the samples are of size n1 and n2 respectively, then the test statistic is the sum of the ranks of the observations from the first sample [minus n1 (n1 + 1)/2]. • For small samples without ties the distribution of the test statistic can be computed exactly and is tabulated, otherwise a large-sample normal approximation is used. Example For the chemical data, we have ordered results experiment rank 10 1 1 11 1 2.5 11 2 2.5 19 1 4 20 2 5 21 1 6 22 1 7 33 2 8 35 1 9 Hence the value of the test statistic is 29.5 − 21 = 8.5. The p-value is 0.52, so we do not reject the null hypothesis. 38 2 10 In R > wilcox.test(exp1,exp2,exact=FALSE) Wilcoxon rank sum test with continuity correction data: exp1 and exp2 W = 8.5, p-value = 0.5212 alternative hypothesis: true location shift is not equal to 0 > wilcox.test(exp1,exp2,exact=TRUE) Wilcoxon rank sum test with continuity correction data: exp1 and exp2 W = 8.5, p-value = 0.5212 alternative hypothesis: true location shift is not equal to 0 Warning message: In wilcox.test.default(exp1, exp2, exact = TRUE) : cannot compute exact p-value with ties The distribution of the test statistics can be computed via a permutation approach or via normal approximation. In this case, the exact p-value (based on the statistic found via a permutation approach) cannot be computed because there are ties in the observations. Permutation tests A permutation test calculates the p-value as the proportion of re-labellings of the groups which produce a test statistic at least as extreme as that observed. They do not require any distributional assumptions to be made, but exact calculation can be computationally infeasible. A permutation test can be carried out as follows: 1 Calculate the value of the test statistic on the observed sample. 2 Compute the value of the test statistic on all possible re-groupings of the sample. 3 The p-value is the proportion of re-groupings which resulted in a value of the test statistic at least as extreme as that observed in step 1. Goodness of fit tests These are used to test if the data came from some hypothesized distribution. For continuous data one can use the Kolmogorov-Smirnov test (ks.test), the Anderson-Darling test or the Shapiro-Wilk test (shapiro.test) for normality. However, visual inspection is generally adequate. Errors in hypothesis testing Type I error: reject H0 when H0 is true. Type II error: accept H0 when H0 is false. So far we focused on: P (Type I error) = P (reject H0 |H0 true) = 1 − P (accept H0 |H0 true) = α. What about P (Type II error) = P (accept H0 |H0 false) = β ? Power of a statistical test = P (reject H0 |H0 false) = 1 − β Statistical power Factors influencing power: • Statistical significance criteria used in the test (α): trade-off between Type I error (α) and Type II error (β) • Sample size: All other things being equal, greater sample size, greater power • The ”true” value of the parameter being tested: Greater difference between ”true” value and value specified in null hypothesis, greater power of the test Example Suppose y1 , · · · , yn random sample from Y1 , · · · Yn iid with Y1 ∼ N (µ, σ 2 ), σ 2 known. Want to test H0 : µ = µ0 versus H1 : µ > µ0 . √ • Under H0 , Z = n(Ȳ −µ0 ) σ ∼ N (0, 1). • For a given significance level α: reject H0 if zobs > z1−α . • Imagine the true mean µa > µ0 : √ n(Ȳ − µ0 ) Power = P (reject H0 |µa ) = P > z1−α µa σ √ √ n(Ȳ − µa ) n > z1−α + (µ0 − µa ) =P µa σ σ √ n = 1 − Φ z1−α + (µ0 − µa ) σ Power and sample size 0.4 power 0.6 0.8 1.0 Fix α = 0.05, µ0 = 0, µa = 1 and σ = 1 0 10 20 30 40 50 Further reading For this lecture, suggested readings are • Ramsey, F.L., Schafer, D.W. (2002, 2013). The Statistical Sleuth. A course in methods of data. Duxbury Press. (Chapter 3 and 4) • Venable, W.N., Ripley, B.D. (2002). Modern Applied Statistics with S. Springer. (Chapter 5) Check also the use of R. Type ?t.test, ?wilcoxon.test in R to understand their use.