* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Hypothesis Testing - University of Strathclyde
Bootstrapping (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
Psychometrics wikipedia , lookup
History of statistics wikipedia , lookup
Statistical hypothesis testing wikipedia , lookup
Foundations of statistics wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Hypothesis Testing David Young Department of Statistics and Modelling Science, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust 2 Statistics and Probability • statistical analysis considers the probability of an event being due to chance • can never be 100% certain for example that one treatment is better than another • can say mathematically how sure we are that a result is true 3 Hypothesis Testing • a statistical tests is designed to ‘prove’ a hypothesis held by the researcher • it starts by assuming the contrary view to the researcher’s and only comes down in support of the researcher’s hypothesis if the data are sufficiently unlikely to have been generated by the contrary view • the ‘contrary view’ is known as the null hypothesis • the research hypothesis of interest is called the alternative hypothesis 4 Probability • in statistical testing, it is impossible to ‘prove’ a hypothesis beyond all reasonable doubt • decision processes must be able to deal with the problems of uncertainty • modelling of uncertainty is impossible with standard mathematical tools and a whole branch of mathematics called Probability Theory has been developed to deal with it • most people have a good grasp of probability through card games, board games, betting odds, etc. 5 Probability Theory • suppose that the proportion (p) of defective items in a large batch is 0.1 • in a sample size 100 taken from this batch, we would expect to get (1000.1)=10 defective items • a single sample may contain any number of defective items ‘close’ to 10 • e.g. samples may have 8, 11, 9, 10 or 12 defectives • probability theory enables us to calculate the probability or chance of getting a given number of defectives 6 Hypothesis Testing • statistical inference is the procedure whereby inferences about a population are made on the basis of the results obtained from a sample drawn from that population • inference may be divided into two categories … • estimation • hypothesis testing • basically, hypothesis testing is a test of the validity of some claim or theory about a population e.g. students have debts of >£4000 upon graduating, aspirin is a more effective pain-killer than paracetamol, a new HIV medication delays the onset of AIDS, etc. 7 Comparing Two Samples of Data • there are several factors which affect the choice of statistical hypothesis test • in comparing two sample means the procedure depends on whether the data are paired (as in a cross-over experiment of when comparing a ‘before’ and ‘after’ measurements) • whether the data are quantitative or qualitative • it also depends on the distribution of the sample data (are the data normal?) 8 Checking the Assumption of Normality • the simplest way to check the normality assumption for a variable is by plotting a histogram and assessing visually if the distribution is bell-shaped • normality tests are available with most statistical packages • e.g. in MINITAB the normality test generates a normal probability plot and performs a hypothesis test to examine whether or not the observations follow a normal distribution • for data which are normally distributed, parametric tests can be applied 9 Distribution Free Tests • occasionally it will not be possible to make this assumption e.g. when the data are clearly skewed or there are too few data points to determine the approximate distribution • a group of tests have been devised for which no assumptions are made about the distribution of the observations – these are called distribution-free tests • since distributions are compared without the use of parameters they can be referred to as non-parametric tests 10 Comparing Unpaired Samples • in a sense we wish to compare the ‘average’ values for the two underlying populations e.g. does the average blood pressure differ in two groups treated with a different drug? • if the samples are normally distributed, use a t-test and the corresponding confidence intervals to compare the means 11 Example: RCT Old Treatment New Treatment P-value 71 70 0.921 71 68 0.893 71 62 0.538 71 53 0.376 71 42 0.112 71 29 0.032 12 Additional Points • Errors in hypothesis testing – p<0.01! • Null and alternative hypotheses • Cranberry juice – randomisation http://www.ncbi.nlm.nih.gov/pubmed/22961092 13 Additional Points • Double blind studies http://www.theguardian.com/society/2005/jan/17/health.medic ineandhealth • Placebo trials • Comparison of baseline characteristics • Intention-to-treat and per-protocol – weight loss example • Tests for correlation, regression and normality testing 14 Example • Comparison of transit times (hours) using two different bran preparations ... http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1410956/ • Bran preparation A: 44 51 52 55 60 62 66 68 69 71 71 76 82 91 108 • Bran preparation B: 52 64 68 77 79 83 84 88 95 97 101 116 • null hypothesis – no difference in transit times for A and B • alternative hypothesis – some difference in transit times 15 Descriptive Statistics Descriptive Statistics: Bran A, Bran B Variable Bran A Bran B N 15 12 Mean 68.40 83.67 StDev 16.47 17.51 Minimum 44.00 52.00 Q1 55.00 70.25 Median 68.00 83.50 Q3 76.00 96.50 Maximum 108.00 116.00 16 Histograms 17 The P-value • the p-value is the probability of getting data as extreme as those actually observed in the experiment if the null hypothesis were true • the lower the p-value, the more evidence there is against the null hypothesis (i.e. in favour of the study hypothesis) • the conventional cut-off for significance is p<0.05 18 Two Sample T-test Two Sample T-Test and Confidence Interval Two sample T for Bran A vs Bran B N Mean StDev SE Mean Bran A 15 68.4 16.5 4.3 Bran B 12 83.7 17.5 5.1 95% CI for mean A - mean B: (-28.9,-1.6) T-Test mean A = mean B (vs not =): T = -2.31 P = 0.030 DF = 23 19 Interpretation • the p-value from the t-test comparing the transit times in both groups is 0.03 • since this is less than 0.05, reject the null hypothesis and accept the alternative • conclude that there is a significant difference between the two groups • conclusion – the transit time for Bran A is significantly lower than it is for Bran B 20 Choice of Test • the choice of statistical test to use depends mainly on two things … – the type of data (categorical or numerical) – the distribution of the data (normal or non-normal) • if the data are normally distributed, parametric tests are used • if the data are not normally distributed, non-parametric tests are appropriate 21 Tests for comparing two group means • if the data are quantitative (i.e. numerical) and normally distributed use a t-test (sometimes referred to a as two sample t-test) • this is known as a parametric test • if the data are quantitative and not normally distributed, the appropriate test is a Mann-Whitney test • this is a non-parametric test • for qualitative data, non-parametric tests are generally used 22 Non-normal data • if the data are not normally distributed either look for a transformation which does normalise the distributions (e.g. log, square root) or use a Mann-Whitney test (the non-parametric equivalent to the t-test) • using a transformation is more sensitive but might lead to results and particularly confidence intervals which are difficult to interpret • using a non-parametric test is less efficient but does lead to an easily interpretable confidence interval for the difference between two medians • if sample sizes are too small to determine if the distribution is normal, use the non-parametric approach 23 Qualitative Data • this involves comparing the proportion of cases who have a certain characteristic of interest in the two groups e.g. do the proportions of cases suffering from a breast cancer recurrence differ for pre and post-menopausal women? • with decent sample sizes use a chi-squared test along with a confidence interval for the difference or ratio of the two proportions 24 Obesity and breast-feeding • Does Breastfeeding Help to Reduce the Risk of Childhood Overweight and Obesity? • http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4374721/ • Results: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4374721/table /pone.0122534.t001/ • Of 5650 breast-fed children, 658 (11.6%) were overweight vs. 1304/7513 (17.4%) of those not breast-fed 25 Results • Use Stat > Basic Statistics > 2 Proportions to get: Test and CI for Two Proportions Sample 1 2 X 658 1304 N 5650 7513 Sample p 0.116460 0.173566 Difference = p (1) - p (2) Estimate for difference: -0.0571056 95% CI for difference: (-0.0690766, -0.0451347) Test for difference = 0 (vs ≠ 0): Z = -9.35 P-Value = 0.000 26 Comparing Paired Samples • the same issues must be addressed when deciding upon the analysis method for a given set of paired data • problem types are essentially the same only in this case the same individual has been measured twice • before we made assumptions about the distributions in the separate groups whereas here the assumptions relate to the within individual differences 27 Quantitative Data • if the differences between the two samples follow a normal distribution (possibly after transformation) then use a paired ttest and compute a confidence interval to compare the two means • if the differences are not normal then use a Wilcoxon signed rank test (the non-parametric equivalent of the paired t-test) 28 Example • data below shows two measurements of pulse rates in 20 patients • each measurement was made by the same observer, under the same circumstances, one minute apart • objective of gathering this data was to determine if the 30 second pulse rates were the same both times • since data are paired, appropriate test is the paired t-test Pulse 1: 46 50 39 40 41 35 31 43 47 48 32 36 37 34 38 Pulse 2: 44 29 36 43 43 37 43 43 48 40 45 42 35 28 42 29 Stat > Basic Statistics > Paired t … Paired T-Test and CI: Pulse 1, Pulse 2 Paired T for Pulse 1 - Pulse 2 Pulse 1 Pulse 2 Difference N 15 15 15 Mean 39.80 39.87 -0.07 StDev 5.94 5.76 8.20 SE Mean 1.53 1.49 2.12 95% CI for mean difference: (-4.61, 4.47) T-Test of mean difference = 0 (vs not = 0): T-Value = 0.03 P-Value = 0.975 30 Conclusion • Paired t-test was performed since the differences were normally distributed • p-value from the test was 0.975 • this is not significant, therefore do not reject the null hypothesis • conclude that there is no evidence to suggest that there is a significant difference in the average pulse rates on the two occasions • methodology applies to cross-over trials 31 Summary • the set-up for a hypothesis test is always the same … • determine the null and alternative hypotheses • choose the appropriate test based on the type and distribution of the data • if the p-value is less than 0.05, reject the null hypothesis and conclude that there is evidence to support the alternative hypothesis • if the p-value is not significant (i.e. >0.05), conclude there is no evidence to reject the null hypothesis 32 Errors in Statistical Tests • Type I Error: a false positive result – the study finds a significant difference but that difference does not really exist (i.e. reject the null hypothesis when it is true) • Type II Error: a false negative result – the study finds no significant difference between groups which are in fact different (i.e. accept the null hypothesis when it is false) 33 Errors in Statistical Tests • the conventional cut-off for significance is p<0.05 • i.e. accept a 1 in 20 chance that a Type I error may occur • a 5% chance of a finding significant result which does not really exist every time a statistical test is carried out • may sometimes want to set a more stringent p-value (e.g. p=0.01 if testing the effect of a very toxic therapy) 34 Confidence Intervals • the sample mean is only an estimate of the population mean • estimates depend on the sample from which they are calculated • a range of plausible values of the mean can be computed • this gives an interval in which we can be relatively sure the true population parameter value lies • these intervals are known as confidence intervals 35 Example (cont.) • part of the computer output from the t-test for the bran example gave the 95% confidence interval for the mean difference in transit times: 95% CI for mean A - mean B: (-28.9,-1.6) • therefore we can be 95% sure that the true population mean difference in transit time between these two bran prepartions lies within this interval • i.e. we can be 95% confident that any subject taking bran A should have a blood glucose level between 1.6 and 28.9 mg/kg less than if they took bran B 36 Example • Does playing music to dairy cattle increase their milk production? • An experiment was conducted where a group of dairy cattle was divided into two groups. Music was played to one group; the control group did not have music played. The average increase in production was 2.5 l/cow over the time period in question. • A 95% confidence interval for the difference (treatmentcontrol) in the mean production was computed to be (1.5,3.5) l/cow. • What does this mean?