* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Topic guide 10.3: Making statistical decisions using
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Psychometrics wikipedia , lookup
History of statistics wikipedia , lookup
Omnibus test wikipedia , lookup
Foundations of statistics wikipedia , lookup
Analysis of variance wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Unit 10: Statistics for experimental design . 10 3 Making statistical decisions using significance testing In this topic guide you will find out about the different types of test used in hypothesis testing and how decisions can be made from these. You will see the mathematical techniques used to carry out one and two sample tests on a given population sample and find out how to use multiple sample tests in experimental design. You will also look at the use of categorical data to test hypotheses. On successful completion of this topic you will: •• be able to make statistical decisions using significance testing (LO3). To achieve a Pass in this unit you need to show that you can: •• carry out one and two sample tests on a given population sample (3.1) •• use multiple sample tests in experimental design (3.2) •• use categorical data to test hypotheses (3.3). 1 Unit 10: Statistics for experimental design 1 Tests of significance A statistical hypothesis test is used to make decisions using data from a scientific study. A result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold probability, the significance level. These tests of significance are used in determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance. This gives information on whether results contain enough information to cast doubt on conventional wisdom, given that conventional wisdom has been used to establish the null hypothesis. The key region of a hypothesis test is the set of outcomes which cause the rejection of the null hypothesis in favour of the alternative hypothesis. A one sample test, also known as a goodness of fit test, shows whether the collected data is useful in making a prediction about the population. A two sample test, also called a test of independence, is used to determine if one variable is dependent on another. z-tests The z-score is a way of describing standard deviations. A z-score of two is equivalent to saying ‘two standard deviations above and below the mean’. A z-score of 1.5 is 1.5 standard deviations above and below the mean. A z-score of 0 means the score is equal to the mean (zero standard deviations). In one-tailed tests z-scores are to a known side of the mean (either above or below). In two-tailed tests z-scores are either above or below the mean. The standard error of the mean tells us how much the sample mean varies from sample to sample (it is the standard deviation of the population mean given a particular sample size). The empirical rule tells us that 95% of the time the sample mean will fall within two standard errors of the population mean. Take it further Z-score online calculator: http:// www.danielsoper.com/statcalc3/ calc.aspx?id=22 Z-score to percentile calculator: http://www.measuringusability. com/pcalcz.php Further information about the applications of z-scores can be found at the following link: https://statistics.laerd.com/ statistical-guides/standard-score. php The z-score for a value x, given a mean µ and a standard deviation s, is: z = x – µ σ To find the probability associated with this z-score we can look up the z-value in a table of normal probabilities or use an online calculator. This is necessary because it comes from the area under the bell-curve of the normal distribution, and because the shape is irregular, calculus is required to generate the table. A common use is to find at what percentile a certain value sits in the population. If, instead of having a single datum, we have a sample data set of sample size n, then this can be expanded to find the probability of any given sample mean using a statistical test called the one sample z-test. x–µ z= σ /n 10.3: Making statistical decisions using significance testing 2 Unit 10: Statistics for experimental design t-testing distributions Taking many samples from the same population and calculating the mean for each sample gives a new distribution that has the same mean as the population distribution but with less variance. The width of the distribution depends on the sizes of the samples taken. The means of large samples are less likely to be outliers, in the tails of the distribution of the population as a whole, than are the means of small samples. ‘Student’ developed a way of analysing the distribution of the deviations of the sample means from the population mean relative to the standard error (see Case study below). This t-statistic gives the probability of finding a difference between a sample mean and true population mean greater than any chosen t. Standard error of a sample: sx = t-statistic of a sample: t = (x – µ) s2 n sx Case study: Guinness and the ‘Student’ t-test The t-statistic was developed by William Sealy Gosset, a chemist working for the Guinness brewery in Dublin. He devised the t-test to monitor the quality of stout. The test was published in Biometrika in 1908, but Gosset was forced to use a pen name (he chose ‘Student’) by Guinness, who then regarded the fact that they were using statistics as a trade secret. Sign tests In statistics, the sign test can be used to test the hypothesis that there is ‘no difference in medians’ between the distributions of two random variables X and Y. It requires that we can draw paired samples from X and Y. It is a non-parametric test with few assumptions about the distributions being tested. In this test the null hypothesis is that the mean of X equals the mean of Y. For each pair, we calculate X − Y. If the null hypothesis is true then we expect to find a binomial distribution of X − Y centred at 0. Independent two sample t-test The two sample t-test is useful for comparing the means of two samples. The exact formula depends on whether the sample sizes and variance are the same or different in the two samples. For equal sample sizes with equal variance the t-statistic to test whether the means are different can be calculated as follows: (x1 – x2) x1x2 2/n t-statistic of a sample: t = s The following equation is an estimator of the common standard deviation between the two samples: sx x = 1 2 1 [(s )2 + (sx )2] 2 x1 2 10.3: Making statistical decisions using significance testing 3 Unit 10: Statistics for experimental design If the two sample sizes are unequal, then the t-statistic is: (x1 – x2) t-statistic of a sample: t = (n1 + n1 ) Sx1x2 1 Sx1x2 = 2 [(n1 – 1)(sx1) + (n2 – 1)(sx2) ] (n1 + n2 – 2) 2 2 For significance testing, the degrees of freedom for this test is 2n − 2 where n is the number of participants in each group. Dependent t-test for paired samples This test can be used when the samples are dependent; that is, when there is only one sample that has been tested twice (repeated measures) or when there are two samples that have been matched or ‘paired’. This is a paired difference test. t-statistic of a sample: t = (xd – µ) sd / n For this equation, the differences between all pairs must be calculated. Mann-Whitney U or Wilcoxon rank-sum test This is a non-parametric statistical hypothesis test for assessing whether one of two samples of independent observations tends to have larger values than the other. It is one of the most well-known non-parametric significance tests. It is performed by arranging all the observations into a single ranked series (it does not matter which sample they are from): •• Choose the sample for which the ranks seem to be smaller as this makes computation easier. Call this ‘sample 1,’ and call the other sample ‘sample 2’. •• For each observation in sample 1, count the number of observations in sample 2 that have a smaller rank (count a half for any that are equal to it). •• The sum of these counts is U. Wilcoxon signed-rank test The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test. It is used to compare two related samples to assess whether their population mean ranks differ (it is a paired difference test). It can be used as an alternative to the paired Student’s t-test or the t-test for dependent samples when the population cannot be assumed to be normally distributed. The test statistic is typically called W in this test. Take it further Information on how histograms are compared in the high energy physics experiments at CERN are found in the slides for: http://indico.cern.ch/contributionDisplay. py?contribId=35&confId=217511. 10.3: Making statistical decisions using significance testing 4 Unit 10: Statistics for experimental design Portfolio activity (1.2) Find two sets of data from different sources that are from the same population. A suitable example could be the UK earnings by industry data in Topic guide 10.2 data sheet. The null hypothesis is that there is no difference in the two samples, only the standard error caused by the variance of the population as a whole. In the UK earnings example, this could be that there is no statistical difference in earnings between the public and private sector, or between working in manufacturing or the arts, entertainment and recreation sector. Apply an appropriate (one or two tailed) test to determine whether the null hypothesis is valid with a 95% confidence level and a 99% confidence level. In your answer you should: •• calculate the mean, sample size, variance and standard deviation and standard error for both data sets •• apply an appropriate (one or two tailed) test to determine whether the null hypothesis is valid with a 95% confidence level •• apply an appropriate (one or two tailed) test to determine whether the null hypothesis is valid with a 99% confidence level. 2 Multiple sample tests Errors in multiple hypothesis testing When two groups, such as a treatment group and a control group, are compared, ‘multiple comparisons’ arise when the statistical analysis encompasses a number of formal comparisons, with the attention focusing on the strongest differences. Failure to compensate for multiple comparisons can have important real-world consequences, because as the number of comparisons increases, it becomes increasingly likely that the groups being compared will appear to differ in terms of at least one attribute. Therefore confidence that a result will generalise to independent data should generally be weaker if it is observed as part of an analysis that involves multiple comparisons, rather than an analysis that involves only a single comparison. For example, if one test is performed at the 5% level, there is only a 5% chance of having a type I error and incorrectly rejecting the null hypothesis. However, for 50 tests with true null hypotheses, the expected number of type I errors is 2.5. If all of the tests are independent, the probability of at least one false positive is >99%. ANOVA The analysis of variance (ANOVA) is a collection of statistical methods whereby the observed variance of a single variate is separated into components associated with different sources of variation. ANOVA provides a statistical test of whether or not the means of several groups are all equal, and so is the equivalent of the t-test for more than two groups. A statistically significant result (when a probability (p-value) is less than the significance level) justifies the rejection of the null hypothesis. In standard ANOVA analysis, the null hypothesis is that all groups are simply random samples from within the same population. This implies that all treatments have the same effect (perhaps none). Rejecting the null hypothesis implies that different treatments result in altered effects. 10.3: Making statistical decisions using significance testing 5 Unit 10: Statistics for experimental design ANOVA is the synthesis of several ideas and it is used for multiple purposes. As a consequence, it is difficult to define concisely or precisely. Its sums of squares indicate the variance of each component of the decomposition. Comparisons of mean squares allow testing of a nested sequence of models. Closely related to the ANOVA is a linear model fit with coefficient estimates and standard errors. Randomisation Randomisation is the assignment of multiple treatments to experimental units so that all units are considered to have an equal chance of receiving a treatment. Randomisation helps to assure unbiased estimates of treatment means and experimental error. In assigning treatments to experimental units such as field plots, the plots can be assigned numbers and the treatments assigned to them at random using a table of random numbers. Randomised block design A randomised block design allows subjects to be divided into subgroups called blocks, such that the variability within one block is less than the variability between blocks. This is very useful if there are other factors that might affect the effectiveness of the treatment; for example, if it is known that men and women respond differently to medicines. Then, subjects within each block are randomly assigned to treatment conditions. Compared to a completely randomised design, this design reduces variability within treatment conditions and potentially misleading results, producing a better estimate of treatment effects. Table 10.3.1 shows a randomised block design for a medical experiment. Subjects are assigned to blocks, based on gender. Then, within each block, subjects are randomly assigned to treatments (either a placebo or the new drug). This design ensures that each treatment condition has an equal proportion of men and women, ensuring that differences between treatment conditions cannot be attributed to gender. Table 10.3.1: Randomised block design for a medical experiment. Treatment New drug Placebo Male 1000 1000 Female 1000 1000 Gender Latin squares A type of randomised block experiment is known as a Latin square. A Latin square is an n 3 n array of blocks filled with n different symbols. Each symbol appears exactly once in each column and each row. A 4 3 4 Latin square is shown in Table 10.3.2. 10.3: Making statistical decisions using significance testing 6 Unit 10: Statistics for experimental design Table 10.3.2: A 4 3 4 Latin square. A B C D D A B C C D A B B C D A The Kruskal-Wallis test The Kruskal-Wallis test is a non-parametric test that is used to compare three or more samples. It tests the null hypothesis that all populations have identical distribution functions against the alternative hypothesis, which is that at least two of the samples differ only with respect to location (median), if at all. It is analogous to ANOVA. While ANOVA tests assume that all of the tested populations have normal distributions, the Kruskal-Wallis test forces no such restriction. It is the logical extension of the Mann-Whitney U or Wilcoxon rank-sum test to multiple samples. Take it further Much more information about statistical hypothesis testing can be found at: http://en.wikipedia. org/wiki/Statistical_hypothesis_testing. 3 Categorical data A categorical distribution is the extension of the binomial distribution covered in Topic guide 10.1, Section 3 to more than two outcomes (categories). For this reason it is sometimes called a multinomial distribution. The distribution is normalised so that the sum of the probabilities of all possible outcomes is one. Categorical variables often represent types of data that may be divided into groups. Examples of categorical variables are race, gender, and age group. While age may also be considered in a numerical manner by using exact values, it is often more informative to categorise such variables into a relatively small number of groups. You have probably seen this done when filling in surveys and other forms. Analysis of categorical data generally involves the use of data tables. A two-way table presents categorical data by counting the number of observations that fall into each group for two variables, one divided into rows and the other divided into columns. The chi-squared test of goodness-of-fit The chi-squared test (or c2 test) is applied when you have one categorical variable taken from a single population. Its function is to determine whether the sample data are consistent with a hypothesised distribution. For a chi-square goodnessof-fit test, the hypotheses take the following form: •• null hypothesis: the data are consistent with the specified distribution •• alternative hypothesis: the data are not consistent with the specified distribution. 10.3: Making statistical decisions using significance testing 7 Unit 10: Statistics for experimental design To perform the chi-squared analysis, first the expected frequency of observation of each level of the categorical variable must be calculated: Expected frequency of ith level: Ei = npi where Ei is the expected frequency count for the ith level of the categorical variable, n is the total sample size, and pi is the hypothesised proportion of observations in level i. The c2 test statistic is then defined as: c2 = Σ[ ] (Oi – Ei)2 Ei where Oi is the observed frequency count for the ith level of the categorical variable. The probability of observing a sample statistic as extreme as the test statistic is defined by a p-value, which depends on both c2 and the degrees of freedom. If the category has L levels, then there are L − 1 degrees of freedom. The particular p-value must be looked up in a chi-square distribution table or using a chi-square distribution calculator. An example table can be found at: http://www.medcalc.org/manual/chi-square-table.php. To interpret the goodness-of-fit it is necessary to compare the p-value to the significance level. Generally the null hypothesis is rejected if the p-value is less than the significance level. The chi-squared test of association The chi-squared test also provides a method for testing the association between the row and column variables in a two-way table. The null hypothesis assumes that there is no association between the variables (in other words, one variable does not vary according to the other variable), while the alternative hypothesis, H1, claims that some association does exist. The alternative hypothesis does not specify the type of association, only that there is some correlation. Further reading Boslaugh, S. (2012) Statistics in a Nutshell, O’Reilly Media Ellison, S. et al. (2009) Practical Statistics for the Analytical Scientist, RSC Larsen, R. and Fox Stroup, D. (1976) Statistics in the Real World, Macmillan Miller, J. and Miller, J. (2010) Statistics and Chemometrics for Analytical Chemistry, Prentice Hall Samuels, M. et al. (2010) Statistics for the Life Sciences, Pearson Swartz, M. and Krull, I. (2012) Handbook of Analytical Validation, CRC Press Statistical calculators online: http://www.danielsoper.com/statcalc3/ http://www.measuringusability.com/calc.php 10.3: Making statistical decisions using significance testing 8 Unit 10: Statistics for experimental design Checklist At the end of this topic guide, you should be familiar with the following ideas: the different types of test used in hypothesis testing and how decisions can be made from these the mathematical techniques used to carry out one and two sample tests on a given population sample and the use of categorical data to test hypotheses. You should be able to: make statistical decisions using significance testing carry out one and two sample tests on a given population sample (3.1) use multiple sample tests in experimental design (3.2) use categorical data to test hypotheses (3.3). Acknowledgements The publisher would like to thank the following for their kind permission to reproduce their photographs: Shutterstock.com: Sofiaworld Every effort has been made to trace the copyright holders and we apologise in advance for any unintentional omissions. We would be pleased to insert the appropriate acknowledgement in any subsequent edition of this publication. 10.3: Making statistical decisions using significance testing 9