Download Topic guide 10.3: Making statistical decisions using

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Psychometrics wikipedia , lookup

History of statistics wikipedia , lookup

Omnibus test wikipedia , lookup

Foundations of statistics wikipedia , lookup

Analysis of variance wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Unit 10: Statistics for experimental design
.
10 3
Making statistical
decisions using
significance testing
In this topic guide you will find out about the different types of test used in
hypothesis testing and how decisions can be made from these. You will see
the mathematical techniques used to carry out one and two sample tests on
a given population sample and find out how to use multiple sample tests in
experimental design. You will also look at the use of categorical data to
test hypotheses.
On successful completion of this topic you will:
•• be able to make statistical decisions using significance testing (LO3).
To achieve a Pass in this unit you need to show that you can:
•• carry out one and two sample tests on a given population sample (3.1)
•• use multiple sample tests in experimental design (3.2)
•• use categorical data to test hypotheses (3.3).
1
Unit 10: Statistics for experimental design
1 Tests of significance
A statistical hypothesis test is used to make decisions using data from a scientific
study. A result is called statistically significant if it is unlikely to have occurred
by chance alone, according to a pre-determined threshold probability, the
significance level. These tests of significance are used in determining what
outcomes of a study would lead to a rejection of the null hypothesis for a
pre-specified level of significance. This gives information on whether results
contain enough information to cast doubt on conventional wisdom, given that
conventional wisdom has been used to establish the null hypothesis.
The key region of a hypothesis test is the set of outcomes which cause the
rejection of the null hypothesis in favour of the alternative hypothesis.
A one sample test, also known as a goodness of fit test, shows whether the
collected data is useful in making a prediction about the population. A two sample
test, also called a test of independence, is used to determine if one variable is
dependent on another.
z-tests
The z-score is a way of describing standard deviations. A z-score of two is
equivalent to saying ‘two standard deviations above and below the mean’.
A z-score of 1.5 is 1.5 standard deviations above and below the mean. A z-score
of 0 means the score is equal to the mean (zero standard deviations).
In one-tailed tests z-scores are to a known side of the mean (either above or
below). In two-tailed tests z-scores are either above or below the mean.
The standard error of the mean tells us how much the sample mean varies from
sample to sample (it is the standard deviation of the population mean given a
particular sample size). The empirical rule tells us that 95% of the time the sample
mean will fall within two standard errors of the population mean.
Take it further
Z-score online calculator: http://
www.danielsoper.com/statcalc3/
calc.aspx?id=22
Z-score to percentile calculator:
http://www.measuringusability.
com/pcalcz.php
Further information about the
applications of z-scores can be found
at the following link:
https://statistics.laerd.com/
statistical-guides/standard-score.
php
The z-score for a value x, given a mean µ and a standard deviation s, is:
z = x – µ σ
To find the probability associated with this z-score we can look up the z-value
in a table of normal probabilities or use an online calculator. This is necessary
because it comes from the area under the bell-curve of the normal distribution,
and because the shape is irregular, calculus is required to generate the table. A
common use is to find at what percentile a certain value sits in the population.
If, instead of having a single datum, we have a sample data set of sample size n,
then this can be expanded to find the probability of any given sample mean using
a statistical test called the one sample z-test.
x–µ
z= σ
/n
10.3: Making statistical decisions using significance testing
2
Unit 10: Statistics for experimental design
t-testing distributions
Taking many samples from the same population and calculating the mean for
each sample gives a new distribution that has the same mean as the population
distribution but with less variance. The width of the distribution depends on the
sizes of the samples taken. The means of large samples are less likely to be outliers,
in the tails of the distribution of the population as a whole, than are the means of
small samples.
‘Student’ developed a way of analysing the distribution of the deviations of the
sample means from the population mean relative to the standard error (see Case
study below). This t-statistic gives the probability of finding a difference between a
sample mean and true population mean greater than any chosen t.
Standard error of a sample: sx =
t-statistic of a sample: t = (x – µ)
s2
n
sx
Case study: Guinness and the ‘Student’ t-test
The t-statistic was developed by William Sealy Gosset, a chemist working for the Guinness
brewery in Dublin. He devised the t-test to monitor the quality of stout. The test was published
in Biometrika in 1908, but Gosset was forced to use a pen name (he chose ‘Student’) by Guinness,
who then regarded the fact that they were using statistics as a trade secret.
Sign tests
In statistics, the sign test can be used to test the hypothesis that there is ‘no
difference in medians’ between the distributions of two random variables X
and Y. It requires that we can draw paired samples from X and Y. It is a
non-parametric test with few assumptions about the distributions being tested.
In this test the null hypothesis is that the mean of X equals the mean of Y. For
each pair, we calculate X − Y. If the null hypothesis is true then we expect to find a
binomial distribution of X − Y centred at 0.
Independent two sample t-test
The two sample t-test is useful for comparing the means of two samples. The
exact formula depends on whether the sample sizes and variance are the same or
different in the two samples.
For equal sample sizes with equal variance the t-statistic to test whether the
means are different can be calculated as follows:
(x1 – x2)
x1x2 2/n
t-statistic of a sample: t = s
The following equation is an estimator of the common standard deviation
between the two samples:
sx x =
1 2
1
[(s )2 + (sx )2]
2 x1
2
10.3: Making statistical decisions using significance testing
3
Unit 10: Statistics for experimental design
If the two sample sizes are unequal, then the t-statistic is:
(x1 – x2)
t-statistic of a sample: t =
(n1 + n1 )
Sx1x2
1
Sx1x2 =
2
[(n1 – 1)(sx1) + (n2 – 1)(sx2) ]
(n1 + n2 – 2)
2
2
For significance testing, the degrees of freedom for this test is 2n − 2 where n is the
number of participants in each group.
Dependent t-test for paired samples
This test can be used when the samples are dependent; that is, when there is only
one sample that has been tested twice (repeated measures) or when there are two
samples that have been matched or ‘paired’. This is a paired difference test.
t-statistic of a sample: t =
(xd – µ)
sd / n
For this equation, the differences between all pairs must be calculated.
Mann-Whitney U or Wilcoxon rank-sum test
This is a non-parametric statistical hypothesis test for assessing whether one of
two samples of independent observations tends to have larger values than the
other. It is one of the most well-known non-parametric significance tests.
It is performed by arranging all the observations into a single ranked series (it does
not matter which sample they are from):
•• Choose the sample for which the ranks seem to be smaller as this makes
computation easier. Call this ‘sample 1,’ and call the other sample ‘sample 2’.
•• For each observation in sample 1, count the number of observations in
sample 2 that have a smaller rank (count a half for any that are equal to it).
•• The sum of these counts is U.
Wilcoxon signed-rank test
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test. It is
used to compare two related samples to assess whether their population mean
ranks differ (it is a paired difference test). It can be used as an alternative to the
paired Student’s t-test or the t-test for dependent samples when the population
cannot be assumed to be normally distributed. The test statistic is typically called
W in this test.
Take it further
Information on how histograms are compared in the high energy physics experiments
at CERN are found in the slides for: http://indico.cern.ch/contributionDisplay.
py?contribId=35&confId=217511.
10.3: Making statistical decisions using significance testing
4
Unit 10: Statistics for experimental design
Portfolio activity (1.2)
Find two sets of data from different sources that are from the same population. A suitable
example could be the UK earnings by industry data in Topic guide 10.2 data sheet. The null
hypothesis is that there is no difference in the two samples, only the standard error caused by the
variance of the population as a whole. In the UK earnings example, this could be that there is no
statistical difference in earnings between the public and private sector, or between working in
manufacturing or the arts, entertainment and recreation sector. Apply an appropriate (one or two
tailed) test to determine whether the null hypothesis is valid with a 95% confidence level and a
99% confidence level. In your answer you should:
•• calculate the mean, sample size, variance and standard deviation and standard error for both
data sets
•• apply an appropriate (one or two tailed) test to determine whether the null hypothesis is valid
with a 95% confidence level
•• apply an appropriate (one or two tailed) test to determine whether the null hypothesis is valid
with a 99% confidence level.
2 Multiple sample tests
Errors in multiple hypothesis testing
When two groups, such as a treatment group and a control group, are compared,
‘multiple comparisons’ arise when the statistical analysis encompasses a number
of formal comparisons, with the attention focusing on the strongest differences.
Failure to compensate for multiple comparisons can have important real-world
consequences, because as the number of comparisons increases, it becomes
increasingly likely that the groups being compared will appear to differ in terms
of at least one attribute. Therefore confidence that a result will generalise to
independent data should generally be weaker if it is observed as part of an
analysis that involves multiple comparisons, rather than an analysis that involves
only a single comparison.
For example, if one test is performed at the 5% level, there is only a 5% chance of
having a type I error and incorrectly rejecting the null hypothesis. However, for
50 tests with true null hypotheses, the expected number of type I errors is 2.5. If all
of the tests are independent, the probability of at least one false positive is >99%.
ANOVA
The analysis of variance (ANOVA) is a collection of statistical methods whereby the
observed variance of a single variate is separated into components associated with
different sources of variation. ANOVA provides a statistical test of whether or not
the means of several groups are all equal, and so is the equivalent of the t-test for
more than two groups. A statistically significant result (when a probability (p-value)
is less than the significance level) justifies the rejection of the null hypothesis.
In standard ANOVA analysis, the null hypothesis is that all groups are simply
random samples from within the same population. This implies that all treatments
have the same effect (perhaps none). Rejecting the null hypothesis implies that
different treatments result in altered effects.
10.3: Making statistical decisions using significance testing
5
Unit 10: Statistics for experimental design
ANOVA is the synthesis of several ideas and it is used for multiple purposes. As
a consequence, it is difficult to define concisely or precisely. Its sums of squares
indicate the variance of each component of the decomposition. Comparisons of
mean squares allow testing of a nested sequence of models. Closely related to the
ANOVA is a linear model fit with coefficient estimates and standard errors.
Randomisation
Randomisation is the assignment of multiple treatments to experimental units
so that all units are considered to have an equal chance of receiving a treatment.
Randomisation helps to assure unbiased estimates of treatment means and
experimental error.
In assigning treatments to experimental units such as field plots, the plots can be
assigned numbers and the treatments assigned to them at random using a table
of random numbers.
Randomised block design
A randomised block design allows subjects to be divided into subgroups called
blocks, such that the variability within one block is less than the variability
between blocks. This is very useful if there are other factors that might affect the
effectiveness of the treatment; for example, if it is known that men and women
respond differently to medicines. Then, subjects within each block are randomly
assigned to treatment conditions.
Compared to a completely randomised design, this design reduces variability
within treatment conditions and potentially misleading results, producing a better
estimate of treatment effects.
Table 10.3.1 shows a randomised block design for a medical experiment. Subjects
are assigned to blocks, based on gender. Then, within each block, subjects are
randomly assigned to treatments (either a placebo or the new drug). This design
ensures that each treatment condition has an equal proportion of men and
women, ensuring that differences between treatment conditions cannot be
attributed to gender.
Table 10.3.1: Randomised block
design for a medical experiment.
Treatment
New drug
Placebo
Male
1000
1000
Female
1000
1000
Gender
Latin squares
A type of randomised block experiment is known as a Latin square. A Latin square
is an n 3 n array of blocks filled with n different symbols. Each symbol appears
exactly once in each column and each row.
A 4 3 4 Latin square is shown in Table 10.3.2.
10.3: Making statistical decisions using significance testing
6
Unit 10: Statistics for experimental design
Table 10.3.2: A 4 3 4 Latin square.
A
B
C
D
D
A
B
C
C
D
A
B
B
C
D
A
The Kruskal-Wallis test
The Kruskal-Wallis test is a non-parametric test that is used to compare three
or more samples. It tests the null hypothesis that all populations have identical
distribution functions against the alternative hypothesis, which is that at least two
of the samples differ only with respect to location (median), if at all.
It is analogous to ANOVA. While ANOVA tests assume that all of the tested
populations have normal distributions, the Kruskal-Wallis test forces no such
restriction. It is the logical extension of the Mann-Whitney U or Wilcoxon rank-sum
test to multiple samples.
Take it further
Much more information about statistical hypothesis testing can be found at: http://en.wikipedia.
org/wiki/Statistical_hypothesis_testing.
3 Categorical data
A categorical distribution is the extension of the binomial distribution covered in
Topic guide 10.1, Section 3 to more than two outcomes (categories). For this reason
it is sometimes called a multinomial distribution. The distribution is normalised so
that the sum of the probabilities of all possible outcomes is one.
Categorical variables often represent types of data that may be divided into groups.
Examples of categorical variables are race, gender, and age group. While age may
also be considered in a numerical manner by using exact values, it is often more
informative to categorise such variables into a relatively small number of groups.
You have probably seen this done when filling in surveys and other forms.
Analysis of categorical data generally involves the use of data tables. A two-way
table presents categorical data by counting the number of observations that fall
into each group for two variables, one divided into rows and the other divided
into columns.
The chi-squared test of goodness-of-fit
The chi-squared test (or c2 test) is applied when you have one categorical variable
taken from a single population. Its function is to determine whether the sample
data are consistent with a hypothesised distribution. For a chi-square goodnessof-fit test, the hypotheses take the following form:
•• null hypothesis: the data are consistent with the specified distribution
•• alternative hypothesis: the data are not consistent with the specified
distribution.
10.3: Making statistical decisions using significance testing
7
Unit 10: Statistics for experimental design
To perform the chi-squared analysis, first the expected frequency of observation of
each level of the categorical variable must be calculated:
Expected frequency of ith level: Ei = npi
where Ei is the expected frequency count for the ith level of the categorical
variable, n is the total sample size, and pi is the hypothesised proportion of
observations in level i. The c2 test statistic is then defined as:
c2 =
Σ[
]
(Oi – Ei)2
Ei
where Oi is the observed frequency count for the ith level of the categorical
variable. The probability of observing a sample statistic as extreme as the test
statistic is defined by a p-value, which depends on both c2 and the degrees of
freedom. If the category has L levels, then there are L − 1 degrees of freedom.
The particular p-value must be looked up in a chi-square distribution table or
using a chi-square distribution calculator. An example table can be found at:
http://www.medcalc.org/manual/chi-square-table.php.
To interpret the goodness-of-fit it is necessary to compare the p-value to the
significance level. Generally the null hypothesis is rejected if the p-value is less
than the significance level.
The chi-squared test of association
The chi-squared test also provides a method for testing the association between
the row and column variables in a two-way table. The null hypothesis assumes
that there is no association between the variables (in other words, one variable
does not vary according to the other variable), while the alternative hypothesis,
H1, claims that some association does exist. The alternative hypothesis does not
specify the type of association, only that there is some correlation.
Further reading
Boslaugh, S. (2012) Statistics in a Nutshell, O’Reilly Media
Ellison, S. et al. (2009) Practical Statistics for the Analytical Scientist, RSC
Larsen, R. and Fox Stroup, D. (1976) Statistics in the Real World, Macmillan
Miller, J. and Miller, J. (2010) Statistics and Chemometrics for Analytical Chemistry, Prentice Hall
Samuels, M. et al. (2010) Statistics for the Life Sciences, Pearson
Swartz, M. and Krull, I. (2012) Handbook of Analytical Validation, CRC Press
Statistical calculators online:
http://www.danielsoper.com/statcalc3/
http://www.measuringusability.com/calc.php
10.3: Making statistical decisions using significance testing
8
Unit 10: Statistics for experimental design
Checklist
At the end of this topic guide, you should be familiar with the following ideas:
the different types of test used in hypothesis testing and how decisions can be made
from these
the mathematical techniques used to carry out one and two sample tests on a given
population sample and the use of categorical data to test hypotheses.
You should be able to:
make statistical decisions using significance testing

carry out one and two sample tests on a given population sample (3.1)

use multiple sample tests in experimental design (3.2)

use categorical data to test hypotheses (3.3).
Acknowledgements
The publisher would like to thank the following for their kind permission to reproduce their
photographs:
Shutterstock.com: Sofiaworld
Every effort has been made to trace the copyright holders and we apologise in advance for any
unintentional omissions. We would be pleased to insert the appropriate acknowledgement in any
subsequent edition of this publication.
10.3: Making statistical decisions using significance testing
9