Download Hypothesis Testing - University of Strathclyde

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Psychometrics wikipedia , lookup

History of statistics wikipedia , lookup

Statistical hypothesis testing wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Hypothesis Testing
David Young
Department of Statistics and Modelling Science,
University of Strathclyde
Royal Hospital for Sick Children,
Yorkhill NHS Trust
2
Statistics and Probability
• statistical analysis considers the probability of an event being
due to chance
• can never be 100% certain for example that one treatment is
better than another
• can say mathematically how sure we are that a result is true
3
Hypothesis Testing
• a statistical tests is designed to ‘prove’ a hypothesis held by the
researcher
• it starts by assuming the contrary view to the researcher’s and
only comes down in support of the researcher’s hypothesis if
the data are sufficiently unlikely to have been generated by the
contrary view
• the ‘contrary view’ is known as the null hypothesis
• the research hypothesis of interest is called the alternative
hypothesis
4
Probability
• in statistical testing, it is impossible to ‘prove’ a hypothesis
beyond all reasonable doubt
• decision processes must be able to deal with the problems of
uncertainty
• modelling of uncertainty is impossible with standard
mathematical tools and a whole branch of mathematics called
Probability Theory has been developed to deal with it
• most people have a good grasp of probability through card
games, board games, betting odds, etc.
5
Probability Theory
• suppose that the proportion (p) of defective items in a large
batch is 0.1
• in a sample size 100 taken from this batch, we would expect to
get (1000.1)=10 defective items
• a single sample may contain any number of defective items
‘close’ to 10
• e.g. samples may have 8, 11, 9, 10 or 12 defectives
• probability theory enables us to calculate the probability or
chance of getting a given number of defectives
6
Hypothesis Testing
• statistical inference is the procedure whereby inferences about
a population are made on the basis of the results obtained
from a sample drawn from that population
• inference may be divided into two categories …
• estimation
• hypothesis testing
• basically, hypothesis testing is a test of the validity of some
claim or theory about a population e.g. students have debts of
>£4000 upon graduating, aspirin is a more effective pain-killer
than paracetamol, a new HIV medication delays the onset of
AIDS, etc.
7
Comparing Two Samples of Data
• there are several factors which affect the choice of statistical
hypothesis test
• in comparing two sample means the procedure depends on
whether the data are paired (as in a cross-over experiment of
when comparing a ‘before’ and ‘after’ measurements)
• whether the data are quantitative or qualitative
• it also depends on the distribution of the sample data (are the
data normal?)
8
Checking the Assumption of Normality
• the simplest way to check the normality assumption for a
variable is by plotting a histogram and assessing visually if the
distribution is bell-shaped
• normality tests are available with most statistical packages
• e.g. in MINITAB the normality test generates a normal
probability plot and performs a hypothesis test to examine
whether or not the observations follow a normal distribution
• for data which are normally distributed, parametric tests can
be applied
9
Distribution Free Tests
• occasionally it will not be possible to make this assumption e.g.
when the data are clearly skewed or there are too few data
points to determine the approximate distribution
• a group of tests have been devised for which no assumptions
are made about the distribution of the observations – these are
called distribution-free tests
• since distributions are compared without the use of
parameters they can be referred to as non-parametric tests
10
Comparing Unpaired Samples
• in a sense we wish to compare the ‘average’ values for the two
underlying populations e.g. does the average blood pressure
differ in two groups treated with a different drug?
• if the samples are normally distributed, use a t-test and the
corresponding confidence intervals to compare the means
11
Example: RCT
Old Treatment
New Treatment
P-value
71
70
0.921
71
68
0.893
71
62
0.538
71
53
0.376
71
42
0.112
71
29
0.032
12
Additional Points
• Errors in hypothesis testing – p<0.01!
• Null and alternative hypotheses
• Cranberry juice – randomisation
http://www.ncbi.nlm.nih.gov/pubmed/22961092
13
Additional Points
• Double blind studies
http://www.theguardian.com/society/2005/jan/17/health.medic
ineandhealth
• Placebo trials
• Comparison of baseline characteristics
• Intention-to-treat and per-protocol – weight loss example
• Tests for correlation, regression and normality testing
14
Example
• Comparison of transit times (hours) using two different bran
preparations ...
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1410956/
• Bran preparation A:
44 51 52 55 60 62 66 68 69 71 71 76 82 91 108
• Bran preparation B:
52 64 68 77 79 83 84 88 95 97 101 116
• null hypothesis – no difference in transit times for A and B
• alternative hypothesis – some difference in transit times
15
Descriptive Statistics
Descriptive Statistics: Bran A, Bran B
Variable
Bran A
Bran B
N
15
12
Mean
68.40
83.67
StDev
16.47
17.51
Minimum
44.00
52.00
Q1
55.00
70.25
Median
68.00
83.50
Q3
76.00
96.50
Maximum
108.00
116.00
16
Histograms
17
The P-value
• the p-value is the probability of getting data as extreme as
those actually observed in the experiment if the null hypothesis
were true
• the lower the p-value, the more evidence there is against the
null hypothesis (i.e. in favour of the study hypothesis)
• the conventional cut-off for significance is p<0.05
18
Two Sample T-test
Two Sample T-Test and Confidence Interval
Two sample T for Bran A vs Bran B
N
Mean
StDev
SE Mean
Bran A
15
68.4
16.5
4.3
Bran B
12
83.7
17.5
5.1
95% CI for mean A - mean B: (-28.9,-1.6)
T-Test mean A = mean B (vs not =):
T = -2.31
P = 0.030
DF = 23
19
Interpretation
• the p-value from the t-test comparing the transit times in both
groups is 0.03
• since this is less than 0.05, reject the null hypothesis and
accept the alternative
• conclude that there is a significant difference between the two
groups
• conclusion – the transit time for Bran A is significantly lower
than it is for Bran B
20
Choice of Test
• the choice of statistical test to use depends mainly on two
things …
– the type of data (categorical or numerical)
– the distribution of the data (normal or non-normal)
• if the data are normally distributed, parametric tests are used
• if the data are not normally distributed, non-parametric tests
are appropriate
21
Tests for comparing two group means
• if the data are quantitative (i.e. numerical) and normally
distributed use a t-test (sometimes referred to a as two sample
t-test)
• this is known as a parametric test
• if the data are quantitative and not normally distributed, the
appropriate test is a Mann-Whitney test
• this is a non-parametric test
• for qualitative data, non-parametric tests are generally used
22
Non-normal data
• if the data are not normally distributed either look for a
transformation which does normalise the distributions (e.g. log,
square root) or use a Mann-Whitney test (the non-parametric
equivalent to the t-test)
• using a transformation is more sensitive but might lead to
results and particularly confidence intervals which are difficult
to interpret
• using a non-parametric test is less efficient but does lead to an
easily interpretable confidence interval for the difference
between two medians
• if sample sizes are too small to determine if the distribution is
normal, use the non-parametric approach
23
Qualitative Data
• this involves comparing the proportion of cases who have a
certain characteristic of interest in the two groups e.g. do the
proportions of cases suffering from a breast cancer recurrence
differ for pre and post-menopausal women?
• with decent sample sizes use a chi-squared test along with a
confidence interval for the difference or ratio of the two
proportions
24
Obesity and breast-feeding
• Does Breastfeeding Help to Reduce the Risk of Childhood
Overweight and Obesity?
• http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4374721/
• Results:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4374721/table
/pone.0122534.t001/
• Of 5650 breast-fed children, 658 (11.6%) were overweight vs.
1304/7513 (17.4%) of those not breast-fed
25
Results
• Use Stat > Basic Statistics > 2 Proportions to get:
Test and CI for Two Proportions
Sample
1
2
X
658
1304
N
5650
7513
Sample p
0.116460
0.173566
Difference = p (1) - p (2)
Estimate for difference: -0.0571056
95% CI for difference: (-0.0690766, -0.0451347)
Test for difference = 0 (vs ≠ 0): Z = -9.35 P-Value = 0.000
26
Comparing Paired Samples
• the same issues must be addressed when deciding upon the
analysis method for a given set of paired data
• problem types are essentially the same only in this case the
same individual has been measured twice
• before we made assumptions about the distributions in the
separate groups whereas here the assumptions relate to the
within individual differences
27
Quantitative Data
• if the differences between the two samples follow a normal
distribution (possibly after transformation) then use a paired ttest and compute a confidence interval to compare the two
means
• if the differences are not normal then use a Wilcoxon signed
rank test (the non-parametric equivalent of the paired t-test)
28
Example
• data below shows two measurements of pulse rates in 20
patients
• each measurement was made by the same observer, under the
same circumstances, one minute apart
• objective of gathering this data was to determine if the 30
second pulse rates were the same both times
• since data are paired, appropriate test is the paired t-test
Pulse 1: 46 50 39 40 41 35 31 43 47 48 32 36 37 34 38
Pulse 2: 44 29 36 43 43 37 43 43 48 40 45 42 35 28 42
29
Stat > Basic Statistics > Paired t …
Paired T-Test and CI: Pulse 1, Pulse 2
Paired T for Pulse 1 - Pulse 2
Pulse 1
Pulse 2
Difference
N
15
15
15
Mean
39.80
39.87
-0.07
StDev
5.94
5.76
8.20
SE Mean
1.53
1.49
2.12
95% CI for mean difference: (-4.61, 4.47)
T-Test of mean difference = 0 (vs not = 0): T-Value = 0.03 P-Value = 0.975
30
Conclusion
• Paired t-test was performed since the differences were
normally distributed
• p-value from the test was 0.975
• this is not significant, therefore do not reject the null
hypothesis
• conclude that there is no evidence to suggest that there is a
significant difference in the average pulse rates on the two
occasions
• methodology applies to cross-over trials
31
Summary
• the set-up for a hypothesis test is always the same …
• determine the null and alternative hypotheses
• choose the appropriate test based on the type and distribution
of the data
• if the p-value is less than 0.05, reject the null hypothesis and
conclude that there is evidence to support the alternative
hypothesis
• if the p-value is not significant (i.e. >0.05), conclude there is no
evidence to reject the null hypothesis
32
Errors in Statistical Tests
• Type I Error: a false positive result
– the study finds a significant difference but that difference does not
really exist (i.e. reject the null hypothesis when it is true)
• Type II Error: a false negative result
– the study finds no significant difference between groups which are
in fact different (i.e. accept the null hypothesis when it is false)
33
Errors in Statistical Tests
• the conventional cut-off for significance is p<0.05
• i.e. accept a 1 in 20 chance that a Type I error may occur
• a 5% chance of a finding significant result which does not really
exist every time a statistical test is carried out
• may sometimes want to set a more stringent p-value (e.g.
p=0.01 if testing the effect of a very toxic therapy)
34
Confidence Intervals
• the sample mean is only an estimate of the population mean
• estimates depend on the sample from which they are
calculated
• a range of plausible values of the mean can be computed
• this gives an interval in which we can be relatively sure the true
population parameter value lies
• these intervals are known as confidence intervals
35
Example (cont.)
• part of the computer output from the t-test for the bran
example gave the 95% confidence interval for the mean
difference in transit times:
95% CI for mean A - mean B: (-28.9,-1.6)
• therefore we can be 95% sure that the true population mean
difference in transit time between these two bran prepartions
lies within this interval
• i.e. we can be 95% confident that any subject taking bran A
should have a blood glucose level between 1.6 and 28.9 mg/kg
less than if they took bran B
36
Example
• Does playing music to dairy cattle increase their milk
production?
• An experiment was conducted where a group of dairy cattle
was divided into two groups. Music was played to one group;
the control group did not have music played. The average
increase in production was 2.5 l/cow over the time period in
question.
• A 95% confidence interval for the difference (treatmentcontrol) in the mean production was computed to be (1.5,3.5)
l/cow.
• What does this mean?