Download More Statistics

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
More Statistics
measures of uncertainty and small
number issues
Fiona Reid
Acknowledgements
• This presentation has been adapted from
the original presentation provided by the
following contributors
• Mark Dancox
• Shelley Bradley
• Jacq Clarkson
Things to be covered
• Summarising data
– Common measures
– Correlation
• Common Distributions
• Measuring uncertainty
– Standard error
– Confidence Intervals
• Significance and p-values
• Small samples
Summary Statistics
Summarising data
• Impractical to look at every single piece of
data so need to use summary measures
• Need to reduce a lot of information into
compact measures
• Look at location and spread
How do you describe an
elephant?
How do you describe an
elephant?
How big is it…..? LOCATION
How do you describe an
elephant?
How varied…?
SPREAD
Measures of Location
• Mean
– Commonly used measure of location
– Is the sum of values divided by the number of values
• The sample mean is given by:
1 n
1
X   xi  [ x1  x2 
n i 1
n
 xn ]
• Can be drastically affected by unusual observations (called
‘outliers’) so it is not very robust.
• Excel function: average()
Measures of location
• Median: is a value for which 50% of the data lie above (or
below) “middle value”
– For an odd number of observations, the median is the
observation exactly in the middle of the ordered list
– For an even number of observations, the median is the
mean of the two middle observation is the ordered list
• Less sensitive to outliers and gives a ‘real’ value (unlike the
mean) but does ‘throw away’ a lot of information in the sample
• Excel function: Median()
• Mode: The mode is the most frequently occurring value
• Sometimes too simplistic and not always unique
Measures of Spread
• Variance: The average of the squared
deviations of each sample value from the
sample mean divided by N-1
n
1
2
s 
( xi  X )

n  1 i 1
2
• Excel function: Vara ()
• Standard deviation: is the square root of the
sample variance
• Excel function: Stdeva()
Measures of Spread
• We can describe the spread of a distribution by using
percentiles.
• The pth percentile of a distribution is the value such that
approximately p percent of the values are equal or less
than that number.
• Excel function: percentile ()
• Quartiles divide data into four equal parts.
– First quartile (Q1) 25th Percentile
• 25% of observations are below Q1 and 75% above Q1
– Second quartile (Q2) 50th Percentile
• 50% of observations are below Q2 and 50% above Q2
– Third quartile (Q3) 75th Percentile
• 75% of observations are below Q3 and 25% above Q3
Measures of Spread
• Range: the difference between the largest
and smallest values
• Can be misleading if the data set contains
outliers.
• The interquartile range is the difference
between the third and the first quartiles in a
dataset (50% of the data lie in this range).
• Interquartile range more robust to outliers.
IQR 50%
95%
x  2s
x
2
Range
100%
x  2s
Q1
x s
3
Q3
x
x s
If the data is approximately normal, we can use the mean and the
standard deviation s to find intervals within which a given
percentage of the data lie.
Skewness
• Values in a distribution may not be spread
evenly. This will affect symmetry.
• Skewness a measure of symmetry.
– If skewness =0 the distribution is symmetrical
– If skewness >0 there are more larger values
– If skewness < 0 there are more smaller values
Some skewed distributions
Count of persons killed and seriously injured in road traffic accidents,
2003-2005 Histogram of killed of seriously injured road traffic accidents, 2003-2005, all LAs
45
40
35
Number of LAs
30
25
20
15
10
5
0
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210
Rate of killed or seriuosly iinjured road traffic accidents, per 100,000
Some skewed distributions
Prevalence ofHistogram
diabetes
in ofall
Local
Authorities
2005/6
of prevalence
diabetes,
2005/06,
all LAs
90
80
70
Number of LAs
60
50
40
30
20
10
0
2.5
2.75
3
3.25
3.5
3.75
4
4.25
4.5
Prevalence of diabetes, percentage
4.75
5
5.25
5.5
5.75
Some skewed distributions
Histogram of under 18 conception rate all local authorities 2003-5
Histogram of under 18 conception rate, 2003-2005, all LAs
60
50
Number of LAs
40
30
20
10
0
10
15
20
25
30
35
40
45
50
Under 18 conception rate
55
60
65
70
80
90
Relative positions of the mean and median for (a)
right-skewed, (b) symmetric, and
(c) left-skewed distributions
Note: The mean assumes that the data is normally distributed. If this is not the case it is
better to report the median as the measure of location.
Skewness
• The degree of skewness affects measures
of location.
• If no skew
– Mean = Median
• If skew > 0 (right or +ve skew)
– Mean > Median
• If skew < 0 (left or -ve skew)
– Mean < Median
Exercise 1
• Calculate some summary statistics for the
class size data in sheet one of the
exercises.
Correlation
• Correlation is a measure of association between two
continuous variables
• Correlation is best visualised graphically, plotting one
variable against the other:
Y
X
Positive Correlation
Y
One variable increases
with the other
X
Negative Correlation
Y
One variable increases
as the other decreases
X
No Correlation
Y
Y neither increases nor
decreases with X
X
Correlation coefficient
• Correlation coefficients measure strength of (linear)
association between continuous variables
• Pearson’s correlation coefficient r measures linear
association i.e. Do the points lie on a straight line?
• If the points form a perfect straight line, then we have
perfect correlation. The closer r is to 0, the weaker the
correlation
r = 1
Perfect positive correlation
r = -1
Perfect negative correlation
r = 0
No correlation
xy  nxy

r
(n  1) sx s y
where Sx and Sy denote standard
deviations. Excel function PEARSON( )
(a)
(b)
r = +1
r = -1
(c)
(d)
r = 0.3
r=0
(e)
r = -0.5
(f)
r = 0.7
Example
Spearman’s rank correlation coefficient measures association
whenever one or both variables are on an ordinal scale.
This does not need to be linear, Does one variable increase/decrease
with the other?
  1
Positive
correlation
Negative
correlation
0  r  1
-1  r  0
6 d 2
n3  n
d is the difference between the rank orderings
of the data. Not an inbuilt function in excel
WARNING
Spurious correlations can arise from:
Change of direction of
association
Subgroups
Outliers
Exercise
• Produce scatterplots of life expectancy for
both deprivation measures.
• Calculate the correlation coefficient for LE
and the deprivation measures
• Which measure of deprivation shows the
strongest association with life expectancy,
is this the same for both men and women?
Ecological Fallacy
• When correlations based on grouped data are
incorrectly assumed to hold for individuals.
• E.g. investigating the relationship between food
consumption and cancer risk.
• One way to begin such an investigation would
be to look at data on the country level and
construct a plot of overall cancer risk against per
capita daily caloric intake.
• But it is people, not countries, who get cancer.
• It could be that within countries those who eat
more are less likely to develop cancer.
On the country level, per capita food intake may
just be an indicator of overall wealth and
industrialization.
The ecological fallacy was in studying countries
when one should have been studying people.
Distributions for Public Health
Analysts
Types of distributions
• Normal distribution
– Used for continuous measures such as height,
weight, blood pressure
• Poisson distribution
– Used for discrete counts of things: violent crimes,
number of serious accidents, number of horse kicks
• Binomial distribution
– Used to analyse data where the response is discrete
count of a category: success/ failure, response/ nonresponse
Normal Distribution
•
•
•
•
•
•
Distribution of natural phenomena
Continuous
Family of distributions with same shape
Area under curve is the same (=1)
Symmetrical
Defined by mean (μ) and standard
deviation (σ)
• Widely assumed for statistical inference
Normal Distribution, changes in
mean (μ)
Normal Distribution
0.45
0.4
σ=1
0.3
0.25
0.2
0.15
0.1
0.05
μ=0
μ=0.7
μ=2
Keeping the standard deviation constant,
changing the mean of a distribution moves the
distribution to the left or right…
4.4
4.12
3.85
3.57
3.3
3.02
2.75
2.47
2.2
1.92
1.65
1.37
1.1
0.82
0.55
0.27
0
-0.3
-0.5
-0.8
-1.1
-1.4
-1.6
-1.9
-2.2
-2.5
-2.7
0
-3
Probability density
0.35
Normal Distribution, changes in
standard deviation (σ)
Normal Distribution
1.4
1.2
0.8
0.6
0.4
σ=0.6
0.2
σ=1
μ=0
Keeping the mean constant but changing the
standard deviation affects the ‘narrowness’ of the
curve…
2.82
2.6
2.37
2.15
1.92
1.7
1.47
1.25
1.02
0.8
0.57
0.35
0.12
-0.1
-0.3
-0.5
-0.8
-1
-1.2
-1.4
-1.7
-1.9
-2.1
-2.3
-2.6
-2.8
0
-3
Probability density
σ=0.3
1
Distribution of values
Distribution of values
0.45
0.4
Probability density
0.35
0.3
0.25
0.2
0.68
0.15
0.1
0.16
0.05
0.16
0
-σ
μ0
+σ
Distribution of values
0.45
0.4
Probability density
0.35
0.3
0.25
0.2
0.95
0.15
0.1
0.025
0.025
0.05
0
-2σ
μ0
+2σ
Poisson
• A discrete distribution taking on the values
X= 0,1,2,3,…
• Often used to model the number of events
occurring in a defined period of time.
• Determined by a single parameter,
lambda (λ), which is the mean of the
process.
• Shape of distribution changes with the
value of λ
Example of Poisson distribution
Poisson Distribution (mean = 5)
0.2
0.18
Probability
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
1
2
3
4
5
6
7
8
9
Number of Events
10
11
12
13
14
15
Binomial Distribution
• Used to analyse discrete counts
• Consider when interested in a count
expressed as a proportion of a total
sample size.
– “the proportion of brown-eyed persons in a
building”
• Defined by the probability of an outcome
(p) and the sample size (n)
Binomial Distribution
Binomial Distribution, n = 10, p =0.5
0.3
Probability
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
6
Number of successes
7
8
9
10
Which distribution would best
describe?
• Number of abortions by gestational age
• Percentage of patients with diabetes
mellitus treated with ACE inhibitor therapy
for Acute sickness
• Number of Adults on prescribed
medication
• Proportion of Adults who are overweight
• Average weekly alcohol units consumed
Choice of distribution
• No hard and fast rules about which
distribution should be used.
• If a sample size is big enough choice of
distribution may be less important
– “everything tends to normality”
– Normal distribution will be a good
approximation to Poisson and Binomial
distributions given big enough sample.
Standard Error
• Summary statistics – such as the meanare based on samples
• Different samples from the same
population give rise to different values
• This variation is quantified by the standard
error
An example….
68 102 51
46 69 35
114 171 130
Population mean = 87.33
An example….continued….
68 102 51
46 69 35
114 171 130
68 102 51
46 69 35
114 171 130
Sample 2 mean = 101.25
Sample 1 mean = 71.25
68 102 51
46 69 35
114 171 130
Sample 3 mean = 100
Standard Errors for some common
distributions
• Normal distribution
s
n
s  standard deviation, n  sample size
se 
• Poisson distribution
se 
mean
• Binomial Distribution
se( p) 
p(1  p)
n
Confidence Intervals
• How to calculate
• How to interpret
Confidence Intervals
• Summary statistics are point estimates
based on samples
• Confidence Intervals quantify the degree
of uncertainty in these estimates
• Quoted as a lower limit and an upper limit
which provide a range of values within
which the population value is likely to lie
Calculating Confidence Intervals
• General form of any 95% C.I.:
Point Estimate ± 1.96*(Estimated SE)
• For 99% CI’s we use 2.57
• For 90% CI’s we use 1.64
Interpretation
(X
X)
X)
(X
• A 95% Confidence
Interval is a random
interval , such that in
related sampling 95
out of every 100
intervals succeed in
covering the
parameter
• Loose interpretation
– “95% chance true
value inside interval”
X)
(X
X)
(X
(X
X)
(X
X)
(X
5% of
cases
X)
X)
(X
X)
(X
X)
(X
X)
(X
(X
X)
(X
X)
X)
(X
(X
X)
(X
X)
(X
X)
(X
X)
(X
(X
X)
True Value
Interpretation of confidence
intervals
• Non overlapping intervals indicative of real
differences
• Overlapping intervals need to be
considered with caution
• Need to be careful about using confidence
intervals as a means of testing.
• The smaller the sample size, the wider the
confidence interval
Example
If the mean weight (kg) for a given sample
of 43 men aged 55 is 81.4kg and the
standard deviation is 12.7 kg….
Then,
A 95% confidence interval is given by
81.4-1.96*(stdev/n), 81.4+1.96
*(stdev/n)
which evaluates to (77.6, 85.2) kg
Exercise 2
Using the CI calculator provided
1.Calculate the 95%CI for the mean class
size from exercise 1.
2.If 20% in a sample of 400 are smokers,
calculate a 95% confidence interval
around this proportion
86
70
Bristol
Plymouth
Gloucester
Swindon
Penwith
Torbay
Forest of Dean
Restormel
Bournemouth
Weymouth and
Exeter
North Devon
Kerrier
Torridge
Mendip
North
Caradon
North Cornwall
Kennet
Sedgemoor
Poole
Taunton Deane
Stroud
Salisbury
Cheltenham
South
North Wiltshire
South
Tewkesbury
West Dorset
West Wiltshire
Bath and North
Carrick
Teignbridge
Mid Devon
West Devon
South Hams
East Devon
West Somerset
Cotswold
North Dorset
Christchurch
Purbeck
East Dorset
Life expectancy at birth (years)
Exercise 3, which areas are
significantly higher than England?
Life Expectancy (2002-04)
Source: ONS 2004
84
82
80
78
76
74
72
Male
England - Male
Female
England - Female
Measuring uncertainty
Types of hypotheses
• Null Hypothesis (H0)
– The hypothesis under consideration
– “there is no difference between groups”
– The accused is innocent
• Alternative Hypothesis (Ha)
– The hypothesis we accept if we reject the null
hypothesis
– “there is a difference between groups”
– Or the accused is guilty
Hypothesis Testing
• Inferences about a population are often
based upon a sample.
• Want to be able to use sample statistics to
draw conclusions about the underlying
population values
• Hypothesis testing provides some criteria
for reaching these conclusions
General principles of hypothesis
testing
• Formulate null (H0) and alternative (Ha)
hypotheses (simple or composite)
• Choose test statistic
• Decide on rule for choosing between the
null and alternative hypotheses
• Calculate test statistic and compare
against the decision rule.
Illustration of acceptance regions
Principles of Testing
0.45
0.4
0.3
0.25
0.2
0.15
reject null
hypothesis
reject null
hypothesis
0.1
accept null hypothesis
0.05
μ0
3
2.77
2.55
2.32
2.1
1.9
1.67
1.45
1.22
1
0.77
0.55
0.32
0.1
-0.1
-0.3
-0.6
-0.8
-1
-1.2
-1.5
-1.7
-1.9
-2.1
-2.3
-2.6
-2.8
0
-3
Probability density
0.35
Significance Levels
• Used as the criteria to accept or reject H0
• P-value < 0.05 (or 0.01) indicates that the
truth of HO is unlikely
• Usually 5% or 1%
• Chosen a priori
P-values
• Criteria to judge statistical significance of
results. Quoted as values between 0 and 1
• Probability of result, assuming Ho true
• Values less than 0.05 (or 0.01) indicates
an observation unlikely under the
assumption that HO is true
Illustration of P-value under Ho
P values
0.45
0.4
Distribution assumed
under H0
0.3
0.25
Probability value
as extreme as
observed
0.2
0.15
Observed
value
0.1
0.05
μ0
3
2.77
2.55
2.32
2.1
1.9
1.67
1.45
1.22
1
0.77
0.55
0.32
0.1
-0.1
-0.3
-0.6
-0.8
-1
-1.2
-1.5
-1.7
-1.9
-2.1
-2.3
-2.6
-2.8
0
-3
Probability density
0.35
Sample size
• Results may indicate no difference
between groups
• This may be because there is truly no
difference between groups or because
there was an insufficiently large sample
size for this to be detected
Determining Sample Size
• Choice of sample size depends on:
– Anticipated size of effect/ required precision
– Variability in measurement
– Power
– Significance levels
Sample size formula
•
It is possible to combine information on
variability, significance and power with
the size of the effect we are trying to
detect:
Example from Clinical Trials:
N = 2σ2(zα/2+zβ)2/(μ0- μa)2
Small samples
• The smaller the sample, the higher degree
of uncertainty in results.
• Increased variability in small samples
• Confidence Intervals for estimates are
wider
• Low numbers may affect the calculation of
directly standardised rates (for instance)
• Distribution assumptions may be affected.
Dealing with small numbers
• Confidentiality can be an issue
• Can combine several years of data
– Mortality pooled over several years for rare
conditions
• Suicide, infant mortality, cancers in the young
• Combine counts across categories of data
– Low cell counts in cross-classifications of the
data
• Exact methods may be needed.
Problems with small numbers
Trends in deaths from accidents 1993-2005
14
12
DSR/100,000
10
8
6
4
2
0
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
PERSONS
England DSR
4.80
4.71
4.40
4.04
3.81
3.62
3.78
3.26
2.97
3.09
region DSR
PCT DSR
5.98
5.60
4.42
5.17
3.65
5.09
4.61
2.70
3.95
6.75
12.25
7.16
4.82
1.61
5.65
0.00
3.18
4.86
England OBS
442
437
407
374
354
337
349
300
region OBS
30
29
22
26
18
26
22
4
7
4
3
1
3
0
PCT OBS
2.99
2.66
2.50
4.36
1.46
4.94
1.40
1.71
10.41
1.79
2.13
272
280
268
238
227
13
18
19
7
21
7
2
3
1
6
1
1
Finding out more: APHO
http://www.apho.org.uk/resource/item.aspx?RID=48457
Finding out more
• Lots of useful information can be found at
the HealthKnowledge website…
Finding out more
• The NCHOD website also contains useful
information on methodology…
http://www.nchod.nhs.uk/
Finding out more
• Some further references of interest:
– Bland, M. Introduction to Medical Statistics.
Third Edition. Oxford University Press, 2000.
– Hennekens CH, Buring JE. Epidemiology in
Medicine, Lippincott Williams & Wilkins, 1987.
– Larson, H.J. Introduction to Probability Theory
and Statistical Inference. Third Edition. Wiley,
1982