Download Basic statistical concepts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
university of copenhagen
Faculty of Health Sciences
Basic statistical concepts
Susanne Rosthøj
Section of Biostatistics
Institute of Public Health
University of Copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Structure of teaching sessions
One topic
Videos
Automated
feedback
Teaching session
Teaching session
Training exercises
Introduction
to training
activities
2 / 30
Discussions
Reading
Collective feedback
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Why all these exercises?
Learning R (and statistics) requires training!
Gains of training:
• focus on statistical rather than technical issues
• increase learning
Several hours of studying and training is necessary before each
session.
You need to complete 80% of the tests for the training exercises
(you don’t need 80% correct answers!).
3 / 30
university of copenhagen
Outline day 1
• Discussion of the training activities
• Descriptive statistics
• Inferential statistics
• Means and confidence intervals
• Test of hypotheses
4 / 30
d e pa rt m e n t o f b i o s tat i s t i c s
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Applied statistics
The basis for statistics is data / observations observed with
random variation.
We want to quantify the variation in the observations due to
• systematic variation
• random variation due to factors we cannot control
We need to
• summarize many observations as simple as possible
• quantify that conclusions based on many observations are
more precise than conclusions based on few observations
5 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Statistical approaches
Descriptive statistics :
• Summarizing observations
• Typically represented
• graphically
• in tables
• as summary statistics (single values)
Inferential statistics :
• Procedures allowing us to infer / generalize / conclude on
observations
• Typically based on
• models, confidence intervals, hypotheses, tests
• need mathematical results and assumptions
6 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Descriptive statistics
Descriptive statistics is the discipline of quantitatively
describing the main features of a collection of data.
Descriptive statistics tell us about the distribution of data points
in a data set.
Quantitative data are summarized by :
Median, range, quartiles, inter quartile range (IQR), mean
and standard deviation.
Graphics: Histograms and scatter plots.
Categorical data are summarized by :
Tables, proportions
7 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Small exercise (1/2)
Data :
X = (X1 , . . . , Xn )
X = (50, 52, 56, 61, 64, 71, 72, 73, 75, 79)
(n = 10)
The k’th percentile is the point below which k% of the values of
the distribution lie :
k’th percentile =
k × (n + 1)th
value of ordered observations.
100
By hand: Find median (k=50) and inter quartiles (k=25,75).
•
•
•
•
8 / 30
median (middle or 50×(n+1)
’th observation) :
100
min
, max
, range
inter quartiles :
IQR (Inter Quartile Range) :
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Small exercise (2/2)
Now find median and inter quartiles using R:
First enter the data in a vector named x in R.
x <- c(50,52,56,61,64,71,72,73,75,79)
Experiment with the commands and explain output from
median(x)
quantile(x)
quantile(x, type=6)
How can you ask R to find the 20% quantile?
Further determine
1 P10
• mean = X = 10
i=1 Xi =
q
10
1
2
• standard deviation = sd = 10−1
i=1 (Xi − X) =
Use google to find out how to determine the standard
deviation in R.
9 / 30
P
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Inferential statistics
What can we say about the rest of the world using the
observations we have seen?
Popula'on Sample Es'mates Assumption: Unknown parameters describe the population.
Sample estimates are guesses of population parameters.
10 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
The normal distribution
The normal distribution is the most important distribution for
describing continuous variables.
Examples:
• Body temperature
• Hemoglobin level
• Weight
It is widely used in statistical inference because
• it has many mathematically convenient properties
• the Central Limit Theorem :
The average of a sufficiently number of independent
variables with same distribution will be approximately
normally distributed.
11 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
The 95% reference interval
Hemoglobin levels for adult women are normal with mean 14g/100ml
and a SD of 1g/100ml.
Reference range / normal range: µ ± 1.96 · SD
Density
Suppose we measure the Hgb in a group of women. What happens
to the SD of the sample distribution as n is increased?
12 / 30
11
12
13
14
Hemoglobin
15
16
17
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Mean and standard deviation of an average
The mean of an average equals the mean of the variable we
are averaging:
mean(X) = mean(X)
(=µ).
The standard deviation of the average is the standard
deviation of the variable divided by squareroot of the number of
observations:
SD(X) =
SD(X)
σ
√
(= √ )
n
n
This is termed the standard error of the mean (SE or SEM)
and measures the amount of variation in averages of size n.
13 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Density
The distribution of the sample mean (X)
95%
2.5%
µ + 1.96
14 / 30
2.5%
σ
n
µ
µ − 1.96
σ
n
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
The 95% confidence interval
The 95% confidence interval (CI) for µ:
Density
X ± 1.96 · SE
95%
X
●
X
●
2.5%
µ + 1.96
15 / 30
2.5%
σ
n
µ
µ − 1.96
σ
n
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Understanding confidence intervals
The population
mean µ is a fixed unknown
number.
Understanding
confidence
The confidence intervals vary between samples:
intervals
Mean and 95% confidence interval
The population mean (µ) is a fixed unknown number: it is the
confidence interval that will vary between samples.
27
20 samples of size
100, from a
population with mean
24.2 and s.d. 5.9.
26
25
The sample means
vary around the
population mean µ
24
23
22
One of the twenty
95% C.I.s does
not contain m
21
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Sample
16 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Interpretation of CI
A 95% CI ranges from 3 to 4.
How many of the following statements are true?
A. The probability that the population mean is greater than 0 is at
least 95%.
B. The probability that the population mean equals 0 is smaller than
5%.
C. There is a 95% probability that the population mean lies
between 3 and 4.
D. We may be 95% confident that the interval (3;4) contains the
population mean.
E. If we were to repeat the experiment over and over, then 95% of
the time the population mean falls between 3 and 4.
17 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Why do we need confidence intervals?
We want to estimate a parameter, e.g.
• the mean value for dice rolling
• the mean effect of a sleeping pill.
• the mean IQ
Based on a sample we suggest a qualified guess (estimate)
• we are uncertain about the guess and suggest an interval
of plausible values
• the interval has to be narrow
• we want a large probability (95%) of guessing right.
18 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Small sample confidence intervals
For small samples (n ≤ 60) the CIs are better approximated by
the t-distribution with df=n − 1.
The 95%-CI for µ is
X ± t(0.025,df=n−1) · SE
with t0.025,df=n−1 being the lower 2.5%-quantile.
See a selection of quantiles in KS table A3 or calculate
quantiles in R qt(0.025,df=n-1).
19 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
How to make conclusions based on data?
The purpose of most experiments is to prove or disprove a
hypothesis.
This is done by collecting data, analyzing it and drawing a
conclusion.
The original hypothesis is tested against the data to find out
whether or not it is right.
20 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Example of a hypothesis
636 children from Peru had their lung capacity examined.
Response: FEV (Forced Expiratory Volume (L/1s).
Scientific question:
Do boys and girls have different lung capacity?
Hypothesis:
H0 : There is no difference in lung capacity for boys and girls.
We observe:
Girls : mean(FEV) = 1.54
Boys : mean(FEV) = 1.66.
Observed difference = 0.12. What can we conclude?
21 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Examples of hypotheses
We always formulate hypotheses as no difference or no
association.
Investigation of a single population (one group):
• H0 : The mean is equal to a specific number
(e.g. mean FEV for boys is µ1 = 1.5)
HA : The mean is not equal to a specific number.
Comparison of two populations (two groups):
• H0 : The means are equal (i.e. µ1 − µ2 = 0)
HA : The means are not equal.
If sufficient evidence against the hypothesis, we reject H0 .
22 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Test statistics
We use test statistics to find evidence against the hypothesis.
Often test statistics are given by
estimate − hypothetical value
SD(estimate)
We expect the test statistic to be
• small if the hypothesis is true
• large if the hypothesis is false.
23 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Example : Lung capacity
Parameters:
µ1 : mean FEV for girls
µ2 : mean FEV for boys
Do boy and girls have different lung capacity?
Hypothesis:
H 0 : µ 2 = µ1 .
µ2 − µ1 is the parameter we investigate.
0 is the hypothetical value.
24 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Two sample t-test
Can be used when data arise from two groups, the variances
in the two groups are equal and all observations are
independent.
Hypothesis:
The population mean in the two groups are equal
H0 : µ1 = µ2
against
HA : µ1 6= µ2
Summary data from the two groups :
Group 1 : n1 , X 1 , sd1
25 / 30
Group 2 : n2 , X 2 , sd2
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Two sample t-test
First calculate a ’pooled’ sd :
s
SD =
(n1 − 1)SD21 + (n2 − 1)SD22
(n1 − 1) + (n2 − 1)
Standard error of difference in means (X 1 − X 2 ) :
s
SE(X 1 − X 2 ) =
1
1
+
· SD
n1 n2
Test statistic :
t = ??
If H0 is true, t ∼ t(df = n1 + n2 − 2).
26 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Example : Lung capacity
Girls
Boys
n
335
301
mean
1.538
1.657
SD
0.291
0.308
X 2 − X 1 = 0.119.
An estimate of the difference :
Pooled SD of the difference assuming equal variances :
s
SD(X 2 − X 1 ) =
(335 − 1) × 0.2912 + (301 − 1) × 0.3082
= 0.299
(335 − 1) + (301 − 1)
The test statistic
T =
0.119
0.299 ×
Small or large???
27 / 30
q
1
335
+
1
301
= −5.01.
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
P values
We use p values to assess the size of test statistics.
If the hypothesis is true and we replicate the sampling many
times:
How often will we obtain a test statistic numerically larger
than the observed test statistic?
The p-value
P(|test statistic| > |observed test statistic|)
is calculated assuming the hypothesis being true.
A small p-value corresponds to the observed test statistic being
unlikely if the hypothesis is true.
28 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Example : Lung capacity
Under the null, T ∼ t(df = 335 + 301 − 2).
P-value:
P(|T| > 5.01) = P(T < −5.01) + P(T > 5.01)
= 2 · 3.54 × 10−7 = 7.09 × 10−7
Conclusion :
The observed test statistics of 5.01 is unlikely, if there was no
difference between boys and girls wrt lung capacity.
NB : If we instead use the normal distribution to calculate the p
value, the test is termed a Wald test.
29 / 30
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Interpretation of P values
The smaller the P-value, the lower the chance of getting a
difference as big as the one observed if the null hypothesis is
true.
Large P-value:
• The difference between observed and hypothetical value is
small compared to the statistical uncertainty.
• The observed difference is due to chance.
Small P values:
• The difference between observed and hypothetical value is
large compared to the statistical uncertainty.
• It is unlikely that the observed difference is due to chance.
30 / 30