Download Statistical inference - Statistical models, estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Program
Faculty of Life Sciences
Statistical inference
Statistical models, estimation and confidence intervals
Ib Skovgaard & Claus Ekstrøm
E-mail: [email protected]
• Distribution of a sample mean
• Statistical inference for a single sample
• statistical model
• estimation and precision of estimates
• the t-distribution
• confidence intervals
• Statistical inference for linear regression
Slide 2 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
The sample mean
Distribution of a sample mean
• But how precise is it?
2.0
1.5
Density
1.0
• Estimate of µ is µ̂ = ȳ = 12.76
n = 25
0.5
• Sample statistics, ȳ = 12.76 and s = 2.25.
1.5
• We have: a sample of n = 162 weights: y1 , . . . , y162 .
n = 10
Density
1.0
• Wanted: the mean weight in the population — µ
0.5
Weights of crabs:
2.0
Histograms of the sample mean of n independent N(0, 1) variables.
0.0
To answer this we make a confidence interval for µ. This requires a
statistical model.
0.0
How large can we expect µ̂ − µ to be?
−1.0
−0.5
0.0
y
0.5
1.0
−1.0
−0.5
0.0
y
Mean? — Standard deviation? — distribution?
Slide 3 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Slide 4 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
0.5
1.0
Distribution of a sample mean
Statistical model
Histogram and N-density
QQ-plot
. . . and σ can be estimated from the sample.
Sample Quantiles
12 14 16 18
Density
0.10
10
√
normal with mean µ and standard deviation σ / n
●
●●
●
●●
●●
● ●●
8
• Because a mean of n independent N(µ, σ 2 )-variables is
0.05
• Answer: Mathematical computation!
0.00
In practice we only observe one sample mean, so how can we find
its distribution?
0.15
20
●
●
●●
●
●●●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●●
●●●
●●
●●
●
●
8
10
12
14 16
Weight
18
20
−2
−1
0
1
Theoretical Quantiles
2
Statistical model:
y1 , . . . , y162 are independent and yi ∼ N(µ, σ 2 )
In words, the observations are normally distributed, have the same
mean, the same standard deviation and are independent.
Slide 5 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Estimation
Slide 6 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Precision of µ̂
Statistical model:
y1 , . . . , y162 ∼ N(µ, σ 2 ) independent
Parameters in the model
• mean µ — in the population
• standard deviation σ — in the population
Estimation: The population parameters are estimated as the
sample statistics:
• µ̂ = ȳ
The estimate µ̂ tells nothing about the precision. But we know that
√
• sd(ȳ ) = σ / n
√
• ȳ is within µ ± 1.96 · σ / n with 95% probability.
But we don’t know σ , just the estimate (s).
• Standard error of ȳ — estimated standard deviation:
√
SE(ȳ ) = s/ n
√
• ȳ is within µ± ??? · s/ n with probability 95%.
• σ̂ = s
Slide 7 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Slide 8 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
The t-distribution
Confidence interval for µ
df = 1, 4 and N(0, 1)
0.3
0.4
Standardization
√
n(ȳ − µ)
z=
∼ N(0, 1),
σ
0.0
0.1
Density
0.2
When the estimate, s, of σ is
inserted the distribution is
changed from a normal
distribution to a t-distribution:
√
n(ȳ − µ)
∼ tn−1
T=
s
−4
−2
0
T
2
4
The t-distribution with n − 1 degrees of freedom.
• Thicker tails than N(0, 1)
If t0.975,n−1 is the 97.5%-quantile in the tn−1 -distribution:
√
n(ȳ − µ)
< tn−1,0.975 = 0.95.
P −tn−1,0.975 <
s
These two inequalities can be rearranged to give two inequalities
for µ:
s
s
P ȳ − tn−1,0.975 · √ < µ < ȳ + tn−1,0.975 · √ ) = 0.95
n
n
This interval contains the population mean, µ, with probability
95%.
The interval is called a 95% confidence interval for µ.
• Resembles N(0, 1) more and more as df increases.
Slide 9 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Confidence intervals: weights of crabs
Slide 10 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Confidence intervals: interpretation
Recall: n = 162, ȳ = 12.75 and s = 2.25.
Quantiles:
> qt(0.975,161)
[1] 1.974808
> qt(0.95,161)
[1] 1.654373
Compute
• Standard error, SE(µ̂)?
• 95% confidence interval?
95%-confidence interval for µ
s
ȳ ± tn−1,0.975 · √ = µ̂ ± tn−1,0.975 · SE(µ̂)
n
Interpretation: with probability 95%, the interval contains the
population mean, µ.
What happens when the sample size, n, increases? Does the 95%
confidence interval become wider or narrower?
• 90% confidence interval?
Slide 11 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Slide 12 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Confidence intervals: interpretation
The central limit theorem
If we repeated the experiment, then in the long run 95% of the
confidence intervals would contain the population mean.
The main reason that the normal distribution is so important.
Confidence intervals for 50 data sets from N(0, 1).
The central limit theorem
95%, n=10
75%, n=10
Assume that Y1 , . . . , Yn are independent random variables with the
same distribution with mean µ and standard deviation σ . Then
their mean
95%, n=40
Ȳ =
1 n
∑ Yi ∼ N(µ, σ 2 /n),
n i=1
has a distribution which approaches the normal distribution as n
increases. More precisely,
Ȳ − µ
√ ≤ z → Φ(z)
P
σ/ n
−2
−1
0
µ
1
2
−2
−1
0
µ
1
2
−2
−1
0
µ
Slide 13 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Summary: a single sample
• Statistical model: y1 , . . . , y162 independent and yi ∼ N(µ, σ 2 )
• Parameters, µ and σ : mean and standard deviation in the
population.
• Estimates: µ̂ = ȳ and σ̂ = s
• Distribution of the estimate: µ̂ is normal with mean µ and
√
standard deviation σ / n
• Standard error is an estimate of the standard deviation of an
√
estimate: SE(µ̂) = s/ n
• 95%-confidence interval:
ȳ ± tn−1,0.975 · √sn = µ̂ ± tn−1,0.975 · SE(µ̂)
Slide 15 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
1
2
Hence, the confidence interval for the mean may be OK, even if
the population is not normal.
Slide 14 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Statistical model and parameters
Statistical model: the deviations from the straight line are normally
distributed and independent
yi = α + β · xi + e i ,
e1 , . . . , en ∼ N(0, σ 2 ) uafhængige
In words: The mean of yi is α + β · xi and the remainders (or
residuals) are normal and independent with the same standard
deviation.
Parameters (population constants)
• Intercept α and slope β
• Standard deviation σ for the deviations from the line
Slide 16 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Estimates and distribution of the estimates
Standard errors and confidence intervals
Estimates βˆ and α̂ shown earlier (Chapter 2).
Distributions:
σ2
βˆ ∼ N β ,
,
SSx
Estimate of the residual standard deviation:
s
s
1 n
1 n 2
ˆ
2
(y
−
α̂
−
β
·
x
)
=
s=
i
i
∑
∑ ri
n − 2 i=1
n − 2 i=1
βˆ and α̂ are normally distributed:
σ2
x̄ 2
1
βˆ ∼ N β ,
, α̂ ∼ N α, σ 2
+
,
SSx
n SSx
1
x̄ 2
α̂ ∼ N α, σ 2
+
n SSx
Standard errors — estimates of standard deviations
s
s
1
x̄ 2
SE(βˆ ) = √
, SE(α̂) = s
+
n SSx
SSx
n
SSx = ∑ (xi − x̄)2 .
95% confidence intervals:
i=1
The statistical experiment is an instrument that “measures” the
values α and β with a precision given by the standard errors.
Slide 17 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Stearic acid example
βˆ ± t0.975,n−2 · SE(βˆ ),
α̂ ± t0.975,n−2 · SE(α̂)
Note: t-distribution with n − 2 degrees of freedom is used.
Slide 18 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Reflection: What is a statistical model?
• A statistical model describes the probability distribution of the
> model1 = lm(digest~st.acid}
> summary(model1)
population from which our sample is drawn.
• But how can we know that?
• We can’t, but a model is just a rough picture displaying the
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 96.53336
1.67518
57.63 1.24e-10 ***
st.acid
-0.93374
0.09262 -10.08 2.03e-05 ***
Residual standard error: 2.97 on 7 degrees of freedom
• Statistical model? Interpretation of models?
• Estimates? Confidence intervals?
important features.
• Some of these features are not known. This is why we
measure a sample.
• Therefore a statistical model is not complete; some aspects
have to be estimated from the sample.
• These aspects may be given as a number of parameters such
as mean and standard deviation.
• The remaining part of the model is assumed and should be
validated as well as possible.
Without a model we have no basis for probability calculations.
Slide 19 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Slide 20 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
A typical statistical model
Main points from this lecture
Many statistical models consist of two parts:
observation
=
fixed part + random part
= predictable part + unpredictable part
Predictable means that it depends on factors we know (type of
antibiotics, amount of stearic acid, age, treatment, etc.).
The random part is defined by the equation above as the remainder
(or residual)
• Statistical model and parameters
• Estimates, distribution of estimates, standard error
• Confidence intervals: estimate ± t-fraktil · SE(estimate) and
interpretation
random part = observation − fixed part
The random part is often assumed to be normally distributed.
Slide 21 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Slide 22 — Statistics for Life Science (Week 3-2 2010) — Statistical inference
Related documents