Download Mathematical Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Sufficient statistic wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Taylor's law wikipedia , lookup

Statistical hypothesis testing wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Mathematical Statistics
Motoya Machida
April 12, 2006
This material is designed to introduces basic concepts and fundamental theory of mathematical statistics. A
review of basic concepts will include likelihood functions, sufficient statistics, and exponential family of distributions. Then point estimation will be discussed, including minimum variance unbiased estimates, Cramér-Rao
inequality, maximum likelihood estimates and asymptotic theory. Topics in general theory of statistical tests
include Neyman–Pearson theorem, uniformly most powerful tests, and likelihood ratio tests.
1
Point estimates
A random sample
X1 , . . . , X n
is regarded as independent and identically distributed (iid) random variables governed by an underlying probability density function f (x; θ). A value θ represents the characteristics of this underlying distribution, and
is called a parameter. Suppose, for example, that the underlying distribution is the normal distribution with
(µ, σ 2 ). Then the values µ and σ 2 are the parameters. Since X1 , . . . , Xn are random variables, the sample mean
X̄ also becomes a random variable. In general, a random variable u(X) constructed from the random vector
X = (X1 , . . . , Xn ) is called a statistic. For example, the sample mean X̄ is a statistic.
A point estimate is a statistic u(X) which is a “best guess” for the true value θ. Suppose that the underlying
distribution is the normal distribution with (µ, σ 2 ). Then the sample mean X̄ is in some sense a best guess of
the parameter µ.
Mean square-error. Let u(X) be a point estimate for θ. Then the functional R(θ, u) = E[(u(X) − θ)2 ] of u
is called the mean square-error risk function. We can immediately observe that
R(θ, u) = Var(u(X)) + [E(u(X)) − θ]2 = Var(u(X)) + [b(θ, u)]2 ,
where b(θ, u) = E(u(X)) − θ is called the bias of u(X).
2
Maximum likelihood estimate
Having observed a random sample (X1 , . . . , Xn ) = (x1 , . . . , xn ) from an underlying pdf f (x; θ), we can construct
the likelihood function
n
Y
L(θ, x) =
f (xi ; θ),
i=1
1
Mathematical Statistics
and consider it as a function of θ. Then the maximum likelihood estimate (MLE) θ̂ is the value of θ which
“maximizes” the likelihood function L(θ, x). It is usually easier to maximize the log likelihood
ln L(θ, x) =
n
X
ln f (xi ; θ).
i=1
2.1
Bernoulli trials
Let f (x; θ) = θx (1−θ)1−x be the Bernoulli frequency function with success probability θ. By solving the equation
Pn
Pn
1
1
∂ ln L(θ, x)
= ( i=1 xi ) − (n − i=1 xi )
= 0,
∂θ
θ
1−θ
we obtain θ∗ =
2.2
Pn
i=1
xi n which minimizes ln L(θ, x). Therefore, θ̂ = X̄ is the MLE of θ.
Normal distribution
Let X1 , . . . , Xn be a random sample from a normal distribution with parameter (µ, σ 2 ). Then the log likelihood
function of parameter (µ, σ 2 ) is given by
ln L(µ, σ 2 ) = −
n
1 X
n
n
ln σ 2 − 2
(xi − µ)2 − ln 2π.
2
2σ i=1
2
By solving
n
1 X
∂ ln L(µ, σ 2 )
(xi − µ) = 0;
= 2
∂µ
σ i=1
n
∂ ln L(µ, σ 2 )
n
1 X
=
−
+
(xi − µ)2 = 0,
∂σ 2
2σ 2
2(σ 2 )2 i=1
we can obtain the MLE’s µ̂ and σ̂ 2 as follows.
n
µ̂ =
1X
Xi = X̄;
n i=1
n
σ̂ 2 =
1X
(Xi − µ̂)2 .
n i=1
Although the MLE σ̂ 2 of σ 2 is consistent, it should be noted that this point estimate σ̂ 2 is not an unbiased one.
2.3
Poisson distribution
Let X1 , . . . , Xn be a random sample from a Poisson distribution with parameter λ. Then the log likelihood
function of parameter λ is given by
ln L(λ) = ln λ
n
X
xi − nλ −
i=1
n
X
ln xi !.
i=1
By solving
n
d ln L(λ)
1X
=
xi − n = 0,
dλ
λ i=1
Lecture Note
Page 2
Mathematical Statistics
we can obtain the MLE
n
λ̂ =
2.4
1X
Xi = X̄.
n i=1
Exercise
Let X1 , . . . , Xn be a random sample from each of the following density function f (x; θ). Find the MLE θ̂ of θ.
1. f (x; θ) = (1 + θ)xθ , 0 ≤ x ≤ 1, where θ > −1.
2. f (x; θ) = θe−θx , x ≥ 0, where θ > 0.
3. f (x; θ) = θLθ x−θ−1 , x ≥ L, where L > 0 is given and θ > 1 (Pareto distribution).
2.5
Solutions to exercise
1. ln L(θ) =
− Pn
!
ln xi
θ + n ln(1 + θ). By solving
ln xi
d
dθ
ln L(θ) =
n
X
i=1
i=1
n
i=1
n
X
!
ln xi
+
n
= 0, we obtain θ̂ =
1+θ
− 1.
!
n
X
n
d
n
2. ln L(θ) = −
xi θ + n ln θ. By solving
ln L(θ) = −
xi + = 0, we obtain θ̂ = Pn
.
dθ
θ
i=1 xi
i=1
i=1
!
!
n
n
X
X
d
n
ln xi (θ + 1) + n ln θ + nθ ln L. By solving
3. ln L(θ) = −
ln xi + + n ln L = 0,
ln L(θ) = −
dθ
θ
i=1
i=1
n
we obtain θ̂ = Pn
.
i=1 ln xi − n ln L
n
X
3
3.1
!
Properties of MLE
Consistency
One of the important attributes of point estimate is unbiasedness. Since a statistic θ̂ is a random variable, we
can consider the expectation E(θ̂) of θ̂. Then the point estimate θ̂ of θ is unbiased if it satisfies E(θ̂) = θ. In the
case of normal distribution, the MLE X̄ for µ is unbiased since E(X̄) = µ. However, the MLE σ̂ 2 for σ 2 is not
unbiased, since
#
" n
n−1 2
1X
2
(Xi − X̄) =
σ
E
n i=1
n
Note that the point estimate θ̂ of θ is also dependent on the sample size n. We say that θ̂ is consistent if θ̂
converges in probability to θ̂ as n → ∞. For example, the above MLE’s X̄ and σ̂ 2 are both consistent by the
weak law of large number. In general, the MLE is consistent under appropriate conditions.
3.2
Invariance
Suppose that h(θ) is a one-to-one function of θ. Then it is clearly seen that θ̂ is the MLE for θ if and only if
h(θ̂) is the MLE for h(θ). Even if h(θ) is not one-to-one, h(θ̂) will be viewed as the MLE which corresponds to
the maximum likelihood.
Lecture Note
Page 3
Mathematical Statistics
3.3
Asymptotic normality
∂ ln L(θ̂, X)
Suppose that θ̂ is the MLE and consistent, and that it satisfies
= 0. By the Taylor expansion we
∂θ
have
∂ ln L(θ̂, X)
∂ ln L(θ, X)
∂ 2 ln L(θ, X)
≈
+ (θ̂ − θ)
= 0.
∂θ
∂θ2
∂θ
Since θ̂ is close to θ by consistency, the approximation is valid. Furthermore, we can make the following
observations:
(Xi ;θ)
’s are iid with mean 0 and variance I1 (θ) = Var
1. The random variables ∂ ln f∂θ
the central limit theorem,
n
X
∂ ln f (Xi ; θ)
∂ ln L(θ, X)
∂θ
= i=1 p
Zn = p ∂θ
nI1 (θ)
nI1 (θ)
∂
∂θ
ln f (X1 ; θ) . By
converges to N (0, 1) in distribution as n → ∞.
2
f (Xi ;θ)
’s are iid with mean (−I1 (θ)) and finite variance. By the weak law of
2. The random variables ∂ ln∂θ
2
large number,
n
1 ∂ 2 ln L(θ, X)
1 X ∂ 2 ln f (Xi ; θ)
Wn =
=
n
∂θ2
n i=1
∂θ2
converges to −I1 (θ) in probability as n → ∞.
Together with Slutsky’s theorem we can find that
p
nI1 (θ)(θ̂ − θ) ≈
Zn
Wn /(−I1 (θ))
converges to N (0, 1) in distribution as n → ∞. Hence, the MLE θ̂ has “approximately” a normal distribution
1
with mean θ and variance nI11(θ) = I(θ)
if n is large, where I(θ) is the Fisher information for the random
sample X. This suggests that
n × Var(θ̂) → 1/I1 (θ) as n → ∞.
Then we call θ̂ asymptotically efficient (cf. Bickel and Doksum, “Mathematical Statistics,” Chapter 4).
4
Confidence interval
Let X = (X1 , . . . , Xn ) be a random sample from f (x; θ). Let u1 (X) and u2 (X) be statistics satisfying u1 (X) ≤
u2 (X). If
P (u1 (X) < θ < u2 (X)) = 1 − α for every θ,
then the random interval (u1 (X), u2 (X)) is called a confidence interval of level (1 − α).
4.1
Population mean
Let X1 , . . . , Xn be iid random variables from N (µ, σ). The sample mean X̄ is an unbiased estimate of the
X̄ − µ
√ has the t-distribution with (n − 1) degrees of freedom. Thus, by
parameter µ. Then the random variable
S/ n
Lecture Note
Page 4
Mathematical Statistics
using the critical point tα/2,n−1 we obtain
X̄ − µ tα/2,n−1 S
tα/2,n−1 S
√
√
P √ < tα/2,n−1 = P X̄ −
< µ < X̄ +
= 1 − α.
S/ n
n
n
This implies that the parameter µ is in the interval
tα/2,n−1 S
tα/2,n−1 S
√
√
X̄ −
, X̄ +
n
n
with probability (1 − α). The interval is also known as the t-interval.
Example. A random sample of n milk containers is selected, and their milk contents are weighed. The data
X1 , . . . , X n
(1)
can be used to investigate the unknown population mean of the milk container weights. The random selection of
sample should ensure that the above data can be assumed to be iid. Suppose that we have calculated X̄ = 2.073
and S = 0.071 from the actual data with n = 30. Then by choosing α = 0.05, we have the critical point
t0.025,29 = 2.045, and therefore, obtain the confidence interval
2.045 × 0.071
2.045 × 0.071
√
√
2.073 −
, 2.073 +
= (2.046, 2.100)
30
30
of level 0.95 (or, of level 95%).
Even if the data (1) are not normally distributed, the central limit theorem says that the estimate X̄ is approximately distributed as N (µ, σ 2 /n). In either case it is sensible to use critical points from the t-distribution.
4.2
Population proportion
Let X1 , . . . , Xn be iid Bernoulli random variables with success probability p. The sample mean X̄ is the MLE
X̄ − p
of the parameter p, and unbiased. By the central limit theorem, the random variable p
has approxp(1 − p)/n
imately N (0, 1) as n gets larger (at least np > 5 and n(1 − p) > 5 by rule of thumb). Here we define the critical
point zα for standard normal distribution by P (X > zα ) = α with standard normal random variable X. Thus,
we have
!
!
r
r
X̄ − p
p(1 − p)
p(1 − p)
P p
< p < X̄ + zα/2
≈ 1 − α.
< zα/2 = P X̄ − zα/2
p(1 − p)/n n
n
q
q
X̄)
p(1−p)
Here we can use X̄(1−
as
an
estimate
for
. (We will see later that it is the MLE via the invariance
n
n
property since X̄ is the MLE of p.) Together we obtain the confidence interval
!
r
r
X̄(1 − X̄)
X̄(1 − X̄)
X̄ − zα/2
, X̄ + zα/2
n
n
of level (1 − α).
There is an alternative and possibly more accurate method to derive a confidence interval. Here we observe that
!
X̄ − p
2
2
P p
n + zα/2
p2 − 2 nX̄ + zα/2
/2 p + nX̄ 2 ≤ 0 ≈ 1 − α.
< zα/2 = P
p(1 − p)/n This implies that the parameter p is in the interval (p̂− , p̂+ ) with probability (1 − α), where
q
2
2 /4
nX̄ + zα/2
/2 ± zα/2 nX̄(1 − X̄) + zα/2
p̂± =
.
2
n + zα/2
Lecture Note
Page 5
Mathematical Statistics
5
Exploratory Data Analysis
The data values recorded x1 , . . . , xn are typically considered as the observed values of random variables X1 , . . . , Xn
having a common probability distribution f (x). To judge the quality of data, it is useful to envisage a population
from which the sample should be drawn. A random sample is chosen at random from the population to ensure
that the sample is representative of the population. Once a data set has been collected, it is useful to find an
informative way of presenting it. Graphical representations of data in various forms can be quite informative.
5.1
Relative frequency histogram
Given the number of observations fi , called frequency, in the i-th interval, the height hi of the i-th rectangle
above the i-th interval is represented by
hi =
fi
.
n × (width of the i-th interval)
When the width of each interval is equally chosen, the width w is called bandwidth and the height hi becomes
hi =
Lecture Note
fi
.
n×w
Page 6
Mathematical Statistics
5.2
Stem and leaf plot
This is much like a histogram except it portrays a data set itself.
Data
20.5
21.5
22.7
23.4
24.1
24.9
25.8
set
20.7
22.0
22.7
23.5
24.3
24.9
25.9
20.8
22.1
22.9
23.6
24.5
25.1
26.1
21.0
22.5
22.9
23.6
24.5
25.1
26.7
21.0
22.6
23.1
23.6
24.8
25.2
21.4
22.6
23.3
23.9
24.8
25.6
Stem-and-leaf
20 | 578
21 | 0045
22 | 015667799
23 | 13456669
24 | 13558899
25 | 112689
26 | 17
5.3
Boxplot
The sample median is the value of the “middle” data point. When there is an odd number of numbers, the
median is simply the middle number. For example, the median of 2, 4, and 7 is 4. When there is an even number
of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12
is (4+7)/2 = 5.5. The 25-sample percentile is the value indicating that 25% of the observations takes values
smaller than the value. Similarly, we can define 50-percentile, 75-percentile, and so on. Note that 50-percentile
is the median. We call 25-percentile the lower sample quartile and 75-percentile the upper sample quartile.
A box is drawn stretching from the lower sample quartile (the 25-percentile) to the upper quartile (the 75percentile). The median is shown as a line across the box. Therefore 1/4 of the distribution is between this line
and the right of the box and 1/4 of the distribution is between this line and the left of the box. Vertical lines
(dotted), called “whiskers,” stretch out from the ends of the box to the largest and smallest data.
Lecture Note
Page 7
Mathematical Statistics
5.4
Outliers
Graphical presentations can be used to identify “odd-looking” value which does not fit in with the rest of the
data. Such a value is called an outlier. In many cases an outlier is discovered to be a misrecorded data value, or
represents some special condition that was not in effect when the data were collected.
Lecture Note
Page 8
Mathematical Statistics
Lecture Note
Page 9
Mathematical Statistics
In the above histogram and boxplot, the value in far right appears to be quite separate from the rest of the data,
and can be considered to be an outlier.
6
Tests of statistical hypotheses
Suppose that a researcher is interested in whether the new drug works. The process of determining whether the
outcome of the experiment points to “yes” or “no” is called hypothesis testing. A widely used formalization of
this process is due to Neyman and Pearson. Our hypothesis is then the null hypothesis that the new drug has no
effect —the null hypothesis is often the reverse of what we actually believe, why? Because the researcher hopes
to reject the hypothesis and announce that the new drug leads to significant improvements. If the hypothesis is
not rejected, the researcher announces nothing and goes on to a new experiment.
6.1
Hypothesis testing of population mean
Hospital workers are subject to a radiation exposure emanating from the skin of the patient. A researcher is
interested in the plausibility of the statement that the population mean µ of radiation level is µ0 —the researcher’s
hypothesis. Then the null hypothesis is
H0 : µ = µ0 .
Lecture Note
Page 10
Mathematical Statistics
The “opposite” of the null hypothesis, called an alternative hypothesis, becomes
HA : µ 6= µ0 .
Thus, the hypothesis testing problem “H0 versus HA ” is formed. The problem here is to whether or not to reject
“H0 in favor of HA .”
To assess this hypothesis, the radiation levels X1 , . . . , Xn are measured from n patients who had been injected
with a radioactive tracer, and assumed to be independent and normally distributed with the mean µ. Under the
null hypothesis, the random variable
X̄ − µ0
√
T =
S/ n
has the t-distribution with (n − 1) degrees of freedom. Thus, we obtain the exact probability
P |T | ≥ tα/2,n−1 = α.
When α is chosen to be a small value (0.05 or 0.01, for example), it is unlikely that the absolute value |T | is
larger than the critical point tα/2,n−1 . Then we say that the null hypothesis H0 is rejected with significance level
α (or, size α) when the observed value t of T satisfies |t| > tα/2,n−1 .
Example. We have µ0 = 5.4 for the hypothesis, and decided to give a test with significance level α = 0.05.
Suppose that we have obtained X̄ = 5.145 and S = 0.7524 from the actual data with n = 28. Then we can
compute
5.145 − 5.4
√ ≈ −1.79.
T =
0.7524/ 28
Since |T | = 1.79 ≤ t0.025,27 = 2.052, the null hypothesis cannot be rejected. Thus, the evidence against the null
hypothesis is not persuasive.
6.2
p-value
The above random variable T is called the t-statistic. Having observed “T = t,” we can calculate the p-value
p∗ = P (|Y | ≥ |t|) = 2 × P (Y ≥ |t|),
where the random variable Y has a t-distribution with (n − 1) degrees of freedom. Then we have the relation
“p∗ < α ⇔ |t| > tα/2,n−1 .” Thus, we reject H0 with significance level α when p∗ < α. In the above example,
we can compute the p-value p∗ = 2 × P (Y ≥ 1.79) ≈ 0.0847 ≥ 0.05; thus, we cannot reject H0 .
6.3
One-sided hypothesis testing
In the same case of hospital workers subject to a radiation exposure, this time the researcher is interested in
the plausibility of the statement that the population mean µ is greater than µ0 . Then the hypothesis testing
problem is
H0 : µ ≥ µ0 versus HA : µ < µ0 .
X̄ − µ0
√ is used as a test statistic. And we reject H0 with significant level α when
S/ n
you find that t < −tα,n−1 for the observed value t of T .
1. The same t-statistic T =
2. Alternatively we can construct the p-value
p∗ = P (Y ≤ t),
where the random variable Y has a t-distribution with (n − 1) degrees of freedom. Because of the relation
“p∗ < α ⇔ t < −tα,n−1 ,” we can reject H0 with significant level α when p∗ < α.
Lecture Note
Page 11
Mathematical Statistics
Example. We use the same µ0 = 5.4 for the hypothesis and the same significance level α = 0.05, but use the
one-sided test. Recall that X̄ = 5.145 and S = 0.7524 were obtained from the data with n = 28.
1. Then we compute
T =
5.145 − 5.4
√ ≈ −1.79.
0.7524/ 28
Since T = −1.79 < −t0.05,27 = −1.703, the null hypothesis H0 is rejected. Thus, the outcome is statistically
significant so that the population mean µ is smaller than 5.4.
2. Alternatively, we can find the p-value p∗ = P (Y ≤ −1.79) ≈ 0.0423 < 0.05; thus, the null hypothesis
should be rejected.
We can also consider the hypothesis testing problem
H0 : µ ≤ µ0
1. Using the t-statistics T =
versus
HA : µ > µ0 .
X̄ − µ0
√ , we can reject H0 with significant level α when the observed value t of
S/ n
T satisfies t > tα,n−1 .
2. Alternatively we can construct the p-value p∗ = P (Y ≥ t) with the random variable Y has a t-distribution
with (n − 1) degrees of freedom. Because of the relation “p∗ < α ⇔ t > tα,n−1 ,” we can reject H0 when
p∗ < α.
6.4
Summary
When the null hypothesis H0 is rejected, it is reasonable to find out the confidence interval of the population
mean µ. The following table shows the confidence interval we can construct when your null hypothesis is rejected.
√ 0 is the test statistic, and α is the significance level of your choice.
Here T = X̄−µ
S/ n
Hypothesis testing
6.5
When can we reject H0 ?
H0 : µ = µ0 versus HA : µ 6= µ0 .
|T | > tα/2,n−1
H0 : µ ≤ µ0 versus HA : µ > µ0 .
T > tα,n−1
H0 : µ ≥ µ0 versus HA : µ < µ0 .
T < −tα,n−1
(1 − α)-level confidence interval
S
S
X̄ − tα/2,n−1 √ , X̄ + tα/2,n−1 √
n
n
S
X̄ − tα,n−1 √ , ∞
n
S
−∞, X̄ + tα,n−1 √
n
Exercises
1. An experimenter is interested in the hypothesis testing problem
H0 : µ = 3.0mm
versus HA : µ 6= 3.0mm,
where µ is the population mean of thickness of glass sheets. Suppose that a sample of n = 21 glass sheets
is obtained and their thicknesses are measured.
(a) For what values of the t-statistic does the experimenter accept the null hypothesis with a size α = 0.10?
(b) For what values of the t-statistic does the experimenter reject the null hypothesis with a size α = 0.01?
Lecture Note
Page 12
Mathematical Statistics
Suppose that the sample mean X̄ = 3.04mm and the sample standard deviation is S = 0.124mm. Is the
null hypothesis accepted or rejected with α = 0.10? With α = 0.01?
2. A machine is set to cut metal plates to a length of 44.350mm. The length of a random sample of 24 metal
plates have a sample mean of X̄ = 44.364mm and a sample standard deviation of S = 0.019mm. Is there
any evidence that the machine is miscalibrated?
3. An experimenter is interested in the hypothesis testing problem
H0 : µ ≤ 0.065
versus HA : µ > 0.065
where µ is the population mean of the density of a chemical solution. Suppose that a sample of n = 31
bottles of the chemical solution is obtained and their densities are measured.
(a) For what values of the t-statistics does the experimenter accept the null hypothesis with a size α =
0.10?
(b) For what values of the t-statistics does the experimenter reject the null hypothesis with a size α = 0.01?
Suppose that the sample mean X̄ = 0.0768 and the sample standard deviation is S = 0.0231. Is the null
hypothesis accepted or rejected with α = 0.10? With α = 0.01?
4. A chocolate bar manufacturer claims that at the time of purchase by a consumer the average age of its
product is no more than 120 days. In an experiment to test this claim a random sample of 26 chocolate
bars are found to have ages at the time of purchase with a sample mean of X̄ = 122.5 days and a sample
standard deviation of S = 13.4 days. With this information how do you feel about the manufacturer’s
claim?
7
Power of test
We define a function K(θ) of parameter θ by the probability that H0 is rejected given µ = θ.
K(θ) = P (“Reject H0 ” | µ = θ)
Then K(θ) is called the power function.
7.1
Type I error
What is the probability that we incorrectly reject H0 when it is actually true? Such an error is called type I
error, and the probability of type I error is exactly the significant level α, as explained in the following:
1. The probability of type I error for the two-sided hypothesis test is given by K(µ0 ). Then we have K(µ0 ) =
P |T | ≥ tα/2,n−1 = α.
2. In one-sided hypothesis test, the probability of type I error is the worst (that is, largest possible) probability
max K(θ) of type I error. Given µ = θ, the random variable
θ≥µ0
X̄ − θ
θ − µ0
√ =T −
√ =T −δ
S/ n
S/ n
has the t-distribution with (n − 1) degrees of freedom, where δ =
θ − µ0
√ . By observing that δ ≥ 0 if
S/ n
θ ≥ µ0 , we obtain
K(θ) = P (T ≤ −tα,n−1 ) = P (T − δ ≤ −tα,n−1 − δ) ≤ P (T − δ ≤ −tα,n−1 ) = α.
Thus, we obtain max K(θ) = α.
θ≥µ0
Lecture Note
Page 13
Mathematical Statistics
7.2
Power of test
What is the probability that we incorrectly accept H0 when it is actually false? Such probability β is called the
probability of type II error. Then the value (1 − β) is known as the power of the test, indicating how correctly
we can reject H0 when it is actually false. Again, consider the case of hospital workers subject to a radiation
exposure. Given the current estimate S = s of standard deviation and the current sample size n = n1 , the
X̄ − µ0
µ−µ
√ can be approximated by N (δ, 1) with δ = √ 0 .
t-statistic T =
s/ n1
S/ n
Example. Suppose that the true population mean is µ = 5.1 (versus the value µ0 = 5.4 in our hypotheses). Then
we can calculate the power of the test with δ ≈ −2.11 as follows.
1. In the two-sided hypothesis testing, we reject H0 when |T | > t0.025,27 = 2.052. Therefore, the power of the
test is K(5.1) = P (|T | > 2.052 | µ = 5.1) ≈ 0.523
2. In the one-sided hypothesis testing, we reject H0 when T < −t0.05,27 = −1.703. Therefore, the power of
the test is K(5.1) = P (T < −1.703 | µ = 5.1) ≈ 0.658.
This explains why we could not reject H0 in the two-sided hypothesis testing. Our chance to detect the falsehood
of H0 is only 52%, while we have 66% of the chance in the one-sided hypothesis testing.
7.3
Effect of sample size
For a fixed significance level α of your choice, the power of the test increases as the sample size n increases. In
the two-sided hypothesis testing discussed above, we could recommend to collect additional data to increase the
power of the test. But how many additional data do we need? Here is one possible way to calculate a desirable
sample size n: In the two-sided hypothesis testing, the power (1 − β) of the test is approximated by
P (|T | > tα/2,n−1 ) ≈ P (Y < −tα/2,n−1 − δ) + P (Y > tα/2,n−1 − δ) ≥ P (Y > tα/2,n−1 − |δ|)
with a random variable Y having the t-distribution with (n − 1) degrees of freedom. Given the current estimate
S = s of standard deviation and the current sample size n1 , we can achieve the power (1 − α/2) of the test by
increasing a total sample size n and consequently satisfying |δ| ≥ 2tα/2,n1 −1 . In the above example of radiation
exposure of hospital workers, such size n can be calculated as
n≥
8
2tα/2,n1 −1 s
|µ − µ0 |
2
=
2t0.025,27 × 0.7524
|5.1 − 5.4|
2
= 105.9.
Comparison of two populations
We often want to compare two populations on the basis of experiment. For example, a researcher wants to test
the effect of his drug on blood pressure. In any treatment, an improvement could have been due to the placebo
effect when the subject believes that he or she has been given an effective treatment. To protect against such
biases, the study should consider (i) the use of a control group in which the subjects are given a placebo, and
an experimental group in which the subjects are treated with the new drug, (ii) the randomization by assigning
the subjects between the control and the exprimental groups randomly, and (iii) a double-blind experiment by
concealing the nature of treatment from the subjects and the person taking measurements. Then it becomes the
hypothesis testing problem
H0 : µ1 = µ2 versus HA : µ1 6= µ2 .
where µ1 and µ2 are the respective population means of the control and the experimental groups
Lecture Note
Page 14
Mathematical Statistics
As a result of experiment, we typically obtain the measurements
X1 , . . . , X n
of the subjects from the control group, and the measurements
Y1 , . . . , Ym
of the subjects from the experimental group. Then it is usually assumed that X1 , . . . , Xn and Y1 , . . . , Ym are
independent and normally distributed with (µ1 , σ12 ) and (µ2 , σ22 ), respectively. Even when they are not normally
distributed, large sample sizes (n, m ≥ 30) ensure that the tests are appropriate via the central limit theorem.
8.1
Pooled variance procedure
Let Sx and Sy be the sample standard deviations constructed from X1 , . . . , Xn and Y1 , . . . , Ym , respectively.
When it is reasonable to assume “σ12 = σ22 ,” we can construct the pooled sample variance
Sp2 =
(n − 1)Sx2 + (m − 1)Sy2
n+m−2
The test statistic
T =
X̄ − Ȳ
q
Sp n1 +
1
m
has the t-distribution with (n + m − 2) degrees of freedom under the null hypothesis H0 . Thus, we reject the null
hypothesis H0 with significant level α when the observed value t of T satisfies |t| > tα/2,n+m−2 . Or, equivalently
we can compute the p-value
p∗ = 2 × P (Y ≥ |t|)
with Y having a t-distribution with (n + m − 2) degrees of freedom, and reject H0 when p∗ < α.
Confidence interval. The following table shows the corresponding confidence interval of the population mean
difference µ1 − µ2 , when your null hypothesis H0 is rejected.
Hypothesis testing
H0 : µ1 = µ2 vs. HA : µ1 6= µ2 .
(1 − α)-level confidence interval
q
q
1
X̄ − Ȳ − tα/2,n+m−2 Sp n1 + m
, X̄ − Ȳ + tα/2,n+m−2 Sp n1 +
q
H0 : µ1 ≤ µ2 vs. HA : µ1 > µ2 .
X̄ − Ȳ − tα,n+m−2 Sp
H0 : µ1 ≥ µ2 vs. HA : µ1 < µ2 .
−∞, X̄ − Ȳ + tα,n+m−2 Sp
1
n
+
1
m,
q
1
n
1
m
∞
+
1
m
Example. Suppose that we consider the significant level α = 0.01, and that we have obtained X̄ = 80.02 and
Sx = 0.024 from the control group of size n = 13, and Ȳ = 79.98 and Sy = 0.031 from the experimental group
of size m = 8. Here we have assumed that σ12 = σ22 . Then we can compute the square root Sp = 0.027 of the
pooled sample variance Sp2 , and the test statistic
T =
80.02 − 79.98
q
≈ 3.33.
1
0.027 13
+ 18
Thus, we can obtain p∗ = 2 × P (Y ≥ 3.33) ≈ 0.0035 < 0.01, and reject H0 . We conclude that the two population
means are significantly different. And the 99% confidence interval for the mean difference is (0.006, 0.074).
Lecture Note
Page 15
Mathematical Statistics
8.2
General procedure
When “σ12 6= σ22 ,” under the null hypothesis H0 the test statistic
X̄ − Ȳ
T =q
2
Sy2
Sx
n + m
has approximately the t-distribution with ν degree of freedom, where ν is the nearest integer to
2
2
Sy2
Sx
n + m
.
4
Sy4
Sx
n2 (n−1) + m2 (m−1)
Thus, we reject the null hypothesis H0 with significant level α when the observed value t of T satisfies |t| > tα/2,ν .
Or, equivalently we can compute the p-value
p∗ = 2 × P (Y ≥ |t|)
with Y having a t-distribution with ν degrees of freedom, and reject H0 when p∗ < α.
Confidence interval. The following table shows the corresponding confidence interval of the population mean
difference µ1 − µ2 , when your null hypothesis H0 is rejected.
Hypothesis testing
H0 : µ1 = µ2 versus HA : µ1 6= µ2 .
H0 : µ1 ≤ µ2 versus HA : µ1 > µ2 .
H0 : µ1 ≥ µ2 versus HA : µ1 < µ2 .
(1 − α)-level confidence interval
q
q
S2
S2
S2
X̄ − Ȳ − tα/2,ν nx + my , X̄ − Ȳ + tα/2,ν nx +
q
S2
S2
X̄ − Ȳ − tα/2,ν nx + my , ∞
q
S2
S2
−∞, X̄ − Ȳ + tα/2,ν nx + my
Sy2
m
Example. Suppose that we consider the significant level α = 0.01, and that we have obtained X̄ = 80.02 and
Sx = 0.024 from the control group of size n = 13, and Ȳ = 79.98 and Sy = 0.031 from the experimental group
of size m = 8 as before. Then the test statistic T ≈ 3.12, and ν = 12. Thus, we can obtain p∗ = 2 × P (Y ≥
3.12) ≈ 0.0089 < 0.01, and still reject H0 .
9
Inference on proportions
In experiments on pea breeding, Mendel observed the different kinds of seeds obtained by crosses from plants
with round yellow seeds and plants with wrinkled green seeds. Possible types of progeny were: “round yellow”,
“wrinkled yellow”, “round green”, and “wrinkled green.” When the data values recorded x1 , . . . , xn takes several
types, or categories, we call them the categorical data.
9.1
Point estimate
Let X be the number of observations for a particular type in categorical data of size n, and let p be the population
proportion of this type (that is, the probability of occurrence of this type). Then the random variable X has the
binomial distribution with parameter (n, p). And the point estimate of the population proportion p is
p̂ =
Lecture Note
X
.
n
Page 16
Mathematical Statistics
We can easily see that
1
X
= E(X) = p
E(p̂) = E
n
n
Thus, p̂ is an unbiased estimate of p. Furthermore, recall by the central limit theorem that we have approximately
X ∼ N (np, np(1 − p))
when n is large. Then the point estimate p̂ is approximately distributed as the normal distribution with parameter
(p, p(1−p)
).
n
9.2
Hypothesis test
Suppose that the vaccine can be approved for widespread use if it can be established that the probability p of
serious adverse reaction is less than p0 . Then the hypothesis testing problem becomes
H0 : p ≥ p0
versus
HA : p < p 0 .
(2)
Let X be the number of participants who suffer an adverse reaction among n participants. Then, the random
variable X has the binomial distribution with parameter (n, p) and is approximated by the normal distribution
with parameter (np, np(1 − p)) when n is large [that is, to satisfy np > 5 and n(1 − p) > 5].
Critical point. The critical point of the standard normal distribution, denoted by zα , is defined as the value
satisfying P (Z > zα ) = α where Z is a standard normal random variable. Since the normal distribution is
symmetric, it implies that P (Z < −zα ) = α.
Testing procedure. When np0 > 5 and n(1 − p0 ) > 5,
X − np0
Z=p
(3)
np0 (1 − p0 )
is used for the test statistic. Then we can reject H0 in (2) with significance level α if the value z of the test
statistic Z satisfies z < −zα . Equivalently, we can proceed to construct the p-value p∗ = P (X < z) = Φ(z), and
reject H0 when p∗ < α. Since the consideration of continuity correction improves the accuracy, the alternative
test statistic
X − np0 + 0.5
Z= p
np0 (1 − p0 )
may be also used.
Confidence interval. When H0 is rejected, we want to further investigate the confidence interval for the population
proportion p which corresponds to the result of hypothesis test. We have the point estimate p̂ = X/n. Then the
two different formulas
!
p
X + zα2 /2 + zα X(n − X)/n + zα2 /4
0,
(4)
n + zα2
!
r
p̂(1 − p̂)
(5)
0, p̂ + zα
n
can be used for the confidence interval of level α. Although Formula (4) is known to be more accurate, Formula (5)
may be used in most of our problems since it is easier to calculate.
Example. Suppose that p0 = 0.05 is required, and that the significance level α = 0.5 is chosen. And the study
shows that X = 4 adverse reactions are found out of n = 155 participants. Note that (0.05)(155) = 7.75 > 5
and (0.95)(155) = 147.25 > 5. Thus, we have
4 − (155)(0.05) + 0.5
Z= p
≈ −1.20
(155)(0.05)(0.95)
Lecture Note
and
p∗ = Φ(−1.20) ≈ 0.115
Page 17
Mathematical Statistics
We can also obtain the point estimate p̂ ≈ 0.0258 and the 95% confidence interval (0, 0.0562) by using (4) [we
get (0, 0.0467) if we use (5)]. Since p∗ ≥ 0.05, we cannot reject the null hypothesis. Thus, it is not advisable
that the vaccine be approved as the result of this study.
9.3
Sample size calculations
We always guarantee the possibility of incorrectly rejecting H0 when H0 is true—Type I error, say, to be less
than 5% of the chance. But, at the same time we sacrifice the power of detecting the falsehood of H0 when H0
is false—power of the test. In order for the hypothesis testing problem
H 0 : p ≥ p0
versus
HA : p < p 0 ,
to achive the power (1 − β) of the test, we need a sample of size
n≥
zα
p
p0 (1 − p0 ) + zβ
p − p0
!2
p
p(1 − p)
.
(6)
In the above vaccine experiment, if the true population mean p is 0.025, then the power of the test is only 0.18.
To increase the power of the test at least 0.8, we need the sample size at least n = 464.
9.4
Summary
Possible null hypotheses for the inference on population proportion are “H0 : p = p0 ”, “H0 : p ≤ p0 ”, and
“H0 : p ≥ p0 ”. In either case we can use the test statistic Z in (3) if we do not make a “continuity correction.”
Then the corresponding testing procedures are summarized in the following table.
Null hypothesis
When we reject it
H 0 : p = p0
|Z| > zα/2
H 0 : p ≤ p0
Z > zα
(1 − α)-level confidence interval
q
q
p̂(1−p̂)
p̂(1−p̂)
p̂ − zα/2
, p̂ + zα/2
n
n
p̂ − zα
q
H 0 : p ≥ p0
Z < −zα
0, p̂ + zα
p̂(1−p̂)
,
n
q
1
p̂(1−p̂)
n
For the sample size calculation, use Formula (6) if the null hypothesis is either “H0 : p ≤ p0 ” or “H0 : p ≥ p0 ”.
When the null hypothesis is “H0 : p = p0 ”, the sample size n can be computed as
!2
p
p
zα/2 p0 (1 − p0 ) + zβ p(1 − p)
n≥
.
p − p0
9.5
Comparison of two proportions
A researcher is interested in whether there is discrimination against women in a university. In terms of statistics
this is the hypothesis testing problem
H 0 : pA ≤ p B
versus HA : pA > pB
where pA and pB are the respective population proportions of men and women who are admitted to the university.
The researcher decided to collect the data for graduate program in the university. Let X and Y be the respective
Lecture Note
Page 18
Mathematical Statistics
numbers of men and women who are admitted to the graduate school, which are summarized in the following
table:
Men
X
n−X
n
Admit
Deny
Total
Women
Y
m−Y
m
The test statistic is given by
p̂A − p̂B
Z=q
p̂(1 − p̂)
1
n
+
1
m
where p̂A = X/n and p̂B = Y /m are the point estimates of pA and pB , and
p̂ =
X +Y
n+m
is called a pooled estimate of the common population proportion. Under the null hypothesis, the probability that
Z > zα becomes approximately less than α. Thus, we reject H0 when the observed value z of Z satisfies z > zα .
Or, equivalently we can reject p∗ = 1 − Φ(z) < α.
Confidence interval. We may want to further investigate the confidence interval for the difference pA − pB .
Having constructed the hypothesis testing problems “H0 : pA = pB ”, “H0 : pA ≤ pB ”, or “H0 : pA ≥ pB ”, the
following table shows the corresponding testing procedure and the confidence interval.
Null hypothesis
When we reject it
H 0 : p A = pB
|z| > zα/2
[That is, p∗ = 2 × (1 − Φ(|z|)) < α]
H 0 : p A ≤ pB
z > zα
[That is, p∗ = 1 − Φ(z) < α]
H 0 : p A ≥ pB
z < −zα
[That is, p∗ = Φ(z) < α]
(1 − α)-level confidence interval for pA − pB
q
p̂A )
p̂B )
+ p̂B (1−
,
p̂A − p̂B − zα/2 p̂A (1−
n
m
q
p̂A (1−p̂A )
p̂B (1−p̂B )
p̂A − p̂B + zα/2
+
,
n
m
q
p̂A )
p̂B )
+ p̂B (1−
,1
p̂A − p̂B − zα p̂A (1−
n
m
−1, p̂A − p̂B + zα
q
p̂A (1−p̂A )
n
+
p̂B (1−p̂B )
m
Example. The following table classifies the applications for the graduate school according to admission status
and sex.
Admit
Deny
Total
Men
97
263
360
Women
40
42
82
Total
137
305
442
Then we have p̂A = 97/360 ≈ 0.269, p̂B = 40/82 ≈ 0.488, and p̂ = 137/442 ≈ 0.310. And we can obtain
0.269 − 0.488
Z=p
(0.31)(0.69)(1/360 + 1/82)
≈ −3.87
and
p∗ = 1 − Φ(−3.87) ≈ 0.9999
Thus, we cannot reject H0 , indicating that there is no discrimination against women in this particular graduate
program. In fact, the null hypothesis “H0 : pA ≥ pB ” will be rejected in this example.
Lecture Note
Page 19
Mathematical Statistics
10
Chi-square test
In the experiment on pea breeding, Mendel observed the different kinds of seeds obtained by crosses from
plants with round yellow seeds and plants with wrinkled green seeds. Possible types of progeny were: “round
yellow”, “wrinkled yellow”, “round green”, and “wrinkled green.” And Mendel’s theory predicted the associated
probabilities of occurrence as follows.
Probabilities
Round yellow
9/16
Wrinkled yellow
3/16
Round green
3/16
Wrinkled green
1/16
We want to test whether the data from n observation is consistent with his theory—goodness of fit test, in which
the statement of null hypothesis becomes “the model is valid.”
10.1
Chi-square test
In general, each observation is classified into one of k categories or “cells,” which results in the cell frequencies
X1 , . . . , Xk .
The goodness of fit to a particular model can be assessed by comparing the observed cell frequencies X1 , . . . , Xk
with the expected cell frequencies
E1 , . . . , E k ,
which are predicted from the model. The discrepancy between the data and the model can be measured by the
Pearson’s chi-square statistic
k
X
(Xi − Ei )2
.
χ2 =
Ei
i=1
Under the null hypothesis (that is, assuming that the model is correct), the distribution of Pearson’s chi-square
χ2 is approximated by the chi-square distribution with
df = (number of cells) − 1 − (number of parameters in the model)
degrees of freedom. Therefore, if you observe that χ2 = x and x > χ2α,df , then we can reject the null hypothesis,
casting doubt on the validity of the model. Or, by computing the p-value
p∗ = P (X > x)
with a random variable X having the chi-square distribution with df degrees of freedom, equivalently we can
reject the null hypothesis when p∗ < α.
Example. In the experiment of pea breeding, we have obtained the data as in the following table.
Frequencies
Round yellow
315
Wrinkled yellow
101
Round green
108
Wrinkled green
32
With the total number of observations n = 556, the expected cell frequencies from the Mendel’s theory can be
calculated as
Expected frequencies
Round yellow
312.75
Wrinkled yellow
104.25
Round green
104.25
Wrinkled green
34.75
We can compute the Pearson’s chi-square χ2 = 0.47. Since the Mendel’s model has no parameter, the chi-square
distribution has 3 = (4 − 1) degrees of freedom and we get the p-value p∗ = 0.925. Thus, there is little reason to
doubt the Mendel’s theory on the basis of Pearson’s chi-square test.
Lecture Note
Page 20
Mathematical Statistics
10.2
Test of independence
Consider again the study of discrimination against women in university admission. In the study, there are two
characteristics: men or women; admitted or denied. The researcher wanted to know whether such characteristics
are linked or independent. For such a study, we take a random sample of size n from the population, which is
summarized in the contingency table
Admit
Deny
Total
Men
X11
X21
X·1
Women
X12
X22
X·2
Total
X1·
X2·
n = X··
The statement of null hypothesis becomes “the two characteristics are independent.” Under the null hypothesis,
the expected frequencies for the contingency table can be given by
Admit
Deny
Total
Men
np1 q1
np2 q1
nq1
Women
np1 q2
np2 q2
nq2
Total
np1
np2
n
The point estimates of p1 , p2 , q1 , and q2 are p̂1 = X1· /n, p̂2 = X2· /n, q̂1 = X·1 /n, and q̂2 = X·2 /n. With these
point estimates, the chi-square statistic is
χ2 =
2
2 X
X
(Xij − Xi· X·j /n)2
,
Xi· X·j /n
i=1 j=1
and the degree of freedom is (4 − 1 − 2) = 1.
Example. By using the same data as before, we can obtain the chi-square statistic
[40 − (137)(82)/(442)]2
[97 − (137)(360)/(442)]2
+
(137)(360)/(442)
(137)(82)/(442)
2
[263 − (305)(360)/(442)]
[42 − (305)(82)/(442)]2
+
+
≈ 14.89,
(305)(360)/(442)
(305)(82)/(442)
χ2 =
and the p-value p∗ = 0.0001. Thus, the null hypothesis is rejected at any reasonable level, indicating that the
two characteristics are somewhat dependent.
11
Minimum variance unbiased estimator.
One of the important attributes of point estimate is unbiasedness. Since a statistic u(X) is a random variable,
we can consider the expectation E[u(X)]. Then the point estimate u(X) of θ is called an unbiased estimator if it
satisfies E[u(X)] = θ. For example, the sample mean X̄ is an unbiased estimate of the mean µ, since E(X̄) = µ.
Furthermore, u(X) is called an minimum variance unbiased estimator if (i) it is unbiased, and (ii) for every
other unbiased estimator s(X) for θ we have
R(θ, u) = Var(u(X)) ≤ Var(s(X)) = R(θ, s)
11.1
for all θ.
Sufficient statistics
Let f (x; θ) be a joint density function of x = (x1 , . . . , xn ) for a random vector X = (X1 , . . . , Xn ). Note that
f (x1 , . . . , xn ; θ) is of the form f (x1 ; θ) · · · f (xn ; θ) if (X1 , . . . , Xn ) is a random sample. Suppose that s(X) is a
Lecture Note
Page 21
Mathematical Statistics
statistic having the pdf g(s; θ). Then s(X) is called a sufficient statistic if
f (x; θ)
g(s(x); θ)
is a function of x and does not depend on θ.
Factorization theorem. s(X) is a sufficient statistic if and only if the joint density f (x; θ) can be expressed
in a form of
f (x; θ) = k(s(x); θ) · h(x).
Furthermore, the pdf g(s; θ) for the statistic s(X) is proportional to k(s; θ), or possibly a multiple c(s) × k(s; θ)
with a function c(s) of s.
11.2
Rao-Blackwell’s theorem
Let X = (X1 , . . . , Xn ) be a random sample from f (x; θ). Suppose that s(X) is a sufficient statistic, and that
u(X) is an unbiased estimator for θ. Then we can construct a new statistic
ϕ(s(X)) = E(u(X) | s(X)),
which is a function of the sufficient statistic s(X). Furthermore, we obtain E[ϕ(s(X))] = θ by the law of total
probability, and
Var(ϕ(s(X))) ≤ Var(u(X))
via the conditional variance formula.
11.3
Lehmann-Scheffé’s theorem
Let s(X) be a statistic having the pdf g(s; θ). Then s(X) is called a complete statistic if for any function φ,
E[φ(s(X))] = 0
for all θ
suffices to imply that φ(s(X)) ≡ 0.
Lehmann-Scheffé’s theorem. Suppose that s(X) is a complete and sufficient statistic. Then the statistic
ϕ∗ (s(X)) = E(u(X) | s(X))
is well-defined with any choice of unbiased statistic u(X). Moreover, ϕ∗ (s(X)) is the minimum variance unbiased
estimator for θ, which is unique among functions of s(X).
11.4
Exponential families
Let A be an interval (or a subset) on R, and let
f (x; θ) = exp [c(θ)k(x) + h(x) + d(θ)] ,
x ∈ A;
otherwise, f (x; θ) = 0 for all x 6∈ A.
(7)
be a probability density function with parameter θ. Here an interval A can be (−∞, ∞), [0, ∞), or [0, 1], or a
subset A can be {0, 1, . . .} or {0, . . . , n}, for example; but A should not depend on the parameter θ. Then we
say that the pdf f (x; θ) is of one-parameter exponential family. For example,
1. an exponential density,
Lecture Note
Page 22
Mathematical Statistics
2. a normal density with known σ 2 ,
3. a Poisson frequency function
are of the one-parameter exponential family.
Natural sufficient statistics and completeness. Let X1 , . . . , Xn be a random sample from (??). Then the
joint density
"
#
n
n
X
X
f (x; θ) = exp c(θ)
k(xi ) +
h(xi ) + nd(θ) , x ∈ An
i=1
i=1
Pn
is of the exponential family. By the factorization theorem, the statistic s(X) = i=1 k(Xi ) is sufficient, which
we call a natural sufficient statistic. Furthermore, the statistic s(X) has the pdf of the exponential family
f (s; θ) = exp [c(θ)s + h∗ (s) + nd(θ)] ,
s ∈ A∗ ,
and is complete.
12
Efficient estimator
We call a statistic u(X) efficient, if u(X) achieves the Cramér-Rao lower bound.
12.1
Fisher information
Let f (x; θ) be a joint density function for a random sample X = (X1 , . . . , Xn ). Furthermore, we assume that
∂
ln f (x; θ) exists; we will call the conditions in (i)–(ii)
(i) A = {x : f (x; θ) > 0} does not depend on θ, and (ii) ∂θ
the regularity assumptions.
For
example,
a
pdf
of
exponential
family satisfies the regularity assumptions. By
∂
observing that E ∂θ
ln f (X; θ) = 0, we can define the Fisher information I(θ) by
"
2 #
∂
∂
I(θ) = E
ln f (X; θ)
= Var
ln f (X; θ) .
∂θ
∂θ
Now suppose that f (x; θ) is of the form f (x1 ; θ) · · · f (xn ; θ). By setting I1 (θ) = Var
∂
ln f (X1 ; θ) = nI1 (θ).
I(θ) = n × Var
∂θ
Exercise. Show that I(θ) = −E
12.2
h
∂2
∂θ 2
∂
∂θ
ln f (X1 ; θ) , we obtain
i
ln f (X; θ) .
Cramér-Rao lower bound.
Let X and Y be random variables, and let a be a real number. Then we have Var(aX − Y ) = a2 Var(X) −
2aCov(X, Y )+Var(Y ) ≥ 0. By substituting a = Cov(X, Y ) /Var(X) , we can find the Cauchy-Schwarz inequality
[Cov(X, Y )]2 ≤ Var(X) · Var(Y ).
Let u(X) be a statistic. Then E[u(X)] is a function of θ, say ψ(θ) = E[u(X)]. Here we can show that
∂
ψ 0 (θ) = Cov u(X),
ln f (X; θ) .
∂θ
Lecture Note
Page 23
Mathematical Statistics
By applying the Cauchy-Schwarz inequality we obtain the Cramér-Rao inequality
Var(u(X)) ≥
[ψ 0 (θ)]2
.
I(θ)
(8)
When u(X) is an unbiased statistic, the right-hand side of (8) becomes 1 /I(θ) , and is called the Cramér-Rao
lower bound.
If a statistic u(X) achieves the Cramér-Rao lower bound, we call u(X) an efficient estimator. Clearly an
efficient and unbiased statistic is a minimum variance unbiased estimator. Now let f (x; θ) be a joint density
which satisfies the regularity assumptions, and let u(X) be an unbiased statistic for θ. If the joint density is of
the exponential family
f (x; θ) = exp [c(θ)u(x) + h(x) + d(θ)] ,
x ∈ A∗ ;
otherwise, f (x; θ) = 0 for all x 6∈ A∗ ,
(9)
then the natural sufficient statistic u(X) is an efficient estimator. Conversely, if u(X) is an efficient estimator,
then the joint density f (x; θ) is given in the form (9).
Remark. The notion of efficient statistic is somewhat stronger than that of minimum variance. Even worse, there
exist minimum variance unbiased estimators which do not achieve their respective Cramér-Rao lower bound.
13
Hypothesis testing
Let θ be a parameter of an underlying probability density function f (x; θ) for a certain population. The hypothesis
“H0 : θ = θ0 ” is called a simple hypothesis, since it completely specifies the underlying distribution. Whereas,
the hypothesis “H0 : θ ∈ Θ0 ” with a set Θ0 of parameters is called a composite hypothesis if the set Θ0 contains
more than one element. The “opposite” of the null hypothesis is called an alternative hypothesis, and is similarly
expressed as “HA : θ ∈ Θ1 ” where Θ1 is another set of parameters such that Θ0 ∩ Θ1 = ∅. The set Θ1 is typically
(but not necessarily) chosen to be the complement of Θ0 . Thus, the hypothesis testing problem “H0 versus HA ”
can be formed as
H0 : θ ∈ Θ0 versus HA : θ ∈ Θ1 .
(10)
The problem stated above is to whether or not to reject “H0 in favor of HA .”
13.1
Test statistic
Given a random sample X = (X1 , . . . , Xn ), a function
(
1 if H0 is rejected;
δ(X) =
0 otherwise.
(11)
is called a test function. Given the test (11), we can define the power function by
K(θ0 ) = P (“Reject H0 ” | θ = θ0 ) = E (δ(X) | θ = θ0 ) .
A typical test, however, is presented in the form “H0 is rejected if T (X) ≥ c.” Here T (X) is called a test
statistic, and c is called a critical value. Then the test function can be expressed as
(
1 T (X) ≥ c;
δ(X) =
(12)
0 otherwise.
Thus, we obtain K(θ0 ) = P (T (X) ≥ c | θ = θ0 ). The probability of type I error (i.e., “H0 is incorrectly rejected
when H0 is true”) is defined by
α = sup K(θ0 ),
θ0 ∈Θ0
which is also known as the size of the test. Having calculated the size α of the test, (11) or (12) is said to be a
level α test, or a test with significant level α.
Lecture Note
Page 24
Mathematical Statistics
13.2
Uniformly most powerful test
What is the probability that we incorrectly accept H0 when it is actually false? Such probability β is called the
probability of type II error. Then the value (1 − β) is known as the power of the test, indicating how correctly
we can reject H0 when it is actually false. Suppose that H0 is in fact false, say θ = θ1 for some θ1 ∈ Θ1 . Then
the power of the test is calculated by K(θ1 ).
Suppose that the test (11) has the size α. This test is said to be uniformly most powerful if it satisfies
K(θ1 ) ≥ K 0 (θ1 )
for all θ1 ∈ Θ1
0
for the power function K of every other level α test. Furthermore, if this is given in the form (12) with test
statistic T (X), then the test statistic T (X) is said to be optimal.
14
Likelihood ratio test
Consider the testing problem with simple (null and alternative) hypotheses:
H0 : θ = θ 0
Let L(θ; x) =
Qn
i=1
versus
HA : θ = θ 1 .
f (xi ; θ) be the likelihood function, and let
L(θ0 , θ1 ; x) =
be the likelihood ratio. Then
(
1
δ(X) =
0
L(θ1 , x)
L(θ0 , x)
L(θ0 , θ1 ; X) ≥ c;
otherwise.
(13)
becomes a uniformly most powerful test, and is called the Neyman–Pearson test.
The test function (13) has the following property: For any function ψ(x) satisfying 0 ≤ ψ(x) ≤ 1,
E(ψ(X) | θ = θ1 ) − E(δ(X) | θ = θ1 ) ≤ c[E(ψ(X) | θ = θ0 ) − E(δ(X) | θ = θ0 )].
14.1
Monotone likelihood ratio family
Let f (x; θ) be a joint density function with parameter θ, and let L(θ0 , θ1 ; x) be the likelihood ratio. Suppose
that T (X) is a statistic and does not depend on the parameter θ. Then f (x; θ) is called a monotone likelihood
ratio family in T (X) if
1. f (x; θ0 ) and f (x; θ1 ) are distinct for θ0 6= θ1 ;
2. L(θ0 , θ1 ; x) is a non-decreasing function of T (x) whenever θ0 < θ1 .
Now consider the following test problem:
H0 : θ ≤ θ0 (or H0 : θ = θ0 )
versus
HA : θ > θ0 .
(14)
If f (x; θ) is a monotone likelihood ratio family in T (X), then the test functions (12) and (13) is equivalent
whenever θ0 < θ1 , and the power function K(θ) for these tests becomes an increasing function. Furthermore,
T (X) is an optimal test statistic, and the size of the test is simply given by α = K(θ0 ).
Suppose that f (x; θ) is of the exponential family f (x; θ) = exp [c(θ)u(x) + h(x) + d(θ)], x ∈ A∗ , and that c(θ)
is a strictly increasing function. Then f (x; θ) is a monotone likelihood ratio family in u(X). And the natural
sufficient statistic u(X) becomes an optimal test statistic.
Remark. Essentially, uniformly most powerful tests exist only for the test problem (14).
Lecture Note
Page 25
Mathematical Statistics
14.2
Test procedure
The Neyman–Pearson test (13) can be generalized for the composite hypotheses in (10): (i) obtain the maximum
likelihood estimate (MLE) θ̂ of θ, (ii) calculate also the MLE θ̂0 restricted for θ ∈ Θ0 , and (iii) construct the
likelihood ratio


sup L(θ; X)
 sup L(θ; X) 
L(θ̂; X)
θ∈Θ1
θ
λ(X) =
=
= max
,1 .
 sup L(θ; X) 
sup L(θ; X)
L(θ̂0 ; X)
θ∈Θ0
θ∈Θ0
The test statistic λ(X) yields an excellent test procedure in many practical applications, though it is not an
optimal test in general.
15
Bayesian theory
Let f (x; θ) be a density function with parameter θ ∈ Ω. In a Bayesian model the paramter space Ω has a
distribution π(θ), called a prior distribution. Furthermore, f (x; θ) is viewed as the conditional distribution of X
given θ. By the Bayes’ rule the conditional density π(θ | x) can be derived from
,

X


π(θ)f (x; θ)
π(θ)f (x; θ)
if Ω is discrete;

θ∈Ω
π(θ | x) =
Z



π(θ)f (x; θ)
π(θ)f (x; θ) dθ if Ω is continuous.
Ω
The distribution π(θ | x) is called the posterior distribution. Whether Ω is discrete or continuous, the posterior
distribution π(θ | x) is “proportional” to π(θ)f (x; θ) up to the constant. Thus, we write
π(θ | x) ∝ π(θ)f (x; θ).
15.1
Conjugate family
It is often the case that both the prior density function π(θ) and the posterior density function π(θ | x) belong
to the same family of density function π(θ; η) with parameter η. Then π(θ; η) is called conjugate to f (x; θ).
Let

f (x; θ) = exp nc0 (θ) +
m
X

cj (θ)kj (x) + h(x) ;
j=1

π(θ; η0 , η1 , . . . , ηm ) = exp c0 (θ)η0 +
m
X

cj (θ)ηj + w(η0 , η1 , . . . , ηm ) .
j=1
Suppose that a prior distribution is given by π(θ) = π(θ; η0 , η1 , . . . , ηm ). Then we obtain the posterior density
π(θ | x) = π(θ; η0 + n, η1 + k1 (x), . . . , ηm + km (x)).
Thus, the family of π(θ; η0 , η1 , . . . , ηm ) is conjugate to f (x; θ).
Lecture Note
Page 26
Mathematical Statistics
15.2
Decision model
Given a random sample X from f (x; θ), we can introduce a decision function δ(X), and incur a loss l(θ, δ(X))
associated with the state of θ. Together we calculate the risk by
Z
R(θ, δ) = l(θ, δ(x))f (x; θ) dx.
If the decision function δ is strictly dominated by no other decision function δ 0 , that is, if no decision function δ 0
satisfies R(θ, δ 0 ) ≤ R(θ, δ) for all θ ∈ Ω with strict inequality for some θ ∈ Ω, then δ is called admissible.
In the Bayesian model where the parameter θ has the prior distribution π(θ), we can define the Bayes risk by
Z
r(δ) =
R(θ, δ)π(θ) dθ.
Ω
∗
Then the decision function δ is called a Bayes solution if δ ∗ minimizes the Bayes risk r(δ). When the parameter
space Ω is an interval on R and π(θ) > 0 and R(θ, δ) is continuous at every point θ and every decision function
δ, the Bayes solution δ ∗ is admissible.
Having observed X = x, we can compute the posterior density π(θ | x), and construct the posterior risk by
Z
r(δ(x) | x) =
l(θ, δ(x))π(θ | x) dθ.
Ω
If there is a decision function δ0 such that δ0 (x) minimizes the posterior risk r(δ(x) | x) for every x, then δ0 is
a Bayes solution.
Lecture Note
Page 27