Download SOLUTION FOR HOMEWORK 3, STAT 4352 Welcome to your third

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computational complexity theory wikipedia , lookup

Pattern recognition wikipedia , lookup

Probability box wikipedia , lookup

Generalized linear model wikipedia , lookup

Least squares wikipedia , lookup

Transcript
SOLUTION FOR HOMEWORK 3, STAT 4352
Welcome to your third homework. We finish the point estimation; your Exam 1 is next
week and it will be close to HW1-HW3.
Recall that X n := (X1 , . . . , Xn ) denotes the vector of n observations.
Try to find mistakes (and get extra points) in my solutions. Typically they are silly
arithmetic mistakes (not methodological ones). They allow me to check that you did your
HW on your own. Please do not e-mail me about your findings — just mention them on the
first page of your solution and count extra points.
Now let us look at your problems.
1. Problem 10.51. Let X1 , . . . , Xn be iid according to Expon(θ), so
fθX (x) = (1/θ)e−x/θ I(x > 0), θ ∈ Ω := (0, ∞).
Please note that it is important to write this density with indicator function showing its
support. In some cases the support may depend on a parameter of interest, and then this
fact is always very important. We shall see such an example in this homework.
For the exponential distribution we know that Eθ (X) = θ (you may check this by a direct
calculation), so we get a simple method of moments estimator
Θ̂M M E = X̄.
This is the answer. But I would like to continue a bit. The method of moments estimator
(or a generalized one) allows you to work with any moment (or any function). Let us consider
the second moment and equate sample second moment to the theoretical one. Recall that
V arθ (X) = θ2 , and thus
Eθ (X 2 ) = V arθ (X) + (Eθ (X))2 = 2θ2 .
The sample second moment is n−1
Pn
i=1
Xi2 , and we get another method of moments estimator
Θ̃M M E = [n−1
n
X
Xi2 /2]1/2 .
i=1
Note that these MM estimators are different, and this is OK. Then a statistician should
choose a better one. Which one do you think is better? You may use the notion of efficiency
to resolve the issue (compare their MSEs (mean squared errors) E(θ̂ − θ)2 and choose an
estimator with the smaller MSE). By the way, which estimator is based on the sufficient
statistic?
2. Problem 10.53. Here X1 , . . . , Xn are P oisson(λ). Recall that Eλ (X) = λ and
V arλ (X) = λ.
The MME is easy to get via the first moment, and we have
λ̂M M E = X̄.
1
This is the answer. But again, as an extra example, I can suggest a MME based on the
second moment. Indeed, Eλ (X 2 ) = V arλ (X) + (Eλ X)2 = λ + λ2 and this yields that
λ̃M M E + λ̃2M M E = n−1
n
X
Xi2 .
i=1
Then you need to solve this equation to get the MME. Obviously it is a more complicated
estimator, but it is yet another MME.
3. Problem 10.56. Let X1 , . . . , Xn be iid according to the pdf
gθ (x) = θ−1 e−(x−δ)/θ I(x > δ).
Please note that this is a location-exponential family because
X = δ + Z,
where Z is a classical exponential RV with fθZ (z) = θ−1 e−z/θ I(z > 0). I can go either further
by saying that we are dealing with a location-scale family because
X = δ + θZ0 ,
where f Z0 (z) = e−z I(z > 0).
So now we know the meaning of parameters δ and θ: the former is the location (shift)
and the latter is the scale (multiplier).
Note that this understanding simplifies all calculations because you can easily figure out
(otherwise do calculations) that
V arδ,θ (X) = θ2 .
Eδ,θ (X) = δ + θ,
These two familiar results yield Eδ,θ (X 2 ) = θ2 + (δ + θ)2 , and we get the following system of
two equations to find the pair of MMEs:
δ̂ + θ̂ = X̄,
2θ̂2 + 2δ̂ θ̂ + δ̂ 2 = n−1
n
X
Xi .
i=1
To solve this system, we square the both sides of the first equality and then subtract the
obtained equality from the second equality. We get a new system
δ̂ + θ̂ = X̄,
θ̂2 = n−1
n
X
Xi2 − X̄ 2 .
i=1
This, together with a simple algebra, yields the answer
δ̂M M E = X̄ − [n−1
n
X
Xi2 − X̄ 2 ]1/2 ,
i=1
θ̂M M E = [n−1
n
X
i=1
2
Xi2 − X̄ 2 ]1/2 .
Remark: We need to check that n−1 ni=1 Xi2 − X̄ 2 ≥ 0 for the estimator to be well
defined. This may be done via famous Hölder inequality
P
(
m
X
aj )2 ≤ m
m
X
a2j .
j=1
j=1
4. Problem 10.59. Here X1 , . . . , Xn are P oisson(λ), λ ∈ Ω = (0, ∞). Recall that
Eλ (X) = λ and V arλ (X) = λ. Then, by definition of the MLE:
λ̂M LE := arg max
λ∈Ω
= arg max
λ∈Ω
n
X
n
Y
fλ (Xl ) =: arg max LX n (λ)
λ∈Ω
l=1
ln(fλ (Xl )) =: arg max ln LX n (λ).
λ∈Ω
l=1
For the Poisson pdf fλ (x) = e−λ λx /x! we get
ln LX n (λ) = −nλ +
n
X
Xl ln(λ) −
l=1
n
X
ln(Xl !).
l=1
Now we need to find λ̂M LE at which the above loglikelihood attains its maximum over all
λ ∈ Ω. You can do this in a usual way: take derivative with respect to λ ( that is, calculate
∂ ln LX n (λ)/∂λ, then equate it to zero, solve with respect to λ, and then check that the
solution indeed maximizes the loglikelihood). Here equating of the derivative to zero yields
P
−n + nl=1 Xl /λ = 0, and we get
λ̂M LE = X̄.
Note that for the Poisson setting the MME and MLE coincide; in general they may be
different.
5. Problem 10.62. Here X1 , . . . , Xn are iid N(µ, σ 2 ) with the mean µ being known and
the parameter of interest being the variance σ 2 . Note that σ 2 ∈ Ω = (0, ∞). Then we are
interested in the MLE. Write:
2
σ̂M
ln LX n (σ 2 ).
LE = arg max
2
σ ∈Ω
Here
ln L
Xx
2
(σ ) =
n
X
2 −1/2 −(Xl −µ)2 /(2σ2 )
ln([2πσ ]
e
2
2
) = −(n/2) ln(2πσ ) − (1/2σ )
l=1
n
X
l=1
This expression takes on its maximum at
2
−1
σ̂M
LE = n
n
X
l=1
Note that this is also the MME.
3
(Xl − µ)2 .
(Xl − µ)2 .
6. Problem 10.66. Let X1 , . . . , Xn be iid according to the pdf
gθ (x) = θ−1 e−(x−δ)/θ I(x > δ).
Then
LX n (δ, θ) = θ−n e−
Pn
l=1
(Xl −δ)/θ
I(X(1) > δ).
Recall that X(1) = min(X1 , . . . , Xn ) is the minimal observation [the first ordered observation].
This is the case that I wrote you about earlier: it is absolutely crucial to take into account
the indicator function (the support) because here the parameter δ defines the support.
By its definition,
(δ̂M LE , θ̂M LE ) := arg
max
δ∈(−∞,∞),θ∈(0,∞)
ln(LX n (δ, θ)).
Note that
L(δ, θ) := ln(L
Xn
(δ, θ)) = −n ln(θ) − θ
−1
n
X
(Xl − δ) + ln I(X(1) ≥ δ).
l=1
Now the crucial step: you should graph the loglikelihood L as a function in δ and visualize
that it takes on maximum when δ = X(1) . So we get δ̂M LE = X(1) . Then by taking a
P
derivative we get that θ̂M LE = n−1 nl=1 (Xl − X(1) ).
P
Answer: (δ̂M LE θ̂M LE ) = (X(1) , n−1 nl=1 (Xl − X(1) ). Please note that δ̂M LE is a biased
estimator; this is a rather typical outcome.
7. Problem 10.73. Consider iid uniform observations X1 , . . . , Xn with the parametric pdf
fθ (x) = I(θ − 1/2 < x < θ + 1/2).
As soon as the parameter is in the indicator function you should be very cautious: typically
a graphic will help you to find a MLE estimator, and not a differentiation. Also, it is very
helpful to figure out the nature of the parameter. Here it is obviously a location parameter,
and you can write
X = θ + Z, Z ∼ Unif orm(−1/2, 1/2).
The latter helps you to guess about a correct estimator and check a suggested one and, if
necessary, simplify calculations of descriptive characteristics (mean, variance, etc.)
Well, now we need to write down the likelihood function (recall that this is just a joint
density only considered as a function in the parameter given a vector of observations):
LX n (θ) =
n
Y
I(θ − 1/2 < Xl < θ + 1/2) = I(θ − 1/2 < X(1) ≤ X(n) < θ + 1/2).
l=1
Note that the latter expression implies that (X(1) , X(n) ) is a sufficient statistic (due to the
Factorization Theorem). As a result, any good estimator, and the MLE in particular, must
be a function of only these two statistics. Another remark is: it is possible to show (there
exists a technique how to do this which is beyond this class objectives) that this pair of
4
extreme observations is also the minimal sufficient statistic. Please look at the situation: we
have 1 parameter and need 2 univariate statistics (X(1) , X(n) ) to have the sufficient statistics;
here this is the limit of data-reduction. Nonetheless, this is a huge data-reduction whenever
n is large. Just think about this: to estimate θ you do not need any observation which is
between the two extreme ones! This is not a trivial assertion.
Well, now let us return to the problem at hand. If you look at the graphic of the likelihood
function as a function in θ, then you may conclude that it attains its maximum on all θ such
that
X(n) − 1/2 < θ < X(1) + 1/2.
(1)
As a result, we get a very curious MLE: any point within this interval can be declared as
the MLE (the MLE is not unique!).
Now we can consider the particular questions at hand.
(a). Let Θ̂1 = (1/2)(X(1) + X(n) ). We need to check that this estimator satisfies (1). We
just plug-in this estimator in (1) and get
X(n) − 1/2 < (1/2)(X(1) + X(n) ) < X(1) + 1/2.
The latter relation is true because it is equivalent to the following valid inequality:
X(n) − X(1) < 1.
(b) Let Θ̂2 = (1/3)(X(1) + 2X(n) ) be another candidate for the MLE. Then it should
satisfy (1). In particular, if this is the MLE then
(1/3)(X(1) + 2X(n) ) < X(1) + 1/2
should hold. The latter inequality is equivalent to
X(n) − X(1) < 3/4
which obviously may not hold. The contradictory shows that this estimator, despite being
a function of the sufficient statistic, is not the MLE.
8. Problem 10.74. Here we are exploring the Bayesian approach where the parameter of
interest is considered as a realization of a random variable, (it can be considered as a random
variable). For the problem at hand X ∼ Binom(n, θ) and θ is a realization (which we do
not directly observe) of a beta RV Θ ∼ Beta(α, β).
[Please note that here your knowledge of basic/classical distributions becomes absolutely
crucial: you cannot solve any problem without knowing formulae for pmf/pdf; so it is time
to refresh them.]
In other words, here we are observing a binomial random variable whose parameter
(probability of success has a beta prior.
To find a Bayesian estimator, we need to find a posterior distribution of the parameter
of interest and then calculate its mean. [Please note that your knowledge of means of classical distribution becomes very handy here: as soon as you realize the underlying posterior
distribution, you can use a formula for calculating its mean.]
5
Given this information, the posterior distribution of Θ given the observation X is
f Θ|X (θ|x) =
=
f Θ (θ)f X|Θ (x|θ)
f X (x)
Γ(n + α + β)
θx+α−1 (1 − θ)(n−x+β)−1 .
Γ(x + α)Γ(n − x + β)
The algebra leading to the last equality is explained on page 345.
Now you can realize that the posterior distribution is again Beta(x + α, n − x + β). There
are two consequences from this fact. First, by a definition, if a prior density and a posterior
density are from the same family of distributions, then the prior is called conjugate. This
is the case that Bayesian statisticians like a lot because this methodologically support the
Bayesian approach and also simplifies formulae. Second, we know a formula for the mean of
a beta RV, and using it we get the Bayesian estimator
Θ̂B = E(Θ|X) =
X +α
X +α
=
(α + X) + (n − X + β)
α+n+β
Now we actually can consider the exercise at hand. A general remark: Bayesian estimator
is typically a linear combination of the prior mean and the MLE estimator with weights
depending on variances of these two estimates. In general, as n → ∞, Bayesian estimator
approaches the MLE.
Let us check that this is the case for the problem at hand. Write,
Θ̂B =
α
n
α+β
X
+
.
n α+β+n α+βα+β+n
Now, if we denote
w :=
n
,
α+β+n
we get the wished presentation
Θ̂ = w X̄ + (1 − w)θ0 .
where θ0 = E(Θ) = α/(α + β) is the prior mean of Θ.
Now, the problem at hand asks us to work a bit further on the weight. The variance of
the beta RV Θ is
αβ
V ar(Θ) := σ02 =
.
2
(α + β) (α + β + 1)
Well, it is plain to see that
θ0 (1 − θ0 ) =
αβ
.
(α + β)2
Then a simple algebra yields
σ02 =
θ0 (1 − θ0 )
α+β+1
6
which in its turn yields
α+β =
θ0 (1 − θ0 )
− 1.
σ02
Using this we get the wished
w=
n
.
n + θ0 (1 − θ − 0)σ0−2 − 1
Problem is solved.
9. Problem 10.76. Here X ∼ N(µ, σ 2 ) with σ 2 being known. A sample of size n is given.
The parameter of interest is the population mean µ, and a Bayesian approach is considered
with the Normal prior M ∼ N(µ0 , σ02 . In other words, the Bayesian approach suggests to
think about an estimated µ as a realization of a random variable M which has a normal
distribution with the given mean and variance.
As a result, we know that the Bayesian estimator is the mean of the posterior distribution.
The posterior distribution is calculated in Th.10.6, and it is again normal N(µ1 , σ12 ) where
σ2
nσ02
+ µ0 2
;
µ1 = X̄ 2
nσ0 + σ 2
nσ0 + σ 2
1
n
1
= 2 + 2.
2
σ1
σ
σ0
Note that this theorem implies that the normal distribution is the conjugate prior: the
prior is normal and the posterior is normal as well.
We can conclude that the Bayesian estimator is
M̂B = E(M|X̄) = w X̄ + (1 − w)µ0 ,
that is, the Bayesian estimator is a linear combination of the MLE estimator (here X̄) and
the prior mean (pure Bayesian estimator when no observations are available). Recall that
this is a rather typical outcome, and the Bayesian estimator approaches the MLE as n → ∞.
A direct (simple) calculation shows that
w = n/[n + σ 2 /σ02 ].
Problem is solved.
10. Problem 10.77. Here a Poisson RV X with an unknown intensity λ is observed. The
problem is to estimate λ. A Bayesian approach is suggested with the prior distribution for
the intensity Λ being Gamma(α, β). In other words, X ∼ P oiss(Λ) and Λ ∼ Gamma(α, β).
To find a Bayesian estimator, we need to evaluate the posterior distribution of Λ given X
and then calculate its mean; that mean will be the Bayesian estimator. We do this in two
steps.
(a) To find the posterior distribution we begin with the joint pdf
f Λ,X (λ, x) = f Λ (λ)f X|Λ (x|λ)
=
1
λα−1 e−λ/β e−λ λx [x!]−1 I(λ > 0)I(x ∈ {0, 1, . . .}).
Γ(α)β α
7
Then the posterior pdf is
f Λ|X (λ|x) =
λ(α+x)−1 e−λ(1+1/β)
f Λ,X (λ, x)
=
I(λ > 0).
f X (x)
Γ(α)β αf X (x)x!
(2)
Now I explain you what smart Bayesian statisticians do. They do not calculate f X (x) or
try to simplify (2); instead they look at (1) as a density in λ and try to guess what family it
is from. Here it is plain to realize that the posterior pdf is again Gamma, more exactly it is
Gamma(α + x, β/(1 + β)). Note that the Gamma prior for the Poisson intensity parameter
is the conjugate prior because the posterior is from the same family.
As soon as you realized the posterior distribution, you know what the Bayesian estimator
is: it is the expected value of this Gamma RV, namely
Λ̂B = E(Λ|X) = (α + X)[β/(1 + β)] = β(α + X)/(1 + β).
The problem is solved.
11. Problem 10.94. This is a curious problem on application and analysis of Bayesian
approach. It is given that the observation X is a binomial RV Binom(n = 30, θ) and
someone believes that the probability of success θ is a realization of a Beta random variable
Θ ∼ Beta(α, β). Parameters α and β are not given; instead it is given that EΘ = θ0 = .74
and V ar(Θ) = σ02 = 32 = 9. [Do you think that this information is enough to find the
parameters of the underlying beta distribution? If “yes”, then what are they?]
Now we are in a position to answer the questions.
(a). Using only the prior information (that is, no observation is available), the best MSE
estimate is the prior mean
Θ̂prior = EΘ = .74.
(b) Based on the direct information, the MLE and the MME estimators are the same
and they are
Θ̂M LE = Θ̂M M E = X̄ = X/n = 18/30.
[Please compare answers in (a) and (b) parts. Are they far enough?]
(c) The Bayesian estimator with Θ ∼ Beta(α, β) is (see p.345)
Θ̂B =
X +α
.
α+β+n
Now, we can either find α and β from the mean and variance information, or use results of
our homework problem 10.74 and get
Θ̂B = w X̄ + (1 − w)E(Θ),
where
w=
n
n+
θ0 (1−θ0 )
σ02
−1
8
=
30
30 +
(.74)(.26)
9
−1
.
12. Problem 10.96. Let X be a grade, and assume that X ∼ N(µ, σ 2 ) with σ 2 = (7.4)2 .
Then there is a professor’s believe, based on a prior knowledge, that the mean M ∼ N(µ0 =
65.2, σ02 = (1.5)2 ). After exam X̄ = 72.9 is the observation.
(a) Denote by Z the standard normal random variable. Then using z-scoring yields
P (63.0 < M < 68.0) = P
63.0 − µ
σ0
0
<
M − µ0
68.0 − µ0 <
σ0
σ0
2.2
68 − 65.2 2.8 63 − 65.2
=P −
.
<Z<
<Z<
1.5
1.5
1.5
1.5
Then you use Table — I skip this step here.
(b) As we know from Theorem 10.6, M|X̄ is normally distributed with
= P(
µ1 =
nX̄σ02 + µ0 σ 2
,
nσ02 + σ 2
σ12 =
σ 2 σ02
.
σ 2 + nσ02
Here: n = 40, X̄ = 72.9, σ02 = (1.5)2 , σ 2 = (7.4)2 , µ0 = 65.2. Plug-in these numbers and
then
63 − µ
68 − µ1 1
.
P (63 < M < 68|X̄ = 72.9) = P
<Z<
σ1
σ1
Find the numbers and use the Table.
9