Download Chapter 2 Elements of Statistical Inference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Sufficient statistic wikipedia , lookup

Central limit theorem wikipedia , lookup

Statistical inference wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Chapter 2
Elements of Statistical Inference
2.1
Principles of Data Reduction
Example 2.1. Cholesterol levels continued.
Suppose we want to make inference on a mean cholesterol level on the second
day after a heart attach of such a population of people in a north eastern American
state. We have got data of 28 patients, which are a realization of the random
sample of size n = 28. One of the first things we can do is to calculate the
estimate of the population mean (µ) and of the population variance (σ 2 ). To do
this we can use some functions of the random sample, such as the sample mean
(X) and the sample variance (S 2 ), respectively, where
28
X=
Here we have
1 X
Xi ,
28 i=1
28
S2 =
1 X
(Xi − X)2 .
27 i=1
x = 257 and s2 = 32.
Note that X and S 2 are random variables, as they are functions of random variables, while x and s2 are their values obtained for the particular values of the rvs;
in this case, for the 28 patients who took part in the study. A different group of 28
patients in the state who suffered a heart attack, would give different values of X
and S 2 .
A random variable which is a function of a random sample, T (X1 , . . . , Xn ), is
called an estimator of a population parameter ϑ, while its value is called an estimate of the population parameter ϑ.
73
74
CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE
Notation
Special symbols such as X or S 2 , are used to denote estimators of some common
parameters, in these cases the population mean and variance. Then their “small
letter” counterparts denote the respective estimates.
b indicates an estimaHat over a symbol of the parameter of interest, for example ϑ,
tor of ϑ. Here we have to be careful because ϑb is often used to denote an estimate
of ϑ as well. It is better to write ϑbobs to indicate that this is a value of the estimator
ϑb obtained for the observed values of a random sample.
A random sample will often be denoted by a bold capital letter, while its realization by a bold small case letter. For example, X denotes a random sample
(X1 , . . . , Xn ) and x denotes its realization (x1 , . . . , xn ).
We often know the family of distributions related to the variable of interest, which
we denote by P = {Pϑ : ϑ ∈ Θ ⊂ Rs } (see Definition 1.10). The family depends
on a parameter vector ϑ which belongs to a parameter space Θ. For example, for
the family of normal distributions P = {f (y; µ, σ 2 ) : −∞ < µ < ∞, σ 2 > 0},
the parameter vector is two-dimensional, ϑ = (µ, σ 2 ), and the parameter space is
Θ = {R × R+ \ {0}}.
Properties of the estimators of the unknown parameters and the methods of finding
them will be our topics for the next several lectures.
2.1.1 Sufficient statistics
Statistic T (Y ) represents a random sample Y = (Y1 , . . . , Yn ) and serves as an
estimator of a population parameter ϑ or of a function of ϑ. For the inference
based on the random sample, it is often enough to consider T (Y ), or a function
of it, rather than the whole random sample. Hence, the statistic should represent
the data well and should retain all the information about the parameter which is
contained in the original observations.
Such a statistic has the property of sufficiency. Any additional information in the
sample, besides the value of the sufficient statistic, does not contain any more
information about the parameter of interest. The principle of sufficiency says that
if T (Y ) is a sufficient statistic for ϑ then any inference about ϑ should depend on
the sample through the value of the statistic only. That is, if y1 and y2 are two
realizations of the random sample Y such that T (y1 ) = T (y2 ), then the inference
75
2.1. PRINCIPLES OF DATA REDUCTION
about ϑ should be the same whichever of the realizations, y1 or y2 , is used.
Definition 2.1. Let X = (X1 , . . . , Xn ) denote a random sample from a probability distribution with unknown parameters ϑ = (ϑ1 , . . . , ϑp ). Then the statistics
T = (T1 , . . . , Tq ) are said to be jointly sufficient for ϑ if the conditional distribution of X given T does not depend on ϑ.
Example 2.2. Let X = (X1 , X2 , . . . , Xn ) be a random sample from a population
of Bernoulli distribution with probability of success equal to p, i.e.,
P (Xi = xi ) = pxi (1 − p)1−xi , i = 1, 2, . . . , n,
where xi ∈ {0, 1}, p ∈ (0, 1). Then the joint mass function of X is
P (X = x) =
n
Y
i=1
pxi (1 − p)1−xi = p
Pn
i=1
xi
(1 − p)n−
Pn
i=1
xi
.
Let us define a statistic T = T (X) as
T (X) =
n
X
Xi .
i=1
This is a random variable which denotes the number of success in n trials and it
has a Binomial distribution, that is,
n t
P (T = t) =
p (1 − p)n−t , t = 0, 1, . . . , n.
t
The
Pn joint probability of X and T is zero unless
i=1 xi = t we can write
P (X = x, T = t) = P (X = x) = p
Pn
i=1
xi
Pn
i=1
(1 − p)n−
xi = t. That is, for
Pn
i=1
xi
= pt (1 − p)n−t .
Hence, the conditional distribution of X|T is
P (X = x|T = t) =
1
pt (1 − p)n−t
P (X = x, T = t)
= = P (T = t)
n t
n
p (1 − p)n−t
t
t
The conditional
Pn distribution does not depend on the parameter p, so the function
T (X) = i=1 Xi is a sufficient statistic for p.
76
CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE
It canPbe interpreted as follows: if we know the number of successes in n trials,
t = ni=1 xi , then the information on the order of the successes occurring in the
Bernoulli experiments does not give anyPmore information about the probability
of success, which could be estimated as ni=1 xi /n.
We can identify sufficient statistics more easily using the following result.
Theorem 2.1. Neyman’s factorization.
The statistics T (Y ) is sufficient for ϑ if and only if the joint pdf (pmf) of Y can
be factorized as
fY (y; ϑ) = g[T (y); ϑ]h(y),
where h(y) does not depend on the parameters and function g, which depends on
the parameters, is a function of the values of Y only through the statistic T .
Proof. Discrete case for a single parameter.
First, assume that T (Y ) is a sufficient statistic for ϑ. If T (Y ) = t, then
P (Y = y) = P (Y = y, T (Y ) = t) = P (T (Y ) = t)P (Y = y|T (Y ) = t).
|
{z
}|
{z
}
=gϑ (T (Y ))
=h(y) if T is a suf f. stat.
So, we get the factorization.
Conversely. Let the factorization hold, that is
P (Y = y) = g[T (y); ϑ]h(y).
Then,
P (Y = y, T (Y ) = t)
P (T (Y ) = t)
(
0,
if T (Y ) 6= t;
=
P (Y =y)
, if T (Y ) = t
P (T (Y )=t)
(
0,
if T (Y ) 6= t;
=
P P (Y =y)
, if T (Y ) = t
y:T (y)=t P (Y =y)
(
0,
if T (Y ) 6= t;
=
g[T (y);ϑ]h(y)
P
,
if T (Y ) = t
g[T (y);ϑ] y:T (y)=t h(y)
(
0,
if T (Y ) 6= t;
=
P h(y)
, if T (Y ) = t
h(y)
P (Y = y|T (Y ) = t) =
y:T (y)=t
This does not depend on the parameter ϑ, hence T is a sufficient statistic for ϑ. 77
2.1. PRINCIPLES OF DATA REDUCTION
As we will later see, sufficient statistics provide optimal estimators and pivots for
confidence intervals and test statistics.
Example 2.3. Suppose that Y = (Y1 , . . . , Yn ) is a random sample from a Poisson(λ)
distribution. Then we may write
P (Y = y; λ) =
n
Y
λyi e−λ
i=1
yi !
=
λ
= λ
Pn
yi −nλ
e
i=1 yi !
i=1
Qn
Pn
i=1
yi −nλ
e
× Qn
1
i=1
It follows that
Pn
i=1
Yi is a sufficient statistic for λ.
yi !
.
Exercise 2.1. Suppose that Y = (Y1 , . . . , Yn ) is a random sample from an Exp(λ)
distribution. Obtain a sufficient statistic for λ.
Example 2.4. Suppose that Y = (Y1 , . . . , Yn ) is a random sample from a N (µ, σ 2 )
distribution. Then we may write
2
fY (y; µ, σ ) =
=
=
=
Hence,
Pn
i=1
(yi − µ)2
√
exp −
2
2σ 2
2πσ
i=1
)
(
n
n
1
1 X
√
exp − 2
(yi − µ)2
2σ i=1
2πσ 2
(
!)
n
n
X
X
n
n
1
(2π)− 2 (σ 2 )− 2 exp − 2
yi2 − 2µ
yi + nµ2
2σ
i=1
i=1
!)
(
n
n
X
X
n
1
2 −n
2
2
× (2π)− 2 .
yi − 2µ
yi + nµ
(σ ) 2 exp − 2
2σ
i=1
i=1
n
Y
Yi and
1
Pn
i=1
Yi2 are jointly sufficient statistics for µ and σ 2 .
Example 2.5. Suppose that Y = (Y1 , . . . , Yn ) is a random sample from a U[0, ϑ]
distribution. Then we may write
n
Y
1
1
= n,
fY (y; ϑ) =
ϑ
ϑ
i=1
max yi ≤ ϑ.
i
78
CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE
Now, this may be rewritten as
fY (y; ϑ) =
1
I{maxi yi ≤ϑ} × 1.
ϑn
It follows that maxi Yi is a sufficient statistic for ϑ.
There can be many sufficient statistics for a parameter of interest. Hence, the
question is which one to use. As the purpose of a statistic is to reduce data without
loss of information about the parameter we would like to find such a function of
the random sample that achieves the most data reduction while still retaining all
the information about the parameter.
Definition 2.2. A sufficient statistic T (Y ) is called a minimal sufficient statistic
if, for any other sufficient statistic T ′ (Y ), T (Y ) is a function of T ′ (Y ).
Lemma 2.1. Let f (y; ϑ) be the pdf of a random sample Y . If there exists a
function T (Y ) such that for two sample points x and y, the ratio f (x; ϑ)/f (y; ϑ)
is constant as a function of ϑ if and only if T (x) = T (y) then T (Y ) is a minimal
sufficient statistic for ϑ.
Analogous Lemma holds for discrete r.vs.
Example 2.6. Suppose that Y = (Y1 , . . . , Yn ) is a random sample fromP
a Poisson(λ)
distribution. From the factorization theorem we know that T (Y ) = ni=1 Yi is a
sufficient statistic for λ. The ratio of the joint probability mass functions is
Pn
Qn
yi ! λ i=1 xi e−nλ
P (Y = x; λ)
i=1
= Pn y −nλ Qn
P (Y = y; λ)
λ i=1 i e
i=1 xi !
n
Y y i ! Pn
Pn
λ i=1 xi − i=1 yi
=
x!
i=1 i
P
P
P
This is constant in terms of λ iff ni=1 yi = ni=1 xi that is T (Y ) = ni=1 Yi is a
minimal sufficient statistic.
Note that Y is also a minimal sufficient statistic for λ. In general, a minimal
sufficient statistic is not unique. Any one-to-one function of a minimal sufficient
statistic is also a minimal sufficient statistic.
2.1. PRINCIPLES OF DATA REDUCTION
79
Exercise 2.2. Suppose that Y = (Y
P1 , . . . , Yn ) is a random sample from an Exp(λ)
distribution. Show that T (Y ) = ni=1 Yi is a minimal sufficient statistic for λ.
2
Example 2.7. Suppose that
n ) is a random sample from a N (µ, σ )
PnY = (Y1 , . . . , YP
n
2
distribution. Then T1 = i=1 Yi and T2 = i=1 Yi are jointly sufficient statistics
for µ and σ 2 . The ratio of the joint densities is
(
" n
!#)
n
n
n
X
X
fY (x; µ, σ 2 )
1 X 2 X 2
x −
yi − 2µ
xi −
yi
= exp − 2
.
fY (y; µ, σ 2 )
2σ i=1 i
i=1
i=1
i=1
P
Pn 2
The ratio does
not depend on$ the
parameters
µ, σ 2 iff ni=1 x2i =
i=1 yi and
P
P
P
P
n
n
n
n
2
x
=
y
.
Hence,
Y
,
Y
are
jointly
minimal
sufficient
i=1 i
i=1 i
i=1 i
i=1 i
statistics for these parameters.
Pn
Pn
2
1
1
2
2
(Y
−
Y
)
=
Y
−
nY
Note that Y and S 2 = n−1
are one-toi
i
i=1
n−1
$ Pn
Pi=1
n
2
2
one functions of
i=1 Yi ,
i=1 Yi . Hence, (Y , S ) are also jointly minimum
sufficient statistics for µ and σ 2 .
80
CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE
2.2 Point Estimation
Suppose that we have data y1 , y2 , . . . , yn which are a realization of Y1 , Y2 , . . . , Yn .
We are interested in the population from which the data are sampled. Suppose that
the probability model is that Yi ∼ N (µ, σ 2 ) independently for i = 1, 2, . . . , n.
Then the probability model for the population is completely specified, except for
the unknown parameters µ and σ 2 .
In general, Y1 , Y2 , . . . , Yn have a joint distribution which is specified, except for
the unknown parameters ϑ = (ϑ1 , ϑ2 , . . . , ϑp )T . We are interested in estimating
some function of the parameters φ = g(ϑ1 , . . . , ϑp ). Note that φ could be just a
single parameter, such as µ or σ 2 . We will use a sample statistic T (Y1 , . . . , Yn )
as a point estimator of φ. For example, T (Y1 , . . . , Yn ) = Y is an estimator of µ.
The observed value of T , T (y1 , . . . , yn ), is the point estimate of φ.
2.2.1 Properties of Point Estimators
Estimator T (Y ) is a random variable (a function of rvs Y1 , . . . , Yn ) and the estimate is a single value taken from the distribution of T (Y ). Since we want our
estimate, t, to be close to φ, the random variable T should be centred close to φ
and have small variance. Also, we would want our estimator to be such that, as
n → ∞, T → φ with probability one.
Below are definitions of some measures of goodness of the estimators.
Definition 2.3. Let Y = (Y1 , Y2 , . . . , Yn )T be a random sample from a distribution Pϑ and let φ = g(ϑ) be a function of the parameters and T (Y ) its point
estimator. Then,
1. The error in estimating φ by estimator T (Y ) is T − φ;
2. The bias of estimator T (Y ) for φ is
bias(T ) = E(T ) − φ,
that is, the expected error;
3. Estimator T (Y ) is unbiased for φ if
E(T ) = φ,
that is, it has zero bias.
2.2. POINT ESTIMATION
81
A weaker property of unbiasedness is defined below.
Definition 2.4. The estimator T (Y ) is asymptotically unbiased for φ if E(T ) → φ
as n → ∞, that is, the bias tends to zero as n → ∞.
Another common measure of goodness of an estimator is given in the following
definition.
Definition 2.5. The mean square error of T (Y ) as an estimator of φ is
MSE(T ) = E (T − φ)2 .
Note: The mean square error can be written as a sum of variance and squared bias
of the estimator.
$
MSE(T ) = E [{T − E(T )} + {E(T ) − φ}]2
= E {T − E(T )}2 + 2{T − E(T )}{E(T ) − φ} + {E(T ) − φ}2
= E {T − E(T )}2 + 2{E(T ) − E(T )}{E(T ) − φ} + {E(T ) − φ}2
= var(T ) + {bias(T )}2 .
It seems natural that if we increase the sample size n we should get a more precise
estimate of φ. The estimators which have this property are called consistent. In
the following definition we use the notion of convergence in probability:
A sequence of rvs X1 , X2 , . . . converges in probability to a random variable X if
for every ε > 0
$
lim P |Xn − X| < ε = 1.
n→∞
Definition 2.6. A sequence Tn = Tn (Y1 , Y2 , . . . , Yn ), n = 1, 2, . . ., is called a
consistent sequence of estimators for φ if Tn converges in probability to φ, that is
for all ε > 0 and for all ϑ ∈ Θ
$
lim P |Tn − φ| < ε = 1.
n→∞
82
CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE
Note: Consistency means that the probability of our estimator being within some
small ε of φ can be made as close to one as we like by making the sample size n
sufficiently large.
The following Lemma gives a useful tool for establishing consistency of estimators.
Lemma 2.2. A sufficient condition for consistency is that MSE(T ) → 0 as n →
∞, or, equivalently, var(T ) → 0 and bias(T ) → 0 as n → ∞.
Example 2.8. Suppose that Y1 , . . . , Yn are independent Poisson(λ) random variables and consider using Y to estimate λ. Then we have
1
E(Y1 + . . . + Yn )
n
1
=
{E(Y1 ) + . . . + E(Yn )}
n
1
=
nλ = λ,
n
E(Y ) =
and so Y is an unbiased estimator of λ. Next, by independence, we have
1
{var(Y1 ) + . . . + var(Yn )}
n2
1
λ
=
nλ = ,
2
n
n
var(Y ) =
so that MSE(Y ) = λ/n. Thus, MSE(Y ) → 0 as n → ∞, and so Y is a consistent
estimator of λ.
A more general result about the mean is given in the following theorem, which
states that the sample mean approaches the population mean as n → ∞, and it
holds for any distribution.
Theorem 2.2. Weak Law of Large Numbers
Let X1 , X2 , . . . be IID
random variables with E(Xi ) = µ and var(Xi ) = σ 2 <
P
n
∞. Define X n = n1 i=1 Xi . Then, for every ε > 0,
$
lim P |X n − µ| < ε = 1,
n→∞
that is, X n converges in probability to µ.
83
2.2. POINT ESTIMATION
Proof. The proof is a straightforward application of Markov’s Inequality, which
says the following: Let X be a random variable and let g(·) be a non-negative
function. Then, for any k > 0,
$
E[g(X)]
P g(X) ≥ k ≤
.
k
(2.1)
[Markov’s Inequality]
Proof of Markov’s inequality (continuous case):
For any rv X and a nonnegative function g(·) we can write
Z
Z ∞
g(x)fX (x)dx ≥
g(x)fX (x)dx
E[g(X)] =
x:g(x)≥k
−∞
Z
≥
kfX (x)dx = kP (g(X) ≥ k).
x:g(x)≥k
Hence (2.1) holds. Then,
$
$
P |X n − µ| < ε = 1 − P |X n − µ| ≥ ε
$
= 1 − P (X n − µ)2 ≥ ε2
E(X n − µ)2
ε2
var(X n )
σ2
=1−
=
1
−
→ 1 as n → ∞.
ε2
nε2
≥1−
Markov’s inequality is a generalization of Chebyshev’s inequality. If we take
g(X) = (X − µ)2 and k = σ 2 t2 for some positive t, then the Markov’v inequality
states that
$
E[(X − µ)2 ]
σ2
1
=
= 2.
P (X − µ)2 ≥ t2 ≤
2
2
2
2
σ t
σ t
t
This is equivalent to
$
1
P |X − µ| ≥ σt ≤ 2
t
which is known as Chebyshev’s inequality.
Example 2.9. The rule of three-sigma.
This rule basically says that a chance that an rv will have values outside the interval (µ − 3σ, µ + 3σ) is close to zero. Chebyshev’s inequality, however, says that
for any random variable the probability that it will deviate from its mean by more
than three standard deviations is at most 1/9, i.e.,
$
1
P |X − µ| ≥ 3σ ≤ .
9
84
CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE
This suggests that the rule should be considered with caution. The rule of threesigma works well for normal rvs or for the variables whose distributions are close
to normal. Let X ∼ N (µ, σ 2 ). Then
P (|X − µ| ≥ 3σ) = 1 − P (|X − µ| ≤ 3σ) = 1 − P (µ − 3σ ≤ X ≤ µ + 3σ)
X −µ
= 1 − P (−3σ ≤ X − µ ≤ 3σ) = 1 − P (−3 ≤
≤ 3)
σ
= 1 − [Φ(3) − Φ(−3)] = 1 − [Φ(3) − (1 − Φ(3))]
= 2(1 − Φ(3)) = 2(1 − 0.99865) ∼
= 0.0027.
Another important result, which is widely used in statistic is the Central Limit
Theorem (CLT) which uses the notion of convergence in distribution. We have
already used this notion, but here we have its formal definition.
Definition 2.7. A sequence of random variables X1 , X2 , . . . converges in distribution to a random variable X if
lim FXn (x) = FX (x),
n→∞
at all points x where the cdf FX (x) is continuous.
The following theorem represents the large sample behaviour of the sample mean
(compare the Weak Law of Large Numbers).
Theorem 2.3. The Central Limit Theorem
Let X1 , X2 , . . . be a sequence of independent random variables such that E(Xi ) =
µ and var(Xi ) = σ 2 < ∞. Define
Un =
X −µ
√ .
σ/ n
Then the sequence of random variables U1 , U2 , . . . converges in distribution to a
random variable U ∼ N (0, 1), that is
lim FUn (u) = FU (u),
n→∞
where FU (u) is a cdf of a standard normal rv.
Proof. The proof is based on the mgf and so it is valid only for the cases where
the mgf exits. Here we assume that MXi (t) exists in some neighborhood of zero,
say for |t| < h for some positive h.
85
2.2. POINT ESTIMATION
We have
X −µ
√ =
Un =
σ/ n
1
n
Pn
n
n
Xi − µ
1 X Xi − µ
1 X
√
=√
=√
Yi ,
σ
σ/ n
n i=1
n i=1
i=1
where Yi = Xiσ−µ . The variables Xi are IID, hence by Theorem 1.15 the variables
Yi are also IID. Hence,
$ √1 Pn Y t = E e n i=1 i
n
!
n
n
n
Y
Y
1
1
√1 tYi
n
=E
e
.
=
MY i √ t = MY i √ t
n
n
i=1
i=1
MUn (t) = M √1
Pn
(t)
i=1 Yi
The common mgf of Yi can be expanded into a Taylor series around zero, that is
MY i
(k)
1
√ t
n
=
∞
X
k=0
(k)
MYi (0)
√
(t/ n)k
,
k!
k
where MYi (0) = dtd k MYi (t)|t=0 . Since the mgfs exist
√ for |t| < h, for some
positive h, the Taylor series expansion is valid if t < nh.
(1)
(2)
(0)
Now, MYi (0) = E(Yi ) = 0, MYi (0) = E(Yi2 ) = var(Yi ) = 1 and MYi (0) =
MYi (0) = E(etYi )|t=0 = E(1) = 1. Hence, the series can be written as
MY i
1
√ t
n
√
√
∞
(t/ n)2 X (k) (t/ n)k
+
MYi (0)
=1+0+
2!
k!
k=3
√ 2
√
(t/ n)
=1+
+ RYi (t/ n),
2!
√
where the remainder term RYi (t/ n) tends to zero faster than the highest order
explicit term, that is
√
RYi (t/ n)
√
lim
= 0.
n→∞ (t/ n)2
R
√
(t/ n)
This means that also limn→∞ Yi1/n
here and can be taken equal to 1).
√
= limn→∞ nRYi (t/ n) = 0 (as t is fixed
86
Thus,
CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE
n
1
lim MUn (t) = MYi √ t
n→∞
n
√
√ n
(t/ n)2
+ RYi (t/ n)
= lim 1 +
n→∞
2!

n






2


√
1 t
= lim 1 +
+ nRYi (t/ n)
n→∞ 

n 2



|
{z
}


= et
2 /2
→t2 /2
= MU (t).
which is the mgf of a standard normal rv.
The CLT means that the sample mean is asymptotically normally distributed whatever the distribution of the original rvs is. However, we have no way of knowing
how good the approximation is. This depends on the original distribution of Xi .
Exercise 2.3. Let Y = (Y1 , . . . , Yn )T be a random sample from the distribution
with the pdf given by
2
(ϑ − y), y ∈ [0, ϑ],
ϑ2
f (y; ϑ) =
0,
elsewhere.
For an estimator T (Y ) = 3Y calculate its bias and variance and check if it is a
consistent estimator for ϑ.
Exercise 2.4. Let Xi ∼ Bernoulli(p). Denote by X the sample mean.
iid
(a) Show that pb = X is a consistent estimator of p.
(b) Show that pq
b = X(1 − X) is an asymptotically unbiased estimator of pq,
where q = 1 − p.