Download Chapter 2 Weak Convergence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
absolute difference |Xn − X|, itself a random variable, should be arbitrarily close to zero
with probability arbitrarily close to one. More precisely, we make the following definition.
Definition 2.1 Let {Xn }n≥1 and X be defined on the same probability space. We
P
say that Xn converges in probability to X, written Xn → X, if for any ! > 0,
Chapter 2
P (|Xn − X| < !) → 1 as n → ∞.
Weak Convergence
Chapter 1 discussed limits of sequences of constants, either scalar-valued or vector-valued.
Chapters 2 and 3 extend this notion by defining what it means for a sequence of random
variables to have a limit. As it turns out, there is more than one sensible way to do this.
Chapters 2 and 4 (and, to a lesser extent, Chapter 3) lay the theoretical groundwork for
nearly all of the statistical topics that will follow. While the material in Chapter 2 is
essential, readers may wish to skip Chapter 3 on a first reading. As is common throughout
the book, some of the proofs here have been relegated to the exercises.
2.1
Modes of Convergence
Whereas the limit of a sequence of real numbers is unequivocally expressed by Definition
1.32, in the case of random variables there are several ways to define the convergence of a
sequence. This section discusses three such definitions, or modes, of convergence; Section 3.1
presents a fourth. Because it is often easier to understand these concepts in the univariate
case than the multivariate case, we only consider univariate random vectors here, deferring
the analogous multivariate topics to Section 2.3.
2.1.1
Convergence in Probability
What does it mean for the sequence X1 , X2 , . . . of random variables to converge to, say,
the random variable X? Under what circumstances should one write Xn → X? We begin
by considering a definition of convergence that requires that Xn and X be defined on the
same sample space. For this form of convergence, called convergence in probability, the
39
(2.1)
It is very common that the X in Definition 2.1 is a constant, say X ≡ c. In such cases, we
P
simply write Xn → c. When we replace X by c in Definition 2.1, we do not need to concern
ourselves with the question of whether X is defined on the same sample space as Xn because
any constant may be defined as a random variable on any sample space. In the most common
statistical usage of convergence to a constant c, we take c to be some parameter θ and Xn
to be an estimator of θ:
P
Definition 2.2 If Xn → θ, Xn is said to be consistent (or weakly consistent) for θ.
As the name suggests, weak consistency is weaker than (i.e., implied by) a condition called
“strong consistency,” which will be defined in Chapter 3. “Consistency,” used without the
word “strong” or “weak,” generally refers to weak consistency. Throughout this book, we
shall refer repeatedly to (weakly) consistent estimators, whereas strong consistency plays a
comparatively small role.
Example 2.3 Suppose that X1 , X2 , . . . are independent and identically distributed
(i.i.d.) uniform (0, θ) random variables, where θ is an unknown positive constant.
For n ≥ 1, let X(n) be defined as the largest value among X1 through Xn : That
def
is, X(n) = max1≤i≤n Xi . Then we may show that X(n) is a consistent estimator
of θ as follows:
By Definition 2.1, we wish to show that for an arbitrary ! > 0, P (|X(n) − θ| <
!) → 1 as n → ∞. In this particular case, we can evalulate P (|X(n) − θ| < !)
directly by noting that X(n) cannot possibly be larger than θ, so that
P (|X(n) − θ| < !) = P (X(n) > θ − !) = 1 − P (X(n) ≤ θ − !).
The maximum X(n) is less than some constant if and only if each of the random
variables X1 , . . . , Xn is less than that constant. Therefore, since the Xi are i.i.d.,
!
[1 − (!/θ)]n if 0 < ! < θ
P (X(n) ≤ θ − !) = [P (X1 ≤ θ − !)]n =
0
if ! ≥ θ.
Since 1 − (!/θ) is strictly less than 1, we conclude that no matter what positive
value ! takes, P (Xn ≤ θ − !) → 0 as desired.
40
2.1.2
OP Notation
There are probabilistic analogues of the o and O notations of Section 1.3 that apply to
random variable sequences instead of real number sequences.
P
Definition 2.4 We write Xn = oP (Yn ) if Xn /Yn → 0.
In particular, oP (1) is shorthand notation for a sequence of random variables that converges
to zero in probability, as illustrated in Equation (2.2) below.
Definition 2.5 We write Xn = OP (Yn ) if for every ! > 0, there exist M and N such
that
"# #
$
# Xn #
P ## ## < M > 1 − ! for all n > N .
Yn
As a special case of Definition 2.5, we refer to any OP (1) sequence as a bounded in probability
sequence:
Definition 2.6 We say that X1 , X2 , . . . is bounded in probability if Xn = OP (1), i.e.,
if for every ! > 0, there exist M and N such that P (|Xn | < M ) > 1−! for n > N .
Definition 2.6 is primarily useful because of the properties of bounded in probability sequences established in Exercise 2.2.
Example 2.7 In Example 2.3, we showed that if X1 , X2 , . . . are independent and
identically distributed uniform (0, θ) random variables, then
P
max Xi → θ
Using the oP notation defined above, it is possible to rewrite Taylor’s theorem 1.18 in a
form involving random variables. This theorem will prove to be useful in later chapters; for
instance, it is used to prove the result known as the delta method in Section 5.1.
P
Theorem 2.8 Suppose that Xn → θ0 for a sequence of random variables X1 , X2 , . . .
and a constant θ0 . Furthermore, suppose that f (x) has d derivatives at the point
θ0 . Then there is a random variable Yn such that
f (Xn ) = f (θ0 ) + (Xn − θ0 )f # (θ0 ) + · · · +
and Yn = oP (1) as n → ∞.
The proof of Theorem 2.8 is a useful example of an “epsilon-delta” proof (named for the !
and δ in Definition 1.11).
Proof: Let
'
Yn =
d!
(Xn −θ0 )d
0
(
f (Xn ) − f (θ0 ) − (Xn − θ0 )f # (θ0 ) − · · · −
(Xn −θ0 )d−1
(d−1)!
)
if Xn '= θ0
if Xn = θ0 .
P
Then Equation (2.4) is trivially satisfied. We will show that Yn = oP (1), which means Yn → 0,
by demonstrating that for an arbitrary ! > 0, there exists N such that P (|Yn | < !) > 1 − !
for all n > N . By Taylor’s Theorem 1.18, there exists some δ > 0 such that |Xn − θ0 | < δ
implies |Yn | < ! (that is, the event {ω : |Xn (ω) − θ0 | < δ} is contained in the event {ω :
P
|Yn (ω)| < !}). Furthermore, because Xn → θ0 , we know that there exists some N such that
P (|Xn − θ0 | < δ) > 1 − ! for all n > N . Putting these facts together, we conclude that for
all n > N ,
as n → ∞.
1≤i≤n
&
(Xn − θ0 )d % (d)
f (θ0 ) + Yn (2.4)
d!
P (|Yn | < !) ≥ P (|Xn − θ0 | < δ) > 1 − !,
which proves the result.
Equivalently, we may say that
max Xi = θ + oP (1)
1≤i≤n
as n → ∞.
(2.2)
In later chapters, we will generally write simply
(2.3)
&
(Xn − θ0 )d % (d)
f (θ0 ) + oP (1)
as n → ∞ (2.5)
d!
when referring to the result of Theorem 2.8. A technical quibble with Expression (2.5) is
that it suggests that any random variable Yn satisfying (2.4) must also be oP (1). This is not
quite true: Since Yn may be defined arbitrarily in the event that Xn = θ0 and still satisfy
(2.4), if
f (Xn ) = f (θ0 ) + (Xn − θ0 )f # (θ0 ) + · · · +
It is also technically correct to write
max Xi = θ + OP (1)
1≤i≤n
as n → ∞,
though Statement (2.3) is less informative than Statement (2.2). On the other
hand, we will see in Example 6.1 that Statement (2.3) may be sharpened considerably—
and made more informative than Statement (2.2)—by writing
" $
1
as n → ∞.
max Xi = θ + OP
1≤i≤n
n
41
P (Xn = θ0 ) > c
for all n
for some positive constant c, then Yn '= oP (1) may still satisfy (2.4). However, as long as
one remembers what Theorem 2.8 says, there is little danger in using Expression (2.5).
42
2.1.3
Convergence in Distribution
As the name suggests, convergence in distribution (also known as convergence in law) has
to do with convergence of the distribution functions (or “laws”) of random variables. Given
a random variable X, the distribution function of X is the function
F (x) = P (X ≤ x).
(2.6)
Any distribution function F (x) is nondecreasing and right-continuous, and it has limits
limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. Conversely, any function F (x) with these properties is a distribution function for some random variable.
It is not enough to define convergence in distribution as simple pointwise convergence of a
sequence of distribution functions; there are technical reasons that such a simplistic definition
fails to capture any useful concept of convergence of random variables. These reasons are
illustrated by the following two examples.
Example 2.9 Let Xn be normally distributed with mean
√ 0 and variance n. Then
the distribution function of Xn is Fn (x) = Φ(x/ n), where Φ(z) denotes the
standard normal distribution function. Because Φ(0) = 1/2, we see that for any
fixed x, Fn (x) → 1/2 as n → ∞. But the function that is constant at 1/2 is not
a distribution function. This example shows that not all convergent sequences of
distribution functions have limits that are distribution functions.
Example 2.10 By any sensible definition of convergence, 1/n should converge to 0
as n → ∞. But consider the distribution functions Fn (x) = I{x ≥ 1/n} and
F (x) = I{x ≥ 0} corresponding to the constant random variables 1/n and 0. We
do not have pointwise convergence of Fn (x) to F (x), since Fn (0) = 0 for all n but
F (0) = 1. However, Fn (x) → F (x) is true for all x '= 0. Not coincidentally, the
point x = 0 where convergence of Fn (x) to F (x) fails is the only point at which
the function F (x) is not continuous.
To write a sensible definition of convergence in distribution, Example 2.9 demonstrates that
we should require that the limit of distribution functions be a distribution function itself,
say F (x), while Example 2.10 suggests that we should exclude points where F (x) is not
continuous. We therefore arrive at the following definition:
Definition 2.11 Suppose that X has distribution function F (x) and that Xn has
distribution function Fn (x) for each n. Then we say Xn converges in distribution
d
to X, written Xn → X, if Fn (x) → F (x) as n → ∞ for all x at which F (x) is
continuous. Convergence in distribution is sometimes called convergence in law
L
and written Xn → X.
43
The notation of Definition 2.11 may be stretched a bit; sometimes the expressions on either
d
side of the → symbol may be distribution functions or other notations indicating certain
distributions, rather than actual random variables as in the definition. The meaning is
always clear even if the notation is not consistent.
d
P
However, one common mistake should be avoided at all costs: If → (or → or any other “limit
arrow”) indicates that n → ∞, then n must never appear on the right side of the arrow. See
Expression (2.8) in Example 2.12 for an example of how this rule is sometimes violated.
Example 2.12 The Central Limit Theorem for i.i.d. sequences: Let X1 , . . . , Xn be
independent and identically distributed (i.i.d.) with mean µ and finite variance
σ 2 . Then by a result that will be covered in Chapter 4 (but which is perhaps
already known to the reader),
* n
,
√
1+
d
n
Xi − µ → N (0, σ 2 ),
(2.7)
n i=1
where N (0, σ 2 ) denotes a normal distribution with mean 0 and variance σ 2 .
[N (0, σ 2 ) is not actually a random variable; this is an example of “stretching
d
the → notation” referred to above.]
Because Equation 2.7 may be interpreted as saying that the sample mean X n
has approximately a N (µ, σ 2 /n) distribution, it may seem tempting to “rewrite”
Equation (2.7) as
n
1+
d
Xi → N
n i=1
"
µ,
σ2
n
$
.
(2.8)
Resist the temptation to do this! As pointed out above, n should never appear
on the right side of a limit arrow (as long as that limit arrow expresses the idea
that n is tending to ∞).
By Definition 2.5, the√right side of (2.7) is OP (1). We may therefore write (after
dividing through by n and adding µ)
n
1+
Xi = µ + OP
n i=1
"
1
√
n
$
as n → ∞.
(2.9)
Unlike Expression (2.8), Equation (2.9) is perfectly legal;√and although it is less
specific than Expression (2.7), it expresses at a glance the n-rate of convergence
of the sample mean to µ.
44
P
d
Unlike Xn → X, the expression Xn → X does not require Xn − X to be a random variable;
d
in fact, Xn → X is possible even if Xn and X are not defined on the same sample space.
Even if Xn and X do have a joint distribution, it is easy to construct an example in which
d
Xn → X but Xn does not converge to X in probability: Take Z1 and Z2 to be independent
and identically distributed standard normal random variables, then let Xn = Z1 for all n
d
and X = Z2 . Since Xn and X have exactly the same distribution by construction, Xn → X
in this case. However, since Xn − X is a N (0, 2) random variable for all n, we do not have
P
Xn → X.
d
P
We conclude that Xn → X cannot possibly imply Xn → X (but see Theorem 2.14 for a special
case in which it does). However, the implication in the other direction is always true:
P
d
Theorem 2.13 If Xn → X, then Xn → X.
Proof: Let Fn (x) and F (x) denote the distribution functions of Xn and X, respectively.
P
Assume that Xn → X. We need to show that Fn (t) → F (t), where t is any point of continuity
of F (x).
Choose any ! > 0. Whenever Xn ≤ t, it must be true that either X ≤ t + ! or |Xn − X| > !.
This implies that
Fn (t) ≤ F (t + !) + P (|Xn − X| > !).
F (t − !) ≤ Fn (t) + P (|Xn − X| > !).
We conclude that for arbitrary n and ! > 0,
(2.10)
Taking both the lim inf n and the lim supn of the above inequalities, we conclude [since
P
Xn → X implies P (|Xn − X| > !) → 0] that
F (t − !) ≤ lim inf Fn (t) ≤ lim sup Fn (t) ≤ F (t + !)
n
n
for all !. Since t is a continuity point of F (x), letting ! → 0 implies
F (t) = lim inf Fn (t) = lim sup Fn (t),
n
n
so we conclude Fn (t) → F (t) and the theorem is proved.
d
d
P
Proof: We only need to prove that Xn → c implies Xn → c, since the other direction is a
special case of Theorem 2.13. If F (x) is the distribution function I{x ≥ c} of the constant
random variable c, then c + ! and c − ! are points of continuity of F (x) for any ! > 0.
d
Therefore, Xn → c implies that Fn (c − !) → F (c − !) = 0 and Fn (c + !) → F (c + !) = 1 as
n → ∞. We conclude that
P (−! <X n − c ≤ !) = Fn (c + !) − Fn (c − !) → 1,
P
which means Xn → c.
When we speak of convergence of random variables to a constant in this book, most commonly we refer to convergence in probability, which (according to Theorem 2.14) is equivalent
to convergence in distribution. On the other hand, when we speak of convergence to a random variable, we nearly always refer to convergence in distribution. Therefore, in a sense,
Theorem 2.14 makes convergence in distribution the most important form of convergence in
this book. This type of convergence is often called “weak convergence”.
2.1.4
Convergence in Mean
Definition 2.15 Let a be a positive constant. We say that Xn converges in ath mean
a
to X, written Xn → X, if
E |Xn − X|a → 0 as n → ∞.
(2.11)
Two specific cases of Definition 2.15 deserve special mention. When a = 1, we normally omit
mention of the a and simply refer to the condition E |Xn − X| → 0 as convergence in mean.
Convergence in mean is not equivalent to E Xn → E X: For one thing, E Xn → E X is
possible without any regard to the joint distribution of Xn and X, whereas E |Xn − X| → 0
clearly requires that Xn − X be a well-defined random variable.
Even more important than a = 1 is the special case a = 2:
P
We remarked earlier that Xn → X could not possibly imply Xn → X because the latter
expression requires that Xn and X be defined on the same sample space for every n. However,
a constant c may be considered to be a random variable defined on any sample space; thus,
d
P
it is reasonable to ask whether Xn → c implies Xn → c. The answer is yes:
45
P
The third and final mode of convergence in this chapter is useful primarily because it is
sometimes easy to verify and thus gives a quick way to prove convergence in probability, as
Theorem 2.17 below implies.
Similarly, whenever X ≤ t − !, either Xn ≤ t or |Xn − X| > !, implying
F (t − !) − P (|Xn − X| > !) ≤ Fn (t) ≤ F (t + !) + P (|Xn − X| > !).
d
Theorem 2.14 Xn → c if and only if Xn → c.
qm
Definition 2.16 We say that Xn converges in quadratic mean to X, written Xn → X,
if
E |Xn − X|2 → 0 as n → ∞.
46
Convergence in quadratic mean is important for two reasons. First, it is often quite easy
qm
to check; in Exercise 2.6, you are asked to prove that Xn → c if and only if E Xn → c
and Var Xn → 0 for some constant c. Second, quadratic mean convergence (indeed, ath
mean convergence for any a > 0) is stronger than convergence in probability, which means
that weak consistency of an estimator may be established by checking that it converges in
quadratic mean. This latter property is a corollary of the following result:
qm
Theorem 2.17 (a) For a constant c, Xn → c if and only if E Xn → c and Var Xn → 0.
a
P
(b) For fixed a > 0, Xn → X implies Xn → X.
Proof: Part (a) is the subject of Exercise 2.6. Part (b) relies on Markov’s inequality (1.35),
which states that
1
P (|Xn − X| ≥ !) ≤ a E |Xn − X|a
(2.12)
!
a
for an arbitrary fixed ! > 0. If Xn → X, then by definition the right hand side of inequality
P
(2.12) goes to zero as n → ∞, so the left side also goes to zero and we conclude that Xn → X
by definition.
Example 2.18 Any unbiased estimator is consistent if its variance goes to zero. This
fact follows directly from Theorem 2.17(a) and (b). As an example, consider a
sequence of independent and identically distributed random variables X1 , X2 , . . .
with mean µ and finite variance σ 2 . The sample mean
n
Xn =
1+
Xi
n i=1
has mean µ and variance σ 2 /n. Therefore, X n is unbiased and its variance goes
P
to zero, so we conclude that it is consistent; i.e., X n → µ. This fact is the Weak
Law of Large Numbers (see Theorem 2.19) for the case of random variables with
finite variance.
Exercises for Section 2.1
P
Exercise 2.1 For each of the three cases below, prove that Xn → 1:
(a) Xn = 1 + nYn , where Yn is a Bernoulli random variable with mean 1/n.
(b) Xn = Yn / log n, where Yn is a Poisson random variable with mean ni=1 (1/i).
(c) Xn = n1 ni=1 Yi2 , where the Yi are independent standard normal random
variables.
47
Exercise 2.2 This exercise deals with bounded in probability sequences; see Definition 2.6.
d
(a) Prove that if Xn → X for some random variable X, then Xn is bounded in
probability.
Hint: You may use the fact that any interval of real numbers must contain a
point of continuity of F (x). Also, recall that F (x) → 1 as x → ∞.
P
P
(b) Prove that if Xn is bounded in probability and Yn → 0, then Xn Yn → 0.
Hint: For fixed ! > 0, argue that there must be M and N such that P (|Xn | <
M ) > 1 − !/2 and P (|Yn | < !/M ) > 1 − !/2 for all n > N . What is then the
smallest possible value of P (|Xn | < M and |Yn | < !/M )? Use this result to prove
P
Xn Yn → 0.
Exercise 2.3 The Poisson approximation to the binomial:
(a) Suppose that Xn is a binomial random variable with n trials, where the
probability of success on each trial is λ/n. Let X be a Poisson random variable
d
with the same mean as Xn , namely λ. Prove that Xn → X.
Hint: Argue that it suffices to show that P (Xn = k) → P (X = k) for all
nonnegative integers k. Then use Stirling’s formula (1.19).
(b) Part (a) can be useful in approximating binomial probabilities in cases where
the number of trials is large but the success probability is small: Simply consider
a Poisson random variable with the same mean as the binomial variable. Assume
that Xn is a binomial random variable with parameters n and 2/n. Create a plot
on which you plot P (X10 = k) for k = 0, . . . , 10. On the same set of axes, plot
the same probabilities for X20 , X50 , and the Poisson variable we’ll denote by X∞ .
Try looking at the same plot but with the probabilities transformed using the
logit (log-odds) transformation logit(t) = log(t) − log(1 − t). Which plot makes
it easier to characterize the trend you observe?
Exercise 2.4 Suppose that X1 , . . . , Xn are independent and identically distributed
Uniform(0, 1) random variables. For a real number t, let
Gn (t) =
n
+
i=1
I{Xi ≤ t}.
(a) What is the distribution of Gn (t) if 0 < t < 1?
(b) Suppose c > 0. Find the distribution of a random variable X such that
48
d
2.2.1
Gn (c/n) → X. Justify your answer.
(c) How does your answer to part (b) change if X1 , . . . , Xn are from a standard exponential distribution instead of a uniform distribution? The standard
exponential distribution function is F (t) = 1 − e−t .
qm
Exercise 2.5 For each of the three examples in Exercise 2.1, does Xn → 1? Justify
your answers.
Exercise 2.6 Prove Theorem 2.17(a).
Exercise 2.7 The converse of Theorem 2.17(b) is not true. Construct a counterexP
ample in which Xn → 0 but E Xn = 1 for all n (by Theorem 2.17, if E Xn = 1,
then Xn cannot converge in quadratic mean to 0).
Hint: The mean of a random variable may be strongly influenced by a large
value that occurs with small probability (and if this probability goes to zero,
then the mean can be influenced in this way without destroying convergence in
probability).
Exercise 2.8 Prove or disprove this statement: If there exists M such that P (|Xn | <
qm
P
M ) = 1 for all n, then Xn → c implies Xn → c.
Exercise 2.9 (a) Prove that if 0 < a < b, then convergence in bth mean is stronger
b
a
than convergence in ath mean; i.e., Xn → X implies Xn → X.
Hint: Use Exercise 1.40 with α = b/a.
Theorem 2.19 Weak Law of Large Numbers (univariate version): Suppose that
X1 , X2 , . . . are independent and identically distributed and have finite mean µ.
P
Then X n → µ.
The proof of Theorem 2.19 in its full generality is beyond the scope of this chapter, though it
may be proved using the tools in Section 4.1. However, by tightening the assumptions a bit,
a proof can be made simple. For example, if the Xi are assumed to have finite variance (not
a terribly restrictive assumption), the weak law may be proved in a single line: Chebyshev’s
inequality (1.36) implies that
P
Consistent Estimates of the Mean
For a sequence of random vectors X1 , X2 , . . ., we denote the nth sample mean by
n
def
Xn =
1+
Xi .
n i=1
.
/ Var X n
Var X1
P |X n − µ| ≥ ! ≤
=
→ 0,
!2
n!2
so X n → µ follows by definition. (This is an alternative proof of the same result in Example
2.18.)
P
Example 2.20 If X ∼ binomial(n, p), then X/n → p. Although we could prove this
fact directly using the definition of convergence in probability, it follows immediately from the Weak Law of Large Numbers due to the fact that X/n is the
sample mean of n independent and identically distributed Bernoulli random variables, each with mean p.
Example 2.21 Suppose that X1 , X2 , . . . are independent and identically distributed
with mean µ and finite variance σ 2 . Then the estimator
n
(b) Prove by counterexample that the conclusion of part (a) is not true in general
if 0 < b < a.
2.2
The Weak Law of Large Numbers
σ̂n2 =
1+
(Xi − µ)2
n i=1
is consistent for σ 2 because of the Weak Law of Large Numbers: The (Xi − µ)2
are independent and identically distributed and they have mean σ 2 .
Ordinarily, of course, we do not know µ, so we replace µ by X n (and often replace
1
1
by n−1
) to obtain the sample variance. We need a bit more theory to establish
n
the consistency of the sample variance, but we will revisit this topic later.
2.2.2
Independent but not Identically Distributed Variables
We begin with a formal statement of the weak law of large numbers for an independent
and identically distributed sequence. Later, we discuss some cases in which the sequence of
random vectors is not independent and identically distributed.
Let us now generalize the conditions of the previous section: Suppose that X1 , X2 , . . . are
independent but not necessarily identically distributed but have at least the same mean, so
49
50