Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
absolute difference |Xn − X|, itself a random variable, should be arbitrarily close to zero with probability arbitrarily close to one. More precisely, we make the following definition. Definition 2.1 Let {Xn }n≥1 and X be defined on the same probability space. We P say that Xn converges in probability to X, written Xn → X, if for any ! > 0, Chapter 2 P (|Xn − X| < !) → 1 as n → ∞. Weak Convergence Chapter 1 discussed limits of sequences of constants, either scalar-valued or vector-valued. Chapters 2 and 3 extend this notion by defining what it means for a sequence of random variables to have a limit. As it turns out, there is more than one sensible way to do this. Chapters 2 and 4 (and, to a lesser extent, Chapter 3) lay the theoretical groundwork for nearly all of the statistical topics that will follow. While the material in Chapter 2 is essential, readers may wish to skip Chapter 3 on a first reading. As is common throughout the book, some of the proofs here have been relegated to the exercises. 2.1 Modes of Convergence Whereas the limit of a sequence of real numbers is unequivocally expressed by Definition 1.32, in the case of random variables there are several ways to define the convergence of a sequence. This section discusses three such definitions, or modes, of convergence; Section 3.1 presents a fourth. Because it is often easier to understand these concepts in the univariate case than the multivariate case, we only consider univariate random vectors here, deferring the analogous multivariate topics to Section 2.3. 2.1.1 Convergence in Probability What does it mean for the sequence X1 , X2 , . . . of random variables to converge to, say, the random variable X? Under what circumstances should one write Xn → X? We begin by considering a definition of convergence that requires that Xn and X be defined on the same sample space. For this form of convergence, called convergence in probability, the 39 (2.1) It is very common that the X in Definition 2.1 is a constant, say X ≡ c. In such cases, we P simply write Xn → c. When we replace X by c in Definition 2.1, we do not need to concern ourselves with the question of whether X is defined on the same sample space as Xn because any constant may be defined as a random variable on any sample space. In the most common statistical usage of convergence to a constant c, we take c to be some parameter θ and Xn to be an estimator of θ: P Definition 2.2 If Xn → θ, Xn is said to be consistent (or weakly consistent) for θ. As the name suggests, weak consistency is weaker than (i.e., implied by) a condition called “strong consistency,” which will be defined in Chapter 3. “Consistency,” used without the word “strong” or “weak,” generally refers to weak consistency. Throughout this book, we shall refer repeatedly to (weakly) consistent estimators, whereas strong consistency plays a comparatively small role. Example 2.3 Suppose that X1 , X2 , . . . are independent and identically distributed (i.i.d.) uniform (0, θ) random variables, where θ is an unknown positive constant. For n ≥ 1, let X(n) be defined as the largest value among X1 through Xn : That def is, X(n) = max1≤i≤n Xi . Then we may show that X(n) is a consistent estimator of θ as follows: By Definition 2.1, we wish to show that for an arbitrary ! > 0, P (|X(n) − θ| < !) → 1 as n → ∞. In this particular case, we can evalulate P (|X(n) − θ| < !) directly by noting that X(n) cannot possibly be larger than θ, so that P (|X(n) − θ| < !) = P (X(n) > θ − !) = 1 − P (X(n) ≤ θ − !). The maximum X(n) is less than some constant if and only if each of the random variables X1 , . . . , Xn is less than that constant. Therefore, since the Xi are i.i.d., ! [1 − (!/θ)]n if 0 < ! < θ P (X(n) ≤ θ − !) = [P (X1 ≤ θ − !)]n = 0 if ! ≥ θ. Since 1 − (!/θ) is strictly less than 1, we conclude that no matter what positive value ! takes, P (Xn ≤ θ − !) → 0 as desired. 40 2.1.2 OP Notation There are probabilistic analogues of the o and O notations of Section 1.3 that apply to random variable sequences instead of real number sequences. P Definition 2.4 We write Xn = oP (Yn ) if Xn /Yn → 0. In particular, oP (1) is shorthand notation for a sequence of random variables that converges to zero in probability, as illustrated in Equation (2.2) below. Definition 2.5 We write Xn = OP (Yn ) if for every ! > 0, there exist M and N such that "# # $ # Xn # P ## ## < M > 1 − ! for all n > N . Yn As a special case of Definition 2.5, we refer to any OP (1) sequence as a bounded in probability sequence: Definition 2.6 We say that X1 , X2 , . . . is bounded in probability if Xn = OP (1), i.e., if for every ! > 0, there exist M and N such that P (|Xn | < M ) > 1−! for n > N . Definition 2.6 is primarily useful because of the properties of bounded in probability sequences established in Exercise 2.2. Example 2.7 In Example 2.3, we showed that if X1 , X2 , . . . are independent and identically distributed uniform (0, θ) random variables, then P max Xi → θ Using the oP notation defined above, it is possible to rewrite Taylor’s theorem 1.18 in a form involving random variables. This theorem will prove to be useful in later chapters; for instance, it is used to prove the result known as the delta method in Section 5.1. P Theorem 2.8 Suppose that Xn → θ0 for a sequence of random variables X1 , X2 , . . . and a constant θ0 . Furthermore, suppose that f (x) has d derivatives at the point θ0 . Then there is a random variable Yn such that f (Xn ) = f (θ0 ) + (Xn − θ0 )f # (θ0 ) + · · · + and Yn = oP (1) as n → ∞. The proof of Theorem 2.8 is a useful example of an “epsilon-delta” proof (named for the ! and δ in Definition 1.11). Proof: Let ' Yn = d! (Xn −θ0 )d 0 ( f (Xn ) − f (θ0 ) − (Xn − θ0 )f # (θ0 ) − · · · − (Xn −θ0 )d−1 (d−1)! ) if Xn '= θ0 if Xn = θ0 . P Then Equation (2.4) is trivially satisfied. We will show that Yn = oP (1), which means Yn → 0, by demonstrating that for an arbitrary ! > 0, there exists N such that P (|Yn | < !) > 1 − ! for all n > N . By Taylor’s Theorem 1.18, there exists some δ > 0 such that |Xn − θ0 | < δ implies |Yn | < ! (that is, the event {ω : |Xn (ω) − θ0 | < δ} is contained in the event {ω : P |Yn (ω)| < !}). Furthermore, because Xn → θ0 , we know that there exists some N such that P (|Xn − θ0 | < δ) > 1 − ! for all n > N . Putting these facts together, we conclude that for all n > N , as n → ∞. 1≤i≤n & (Xn − θ0 )d % (d) f (θ0 ) + Yn (2.4) d! P (|Yn | < !) ≥ P (|Xn − θ0 | < δ) > 1 − !, which proves the result. Equivalently, we may say that max Xi = θ + oP (1) 1≤i≤n as n → ∞. (2.2) In later chapters, we will generally write simply (2.3) & (Xn − θ0 )d % (d) f (θ0 ) + oP (1) as n → ∞ (2.5) d! when referring to the result of Theorem 2.8. A technical quibble with Expression (2.5) is that it suggests that any random variable Yn satisfying (2.4) must also be oP (1). This is not quite true: Since Yn may be defined arbitrarily in the event that Xn = θ0 and still satisfy (2.4), if f (Xn ) = f (θ0 ) + (Xn − θ0 )f # (θ0 ) + · · · + It is also technically correct to write max Xi = θ + OP (1) 1≤i≤n as n → ∞, though Statement (2.3) is less informative than Statement (2.2). On the other hand, we will see in Example 6.1 that Statement (2.3) may be sharpened considerably— and made more informative than Statement (2.2)—by writing " $ 1 as n → ∞. max Xi = θ + OP 1≤i≤n n 41 P (Xn = θ0 ) > c for all n for some positive constant c, then Yn '= oP (1) may still satisfy (2.4). However, as long as one remembers what Theorem 2.8 says, there is little danger in using Expression (2.5). 42 2.1.3 Convergence in Distribution As the name suggests, convergence in distribution (also known as convergence in law) has to do with convergence of the distribution functions (or “laws”) of random variables. Given a random variable X, the distribution function of X is the function F (x) = P (X ≤ x). (2.6) Any distribution function F (x) is nondecreasing and right-continuous, and it has limits limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. Conversely, any function F (x) with these properties is a distribution function for some random variable. It is not enough to define convergence in distribution as simple pointwise convergence of a sequence of distribution functions; there are technical reasons that such a simplistic definition fails to capture any useful concept of convergence of random variables. These reasons are illustrated by the following two examples. Example 2.9 Let Xn be normally distributed with mean √ 0 and variance n. Then the distribution function of Xn is Fn (x) = Φ(x/ n), where Φ(z) denotes the standard normal distribution function. Because Φ(0) = 1/2, we see that for any fixed x, Fn (x) → 1/2 as n → ∞. But the function that is constant at 1/2 is not a distribution function. This example shows that not all convergent sequences of distribution functions have limits that are distribution functions. Example 2.10 By any sensible definition of convergence, 1/n should converge to 0 as n → ∞. But consider the distribution functions Fn (x) = I{x ≥ 1/n} and F (x) = I{x ≥ 0} corresponding to the constant random variables 1/n and 0. We do not have pointwise convergence of Fn (x) to F (x), since Fn (0) = 0 for all n but F (0) = 1. However, Fn (x) → F (x) is true for all x '= 0. Not coincidentally, the point x = 0 where convergence of Fn (x) to F (x) fails is the only point at which the function F (x) is not continuous. To write a sensible definition of convergence in distribution, Example 2.9 demonstrates that we should require that the limit of distribution functions be a distribution function itself, say F (x), while Example 2.10 suggests that we should exclude points where F (x) is not continuous. We therefore arrive at the following definition: Definition 2.11 Suppose that X has distribution function F (x) and that Xn has distribution function Fn (x) for each n. Then we say Xn converges in distribution d to X, written Xn → X, if Fn (x) → F (x) as n → ∞ for all x at which F (x) is continuous. Convergence in distribution is sometimes called convergence in law L and written Xn → X. 43 The notation of Definition 2.11 may be stretched a bit; sometimes the expressions on either d side of the → symbol may be distribution functions or other notations indicating certain distributions, rather than actual random variables as in the definition. The meaning is always clear even if the notation is not consistent. d P However, one common mistake should be avoided at all costs: If → (or → or any other “limit arrow”) indicates that n → ∞, then n must never appear on the right side of the arrow. See Expression (2.8) in Example 2.12 for an example of how this rule is sometimes violated. Example 2.12 The Central Limit Theorem for i.i.d. sequences: Let X1 , . . . , Xn be independent and identically distributed (i.i.d.) with mean µ and finite variance σ 2 . Then by a result that will be covered in Chapter 4 (but which is perhaps already known to the reader), * n , √ 1+ d n Xi − µ → N (0, σ 2 ), (2.7) n i=1 where N (0, σ 2 ) denotes a normal distribution with mean 0 and variance σ 2 . [N (0, σ 2 ) is not actually a random variable; this is an example of “stretching d the → notation” referred to above.] Because Equation 2.7 may be interpreted as saying that the sample mean X n has approximately a N (µ, σ 2 /n) distribution, it may seem tempting to “rewrite” Equation (2.7) as n 1+ d Xi → N n i=1 " µ, σ2 n $ . (2.8) Resist the temptation to do this! As pointed out above, n should never appear on the right side of a limit arrow (as long as that limit arrow expresses the idea that n is tending to ∞). By Definition 2.5, the√right side of (2.7) is OP (1). We may therefore write (after dividing through by n and adding µ) n 1+ Xi = µ + OP n i=1 " 1 √ n $ as n → ∞. (2.9) Unlike Expression (2.8), Equation (2.9) is perfectly legal;√and although it is less specific than Expression (2.7), it expresses at a glance the n-rate of convergence of the sample mean to µ. 44 P d Unlike Xn → X, the expression Xn → X does not require Xn − X to be a random variable; d in fact, Xn → X is possible even if Xn and X are not defined on the same sample space. Even if Xn and X do have a joint distribution, it is easy to construct an example in which d Xn → X but Xn does not converge to X in probability: Take Z1 and Z2 to be independent and identically distributed standard normal random variables, then let Xn = Z1 for all n d and X = Z2 . Since Xn and X have exactly the same distribution by construction, Xn → X in this case. However, since Xn − X is a N (0, 2) random variable for all n, we do not have P Xn → X. d P We conclude that Xn → X cannot possibly imply Xn → X (but see Theorem 2.14 for a special case in which it does). However, the implication in the other direction is always true: P d Theorem 2.13 If Xn → X, then Xn → X. Proof: Let Fn (x) and F (x) denote the distribution functions of Xn and X, respectively. P Assume that Xn → X. We need to show that Fn (t) → F (t), where t is any point of continuity of F (x). Choose any ! > 0. Whenever Xn ≤ t, it must be true that either X ≤ t + ! or |Xn − X| > !. This implies that Fn (t) ≤ F (t + !) + P (|Xn − X| > !). F (t − !) ≤ Fn (t) + P (|Xn − X| > !). We conclude that for arbitrary n and ! > 0, (2.10) Taking both the lim inf n and the lim supn of the above inequalities, we conclude [since P Xn → X implies P (|Xn − X| > !) → 0] that F (t − !) ≤ lim inf Fn (t) ≤ lim sup Fn (t) ≤ F (t + !) n n for all !. Since t is a continuity point of F (x), letting ! → 0 implies F (t) = lim inf Fn (t) = lim sup Fn (t), n n so we conclude Fn (t) → F (t) and the theorem is proved. d d P Proof: We only need to prove that Xn → c implies Xn → c, since the other direction is a special case of Theorem 2.13. If F (x) is the distribution function I{x ≥ c} of the constant random variable c, then c + ! and c − ! are points of continuity of F (x) for any ! > 0. d Therefore, Xn → c implies that Fn (c − !) → F (c − !) = 0 and Fn (c + !) → F (c + !) = 1 as n → ∞. We conclude that P (−! <X n − c ≤ !) = Fn (c + !) − Fn (c − !) → 1, P which means Xn → c. When we speak of convergence of random variables to a constant in this book, most commonly we refer to convergence in probability, which (according to Theorem 2.14) is equivalent to convergence in distribution. On the other hand, when we speak of convergence to a random variable, we nearly always refer to convergence in distribution. Therefore, in a sense, Theorem 2.14 makes convergence in distribution the most important form of convergence in this book. This type of convergence is often called “weak convergence”. 2.1.4 Convergence in Mean Definition 2.15 Let a be a positive constant. We say that Xn converges in ath mean a to X, written Xn → X, if E |Xn − X|a → 0 as n → ∞. (2.11) Two specific cases of Definition 2.15 deserve special mention. When a = 1, we normally omit mention of the a and simply refer to the condition E |Xn − X| → 0 as convergence in mean. Convergence in mean is not equivalent to E Xn → E X: For one thing, E Xn → E X is possible without any regard to the joint distribution of Xn and X, whereas E |Xn − X| → 0 clearly requires that Xn − X be a well-defined random variable. Even more important than a = 1 is the special case a = 2: P We remarked earlier that Xn → X could not possibly imply Xn → X because the latter expression requires that Xn and X be defined on the same sample space for every n. However, a constant c may be considered to be a random variable defined on any sample space; thus, d P it is reasonable to ask whether Xn → c implies Xn → c. The answer is yes: 45 P The third and final mode of convergence in this chapter is useful primarily because it is sometimes easy to verify and thus gives a quick way to prove convergence in probability, as Theorem 2.17 below implies. Similarly, whenever X ≤ t − !, either Xn ≤ t or |Xn − X| > !, implying F (t − !) − P (|Xn − X| > !) ≤ Fn (t) ≤ F (t + !) + P (|Xn − X| > !). d Theorem 2.14 Xn → c if and only if Xn → c. qm Definition 2.16 We say that Xn converges in quadratic mean to X, written Xn → X, if E |Xn − X|2 → 0 as n → ∞. 46 Convergence in quadratic mean is important for two reasons. First, it is often quite easy qm to check; in Exercise 2.6, you are asked to prove that Xn → c if and only if E Xn → c and Var Xn → 0 for some constant c. Second, quadratic mean convergence (indeed, ath mean convergence for any a > 0) is stronger than convergence in probability, which means that weak consistency of an estimator may be established by checking that it converges in quadratic mean. This latter property is a corollary of the following result: qm Theorem 2.17 (a) For a constant c, Xn → c if and only if E Xn → c and Var Xn → 0. a P (b) For fixed a > 0, Xn → X implies Xn → X. Proof: Part (a) is the subject of Exercise 2.6. Part (b) relies on Markov’s inequality (1.35), which states that 1 P (|Xn − X| ≥ !) ≤ a E |Xn − X|a (2.12) ! a for an arbitrary fixed ! > 0. If Xn → X, then by definition the right hand side of inequality P (2.12) goes to zero as n → ∞, so the left side also goes to zero and we conclude that Xn → X by definition. Example 2.18 Any unbiased estimator is consistent if its variance goes to zero. This fact follows directly from Theorem 2.17(a) and (b). As an example, consider a sequence of independent and identically distributed random variables X1 , X2 , . . . with mean µ and finite variance σ 2 . The sample mean n Xn = 1+ Xi n i=1 has mean µ and variance σ 2 /n. Therefore, X n is unbiased and its variance goes P to zero, so we conclude that it is consistent; i.e., X n → µ. This fact is the Weak Law of Large Numbers (see Theorem 2.19) for the case of random variables with finite variance. Exercises for Section 2.1 P Exercise 2.1 For each of the three cases below, prove that Xn → 1: (a) Xn = 1 + nYn , where Yn is a Bernoulli random variable with mean 1/n. (b) Xn = Yn / log n, where Yn is a Poisson random variable with mean ni=1 (1/i). (c) Xn = n1 ni=1 Yi2 , where the Yi are independent standard normal random variables. 47 Exercise 2.2 This exercise deals with bounded in probability sequences; see Definition 2.6. d (a) Prove that if Xn → X for some random variable X, then Xn is bounded in probability. Hint: You may use the fact that any interval of real numbers must contain a point of continuity of F (x). Also, recall that F (x) → 1 as x → ∞. P P (b) Prove that if Xn is bounded in probability and Yn → 0, then Xn Yn → 0. Hint: For fixed ! > 0, argue that there must be M and N such that P (|Xn | < M ) > 1 − !/2 and P (|Yn | < !/M ) > 1 − !/2 for all n > N . What is then the smallest possible value of P (|Xn | < M and |Yn | < !/M )? Use this result to prove P Xn Yn → 0. Exercise 2.3 The Poisson approximation to the binomial: (a) Suppose that Xn is a binomial random variable with n trials, where the probability of success on each trial is λ/n. Let X be a Poisson random variable d with the same mean as Xn , namely λ. Prove that Xn → X. Hint: Argue that it suffices to show that P (Xn = k) → P (X = k) for all nonnegative integers k. Then use Stirling’s formula (1.19). (b) Part (a) can be useful in approximating binomial probabilities in cases where the number of trials is large but the success probability is small: Simply consider a Poisson random variable with the same mean as the binomial variable. Assume that Xn is a binomial random variable with parameters n and 2/n. Create a plot on which you plot P (X10 = k) for k = 0, . . . , 10. On the same set of axes, plot the same probabilities for X20 , X50 , and the Poisson variable we’ll denote by X∞ . Try looking at the same plot but with the probabilities transformed using the logit (log-odds) transformation logit(t) = log(t) − log(1 − t). Which plot makes it easier to characterize the trend you observe? Exercise 2.4 Suppose that X1 , . . . , Xn are independent and identically distributed Uniform(0, 1) random variables. For a real number t, let Gn (t) = n + i=1 I{Xi ≤ t}. (a) What is the distribution of Gn (t) if 0 < t < 1? (b) Suppose c > 0. Find the distribution of a random variable X such that 48 d 2.2.1 Gn (c/n) → X. Justify your answer. (c) How does your answer to part (b) change if X1 , . . . , Xn are from a standard exponential distribution instead of a uniform distribution? The standard exponential distribution function is F (t) = 1 − e−t . qm Exercise 2.5 For each of the three examples in Exercise 2.1, does Xn → 1? Justify your answers. Exercise 2.6 Prove Theorem 2.17(a). Exercise 2.7 The converse of Theorem 2.17(b) is not true. Construct a counterexP ample in which Xn → 0 but E Xn = 1 for all n (by Theorem 2.17, if E Xn = 1, then Xn cannot converge in quadratic mean to 0). Hint: The mean of a random variable may be strongly influenced by a large value that occurs with small probability (and if this probability goes to zero, then the mean can be influenced in this way without destroying convergence in probability). Exercise 2.8 Prove or disprove this statement: If there exists M such that P (|Xn | < qm P M ) = 1 for all n, then Xn → c implies Xn → c. Exercise 2.9 (a) Prove that if 0 < a < b, then convergence in bth mean is stronger b a than convergence in ath mean; i.e., Xn → X implies Xn → X. Hint: Use Exercise 1.40 with α = b/a. Theorem 2.19 Weak Law of Large Numbers (univariate version): Suppose that X1 , X2 , . . . are independent and identically distributed and have finite mean µ. P Then X n → µ. The proof of Theorem 2.19 in its full generality is beyond the scope of this chapter, though it may be proved using the tools in Section 4.1. However, by tightening the assumptions a bit, a proof can be made simple. For example, if the Xi are assumed to have finite variance (not a terribly restrictive assumption), the weak law may be proved in a single line: Chebyshev’s inequality (1.36) implies that P Consistent Estimates of the Mean For a sequence of random vectors X1 , X2 , . . ., we denote the nth sample mean by n def Xn = 1+ Xi . n i=1 . / Var X n Var X1 P |X n − µ| ≥ ! ≤ = → 0, !2 n!2 so X n → µ follows by definition. (This is an alternative proof of the same result in Example 2.18.) P Example 2.20 If X ∼ binomial(n, p), then X/n → p. Although we could prove this fact directly using the definition of convergence in probability, it follows immediately from the Weak Law of Large Numbers due to the fact that X/n is the sample mean of n independent and identically distributed Bernoulli random variables, each with mean p. Example 2.21 Suppose that X1 , X2 , . . . are independent and identically distributed with mean µ and finite variance σ 2 . Then the estimator n (b) Prove by counterexample that the conclusion of part (a) is not true in general if 0 < b < a. 2.2 The Weak Law of Large Numbers σ̂n2 = 1+ (Xi − µ)2 n i=1 is consistent for σ 2 because of the Weak Law of Large Numbers: The (Xi − µ)2 are independent and identically distributed and they have mean σ 2 . Ordinarily, of course, we do not know µ, so we replace µ by X n (and often replace 1 1 by n−1 ) to obtain the sample variance. We need a bit more theory to establish n the consistency of the sample variance, but we will revisit this topic later. 2.2.2 Independent but not Identically Distributed Variables We begin with a formal statement of the weak law of large numbers for an independent and identically distributed sequence. Later, we discuss some cases in which the sequence of random vectors is not independent and identically distributed. Let us now generalize the conditions of the previous section: Suppose that X1 , X2 , . . . are independent but not necessarily identically distributed but have at least the same mean, so 49 50