Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Misuse of statistics wikipedia , lookup
Karhunen–Loève theorem wikipedia , lookup
Law of large numbers wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Central limit theorem wikipedia , lookup
SUFFICIENT STATISTICS 1. Introduction Let X = (X1 , . . . , Xn ) be a random sample from fθ , where θ ∈ Θ is unknown. We are interested using X to estimate θ. In the simple case where Xi ∼ Bern(p), we found that the sample mean was an efficient estimator for p. Thus, if we observe a finite sequence of coin flips, in order to have an efficient estimate of the probability, p, that heads occurs in a single flip, we need only count the number of times we see heads (and divide by the total number of flips), and we need not worry about the order in which the heads and tails occurred. Note that the sequences 100011 and 111000 lead to the same estimate, when we use the sample mean. In what follows, we want to study the following question: do we get any additional information about p by making use of the order in which the heads and tails occurred? The sample mean does not make use of the order, and it does give us an efficient estimator, so, in short the answer in this case is no. Thus in this example, it appears that we can greatly simplify and reduce the amount of data, without affecting our ability to find good estimators. 2. Sufficient statistics Let X = (X1 , . . . , Xn ) be a random sample from fθ , where θ ∈ Θ is unknown. Recall that a T is a statistic if T = T (X) = u(X) for some deterministic function u. We will assume that u does not depend on θ. Some examples of that you are familiar with are when T is the sample mean, the sample variance, and the maximum. Let us remark that although in many important examples, T is one dimensional point estimator, it need not be, for example, T (X) = X is a statistic. We say that T is sufficient for θ if the conditional distribution of X given T does not depend on θ. In the case were the random variables involved are not discrete, even this definition requires somewhat advanced mathematics, since we might have that P(T = t) = 0, in which case it is not immediate how one can make sense of P(X ∈ · | T = t) We will first discuss the discrete case, and then we will extend our discussion to the continuous case. 1 2 SUFFICIENT STATISTICS 3. The discrete case Exercise 1. Let X = (X1 , . . . , Xn ) be a random sample, where Xi ∼ Bern(p). Show that the sample sum given by T = X1 + · · · + Xn is a sufficient statistic for p. Solution. Let x ∈ {0, 1}n and t ∈ {0, 1, . . . n}. We need to show that P(X = x | T = t) does not depend on p. In fact, you already did this computation in the first homework! By definition, P(X = x | T = t) = P(X = x, T = t) . P(T = t) We may assume that t = t(x) = x1 + · · · + xn , otherwise, P(X = x, T = t) = 0. Thus, {X = x} = {X = x} ∩ {T = t}, and P(X = x, T = t) = P(X = x). We have that P(X = x) = L(x; p) = n Y pxi (1 − p)1−xi = pt (1 − p)n−t i=1 We also know that T ∼ Bin(n, p), so that n t P(T = t) = p (1 − p)n−t . t Hence we obtain that P(X = x | T = t) = 1 n , t which does not depend on p. Exercise 2. Discuss why you should expect that the final answer we obtained in Exercise 1 is 1 P(X = x | T = t) = n . t In the discrete setting, we have that T is sufficient for θ if and only if for all x and t = t(x), we have P(X = x | T = t) = P(X = x, T = t) P(X = x) = = H(x), P(T = t) P(T = t) for some function of H(x) which does not depend on θ. Exercise 3. Let X = (X1 , . . . , Xn ) be a random sample, where Xi is a discrete random variable that is uniformly distributed in {1, 2, . . . , θ}. Show that M = max {X1 , . . . , Xn } is a sufficient statistic for θ. SUFFICIENT STATISTICS 3 Solution. Let m ∈ {1, 2, . . . , θ}. Note that {M ≤ m} = n \ {Xi ≤ m} i=1 and {M = m} = {M ≤ m} \ {M ≤ m − 1} . Hence 1 (mn − (m − 1)n ). n θ n Let x ∈ {1, 2, . . . , θ} and m = max {x1 , . . . , xn }. P(M = m) = P(X = x, M = m) P(M = m) P(X = x) = P(M = m) P(X = x | M = m) = = 1 θn 1 (mn θn − (m − 1)n ) 1 , = n m − (m − 1)n so we are done. Exercise 4. Let X = (X1 , . . . , Xn ) be a random sample, where Xi is a Poisson random variable with mean λ. Show that the sample sum given by T = X1 + · · · + Xn is a sufficient statistic for λ. In order to have some more examples to discuss, recall that X is a geometric random variable with parameter p ∈ (0, 1), if P(X = k) = p(1 − p)k−1 , for k = 1, 2, . . .. Thus X is the number of Bernoulli p trials required to get a success. Here, EX = 1/p. Let us remark that sometimes geometric random variables are defined so that P(X = k) = p(1 − p)k , for k = 0, 1, 2, . . . ; in this case X is the number of fails before a success, and EX = p/(1 − p). Before we find a sufficient statistic for p, we do a couple of preliminary exercises. Exercise 5. Let X = (X1 , . . . , Xn ) be a random sample, where Xi is a geometric random variable with parameter p ∈ (0, 1) and mean 1/p. Show that the mle for p is given by 1/X̄. 4 SUFFICIENT STATISTICS Exercise 6. Referring to Exercise 5, let T = X1 + · · · + Xn . Show that for k = n, n + 1, . . ., we have k−1 n P(T = k) = p (1 − p)k−n . n−1 Solution. Note that T is the number for trials required to get n success. By counting we obtain the required formula: the last kth trial is a success, and you are left with k − 1 trials, of which n − 1 of them must be successes. Exercise 7. Referring to Exercise 5, show that the sample sum given by T = X1 + · · · + Xn is a sufficient statistic for p. Solution. Let x ∈ {1, 2, 3, 4, . . .}n and t = x1 + · · · + xn . We have that P(X = x) P(T = t) Qn xi −1 i=1 p(1 − p) = t−1 pn (1 − p)t−n n−1 1 = t−1 , P(X = x | T = t) = n−1 which does not depend on p. 4. The continuous case In the continuous case, as in the case of likelihoods we work with the density functions instead of the probabilities directly. Let X = (X1 , . . . , Xn ) be a random sample from fθ , where θ ∈ Θ is unknown. Let T = u(X) be a statistic with density function q(t). Then T is a sufficient statistic for θ if for all x and t = t(x), we have Qn f (xi ; θ) L(x; θ) = i=1 = H(x), q(t(x)) q(t(x)) for some function H which does not depend on θ. Exercise 8. Let X = (X1 , . . . , Xn ) be a random sample, where Xi ∼ U nif (0, θ), where θ is unknown. Show that M = max {X1 , . . . , Xn } is a sufficient statistic for θ. Exercise 9. Let X = (X1 , . . . , Xn ) be a random sample, where Xi ∼ N (µ, 1), where µ is unknown. Show that the sample mean is a sufficient statistic for µ. SUFFICIENT STATISTICS 5 Solution. Luckily, we know that the distribution for X̄; we have that X̄ ∼ N (µ, 1/n). However, even with this piece of knowledge, this is a tricky exercise. First, we need the following observation. Note that n X (xi − x̄) = 0. i=1 Thus n X (xi − µ)2 = i=1 n X (xi − x̄ + x̄ − µ)2 i=1 = n X (xi − x̄)2 + 2(xi − x̄)(x̄ − µ) + (x̄ − µ)2 i=1 = = n X i=1 n X (xi − x̄)2 + (x̄ − µ)2 (xi − x̄)2 + n(x̄ − µ)2 i=1 With this algebra in hand, we have that Qn (xi −µ)2 √1 e− 2 P (xi −x̄)2 1 i=1 2π − n i=1 2 √ = , e √ n(x̄−µ)2 n n(2π)(n−1)/2 √ e− 2 2π which does not depend on µ. 5. Fisher-Neyman factorization We saw in the previous exercises that proving that a statistic is sufficient from the definition can be quite challenging. The following theorem factorization theorem makes life easier. Theorem 10. Let X = (X1 , . . . , Xn ) be a random sample from the pdf fθ , where θ ∈ Θ is unknown. A statistic T is sufficient for θ if and only if there exists nonnegative functions g(t; θ) and h(x) (which does not depend on θ) such that for all points x and all θ ∈ Θ, we have L(x; θ) = n Y f (xi ; θ) = g(t(x); θ)h(x). i=1 Clearly, by definition, a factorization holds if T is sufficient, so one direction of the proof is trivial. It is also immediate for Theorem 10, that a 1-1 function of a sufficient statistic is again sufficient. Let us also remark in Theorem 10, g(t; θ) does not have to be the density 6 SUFFICIENT STATISTICS function for T (X), and in the discrete case, we do not require that g(t) = P(T = t). The factorization of Theorem 10 is not unique. The utility of Theorem 10 lies in the fact that we do not need to identify the distribution of T . Before we prove the non-trivial direction of Theorem 10, let us apply it Exercise 9. Exercise 11. Apply Theorem 10 to solve Exercise 9. Solution (Solution to Exercise 9). The difference here is we still need the somewhat tricky algebra, but we no longer need to know that sum of independent normals is again normal. n Y (xi −µ)2 1 Pn n 1 2 2 √ e− 2 = (2π)−n/2 · e− 2 i=1 (xi −x̄) · e− 2 (x̄−µ) . L(x; µ) = 2π i=1 n 2 1 Thus we choose, g(x̄; µ) = e− 2 (x̄−µ) and h(x) = (2π)−n/2 ·e− 2 Pn 2 i=1 (xi −x̄) Exercise 12. Let X = (X1 , . . . , Xn ) be a random sample, where Xi ∼ P N (0, θ), where the variance θ is unknown. Show that T = ni=1 Xi2 is a sufficient statistic for θ. Exercise 13. Let X = (X1 , . . . , Xn ) be a random sample, where Xi ∼ N (µ, σ 2 ), where both µ and σ 2 are unknown. Set θ = (µ, σ 2 ). Let T = (X̄, S 2 ), where X̄ is the usual sample mean, and S 2 is the usual sample variance. Show that L(x; θ) = g(t(x); θ)h(x), some functions g and h, so that T is a sufficient statistic for θ. Exercise 14. Apply Theorem 10 to solve Exercise 4. Solution. Let x ∈ {0, 1, 2 . . .}n , and t = t(x) = x1 + · · · + xn . We have that n n Y Y λxi 1 P(X = x) = e−λ = λt e−nλ . x ! x ! i i i=1 i=1 Qn 1 t −nλ Thus we choose g(t; λ) = λ e and h(x) = i=1 xi ! . Exercise 15. Let X = (X1 , . . . , Xn ) be a random sample, where X1 is a real-valued continuous random variable with a pdf given by f (x1 ; θ) = h(x1 )c(θ)ew(θ)u(x1 ) Show that T = Pn i=1 u(Xi ) is a sufficient statistic for θ. . SUFFICIENT STATISTICS 7 Proof Theorem 10 (discrete case). Let t = t(x). We have by assumption that P(X = x) g(t; θ) · h(x) = . P(T = t) P(T = t) Let us remark that we do not have that g(t; θ) = P(T = t). Of course, P(T = t) = Pθ (T = t) depends on θ, and the claim is that the θ’s in g(t; θ) cancel out the θ’s in P(T = t). To see why, let A := {y : t(y) = t(x)}. Of course, x ∈ A, but there could be other elements; think of t as the sample sum, then if t(x) = t, any other permutation of y of x, we have t(y) = t. Thus, X X P(T = t) = P(A) = P(X = y) = g(t; θ) h(y). y∈A y∈A Hence P(X = x) h(x) , =P P(T = t) y∈A h(y) which does not depend on θ. The proof in the continuous case is more technical; your text has a proof of a special case of the continuous case. The above proof is similar to the proof of the following elementary fact. Theorem 16. Let X be a discrete random variable with pdf f . If g : R → R, then X g(x)f (x), Eg(X) = x whenever the sum is absolutely convergent. Proof. We have that Eg(X) = X yP(g(X) = y). y Suppose X takes values on the set A. Let Ay := {x ∈ A : g(x) = y}. Note that the sets Ay partition the set A. Thus X P(g(X) = y) = P(Ay ) = f (x) x∈Ay and Eg(X) = XX y yf (x) = XX x∈Ay y x∈Ay g(x)f (x) = X g(x)f (x). x∈A End of Midterm 1 coverage