Download Mathematical Statistics

Mathematical Statistics Motoya Machida April 12, 2006 This material is designed to introduces basic concepts and fundamental theory of mathematical statistics. A review of basic concepts will include likelihood functions, sufficient statistics, and exponential family of distributions. Then point estimation will be discussed, including minimum variance unbiased estimates, Cramér-Rao inequality, maximum likelihood estimates and asymptotic theory. Topics in general theory of statistical tests include Neyman–Pearson theorem, uniformly most powerful tests, and likelihood ratio tests. 1 Point estimates A random sample X1 , . . . , X n is regarded as independent and identically distributed (iid) random variables governed by an underlying probability density function f (x; θ). A value θ represents the characteristics of this underlying distribution, and is called a parameter. Suppose, for example, that the underlying distribution is the normal distribution with (µ, σ 2 ). Then the values µ and σ 2 are the parameters. Since X1 , . . . , Xn are random variables, the sample mean X̄ also becomes a random variable. In general, a random variable u(X) constructed from the random vector X = (X1 , . . . , Xn ) is called a statistic. For example, the sample mean X̄ is a statistic. A point estimate is a statistic u(X) which is a “best guess” for the true value θ. Suppose that the underlying distribution is the normal distribution with (µ, σ 2 ). Then the sample mean X̄ is in some sense a best guess of the parameter µ. Mean square-error. Let u(X) be a point estimate for θ. Then the functional R(θ, u) = E[(u(X) − θ)2 ] of u is called the mean square-error risk function. We can immediately observe that R(θ, u) = Var(u(X)) + [E(u(X)) − θ]2 = Var(u(X)) + [b(θ, u)]2 , where b(θ, u) = E(u(X)) − θ is called the bias of u(X). 2 Maximum likelihood estimate Having observed a random sample (X1 , . . . , Xn ) = (x1 , . . . , xn ) from an underlying pdf f (x; θ), we can construct the likelihood function n Y L(θ, x) = f (xi ; θ), i=1 1 Mathematical Statistics and consider it as a function of θ. Then the maximum likelihood estimate (MLE) θ̂ is the value of θ which “maximizes” the likelihood function L(θ, x). It is usually easier to maximize the log likelihood ln L(θ, x) = n X ln f (xi ; θ). i=1 2.1 Bernoulli trials Let f (x; θ) = θx (1−θ)1−x be the Bernoulli frequency function with success probability θ. By solving the equation Pn Pn 1 1 ∂ ln L(θ, x) = ( i=1 xi ) − (n − i=1 xi ) = 0, ∂θ θ 1−θ we obtain θ∗ = 2.2 Pn i=1 xi n which minimizes ln L(θ, x). Therefore, θ̂ = X̄ is the MLE of θ. Normal distribution Let X1 , . . . , Xn be a random sample from a normal distribution with parameter (µ, σ 2 ). Then the log likelihood function of parameter (µ, σ 2 ) is given by ln L(µ, σ 2 ) = − n 1 X n n ln σ 2 − 2 (xi − µ)2 − ln 2π. 2 2σ i=1 2 By solving n 1 X ∂ ln L(µ, σ 2 ) (xi − µ) = 0; = 2 ∂µ σ i=1 n ∂ ln L(µ, σ 2 ) n 1 X = − + (xi − µ)2 = 0, ∂σ 2 2σ 2 2(σ 2 )2 i=1 we can obtain the MLE’s µ̂ and σ̂ 2 as follows. n µ̂ = 1X Xi = X̄; n i=1 n σ̂ 2 = 1X (Xi − µ̂)2 . n i=1 Although the MLE σ̂ 2 of σ 2 is consistent, it should be noted that this point estimate σ̂ 2 is not an unbiased one. 2.3 Poisson distribution Let X1 , . . . , Xn be a random sample from a Poisson distribution with parameter λ. Then the log likelihood function of parameter λ is given by ln L(λ) = ln λ n X xi − nλ − i=1 n X ln xi !. i=1 By solving n d ln L(λ) 1X = xi − n = 0, dλ λ i=1 Lecture Note Page 2 Mathematical Statistics we can obtain the MLE n λ̂ = 2.4 1X Xi = X̄. n i=1 Exercise Let X1 , . . . , Xn be a random sample from each of the following density function f (x; θ). Find the MLE θ̂ of θ. 1. f (x; θ) = (1 + θ)xθ , 0 ≤ x ≤ 1, where θ > −1. 2. f (x; θ) = θe−θx , x ≥ 0, where θ > 0. 3. f (x; θ) = θLθ x−θ−1 , x ≥ L, where L > 0 is given and θ > 1 (Pareto distribution). 2.5 Solutions to exercise 1. ln L(θ) = − Pn ! ln xi θ + n ln(1 + θ). By solving ln xi d dθ ln L(θ) = n X i=1 i=1 n i=1 n X ! ln xi + n = 0, we obtain θ̂ = 1+θ − 1. ! n X n d n 2. ln L(θ) = − xi θ + n ln θ. By solving ln L(θ) = − xi + = 0, we obtain θ̂ = Pn . dθ θ i=1 xi i=1 i=1 ! ! n n X X d n ln xi (θ + 1) + n ln θ + nθ ln L. By solving 3. ln L(θ) = − ln xi + + n ln L = 0, ln L(θ) = − dθ θ i=1 i=1 n we obtain θ̂ = Pn . i=1 ln xi − n ln L n X 3 3.1 ! Properties of MLE Consistency One of the important attributes of point estimate is unbiasedness. Since a statistic θ̂ is a random variable, we can consider the expectation E(θ̂) of θ̂. Then the point estimate θ̂ of θ is unbiased if it satisfies E(θ̂) = θ. In the case of normal distribution, the MLE X̄ for µ is unbiased since E(X̄) = µ. However, the MLE σ̂ 2 for σ 2 is not unbiased, since # " n n−1 2 1X 2 (Xi − X̄) = σ E n i=1 n Note that the point estimate θ̂ of θ is also dependent on the sample size n. We say that θ̂ is consistent if θ̂ converges in probability to θ̂ as n → ∞. For example, the above MLE’s X̄ and σ̂ 2 are both consistent by the weak law of large number. In general, the MLE is consistent under appropriate conditions. 3.2 Invariance Suppose that h(θ) is a one-to-one function of θ. Then it is clearly seen that θ̂ is the MLE for θ if and only if h(θ̂) is the MLE for h(θ). Even if h(θ) is not one-to-one, h(θ̂) will be viewed as the MLE which corresponds to the maximum likelihood. Lecture Note Page 3 Mathematical Statistics 3.3 Asymptotic normality ∂ ln L(θ̂, X) Suppose that θ̂ is the MLE and consistent, and that it satisfies = 0. By the Taylor expansion we ∂θ have ∂ ln L(θ̂, X) ∂ ln L(θ, X) ∂ 2 ln L(θ, X) ≈ + (θ̂ − θ) = 0. ∂θ ∂θ2 ∂θ Since θ̂ is close to θ by consistency, the approximation is valid. Furthermore, we can make the following observations: (Xi ;θ) ’s are iid with mean 0 and variance I1 (θ) = Var 1. The random variables ∂ ln f∂θ the central limit theorem, n X ∂ ln f (Xi ; θ) ∂ ln L(θ, X) ∂θ = i=1 p Zn = p ∂θ nI1 (θ) nI1 (θ) ∂ ∂θ ln f (X1 ; θ) . By converges to N (0, 1) in distribution as n → ∞. 2 f (Xi ;θ) ’s are iid with mean (−I1 (θ)) and finite variance. By the weak law of 2. The random variables ∂ ln∂θ 2 large number, n 1 ∂ 2 ln L(θ, X) 1 X ∂ 2 ln f (Xi ; θ) Wn = = n ∂θ2 n i=1 ∂θ2 converges to −I1 (θ) in probability as n → ∞. Together with Slutsky’s theorem we can find that p nI1 (θ)(θ̂ − θ) ≈ Zn Wn /(−I1 (θ)) converges to N (0, 1) in distribution as n → ∞. Hence, the MLE θ̂ has “approximately” a normal distribution 1 with mean θ and variance nI11(θ) = I(θ) if n is large, where I(θ) is the Fisher information for the random sample X. This suggests that n × Var(θ̂) → 1/I1 (θ) as n → ∞. Then we call θ̂ asymptotically efficient (cf. Bickel and Doksum, “Mathematical Statistics,” Chapter 4). 4 Confidence interval Let X = (X1 , . . . , Xn ) be a random sample from f (x; θ). Let u1 (X) and u2 (X) be statistics satisfying u1 (X) ≤ u2 (X). If P (u1 (X) < θ < u2 (X)) = 1 − α for every θ, then the random interval (u1 (X), u2 (X)) is called a confidence interval of level (1 − α). 4.1 Population mean Let X1 , . . . , Xn be iid random variables from N (µ, σ). The sample mean X̄ is an unbiased estimate of the X̄ − µ √ has the t-distribution with (n − 1) degrees of freedom. Thus, by parameter µ. Then the random variable S/ n Lecture Note Page 4 Mathematical Statistics using the critical point tα/2,n−1 we obtain X̄ − µ tα/2,n−1 S tα/2,n−1 S √ √ P √ < tα/2,n−1 = P X̄ − < µ < X̄ + = 1 − α. S/ n n n This implies that the parameter µ is in the interval tα/2,n−1 S tα/2,n−1 S √ √ X̄ − , X̄ + n n with probability (1 − α). The interval is also known as the t-interval. Example. A random sample of n milk containers is selected, and their milk contents are weighed. The data X1 , . . . , X n (1) can be used to investigate the unknown population mean of the milk container weights. The random selection of sample should ensure that the above data can be assumed to be iid. Suppose that we have calculated X̄ = 2.073 and S = 0.071 from the actual data with n = 30. Then by choosing α = 0.05, we have the critical point t0.025,29 = 2.045, and therefore, obtain the confidence interval 2.045 × 0.071 2.045 × 0.071 √ √ 2.073 − , 2.073 + = (2.046, 2.100) 30 30 of level 0.95 (or, of level 95%). Even if the data (1) are not normally distributed, the central limit theorem says that the estimate X̄ is approximately distributed as N (µ, σ 2 /n). In either case it is sensible to use critical points from the t-distribution. 4.2 Population proportion Let X1 , . . . , Xn be iid Bernoulli random variables with success probability p. The sample mean X̄ is the MLE X̄ − p of the parameter p, and unbiased. By the central limit theorem, the random variable p has approxp(1 − p)/n imately N (0, 1) as n gets larger (at least np > 5 and n(1 − p) > 5 by rule of thumb). Here we define the critical point zα for standard normal distribution by P (X > zα ) = α with standard normal random variable X. Thus, we have ! ! r r X̄ − p p(1 − p) p(1 − p) P p < p < X̄ + zα/2 ≈ 1 − α. < zα/2 = P X̄ − zα/2 p(1 − p)/n n n q q X̄) p(1−p) Here we can use X̄(1− as an estimate for . (We will see later that it is the MLE via the invariance n n property since X̄ is the MLE of p.) Together we obtain the confidence interval ! r r X̄(1 − X̄) X̄(1 − X̄) X̄ − zα/2 , X̄ + zα/2 n n of level (1 − α). There is an alternative and possibly more accurate method to derive a confidence interval. Here we observe that ! X̄ − p 2 2 P p n + zα/2 p2 − 2 nX̄ + zα/2 /2 p + nX̄ 2 ≤ 0 ≈ 1 − α. < zα/2 = P p(1 − p)/n This implies that the parameter p is in the interval (p̂− , p̂+ ) with probability (1 − α), where q 2 2 /4 nX̄ + zα/2 /2 ± zα/2 nX̄(1 − X̄) + zα/2 p̂± = . 2 n + zα/2 Lecture Note Page 5 Mathematical Statistics 5 Exploratory Data Analysis The data values recorded x1 , . . . , xn are typically considered as the observed values of random variables X1 , . . . , Xn having a common probability distribution f (x). To judge the quality of data, it is useful to envisage a population from which the sample should be drawn. A random sample is chosen at random from the population to ensure that the sample is representative of the population. Once a data set has been collected, it is useful to find an informative way of presenting it. Graphical representations of data in various forms can be quite informative. 5.1 Relative frequency histogram Given the number of observations fi , called frequency, in the i-th interval, the height hi of the i-th rectangle above the i-th interval is represented by hi = fi . n × (width of the i-th interval) When the width of each interval is equally chosen, the width w is called bandwidth and the height hi becomes hi = Lecture Note fi . n×w Page 6 Mathematical Statistics 5.2 Stem and leaf plot This is much like a histogram except it portrays a data set itself. Data 20.5 21.5 22.7 23.4 24.1 24.9 25.8 set 20.7 22.0 22.7 23.5 24.3 24.9 25.9 20.8 22.1 22.9 23.6 24.5 25.1 26.1 21.0 22.5 22.9 23.6 24.5 25.1 26.7 21.0 22.6 23.1 23.6 24.8 25.2 21.4 22.6 23.3 23.9 24.8 25.6 Stem-and-leaf 20 | 578 21 | 0045 22 | 015667799 23 | 13456669 24 | 13558899 25 | 112689 26 | 17 5.3 Boxplot The sample median is the value of the “middle” data point. When there is an odd number of numbers, the median is simply the middle number. For example, the median of 2, 4, and 7 is 4. When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5. The 25-sample percentile is the value indicating that 25% of the observations takes values smaller than the value. Similarly, we can define 50-percentile, 75-percentile, and so on. Note that 50-percentile is the median. We call 25-percentile the lower sample quartile and 75-percentile the upper sample quartile. A box is drawn stretching from the lower sample quartile (the 25-percentile) to the upper quartile (the 75percentile). The median is shown as a line across the box. Therefore 1/4 of the distribution is between this line and the right of the box and 1/4 of the distribution is between this line and the left of the box. Vertical lines (dotted), called “whiskers,” stretch out from the ends of the box to the largest and smallest data. Lecture Note Page 7 Mathematical Statistics 5.4 Outliers Graphical presentations can be used to identify “odd-looking” value which does not fit in with the rest of the data. Such a value is called an outlier. In many cases an outlier is discovered to be a misrecorded data value, or represents some special condition that was not in effect when the data were collected. Lecture Note Page 8 Mathematical Statistics Lecture Note Page 9 Mathematical Statistics In the above histogram and boxplot, the value in far right appears to be quite separate from the rest of the data, and can be considered to be an outlier. 6 Tests of statistical hypotheses Suppose that a researcher is interested in whether the new drug works. The process of determining whether the outcome of the experiment points to “yes” or “no” is called hypothesis testing. A widely used formalization of this process is due to Neyman and Pearson. Our hypothesis is then the null hypothesis that the new drug has no effect —the null hypothesis is often the reverse of what we actually believe, why? Because the researcher hopes to reject the hypothesis and announce that the new drug leads to significant improvements. If the hypothesis is not rejected, the researcher announces nothing and goes on to a new experiment. 6.1 Hypothesis testing of population mean Hospital workers are subject to a radiation exposure emanating from the skin of the patient. A researcher is interested in the plausibility of the statement that the population mean µ of radiation level is µ0 —the researcher’s hypothesis. Then the null hypothesis is H0 : µ = µ0 . Lecture Note Page 10 Mathematical Statistics The “opposite” of the null hypothesis, called an alternative hypothesis, becomes HA : µ 6= µ0 . Thus, the hypothesis testing problem “H0 versus HA ” is formed. The problem here is to whether or not to reject “H0 in favor of HA .” To assess this hypothesis, the radiation levels X1 , . . . , Xn are measured from n patients who had been injected with a radioactive tracer, and assumed to be independent and normally distributed with the mean µ. Under the null hypothesis, the random variable X̄ − µ0 √ T = S/ n has the t-distribution with (n − 1) degrees of freedom. Thus, we obtain the exact probability P |T | ≥ tα/2,n−1 = α. When α is chosen to be a small value (0.05 or 0.01, for example), it is unlikely that the absolute value |T | is larger than the critical point tα/2,n−1 . Then we say that the null hypothesis H0 is rejected with significance level α (or, size α) when the observed value t of T satisfies |t| > tα/2,n−1 . Example. We have µ0 = 5.4 for the hypothesis, and decided to give a test with significance level α = 0.05. Suppose that we have obtained X̄ = 5.145 and S = 0.7524 from the actual data with n = 28. Then we can compute 5.145 − 5.4 √ ≈ −1.79. T = 0.7524/ 28 Since |T | = 1.79 ≤ t0.025,27 = 2.052, the null hypothesis cannot be rejected. Thus, the evidence against the null hypothesis is not persuasive. 6.2 p-value The above random variable T is called the t-statistic. Having observed “T = t,” we can calculate the p-value p∗ = P (|Y | ≥ |t|) = 2 × P (Y ≥ |t|), where the random variable Y has a t-distribution with (n − 1) degrees of freedom. Then we have the relation “p∗ < α ⇔ |t| > tα/2,n−1 .” Thus, we reject H0 with significance level α when p∗ < α. In the above example, we can compute the p-value p∗ = 2 × P (Y ≥ 1.79) ≈ 0.0847 ≥ 0.05; thus, we cannot reject H0 . 6.3 One-sided hypothesis testing In the same case of hospital workers subject to a radiation exposure, this time the researcher is interested in the plausibility of the statement that the population mean µ is greater than µ0 . Then the hypothesis testing problem is H0 : µ ≥ µ0 versus HA : µ < µ0 . X̄ − µ0 √ is used as a test statistic. And we reject H0 with significant level α when S/ n you find that t < −tα,n−1 for the observed value t of T . 1. The same t-statistic T = 2. Alternatively we can construct the p-value p∗ = P (Y ≤ t), where the random variable Y has a t-distribution with (n − 1) degrees of freedom. Because of the relation “p∗ < α ⇔ t < −tα,n−1 ,” we can reject H0 with significant level α when p∗ < α. Lecture Note Page 11 Mathematical Statistics Example. We use the same µ0 = 5.4 for the hypothesis and the same significance level α = 0.05, but use the one-sided test. Recall that X̄ = 5.145 and S = 0.7524 were obtained from the data with n = 28. 1. Then we compute T = 5.145 − 5.4 √ ≈ −1.79. 0.7524/ 28 Since T = −1.79 < −t0.05,27 = −1.703, the null hypothesis H0 is rejected. Thus, the outcome is statistically significant so that the population mean µ is smaller than 5.4. 2. Alternatively, we can find the p-value p∗ = P (Y ≤ −1.79) ≈ 0.0423 < 0.05; thus, the null hypothesis should be rejected. We can also consider the hypothesis testing problem H0 : µ ≤ µ0 1. Using the t-statistics T = versus HA : µ > µ0 . X̄ − µ0 √ , we can reject H0 with significant level α when the observed value t of S/ n T satisfies t > tα,n−1 . 2. Alternatively we can construct the p-value p∗ = P (Y ≥ t) with the random variable Y has a t-distribution with (n − 1) degrees of freedom. Because of the relation “p∗ < α ⇔ t > tα,n−1 ,” we can reject H0 when p∗ < α. 6.4 Summary When the null hypothesis H0 is rejected, it is reasonable to find out the confidence interval of the population mean µ. The following table shows the confidence interval we can construct when your null hypothesis is rejected. √ 0 is the test statistic, and α is the significance level of your choice. Here T = X̄−µ S/ n Hypothesis testing 6.5 When can we reject H0 ? H0 : µ = µ0 versus HA : µ 6= µ0 . |T | > tα/2,n−1 H0 : µ ≤ µ0 versus HA : µ > µ0 . T > tα,n−1 H0 : µ ≥ µ0 versus HA : µ < µ0 . T < −tα,n−1 (1 − α)-level confidence interval S S X̄ − tα/2,n−1 √ , X̄ + tα/2,n−1 √ n n S X̄ − tα,n−1 √ , ∞ n S −∞, X̄ + tα,n−1 √ n Exercises 1. An experimenter is interested in the hypothesis testing problem H0 : µ = 3.0mm versus HA : µ 6= 3.0mm, where µ is the population mean of thickness of glass sheets. Suppose that a sample of n = 21 glass sheets is obtained and their thicknesses are measured. (a) For what values of the t-statistic does the experimenter accept the null hypothesis with a size α = 0.10? (b) For what values of the t-statistic does the experimenter reject the null hypothesis with a size α = 0.01? Lecture Note Page 12 Mathematical Statistics Suppose that the sample mean X̄ = 3.04mm and the sample standard deviation is S = 0.124mm. Is the null hypothesis accepted or rejected with α = 0.10? With α = 0.01? 2. A machine is set to cut metal plates to a length of 44.350mm. The length of a random sample of 24 metal plates have a sample mean of X̄ = 44.364mm and a sample standard deviation of S = 0.019mm. Is there any evidence that the machine is miscalibrated? 3. An experimenter is interested in the hypothesis testing problem H0 : µ ≤ 0.065 versus HA : µ > 0.065 where µ is the population mean of the density of a chemical solution. Suppose that a sample of n = 31 bottles of the chemical solution is obtained and their densities are measured. (a) For what values of the t-statistics does the experimenter accept the null hypothesis with a size α = 0.10? (b) For what values of the t-statistics does the experimenter reject the null hypothesis with a size α = 0.01? Suppose that the sample mean X̄ = 0.0768 and the sample standard deviation is S = 0.0231. Is the null hypothesis accepted or rejected with α = 0.10? With α = 0.01? 4. A chocolate bar manufacturer claims that at the time of purchase by a consumer the average age of its product is no more than 120 days. In an experiment to test this claim a random sample of 26 chocolate bars are found to have ages at the time of purchase with a sample mean of X̄ = 122.5 days and a sample standard deviation of S = 13.4 days. With this information how do you feel about the manufacturer’s claim? 7 Power of test We define a function K(θ) of parameter θ by the probability that H0 is rejected given µ = θ. K(θ) = P (“Reject H0 ” | µ = θ) Then K(θ) is called the power function. 7.1 Type I error What is the probability that we incorrectly reject H0 when it is actually true? Such an error is called type I error, and the probability of type I error is exactly the significant level α, as explained in the following: 1. The probability of type I error for the two-sided hypothesis test is given by K(µ0 ). Then we have K(µ0 ) = P |T | ≥ tα/2,n−1 = α. 2. In one-sided hypothesis test, the probability of type I error is the worst (that is, largest possible) probability max K(θ) of type I error. Given µ = θ, the random variable θ≥µ0 X̄ − θ θ − µ0 √ =T − √ =T −δ S/ n S/ n has the t-distribution with (n − 1) degrees of freedom, where δ = θ − µ0 √ . By observing that δ ≥ 0 if S/ n θ ≥ µ0 , we obtain K(θ) = P (T ≤ −tα,n−1 ) = P (T − δ ≤ −tα,n−1 − δ) ≤ P (T − δ ≤ −tα,n−1 ) = α. Thus, we obtain max K(θ) = α. θ≥µ0 Lecture Note Page 13 Mathematical Statistics 7.2 Power of test What is the probability that we incorrectly accept H0 when it is actually false? Such probability β is called the probability of type II error. Then the value (1 − β) is known as the power of the test, indicating how correctly we can reject H0 when it is actually false. Again, consider the case of hospital workers subject to a radiation exposure. Given the current estimate S = s of standard deviation and the current sample size n = n1 , the X̄ − µ0 µ−µ √ can be approximated by N (δ, 1) with δ = √ 0 . t-statistic T = s/ n1 S/ n Example. Suppose that the true population mean is µ = 5.1 (versus the value µ0 = 5.4 in our hypotheses). Then we can calculate the power of the test with δ ≈ −2.11 as follows. 1. In the two-sided hypothesis testing, we reject H0 when |T | > t0.025,27 = 2.052. Therefore, the power of the test is K(5.1) = P (|T | > 2.052 | µ = 5.1) ≈ 0.523 2. In the one-sided hypothesis testing, we reject H0 when T < −t0.05,27 = −1.703. Therefore, the power of the test is K(5.1) = P (T < −1.703 | µ = 5.1) ≈ 0.658. This explains why we could not reject H0 in the two-sided hypothesis testing. Our chance to detect the falsehood of H0 is only 52%, while we have 66% of the chance in the one-sided hypothesis testing. 7.3 Effect of sample size For a fixed significance level α of your choice, the power of the test increases as the sample size n increases. In the two-sided hypothesis testing discussed above, we could recommend to collect additional data to increase the power of the test. But how many additional data do we need? Here is one possible way to calculate a desirable sample size n: In the two-sided hypothesis testing, the power (1 − β) of the test is approximated by P (|T | > tα/2,n−1 ) ≈ P (Y < −tα/2,n−1 − δ) + P (Y > tα/2,n−1 − δ) ≥ P (Y > tα/2,n−1 − |δ|) with a random variable Y having the t-distribution with (n − 1) degrees of freedom. Given the current estimate S = s of standard deviation and the current sample size n1 , we can achieve the power (1 − α/2) of the test by increasing a total sample size n and consequently satisfying |δ| ≥ 2tα/2,n1 −1 . In the above example of radiation exposure of hospital workers, such size n can be calculated as n≥ 8 2tα/2,n1 −1 s |µ − µ0 | 2 = 2t0.025,27 × 0.7524 |5.1 − 5.4| 2 = 105.9. Comparison of two populations We often want to compare two populations on the basis of experiment. For example, a researcher wants to test the effect of his drug on blood pressure. In any treatment, an improvement could have been due to the placebo effect when the subject believes that he or she has been given an effective treatment. To protect against such biases, the study should consider (i) the use of a control group in which the subjects are given a placebo, and an experimental group in which the subjects are treated with the new drug, (ii) the randomization by assigning the subjects between the control and the exprimental groups randomly, and (iii) a double-blind experiment by concealing the nature of treatment from the subjects and the person taking measurements. Then it becomes the hypothesis testing problem H0 : µ1 = µ2 versus HA : µ1 6= µ2 . where µ1 and µ2 are the respective population means of the control and the experimental groups Lecture Note Page 14 Mathematical Statistics As a result of experiment, we typically obtain the measurements X1 , . . . , X n of the subjects from the control group, and the measurements Y1 , . . . , Ym of the subjects from the experimental group. Then it is usually assumed that X1 , . . . , Xn and Y1 , . . . , Ym are independent and normally distributed with (µ1 , σ12 ) and (µ2 , σ22 ), respectively. Even when they are not normally distributed, large sample sizes (n, m ≥ 30) ensure that the tests are appropriate via the central limit theorem. 8.1 Pooled variance procedure Let Sx and Sy be the sample standard deviations constructed from X1 , . . . , Xn and Y1 , . . . , Ym , respectively. When it is reasonable to assume “σ12 = σ22 ,” we can construct the pooled sample variance Sp2 = (n − 1)Sx2 + (m − 1)Sy2 n+m−2 The test statistic T = X̄ − Ȳ q Sp n1 + 1 m has the t-distribution with (n + m − 2) degrees of freedom under the null hypothesis H0 . Thus, we reject the null hypothesis H0 with significant level α when the observed value t of T satisfies |t| > tα/2,n+m−2 . Or, equivalently we can compute the p-value p∗ = 2 × P (Y ≥ |t|) with Y having a t-distribution with (n + m − 2) degrees of freedom, and reject H0 when p∗ < α. Confidence interval. The following table shows the corresponding confidence interval of the population mean difference µ1 − µ2 , when your null hypothesis H0 is rejected. Hypothesis testing H0 : µ1 = µ2 vs. HA : µ1 6= µ2 . (1 − α)-level confidence interval q q 1 X̄ − Ȳ − tα/2,n+m−2 Sp n1 + m , X̄ − Ȳ + tα/2,n+m−2 Sp n1 + q H0 : µ1 ≤ µ2 vs. HA : µ1 > µ2 . X̄ − Ȳ − tα,n+m−2 Sp H0 : µ1 ≥ µ2 vs. HA : µ1 < µ2 . −∞, X̄ − Ȳ + tα,n+m−2 Sp 1 n + 1 m, q 1 n 1 m ∞ + 1 m Example. Suppose that we consider the significant level α = 0.01, and that we have obtained X̄ = 80.02 and Sx = 0.024 from the control group of size n = 13, and Ȳ = 79.98 and Sy = 0.031 from the experimental group of size m = 8. Here we have assumed that σ12 = σ22 . Then we can compute the square root Sp = 0.027 of the pooled sample variance Sp2 , and the test statistic T = 80.02 − 79.98 q ≈ 3.33. 1 0.027 13 + 18 Thus, we can obtain p∗ = 2 × P (Y ≥ 3.33) ≈ 0.0035 < 0.01, and reject H0 . We conclude that the two population means are significantly different. And the 99% confidence interval for the mean difference is (0.006, 0.074). Lecture Note Page 15 Mathematical Statistics 8.2 General procedure When “σ12 6= σ22 ,” under the null hypothesis H0 the test statistic X̄ − Ȳ T =q 2 Sy2 Sx n + m has approximately the t-distribution with ν degree of freedom, where ν is the nearest integer to 2 2 Sy2 Sx n + m . 4 Sy4 Sx n2 (n−1) + m2 (m−1) Thus, we reject the null hypothesis H0 with significant level α when the observed value t of T satisfies |t| > tα/2,ν . Or, equivalently we can compute the p-value p∗ = 2 × P (Y ≥ |t|) with Y having a t-distribution with ν degrees of freedom, and reject H0 when p∗ < α. Confidence interval. The following table shows the corresponding confidence interval of the population mean difference µ1 − µ2 , when your null hypothesis H0 is rejected. Hypothesis testing H0 : µ1 = µ2 versus HA : µ1 6= µ2 . H0 : µ1 ≤ µ2 versus HA : µ1 > µ2 . H0 : µ1 ≥ µ2 versus HA : µ1 < µ2 . (1 − α)-level confidence interval q q S2 S2 S2 X̄ − Ȳ − tα/2,ν nx + my , X̄ − Ȳ + tα/2,ν nx + q S2 S2 X̄ − Ȳ − tα/2,ν nx + my , ∞ q S2 S2 −∞, X̄ − Ȳ + tα/2,ν nx + my Sy2 m Example. Suppose that we consider the significant level α = 0.01, and that we have obtained X̄ = 80.02 and Sx = 0.024 from the control group of size n = 13, and Ȳ = 79.98 and Sy = 0.031 from the experimental group of size m = 8 as before. Then the test statistic T ≈ 3.12, and ν = 12. Thus, we can obtain p∗ = 2 × P (Y ≥ 3.12) ≈ 0.0089 < 0.01, and still reject H0 . 9 Inference on proportions In experiments on pea breeding, Mendel observed the different kinds of seeds obtained by crosses from plants with round yellow seeds and plants with wrinkled green seeds. Possible types of progeny were: “round yellow”, “wrinkled yellow”, “round green”, and “wrinkled green.” When the data values recorded x1 , . . . , xn takes several types, or categories, we call them the categorical data. 9.1 Point estimate Let X be the number of observations for a particular type in categorical data of size n, and let p be the population proportion of this type (that is, the probability of occurrence of this type). Then the random variable X has the binomial distribution with parameter (n, p). And the point estimate of the population proportion p is p̂ = Lecture Note X . n Page 16 Mathematical Statistics We can easily see that 1 X = E(X) = p E(p̂) = E n n Thus, p̂ is an unbiased estimate of p. Furthermore, recall by the central limit theorem that we have approximately X ∼ N (np, np(1 − p)) when n is large. Then the point estimate p̂ is approximately distributed as the normal distribution with parameter (p, p(1−p) ). n 9.2 Hypothesis test Suppose that the vaccine can be approved for widespread use if it can be established that the probability p of serious adverse reaction is less than p0 . Then the hypothesis testing problem becomes H0 : p ≥ p0 versus HA : p < p 0 . (2) Let X be the number of participants who suffer an adverse reaction among n participants. Then, the random variable X has the binomial distribution with parameter (n, p) and is approximated by the normal distribution with parameter (np, np(1 − p)) when n is large [that is, to satisfy np > 5 and n(1 − p) > 5]. Critical point. The critical point of the standard normal distribution, denoted by zα , is defined as the value satisfying P (Z > zα ) = α where Z is a standard normal random variable. Since the normal distribution is symmetric, it implies that P (Z < −zα ) = α. Testing procedure. When np0 > 5 and n(1 − p0 ) > 5, X − np0 Z=p (3) np0 (1 − p0 ) is used for the test statistic. Then we can reject H0 in (2) with significance level α if the value z of the test statistic Z satisfies z < −zα . Equivalently, we can proceed to construct the p-value p∗ = P (X < z) = Φ(z), and reject H0 when p∗ < α. Since the consideration of continuity correction improves the accuracy, the alternative test statistic X − np0 + 0.5 Z= p np0 (1 − p0 ) may be also used. Confidence interval. When H0 is rejected, we want to further investigate the confidence interval for the population proportion p which corresponds to the result of hypothesis test. We have the point estimate p̂ = X/n. Then the two different formulas ! p X + zα2 /2 + zα X(n − X)/n + zα2 /4 0, (4) n + zα2 ! r p̂(1 − p̂) (5) 0, p̂ + zα n can be used for the confidence interval of level α. Although Formula (4) is known to be more accurate, Formula (5) may be used in most of our problems since it is easier to calculate. Example. Suppose that p0 = 0.05 is required, and that the significance level α = 0.5 is chosen. And the study shows that X = 4 adverse reactions are found out of n = 155 participants. Note that (0.05)(155) = 7.75 > 5 and (0.95)(155) = 147.25 > 5. Thus, we have 4 − (155)(0.05) + 0.5 Z= p ≈ −1.20 (155)(0.05)(0.95) Lecture Note and p∗ = Φ(−1.20) ≈ 0.115 Page 17 Mathematical Statistics We can also obtain the point estimate p̂ ≈ 0.0258 and the 95% confidence interval (0, 0.0562) by using (4) [we get (0, 0.0467) if we use (5)]. Since p∗ ≥ 0.05, we cannot reject the null hypothesis. Thus, it is not advisable that the vaccine be approved as the result of this study. 9.3 Sample size calculations We always guarantee the possibility of incorrectly rejecting H0 when H0 is true—Type I error, say, to be less than 5% of the chance. But, at the same time we sacrifice the power of detecting the falsehood of H0 when H0 is false—power of the test. In order for the hypothesis testing problem H 0 : p ≥ p0 versus HA : p < p 0 , to achive the power (1 − β) of the test, we need a sample of size n≥ zα p p0 (1 − p0 ) + zβ p − p0 !2 p p(1 − p) . (6) In the above vaccine experiment, if the true population mean p is 0.025, then the power of the test is only 0.18. To increase the power of the test at least 0.8, we need the sample size at least n = 464. 9.4 Summary Possible null hypotheses for the inference on population proportion are “H0 : p = p0 ”, “H0 : p ≤ p0 ”, and “H0 : p ≥ p0 ”. In either case we can use the test statistic Z in (3) if we do not make a “continuity correction.” Then the corresponding testing procedures are summarized in the following table. Null hypothesis When we reject it H 0 : p = p0 |Z| > zα/2 H 0 : p ≤ p0 Z > zα (1 − α)-level confidence interval q q p̂(1−p̂) p̂(1−p̂) p̂ − zα/2 , p̂ + zα/2 n n p̂ − zα q H 0 : p ≥ p0 Z < −zα 0, p̂ + zα p̂(1−p̂) , n q 1 p̂(1−p̂) n For the sample size calculation, use Formula (6) if the null hypothesis is either “H0 : p ≤ p0 ” or “H0 : p ≥ p0 ”. When the null hypothesis is “H0 : p = p0 ”, the sample size n can be computed as !2 p p zα/2 p0 (1 − p0 ) + zβ p(1 − p) n≥ . p − p0 9.5 Comparison of two proportions A researcher is interested in whether there is discrimination against women in a university. In terms of statistics this is the hypothesis testing problem H 0 : pA ≤ p B versus HA : pA > pB where pA and pB are the respective population proportions of men and women who are admitted to the university. The researcher decided to collect the data for graduate program in the university. Let X and Y be the respective Lecture Note Page 18 Mathematical Statistics numbers of men and women who are admitted to the graduate school, which are summarized in the following table: Men X n−X n Admit Deny Total Women Y m−Y m The test statistic is given by p̂A − p̂B Z=q p̂(1 − p̂) 1 n + 1 m where p̂A = X/n and p̂B = Y /m are the point estimates of pA and pB , and p̂ = X +Y n+m is called a pooled estimate of the common population proportion. Under the null hypothesis, the probability that Z > zα becomes approximately less than α. Thus, we reject H0 when the observed value z of Z satisfies z > zα . Or, equivalently we can reject p∗ = 1 − Φ(z) < α. Confidence interval. We may want to further investigate the confidence interval for the difference pA − pB . Having constructed the hypothesis testing problems “H0 : pA = pB ”, “H0 : pA ≤ pB ”, or “H0 : pA ≥ pB ”, the following table shows the corresponding testing procedure and the confidence interval. Null hypothesis When we reject it H 0 : p A = pB |z| > zα/2 [That is, p∗ = 2 × (1 − Φ(|z|)) < α] H 0 : p A ≤ pB z > zα [That is, p∗ = 1 − Φ(z) < α] H 0 : p A ≥ pB z < −zα [That is, p∗ = Φ(z) < α] (1 − α)-level confidence interval for pA − pB q p̂A ) p̂B ) + p̂B (1− , p̂A − p̂B − zα/2 p̂A (1− n m q p̂A (1−p̂A ) p̂B (1−p̂B ) p̂A − p̂B + zα/2 + , n m q p̂A ) p̂B ) + p̂B (1− ,1 p̂A − p̂B − zα p̂A (1− n m −1, p̂A − p̂B + zα q p̂A (1−p̂A ) n + p̂B (1−p̂B ) m Example. The following table classifies the applications for the graduate school according to admission status and sex. Admit Deny Total Men 97 263 360 Women 40 42 82 Total 137 305 442 Then we have p̂A = 97/360 ≈ 0.269, p̂B = 40/82 ≈ 0.488, and p̂ = 137/442 ≈ 0.310. And we can obtain 0.269 − 0.488 Z=p (0.31)(0.69)(1/360 + 1/82) ≈ −3.87 and p∗ = 1 − Φ(−3.87) ≈ 0.9999 Thus, we cannot reject H0 , indicating that there is no discrimination against women in this particular graduate program. In fact, the null hypothesis “H0 : pA ≥ pB ” will be rejected in this example. Lecture Note Page 19 Mathematical Statistics 10 Chi-square test In the experiment on pea breeding, Mendel observed the different kinds of seeds obtained by crosses from plants with round yellow seeds and plants with wrinkled green seeds. Possible types of progeny were: “round yellow”, “wrinkled yellow”, “round green”, and “wrinkled green.” And Mendel’s theory predicted the associated probabilities of occurrence as follows. Probabilities Round yellow 9/16 Wrinkled yellow 3/16 Round green 3/16 Wrinkled green 1/16 We want to test whether the data from n observation is consistent with his theory—goodness of fit test, in which the statement of null hypothesis becomes “the model is valid.” 10.1 Chi-square test In general, each observation is classified into one of k categories or “cells,” which results in the cell frequencies X1 , . . . , Xk . The goodness of fit to a particular model can be assessed by comparing the observed cell frequencies X1 , . . . , Xk with the expected cell frequencies E1 , . . . , E k , which are predicted from the model. The discrepancy between the data and the model can be measured by the Pearson’s chi-square statistic k X (Xi − Ei )2 . χ2 = Ei i=1 Under the null hypothesis (that is, assuming that the model is correct), the distribution of Pearson’s chi-square χ2 is approximated by the chi-square distribution with df = (number of cells) − 1 − (number of parameters in the model) degrees of freedom. Therefore, if you observe that χ2 = x and x > χ2α,df , then we can reject the null hypothesis, casting doubt on the validity of the model. Or, by computing the p-value p∗ = P (X > x) with a random variable X having the chi-square distribution with df degrees of freedom, equivalently we can reject the null hypothesis when p∗ < α. Example. In the experiment of pea breeding, we have obtained the data as in the following table. Frequencies Round yellow 315 Wrinkled yellow 101 Round green 108 Wrinkled green 32 With the total number of observations n = 556, the expected cell frequencies from the Mendel’s theory can be calculated as Expected frequencies Round yellow 312.75 Wrinkled yellow 104.25 Round green 104.25 Wrinkled green 34.75 We can compute the Pearson’s chi-square χ2 = 0.47. Since the Mendel’s model has no parameter, the chi-square distribution has 3 = (4 − 1) degrees of freedom and we get the p-value p∗ = 0.925. Thus, there is little reason to doubt the Mendel’s theory on the basis of Pearson’s chi-square test. Lecture Note Page 20 Mathematical Statistics 10.2 Test of independence Consider again the study of discrimination against women in university admission. In the study, there are two characteristics: men or women; admitted or denied. The researcher wanted to know whether such characteristics are linked or independent. For such a study, we take a random sample of size n from the population, which is summarized in the contingency table Admit Deny Total Men X11 X21 X·1 Women X12 X22 X·2 Total X1· X2· n = X·· The statement of null hypothesis becomes “the two characteristics are independent.” Under the null hypothesis, the expected frequencies for the contingency table can be given by Admit Deny Total Men np1 q1 np2 q1 nq1 Women np1 q2 np2 q2 nq2 Total np1 np2 n The point estimates of p1 , p2 , q1 , and q2 are p̂1 = X1· /n, p̂2 = X2· /n, q̂1 = X·1 /n, and q̂2 = X·2 /n. With these point estimates, the chi-square statistic is χ2 = 2 2 X X (Xij − Xi· X·j /n)2 , Xi· X·j /n i=1 j=1 and the degree of freedom is (4 − 1 − 2) = 1. Example. By using the same data as before, we can obtain the chi-square statistic [40 − (137)(82)/(442)]2 [97 − (137)(360)/(442)]2 + (137)(360)/(442) (137)(82)/(442) 2 [263 − (305)(360)/(442)] [42 − (305)(82)/(442)]2 + + ≈ 14.89, (305)(360)/(442) (305)(82)/(442) χ2 = and the p-value p∗ = 0.0001. Thus, the null hypothesis is rejected at any reasonable level, indicating that the two characteristics are somewhat dependent. 11 Minimum variance unbiased estimator. One of the important attributes of point estimate is unbiasedness. Since a statistic u(X) is a random variable, we can consider the expectation E[u(X)]. Then the point estimate u(X) of θ is called an unbiased estimator if it satisfies E[u(X)] = θ. For example, the sample mean X̄ is an unbiased estimate of the mean µ, since E(X̄) = µ. Furthermore, u(X) is called an minimum variance unbiased estimator if (i) it is unbiased, and (ii) for every other unbiased estimator s(X) for θ we have R(θ, u) = Var(u(X)) ≤ Var(s(X)) = R(θ, s) 11.1 for all θ. Sufficient statistics Let f (x; θ) be a joint density function of x = (x1 , . . . , xn ) for a random vector X = (X1 , . . . , Xn ). Note that f (x1 , . . . , xn ; θ) is of the form f (x1 ; θ) · · · f (xn ; θ) if (X1 , . . . , Xn ) is a random sample. Suppose that s(X) is a Lecture Note Page 21 Mathematical Statistics statistic having the pdf g(s; θ). Then s(X) is called a sufficient statistic if f (x; θ) g(s(x); θ) is a function of x and does not depend on θ. Factorization theorem. s(X) is a sufficient statistic if and only if the joint density f (x; θ) can be expressed in a form of f (x; θ) = k(s(x); θ) · h(x). Furthermore, the pdf g(s; θ) for the statistic s(X) is proportional to k(s; θ), or possibly a multiple c(s) × k(s; θ) with a function c(s) of s. 11.2 Rao-Blackwell’s theorem Let X = (X1 , . . . , Xn ) be a random sample from f (x; θ). Suppose that s(X) is a sufficient statistic, and that u(X) is an unbiased estimator for θ. Then we can construct a new statistic ϕ(s(X)) = E(u(X) | s(X)), which is a function of the sufficient statistic s(X). Furthermore, we obtain E[ϕ(s(X))] = θ by the law of total probability, and Var(ϕ(s(X))) ≤ Var(u(X)) via the conditional variance formula. 11.3 Lehmann-Scheffé’s theorem Let s(X) be a statistic having the pdf g(s; θ). Then s(X) is called a complete statistic if for any function φ, E[φ(s(X))] = 0 for all θ suffices to imply that φ(s(X)) ≡ 0. Lehmann-Scheffé’s theorem. Suppose that s(X) is a complete and sufficient statistic. Then the statistic ϕ∗ (s(X)) = E(u(X) | s(X)) is well-defined with any choice of unbiased statistic u(X). Moreover, ϕ∗ (s(X)) is the minimum variance unbiased estimator for θ, which is unique among functions of s(X). 11.4 Exponential families Let A be an interval (or a subset) on R, and let f (x; θ) = exp [c(θ)k(x) + h(x) + d(θ)] , x ∈ A; otherwise, f (x; θ) = 0 for all x 6∈ A. (7) be a probability density function with parameter θ. Here an interval A can be (−∞, ∞), [0, ∞), or [0, 1], or a subset A can be {0, 1, . . .} or {0, . . . , n}, for example; but A should not depend on the parameter θ. Then we say that the pdf f (x; θ) is of one-parameter exponential family. For example, 1. an exponential density, Lecture Note Page 22 Mathematical Statistics 2. a normal density with known σ 2 , 3. a Poisson frequency function are of the one-parameter exponential family. Natural sufficient statistics and completeness. Let X1 , . . . , Xn be a random sample from (??). Then the joint density " # n n X X f (x; θ) = exp c(θ) k(xi ) + h(xi ) + nd(θ) , x ∈ An i=1 i=1 Pn is of the exponential family. By the factorization theorem, the statistic s(X) = i=1 k(Xi ) is sufficient, which we call a natural sufficient statistic. Furthermore, the statistic s(X) has the pdf of the exponential family f (s; θ) = exp [c(θ)s + h∗ (s) + nd(θ)] , s ∈ A∗ , and is complete. 12 Efficient estimator We call a statistic u(X) efficient, if u(X) achieves the Cramér-Rao lower bound. 12.1 Fisher information Let f (x; θ) be a joint density function for a random sample X = (X1 , . . . , Xn ). Furthermore, we assume that ∂ ln f (x; θ) exists; we will call the conditions in (i)–(ii) (i) A = {x : f (x; θ) > 0} does not depend on θ, and (ii) ∂θ the regularity assumptions. For example, a pdf of exponential family satisfies the regularity assumptions. By ∂ observing that E ∂θ ln f (X; θ) = 0, we can define the Fisher information I(θ) by " 2 # ∂ ∂ I(θ) = E ln f (X; θ) = Var ln f (X; θ) . ∂θ ∂θ Now suppose that f (x; θ) is of the form f (x1 ; θ) · · · f (xn ; θ). By setting I1 (θ) = Var ∂ ln f (X1 ; θ) = nI1 (θ). I(θ) = n × Var ∂θ Exercise. Show that I(θ) = −E 12.2 h ∂2 ∂θ 2 ∂ ∂θ ln f (X1 ; θ) , we obtain i ln f (X; θ) . Cramér-Rao lower bound. Let X and Y be random variables, and let a be a real number. Then we have Var(aX − Y ) = a2 Var(X) − 2aCov(X, Y )+Var(Y ) ≥ 0. By substituting a = Cov(X, Y ) /Var(X) , we can find the Cauchy-Schwarz inequality [Cov(X, Y )]2 ≤ Var(X) · Var(Y ). Let u(X) be a statistic. Then E[u(X)] is a function of θ, say ψ(θ) = E[u(X)]. Here we can show that ∂ ψ 0 (θ) = Cov u(X), ln f (X; θ) . ∂θ Lecture Note Page 23 Mathematical Statistics By applying the Cauchy-Schwarz inequality we obtain the Cramér-Rao inequality Var(u(X)) ≥ [ψ 0 (θ)]2 . I(θ) (8) When u(X) is an unbiased statistic, the right-hand side of (8) becomes 1 /I(θ) , and is called the Cramér-Rao lower bound. If a statistic u(X) achieves the Cramér-Rao lower bound, we call u(X) an efficient estimator. Clearly an efficient and unbiased statistic is a minimum variance unbiased estimator. Now let f (x; θ) be a joint density which satisfies the regularity assumptions, and let u(X) be an unbiased statistic for θ. If the joint density is of the exponential family f (x; θ) = exp [c(θ)u(x) + h(x) + d(θ)] , x ∈ A∗ ; otherwise, f (x; θ) = 0 for all x 6∈ A∗ , (9) then the natural sufficient statistic u(X) is an efficient estimator. Conversely, if u(X) is an efficient estimator, then the joint density f (x; θ) is given in the form (9). Remark. The notion of efficient statistic is somewhat stronger than that of minimum variance. Even worse, there exist minimum variance unbiased estimators which do not achieve their respective Cramér-Rao lower bound. 13 Hypothesis testing Let θ be a parameter of an underlying probability density function f (x; θ) for a certain population. The hypothesis “H0 : θ = θ0 ” is called a simple hypothesis, since it completely specifies the underlying distribution. Whereas, the hypothesis “H0 : θ ∈ Θ0 ” with a set Θ0 of parameters is called a composite hypothesis if the set Θ0 contains more than one element. The “opposite” of the null hypothesis is called an alternative hypothesis, and is similarly expressed as “HA : θ ∈ Θ1 ” where Θ1 is another set of parameters such that Θ0 ∩ Θ1 = ∅. The set Θ1 is typically (but not necessarily) chosen to be the complement of Θ0 . Thus, the hypothesis testing problem “H0 versus HA ” can be formed as H0 : θ ∈ Θ0 versus HA : θ ∈ Θ1 . (10) The problem stated above is to whether or not to reject “H0 in favor of HA .” 13.1 Test statistic Given a random sample X = (X1 , . . . , Xn ), a function ( 1 if H0 is rejected; δ(X) = 0 otherwise. (11) is called a test function. Given the test (11), we can define the power function by K(θ0 ) = P (“Reject H0 ” | θ = θ0 ) = E (δ(X) | θ = θ0 ) . A typical test, however, is presented in the form “H0 is rejected if T (X) ≥ c.” Here T (X) is called a test statistic, and c is called a critical value. Then the test function can be expressed as ( 1 T (X) ≥ c; δ(X) = (12) 0 otherwise. Thus, we obtain K(θ0 ) = P (T (X) ≥ c | θ = θ0 ). The probability of type I error (i.e., “H0 is incorrectly rejected when H0 is true”) is defined by α = sup K(θ0 ), θ0 ∈Θ0 which is also known as the size of the test. Having calculated the size α of the test, (11) or (12) is said to be a level α test, or a test with significant level α. Lecture Note Page 24 Mathematical Statistics 13.2 Uniformly most powerful test What is the probability that we incorrectly accept H0 when it is actually false? Such probability β is called the probability of type II error. Then the value (1 − β) is known as the power of the test, indicating how correctly we can reject H0 when it is actually false. Suppose that H0 is in fact false, say θ = θ1 for some θ1 ∈ Θ1 . Then the power of the test is calculated by K(θ1 ). Suppose that the test (11) has the size α. This test is said to be uniformly most powerful if it satisfies K(θ1 ) ≥ K 0 (θ1 ) for all θ1 ∈ Θ1 0 for the power function K of every other level α test. Furthermore, if this is given in the form (12) with test statistic T (X), then the test statistic T (X) is said to be optimal. 14 Likelihood ratio test Consider the testing problem with simple (null and alternative) hypotheses: H0 : θ = θ 0 Let L(θ; x) = Qn i=1 versus HA : θ = θ 1 . f (xi ; θ) be the likelihood function, and let L(θ0 , θ1 ; x) = be the likelihood ratio. Then ( 1 δ(X) = 0 L(θ1 , x) L(θ0 , x) L(θ0 , θ1 ; X) ≥ c; otherwise. (13) becomes a uniformly most powerful test, and is called the Neyman–Pearson test. The test function (13) has the following property: For any function ψ(x) satisfying 0 ≤ ψ(x) ≤ 1, E(ψ(X) | θ = θ1 ) − E(δ(X) | θ = θ1 ) ≤ c[E(ψ(X) | θ = θ0 ) − E(δ(X) | θ = θ0 )]. 14.1 Monotone likelihood ratio family Let f (x; θ) be a joint density function with parameter θ, and let L(θ0 , θ1 ; x) be the likelihood ratio. Suppose that T (X) is a statistic and does not depend on the parameter θ. Then f (x; θ) is called a monotone likelihood ratio family in T (X) if 1. f (x; θ0 ) and f (x; θ1 ) are distinct for θ0 6= θ1 ; 2. L(θ0 , θ1 ; x) is a non-decreasing function of T (x) whenever θ0 < θ1 . Now consider the following test problem: H0 : θ ≤ θ0 (or H0 : θ = θ0 ) versus HA : θ > θ0 . (14) If f (x; θ) is a monotone likelihood ratio family in T (X), then the test functions (12) and (13) is equivalent whenever θ0 < θ1 , and the power function K(θ) for these tests becomes an increasing function. Furthermore, T (X) is an optimal test statistic, and the size of the test is simply given by α = K(θ0 ). Suppose that f (x; θ) is of the exponential family f (x; θ) = exp [c(θ)u(x) + h(x) + d(θ)], x ∈ A∗ , and that c(θ) is a strictly increasing function. Then f (x; θ) is a monotone likelihood ratio family in u(X). And the natural sufficient statistic u(X) becomes an optimal test statistic. Remark. Essentially, uniformly most powerful tests exist only for the test problem (14). Lecture Note Page 25 Mathematical Statistics 14.2 Test procedure The Neyman–Pearson test (13) can be generalized for the composite hypotheses in (10): (i) obtain the maximum likelihood estimate (MLE) θ̂ of θ, (ii) calculate also the MLE θ̂0 restricted for θ ∈ Θ0 , and (iii) construct the likelihood ratio   sup L(θ; X)  sup L(θ; X)  L(θ̂; X) θ∈Θ1 θ λ(X) = = = max ,1 .  sup L(θ; X)  sup L(θ; X) L(θ̂0 ; X) θ∈Θ0 θ∈Θ0 The test statistic λ(X) yields an excellent test procedure in many practical applications, though it is not an optimal test in general. 15 Bayesian theory Let f (x; θ) be a density function with parameter θ ∈ Ω. In a Bayesian model the paramter space Ω has a distribution π(θ), called a prior distribution. Furthermore, f (x; θ) is viewed as the conditional distribution of X given θ. By the Bayes’ rule the conditional density π(θ | x) can be derived from ,  X   π(θ)f (x; θ) π(θ)f (x; θ) if Ω is discrete;  θ∈Ω π(θ | x) = Z    π(θ)f (x; θ) π(θ)f (x; θ) dθ if Ω is continuous. Ω The distribution π(θ | x) is called the posterior distribution. Whether Ω is discrete or continuous, the posterior distribution π(θ | x) is “proportional” to π(θ)f (x; θ) up to the constant. Thus, we write π(θ | x) ∝ π(θ)f (x; θ). 15.1 Conjugate family It is often the case that both the prior density function π(θ) and the posterior density function π(θ | x) belong to the same family of density function π(θ; η) with parameter η. Then π(θ; η) is called conjugate to f (x; θ). Let  f (x; θ) = exp nc0 (θ) + m X  cj (θ)kj (x) + h(x) ; j=1  π(θ; η0 , η1 , . . . , ηm ) = exp c0 (θ)η0 + m X  cj (θ)ηj + w(η0 , η1 , . . . , ηm ) . j=1 Suppose that a prior distribution is given by π(θ) = π(θ; η0 , η1 , . . . , ηm ). Then we obtain the posterior density π(θ | x) = π(θ; η0 + n, η1 + k1 (x), . . . , ηm + km (x)). Thus, the family of π(θ; η0 , η1 , . . . , ηm ) is conjugate to f (x; θ). Lecture Note Page 26 Mathematical Statistics 15.2 Decision model Given a random sample X from f (x; θ), we can introduce a decision function δ(X), and incur a loss l(θ, δ(X)) associated with the state of θ. Together we calculate the risk by Z R(θ, δ) = l(θ, δ(x))f (x; θ) dx. If the decision function δ is strictly dominated by no other decision function δ 0 , that is, if no decision function δ 0 satisfies R(θ, δ 0 ) ≤ R(θ, δ) for all θ ∈ Ω with strict inequality for some θ ∈ Ω, then δ is called admissible. In the Bayesian model where the parameter θ has the prior distribution π(θ), we can define the Bayes risk by Z r(δ) = R(θ, δ)π(θ) dθ. Ω ∗ Then the decision function δ is called a Bayes solution if δ ∗ minimizes the Bayes risk r(δ). When the parameter space Ω is an interval on R and π(θ) > 0 and R(θ, δ) is continuous at every point θ and every decision function δ, the Bayes solution δ ∗ is admissible. Having observed X = x, we can compute the posterior density π(θ | x), and construct the posterior risk by Z r(δ(x) | x) = l(θ, δ(x))π(θ | x) dθ. Ω If there is a decision function δ0 such that δ0 (x) minimizes the posterior risk r(δ(x) | x) for every x, then δ0 is a Bayes solution. Lecture Note Page 27

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mathematical Statistics