Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
L ECTURE N OTES ON S TATISTICAL I NFERENCE K RZYSZTOF P ODG ÓRSKI Department of Mathematics and Statistics University of Limerick, Ireland November 23, 2009 Contents 1 Introduction 4 1.1 Models of Randomness and Statistical Inference . . . . . . . . . . . . 4 1.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Probability vs. likelihood . . . . . . . . . . . . . . . . . . . . 8 1.2.2 More data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 Likelihood and theory of statistics . . . . . . . . . . . . . . . . . . . 15 1.4 Computationally intensive methods of statistics . . . . . . . . . . . . 15 1.4.1 1.4.2 2 Monte Carlo methods – studying statistical methods using computer generated random samples . . . . . . . . . . . . . . . . 16 Bootstrap – performing statistical inference using computers . 18 Review of Probability 21 2.1 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Distribution of a Function of a Random Variable . . . . . . . . . . . . 22 2.3 Transforms Method Characteristic, Probability Generating and Mo- 2.4 2.5 ment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . 24 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 Sums of Independent Random Variables . . . . . . . . . . . . 26 2.4.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . 27 2.4.3 The Bivariate Change of Variables Formula . . . . . . . . . . 28 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.1 29 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . 1 2.6 2.7 3 4 5 2.5.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . 29 2.5.3 Negative Binomial and Geometric Distribution . . . . . . . . 30 2.5.4 Hypergeometric Distribution . . . . . . . . . . . . . . . . . 31 2.5.5 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . 32 2.5.6 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . 33 2.5.7 The Multinomial Distribution . . . . . . . . . . . . . . . . . 33 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . 34 2.6.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . 34 2.6.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . 35 2.6.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . 35 2.6.4 Gaussian (Normal) Distribution . . . . . . . . . . . . . . . . 36 2.6.5 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . 38 2.6.6 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6.7 Chi-square Distribution . . . . . . . . . . . . . . . . . . . . . 39 2.6.8 The Bivariate Normal Distribution . . . . . . . . . . . . . . . 39 2.6.9 The Multivariate Normal Distribution . . . . . . . . . . . . . 40 Distributions – further properties . . . . . . . . . . . . . . . . . . . . 42 2.7.1 Sum of Independent Random Variables – special cases . . . . 42 2.7.2 Common Distributions – Summarizing Tables 45 . . . . . . . . Likelihood 48 3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 48 3.2 Multi-parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 The Invariance Principle . . . . . . . . . . . . . . . . . . . . . . . . 59 Estimation 61 4.1 General properties of estimators . . . . . . . . . . . . . . . . . . . . 61 4.2 Minimum-Variance Unbiased Estimation . . . . . . . . . . . . . . . . 64 4.3 Optimality Properties of the MLE . . . . . . . . . . . . . . . . . . . 69 The Theory of Confidence Intervals 71 5.1 71 Exact Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 2 6 5.2 Pivotal Quantities for Use with Normal Data . . . . . . . . . . . . . . 75 5.3 Approximate Confidence Intervals . . . . . . . . . . . . . . . . . . . 80 The Theory of Hypothesis Testing 87 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2 Hypothesis Testing for Normal Data . . . . . . . . . . . . . . . . . . 92 6.3 Generally Applicable Test Procedures . . . . . . . . . . . . . . . . . 97 6.4 The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . 101 6.5 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.6 The χ2 Test for Contingency Tables . . . . . . . . . . . . . . . . . . 109 3 Chapter 1 Introduction Everything existing in the universe is the fruit of chance. Democritus, the 5th Century BC 1.1 Models of Randomness and Statistical Inference Statistics is a discipline that provides with a methodology allowing to make an inference from real random data on parameters of probabilistic models that are believed to generate such data. The position of statistics with relation to real world data and corresponding mathematical models of the probability theory is presented in the following diagram. The following is the list of few from plenty phenomena to which randomness is attributed. • Games of chance – Tossing a coin – Rolling a die – Playing Poker • Natural Sciences 4 Real World - Science & Mathematics ? ? Random Phenomena Probability Theory - ? ? Data – Samples H HH H - HH H HH H j Models ? Statistics ? Prediction and Discovery Statistical Inference Figure 1.1: Position of statistics in the context of real world phenomena and mathematical models representing them. 5 – Physics (notable Quantum Physics) – Genetics – Climate • Engineering – Risk and safety analysis – Ocean engineering • Economics and Social Sciences – Currency exchange rates – Stock market fluctations – Insurance claims – Polls and election results • etc. 1.2 Motivating Example Let X denote the number of particles that will be emitted from a radioactive source in the next one minute period. We know that X will turn out to be equal to one of the non-negative integers but, apart from that, we know nothing about which of the possible values are more or less likely to occur. The quantity X is said to be a random variable. Suppose we are told that the random variable X has a Poisson distribution with parameter θ = 2. Then, if x is some non-negative integer, we know that the probability that the random variable X takes the value x is given by the formula P (X = x) = θx exp (−θ) x! where θ = 2. So, for instance, the probability that X takes the value x = 4 is P (X = 4) = 24 exp (−2) = 0.0902 . 4! 6 We have here a probability model for the random variable X. Note that we are using upper case letters for random variables and lower case letters for the values taken by random variables. We shall persist with this convention throughout the course. Let us still assume that the random variable X has a Poisson distribution with parameter θ but where θ is some unspecified positive number. Then, if x is some nonnegative integer, we know that the probability that the random variable X takes the value x is given by the formula P (X = x|θ) = θx exp (−θ) , x! (1.1) for θ ∈ R+ . However, we cannot calculate probabilities such as the probability that X takes the value x = 4 without knowing the value of θ. Suppose that, in order to learn something about the value of θ, we decide to measure the value of X for each of the next 5 one minute time periods. Let us use the notation X1 to denote the number of particles emitted in the first period, X2 to denote the number emitted in the second period and so forth. We shall end up with data consisting of a random vector X = (X1 , X2 , . . . , X5 ). Consider x = (x1 , x2 , x3 , x4 , x5 ) = (2, 1, 0, 3, 4). Then x is a possible value for the random vector X. We know that the probability that X1 takes the value x1 = 2 is given by the formula P (X = 2|θ) = θ2 exp (−θ) 2! and similarly that the probability that X2 takes the value x2 = 1 is given by P (X = 1|θ) = θ exp (−θ) 1! and so on. However, what about the probability that X takes the value x? In order for this probability to be specified we need to know something about the joint distribution of the random variables X1 , X2 , . . . , X5 . A simple assumption to make is that the random variables X1 , X2 , . . . , X5 are mutually independent. (Note that this assumption may not be correct since X2 may tend to be more similar to X1 that it would be to X5 .) However, with this assumption we can say that the probability that X takes the value x 7 is given by P (X = x|θ) = 5 Y θxi exp (−θ) i=1 2 = = xi ! , θ exp (−θ) θ1 exp (−θ) θ0 exp (−θ) × × 2! 1! 0! θ3 exp (−θ) θ4 exp (−θ) × , × 3! 4! θ10 exp (−5θ) . 288 In general, if X = (x1 , x2 , x3 , x4 , x5 ) is any vector of 5 non-negative integers, then the probability that X takes the value x is given by P (X = x|θ) = 5 Y θxi exp (−θ) i=1 P5 = θ i=1 xi ! , xi exp (−5θ) . 5 Q xi ! i=1 We have here a probability model for the random vector X. Our plan is to use the value x of X that we actually observe to learn something about the value of θ. The ways and means to accomplish this task make up the subject matter of this course. The central tool for various statistical inference techniques is the likelihood method. Below we present a simple introduction to it using the Poisson model for radioactive decay. 1.2.1 Probability vs. likelihood . In the introduced Poisson model for a given θ, say θ = 2, we can observe a function p(x) of probabilities of observing values x = 0, 1, 2, . . . . This function is referred to as probability mass function . The graph of it is presented below The usage of such function can be utilized in bidding for a recorded result of future experiments. If one wants to optimize chances of correctly predicting the future, the choice of the number of recorded particles would be either on 1 or 2. So far, we have been told that the random variable X has a Poisson distribution with parameter θ where θ is some positive number and there are physical reason to assume 8 0.25 0.20 0.15 0.10 0.00 0.05 Probability 0 2 4 6 8 10 Number of particles Figure 1.2: Probability mass function for Poisson model with θ = 2. that such a model is correct. However, we have arbitrarily set θ = 2 and this is more questionable. How can we know that it is correct a correct value of the parameter? Let us analyze this issue in detail. If x is some non-negative integer, we know that the probability that the random variable X takes the value x is given by the formula P (X = x|θ) = θx e−θ , x! for θ > 0. But without knowing the true value of θ, we cannot calculate probabilities such as the probability that X takes the value x = 1. Suppose that, in order to learn something about the value of θ, an experiment is performed and a value of X = 5 is recorded. Let us take a look at the probability mass function for θ = 2 in Figure 1.2. What is the probability of X to take value 2? Do we like what we see? Why? Would you bet 1 or 2 in the next experiment? We certainly have some serious doubt about our choice of θ = 2 which was arbitrary anyway. One can consider, for example, θ = 7 as an alternative to θ = 2. Here are graphs of the pmf for the two cases. Which of the two choices do we like? Since it 9 0.15 0.25 Probability 0.10 0.20 0.15 0.05 Probability 0.10 0.00 0.05 0.00 0 2 4 6 8 0 10 2 4 6 8 10 Number of particles Number of particles Figure 1.3: The probability mass function for Poisson model with θ = 2 vs. the one with θ = 7. was more probable to get X = 5 under the assumption θ = 7 than when θ = 2, we say θ = 7 is more likely to produce X = 5 than θ = 2. Based on this observation we can develop a general strategy for chosing θ. Let us summarize our position. So far we know (or assume) about the radioactive emission that it follows Poisson model with some unknown θ > 0 and the value x = 5 has been once observed. Our goal is somehow to utilized this knowledge. First, we note that the Poisson model is in fact not only a function of x but also of θ p(x|θ) = θx e−θ . x! Let us plug in the observed x = 5, so that we get a function of θ that is called likelihood function l(θ) = θ5 e−θ . 120 The graph of it is presented on the next figure. Can you localize on this graph the values of probabilities that were used to chose θ = 7 over θ = 2? What value of θ appears to be the most preferable if the same argument is extended to all possible values of θ? We observe that the value of θ = 5 is most likely to produce value x = 5. In the result of our likelihood approach we have used the data x = 5 and the Poisson model to make inference - an example of statistical inference . 10 0.15 0.10 0.00 0.05 Likelihood 0 5 10 15 theta Figure 1.4: Likelihood function for the Poisson model when the observed value is x = 5. Likelihood – Poisson model backward Poisson model can be stated as a probability mass function that maps possible values x into probabilities p(x) or if we emphasize the dependence on θ into p(x|θ) that is given below p(x|θ) = l(θ|x) = θx e−θ , x! • With the Poisson model with given θ one can compute probabilities that various possible numbers x of emitted particles can be recorded, i.e. we consider x 7→ p(x|θ) with θ fixed. We get the answer how probable are various outcomes x. • With the Poisson model where x is observed and thus fixed one can evaluate how likely it would be to get x under various values of θ, i.e. we consider θ 7→ l(θ|x) with θ fixed. We get the answer how likely various θ could produced the observed x. 11 Exercise 1. For the general Poisson model p(x|θ) = l(θ|x) = θx e−θ , x! 1. for a given θ > find the most probable value of the observation x. 2. for a given observation x find the most likely value of θ. Give a mathematical argument for your claims. 1.2.2 More data Suppose that we perform another measurement of the number of emitted particles. Let us use the notation X1 to denote the number of particles emitted in the first period, X2 to denote the number emitted in the second period. We shall end up with data consisting of a random vector X = (X1 , X2 ). The second measurement yielded x2 = 2, so that x = (x1 , x2 ) = (5, 2). We know that the probability that X1 takes the value x1 = 5 is given by the formula P (X = 5|θ) = θ5 e−θ 5! and similarly that the probability that X2 takes the value x2 = 2 is given by P (X = 2|θ) = θ2 e−θ . 2! However, what about the probability that X takes the value x = (5, 2)? In order for this probability to be specified we need to know something about the joint distribution of the random variables X1 , X2 . A simple assumption to make is that the random variables X1 , X2 are mutually independent. In such a case the probability that X takes the value x = (x1 , x2 ) is given by P (X = (x1 , x2 )|θ) = θx1 e−θ θx2 e−θ θx1 +x2 · = e−2θ . x1 ! x2 ! x1 !x2 ! After little of algebra we easily find the likelihood function of observing X = (5, 2) as l(θ|(5, 2)) = e−2θ 12 θ7 240 0.025 0.020 0.015 0.010 0.000 0.005 Likelihood 0 5 10 15 0.10 0.00 0.05 Likelihood 0.15 theta 0 5 10 15 theta Figure 1.5: Likelihood of observing (5, 2) (top) vs. the one of observing 5 (bottom). and its graph is presented in Figure 1.5 in comparison with the previous likelihood for a single observation. Two important effects of adding an extra information should be noted • We observe that the location of the maximum shifted from 5 to 3 compared to single observation. • We also note that the range of likely values for θ has diminished. Let us suppose that eventually we decide to measure three more values of X. Let us use the vector notation X = (X1 , X2 , . . . , X5 ) to denote observable random 13 vector. Assume that three extra measurements yielded 3, 7, 7 so that we have x = (x1 , x2 , x3 , x4 , x5 ) = (5, 2, 3, 7, 7). Under the assumption of independence the probability that X takes the value x is given by P (X = x|θ) = 5 Y θxi e−θ xi ! i=1 . The likelihood function of observing X = (5, 2, 3, 7, 7) under independence can be easily derived to be θ24 e−5θ . 14515200 In general, if X = (x1 , . . . , xn ) is any vector of 5 non-negative integers, then the likelihood is given by l(θ|(x1 , . . . , xn ) = θ Pn i=1 xi −nθ e n Q . xi ! i=1 The value θb that maximizes this likelihood is called the maximum likelihood estimator of θ. In order to find values that effectively maximize likelihood, the method of calculus can be implemented. We note that in our example we deal only with one variable θ and computation of derivative is rather straightforward. Exercise 2. For the general case of likelihood based on Poisson model l(θ|x1 , . . . , xn ) = θ Pn i=1 xi −nθ n Q e xi ! i=1 using methods of calculus derive a general formula for the maximum likelihood estimator of θ. Using the result find θb for (x1 , x2 , x3 , x4 , x5 ) = (5, 2, 3, 7, 7). Exercise 3. It is generally believed that time X that passes until there is half of the original radioactive material follow exponential distribution f (x|θ) = θe−θx , x > 0. For beryllium 11 five experiments has been performed and values 13.21, 13.12, 13.95, 13.54, 13.88 seconds has been obtained. Find and plot the likelihood function for θ and based on this determine the most likely θ. 14 1.3 Likelihood and theory of statistics The strategy of making statistical inference based on the likelihood function as described above is the recurrent theme in mathematical statistics and thus in our lecture. Using mathematical argument we would compare various strategies to infering about the parameters and often we will demonstrate that the likelihood based methods are optimal. It will show its strength also as a criterium deciding between various claims about parameters of the model which is the leading story of testing hypotheses. In the modern days, the role of computers has increased in statistical methodology. New computationally intense methods of data explorations become one of the central areas of modern statistcs. Even there, methods that refer to likelihood play dominant roles, in particular, in Bayesian methodology. Despite this extensive penetration of statistical methodology by likelihood techinques, by no means statistics can be reduced to analysis of likelihood. In every area of statistics, there are important aspects that require reaching beyond likelihood, in many cases, likelihood is not even a focus of studies and development. The purpose of this course is to present both the importance of likelihood approach across statistics but also presentation of topics for which likelihood plays a secondary role if any. 1.4 Computationally intensive methods of statistics The second part of our presentation of modern statistical inference is devoted to computationally intensive statistical methods. The area of data explorations is rapidly growing in importance due to • common access to inexpensive but advance computing tools, • emerging of new challenges associated with massive highly dimensional data far exceeding traditional assumptions on which traditional methods of statistics have been based. In this introduction we give two examples that illustrate the power of modern computers and computing software both in analysis of statistical models and in performing actual 15 statistical inference. We start with analyzing a performance of a statistical procedure using random sample generation. 1.4.1 Monte Carlo methods – studying statistical methods using computer generated random samples Randomness can be used to study properties of a mathematical model. The model itself may be probabilistic or not but here we focus on the probabilistic ones. Essentially, it is based on repetitive simulations of random samples corresponding to the model and observing behavior of objects of interests. An example of Monte Carlo method is approximate the area of circle by tossing randomly a point (typically computer generated) on the paper where a circle is drawn. The percentage of points that fall inside the circle represents (approximately) percentage of the area covered by the circle, as illustrated in Figure 1.6. Exercise 4. Write an R code that would explore the area of an elipsoid using Monte Carlo method. Below we present an application of Monte Carlo approach to studying fitting methods for the Poisson model. Deciding for Poisson model Recall that the Poisson model is given by P (X = x|θ) = θx e−θ . x! It is relatively easy to demonstrate that the mean value of this distribution is equal to θ and standard deviation is also equal to θ. Exercise 5. Present a formal argument showing that for a Poisson random variable X with parameter θ, EX = θ and VarX = θ. Thus for a sample of observations x = (x1 , . . . , xn ) it is reasonable to consider 16 Figure 1.6: Monte Carlo study of the circle area – approximation for sample size of 10000 is 3.1248 which compares to the true value of π = 3.141593. both θb1 = x̄, θb2 = x¯2 − x̄2 as estimators of θ. We want to employ Monte Carlo method to decide which one is better. In the process we run many samples from the Poisson distribution and check which of the 17 100 0 Frequency Histogram of means 2.5 3.0 3.5 4.0 4.5 5.0 5.5 means 300 150 0 Frequency Histogram of vars 0 5 10 15 vars Figure 1.7: Monte Carlo results of comparing estimation of θ = 4 by the sample mean (left) vs. estimation using the sample standard deviation right. estimates performs better. The resulting histograms of the values of estimator are presented in Figure 1.8. It is quite clear from the graphs that the estimator based on the mean is better than the one based on the variance. 1.4.2 Bootstrap – performing statistical inference using computers Bootstrap (resampling) methods are one of the examples of Monte Carlo based statistical analysis. The methodology can be summarized as follows • Collect statistical sample, i.e. the same type of data as in classical statistics. • Used a properly chosen Monte Carlo based resampling from the data using RNG – create so called bootstrap samples. • Analyze bootstrap samples to draw conclusions about the random mechanism 18 that produced the original statistical data. This way randomness is used to analyze statistical samples that, by the way, are also a result of randomness. An example illustrating the approach is presented next. Estimating nitrate ion concentration Nitrate ion concentration measurements in a certain chemical lab has been collected and their results are given in the following table. The goal is to estimate, based on 0.51 0.51 0.51 0.50 0.51 0.49 0.52 0.53 0.50 0.47 0.51 0.52 0.53 0.48 0.49 0.50 0.52 0.49 0.49 0.50 0.49 0.48 0.46 0.49 0.49 0.48 0.49 0.49 0.51 0.47 0.51 0.51 0.51 0.48 0.50 0.47 0.50 0.51 0.49 0.48 0.51 0.50 0.50 0.53 0.52 0.52 0.50 0.50 0.51 0.51 Table 1.1: Results of 50 determinations of nitrate ion concentration in µg per ml. these values, the actual nitrate ion concentration. The overall mean of all observations is 0.4998. It is natural to ask what is the error of this determination of the nitrate concentration. If we would repeat our experiment of collecting 50 samples of nitrate concentrations many times we would see the range of error that is made. However, it would be a waste of resources and not a viable method at all. Instead we resample ‘new’ data from our data and use so obtained new samples for assessment of the error and compare the obtained means (bootstrap means) with the original one. The differences of these represent the bootstrap “estimation” errors their distribution is viewed as a good representation of the distribution of the true error. In Figure ??, we see the bootstrap counterpart of the distribution of the estimation error. Based on this we can safely say that the nitrate concentration is 49.99 ± 0.005. Exercise 6. Consider a sample of daily number of buyers in a furniture store 8, 5, 2, 3, 1, 3, 9, 5, 5, 2, 3, 3, 8, 4, 7, 11, 7, 5, 12, 5 Consider the two estimators of θ for a Poisson distribution as discussed in the previous section. Describe formally the procedure (in steps) of obtaining a bootstrap confidence 19 60 40 0 20 Frequency 80 Histogram of bootstrap -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 bootstrap Figure 1.8: Boostrap estimation error distribution. interval for θ using each of the discussed estimatoand provide with 95% bootstrap confidence intervals for each of them. 20 Chapter 2 Review of Probability 2.1 Expectation and Variance The expected value E[Y ] of a random variable Y is defined as ∞ X E[Y ] = yi P (yi ); i=0 if Y is discrete, and Z ∞ E[Y ] = yf (y)dy; −∞ if Y is continuous, where f (y) is the probability density function. The variance Var[Y ] of a random variable Y is defined as Var[Y ] = E(Y − E[Y ])2 ; or Var[Y ] = ∞ X (yi − E[Y ])2 P (yi ); i=0 if Y is discrete, and Z ∞ V ar[Y ] = (y − E[Y ])2 f (y)dy; −∞ if Y is continuous. When there is no ambiguity we often write EY for E[Y ], and VarY for Var[Y ]. 21 A function of a random variable is itself a random variable. If h(Y ) is function of the random variable Y , then the expected value of h(Y ) is given by E[h(Y )] = ∞ X h(yi )P (yi ); i=0 if Y is discrete, and if Y is continuous Z ∞ E[h(Y )] = h(y)f (y) dy. −∞ It is relatively straightforward to derive the following results for the expectation and variance of a linear function of Y . E[aY + b] = aE[Y ] + b, V ar[aY + b] = a2 Var[Y ], where a and b are constants. Also Var[Y ] = E[Y 2] − (E[Y ])2 (2.1) For expectations, it can be shown more generally that E k X i=1 ai hi (Y ) = k X ai E[hi (Y )], i=1 where ai , i = 1, 2, . . . , k are constants and hi (Y ), i = 1, 2, . . . , k are functions of the random variable Y . 2.2 Distribution of a Function of a Random Variable If Y is a random variable than for any regular function X = g(Y ) is also a random variable. The cumulative distribution function of X is given as FX (x) = P (X ≤ x) = P (Y ∈ g −1 (−∞, x]). The density function of X if exists can be found by differentiating the right hand side of the above equality. 22 Example 1. Let Y has a density fY and X = Y 2 . Then √ √ √ √ FX (x) = P (Y 2 < x) = P (− x ≤ Y ≤ x) = FY ( x) − FY (− x). By taking a derivative in x we obtain √ √ 1 fX (x) = √ fY ( x) + fY (− x) . 2 x If additionally the distribution of Y is symmetric around zero, i.e. fY (y) = fY (−y), then √ 1 fX (x) = √ fY ( x). x Exercise 7. Let Z be a random variable with the density fZ (z) = e−z 2 /2 √ / 2π, so called the standard normal (Gaussian) random variable. Show that Z 2 is a Gamma(1/2, 1/2) random variable, i.e. that it has the density given by 1 √ x−1/2 e−x/2 . 2π The distribution of Z 2 is also called chi-square distribution with one degree of freedom. Exercise 8. Let FY (y) be a cumulative distribution function of some random variable Y that with probability one takes values in a set RY . Assume that there is an inverse function FY−1 [0, 1] 7→ RY so that FY FY−1 (u) = u for u ∈ [0, 1]. Check that for U ∼ U nif (0, 1) the random variable Ỹ = FY−1 (U ) has FY as its cumulative distribution function. The densities of g(Y ) are particularly easy to express if g is a strictly monotone as shown in the next result Theorem 2.2.1. Let Y be a continuous random variable with probability density function fY . Suppose that g(y) is a strictly monotone (increasing or decreasing) differentiable (and hence continuous) function of y. The random variable Z defined by Z = g(Y ) has probability density function given by d −a −1 fZ (z) = fY g (z) g (z) dz where g −1 (z) is defined to be the inverse function of g(y). 23 (2.2) Proof. Let g(y) be a monotone increasing (decreasing) function and let FY (y) and FZ (z) denote the probability distribution functions of the random variables Y and Z. Then FZ (z) = P (Z ≤ z) = P (g(Y ) ≤ z) = P (Y ≤ (≥)g −1 (z)) = (1−)FY (g −1 (z)) By the chain rule, −1 dg d d −1 fZ (z) = FZ (z) = (−) FY (g (z)) = fY (g−1 (z)) (z) . dz dz dz Exercise 9. (The Log-Normal Distribution) Suppose Z is a standard normal distribution and g(z) = eaz+b . Then Y = g(Z) is called a log-normal random variable. Demonstrate that the density of Y is given by fY (y) = √ 2.3 log2 (y/eb ) y −1 exp − . 2a2 2πa2 1 Transforms Method Characteristic, Probability Generating and Moment Generating Functions The probability generating function of a random variable Y is a function denoted by GY (t) and defined by GY (t) = E(tY ), for those t ∈ R for which the above expectation is convergent. The expectation defining GY (t) converges absolutely if |t| ≤ 1. As the name implies, the p.g.f generates the probabilities associated with a discrete distribution P (Y = j) = pj , j = 0, 1, 2, . . . . GY (0) = p0 , G0Y (0) = p1 , G”Y (0) = 2!p2 . In general the kth derivative of the p.g.f of Y satisfies G( k)Y (0) = k!pk . 24 The p.g.f can be used to calculate the mean and variance of a random variable Y . Note P∞ that in the discrete case G0Y (t) = j=1 jpj tj−1 for −1 < t < 1. Let t approach one from the left, t → 1− , to obtain G0Y (1) = ∞ X jpj = E(Y ) = µY . j=1 The second derivative of GY (t) satisfies G”Y (t) = ∞ X j(j − 1)pj tj−2 , j=1 and consequently G”Y (1) = ∞ X j = 1j(j − 1)pj = E(Y 2 ) − E2 (Y ). The variance of Y satisfies 2 σY2 = EY 2 − EY + EY − E2 Y = G”Y (1) + G0Y (1) − G0 Y (1). The moment generating function (m.g.f) of a random variable Y is denoted by MY (t) and defined as MY (t) = E etY , for some t ∈ R. The moment generating function generates the moments EY k MY (0) = 1, MY0 (0) = µY = E(Y ), M ”Y (0) = EY 2 , and, in general, M ( k)Y (0) = EY k . The characteristic function (ch.f) of a random variable Y is defined by φY (t) = EeitY , where i = √ −1. A very important result concerning generating functions states that the moment generating function uniquely defines the probability distribution (provided it exists in an open interval around zero). The characteristic function also uniquely defines the probability distribution. 25 Property 1. If Y has the characteristic function φY (t) and the moment generating function MY (t), then for X = a + bY φX (t) =eait φY (bt) MX (t) =eat MY (bt). 2.4 2.4.1 Random Vectors Sums of Independent Random Variables Suppose that Y1 , Y2 , . . . , Yn are independent random variables. Then the moment genPn erating function of the linear combination Z = i=1 ai Yi is the product of the individual moment generating functions. MZ (t) =Eet P a i Yi =Eea1 tY1 Eea2 tY2 · · · Eean tYn = n Y MYi (ai Yi ). i=1 The same argument gives also that φZ (t) = Qn i=1 φYi (aiY i). When X and Y are discrete random variables, the condition of independence is equiva- lent to pX,Y (x, y) = pX (x)pY (y) for all x, y. In the jointly continuous case the condition of independence is equivalent to fX,Y (x, y) = fX (x)fY (y) for all x, y. Consider random variables X and Y with probability densities fX (x) and fY (y) respectively. We seek the probability density of the random variable X + Y . Our general result follows from FX+Y (a) =P (X + Y < a) Z Z = fX (x)fY (y) dxdy Z X+Y <a ∞ Z a−y = fX (x)fY (y) dxdy −∞ Z ∞ −∞ Z a −∞ Z a −∞ Z ∞ −∞ −∞ fX (z − y) dz fY (y) dy = fX (z − y)fY (y) dy dz = 26 (2.3) Thus the density function fX+Y (z) = R∞ −∞ fX (z − y)fY (y) dy which is called the convolution of the densities fX and fY . 2.4.2 Covariance and Correlation Suppose that X and Y are real-valued random variables for some random experiment. The covariance of X and Y is defined by Cov(X, Y ) = E[(X − EX)(Y − EY )] and (assuming the variances are positive) the correlation of X and Y is defined by Cov(X, Y ) p . ρ(X, Y ) = p Var(X) Var(Y ) Note that the covariance and correlation always have the same sign (positive, negative, or 0). When the sign is positive, the variables are said to be positively correlated, when the sign is negative, the variables are said to be negatively correlated, and when the sign is 0, the variables are said to be uncorrelated. For an intuitive understanding of correlation, suppose that we run the experiment a large number of times and that for each run, we plot the values (X, Y ) in a scatterplot. The scatterplot for positively correlated variables shows a linear trend with positive slope, while the scatterplot for negatively correlated variables shows a linear trend with negative slope. For uncorrelated variables, the scatterplot should look like an amorphous blob of points with no discernible linear trend. Property 2. You should satisfy yourself that the following are true Cov(X, Y ) =EXY − EXEY Cov(X, Y ) =Cov(Y, X) Cov(Y, Y ) =Var(Y ) Cov(aX + bY + c, Z) =aCov(X, Z) + bCov(Y, Z) ! n n X X Var Yi = Cov(Yi , Yj ) i=1 i,j=1 If X and Y are independent, then they are uncorrelated. The converse is not true however. 27 2.4.3 The Bivariate Change of Variables Formula Suppose that (X, Y ) is a random vector taking values in a subset S of R2 with probability density function f . Suppose that U and V are random variables that are functions of X and Y U = U (X, Y ), V = V (X, Y ). If these functions have derivatives, there is a simple way to get the joint probability density function g of (U, V ). First, we will assume that the transformation (x, y) 7→ (u, v) is one-to-one and maps S onto a subset T of R2 . Thus, the inverse transformation (u, v) 7→ (x, y) is well defined and maps T onto S. We will assume that the inverse transformation is “smooth”, in the sense that the partial derivatives ∂x ∂x ∂y ∂y , , , , ∂u ∂v ∂u ∂v exist on T , and the Jacobian ∂(x, y) = ∂(u, v) ∂x ∂u ∂y ∂u ∂x ∂v ∂y ∂v ∂x ∂y ∂x ∂y = ∂u ∂v − ∂v ∂u is nonzero on T . Now, let B be an arbitrary subset of T . The inverse transformation maps B onto a subset A of S. Therefore, Z Z P ((U, V ) ∈ B) = P ((X, Y ) ∈ A) = f (x, y) dxdy. A But, by the change of variables formula for double integrals, this can be written as Z Z ∂(x, y) dudv. P ((U, V ) ∈ B) = f (x(u, v), y(u, y)) ∂(u, v) B By the very meaning of density, it follows that the probability density function of (U, V ) is ∂(x, y) , (u, v) ∈ T. g(u, v) = f (x(u, v), y(u, v)) ∂(u, v) The change of variables formula generalizes to Rn . Exercise 10. Let U1 and U2 be independent random variables with the density equal to one over [0, 1], i.e. standard uniform random variables. Find the density of the following vector of variables p p (Z1 , Z2 ) = ( −2 log U1 cos(2πU2 ), −2 log U1 sin(2πU2 )). 28 2.5 Discrete Random Variables 2.5.1 Bernoulli Distribution A Bernoulli trial is a probabilistic experiment which can have one of two outcomes, success (Y = 1) or failure (Y = 0) and in which the probability of success is θ. We refer to θ as the Bernoulli probability parameter. The value of the random variable Y is used as an indicator of the outcome, which may also be interpreted as the presence or absence of a particular characteristic. A Bernoulli random variable Y has probability mass function P (Y = y|θ) = θy (1 − θ)1 − y (2.4) for y = 0, 1 and some θ ∈ (0, 1). The notation Y ∼ Ber(θ) should be read as the random variable Y follows a Bernoulli distribution with parameter θ. A Bernoulli random variable Y has expected value E[Y ] = 0 · P (Y = 0) + 1 · P (Y = 1) = 0·(1−θ)+1·θ = θ, and variance Var[Y ] = (0−θ)2·(1−θ)+(1−θ)2 ·θ = θ(1 − θ). 2.5.2 Binomial Distribution Consider independent repetitions of Bernoulli experiments, each with a probability of success θ. Next consider the random variable Y , defined as the number of successes in a fixed number of independent Bernoulli trials, n . That is, Y = n X Xi , i=1 where Xi ∼ Bernoulli(θ) for i = 1, . . . , n. Each sequence of length n containing y “ones” and (n − y) “zeros” occurs with probability θy(1 − θ)( n − y). The number of sequences with y successes, and consequently (n − y) fails, is n! n = . y!(n − y)! y The random variable Y can take on values y = 0, 1, 2, . . . , n with probabilities n y P (Y = y|θ) = θ (1 − θ)n−y . (2.5) y 29 The notation Y ∼ Bin(n, θ) should be read as “the random variable Y follows a binomial distribution with parameters n and θ.” Finally using the fact that Y is the sum of n independent Bernoulli random variables we can calculate the expected value as P P P E[Y ] = E[ Xi ] = P E[Xi ] = θ = nθ and variance as Var[Y ] = V ar[ Xi ] = P P Var[Xi ] = θ(1 − θ) = nθ(1 − θ). 2.5.3 Negative Binomial and Geometric Distribution Instead of fixing the number of trials, suppose now that the number of successes, r, is fixed, and that the sample size required in order to reach this fixed number is the random variable N . This is sometimes called inverse sampling. In the case of r = 1, using the independence argument again, leads to geomoetric distribution P (N = n|θ) = θ(1 − θ)n−1 , n = 1, 2, . . . (2.6) for n = 1, 2, . . . which is the geometric probability function with parameter θ. The distribution is so named as successive probabilities form a geometric series. The notation N ∼ Geo(θ) should be read as “the random variable N follows a geometric distribution with parameter θ.” Write (1 − θ) = q. Then ∞ X d n d nq n θ = θ E[N ] = (q ) = θ dq dq n=0 n=1 d 1 1 θ =θ = . = dq 1 − q (1 − q)2 θ ∞ X ∞ X ! qn n=0 Also, ∞ ∞ X d d X n E[N ] = n q θ=θ (nq n ) = θ nq dq dq n=1 n=1 n=1 d q d =θ θ E(N ) = θ q(1 − q)−2 dq 1−q dq 1 2(1 − θ) 2 1 =θ + = 2− . θ2 θ3 θ θ 2 ∞ X ! 2 n−1 Using Var[N ] = E[N 2 ] − (E[N ])2 , we get Var[N ] = (1 − θ)/θ2 . Consider now sampling continues until a total of r successes are observed. Again, let the random variable N denote number of trial required. If the rth success occurs 30 on the nth trial, then this implies that a total of (r − 1) successes are observed by the (n − 1)th trial. The probability of this happening can be calculated using the binomial distribution as n − 1 r−1 θ (1 − θ)n−r . r−1 The probability that the nth trial is a success is θ. As these two events are independent we have that P (N = n|r, θ) = n−1 r θ (1 − θ)n−r r−1 (2.7) for n = r, r + 1, . . . . The notation N ∼ N egBin(r, θ) should be read as “the random variable N follows a negative binomial distribution with parameters r and θ.” This is also known as the Pascal distribution. ∞ X k k n−1 θr (1 − θ)n−r E[N ] = n r−1 n=r ∞ r X k−1 n r+1 n−1 n n−r = n θ (1 − θ) since n =r θ n=r r r−1 r ∞ m − 1 r+1 r X (m − 1)k−1 θ (1 − θ)m−(r+1) = r θ m=r+1 = r E (X − 1)k−1 , θ where X ∼ N egativebinomial(r + 1, θ). Setting k = 1 we get E(N ) = r/θ. Setting k = 2 gives r r E[N 2] = E(X − 1) = θ θ r+1 −1 . θ Therefore Var[N ] = r(1 − θ)/θ2 . 2.5.4 Hypergeometric Distribution The hypergeometric distribution is used to describe sampling without replacement. Consider an urn containing b balls, of which w are white and b − w are red. We intend to draw a sample of size n from the urn. Let Y denote the number of white balls selected. Then, for y = 0, 1, 2, . . . , n we have P (Y = y|b, w, n) = 31 w y b−w n−y b n . (2.8) The expected value of the jth moment of a hypergeometric random variable is w b−w n n X X j j y n−y E[Y ] = y P (Y = y) = y . b y=0 n y=1 The identities w w−1 =w y y−1 b b−1 n =b n n−1 y can be used to obtain n nw X j−1 E[Y ] = y b y=1 j w−1 b−w y−1 n−1 b−1 n−1 n−1 nw X (x + 1)j−1 = b x=0 = w−1 x b−w n−1−x b−1 n−1 nw E[(X + 1)j−1 ] b where X is a hypergeometric random variable with parameters n−1, b−1, w−1. From this it is easy to establish that E[Y ] = nθ and Var[Y ] = nθ(1 − θ)(b − n)/(b − 1), where θ = w/b is the fraction of white balls in the population. 2.5.5 Poisson Distribution Certain problems involve counting the number of events that have occurred in a fixed time period. A random variable Y , taking on one of the values 0, 1, 2, . . . , is said to be a Poisson random variable with parameter θ if for some θ > 0, P (Y = y|θ) = θy −θ e , y = 0, 1, 2, . . . y! (2.9) The notation Y ∼ P ois(θ) should be read as “random variable Y follows a Poisson distribution with parameter θ.” Equation 2.9 defines a probability mass function, since ∞ X θy y=0 y! e−θ = e−θ ∞ X θy y=0 y! = e−θ eθ = 1. The expected value of a Poisson random variable is E[Y ] = ∞ X y=0 ye−θ ∞ ∞ X X θy θy−1 θj = θe−θ = θe−θ = θ. y! (y − 1)! (j)! y=1 j=0 32 To get the variance we first compute the second moment E[Y 2 ] = ∞ X y=0 y 2 e−θ ∞ ∞ X X θy θy−1 θj =θ =θ = θ(θ + 1). ye−θ (j + 1)e−θ y! y − 1! j! y=1 j=0 Since we already have E[Y ] = θ, we obtain Var[Y ] = E[Y 2 ] − (E[Y ])2 = θ. Suppose that Y ∼ Binomial(n, p), and let θ = np. Then n y P (Y = y|np) = p (1 − p)n−y y n−y y n θ θ = 1− y n n n(n − 1) · · · (n − y + 1) θy (1 − θ/n)n . = ny y! (1 − θ/n)y For n large and θ “moderate”, we have that n y θ n(n − 1) · · · (n − y + 1) θ 1− ≈ e−θ , ≈ 1, 1 − ≈ 1. n ny n Our result is that a binomial random variable Bin(n, p) is well approximated by a Poisson random variable P ois(θ = np) when n is large and p is small. That is P (Y = y|n, p) ≈ e−np 2.5.6 (np)y . y! Discrete Uniform Distribution The discrete uniform distribution with integer parameter N has a random variable Y that can take the vales y = 1, 2, . . . , N with equal probability 1/N . It is easy to show that the mean and variance of Y are E[Y ] = (N + 1)/2, and Var[Y ] = (N 2 − 1)/12. 2.5.7 The Multinomial Distribution Suppose that we perform n independent and identical experiments, where each experiment can result in any one of r possible outcomes, with respective probabilities Pr p1 , p2 , . . . , pr , where i=1 pi = 1. If we denote by Yi , the number of the n experiments that result in outcome number i, then P (Y1 = n1 , Y2 = n2 , . . . , Yr = nr ) = 33 n! pn1 pn2 · · · pn5 r n1 !n2 ! · · · nr ! 1 2 (2.10) where Pr i=1 ni = n. Equation 2.10 is justified by noting that any sequence of out- comes that leads to outcome i occurring ni times for i = 1, 2, . . . , r, will, by the assumption of independence of experiments, have probability pn1 1 pn2 2 · · · pnr r of occurring. As there are n! = (n1 !n2 ! · · · nr !) such sequence of outcomes equation 2.10 is established. 2.6 2.6.1 Continuous Random Variables Uniform Distribution A random variable Y is said to be uniformly distributed over the interval (a, b) if its probability density function is given by 1 , if a < y < b b−a Ru and equals 0 for all other values of y. Since F (u) = −∞ f (y)dy, the distribution f (y|a, b) = function of a uniform random variable on the interval (a, b) is 0; u ≤ a, F (u) = (u − a)/(b − a); a < u ≤ b, 1; u>b The expected value of a uniform random turns out to be the mid-point of the interval, that is Z ∞ E[Y ] = Z yf (y)dy = −∞ a b y b2 − a2 b+a dy = = . b−a 2(b − a) 2 The second moment is calculated as Z b 2 y b3 − a3 1 2 dy = = (b2 + ab + a2 ), E[Y ] = 3(b − a) 3 a b−a hence the variance is Var[Y ] = E[Y 2 ] − (E[Y ])2 = 1 (b − a)2 . 12 The notation Y ∼ U (a, b) should be read as “the random variable Y follows a uniform distribution on the interval (a, b)”. 34 2.6.2 Exponential Distribution A random variable Y is said to be an exponential random variable if its probability density function is given by f (y|θ) = θe−θy , y > 0, θ > 0. The cumulative distribution of an exponential random variable is given by Z a F (a) = θe−θy dy = −e−θy |a0 = 1 − e−θa , a > 0. 0 The expected value E[Y ] = R∞ 0 yθe−θy dy requires integration by parts, yielding E[Y ] = −ye−θy |∞ 0 + ∞ Z e−θy dy = 0 −e−θy ∞ 1 |0 = . θ θ Integration by parts can be used to verify that E[Y 2 ] = 2θ−2 . Hence Var[Y ] = 1/θ2 . The notation Y ∼ Exp(θ) should be read as “the random variable Y follows an exponential distribution with parameter θ”. Exercise 11. Let Y ∼ U [0, 1]. Find the distribution of Y = − log U . Can you identify it as a one of the common distributions? 2.6.3 Gamma Distribution A random variable Y is said to have a gamma distribution if its density function is given by f (y|αθ) = θα e−θy y α−1 /Γ(α), 0 < y, λ > 0, θ > 0 where Γ(α), is called the gamma function and is defined by Z ∞ Γ(α) = e−u uα−1 du. 0 The integration by parts of Γ(α) yields the recursive relationship Z ∞ Γ(α) = −e−u uα−1 |∞ + e−u (α − 1)uα−2 du 0 0 Z ∞ = (α − 1) e−u uα−2 du = (α − 1)Γ(α − 1). 0 35 (2.11) (2.12) For integer values α = n, this recursive relationship reduces to Γ(n + 1) = n!. Note, by setting α = 1 the gamma distribution reduces to an exponential distribution. The expected value of a gamma random variable is given by Z ∞ Z ∞ θα θα E[Y ] = y α e−θy dy = θ! uα e−u du, Γ(α) 0 Γ(α) 0 after the change of variable u = θy. Hence E[Y ] = Γ(α + 1)/(Γ(α)θ) = α/θ. Using the same substitution θα E[Y ] = Γ(α) 2 Z ∞ y ω+1 e−θy dy = 0 (α + 1)α , θ2 2 so that Var[Y ] = α/θ . The notation Y ∼ Gamma(α, θ) should be read as “the random variable Y follows a gamma distribution with parameters α and θ”. Exercise 12. Let Y ∼ Gamma(α, θ). Show that the moment generating function for Y is given for t ∈ (−θ, θ) by MY (t) = 2.6.4 1 . (1 − t/θ)α Gaussian (Normal) Distribution A random variable Z is a standard normal (or Gaussian) random variable if the density of Z is specified by 2 1 f (z) = √ e−z /2 . 2π (2.13) It is not immediately obvious that (2.13) specifies a probability density. To show that this is the case we need to prove Z ∞ 2 1 √ e−z /2 dy = 1 2π −∞ √ R ∞ −z2 /2 or, equivalently, that I = −∞ e dz = 2π. This is a “classic” results and so is well worth confirming. Consider Z ∞ Z ∞ Z e − z 2 /2 dz e−w2 /2 dw = I2 = −∞ −∞ ∞ −∞ Z ∞ e−(z 2 +w2 )/2 dzdw. −∞ The double integral can be evaluated by a change of variables to polar coordinates. Substituting z = r cos θ, w = r sin θ, and dzdw = rdθdr, we get Z ∞Z π Z ∞ 2 2 −r 2 /2 I = e rdθdr = 2π re−r2 /2 dr = −2πe−r /2 |10 = 2π. 0 0 0 36 √ √ Taking the square root we get I = 2π. The result I = 2π can also be used to √ establish the result Γ(1/2) = π. To prove that this is the case note that Z ∞ Z ∞ √ 2 −u 1/2 e−z dz = π. e u du = 2 Γ(1/2) = 0 0 The expected value of Z equals zero because ze−z 2 /2 is integrable and asymmetric around zero. The variance of Z is given by Z ∞ 2 1 √ z 2 e−z /2 dz. Var[Z] = 2π −∞ Thus Z ∞ 2 1 √ Var[Z] = z 2 e−z /2 dz 2π −∞ Z ∞ 1 −z 2 /2 ∞ 2 −ze |−∞ + + e − z /2 dz =√ 2π −∞ Z ∞ 2 1 =√ e−z /2 dz 2π −∞ =1. If Z is a standard normal distribution then Y = µ + σZ is called general normal (Gaussian distribution) with parameters µ and σ. The density of Y is given by f (y|µ, σ) = √ 1 2πσ 2 e− (y−µ)2 2σ 2 . We have obviously E[Y ] = µ and Var[Y ] = σ 2 . The notation Y ∼ N (µ, σ 2 ) should be read as “the random variable Y follows a normal distribution with mean parameter µ and variance parameter σ 2 ”. From the definition of Y it follows immediately that a + bY , where a and b are known constants, is again normal distribution. Exercise 13. Let Y ∼ N (µ, σ 2 ). What is the distribution of X = a + bY ? Exercise 14. Let Y ∼ N (µ, σ 2 ). Show that the moment generating function if Y is given by MY (t) = eµt+σ 2 2 t /2 . Hint Consider first the standard normal variable and then apply Property 1. 37 2.6.5 Weibull Distribution The Weibull distribution function has the form h y a i , y > 0. F (y) = 1 − exp − b The Weibull density can be obtained by differentiation as a y a−1 h y a i f (y|a, b) = exp − . b b b To calculate the expected value Z ∞ a h y a i 1 y a−1 exp − dy E[Y ] = ya b b 0 we use the substitutions u = (y/b)a , and du = ab−a y a−1 dy. These yield Z ∞ a+1 1/a −u E[Y ] = b u e du = bΓ . a 0 In a similar manner, it is straightforward to verify that a+2 2 2 , E[Y ] = b Γ a and thus a+2 a+1 Var[Y ] = b2 Γ − Γ2 . a a 2.6.6 Beta Distribution A random variable is said to have a beta distribution if its density is given by f (y|a, b) = 1 y a−1 (1 − y)b−1 , 0 < y < 1. B(a, b) Here the function 1 Z ua−1 (1 − u)b−1 du B(a, b) = 0 is the “beta” function, and is related to the gamma function through B(a, b) = Γ(a)Γ(b) . Γ(a + b) Proceeding in the usual manner, we can show that E[Y ] = Var[Y ] = a a+b ab . (a + b)2 (a + b + 1) 38 2.6.7 Chi-square Distribution Let Z ∼ N (0, 1), and let Y = Z 2 . Then the cumulative distribution function √ √ √ √ FY (y) = P (Y ≤ y) = P (Z 2 ≤ y) = P (− y ≤ Z ≤ y) = FZ ( y) − FZ (− y) so that by differentiating in y we arrive to the density 1 1 √ √ fY (y) = √ [fz ( y) + fz (− y)] = √ e−y/2 , 2 y 2πy Pn in which we recognize Gamma(1/2, 1/2). Suppose that Y = i=1 Zi2 , where the Zi ∼ N (0, 1) for i = 1, . . . , n are independent. From results on the sum of independent Gamma random variables, Y ∼ Gamma(n/2, 1/2). This density has the form fY (y|n) = e−y/2 y n/2−1 , y>0 2n/2 Γ(n/2) (2.14) and is referred to as a chi-squared distribution on n degrees of freedom. The notation Y ∼ Chi(n) should be read as “the random variable Y follows a chi-squared distribution with n degrees of freedom”. Later we will show that if X ∼ Chi(u) and Y ∼ Chi(v), it follows that X + Y ∼ Chi(u + v). 2.6.8 The Bivariate Normal Distribution Suppose that U and V are independent random variables each, with the standard normal distribution. We will need the following parameters µX , µY , σX > 0, σY > 0, ρ ∈ [−1, 1]. Now let X and Y be new random variables defined by X =µX + σX U, V =µY + ρσY U + σY p 1 − ρ2 V. Using basic properties of mean, variance, covariance, and the normal distribution, satisfy yourself of the following. Property 3. The following properties hold 1. X is normally distributed with mean µX and standard deviation σX , 2. Y is normally distributed with mean µY and standard deviation σY , 39 3. Corr(X, Y ) = ρ, 4. X and Y are independent if and only if ρ = 0. The inverse transformation is x − µX σX y − µY ρ(x − µX ) p v= p − σY 1 − ρ2 σX 1 − ρ2 u= so that the Jacobian of the transformation is ∂(x, y) 1 p = . ∂(u, v) σX σY 1 − ρ2 Since U and V are independent standard normal variables, their joint probability density function is g(u, v) = 1 − u2 +v2 2 e . 2π Using the bivariate change of variables formula, the joint density of (X, Y ) is ρ(x − µX )(y − µY ) (y − µY )2 1 (x − µX )2 p + f (x, y) = exp − 2 2σX (1 − ρ2 ) σX σY (1 − ρ2 ) 2σY2 (1 − ρ2 ) 2πσX σY 1 − ρ2 Bivariate Normal Conditional Distributions In the last section we derived the joint probability density function f of the bivariate normal random variables X and Y . The marginal densities are known. Then, (y − (µY + ρσY (x − µX )/σX ))2 fY,X (y, x) 1 exp − fY |X (y|x) = =p . fX (x) 2σY2 (1 − ρ2 ) 2πσY2 (1 − ρ2 ) Then the conditional distribution of Y given X = x is also Gaussian, with E(Y |X = x) =µY + ρσY Var(Y |X = x) = σY2 (1 − ρ2 ) 2.6.9 The Multivariate Normal Distribution Let Σ denote the 2 × 2 symmetric matrix 2 σX σX σY ρ σY σX ρ σY2 40 x − µX σX Then 2 2 2 2 det|Σ| = σX σY − (σX σY ρ)2 = σX σY (1 − ρ2 ) and Σ−1 2 1/σX −ρ/(σX σY ) 1 . = 1 − ρ2 −ρ/(σX σY ) 1/σY2 Hence the bivariate normal distribution (X, Y ) can be written in matrix notation as T x − µX 1 x − µX . p Σ−1 f(X,Y ) (x, y) = exp − 2 2π det|Σ| y − µY y − µY 1 Let Y = (Y 1, . . . , Y p) be a random vector. Let E(Yi ) = µi , i = 1, . . . , p, and define the p-length vector µ = (µ1 , . . . , µp ). Define the p × p matrix Σ through its elements Cov(Yi , Yj ) for i, j = 1, . . . p. Then, the random vector Y has a p-dimensional multivariate Gaussian distribution if its density function is specified by 1 1 T −1 fY (y) = exp − (y − µ) Σ (y − µ) . 2 (2π)p/2 |Σ|1/2 (2.15) The notation Y ∼ M V Np (µ, Σ) should be read as “the random variable Y follows a multivariate Gaussian (normal) distribution with p-vector mean µ and p × p variancecovariance matrix Σ.” 41 2.7 2.7.1 Distributions – further properties Sum of Independent Random Variables – special cases Poisson variables Suppose X ∼ P ois(θ) and Y ∼ P ois(λ). Assume that X and Y are independent. Then P (X + Y = n) = = = n X k=0 n X k=0 n X P (X = k, Y = n − k) P (X = k)P (Y = n − k) e−θ k=0 = θk −λ λn−k e k! (n − k)! n e−(θ+λ) X n! θk λn−k n! k!(n − k)! k=0 =e − (θ + λ) (θ + λ)n . n! That is, X + Y ∼ P ois(θ + λ). Binomial Random Variables We seek the distribution of Y + X, where Y ∼ Bin(n, θ) and X ∼ Bin(m, θ). Since X + Y is modelling the situation where the total number of trials is fixed at n + m and the probability of a success in a single trial equals θ. Without performing a calculations, we expect to find that X + Y ∼ Bin(n + m, θ). To verify that note that X = X1 + · · · + Xn where Xi are independent Bernoulli variables with parameter θ while Y = Y1 + · · · + Ym where Yi are also independent Bernoulli variables with parameter θ. Assuming that Xi ’s are independent of Yi ’s we obtain that X + Y is the sum of n + m indpendent Bernoulli random variables with parameter θ, i.e. X + Y has Bin(n + m, θ) distribution. 42 Gamma, Chi-square, and Exponential Random Variables Let X ∼ Gamma(α, θ) and Y ∼ Gamma(βθ) are independent. Then the moment generating function of X + Y is given as MX+Y (t) = MX (t)MY (t) = 1 1 1 = α β (1 + t/θ) (1 + t/θ) (1 + t/θ)α+β But this is the moment generating function of a Gamma random variable distributed as Gamma(α + β, θ). The result X + Y ∼ Chi(u + v) where X ∼ Chi(u) and Y ∼ Chi(v), follows as a corollary. Let Y1 , . . . , Yn be n independent exponential random variables each with parameter θ. Then Z = Y1 + Y2 + · · · + Yn is a Gamma(n, θ) random variable. To see that this is indeed the case, write Yi ∼ Exp(θ), or alternatively, Yi ∼ Gamma(1, θ). Then Pn Y1 + Y2 ∼ Gamma(2, θ), and by induction i=1 Yi ∼ Gamma(n, θ). Gaussian Random Variables 2 Let X ∼ N (µX , σX ) and Y ∼ N (µY , σY2 ). Then the moment generating function of X + Y is given by 2 MX+Y (t) = MX (t)MY (t) = eµX t+σX t 2 2 2 /2 µY t+σY t /2 e 2 + σY2 ). which proves that X + Y ∼ N (µX + µY , σX 43 2 2 = e(µX +µY )t+(σX +σY )t 2 /2 44 2.7.2 Common Distributions – Summarizing Tables Discrete Distributions Bernoulli(θ) pmf P (Y = y|θ) = θy (1 − θ)1−y , y = 0, 1, 0 ≤ θ ≤ 1 mean/variance E[Y ] = θ, Var[Y ] = θ(1 − θ) mgf MY (t) = θet + (1 − θ) Binomial(n, θ) n y y θ (1 − θ)n−y , y = 0, 1, . . . , n, 0 ≤ θ ≤ 1 pmf P (Y = y|θ) = mean/variance E[Y ] = nθ, Var[Y ] = nθ(1 − θ) mgf MY (t) = [θet + (1 − θ)]n Discrete uniform(N ) pmf P (Y = y|N ) = 1/N, y = 1, 2, . . . , N mean/variance E[Y ] = (N + 1)/2, Var[Y ] = (N + 1)(N − 1)/12 mgf MY (t) = 1 N Nt et 1−e 1−et Geometric(θ) pmf P (Y = y|N ) = θ(1 − θ)y−1 , y = 1, . . . , 0 ≤ θ ≤ 1 mean/variance E[Y ] = 1/θ, Var[Y ] = (1 − θ)/θ2 mgf MY (t) = θet /[1 − (1 − θ)et ], t < − log(1 − θ) notes The random variable X = Y − 1 is NegBin(1, θ). Hypergeometric(b, w, n) pmf P (Y = y|b, w, n) = w y b−w n−y b / n , y = 0, 1, . . . , n, b − (b − w) ≤ y ≤ b, b, w, n ≥ 0 mean/variance E[Y ] = nw/b, Var[Y ] = nw(b − w)(b − n)/(b2 (b − 1)) Negative binomial(r, θ) pmf P (Y = y|r, θ) = r+y−1 y r θ (1 − θ)y , y = 0, 1, . . . , n, b − (b − w) ≤ y ∈ N, 0 < θ ≤ 1 mean/variance E[Y ] = r(1 − θ)/θ, Var[Y ] = r(1 − θ)/θ2 mgf MY (t) = θ/(1 − (1 − θ)et )r , t < − log(1 − θ) An alternative form of the pmf, used in our notes, is rin the derivation given by P (N = n|r, θ) = n−1 θ (1 − θ)n−r , n = r, r + 1, . . . r−1 where the random variable N = Y + r. The negative binomial can also be derived as a mixture of Poisson random variables. notes Poisson(θ) pmf P (Y = y|θ) = θy e−θ /y!, y = 0, 1, 2, . . . , 0 < θ mean/variance E[Y ] = θ, Var[Y ] = θ, mgf MY (t) = eθ(e t −1) 45 Continuous Distributions Uniform U(a, b) pmf f (y|a, b) = 1/(b − a), a < y < b mean/variance E[Y ] = (b + a)/2, Var[Y ] = (b − a)2 /12, mgf MY (t) = (ebt − eat )/((b − a)t) A uniform distribution with a = 0 and b = 1 is a special case of the beta distribution where (α = β = 1). notes Exponential E(θ) pmf f (y|θ) = θe−θy , y > 0, θ > 0 mean/variance E[Y ] = 1/θ, Var[Y ] = 1/θ2 , mgf MY (t) = 1/(1 − t/θ) Special case of the gamma distribution. X = Y 1/γ is Weibull, X = √ 2θY is Rayleigh, X = α − γ log(Y /β) is Gumbel. notes Gamma G(λ, θ) pmf f (y|λθ) = θλ e−θy y λ−1 /Γ(λ), y > 0, λ, θ > 0 mean/variance E[Y ] = λ/θ, Var[Y ] = λ/θ2 , mgf MY (t) = 1/(1 − t/θ)λ notes Includes the exponential (λ = 1) and chi squared (λ = n/2, θ = 1/2). Normal N( µ, σ 2 ) 2 2 √ 1 e−(y−µ) /(2σ ) , σ 2πσ 2 2 pmf f (y|µ, σ 2 ) = mean/variance E[Y ] = µ, Var[Y ] = σ , mgf MY (t) = eµt+σ notes Often called the Gaussian distribution. >0 2 2 t /2 Transforms The generating functions of the discrete and continuous random variables discussed thus far are given in Table 2.7.2. 46 Distrib. p.g.f. m.g.f. (θeit + θ̄)n θt/(1 − θ̄t) θ/(e−t − θ̄) θ/(e−it − θ̄) N egBin(r, θ) θr (1 − θ̄t)−r θr (1 − θ̄et )−r θr (1 − θ̄eit )−r P oi(θ) e−θ(1−t) (θt + θ̄) Geo(θ) t ch.f. (θe + θ̄) Bi(n, θ) n eθ(e n t −1) eθ(e it −1) U nif (α, β) eαt (eβt − 1)/(βt) eiαt (eiβt − 1)/(iβt) Exp(θ) (1 − t/θ)−1 (1 − it/θ)−1 Ga(c, λ) (1 − t/θ)−c N (µ, σ 2 ) exp −µt + σ 2 t2 /2 (1 − it/θ)−c exp −iµt − σ 2 t2 /2 Table 2.1: Transforms of distributions. In the formulas θ̄ = 1 − θ. 47 Chapter 3 Likelihood 3.1 Maximum Likelihood Estimation Let x be a realization of the random variable X with probability density fX (x|θ) where θ = (θ1 , θ2 , . . . , θm )T is a vector of m unknown parameters to be estimated. The set of allowable values for θ, denoted by Ω, or sometimes by Ωθ , is called the parameter space. Define the likelihood function l(θ|x) = fX (x|θ). (3.1) It is crucial to stress that the argument of fX (x|θ) is x, but the argument of l(θ|x) is θ. It is therefore convenient to view the likelihood function l(θ) as the probability of the observed data x considered as a function of θ. Usually it is convenient to work with the natural logarithm of the likelihood called the log-likelihood, denoted by log l(θ|x) = log l(θ|x). When θ ∈ R1 we can define the score function as the first derivative of the loglikelihood S(θ) = ∂ log l(θ). ∂θ The maximum likelihood estimate (MLE) θ̂ of θ is the solution to the score equation S(θ) = 0. 48 At the maximum, the second partial derivative of the log-likelihood is negative, so we define the curvature at θ̂ as I(θ̂) where I(θ) = − ∂2 log l(θ). ∂θ2 We can check that a solution θ̂ of the equation S(θ) = 0 is actually a maximum by checking that I(θ̂) > 0. A large curvature I(θ̂) is associated with a tight or strong peak, intuitively indicating less uncertainty about θ. The likelihood function l(θ|x) supplies an order of preference or plausibility among possible values of θ based on the observed y. It ranks the plausibility of possible values of θ by how probable they make the observed y. If P (x|θ = θ1 ) > P (x|θ = θ2 ) then the observed x makes θ = θ1 more plausible than θ = θ2 , and consequently from (3.1), l(θ1 |x) > l(θ2 |x). The likelihood ratio l(θ1 |x)/l(θ2 |x) = f (θ1 |x)/f (θ2 |x) is a measure of the plausibility of θ1 relative to θ2 based on the observed fact y. The relative likelihood l(θ1 |x)/l(θ2 |x) = k means that the observed value x will occur k times more frequently in repeated samples from the population defined by the value θ1 than from the population defined by θ2 . Since only ratios of likelihoods are meaningful, it is convenient to standardize the likelihood with respect to its maximum. When the random variables X1 , . . . , Xn are mutually independent we can write the joint density as fX (x) = n Y fXj (xj ) j=1 where x = (x1 , . . . , xn )0 is a realization of the random vector X = (X1 , . . . , Xn )0 , and the likelihood function becomes LX (θ|x) = n Y fXj (xj |θ). j=1 When the densities fXj (xj ) are identical, we unambiguously write f (xj ). Example 2 (Bernoulli Trials). Consider n independent Bernoulli trials. The jth observation is either a “success” or “failure” coded xj = 1 and xj = 0 respectively, and P (Xj = xj ) = θxj (1 − θ)1−xj 49 for j = 1, . . . , n. The vector of observations y = (x1 , x2 , . . . , xn )T is a sequence of ones and zeros, and is a realization of the random vector Y = (X1 , X2 , . . . , Xn )T . As the Bernoulli outcomes are assumed to be independent we can write the joint probability mass function of Y as the product of the marginal probabilities, that is l(θ) = = n Y j=1 n Y j=1 P = θ P (Xj = xj ) θxj (1 − θ)1−xj xj P (1 − θ)n− xj = θr (1 − θ)n−r where r = Pn i=1 xj is the number of observed successes (1’s) in the vector y. The log-likelihood function is then log l(θ) = r log θ + (n − r) log(1 − θ), and the score function is S(θ) = r (n − r) ∂ log l(θ) = − . ∂θ θ 1−θ Solving for S(θ̂) = 0 we get θ̂ = r/n. We also have I(θ) = r n−r + >0 θ2 (1 − θ)2 ∀ θ, guaranteeing that θ̂ is the MLE. Each Xi is a Bernoulli random variable and has expected value E(Xi ) = θ, and variance Var(Xi ) = θ(1 − θ). The MLE θ̂(y) is itself a random variable and has expected value Pn n n r 1X 1X i=1 Xi E(θ̂) = E =E = E (Xi ) = θ = θ. n n n i=1 n i=1 If an estimator has on average the value of the parameter that it is intended to estimate than we call it unbiased, i.e. if Eθb = θ. From the above calculation it follows that θ̂(y) is an unbiased estimator of θ. The variance of θ̂(y) is Pn n n 1 X 1 X (1 − θ)θ i=1 Xi Var(θ̂) = Var = 2 Var (Xi ) = 2 (1 − θ)θ = . n n i=1 n i=1 n 2 50 Example 3 (Binomial sampling). The number of successes in n Bernoulli trials is a random variable R taking on values r = 0, 1, . . . , n with probability mass function n r P (R = r) = θ (1 − θ)n−r . r This is the exact same sampling scheme as in the previous example except that instead of observing the sequence y we only observe the total number of successes r. Hence the likelihood function has the form n r LR (θ|r) = θ (1 − θ)n−r . r The relevant mathematical calculations are as follows n log lR (θ|r) = log + r log(θ) + (n − r) log(1 − θ) r r n−r r S (θ) = + ⇒ θ̂ = n 1−θ n n−r r I (θ) = + >0 ∀θ θ2 (1 − θ)2 E(r) nθ E(θ̂) = = =θ ⇒ θ̂ unbiased n n Var(r) nθ(1 − θ) θ(1 − θ) Var(θ̂) = = = . 2 2 n n n 2 Example 4 (Prevalence of a Genotype). Geneticists interested in the prevalence of a certain genotype, observe that the genotype makes its first appearance in the 22nd subject analysed. If we assume that the subjects are independent, the likelihood function can be computed based on the geometric distribution, as l(θ) = (1 − θ)n−1 θ. The score function is then S(θ) = θ−1 − (n − 1)(1 − θ)−1 . Setting S(θ̂) = 0 we get θ̂ = n−1 = 22−1 . Moreover I(θ) = θ−2 + (n − 1)(1 − θ)−2 and is greater than zero for all θ, implying that θ̂ is MLE. Suppose that the geneticists had planned to stop sampling once they observed r = 10 subjects with the specified genotype, and the tenth subject with the genotype was the 100th subject anaylsed overall. The likelihood of θ can be computed based on the negative binomial distribution, as l(θ) = n−1 r θ (1 − θ)n−r r−1 51 2 for n = 100, r = 5. The usual calculation will confirm that θ̂ = r/n is MLE. Example 5 (Radioactive Decay). In this classic set of data Rutherford and Geiger counted the number of scintillations in 72 second intervals caused by radioactive decay of a quantity of the element polonium. Altogether there were 10097 scintillations during 2608 such intervals Count 0 1 2 3 4 5 6 7 Observed 57 203 383 525 532 408 573 139 Count 8 9 10 11 12 13 14 Observed 45 27 10 4 1 0 1 The Poisson probability mass function with mean parameter θ is θx exp (−θ) . x! fX (x|θ) = The likelihood function equals l(θ) = Y θxi exp (−θ) xi ! = θ P xi exp (−nθ) Q . xi ! The relevant mathematical calculations are (Σxi ) log (θ) − nθ − log [Π(xi !)] P xi −n S(θ) = θ P xi ⇒ θ̂ = = x̄ n Σxi > 0, ∀ θ I(θ) = θ2 P P implying θ̂ is MLE. Also E(θ̂) = E(xi ) = n1 θ = θ, so θ̂ is an unbiased estimator. P Next Var(θ̂) = n12 Var(xi ) = n1 θ. It is always useful to compare the fitted values log l(θ) = from a model against the observed values. i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Oi 57 203 383 525 532 408 573 139 45 27 10 4 1 0 1 Ei 54 211 407 525 508 393 254 140 68 29 11 4 1 0 0 +3 -8 -24 0 +24 +15 +19 -1 -23 -2 -1 0 -1 +1 +1 The Poisson law agrees with the observed variation within about one-twentieth of its 2 range. 52 Example 6 (Exponential distribution). Suppose random variables X1 , . . . , Xn are i.i.d. as Exp(θ). Then l(θ) = n Y θ exp (−θxi ) i=1 X = θn exp −θ xi X log l(θ) = n log θ − θ xi n S(θ) = θ̂ = I(θ) = ⇒ n X − xi θ i=1 n P xi n >0 ∀ θ. θ2 Exercise 15. Demonstrate that the expectation and variance of θ̂ are given as follows n θ n−1 n2 θ2 . Var[θ̂] = (n − 1)2 (n − 2) E[θ̂] = Hint Find the probability distribution of Z = n P Xi , where Xi ∼ Exp(θ). i=1 n−1 n θ̂. Exercise 16. Propose the alternative estimator θ̃ = Show that θ̃ is unbiased estimator of θ with the variance Var[θ̃] = θ2 . n−2 As this example demonstrates, maximum likelihood estimation does not automatically produce unbiased estimates. If it is thought that this property is (in some sense) desirable, then some adjustments to the MLEs, usually in the form of scaling, may be required. Example 7 (Gaussian Distribution). Consider data X1 , X2 . . . , Xn distributed as N(µ, υ). Then the likelihood function is l(µ, υ) = 1 √ πυ n exp 53 n P (xi − µ)2 − i=1 2υ and the log-likelihood function is log l(µ, υ) = − n n n 1 X log (2π) − log (υ) − (xi − µ)2 2 2 2υ i=1 (3.2) Unknown mean and known variance As υ is known we treat this parameter as a constant when differentiating wrt µ. Then S(µ) = n 1X (xi − µ), υ i=1 n µ̂ = 1X xi , n i=1 and I(θ) = n > 0 ∀ µ. υ Also, E[µ̂] = nµ/n = µ, and so the MLE of µ is unbiased. Finally " n # X 1 υ −1 Var[µ̂] = 2 Var xi = = (E[I(θ)]) . n n i=1 Known mean and unknown variance Differentiating (3.2) wrt υ returns S(υ) = − n n 1 X + 2 (xi − µ)2 , 2υ 2υ i=1 and setting S(υ) = 0 implies n υ̂ = 1X (xi − µ)2 . n i=1 Differentiating again, and multiplying by −1 yields I(υ) = − n 1 X n (xi − µ)2 . + 2υ 2 υ 3 i=1 Clearly υ̂ is the MLE since I(υ̂) = n > 0. 2υ 2 Define √ Zi = (Xi − µ)2 / υ, so that Zi ∼ N(0, 1). From the appendix on probability n X Zi2 ∼ χ2n , i=1 implying E[ P P Zi2 ] = n, and Var[ Zi2 ] = 2n. The MLE υ̂ = (υ/n) n X i=1 54 Zi2 . Then " # n υX 2 E[υ̂] = E Z = υ, n i=1 i and Var[υ̂] = υ 2 n Var " n X # Zi2 = i=1 2υ 2 . n 2 Our treatment of the two parameters of the Gaussian distribution in the last example was to (i) fix the variance and estimate the mean using maximum likelihood; and then (ii) fix the mean and estimate the variance using maximum likelihood. In practice we would like to consider the simultaneous estimation of these parameters. In the next section of these notes we extend MLE to multiple parameter estimation. 3.2 Multi-parameter Estimation Suppose that a statistical model specifies that the data y has a probability distribution f (y; α, β) depending on two unknown parameters α and β. In this case the likelihood function is a function of the two variables α and β and having observed the value y is defined as l(α, β) = f (y; α, β) with log l(α, β) = log l(α, β). The MLE of (α, β) is a value (α̂, β̂) for which l(α, β) , or equivalently log l(α, β) , attains its maximum value. Define S1 (α, β) = ∂ log l/∂α and S2 (α, β) = ∂ log l/∂β. The MLEs (α̂, β̂) can be obtained by solving the pair of simultaneous equations S1 (α, β) = 0 S2 (α, β) = 0 Let us consider the matrix I(α, β) I11 (α, β) I12 (α, β) = − I(α, β) = I21 (α, β) I22 (α, β) ∂2 ∂α2 ∂2 ∂β∂α log l log l ∂2 ∂α∂β ∂2 ∂β 2 log l log l The conditions for a value (α0 , β0 ) satisfying S1 (α0 , β0 ) = 0 and S2 (α0 , β0 ) = 0 to be a MLE are that I11 (α0 , β0 ) > 0, I22 (α0 , β0 ) > 0, 55 and det(I(α0 , β0 ) = I11 (α0 , β0 )I22 (α0 , β0 ) − I12 (α0 , β0 )2 > 0. This is equivalent to requiring that both eigenvalues of the matrix I(α0 , β0 ) be positive. Example 8 (Gaussian distribution). Let X1 , X2 . . . , Xn be iid observations from a N (µ, σ 2 ) density in which both µ and σ 2 are unknown. The log likelihood is log l(µ, σ 2 ) n X 1 1 exp [− 2 (xi − µ)2 ] log √ 2σ 2πσ 2 i=1 n X 1 1 1 = − log [2π] − log [σ 2 ] − 2 (xi − µ)2 2 2 2σ i=1 = = − n n n 1 X log [2π] − log [σ 2 ] − 2 (xi − µ)2 . 2 2 2σ i=1 Hence for v = σ 2 n 1X ∂ log l = (xi − µ) = 0 S1 (µ, v) = ∂µ v i=1 which implies that n µ̂ = Also S2 (µ, v) = 1X xi = x̄. n i=1 (3.3) n ∂ log l n 1 X =− + 2 (xi − µ)2 = 0 ∂v 2v 2v i=1 implies that n σ̂ 2 = v̂ = n 1X 1X (xi − µ̂)2 = (xi − x̄)2 . n i=1 n i=1 Calculating second derivatives and multiplying by −1 gives that I(µ, v) equals n P n 1 (x − µ) i v v2 i=1 I(µ, v) = n n P P 1 n 1 2 (x − µ) − + (x − µ) i i v2 2v 2 v3 i=1 i=1 Hence I(µ̂, v̂) is given by n v̂ 0 0 n 2v 2 56 (3.4) Clearly both diagonal terms are positive and the determinant is positive and so (µ̂, v̂) are, indeed, the MLEs of (µ, v). Go back to equation (3.3), and X̄ ∼ N (µ, v/n). Clearly E(X̄) = µ (unbiased) and Var(X̄) = v/n. Go back to equation (3.4). Then from Lemma 1 that is proven below we have nv̂ ∼ χ2n−1 v so that E ⇒ nv̂ v n−1 n−1 v n = = E(v̂) Instead, propose the (unbiased) estimator of σ 2 n S 2 = ṽ = n 1 X v̂ = (xi − x̄)2 n−1 n − 1 i=1 (3.5) Observe that E(ṽ) = n n−1 E(v̂) = n n−1 n−1 n v = v and ṽ is unbiased as suggested. We can easily show that Var(ṽ) = 2v 2 (n − 1) 2 Lemma 1 (Joint distribution of the sample mean and sample variance). If X1 , . . . , Xn are iid N (µ, v) then the sample mean X̄ and sample variance S 2 are independent. Also X̄ is distributed N (µ, v/n) and (n − 1)S 2 /v is a chi-squared random variable with n − 1 degrees of freedom. Proof. Define W = n X (Xi − X̄)2 = i=1 ⇒ W (X̄ − µ) + v v/n n X (Xi − µ)2 − n(X̄ − µ)2 i=1 2 = n X (Xi − µ)2 i=1 57 v The RHS is the sum of n independent standard normal random variables squared, and so is distributed χ2n . Also, X̄ ∼ N (µ, v/n), therefore (X̄ − µ)2 /(v/n) is the square of a standard normal and so is distributed χ21 These Chi-Squared random variables have moment generating functions (1 − 2t)−n/2 and (1 − 2t)−1/2 respectively. Next, W/v and (X̄ − µ)2 /(v/n) are independent Cov(Xi − X̄, X̄) = = = = Cov(Xi , X̄) − Cov(X̄, X̄) 1X Cov Xi , Xj − Var(X̄) n v 1X Cov(Xi , Xj ) − n j n v v − = 0 n n But, Cov(Xi − X̄, X̄ − µ) = Cov(Xi − X̄, X̄) = 0 , hence X ! X Cov(Xi − X̄, X̄ − µ) = Cov (Xi − X̄), X̄ − µ = 0 i i As the moment generating function of the sum of independent random variables is equal to the product of their individual moment generating functions, we see h i E et(W/v) (1 − 2t)−1/2 h i ⇒ E et(W/v) = (1 − 2t)−n/2 = (1 − 2t)−(n−1)/2 But (1 − 2t)−(n−1)/2 is the moment generating function of a χ2 random variables with (n−1) degrees of freedom, and the moment generating function uniquely characterizes the random variable S = (W/v). Suppose that a statistical model specifies that the data x has a probability distribution f (x; θ) depending on a vector of m unknown parameters θ = (θ1 , . . . , θm ). In this case the likelihood function is a function of the m parameters θ1 , . . . , θm and having observed the value of x is defined as l(θ) = f (x; θ) with log l(θ) = log l(θ). The MLE of θ is a value θ̂ for which l(θ), or equivalently log l(θ), attains its maximum value. For r = 1, . . . , m define Sr (θ) = ∂ log l/∂θr . Then we can (usually) find the MLE θ̂ by solving the set of m simultaneous equations Sr (θ) = 0 for r = 58 1, . . . , m. The matrix I(θ) is defined to be the m × m matrix whose (r, s) element is given by Irs where Irs = −∂ 2 log l/∂θr ∂θs . The conditions for a value θ̂ satisfying Sr (θ̂) = 0 for r = 1, . . . , m to be a MLE are that all the eigenvalues of the matrix I(θ̂) are positive. 3.3 The Invariance Principle How do we deal with parameter transformation? We will assume a one-to-one transformation, but the idea applied generally. Consider a binomial sample with n = 10 independent trials resulting in data x = 8 successes. The likelihood ratio of θ1 = 0.8 versus θ2 = 0.3 is θ8 (1 − θ1 )2 l(θ1 = 0.8) = 18 = 208.7 , l(θ2 = 0.3) θ2 (1 − θ2 )2 that is, given the data θ = 0.8 is about 200 times more likely than θ = 0.3. Suppose we are interested in expressing θ on the logit scale as ψ ≡ log{θ/(1 − θ)} , then ‘intuitively’ our relative information about ψ1 = log(0.8/0.2) = 1.29 versus ψ2 = log(0.3/0.7) = −0.85 should be L∗ (ψ1 ) l(θ1 ) = = 208.7 . L∗ (ψ2 ) l(θ2 ) That is, our information should be invariant to the choice of parameterization. ( For the purposes of this example we are not too concerned about how to calculate L∗ (ψ). ) Theorem 3.3.1 (Invariance of the MLE). If g is a one-to-one function, and θ̂ is the MLE of θ then g(θ̂) is the MLE of g(θ). Proof. This is trivially true as we let θ = g −1 (µ) then f {y|g −1 (µ)} is maximized in µ exactly when µ = g(θ̂). When g is not one-to-one the discussion becomes more subtle, but we simply choose to define ĝMLE (θ) = g(θ̂) It seems intuitive that if θ̂ is most likely for θ and our knowledge (data) remains unchanged then g(θ̂) is most likely for g(θ). In fact, we would find it strange if θ̂ is an 59 estimate of θ, but θ̂2 is not an estimate of θ2 . In the binomial example with n = 10 and x = 8 we get θ̂ = 0.8, so the MLE of g(θ) = θ/(1 − θ) is g(θ̂) = θ̂/(1 − θ̂) = 0.8/0.2 = 4. 60 Chapter 4 Estimation In the previous chapter we have seen an approach to estimation that is based on the likelihood of observed results. Next we study general theory of estimation that is used to compare between different estimators and to decide on the most efficient one. 4.1 General properties of estimators Suppose that we are going to observe a value of a random vector X. Let X denote the set of possible values X can take and, for x ∈ X , let f (x|θ) denote the probability that X takes the value x where the parameter θ is some unknown element of the set Θ. The problem we face is that of estimating θ. An estimator θ̂ is a procedure which for each possible value x ∈ X specifies which element of Θ we should quote as an estimate of θ. When we observe X = x we quote θ̂(x) as our estimate of θ. Thus θ̂ is a function of the random vector X. Sometimes we write θ̂(X) to emphasise this point. Given any estimator θ̂ we can calculate its expected value for each possible value of θ ∈ Θ. As we have already mentioned when discussing the maximum likelihood estimation, an estimator is said to be unbiased if this expected value is identically equal to θ. If an estimator is unbiased then we can conclude that if we repeat the experiment an infinite number of times with θ fixed and calculate the value of the estimator each time then the average of the estimator values will be exactly equal to θ. To evaluate 61 the usefulness of an estimator θ̂ = θ̂(x) of θ, examine the properties of the random variable θ̂ = θ̂(X). Definition 1 (Unbiased estimators). An estimator θ̂ = θ̂(X) is said to be unbiased for a parameter θ if it equals θ in expectation E[θ̂(X)] = E(θ̂) = θ. Intuitively, an unbiased estimator is ‘right on target’. 2 Definition 2 (Bias of an estimator). The bias of an estimator θ̂ = θ̂(X) of θ is defined 2 as bias(θ̂) = E[θ̂(X) − θ]. Note that even if θ̂ is an unbiased estimator of θ, g(θ̂) will generally not be an unbiased estimator of g(θ) unless g is linear or affine. This limits the importance of the notion of unbiasedness. It might be at least as important that an estimator is accurate in the sense that its distribution is highly concentrated around θ. Exercise 17. Show that for an arbitrary distribution the estimator S 2 as defined in (3.5) is an unbiased estimator of the variance of this distribution. Exercise 18. Consider the estimator S 2 of variance σ 2 in the case of the normal distribution. Demonstrate that although S 2 is an unbiased estimator of σ 2 , S is not an unbiased estimator of σ. Compute its bias. Definition 3 (Mean squared error). The mean squared error of the estimator θ̂ is defined as MSE(θ̂) = E(θ̂ − θ)2 . Given the same set of data, θ̂1 is “better” than θ̂2 if M SE(θ̂1 ) ≤ MSE(θ̂2 ) (uniformly better if true ∀ θ). Lemma 2 (The MSE variance-bias tradeoff). The MSE decomposes as MSE(θ̂) = Var(θ̂) + bias(θ̂)2 . 62 2 Proof. We have MSE(θ̂) = E(θ̂ − θ)2 = E{ [ θ̂ − E(θ̂) ] + [ E(θ̂) − θ ]}2 = E[θ̂ − E(θ̂)]2 + E[E(θ̂) − θ]2 n o +2 E [θ̂ − E(θ̂)][E(θ̂) − θ] | {z } =0 = E[θ̂ − E(θ̂)]2 + E[E(θ̂) − θ]2 = Var(θ̂) + [E(θ̂) − θ]2 . | {z } 2 bias(θ̂) NOTE This lemma implies that the mean squared error of an unbiased estimator is equal to the variance of the estimator. Exercise 19. Consider X1 , . . . , Xn where Xi ∼ N(θ, σ 2 ) and σ is known. Three Pn estimators of θ are θ̂1 = X̄ = n1 i=1 Xi , θ̂2 = X1 , and θ̂3 = (X1 + X̄)/2. Discuss their properties which one you would recommend and why. Example 9. Consider X1 , . . . , Xn to be independent random variables with means E(Xi ) = µ and variances Var(Xi ) = σi2 . Consider pooling the estimators of µ into a common estimator using the linear combination µ̂ = w1 X1 + w2 X2 + · · · + wn Xn . We will see that the following is true (i) The estimator µ̂ is unbiased if and only if P wi = 1. (ii) The estimator µ̂ has minimum variance among this class of estimators when the weights are inversely proportional to the variances σi2 . (iii) The variance of µ̂ for optimal weights wi is Var(µ̂) = 1/ P i σi−2 . P P Indeed, we have E(µ̂) = E(w1 X1 + · · · + wn Xn ) = i wi E(Xi ) = i wi µ = P P µ i wi so µ̂ is unbiased if and only if i wi = 1. The variance of our estimator is P P Var(µ̂) = i wi2 σi2 , which should be minimized subject to the constraint i wi = 1. P P Differentiating the Lagrangian L = i wi2 σi2 − λ ( i wi − 1) with respect to wi and 63 P setting equal to zero yields 2wi σi2 = λ ⇒ wi ∝ σi−2 so that wi = σi−2 /( j σj−2 ). P P P Then, for optimal weights we get Var(µ̂) = i wi2 σi2 = ( i σi−4 σi2 )/( i σi−2 )2 = P 1/( i σi−2 ). Assume now that the instead of Xi we observe biased variable X̂i = Xi + β for some β 6= 0. When σi2 = σ 2 we have that Var(µ̂) = σ 2 /n which tends to zero for n → ∞ whereas bias(µ̂) = βand MSE(µ̂) = σ 2 /n + β 2 . Thus in the general case when the bias is present it tends to dominate the variance as n gets larger, which is very unfortunate. Exercise 20. Let X1 , . . . , Xn be an independent sample of size n from the uniform distribution on the interval (0, θ), with density for a single observation being f (x|θ) = θ−1 for 0 < x < θ and 0 otherwise, and consider θ > 0 unknown. (i) Find the expected value and variance of the estimator θ̂ = 2X̄. (ii) Find the expected value of the estimator θ̃ = X(n) , i.e. the largest observation. (iii) Find an unbiased estimator of the form θ̌ = cX(n) and calculate its variance. (iv) Compare the mean square error of θ̂ and θ̌. 4.2 Minimum-Variance Unbiased Estimation Getting a small MSE often involves a tradeoff between variance and bias. For unbiased estimators, the MSE obviously equals the variance, MSE(θ̂) = Var(θ̂), so no tradeoff can be made. One approach is to restrict ourselves to the subclass of estimators that are unbiased and minimum variance. Definition 4 (Minimum-variance unbiased estimator). If an unbiased estimator of g(θ) has minimum variance among all unbiased estimators of g(θ) it is called a minimum 2 variance unbiased estimator (MVUE). We will develop a method of finding the MVUE when it exists. When such an estimator does not exist we will be able to find a lower bound for the variance of an unbiased estimator in the class of unbiased estimators, and compare the variance of our unbiased estimator with this lower bound. 64 Definition 5 (Score function). For the (possibly vector valued) observation X = x to be informative about θ, the density must vary with θ. If f (x|θ) is smooth and differentiable, then for finding MLE we have used the score function S(θ) = S(θ|x) = ∂ ∂f (x|θ)/∂θ log f (x|θ) ≡ . ∂θ f (x|θ) 2 Under suitable regularity conditions (differentiation wrt θ and integration wrt x can be interchanged), we have for X distributed according to f (x|θ): Z Z ∂f (x|θ)/∂θ E{S(θ|X)} = f (x|θ)dx = ∂f (x|θ)/∂θdx , f (x|θ) Z ∂ ∂ 1 = 0. = f (x|θ)dx = ∂θ ∂θ Thus the score function has expectation zero. The score function S(θ|x) is a random variable if for x we substitute X – a random variable with f (x|θ) distribution. In this case we often drop explicit dependence on X from the notation by simply writing S(θ). The negative derivative of the score function measure how concave down is the likelihood around value θ. Definition 6 (Fisher information). The Fisher information is defined as the average value of the negative derivative of the score function ∂ S(θ) . I(θ) ≡ −E ∂θ The negative derivative of the score function I(θ), which is a random variable dependent on X, is sometimes referred to as empirical or observed information about θ. Lemma 3. The variance of S(θ) is equal to the Fisher information about θ ( 2 ) ∂ 2 I(θ) = E{S(θ) } ≡ E log f (X|θ) ∂θ Proof. Using the chain rule ∂2 log f ∂θ2 = = = ∂ 1 ∂f ∂θ f ∂θ 2 1 ∂f 1 ∂2f − 2 + f ∂θ f ∂θ2 2 ∂ log f 1 ∂2f − + ∂θ f ∂θ2 65 2 If integration and differentiation can be interchanged Z Z 1 ∂2f ∂2 ∂2f ∂2 E = dx = f dx = 2 1 = 0, 2 2 2 f ∂θ ∂θ X ∂θ X ∂θ thus " 2 # ∂ ∂2 log f (X|θ) = E = I(θ). −E log f (X|θ) ∂θ2 ∂θ (4.1) Theorem 4.2.1 (Cramér Rao lower bound). Let θ̂ be an unbiased estimator of θ. Then Var(θ̂) ≥ { I(θ) }−1 . Proof. Unbiasedness, E(θ̂) = θ, implies Z θ̂(x)f (x|θ)dx = θ. Assume we can differentiate wrt θ under the integral, then Z o ∂ n θ̂(x)f (x|θ) dx = 1. ∂θ The estimator θ̂(x) can’t depend on θ, so Z ∂ θ̂(x) f (x|θ) dx = 1. ∂θ Since ∂f ∂ =f (log f ) , ∂θ ∂θ so that now Z θ̂(x)f ∂ (log f ) dx = 1. ∂θ Thus ∂ E θ̂(x) (log f ) = 1. ∂θ Define random variables U = θ̂(x), and S= ∂ (log f ) . ∂θ 66 Then E (U S) = 1. We already know that the score function has expectation zero, E (S) = 0. Consequently Cov(U, S) = E(U S) − E(U )E(S) = E(U S) = 1. By the well-known property of correlations (that follows from the Schwartz’s inequality) we have 2 2 {Corr(U, S)} = {Cov(U, S)} ≤ 1 Var(U )Var(S) Since, as we mentioned, Cov(U, S) = 1 we get Var(U )Var(S) ≥ 1 This implies Var(θ̂) ≥ 1 I(θ) which is our main result. We call { I(θ) }−1 the Cramér Rao lower bound (CRLB). Why information? Variance measures lack of knowledge. Reasonable that the reciprocal of the variance should be defined as the amount of information carried by the (possibly vector valued) random observation X about θ. Sufficient conditions for the proof of CRLB are that all the integrands are finite, within the range of x. We also require that the limits of the integrals do not depend on θ. That is, the range of x, here f (x|θ), cannot depend on θ. This second condition is violated for many density functions, i.e. the CRLB is not valid for the uniform distribution. We can have absolute assessment for unbiased estimators by comparing their variances to the CRLB. We can also assess biased estimators. If its variance is lower than CRLB then it can be indeed a very good estimate, although it is bias. Example 10. Consider IID random variables Xi , i = 1, . . . , n, with 1 1 fXi (xi |µ) = exp − xi . µ µ Denote the joint distribution of X1 , . . . , Xn by ! n n n Y 1 1X f= fXi (xi |µ) = exp − xi , µ µ i=1 i=1 so that n log f = −n log(µ) − 67 1X xi . µ i=1 The score function is the partial derivative of log f wrt the unknown parameter µ, S(µ) = and n ∂ n 1 X log f = − + 2 xi ∂µ µ µ i=1 ( n n 1 X E {S(µ)} = E − + 2 Xi µ µ i=1 ) ( n ) X 1 n Xi = − + 2E µ µ i=1 For X ∼ Exp(1/µ), we have E(X) = µ implying E(X1 + · · · + Xn ) = E(X1 ) + · · · + E(Xn ) = nµ and E {S(µ)} = 0 as required. ( !) n ∂ n 1 X Xi I(θ) = −E − + 2 ∂µ µ µ i=1 ( = −E n n 2 X − Xi µ2 µ3 i=1 ) ( n ) X 2 n Xi = − 2 + 3E µ µ i=1 = − n 2nµ n + 3 = 2 µ2 µ µ Hence CRLB = µ2 . n Let us propose µ̂ = X̄ as an estimator of µ. Then ( n ( n ) ) X 1X 1 E(µ̂) = E Xi = E Xi = µ, n i=1 n i=1 verifying that µ̂ = X̄ is indeed an unbiased estimator of µ. For X ∼ Exp(1/µ), we p have E(X) = µ = Var(X), implying n 1 X nµ2 µ2 Var(µ̂) = 2 Var(Xi ) = 2 = . n i=1 n n −1 We have already shown that Var(µ̂) = { I(θ) } , and therefore conclude that the 2 unbiased estimator µ̂ = x̄ achieves its CRLB. Definition 7 (Efficiency ). Define the efficiency of the unbiased estimator θ̂ as eff(θ̂) = CRLB Var(θ̂) 68 , where CRLB = { I(θ) }−1 . Clearly 0 < eff(θ̂) ≤ 1. An unbiased estimator θ̂ is said 2 to be efficient if eff(θ̂) = 1. Exercise 21. Consider the MLE θ̂ = r/n for the binomial distribution that was considered in Example 3. Show that for this estimator efficiency is 100%, i.e. its variance attains CRLB. Exercise 22. Consider the MLE for the Poisson distribution that was considered in Example 5. Show that also in this case the MLE is 100% efficient. Definition 8 (Asymptotic efficiency ). The asymptotic efficiency of an unbiased estimator θ̂ is the limit of the efficiency as n → ∞. An unbiased estimator θ̂ is said to be asymptotically efficient if its asymptotic efficiency is equal to 1. 2 Exercise 23. Consider the MLE θ̂ for the exponential distribution with parameter θ that was considered in Exercise 16. Find its variance, and its mean square error. Consider also θ̃ that was considered in this example. Which of the two has smaller variance and which has smaller mean square error? Is θ̃ asymptotically efficient? Exercise 24. Discuss efficiency of the estimator of variance in the normal distribution in the case when the mean is known (see Example 7). 4.3 Optimality Properties of the MLE Suppose that an experiment consists of measuring random variables X1 , X2 , . . . , Xn which are iid with probability distribution depending on a parameter θ. Let θ̂ be the MLE of θ. Define W1 W2 = p I(θ)(θ̂ − θ) = p I(θ)(θ̂ − θ) q W3 = q W4 = I(θ̂)(θ̂ − θ) I(θ̂)(θ̂ − θ). Then, W1 , W2 , W3 , and W4 are all random variables and, as n → ∞, the probabilistic behaviour of each of W1 , W2 , W3 , and W4 is well approximated by that of a N (0, 1) random variable. 69 Since E[W1 ] ≈ 0, we have that E[θ̂] ≈ θ and so θ̂ is approximately unbiased. Also Var[W1 ] ≈ 1 implies that Var[θ̂] ≈ (I(θ))−1 and so θ̂ is asymptotically efficient. The above properties of the MLE estimators carry to the multivariate case. Here is a brief account of these properties. Let the data X have probability distribution g(X; θ ) where θ = (θ1 , θ2 , . . . , θm ) is a vector of m unknown parameters. Let I(θθ ) be the m × m observed information matrix and let I(θθ ) be the m × m Fisher’s information matrix obtained by replacing the elements of I(θθ ) by their expected values. Let θ̂θ be the MLE of θ . Let CRLBr be the rth diagonal element of the √ Fisher’s information matrix. For r = 1, 2, . . . , m, define W1r = (θ̂r − θr )/ CRLBr . Then, as n → ∞, W1r behaves like a standard normal random variable. Suppose we define W2r by replacing CRLBr by the rth diagonal element of the matrix I(θθ )−1 , W3r by replacing CRLBr by the rth diagonal element of the matrix I(θ̂θ )−1 and W4r by replacing CRLBr by the rth diagonal element of the matrix I(θ̂θ )−1 . Then it can be shown that as n → ∞, W2r , W3r , and W4r all behave like standard normal random variables. 70 Chapter 5 The Theory of Confidence Intervals 5.1 Exact Confidence Intervals Suppose that we are going to observe the value of a random vector X. Let X denote the set of possible values that X can take and, for x ∈ X , let g(x|θ) denote the probability that X takes the value x where the parameter θ is some unknown element of the set Θ. Consider the problem of quoting a subset of θ values which are in some sense plausible in the light of the data x. We need a procedure which for each possible value x ∈ X specifies a subset C(x) of Θ which we should quote as a set of plausible values for θ. Definition 9. Let X1 , . . . , Xn be a sample form a distribution that is parameterized by some parameter θ. A random set C(X1 , . . . , Xn ) of possible values for θ that is computable from the sample is called a confidence region at confidence level 1 − α if P(θ ∈ C(X1 , . . . , Xn )) = 1 − α. If the set C(X1 , . . . , Xn ) has the form of an interval, then we call it a confidence interval. Example 11. Suppose we are going to observe data x where x = (x1 , x2 , . . . , xn ), and 71 x1 , x2 , . . . , xn are the observed values of random variables X1 , X2 , . . . , Xn which are thought to be iid N (θ, 1) for some unknown parameter θ ∈ (−∞, ∞) = Θ. Consider √ √ the subset C(x) = [x̄ − 1.96/ n, x̄ + 1.96/ n]. If we carry out an infinite sequence of independent repetitions of the experiment then we will get an infinite sequence of x values and thereby an infinite sequence of subsets C(x). We might ask what proportion of this infinite sequence of subsets actually contain the fixed but unknown value of θ? Since C(x) depends on x only through the value of x̄ we need to know how x̄ behaves in the infinite sequence of repetitions. This follows from the fact that X̄ has a √ N (θ, n1 ) density and so Z = X̄−θ = n(X̄ − θ) has a N (0, 1) density. Thus even√1 n though θ is unknown we can calculate the probability that the value of Z will exceed 2.78, for example, using the standard normal tables. Remember that the probability is the proportion of experiments in the infinite sequence of repetitions which produce a value of Z greater than 2.78. In particular we have that P [|Z| ≤ 1.96] = 0.95. Thus 95% of the time Z will lie between −1.96 and +1.96. But −1.96 ≤ Z ≤ +1.96 √ −1.96 ≤ n(X̄ − θ) ≤ +1.96 √ √ ⇒ −1.96/ n ≤ X̄ − θ ≤ +1.96/ n √ √ ⇒ X̄ − 1.96/ n ≤ θ ≤ X̄ + 1.96/ n ⇒ ⇒ θ ∈ C(X) Thus we have answered the question we started with. The proportion of the infinite sequence of subsets given by the formula C(X) which will actually include the fixed but unknown value of θ is 0.95. For this reason the set C(X) is called a 95% confidence set or confidence interval for the parameter θ. 2 It is well to bear in mind that once we have actually carried out the experiment and observed our value of x, the resulting interval C(x) either does or does not contain the unknown parameter θ. We do not know which is the case. All we know is that the procedure we used in constructing C(x) is one which 95% of the time produces an interval which contains the unknown parameter. 72 25 20 15 5 10 c(0, mu) 0 20 40 60 80 100 Index Figure 5.1: One hundred confidence intervals for the mean of a normal variable with “unknown” mean and variance for sample size of ten. In fact the samples have been drawn from the normal distribution with the mean 15 and standard deviation 6. The crucial step in the last example was finding the quantity Z = √ n(X̄ − θ) whose value depended on the parameter of interest θ but whose distribution was known to be that of a standard normal variable. This leads to the following definition. Definition 10 (Pivotal Quantity). A pivotal quantity for a parameter θ is a random variable Q(X|θ) whose value depends both on (the data) X and on the value of the unknown parameter θ but whose distribution is known. 2 The quantity Z in the example above is a pivotal quantity for θ. The following lemma provides a method of finding pivotal quantities in general. Lemma 4. Let X be a random variable and define F (a) = P [X ≤ a]. Consider the random variable U = −2 log [F (X)]. Then U has a χ22 density. Consider the random variable V = −2 log [1 − F (X)]. Then V has a χ22 density. 73 Proof. Observe that, for a ≥ 0, P [U ≤ a] Hence, U has density 1 2 = P [F (X) ≥ exp (−a/2)] = 1 − P [F (X) ≤ exp (−a/2)] = 1 − P [X ≤ F −1 (exp (−a/2))] = 1 − F [F −1 (exp (−a/2))] = 1 − exp (−a/2). exp (−a/2) which is the density of a χ22 variable as required. The corresponding proof for V is left as an exercise. This lemma has an immediate, and very important, application. Suppose that we have data X1 , X2 , . . . , Xn which are iid with density f (x|θ). DeRa fine F (a|θ) = −∞ f (x|θ)dx and, for i = 1, 2, . . . , n, define Ui = −2 log[F (Xi |θ)]. Pn Then U1 , U2 , . . . , Un are iid each having a χ22 density. Hence Q1 (X, θ) = i=1 Ui has a χ22n density and so is a pivotal quantity for θ. Another pivotal quantity ( also having Pn a χ22n density ) is given by Q2 (X, θ) = i=1 Vi where Vi = −2 log[1 − F (Xi |θ)]. Example 12. Suppose that we have data X1 , X2 , . . . , Xn which are iid with density f (x|θ) = θ exp (−θx) for x ≥ 0 and suppose that we want to construct a 95% confidence interval for θ. We need to find a pivotal quantity for θ. Observe that Z a F (a|θ) = f (x|θ)dx −∞ Z a = θ exp (−θx)dx 0 = Hence Q1 (X, θ) = −2 1 − exp (−θa). n X log [1 − exp (−θXi )] i=1 is a pivotal quantity for θ having a χ22n density. Also Q2 (X, θ) = −2 n X log [exp (−θXi )] = 2θ i=1 n X i=1 74 Xi is another pivotal quantity for θ having a χ22n density. Using the tables, find A < B such that P [χ22n < A] = P [χ22n > B] = 0.025. Then 0.95 P [A ≤ Q2 (X, θ) ≤ B] n X = P [A ≤ 2θ Xi ≤ B] = i=1 B ≤ θ ≤ Pn ] = P [ Pn 2 i=1 Xi 2 i=1 Xi A and so the interval B A , Pn ] [ Pn 2 i=1 Xi 2 i=1 Xi is a 95% confidence interval for θ. 5.2 Pivotal Quantities for Use with Normal Data Many exact pivotal quantities have been developed for use with Gaussian data. Exercise 25. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ, σ 2 ) density where σ is known. Define √ n(X̄ − θ) . Q= σ Show that the defined random variable is pivotal for µ. Construct confidence intervals for µ based on this pivotal quantity. Example 13. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ, σ 2 ) density where θ is known. Define n P Q= We can write Q = n P i=1 (Xi − θ)2 i=1 σ2 Zi2 where Zi = (Xi − θ)/σ. If Zi has a N (0, 1) density then Zi2 has a χ21 density. Hence, Q has a χ2n density and so is a pivotal quantity for σ. If n = 20 then we can be 95% sure that n P 9.591 ≤ (Xi − θ)2 i=1 σ2 75 ≤ 34.170 which is equivalent to v v u u n n u 1 X u 1 X 2 t (Xi − θ) ≤ σ ≤ t (Xi − θ)2 . 34.170 i=1 9.591 i=1 The R command qchisq(p=c(.025,.975),df=20) returns the values 9.590777 and 34.169607 as the 2 12 % and 97 21 % quantiles from a Chi-squared distribution on 20 2 degrees of freedom. Lemma 5 (The Student t-distribution). Suppose the random variables X and Y are independent, and X ∼ N (0, 1) and Y ∼ χ2n . Then the ratio X T =p Y /n has pdf 1 Γ([n + 1]/2) fT (t|n) = √ Γ(n/2) πn t2 1+ n −(n+1)/2 , and is known as Student’s t-distribution on n degrees of freedom. Proof. The random variables X and Y are independent and have joint density 1 2−n/2 −x2 /2 n/2−1 −y/2−1 −y/2 e y e e fX,Y (x, y) = √ 2π Γ(n/2) for y > 0. The Jacobian ∂(t, u)/∂(x, y) of the change of variables t= p x y/n and equals ∂t ∂(t, u) ≡ ∂x ∂(x, y) ∂u ∂x p n/y = ∂u 0 ∂y ∂t ∂y u=y √ x n − 12 (y) 3/2 = (n/y)1/2 1 and the inverse Jacobian ∂(x, y)/∂(t, u) = (u/n)1/2 . 76 Then Z fT (t) ∞ = 0 = = u 1/2 fX,Y t(u/n)1/2 , u du n 1 2−n/2 √ 2π Γ(n/2) ∞ Z 2 e−t u/2n n/2−1 −u/2 u e u 1/2 n 0 1 2−n/2 √ 2π Γ(n/2)n1/2 Z ∞ 2 e−(1+t /n)u/2 (n+1)/2−1 u du du . 0 The last integrand comes from the pdf of a Gam((n + 1)/2, 1/2 + t2 /(2n)) random variable. Hence 1 Γ([n + 1]/2) fT (t) = √ Γ(n/2) πn 1 1 + t2 /n (n+1)/2 , which gives the above formula. Example 14. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ, σ 2 ) density where both θ and σ are unknown. Define √ n(X̄ − θ) Q= s where n P 2 s = (Xi − X̄)2 i=1 n−1 . We can write Q= p where Z W/(n − 1) √ Z= n(X̄ − θ) σ has a N (0, 1) density and n P W = (Xi − X̄)2 i=1 σ2 has a χ2n−1 density ( see lemma 1 ). It follows immediately that W is a pivotal quantity for σ. If n = 31 then we can be 95% sure that n P (Xi − X̄)2 i=1 16.79077 ≤ ≤ 46.97924 σ2 77 which is equivalent to v v u u n n X X u u 1 1 2 t (Xi − X̄) ≤ σ ≤ t (Xi − X̄)2 . 46.97924 i=1 16.79077 i=1 (5.1) The R command qchisq(p=c(.025,.975),df=30) returns the values 16.79077 and 46.97924 as the 2 12 % and 97 12 % quantiles from a Chi-squared distribution on 30 degrees of freedom. In lemma 5 we show that Q has a tn−1 density, and so is a pivotal quantity for θ. If n = 31 then we can be 95% sure that √ n(X̄ − θ) ≤ +2.042 −2.042 ≤ s which is equivalent to s s X̄ − 2.042 √ ≤ θ ≤ X̄ + 2.042 √ . n n (5.2) The R command qt(p=.975,df=30) returns the value 2.042272 as the 97 21 % quantile from a Student t-distribution on 30 degrees of freedom. ( It is important to point out that although a probability statement involving 95% confidence has been attached the two intervals (5.2) and (5.1) separately, this does not imply that both intervals si2 multaneously hold with 95% confidence. ) Example 15. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ1 , σ 2 ) density and data Y1 , Y2 , . . . , Ym which are iid observations from a N (θ2 , σ 2 ) density where θ1 , θ2 , and σ are unknown. Let δ = θ1 − θ2 and define (X̄ − Ȳ ) − δ Q= q 1 s2 ( n1 + m ) where Pn 2 s = i=1 (Xi − X̄)2 + Pm j=1 (Yj − Ȳ )2 n+m−2 2 . 2 We know that X̄ has a N (θ1 , σn ) density and that Ȳ has a N (θ2 , σm ) density. Then the difference X̄ − Ȳ has a N (δ, σ 2 [ n1 + 1 m ]) density. Hence X̄ − Ȳ − δ Z=q 1 σ 2 [ n1 + m ] 78 Pn has a N (0, 1) density. Let W1 = i=1 (Xi − X̄)2 /σ 2 and let W2 = Pm j=1 (Yj − Ȳ )2 /σ 2 . Then, W1 has a χ2n−1 density and W2 has a χ2m−1 density, and W = W1 +W2 has a χ2n+m−2 density. We can write p Q1 = Z/ W/(n + m − 2) and so, Q1 has a tn+m−2 density and so is a pivotal quantity for δ. Define Pn Pm 2 2 i=1 (Xi − X̄) + j=1 (Yj − Ȳ ) Q2 = . σ2 Then Q2 has a χ2n+m−2 density and so is a pivotal quantity for σ. 2 Lemma 6 (The Fisher F-distribution). Let X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Ym be iid N (0, 1) random variables. The ratio n P Z= i=1 m P i=1 Xi2 /n Yi2 /m has the distribution called Fisher, or F distribution with parameters (degrees of freedom) n, m, or the Fn,m distribution for short. The corresponding pdf fFn,m is concentrated on the positive half axis fFn,m (z) = Γ((n + m)/2) n n/2 n/2−1 n −(n+m)/2 z 1+ z Γ(n/2)Γ(m/2) m m for z > 0. Observe that if T ∼ tm , then T 2 = Z ∼ F1,m , and if Z ∼ Fn,m , then Z −1 ∼ Fm,n . If W1 ∼ χ2n and W2 ∼ χ2m , then Z = (mW1 )/(nW2 ) ∼ Fn,m . 2 Example 16. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations 2 from a N (θX , σX ) density and data Y1 , Y2 , . . . , Ym which are iid observations from a N (θY , σY2 ) density where θX , θY , σX , and σY are all unknown. Let λ = σX /σY and define F∗ = ŝ2X = ŝ2Y Pn − X̄)2 (m − 1) Pm . 2 (n − 1) j=1 (Yj − Ȳ ) i=1 (Xi 79 Let WX = n X 2 (Xi − X̄)2 /σX i=1 and let WY = m X (Yj − Ȳ )2 /σY2 . j=1 Then, WX has a χ2n−1 density and WY has a χ2m−1 density. Hence, by lemma 6, Q= F∗ WX /(n − 1) ≡ 2 WY /(m − 1) λ has an F density with n−1 and m−1 degrees of freedom and so is a pivotal quantity for λ. Suppose that n = 25 and m = 13. Then we can be 95% sure that 0.39 ≤ Q ≤ 3.02 which is equivalent to r F∗ ≤λ≤ 3.02 r F∗ . 0.39 To see how this might work in practice try the following R commands one at a time x = rnorm(25, mean = 0, sd = 2) y = rnorm(13, mean = 1, sd = 1) Fstar = var(x)/var(y); Fstar CutOffs = qf(p=c(.025,.975), df1=24, df2=12) CutOffs; rev(CutOffs) Fstar / rev(CutOffs) var.test(x, y) 2 The search for a nice pivotal quantity for δ = θ1 − δ2 continues and is one of the unsolved problems in Statistics - referred to as the Behrens-Fisher Problem. 5.3 Approximate Confidence Intervals Let X1 , X2 , . . . , Xn be iid with density f (x|θ). Let θ̂ be the MLE of θ. We saw before q p p that the quantities W1 = I(θ)(θ̂ − θ), W2 = I(θ)(θ̂ − θ), W3 = I(θ̂)(θ̂ − q θ), and W4 = I(θ̂)(θ̂ − θ) all had densities which were approximately N (0, 1). 80 Hence they are all approximate pivotal quantities for θ. W3 and W4 are the simplest to use in general. For W3 the approximate 95% confidence interval is given by [θ̂ − q q 1.96/ I(θ̂), θ̂ + 1.96/ I(θ̂)]. For W4 the approximate 95% confidence interval is q q q q given by [θ̂ − 1.96/ I(θ̂), θ̂ + 1.96/ I(θ̂)]. The quantity 1/ I(θ̂) ( or 1/ I(θ̂)) is often referred to as the approximate standard error of the MLE θ̂. Let X1 , X2 , . . . , Xn be iid with density f (x|θ) where θ = (θ1 , θ2 , . . . , θm ) consists of m unknown parameters. Let θ = (θ̂1 , θ̂2 , . . . , θ̂m ) be the MLE of θ. We saw √ before that for r = 1, 2, . . . , m the quantities W1r = (θ̂r − θr )/ CRLBr where CRLBr is the lower bound for Var(θ̂r ) given in the generalisation of the Cramer-Rao theorem had a density which was approximately N (0, 1). Recall that CRLBr is the rth diagonal element of the matrix [I(θ)]−1 . In certain cases CRLBr may depend on the values of unknown parameters other than θr and in those cases W1r will not be an approximate pivotal quantity for θr . We also saw that if we define W2r by replacing CRLBr by the rth diagonal element of the matrix [I(θ)]−1 , W3r by replacing CRLBr by the rth diagonal element of the matrix [I(θ̂)]−1 and W4r by replacing CRLBr by the rth diagonal element of the matrix [I(θ̂)]−1 we get three more quantities all of whom have a density which is approximately N (0, 1). W3r and W4r only depend on the unknown parameter θr and so are approximate pivotal quantities for θr . However in certain cases the rth diagonal element of the matrix [I(θ)]−1 may depend on the values of unknown parameters other than θr and in those cases W2r will not be an approximate pivotal quantity for θr . Generally W3r and W4r are most commonly used. We now examine the use of approximate pivotal quantities based on the MLE in a series of examples Example 17 (Poisson sampling continued). Recall that θ̂ = x̄ and I(θ) = Pn i=1 xi /θ2 = nθ̂/θ2 with E[I(θ)] = n/θ. Hence I(θ̂) = I(θ̂) = n/θ̂ and the usual approximate 95% confidence interval is given by s [ θ̂ − 1.96 s θ̂ θ̂ , θ̂ + 1.96 ]. n n 2 81 Example 18 (Bernoulli trials continued). Recall that θ̂ = x̄ and Pn Pn xi n − i=1 xi I(θ) = i=1 + θ2 (1 − θ)2 with I(θ) = EI(θ) = n . θ(1 − θ) Hence I(θ̂) = I(θ̂) = n θ̂(1 − θ̂) and the usual approximate 95% confidence interval is given by s s θ̂(1 − θ̂) θ̂(1 − θ̂) , θ̂ + 1.96 ]. [ θ̂ − 1.96 n n 2 Example 19. Let X1 , X2 , . . . , Xn be iid observations from the density f (x|α, β) = αβxβ−1 exp [−αxβ ] for x ≥ 0 where both α and β are unknown. In can be verified by straightforward calculations that the information matrix I(α, β) is given by Pn β n/α2 x log[x ] i i=1 i P Pn n β β 2 2 x log[x ] n/β + α x log[x ] i i i=1 i i=1 i Let V11 and V22 be the diagonal elements of the matrix [I(α̂, β̂)]−1 . Then the approximate 95% confidence interval for α is [α̂ − 1.96 p p V11 , α̂ + 1.96 V11 ] and the approximate 95% confidence interval for β is p p [β̂ − 1.96 V22 , β̂ + 1.96 V22 ]. Finding α̂ and β̂ is an interesting exercise that you can try to do on your own. 82 2 Exercise 26. Components are produced in an industrial process and the number of flaws indifferent components are independent and identically distributed with probability mass function p(x) = θ(1 − θ)x , x = 0, 1, 2, . . ., where 0 < θ < 1. A random sample of n components is inspected; n0 components are found to have no flaws, n1 components are found to have two or more flaws. 1. Show that the likelihood function is l(θ) = θn0 +n1 (1 − θ)2n−2n0 −n1 . 2. Find the MLE of θ and the sample information in terms of n, n0 and n1 . 3. Hence calculate an approximate 90% confidence interval for θ where 90 out of 100 components have no flaws, and seven have exactly one flaw. Exercise 27. Suppose that X1 , X2 , . . . , Xn is a random sample from the shifted exponential distribution with probability density function f (x|θ, µ) = 1 −(x−µ)/θ e , θ µ < x < ∞, where θ > 0 and −∞, µ < ∞. Both θ and µ are unknown, and n > 1. 1. The sample range W is defined as W = X(n) − X(1) , where X(n) = maxi Xi and X(1) = mini Xi . It can be shown that the joint probability density function of X(1) and W is given by fX(1) ,W (x(1) , w) = n(n − 1)θ−2 e−n(x(1) −µ)/θ e−w/θ (1 − e−w/θ )n−2 , for x(1) > µ and w > 0. Hence obtain the marginal density function of W and show that W has distribution function P (W ≤ w) = (1 − e−w/θ )n−1 , w > 0. 2. Show that W/θ is a pivotal quantity for θ. Without carrying out any calculations, explain how this result may be used to construct a 100(1 − α)% confidence interval for θ for 0, α < 1. Exercise 28. Let X have the logistic distribution with probability density function f (x) = ex−θ , (1 + ex−θ )2 where −∞ < θ < ∞ is an unknown parameter. 83 −∞ < x < ∞, 1. Show that X − θ is a pivotal quantity and hence, given a single observation X, construct an exact 100(1 − α)% confidence interval for θ. Evaluate the interval when α = 0.05 and X = 10. 2. Given a random sample X1 , X2 , . . . , Xn from the above distribution, briefly explain how you would use the central limit theorem to construct an approximate 95% confidence interval for θ. Hint E(X) = θ and Var(X) = π 2 /3. Exercise 29. Let X1 , . . . , Xn be iid with density fX (x|θ) = θ exp (−θx) for x ≥ 0. 1. Show that Rx 0 f (u|θ)du = 1 − exp (−θx). 2. Use the result in (a) to establish that Q = 2θ Pn i=1 Xi is a pivotal quantity for θ and explain how to use Q to find a 95% confidence interval for θ. 3. Derive the information I(θ). Suggest an approximate pivotal quantity for θ involving I(θ) and another approximate pivotal quantity involving I(θ̂) where θ̂ = 1/x̄ is the maximum likelihood estimate of θ. Show how both approximate pivotal quantities may be used to find approximate 95% confidence intervals for θ. Prove that the approximate confidence interval calculated using the approximate pivotal quantity involving I(θ̂) is always shorter than the approximate confidence interval calculated using the approximate pivotal quantity involving I(θ) but that the ratio of the lengths converges to 1 as n → ∞. 4. Suppose n = 25 and P20 i=1 xi = 250. Use the method explained in (b) to calculate a 95% confidence interval for θ and the two methods explained in (c) to calculate approximate 95% confidence intervals for θ. Compare the three intervals obtained. Exercise 30. Let X1 , X2 , . . . , Xn be iid with density f (x|θ) = θ (x + 1)θ+1 for x ≥ 0. 1. Derive an exact pivotal quantity for θ and explain how it may be used to find a 95% confidence interval for θ. 84 2. Derive the information I(θ). Suggest an approximate pivotal quantity for θ involving I(θ) and another approximate pivotal quantity involving I(θ̂) where θ̂ is the maximum likelihood estimate of θ. Show how both approximate pivotal quantities may be used to find approximate 95% confidence intervals for θ. 3. Suppose n = 25 and P25 i=1 log [xi + 1] = 250. Use the method explained in (a) to calculate a 95% confidence interval for θ and the two methods explained in (b) to calculate approximate 95% confidence intervals for θ. Compare the three intervals obtained. Exercise 31. Let X1 , X2 , . . . , Xn be iid with density f (x|θ) = θ2 x exp (−θx) for x ≥ 0. 1. Show that Rx 0 f (u|θ)du = 1 − exp (−θx)[1 + θx]. 2. Describe how the result from (a) can be used to construct an exact pivotal quantity for θ. 3. Construct FOUR approximate pivotal quantities for θ based on the MLE θ̂. 4. Suppose that n = 10 and the data values are 1.6, 2.5, 2.7, 3.5, 4.6, 5.2, 5.6, 6.4, 7.7, 9.2. Evaluate the 95% confidence interval corresponding to ONE of the exact pivotal quantities ( you may need to use a computer to do this ). Compare your answer to the 95% confidence intervals corresponding to each of the FOUR approximate pivotal quantities derived in (c). Exercise 32. Let X1 , X2 , . . . , Xn be iid each having a Poisson density f (x|θ) = θx exp (−θ) x! for x = 0, 1, 2, . . . ., ∞. Construct FOUR approximate pivotal quantities for θ based on the MLE θ̂. Show how each may be used to construct an approximate 95% confidence interval for θ. Evaluate the four confidence intervals in the case where the data consist of n = 64 observations with an average value of x̄ = 4.5. 85 Exercise 33. Let X1 , X2 , . . . , Xn be iid with density f1 (x|θ) = 1 exp [−x/θ] θ for 0 ≤ x < ∞. Let Y1 , Y2 , . . . , Ym be iid with density f2 (y|θ, λ) = λ exp [−λy/θ] θ for 0 ≤ y < ∞. 1. Derive approximate pivotal quantities for each of the parameters θ and λ. 2. Suppose that n = 10 and the average of the 10 x values is 24.0. Suppose that m = 40 and the average of the 40 y values is 12.0. Calculate approximate 95% confidence intervals for both θ and λ. 86 Chapter 6 The Theory of Hypothesis Testing 6.1 Introduction Suppose that we are going to observe the value of a random vector X. Let X denote the set of possible values that X can take and, for x ∈ X , let f (x, θ) denote the density (or probability mass function) of X where the parameter θ is some unknown element of the set Θ. A hypothesis specifies that θ belongs to some subset Θ0 of Θ. The question arises as to whether the observed data x is consistent with the hypothesis that θ ∈ Θ0 , often written as H0 : θ ∈ Θ0 . The hypothesis H0 is usually referred to as the null hypothesis. The null hypothesis is contrasted with the so-called alternative hypothesis H1 : θ ∈ Θ1 , where Θ0 ∩ Θ1 = ∅. The testing hypothesis is aiming at finding in the data x enough evidence to reject the null hypothesis: H0 : θ ∈ Θ0 in favor of the alternative hypothesis H1 : θ ∈ Θ1 . 87 Due to the focus on rejecting H0 and control of the error rate for such a decision, the set up in the role of the hypotheses is not exchangeable. In a hypothesis testing situation, two types of error are possible. • The first type of error is to reject the null hypothesis H0 : θ ∈ Θ0 as being inconsistent with the observed data x when, in fact, θ ∈ Θ0 i.e. when, in fact, the null hypothesis happens to be true. This is referred to as Type I Error. • The second type of error is to fail to reject the null hypothesis H0 : θ ∈ Θ0 as being inconsistent with the observed data x when, in fact, θ ∈ Θ1 i.e. when, in fact, the null hypothesis happens to be false. This is referred to as Type II Error. The goal is to propose a procedure that for given data X = x would automatically point which of the hypothesis is more favorable and in such a way that chances of making Type I Error are some prescribed small α ∈ (0, 1), that is refered to as the significance level of a test. More precisely for given data x we evaluate a certain numerical characteristics T (x) that is called a test statistic and if it falls in a certain critical region Rα (often also called rejection region), we reject H0 in the favor of H1 . We demand that T (x) and Rα are chosen in such a way that for θ ∈ Θ0 P(T (X) ∈ Rα |θ) ≤ α, i.e. Type I Error is at most α. Therefore the test procedure can be identified with a test statistic T (x) and a rejection region Rα . It is quite natural to expected that Rα is decreasing with α (it should be harder to reject H0 if error 1 is smaller). Thus for a given sample x, there should be an α̂ such that for α > α̂ we have T (x) ∈ Rα and for α < α̂ the test statistics T (x) is outside Rα . The value α̂ is called the p-value for a given test. While the focus in setting a testing hypothesis problem is on Type I Error so it is controlled by the significance level, it is also important to have chances of Type II Error as small as possible. For a given testing procedure smaller chances of Type I Error are at the cost of bigger chances of Type II Error. However, the chances of Type II Error can serve for comparison of testing procedures for which the significance level 88 is the same. For this reason, the concept of the power of a test has been introduced. In general, the power of a test is a function p(θ) of θ ∈ Θ1 and equals the probability of rejecting H0 while the true parameter is θ, i.e. under the alternative hypothesis. Among two tests in the same problem and at the same significance level, the one with larger power for all θ ∈ Θ1 is considered better. The power of a given procedure is increasing with the sample size of data, therefore it is often used to determine a sample size so that not rejecting H0 will represent a strong support for H0 not only a lack of evidence for the alternative. Example 20. Suppose the data consist of a random sample X1 , X2 , . . . , Xn from a N (θ, 1) density. Let Θ = (−∞, ∞) and Θ0 = (−∞, 0] and consider testing H0 : θ ∈ Θ0 , i.e. H0 : θ ≤ 0. The standard estimate of θ for this example is X̄. It would seem rational to consider that the bigger the value of X̄ that we observe the stronger is the evidence against the null hypothesis that θ ≤ 0, in favor of the alternative θ > 0. Thus we decide to use T (X) = X̄ as our test statistics. How big does X̄ have to be in order for us to reject H0 ? In other words we want to determine the rejection region Rα . It is quite natural to consider Rα = [aα , ∞), so we reject H0 if X̄ is too large, i.e. X̄ ≥ aα . To determine aα we recall that controlling Type I Error means that P(X̄ ≥ aα |θ) ≤ α, where θ ≤ 0. For such θ, we clearly have P(X̄ ≥ aα |θ) ≤ P(X̄ ≥ aα |θ = 0) √ = 1 − Φ(aα n), √ from which we get that aα = z1−α / n assures that Type I Error is controlled at α level. Suppose that n = 25 and we observe x̄ = 0.32. Finding the p-value is then equivalent to determining the chances of getting such a large value for x̄ by a random variable that has the distribution of X̄, i.e. N (0, n1 ). In our particular case, it is a N (θ, 0.04), the probability of getting a value for X̄ as large as 0.32 is the area under a N (0, 0.04) curve 89 between 0.32 and ∞ which is the area under a N (0, 1) curve between 0.32 0.20 = 1.60 and ∞ or 0.0548. This quantity is called the p-value. The p-value is used to measure the strength of the evidence against H0 : θ ≤ 0 and H0 is rejected if the p-value is less than some small number such as 0.05. You might like to try the R commands 1-pnorm(q=0.32,mean=0,sd=sqrt(.04)) and 1-pnorm(1.6). √ Consider the test statistic T (X) = nX̄ and suppose we observe T (x) = t. A rejection region that results in the significance level α can be defined as Rα = [z1−α , ∞). In order to calculate the p-value we need to find α̂ such that t = z1−α̂ which is equiva2 lent to α̂ = P(T > t). Exercise 34. Since the images on the two sides of coins are made of raised metal, the toss may slightly favor one face or the other, if the coin is allowed to roll on one edge upon landing. For the same reason coin spinning is much more likely to be biased than flipping. Conjurers trim the edges of coins so that when spun they usually land on a particular face. To investigate this issue a strict method of coin spinning has been designed and the results of it recorded for various coins. We assume that the number of considered tosses if fairly large (bigger than 100). Formulate a testing hypothesis problem for this situation and in the process answer the following questions. 1. Formulate the null and alternative hypotheses. 2. Propose a test statistic used to decide for one of the hypotheses. 3. Derive a rejection region that guarantees the chances of Type I Error to be at most α. 4. Explain how for an observed proportions p̂ of “Heads” one could obtain p-value for the proposed test. 5. Derive a formula for the power of the test. 6. Study how the power depends on the sample size. In particular, it is believed that a certain coin is tinted toward “Heads” and that the true chances of landing 90 “Heads” are at least 0.51. Design an experiment in which the chances of making a correct decision using your procedure are 95%. 7. If one hundred thousands spins of a coin will be made. What are chances that the procedure will lead to the correct decision? 8. Suppose that one hundred thousands spins of a coin have been made and the coin landed “Heads” 50877 times. Find the p-value and report a conclusion. Example 21 (The power function). Suppose our rule is to reject H0 : θ ≤ 0 in favor H1 : θ > 0 if the p-value is less than 0.05. In order for the p-value to be less than 0.05 √ √ we require nt > 1.65 and so we reject H0 if x̄ > 1.65/ n. What are the chances of rejecting H0 if θ = 0.2 ? If θ = 0.2 then X̄ has the N [0.2, 1/n] distribution and so the probability of rejecting H0 is √ 1 1.65 P N 0.2, ≥ √ = P N (0, 1) ≥ 1.65 − 0.2 n . n n For n = 25 this is given by P {N (0, 1) ≥ 0.65} = 0.2578. This calculation can be verified using the R command 1-pnorm(1.65-0.2*sqrt(25)). The following table gives the results of this calculation for n = 25 and various values of θ ∈ Θ1 . θ: 0.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Prob: 0.50 .125 .258 .440 .637 .802 .912 .968 .991 .998 .999 This is called the power function of the test. The R command Ns=seq(from=(-1),to=1, by=0.1) generates and stores the sequence −1.0, −0.9, . . . , +1.0 and the probabilities in the table were calculated using 1-pnorm(1.65-Ns*sqrt(25)). The graph of the 2 power function is presented in Figure 6.1. Example 22 (Sample size). How large would n have to be so that the probability of √ rejecting H0 when θ = 0.2 is 0.90 ? We would require 1.65 − 0.2 n = −1.28 which √ implies that n = (1.65 + 1.28)/0.2 or n = 215. 2 91 1.0 0.8 0.6 0.0 0.2 0.4 y -1.0 -0.5 0.0 0.5 1.0 Ns Figure 6.1: The power function of the test that a normal sample of size 25 has the mean value bigger than zero. So the general plan for testing a hypothesis is clear: choose a test statistic T , observe the data, calculate the observed value t of the test statistic T , calculate the p-value as the maximum over all values of θ in Θ0 of the probability of getting a value for T as large as t, and reject H0 : θ ∈ Θ0 if the p-value so obtained is too small. 6.2 Hypothesis Testing for Normal Data Many standard test statistics have been developed for use with normally distributed data. Example 23. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (µ, σ 2 ) density where both µ and σ are unknown. Here θ = (µ, σ) and Θ = {(µ, σ) : −∞ < µ < ∞, 0 < σ < ∞}. Define Pn PN (Xi − X̄)2 2 i=1 Xi X̄ = and s = i=1 . n n−1 92 (a) Suppose that for a fixed value µ0 we consider Θ0 = {(µ, σ) : −∞ < µ ≤ µ0 , 0 < σ < ∞}, which can be simply reported as H0 : µ ≤ µ0 . Define √ T = n(X̄ − µ0 )/s. Let t denote the observed value of T . Then the rejection region at the level α is defined as Rα = [t1−α,n−1 , ∞), where tp,k is, as usual, the p-quantile of Student t-distribution with k degrees of freedom. It is clear that the p-value is α̂ that is determined from the equality t1−α̂,n−1 = t, which equivlant to α̂ = P(T > t) = 1 − F (t), where F is the cdf of Student t-distribution with n − 1 degrees of freedom. (b) Suppose H0 : µ ≥ µ0 . Let T be as before and t denote the observed value of T . By the analogy with the previous case Rα = (−∞, tα,n−1 ], and the p-value is given by α̂ = P(T < t) = F (t). (c) Suppose H0 : µ = µ0 . Define T as before and t denote the observed value of T . Then Then a rejection region at the level α can be defined as Rα = (−∞, tα/2,n−1 ] ∪ [t1−α/2,n−1 , ∞), . It is clear that the p-value is can be obtained as α̂ from the equation |t| = t1−α̂/2,n−1 or equivalently α̂ = 2P(T > |t|) = 2(1 − F (|t|)). (d) Suppose H0 : σ ≤ σ0 . Define T = Pn i=1 (Xi − X̄)2 /σ02 . Let t denote the observed value of T . Then the rejection region can be set as Rα = [χ21−α,n−1 , ∞), 93 where χ2p,k is as usually the p-quantile of the chi-squared distribution with kdegrees of freedom. Let us verify that the test statistic with the rejection region gives indeed the significance at the level α. We have P(T ≥ χ21−α,n−1 |σ n X ≤ σ0 ) = P( (Xi − X̄)2 /σ 2 ≥ χ21−α,n−1 σ02 /σ 2 |σ ≤ σ0 ) i=1 n X ≤ P( (Xi − X̄)2 /σ 2 ≥ χ21−α,n−1 ) = α. i=1 The p-value is α̂ obtained from t = χ21−α̂,n−1 which is equivalent to α̂ = P(T > t) = 1−F (t) where F is the cdf of the chi-squared distribution with n−1 degrees of freedom. (e) The case H0 : σ 2 ≥ σ02 can be treated analogously, so that Rα = [0, χ2α,n−1 ], and the p-value is obtained as α̂ = P(T < t) = F (t). (f) Finally for H0 : σ = σ0 and T defined as before we consider a rejection region Rα = [0, χ2α/2,n−1 ] ∪ [χ21−α/2,n−1 , ∞), It is easy to see that Rα and T gives the significance level α. More over the p-value is determined by α̂ = 2 (P(T < t) ∧ P(T > t)) = 2 (F (t) ∧ (1 − F (t))) , where ∧ stands for the minimum operator and F is the cdf of the chi-squared distribution with n − 1 degrees of freedom. 2 Exercise 35. The following data has been obtained as result of measuring temperature on the first of December, noon, for ten consequitive years in a certain location in Ireland: 3.7, 6.6, 8.0, 2.5, 4.5, 4.5, 3.5, 7.7, 4.0, 5.7. Using this data set perform tests for the following problems 94 • H0 : µ = 6 vs. H1 : µ 6= 6, • H0 : σ ≤ 1 vs. H1 : σ ≥ 1. Report p values and write conclusions. In the following we examine two samples problems with the normal distribution assumption. Example 24. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (µ1 , σ 2 ) density and data y1 , y2 , . . . , ym which are iid observations from a N (µ2 , σ 2 ) density where µ1 , µ2 , and σ are unknown. Here θ = (µ1 , µ2 , σ) and Θ = {(µ1 , µ2 , σ) : −∞ < µ1 < ∞, −∞ < µ2 < ∞, 0 < σ < ∞}. Recall the pooled estimator of the common variance Pn 2 (xi − x̄)2 + Σm j=1 (yj − ȳ) 2 . s = i=1 n+m−2 (a) Suppose Θ0 = {(µ1 , µ2 , σ) : −∞ < µ1 < ∞, µ1 ≤ µ2 < ∞, 0 < σ < ∞}, which can be simply expressed by H0 : µ1 ≤ µ2 vs. H1 : µ1 > µ2 . Define p T = (x̄ − ȳ)/ s2 (1/n + 1/m). Let t denote the observed value of T . Then the following rejection region Rα = [t1−α,n+m−2 , ∞) for T defines a test at significance level α. It is clear by the same arguments as before that α̂ = P(T > t) = 1 − F (t) is the p value for the discussed procedure. Here F is the cdf of the Student t-distribution with n+m−2 degrees of freedom. (b) The symmetric case to the previous one of H0 : µ1 ≥ µ2 can be treated by taking Rα = (−∞, tα,n+m−2 ]. and the p-value α̂ = P(T < t). 95 (c) Two-sided testing for H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 is addressed by Rα = (−∞, tα/2,n+m−2 ] ∪ [t1−α/2,n+m−2 , ∞). with the p-value given by α̂ = P(|T | > |t|) = 2P(T > |t|) = 2(1 − F (t)). (d) Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (µ1 , σ12 ) density and data y1 , y2 , . . . , ym which are iid observations from a N (µ2 , σ12 ) density where µ1 , µ2 , σ1 , and σ2 are all unknown. Here θ = (µ1 , µ2 , σ1 , σ2 ) and Θ = {(µ1 , µ2 , σ1 , σ2 ) : −∞ < µ1 < ∞, −∞ < µ2 < ∞, 0 < σ1 < ∞, 0 < σ2 < ∞}. Define s21 Pn = − x̄)2 , n−1 i=1 (xi and s22 = 2 Σm j=1 (yj − ȳ) . m−1 Suppose Θ0 = {(µ1 , µ2 , σ, σ) : −∞ < µ1 < ∞, µ1 < µ2 < ∞, 0 < σ < ∞} or simply H0 : σ1 = σ2 vs. H1 : σ1 6= σ2 . Define T = (n − 1)s21 . (m − 1)s22 Let t denote the observed value of T . Define a rejection region by Rα = [0, Fn−1,m−1 (α/2)] ∪ [Fn−1,m−1 (1 − α/2), ∞), where Fk,l (p) is the p-quantile of the Fischer distribution with k and l degrees of freedom. We note that Fn−1,m−1 (α/2) = 1/Fm−1,n−1 (1 − α/2). The p-value can be obtained by taking α̂ = 2 (P(T < t) ∧ P(T > t)) = 2 F (t) ∧ F̃ (1/t) , where F is the cdf’s of the Fischer distributions with n − 1 and m − 1 and F̃ is the one with m − 1 and n − 1 degrees of freedom. 2 Exercise 36. The following table gives the concentration of norepinephrine (µmol per gram creatinine) in the urine of healthy volunteers in their early twenties. Male 0.48 0.36 0.20 0.55 0.45 0.46 0.47 0.23 Female 0.35 0.37 0.27 0.29 96 The problem is to determine if there is evidence that concentration of norepinephrine differs between genders. 1. Testing for the difference between means in two normal sample problem is the main testing procedure. However it requires verification if the variances in the samples are the same. Carry out a test that checks if there is a significant difference between variances. Evaluate the p-value and make a conclusion. 2. If the above procedure did not reject the equality of variance assumption, carry out a procedure that examines the equality of concentrations between gender. Report the p-value and write down conclusion. 6.3 Generally Applicable Test Procedures Suppose that we observe the value of a random vector X whose probability density function is g(X|θ) for x ∈ X where the parameter θ = (θ1 , θ2 , . . . , θp ) is some unknown element of the set Θ ⊆ Rp . Let Θ0 be a specified subset of Θ. Consider the hypothesis H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 . In this section we consider three ways in which good test statistics may be found for this general problem. The Likelihood Ratio Test: This test statistic is based on the idea that the maximum of the log likelihood over the subset Θ0 should not be too much less than the maximum over the whole set Θ if, in fact, the parameter θ actually does lie in the subset Θ0 . Let log l(θ) denote the log likelihood function. The test statistic is T1 (x) = 2 log l(θ̂) l(θ̂ 0 ) = 2[log l(θ̂) − log l(θ̂ 0 )] where θ̂ is the value of θ in the set Θ for which log l(θ) is a maximum and θ̂ 0 is the value of θ in the set Θ0 for which log l(θ) is a maximum. The Maximum Likelihood Test Statistic: This test statistic is based on the idea that θ̂ and θ̂ 0 should be close to one another. Let I(θ) be the p × p information 97 matrix. Let B = I(θ̂). The test statistic is T2 (x) = (θ̂ − θ̂ 0 )T B(θ̂ − θ̂ 0 ) Other forms of this test statistic follow by choosing B to be I(θ̂ 0 ) or I(θ̂) or I(θ̂ 0 ). The Score Test Statistic: This test statistic is based on the idea that θ̂θ 0 should almost solve the likelihood equations. Let S(θ) be the p × 1 vector whose rth element is given by ∂ log l/∂θr . Let C be the inverse of I(θ̂ 0 ) i.e. C = I(θ̂ 0 )−1 . The test statistic is T3 (x) = S(θ̂ 0 )T CS(θ̂ 0 ) In order to calculate p-values we need to know the probability distribution of the test statistic under the null hypothesis. Deriving the exact probability distribution may be difficult but approximations suitable for situations in which the sample size is large are available in the special case where Θ is a p dimensional set and Θ0 is a q dimensional subset of Θ for q < p, whence it can be shown that, when H0 is true, the probability distributions of T1 (x), T2 (x) and T3 (x) are all approximated by a χ2p−q density. Example 25. Let X1 , X2 , . . . , Xn be iid each having a Poisson distribution with parameter θ. Consider testing H0 : θ = θ0 where θ0 is some specified constant. Recall that n n Y X xi ] log [θ] − nθ − log [ xi !]. log l(θ) = [ i=1 i=1 Here Θ = [0, ∞) and the value of θ ∈ Θ for which log l(θ) is a maximum is θ̂ = x̄. Also Θ0 = {θ0 } and so trivially θ̂0 = θ0 . We saw also that Pn xi S(θ) = i=1 − n θ and that Pn i=1 θ2 I(θ) = xi . Suppose that θ0 = 2, n = 40 and that when we observe the data we get x̄ = 2.50. 98 Hence Pn i=1 xi = 100. Then T1 = 2[log l(2.5) − log l(2.0)] = 200 log (2.5) − 200 − 200 log (2.0) + 160 = 4.62. The information is B = I(θ̂) = 100/2.52 = 16. Hence T2 = (θ̂ − θ̂0 )2 B = 0.25 × 16 = 4. We have S(θ0 ) = S(2.0) = 10 and I(θ0 ) = 25 and so T3 = 102 /25 = 4. Here p = 1, q = 0 implying p − q = 1. Since P [χ21 ≥ 3.84] = 0.05 all three test statistics produce a p-value less than 0.05 and lead to the rejection of H0 : θ = 2. 2 Example 26. Let X1 , X2 , . . . , Xn be iid with density f (x|α, β) = αβxβ−1 exp(−αxβ ) for x ≥ 0. Consider testing H0 : β = 1. Here Θ = {(α, β) : 0 < α < ∞, 0 < β < ∞} and Θ0 = {(α, 1) : 0 < α < ∞} is a one-dimensional subset of the twoPn dimensional set Θ. Recall that log l(α, β) = n log[α]+n log[β]+(β−1) i=1 log[xi ]− Pn α i=1 xβi . Hence the vector S(α, β) is given by Pn n/α − i=1 xβi Pn Pn n/β + i=1 log[xi ] − α i=1 xβi log[xi ] and the matrix I(α, β) is given by Pn n/α2 xβi log[xi ] i=1 P Pn n β β 2 2 x log[x ] x log[x ] n/β + α i i i=1 i i=1 i We have that θ̂ = (α̂, β̂) which require numerical method for their calculation which is discussed in the sample of exam problems. Also θ̂0 = (α̂0 , 1) where α̂0 = 1/x̄. Suppose that the observed value of T1 (x) is 3.20. Then the p-value is P [T1 (x) ≥ 3.20] ≈ P [χ21 ≥ 3.20] = 0.0736. In order to get the maximum likelihood test statistic plug in the values α̂, β̂ for α, β in the formula for I(α, β) to get the matrix B. Then calculate T2 (X) = (θ̂ − θ̂0 )T B(θ̂ − θ̂0 ) and use the χ21 tables to calculate the p-value. 99 Finally, to calculate the score test statistic note that the vector S(θ̂0 ) is given by 0 Pn Pn n + i=1 log[xi ] − i=1 xi log[xi ]/x̄ and the matrix I(θ̂0 ) is given by Pn nx̄2 x log[x ] i i=1 i P Pn n 2 i=1 xi log[xi ] n + i=1 xi log[xi ] /x̄ Since T2 (x) = S(θ̂0 )T CS(θ̂0 ) where C = I(θ̂0 )−1 we have that T2 (x) is [n + n X log[xi ] − i=1 n X xi log[xi ]/x̄]2 i=1 multiplied by the lower diagonal element of C which is given by [nx̄2 ][n + nx̄2 Pn 2 2 i=1 xi log[xi ] /x̄] − [ i=1 xi log[xi ]] Pn Hence we get that T2 (x) = Pn Pn [n + i=1 log[xi ] − i=1 xi log[xi ]/x̄]2 nx̄2 P Pn n [nx̄2 ][n + i=1 xi log[xi ]2 /x̄] − [ i=1 xi log[xi ]]2 No numerical techniques are need to calculate the value of T2 (X) and for this reason the score test is often preferred to the other two. However there is some evidence that the likelihood ratio test is more powerful in the sense that it has a better chance of 2 detecting departures from the null hypothesis. Exercise 37. Suppose that household incomes in a certain country have a Pareto distribution with probability density function f (x) = θv θ , xθ+1 v≤x<∞, where θ > 0 is unknown and v > 0 is known. Let x1 , x2 , . . . , xn denote the incomes for a random sample of n such households. We wish to test the null hypothesis θ = 1 against the alternative that θ 6= 1. 1. Derive an expression for θ̂, the MLE of θ. 100 2. Show that the generalised likelihood ratio test statistic, λ(x), satisfies ln{λ(x)} = n − n ln(θ̂) − n θ̂ . 3. Show that the test accepts the null hypothesis if k1 < n X ln(xi ) < k2 , i=1 and state how the values of k1 and k2 may be determined. Hint: Find the distribution of ln(X), where X has a Pareto distribution. Exercise 38. A Geiger counter (radioactivity meter) is calibrated using a source of known radioactivity. The counts recorded by the counter, xi , over 200 one second intervals are recorded: 8 12 6 11 3 9 9 8 5 4 6 11 6 14 3 5 15 11 7 6 9 9 14 13 6 11 . . . . . . . . . . . . . . . . . . . . . . 9 8 5 8 9 14 14 The sum of the counts P200 i=1 xi = 1800. The counts can be treated as observations of iid Poisson random variables with parameter µ with p.m.f. f (xi ; µ) = µxi e−µ /xi ! xi = 0, 1, . . . ; µ > 0. If the Geiger counter is functioning correctly then µ = 10, and to check this we would test H0 : µ = 10 versus H1 : µ 6= 10. Suppose that we choose to test at a significance level of 5%. The test can be performed using a generalized likelihood ratio test. Carry out such a test. What does this imply about the Geiger counter? Finally, given the form of the MLE, what was the point of recording the counts in 200 one-second intervals rather than recording the count in one 200-second interval? 6.4 The Neyman-Pearson Lemma Suppose we are testing a simple null hypothesis H0 : θ = θ0 against a simple alternative H1 : θ = θ00 , where θ is the parameter of interest, and θ0 , θ00 are particular values of θ. Observed values of the i.i.d. random variables X1 , X2 , . . . , Xn , each with p.d.f. 101 fX (x|θ), are available. We are going to reject H0 if (x1 , x2 , . . . , xn ) ∈ Rα , where Rα is a region of the n-dimensional space called the critical or rejection region. The critical region Rα is determined so that the probability of a Type I error is α: P[ (X1 , X2 , . . . , Xn ) ∈ Rα |H0 ] = α. Definition 11. We call a test defined through Rα as the most powerful at the significance level α in the testing problem H0 : θ = θ0 against the alternative H1 : θ = θ00 if any other test of this problem has lower power. The Neyman-Pearson lemma provides us with a way of finding most powerfull tests in the above problem. It demonstrates that the likelihood ratio test is the most powerful for the above problem. To avoid distracting technicalities of non-continuous case we formulate and prove it for the continuous distribution case. Lemma 7 (The Neyman-Pearson lemma). Let Rα be a subset of the sample space defined by Rα = {x : l(θ0 |x)/l(θ00 |x) ≤ k} where k is uniquely determined from the equality α = P[X ∈ Rα |H0 ]. Then Rα defines the most powerful test at the significance level α for testing the simple hypothesis H0 : θ = θ0 against the alternative simple hypothesis H1 : θ = θ00 . Proof. For any region R of n-dimensional space, we will denote the probability that R X ∈ R by l(θ), where θ is the true value of the parameter. The full notation, omitted R to save space, would be Z P(X ∈ R|θ) = ... Z l(θ|x1 , . . . , xn )dx1 . . . dxn . R We need to prove that if A is another critical region of size α, then the power of the test associated with Rα is at least as great as the power of the test associated with A, or in the present notation, that Z 00 Z l(θ ) ≤ A Rα 102 l(θ00 ). (6.1) By the definition of Rα we have Z A0 ∩R l(θ00 ) ≥ Z 1 k A0 ∩R α l(θ0 ). (6.2) l(θ0 ). (6.3) α On the other hand Z l(θ00 ) ≤ Z 1 k 0 A∩Rα 0 A∩Rα We now establish (6.1), thereby completing the proof. Z Z Z l(θ00 ) = l(θ00 ) + l(θ00 ) A 0 A∩Rα A∩Rα Z = ≤ Z l(θ00 ) − Rα Z = 1 k Z Z 1 l(θ ) + k 0 L(θ0 ) ( see (6.2), (6.3) ) 0 A∩Rα l(θ0 ) + 1 k Z l(θ0 ) A Rα l(θ00 ) − l(θ00 ) 0 A∩Rα A0 ∩Rα Rα = Z 1 l(θ ) − k 00 Z l(θ00 ) + A0 ∩Rα Rα Z Z l(θ00 ) − α α + k k Rα Z = l(θ00 ) Rα since both Rα and A have size α. Example 27. Suppose X1 , . . . , Xn are iid N (0, 1), and and we want to test H0 : θ = θ0 versus H1 : θ = θ00 , where θ00 > θ0 . According to the Z-test, we should reject H0 if √ Z = n(X̄ − θ0 ) is large, or equivalently if X̄ is large. We can now use the NeymanPearson lemma to show that the Z-test is “best”. The likelihood function is L(θ) = (2π)−n/2 exp{− n X (xi − θ)2 /2}. i=1 103 According to the Neyman-Pearson lemma, a best critical region is given by the set of (x1 , . . . , xn ) such that L(θ0 )/L(θ00 ) ≤ k1 , or equivalently, such that 1 ln[L(θ00 )/L(θ0 )] ≥ k2 . n But 1 ln[L(θ00 )/L(θ0 )] n n = 1X [(xi − θ0 )2 /2 − (x1 − θ00 )2 /2] n i=1 = 1 X 2 [(x − 2θ0 xi + θ02 ) − (x2i − 2θ00 xi + θ002 )] 2n i=1 i = 1 X [2(θ00 − θ0 )xi + θ02 − θ002 ] 2n i=1 = 1 (θ00 − θ0 )x̄ + [θ02 − θ002 ]. 2 n n So the best test rejects H0 when x̄ ≥ k, where k is a constant. But this is exactly the form of the rejection region for the Z-test. Therefore, the Z-test is “best”. 2 Exercise 39. A random sample of n flowers is taken from a colony and the numbers X, Y and Z of the three genotypes AA, Aa and aa are observed, where X + Y + Z = n. Under the hypothesis of random cross-fertilisation, each flower has probabilities θ2 , 2θ(1 − θ) and (1 − θ)2 of belonging to the respective genotypes, where 0 < θ < 1 is an unknown parameter. 1. Show that the MLE of θ is θ̂ = (2X + Y )/(2n). 2. Consider the test statistic T = 2X + Y. Given that T has a binomial distribution with parameters 2n and θ, obtain a critical region of approximate size α based on T for testing the null hypothesis that θ = θ0 against the alternative that θ = θ1 , where θ1 < θ0 and 0 < α < 1. 3. Show that the above test is the most powerful of size α. 4. Deduce approximately how large n must be to ensure that the power is at least 0.9 when α = 0.05, θ0 = 0.4 and θ1 = 0.3. 104 Definition 12. For a general testing problem H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 . A test at significance level α is called uniformly most powerful if its power is larger at each θ ∈ Θ1 from the power of any other test in the same problem and at the same significance. It is easy to note that if the test (rejection region) derived from the Neyman-Pearson lemma does not depend on θ00 ∈ Θ1 then it is most powerful for the problem H0 : θ = θ0 vs. H1 : θ ∈ Θ1 . Exercise 40. Let X1 , X2 , . . . , Xn be a random sample from the Weibull distribution with probability density function f (x) = θλxλ−1 exp(−θxλ ), for x > 0 where θ > 0 is unknown and λ > 0 is known. 1. Find the form of the most powerful test of the null hypothesis that θ = θ0 against the alternative hypothesis that θ = θ1 , where θ0 > θ1 . 2. Find the distribution function of X λ and deduce that this random variable has an exponential distribution. 3. Find the critical region of the most powerful test at the 1% level when n = 50, θ0 = 0.05 and θ1 = 0.025. Evaluate the power of this test. 4. Explain what is meant by the power of a test and describe how the power may be used to determine the most appropriate size of a sample. Using this approach to the situation described in the previous item to determine the minimal sample size for a test that would have chances of any kind of error smaller than 1%. Exercise 41. In a particular set of Bernoulli trials, it is widely believed that the probability of a success is θ = 43 . However, an alternative view is that θ = 23 . In order to test H0 : θ = 3 4 against H1 : θ = 32 , n independent trials are to be observed. Let θ̂ denote the proportion of successes in these trials. 1. Show that the likelihood ratio aapproach leads to a size α test in which H0 is rejected in favour of H1 when θ̂ < k for some suitable k. 105 2. By applying the central limit theorem, write down the large sample distributions of θ̂ when H0 is true and when H1 is true. 3. Hence find an expression for k in terms of n when α = 0.05. 4. Find n so that this test has power 0.95. 6.5 Goodness of Fit Tests Suppose that we have a random experiment with a random variable Y of interest. Assume additionally that Y is discrete with density function f on a finite set S. We repeat the experiment n times to generate a random sample Y1 , Y2 , . . . , Yn from the distribution of Y . These are independent variables, each with the distribution of Y . In this section, we assume that the distribution of Y is unknown. For a given probability mass function f0 , we will test the hypotheses H0 : f = f0 versus H1 : f 6= f0 . The test that we will construct is known as the goodness of fit test for the conjectured density f0 . As usual, our challenge in developing the test is to find an appropriate test statistic – one that gives us information about the hypotheses and whose distribution, under the null hypothesis, is known, at least approximately. Suppose that S = y1 , y2 , . . . , yk . To simplify the notation, let pj = f0 (yj ) for j = 1, 2, . . . , k. Now let Nj = #{i ∈ 1, 2, ..., n : yi = yj } for j = 1, 2, . . . , k. Under the null hypothesis, (N1 , N2 , . . . , Nk ) has the multinomial distribution with parameters n and p1 , p2 , . . . , pk with E(Nj ) = npj and Var(Nj ) = npj (1 − pj ). This result indicates how we might begin to construct our test: for each j we can compare the observed frequency of yj (namely Nj ) with the expected frequency of value yj (namely npj ), under the null hypothesis. Specifically, our test statistic will be X2 = (N1 − np1 )2 (N2 − np2 )2 (Nk − npk )2 + + ··· + . np1 np2 npk Note that the test statistic is based on the squared errors (the differences between the expected frequencies and the observed frequencies). The reason that the squared errors are scaled as they are is the following crucial fact, which we will accept without proof: 106 under the null hypothesis, as n increases to infinity, the distribution of X 2 converges to the chi-square distribution with k − 1 degrees of freedom. For m > 0 and r in (0, 1), we will let χ2m,r denote the quantile of order r for the chi-square distribution with m degrees of freedom. Then, the following test has approximate significance level α: reject H0 : f = f0 versus H1 : f 6= f0 , if and only if X 2 > χ2k−1,1−α . The test is an approximate one and works best when n is large. Just how large n needs to be depends on the pj . One popular rule of thumb proposes that the test will work well if all the expected frequencies satisfy npj ≥ 1 and at least 80% of the expected frequencies satisfy npj ≥ 5. Example 28 (Genetical inheritance). In crosses between two types of maize four distinct types of plants were found in the second generation. In a sample of 1301 plants there were 773 green, 231 golden, 238 green-striped, 59 golden-green-striped. According to a simple theory of genetical inheritance the probabilities of obtaining these four plants are 9 3 3 16 , 16 , 16 and 1 16 respectively. Is the theory acceptable as a model for this experiment? Formally we will consider the hypotheses: H0 : p1 = 9 16 , and p2 = 3 16 , and p3 = 3 16 and p4 = 1 16 ; H1 : not all the above probabilities are correct. The expected frequencies for any plant under H0 is npi = 1301pi . We therefore calculate the following table: Observed Counts Expected Counts Contributions to X 2 Oi Ei (Oi − Ei )2 /Ei 773 731.8125 2.318 231 243.9375 0.686 238 243.9375 0.145 59 81.3125 6.123 X 2 = 9.272 Since X 2 embodies the differences between the observed and expected values we can say that if X 2 is large that there is a big difference between what we observe and 107 what we expect so the theory does not seem to be supported by the observations. If X 2 is small the observations apparently conform to the theory and act as support for the theory. The test statistic X 2 is distributed X 2 ∼ χ23df . In order to define what we would consider to be an unusually large value of X 2 we will choose a significance level of α = 0.05. The R command qchisq(p=0.05,df=3,lower.tail=FALSE) calculates the 5% critical value for the test as 7.815. Since our value of X 2 is greater than the critical value 7.815 we reject H0 and conclude that the theory is not a good model for these data. The R command pchisq(q=9.272,df=3,lower.tail=FALSE) calculates the p-value for the test equal to 0.026. ( These data are examined further in 2 chapter 9 of Snedecor and Cochoran. ) Very often we do not have a list of probabilities to specify our hypothesis as we had in the above example. Rather our hypothesis relates to the probability distribution of the counts without necessarily specifying the parameters of the distribution. For instance, we might want to test that the number of male babies born on successive days in a maternity hospital followed a binomial distribution, without specifying the probability that any given baby will be male. Or, we might want to test that the number of defective items in large consignments of spare parts for cars, follows a Poisson distribution, again without specifying the parameter of the distribution. The X 2 test is applicable when all the probabilities depend on unknown parameters, provided that the unknown parameters are replaced by their maximum likelihood estimates and provided that one degree of freedom is deducted for each parameter estimated. Example 29. Feller reports an analysis of flying-bomb hits in the south of London during World War II. Investigators partitioned the area into 576 sectors each beng 14 km2 . The following table gives the resulting data: No. of hits (x) No. of sectors with x hits 0 1 2 3 4 5 229 221 93 35 7 1 If the hit pattern is random in the sense that the probability that a bomb will land in any particular sector in constant, irrespective of the landing place of previous bombs, a Poisson distribution might be expected to model the data. 108 x P (x) = θ̂x e−θ̂ x! Expected Observed Contributions to X 2 (Oi − Ei )2 /Ei 576 × P (X) 0 0.395 227.53 229 0.0095 1 0.367 211.34 211 0.0005 2 0.170 98.15 93 0.2702 3 0.053 30.39 35 0.6993 4 0.012 7.06 7 0.0005 5 0.002 1.31 1 0.0734 X 2 = 1.0534 The MLE of θ was calculated as θ̂ = 535/576 = 0.9288, that is, the total number of observed hits divided by the number of sectors. We carry out the chi-squared test as before except that we now subtract one additional degree of freedom because we had to estimate θ. The test statistic X 2 is distributed X 2 ∼ χ24df . The R command qchisq(p=0.05,df=4,lower.tail=FALSE) calculates the 5% critical value for the test as 9.488. Alternatively, the R command pchisq(q=1.0534,df=4,lower.tail=FALSE) calculates the p-value for the test equal to 0.90. The result of the chi-squared test is not statistically significant indicating that the divergence between the observed and expected counts can be regarded as random fluctuations about the expected values. Feller comments, “It is interesting to note that most people believed in a tendency of the points of impact to cluster. It this were true, there would be a higher frequency of sectors with either many hits or no hits and a deficiency in the intermediate classes. the above table indicates perfect randomness and homogeneity of the area; we have here an instructive illustration of the established fact that to the untrained eye randomness appears a regularity or tendency to cluster.” 6.6 2 The χ2 Test for Contingency Tables Let X and Y be a pair of categorical variables and suppose there are r possible values for X and c possible values for Y . Examples of categorical variables are Religion, 109 Race, Social Class, Blood Group, Wind Direction, Fertiliser Type etc. The random variables X and Y are said to be independent if P [X = a, Y = b] = P [X = a]P [Y = b] for all possible values a of X and b of Y . In this section we consider how to test the null hypothesis of independence using data consisting of a random sample of N observations from the joint distribution of X and Y . Example 30. A study was carried out to investigate whether hair colour (columns) and eye colour (rows) were genetically linked. A genetic link would be supported if the proportions of people having various eye colourings varied from one hair colour grouping to another. 955 people were chosen at random and their hair colour and eye colour recorded. The data are summarised in the following table : Oij Black Brown Fair Red Total Brown 60 110 42 30 242 Green 67 142 28 35 272 Blue 123 248 90 25 486 Total 250 500 160 90 1000 The proportion of people with red hair is 90/1000 = 0.09 and the proportion having blue eyes is 486/1000 = 0.486. So if eye colour and hair colour were truly independent we would expect the proportion of people having both black hair and brown eyes to be approximately equal to (0.090)(0.486) = 0.04374 or equivalently we would expect the number of people having both black hair and brown eyes to be close to (1000)(0.04374) = 43.74. The observed number of people having both black hair and brown eyes is 60.5. We can do similar calculations for all other combinations of hair colour and eye colour to derive the following table of expected counts : Eij Black Brown Fair Red Total Brown 60.5 121 38.72 21.78 242 Green 68.0 136 43.52 24.48 272 Blue 121.5 243 77.76 43.74 486 Total 250.0 500 160.00 90.00 1000 In order to test the null hypothesis of independence we need a test statistic which measures the magnitude of the discrepancy between the observed table and the table that 110 would be expected if independence were in fact true. In the early part of this century, long before the invention of maximum likelihood or the formal theory of hypothesis testing, Karl Pearson ( one of the founding fathers of Statistics ) proposed the following method of constructing such a measure of discrepancy: (Oij −Eij )2 Eij Black Brown Fair Red Brown 0.004 1.000 0.278 3.102 Green 0.015 0.265 5.535 4.521 Blue 0.019 0.103 1.927 8.029 For each cell in the table calculate (Oij − Eij )2 /Eij where Oij is the observed count and Eij is the expected count and add the resulting values across all cells of the table. The resulting total is called the χ2 test statistic which we will denote by W . The null hypothesis of independence is rejected if the observed value of W is surprisingly large. In the hair and eye colour example the discrepancies are as follows : c r X X (Oij − Eij )2 W = = 24.796 Eij i=1 j=1 What we would now like to calculate is the p-value which is the probability of getting a value for W as large as 24.796 if the hypothesis of independence were in fact true. Fisher showed that, when the hypothesis of independence is true, W behaves somewhat like a χ2 random variable with degrees of freedom given by (r − 1)(c − 1) where r is the number of rows in the table and c is the number of columns. In our example r = 3, c = 4 and so (r − 1)(c − 1) = 6 and so the p-value is P [W ≥ 24.796] ≈ 2 P [χ26 ≥ 24.796] = 0.0004. Hence we reject the independence hypothesis. Exercise 42. It is believed that the number of breakages in a damaged chromosome, X, follows a truncated Poisson distribution with probability mass function P (X = k) = e−λ λk , 1 − e−λ k! k = 1, 2, . . . , where λ > 0 is an unknown parameter. The frequency distribution of the number of breakages in a random sample of 33 damaged chromosomes was as follows: Breakages 1 2 3 4 5 6 7 8 9 10 11 12 13 Total Chromosomes 11 6 4 5 0 1 0 2 1 0 1 1 1 33 111 1. Find an equation satisfied by λ̂, the MLE of λ. 2. Discuss approximations of λ̂. Show that the observed data give the estimate λ̂ = 3.6. 3. Using this value for λ̂, test the null hypothesis that the number of breakages in a damaged chromosome follows a truncated Poisson distribution. The categories 6 to 13 should be combined into a single category in the goodness-of-fit test. 112 Bibliography [1] Hogg, R.V, McKean J.W., Craig, A.T. (2005) Introduction to mathematical statistics. 6th Ed. Pearson-Prentice Hall. 113