Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Inductive probability wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Confidence interval wikipedia , lookup
Foundations of statistics wikipedia , lookup
Law of large numbers wikipedia , lookup
Resampling (statistics) wikipedia , lookup
University of Sheffield School of Mathematics & and Statistics Introduction to Probability and Statistics: Semester 2 MAS113 Spring 2016 90 Chapter 5 Sums of Random Variables In this chapter we continue the development of probability theory, which we started in Semester 1. We will work towards an understanding of the central limit theorem which will be vital for our later work on statistics. 5.1 Sums of Independent and Identically Distributed Random Variables Suppose we have n random variables X1 , X2 , . . . , Xn . If these each have the same probability distribution i.e. P (X1 ≤ a) = P (X2 ≤ a) = · · · P (Xn ≤ a), then we say that they are identically distributed. We are particularly interested in the case where X1 , X2 , . . . , Xn are not only identically distributed, but also independent so that for all 1 ≤ i, j ≤ n with i 6= j: P {(Xi ≤ a) ∩ (Xj ≤ b)} = P (Xi ≤ a)P (Xj ≤ b). (5.1) We then say that X1 , X2 , . . . , Xn are independent and identically distributed, or i.i.d. for short. We can, if we like, regard X1 , X2 , . . . , Xn as independent copies of some given random variable. I.i.d random variables are very important in applications as they describe repeated experiments that are carried out under identical conditions, in which the outcome of each experiment does not affect the others. 91 92 CHAPTER 5. SUMS OF RANDOM VARIABLES Now define S(n) to be the sum and X̄(n) to be the mean: S(n) = n X Xi , (5.2) S(n) . n (5.3) i=1 and X̄(n) = Both S(n) and X̄(n) are also random variables, as they are functions of the random variables X1 , . . . , Xn . If we write E(Xi ) = µ and Var(Xi ) = σ 2 (for all i, as the variables have the same distribution), it is straightforward to derive the mean and variance of S(n) and X̄(n) in terms of µ and σ 2 . Firstly, E(S(n)) = = = = E(X1 + X2 + · · · + Xn ) E(X1 ) + E(X2 ) + · · · + E(Xn ) µ + µ + ··· + µ nµ. Also, since Var(X + Y ) = Var(X) + Var(Y ) if X and Y are independent,1 we also have Var(S(n)) = = = = Var(X1 + X2 + · · · + Xn ) Var(X1 ) + Var(X2 ) + · · · + Var(Xn ) σ2 + σ2 + · · · + σ2 nσ 2 . We then have E(X̄(n)) = E S(n) n 1 E(S(n)) n 1 = × nµ n = µ, = and also 1 This is not true, in general. (5.4) 5.1. SUMS OF INDEPENDENT AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES93 Var(X̄(n)) = Var S(n) n 1 Var(S(n)) n2 1 × nσ 2 = n2 σ2 = . n = (5.5) The standard deviation of X̄(n) plays an important role. It is called the standard error and we denote it by SE(X̄(n)), so that σ SE(X̄(n)) = √ . n (5.6) These results have important applications in statistics. Suppose we are able to observe i.i.d. random variables X1 , . . . , Xn , but we don’t know the value of E(Xi ) = µ. Equation (5.5) tells us that as n increases, the variance of X̄(n) gets smaller, and the smaller the variance is, the closer we expect X̄(n) to be to its mean value. Equation (5.4) tells us that the mean value of X̄(n) is µ (for any value of n). In other words, as n gets larger, we expect X̄(n) to be increasingly close to the unknown quantity µ, so we can use the observed value of X̄(n) to estimate µ. We illustrate this in Figure 5.1. The four plots show the density functions of X̄(n) for n = 1, 10, 20 and 100. In each case, X1 , . . . , Xn ∼ N (0, 1), so E(Xi ) = µ = 0. We can see that the density function of X̄(n) becomes more tightly concentrated about the value 0 as we increase n, and that the observed value of X̄(n) is increasingly likely to be close to 0. 94 CHAPTER 5. SUMS OF RANDOM VARIABLES n=1 n=10 4 4 3 3 2 2 1 1 0 ï4 ï2 0 x̄ n=20 2 4 0 ï4 4 4 3 3 2 2 1 1 0 ï4 ï2 0 x̄ 2 4 0 ï4 ï2 ï2 0 2 4 0 2 4 x̄ n=100 x̄ Figure 5.1: The density function of X̄(n), when Xi ∼ N (0, 1), for four choices of n. If we didn’t know that µ = E(Xi ) = 0, but we could observe X̄(n), then using the observed value of X̄(n) for large n is likely to give us a good estimate of µ, as X̄(n) is likely to be close to µ. Here are two key examples of sums of i.i.d. random variables: • If Xi ∼ Bernoulli(p) for 1 ≤ i ≤ n, then S(n) ∼ Bin(n, p). • If Xi ∼ N (µ, σ 2 ) for 1 ≤ i ≤ n, then S(n) ∼ N (nµ, nσ 2 ). We will prove these two facts later on, using moment generating functions. 5.2 Laws of Large Numbers We will now derive an important result regarding the behaviour of X̄(n) for large n. We first prove a useful inequality. It is true if X is discrete or continuous. We’ll just prove the continuous case. 5.2.1 Chebyshev’s inequality Let X be a random variable for which E(X) = µ and Var(X) = σ 2 . Then for any c > 0 σ2 (5.7) P (|X − µ| ≥ c) ≤ 2 . c 5.2. LAWS OF LARGE NUMBERS 95 Proof. Let A = {x ∈ R; |x − µ| ≥ c}. From the definition of variance, and using the fact that R = A ∪ Ā, Z ∞ 2 (x − µ)2 fX (x)dx σ = Z−∞ Z 2 = (x − µ) fX (x)dx + (x − µ)2 fX (x)dx Ā ZA (x − µ)2 fX (x)dx ≥ AZ 2 ≥ c fX (x)dx A 2 = c P (X ∈ A) = c2 P (|X − µ| ≥ c), and the result follows. The same inequality holds if P (|X −µ| ≥ c) is replaced by P (|X −µ| > c) in (5.7), as P (|X − µ| > c) ≤ P (|X − µ| ≥ c). What does Chebychev’s inequality tell us? We expect the probability to find the value of a random variable to be smaller, the further away we get from the mean. But a large variance may counteract this a little, as it tells us that the values which have high probability are more spread out. Chebychev’s inequality makes this more precise. Example. A random variable X has mean 1 and variance 0.5. What can you say about P (X > 6)? Solution. P (X > 6) = P (X − 1 > 5) ≤ P (|X − 1| > 5) 0.5 1 ≤ = , by Chebychev’s inequality. 25 50 5.2.2 The Weak Law of Large Numbers Let X1 , X2 , . . . be a sequence of i.i.d. random variables, each with mean µ and variance σ 2 . Then for all ε > 0, P (|X̄(n) − µ| > ε) ≤ σ2 . ε2 n (5.8) This follows from Chebyshev’s inequality, and the fact that E(X̄(n)) = µ and Var(X̄(n)) = σ 2 /n. 96 CHAPTER 5. SUMS OF RANDOM VARIABLES From (5.8), we have lim P (|X̄(n) − µ| > ε) = 0. n→∞ (5.9) We can choose ε to be as small as we like, so as n increases, it becomes increasingly unlikely that X̄(n) will differ from µ by any amount ε. Equation (5.9) is known as the weak law of large numbers. It is possible to prove a stronger result, which is that P lim X̄(n) = µ = 1; (5.10) n→∞ but the proof is outside the scope of this module. This result is known as the strong law of large numbers. 5.3 Moment Generating Functions Let X be a random variable. Two important associated numerical quantities are the expected values E(X) and E(X 2 ), from which we can compute the variance, using Var(X) = E(X 2 ) − E(X)2 . More generally we might try to find the nth moment E(X n ) for n ∈ N. For example E(X 3 ) and E(X 4 ) give information about the shape of the distribution and are used to calculate quantities called the skewness and kurtosis. We can try to calculate moments directly, but there is also a useful shortcut. Define the moment generating function (or mgf) for all t ∈ R by: MX (t) = E(etX ). So MX (t) = Z MX (t) = N X (5.11) etxi pX (xi ), if X is discrete and i=1 ∞ etx fX (x)dx, if X is continuous. −∞ e.g. If X ∼ Bernoulli(p), it only takes two values: 1 with probability p, and 0 with probability 1 − p, and so MX (t) = pet.1 + (1 − p)et.0 = pet + 1 − p. Now differentiate (5.11) to get d MX (t) = E(XetX ), dt 5.3. MOMENT GENERATING FUNCTIONS 97 d MX (t) = E(X). dt t=0 and so Similarly d2 MX (t) = E(X 2 etX ), 2 dt d2 = E(X 2 ). M (t) and hence X 2 dt t=0 In fact you can find all the moments by this procedure: dn MX (t) = E(X n ). n dt t=0 You can explore this result for some known distributions in the exercises. Another way of seeing how the mgf of X contains information about all the moments comes from writing the series expansion for the exponential function ∞ X tn n tX X . e = n! n=0 It then follows from (5.11) that: MX (t) = ∞ X tn n=0 n! E(X n ). (5.12) Our main purpose in introducing mgfs at this stage is not to use them to calculate moments, but to give a rough idea of the proof of the central limit theorem. The next result is very useful for that. Theorem 1. If X and Y are independent, then for all t ∈ R MX+Y (t) = MX (t)MY (t). Proof MX+Y (t) = = = = E(et(X+Y ) ) E(etX etY ) E(etX )E(etY ) by independence MX (t)MY (t). It follows (by using mathematical induction) that if S(n) = X1 + X2 + · · · + Xn is a sum of i.i.d random variables having common mgf MX then MS(n) (t) = MX (t)n . (5.13) 98 CHAPTER 5. SUMS OF RANDOM VARIABLES Here is the key example that we need. Example If X ∼ N (µ, σ 2 ) then 1 MX (t) = eµt+ 2 σ 2 t2 (5.14) To establish this we use the definition of MX (t) and the formula for the normal density: ( 2 ) Z ∞ 1 x − µ 1 ext exp − dx MX (t) = √ 2 σ σ 2π −∞ Substitute z = x−µ σ to obtain Z 1 2 1 µt ∞ σtz e exp − z dz MX (t) = √ e 2 2π −∞ Z ∞ 1 1 1 22 2 = √ exp µt + σ t exp − (z − σt) dz, 2 2 2π −∞ afterR completing the square. Now substitute y = z − σt and use the fact that ∞ − 12 x2 √1 e dx = 1 to find −∞ 2π 1 22 MX (t) = exp µt + σ t . 2 An important special case is when X is the standard normal Z ∼ N (0, 1): 1 2 MZ (t) = e 2 t (5.15) In fact it can be shown that the mgf uniquely determines the distribution of a random variable, so X ∼ N (µ, σ 2 ) is the only random variable with the mgf (5.14). Using this fact we can show that the sum of n i.i.d normal distributions is itself normal. From (5.13) and (5.14), we have 1 MS(n) (t) = [eµt+ 2 σ 1 2 t2 = enµt+ 2 nσ ]n 2 t2 , so S(n) ∼ N (nµ, nσ 2 ), and X̄(n) ∼ N (µ, σ 2 /n). 5.4 The Central Limit Theorem (CLT) We finish with another important result, that tells about the distribution of X̄(n) for large n. It could be argued that this is the most important result 5.4. THE CENTRAL LIMIT THEOREM (CLT) 99 in the whole of probability (and statistics). The law of large numbers tells us that X̄(n) tends to µ as n → ∞. But the central limit theorem gives far more information. It tells us about the behaviour of the distribution of the fluctuations of X̄(n) around µ, as n → ∞. These are always normally distributed! The central limit theorem Let X1 , X2 , . . . be a sequence of i.i.d random variables, each with mean µ and variance σ 2 . For any −∞ ≤ a < b ≤ ∞, Z b 1 1 2 X̄(n) − µ √ ≤b = √ exp − z dz. lim P a ≤ (5.16) n→∞ 2 σ/ n 2π a In other words, the distribution of X̄(n) tends to a normal distribution with mean µ and variance σ 2 /n, as n → ∞. So for large n, we have, approximately σ2 X̄(n) ∼ N µ, , n S(n) ∼ N nµ, nσ 2 . Notice that the right hand side of (5.16) is P (a ≤ Z ≤ b) where Z ∼ N (0, 1) is the standard normal. We can also rewrite the left hand side in terms of S(n) by multiplying top and bottom by n. Then we get another equivalent form of the central limit theorem (CLT) which is often seen in books: S(n) − nµ √ lim P a ≤ ≤ b = P (a ≤ Z ≤ b). (5.17) n→∞ σ n As long as X1 , X2 , . . . have the same distribution (and are independent) (5.17) is valid; it doesn’t matter what that distribution is. It can be discrete, or continuous – uniform, Poisson, Bernoulli, exponential, etc etc. Proof (Outline only). We will aim to establish (5.17). Define S(n) − nµ √ . σ n Then E(Y (n)) = 0 and Var(Y (n)) = 1. (You check this.) Now consider i.i.d random variables defined by Y (n) = Tj = Xj − µ for j = 1, 2, . . . Clearly, we have 100 CHAPTER 5. SUMS OF RANDOM VARIABLES E(Tj ) = 0 and E(Tj2 ) = Var(Tj ) = σ 2 ; (5.18) 1 Y (n) = √ (T1 + T2 + · · · + Tn ). σ n By (5.13) n t √ for all t ∈ R MY (n) (t) = M σ n where M on the right-hand side denotes the common moment generating function of the Ti s. Then expanding (5.12) using (5.18): M t √ σ n 1 t2 2 t α(n) √ ×0 + σ + = 1+ 2 2σ n n σ n 2 α(n) 1t + = 1+ 2n n where α(n) is a remainder term that satisfies limn→∞ α(n) = 0. Thus n 1 t2 1 MY (n) (t) = 1 + + α(n) 2n n From MAS110, we know that ey = limn→∞ (1 + ny )n and, furthermore, since limn→∞ α(n) = 0, then it can also be shown that n y 1 y e = lim 1 + + α(n) n→∞ n n Hence, using (5.15) we find that lim MY (n) (t) = exp n→∞ 1 2 t 2 = MZ (t) and the result follows, since the mgf determines the distribution. As a first application of this result we can investigate the binomial approximation to the normal distribution. If we take X1 , X2 , . . . to be Bernoulli random variables with common parameter p, then S(n) is binomial with parameters n and p (see Problem 8). Since E(S(n)) = np and Var(S(n)) = np(1 − p) we get the following: 5.4. THE CENTRAL LIMIT THEOREM (CLT) 101 Corollary 5.4.1. (de Moivre–Laplace central limit theorem) If X1 , X2 , . . . are Bernoulli with common parameter p, then ! Z b 1 2 S(n) − np 1 ≤b = √ exp − z dz (5.19) lim P a ≤ p n→∞ 2 2π a np(1 − p) This was historically the first central limit theorem to be established. It is named after Abraham de Moivre (1667–1754) who first used integrals of exp(− 21 z 2 ) to approximate probabilities, and Pierre-Simon Laplace (1749– 1857) who obtained a formula quite close in spirit to (5.19). Laplace was studying the dynamics of the solar system, and was motivated in these calculations by the need to estimate the probability that the inclination of cometary orbits to a given plane was between certain limits. Corollary 5.4.1 is sometime referred to as the normal approximation to the binomial distribution. Because the binomial distribution is discrete, while the normal distribution is continuous, we can get greater accuracy in the CLT by using a continuity correction. If X is binomial and we want to use the CLT to approximate P (m ≤ X ≤ n) where m and n are positive integers, we get greater accuracy by applying the CLT to P (m − 0.5 ≤ X ≤ n + 0.5). Example A woman claims to have psychic abilities, in that she can predict the outcome of a coin toss. If she is tested 100 times, but she is really just guessing, what is the probability that she will be right 60 or more times? Let X be the number of correct predictions. If the woman is guessing, and the coin is fair, the probability that she is right on any single occasion is 0.5. Then X ∼ Binom(100, 0.5). Approximately, using the CLT, X ∼ N (50, 100 × 0.5 × (1 − 0.5)), so P (X ≥ 60) = 1 − Φ 60 − 50 5 = 0.02275013, using the command 1-pnorm(60,50,5) in R. If we apply the continuity correction, 59.5 − 50 P (X ≥ 60) = 1 − Φ = 0.02871656, 5 102 CHAPTER 5. SUMS OF RANDOM VARIABLES using the command 1-pnorm(59.5,50,5) in R. (Using the command 1-pbinom(59,100,0.5), we calculate the probability to be 0.02844397). 5.4.1 Example: Sums of Exponential Random Variables Here we illustrate the CLT by a computer experiment. Let X1 , X2 , . . . be a sequence of i.i.d exponential random variables, each with rate parameter 1. Note that the shape of the Exp(rate= 1) distribution is very ‘non-normal’ (compare the pdf of an exponential random variable with that of a normal random variable). Now µ = E(Xi ) = 1 and σ 2 = Var(Xi ) = 1. For large n, approximately, we should have X̄(n) ∼ N 1 1, n . We now do a simulation experiment in R to compare this approximation with what we get if we simulate random values of X̄(n) lots of times, for different values of n. The results are shown in Figure 5.2. In the top left plot, we have the case n = 1. The solid line shows the N (1, 1) density function, and the histogram represents the distribution of X̄(1) (which is just a single value of X), based on what we see when we simulate random values of X̄(1) many times. The histogram doesn’t match the shape of the density function, which is to be expected, as we know that the density of a single exponential random variable looks nothing like a ‘bell-shaped’ curve. In the top right plot, we repeat the experiment, but now with n = 10. Now we can see that histogram of simulated values of X̄(10) is closer in shape to the N (1, 1/10) density function, so even with small n, the CLT is giving a good approximation for the distribution of X̄(n). Simulations for larger values of n are shown in the remaining two plots. 5.4. THE CENTRAL LIMIT THEOREM (CLT) n = 10 0.8 Density 0.0 0.4 0.4 0.0 0.2 Density 0.6 1.2 0.8 n=1 103 -1 0 1 2 3 4 0.0 0.5 1.0 1.5 X(n) X(n) n = 30 n = 100 2.0 3 2 Density 1 0 Density 0.0 0.5 1.0 1.5 2.0 4 -2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.7 0.9 X(n) 1.1 1.3 X(n) Figure 5.2: Testing the CLT approximation for different n. The solid line shows the density function under the CLT approximation. The histogram represents the distributed for simulated values of X̄(n), from a simulation in R. 5.4.2 Application to Stirling’s Approximation In this subsection we will use the probability theory that we’ve just studied to gain insight into Stirling’s Approximation which states that, as n → ∞, n! ∼ √ 1 2πnn+ 2 e−n . The precise meaning of (5.20) is that lim √ (5.20) n! = 1. 1 2πnn+ 2 e−n The approximation goes back to Abraham de Moivre (1667-1754) who 1 showed that n! ∼ Cnn+ 2 e−n ,√where C > 0. It was James Stirling (16921770) who proved that C = 2π. Stirling’s approximation is very widely used in both pure and applied mathematics. It is even reasonably accurate for small n – for example, 3! = 6 and the approximation gives 5.84; 5! = 120 and you get 118.02 when you use the approximation. There are many proofs of (5.20) in the literature. Here we will give a “rough proof”, based on the work of this section. Let X1 , X2 , . . . be a sequence of i.i.d. random variables, each having a Poisson distribution, with mean (and variance) λ = 1. Then by Problem 11, S(n) = X1 + X2 + · · · + Xn n→∞ 104 CHAPTER 5. SUMS OF RANDOM VARIABLES also has a Poisson distribution, with mean (and variance) λ = n. It follows that e−n nn . (5.21) P (S(n) = n) = n! But 1 S(n) − n √ P (S(n) = n) = P (n − 1 < S(n) ≤ n) = P − √ < ≤0 , n n (5.22) and by the central limit theorem (5.17), where we take µ = σ = 1, Zwe find 0 x2 1 √ √ e− 2 dx. that for large n, P − √1n < S(n)−n ≤ 0 is well-approximated by n 2π − √1n Now if n is very large, √1n is very small, and so for − √1n < x ≤ 0, we can x2 make the approximation e− 2 ∼ = 1.2 But then 1 √ 2π Z 0 2 e − √1n − x2 1 dx ∼ =√ 2π Z 0 − √1n 1dx = √ 1 . 2πn When we combine this with (5.22) and (5.21), we conclude that for sufficiently large n, e−n nn ∼ 1 , =√ n! 2πn which after rearrangement gives √ 1 n! ∼ = 2πnn+ 2 e−n . Before we leave the central limit theorem, here is a wonderful quote from Sir Francis Galton (1822-1911), one of the founders of statistics: “I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”.3 The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” 2 3 The symbol ∼ = means “approximately equal to”; it is far less precise than ∼. This is another name for the CLT. Chapter 6 Introducing Statistics 6.1 Introduction So far in this module, we have focussed entirely on probability theory – the mathematical theory of chance. It is a distinct branch of mathematics in its own right; from a pure mathematical point of view, it is a special case of a more general subject called measure theory; but probability has many important areas of application, e.g. to studying the bulk properties of ensembles of particles in physics (statistical mechanics), to modelling random mutations in genetics, and to describing the random behaviour of the stock market in financial economics. For the rest of this module, we will be studying the application of probability theory to statistics. But first we should ask the question “What is statistics?” At the beginning of Randall Pruim’s book “Foundations and Applications of Statistics”, are listed five alternative definitions, all given by different experts. Here’s the list: • a collection of procedures and principles for gaining information in order to make decisions when faced with uncertainty, • a way of taming uncertainty, of turning raw data into arguments that can resolve profound questions, • the science of drawing conclusions from data with the aid of the mathematics of probability, • the explanation of variation in the context of what remains unexplained, 105 106 CHAPTER 6. INTRODUCING STATISTICS • the mathematics of the collection, organisation, and interpretation of numerical data, especially the analysis of a population’s characteristics by inference from sampling. Although not quite the same, these all capture something important. We have collected some data which tells us something, but not everything, about its source. So there is uncertainty as we do not have the whole picture. We wish to gain as much information as we can, and probability will be the mathematical tool that we use to help us get that. For example, suppose that a new drug is given to 20 patients with some disease, and 12 recover. Can we deduce that the probability of recovery is 0.6? What if we gave the same drug to another group of 20 patients, and only 5 recovered? In the next two sections (6.2 and 6.3) we will look at some real-life examples. 6.2 Is smoking related to lung cancer? It is well known that smoking can increase the risk of lung cancer. Of course there are other health risks attributed to this type of cancer (such as family history), but now there are several studies linking smoking to lung cancer. The 1982 USA Surgeon Generals Report stated that: “Cigarette smoking is the major single cause of cancer mortality in the United States.” Lung cancer gives 5% of UK deaths today. Below is Müller’s data (Table 3 of Doll, 1988). The data is classifying 86 sampled people according to lung cancer intensity and general health, for 5 categories of smokers (from zero to extreme). Smoking Extreme Very heavy Heavy Moderate Non Total Lung cancer 25 18 13 27 3 86 Healthy 4 5 22 41 14 86 The question is whether smoking increases the risk of getting lung cancer. The table indicates clearly that the number of lung cancer sufferers appears to be increasing with the intensity of smoking, while their general health seems to be inversely proportional to having lung cancer (e.g. 18 very heavy smokers have lung cancer and 5 of them are healthy, while the inverse relationship applies for non-smokers). This suggests that there may be a link between smoking and lung cancer. 6.3. DO MOBILE PHONES CAUSE BRAIN TUMOURS? 107 However, if we are going to trust this data, we need to know that it is representative of the general population, so that by studying this data set we can draw conclusions, not just for the 86 people used in that sample, but for any member of the general public. It is reasonable to think that if we pick another 86 people in a similar study, the results would be different (maybe only slightly, but perhaps more substantially). In other words, there is some uncertainty in our conclusion that smoking causes lung cancer. We need to have a more complex mechanism that gives us information about the level of uncertainty we have in trusting the above study. One option would be to conduct many similar studies and to have some rules to compare them. This is impractical because in many real-life situations, such studies are lengthy and expensive. We should be able to use the same single data set and to be able to say how “confident” we are in the conclusions drawn. In addition to that we should be able to address the following questions: • Is this sample of 86 people representative? • Is 86 a large enough sample to draw “good” conclusions? For comparison, we would surely believe that asking only 5 people will not be enough. What about 15, or 50, or 500? • How can we “estimate” the effects of this alleged relationship of smoking and cancer? Can we estimate how certain or how uncertain we believe we are? • Finally, can we design a model that describes this relationship beyond the limitations of this data set? Can we make predictions based on this model? 6.3 Do mobile phones cause brain tumours? Over the past 20 years, there has been a growing interest in establishing whether mobile phones are linked to brain cancer. Some studies have found a positive link. However, in a recent study (BBC, 21 October 2011), this link is rejected, you can read the related article in http://www.bbc.co.uk/news/health-15387297. The discussion below is based on this article. The study looked at brain cancer incidence, for a sample of 358,403 mobile users, over a period of 18 years. The results are summarized below: no of mobile users gliomas (type of brain cancer) 358,400 356 other cancers 846 108 CHAPTER 6. INTRODUCING STATISTICS From a first glance we might conclude that the gliomas counts are too small, for the sample size 358,403. We can calculate the relevant proportions 0.00099 (gliomas) and 0.00236 (other cancers). We may think that these proportions look small, but for a more conclusive comparison one should know how many brain cancer counts are expected among non mobile users. For example, if we knew that among non mobile users, 20 gliomas are expected (for the same sample size 358,400), we may believe that 356 is a significant number. Furthermore, the following points are important to consider: • Can we estimate the true proportion of gliomas among mobile users? If we calculate (as above) this proportion as 0.00099 (based on the above data set), can we know whether this is generally true? What if we conduct another study and calculate another proportion? How different is this likely to be? Based on this data set, can we give a measure of how certain or how confident we are about these proportions? • As mentioned above, we should be able to establish a “gold standard”, to which we can compare the cancer counts of our study. In this case, this could be brain cancer proportions, or risk among non mobile users. In statistics it is very common to compare the results of a study with a gold standard; this is usually called a control group. • The selected sample should be representative of the general population, otherwise the findings are likely to be biased and limited in scope. By “representative” we mean that if the general population shares some characteristics (relevant to the link of cancer with mobile phones), then any sample we pick should share these too. • An important point to consider is that a number of mobile users in this study could have developed early stage brain cancer which is not detected, or be at risk (for which there is some non-zero probability) to develop brain cancer in the future. Although the study considers such possibilities to be risk-free, they may not be, and this could lead to different estimation of the overall risk and safety of mobile phones. 6.4 Data and Summary Statistics In statistics, the word population is used to describe the largest group of “things” that we are interested in. The “things” might be people, animals, plants, machines, numbers etc, e.g. registered voters in the UK, cod in the English channel, breast cancer sufferers in Yorkshire, weights of Mars bars in a supermarket etc etc 6.4. DATA AND SUMMARY STATISTICS 109 The population as a whole is generally too big or inaccessible for us to fully investigate. This is one of the most important reasons why we need statistics. What we do instead is to obtain a representative sample taken from the population, and then use our knowledge of the sample together with probability theory, to make inferences about the population as a whole. We will say more about the important problem of inference in the next chapter. Suppose that we are interested in the weight (in kg, say) of cod in the North sea. Let us suppose that there are N cod altogether. This is already problematic; no–one knows how to count all the cod in the sea at any one time! We could label the weights of the cods y1 , y2 , . . . , yN . Then we would love to know the population mean N 1 X yi , µ= N i=1 and the population variance N 1 X σ = (yi − µ)2 . N i=1 2 But we cannot ever measure these numbers. They are inaccessible to us. But on the other hand, we can go out and catch some cod from different parts of the sea and obtain a sample of weights x1 , x2 , . . . , xn . These are the data that we will work with. Here n is the number of fish in the sample. It may be e.g. 50, or 100 or 1000. But it is much smaller that N and we write n << N to indicate this (the symbol << means “much smaller”). This time we can calculate the sample mean n 1X xi , x̄ = n i=1 (6.1) and the sample variance n 1 X (xi − x̄)2 . s = n − 1 i=1 2 (6.2) Notice how in (6.2) we divide by n − 1 rather than by n. One way of justifying this is to argue that although s2 appears to depend on n unknowns x1 , x2 , . . . , xn , once we know x̄, we have an equation x1 + x2 + · · · + xn = nx̄ which allows us to eliminate one of the unknowns, so there are only n − 1 unknowns altogether. The sample standard deviation is 110 CHAPTER 6. INTRODUCING STATISTICS v u √ u 2 s= s =t n 1 X (xi − x̄)2 . n − 1 i=1 Just as in probability theory, the sample mean x̄ measures the average weight of the sample, and the variance s2 and standard deviation s tell us about the spread of numbers in the sample about the mean. We tend to prefer to work with s2 as it has nicer mathematical properties; s has the advantage of being measured in the same units as the data. If there are many data points that differ from x̄ by a relatively large amount, then s and s2 will be large, but if all the data points are tightly clustered around x̄ then these numbers are much smaller. The following alternative formula for sample variance is sometimes useful: Proposition 6.4.1. n 1 X 2 n s = xi − x̄2 . n − 1 i=1 n−1 2 Proof. Expanding (6.2) we get (n − 1)s 2 = = n X (x2i − 2xi x̄ + x̄2 ) i=1 n X i=1 = n X x2i − 2x̄ n X xi + i=1 n X x̄2 i=1 x2i − 2nx̄2 + nx̄2 i=1 = n X x2i − nx̄2 , i=1 and the result follows when wePdivide both sides by n − 1. Notice that to get to the third line, we used ni=1 xi = nx̄ from (6.1). We also used the very useful fact that if c is any constant (which doesn’t depend on i), then P n 2 i=1 c = nc, which we applied with c = x̄ . Note It is very important in statistics to reserve the symbols µ and σ 2 for the population mean and variance, and to use x̄ and s2 for the sample mean and variance. 6.4. DATA AND SUMMARY STATISTICS 111 Example 6.1. from the North sea had weights P24A sample of 24 cod P24taken 2 in kg for which i=1 xi = 207 and i=1 xi = 2230. Calculate x̄ and s to 3 d.p.’s = 8.625 and s2 = x̄ = 207 24 3 d.p.’s. 2230 23 − 24×(8.625)2 23 = 19.332. Hence s = 4.397 to Numerical data can be easily entered into R using the command > x = c(x1 , x2 , . . . , xn ), where here, of course, x1 , x2 , . . . , xn are the numerical values of the data. The sample mean, variance, and standard deviation are then obtained using the commands > mean(x), > var(x) and > sd(x). Exercise. Enter the data 1, 3, 5, 5, 6, 8, 9, 14, 14, 20 into R and check that the mean, variance and standard deviation are respectively 8.5, 34.5 and 5.87367. 112 CHAPTER 6. INTRODUCING STATISTICS Chapter 7 Estimation and uncertainty 7.1 Populations and random samples As was discussed in Chapter 6, statistics in concerned with observational studies where we seek information about the whole from knowledge of the part. A typical example involves the study of a particular data set which represents a sample taken from a larger population from which the data is obtained. In the smoking/cancer example of section 6.2, the data consisted of 86 people and their related health state, and the population comprised all the people in a particular region of interest. Statistics is concerned with studying the properties of the particular sample we observe, and making inferences or drawing conclusions about the population. This is necessary, as for example, it is usually not possible to observe and study all elements of the population; in the above study it may be impossible to ask all people about their smoking habits, and very costly to examine their state of health. Thus, we obtain the sample, a small part of the population (86 people), and we aim to use this sample to draw conclusions for the entire population. The way that we extract information from the sample and use it to make estimates about the population, is through probability theory. Suppose that we are interested in learning about the mean height µ of the students taking MAS113 in a particular academic year. Suppose that there are 240 students; then to obtain the mean we might measure the height of all students. However, this could be quite time-consuming. Instead, we may decide to take a sample of these students, e.g. a sample of 10 students and measure their height: Student 1 Sample element x1 Height (in cm) 167 2 x2 165 3 x3 178 4 x4 190 113 5 x5 176 6 x6 184 7 x7 189 8 x8 191 9 x9 182 10 x10 169 114 CHAPTER 7. ESTIMATION AND UNCERTAINTY We may believe that if these 10 students are “representative” of the popu1 (x1 +x2 +x3 +x4 +x5 +x6 +x7 +x8 +x9 +x10 ), lation, their average height x̄ = 10 will be a “good estimate” of the population mean µ. However, it is highly unlikely that x̄ is going to be exactly the same as the mean of all 240 student heights. So the question is how close x̄ is to µ? P We could obtain another 10 1 ∗ x sample x∗1 , . . . , x∗10 , and again compute x̄∗ = 10 i=1 i . Each time we obtain a different sample, we are likely to receive a different estimate of µ. Thus, there is uncertainty around the values of x̄ that we compute from the data, to the extent of how closely they agree with the true population mean (heights in this example). 7.2 Confidence Intervals for the Mean I Let us assume that we have a population of size N and we randomly choose a sample of size n, with n much smaller than N . The population has some “true distribution” which we don’t know, e.g. if it is the heights of people in the U.K., it may be that 63% are between 50 800 and 60 200 . Let X1 be the random variable that selects the first point x1 in the sample, X2 be the random variable that selects the second point x2 in the sample, ··· ··· ··· ··· ··· Xn be the random variable that selects the nth point xn in the sample. We will make the following assumptions (I) Each of X1 , X2 , . . . , Xn have the same distribution, which is the distribution of the population. (II) The random variables X1 , X2 , . . . , Xn are independent. So X1 , X2 , . . . , Xn are i.i.d. random variables as studied in section 5.1. You may criticize both of these assumptions, but we have to start somewhere. All models of the real world make simplifying assumptions in order to make progress. Later on we may try to refine them. Let µ and σ 2 be the population mean and variance. So E(Xi ) = µ and Var(Xi ) = σ 2 for i = 1, . . . , n. A key topic in statistics is called estimation of parameters. The numbers µ and σ 2 are unknown parameters and we try to get good “estimates” of them based on the data we obtain, and probabilistic reasoning. This is also a part of statistical inference. In this section we seek to estimate the parameter µ. At this stage we will assume that σ is known to us. As we discussed previously, we will assume 7.2. CONFIDENCE INTERVALS FOR THE MEAN I 115 that we have collected a random sample x1 , x2 , . . . , xn from the population. We can then calculate the sample mean: n 1X x̄ = xi . n i=1 The number x̄ is called a point estimate of µ. We have already pointed out that it may not be very accurate (but would become more so, if n were to get close to N ). We can do better by using our knowledge of probability theory to construct an interval estimate, i.e. an interval in which we have “high confidence” that µ will lie. Let us consider the random variable: X̄ = X1 + X 2 + · · · + Xn . n From section 5.1, we know that σ2 E(X̄) = µ, Var(X̄) = . n The values taken by the random variable X̄ consist of all possible sample means of size n, which are taken from the population. Step 1. Assume for now that Xi ∼ N (µ, σ 2 ), so the underlying population is normally distributed. In this case Z = X̄ − µ √σ n is a standard normal: Z ∼ N (0, 1). We need to make a choice on the size of the confidence interval we’re going to construct. To do this we choose a small number α, which is reasonably close to zero, and find the corresponding critical value zα/2 so that P (Z < −zα/2 ) = P (Z > zα/2 ) = α . 2 In practice, given α, we find zα/2 from R using the command qnorm(1 − α/2, 0, 1). Common choices are (check), to 4 s.f.: α = 0.1, zα/2 = 1.645 α = 0.05, zα/2 = 1.960 α = 0.01, zα/2 = 2.576. (7.1) 116 CHAPTER 7. ESTIMATION AND UNCERTAINTY Now we reason as follows; from (7.1), we have P (−zα/2 ≤ Z ≤ zα/2 ) = 1 − α, i.e. P −zα/2 ≤ X̄ − µ √σ n ! ≤ zα/2 = 1 − α. Rearranging the inequality we conclude that σ σ P X̄ − zα/2 √ ≤ µ ≤ X̄ + zα/2 √ = 1 − α. n n (7.2) The equation (7.2) gives a mathematically precise probability interval for the mean µ. But it is not useful practically as it stands. To make it useful, we must replace the random variable X̄ with the point estimate x̄ that we’ve obtained from the data. Then we can no longer speak the language of probability, but we have what is called in statistics, a confidence interval. The key conclusion is: We are 100(1 − α)% confident that the true value of µ lies in the interval σ σ (7.3) x̄ − zα/2 √ , x̄ + zα/2 √ . n n The result stated in (7.3) is very important. It can also be summed up as: We are 100(1 − α)% confident that True value of mean is in the interval given by “point estimate ± (critical value × standard error).” What does a confidence interval mean? The parameter µ is not a random quantity. So when we calculate our interval it is either “true” or “false” that µ lies in it. Suppose α = 0.05. Then if we generated many different samples of size n from our population, and computed the interval (7.3) for the different point estimates x̄ that we’d obtain, we would expect that approximately 95% of these would contain µ. The other 5% would not. Step 2. From now on, assume that the Xi s are arbitrary random variables, each having mean µ and variance σ 2 . We are now in a much more general situation. But the good news is that we can proceed exactly in Step 1, and still use (7.3) to calculate confidence intervals, provided we have a large enough sample. Why? The reason is that we have the central limit theorem which tells us that for sufficiently large n, 7.2. CONFIDENCE INTERVALS FOR THE MEAN I X̄ − µ √σ n 117 is approximately N (0, 1) (c.f. (5.16)), and that is all we need to, at least approximately, carry out the same reasoning as before. How large should n be? In practice, statisticians usually require at least 30. If n < 30, we have a small sample, and we will see how to calculate confidence intervals in this case, later on. Example 7.1 (Temperature measurement) A chemist wants to know the melting point of a newly developed compound, under specific laboratory conditions. Temperature is measured using an instrument which is known to have measurement errors with standard deviation 0.3o C. There is good theoretical evidence to support the belief that the melting points are normally distributed. She repeats an experiment to measure the melting point nine times, and obtains the following values (in o C): 82.38, 82.57, 82.23, 82.05, 82.21, 82.03, 82.04, 82.35, 81.60. Although the sample size is smaller than 30, because the distribution of the population is normal (and σ is known) we can still use (7.3) Solution. The estimated melting point in o C is computed using R to give 0.3 = 0.1 If we calculate a 95% confidence x̄ = 82.16. The standard error is √ 9 interval for the melting point, we obtain (82.16 − 1.96 × 0.1, 82.16 + 1.96 × 0.1) or (81.96, 82.36). We might argue that 95% is insufficiently precise, and ask for a 99% confidence interval. Then we would get (82.16 − 2.576 × 0.1, 82.16 + 2.576 × 0.1) or (81.90, 82.412). So we have increased our confidence, but at the price of increased uncertainty, as the confidence interval for µ is wider. Note 1. In practice, it is unlikely that we will know the value of σ 2 . In that case, provided n > 30, we can still derive confidence intervals by replacing it in (7.3) by the sample variance: n s2 = 1 X (xi − x̄)2 . n − 1 i=1 Of course, this reduces the accuracy of the calculation. We will come back to this point later on, when we investigate the role of the t-distribution. 118 CHAPTER 7. ESTIMATION AND UNCERTAINTY 7.2.1 Confidence Intervals for a Proportion Frequently we may want to estimate the proportion of individuals in a population that have a certain property. Then for each member of the sample, we might ask is it a “success” or a “failure”? (or alternatively, a “yes”, or a “no”). Let p be the true proportion that has the property. This is the parameter that we seek to estimate. In this case, X1 , X2 , . . . , Xn are Bernoulli(p) and S(n) = X1 + X2 + · · · Xn is Binom(n, p), and so has mean np and q stanp has mean p with standard error p(1−p) . dard deviation np(1 − p). So S(n) n n We seek a confidence interval for p, and instead of x̄, it is common to use p̂ for the sample proportion that is obtained from the data. We use the same machinery as we developed above (see also (5.19)) to deduce that we are 100(1 − α)% confident that the true value of p lies in the interval: " r p̂ − zα/2 p(1 − p) , p̂ + zα/2 n r # p(1 − p) . n But we cannot use this in practice, as the interval depends on the quantity we are trying to estimate. To q get a usable formula, we apply the philosophy of Note 1, and replace p in p(1−p) , with the point estimate p̂ to obtain the n 100(1 − α)% confidence interval: " r p̂ − zα/2 p̂(1 − p̂) , p̂ + zα/2 n r # p̂(1 − p̂) . n (7.4) The interval in (7.4)is called the Wald interval, in honour of Abraham Wald (1902-50). The approximation of p by p̂ in the standard error, means that (7.4) is not always very accurate. In the next chapter we’ll look at a way of improving it. Example 7.2 (Opinion polling) A random sample of 150 UK university students has been asked “Will you vote Labour at the next election?”. Of those polled, 68 say ‘Yes”. Using this sample, we can give a 99% confidence interval for the proportion p of students (in the population of UK students) who we expect to vote Labour at the next election. q We have p̂ = 68/150 ≈ 0.453, and the estimated standard error is p̂(1−p̂) 150 ≈ 0.0406. An approximate 99% confidence interval is given by (0.453 − 2.58 × 0.0406, 0.453 + 2.58 × 0.0406) i.e. (0.348, 0.558). 7.3. UNBIASED AND CONSISTENT ESTIMATORS 7.3 119 Unbiased and Consistent Estimators In the last section, we have learned how to find confidence intervals for the mean and for a proportion, by using the normal distribution and/or the CLT. Later on we will want to find confidence intervals for the variance. At this stage, it’s a good idea to develop some more theoretical insight. Let θ be some parameter of the population that we seek to estimate, e.g we could have θ = µ, or θ = σ 2 . Let X1 , X2 , . . . , Xn be, as in the last section, the i.i.d. random variables that randomly select the sample of size n. An estimator of θ is defined to be a function θ̂ of the random variables X1 , X2 , . . . , Xn . So we could write: θ̂ = f (X1 , X2 , . . . , Xn ). Of course there are many such functions, and we need to impose some more conditions to get a “good” estimator. Whenever we choose a random sample x1 , x2 , . . . , xn , we can calculate a value of the estimator using f (x1 , x2 , . . . , xn ). We might expect some values to be accurate, and others to be less so, but “on average”, we would expect to get quite close to the true value θ. To make this intuition more precise, we define the bias to be the real number B(θ̂, θ) = E(θ̂) − θ, and we say that an estimator is unbiased if it has zero bias, i.e. if E(θ̂) = θ. The distribution of θ̂ is called the sampling distribution and its standard deviation is called the standard error which is denoted SE(θ̂). Example 7.3 Suppose that we seek to estimate the mean µ. We define the estimator n 1X Xi . µ̂ = X̄ = n i=1 We have already seen that E(X̄) = µ (recall (5.4)), and by (5.6), SE(X̄) = If the population is normally distributed, then the sampling distribution is N (µ, σ 2 /n). √σ . n There may be many good unbiased estimates of the same parameter. Another desirable property, is that they should not fluctuate too far from the mean, and if we have two or more competing unbiased estimators, then we should always choose the one with the smallest variance. There has been a great deal of theoretical work on minimal variance unbiased estimators. Example 7.4 120 CHAPTER 7. ESTIMATION AND UNCERTAINTY Suppose that for an unknown mean µ, we have X ∼ N (µ, 2), Y ∼ N (µ, 1) and X, Y are independent. We define the estimators 1 1 and W = X + Y 2 2 Are Z, W unbiased estimators? Which one would you prefer? Z = 2X − Y E(Z) = E(2X − Y ) = 2E(X) − E(Y ) = 2µ − µ = µ, and so Z is unbiased. E(W ) = E(0.5X + 0.5Y ) = 0.5E(X) + 0.5E(Y ) = 0.5µ + 0.5µ = µ, and so W is also unbiased. The variances are Var(Z) = Var(2X − Y ) = 4Var(X) + Var(Y ) = 4 × 2 + 1 = 9 Var(W ) = Var(0.5X + 0.5Y ) = 0.25Var(X) + 0.25Var(Y ) = 0.25 × 2 + 0.25 × 1 = 0.75. Then W is better estimator, as both Z and W are unbiased for µ, but W has the smaller variance. But having a small variance does not in itself create a good estimator. In Figure 7.1, we return to the data on student height that we examined at the beginning of this section. Let us suppose, just for illustrative purposes, that the true mean of the population is µ = 178, and that we select an estimator X ∼ N (175, 0.1). Since E(X) = 175, this is a biased estimator; although it has a small variance, it is a poor estimator, as the probability of obtaining a good approximation to the true mean is quite small (as can be seen from the figure). Another property of estimators that we’ll briefly look at is consistency. Suppose that we have a sequence θ̂n of estimators of θ, where the label n corresponds to increasing sample size. We say that these are consistent if the probability that θ̂n is close to θ is close to 1, for large n. More precisely we require that for any > 0 (no matter how small it is) lim P (|θ̂n − θ| < ) = 1. n→∞ (7.5) The next result gives us a convenient method for finding consistent estimators. Theorem 2. Suppose that (θ̂n ) are unbiased estimators of θ. If lim Var(θ̂n ) = 0, n→∞ then θ̂n are consistent. 7.3. UNBIASED AND CONSISTENT ESTIMATORS 121 0.8 0.6 0.4 0.0 0.2 probability density function 1.0 1.2 Distribution of estimator of the mean 173 174 175 176 177 178 179 180 X Figure 7.1: Distribution of the estimator X̄ ∼ N (175, 0.1); the true mean of the population is µ = 178 and is indicated with the solid vertical line. Proof. From Chebychev’s inequality (5.7): P (|θ̂n − θ| ≥ ) ≤ Var(θ̂n ) 2 and so since Var(θ̂n ) → 0, as n → ∞, it follows that P (|θ̂n − θ| ≥ ) → 0. Hence P (|θ̂n − θ| < ) = 1 − P (|θ̂n − θ| ≥ ) → 1, as was required. This theorem enables us to check quickly whether an unbiased estimator is consistent, without needing to directly evaluate the limit of P (|θ̂n −θ| < ). For example, if we return to Example 7.3 we see that the estimator X̄ of the mean µ is consistent by applying Theorem 2; indeed we have already seen that X̄ is unbiased, and its variance σ 2 /n converges to 0 as n → ∞. Be aware that, in general, it is possible to find consistent estimators that are biased, and unbiased estimators that fail to be consistent. 122 CHAPTER 7. ESTIMATION AND UNCERTAINTY 0.2 0.0 0.1 probability density function 0.3 0.4 Consistent estimators for the estimator of the mean 172 174 176 178 180 182 184 X Figure 7.2: Distribution of the estimator X̄ ∼ N (178, 40/n), for n = 10 (solid line), n = 40 (dashed line) and n = 100 (dotted line). Figure 7.2 illustrates consistency for the student height data, under the assumption that the true mean height is 178, using a sequence of estimators which are N (178, 40/n). 7.3.1 The variance estimator S 2 In this subsection we will aim to find an unbiased estimator for the population variance σ 2 . If µ was known, a candidate for an estimator of σ 2 is: n S12 = 1X (Xi − µ)2 n i=1 (7.6) 7.3. UNBIASED AND CONSISTENT ESTIMATORS 123 We can see that S12 is unbiased for σ 2 . Indeed, n E(S12 ) = 1X E[(Xi − µ)2 ] n i=1 n 1X 2 = σ n i=1 = nσ 2 = σ2. n However, in most applications we will not know µ. One idea is to replace µ by its estimator X̄, and so to consider as an estimator for σ 2 n S22 = 1X (Xi − X̄)2 . n i=1 It turns out that S22 is not an unbiased estimator for σ 2 . The reason is that we have lost some information by estimating µ. We slightly modify S22 to n 1 X S = (Xi − X̄)2 . n − 1 i=1 2 (7.7) We will show that this is an unbiased estimator for σ 2 . Note the difference between S 2 and S22 . In order to prove that S 2 is unbiased, we first write S2 = 1 SXX , n−1 where SXX = n X (Xi − X̄)2 . i=1 (7.8) 124 CHAPTER 7. ESTIMATION AND UNCERTAINTY Then we have E(SXX ) = E n X ! (Xi − X̄)2 i=1 = = n X i=1 n X E[(Xi − X̄)2 ] E[{(Xi − µ) − (X̄ − µ)}2 ] i=1 = = = n X E[(Xi − µ)2 ] + n X i=1 i=1 n X n X E[(Xi − µ)2 ] + i=1 i=1 n X n X E[(Xi − µ)2 ] + i=1 n X E[(Xi − µ)(X̄ − µ)] i=1 E[(X̄ − µ)2 ] − 2E n X ! (Xi − µ)(X̄ − µ) i=1 E[(X̄ − µ)2 ] − 2nE[(X̄ − µ)2 ] i=1 2 = nσ 2 + E[(X̄ − µ)2 ] − 2 2 2nσ nσ − = (n − 1)σ 2 , n n since Xi ∼ N (µ, σ 2 ) and X̄ ∼ N (µ, σ 2 /n). Thus 1 1 E(SXX ) = (n − 1)σ 2 = σ 2 E(S 2 ) = n−1 n−1 and so S 2 is unbiased for σ 2 . It can also be shown to be a consistent estimator. 7.4 Confidence Intervals for the Mean II We now return to the problem of using confidence intervals for the mean µ. In particular, we will want to deal with the situation where σ 2 is unknown, and the sample is small (n < 30), so it is not reasonable to approximate σ 2 by the sample standard deviation s2 , and proceed as we did in section 7.2. Throughout this section we will assume that the population is normally distributed, so our sample random variables X1 , X2 , . . . , Xn are N (µ, σ 2 ). p We will use S = SXX /(n − 1) as our estimator of σ. It is possible to show that the distribution of S/σ depends on n, but not on µ or σ, so the distribution of X̄ − µ √ S/ n also depends on n but not on µ or σ. This becomes clearer if we write 7.4. CONFIDENCE INTERVALS FOR THE MEAN II 125 X̄ − µ X̄ − µ S √ = √ ÷ , σ S/ n σ/ n X̄−µ √ ∼ N (0, 1) which is independent of both µ and σ. since σ/ n This gives rise to a new family of distributions that we have not met before. If X1 , . . . , Xn are independent N (µ, σ 2 ) random variables, then the distribution of X̄ − µ √ T = S/ n is called the Student t distribution with n − 1 degrees of freedom. We write T ∼ tn−1 . The degrees-of-freedom parameter determines the shape of the distribution. The fact that the distribution of T depends on n, but not on µ or σ, gives us a way to calculate confidence intervals, as we’ll see below. Properties of the t distribution If a random variable T has a t distribution with ν degrees of freedom, that is if T ∼ tν , then T has the density function fν (t) = k(ν)(1 + t2 /ν)−(ν+1)/2 , −∞<t<∞ for an appropriate constant k(ν), which depends only on ν.1 This will be proved in MAS223. The pdf fν is symmetric about zero. For large values of ν, it is very similar to the standard normal density N (0, 1). (Why?) Figure 7.3 shows the central part of the t-distribution density with 2 degrees of freedom. Note that the t-distribution looks quite similar to the standard normal near to zero, but the t-distribution has heavier tails. To see more, draw the graphs of the t distribution with 5 degrees of freedom and the standard normal; for the t-distribution, use the command: plot(function(x) dt(x, df=5), -5,5, main="", xlab="Values of t", ylab="Probability density") In fact k(ν) = √1πν Γ((ν+1)/2) Γ(ν/2) , where the gamma function Γ(ν) = √ that Γ(n) = (n − 1)!, if n is a natural number; but Γ(1/2) = π. 1 R∞ 0 e−x xν−1 dx. Note CHAPTER 7. ESTIMATION AND UNCERTAINTY 0.20 0.15 0.00 0.05 0.10 Probability density 0.25 0.30 0.35 126 −4 −2 0 2 4 Values of t Figure 7.3: Graph of the t-distribution with 2 degrees of freedom and for the normal use the command plot(function(x) dnorm(x), -5,5, main="", xlab="Values of z", ylab="Probability density") Cumulative probabilities from t-distributions can be calculated in R using the pt function, which works in a similar way to the pnorm function; but note that the value of the degree-of-freedom parameter needs to be specified. Confidence intervals using the t distribution For our purposes, the importance of the t-distribution lies in its use for constructing confidence intervals when σ 2 is unknown and has to be effectively replaced by s2 , which is estimated from the data. 7.4. CONFIDENCE INTERVALS FOR THE MEAN II 127 To recap, let X1 , . . . , Xn be independent N (µ, σ 2 ) random variables; we use X̄ as an estimator for µ, with σ 2 unknown. Then T = X̄ − µ √ ∼ tn−1 . S/ n (7.9) As in (7.1) we choose a small number α and (using the symmetry of the t-distribution), seek a critical value tn−1;α/2 so that P (T < −tn−1;α/2 ) = P (T > tn−1;α/2 ) = α . 2 Then we have, P (−tn−1;α/2 ≤ T ≤ −tn−1;α/2 ) = 1 − α. Now substitute for T from (7.9) and rearrange to obtain S S P X̄ − √ tn−1;α/2 ≤ µ ≤ X̄ + √ tn−1;α/2 = 1 − α. n n Replacing the random variables X̄ and S by their observed values, x̄ and s, we find that a 100(1 − α)% confidence interval for µ is given by: s x̄ ± √ tn−1;α/2 . n (7.10) You should compare (7.10) with (7.3). For the most common case, where we want a 95% confidence interval (α = 0.05), some examples of critical values are shown in the following table. (What happens as n → ∞?) Sample size n Degrees of freedom ν Critical value tn−1;0.025 3 2 4.30 6 5 2.57 10 9 2.26 25 24 2.06 50 49 2.01 100 99 1.98 Note that these values can be obtained easily in R. The last one, for example, is produced using the command qt(0.975,99) and rounding. In R, a calculation of a 95% confidence interval for the mean, based on estimating the sample standard deviation and using the t-distribution, can be obtained as follows. • Make sure that data are collected in an object, say dat. • Get estimates of the mean (µ̂)=mean(dat) and of the variance (s2 )=var(dat) of dat. 128 CHAPTER 7. ESTIMATION AND UNCERTAINTY • Then the interval is (mean(dat)-qt(0.975,n-1)* sqrt(var(dat))/sqrt(n), mean(dat)+qt(0.975,n-1)* sqrt(var(dat))/sqrt(n)), where n is the size of dat, which can be determined in R by length(dat). Example 7.4 (Density of the Earth) Twelve observations were made on the density of the earth relative to water. We require a 95% confidence interval for the true value. Here are the data: 5.10, 5.39, 5.47, 5.34, 5.30, 5.68, 5.27, 5.42, 5.63, 5.46, 5.75, 5.85. Let Xi be the ith measurement of the density of the earth relative to water, and suppose Xi ∼ N (µ, σ 2 ), where µ is the unknown true value and σ 2 is the unknown variance of the measurements around the true value. Here we have n = 12, s = 0.2183 and x̄ = 5.4717. From R, t11;0.025 = 2.201. So by (7.10) the required confidence interval is √ √ (5.4717 − 2.201 × 0.2183/ 12, 5.4717 + 2.201 × 0.2183/ 12), or 5.4717 ± 0.1387. Thus a 95% confidence interval for the density of the earth relative to water is (5.333, 5.610). In R, following the above steps we get the following interval (unrounded!): (5.33297, 5.610363). 7.5 Estimating Variance We have seen that S 2 = SXX /(n − 1) is a useful estimator of the variance of a normal distribution, based on observations X1 , . . . , Xn . In this section, we look at the sampling distribution of S 2 , and at interval estimation for σ 2 . The distribution of S 2 We start first with a basic result. If the random variable X follows the standard normal distribution X ∼ N (0, 1), then the p.d.f. of Y = X 2 is given by y 1 fY (y) = √ y 1/2−1 exp − (7.11) 2 2π 7.5. ESTIMATING VARIANCE 129 To see this, we denote by fX (x) the p.d.f. of X, FX (x) the c.d.f. of X and FY (y) the c.d.f. of Y . Recall that 1 2 1 fX (x) = √ e− 2 x . 2π (7.12) Then for y ≥ 0, FY (y) = = = = = = P (Y ≤ y) P (X 2 ≤ y) √ √ P (− y ≤ X ≤ y) √ √ P (X ≤ y) − P (X ≤ − y) √ √ FX ( y) − FX (− y) √ √ √ FX ( y) − 1 + FX ( y) = 2FX ( y) − 1 Now from the definition of the p.d.f. we have √ fX ( y) 1 d FY (y) √ = 2fX ( y) √ = √ fY (y) = dy 2 y y and using (7.12), we obtain y 1 fY (y) = √ exp − . 2 2πy The above distribution of Y is known as the chi-square distribution with 1 degree of freedom. The general form of the chi-square distribution with n degrees of freedom is given by the p.d.f. y 1 y n/2−1 exp − , y ≥ 0 (7.13) fY (y) = n/2 2 Γ(n/2) 2 (If y ≤ 0, √ then fY (y) = 0). Here Γ denotes the gamma function. Since Γ(1/2) = π, if we put n = 1 in (7.13), we obtain (7.11). In terms of notation, we write Y ∼ χ2n to mean that the random variable Y has a chisquared distribution with n degrees of freedom. An important property of the chi-square distribution is that, if the independent random variables Y1 , Y2 follow the chi-square distributions Y1 ∼ χ21 and Y2 ∼ χ21 , then Y1 + Y2 ∼ χ22 . It is beyond the scope of this module to derive this result, or to give a more detailed account of the gamma function. That is postponed to Year 2. The above property can easily be generalized for a finite number of random variables, to give the following important result: 130 CHAPTER 7. ESTIMATION AND UNCERTAINTY If Z1 , . . . , Zn are n independent random variables each with the standard Pn 2 normal distribution then j=1 Zj ∼ χ2n . Another way of writing this is that if X1 , . . . , Xn are n independent random variables each with the same normal distribution N (µ, σ 2 ) then Pn 2 j=1 (Xj − µ) ∼ χ2n . σ2 Figure 7.4 gives the pdf of the chi-squared distribution with 10 degrees of freedom. A closely related result is that Pn 2 SXX j=1 (Xj − X̄) (7.14) = ∼ χ2n−1 . σ2 σ2 Once again, a proof of this is beyond the scope of this module, but the “n − 1” relates to the fact that the terms in the sum cannot be regarded as n distinct sources of information; there is one constraint on them, namely that ΣXj = nX̄. We’ll see another example of this, arising in a very different way, later in the module. For our present purposes, this helps because, combining (7.14) with (7.8) we can then write (n − 1) 2 S ∼ χ2n−1 . (7.15) σ2 In the following we need critical values for the chi-square distribution. To be precise if Q has a chi-square distribution with ν degrees of freedom, we define the values χ2ν,α by P (Q ≤ χ2ν,α ) = α. (7.16) It follows that P (χ2ν,α/2 ≤ Q ≤ χ2ν,1−α/2 ) = 1 − α. (7.17) To obtain values of χ2ν,α , we require the R command qchisq(α, ν). Confidence intervals using the χ2 distribution In this case, the idea is to seek confidence intervals, wherein we find a, b ≥ 0 so that P (aS 2 ≤ σ 2 ≤ bS 2 ) = 1 − α. (7.18) Now by (7.17) and using (7.15), we have (n − 1)S 2 2 2 ≤ χn−1,1−α/2 = 1 − α. P χn−1,α/2 ≤ σ2 7.5. ESTIMATING VARIANCE 131 (n − 1)S 2 (n − 1)S 2 2 2 , and Now since ≤ χn−1,1−α/2 if and only if σ ≥ 2 σ2 χn−1,1−α/2 (n − 1)S 2 (n − 1)S 2 2 2 if and only if σ ≤ ; we conclude that a ≥ χ n−1,α/2 σ2 χ2n−1,α/2 100(1 − α)% confidence interval for σ 2 is given by (n − 1)s2 (n − 1)s2 2 ≤ σ ≤ , χ2n−1,1−α/2 χ2n−1,α/2 (7.19) where the sample variance s2 is computed from the data. So we have a= (n − 1) (n − 1) , b= 2 . 2 χn−1,1−α/2 χn−1,α/2 For the most common case, where we want a 95% confidence interval (α = 0.05), some examples of critical values are shown in the following table. Sample size n Degrees of freedom ν χ2ν,0.975 χ2ν,0.025 Multiplier a Multiplier b 3 2 7.38 0.0506 0.271 39.50 6 5 12.8 0.831 0.390 6.015 10 9 19.0 2.70 0.473 3.333 25 24 39.4 12.4 0.610 1.935 50 49 70.2 31.6 0.698 1.553 100 99 128 73.4 0.771 1.349 So if we calculate a sample variance s2 based on 25 normal observations, then a 95% confidence interval for the true variance would be roughly (0.6s2 , 1.9s2 ). Note that if we have P (aS 2 ≤ σ 2 ≤ bS 2 ) = 1 − α. then we also have √ √ P ( aS ≤ σ ≤ bS) = 1 − α. So confidence intervals for σ can be calculated in the obvious way from corresponding intervals for σ 2 . Example 7.5 (Density of the Earth) In Example 7.4, twelve observations were made on the density of the earth relative to water; the sample variance was s2 = (0.2183)2 = 0.04765. From R, χ211,0.025 = 3.816, χ211,0.975 = 21.92. We get a = 11/21.92 = 0.502, b = 11/3.816 = 2.88, so a 95% confidence interval for σ 2 is (0.024, 0.137). A 95% confidence interval for σ, the standard deviation of the measurement error on individual observations, is (0.15, 0.37). CHAPTER 7. ESTIMATION AND UNCERTAINTY fX(x) 0.00 0.04 0.08 132 0 5 10 15 20 25 30 35 x Figure 7.4: Chi-squared distribution for ten degrees of freedom. The shaded area on the left is P (Q < χ210,0.025 ) and that on the right is P (Q > χ210,0.975 ). Chapter 8 Hypothesis Testing 8.1 Introduction In Chapter 7 we were concerned with estimating the parameters of a model, allowing for the uncertainty in them, and relating the values to an underlying real-world problem. Often we want to investigate whether or not the data are consistent with a particular model, or with a particular value of a parameter—usually one that has a specific real-world interpretation. For this chapter, we will concentrate on the latter case, investigating the plausibility of a particular parameter value; later in the course, we will see some examples of more general model-checking. There are many real-world questions that can be framed as: “is this parameter equal to a particular fixed value?” • Is the mean measurement error equal to zero? For example, if the measurement errors from an instrument follow a N (µ, σ 2 ) distribution, is µ = 0? • Is the probability of success equal to half? Is this coin fair? In a series of Bernoulli(p) trials, is p = 12 ? • Are these two variables unrelated? Is the correlation between them zero? The formal terminology is hypothesis testing: we have a hypothesis, which is effectively a statement about certain parameter values, and we wish to test it by looking at the data we have. For example in medical research, we may be interested in assessing the effectiveness of a treatment, or the effectiveness of a drug. We may choose some patients who did receive the treatment and some others who did not, 133 134 CHAPTER 8. HYPOTHESIS TESTING and we may be interested in whether the group with the treatment have shown enough evidence of improvement. Here the hypothesis we want to test is that the treatment has no effect. We cannot prove that a hypothesis is true. Generally, any data we obtained can be explained by, or are consistent with, a range of possible models. However, we can have evidence against a hypothesis; the data we see can effectively rule out certain possibilities. We can talk about ‘rejecting’ a hypothesis. Of course, in many cases this is only because the observed data would be very unlikely if a particular hypothesis were true, not (usually) because the data would be impossible. In the medical example above, we may reject the hypothesis of no effect of the treatment; in that case the data we have obtained provides evidence in favour of the effectiveness of the treatment. The hypothesis that we test is known as the null hypothesis. We need to be able to describe, statistically, what would happen if the null hypothesis were true. Because we cannot prove a hypothesis, but only reject it or not, often the null hypothesis has a negative form. For instance, if we are interested in showing that there is a relationship between two variables, we can take our null hypothesis to be that there is no relationship; if the data seem very unlikely given that hypothesis, then we may be able to reject the hypothesis of ‘no relationship’, which tells us that there is some sort of relationship. In the medical example above, our null hypothesis is that the treatment has no effect. Sometimes a hypothesis is described as ‘accepted’; this does not imply that it has been proven to be true, only that it has not been rejected. This terminology is potentially misleading, and best avoided. To carry out a formal test of a null hypothesis, we need to specify what the alternative is. Often this is obvious, since it may for example consist of all possible parameter values except the one specified by the null hypothesis; but this need not always be the case. We do need to know what the alternative hypothesis is, but it turns out that we do not need to know what the data would look like if the alternative hypothesis were true, at least not in as much detail as under the null hypothesis. So the alternative can include a whole range of parameter values without creating technical problems. The shorthand H0 is often used to denote the null hypothesis, and HA (or sometimes H1 ) to denote the alternative hypothesis. The same subscript zero is used to denote the particular value of a parameter specified by a null hypothesis; we might for example set µ0 = 0 if we are investigating whether a particular mean is zero. To carry out a test of a hypothesis, the basic idea is to look at some summary of the data—which should be simple enough that we can find its 8.2. EXAMPLES 135 distribution given that the null hypothesis is true—and assess whether the observed summary is consistent with what we would expect to see, or whether it differs from what we’d expect in a way that supports the alternative hypothesis. 8.2 Examples Example 8.1: The lady tasting tea This is a famous study by R.A. Fisher (1890-1962), which revolutionized statistics and opened the path onto randomization and designed experiments. Around 1919 a lady called Dr. Muriel Bristol, claimed that she could determine whether milk was added before or after the tea, just by using taste. Fisher designed an experiment in order to test this claim. 8 randomly designed tea cups were prepared, 4 of which had milk added before tea and 4 had milk added after tea. The objective was to see whether the lady could determine and classify the two groups of 4 cups. The null hypothesis here is that the lady has no such ability, or that the lady has probability 0.5 to give the right answer, i.e. H0 : p = 0.5, with the alternative HA : p 6= 0.5, where p denotes the probability that she gives the right answer. Fisher found the probability function, under the null hypothesis H0 and compared this with the evidence of the lady’s answers. Example 8.2: Fairness of a coin It is claimed that the following sequence of results comes from independent tosses of a fair coin (where H denotes ‘heads’ and T denotes ‘tails’): T, T, T, H, T, T, T, T, T, T, T, T, H, T, H, T, T, T, H, T. We wish to test the null hypothesis of independent tosses of a fair coin (H0 ) against the alternative hypothesis of independent tosses of an unfair coin (HA ), using the number of heads observed, Y , as a test statistic. If the null hypothesis is true, what is the distribution of Y ? What form would the distribution of the test statistic take if the alternative hypothesis were true? Calculate (i) the probability of getting exactly the number of heads observed, and (ii) the probability of a number of heads as low as that observed or lower, if the null hypothesis is true. If the null hypothesis is true, then the distribution of the number of heads Y would be Binom(20, 0.5). Under the alternative hypothesis of an unfair coin, Y ∼ Binom(20, p), p 6= 0.5. 136 CHAPTER 8. HYPOTHESIS TESTING Just 4 heads were observed. (i) The probability of getting exactly 4 heads, under the null hypothesis, is P (Y = 4) given Y ∼ Binom(20, 0.5), which is 0.00462 (using dbinom(x=4, size=20, prob=0.5)). (ii) The probability of getting 4 or fewer heads is P (Y ≤ 4) given Y ∼ Binom(20, 0.5), which is 0.00591 (using pbinom(q=4, size=20, prob=0.5)). Example 8.3 Temperature measurement The chemist introduced in Chapter 7 is investigating the effect of a new way of producing a material. She thinks that the new technique may change the melting point of the material from the well-known value it has when the standard production technique is used, which is 50◦ C. She measures the melting point of 25 samples of the material, using an instrument known to have measurement errors with standard deviation 0.3◦ C, and is happy to assume that observations are independent and normally distributed. We can formalize this by modelling the measured temperatures as X1 , . . . , Xn ∼ N (µ, σ 2 ), with n = 25, σ = 0.3, and µ unknown. The null hypothesis is that µ = 50; the alternative hypothesis is that µ 6= 50. If H0 is true, then the data should come from the N (µ0 , σ 2 ) or N (50, 0.32 ) distribution. If HA is true, then the data will be systematically larger or smaller than under H0 , depending on whether µ is larger or smaller than µ0 . Because a departure from H0 of this sort will tend to affect all the observations equally, looking at the sample mean is likely to be a good way of checking for consistency with H0 . Given H0 , we know that the distribution of X̄ would be N (µ0 , σ 2 /n) which is N (50, 0.062 ); given HA , X̄ would be N (µ, 0.062 ) for some µ 6= 50. So a possible test of H0 is to look at the sample mean, and compare it with the distribution it should have under H0 . Any value of X̄ is possible under H0 , but very large or very small values suggest that we can reject H0 in favour of HA . If we observe x̄ = 50.1, say, there is no reason to doubt the null hypothesis, since that is a perfectly reasonable value from N (50, 0.062 ). This does not prove that H0 is true; 50.1 is also a perfectly reasonable value from N (50.2, 0.062 ) say. On the other hand, if we observe x̄ = 51.2, that is a very unlikely value from N (50, 0.062 ), although of course not impossible; since some alternative values of µ would explain that observation much better, we should regard this as a strong reason to reject H0 in favour of HA . In the next section, we will formalise and quantify these ideas. 8.3. OVERVIEW OF HYPOTHESIS TESTING 8.3 137 Overview of Hypothesis Testing In general we have a model with a parameter θ, a null hypothesis H0 : θ = θ0 , and an alternative hypothesis HA . We find a random variable which will behave differently depending on whether H0 or HA is true—a test statistic. We need to be able to work out its distribution if H0 is true; this is known as the null distribution. Based on the form of the hypotheses, we decide what sort of values of the test statistic count as evidence against H0 : very large values, or very small ones, or both. We then quantify the strength of the evidence by calculating the probability given H0 of getting the observed value of the test statistic or a more extreme value. This probability is known as the p-value of the test, and is a measure of the evidence against H0 . A very small p-value means that what we actually saw would have been very unlikely under H0 (and better explained by HA ); a p-value that is not small means that what we saw was just the kind of thing we would expect to see if H0 were true. Note that in this approach, all probability calculations are done under the assumption that the null hypothesis is true; the alternative hypothesis affects the choice of test statistic, and the choice of which extremes of the null distribution count against H0 . This means that the alternative hypothesis can be less precisely specified than the null. 8.4 The Z test As a first example of a hypothesis test in detail, we will consider independent observations X1 , . . . , Xn ∼ N (µ, σ 2 ), with σ 2 known. We test H0 :µ = µ0 against HA :µ 6= µ0 . We will use the sample mean X̄ to summarize the sample; the test statistic we use is Z= X̄ − µ0 √ . σ/ n We know that if H0 is true, then X̄ ∼ N (µ0 , σ 2 /n), X̄ − µ0 ∼ N (0, σ 2 /n), Z ∼ N (0, 1). A value of Z that is very large (assuming that Z ∼ N (0, 1)) suggests that µ > µ0 , and a value of Z that is very small suggests that µ < µ0 ; either case counts as evidence against H0 and in favour of HA . We quantify this by 138 CHAPTER 8. HYPOTHESIS TESTING calculating the probability of getting a value as unfavourable as the one we actually got, or more unfavourable, under the assumption that H0 holds: the p-value. If we write zobs for the actual value we obtain for Z, i.e. zobs = x̄ − µ0 √ , σ/ n then the p-value is P (Z >| zobs | given H0 ) + P (Z < − | zobs | given H0 ). Because of the symmetry of the normal distribution about zero, this is simply 2P (Z >| zobs | given H0 ). See Figure 8.1. Since Z ∼ N (0, 1) given H0 , the p-value is 2(1 − Φ(| zobs |)). (8.1) The p-value from a Z-test can therefore be looked up in standard normal tables, or calculated easily in R. Example 8.3 continued In the temperature example, we had 25 normal observations, with standard deviation 0.3. We test H0 : µ = 50 against HA : µ 6= 50. (X̄ − 50) √ . (0.3/ 25) If we observed x̄ = 50.1, this would give So the test statistic is Z = zobs = 50.1 − 50 = 0.1/0.06 = 1.667. 0.3/5 Using (8.1), the p-value is 2(1 − Φ(1.667)) = 0.096, (where we use the R command pnorm(1.667, 0, 1) = 0.9522 to find Φ(1.667)). This is quite a small number and gives some evidence against H0 . If instead we observed x̄ = 50.3, this would give zobs = 0.3/0.06 = 5. In this case, the p-value is 2(1 − Φ(5)) = 2.87 × 10−7 . This gives considerable evidence against H0 . 8.4. THE Z TEST 139 0.2 zobs 0.1 − zobs P(Z < − zobs) P(Z > zobs) 0.0 Probability density 0.3 0.4 Z−test with zobs positive −3 −2 −1 0 1 2 z Figure 8.1: Calculation of the p-value for a Z-test. For zobs > 0, the p-value is given by (1 − Φ(zobs )) + Φ(−zobs ), which simplifies to 2(1 − Φ(zobs )). 3 140 8.5 8.5.1 CHAPTER 8. HYPOTHESIS TESTING Interpretation of p-Values General points Firstly, note that the p-value is not the probability of the null hypothesis being true. Nor is it the probability of the null hypothesis being false. In the current framework, we cannot make probability statements about hypotheses, just as we cannot make them directly about parameters. The p-value is a measure of the evidence against the null hypothesis, which is calculated as a probability of seeing certain kinds of data given that H0 is true. If the p-value is very small, it indicates that either H0 is false, or that something very unusual has happened. A very small p-value is evidence against H0 , although of course it can never absolutely prove that H0 is false; something unusual could have happened. If the p-value is not small, all that we can say is that there is no evidence against H0 . Remember that it does not prove that H0 is true. Note also that there is no difference in interpretation between say p ≈ 1, p ≈ 0.5, p ≈ 0.2; all indicate that we have seen something that gives us no reason to doubt H0 in favour of HA . 8.5.2 Thresholds for p-values The most common threshold is that a p-value less than 0.05 is regarded as ‘good’ evidence to reject a null hypothesis. Of course, this is somewhat arbitrary; a p-value just above this threshold is saying more or less the same as one just below it. A p-value less than 0.01 is regarded as ‘strong’ evidence to reject the null hypothesis, and smaller and smaller p-values represent stronger and stronger evidence against it. This is not the same thing as evidence that H0 is ‘more and more wrong’ i.e. that θ is further and further from θ0 . Any value above 0.1 is regarded as no evidence against the null hypothesis. As already mentioned, you should not attach any meaning to the difference between p values of say 0.2 and 0.9; in both cases, there is simply no evidence against H0 . A value between 0.1 and 0.05 is conventionally ‘weak’ or ‘some’ evidence against H0 ; sometimes it is treated as essentially no evidence. Note that there is a connection between these conventional thresholds for p-values and the values for α usually used in constructing confidence intervals. We will explore the connection between hypothesis testing and interval estimation in more detail later in the module. 8.6. TESTING A NORMAL MEAN: VARIANCE UNKNOWN 141 It is also important to appreciate that p-values, although extremely useful and widely used, are not infallible and can lead to misunderstandings if misused in conjunction with badly structured experiments. For this and other reasons, the journal “Basic and Applied Social Psychology” banned them in February 2015. This is a highly controversial move. To find out more about it, go to http://www.nature.com/news/psychology-journal-bans-p-values-1. 17001 and/or http://www.statslife.org.uk/news/2116-academic-journal-bans-p-value-significanceAn alternative approach to hypothesis testing called Bayesian statistics is becoming increasingly popular. This uses conditional probability and Bayes’ theorem as inferential tools. It is assumed that you have some prior probability P (H) that a hypothesis of interest is valid. You then gather data and use this as evidence E to upgrade to a new posterior probability P (H/E), which is the conditional probability that the hypothesis is valid, given the weight of evidence that you have obtained. You can learn more about this approach in MAS364. 8.6 Testing a Normal Mean: Variance Unknown If X1 , . . . , Xn are independent N (µ, σ 2 ) random variables and σ 2 is not known, Z is no longer suitable as a test statistic; in fact it is no longer a statistic, since it depends on the unknown value of σ. There is a simple but important variant of the Z-test that can be used to test hypotheses about µ in this situation. Define T = X̄ − µ0 √ . S/ n Then if H0 :µ = µ0 is true, we know from section 7.4 that T ∼ tn−1 . The hypothesis test based on this result is known as the t-test. More precisely, it is called the one-sample t-test, since it deals with a single sample of independent observations. The reason for the terminology is to distinguish this test from an important, closely related test called the two-sample t-test, based on a different structure for the data. Later in this module, we will look 142 CHAPTER 8. HYPOTHESIS TESTING at the two-sample t-test in detail, along with a wider range of uses for the one-sample test. In the most common case, where we have HA :µ 6= µ0 , we regard very large or very small values of T as evidence against H0 . So the p-value is P (T more extreme than tobs given H0 ) = P (T >| tobs | given H0 ) + P (T < − | tobs | given H0 ) = 2P (T >| tobs | given H0 ) by the symmetry of the t distribution about zero. Since T ∼ tn−1 given H0 , the p-value is 2(1 − Ftn−1 (| tobs |)), where Ftn−1 is the cdf for the t-distribution having n − 1 degrees of freedom. The cumulative probabilities for t distributions that are needed here can be found easily using R. Example 8.4 [Density of the Earth] The accepted value for the density of the earth relative to water is 5.517. Are the sample data in Example 7.4 consistent with this hypothesis? We have H0 : µ = 5.517 and HA : µ 6= 5.517, and we have already seen that x̄ = 5.4717, s = 0.2183, n = 12. So X̄ − µ0 √ S/ n X̄ − 5.517 √ = , S/ 12 T = and inserting the observed values for x̄ and s we get tobs = −0.719. We need to compare this with the t distribution with n − 1 = 11 degrees of freedom. We have | tobs |= 0.719. From R, pt(-0.719,11) or 1-pt(0.719,11) gives 0.2436 to 4 d.p., and so the p-value is 0.487 to 3 d.p. and there is no evidence against H0 . This example can also be carried out entirely within R. The command t.test(earth,mu=5.517) produces the following output. 8.7. FIXED SIGNIFICANCE LEVELS AND ERROR PROBABILITIES143 One-sample t-Test data: earth t = -0.7194, df = 11, p-value = 0.4869 alternative hypothesis: mean is not equal to 5.517 95 percent confidence interval: 5.332970 5.610363 sample estimates: mean of x 5.471667 Note that R does not give any interpretation. Also, R does need to be told µ0 = 5.517 here or it will assume that µ0 = 0 by default. The test can also be carried out from the menus; again, the null mean 5.517 has to be filled in. 8.7 8.7.1 Fixed significance levels and error probabilities Testing with a fixed significance level Rather than simply quantifying evidence, often we need explicitly to decide whether or not to reject H0 . This amounts to fixing a threshold α, the significance level of the test, and then rejecting H0 if the p-value is less than α. Note that this can equally well be done by calculating the corresponding threshold level of the test statistic, known as the critical value, and comparing the observed value with that. This is essentially the same notion of critical value that was used when we calculated confidence intervals. For example, in the Z-test with a two-sided HA , the p-value is 2(1 − Φ(| zobs |)). To carry out a test with a fixed significance level of α, we reject H0 if 2(1 − Φ(| zobs |)) < α ⇔ 1 − Φ(| zobs |) < α/2 ⇔ Φ(| zobs |) > 1 − α/2 ⇔ | zobs |> Φ−1 (1 − α/2). So we calculate Φ−1 (1−α/2), and compare zobs with that. In R, Φ−1 (1−α/2) can be found using the command qnorm(1 − α/2, 0, 1). 144 CHAPTER 8. HYPOTHESIS TESTING For the Z-test at the 5% level, the critical value is Φ−1 (1−0.05/2) = 1.96. so we reject the null hypothesis at the 5% level if | zobs |> 1.96, (without calculating the precise p-value; we know that p < 0.05). Similarly, at the 1% level, the critical value is Φ−1 (1 − 0.01/2) = 2.58. It is not quite so simple for the t-test; since there is a different distribution for each value of the degrees of freedom parameter, there is also a different critical value in each case. The advantages of a test at a fixed level is that it is slightly easier, and that it gives a clear decision. The disadvantages are that it gives less information, and that it depends, of course, on the choice of the significance level. How we make that choice depends on the probabilities of different outcomes of the test and on their consequences. Testing hypotheses at fixed significance level uses a similar way of thinking to that we employed to calculate confidence intervals. In fact, if we are using the normal distribution, then for significance level α, we compute the critical value z α2 . If zobs is in the interval [−z α2 , z α2 ] we do not reject H0 . If zobs is outside the interval, we reject H0 . e.g. In Example 8.3, if we tested at the 5% significance level, z α2 = 1.96. Observing x̄ = 50.1, gives zobs = 1.667 which is in the interval [−1.96, 1.96], so we do not reject H0 . Observing x̄ = 50.3, gives zobs = 5 which is not in the interval [−1.96, 1.96], so we reject H0 . These are the same conclusions that we reached before using p-values. 8.7.2 Types and Probabilities of Errors With a test at a fixed significance level, we have two clear outcomes—either reject H0 , or don’t reject it. An important issue in choosing the significance level of a test, and in other choices such as sample size, is the probability that the test gives the wrong conclusion. There are two kinds of error that we can make. Suppose that we have a given significance level α. • Type I error: reject the null hypothesis when it is in fact true • Type II error: fail to reject null hypothesis when alternative is true The first case is easier to deal with. P (Type I error) = P (p-value < α given H0 true) = P (observed value is in region leading to p < α given H0 true) = α. 8.7. FIXED SIGNIFICANCE LEVELS AND ERROR PROBABILITIES145 We can control the probability of Type I error by choosing α. Note that it is not the probability that we are wrong on a given occasion, or even in the long run—it is a measure of the proportion of true null hypotheses that we would reject. Obviously, all other things being equal, the lower the significance level the better. The second type of error is more complicated to quantify. P (Type II error) = P (p-value > α given HA true). It depends on the actual parameter value within HA . So usually, this error probability is a function of the parameter, θ say, as well as of α. In general, the higher the significance level, the less likely we are to fail to reject H0 , for any given alternative. This is the argument for a higher significance level. In practice, choosing a significance level for a test represents a trade-off between the two sorts of errors. In fact there is a theoretical result - the Neyman-Pearson lemma which, in principle, enables the probability of Type II error to be minimised, for a fixed probability of type I error. Sometimes it is more natural to think about the probability of successfully rejecting H0 when it is false. This is just 1 − P (Type II error), so is again a function of the particular parameter value within HA , and is known as the power of the test. Another way of stating the trade-off is that we want high power but low Type I error probability. Consider a Z-test with H0 :µ = µ0 and HA :µ 6= µ0 . If we choose to carry out a test at the 5% level, we know that we reject H0 if | zobs |> 1.96. The probability of this is 0.05 if H0 is true; if HA is true, the probability of making a Type II error by not rejecting H0 depends on the actual value of the population mean. Given the actual value µ, for all i = 1, 2, . . . , n Xi ∼ N (µ, σ 2 ), X̄ ∼ N (µ, σ 2 /n), X̄ − µ0 ∼ N (µ − µ0 , σ 2 /n), µ − µ0 √ ,1 , Y ∼ N σ/ n where Y = X̄ − µ0 √ . σ/ n 146 CHAPTER 8. HYPOTHESIS TESTING So the power of the test, for a particular µ 6= µ0 , is P (Reject H0 given µ) = P (| Y |> 1.96 given µ) µ − µ0 √ ,1 = P | Y |> 1.96 given Y ∼ N σ/ n µ − µ0 √ ,1 = P Y < −1.96 given Y ∼ N σ/ n µ − µ0 √ ,1 + P Y > 1.96 given Y ∼ N σ/ n µ − µ0 µ − µ0 √ √ = Φ −1.96 − + 1 − Φ 1.96 − . σ/ n σ/ n Note the lack of symmetry; the two terms are not the same. In the above calculation, we chose α = 0.05, and so the critical value zcrit = 1.96. In the general case where α is not specified, we have: µ − µ0 √ P (Reject H0 given µ) = Φ −zcrit − σ/ n µ − µ0 √ + 1 − Φ zcrit − , σ/ n where zcrit is the critical value for the particular significance level we want. Example 16: Temperature Measurement Revisited In the temperature measurement example (Example 8.3), we have X1 , . . . , Xn ∼ N (µ, σ 2 ), n = 25, σ = 0.3, H0 : µ = 50, HA : µ 6= 50. So Z= X̄ − 50 X̄ − 50 √ = , 0.06 (0.3/ 25) and we will reject H0 at the 5% level if | zobs |> 1.96. If H0 is true, the probability of Type I error is 5%. If instead HA is true, and the actual value is µ = 50.2, then µ − µ0 √ , 1 = N (0.2/0.06, 1) = N (3.333, 1), Y ∼N σ/ n and so the power of the test for µ = 50.2 is P (Reject H0 given µ = 50.2) = P (| Y |> 1.96 given Y ∼ N (3.333, 1)) = P (Y < −1.96 given Y ∼ N (3.333, 1)) + P (Y > 1.96 given Y ∼ N (3.333, 1)) = Φ(−1.96 − 3.333) + 1 − Φ(1.96 − 3.333) = 6 × 10−8 + 0.915 = 0.915 to 3d.p. 8.7. FIXED SIGNIFICANCE LEVELS AND ERROR PROBABILITIES147 So the power of the test, that is the probability of rejecting H0 , when HA is true and µ = 50.2, is about 0.9; the Type II error probability is about 0.1. If instead the actual value is 49.9, then µ − µ0 √ , 1 = N (−0.1/0.06, 1) = N (−1.667, 1), Y ∼N σ/ n and so P (Reject H0 given µ = 49.9) = Φ(−1.96 − (−1.667)) + 1 − Φ(1.96 − (−1.667)) = 0.385 + 0.00014 = 0.385 to 3d.p. So when the true mean is µ = 49.9, the probability of rejecting H0 is much lower, around 0.4. The probability of a Type II error is more than 0.6. Alternative hypotheses close to µ0 are harder to detect. We can repeat this calculation for a wide range of values of µ. Ideally a test should have a power function that is always high and a low significance level, so that all error probabilities are small; in practice, this is not possible, as the Neyman-Pearson lemma tells us. As well as depending on the significance level of the test, the power function also depends on the sample size. Larger sample sizes give a better chance of rejecting the null hypothesis when it is false, for a test of a given significance level. If we carried out a Z-test at the 5% level based on just 9 temperature measurements, we would have Z= X̄ − 50 X̄ − 50 √ = , 0.1 (0.3/ 9) still rejecting H0 for |zobs | > 1.96. If instead HA is true, and the actual value is 50.2, then µ − µ0 √ , 1 = N (0.2/0.1, 1) = N (2, 1), Y ∼N σ/ n and so the power of the test for µ = 50.2 is P (Reject H0 given µ = 50.2) = P (|Z| > 1.96 given Y ∼ N (2, 1)) = P (Z < −1.96 given Y ∼ N (2, 1)) + P (Z > 1.96 given Y ∼ N (2, 1)) = Φ(−1.96 − 2) + 1 − Φ(1.96 − 2) = 4 × 10−8 + 0.516 = 0.516 to 3d.p. 148 CHAPTER 8. HYPOTHESIS TESTING The power when µ = 50.2 is much lower with n = 9 than with n = 25. 0.2 0.1 0.0 Probability density 0.3 0.4 8.7. FIXED SIGNIFICANCE LEVELS AND ERROR PROBABILITIES149 −2 −1 0 1 2 3 z Figure 8.2: The distribution of Z under HA with a particular value of µ. Compare this, with the distribution of Z under H0 as in Figure 8.1. 4 150 CHAPTER 8. HYPOTHESIS TESTING 0.6 0.4 0.2 Power of test 0.8 1.0 Power with n=9 49.0 49.5 50.0 50.5 True mean Figure 8.3: The power function of the Z-test in the worked example, with sample size 9. 51.0 8.8. TWO-SAMPLE PROBLEMS 8.8 151 Two-sample problems So far, we have concentrated on situations involving just a single sample of independent observations from some distribution. Much of the rest of the module is about relaxing these assumptions. We start by looking at problems involving two samples; where we may be interested in measuring the same quantity, but the distributions of the two populations may be different.Often we will be interested in estimating the difference between corresponding parameters in two distributions, or in testing whether or not there is a difference. Examples might be assessing the difference in effect of a medical treatment in men and in women, the relative popularities of two policies, or the effect of physical environment on how well individuals carry out certain tasks. 8.8.1 Estimators and Standard Errors Firstly, let X1 , . . . , XnX and Y1 , . . . , YnY be i.i.d random variables representing independent samples from two populations, with means µX and µY , and 2 variances σX and σY2 , respectively. We already know that X̄ is an unbiased √ estimator for µX , with standard error σX / nX , and similarly for Ȳ . If we are interested in the difference µX − µY then the obvious estimator is X̄ − Ȳ . We have E[X̄ − Ȳ ] = E[X̄] − E[Ȳ ] = µX − µY , so this is an unbiased estimator of the difference. Since X1 , . . . , XnX and Y1 , . . . , YnY are all independent, Var[X̄ − Ȳ ] = Var[X̄ + (−1)Ȳ ] = Var[X̄] + (−1)2 Var[Ȳ ] 2 = σX /nX + σY2 /nY . So X̄ − Ȳ has standard error q 2 σX /nX + σY2 /nY . 2 As usual, when σX and σY2 are not known, then we can use the estimated standard error q s2X /nX + s2Y /nY . On the other hand, if we have two samples that are clearly linked, as in Example 17 below, we may have nX = nY = n say; but the two samples 152 CHAPTER 8. HYPOTHESIS TESTING are not independent. In that example Xi and Yi are dependent, for any i; however, we can often assume that if i 6= j, then Xi and Xj are independent, Yi and Yj are independent and Xi and Yj are independent. The trick is to define new random variables Di = Xi − Yi , i = 1, . . . , n. Note that these are signed differences, not absolute differences. Then Di and Dj are independent, for i 6= j, and E[Di ] = E[Xi − Yi ] = E[Xi ] − E[Yi ] = µX − µY . So we can use D̄ (which is, of course, equal to X̄ √ − Ȳ ), as an2 estimator of the difference µX − µY . It has standard error σD / n, where σD = Var[Di ]. The 2 2 variance σD cannot be calculated in terms of σX and σY2 , since it is affected by the dependence between Xi and Yi , but√it can be estimated from the data. The estimated standard error of D̄ is sD / n. This is not the same as in the case of independent samples. Example 17: Reaction times For a sample of people, reaction times to a flashing light are measured in ordinary daylight and in darkness, giving the following values (in seconds). Reaction time in daylight 2.3 2.2 2.4 2.3 2.0 2.3 2.8 1.7 2.4 2.5 Reaction time in the dark 1.8 2.3 2.0 1.6 1.8 2.1 2.1 1.9 2.4 2.1 Difference 0.5 -0.1 0.4 0.7 0.2 0.2 0.7 -0.2 0.0 0.4 Note that each row corresponds to two measurements (and their difference) for a single person; the two measurements are definitely not independent. Write Xi for the reaction time in daylight of the ith person, Yi for that 2 person’s reaction time in darkness, Di for the (signed) difference, µX , σX etc 2 for the population means and variances, and x̄, sX etc for the sample means and variances, and d¯ = x̄ − ȳ. ¯ Our estimate of the population mean √ difference µD is d = 0.28, with esti√ mated standard error sD / n = 0.316/ 10 = 0.100. So the estimated mean 8.8. TWO-SAMPLE PROBLEMS 153 difference in reaction times between daylight and darkness is 0.28 seconds with estimated standard error 0.10 seconds. 8.8.2 Paired samples We would like to be able to give interval estimates and carry out hypothesis tests in the case of paired observations. This would enable us to quantify the difference in a more satisfactory way, and to test (for example) whether there really is a difference between the population means. Again, the trick is to work with the signed differences Di . Under the assumptions mentioned above, the Di s are a random sample from the population of differences; we can often use familiar methods for inference on µD , provided that the other assumptions are met. Example 18: Reactions Revisited Assume that the differences in reaction times in Example 8.8.1 are normally distributed. Then, since the true standard deviation of the Di s is unknown, we can use the method of section 7.4, based on the t distribution, to get an interval estimate for the mean difference. The 95% interval is √ √ d¯ ± t9,0.025 sD / 10 = 0.28 ± 2.262 × 0.316/ 10 = (0.054, 0.506), where we found t9,0.025 by using the R command qt(0.975, 9). If we wanted to test the null hypothesis that there is no difference, that is H0 : µD = 0 against HA : µD 6= 0, say, we could use a t-test on the Di s; this is known as a paired t-test. We define T = which gives d¯ − 0 √ sD / n √ tobs = 0.28/(0.316/ 10) = 2.80. There is good evidence against H0 ; from R, the p-value is 2(1 − pt(2.8, 9)) = 0.021. Example 19: Wearing Shoes. In an experiment to compare two materials, ‘X’ and ‘Y’, for making shoes, one material is assigned at random to the left shoe of a child and the other to the right shoe. The amount of wear is measured as follows1 . In a real 1 This is based on a classic example originally from Box, G.E.P., W.G. Hunter and J.S. Hunter (1978) Statistics for Experimenters: an Introduction to Design, Data Analysis, and Model Building. New York: John Wiley. Units are not specified. 154 CHAPTER 8. HYPOTHESIS TESTING experiment there are likely to be many more observations. Child Amount of wear with material ‘X’: xi Amount of wear with material ‘Y’: yi Difference di 1 4.7 3.0 1.7 2 3 4 5 15.4 9.2 7.4 2.1 15.2 7.8 5.4 1.6 0.2 1.4 2.0 0.5 We can test whether there is a real difference in wear by testing H0 :µD = 0 against the alternative HA :µD 6= 0. Again, provided the differences are normally distributed, we can use a paired t-test. We have T = D̄ − 0 √ , sD / n and under H0 , T ∼ t4 . √ tobs = 1.16/(0.777/ 5) = 3.34. The p-value is = = = = 2P (T > |tobs | given H0 ) 2(1 − P (T < |tobs | given H0 )) 2(1 − P (T < 3.34 given T ∼ t4 )) 2(1 − 0.9856) 0.0288. So there is good evidence of a difference in wear between the two materials. We can give a confidence interval for the mean difference in wear; a 95% interval is √ √ d¯ ± t4,0.025 sD / 5 = 1.16 ± 2.776 × 0.777/ 5 = (0.20, 2.12). Interpretation of this is limited in this example, since the units are unknown; clearly the wear with material ‘X’ is greater than with ‘Y’, but the magnitude of the difference is very uncertain. 8.8.3 Independent samples For independent samples, we can work directly with the raw data, or with summaries of each sample, rather than needing to calculate differences. Again, we will concentrate here on the case where the observations are normally distributed, so 2 Xi ∼ N (µX , σX ), i = 1, . . . , nX 8.8. TWO-SAMPLE PROBLEMS 155 and Yi ∼ N (µY , σY2 ), i = 1, . . . , nY . In this case, it follows that 2 /nX + σY2 /nY ). X̄ − Ȳ ∼ N (µX − µY , σX 2 If both σX and σY2 are known, then this immediately allows us to construct confidence intervals and carry out hypothesis tests; we can use the fact that (X̄ − Ȳ ) − (µX − µY ) p ∼ N (0, 1). 2 σX /nX + σY2 /nY 2 However, the much more common case is that the variances σX and σY2 are unknown. We can replace them by their estimates from the samples, to get the estimated standard error, which suggests using (X̄ − Ȳ ) − (µX − µY ) p . 2 /nX + SY2 /nY SX Unfortunately, this statistic does not have a t distribution, but under suitable conditions—provided neither nX or nY is less than 5, its distribution can be approximated by by a suitably chosen t distribution, tν . It can be shown that the degrees of freedom parameter, ν, lies between min{nX − 1, nY − 1} and nX + nY − 2, depending on the relative variances and sample sizes. The best approximation is given by ν= s2X nX (s2X /nX )2 nX −1 + + s2Y nY 2 (s2Y /nY )2 nY −1 , the Welch approximation, and this is what R uses. Note that ν is not necessarily an integer in this case. When doing calculations by hand, the simpler ‘approximation’ ν = min{nX − 1, nY − 1} is often used. We have approximately (X̄ − Ȳ ) − (µX − µY ) p ∼ tν , 2 SX /nX + SY2 /nY so a 100(1 − α)% confidence interval for µX − µY is given by q (x̄ − ȳ) ± tν,1−α/2 s2X /nX + s2Y /nY 156 CHAPTER 8. HYPOTHESIS TESTING and a test of H0 : µX −µY = µ0 against HA : µX −µY 6= µ0 involves comparing (X̄ − Ȳ ) − µ0 T =p 2 SX /nX + SY2 /nY with the tν distribution it would have if H0 were true. Note that here, µ0 is the hypothesised value of µX − µY ; very often µ0 = 0, since the hypothesis of ‘no difference’ is naturally of interest. Example 20: Mathematics teaching. An eight-week trial of teaching mathematics to children aged 6 years has been carried out. Those in Group 1 were regularly praised, whilst the ones in Group 2 were not. At the end of the trial, all children took an examination. A summary of the examination results is as follows, where X1 , . . . , XnX represent results obtained by children in Group 1, and Y1 , . . . , YnY those obtained by Group 2. Group Sample size 1 nX = 21 2 nY = 23 Sample mean x̄ = 51.48 ȳ = 41.52 Sample standard deviation sX = 11.022 sY = 17.152 To estimate the effect of the difference in teaching method—that is, the difference between µX and µY —we can use the estimate x̄ − ȳ = 9.96. The estimated standard error is q s2X /nX + s2Y /nY p = 11.0222 /21 + 17.1522 /23 = 4.309. We can approximate the degrees of freedom by ν = min{nX −1, nY −1} = 20. So a 95% confidence interval for µX − µY is 9.96 ± t20,0.025 × 4.309 = 9.96 ± 2.086 × 4.309 = (1.08, 18.84). A test of the hypothesis H0 : µX − µY = 0, at a fixed level of 5%, can be carried out simply by noting that the confidence interval does not contain 0. To assess the evidence against H0 more usefully, we can calculate a p-value. We have (X̄ − Ȳ ) − 0 T = p 2 , SX /nX + SY2 /nY tobs = 9.96/4.309 = 2.31 P (T > |tobs | given H0 ) = 2(1 − Ft20 (2.31)) = 0.0317. 8.9. CONFIDENCE INTERVALS FOR A PROPORTION, REVISITED157 So there is good evidence (p = 0.032) that the difference in teaching methods does lead to a difference in performance; a 95% confidence interval for the mean difference goes from about 1 mark to 19 marks. Note that the 2-sample t-test described here is sometimes called the Welch corrected, or Welch modified (2-sample) t-test, and does not assume equal variances or equal sample sizes. In special cases, the calculation of the test statistic or the degrees of freedom differs slightly. In R, the Welch modified version is available, but the default assumption is that the two sample variances are the same; usually this should be changed. In R 2.1.3.2 the modified Welch test is applied using the function t.test. For example t.test(x, y). The equal variances is set as FALSE and so no change is needed. In any case you should make sure that the equal variance option is deactivated. For more information see the R help pages, e.g. type ?t.test. 8.9 Confidence Intervals for a Proportion, Revisited In this small final section, we revisit the work of section 7.2.1. Recall that we obtained a confidence interval (7.4) for an unknown proportion p of the following form r p̂(1 − p̂) . p̂ ± zα/2 n Here we have a sample of size n, the sample proportion is p̂ and zα/2 is the critical value. Now let’s take another perspective. We ask the question, what values of p0 would not be rejected if we tested the hypothesis H0 : p = p0 against the alternative HA : p 6= p0 ? Given the level of significance α, the p̂ − p0 and we will not reject H0 if test statistic takes values q p0 (1−p0 ) n p̂ − p0 −zα/2 ≤ q p0 (1−p0 ) n ≤ zα/2 . Consider the case where equality holds. Then we obtain the following 2 (p̂ − p0 ) = 2 zα/2 p0 (1 − p0 ) , n and when we expand this and rearrange, we obtain a quadratic equation for the unknown p0 : 158 CHAPTER 8. HYPOTHESIS TESTING p20 1+ 2 zα/2 n ! − p0 2p̂ + 2 zα/2 n ! + p̂2 = 0. When we solve this, after some algebraic manipulation (and you should check this), we get q 2 2 zα/2 zα/2 + p̂ + 2n ± zα/2 p̂(1−p̂) n 4n2 . (8.2) p0 = 2 zα/2 1+ n and this gives the two limits for our required confidence interval. When n is very large, some of the terms in (8.2) are negligible. Also, in the case where we seek a 95% confidence interval, zα/2 = 1.96 ∼ = 2. Then you can check in the exercises, that a good approximation to (8.2) is r p̃(1 − p̃) , (8.3) p̃ ± 1.96 n+4 x+2 where if p̂ = nx , then p̃ = n+4 . Note that (8.3) is the same as the Wald interval (7.4), but with two additional successes and two additional failures added to the data. For this reason (8.3) is called the plus four confidence interval for a proportion, or sometimes the Wilson interval after its creator. Chapter 9 Count Data, Contingency Tables and Goodness of Fit 9.1 Hypothesis Tests on Proportions Count data arise when we are interested in counting some observed aspect of an experiment, e.g. number of Ebola virus outbreaks in a population, number of votes obtained by a candidate in an election. We might suppose, for example, that the population from which our count data are taken has a binomial, Poisson or multinomial distribution. Suppose that the data are either “success” or “failure”, and we are counting number of successes. If we have a sample of size n, and there are x successes, then we may calculate the sample proportion p̂ = x/n. If p is the true probability of success in the population, then we can test the hypothesis H0 : p = p0 against HA : p 6= p0 . When testing proportions it is sometimes more realistic to use a one-sided test and take either HA : p < p0 , or HA : p > p0 . If n is large, then we can use the normal approximation to the binomial as our test statistic. In this case we should use a continuity correction as described in section 5.4. So Z ∼ N (0, 1) where X ± 1 − np0 , Z=p 2 np0 (1 − p0 ) (9.1) and X is the random variable: number of successes in n trials. Example 21 The manufacturers of “Happy Crunch”, claim that at most 20% of all breakfast cereal buyers purchase the rival brand “Sweet Bliss”. Test this claim at α = 0.01 if a random check at several supermarket outlets found 58 purchases of “Sweet Bliss” among 200 cereal buyers. 159 160CHAPTER 9. COUNT DATA, CONTINGENCY TABLES AND GOODNESS OF FIT Solution. We test H0 : p = 0.2 against HA : p > 0.2. As this is a one-sided test, we compare with zα and not zα/2 . Using the R-command qnorm(0.99, 0, 1), we have zα = 2.326. Substituting x = 58, n = 200 and = 3.09. p0 = 0.2 into (9.1) we get zobs = √ 57.5−40 200(0.2)(0.8) Since 3.09 > 2.326, we reject H0 , and conclude that the evidence suggests that more than 20% of cereal lovers prefer “Sweet Bliss”. In more complicated situations we may want to test for differences between proportions. For example, suppose that we have samples of voters in 12 different cities and we want to know if the proportion favouring a given political party is the same in each. To model this and similar problems, we suppose that we have r independent random variables X1 , X2 , . . . , Xr where Xk ∼Binom(pk , nk ) for k = 1, 2, . . . , r. In the example just mentioned, we’d have r = 12, pk is the probability of a randomly chosen person in the kth city supporting the party, and nk is the number of people sampled in city number k. If the nk ’s are sufficiently large, we can use the CLT to approximate each of these by standard normals Zk ∼ N (0, 1), where Xk − nk pk . Zk = p nk pk (1 − pk ) But it is not convenient for hypothesis testing to deal with r different test statistics. It is better to use just one random variable. In fact from section 7.5. we know that Q ∼ χ2r where Q= r X (Xk − nk pk )2 k=1 nk pk (1 − pk ) , (9.2) and this is the test statistic that we need. When we replace the random variables Xk by the observed values xk in (9.2), we will write q instead of Q. There is a convenient way to think about this problem, that has farreaching generalisations. Suppose that we arrange the data in a table as follows: Sample 1 Sample 2 .. . .. . Sample r Successes x1 x2 .. . .. . xr Failures n 1 − x1 n 2 − x2 .. . .. . nr − xr 9.2. CONTINGENCY TABLES 161 The numbers that appear in this table are called observed cell frequencies and are denoted fij for i = 1, 2, . . . , r and j = 1, 2, so fi1 = xi , fi2 = ni − xi . Now suppose that we test the null hypothesis H0 : p1 = p2 = · · · = pr against HA : At least two of the pk ’s are different. If H0 is true, and p0 is the common value of the pk ’s, then the expected cell frequencies are the numbers eij for i = 1, 2, . . . , r and j = 1, 2, where ei1 = ni p0 and ei2 = ni (1 − p0 ). The following result has important consequences. Proposition 9.1.1. q= r X 2 X (fij − eij )2 . e ij i=1 j=1 (9.3) Proof. Since we don’t yet know it is q, let q 0 denote the right hand side of (9.3). Then q 0 r X (xi − ni p0 )2 [ni − xi − ni (1 − p0 )]2 + = ni p0 ni (1 − p0 ) i=1 r X 1 1 2 = (xi − ni p0 ) + ni p0 ni (1 − p0 ) i=1 = r X (xi − ni p0 )2 = q, n p (1 − p ) i 0 0 i=1 by (9.2), given that H0 is true. 9.2 Contingency Tables In the last section we dealt with binomial random variables, where there are only two possible outcomes, “success” or “failure”. The multinomial distribution generalises this to the case where there are c ≥ 2 possible outcomes. It was covered in detail in Section 3.5, Chapter 3 of Semester 1 notes. Here we give a brief reminder. We have a sequence of n independent trials, each of which has c ≥ 2 possible outcomes, with probabilities θ1 , θ2 , . . . , θc respectively, where θ1 + · · · + θc = 1. Define 162CHAPTER 9. COUNT DATA, CONTINGENCY TABLES AND GOODNESS OF FIT X1 to be the number of times outcome 1 occurs, X2 to be the number of times outcome 2 occurs, .. . Xc to be the number of times outcome c occurs. The joint distribution of the random variables X1 , X2 , . . . , Xc is called the multinomial distribution with parameters n, θ1 , θ2 , . . . , θc , and it is denoted by Mn(n; θ1 , θ2 , . . . , θc ). The joint probability function of X1 , X2 , . . . , Xc is n! θ1x1 · · · θcxc if x1 + · · · + xc = n pX1 ···Xc (x1 , . . . , xc ) = x1 ! · · · xc ! 0 otherwise. Using this distribution, we can generalise the table we compiled in the last section, where we had r different samples, each of which can be a success, or a failure; to the case where there are r samples, each with c possible outcomes. The resulting table (we’ll see an example below) is called an r ×c contingency table. Sample 1 Sample 2 .. . .. . Sample r Type 1 f11 f21 .. . .. . fr1 Type 2 · · · f12 ··· f22 ··· .. . .. . fr2 ··· Type c f1c f2c frc For example, suppose that we visit 12 different cities during the UK General Election, and ask voters in each city if they plan to vote Conservative, Labour, Liberal Democrat, Green, Ukip, Other or are Undecided. Then we will compile a 12 × 7 contingency table of data. When we have a contingency table, there are many different hypotheses that we might test. Here we will just look at one of these, and test for independence of rows and columns. To be precise, lets suppose that pij is the joint probability that the i th population has outcome j. Then we test the hypothesis H0 : pij = pi uj for all i, j against HA : pij 6= pi uj for at P least one value of i and j. Here pi and P uj are the marginal probabilities: pi = cj=1 pij , for i = 1, 2, . . . , r and uj = ri=1 pij , for j = 1, 2, . . . , c. Generalising (9.3), our test statistic is 9.2. CONTINGENCY TABLES 163 r X c X (fij − eij )2 q= , eij i=1 j=1 (9.4) where, as before eij are the expected cell frequencies if H0 were true, and fij are the observed cell frequencies. The next point is very important. We reject H0 if q > χ2ν,1−α where ν = (r − 1)(c − 1), and α is a given significance level. Alternatively (as in Example 22 below), we reject H0 if the p-value P (Q > q|H0 ) is smaller than 0.05, where the random variable Q has a chi-squared distribution with ν degrees of freedom. Why is ν = (r − 1)(c − 1)? In general, when we use the chi-squared distribution for count data, we have ν = s−t−1, where s is the total number of terms in the summation, and t is the number of independent parameters that are estimated from the rc. We Pcmust estimate the Pcdata. In our case, s =P r r parameters pi by fi = j=1 fij /f , where f = i=1 j=1 fij is the total P P P sum of all the count data points. Since ri=1 fi = f1 ri=1 cj=1 fij = 1, only r − 1 of these Prare independent. Similarly, we must estimate the c parameters uj by gj = i=1 fij /f , and only c − 1 of these are independent. Hence ν = s − t − 1 = rc − (r − 1) − (c − 1) − 1 = (r − 1)(c − 1). Example 22: Restaurant Types The following table presents data on restaurants in the UK. In this case r = c = 3, and the rows label the type of ownership, while the columns describe the nature of the food served. The table gives the count data along with row and column totals for 259 restaurants. Is there any evidence of a relationship between owner type and food type? 1: Sole proprietorship Owner 2: Partnership 3: Corporation 1: Fast food 42 8 59 109 Food type 2: Ethnic 3: Other 30 32 9 7 35 37 74 76 104 24 131 259 164CHAPTER 9. COUNT DATA, CONTINGENCY TABLES AND GOODNESS OF FIT The estimates of the marginal probabilities for “owner type” are: f1 = 104/259 ≈ 0.402, f2 = 24/259 ≈ 0.093, f3 = 131/259 ≈ 0.506. Similarly for “food type”, we have: g1 = 109/259 ≈ 0.421, g2 = 74/259 ≈ 0.286, g3 = 76/259 ≈ 0.293. The expected counts eij = nfi gj are then as shown in the following table (to 1 d.p.). 1: Sole proprietorship Owner 2: Partnership 3: Corporation 1: Fast food 43.8 10.1 55.1 109 Food type 2: Ethnic 3: Other 29.7 30.5 6.9 7.0 37.4 38.4 74 76 104 24 131 259 The χ2 statistic is calculated using (9.4) to give q = (42 − 43.8)2 /43.8 + (30 − 29.7)2 /29.7 + · · · and we get q = 1.736 (3 d.p.) We then compare q with the χ2 distribution with (r − 1)(c − 1) = 4 degrees of freedom. The pvalue for testing the null hypothesis of independence against the alternative of some form of dependence is 1 − pchisq(1.736, 4) = 0.784 (3 d.p.), and so we conclude that there is no evidence against the null hypothesis. For theoretical reasons that we won’t go into here, we should not use the chi-squared test arbitrarily. One rule of thumb is that we require that none of the eij ’s are smaller than 5. A weaker one that is also used is that all eij > 1 and at least 80% of the eij ’s are larger than 5. This also applies to the work in the next section. If some of the eij ’s do turn out to be smaller than 5, then we should combine cells together (which also reduces the number of degrees of freedom). 9.3 Goodness of Fit This applied to situations where we want to determine if a given data set is a random sample from a population that has a hypothesised probability distribution. Suppose that we have m count data points, and that the observed 9.3. GOODNESS OF FIT 165 frequencies are O1 , O2 , . . . , Om . We want to compare these with the expected frequencies E1 , E2 , . . . , Em if the hypothesis were true. We compute the test statistic: q= m X (Oi − Ei )2 Ei i=1 , χ2m−t−1 , where t is the number of independent and we reject H0 if q > parameters that are estimated from the data. Example 23: Poisson counts In a long sequence of observations on the behaviour of a laboratory animal, it is hypothesized that the numbers of actions categorised as ‘grooming’ in a day should follow a Poisson distribution (see Section 3.4.3 of Chapter 3) with unknown mean. Actual numbers of actions, over 60 days, are as shown in the following table. Are these consistent with a Poisson distribution? Number of actions 0 Frequency 13 1 2 9 10 3 4 5 6 7 or more 16 5 5 2 0 Estimating the Poisson mean as the mean of the observations gives a value of 9 + (10 × 2) + (16 × 3) + (5 × 4) + (5 × 5) + (2 × 6) = 2.233 (to 3 d.p.). 13 + 9 + 10 + 16 + 5 + 5 + 2 The probabilities of the different possible values are then 0.107, 0.239, 0.267, 0.199, 0.111, 0.050, 0.018, 0.006, 0.002, . . . (to 3 d.p.) using dpois(0:8,2.233) in R. The expected numbers of observations taking these values are obtained by multiplying by n = 60, giving 6.4, 14.4, 16.0, 11.9, 6.7, 3.0, 1.1, 0.4, 0.1, . . . ; to give high enough expected values in all classes, we combine values of 4 and over into a single class. Number of actions Frequency 0 13 1 9 2 10 3 16 4 5 6 7+ |5 5 {z 2 0} Class Observed number Expected number χ2 term 0 13 6.4 6.7 1 9 14.4 2.0 2 10 16.0 2.3 3 16 11.9 1.4 4 or more 12 11.2 0.1 60 60 60 12.4 The value q = 12.4 should then be compared with the distribution of Q under the null hypothesis; bearing in mind that one parameter has been estimated 166CHAPTER 9. COUNT DATA, CONTINGENCY TABLES AND GOODNESS OF FIT (using extra information—the original data, not just the counts in the final classes), that distribution is approximately χ25−1−1=3 (and, more precisely, lies between χ23 and χ24 ). From R, 1-pchisq(12.4,3) is 0.0061, and so there is strong evidence that the data do not come from a Poisson distribution. It is clear from comparing observed and expected numbers that there are more zeroes in the data than would be expected from a Poisson distribution with the appropriate mean, and that the rest of the data are rather higher than expected.