Download (ST217: Mathematical Statistics B)

MSB (ST217: Mathematical Statistics B) Aim To review, expand & apply the ideas from MSA. In particular, MSA mainly studied one unknown quantity at once. In MSB we’ll study interrelationships. Lectures & Classes Monday Wednesday Thursday 12–1 10–11 1–2 R0.21 R0.21 PLT Examples classes will begin in week 3. Style • Lectures will be supplemented (NOT replaced!!) with printed notes. Please take care of these notes—duplicates may not be readily available. • I shall teach mainly by posing problems (both theoretical and applied) and working through them. Contents 1. Overview of MSA. 2. Bivariate & Multivariate Probability Distributions. Joint distributions, conditional distributions, marginal distributions; conditional expectation. The χ2 , t, F and multivariate Normal distributions and their interrelationships. 3. Inference for Multiparameter Models. Likelihood, frequentist and Bayesian inference, prediction and decision-making. Comparison between various approaches. Point and interval estimation. Classical simple and composite hypothesis testing, likelihood ratio tests, asymptotic results. 4. Linear Statistical Models. Linear regression, multiple regression & analysis of variance models. Model choice, model checking and residuals. 5. Further Topics (time permitting). Nonlinear models, problems & paradoxes, etc. Books The books recommended for MSA are also useful for MSB. Excellent books on mathematical statistics are: 1. ‘Statistical Inference’ by George Casella & Roger L. Berger [C&B], Duxbury Press (1990), 2. ‘Probability and Statistics’ by Morris DeGroot, Addison-Wesley (2nd edition 1989). A good book discussing the application and interpretation of statistical methods is ‘Introduction to the Practice of Statistics’ by Moore & McCabe [M&M], Freeman (3rd edition 1998). Many of the data sets considered below come from the ‘Handbook of Small Data Sets’ [HSDS] by Hand et al., Chapman & Hall, London (1994). There are many other useful references on mathematical statistics available in the library, including books by Hogg & Craig [H&C], Lindgren, Mood, Graybill & Boes [MG&B], and Rice. c 1998,1999,2000,2001 by J. E. H. Shaw These notes are copyright 1 Chapter 1 Overview of MSA 1.1 Basic Ideas 1.1.1 What is ‘Statistics’ ? Statistics may be defined as: ‘The study of how information should be employed to reflect on, and give guidance for action in, a practical situation involving uncertainty.’ [italics by JEHS] Vic Barnett, Comparative Statistical Inference Figure 1.1: A practical situation involving uncertainty 2 1.1.2 Statistical Modelling The emphasis of modern statistics is on modelling the patterns and interrelationships in the existing data, and then applying the chosen model(s) to predict future data. Typically there is a measurable response (for example, reduction Y in patient’s blood pressure) that is thought to be related to explanatory variables xj (for example, treatment applied, dose, patient’s age, weight, etc.) We seek a formula that relates the observed responses to the corresponding explanatory variables, and that can be used to predict future responses in terms of their corresponding explanatory variables: Observed Response = Fitted Value + Residual, Future Response = Predicted value + Error. Here the fitted values should take account of all the consistent patterns in the data, and the residuals represent the remaining random variation. 1.1.3 Prediction and Decision-Making Always remember that the main aim in modelling as above is to predict (for example) the effects of different medical treatments, and hence to decide which treatment to use, and in what circumstances. The fundamental assumption is that the future data will be in some sense similar to existing data. The ideas of exchangeability and conditional independence are crucial. The following notation is useful: X⊥ ⊥Y X⊥ ⊥ Y |Z ‘X is independent of Y ’, i.e. Y gives you no information about X, ‘X is conditionally independent of Y given Z’, i.e. if you know the value taken by the RV Z, then Y gives you no further information about X. Most methods of statistical inference proceed indirectly from what we know (the observed data and any other relevant information) to what we really want to know (future, as yet unobserved, data), by assuming that the random variation in the observed data can be thought of as a sample from an underlying population, and learning about the properties of this population. 1.1.4 Known and Unknown Features of a Statistical Problem A statistic is a property of a sample, whereas a parameter is a property of a population. Often it’s natural to estimate a parameter θ (such as the population mean µ) by the corresponding property of the sample (here the sample mean X). Note that θ may be a vector or more complicated object. Unobserved quantities are treated mathematically as random variables. Potentially observable quantities are usually denoted by capital letters (Xi , X, Y etc.) Once the data have been observed, the values taken by these random variables are known (Xi = xi , X = x etc.) Unobservable or hypothetical quantities are usually denoted by Greek letters (θ, µ, σ 2 etc.), and estimators are often denoted by putting a hat on the b µ corresponding symbol (θ, b, σ b2 etc.) Nearly all statistics books use the above style of notation, so it will be adopted in these notes. However, sometimes I shall wish to distinguish carefully between knowns and unknowns, and shall denote all unknowns by capitals. Thus Θ represents an unknown parameter vector, and θ represents a particular assumed value of Θ. This is especially useful when considering probability distributions for parameters; one can then write fΘ (θ) and Pr(Θ = θ) by exact analogy with fX (x) and Pr(X = x). The set of possible values for a RV X is called its sample space ΩX . Similarly the parameter space ΩΘ is the set of possible values for the parameter Θ. 3 1.1.5 Likelihood In general, we can infer properties θ of the population by comparing how compatible are the various possible values of θ with the observed data. This motivates the idea of likelihood (equivalently, log-likelihood or support). We need a probability model for the data, in which the probability distribution of the random variation is a member of a (realistic but mathematically tractable) family of probability distributions, indexed by a parameter θ. Likelihood-based approaches have both advantages and disadvantages— Advantages Disadvantages Unified theory (many practical problems can be tackled in essentially the same way). Is the theory directly relevant? (is likelihood alone enough? and how do we balance realism and tractability?) Often get simple sufficient statistics (hence we can summarise a huge data set by a few simple properties). If the probability model is wrong, then results can be misleading (e.g. if one assumes a Normal distribution when the true distribution is Cauchy). CLT suggests likelihood methods work well when there’s loads of data. One seldom has loads of data! 1.1.6 Where Will We Go from Here? • MSA provided the mathematical toolbox (e.g. probability theory and the idea of random variables) for studying random variation. • MSB will add to this toolbox and study interrelationships between (random) variables. • We shall also consider some important general forms for the fitted/predicted values, in particular linear models and their generalizations. 1.2 Sampling Distributions Statistical analysis involves calculating various statistics from the data, for example the maximum likelihood b for θ. We want to understand the properties of these statistics; hence the importance estimator (MLE) θ of the central limit theorem (CLT) & its generalizations, and of studying the probability distributions of transformed random variables. P If we have a formula for a summary statistic S, e.g. S = Xi /n = X, and are prepared to make certain assumptions about the original random variables Xi , then we can say things about the probability distribution of S. The probability distribution of a statistic S, i.e. the pattern of values S would take if it were calculated in successive samples similar to the one we actually have, is called its sampling distribution. 4 1.2.1 Typical Assumptions 1. Standard Assumption (IID RVs): Xi are IID (independent and identically distributed) with (unknown) mean µ and variance σ 2 . This implies (a) E[X] = E[Xi ] = µ, and (b) Var[X] = 1 n Var[Xi ] = σ 2 /n. (c) If we define the standardised random variables Zn = X −µ , σ then as n → ∞, the distribution of Zn tends to the standard Normal N(0, 1) distribution. 2. Additional Assumption (Normality): 2 The Xi are IID Normal: Xi IID ∼ N (µ, σ ). This implies that X ∼ N(µ, σ 2 /n). 1.2.2 Further Uses of Sampling Distributions We can also • compare various plausible estimators (e.g. to estimate the centre of symmetry of a supposedly symmetric distribution we might use the sample mean, median, or something more exotic), • obtain interval estimates for unknown quantities (e.g. 95% confidence intervals, HPD intervals, support intervals), • test hypotheses about unknown quantities. Comments 1. Note the importance of expectations of (possibly transformed) random variables: E[X] E[(X − µ)2 ] E[esX ] E[eitX ] = = = = µ (measure of location) σ 2 (measure of scale) moment generating function characteristic function 2. We must always consider whether the assumptions made are reasonable, both from general considerations (e.g.: is independence reasonable? is the assumption of identical distributions reasonable? is it reasonable to assume that the data follow a Poisson distribution? etc.) and with reference to the observed set of data (e.g. are there any ‘outliers’—unreasonably extreme values—or unexpected patterns?) 3. Likelihood and other methods suggest estimators for unknown quantities of interest (parameters etc.) under certain specified assumptions. Even if these assumptions are invalid (and in practice they always will be to some extent!) we may still want to use summary statistics as estimators of properties of the underlying population. Therefore (a) We’ll want to investigate the properties of estimators under various relaxed assumptions, for example partially specified models that use only the first and second moments of the unknown quantities. (b) It’s useful if the calculated statistics (e.g. MLEs) have an intuitive interpretation (like ‘sample mean’ or ‘sample variance’). 5 1.3 (Revision?) Problems 1. First-year students attending a statistics course were asked to carry out the following procedure: Toss two coins, without showing anyone else the results. If the first coin showed ‘Heads’ then answer the following question: “Did the second coin show ‘Heads’ ? (Yes or No)” If the first coin showed ‘Tails’ then answer the following question: “Have you ever watched a complete episode of ‘Teletubbies’ ? (Yes or No)” The following results were recorded: Males Females Yes 84 23 No 48 24 For each sex, and for both sexes combined, estimate the proportion who have watched a complete episode of ‘Teletubbies’. Using a chi-squared test, or otherwise, test whether the proportions differ between the sexes. Discuss the assumptions you have made in carrying out your analysis. 2. Let X and Y be IID RVs with a standard Normal N (0, 1) distribution, and define Z = X/Y . (a) Write down the lower quartile, median and upper quartile of Z, i.e. the points z25 , z50 & z75 such that Pr(Z < zk ) = k/100. (b) Show that Z has a Cauchy distribution, with PDF 1/π(z 2 + 1). HINT : consider the transformation Z = X/Y and W = |Y |. 3. Let X1 , . . . Xn be mutually independent RVs, with respective MGFs (moment generating functions) MX1 (t), . . . , MXn (t), and let a1 , . . . , an and b1 , . . . , bn be fixed constants. Show that the MGF of Z = (a1 X1 + b1 ) + (a2 X2 + b2 ) + · · · + (an Xn + bn ) is P MZ (t) = exp t bi MX1 (a1 t) × · · · × MXn (an t). Hence or otherwise show that any linear combination of independent Normal RVs is itself Normally distributed. 4. A workman has to move a rectangular stone block a short distance, but doesn’t want to strain himself. He rapidly estimates: • height of block = 10 cm, with standard deviation 1 cm. • width of block = 20 cm, with standard deviation 3 cm. • length of block = 25 cm, with standard deviation 4 cm. • density of block = 4.0 g/cc, with standard deviation 0.5 g/cc. Assuming these estimates are mutually independent, calculate his estimates of the volume V (cc) and total weight W (Kg) of the block, and their standard deviations. The workman fears that he might hurt his back if W ≥ 30. Using Chebyshev’s inequality, give an upper bound for his probability Pr(W ≥ 30). [Chebyshev’s inequality states that if X has mean µ & variance σ 2 , then Pr(|X − µ| ≥ c) ≤ σ 2 /c2 —see MSA]. What is the workman’s value for Pr(W > 30) under the additional assumption that W is Normally distributed? Compare this value with the bound found earlier. How reasonable are the independence and Normality assumptions used in the above analysis? 6 5. Calculate the MLE of the centre of symmetry θ, given IID RVs X1 , X2 , . . . , Xn , where the common PDF fX (x) of the Xi s is (a) Normal (or Gaussian): fX (x|θ, σ) = √ 2 1 exp − 12 (x − θ)/σ 2πσ (b) Laplacian (or Double Exponential ): fX (x|θ, σ) = 1 exp |x − θ|/σ 2σ (c) Uniform (or Rectangular ): fX (x|θ) = 1 if θ − 21 < x < θ + 12 0 otherwise. Do you consider these MLEs to be intuitively reasonable? 6. Calculate E[X], E[X 2 ], E[X 3 ] and E[X 4 ] under each of the following assumptions: (a) X ∼ Poi (λ), i.e. X has PMF (probability mass function) Pr(X = x|λ) = λx exp(−λ) x! (x = 0, 1, 2, . . . ) (b) X ∼ Exp(β), i.e. X has PDF (probability density function) βe−βx if x > 0 fX (x|β) = 0 otherwise. (c) X ∼ N µ, σ 2 , i.e. X has PDF fX (x|µ, σ) = √ 2 1 exp − 12 (x − µ)/σ 2πσ 7. Describe briefly how, and under what circumstances, you might approximate (a) a binomial distribution by a Normal distribution, (b) a binomial distribution by a Poisson distribution, (c) a Poisson distribution by a Normal distribution. Suppose X ∼ Bin(100, 0.1), Y ∼ Poi (10), and Z ∼ N 10, 32 . Calculate, or look up in tables, (i) (iv) Pr(X ≥ 6), Pr(X > 16), (ii) (v) Pr(Y ≥ 6), Pr(Y > 16), (iii) (vi) Pr(Z > 5.5), Pr(Z > 16.5), and comment on the accuracy of the approximations here. 8. The t distribution with n degrees of freedom, denoted tn or t(n), has the PDF Γ 21 (n + 1) 1 1 √ , −∞ < t < ∞, f (t) = nπ 1 + t2 /n(n+1)/2 Γ 12 n and the F distribution with m and n degrees of freedom, denoted Fm,n or F (m, n), has PDF Γ 12 (m + n) x(m/2)−1 m/2 n/2 , 0 < x < ∞, m n f (x) = (mx + n)(m+n)/2 Γ 12 m Γ 12 n with f (x) = 0 for x ≤ 0. Show that if T ∼ tn and X ∼ Fm,n , then T 2 and X −1 both have F distributions. 7 9. Table 1.1 shows the estimated total resident population (thousands) of England and Wales at 30 June 1993: Age Persons Males Females <1 1–14 15–44 45–64 65–74 ≥ 75 669.6 9,268.0 21,875.0 11,435.8 4,595.9 3,594.9 343.1 4,756.9 11,115.6 5,676.6 2,081.7 1,224.5 326.5 4,511.1 10,759.4 5,759.2 2,514.2 2,370.4 Total 51,439.2 25,198.4 26,240.8 Table 1.1: Estimated resident population of England & Wales, mid 1993, by sex and age-group (simplified from Table 1 of the 1993 mortality tables ) Table 1.2, also extracted from the published 1993 Mortality Statistics, shows the number of deaths in 1993 among the resident population of England and Wales, categorised by sex, age-group and underlying cause of death. Assume that the rates observed in Tables 1.1 and 1.2 hold exactly, and suppose that an individual I is chosen at random from the population. Define the random variables S (sex), A (age group), D (death) and C (cause) as follows: S = 0 1 if I is male, if I is female, A = 1 2 3 if I is under 1 year old, if I is aged 1–14, if I is aged 15–44, D = 0 1 if I survives the year, if I dies, C = cause of death (0–17). 4 5 6 if I is aged 45–64, if I is aged 65–74, if I is 75 years old or over, For example, Pr(S=0) = 25198.4/51439.2, Pr(S=0 & A=6) = 1224.5/51439.2, Pr(D=0|S=0 & A=6) = 1 − 138.239/1224.5, Pr(C=8|S=0 & A=6) = 28.645/1224.5, etc. (a) Calculate Pr(D=1|S=0), and Pr(D=1|S=0 & A=a) for a = 1, 2, 3, 4, 5, 6. Also calculate Pr(S=0|D=1), and Pr(S=0|D=1 & A=a) for a = 1, 2, 3, 4, 5, 6. If you were an actuary, and were asked by a non-expert “is the death rate for males higher or lower than that for females?”, how would you respond based on the above calculations? Justify your answer. (b) Similarly, explain how you would respond to the questions i. “is the death rate from neoplasms higher for males or for females?” ii. “is the death rate from mental disorders higher for males or for females?” iii. “is the death rate from diseases of the circulatory system higher for males or for females?” iv. “is the death rate from diseases of the respiratory system higher for males or for females?” 8 Sex All ages <1 1–14 0 Deaths below 28 days (no cause specified) M F 1,603 1,192 1,603 1,192 − − − − − − − − − − 1 Infectious & parasitic diseases M F 1,954 1,452 60 46 79 44 565 169 390 193 346 283 514 717 2 Neoplasms M F 74,480 67,966 16 8 195 138 2,000 2,551 16,372 15,026 25,644 19,141 30,253 31,102 3 Endocrine, nutritional & metabolic diseases and immunity disorders M F 3,515 4,403 28 17 43 37 208 153 639 474 959 901 1,638 2,821 4 Diseases of blood and blood-forming organs M F 897 1,084 5 3 12 14 62 28 106 73 204 163 508 803 5 Mental disorders M F 2,530 5,189 − − 8 1 281 83 169 99 334 297 1,738 4,709 6 Diseases of the nervous system and sense organs M F 4,403 4,717 59 42 136 118 530 313 675 546 890 809 2,113 2,889 7 Diseases of the circulatory system M F 123,717 134,439 41 44 66 45 1,997 834 20,682 7,783 37,195 23,185 63,736 102,548 8 Diseases of the respiratory system M F 41,802 49,068 86 59 79 74 608 322 3,157 2,145 9,227 6,602 28,645 39,866 9 Diseases of the digestive system M F 7,848 10,574 10 20 27 14 511 298 1,706 1,193 2,058 1,921 3,536 7,128 10 Diseases of the genitourinary system M F 3,008 3,710 4 4 6 7 57 55 215 219 676 535 2,050 2,890 11 Complications of pregnancy, childbirth and the puerperium M F − 27 − − − − − 27 − − − − − − 12 Diseases of the skin and subcutaneous tissue M F 269 748 1 − 1 − 7 15 22 30 62 80 176 623 13 Diseases of the musculoskeletal system and connective tissue M F 785 2,639 1 − 5 5 28 43 106 173 151 385 494 2,033 14 Congenital anomalies M F 660 675 131 136 114 116 158 133 118 101 58 87 81 102 15 Certain conditions originating in the perinatal period M F 186 114 93 60 8 5 13 3 18 4 16 10 38 32 16 Signs, symptoms and ill-defined conditions M F 1,642 5,146 238 171 17 17 126 50 111 53 72 75 1,078 4,780 17 External causes of injury and poisoning M F 9,859 5,869 34 30 311 162 4,749 1,240 2,183 882 941 731 1,641 2,824 M F 279,158 299,012 2,410 1,832 1,107 797 11,900 6,317 46,669 28,994 78,833 55,205 138,239 205,867 Cause of death Total Age at death (years) 15–44 45–64 65–74 Table 1.2: Deaths in England & Wales, 1993, by underlying cause, sex and age-group (extracted from Table 2 of the 1993 mortality tables ) 9 ≥ 75 (c) Now treat the data in Tables 1.1 & 1.2 as subject to statistical fluctuations. One can still estimate psac = Pr(S=s & A=a & C=c), p·ac = Pr(A=a & C=c), ps · · = Pr(S=s) etc. from the data, for example pb0,·,14 = 660/25198400 = 2.62×10−5 . Similarly estimate p1,·,14 and p·,a,14 for a = 1 . . . 6. Using a chi-squared test or otherwise, investigate whether the relative risk of death from a congenital anomaly between males and females is the same at all ages, i.e. whether it reasonable to assume that ps, a,14 = ps, ·,14 × p·, a,14 . 10. Data were collected on litter size and sex ratios for a large number of litters of piglets. The following table gives the data for all litters of size between four and twelve: Number of males Litter size 7 8 9 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 1 14 23 14 1 2 20 41 35 14 4 3 16 53 78 53 18 0 0 21 63 117 104 46 21 2 1 8 37 81 162 77 30 5 1 Total 53 116 221 374 402 10 11 12 0 2 23 72 101 83 46 12 7 0 0 7 8 19 79 82 48 24 10 0 0 0 1 3 15 15 33 13 12 8 1 1 0 0 0 1 8 4 9 18 11 15 4 0 0 0 346 277 102 70 (a) Discuss briefly what sort of probability distributions it might be reasonable to assume for the total size N of a litter, and for the number M of males in a litter of size N = n. (b) Suppose now that the litter size N follows a Poisson distribution with mean λ. Write down an expression for Pr(N = n|4 ≤ N ≤ 12). Hence or otherwise give an expression for the log-likelihood `(λ; . . .) given the above table of data. (c) Evaluate `(λ; . . .) at λ = 7.5, 8 and 8.5. By fitting a quadratic to these values, provide point and interval estimates of λ. (d) Using a chi-squared test or otherwise, check how well your model fits the data. (e) Comment on the following argument: ‘Provided λ isn’t too small, we could approximate the Poisson distribution Poi (λ) by the Normal distribution N (λ, λ). This is symmetric, so we may simply estimate the mean λ by the mode of the data (8 in our case). The standard deviation is therefore nearly 3, and so we would expect the counts at litter size 8 ± 3 to be nearly 60% the count at 8 (note that for a standard Normal, φ(1)/φ(0) = exp(−0.5) l 0.6). Since there are far fewer litters of size 5 & 11 than this, the Poisson distribution must be a poor fit.’ Data from HSDS, set 176 Education is what survives when what has been learnt has been forgotten. Burrhus Frederoc Skinner 10 Chapter 2 Bivariate & Multivariate Distributions MSA largely concerned IID (independent & identically distributed) random variables. However in practice we are usually most interested in several random variables simultaneously, and their interrelationships. Therefore we need to consider the probability distributions of random vectors, i.e. the joint distribution of the individual random variables. Bivariate Examples A. (X1 , X2 ), the number of male & female pigs in a litter. B. (X, Y ), the systolic and diastolic blood pressure of an individual. C. (X, Y ), the age and height of an individual. D. (X, Y ), the height and weight of an individual. E. (b µ, σ b2 ), the estimated common mean and variance of n IID random variables X1 , . . . , Xn . F. (Θ, X) where Θ ∼ U (0, 1) and X|Θ ∼ Bin(n, Θ), i.e. 1 if 0 < x < 1 fΘ (θ) = 0 otherwise, fX (x|Θ = θ) = n x θx (1 − θ)n−x x = 0, 1, . . . , n. Definition 2.1 (Bivariate CDF) The joint cumulative distribution function of 2 RVs X & Y is the function FX,Y (x, y) = Pr(X ≤ x & Y ≤ y), (x, y) ∈ R2 . (2.1) Comments 1. The joint cumulative distribution function (or joint CDF) may also be called the ‘joint distribution function’ or ‘joint DF’. 2. If there’s no ambiguity, then we may simply write F (x, y) for FX,Y (x, y). 11 2.1 Discrete Bivariate Distributions If RVs X & Y are discrete, then they have a discrete joint distribution and a probability mass function (PMF) that, similarly to the univariate case, is usually written fX,Y (x, y) or more simply f (x, y): Definition 2.2 (Bivariate PMF) The joint probability mass function of discrete RVs X and Y is f (x, y) = Pr(X = x & Y = y). Exercise 2.1 Suppose that the numbers X1 and X2 of male and female piglets follow independent Poisson distributions with means λ1 & λ2 respectively. Find the joint PMF. k Exercise 2.2 Now assume the model N ∼ Poi (λ), (X1 |N ) ∼ Bin(N, θ), i.e. the total number N of piglets follows a Poisson distribution, and, conditional on N = n, X has a Bin(n, θ) distribution (in particular θ = 0.5 if the sexes are equally likely). Again find the joint PMF. k Exercise 2.3 Verify that the two models given in Exercises 2.1 & 2.2 give identical fitted values, and are therefore in practice indistinguishable. k 2.1.1 Manipulation A discrete RV has a countable sample space, which without loss of generality can be represented as N = {0, 1, 2, . . .}. Values of a discrete joint distribution f (x, y) can therefore be tabulated: 0 X 0 1 .. . f00 f10 .. . 1 Y 2 3 f01 f11 .. . f02 f12 .. . ... ... .. . ... and the probability of any event E obtained by simple summation: X Pr (X, Y ) ∈ E = f (xi , yi ). (xi ,yi )∈E Exercise 2.4 Continuing Exercise 2.2, find the PMF of X1 , and hence identify the distribution of X1 . k Exercise 2.5 The RV Q is defined on the rational numbers in [0, 1] by Q = X/Y , where f (x, y) = (1 − α)αy−1 /(y + 1), 0 < α < 1, y = {1, 2, . . .}, x = {0, 1, . . . , y}. Show that Pr(Q = 0) = (α − 1) α + log(1 − α) /α2 . k 12 2.2 Continuous Bivariate Distributions Definition 2.3 (Continuous bivariate distribution) Random variables X & Y have a continuous joint distribution if there exists a function f from R2 to [0, ∞) such that ZZ Pr (X, Y ) ∈ A = f (x, y) dx dy ∀A ⊆ R2 . (2.2) A Definition 2.4 (Bivariate PDF) The function f (x, y) defined by Equation 2.2 is called the joint probability density function of X & Y . Comments 1. f (x, y) may be written more explicitly as fX,Y (x, y). Z ∞Z ∞ 2. f (x, y) dx dy = 1. −∞ −∞ 3. f (x, y) is not unique—it could be arbitrarily defined at a countable RR set of points (xi , yi ) (more generally, any ‘set with measure zero’) without changing the value of A f (x, y) dx dy for any set A. 4. f (x, y) ≥ 0 at all continuity points (x, y) ∈ R2 . Examples 1. As in Example E from page 11, we will want to know properties of the joint distribution of (b µ, σ b2 ), IID 2 2 the MLEs of µ and σ respectively given X1 , . . . , Xn ∼ N (µ, σ ). 2. In the situation of Example B from page 11, where X is the systolic blood pressure and Y the diastolic blood pressure of an individual, it might be reasonable to assume that X Y |X ∼ N (µS , σS2 ), ∼ N (α + βX, σD2 ), and hence obtain fX,Y (x, y) = fX (x) fY |X (y|x). Comment As in Exercise 2.2, a family of multivariate distributions is most easily built up hierarchically using simple univariate distributions and conditional distributions like that of Y |X. Conditional distributions are considered formally in Section 2.4. 2.2.1 Visualising and Displaying a Continuous Joint Distribution A continuous bivariate distribution can be represented by a contour or other plot of its joint PDF (Fig. 2.1). Comments 1. The joint distribution of X and Y may be neither discrete nor continuous, for example: • Either X or Y may have both continuous and discrete components, • One of X and Y may have a continuous distribution, the other discrete (like Example F on page 11). 2. Higher dimensional joint distributions are obviously much more difficult to interpret and to represent graphically, with or without computer help. 13 Figure 2.1: Contour and perspective plots of a bivariate distribution 2.3 Marginal Distributions Given a joint CDF FX,Y (x, y), the distributions defined by the CDFs FX (x) = limy→∞ FX,Y (x, y) and FY (y) = limx→∞ FX,Y (x, y) are called the marginal distributions of X and Y respectively: Definition 2.5 (Marginal CDF, PMF and PDF—bivariate case) FX (x) = limy→∞ FX,Y (x, y) is the marginal CDF of X. If X has a discrete distribution, then fX (x) = Pr(X = x) is the marginal PMF of X. d If X has a continuous distribution, then fX (x) = FX (x) is the marginal PDF of X. dx Marginal CDFs and PDFs of Y , and of other RVs for higher-dimensional joint distributions, are defined similarly. Exercise 2.6 Suppose that you are given a bag containing five coins: 1 double-tailed, 1 with Pr(head) = 1/4, 2 fair, 1 double-headed. You pick one coin at random (each with probability 1/5), then toss it twice. By finding the joint distribution of Θ = Pr(head) and X = number of heads, or otherwise, calculate the distribution of the number of heads obtained. k Comments 1. If you’ve tabulated Pr(Θ = θ & X = x), then it’s simple to find FΘ (θ) and FX (x) by writing the row sums and column sums in the margins of the table of Pr(Θ = θ & X = x)—hence the name ‘marginal distribution’. 2. Although the most satisfactory general definition of marginal distributions is in terms of their CDFs, in practice it’s usually easiest to work with PMFs or PDFs 2.4 2.4.1 Conditional Distributions Discrete Case If X and Y are discrete RVs then, by definition, Pr(Y =y|X=x) = Pr(X=x & Y =y)/ Pr(X=x). 14 (2.3) In other words (or, more accurately, in other symbols): Definition 2.6 (Conditional PMF—bivariate case) If X and Y have a discrete joint distribution with PMF fX,Y (x, y), then the conditional PMF fY |X of Y given X = x is fX,Y (x, y) fY |X (y|x) = (2.4) fX (x) P where fX (x) = y fX,Y (x, y) is the marginal PMF of X. Exercise 2.7 Continuing Exercise 2.6, what are the conditional distributions of [X |Θ = 1/4] and [Θ|X = 0]? k 2.4.2 Continuous Case Now suppose that X and Y have a continuous joint distribution. If we observe X = x, then we will want to know the conditional CDF FY |X (y|X = x). But we CAN’T use Equation 2.3 directly, which would entail dividing by zero. Therefore, by analogy with Equation 2.4, we adopt the following definition: Definition 2.7 (Conditional PDF—bivariate case) If X and Y have a continuous joint distribution with PDF fX,Y (x, y), then the conditional PDF fY |X of Y given that X = x is fX,Y (x, y) fY |X (y|x) = , (2.5) fX (x) defined for all x ∈ R such that fX (x) > 0. 2.4.3 Independence Recall that two RVs X and Y are independent (X ⊥ ⊥ Y ) if, for any two sets A, B ∈ R, Pr(X∈A & Y ∈B) = Pr(X ∈ A) Pr(Y ∈ B) (2.6) Exercise 2.8 Show that X and Y are independent according to Formula 2.6 if and only if FX,Y (x, y) = FX (x)FY (y) − ∞ < x, y < ∞, (2.7) fX,Y (x, y) = fX (x)fY (y) − ∞ < x, y < ∞, (2.8) or equivalently if and only if (where the functions f are interpreted as PMFs or PDFs in the discrete or continuous case respectively). k 15 2.5 Problems 1. Let the function f (x, y) be defined by 6xy 2 f (x, y) = 0 if 0 < x < 1 and 0 < y < 1, otherwise. (a) Show that f (x, y) is a probability density function. (b) If X and Y have the joint PDF f (x, y) above, show that Pr(X + Y ≥ 1) = 9/10. (c) Find the marginal PDF fX (x) of X. (d) Show that Pr(0.5 < X < 0.75) = 5/16. 2. Suppose that the random vector (X, Y ) takes values in the region A = {(x, y)|0 ≤ x ≤ 2, 0 ≤ y ≤ 2}, and that its CDF within A is given by FX,Y (x, y) = xy(x + y)/16. (a) Find FX,Y (x, y) for values of (X, Y ) outside A. (b) Find the marginal CDF FX (x) of X. (c) Find the joint PDF fX,Y (x, y). 3. Suppose that X and Y are RVs with joint PDF cx2 y f (x, y) = 0 if x2 ≤ y ≤ 1, otherwise. (a) Find the value of c. (b) Find Pr(X ≥ Y ). (c) Find the marginal PDFs fX (x) & fY (y) 4. For each of the following joint PDFs f of X and Y , determine the constant c, find the marginal PDFs of X and Y , and determine whether or not X and Y are independent. (a) f (x, y) = ce−(x+2y) , for x, y ≥ 0, 0 otherwise. (b) f (x, y) = cy 2 /2, for 0 ≤ x ≤ 2 and 0 ≤ y ≤ 1, 0 otherwise. (c) f (x, y) = cxe−y , for 0 ≤ x ≤ 1 and 0 ≤ y < ∞, 0 otherwise. (d) f (x, y) = cxy, for x, y ≥ 0 and x + y ≤ 1, 0 otherwise. 5. Suppose that X and Y are continuous RVs with joint PDF f (x, y) = e−y on 0 < x < y < ∞. (a) Find Pr(X + Y ≥ 1) [HINT : write this as 1 − Pr(X + Y < 1)]. (b) Find the marginal distribution of X. (c) Find the conditional distribution of Y given that X = x. 16 6. Assume that X and Y are random variables each taking values in [0, 1]. For each of the following CDFs, show that the marginal distribution of X and Y are both uniform U (0, 1), and determine the conditional CDF FX|Y (x|Y = 0.5) in each case: (a) F (x, y) = xy, (b) F (x, y) = min(x, y), 0, if x + y < 1, (c) F (x, y) = x + y − 1 if x + y ≥ 1. 7. Suppose that Θ is a random variable uniformly distributed on (0, 1), i.e. Θ ∼ U (0, 1), and that, once Θ = θ has been observed, the random variable X is drawn from a binomial distribution [X|θ] ∼ Bin(2, θ). (a) Find the joint CDF F (θ, x). (b) How might you display the joint distribution of Θ and X graphically? (c) What (as simply as you can express them) are the marginal CDFs F1 (θ) of Θ and F2 (x) of X? 8. Suppose that X and Y are two RVs having a continuous joint distribution. Show that X and Y are independent if and only if fX|Y (x|y) = fX (x) for each value of y such that fY (y) > 0, and for all x. 9. Suppose that X ∼ U (0, 1) and [Y |X = x] ∼ U (0, x). Find the marginal PDFs of X and Y . 2.6 2.6.1 Multivariate Distributions Introduction Given a random vector X = (X1 , X2 , . . . , Xn )T , the joint distribution of the random variables X1 , X2 , . . . , Xn is called a multivariate distribution. Definition 2.8 (Joint CDF) The joint cumulative distribution function of RVs X1 , X2 , . . . , Xn is the function FX (x1 , x2 , . . . , xn ) = Pr(Xk ≤ xk ∀ k = 1, 2, . . . , n). (2.9) Comments 1. Formula 2.9 can be written succinctly as FX (x) = Pr(X ≤ x), in an ‘obvious’ vector notation. 2. FX (x) can be called simply the CDF of the random vector X. 3. Properties of FX are similar to the bivariate case. Unfortunately the notation is messier, particularly for the things we’re generally most interested in for statistical inference, such as (a) marginal distributions of unknown quantities and vectors, (b) conditional distributions of unknown quantities and vectors, given what we know. 4. It’s often simpler to blur the distinction between row and column vectors, i.e. to let X denote either (X1 , X2 , . . . , Xn ) or (X1 , X2 , . . . , Xn )T , depending on context. 17 Definition 2.9 (Discrete multivariate distribution) The RV X ∈ Rn has a discrete distribution if it can take only a countable number of possible values. Definition 2.10 (Multivariate PMF) If X has a discrete distribution, then its probability mass function (PMF) is x ∈ Rn f (x) = Pr(X = x), (2.10) [i.e. the RVs X1 . . . Xn have joint PMF f (x1 . . . xn ) = Pr(X1 = x1 & · · · &Xn = xn )]. Definition 2.11 (Continuous multivariate distribution) The RV X = (X1 , X2 , . . . , Xn ) has a continuous distribution if there is a nonnegative function f (x), where x = (x1 , x2 , . . . , xn ), such that for any subset A ⊂ Rn , Z Z Pr (X1 , X2 , . . . , Xn ) ∈ A = . . . f (x1 , x2 , . . . xn ) dx1 dx2 . . . dxn . (2.11) A Definition 2.12 (Multivariate PDF) The function f in 2.11 is the (joint) probability density function of X. Comments 1. Without loss of generality, if X is discrete, then we can take its possible values to be Nn (i.e. each coordinate Xi of X is a nonnegative integer). 2. Equation 2.11 could be simply written Z Pr X ∈ A) = f (x)dx (2.12) A 3. As usual, f (·) may be written more explicitly fX (·), etc. 4. By the fundamental theorem of calculus, fX (x1 , . . . , xn ) = ∂ n FX (x1 , . . . , xn ) ∂x1 · · · ∂xn (2.13) at all points (x1 , . . . , xn ) where this derivative exists i.e. fX (x) = ∂ n FX (x)/∂x . 5. Mixed distributions (neither continuous nor discrete) can be handled using appropriate combinations of summation and integration. 2.6.2 Useful Notation for Marginal & Conditional Distributions We’ll sometimes adopt the following notation from DeGroot, particularly when the components Xi of X are in some way similar, as in the multivariate Normal distribution (see later). F (x) denotes the CDF of X = (X1 , X2 , . . . , Xn ) at x = (x1 , x2 , . . . , xn ), f (x) denotes the corresponding joint PMF (discrete case) or PDF (continuous case), fj (xj ) denotes the marginal PMF (PDF) of Xj (integrating over x1 . . . xj−1 , xj+1 . . . xn ), fjk (xj , xk ) denotes the marginal joint PDF of Xj & Xk (integrating over the remaining xi s), gj (xj |x1 . . . xj−1 , xj+1 . . . xn ) denotes the conditional PMF (PDF) of Xj given Xi = xi , i 6= j, Fj (xj ) denotes the marginal CDF of Xj , Gjk denotes the conditional CDF of (Xj , Xk ) given the values xi of all Xi , i 6= j, k, etc. 18 2.7 Expectation 2.7.1 Introduction The following are important definitions and properties involving expectations, variances and covariances: Var(X) = E (X − µ)2 where µ = EX 2 2 = E X −µ , E[aX + b] = aEX + b where a and b are constants, 2 2 2 E (aX + b) = a E X + 2abEX + b2 , Var(aX + b) = a2 Var(X), E[X1 X2 ] = (EX1 )(EX2 ) Cov(X1 , X2 ) = E(X1 − µ1 )(X2 − µ2 ) = E[X1 X2 ] − µ1 µ2 , p Var(X), = Cov(X1 , X2 ) = ρ(X1 , X2 ) = . SD(X1 )SD(X2 ) SD(X) corr(X1 , X2 ) if X1 ⊥ ⊥ X2 , Note that the definition of expectation applies directly in the multivariate case: Definition 2.13 (Multivariate expectation)  P if X is discrete,  x h(x) f (x)  Z E[h(X)] =   h(x) f (x) dx if X is continuous. Rn For example, if X = (X1 , X2 , X3 ) has a continuous distribution, then Z ∞Z ∞Z ∞ E[X1 ] = x1 f (x1 , x2 , x3 ) dx1 dx2 dx3 −∞ −∞ −∞ Exercise 2.9 Let X and Y be independent continuous RVs. Prove that, for arbitrary functions g(·) and h(·), E g(X)h(Y ) = E g(X) E h(Y ) . k Exercise 2.10 Let X, Y and Z have independent Poisson distributions with means λ, µ, ν respectively. Find E[X 2 Y Z]. k Exercise 2.11 [Cauchy-Schwartz] By considering E (tX − Y )2 , or otherwise, prove the Cauchy Schwartz inequality for 2 expectations, i.e. for any two RVs X and Y with finite second moments, E(XY ) ≤ E X 2 E Y 2 , with equality if and only if Pr(Y = cX) = 1 for some constant c. Hence or otherwise prove that the correlation ρX,Y between X and Y satisfies |ρX,Y | ≤ 1. Under what circumstances does ρX,Y = 1? k 19 2.8 Approximate Moments of Transformed Distributions The moments of a transformed RV g(X) can often be well approximated via a Taylor series: Exercise 2.12 [delta method] Let X1 , X2 , . . . , Xn be independent, each with mean µ and variance σ 2 , and let g(·) be a function with a continuous derivative g 0 (·). By considering a Taylor series expansion involving X −µ , Zn = p σ 2 /n show that E g(X) = g(µ) + O(n−1 ), Var g(X) = n−1 σ 2 g 0 (µ)2 + O(n−3/2 ). (2.14) (2.15) k Comments 1. There is similarly a multivariate delta method, outside the scope of this course. 2. Important uses of expansions like the delta method include identifying useful transformations g(·), for example to remove skewness or, when Var(X) is a function of µ, to make Var g(X) (approximately) independent of µ. practice applied to the original RVs onP the (often 3. A useful transformation g(X) is sometimes in P reasonable) assumption that the properties of g(Xi ) /n will be similar to those of g Xi )/n . Exercise 2.13 [Variance stabilising transformations] Suppose that X1 , X2 , . . . , Xn are IID and that the (common) variance of each Xi is a function of the (common) mean µ = EXi . Show that the variance of g(X) is approximately constant if p g 0 (µ) = 1/ Var(µ). If X ∼ Poi (µ), show that Y = √ X has approximately constant variance. k 20 2.9 Problems 1. The discrete random vector (X1 , X2 , X3 ) has the following PMF: (X1 = 1) X2 1 2 3 1 .02 .04 .02 X3 2 .03 .06 .03 (a) Calculate the marginal PMFs: (X1 = 2) 3 .05 .10 .05 1 2 3 X2 f1 (x1 ), f2 (x2 ), f3 (x3 ) 1 .08 .12 .05 X3 2 .04 .11 .05 3 .03 .07 .05 and f12 (x1 , x2 ). (b) Are X1 and X2 independent? (c) What are the conditional PMFs: g1 (x1 |X2 = 1, X3 = 3), g3 (x3 |X1 = 1, X2 = 3), and g12 (x1 , x2 |X3 = 3) ? g2 (x2 |X1 = 1, X3 = 3), 2. The RVs A, B, C etc. count the number of times the corresponding letter appears when a word is chosen at random from the following list (each being chosen with probability 1/16): MASCARA, MOVIE, RITE, SQUID, MASK, PREY, SEAT, TENDER, MERCY, REPLICA, SNAKE, TIME, MONSTER, REPTILES, SOMBRE, TROUT. (a) Complete the following table of the joint distribution of E, M and R: E=0 R=0 M =0 M =1 1/16 1/16 R=1 E=1 M =0 R=0 E=2 M =1 M =0 2/16 M =1 R=0 R=1 R=1 (b) Calculate all three bivariate marginal distributions, and hence find which of the following statements are true: (a) E ⊥ ⊥ M, (b) E ⊥ ⊥ R, (c) M ⊥ ⊥ R. (c) Similarly discover which of the following statements are true: (d) M ⊥ ⊥ R|E=0, (g) M ⊥ ⊥ R|E, (e) M ⊥ ⊥ R|E=1, (h) E ⊥ ⊥ R|M , (f) M ⊥ ⊥ R|E=2, (i) E ⊥ ⊥ M |R. 3. Find variance stabilizing transformations for (a) the exponential distribution, (b) the binomial distribution. 4. Let Z ∼ N (0, 1) and define the RV X by √ Pr(X = − 3) = 1/6, Pr(X = 0) = 4/6, √ Pr(X = + 3) = 1/6. (a) Show that X has the same mean and variance as Z, and that X 2 has the same mean and variance as Z 2 . (b) Suppose the RV Y has mean µ and variance σ 2 . Compare the delta method for estimating the mean and variance of the RV T = g(Y ) with the alternative estimates µ b(T ) l E g(µ + σX) , d ) l Var g(µ + σX) . [Try a few simple distributions for Y and transformations g(·)]. Var(T 21 2.10 Conditional Expectation 2.10.1 Introduction A common practical problem arises when X1 and X2 aren’t independent, we observe X2 = x2 , and we want to know the mean of the resulting conditional distribution of X1 . Definition 2.14 (Conditional expectation) The conditional expectation of X1 given X2 is denoted E[X1 |X2 ]. If X2 = x2 then Z E[X1 |x2 ] ∞ x1 g1 (x1 |x2 ) dx1 = (continuous case) (2.16) −∞ = X x1 g1 (x1 |x2 ) (discrete case) (2.17) x1 where g1 (x1 |x2 ) is the conditional PDF or PMF respectively. Comment Note that before X2 is known to take the value x2 , E[X1 |X2 ] is itself a random variable, being a function of the RV X2 . We’ll be interested in the distribution of the RV E[X1 |X2 ], and (for example) comparing it with the unconditional expectation EX1 . The following is an important result: Theorem 2.1 (Marginal expectation) For any two RVs X1 & X2 , E E[X1 |X2 ] = EX1 . (2.18) Exercise 2.14 Prove Equation 2.18 (i) for continuous RVs X1 and X2 , (ii) for discrete RVs X1 and X2 . k Exercise 2.15 Suppose that the RV X has a uniform distribution, X ∼ U (0, 1), and that, once X = x has been observed, the conditional distribution of Y is [Y |X = x] ∼ U (x, 1). Find E[Y |x] and hence, or otherwise, show that EY = 3/4. k Exercise 2.16 Suppose that Θ ∼ U (0, 1) and (X|Θ) ∼ Bin(2, Θ). Find E[X |Θ] and hence or otherwise show that EX = 1. 2.10.2 k Conditional Expectations of Functions of RVs By extending Theorem 2.1, we can relate the conditional and marginal expectations of functions of RVs (in particular, their variances). Theorem 2.2 (Marginal expectation of a transformed RV) For any RVs X1 & X2 , and for any function h(·), E E[h(X1 )|X2 ] = E[h(X1 )]. 22 (2.19) Exercise 2.17 Prove Equation 2.19 (i) for discrete RVs X1 and X2 , (ii) for continuous RVs X1 and X2 . k An important consequence of Equation 2.19 is the following theorem relating marginal variance to conditional variance and conditional expectation: Theorem 2.3 (Marginal variance) For any RVs X1 & X2 , Var(X1 ) = E Var(X1 |X2 ) + Var E[X1 |X2 ] . (2.20) Comments 1. Equation 2.20 is easiest to remember in English: ‘marginal variance = expectation of conditional variance + variance of conditional expectation’. 2. A useful interpretation of Equation 2.20 is: Var(X1 ) = average random variation inherent in X1 even if X2 were known + random variation due to not knowing X2 and hence not knowing EX1 . i.e. the uncertainty involved in predicting the value x1 taken by a random variable X1 splits into two components. One component is the unavoidable uncertainty due to random variation in X1 , but the other can be reduced by observing quantities (here the value x2 of X2 ) related to X1 . Exercise 2.18 [Proof of Theorem 2.3] Expand E Var(X1 |X2 ) and Var E[X1 |X2 ] . Hence show that Var(X1 ) = E Var(X1 |X2 ) + Var E[X1 |X2 ] . k Exercise 2.19 Continuing Exercise 2.16, in which Θ ∼ U (0, 1), (X|Θ) ∼ Bin(2, Θ), and E[X |Θ] = 2Θ, find Var E[X |Θ] and E Var(X |Θ) . Hence or otherwise show that VarX = 2/3, and comment on the effect on the uncertainty in X of observing Θ. k 23 2.11 Problems 1. Two fair coins are tossed independently. Let A1 , A2 and A3 be the following events: A1 A2 A3 = = = ‘coin 1 comes down heads’ ‘coin 2 comes down heads’ ‘results of both tosses are the same’. (a) Show that A1 , A2 and A3 are pairwise independent (i.e. A1 ⊥ ⊥ A2 , A1 ⊥ ⊥ A3 and A2 ⊥ ⊥ A3 ) but not mutually independent. (b) Hence or otherwise construct three random variables X1 , X2 , X3 such that E[X3 |X1 = x1 ] and E[X3 |X2 = x2 ] are constant, but E[X3 |X1 = x1 &X2 = x2 ] isn’t. 2. Construct three random variables X1 , X2 , X3 with continuous distributions such that X1 ⊥ ⊥ X2 , X1 ⊥ ⊥ X3 and X2 ⊥ ⊥ X3 , but any two Xi ’s determine the remaining one. 3. (a) Show that for any random variables X and Y , i. E[Y ] = E E[Y |X] , ii. Var[Y ] = E Var[Y |X] + Var E[Y |X] . (b) Suppose that the random variables Xi and Pi , i = 1, . . . , n, have the following distributions: 1 with probability Pi , Xi = 0 with probability 1 − Pi , IID Pi Beta(α, β), ∼ i.e. Pi has density f (p) = Γ(α + β) α−1 p (1 − p)β−1 Γ(α) Γ(β) with mean µ and variance σ 2 given by µ = E[Pi ] = α , α+β σ 2 = Var[Pi ] = αβ , (α + β)2 (α + β + 1) and Xi has a Bernoulli (Pi ) distribution. Find i. ii. iii. iv. E[X1 |P1 ], Var[X1 |P1 ], Var E[X1 |P1 ] , and E Var[X1 |P1 ] . Hence find E[Y ] where Y = Pn i=1 Xi , and show that Var[Y ] = nαβ/(α + β)2 . (c) Express E[Y ] and Var[Y ] in terms of µ and σ 2 , and comment on the result. From Warwick ST217 exam 1998 4. Suppose that the number N of bye-elections occurring in Government-held seats over a 12-month period follows a Poisson distribution with mean 10. Suppose also that, independently for each such bye-election, the probability that the Government hold onto the seat is 1/4. The number X of seats retained in the N bye-elections therefore follows a binomial distribution: [X|N ] ∼ Bin(N, 0.25). (a) What are E[N ], Var[N ], E[X|N ] and Var[X|N ]? (b) What are E[X] and Var[X]? (c) What is the distribution of X? [HINT : try using generating functions—see MSA] 24 5. (a) For continuous random variables X and Y , define i. ii. iii. iv. the the the the marginal density fX (x) of X, conditional density fY |X (y|x) of Y given X = x, conditional expectation E[Y |X] of Y given X, and conditional variance Var[Y |X] of Y given X. (b) Show that i. E[g(Y )] = E E[g(Y )|X] , for an arbitrary function g(·), and ii. Var[Y ] = E Var[Y |X] + Var E[Y |X] . (c) Suppose that the random variables X and Y have a continuous joint distribution, with PDF 2 f (x, y), means µX & µY respectively, variances σX & σY2 respectively, and correlation ρ. Also suppose the conditional mean of Y given X = x is a linear function of x: E[Y |x] = β0 + β1 x. Show that R∞ i. −∞ yf (x, y)dy = (β0 + β1 x)fX (x), ii. µY = β0 + β1 µX , and 2 iii. ρσX σY + µX µY = β0 µX + β1 (σX + µ2X ). (Hint: use the fact that E[XY ] = E[E[XY |X]]). (d) Hence or otherwise express β0 and β1 in terms of µX , µY , σX , σY & ρ, and write down (or derive) the maximum likelihood estimates of β0 & β1 under the assumption that the data (x1 , y1 ), . . . , (xn , yn ) are i.i.d observations from a bivariate Normal distribution. From Warwick ST217 exam 1997 6. For discrete random variables X and Y , define: (i) The conditional expectation of Y given X, E[Y |X], and (ii) The conditional variance of Y given X, Var[Y |X]. Show that (iii) E[Y ] = E E[Y |X] , and (iv) Var[Y ] = E Var[Y |X] + Var E[Y |X] . (v) Show also that if E[Y |X] = β0 + β1 X for some constants β0 and β1 , then E[XY ] = β0 E[X] + β1 E[X 2 ]. The random variable X denotes the number of leaves on a certain plant at noon on Monday, Y denotes the number of greenfly on the plant at noon on Tuesday, and Z denotes the number of ladybirds on the plant at noon on Wednesday. Suppose that, given X = x, Y has a Poisson distributions with mean µx. If X has a Poisson distribution with mean λ, show that E[Y ] = λµ and Var[Y ] = λµ(1 + µ), (you may assume that for a Poisson distribution the mean and variance are equal). Suppose further that, given Y = y, Z has a Poisson distributions with mean νy. Find E[Z], Var[Z], and the correlation between X and Z. From Warwick ST217 exam 1996 25 7. Using the relationship E E[h(X1 )|X2 ] = E[h(X1 )], where h(x1 ) = (x1 − E[X1 |x2 ] + E[X1 |x2 ] − EX1 )2 , prove that Var(X1 ) = E Var(X1 |X2 ) + Var E[X1 |X2 ] for any two random variables X1 & X2 . 8. Prove that, for any three RVs X, Y and Z for which the various expectations exist, (a) X and Y − E(Y |X) are uncorrelated, (b) Var Y − E(Y |X) = E Var(Y |X) , (c) if X and Y are uncorrelated then E Cov(X, Y |Z) = −Cov E(X|Z), E(Y |Z) , (d) Cov Z, E(Y |Z) = Cov(Z, Y ). In scientific thought we adopt the simplest theory which will explain all the facts under consideration and enable us to predict new facts of the same kind. The catch in this criterion lies in the word ‘simplest’. It is really an aesthetic canon such as we find implicit in our criticisms of poetry or painting. J. B. S. Haldane All models are wrong, some models are useful. G. E. P. Box A child of five would understand this. Send somebody to fetch a child of five. Groucho Marx 26 Chapter 3 The Multivariate Normal Distribution 3.1 Motivation A Normally distributed RV X ∼ N (µ, σ 2 ) has PDF 1 (x − µ)2 f (x; µ, σ 2 ) = constant × exp − 2 σ2 (3.1) where µ σ2 ‘constant’ is the mean of X, is the variance of X, and is there to make f integrate to 1. The P Normal distribution is important because, by the CLT, as n → ∞, the CDF of a MLE such as P θb = Xi /n or θb = (Xi − ΣXj /n)2 / n, tends uniformly (under reasonable conditions) to the CDF of a Normal RV with the appropriate mean and variance. i.e. the log-likelihood tends to a quadratic in θ. Similarly it can be shown that, for a model with parameter vector θ = (θ1 , . . . , θp )T , under reasonable conditions the log-likelihood will tend to a quadratic in (θ1 , . . . , θp ). Therefore, by analogy with Equation 3.1, we will want to define a distribution with PDF 1 f (x; µ, V) = constant × exp − (x − µ)T V−1 (x − µ) 2 where µ V ‘constant’ is a (p × 1) matrix or column vector, is a (p × p) matrix, and is again there to make f integrate to 1. 27 (3.2) As an example of a PDF of this form, if X1 , X2 , . . . , Xp IID ∼ N (0, 1), then f (x) = f1 (x1 ) × f2 (x2 ) × · · · × fp (xp ) by independence 1 1 = exp − 12 Σx2i = exp − 12 xT x . p/2 p/2 (2π) (2π) (3.3) Definition 3.1 (Multivariate standard Normal) The distribution with PDF f (z) = f (z1 , z2 , . . . , zp ) = 1 exp − 21 zT z p/2 (2π) is called the multivariate standard Normal distribution. The statement ‘Z has a multivariate standard Normal distribution’ is often written Z ∼ N (0, I), Z ∼ MVN (0, I), Z ∼ N p (0, I), or Z ∼ MVN p (0, I), and the CDF and PDF of Z are often written Φ(z) and φ(z), or Φp (z) and φp (z), respectively. In the more general case, where the component RVs X1 , X2 , . . . , Xp in Equation 3.2 aren’t independent, we need an expression for the constant term. 3.2 Digression: Transforming a Random Vector Exercise 3.1 Suppose that the RVs Z1 , Z2 , . . . , Zn have a continuous joint distribution, with joint PDF fZ (z). Consider a 1-1 transformation (i.e. a bijection between the corresponding sample spaces) to new RVs X1 , X2 , . . . , Xn . What is the PDF fX (x) of the transformed RVs? Solution: Because the transformation is 1-1 we can invert it and write Z = u(X) i.e. a given point (z1 , . . . , zn ) transforms to (x1 , . . . , xn ), where z1 z2 = u1 (x1 , . . . , xn ), = u2 (x1 , . . . , xn ), .. . zn = un (x1 , . . . , xn ). Now assume that each function ui (·) is continuous and differentiable. Then we can form the following matrix:  ∂u1 ∂u1 ∂u1  ∂x1 ∂x2 . . . ∂xn   ∂u ∂u2 ∂u2 2  ∂u ...  =  ∂x1 ∂x2 ∂xn  . ∂x .. .. ..  .. . . .   ∂un ∂un ∂un ... ∂x1 ∂x2 ∂xn (3.4)            (3.5) and its determinant J, which is called the Jacobian of the transformation u [i.e. of the joint transformation (u1 , . . . , un )]. Then it can be shown that fX (x) = |J| × fZ (z) at all points in the ‘sample space’ (i.e. set of possible values) of X. k 28 z + δ2 z + δ1 + δ2 z infinitesimal δ1 × δ2 rectangle density = fZ (z) area = δ1 δ2 z + δ1 ∴ probability content = δ1 δ2 fZ (z) 6 u infinitesimal parallelogram area = δ1 δ2 /|J| probability content = δ1 δ2 fZ (z) u−1 (z + δ 1 + δ 2 ) u−1 (z + δ 2 ) −1 u (z + δ 1 ) x = u−1 (z) ∴ density = |J| × fZ (z) Figure 3.1: Bivariate Parameter Transformation 3.3 The Bivariate Normal Distribution Suppose that Z1 and Z2 are IID with N (0, 1) distributions, i.e. (as in Equation 3.3): fZ (z1 , z2 ) = 1 exp − 12 (z12 + z22 ) . 2π Now let µ1 , µ2 ∈ (−∞, ∞), σ1 , σ2 ∈ (0, ∞) & ρ ∈ (−1, 1), and define (as in DeGroot §5.12): X1 X2 = σ1 Z1 + µ1p , = σ2 ρZ1 + 1 − ρ2 Z2 + µ2 . (3.6) Then the Jacobian of the transformation from Z to X is given by σ1 p 0 = 1 − ρ2 σ 1 σ 2 . p J = ρ σ2 1 − ρ2 σ2 p Therefore the Jacobian of the inverse transformation from X to Z is 1/ 1 − ρ2 σ1 σ2 , and the PDF of X is given by Equations 3.7 & 3.8 below. Definition 3.2 (Bivariate Normal Distribution) The continuous bivariate distribution with PDF fX (x) 1 fZ (z) |J| = ! 1 1 Q , p × × exp − 2π 2 1 − ρ2 1 − ρ2 σ 1 σ 2 1 = (3.7) where Q= x1 − µ1 σ1 2 − 2ρ x1 − µ1 σ1 is called the bivariate Normal distribution. 29 x2 − µ2 σ2 + x2 − µ2 σ2 2 . (3.8) Exercise 3.2 If the RV X = (X1 , X2 ) has PDF given by Equations 3.7 & 3.8, then show by substituting v= x2 − µ2 σ2 followed by w = v − ρ(x1 − µ1 )/σ1 p , 1 − ρ2 or otherwise, that X1 ∼ N (µ1 , σ12 ). Hence or otherwise show that the conditional distribution of X1 given X2 = x2 is Normal with mean µ1 + (ρσ1 /σ2 )(x2 − µ2 ) and variance σ12 1 − ρ2 . k Comments 1. It’s easy to show (problem 3.4.2, page 31) that EXi = µi , VarXi = σi2 and corr(X1 , X2 ) = ρ. This suggests that we will be able to write X = (X1 , X2 )T ∼ MVN (µ, V), T µ = V (µ1 , µ2 ) σ12 = ρ σ1 σ 2 where is the ‘mean vector ’ of X, and ρ σ1 σ 2 is the ‘variance-covariance matrix ’ of X. σ22 2. The ‘level curves’ (i.e. contours in 2-d) of the bivariate Normal PDF are given by Q = constant in formula 3.8; i.e. ellipses provided the discriminant is negative: ρ σ1 σ2 2 − 1 1 ρ2 − 1 = 2 2 < 0. 2 2 σ 1 σ2 σ 1 σ2 This holds as we are only considering ‘nonsingular’ bivariate Normal distributions with ρ 6= ±1. 3. PLEASE MAKE NO ATTEMPT TO MEMORISE FORMULAE 3.7 & 3.8!! Exercise 3.3 Show that the inverse of the variance-covariance matrix V = V −1 1 = 1 − ρ2 1/σ12 −ρ/σ1 σ2 σ12 ρ σ1 σ 2 −ρ/σ1 σ2 1/σ22 ρ σ1 σ 2 σ22 is . k 30 3.4 Problems 1. Suppose that the RVs X1 , X2 , . . . , Xn have a continuous joint distribution with PDF fX (x), and that the RVs Y1 , Y2 , . . . , Yn are defined by Y = AX, where the (n × n) matrix A is nonsingular. Show that the joint density of the Yi s is given by 1 fX A−1 y for y ∈ Rn . fY (y) = | det A| Hence or otherwise show carefully that if X1 and X2 are independent RVs with PDFs f1 and f2 respectively, then the PDF of Y = X1 + X2 is given by Z ∞ fY (y) = f1 (y − z)f2 (z)dz for −∞ < y < ∞ −∞ or equivalently by Z ∞ f1 (z)f2 (y − z)dz fY (y) = for −∞ < y < ∞ −∞ If Xi IID ∼ Exp(1), i = 1, 2, then what is the distribution of X1 + X2 ? 2. Suppose that Z1 and Z2 are i.i.d. random variables with standard Normal N (0, 1) distributions. Define the random vector (X1 , X2 ) by: X1 = µ1 + σ1 Z1 , i h p X2 = µ2 + σ2 ρZ1 + 1 − ρ2 Z2 , where σ1 , σ2 > 0 and −1 ≤ ρ ≤ 1. Show that E[X1 ] = µ1 , E[X2 ] = µ2 , Var[X1 ] = σ12 , Var[X2 ] = σ22 , and corr[X1 , X2 ] = ρ. Find E[X2 |X1 ] and Var[X2 |X1 ]. Derive the joint PDF f (x1 , x2 ). Find the distribution of [X2 |X1 ]. Hence or otherwise show that two r.v.s. with a joint bivariate Normal distribution are independent if and only if they are uncorrelated. (e) Now suppose that σ1 = σ2 . Show that the RVs Y1 = X1 +X2 and Y2 = X1 −X2 are independent. (a) (b) (c) (d) 3. Suppose that X and Y have the joint density 1 p fX,Y (x, y) = 2π σX σY 1 − ρ2 " 2 2 #! 1 x − µX x − µX y − µY y − µY × exp − − 2ρ + . σX σX σY σY 2 1 − ρ2 p (a) Show by substituting u = (x − µX )/σX and v = (y − µY )/σY followed by w = (u − ρv)/ 1 − ρ2 , or otherwise, that fX,Y does indeed integrate to 1. (b) Show that the ‘joint MGF’ MX,Y (s, t) = E exp(sX + tY ) is given by 2 2 s + 2ρσX σY st + σY2 t2 ) . MX,Y (s, t) = exp µX s + µY t + 12 (σX (c) Show that ∂MX,Y ∂s s,t=0 ∂ 2 MX,Y ∂s2 = µX , 2 = µ2X + σX , s,t=0 & ∂ 2 MX,Y ∂s∂t s,t=0 = µX µY + ρσX σY . (d) Guess the formula for the MGF MX (s) of X, where X ∼ MVN (µ, V). 4. Suppose that (X1 , X2 ) have a bivariate Normal distribution. Show that any linear combination Y = a0 + a1 X1 + a2 X2 has a univariate Normal distribution. 31 3.5 The Multivariate Normal Distribution Definition 3.3 (Multivariate Normal distribution) Let µ = (µ1 , µ2 , . . . , µp ) be a p-vector, and let V be a symmetric positive-definite (p × p) matrix. Then the multivariate probability density defined by fX (x; µ, V) = 1 p (2π)p |V| exp − 12 (x − µ)T V−1 (x − µ) (3.9) is called a multivariate Normal PDF with mean vector µ and variance-covariance matrix V. Comments 1. Expression 3.9 is a natural generalisation of the univariate Normal density, with V taking the rôle of σ 2 in the exponent, and its determinant |V| taking the rôle of σ 2 in the ‘normalising constant’ that makes the whole thing integrate to 1. Many of the properties of the MVN distribution are guessable from properties of the univariate Normal distribution—in particular, it’s helpful to think of 3.9 as ‘exponential of a quadratic’. 2. The statement ‘X = (X1 , X2 , . . . , Xp ) has a multivariate Normal distribution with mean vector µ and variance-covariance matrix V’ may be written X ∼ N (µ, V), X ∼ MVN(µ, V), X ∼ N p (µ, V), or X ∼ MVNp (µ, V). 3. The mean vector µ is sometimes called just the mean, and the variance-covariance matrix V is sometimes called the dispersion matrix, or simply the variance matrix or covariance matrix. 4. µ = EX, (or equivalently, componentwise, EXi = µi , i = 1, 2, . . . , p). This fact should be obvious from the name ‘mean vector’, and can be proved in various ways, e.g. by differentiating a multivariate generalization of the MGF, or simply by symmetry. 5. V = E (X − µ)(X − µ)T = E(XXT ) − µµT , i.e.     µ21 µ1 µ2 . . . µ1 µp X12 X 1 X 2 . . . X1 X p   X2 X1 . . . µ2 µp  µ22 X22 . . . X2 X p      µ2 µ1 − E(XXT ) − µµT = E    . . ..  .. .. . .. .. .. ..   ..  . . .  . . Xp X1    =   ... Xp2 µp µ1 E X12 − µ21 E(X2 X1 ) − µ2 µ1 .. . E(X1 X2) − µ1 µ2 E X22 − µ22 .. . E(Xp X1 ) − µp µ1 E(Xp X2 ) − µp µ2  = V Xp X2 =     v11 v21 .. . v12 v22 .. . ... ... .. . v1p v2p .. . vp1 vp2 ... vpp µp µ2 ... µ2p . . . E(X1 Xp ) − µ1 µp . . . E(X2 Xp ) − µ2 µp .. .. . . ... E Xp2 − µ2p         ,  say, from which it follows that    V=  σ12 ρ12 σ1 σ2 .. . ρ12 σ1 σ2 σ22 .. . . . . ρ1p σ1 σp . . . ρ2p σ2 σp .. .. . . ρ1p σ1 σp ρ2p σ2 σp ... 32 σp2    ,  (3.10) where σi is the standard deviation of Xi and ρij is the correlation between Xi and Xj . Again these results can be proved using a multivariate generalization of the MGF. 6. The p-dimensional MVN p (µ, V) distribution can therefore be parametrised by— p p 1 p(p − 1) 2 means µi , variances σi2 , and correlations ρij NB. —a total of 21 p(p + 3) parameters. 7. Given n random vectors Xi = (Xi1 , Xi2 , . . . , Xip ) IID ∼ MVN (µ, V), i = 1, 2, . . . , n, a set of minimal sufficient statistics for the unknown parameters is given by:  n X   Xij j = 1, . . . , p,     i=1     n  X 2 Xij j = 1, . . . , p,   i=1     n  X   & Xij Xik j = 2, . . . , p, k = 1, . . . , (j − 1),   (3.11) i=1 and MLEs for µ and V are given by: µ bj = σ bj2 = ρbjk = 1X Xij , n i 1X (Xij − µ bj )2 , n i P 1 bj )(Xik − µ bk ) i (Xij − µ n , σ bj σ bk (3.12) (3.13) (3.14) or, in matrix notation, n b = µ 1X Xi , n i=1 b V = 1X b )(Xi − µ b )T (Xi − µ n i=1 = 1X bµ bT . Xi XTi − µ n i=1 (3.15) n (3.16) n (3.17) 8. The fact that V is positive-definite implies various (messy!) constraints on the correlations ρij . 9. Surfaces of constant density form concentric (hyper-)ellipsoids (concentric hyper-spheres in the case of the standard MVN distribution). In particular, the contours of a bivariate Normal density form concentric ellipses (or concentric circles for the standard bivariate Normal). 10. It can be proved that all conditional and marginal distributions of a MVN are themselves MVN. The proof of this important fact is quite straightforward, quite tedious, and mercifully omitted from this course. 33 3.6 Distributions Related to the MVN Because of the CLT, the MVN distribution is important throughout statistics. For example, the joint distribution of the MLEs θb1 , θb2 , . . . , θbp of unknown parameters θ1 , θ2 , . . . , θp will under reasonable conditions b = (θb1 , θb2 , . . . , θbp )T was calculated increases. tend to a MVN as the size of the sample from which θ Therefore various distributions arising from the MVN by transformation are also important. Throughout this Section we shall usually denote independent standard Normal RVs by Zi , i.e.: Zi IID ∼ N (0, 1), i.e. Z = (Z1 , Z2 , . . . , Zn )T i = 1, 2, . . . ∼ MVN (0, I). Exercise 3.4 Show that if a is a constant (n × 1) column vector, B is a constant nonsingular (n × n) matrix, and Z = (Z1 , Z2 , . . . , Zn )T is a random n-vector with a MVN (0, I) distribution, then Y = a+BZ ∼ MVN a, BBT . k 3.6.1 The Chi-squared Distribution Definition 3.4 (Chi-squared Distribution) If Zi IID ∼ N(0, 1) for i = 1, 2, . . . , n, then the distribution of X = Z12 + Z22 + · · · + Zn2 is called a Chi-squared distribution on n degrees of freedom, and we write X ∼ χ2n . Comments 1. In particular, if Z ∼ N (0, 1), then Z 2 ∼ χ21 . 2. The above construction of the χ2n distribution shows that if X ∼ χ2m , Y ∼ χ2n , and X ⊥ ⊥Y , then (X + Y ) ∼ χ2m+n . This summation property accounts for the importance and usefulness of the χ2 distribution: essentially a squared length is split into two orthogonal components, as in Pythagoras’ theorem. 3. If X ∼ χ2n , then the (unmemorable) density of X can be shown to be fX (x) = 1 x(n/2)−1 e−x/2 2n/2 Γ(n/2) for x > 0, (3.18) with fX (x) = 0 for x ≤ 0. Comparing this with the definition of a Gamma distribution (MSA) shows that a Chi-squared distribution on n degrees of freedom is just a Gamma distribution with α = n/2 and β = 1/2 (in the usual parametrisation). 4. It can be shown that if X ∼ χ2n then EX = n and VarX = 2n. Note that this implies that E[X/n] = 1 and Var[X/n] = 2/n. 5. The χ2 distributions are positively skewed—for example, χ22 is just an exponential distribution with mean 2. However, because of the CLT, the χ2n distribution tends (slowly!) to Normality as n → ∞. 6. The PDF 3.18 cannot be integrated analytically except for the special case n = 2. Therefore the CDFs of χ2n distributions for various n are given in standard Statistical Tables. 34 Figure 3.2: Chi-squared distributions for 1, 2, 5 & 20 d.f. Vertical lines show the 2.5%, 16%, 50%, 84% and 97.5% points (which for N (0, 1) are at −2, −1, 0, 1, 2). 3.6.2 Student’s t Distribution Definition 3.5 (t Distribution) If Z ∼ N(0, 1), Y ∼ χ2n and Y ⊥ ⊥ Z, then the distribution of Z X=p Y /n is called a (Student’s) t distribution on n degrees of freedom, and we write X ∼ tn . Comments 1. The shape of the t distribution is like that of a Normal, but with heavier tails (since there is variability in the denominator of t as well as in the Normally-distributed numerator Z). However, as n → ∞, the denominator becomes more and more concentrated around 1, so (loosely speaking!) ‘tn → N (0, 1) as n → ∞’. 2. The (highly unmemorable) PDF of X ∼ tn can be shown to be −(n+1)/2 Γ (n + 1)/2 fX (x) = √ 1 + x2 /n for −∞ < x < ∞. nπ Γ(n/2) 35 (3.19) Figure 3.3: t distributions for 1, 2, 5 & 20 d.f. Vertical lines show the 2.5%, 16%, 50%, 84% and 97.5% points. 3. The t distribution on 1 degree of freedom is also called the Cauchy distribution—note that it arises as the distribution of Z1 /Z2 where Zi IID ∼ N (0, 1). The Cauchy distribution is infamous for not having a mean. More generally, only the first n − 1 moments of the tn distribution exist. 2 4. Note that if Xi IID ∼ N (0, σ ), then the RV T = pPn i=2 X1 Xi2 /(n − 1) has a tn−1 distribution, and is a measure of the length of X1 compared to the root mean square length of the other Xi s. i.e. if X has a spherical MVN (0, σ 2 I) distribution, then we would expect T not to be too large. This is, in effect, how the t distribution usually arises in practice. 5. The PDF 3.19 cannot be integrated analytically in general (exception: n = 1 d.f.). The CDF must be looked up in Statistical Tables or approximated using a computer. 36 3.6.3 Snedecor’s F Distribution Definition 3.6 (F Distribution) If Y ∼ χ2m , Z ∼ χ2n and Y ⊥ ⊥ Z, then the distribution of X= Y /m Z/n is called an F distribution on m & n degrees of freedom, and we write X ∼ Fm,n . Figure 3.4: F distributions for selected d.f. Vertical lines show the 2.5%, 16%, 50%, 84% and 97.5% points. Comments 1. Note that the numerator Y /m and denominator Z/n of X both have mean 1. Therefore, provided both m and n are large, X will usually take values around 1. 2. If X ∼ Fm,n , then the (extraordinarily unmemorable) density of X can be shown to be Γ (m + n)/2 mm/2 nn/2 x(m/2)−1 fX (x) = × for x > 0, Γ(m/2) Γ(n/2) (mx + n)(m+n)/2 with fX (x) = 0 for x ≤ 0. 37 (3.20) 3.7 Problems 1. Let Z ∼ N (0, 1) & Y = Z 2 , and let φ(·) & Φ(·) denote the PDF & CDF respectively of the standard Normal N (0, 1) distribution. √ √ (a) Show that FY (y) = Φ( y) − Φ(− y). √ (b) Express fY (y) in terms of φ( y). (c) Hence show that 1 fY (y) = √ y −1/2 e−y/2 2π for y > 0. (d) Find the MGF of Y . 2. Using Formula 3.18 for the PDF of the χ2 distribution, show that if X ∼ χ2n then the MGF of X is MX (t) = (1 − 2t)−n/2 . (3.21) Deduce that if X ∼ χ2m & Y ∼ χ2n with X ⊥ ⊥ Y , then (X + Y ) ∼ χ2m+n . 3. Given Z1 , Z2 IID ∼ N (0, 1), what is the probability that the point (Z1 , Z2 ) lies (a) in the square {(z1 , z2 ) | −1 < z1 < 1 & −1 < z2 < 1}, (b) in the circle {(z1 , z2 ) | (z12 + z22 ) < 1}? 4. Let Z1 , Z2 , . . . be independent random variables, each with mean 0 and variance 1, and let µi , σi and ρij be constants with −1 ≤ ρij ≤ 1. Let Y1 = Z1 , Y2 = ρ12 Z1 + q 1 − ρ212 Z2 , and define Xi = µi + σi Yi , i = 1, 2. (a) Show that E[Xi ] = µi , Var[Xi ] = σi2 (i = 1, 2), and that ρ12 is the correlation between X1 and X2 . (b) Find constants c0 , c1 , c2 and c3 such that Y3 = c0 + c1 Z1 + c2 Z2 + c3 Z3 has mean 0, variance 1, and correlations ρ13 & ρ23 with Y1 and Y2 respectively. (c) Hence show that the random vector Z = (Z1 , Z2 , Z3 )T with zero mean vector and identity variance-covariance matrix can be transformed to give a random vector X = (X1 , X2 , X3 )T with specified first and second moments, subject to constraints on the correlations corr[Xi , Xj ] = ρij including ρ212 + ρ213 + ρ223 ∈ [0, 1 + 2ρ12 ρ13 ρ23 ]. (d) What can you say about the distribution of X when Z has a standard trivariate Normal distribution and ρ212 + ρ213 + ρ223 is at one of the extremes of its allowable range (i.e. 0 or 1 + 2ρ12 ρ13 ρ23 )? From Warwick ST217 exam 2001 5. Let Z = (Z1 , Z2 , . . . , Zm+n )T ∼ MVN m+n (0, I). qP m+n 2 Zi . (a) Describe the distribution of Y = Z/ 1 Pm 2 Pm+n 2 (b) Show that the RV X = (n 1 Yi )/(m m+1 Yi ) has an Fm,n distribution. (c) Hence show that if Y = (Y1 , Y2 , . . . , Ym+nP )T has any continuous spherically symmetric distriPm+n m bution centred at the origin, then X = (n 1 Yi2 )/(m m+1 Yi2 ) has an Fm,n distribution. 6. Suppose that X has a χ2n distribution with PDF given by Formula 3.18. Find the mean, mode & variance of X, and an approximate variance-stabilising transformation. 38 7. Suppose that Yi are independent RVs with Poisson distributions: Yi ∼ Poi (λi ), i = 1, . . . , k. √ (a) Assuming that λi is large, what is the approximate distribution of Zi = (Yi − λi )/ λi ? Pk (b) Hence or otherwise show that if all the λi s are large, then the RV X = i=1 (Yi − λi )2 /λi has approximately a χ2k distribution. 8. Suppose that the RVs Oi have independent Poisson distributions: Oi ∼ Poi (npi ), i = 1, . . . , k, where Pk i=1 pi = 1. (a) Find EOi and Var Oi . Hence or otherwise show that E[Oi − npi ] = 0 and Var[Oi − npi ] = npi . Pk (b) Define the RV N by N = i=1 Oi . What is the distribution of N ? (c) Define the RVs Ei = N pi , i = 1, . . . , k. Show that EEi = npi and VarEi = np2i . Pk (d) By writing E[O1 E1 ] = p1 E[O12 ] + E[O1 i=2 Oi ] , or otherwise, show that Cov(O1 , E1 ) = np21 . (e) Deduce that the RV (Oi − Ei ) has mean 0 and variance npi (1 − pi ) for i = 1, . . . , k. 9. (a) Define a multivariate standard Normal distribution N (0, I), where I denotes the identity matrix. Given Z = (Z1 , Z2 , . . . , Zn )T ∼ N (0, I), write down functions of Z (i.e. transformed random variables) having i. a chi-squared distribution on (n − 1) degrees of freedom, and ii. a t distribution on (n − 1) degrees of freedom. T (b) Let Pn Z = (Z1 , Z2 , . . . , Zn ) have a multivariate standard Normal distribution, and let Z = i=1 Zi /n. Also let A = (aij ) be an n × n orthogonal matrix, i.e. AAT = I, and define the random vector Y = (Y1 , Y2 , . . . , Yn )T by Y = AZ. Quoting any properties of probability distributions that you require, show the following: Pn Pn i. Show that i=1 Yi2 = i=1 Zi2 . ii. Show that Y ∼ N (0, I). iii. Show that for suitable choices of ki , i = 1, . . . , n (where ki > 0 for all i), the following matrix A is orthogonal, and find ki :   k1 −k1 0 ... 0 0  k2  k2 −2k2 . . . 0 0    ..  .. .. . . . . . .  .  . . . . . A= .  kn−2 kn−2 kn−2 . . . −(n − 2)kn−2  0    kn−1 kn−1 kn−1 . . . kn−1 −(n − 1)kn−1  kn kn kn ... kn kn Pn−1 Pn √ iv. With the above definition of A, show that i=1 Yi2 = i=1 (Zi − Z)2 and that Yn = n Z. Pn v. Hence show that the RVs Z and i=1 (Zi − Z)2 are independent and have N (0, 1/n) and χ2n−1 distributions respectively. Pn 2 vi. Hence or otherwise show that if X1 , X2 , . . . , Xn IID ∼ N (µ, σ ), and X = i=1 Xi /n, then the random variable X T =q P n 1 2 i=1 (Xi − X) n(n−1) has a t distribution on n − 1 degrees of freedom. From Warwick ST217 exam 2000 10. Let z(m, n, P ) denote the P % point of the Fm,n distribution. Without looking in statistical tables, what can you say about the relationships between the following values: (a) z(2, 2, 50) and z(20, 20, 50), (c) z(2, 20, 16) and z(20, 2, 84), (b) z(2, 20, 50) and z(20, 2, 50), (d) z(20, 20, 2.5) and z(20, 20, 97.5)? 39 11. Suppose that Zi IID ∼ N (0, 1), i = 1, 2, . . . What is the distribution of the following RVs? (a) X1 = Z1 + Z2 − Z3 (b) X2 = Z1 + Z2 Z1 − Z2 (c) X3 = (Z1 − Z2 )2 (Z1 + Z2 )2 (d) X4 = (Z1 + Z2 )2 + (Z1 − Z2 )2 2 (e) 2Z1 X5 = p 2 Z2 + Z32 + Z42 + Z52 (f) (Z1 + Z2 + Z3 ) X6 = p 2 Z4 + Z52 + Z62 (g) X7 = 3(Z1 + Z2 + Z3 + Z4 )2 (Z1 + Z2 − Z3 − Z4 )2 + (Z1 − Z2 + Z3 − Z4 )2 + (Z1 − Z2 − Z3 + Z4 )2 (h) X8 = 2Z12 + (Z2 + Z3 )2 12. For each of the RVs Xi defined in the previous question, use Statistical Tables to find ci (i = 1 . . . 8) such that Pr(Xi > ci ) = 0.95. 13. Show that the PDFs of the t and F distributions (definitions 3.5 & 3.6) are indeed given by formulae 3.19 & 3.20. 14. (a) Define the Standard Multivariate Normal distribution MVN (0, I). (b) Given Z = (Z1 , Z2 , . . . , Zm+n )T ∼ MVN (0, I), write down transformed random variables X(Z), T (Z) and Y (Z) with the following distributions: i. X ∼ χ2n , ii. T ∼ tn , iii. Y ∼ Fm,n . (c) Given that the PDF of X ∼ χ2n is fX (x) = 2n/2 1 x(n/2)−1 e−x/2 Γ(n/2) and fX (x) = 0 elsewhere, show that i. E[X] = n, ii. E[X 2 ] = n2 + 2n, and 40 for x > 0, iii. E[1/X] = 1/(n − 2) (provided n > 2). (d) Hence or otherwise find 2 i. the variance σX of X ∼ χ2n , ii. the mean µY of Y ∼ Fm,n and iii. the mean µT and variance σT2 of T ∼ tn , 2 stating under what conditions σX , µY , µT and σT2 exist. From Warwick ST217 exam 1998 Theory is often just practice with the hard bits left out. J. M. Robson Get a bunch of those 3–D glasses and wear them at the same time. Use enough to get it up to a good, say, 10– or 12–D. Rod Schmidt The Normal . . . is the Ordinary made beautiful; it is also the Average made lethal. Peter Shaffer Symmetry, as wide or as narrow as you define is meaning, is one idea by which man through the ages has tried to comprehend and create order, beauty and perfection. Hermann Weyl 41 This page intentionally left blank (except for this sentence). 42 Chapter 4 Inference for Multiparameter Models 4.1 4.1.1 Introduction: General Concepts Modelling Given a random vector X = (X1 , X2 , . . . , Xp ), we can describe the joint distribution of the Xi s by the CDF FX (x) or, usually more conveniently, by the PMF or PDF fX (x). Interrelationships between the Xi s can be described using 1. marginals Fi (xi ), fi (xi ), Fij (xi , xj ), etc., 2. conditionals Gi (xi |xj , j 6= i), gi (xi |xj , j 6= i), Gij (xi , xj |xk , k 6= i, j), etc., 3. conditional expectations E[Xi |Xj ], Var[Xi |Xj ], etc. Often FX (x) is assumed to lie in a family of probability distributions: F = {F (x|θ) | θ ∈ ΩΘ } (4.1) where ΩΘ is the ‘parameter space’. The process of formulating, choosing within, & checking the reasonableness of, such families F, is called statistical modelling (or probability modelling, or just modelling). Exercise 4.1 The data-set in Table 4.1, plotted in Figure 1.1 (page 2), shows patients’ blood pressures before and after treatment. Suggest some reasonable models for the data. k 4.1.2 Data In practice, we typically have a set of data in which d variables are measured on each of n ‘cases’ (or ‘individuals’ or ‘units’): D = case.1 case.2 .. . case.n      var.1 var.2 ··· var.d x11 x21 .. . x12 x22 .. . ··· ··· .. . x1d x2d .. . xn1 xn2 ··· xnd 43      (4.2) Patient Number before Systolic after change before Diastolic after change 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 210 169 187 160 167 176 185 206 173 146 174 201 198 148 154 201 165 166 157 147 145 168 180 147 136 151 168 179 129 131 -9 -4 -21 -3 -20 -31 -17 -26 -26 -10 -23 -33 -19 -19 -23 130 122 124 104 112 101 121 124 115 102 98 119 106 107 100 125 121 121 106 101 85 98 105 103 98 90 98 110 103 82 -5 -1 -3 2 -11 -16 -23 -19 -12 -4 -8 -21 4 -4 -18 Table 4.1: Supine systolic and diastolic blood pressures of 15 patients with moderate hypertension (high blood pressure), immediately before and 2 hours after taking 25mg of the drug captopril. Data from HSDS, set 72 Definition 4.1 (Data Matrix) A set of data D arranged in the form of 4.2 is called a data matrix or a cases-by-variables array. The data-set D is assumed to be a representative sample (of size n) from an underlying population of potential cases. This population may be actual, e.g. the resident population of England & Wales at noon on June 30th 1993, or purely theoretical/hypothetical, e.g. MVN(µ, V). Exercise 4.2 Table 4.2 presents data on ten asthmatic subjects, each tested with 4 drugs. Describe various ways that the data might be set out as a data matrix for analysis by a statistical computing package. k 4.1.3 Statistical Inference Statistical inference is the art/science of using the sample to learn about the population (and hence, implicitly, about future samples). Typically we use statistics (properties of the sample) to learn about parameters (properties of the population). This activity might be: 1. Part of analysing a formal probability model, b of θ, after making an assumption as in Expression 4.1, or e.g. calculating the MLEs θ 2. Purely to summarise the data as a part of ‘data analysis’ (Section 4.2), For example, given X1 , X2 , . . . , Xn IID ∼ FX (unknown), the statistics S1 = 1X Xi = X, n S2 = 1X (Xi − X)2 , n 44 S3 = 1X (Xi − X)3 n Patient number 5 6 7 Drug Time 1 2 3 4 8 9 10 P −5 mins +15 mins 0.0 3.8 2.3 9.2 2.4 5.4 1.9 3.3 1.6 4.2 4.8 15.1 0.6 1.3 2.7 6.7 0.9 4.2 1.3 3.1 C −5 mins +15 mins 0.5 2.0 1.0 5.3 2.0 7.5 1.1 6.4 2.1 4.1 6.8 9.1 0.6 0.6 3.1 14.8 1.5 2.4 3.0 2.3 D −5 mins +15 mins 0.8 2.4 2.3 4.8 0.8 2.4 0.8 1.9 1.2 1.2 9.6 12.5 1.1 1.7 9.7 12.5 0.8 4.3 4.9 8.1 K −5 mins +15 mins 0.2 0.4 1.7 3.4 2.2 2.0 0.1 1.3 1.7 3.4 9.2 6.7 0.6 1.1 12.7 12.5 1.1 2.7 2.8 5.7 Table 4.2: NCF (Neutrophil Chemotactic Factor) of ten individuals, each tested with 4 drugs: P (Placebo), C (Clemastine), D (DSCG), K (Ketotifen). On a given day, an individual was administered the chosen drug, and his NCF measured 5 minutes before, and 15 minutes after, being given a ‘challenge’ of allergen. Data from Dr. R. Morgan of Bart’s Hospital provide measures of location, scale and skewness. Note that here we’re implicitly estimating the corresponding population quantities µX = EX, E (X − µX )2 , E (X − µX )3 , and using these as measures of population location, scale and skewness. Without a formal probability model, it can be hard to judge whether these or some other measures may be most appropriate. In both cases, the CLT & its generalisations (to higher dimensions and to ‘near-independence’) show that, b or (S1 , S2 , S3 ), is under reasonable conditions, the joint distribution of the statistics of interest, such as θ approximately MVN. This approximation improves if 1. the sample size n → ∞, and/or 2. the joint distribution of the random variables being summed (e.g. the original random vectors X1 , X2 , . . . , Xn ) is itself close to MVN. QUESTIONS: How should we interpret this? How should we try to link probability models to reality? 4.2 Data Analysis Data analysis is the art of summarising data while attempting to avoid probability theory. For example, you can calculate summary statistics such as means, medians, modes, ranges, standard deviations etc., thus summarising in a few numbers the main features of a possibly huge data-set. For example, the (0%, 25%, 50%, 75%, 100%) points of the data distribution (i.e. minimum, lower quartile, median, upper quartile and maximum) form the five-number summary, and the inter-quartile range (IQR = upper quartile - lower quartile) is a measure of spread, containing the ‘middle 50%’ of the data. These summaries can be formalised as follows Definition 4.2 (Order statistics) Given RVs X1 , X2 , . . . , Xn , one can order them and denote the smallest of the Xi s by X(1) , the second smallest by X(2) , etc. Then X(k) is called the kth order statistic. 45 Thus X(1) , X(2) , . . . , X(n) are a permutation of X1 , X2 , . . . , Xn , and x(n) , the observed value of X(n) , denotes the largest observed value in a sample of size n. Given ordered data x(1) ≤ x(2) ≤ · · · ≤ x(n) , one can define: Definition 4.3 (Sample median) xM We can always write xM = x n 2    x = 1   2 x + 1 2 n 2 n+1 2 +x if n is odd, n 2 if n is even. +1 , provided we adopt the following convention: 1. If the number in brackets is exactly half-way between two integers, then take the average of the two corresponding order statistics. 2. Otherwise round the bracketed subscript to the nearest integer, and take the corresponding order statistic. Similarly the quartiles etc. can be formally defined as follows: Definition 4.4 (Sample lower quartile) xL = x n 4 + 1 2 , Definition 4.5 (Sample upper quartile) xU = x 3n 4 + 1 2 , Definition 4.6 (100p th sample percentile) x100p% = x pn 100 + 1 2 , Definition 4.7 (Five number summary) x(1) , xL , xM , xU , x(n) . 4.3 Classical Inference 4.3.1 Introduction In ‘classical statistical inference’, the typical procedure is: 1. Choose a family F of models indexed by θ (formula 4.1). 2. Assume temporarily that the true distribution lies in F i.e. data D ∼ F (d|θ) for some true but unknown parameter vector θ ∈ ΩΘ . 3. Compare possible models according to some criterion of compatibility between the model & the data (equivalently, between the population & the sample). 4. Assess the chosen model(s), and go back to step (1) or (2) if the model proves inadequate. 46 Comments 1. Step 1 is a compromise between (a) what we believe is the true underlying mechanism that produced the data, and (b) what we can do mathematically. If in doubt, keep it simple. 2. Step 2, by assuming a true θ exists, implicitly interprets probability as a property of Nature e.g. a ‘fair’ coin is assumed to have an intrinsic property: if you toss it n times, then the proportion of ‘heads’ tends to 1/2 as n → ∞. Thus probability represents a ‘long-run relative frequency’. 3. Most statistical computer packages currently use the classical approach, and we’ll mainly be using classical inference in MSB. 4. There are many possible criteria at step 3. For example, hypothesis-testing and likelihood approaches are both discussed briefly below. 4.3.2 Point Estimation (Univariate) Given RVs X = (X1 , X2 , . . . , Xn ), a point estimator for an unknown parameter Θ ∈ ΩΘ is simply a function b Θ(X) taking values in the parameter space ΩΘ . Once data X = x are obtained, one can calculate the b corresponding point estimate θb = Θ(x). b to be considered a ‘good’ estimator of Θ. For example: There are many plausible criteria for Θ 1. Mean Squared Error b to be small whatever the true value θ of Θ, where One would like the mean squared error (MSE) of Θ b = E (Θ b − θ)2 . MSE(Θ) (4.3) b has minimum mean squared error if In particular, an estimator Θ b = min MSE(Θ b 0 ). MSE(Θ) b0 Θ 2. Unbiasedness Definition 4.8 (Bias) b is The bias of an estimator Θ b = E[Θ b − θ|Θ = θ]. Bias(Θ) (4.4) Exercise 4.3 b = Var(Θ) b + (Bias Θ) b 2. Show that MSE(Θ) k Definition 4.9 (Unbiasedness) b for a parameter Θ is called unbiased if E[Θ|Θ b An estimator Θ = θ] = θ for all possible true values θ of Θ. 47 Example Given a random sample X1 , X2 , . . . , Xn , i.e. Xi IID ∼ FX (x), where FX is a member of some family F of probability distributions, Pn (a) X = i=1 Xi /n is an unbiased estimate of the mean µX = EX of FX . Pn Pn (b) More generally, any statistic of the form i=1 wi Xi , where i=1 wi = 1, is an unbiased estimate of µX . Pn 2 of FX , but (c) σ b12 = i=1 (Xi − X)2 /(n − 1) is an unbiased estimate of the variance σX P n 2 2 2 (d) σ b2 = i=1 (Xi − X) /n is NOT an unbiased estimate of the variance σX of FX . 3. Efficiency & Minimum Variance Unbiased Estimation b1 & Θ b 2 for a parameter Θ, the efficiency of Θ b 1 relative to Θ b 2 is Given two unbiased estimators Θ defined by b b 1, Θ b 2 ) = Var(Θ1 ) . Eff(Θ (4.5) b 2) Var(Θ Definition 4.10 (MVUE) b out The Minimum Variance Unbiased Estimator of a parameter Θ is the unbiased estimator Θ, of all possible unbiased estimators, that has minimum variance. Example Given Xi IID ∼ FX (x) ∈ F, the family of all probability distributions with finite mean & variance, it can be shown that (a) X is the MVUE of the mean µX = EX of FX , and Pn 2 2 (b) i=1 (Xi − X) /(n − 1) is the MVUE of the variance σX of FX . Note that there are major problems with using MVUE as a criterion for estimation: (a) The MVUE may not exist (e.g. in general there is no unbiased estimator for the underlying standard deviation σX of X). (b) The MVUE may exist but be nonsensical (see Problems). (c) Even if the MVUE exists and appears reasonable, other (biased) estimators may be better by other criteria, for example by having smaller mean squared error, which is much more important in practice than being unbiased. 4. Consistency Definition 4.11 (Consistency) b 1, Θ b 2 , . . . is consistent for Θ ∈ ΩΘ if, for all > 0 and for all θ ∈ ΩΘ , A sequence of estimators Θ b n − θ| > |Θ = θ) = 0. lim Pr(|Θ n→∞ 5. Sufficiency b 1 , . . . Xn ) is sufficient for Θ if the conditional distribution of (X1 , . . . Xn ) given Θ b = θb & Θ = θ Θ(X does not depend on θ. See MSA. 6. Maximum likelihood See MSA. 7. Invariance See Casella & Berger, page 300. 48 8. The ‘plug in’ property If θ is a specified property of the CDF F (x), then θb is the corresponding property of the empirical CDF 1 Fb(x) = × (number of Xi ≤ xi ). (4.6) n For example (assuming the named quantities exist): P (a) the sample mean θb = x = xi /n is the plug-in estimate of the population mean θ = EX, (b) the sample median is the plug-in estimate of the population median F −1 (0.5). 4.3.3 Hypothesis Testing (Introduction) In this approach you 1. Choose a statistic T that has a known distribution F0 (t) if the true parameter value is θ = θ 0 (for some particular parameter value θ 0 of interest). The statistic T should provide a measure of the discrepancy of the data D from what would be reasonable if θ = θ 0 . 2. Test the hypothesis ‘θ = θ 0 ’ using the tail probabilities of F0 . An example is the ‘Chi-squared’ statistic used in MSA. Hypothesis testing will be covered in more detail in chapter 5. Some problems with the standard hypothesis testing approach are: 1. In practice, we don’t really believe that θ = θ 0 is ‘true’ and all other possible values of θ are ‘false’; instead we just wish to adopt ‘θ = θ 0 ’ as a convenient assumption, because it’s as good as, and simpler than, other models. 2. If we really do want to make a decision [e.g. to give drug ‘A’ or drug ‘B’ to a particular patient], then we should weigh up the possible consequences. 3. It’s hard to create appropriate hypothesis tests in complex situations, such as to test whether or not θ lies in a particular subset Ω0 of the parameter space Ω. Unfortunately, real life is a complex situation. 4.3.4 Likelihood Methods Use the likelihood function L(θ; D) = Pr(D|θ) (constant) × f (D|θ) (discrete case) (continuous case), (4.7) or equivalently the log-likelihood or ‘support’ function `(θ; D) = log L(θ; D) (4.8) as a measure of the compatibility between data D and parameter θ. In particular, the MLE corresponds to the particular F b ∈ F that is most compatible with the data D. θ Likelihood underlies the most useful general approaches to statistics: 1. It can handle several parameters simultaneously. 2. The CLT implies that in many cases the log-likelihood will be approximately quadratic in θ (at least near the MLE). This makes both theory and numerical computation easier. 49 However, there are difficulties with basing inference solely on likelihood: 1. How should we handle ‘nuisance parameters’ (i.e. components θi that we’re not interested in)? Note that it makes no sense to integrate over values of θi to get a ‘marginal likelihood’ for the other θj s, since L(θ; d) is NOT a probability density or probability function—we would get a different marginal likelihood if we reparametrised say by θi 7→ log θi . 2. A more fundamental problem is that likelihood takes no account of how far-fetched the model might be (‘high likelihood’ does NOT mean ‘likely’ !) This suggests that in practice we may wish to incorporate information not contained in the likelihood: 1. Prior information/Expert opinion: Are there external reasons for doubting some values of θ more than others? 2. For decision-making: How relatively important are the possible consequences of our inferences? [e.g. an innocent person is punished / a murderer walks free]. 4.4 Problems 1. How might the mortality data in Tables 1.1 and 1.2 (pages 8 & 9) be set out as a data matrix? b 1 , . . . Xn ) is unbiased. Show that θb is consistent iff limn→∞ Var(θ(X b 1 , . . . Xn )) = 0. 2. Suppose that θ(X 3. Given Xi IID ∼ FX (x), where FX is a member of some family F of probability distributions, show that Pn Pn (a) Any statistic of the form i=1 wi Xi , where i=1 wi = 1, is an unbiased estimate of µX = EX, Pn (b) The mean X = i=1 Xi /n is the unique UMVUE of this form, Pn 2 (c) σ b2 = i=1 (Xi − X)2 /(n − 1) is an unbiased estimate of the variance σX of FX . 4. The number of mistakes made each lecture by a certain lecturer follow independent Poisson distributions, each with mean λ > 0. You decide to attend the Monday lecture, note the number of mistakes X, and use X to estimate the probability p that there will be no mistakes in the remaining two lectures that week. (a) Show that p = exp(−2λ). (b) Show that the only unbiased estimator of p (and hence, trivially, the MVUE), is 1 if X is even, pb = −1 if X is odd. (c) What is the maximum likelihood estimator of p? (d) Discuss (briefly) the relative merits of the MLE and the MVUE in this case. 5. Let T be an unbiased estimator for g(θ), let S be a sufficient statistic for θ, and let φ(S) = E[T |S]. Prove the Rao-Blackwell theorem: φ(S) is also an unbiased estimator of g(θ), and Var[φ(S)|θ] ≤ Var[T |θ], for all θ, and interpret this result. 50 6. (a) Explain what is meant by an unbiased estimator for an unknown parameter θ. (b) Show, using moment generating functions or otherwise, that if X1 & X2 have independent Poisson distributions with means λ1 & λ2 respectively, then their sum (X1 + X2 ) follows a Poisson distribution with mean (λ1 + λ2 ). (c) A particular sports game comprises four ‘quarters’, each lasting 15 minutes, and a statistician attending the game wishes to predict the probability p that no further goals will be scored before full time. The statistician assumes that the numbers Xk of goals scored in the kth quarter follow independent Poisson distributions, each with (unknown) mean λ, so that Pr(Xk = x) = λx −λ e x! (k = 1, 2, 3, 4; x = 0, 1, 2, . . .). Suppose that the statistician makes his prediction halfway through the match (i.e. after observing X1 = x1 & X2 = x2 ). Show that an unbiased estimator of p is 1 if (x1 + x2 ) = 0, T = 0 otherwise. (d) Suppose the statistician also made a prediction after 15 minutes. Show that in this case the ONLY unbiased estimator of p given X1 = x1 is x 2 1 if x1 is even, T = −2x1 if x1 is odd. (e) What are the maximum likelihood estimators of p after 15 and after 30 minutes? (f) Briefly compare the advantages of maximum likelihood and unbiased estimation for this situation. From Warwick ST217 exam 1997 7. (a) Explain what is meant by a minimum variance unbiased estimator (MVUE). (b) Let X and Y be random variables. Write down (without proof) expressions relating E[Y ] and Var[Y ] to the conditional moments E[Y |X] and Var[Y |X]. (c) Let S be a sufficient statistic for a parameter θ, let T be an unbiased estimator for τ (θ), and define W = E[T |S]. Show that i. W is an unbiased estimator for τ (θ), and ii. Var[W ] ≤ Var[T ] for all θ. Deduce that a MVUE, if one exists, must be a function of a sufficient statistic. (d) Let X1 , X2 , . . . , Xn be IID Bernoulli random variables, i.e. Pr(Xi = 1) = θ i = 1, 2, . . . , n. Pr(Xi = 0) = 1 − θ i. Show that S = ii. Define T by Pn i=1 Xi is a sufficient statistic for θ. T = 1 0 if X1 = 1 and X2 = 0, otherwise. What is E[T ]? iii. Find E[T |S], and hence show that S(n − S)/(n − 1) is an MVUE of Var[S] = nθ(1 − θ). From Warwick ST217 exam 1999 51 8. Given Xi IID ∼ Poi (θ), compare the following possible estimators for θ in terms of unbiasedness, consistency, relative efficiency, etc. n θb1 1X Xk , = X= n k=1 θb2 = θb3 = θb4 = 1 n 100 + n X ! Xk , k=1 1 (X2 − X1 )2 , 2 n 1X (Xk − X)2 , n k=1 n = 1 X (Xk − X)2 , n−1 θb6 = (θb1 + θb5 )/2, θb7 = median(X1 , X2 , . . . , Xn ), θb8 = mode(X1 , X2 , . . . , Xn ), θb9 = X 2 kXk , n(n + 1) θb5 k=1 n k=1 θb10 9. [Light relief] = 1 n−1 n X Xk . k=2 Discuss the following possible defence submission at a murder trial: ‘The supposed DNA match placing the defendant at the scene of the crime would have arisen with even higher probability if the defendant had a secret identical twin [the more people with that DNA, the more chances of getting a match at the crime scene]. ‘Now assume that my client has been cloned θ times, θ ∈ {0, 1, . . . , n} for some n > 0. Clearly the larger the value of θ, the higher the probability of obtaining the observed DNA results [every increase in θ means another clone who might have been at the scene of the crime]. ‘Therefore the MLE of θ is n. ‘But then, even assuming somebody with my client’s DNA committed this terrible crime, the probability that it was my client is only 1/(n + 1) (under reasonable assumptions). ‘Therefore you cannot say that my client is, beyond a reasonable doubt, guilty. ‘The defence rests.’ 4.5 4.5.1 Bayesian Inference Introduction Classical inference regards probability as a property of physical objects (e.g. a ‘fair coin’). An alternative interpretation uses probability to represent an individual’s (lack of) understanding of an uncertain situation. 52 Examples 1. ‘I have no reason to suspect that “heads” or “tails” are more likely. Therefore, by symmetry, my current probability for this particular coin’s coming down “heads” is 1/2.’ 2. ‘I doubt the accused has any previously-unknown identical siblings. I’d bet 100,000 to 1 against’ (i.e. if θ is the number of identical siblings, then my probability for θ > 0 is 1/100001). Different people, with different knowledge, can legitimately have different probabilities for real-world events (therefore it’s good discipline to say ‘my probability for. . . ’ rather than ‘the probability of. . . ’). As you learn. your probabilities can be continually updated using Bayes’ theorem, i.e. Pr(A|B) = Pr(B|A) × Pr(A) Pr(B) (4.9) assuming Pr(B) is positive, and using the fact that Pr(A&B) = Pr(A|B) Pr(B) = Pr(B|A) Pr(A) . The Bayesian approach to statistical inference treats all uncertainty via probability, as follows: 1. You have a probability model for the data, with PMF p(D|Θ). 2. Your prior PMF for Θ (i.e. your PMF for Θ based on a combination of expert opinion, previous experience, and your own prejudice), is p(θ). 3. Then Bayes’ theorem says p(θ|D) = p(D|θ) p(θ) p(D) or, since once the data have been obtained p(D) is a constant, p(θ|D) ∝ p(D|θ) p(θ) ∝ L(θ; D) p(θ) i.e. ‘posterior probability ∝ ‘likelihood’ × ‘prior’ (4.10) Formula 4.10 also applies in the continuous case, in which case p(·) represents a PDF. Comments 1. Further applications to decision theory are given in the third year course ST301. 2. Note that if θ = (θ1 , θ2 , . . . , θp ), then p(θ|D) is a p-dimensional function, and may prove difficult to manipulate, summarise or visualise. 3. Treating all uncertainty via probability has the advantage that one-off events (e.g. management decisions, or the results of horse races) can be handled. However, it’s not at all obvious that all uncertainty can be treated via probability! 4. As with Classical inference, a Bayesian analysis of a problem should involve checking whether the assumptions underlying p(D|θ) and p(θ) are reasonable, and rethinking & reanalysing the model if necessary. Exercise 4.4 Describe the Bayesian approach to statistical inference, denoting the data by x, the prior by fΘ (θ), and the likelihood by L(θ; x) = fX|Θ (x|θ). k 53 4.6 Nonparametric Methods Standard Classical and Bayesian methods make strong assumptions, e.g. Xi IID ∼ F (x|θ) for some θ ∈ Ω. Assumptions of independence are critical (what aspects of the problem provide information about other aspects?) Assumptions about the form of probability distributions are often less important, at least provided the sample size n is large. However, there are exceptions to this: 1. It might be that the probability distribution encountered in practice is fundamentally different from the form assumed in our model. For example, some probability distributions are so ‘heavy-tailed’ that their means don’t exist e.g. the Cauchy distribution with f (x) = 1/π(1 + x2 ), x ∈ R . 2. Some data may be recorded incorrectly, or there may be a few atypically large/small data values (‘outliers’), etc. 3. In any case, what if n is small and the CLT can’t be invoked? ‘Nonparametric’ methods don’t assume that the actual probability distribution F (·|θ) lies in a particular parametric family F; instead they make more general assumptions, for example 1. ‘F (x) is symmetric about some unknown value Θ’. Note that this may be a reasonable assumption even if EX doesn’t exist. Θ is the (unknown) median of the population, i.e. Pr(X < Θ) = Pr(X > Θ). Therefore one could estimate Θ by the median of the data (though better methods may exist). 2. ‘F (x, y) is such that if (Xi , Yi ) IID ∼ F , (i = 1, 2), then Pr(Y1 < Y2 |X1 < X2 ) = 1/2’. This is a nonparametric version of the statement ‘X & Y are uncorrelated’. Many statistical methods involve estimating means, as we’ll see in the rest of the course (t-tests, linear regression, many MLEs etc.) Corresponding nonparametric methods typically involve medians—or equivalently, various probabilities. Exercise 4.5 Suppose that X has a continuous distribution. Show that a test of the statement ‘median of X is θ0 ’ is equivalent to a test of the statement ‘Pr(X < θ0 ) = 1/2’. If Xi are IID, what is the distribution of R = (number of Xi < θT ), where θT is the true value of θ? k Other nonparametric methods involve ranking the data Xi : replacing the smallest Xi by 1, the next smallest by 2, etc. Classical statistical methods can then be applied to the ranks. Note that the effect of outliers will be reduced. Example Given data (Xi , Yi ), i = 1, . . . , n from a continuous bivariate distribution, ‘Spearman’s rank correlation’ (often written ρS ) can be calculated as follows: 1. replace the Xi values by their ranks Ri , 2. similarly replace the Yi values by their ranks Si , 3. calculate the usual (‘product-moment’ or ‘Pearson’s’) correlation between the Ri s and Si s. 54 Comments 1. If the distribution of the original RVs is not continuous, then some data values may be repeated (‘tied ranks’). Repeated Xi s are given averaged ranks (for example, if there are two Xi with the smallest value, then they are each given rank 1.5 = (1 + 2)/2). 2. If X ⊥ ⊥ Y , so the ‘true’ ρS is zero,Pthen the distribution of the calculated ρS is easily approximated n (using the standard formulae for i=1 ik ). 3. ‘Easily approximated’ does not necessarily mean ‘well approximated’ ! 4. Most books give another formula for ρS , which is equivalent unless there are tied ranks, but which obscures the relationship with the standard product-moment correlation P (xi − x)(yi − y) ρ = pP . P (xi − x)2 (yi − y)2 5. Other, perhaps better, types of nonparametric correlation have been defined (‘Kendall’s τ ’). 4.7 Graphical Methods A vital part of data analysis is to plot the data using bar-charts, histograms, scatter diagrams etc. Plotting the data is important no matter what further formal statistical methods will be used: 1. It enables you to ‘get a feel for’ the data, 2. It helps you look for patterns and anomalies, 3. It helps in checking assumptions (such as independence, linearity or Normality). Many useful plots can be easily churned out using a computer, though sometimes you have to devise original plots to display the data in the most appropriate way. Exercise 4.6 The following table shows 66 measurements on the speed of light, made by S. Newcomb in 1882. Values are the times in nanoseconds (ns), less 24,800 ns, for light to travel from his laboratory to a mirror and back. Values are to be read row-by-row, thus the first to observations are 24,828 ns and 24,826 ns. 28 29 24 37 36 26 29 26 22 20 25 23 32 27 33 24 36 28 27 32 28 24 21 32 26 27 24 29 34 25 36 30 28 39 16 -44 30 28 32 27 28 23 27 23 25 36 31 24 16 29 21 26 27 25 40 31 28 30 26 32 -2 19 29 22 33 25 Produce a histogram, a Normal probability plot and a time plot of Newcomb’s data. Decide which (if any) observations to ignore, and produce a normal probability plot of the remaining reduced data set. Finally compare the mean of this reduced data set with (i) the mean and (ii) the 10% trimmed mean of the original data. Solution: Plots are shown in Figure 4.1. There are clearly 2 large outliers, but the time plot also suggests that the 6th to 10th observations are unusually variable, and that the last two observations are atypically low (both being lower than the previous 20 observations). The Normal probability plot is calculated by calculating y(i) (the sorted data) and zi as follows, and plotting y(i) against zi . 55 i y(i) xi = (i+0.5)/(n+1) zi = Φ(xi ) 1 2 3 4 .. . −44 −2 16 16 .. . 0.0075 0.0224 0.0373 0.0522 .. . −2.434 −2.007 −1.783 −1.624 .. . 65 66 39 40 0.9776 0.9925 2.007 2.434 Omitting the first 10 and the last 2 recorded observations leaves a data-set where the Normality and independence assumptions are much more reasonable—see plot (d) of Figure 4.1. Location estimates are (i) 26.2, (ii) 27.4, (iii) 27.9. The trimmed mean is reasonably close to the mean of observations 11–64. Figure 4.1: Plots of Newcomb’s data: (a) histogram, (b) Normal probability plot, (c) time plot, (d) Normal probability plot of data after excluding the first 10 and last 2 observations. k 4.8 Bootstrapping ‘Bootstrap’ methods have become increasingly used over the past few years. They address the general question: 56 b given that the underlying ‘What are the properties of the calculated statistics (e.g. MLEs θ) distributional assumptions may be false (and, in reality, will be false)?’ Bootstrapping uses the observed data directly as an estimate of the underlying population, then uses ‘plug-in’ estimation, and typically involves computer simulation. Several other computer-intensive approaches to statistical inference have also become very popular recently. 4.9 Problems 1. [Light relief] Discuss the following quote: ‘As a statistician, I want to use mathematics to help deal with practical uncertainty. The natural mathematical way to handle uncertainty is via probability. ‘About the simplest practical probability statement I can think of is “The probability that a fair coin, tossed at random, will come down ‘heads’ is 1/2”. ‘Now try to define “fair coin”, “at random” and “probability 1/2” without using subjective probability or circular definitions. ‘Summary: if a practical probability statement is not subjective, then it must be tautologous, illdefined, or useless. ‘Of course, for balance, some of the time I teach subjective methods, and some of the time I teach useless methods :-).’ Ewart Shaw (Internet posting 13–Aug–1993). 2. (a) Plot the captopril data (Table 4.1), and suggest what sort of models seem reasonable. (b) Roughly estimate from your graph(s) the effect of captopril (C) on systolic and diastolic blood pressure (SBP & DBP). (c) Suggest a single summary measure (SBP, DBP or a combination of the two) to quantify the effect of treatment. (d) Do you think a transformation of the data would be appropriate? (e) Comment on the number of parameters in your model(s). (f) Calculate ρS and ρ between ∆S , the change (after-before) in SBP, and ∆D , the change (afterbefore) in DBP. Suggest some advantages and disadvantages in using ρS and ρ here. (g) Calculate some further summary statistics such as means, variances, correlations and fivenumber summaries, and comment on how useful they are as summaries of the data. (h) Are there any problems in using the data to estimate the effect of captopril? What further information would be useful? (i) What advantages/disadvantages would there be in using bootstrapping here, i.e. using the discrete distribution that assigns probability 1/15 to each of the 15 points x1 = (210, 201, 130, 125), x2 = (169, 165, 122, 121), . . . , x15 = (154, 131, 100, 82) as an estimate of the underlying population, and working out the properties of ρS , ρ, etc. based on that assumption? 57 This page intentionally left blank (except for this sentence). 58 Chapter 5 Hypothesis Testing 5.1 Introduction A hypothesis is a claim about the real world; statisticians will be interested in hypotheses like: 1. ‘The probabilities of a male panda or a female panda being born are equal’, 2. ‘The number of flying bombs falling on a given area of London during World War II follows a Poisson distribution’, 3. ‘The mean systolic blood pressure of 35-year-old men is no higher than that of 40-year-old women’, 4. ‘The mean value of Y = log(systolic blood pressure) is independent of X = age’ (i.e. E[Y |X = x] = constant). These hypotheses can be translated into statements about parameters within a probability model: 1. ‘p1 = p2 ’, n 2. ‘N ∼ Poi (λ) for some λ > 0’, Pi.e.: pn = Pr(N = n) = λ exp(−λ)/n! (within the general probability model pn ≥ 0 ∀n = 0, 1, . . .; pn = 1), 3. ‘θ1 ≤ θ2 ’ and 4. ‘β1 = 0’ (assuming the linear model E[Y |x] = β0 + β1 x). Definition 5.1 (Hypothesis test) A hypothesis test is a procedure for deciding whether to accept a particular hypothesis as a reasonable simplifying assumption, or to reject it as unreasonable in the light of the data. Definition 5.2 (Null hypothesis) The null hypothesis H0 is the simplifying assumption we are considering making. Definition 5.3 (Alternative hypothesis) The alternative hypothesis H1 is the alternative explanation(s) we are considering for the data. Definition 5.4 (Type I error) A type I error is made if H0 is rejected when H0 is true. Definition 5.5 (Type II error) A type II error is made if H0 is accepted when H0 is false. 59 Comments 1. In the first example above (pandas) the null hypothesis is H0 : p1 = p2 . 2. The alternative hypothesis in the first example would usually be H1 : p1 6= p2 , though it could also be (for example) (a) H1 : p1 < p2 , (b) H1 : p1 > p2 , or (c) H1 : p1 − p2 = δ for some specified δ 6= 0. 5.2 Simple Hypothesis Tests The simplest type of hypothesis testing occurs when the probability distribution giving rise to the data is specified completely under the null and alternative hypotheses. Definition 5.6 (Simple hypotheses) A simple hypothesis is of the form Hk : θ = θk , i.e. the probability distribution of the data is specified completely. Definition 5.7 (Composite hypotheses) A composite hypothesis is of the form Hk : θ ∈ Ωk , i.e. the parameter θ lies in a specified subset Ωk of the parameter space ΩΘ . Definition 5.8 (Simple hypothesis test) A simple hypothesis test tests a simple null hypothesis H0 : θ = θ0 against a simple alternative H1 : θ = θ1 , where θ parametrises the distribution of our experimental random variables X = X 1 , X 2 , . . . Xn . There may be many seemingly sensible approaches to testing a given hypothesis. A reasonable criterion for choosing between them is to attempt to minimise the chance of making a mistake: incorrectly rejecting a true null hypothesis, or incorrectly accepting a false null hypothesis. Definition 5.9 (Size) A test of size α is one which rejects the null hypothesis H0 : θ = θ0 in favour of the alternative H1 : θ = θ1 iff X ∈ Cα where Pr(X ∈ Cα | θ = θ0 ) = α for some subset Cα of the sample space S of X. Definition 5.10 (Critical region) The set Cα in Definition 5.9 is called the critical region or rejection region of the test. Definition 5.11 (Power & power function) The power function of a test with critical region Cα is the function β(θ) = Pr(X ∈ Cα | θ), and the power is β = β(θ1 ), i.e. the probability that we reject H0 in favour of H1 when H1 is true. A hypothesis test typically uses a test statistic T (X), whose distribution is known under H0 , and such that extreme values of T (X) are more compatible with H1 that H0 . Many useful hypothesis tests have the following form: 60 Definition 5.12 (Simple likelihood ratio test) A simple likelihood ratio test (SLRT) of H0 : θ = θ0 against H1 : θ = θ1 rejects H0 iff n L(θ ; x) o 0 ≤ Aα X ∈ Cα∗ = x L(θ1 ; x) where L(θ; x) is the likelihood of θ given the data x, and the number Aα is chosen so that the size of the test is α. Exercise 5.1 Suppose that X1 , X2 , . . . , Xn IID ∼ N (θ, 1). Show that the likelihood ratio for testing H0 : θ = 0 against H1 : θ = 1 can be written λ(x) = exp n x − 12 . Hence show that √ the corresponding SLRT of size α rejects H0 when the test statistic T (X) = X satisfies T > Φ−1 (1 − α)/ n. k Comments 1. For a simple hypothesis test, both H0 and H1 are ‘point hypotheses’, each specifying a particular value for the parameter θ rather than a region of the parameter space. 2. The size α is the probability of rejecting H0 when H0 is in fact true; clearly we want α to be small (α = 0.05, say). 3. Clearly for a fixed size α of test, the larger the power β of a test the better. However, there is an inevitable trade-off between small size and high power (as in a jury trial: the more careful one is not to convict an innocent defendant, the more likely one is to free a guilty one by mistake). 4. In practice, no hypothesis will be precisely true, so the whole foundation of classical hypothesis testing seems suspect! 5. Regarding likelihood as a measure of compatibility between data and model, an SLRT compares the compatibility of θ0 and θ1 with the observed data x, and accepts H0 iff the ratio is sufficiently large. 6. One reason for the importance of likelihood ratio tests is the following theorem, which shows that out of all tests of a given size, an SLRT (if one exists) is ‘best’ in a certain sense. Theorem 5.1 (The Neyman-Pearson lemma) Given random variables X1 , X2 , . . . , Xn , with joint density f (x|θ), the simple likelihood ratio test of a fixed size α for testing H0 : θ = θ0 against H1 : θ = θ1 is at least as powerful as any other test of the same size. Exercise 5.2 [Proof of Theorem 5.1] Prove the Neyman-Pearson lemma. Solution: Fix the size of the test to be α. Let A be a positive constant and C0 a subset of the sample space satisfying 1. Pr(X ∈ C0 | θ = θ0 ) = α, 2. X ∈ C0 ⇐⇒ L(θ0 ; x) f (x|θ0 ) = ≤ A. L(θ1 ; x) f (x|θ1 ) Suppose that there exists another test of size α, defined by the critical region C1 , i.e. 61 C0 C1 B2 B1 B3 ΩX Figure 5.1: Proof of Neyman-Pearson lemma Reject H0 iff x ∈ C1 , where Pr(x ∈ C1 |θ = θ0 ) = α. Let B1 = C0 ∩ C1 , B2 = C0 ∩ C1c , B3 = C0c ∩ C1 . Note that B1 ∪ B2 = C0 , B1 ∪ B3 = C1 , and B1 , B2 & B3 are disjoint. Let the power of the likelihood ratio test be I0 = Pr(X ∈ C0 | θ = θ1 ), and the power of the other test be I1 = Pr(X ∈ C1 | θ = θ1 ). We want to show that I0 − I1 ≥ 0. But I0 − I1 = R f (x|θ1 )dx − R f (x|θ1 )dx R = B1 ∪B2 f (x|θ1 )dx − B1 ∪B3 f (x|θ1 )dx R R = B2 f (x|θ1 )dx − B3 f (x|θ1 )dx. C0 C1 R Also B2 ⊆ C0 , so f (x|θ1 ) ≥ A−1 f (x|θ0 ) for x ∈ B2 , similarly B3 ⊆ C0c , so f (x|θ1 ) ≤ A−1 f (x|θ0 ) for x ∈ B3 , Therefore I0 − I1 i f (x|θ )dx 0 B3 i hR R −1 = A f (x|θ )dx − f (x|θ )dx 0 0 C0 C1 ≥ A−1 hR f (x|θ0 )dx − B2 = A−1 [α − α] = R 0 as required. k 5.3 Simple Null, Composite Alternative Suppose that we wish to test the simple null hypothesis H0 : θ = θ0 against the composite alternative hypothesis H1 : θ ∈ Ω1 . The easiest way to investigate this is to imagine the collection of simple hypothesis tests with null hypothesis H0 : θ = θ0 and alternative H1 : θ = θ1 , where θ1 ∈ Ω1 . Then, for any given θ1 , an SLRT is the most powerful test for a given size α. The only problem would be if different values of θ1 result in different SLRTs. 62 Definition 5.13 (UMP Tests) A hypothesis test is called a uniformly most powerful test of H0 : θ = θ0 against H1 : θ = θ1 , θ1 ∈ Ω1 , if 1. There exists a critical region Cα corresponding to a test of size α not depending on θ1 , 2. For all values of θ1 ∈ Ω1 , the critical region Cα defines a most powerful test of H0 : θ = θ0 against H1 : θ = θ1 . Exercise 5.3 2 Suppose that X1 , X2 , . . . , Xn IID ∼ N (0, σ ). 1. Find the UMP test of H0 : σ 2 = 1 against H1 : σ 2 > 1. 2. Find the UMP test of H0 : σ 2 = 1 against H1 : σ 2 < 1. 3. Show that no UMP test of H0 : σ 2 = 1 against H1 : σ 2 6= 1 exists. k Comments 1. If a UMP test exists, then it is clearly the appropriate test to use. 2. Often UMP tests don’t exist! 3. A UMP test involves the data only via a likelihood ratio, so is a function of the sufficient statistics. 4. The critical region Cα therefore often has a simple form, and is usually easily found once the distribution of the sufficient statistics have been determined (hence the importance of the χ2 , t and F distributions). 5. The above three examples illustrate how important is the form of alternative hypothesis being considered. The first two are one-sided alternatives whereas H1 : σ 2 6= 1 is a two-sided alternative hypothesis, since σ 2 could lie on either side of 1. 5.4 Composite Hypothesis Tests The most general situation we’ll consider is where the parameter space Ω is divided into two subsets: Ω = Ω0 ∪ Ω1 , where Ω0 ∩ Ω1 = ∅, and the hypotheses are H0 : θ ∈ Ω0 , H1 : θ ∈ Ω1 . For example, one may want to test the null hypothesis that the data come from an exponential distribution against the alternative that the data come from a more general gamma distribution. Note that here, as in many other cases, dim(Ω0 ) < dim(Ω1 ) = dim(Ω). One possible approach to this situation is to regard the maximum possible likelihood over θ ∈ Ωi as a measure of compatibility between the data and the hypothesis Hi (i = 0, 1). It’s therefore convenient to define the following: b θ b θ0 b1 θ is the MLE of θ over the whole parameter space Ω, is the MLE of θ over Ω0 , i.e. under the null hypothesis H0 , and is the MLE of θ over Ω1 , i.e. under the alternative hypothesis H1 . b must therefore be the same as either θ b0 or θ b1 , since Ω = Ω0 ∪ Ω1 . Note that θ b1 ; x)/L(θ b0 ; x), by direct analogy with the SLRT. One might consider using the likelihood ratio criterion L(θ b b0 ; x): However, it’s generally easier to use the equivalent ratio L(θ; x)/L(θ 63 Definition 5.14 (Likelihood Ratio Test (LRT)) b ∈ Ω0 in favour of the alternative H1 : θ b ∈ Ω1 = Ω \ Ω0 iff A likelihood ratio test rejects H0 : θ λ(x) = b x) L(θ; ≥ λ, b L(θ 0 ; x) (5.1) b is the MLE of θ over the whole parameter space Ω, θ b0 is the MLE of θ over Ω0 , and the where θ value λ is fixed so that sup Pr(λ(X) ≥ λ|θ) = α θ∈Ω0 where α, the size of the test, is some chosen value. Equivalently, the test criterion uses the log LRT statistic: b x) − `(θ b0 ; x) ≥ λ0 , r(x) = `(θ; (5.2) where `(θ; x) = log L(θ; x), and λ0 is chosen to give chosen size α = supθ∈Ω0 Pr(r(X) ≥ λ0 |θ). Comments 1. The size α is typically chosen by convention to be 0.05 or 0.01. 2. Note that high values of the test statistic λ(x), or equivalently of r(x), are taken as evidence against the null hypothesis H0 . 3. The test given in Definition 5.14 is sometimes referred to as a generalized likelihood ratio test, and Equation 5.1 a generalized likelihood ratio test statistic. 4. Equation 5.2 is often easier to work with than Equation 5.1—see the exercises and problems. Exercise 5.4 P P [Paired t-test] Suppose that X1 , X2 , . . . , Xn IID N (µ, σ 2 ), and let X = Xi /n, S 2 = (Xi − X)2 /(n − 1). √ ∼ What is the distribution of T = X/(S/ n)? Is the test based on rejecting H0 : µ = 0 for large T a likelihood ratio test? Assuming that the observed differences in diastolic blood pressure (after–before) are IID and Normally distributed with mean δD , use the captopril data (4.1) to test the null hypothesis H0 : δD = 0 against the alternative hypothesis H1 : δD 6= 0. Comment: this procedure is called the paired t test k Exercise 5.5 IID 2 2 [Two sample t-test] Suppose X1 , X2 , . . . , Xm IID ∼ N (µX , σ ) and Y1 , Y2 , . . . , Yn ∼ N (µY , σ ). 1. Derive the LRT for testing H0 : µX = µY versus H1 : µX 6= µY . 2. Show that the LRT can be based on the test statistic T = where Sp2 = Pm i=1 (Xi X −Y q 1 Sp m + . (5.3) Pn − X)2 + i=1 (Yi − Y )2 . m+n−2 (5.4) 3. Show that, under H0 , T ∼ tm+n−2 . 64 1 n 4. Two groups of female rats were placed on diets with high and low protein content, and the gain in weight (grammes) between the 28th and 84th days of age was measured for each rat, with the following results: High protein diet 134 146 104 119 124 161 107 83 113 129 97 123 Low protein diet 70 118 101 85 107 132 94 Using the test statistic T above, test the null hypothesis that the mean weight gain is the same under both diets. Comment: this is called the two sample t-test, and Sp2 is the pooled estimate of variance. k Exercise 5.6 IID 2 2 [F -test] Suppose X1 , X2 , . . . , Xm IID ∼ N (µX , σX ) and Y1 , Y2 , . . . , Yn ∼ N (µY , σY ), where µX , µY , σX and σY are all unknown. 2 2 Suppose we wish to test the hypothesis H0 : σX = σY2 against the alternative H1 : σX 6= σY2 . Pn Pm 2 1. Let SX = i=1 (Xi − X)2 and SY2 = i=1 (Yi − Y )2 . 2 2 What are the distributions of SX /σX and SY2 /σY2 ? 2. Under H0 , what is the distribution of the statistic V = 2 SX /(m − 1) ? SY2 /(n − 1) 3. Taking values of V much or smaller than P P larger P P 1 2as evidence against H0 , and given data with m = 16, n = 16, xi = 84, yi = 18, x2i = 563, yi = 72, test the null hypothesis H0 . 2 Comment: with the alternative hypothesis H1 : σX > σY2 , the above procedure is called an F test. k Even in simple cases like this, the null distribution of the log likelihood ratio test statistic r(x) (5.2) can be difficult or impossible to find analytically. Fortunately, there is a very powerful and very general theorem that gives the approximate distribution of r(x): Theorem 5.2 (Wald’s Theorem) Let X1 , X2 , . . . , Xn IID ∼ f (x|θ) where θ ∈ Ω, and let r(x) denote the log likelihood ratio test statistic b x) − `(θ b0 ; x), r(x) = `(θ; b is the MLE of θ over Ω and θ b0 is the MLE of θ over Ω0 ⊂ Ω. where θ Then under reasonable conditions on the PDF (or PMF) f (·|·), the distribution of 2r(x) converges to a χ2 distribution on dim(Ω) − dim(Ω0 ) degrees of freedom as n → ∞. Comments 1. A proof is beyond the scope of this course, but may be found in e.g. Kendall & Stuart, ‘The Advanced Theory of Statistics’, Vol. II. 2. Wald’s theorem implies that, provided the sample size is large, you only need tables of the χ2 distribution to find the critical regions for a wide range of hypothesis tests. 65 Another important theorem, see Problem 3.7.9, page 39, is the following: 2 Theorem 5.3 (Sample Mean and Variance of Xi IID ∼ N (µ, σ )) IID 2 Let X1 , X2 , . . . , Xn ∼ N(µ, σ ). Then P P 1. X = Xi /n and Y = (Xi − X)2 are independent RVs, 2. X has a N(µ, σ 2 /n) distribution, 3. Y /σ 2 has a χ2n−1 distribution. Exercise 5.7 Suppose X1 , X2 , . . . , Xn IID ∼ N (θ, 1), with hypotheses H0 : θ = 0 and H1 : θ arbitrary. Show that 2r(x) = nx2 , and hence that Wald’s theorem holds exactly in this case. k Exercise 5.8 Suppose now that Xi ∼ N (θi , 1), i = 1, . . . , n are independent, with null hypothesis H0 : θi = θ ∀i and alternative hypothesis H1 : θi arbitrary. Pn Show that 2r(x) = i=1 (xi − x)2 . and hence (quoting any other theorems you need) that Wald’s theorem again holds exactly. k 5.5 Problems 1. Suppose that X ∼ Bin(n, p). Under the null hypothesis H0 : p = p0 , what are EX and VarX? Show that if n is large and p0 is not too close to 0 or 1, then X/n − p0 p p0 (1 − p0 )/n ∼ N (0, 1) approximately. Out of 1000 tosses of a given coin, 560 were heads and 440 were tails. Is it reasonable to assume that the coin is fair? Justify your answer. 2. Out of 370 new-born babies at a Hospital, 197 were male and 173 female. Test the null hypothesis H0 : p < 1/2 versus H1 : p ≥ 1/2, where p denotes the probability that a baby born at the Hospital will be male. Discuss any assumptions you make. 3. X is a single observation whose density is given by (1 + θ)xθ f (x) = 0 if 0 < x < 1, otherwise. Find the most powerful size α test of H0 : θ = 0 against H1 : θ = 1. Is there a U.M.P. test of H0 : θ ≤ 0 against H1 : θ > 0? If so, what is it? 2 2 2 4. Suppose X1 , X2 , . . . , Xn IID ∼ N (µ, σ ) with null hypothesis H0 : σ = 1 and alternative H1 : σ is arbitrary. Show v −1−log vb), Pn that the LRT will reject H0 for large values of the test statistic r(x) = n(b where vb = i=1 (xi − x)2 /n. 66 5. Let X1 , . . . , Xn be independent each with density λx−2 e−λ/x f (x) = 0 if x > 0, otherwise, where λ is an unknown parameter. (a) Show that the UMP test of H0 : λ = 12 against H1 : λ > 12 is of the form: Pn ‘reject H0 if i=1 Xi−1 ≤ A∗ ’, where A∗ is chosen to fix the size of the test. Pn (b) Find the distribution of i=1 Xi−1 under the null & alternative hypotheses. (c) You observe values 0.59, 0.36, 0.71, 0.86, 0.13, 0.01, 3.17, 1.18, 3.28, 0.49 for X1 , . . . , X10 . Test H0 against H1 , & comment on the test in the light of any assumptions made. 6. (a) Define the size and power of a hypothesis test of a simple null hypothesis H0 : θ = θ0 against a simple alternative hypothesis H1 : θ = θ1 . (b) State and prove the Neyman-Pearson Lemma for continuous random variables X1 , . . . , Xn when testing the null hypothesis H0 : θ = θ0 against the alternative H1 : θ = θ1 . (c) Assume that a particular bus service runs at regular intervals of θ minutes, but that you do not know θ. Assume also that the times you find you have to wait for a bus on n occasions, X1 , . . . , Xn , are independent and identically distributed with density −1 θ if 0 ≤ x ≤ θ, f (x|θ) = 0 otherwise. i. Discuss briefly when the above assumptions would be reasonable in practice. ii. Find the likelihood L(θ; x) for θ given the data (X1 , . . . , Xn ) = x = (x1 , . . . , xn ). iii. Find the most powerful test of size α of the hypothesis H0 : θ = θ0 = 20 against the alternative H1 : θ = θ1 > 20. From Warwick ST217 exam 1997 7. The following problem is quoted verbatim from Osborn (1979), ‘Statistical Exercises in Medical Research’ : A study of immunoglobulin levels in mycetoma patients in the Sudan involved 22 patients to be compared to 22 normal individuals. The levels of IgG recorded for the 22 mycetoma patients are shown below. The mean level for the normal individuals was calculated to be 1,477 mg/100ml before the data for this group was lost overboard from a punt on the river Nile. Use the data below to estimate the within group variance and hence perform a ‘t’ test to investigate the significance of the difference between the mean levels of IgG in mycetoma patients and normals. IgG levels (mg/100ml) in 22 mycetoma patients 1,047 1,377 1,210 1,103 1,270 1,135 1,375 1,067 907 1,230 1,350 804 1,032 960 1,122 1,062 1,002 960 1,345 1,204 1,053 936 Osborn (1979) 4.6.16 8. Let X1 , X2 , . . . , Xn ∼ Exp(θ), i.e. f (x|θ) = θe−θx for θ ∈ (0, ∞). IID Show that a likelihood ratio test for H0 : θ ≤ θ0 versus H1 : θ > θ0 has the form: Z ‘Reject H0 iff θ0 x < k, where k is given by α = 0 nk 1 n−1 −z z e dz’. Γ(n) Show that a test of this form is UMP for testing H0 : θ = θ0 versus H1 : θ > θ0 . 67 9. (a) Define the size and power function of a hypothesis test procedure. (b) State and prove the Neyman-Pearson lemma in the case of a test statistic that has a continuous distribution. 2 2 (c) Let X1 , X2 , . . . , Xn IID ∼ N (µ, σ ), where σ is known. Find the likelihood ratio fX (x|µ1 )/fX (x|µ0 ) and hence show that the most powerful test of size α for testing the null hypothesis H0 : µ = µ0 against the alternative H1 : µ = µ1 , for some µ1 < µ0 , has the form: √ ‘Reject H0 if X < µ0 + σ Φ−1 (α)/ n ’, Pn where X = i=1 Xi /n is the sample mean, and Φ−1 (α) is the 100 α% point of the standard Normal N (0, 1) distribution. (d) Define a uniformly most powerful (UMP) test, and show that the above test is UMP for testing H0 : µ = µ0 against H1 : µ < µ0 . (e) What is the UMP test of H0 : µ = µ0 against H1 : µ > µ0 ? (f) Deduce that no UMP test of size α exists for testing H0 : µ = µ0 against H1 : µ 6= µ0 . (g) What test would you choose to test H0 : µ = µ0 against H1 : µ 6= µ0 , and why? From Warwick ST217 exam 1999 10. A group of clinicians wish to study survival after heart attack, by classifying new heart attack patients according to (a) whether they survive at least 7 days after admission, and (b) whether they currently smoke 10 or more cigarettes per day. From previous experience, the clinicians predict that after N days the observed counts Smoker Non-smoker Survive Die R1 R3 R2 R4 will follow independent Poisson distributions with means Smoker Non-smoker Survive Die N r1 N r3 N r2 N r4 The clinicians intend to estimate the population log-odds ratio ` = log(r1 r4 /r2 r3 ) by the sample value L = log(R1 R4 /R2 R3 ), and they wish to choose N to give a probability 1 − β of being able to reject the hypothesis H0 : ` = 0 at the 100α% significance level, when the true value of ` is `0 > 0. 2 Using the formula Var f (X) ≈ f 0 (EX) Var(X), show that L has approximate variance 1 1 1 1 + + + , N r1 N r2 N r3 N r4 and hence, assuming a Normal approximation to the distribution of L, that the required number of days is roughly 2 1 1 1 1 1 −1 N= 2 + + + Φ (α/2) + Φ−1 (β) , `0 r1 r2 r3 r4 where Φ is the standard Normal cumulative distribution function. Comment critically on the clinicians’ method for choosing N . From Warwick ST332 exam 1988 68 11. (a) Define the size and power of a hypothesis test, and explain what is meant by a simple likelihood ratio test and by a uniformly most powerful test. (b) Let X1 , X2 , . . . , Xn be independent random variables, each having a Poisson distribution with mean λ. Find the likelihood ratio test for testing H0 : λ = λ0 against H1 : λ = λ1 , where λ1 > λ 0 . Show also that this test is uniformly most powerful. (c) Twenty-five leaves were selected at random from each of six similar apple trees. The number of adult female European red mites on each was counted, with the following results: No. of mites Frequency 0 70 1 38 2 17 3 10 4 9 5 3 6 2 7 1 Assuming that the number of mites per leaf follow IID Poisson distributions, and using a Normal approximation to the Poisson distribution, carry out a test of size 0.05 of the null hypothesis H0 that the mean number of mites per leaf is 1.0, against the alternative H1 that it is greater than 1.0. Discuss briefly whether the assumptions you have made in testing H0 appear reasonable here. From Warwick ST217 exam 2000 12. Hypothesis test procedures can be inverted to produce confidence intervals or more generally confidence regions. Thus, given a size α test of the null hypothesis H0 : θ = θ0 , the set of all values θ0 that would NOT be rejected forms a ‘100(1 − α)% confidence interval for θ’. An amateur statistician argues as follows: Suppose something starts at time t0 and ends at time t1 . Then at time t ∈ (t0 , t1 ), the ratio r of its remaining lifetime (t1 − t) to its current age (t − t0 ), i.e. r(t) = t1 − t , t − t0 is clearly a monotonic decreasing function of t. Also it is easy to check that r = 39 after (1/40)th of the total lifetime, and that r = 1/39 after (39/40)th of the total lifetime. Therefore, for 95% of something’s existence, its remaining lifetime lies in the interval (t − t0 )/39, 39(t − t0 ) , where t is the time under consideration, and t0 is the time the thing came into existence. The statistician is also an amateur theologian, and firmly believes that the World came into existence 6006 year ago. Using his pet procedure outlined above, he says he is ‘95% confident that the World will end sometime between 154 years hence, and 234234 years hence’. His friend, also an amateur statistician, says she has an even more general procedure to produce confidence intervals: In any situation I simply roll an icosahedral (fair 20-sided) die. If the die shows ‘13’ then I quote the empty set ∅ as a 95% confidence interval, otherwise I quote the whole real line R. She rolls the die, which comes up 13. She therefore says she is ‘95% confident that the World ended before it even began (although presumably no-one has noticed yet).’ Discuss. 69 The Multinomial Distribution and χ2 Tests 5.6 5.6.1 Multinomial Data Definition 5.15 (Multinomial Distribution) The multinomial distribution Mn(n, θ) is a probability distribution on points y = (y1 , y2 , . . . , yk ), Pk where yi ∈ {0, 1, 2, . . .}, i = 1, 2, . . . , k, and i=1 yi = n, with PMF f (y1 , y2 , . . . , yk ) = where θi > 0 for i = 1, . . . , k, and Pk i=1 θi k Y n! θ yi y1 !y2 ! · · · yk ! i=1 i (5.5) = 1. Comments 1. The multinomial distribution arises when one has n independent observations, each classified in one of k ways (e.g. ‘eye colour’ classified as ‘Brown’, ‘Blue’ or ‘Other’; here k = 3). Let θi denote the probability that any given observation lies in category number i, and let Yi denote the number of observations falling in category i. Then the random vector Y = (Y1 , Y2 , . . . , Yk ) has a Mn(n, θ) distribution. 2. A binomial distribution is the special case k = 2, and is usually parametrised by p = θ1 (so θ2 = 1−p). Exercise 5.9 By partial differentiation of the likelihood function, show that the MLEs θbi of the parameters θi of the Mn(n, θ) satisfy the equations yi yk − = 0, (i = 1, . . . , k − 1) P k−1 b θbi 1− θj j=1 and hence that θbi = yi /n for i = 1, . . . , k. k 5.6.2 Chi-Squared Tests Suppose one wishes to test the null hypothesis H0 that, in the multinomial distribution 5.5, θ is some function θ(φ) of another parameter φ. The alternative hypothesis H1 is that θ is arbitrary. Exercise 5.10 Suppose H0 is that X1 , X2 , . . . Xn IID ∼ Bin(3, φ). Let Yi (for i = 1, 2, 3, 4) denote the number of observations Xj taking value i − 1. What is the null distribution of Y = (Y1 , Y2 , Y3 , Y4 )? k The log likelihood ratio test statistic r(X) is given by r(X) = k X Yi log θbi − i=1 k X b Yi log θi (φ) (5.6) i=1 where θbi = yi /n for i = 1, . . . , k. By Walds theorem, under H0 , 2r(X) has approximately a χ2 distribution: 2 k X b Yi [log θbi − log θi (φ)] i=1 where 70 ∼ χ2k1 −k0 (5.7) θbi = Yi /n, k0 is the dimension of the parameter φ, and k1 = k − 1 is the dimension of θ under the constraint Pk i=1 θi = 1. Comments b = X = P4 Yi . 1. in Example 5.10, k = 4, k0 = 1, k1 = 3 and φ i=1 We would reject H0 , that the sample comes from a Bin(3, φ) distribution for some φ, if 2r(x) is greater than the 95% point of the χ22 distribution, where r(x) is given in Formula 5.6. 2. It is straightforward to check, using a Taylor series expansion of the log function, that provided EYi is large ∀ i, k k X X (Yi − µi )2 b l , (5.8) 2 Yi [log θbi − log θi (φ)] µi i=1 i=1 b is the expected number of individuals (under H0 ) in the ith category. where µi = nθi (φ) Definition 5.16 (Chi-squared Goodness of Fit Statistic) X2 = k X (oi − ei )2 i=1 ei , (5.9) where oi is the observed count in the ith category and ei is the corresponding expected count under the null hypothesis, is called the χ2 goodness-of-fit statistic. Comments 1. Under H0 , X 2 has approximately a χ2 distribution with number of degrees of freedom being (number of categories) - 1 - (number of parameters estimated under H0 ). This approximation works well provided all the expected counts are reasonably large (say all are at least 5). 2. This χ2 test was suggested by Karl Pearson before the theory of hypothesis testing was fully developed. 71 5.7 Problems 1. In a genetic experiment, peas were classified according to their shape (‘round’ or ‘angular’) and colour (‘yellow’ or ‘green’). Out of 556 peas, 315 were round+yellow, 108 were round+green, 101 were angular+yellow and 32 were angular+green. Test the null hypothesis that the probabilities of these four types are 9/16, 3/16, 3/16 and 1/16 respectively. 2. A sample of 300 people was selected from a population, and classified into blood type (O/A/B/AB, and Rhesus positive/negative), as shown in the following table: O 82 13 Rh positive Rh negative A 89 27 B 54 7 AB 19 9 The null hypothesis H0 is that being Rhesus negative is independent of whether an individual’s blood group is O, A, B or AB. Estimate the probabilities under H0 of falling into each of the 8 categories, and hence test the hypothesis H0 . P 3. The random variables X1 , X2 , . . . , Xn are IID with Pr(Xi = j) = pj for j = 1, 2, 3, 4, where pj = 1 and pj > 0 for each j = 1, 2, 3, 4. Interest centres on the hypothesis H0 that p1 = p2 and simultaneously p3 = p4 . (a) Define the following terms i. a hypothesis test, ii. simple and composite hypotheses, and iii. a likelihood ratio test. (b) Letting θ = (p1 , p2 , p3 , p4 ), X = (X1 , . . . , Xn )T with observed values x = (x1 , . . . , xn )T , and letting yj denote the number of x1 , x2 , . . . , xn equal to j, what is the likelihood L(θ|x)? (c) Assume the usual regularity conditions, i.e. that the distribution of −2 log L(θ|x) tends to χ2ν as the sample size n → ∞. What are the dimension of the parameter space Ωθ and the number of degrees of freedom ν of the asymptotic chi-squared distribution? (d) By partial differentiation of the log-likelihood, or otherwise, show that the maximum likelihood estimator of pj is yj /n. (e) Hence show that the asymptotic test statistic of H0 : p1 = p2 and p3 = p4 is −2 log L(x) = 2 4 X yj log(yj /mj ), j=1 where m1 = m2 = (y1 + y2 )/2 and m3 = m4 = (y3 + y4 )/2. (f) In a hospital casualty unit, the numbers of limb fractures seen over a certain period of time are: Arm Leg Left Side Right 46 22 49 32 Using the test developed above, test the hypothesis that limb fractures are equally likely to occur on the right side as on the left side. Discuss briefly whether the assumptions underlying the test appear reasonable here. From Warwick ST217 exam 1998 72 Prudens quaestio dimidium scientiae. Half of science is asking the right questions. Roger Bacon We all learn by experience, and your lesson this time is that you should never lose sight of the alternative. Sir Arthur Conan Doyle One forms provisional theories and then waits for time or fuller knowledge to explode them. Sir Arthur Conan Doyle What used to be called prejudice is now called a null hypothesis. A. W. F. Edwards The conventional view serves to protect us from the painful job of thinking. John Kenneth Galbraith Science must begin with myths, and with the criticism of myths. Sir Karl Raimund Popper 73 This page intentionally left blank (except for this sentence). 74 Chapter 6 Linear Statistical Models 6.1 Introduction Definition 6.1 (Response Variable) a response variable is a random variable Y whose value we wish to predict. Definition 6.2 (Explanatory Variable) An explanatory variable is a random variable X whose values can be used to predict Y . Definition 6.3 (Linear Model) A linear model is a prediction function for Y in terms of the values x1 , x2 , . . . , xk of X1 , X2 , . . . , Xk of the form E[Y |x1 , x2 , . . . , xk ] = β0 + β1 x1 + β2 x2 + · · · + βk xk (6.1) Thus if Y1 , Y2 , . . . , Yn are the responses for cases 1, 2, . . . , n, and xij is the value of Xj (j = 1, . . . , k) for case i, then E[Y|X] = Xβ (6.2) where  Y1 Y2 .. .   Y=       Yn is the vector of responses, X = (xij ) where xi0 = 1 for i = 1, . . . n, is the matrix of explanatory variables, and  β0 β1 .. .   β=  βk is the (unknown) parameter vector. 75      Examples Consider the captopril data (page 44), and let X1 X3 Z1 = = = Diastolic BP before treatment, Diastolic BP after treatment, 2X1 + X2 , X2 X4 Z2 = = = Systolic BP before treatment, Systolic BP after treatment, 2X3 + X4 . Some possible linear models of interest are: 1. Response Y = X4 , (a) explanatory variable X2 (this is a ‘simple linear regression model ’, with just 1 explanatory variable), (b) explanatory variable X3 (c) explanatory variables X1 and X2 (a ‘multiple regression model ’). 2. Response Y = Z2 , (a) explanatory variable Z1 (b) explanatory variables Z1 and Z12 (a ‘quadratic regression model ’). Note how new explanatory variables may be obtained by transforming and/or combining old ones. 3. Looking just at the interrelationship between SBP and DBP at a given time: (a) response Y = X2 , explanatory variable X1 , (b) response Y = X1 , explanatory variable X2 , (c) response Y = X4 , explanatory variable X3 , etc. Comments 1. A linear relationship is the simplest possible relationship between response variables and explanatory variables, so linear models are easy to understand, interpret and also to check for plausibility. 2. One can (in theory) approximate an arbitrarily complicated relationship by a linear model, for example quadratic regression can obviously be extended to ‘polynomial regression’ E[Y |x] = β0 + β1 x + β2 x2 + · · · + βm xm . 3. Linear models have nice links with • geometry, • linear algebra, • conditional expectations and variances, • the Normal distribution. 4. Distributional assumptions (if any!) will typically be made ONLY about the response variable Y , NOT about the explanatory variables. Therefore the model makes sense even if the Xi s are chosen nonrandomly (‘designed experiments’). 5. The response variable Y is sometimes called the ‘dependent variable’, and the explanatory variables are sometimes called ‘predictor variables’, ‘regressor variables’, or (very misleadingly) ‘independent variables’. 76 6.2 Simple Linear Regression Definition 6.4 A simple linear regression model is a linear model with one response variable Y and one explanatory variable X, i.e. a model of the form E[Y |x1 ] = β0 + β1 x1 . (6.3) Typically in practice we have n data points (xi , yi ) for i = 1, . . . , n, and we want to predict a future response Y from the corresponding observed value x of X. Often there’s a natural candidate for which variable should be treated as the response: 1. X may precede Y in time, for example (a) X is BP before treatment and Y is BP after treatment, or (b) X is number of hours revision and Y is exam mark; 2. X may be in some way more fundamental, for example (a) X is age and Y is height or (b) X is height and Y is weight; 3. X may be easier or cheaper to observe, so we hope in future to estimate Y without measuring it. In simple linear regression we don’t know β0 or β1 , but need to estimate them in order to predict Y by Yb = βb0 + βb1 x. To make accurate predictions we require the prediction error Y − Yb = Y − βb0 + βb1 x to be small. This suggests that, given data (xi , yi ) for i = 1, . . . , n, we should fit βb0 and βb1 by simultaneously making all the vertical deviations of the observed data points from the fitted line y = βb0 + βb1 x small. P The easiest way to do this is to minimise the sum of squared deviations (yi − ybi )2 , i.e. to use the ‘least squares’ criterion. 6.3 Method of Least Squares For simple linear regression, ybi = β0 + β1 xi (i = 1, . . . , n) (6.4) Therefore to estimate β0 and β1 by least squares, we need to minimise Q= n X [yi − (β0 + β1 xi )]2 . (6.5) i=1 Exercise 6.1 Show that Q in equation 6.5 is minimised at values β0 and β1 satisfying the simultaneous equations βP 0n β0 xi P + β1 P xi + β1 x2i 77 P = P yi , = xi yi , (6.6) and hence that P xi yi − n x y P 2 , xi − nx2 βb1 = βb0 = y − βb1 x. (6.7) (6.8) k Comments b 1. Forming ∂ 2 Q/∂β02 , ∂ 2 Q/∂β12 and ∂ 2 Q/∂β0 β1 verifies that Q is minimised at β = β. 2. Equations 6.6 are called the ‘normal equations’ for β0 and β1 (‘normal’ as in ‘perpendicular’ rather than as in ‘standard’ or as in ‘Normal distribution’). 3. y = βb0 + βb1 x is called the ‘least squares fit’ to the data. 4. From equations 6.7 and 6.8, the least squares fitted line passes through (x, y), the centroid of the data points. b rather than on memorising 5. Concentrate on understanding and remembering the method for finding β, the formulae 6.7 and 6.8 for βb1 and βb1 . 6. Geometrical interpretation We have a vector y = (y1 , y2 , . . . , yn )T of observed responses, i.e. a point in n-dimensional space, together with a surface S representing possible joint predicted values under the model (for simple linear regression, it’s the 2-dimensional surface β0 + β1 x for real values of β0 and β1 ). P Minimising (yi − ybi )2 is equivalent to dropping a perpendicular from the point y to the surface S; b . Thus we are literally finding the model closest to the data. the perpendicular hits the surface at y 6.4 Problems P 1. P Show that the expression xi yi −Pn x y occurring in the formula for βb1 could also be written as P (xi − x)(yi − y), (xi − x)yi , or xi (yi − y). Pn 2. Show that the ‘residual sum of squares’, i=1 (yi − ybi )2 , satisfies the following identity: n X (yi − ybi )2 = i=1 n X (yi − βb0 − βb1 xi )2 = i=1 n X (yi − y)2 − βb1 i=1 n X (xi − x)(yi − y). i=1 3. For the captopril data, find the least squares lines (a) to predict SBP before captopril from DBP before captopril, (b) to predict SBP after captopril from DBP after captopril, (c) to predict DBP before captopril from SBP before captopril. Compare these three lines. Discuss whether it is sensible to combine the before and after measurements in order to obtain a better prediction of SBP at a given time from DBP measured at that time. 4. Illustrate the geometrical interpretation of least squares (see above comments) in the following two cases (a) model E[Y |x] = β0 + β1 x with 3 data points (x1 , y1 ), (x2 , y2 ) and (x3 , y3 ), (b) model E[Y |x] = βx with 2 data points (x1 , y1 ) and (x2 , y2 ). What does Pythagoras’ theorem tell us in the second case? 78 6.5 The Normal Linear Model (NLM) 6.5.1 Introduction Definition 6.5 (NLM) Given n response RVs Yi (i = 1, 2, . . . , n), with corresponding values of explanatory variables xTi , the NLM makes the following assumptions: 1. (Conditional) Independence The Yi are mutually independent given the xTi . 2. Linearity The expected value of the response variable is linearly related to the unknown parameters β: EYi = xTi β. 3. Normality The random variation Yi |xi is Normally distributed. 4. Homoscedasticity (Equal Variances) i.e. Yi |xi ∼ N(xTi β, σ 2 ). 6.5.2 Matrix Formulation of NLM The NLM for responses y = (y1 , y2 , . . . , yn )T can be recast as follows 1. E[Y] = Xβ for some parameter vector β = (β1 , β2 , . . . , βp )T , 2. = Y − E[Y] ∼ MVN(0, σ 2 I), where I is the (n × n) identity matrix. It can be shown that the least squares estimates of β are given by solving the simultaneous linear equations XT y = XT Xβ (6.9) (the normal equations), with solution (assuming that XT X is nonsingular) b = (XT X)−1 XT y, β (6.10) Comments 1. Note that, by formula 6.10, each estimator βbj is a linear combination of the Yi s. b has a MVN distribution. Therefore under the NLM, β 2. Even if the Normality assumption doesn’t hold, the CLT implies that, provided the number n of b will still be approximately MVN. cases is large, the distribution of the estimator β 3. The most important assumption is independence, since it’s relatively easy to modify the standard NLM to account for • nonlinearity: transform the data, or include e.g. x2ij as an explanatory variable, • unequal variances (‘heteroscedasticity’): e.g. transform from yi − ybi to zi = (yi − ybi )/b σi . • non-Normality: transform, or simply get more data! 79 4. In the general formulation the constant term β0 is omitted, though in practice the first column of the matrix X will often contain 1’s and the corresponding parameter β1 will be the ‘constant term’. b and the vector of residuals is r = y − y b = XT β, b, 5. The corresponding fitted values are y Pp Tb i.e. ri = yi − ybi , where ybi = xi β = j=1 xij βbj . Definition 6.6 (RSS) The residual sum of squares (RSS) in the fitted NLM is s2 = Pn = b T (y − Xβ) b (y − Xβ) i=1 (yi − ybi )2 (6.11) Important Fact about the RSS Considering the RSS s2 to be the observed value of a corresponding RV S 2 , it can be shown that • S 2 /σ 2 ∼ χ2(n−p) , b • S 2 is independent of β. Exercise 6.2 1. Show that the log-likelihood function for the NLM is (constant) − n 1 log(σ 2 ) − 2 (y − Xβ)T (y − Xβ). 2 2σ (6.12) 2. Show that the maximum likelihood estimate of β is identical to the least squares estimate. b What is the distribution of β? 3. Show that the MLE σ b2 of σ 2 is σ b2 = s2 . n (6.13) What are the mean and variance of σ b2 ? 4. Show that an unbiased estimator of σ 2 is given by the formula Residual Sum of Squares Residual Degrees of Freedom 6.5.3 k Examples of the NLM 1. Simple Linear Regression (again) Yi = β0 + β1 xi + i , (6.14) 2 where i IID ∼ N (0, σ ). 2. Two-sample t-test  x1 x2 .. .      y=  xm  y1   .  .. yn       ,            X=      1 1 .. . 0 0 .. . 1 0 .. . 0 1 .. . 0 1       ,      and we’re interested in the hypothesis H0 : (β0 − β1 ) = 0. 80 β= β0 β1 , (6.15) 3. Paired t-test Some quantity Y is measured on each of n individuals under 2 different conditions (e.g. drugs A and B), and we want to test whether the mean of Y can be assumed equal in both circumstances.     1 0 ··· 0 0 y11  0 1 ··· 0 0   y21        α1  .. .. . .  ..  .. ..   . .  .  . . .   α2         0 0 ··· 1 0   yn1   ..  ,  , X = (6.16) y= β = ,   1 0 ··· 0 1   y12   .        αn  0 1 ··· 0 1   y22      δ  . . .  .   . . . . .. ..   .. ..  ..  yn2 0 0 ··· 1 1 where δ is the difference between the expected responses under the two conditions, and the αi are ‘nuisance parameters’ representing the overall level of response for the ith individual. The null hypothesis is H0 : δ = 0. 4. Multiple Regression (example thereof) Y = SBP after captopril, x1 = SBP before captopril,    1 210 201  1 169  165      166   1 187     1 160  157      1 167  147      1 176  145      1 185  168      1 206 , 180 X = y=     1 173  147      1 146  136      1 174  151      1 201  168      1 198  179      1 148  129  1 154 131 x2 = DBP before captopril,  130 122   124   104   112   101     β0 121   124  β =  β1  , , 115  β2  102   98   119   106   107  100 (6.17) where (roughly speaking) β1 represents the increase in EY per unit increase in SBP before captopril (x1 ), allowing for the fact that EY also depends partly on DBP before captopril (x2 ), and β2 has a similar interpretation in terms of the effect of x2 allowing for x1 . b = (XT X)−1 XT y, and also (for example) to In all the above examples, it’s straightforward to calculate β calculate the sampling distribution of βbi under the null hypothesis H0 : βi = 0. Exercise 6.3 Verify the following calculations from the data given in 6.17 above:    15 2654 1685 XT X =  2654 475502 300137  , XT y =  1685 300137 190817   8.563 −0.009165 −0.06120 0.0003026 −0.0003951  , (XT X)−1 =  −0.009165 −0.06120 −0.0003951 0.001167  2370 424523  , 268373   −20.7 b =  0.724  . β 0.450 k 81 6.6 Checking Assumptions of the NLM Clearly it’s very important in practice to check that your assumptions seem reasonable; there are various ways to do this 6.6.1 Formal hypothesis testing 2 χ tests are not very powerful, but are simple and general: count the number of data points satisfying various (exhaustive & mutually exclusive) conditions, and compare with the expected counts under your assumptions. Other tests, for example to test for Normality, have been devised. However, a general problem with statistical tests is that they don’t usually suggest what to do if your null hypothesis is rejected. Exercise 6.4 How might you use a χ2 test to check whether SBP after captopril is independent of SBP before captopril? k Exercise 6.5 A possible test for linearity in the simple Normal linear regression model (i.e. the NLM with just one explanatory variable x) is to fit the quadratic NLM EY = β0 + β1 x + β2 x2 (6.18) and test the null hypothesis H0 : β2 = 0. Suppose that Y is SBP and x is dose of drug, and that you have rejected the above null hypothesis. Comment on the advisability of using Formula 6.18 for predicting Y given x. k 6.6.2 Graphical Methods and Residuals If all the assumptions of the NLM are valid, then the residuals ri = yi − ybi b = yi − xTi β (6.19) should resemble observations on IID Normal random variables. Therefore plots of ri against ANYTHING should be patternless SEE LECTURE Comments 1. Before fitting a formal statistical model (including e.g. performing a t-test), you should plot the data, particularly the response variable against each explanatory variable. 2. After fitting a model, produce several residual plots. The computer is your friend! 3. Note that it’s the residual plots that are most informative. For example, the NLM DOESN’T assume that the Yi are Normally distributed about µY , but DOES assume that each Yi is Normally distributed about EYi |xi . i.e. it’s the conditional distributions, not the marginal distributions, that are important. 82 6.7 Problems 1. Show that the following is an equivalent formulation of the two-sample t-test to that given above in Formulae 6.15     x1 1 0  x2   1 0       ..   .. ..   .   . .      β0     Y =  xm  , β= , (6.20) X =  1 0 , β1  y1   1 1       .   . .   ..   .. ..  yn 1 1 with null hypothesis H0 : β1 = 0. 2. Independent samples of 10 U.S. men aged 25–34 years, and 15 U.S. men aged 45–54 years were taken. Their heights (in inches) were as follows: (a) Age 25–34 73.3 64.8 72.1 68.9 68.7 70.4 66.8 70.7 74.4 71.8 (b) Age 45–54 73.2 68.5 62.4 65.5 71.3 69.5 74.5 70.6 69.3 67.1 64.7 73.0 66.7 68.1 64.3 Use a two-sample t-test to test the hypothesis that the population means of the two age-groups are equal (the 90%, 95%, 97.5%, and 99% points of the t23 distribution are 1.319, 1.714, 2.069 and 2.500 respectively). Comment on whether the underlying assumptions of the two-sample t-test appear reasonable for this set of data. Comment also on whether the data can be used to suggest that the population of the U.S. has (or hasn’t) tended to get taller over the last 20 years. 3. Verify that the least squares estimates in simple linear regression P xi yi − n x y βb1 = P 2 , βb0 = y − βb1 x, xi − nx2 b = (XT X)−1 XT y. are a special case of the general formula β 4. The following data-set shows average January minimum temperature in degrees Fahrenheit (y), together with Latitude (x1 ) and Longitude (x2 ) for 28 US cities. Plot y against x1 , and comment on what this plot suggests about the reasonableness of the various assumptions underlying the NLM for predicting y from x1 and x2 . y x1 x2 y x1 x2 y x1 x2 44 31 15 30 58 19 22 12 21 8 31.2 35.4 40.7 39.7 26.3 42.3 38.1 44.2 43.1 47.1 88.5 92.8 105.3 77.5 80.7 88.0 97.6 70.5 83.9 112.4 38 47 22 45 37 21 27 25 2 32.9 34.3 41.7 31.0 33.9 39.8 39.0 39.7 45.9 86.8 118.7 73.4 82.3 85.0 86.9 86.5 77.3 93.9 35 42 26 65 22 11 45 23 24 33.6 38.4 40.5 25.0 43.7 41.8 30.8 42.7 39.3 112.5 123.0 76.3 82.0 117.1 93.6 90.2 71.4 90.5 Data from HSDS, set 262 83 5. (a) Assuming the model E[Y |x] = β0 + β1 x, Var[Y |x] = σ 2 independently of x, derive formulae for the least squares estimates βb0 and βb1 from data (xi , yi ), i = 1, . . . , n. What advantages are gained if the corresponding random variables Yi |xi can be assumed to be independently Normally distributed? (b) The following table shows the tensile strength (y) of different batches of cement after being ‘cured’ (dried) for various lengths of time x: 3 batches were cured for 1 day, 3 for 2 days, 5 for 3 days, etc. The batch means and standard deviations (s.d.) are also given. Curing time Tensile strength 2 (kg/cm ) y (days) x 1 2 3 7 28 13.0 21.9 29.8 32.4 41.8 13.3 24.5 28.0 30.4 42.6 11.8 24.7 24.1 34.5 40.3 24.1 33.1 35.7 26.2 35.7 37.3 mean s.d. 12.7 23.7 26.5 33.2 40.0 0.8 1.6 2.5 2.0 3.0 Plot y against x and discuss briefly how reasonable seem each of the following assumptions: (i) linearity: E[Yi |xi ] = β0 + β1 xi for some constants β0 and β1 . (ii) independence: the Yi are mutually independent given the xi . If conditional independence (ii) is assumed true, then how reasonable here are the further assumptions: (iii) homoscedasticity: Var[Yi |xi ] = σ 2 for all i = 1, . . . , n, (iv) Normality: the random variables Yi are each Normally distributed. Say briefly whether you consider any of the above assumptions (i)–(iv) would be more plausible following (A) transforming from y to y 0 = loge (y), and/or (B) transforming x in an appropriate way. NOTE: you do not need to carry out numerical calculations such as finding the least-squares fit explicitly. From Warwick ST217 exam 2000 84 6. To monitor an industrial process for converting ammonia to nitric acid, the percentage of ammonia lost (y) was measured on each of 21 consecutive days, together with explanatory variables representing air flow (x1 ), cooling water temperature (x2 ) and acid concentration (x3 ). The data, together with the residuals after fitting the model yb = 3.614 + 0.072 x1 + 0.130 x2 − 0.152 x3 , are given in the following table: Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 y Air Flow (x1 ) Water Temp. (x2 ) Acid Conc. (x3 ) Resid. 4.2 3.7 3.7 2.8 1.8 1.8 1.9 2.0 1.5 1.4 1.4 1.3 1.1 1.2 0.8 0.7 0.8 0.8 0.9 1.5 1.5 80 80 75 62 62 62 62 62 58 58 58 58 58 58 50 50 50 50 50 56 70 27 27 25 24 22 23 24 24 23 18 18 17 18 19 18 18 19 19 20 20 20 58.9 58.8 59.0 58.7 58.7 58.7 59.3 59.3 58.7 58.0 58.9 58.8 58.2 59.3 58.9 58.6 57.2 57.9 58.0 58.2 59.1 0.323 −0.192 0.456 0.570 −0.171 −0.301 −0.239 −0.139 −0.314 0.127 0.264 0.278 −0.143 −0.005 0.236 0.091 −0.152 −0.046 −0.060 0.141 −0.724 Some residual plots are shown on the next page (Fig. 6.1). (a) Discuss whether the pattern of residuals casts doubt on any of the assumptions underlying the Normal Linear Model (NLM). Describe any further plots or calculations that you think would help you assess whether the fitted NLM is appropriate here. Continued. . . 85 (b) Various suggestions could be made for improving the model, such as i. ii. iii. iv. v. vi. vii. viii. transforming the response (e.g. to log y or to y/x1 ), transforming some or all of the explanatory variables, deleting outliers, including quadratic or even higher-order terms (e.g. x22 ), including interaction terms (e.g. x1 x3 ), carrying out a nonparametric analysis of the data, applying a bootstrap procedure, fitting a nonlinear model. Outline the merits and disadvantages of each of these suggestions here. What would be your next step in analysing this data-set? Figure 6.1: Residual plots From Warwick ST217 exam 1999 86 7. Table 6.1, originally from Narula & Wellington (1977), shows data on selling prices of 28 houses in Erie, Pennsylvania, together with explanatory variables that could be used to predict the selling price. The variables are: X1 X2 X3 X4 X5 X6 X7 X8 X9 Y = = = = = = = = = = current taxes (local, school and county) ÷ 100, number of bathrooms, lot size ÷ 1000 (square feet), living space ÷ 1000 (square feet), number of garage spaces, number of rooms, number of bedrooms, age of house (years), number of fireplaces, actual sale price ÷ 1000 (dollars). Find a function of X1 –X9 that predicts Y reasonably accurately (such functions are used to fix property taxes, which should be based on the current market value of each property). X1 X2 X3 X4 X5 X6 X7 X8 X9 Y 4.9176 5.0208 4.5429 4.5573 5.0597 3.8910 5.8980 5.6039 15.4202 14.4598 5.8282 5.3003 6.2712 5.9592 5.0500 8.2464 6.6969 7.7841 9.0384 5.9894 7.5422 8.7951 6.0931 8.3607 8.1400 9.1416 12.0000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.5 2.5 1.0 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.0 1.0 1.5 1.5 1.5 1.5 1.0 1.5 1.5 3.4720 3.5310 2.2750 4.0500 4.4550 4.4550 5.8500 9.5200 9.8000 12.8000 6.4350 4.9883 5.5200 6.6660 5.0000 5.1500 6.9020 7.1020 7.8000 5.5200 4.0000 9.8900 6.7265 9.1500 8.0000 7.3262 5.0000 0.9980 1.5000 1.1750 1.2320 1.1210 0.9880 1.2400 1.5010 3.4200 3.0000 1.2250 1.5520 0.9750 1.1210 1.0200 1.6640 1.4880 1.3760 1.5000 1.2560 1.6900 1.8200 1.6520 1.7770 1.5040 1.8310 1.2000 1.0 2.0 1.0 1.0 1.0 1.0 1.0 0.0 2.0 2.0 2.0 1.0 1.0 2.0 0.0 2.0 1.5 1.0 1.5 2.0 1.0 2.0 1.0 2.0 2.0 1.5 2.0 7 7 6 6 6 6 7 6 10 9 6 6 5 6 5 8 7 6 7 6 6 8 6 8 7 8 6 4 4 3 3 3 3 3 3 5 5 3 3 2 3 2 4 3 3 3 3 3 4 3 4 3 4 3 42 62 40 54 42 56 51 32 42 14 32 30 30 32 46 50 22 17 23 40 22 50 44 48 3 31 30 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 1 25.9 29.5 27.9 25.9 29.9 29.9 30.9 28.9 84.9 82.9 35.9 31.5 31.0 30.9 30.0 36.9 41.9 40.5 43.9 37.5 37.9 44.5 37.9 38.9 36.9 45.8 41.0 Table 6.1: House price data Weisberg (1980) 87 8. The number of ‘hits’ recorded on J.E.H.Shaw’s WWW homepage in late 1999 are given below. ‘Local’ means the homepage was accessed from within Warwick University, ‘Remote’ means it was accessed from outside. Data for the week beginning 7–Nov–1999 were unavailable. Note that there was an exam on Wednesday 8–Dec–1999 for the course ST104, taught by J.E.H.Shaw. Week Beginning Number of Hits Local Remote Total 26 Sept 3 Oct 10 Oct 17 Oct 24 Oct 31 Oct 7 Nov 14 Nov 21 Nov 28 Nov 5 Dec 12 Dec 19 Dec 0 35 901 641 1549 823 — 1136 2114 2097 3732 5 0 182 253 315 443 525 344 — 383 584 536 461 352 296 182 288 1216 1084 2074 1167 — 1519 2698 2633 4193 357 296 (a) Fit a linear least-squares regression line to predict the number of remote hits (Y ) in a week from the observed number x of local hits. (b) Calculate the residuals and plot them against date. Does the plot give any evidence that the interrelationship between X and Y changes over time? (c) Using both general considerations and residual plots, comment on how reasonable here are the assumptions underlying the simple Normal linear regression model, and suggest possible ways to improve the prediction of Y . 9. The following table shows the assets x (billions of dollars) and net income y (millions of dollars) for the 20 largest US banks in 1973. Bank x y Bank x y Bank x y Bank x y 1 2 3 4 5 49.0 42.3 36.3 16.4 14.9 218.8 265.6 170.9 85.9 88.1 6 7 8 9 10 14.2 13.5 13.4 13.2 11.8 63.6 96.9 60.9 144.2 53.6 11 12 13 14 15 11.6 9.5 9.4 7.5 7.2 42.9 32.4 68.3 48.6 32.2 16 17 18 19 20 6.7 6.0 4.6 3.8 3.4 42.7 28.9 40.7 13.8 22.2 (a) Plot income (y) against assets (x), and also log(income) against log(assets). (b) Verify that the least squares fit regression lines are fit 1: fit 2: y = 4.987 x + 7.57, log(y) = 0.963 log(x) + 1.782 (Note: logs to base e), and show the fitted lines on your plots. (c) Produce Normal probability plots of the residuals from each fit. (d) Which (if either) of these models would you use to describe the relationship between total assets and net income? Why? (e) Bank number 19 (the Franklin National Bank) failed in 1974, and was the largest ever US bank to fail. Identify the point representing this bank on each of your plots, and discuss briefly whether, from the data presented, one might have expected beforehand that the Franklin National Bank was in trouble. 88 10. The following data show the blood alcohol levels (mg/100ml) at post mortem for traffic accident victims. Blood samples in each case were taken from the leg (A) and from the heart (B). Do these results indicate that blood alcohol levels differ systematically between samples from the leg and the heart? Case A B Case A B 1 2 3 4 5 6 7 8 9 10 44 265 250 153 88 180 35 494 249 204 44 269 256 154 83 185 36 502 249 208 11 12 13 14 15 16 17 18 19 20 265 27 68 230 180 149 286 72 39 272 277 39 84 228 187 155 290 80 50 290 Osborn (1979) 4.6.5 11. (a) Assume the linear model E[Y|X] = Xβ, Var[Y|X] = σ 2 In , where In denotes the n × n identity matrix, and XT X is nonsingular. By writing Y − Xβ = b + X(β b − β), or otherwise, show that for this model, the residual sum of squares (Y − Xβ) (Y − Xβ)T (Y − Xβ) b = (XT X)−1 XT Y. is minimised at β = β b = β and that Var[β] b = σ 2 (XT X)−1 . (b) Show that E[β] (c) Let A = X(XT X)−1 XT . Show that A and In − A are both idempotent, i.e. AA = A and (In − A)(In − A) = In − A. (d) For the particular case of a Normal linear model, find the joint distribution of the fitted values b and show that Y − Y b = Xβ, b is independent of Y. b Quote carefully any properties of the Y Normal distribution you use. (e) For the simple linear regression model (EYi = β0 + β1 xi ), write down the corresponding matrix X and vector Y, find (XT X)−1 , and hence find the least squares estimates βb0 and βb1 and their variances. From Warwick ST217 exam 2001 89 6.8 The Analysis of Variance (ANOVA) 6.8.1 One-Way Analysis of Variance: Introduction This is a generalization of the two-sample t-test to p > 2 groups. Suppose there are observations yij (j = 1, 2, . . . , ni ) in the ith group (i = 1, 2, . . . , p), and let n = n1 + n2 + · · · + np denote the total number of observations. Denote the corresponding RVs by Yij , and assume that Yij ∼ N (βi , σ 2 ) independently. Traditionally the main aim has been to test the null hypothesis H0 : β1 = β2 = . . . = βp i.e. : β = β 0 = (β0 , β0 , . . . , β0 ) b and β b and apply a likelihood ratio test, i.e. test whether the ratio The idea is to fit MLEs β 0 change in RSS RSS b to y b0 squared distance from y b squared distance from y to y = b and y b0 are the corresponding fitted values) is larger than would be expected by chance. (where y A useful notation for group means etc. uses overbars and ‘+’ suffices as follows: ! p ni p ni 1 X 1 XX 1X y i+ = yij , y ++ = yij = ni y i+ , ni j=1 n i=1 j=1 n i=1 etc. The underlying models fit naturally in the NLM framework: Definition 6.7 (One-Way ANOVA) The one-way ANOVA model is a NLM of the form Y ∼ MVN(Xβ, σ 2 I), where     Y=  Y1 Y2 .. . Yn    ,                X=              0 ··· 0 ··· .. . . . . 0 ··· 0 ··· .. . . . . 1 1 .. . 0 0 .. . 0 0 .. . 1 0 .. . 0 1 .. . 0 0 .. . 1 0 .. . 0 .. . 0 .. . 0 .. . 0 .. . 0 ··· 0 1 ··· 0 .. .. .. . . . 1 ··· 0 .. .. .. . . . 0 ··· 1 .. . . . . .. . 0 0 0 ··· 1 0 0 .. .                ,                 β=  β1 β2 .. .    ,  (6.21) βp where X has n1 rows of the first type, . . . np rows of the last type, and n1 + n2 + · · · + np = n. Exercise 6.6 b = (Y 1+ , Y 2+ , . . . , Y p+ )T . Show that for one-way ANOVA, XT X = diag(n1 , n2 , . . . , np ), and hence β k 90 6.8.2 One-Way Analysis of Variance: ANOVA Table Let p β0 = E[Y ++ ] αi = βi − β0 p n i 1 XX EYij n i=1 j=1 = 1X ni βi , n i=1 = (i = 1, 2, . . . , p). Typically the p groups correspond to p different treatments, and αi is then called the ith treatment effect. We’re interested in the hypotheses H0 H1 : αi = 0 (i = 1, 2, . . . , p), : the αi are arbitrary. Note that 1. Y ++ is the MLE of β = β0 under H0 , 2. Y i+ is the MLE of β + αi , i.e. the mean response given the i treatment. Hence the fitted values under H0 and H1 are given by Y ++ and Y i+ respectively. If we also include the ‘null model’ that all the βi are zero, then the possible models of interest are: Model βi = 0 ∀ i # params i.e. ybij = 0 βi = β0 ∀ i 0 i.e. ybij = y ++ βi arbitrary, DF n 1 i.e. ybij = y i+ RSS P p i,j n−1 P n−p P i,j (yij i,j (yij 2 yij (1) 2 − y ++ ) (2) − y i+ )2 (3) The calculations needed to test H0 , involving the RSS formulae given above, can be conveniently presented in an ‘ANOVA table’: Source of variation Degrees of freedom (DF) Overall mean 1 Sum of squares (SS) ny 2++ (1)–(2) = Treatment p−1 (2)–(3) = Residual n−p (3) = Total n (1) = Mean square (MS) = SS/DF ni (y i+ − y ++ )2 P 2 i,j (yij − y i+ ) P 2 i,j yij P i ny 2++ ni (y i+ − y ++ )2 (p − 1) P 2 (n − p) i,j (yij − y i+ ) P i Finally, calculate the ‘F ratio’ F = Treatment MS Treatment SS/(p − 1) = Residual MS Residual SS/(n − p) (6.22) which, under H0 , has an F distribution on (p − 1) and (n − p) d.f. Large values of F are evidence against H0 . Note: DON’T try too hard to remember formulae for sums of squares in an ANOVA table. Instead THINK OF THE MODELS BEING FITTED. The ‘lack of fit’ of each model is given by the corresponding RSS, & the formulae for the differences in RSS simplify. 91 6.9 Problems 1. Show that the formulae for sums of squares in one-way ANOVA simplify: p X ni (Y i+ − Y ++ )2 = i=1 p X 2 2 ni Y i+ − nY ++ , i=1 p X ni X (Yij − Y i+ )2 = i=1 j=1 p X ni X Yij2 − i=1 j=1 p X 2 ni Y i+ . i=1 2. (a) Define the Normal Linear Model, and describe briefly how each of its assumptions may be informally checked by plotting residuals. (b) The following data summarise the number of days survived by mice inoculated with three strains of typhoid (31 mice with ‘9D’, 60 mice with ‘11C’ and 133 mice with ‘DSCI’). Days to Death 2 3 4 5 6 7 8 9 10 11 12 13 14 Total P P X2i Xi Numbers of Mice Inoculated with. . . 9D 11C DSCI Total 6 4 9 8 3 1 1 3 3 6 6 14 11 4 6 2 3 1 3 5 5 8 19 23 22 14 14 7 8 4 1 10 12 17 22 28 38 33 18 20 9 11 5 1 31 125 561 60 442 3602 133 1037 8961 224 1604 13124 (Xi is the survival time of the ith mouse in the given group). Without carrying out any calculations, discuss briefly how reasonable seem the assumptions underlying a one-way ANOVA on the data, and whether a transformation of the data may be appropriate. (c) Carry out a one-way ANOVA on the untransformed data. What do you conclude about the responses to the three strains of typhoid? From Warwick ST217 exam 1997 3. The amount of nitrogen-bound bovine serum albumin produced by three groups of mice was measured. The groups were: normal mice treated with a placebo (i.e. an inert substance), alloxan-diabetic mice treated with a placebo, and alloxan-diabetic mice treated with insulin. The resulting data are shown in the following table: 92 Normal + placebo Alloxan-diabetic + placebo Alloxan-diabetic + insulin 156 282 197 297 116 127 119 29 253 122 349 110 143 64 26 86 122 455 655 14 391 46 469 86 174 133 13 499 168 62 127 276 176 146 108 276 50 73 82 100 98 150 243 68 228 131 73 18 20 100 72 133 465 40 46 34 44 (a) Produce appropriate graphical display(s) and numerical summaries of these data, and comment on what can be learnt from these. (b) Carry out a one-way analysis of variance on the three groups. You may feel it necessary to transform the data first. Data from HSDS, set 304 4. The following table shows measurements of the steady-state haemoglobin levels for patients with different types of sickle-cell anaemia (‘HB SS’, ‘HB S/-thalassaemia’ and ‘HB SC’). Construct an ANOVA table and hence test whether the steady-state haemoglobin levels differ between the three types. HB SS HB S/-thalassaemia HB SC 7.2 7.7 8.0 8.1 8.3 8.4 8.4 8.5 8.6 8.7 9.1 9.1 9.1 9.8 10.1 10.3 8.1 9.2 10.0 10.4 10.6 10.9 11.1 11.9 12.0 12.1 10.7 11.3 11.5 11.6 11.7 11.8 12.0 12.1 12.3 12.6 12.6 13.3 13.3 13.8 13.9 Data from HSDS, set 310 93 5. The data in Table 6.2, collected by Brian Everitt, are described in HSDS as being the ‘weights, in kg, of young girls receiving three different treatments for anorexia over a fixed period of time with the control group receiving the standard treatment’. (a) Using a one-way ANOVA on the weight gains, compare the three methods of treatment. (b) Plot the data so as to clarify the effects of the three treatments, and discuss whether the above formal analysis was appropriate. Cognitive behavioural treatment Control Weight before after Weight before after 80.5 84.9 81.5 82.6 79.9 88.7 94.9 76.3 81.0 80.5 85.0 89.2 81.3 81.3 76.5 70.0 80.4 83.3 83.0 87.7 84.2 86.4 76.5 80.2 87.8 83.3 79.7 84.5 80.8 87.4 82.2 85.6 81.4 81.9 76.4 103.6 98.4 93.4 73.4 82.1 96.7 95.3 82.4 82.4 72.5 90.9 71.3 85.4 81.6 89.1 83.9 82.7 75.7 82.6 100.4 85.2 83.6 84.6 96.2 86.7 80.7 89.4 91.8 74.0 78.1 88.3 87.3 75.1 80.6 78.4 77.6 88.7 81.3 81.3 78.1 70.5 77.3 85.2 86.0 84.1 79.7 85.5 84.4 79.6 77.5 72.3 89.0 Family therapy Weight before after 80.2 80.1 86.4 86.3 76.1 78.1 75.1 86.7 73.5 84.6 77.4 79.5 89.6 89.6 81.4 81.8 77.3 84.2 75.4 79.5 73.0 88.3 84.7 81.4 81.2 88.2 78.8 83.8 83.3 86.0 82.5 86.7 79.6 76.9 94.2 73.4 80.5 81.6 82.1 77.6 77.6 83.5 89.9 86.0 87.3 95.2 94.3 91.5 91.9 100.3 76.7 76.8 101.6 94.9 75.2 77.8 95.5 90.7 90.7 92.5 93.8 91.7 98.0 Table 6.2: Anorexia data Data from HSDS, set 285 94 6. The following data come from a study of pollution in inland waterways. In each of seven localities, five pike were caught and the log concentration of copper in their livers measured. Locality 1. 2. 3. 4. 5. 6. 7. Windermere Grassmere River Stour Wimbourne St Giles River Avon River Leam River Kennett Log concentration of copper (ppm) 0.187 0.449 0.628 0.412 0.243 0.134 0.471 0.836 0.769 0.193 0.286 0.258 0.281 0.371 0.704 0.301 0.810 0.497 -0.276 0.529 0.297 0.938 0.045 0.000 0.417 -0.538 0.305 0.691 0.124 0.846 0.855 0.337 0.041 0.459 0.535 (a) The data are plotted in Figure 6.2. Discuss briefly what the plot suggests about the relative copper pollution in the various localities. Figure 6.2: Concentration of copper in pike livers (b) Carry out a one-way analysis of variance to test for differences between the data between localities. Do the results of the formal analysis agree with your subjective impressions from Figure 6.2? 95 6.10 Two-Way Analysis of Variance Here there are two factors (e.g. two treatments, or patient number and treatment given) that can be varied independently. Factor A has I ‘levels’ 1, 2, . . . , I, and factor B has J ‘levels’ 1, 2, . . . , J. For example: (a) A is patient number 1, 2, . . . , I, every patient receiving each treatment j = 1, 2, . . . , J in turn, (b) A is treatment number 1, 2, . . . , I, and B is one of J possible supplementary treatments. Data can be conveniently tabulated: Factor A 1 2 3 .. . 1 Y11 Y21 Y31 .. . Factor B 2 ... Y12 . . . Y22 . . . Y32 . . . .. .. . . J Y1J Y2J Y3J .. . I YI1 YI2 YIJ ... i.e. there is precisely one observation Yij at each (i, j) combination of factor levels. Again assume the NLM with E[Yij ] = θi + φj for i = 1 . . . I and j = 1 . . . J. i.e. Yij ∼ N (θi + φj , σ 2 ) independently. (6.23) A problem here is that one could transform θi → 7 θi + c and φj 7→ φj − c for each i and j, where c is arbitrary. Therefore for identifiability one needs to impose some (arbitrary) constraints. The simplest and most symmetrical reformulation for the two-way ANOVA model is Yij ∼ N (µ + αi + βj , σ 2 ), PI αi = 0, PJ βj = i=1 j=1 where (6.24) 0. Exercise 6.7 What is the matrix formulation of the model 6.24? k Particular models of interest within the framework of Formulae 6.24 are: (1) Yij ∼ N (0, σ 2 ), X RSS = Yij2 , DF = n = IJ. i,j (2) Yij ∼ N (µ, σ 2 ), X RSS = (Yij − Y ++ )2 , DF = n − 1 = IJ − 1. i,j (3) Yij ∼ N (µ + αi , σ 2 ), Ybij = µ b+α bi = Y i+ . X Therefore RSS = (Yij − Y i+ )2 , DF = n − I = I(J − 1). i,j 96 (4) Yij ∼ N (µ + βj , σ 2 ), Ybij = µ b + βbj = Y +j . X Therefore RSS = (Yij − Y +j )2 , DF = n − J = (I − 1)J. i,j (5) Yij ∼ N (µ + αi + βj , σ 2 ), Ybij = µ b+α bi + βbj = Y i+ + Y +j − Y ++ . X Therefore RSS = (Yij − Y i+ − Y +j + Y ++ )2 , DF = n − I − J + 1 = (I−1)(J−1). i,j Again, we can form an ANOVA table summarising the independent ‘sources of variation’. The degrees of freedom are the differences between the DFs associated with the various models. The sums of squares are the differences between the SSs associated with the various models. Source of variation Degrees of freedom (DF) Sum of squares (SS) Mean square (MS) Overall mean Effect of Factor A Effect of Factor B Residuals 1 I−1 J−1 (I−1)(J−1) (1)−(2) (2)−(3) (2)−(4) (5) (2)−(3) (I−1) (2)−(4) (J−1) (5) (I−1)(J−1) Total IJ = n (1) Table 6.3: Two-way ANOVA table Comments 1. DeGroot gives a more general version. 2. As with one-way ANOVA, one can test H0 : αi = 0, i = 1 . . . I, by comparing (SS due to A)/(I − 1) (Residual SS)/([I − 1][J − 1]) with the 95% point of F(I−1),([I−1][J−1]) . 3. Similarly one can test H0 : βj = 0, j = 1 . . . J, by comparing (SS due to B)/(J − 1) (Residual SS)/([I − 1][J − 1]) with the 95% point of F(J−1),([I−1][J−1]) . 4. The above two F tests are using completely separate aspects of the data (row sums of the Yij table, column sums of the Yij table). 5. The case J = 2 is equivalent to the paired t-test (Exercise 5.4). 6. As for one-way ANOVA, the formulae for sums of squares simplify: ‘sum over each observation the squared difference between the fitted values under the two models being considered’. The residual SS is then most easily obtained by subtraction. See problem 6.11.1 97 6.11 Problems 1. For the two-way analysis of variance (Table 6.3, page 97), find simplified formulae for the sums of squares analogous to those found for the one-way ANOVA (exercise 6.9.1). 2. Three pertussis vaccines were tested on each of ten days. The following table shows estimates of the log doses of vaccine (in millions of organisms) required to protect 50% of mice against a subsequent infection with pertussis organisms. Day A Vaccine B C Total 1 2 3 4 5 6 7 8 9 10 2.64 2.00 3.04 2.07 2.54 2.76 2.03 2.20 2.38 2.42 2.93 2.52 3.05 2.97 2.44 3.18 2.30 2.56 2.99 3.20 2.93 2.56 3.35 2.55 2.45 3.25 2.17 2.18 2.74 3.14 8.50 7.08 9.44 7.59 7.43 9.19 6.50 6.94 8.11 8.76 Total 24.08 28.14 27.32 79.54 Test the statistical significance of the differences between days and between vaccines. Osborn (1979) 8.1.2 3. (a) Explain what is meant by the Normal Linear Model (NLM), and show how the two-way analysis of variance may be formulated in this way. (b) The following table gives the average UK cereal yield (tonnes per hectare) from 1994 to 1998, together with the row, column, and overall totals. Wheat Barley Oats Other cereal Total 1994 1995 1996 1997 1998 Total 7.35 5.37 5.50 5.65 7.70 5.73 5.52 5.52 8.15 6.14 6.14 5.86 7.38 5.76 5.78 5.52 7.56 5.29 6.00 5.04 38.14 28.29 28.94 27.59 23.87 24.47 26.29 24.44 23.89 122.96 Calculate the fitted yields and residuals for Wheat in each of the five years i. under the NLM assuming no column effect, and ii. under the NLM assuming that row & column effects are additive. (c) Describe briefly how to test the null hypothesis that there is no column effect (i.e. no consistent change in yield from year to year). You do not need to carry out the numerical calculations. (d) A nonparametric test of the above hypothesis may be carried out as follows: rank the data for each row from lowest to highest (thus for Wheat the values 7.35, 7.70, 8.15, 7.38 and 7.56 are replaced by 1, 4, 5, 2 and 3 respectively), then sum the four ranks for each year, and finally carry out a one-way analysis of variance on the five sums of ranks. Comment on the advantages and disadvantages of applying this procedure, rather than the standard two-way ANOVA, to the above data. From Warwick ST217 exam 2001 98 4. The following table gives the estimated hospital waiting lists (000s) by month & region, throughout the years 2000 & 2001. Month Year 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 NY T E L SE SW 137.6 132.0 125.1 129.5 129.9 128.9 127.9 126.7 124.6 123.4 121.0 121.3 122.4 121.6 119.6 122.6 124.0 124.2 123.2 123.2 123.0 124.1 123.0 124.3 105.5 103.2 98.7 99.5 99.5 99.9 99.0 98.9 98.2 97.1 97.2 99.0 97.7 96.9 95.3 96.5 97.2 98.0 98.9 99.6 99.4 99.0 99.6 100.7 121.6 118.7 111.9 114.3 114.6 114.4 113.6 113.2 111.9 112.0 111.7 112.7 113.3 113.0 109.7 110.7 111.0 111.8 111.4 111.9 111.6 111.8 113.2 115.8 173.3 167.9 162.3 163.5 163.2 163.3 160.8 159.6 158.1 156.0 155.7 158.0 159.5 159.2 156.3 158.9 160.0 160.6 160.9 161.4 159.1 156.7 155.4 159.1 192.6 190.4 184.0 186.9 186.4 183.7 183.9 183.4 183.8 183.4 183.4 188.2 188.8 186.7 181.1 184.1 185.9 187.2 187.4 185.7 185.2 184.6 183.7 186.7 111.0 107.7 100.7 101.5 100.6 100.2 99.6 100.1 99.6 98.8 98.9 99.4 99.8 100.1 97.1 99.3 100.3 99.7 100.2 100.0 100.2 101.1 102.0 103.6 WM 99.4 95.8 89.6 92.1 92.4 91.9 90.7 90.2 91.0 91.1 92.1 92.8 93.4 92.1 87.2 88.8 90.1 90.8 91.0 91.6 91.0 90.7 89.8 92.1 NW 177.6 172.2 164.8 166.5 166.2 165.5 164.9 165.6 164.6 163.1 161.2 162.9 164.0 163.3 160.5 162.7 164.4 165.6 165.5 166.2 165.8 165.5 164.7 168.0 Key NY E SW Northern & Yorkshire Eastern L South West WM London West Midlands T SE NW Trent South East North West [Data extracted from archived Press Releases at http://tap.ccta.gov.uk/doh/intpress.nsf] Fit a two-way ANOVA model, possibly after transforming the data, and address (briefly) the following questions: (a) Does the pattern of change in waiting lists differ across the regions? (b) Is there a simple (but not misleading) description of the overall change in waiting lists over the two years? (c) Predict the values for the eight regions in March 2002 (to the nearest 100, as in the Table). (d) The set of figures for March 2001 were the latest available at the time of the General Election in May 2001. A cynical acquaintance suggests to you that the March 2001 waiting lists were ‘unusually good’. What do you think? 99 5. Table 4.2, page 45, presented data on the preventive effect of four different drugs on allergic response in ten patients. A simple way to analyse the patient √ response, √ data is via a two-way ANOVA on a suitable measure of√ such as√the increase in NCF, which is tabulated below (for example, 1.95 = 3.8 − 0.0 and √ 1.52 = 9.2 − 2.3). Drug 1 2 3 4 P C D K 1.95 0.71 0.65 0.19 1.52 1.30 0.67 0.54 0.77 1.32 0.65 −0.07 0.44 1.48 0.48 0.82 Patient number 5 6 0.78 0.58 0.00 0.54 1.69 0.41 0.44 −0.44 7 8 9 0.37 0.00 0.26 0.27 0.95 2.09 0.42 −0.03 1.10 0.32 1.18 0.59 10 0.62 −0.22 0.63 0.71 (a) Test the statistical significance of the differences between drugs and between patients. (b) Plot the original data (Table 4.2) in a way that would help you assess whether the assumptions underlying the above two-way ANOVA are reasonable. (c) Comment on the analysis you have made suggesting possible improvements where appropriate. You do NOT need to carry out any further complicated calculations. 6. Table 6.4 shows purported IQ scores of identical twins, one raised in a foster home (Y ), and the other raised by natural parents (X). The data are also categorised according to the social class of the natural parents (upper, middle, low). The data come from Burt (1966), and are also available in Weisberg (1980). upper class Case Y X 1 2 3 4 5 6 7 82 80 88 108 116 117 132 82 90 91 115 115 129 131 middle class Case Y X 8 9 10 11 12 13 71 75 93 95 88 111 78 79 82 97 100 107 lower class Case Y X 14 15 16 17 18 19 20 21 22 23 24 25 26 27 63 77 86 83 93 97 87 94 96 112 113 106 107 98 68 73 81 85 87 87 93 94 95 97 97 103 106 111 Table 6.4: Burt’s twin IQ data (a) Plot the data. (b) Fit simple linear regression models to predict Y from X within each social class. (c) Fit parallel lines predicting Y from X within each social class (i.e. fit regression models with the same slope in each of the three classes, but possibly different intercepts). (d) Produce an ANOVA table and an F -test to test whether the parallelism assumption is reasonable. Comment on the calculated F ratio. 100 For we know in part, and we prophesy in part. But when that which is perfect is come, then that which is in part shall be done away. 1 Corinthians 13:9–10 Everything should be made as simple as possible, but not simpler. Albert Einstein A theory is a good theory if it satisfies two requirements: it must accurately describe a large class of observations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations. Stephen William Hawking The purpose of models is not to fit the data but to sharpen the question. Samuel Karlin Science may be described as the art of systematic oversimplification. Sir Karl Raimund Popper 101 This page intentionally left blank (except for this sentence). 102 Chapter 7 Further Topics 7.1 Generalisations of the Linear Model You can generalise the systematic part of the linear model, i.e. the formula for E[Y |x] and/or the random part, i.e. the distribution of Y − E[Y |x]. 7.1.1 Nonlinear Models These are models of the form E[Y |x] = g(x, β) (7.1) T where Y is the response, x is a vector of explanatory variables, β = (β1 . . . βp ) is a parameter vector, and the function g is nonlinear in the βi s. Examples 1. Asymptotic regression: Yi = i IID ∼ α − βγ xi + i (i = 1, 2, . . . , n), 2 N (0, σ ). There are four parameters to be estimated: β = (α, β, γ, σ 2 )T . Assuming that 0 < γ < 1, we have: (a) E[Y |x] is monotonic increasing in x, (b) E[Y |x = 0] = α − β, (c) as x → ∞, E[Y |x] → α. This ‘asymptotic regression’ model might be appropriate, for example, if (a) x = age of an animal, y = height or weight, or (b) x = time spent training, y = height jumped (for n people of similar build). 2. The ‘Michaelis-Menten’ equation in enzyme kinetics E[Y |x] = β1 x β2 + x with various possible distributional assumptions, the simplest of which is [Y |x] ∼ N (β1 x/(β2 +x), σ 2 ). 103 Comments 1. Nonlinear models can be fitted, in principle, by maximum likelihood. 2. In practice one needs computers and iteration. 3. Even if the random variation is assumed to be Normal, the likelihood may have a very non-Normal shape. 7.1.2 Generalised Linear Models Definition 7.1 (GLM) A generalized linear model (GLM) has a random part and a systematic part: Random Part 1. The ith response Yi has a probability distribution with mean µi . 2. The distributions are all of the same form (e.g. all Normal with variance σ 2 , or all Poisson, etc.) 3. The Yi s are independent. Systematic Part g(µi ) = xTi β = p X βj xij , where j=1 1. xi = (xi1 . . . xip )T is a vector of explanatory variables, 2. β = (β1 . . . βp )T is a parameter vector, and 3. g(·) is a monotonic function called the link function. Comments 1. If Yi ∼ N (µi , σ 2 ) and g(·) is the identity function, then we have the NLM. 2. Other GLMs typically must have their parameters estimated by maximising the likelihood numerically (iteratively in a computer). 3. The principles behind fitting GLMs are similar to those for fitting NLMs Example: ‘logistic regression’ 1. Random part: binary response e.g. Yi |xi = 1 if individual i survived 0 if individual i died (and all Yi s are conditionally independent given the corresponding xi s). Note that µi = E[Yi |xi ] is here the probability of surviving given explanatory variables xi , and is usually written pi or πi . 2. Systematic part: g(πi ) = log 104 πi 1 − πi . Exercise 7.1 Show that under the logistic regression model, if n patients have identical explanatory variables x say, then 1. Each of these n patients has probability of survival given by π= exp(xT β) , 1 + exp(xT β) 2. The number R surviving out of n has expected values nπ and variance nπ(1 − π). k 7.2 Simpson’s Paradox Simpson’s paradox occurs when there are three RVs X, Y and Z, such that the conditional distributions [X, Y |Z] show a relationship between [X|Z] and [Y |Z], but the marginal distribution [X, Y ] apparently shows a very different relationship between X and Y . For example, 1. X(Y )=male(female) death rate, Z=age, 2. X(Y )=male(female) admission rate to University, Z=admission rate for student’s chosen course. 7.3 Problems 1. (a) Explain what is meant by i. the Normal linear model, ii. simple linear regression, and iii. nonlinear regression. (b) For simple linear regression applied to data (xi , yi ), i = 1, . . . , n, show that the maximum likelihood estimators βb0 and βb1 of the intercept β0 and slope β1 satisfy the simultaneous equations βb0 n + βb1 n X xi = i=1 and βb0 n X xi + βb1 i=1 n X n X yi i=1 x2i = n X xi yi . i=1 i=1 Hence find βb0 and βb1 . (c) The following table shows Y , the survival time (weeks) of leukaemia patients and x, the corresponding log of initial white blood cell count. x Y x Y x 3.36 2.88 3.63 3.41 3.78 4.02 65 156 100 134 16 108 4.00 4.23 3.73 3.85 3.97 4.51 121 4 39 143 56 26 4.54 5.00 5.00 4.72 5.00 Y 22 1 1 5 65 Plot the data and, without carrying out any calculations, discuss how reasonable are the assumptions underlying simple linear regression in this case. From Warwick ST217 exam 1998 105 2. Because of concerns about sex discrimination, a study was carried out by the Graduate Division at the University of California, Berkeley. In fall 1973, there were 8,442 male applications and 4,321 female applications to graduate school. It was found that about 44% of the men and 35% of the women were admitted. When the data were investigated further, it was found that just 6 of the more than 100 majors accounted for over one-third of the total number of applicants. The data for these six majors (which Berkeley forbids identifying by name) are summarized in the table below. Men Women Major Number of applicants Percent admitted Number of applicants Percent admitted A B C D E F 825 560 325 417 191 373 62 63 37 33 28 6 108 25 593 375 393 341 82 68 34 35 24 7 Discuss the possibility of sex discrimination in admission, with particular reference to explanatory variables, conditional probability, independence and Simpson’s paradox. Data from Freedman et al. (1991), page 17 1 3. (a) At a party, the POTAS of your dreams approaches you, and says by way of introduction: Hi—I’m working on a study of human pheromones, and need some statistical help. Can you explain to me what’s meant by ‘logistic regression’, and why the idea’s important? Give a brief verbal explanation of logistic regression, without (i) using any formulae, (ii) saying anything that’s technically incorrect, (iii) boring the other person senseless and ruining a potentially beautiful friendship, (iv) otherwise embarrassing yourself. (b) Repeat the exercise, replacing logistic regression successively with: Bayesian inference, a multinomial distribution, nuisance parameters, the Poisson distribution, statistical independence, conditional expectation, multiple regression, one-way ANOVA, a linear model, a t-test. likelihood, the Neyman-Pearson lemma, order statistics, size & power, (c) Suddenly, a somewhat inebriated student (SIS) appears and interrupts your rather impressive explanation with the following exchange: SIS: POTASOYD: SIS: Think of a number from 1 to 10. Erm—seven? Wrong. Get your clothes off. You then watch aghast while he starts introducing himself in the same way to everyone in the room. As a statistician, you of course note down the numbers xi he is given, namely 7, 2, 3, 1, 5, 2, 10, 10, 7, 3, 9, 1, 2, 2, 7, 10, 5, 8, 5, 7, 3, 10, 6, 1, 5, 3, 2, 7, 8, 5, 7. His response yi is ‘Wrong’ in each case, and you formulate the hypotheses H0 : yi H1 : yi = ‘Wrong’ irrespective of xi ‘Right’ if xi = x0 , for some x0 ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, = ‘Wrong’ if xi 6= x0 . How might you test the null hypothesis H0 against the alternative H1 ? 1 Person Of The Appropriate Sex 106 4. (i) Explain what is meant by: (a) a generalised linear model, (b) a nonlinear model. (ii) Discuss the models you would most likely consider for the following data sets: (a) Data on the age, sex, and weight of 100 people who suffered a heart attack (for the first time), and whether or not they were still alive two years later. (b) Data on the age, sex and weight of 100 salmon in a fish farm. From Warwick ST217 exam 1996 I have yet to see any problem, however complicated, which, when you looked at it the right way, did not become still more complicated. Poul Anderson The manipulation of statistical formulas is no substitute for knowing what one is doing. Hubert M. Blalock, Jr. A judicious man uses statistics, not to get knowledge, but to save himself from having ignorance foisted upon him. Thomas Carlyle The best material model of a cat is another, or preferably the same, cat. A. Rosenblueth & Norbert Wiener A little inaccuracy sometimes saves tons of explanation. Saki (Hector Hugh Munro) karma police arrest this man he talks in maths he buzzesLikeAfridge hes like a detuned radio. Thom Yorke Better is the end of a thing than the beginning thereof. Ecclesiastes 7:8 107 Bibliography [1] V. Barnett. Comparative Statistical Inference. John Wiley and Sons, New York, second edition, 1982. [2] C. Burt. The genetic determination of differences in intelligence: A study of monozygotic twins reared together and apart. Brit. J. Psych., 57:137–153, 1966. [3] G. Casella and R. L. Berger. Statistical Inference. Wadsworth & Brooks/Cole, Pacific Grove, CA, 1990. [4] G. Casella and R. L. Berger. Statistical Inference. Wadsworth & Brooks/Cole, Pacific Grove, CA, second edition, 2001. [5] M. H. DeGroot. Probability and Statistics. Addison-Wesley, Reading, Mass., second edition, 1989. [6] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap, volume 57 of Monographs on Statistics and Applied Probability. Chapman and Hall, New York, 1993. [7] D. Freedman, R. Pisani, R. Purves, and A. Adhikari. Statistics. W. W. Norton, New York, second edition, 1991. [8] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice. Chapman and Hall, London, 1996. [9] D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski, editors. A Handbook of Small Data Sets. Chapman and Hall, London, 1994. [10] R. V. Hogg and A. T. Craig. Introduction to Mathematical Statistics. MacMillan, New York, 1970. [11] B. W. Lindgren. Statistical Theory. Chapman and Hall, London, fourth edition, 1994. [12] A. M. Mood, F. A. Graybill, and D. C. Boes. Introduction to the Theory of Statistics. McGraw-Hill, New York, third edition, 1974. [13] D. S. Moore and G. S. McCabe. Introduction to the Practice of Statistics. W. H. Freeman & Company Limited, Oxford, UK, third edition, 1998. [14] S. C. Narula and J. F. Wellington. Prediction, linear regression and minimum sum of relative errors. Technometrics, 19:185–190, 1977. [15] O.P.C.S. 1993 Mortality Statistics, volume 20 of DH2. Her Majesty’s Stationery Office, London, 1995. [16] J. F. Osborn. Statistical Exercises in Medical Research. Blackwell Scientific Publications, Oxford, UK, 1979. [17] J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth, Pacific Grove, CA, second edition, 1995. [18] P. Sprent. Data Driven Statistical Methods. Chapman and Hall, London, 1998. [19] S. Weisberg. Applied Linear Regression. John Wiley and Sons, New York, 1980. 108

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download (ST217: Mathematical Statistics B)