Download Regression Analysis - UF-Stat

Regression Analysis Demetris Athienitis Department of Statistics, University of Florida Contents Contents 1 0 Review 0.1 Random Variables and Probability Distributions . . . . . . . . 4 5 0.1.1 0.1.2 0.1.3 Expected value and variance . . . . . . . . . . . . . . . 8 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . 12 Mean and variance of linear combinations . . . . . . . 14 0.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 15 0.3 Inference for Population Mean . . . . . . . . . . . . . . . . . . 16 0.3.1 Confidence intervals . . . . . . . . . . . . . . . . . . . 16 0.3.2 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . 20 0.4 Inference for Two Population Means . . . . . . . . . . . . . . 27 0.4.1 0.4.2 Independent samples . . . . . . . . . . . . . . . . . . . 27 Paired data . . . . . . . . . . . . . . . . . . . . . . . . 29 1 Simple Linear Regression 31 1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . 33 1.2.1 Regression function . . . . . . . . . . . . . . . . . . . . 33 1.2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2 Inferences in Regression 38 2.1 Inferences concerning β0 and β1 . . . . . . . . . . . . . . . . . 38 2.2 Inferences involving E(Y ) and Ŷpred . . . . . . . . . . . . . . . 41 2.2.1 2.2.2 Confidence interval on the mean response . . . . . . . . 41 Prediction interval . . . . . . . . . . . . . . . . . . . . 42 2.2.3 Confidence Band for Regression Line . . . . . . . . . . 44 2.3 Analysis of Variance Approach . . . . . . . . . . . . . . . . . . 46 2.3.1 F-test for β1 . . . . . . . . . . . . . . . . . . . . . . . . 48 1 2.3.2 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . 49 2.4 Normal Correlation Models . . . . . . . . . . . . . . . . . . . 50 3 Diagnostics and Remedial Measures 53 3.1 Diagnostics for Predictor Variable . . . . . . . . . . . . . . . . 53 3.2 Checking Assumptions . . . . . . . . . . . . . . . . . . . . . . 54 3.2.1 Graphical methods . . . . . . . . . . . . . . . . . . . . 55 3.2.2 Significance tests . . . . . . . . . . . . . . . . . . . . . 60 3.3 Remedial Measures . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3.1 Box-Cox (Power) transformation . . . . . . . . . . . . 67 3.3.2 Lowess (smoothed) plots . . . . . . . . . . . . . . . . . 71 4 Simultaneous Inference and Other Topics 73 4.1 Controlling the Error Rate . . . . . . . . . . . . . . . . . . . . 73 4.1.1 Simultaneous estimation of mean responses . . . . . . . 75 4.1.2 Simultaneous predictions . . . . . . . . . . . . . . . . . 75 4.2 Regression Through the Origin . . . . . . . . . . . . . . . . . 75 4.3 Measurement Errors . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.1 Measurement error in the dependent variable . . . . . . 78 4.3.2 Measurement error in the independent variable . . . . . 78 4.4 Inverse Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Choice of Predictor Levels . . . . . . . . . . . . . . . . . . . . 80 5 Matrix Approach to Simple Linear Regression 81 5.1 Special Types of Matrices . . . . . . . . . . . . . . . . . . . . 81 5.2 Basic Matrix Operations . . . . . . . . . . . . . . . . . . . . . 83 5.2.1 Addition and subtraction . . . . . . . . . . . . . . . . . 83 5.2.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Linear Dependence and Rank . . . . . . . . . . . . . . . . . . 86 5.4 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5 Useful Matrix Results . . . . . . . . . . . . . . . . . . . . . . . 90 5.6 Random Vectors and Matrices . . . . . . . . . . . . . . . . . . 91 5.6.1 Mean and variance of linear functions of random vectors 93 5.6.2 Multivariate normal distribution . . . . . . . . . . . . . 94 5.7 Estimation and Inference in Regression . . . . . . . . . . . . . 94 5.7.1 5.7.2 Estimating parameters by least squares . . . . . . . . . 94 Fitted values and residuals . . . . . . . . . . . . . . . . 95 2 5.7.3 5.7.4 Analysis of variance . . . . . . . . . . . . . . . . . . . . 96 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6 Multiple Regression I 98 6.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Special Types of Variables . . . . . . . . . . . . . . . . . . . . 100 6.3 Matrix Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7 Multiple Regression II 119 7.1 Extra Sums of Squares . . . . . . . . . . . . . . . . . . . . . . 119 7.1.1 7.1.2 Definition and decompositions . . . . . . . . . . . . . . 119 Inference with extra sums of squares . . . . . . . . . . 122 7.2 Other Linear Tests . . . . . . . . . . . . . . . . . . . . . . . . 127 7.3 Coefficient of Partial Determination . . . . . . . . . . . . . . . 129 7.4 Standardized Regression Model . . . . . . . . . . . . . . . . . 131 7.5 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . 133 9 Model Selection and Validation 137 9.1 Data Collection Strategies . . . . . . . . . . . . . . . . . . . . 137 9.2 Reduction of Explanatory Variables . . . . . . . . . . . . . . . 137 9.3 Model Selection Criteria . . . . . . . . . . . . . . . . . . . . . 138 9.4 Regression Model Building . . . . . . . . . . . . . . . . . . . . 142 9.4.1 9.4.2 Backward elimination . . . . . . . . . . . . . . . . . . . 143 Forward selection . . . . . . . . . . . . . . . . . . . . . 143 9.4.3 Stepwise regression . . . . . . . . . . . . . . . . . . . . 143 9.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 146 10 Diagnostics 149 10.1 Outlying Y observations . . . . . . . . . . . . . . . . . . . . . 149 10.2 Outlying X-Cases . . . . . . . . . . . . . . . . . . . . . . . . . 151 10.3 Influential Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 151 10.3.1 Fitted values . . . . . . . . . . . . . . . . . . . . . . . 151 10.3.2 Regression coefficients . . . . . . . . . . . . . . . . . . 151 10.4 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . 151 11 Remedial Measures 152 12 Autocorrelation in Time Series 153 3 Chapter 0 Review In regression the emphasis is on finding links/associations between two or more variables. For two variables a scatterplot can help in visualizing the association Example 0.1. A small study with 7 subjects on the pharmacodynamics of LSD on how LSD tissue concentration affects the subjects math scores yielded the following data. Score Conc. 78.93 58.20 67.47 37.47 45.65 32.92 29.97 1.17 2.97 3.26 4.69 5.83 6.00 6.41 Table 1: Math score with LSD tissue concentration 60 50 30 40 Math score 70 80 Scatterplot 1 2 3 4 5 6 LSD tissue concentration Figure 1: Scatterplot of Math score vs. LSD tissue concentration http://www.stat.ufl.edu/~ athienit/STA4210/scatterplot.R 4 Before we begin he will need to grasp some basic concepts. 0.1 Random Variables and Probability Distributions Definition 0.1. A random variable is a function that assigns a numerical value to each outcome of an experiment. It is a measurable function from a probability space into a measurable space known as the state space. It is an outcome characteristic that is unknown prior to the experiment. For example, an experiment may consist of tossing two dice. One potential random variable could be the sum of the outcome of the two dice, i.e. X= sum of two dice. Then, X is a random variable. Another experiment could consist of applying different amounts of a chemical agent and a potential random variable could consist of measuring the amount of final product created in grams. Quantitative random random variables can either be discrete, by which they have a countable set of possible values, or continuous which have uncountably infinite. Notation: For a discrete random variable (r.v.) X, the probability distribution is the probability of a certain outcome occurring, denoted as P (X = x) = pX (x). This is also called the probability mass function (p.m.f.). Notation: For a continuous random variable (r.v.) X, the probability density function (p.d.f.), denoted by fX (x), models the relative frequency of X. Since there are infinitely many outcomes within an interval, the probability evaluated at a singularity is always zero, e.g. P (X = x) = 0, ∀x, X being a continuous r.v. 5 Conditions for a function to be: • p.m.f. 0 ≤ p(x) ≤ 1 and • p.d.f. f (x) ≥ 0 and R∞ −∞ P ∀x p(x) = 1 f (x)dx = 1 Example 0.2. (Discrete) Suppose a storage tray contains 10 circuit boards, of which 6 are type A and 4 are type B, but they both appear similar. An inspector selects 2 boards for inspection. He is interested in X = number of type A boards. What is the probability distribution of X? The sample space of X is {0, 1, 2}. We can calculate the following: p(2) = P (A on first)P (A on second|A on first) = (6/10)(5/9) = 0.3333 p(1) = P (A on first)P (B on second|A on first) + P (B on first)P (A on second|B on first) = (6/10)(4/9) + (4/10)(6/9) = 0.5333 p(0) = P (B on first)P (B on second|B on first) = (4/10)(3/9) = 0.1334 Consequently, X=x p(x) 0 0.1334 1 0.5333 2 0.3333 Total 1.0 Table 2: Probability Distribution of X 6 0.3 0.2 0.0 0.1 Density 0.4 0.5 Example 0.3. (Continuous) The lifetime of a certain battery has a distribution that can be approximated by f (x) = 0.5e−0.5x , x > 0. 0 2 4 6 8 Lifetime in 100 hours Figure 2: Probability density function of battery lifetime. Normal The normal distribution (Gaussian distribution) is by far the most important distribution in statistics. The normal distribution is identified by a location parameter µ and a scale parameter σ 2 (> 0). A normal r.v. X is denoted as X ∼ N(µ, σ 2 ) with p.d.f. 1 2 1 f (x) = √ e− 2σ2 (x−µ) σ 2π −∞<x<∞ 0.0 0.1 0.2 0.3 0.4 Normal Distribution −3 −2 −1 0 1 2 3 µ Figure 3: Density function of N(0, 1). It is symmetric, unimodal, bell shaped with E(X) = µ and V (X) = σ 2 . 7 Notation: A normal random variable with mean 0 and variance 1 is called a standard normal r.v. It is usually denoted by Z ∼ N(0, 1). The c.d.f. of a standard normal is given at the end of the textbook, is available online, but most importantly has a built in function in software. Note that probabilities, which can be expressed in terms of c.d.f, can be conveniently obtained. Example 0.4. Find P (−2.34 < Z < −1). From the relevant remark, P (−2.34 < Z < −1) = P (Z < −1) − P (Z < −2.34) = 0.1587 − 0.0096 = 0.1491 Notation: You may recall that R f (t)dt is contrived from lim P f (ti )∆i . Hence for the following definitions and expressions we will only be using notation R for continuous variables and wherever you see “ ” simply replace it with P “ ”. 0.1.1 Expected value and variance The expected value of a r.v. is thought of as the long term average for that variable. Similarly, the variance is thought of as the long term average of values of the r.v. to the expected value. Definition 0.2. The expected value (or mean) of a r.v. X is µX := E(X) = Z ∞ xf (x)dx −∞ discrete = X ∀x ! xp(x) . In actuality, this definition is a special case of a much broader statement. Definition 0.3. The expected value (or mean) of function h(·) of a r.v. X is E(h(X)) = Z ∞ h(x)f (x)dx. −∞ Due to this last definition, if the function h performs a simple linear transformation, such as h(t) = at + b, for constants a and b, then E(aX + b) = Z (ax + b)f (x)dx = a Z 8 xf (x)dx + b Z f (x)dx = aE(X) + b Example 0.5. Referring back to Example 0.2, the expected value of the number of type A boards (X) is E(X) = X xp(x) = 0(0.1334) + 1(0.5333) + 2(0.3333) = 1.1999. ∀x We can also calculate the expected value of (i) 5X + 3 and (ii) 3X 2 . (i) 5(1.1999) + 3 = 8.995. (ii) 3(02 )(0.1334) + 3(12)(0.5333) + 3(22 )(0.3333) = 5.5995 Definition 0.4. The variance of a r.v. X is 2 σX := V (X) = E (X − µX )2 Z = (x − µX )2 f (x)dx Z = (x2 − 2xµX + µ2X )f (x)dx Z Z Z 2 2 = x f (x)dx − 2µX xf (x)dx + µX f (x)dx = E(X 2 ) − 2E 2 (X) + E 2 (X) = E(X 2 ) − E 2 (X) Example 0.6. This refers to Example 0.2. We know that E(X) = 1.1999 and E(X 2 ) = 02 (0.1334) + 12 (0.5333) + 22 (0.3333) = 1.8665. Thus, V (X) = E(X 2 ) − E 2 (X) = 1.8665 − 1.19992 = 0.42674 9 Example 0.7. This refers to example 0.3. If we were to do this by hand we would need to do integration by parts (multiple times). However we can use software such as Wolfram Alpha. 1. Find E(X), so in Wolfram Alpha simply input: integrate x*0.5*e^(-0.5*x) dx from 0 to infinity So E(X) = 2. 2. Find E(X 2 ), so input: integrate x^2*0.5*e^(-0.5*x) dx from 0 to infinity So, E(X 2 ) = 8. 3. V (X) = E(X 2 ) − E 2 (X) = 8 − 22 = 4. Definition 0.5. The variance of a function h of a r.v. X is Z V (h(X)) = [h(x) − E(h(x))]2 f (x)dx = E(h2 (X)) − E 2 (h(X)) Notice that if h stands for a linear transformation function then, V (aX + b) = E (aX + b − E (aX + b))2 = a2 E (X − E(X))2 = a2 V (X) If Z is standard normal then it has mean 0 and variance 1. Now if we take a linear transformation of Z, say X = aZ + b, then E(X) = E(aZ + b) = aE(Z) + b = b and V (X) = V (aZ + b) = a2 V (Z) = a2 . This fact together with the following proposition allows us to express any normal r.v. as a linear transformation of the standard normal r.v. Z by setting a = σ and b = µ. 10 Proposition 0.1. The r.v. X that is expressed as the linear transformation σZ + µ, is a also a normal r.v. with E(X) = µ and V (X) = σ 2 . Linear transformations are completely reversible, so given a normal r.v. X with mean µ and variance σ 2 we can revert back to a standard normal by Z= X −µ . σ As a consequence any probability statements made about an arbitrary normal r.v. can be reverted to statements about a standard normal r.v. Example 0.8. Let X ∼ N(15, 7). Find P (13.4 < X < 19.0). We begin by noting P (13.4 < X < 19.0) = P 13.4 − 15 X − 15 19.0 − 15 √ √ < √ < 7 7 7 = P (−0.6047 < Z < 1.5119) = P (Z < 1.5119) − P (Z < −0.6047) = 0.6620312 If one is using a computer there is no need to revert back and forth from a standard normal, but it is always useful to standardize concepts. You could find the answer by using pnorm(1.5119)-pnorm(-0.6047) or pnorm(19,15,sqrt(7))-pnorm(13.4,15,sqrt(7)) Example 0.9. The height of males in inches is assumed to be normally distributed with mean of 69.1 and standard deviation 2.6. Let X ∼ N(69.1, 2.62 ). Find the 90th percentile for the height of males. 11 0.15 0.00 0.05 0.10 90 % area 69.1 Figure 4: N(69.1, 2.62 ) distribution First we find the 90th percentile of the standard normal which is qnorm(0.9)= 1.281552. Then we transform to 2.6(1.281552) + 69.1 = 72.43204. Or, just input into R: qnorm(0.9,69.1,2.6). 0.1.2 Covariance The population covariance is a measure of strength of a linear relationship among two variables. Definition 0.6. Let X and Y be two r.vs. The population covariance of X and Y is Cov(X, Y ) = E [(X − E(X)) (Y − E(Y ))] = E(XY ) − E(X)E(Y ) Remark 0.1. If X and Y are independent, then E(XY ) = Z Z ind. = xyf (x, y)dxdy Z Z xyfX (x)fY (y)dxdy Z Z = xfX (x)dx yfY (y)dy = E(X)E(Y ) 12 and consequently Cov(X, Y ) = 0. This is because under independence f (x, y) = fX (x)fY (y). However, the converse is not true. Think of a circle such as sin2 X + cos2 Y = 1. Obviously, X and Y are dependent but they have no linear relationship. Hence, Cov(X, Y ) = 0. The covariance is not unitless so a measure called the population correlation is used to describe the strength of the linear relationship that is • unitless • ranges from −1 to 1 ρXY = p Cov(X, Y ) p , V (X) V (Y ) A negative relationship implies a negative covariance and consequently a negative correlation. Moving away from the population parameters, to estimate the sample statistic of the covariance and the correlation we need n 1 X (xi − x̄)(yi − ȳ) n − 1 i=1 " n ! # X 1 = xi yi − nx̄ȳ n−1 i=1 \Y ) = σ̂XY := Cov(X, Therefore, rXY := ρ̂XY = ( Pn xi yi ) − nx̄ȳ . (n − 1)sX sY i=1 (1) Example 0.10. Let’s assume that we want to look at the relationship between two variables, height (in inches) and self esteem for 20 individuals. Height Esteem 68 4.1 68 3.5 71 62 75 58 4.6 3.8 4.4 3.2 67 63 62 60 3.2 3.7 3.3 3.4 60 67 68 3.1 3.8 4.1 63 65 67 4.0 4.1 3.8 Table 3: Height to self esteem data Hence, rXY = 4937.6 − 20(65.4)(3.755) = 0.731 19(4.406)(0.426) there is a moderate to strong positive linear relationship. 13 71 69 4.3 3.7 63 61 3.4 3.6 0.1.3 Mean and variance of linear combinations Let X and Y be two r.vs, for (aX + b) + (cY + d) for constants a, b, c and d, E(aX + b + cY + d) = aE(X) + cE(Y ) + b + d V (aX + b + cY + d) = Cov(aX, aX) + Cov(cY, cY ) + Cov(aX, cY ) + Cov(cY, aX) {z } | {z } | {z } | a2 V (X) c2 V (Y ) 2acCov(X,Y ) Example 0.11. Let X be a r.v. with E(X) = 3 and V (X) = 2, and Y be another r.v. independent of X with E(Y ) = −5 and V (Y ) = 6. Then, E(X − 2Y ) = E(X) − 2E(Y ) = 3 − 2(−5) = 13 and V (X − 2Y ) = V (X) + 4V (Y ) = 2 + 4(6) = 26 Now we extend these two concepts to more than two r.vs. Let X1 , . . . , Xn be a sequence of r.vs and a1 , . . . , an a sequence of constants. Then the r.v. Pn i=1 ai Xi has mean and variance E n X ai Xi i=1 ! = n X ai E(Xi ) i=1 and V n X ai Xi i=1 ! = = n X n X i=1 j=1 n X a2i V i=1 ai aj Cov(Xi , Xj ) (Xi ) + 2 XX ai aj Cov(Xi , Xj ) (2) (3) i<j Example 0.12. Assume the random sample, i.e. independent identically distributed (i.i.d.) r.vs, X1 , . . . , Xn are to be obtained and of interest will be the specific linear combination corresponding to the sample mean X̄ = P (1/n) ni=1 Xi . Since the r.vs are i.i.d., let E(Xi ) = µ and V (Xi ) = σ 2 ∀i = 1, . . . , n. Then, n E 1X Xi n i=1 ! n = 1X 1 E(Xi ) = nµ = µ n i=1 n 14 and n V 1X Xi n i=1 ! n 1 X 1 σ2 = 2 V (Xi ) = 2 nσ 2 = n i=1 n n ind. Remark 0.2. As the sample size increases, the variance of the sample mean decreases with limn→∞ V (X̄) = 0. A very useful theorem (whose proof is beyond the scope of this class is the following. Proposition 0.2. A linear combination of (independent) normal random variables is a normal random variable. 0.2 Central Limit Theorem The Central Limit Theorem (C.L.T.) is a powerful statement concerning the mean of a random sample. There are three versions, the classical, the Lyapunov and the Linderberg but in effect they all make the same statement that the asymptotic distribution of the sample mean X̄ is normal, irrespective of the distribution of the individual r.vs. X1 , . . . , Xn . Proposition 0.3. (Central Limit Theorem) Let X1 , . . . , Xn be a random sample, i.e. i.i.d., with E(Xi ) = µ < ∞ and P V (Xi ) = σ 2 < ∞. Then, for X̄ = (1/n) ni=1 Xi X̄ − µ √σ n d −→ N(0, 1) n→∞ Although the central limit theorem is an asymptotic statement, i.e. as the sample size goes to infinity, we can in practice implement it for sufficiently large sample sizes n > 30 as the distribution of X̄ will be approximately normal with mean and variance derived from Example 0.12. X̄ approx. ∼ N 15 σ2 µ, n 0.3 Inference for Population Mean When a population parameter is estimated by a sample statistic such as µ̂ = x̄, the sample statistic is a point estimate of the parameter. Due to sampling variability the point estimate will vary from sample to sample. The fact that the sample estimate is not 100% accurate has to be taken into account. 0.3.1 Confidence intervals An alternative or complementary approach is to report an interval of plausible values based on the point estimate sample statistic and its standard deviation (a.k.a. standard error). A confidence interval (C.I.) is calculated by first selecting the confidence level, the degree of reliability of the interval. A 100(1 − α)% C.I. means that the method by which the interval is calculated will contain the true population parameter 100(1 − α)% of the time. That is, if a sample is replicated multiple times, the proportion of times that the C.I. will not contain the population parameter is α. For example, assume that we know the (in practice unknown) population parameter µ is 0 and from multiple samples, multiple C.Is are created. 4 2 0 J mp le Sa H mp le I Sa mp le Sa F mp le G Sa E mp le Sa mp le Sa C mp le D Sa B mp le Sa mp le Sa Sa mp le A −2 Figure 5: Multiple confidence intervals from different samples Known population variance Let X1 , . . . , Xn be i.i.d. from some distribution with finite unknown mean µ and known variance σ 2 . The methodology will require that X̄ ∼ N(µ, σ 2 /n). This can occur in the following ways: • X1 , . . . , Xn be i.i.d. from a normal distribution, so that by Proposition 0.2, X̄ ∼ N(µ, σ 2 /n) 16 • n > 30 and the C.L.T. is invoked. Let zc stand for the value of Z ∼ N(0, 1) such that P (Z ≤ zc ) = c. Hence, the proportion of C.Is containing the population parameter is, 1−α α 2 α 2 0.0 0.1 0.2 0.3 0.4 Standard Normal zα 0 2 z1−α 2 Due to the symmetry of the normal distribution, z1−α/2 = |zα/2 | and zα/2 = −z1−α/2 . Note: Some books may define zc such that P (Z > zc ) = c, i.e. c referring to the area to the right. X̄ − µ √ < z1−α/2 1 − α = P −z1−α/2 < σ/ n σ σ = P X̄ − z1−α/2 √ < µ < X̄ + z1−α/2 √ n n (4) and the probability that (on the long run) the random C.I. interval, σ X̄ ∓ z1−α/2 √ n contains the true value of µ is 1 − α. When a C.I. is constructed from a single sample we can no longer talk about a probability as there is no long run temporal concept but we can say that we are 100(1 − α)% confident that the methodology by which the interval was contrived will contain the true population parameter. 17 Example 0.13. A forester wishes to estimate the average number of count trees per acre on a plantation. The variance is assumed to be known as 12.1. A random sample of n = 50 one acre plots yields a sample mean of 27.3. A 95% C.I. for the true mean is then r 12.1 → (26.33581, 28.26419) 27.3 ∓ z1−0.025 | {z } 50 1.96 Unknown population variance In practice the population variance is unknown, that is σ is unknown. A large sample size implies that the sample variance s2 is a good estimate for σ 2 and you will find that many simply replace it in the C.I. calculation. However, there is a technically “correct” procedure for when variance is unknown. Note that s2 is calculated from data, so just like x̄, there is a corresponding random variable S 2 to denote the theoretical properties of the sample variance. In higher level statistics the distribution of S 2 is found, as once again, it is a statistic that depends on the random variables X1 , . . . , Xn . It is shown that X̄ − µ √ ∼ tn−1 (5) S/ n where tn−1 stands for Student’s-t distribution with parameter degrees of freedom ν = n−1. A Student’s-t distribution is “similar” to the standard normal except that it places more “weight” to extreme values as seen in Figure 6. 18 0.4 Density Functions 0.0 0.1 0.2 0.3 N(0,1) t_4 −4 −2 0 2 4 Figure 6: Standard normal and t4 probability density functions It is important to note that Student’s-t is not just “similar” to the standard normal but asymptotically (as n → ∞) is the standard normal. One just needs to view the t-table to see that under infinite degrees of freedom the values in the table are exactly the same as the ones found for the standard normal. Intuitively then, using Student’s-t when σ 2 is unknown makes sense as it adds more probability to extreme values due to the uncertainty placed by estimating σ 2 . The 100(1 − α)% C.I. for µ is then s x̄ ∓ t1−α/2,n−1 √ . n (6) Example 0.14. In a packaging plant, the sample mean and standard deviation for the fill weight of 100 boxes are x̄ = 12.05 and s = 0.1. The 95% C.I. for the mean fill weight of the boxes is 0.1 → (12.03016, 12.06984), 12.05 ∓ t1−0.025,99 √ | {z } 100 (7) 1.984 Remark 0.3. If we wanted to perform a 90% we would simply replace t(0.05/2,99) with t(0.10/2,99) = 1.660, which would lead to CI of (12.0334, 12.0666) that is a narrower interval. Thus, as α ↑ then 100(1 − α) ↓ which implies a narrower interval. 19 Example 0.15. Suppose that a sample of 36 resistors is taken with x̄ = 10 and s2 = 0.7. A 95% C.I. for µ is 10 ∓ t1−0.025,35 | {z } 2.03 r 0.7 → (9.71693, 10.28307) 36 Remark 0.4. So far we have only discussed two-sided confidence intervals. In equation (4) However, one-sided confidence intervals might be more appropriate in certain circumstances. For example, when one is interested in the minimum breaking strength, or the maximum current in a circuit. In these instances we are not interested in an upper and lower limit but only in a lower or only in a upper limit. Then we simply replace zα/2 or t(α/2,n−1) by zα or tα,n−1 , e.g. a 100(1 − α)% C.I. for µ s x̄ − t1−α,n−1 √ , ∞ n 0.3.2 or s −∞, x̄ + t1−α,n−1 √ n Hypothesis tests A statistical hypothesis is a claim about a population characteristic (and on occasion more than one). An example of a hypothesis is the claim that the population is some value, e.g. µ = 0.75. Definition 0.7. The null hypothesis, denoted by H0 , is the hypothesis that is initially assumed to be true. The alternative hypothesis, denoted by Ha or H1 , is the complementary assertion to H0 and is usually the hypothesis, the new statement that we wish to test. A test procedure is created under the assumption of H0 and then it is determined how likely that assumption is compared to its complement Ha . The decision will be based on • Test statistic, a function of the sampled data. • Rejection region/criteria, the set of all test statistic values for which H0 will be rejected. The basis for choosing a particular rejection region lies in an understanding of the errors that can be made. 20 Definition 0.8. A type I error consists of rejecting H0 when it is actually true. A type II error consists of failing to reject H0 when in actuality H0 is false. The type I error is generally considered to be the most serious one, and due to limitations, we can only control for one, so the rejection region is chosen based upon the maximum P (type I error) = α that a researcher is willing to accept. Known population variance We motivate the test procedure by an example whereby the drying time of a certain type of paint, under fixed environmental conditions, is known to be normally distributed with mean 75 min. and standard deviation 9 min. Chemists have added a new additive that is believed to decrease drying time and have obtained a sample of 35 drying times and wish to test their assertion. Hence, H0 : µ ≥ 75 (or µ = 75) Ha : µ < 75 Since we wish to control for the type I error, we set P (type I error) = α. The default value of α is usually taken to be 5%. An obvious candidate for a test statistic, that is an unbiased estimator of the population mean, is X̄ which is normally distributed. If the data were not known to be normally distributed the normality of X̄ can also be confirmed by the C.L.T. Thus, under the null assumption H0 92 , X̄ ∼ N 75, 35 H0 or equivalently X̄ − 75 √9 35 H0 ∼ N(0, 1). The test statistic will be T.S. = x̄ − 75 21 √9 35 , and assuming that x̄ = 70.8 from the 35 samples, then, T.S. = −2.76. This implies that 70.8 is 2.76 standard deviations below 75. Although this appears to be far, we need to use the p-value to reach a formal conclusion. Definition 0.9. The p-value of a hypothesis test is the probability of observing the specific value of the test statistic, T.S., or a more extreme value, under the null hypothesis. The direction of the extreme values is indicated by the alternative hypothesis. Therefore, in this example values more extreme than -2.76 are {x|x ≤ −2.76}, as indicated by the alternative, Ha : µ < 75. Thus, p-value = P (Z ≤ −2.76) = 0.0029. The criterion for rejecting the null is p-value < α, the null hypothesis is rejected in favor of the alternative hypothesis as the probability of observing the test statistic value of -2.76 or more extreme (as indicated by Ha ) is smaller than the probability of the type I error we are willing to undertake. α=0.05 area p−value 0.0 0.1 0.2 0.3 0.4 Standard Normal −2.76 −1.645 0 Figure 7: Rejection region and p-value. If we can assume that X̄ is normally distributed and σ 2 is known then, to test 22 (i) H0 : µ ≤ µ0 vs Ha : µ > µ0 (ii) H0 : µ ≥ µ0 vs Ha : µ < µ0 (iii) H0 : µ = µ0 vs Ha : µ 6= µ0 at the α significance level, compute the test statistic T.S. = x̄ − µ0 √ . σ/ n (8) Reject the null if the p-value < α, i.e. (i) P (Z ≥ T.S.) < α (area to the right of T.S. < α) (ii) P (Z ≤ T.S.) < α (area to the left of T.S. < α) (iii) P (|Z| ≥ |T.S.|) < α (area to the right of |T.S.| plus area to the left of −|T.S.| < α) Example 0.16. A scale is to be calibrated by weighing a 1000g weight 60 times. From the sample we obtain x̄ = 1000.6 and s = 2. Test whether the scale is calibrated correctly. H0 : µ = 1000 vs Ha : µ 6= 1000 T.S. = 1000.6 − 1000 √ = 2.32379 2/ 60 Hence, the p-value is 0.02013675 and we reject the null hypothesis and conclude that the true mean is not 1000. 23 0.4 Standard Normal 0.0 0.1 0.2 0.3 p−value −2.32379 0 2.32379 Figure 8: p-value. Since 1000.6 is 2.32379 standard deviations greater than 1000, we can conclude that not only is the true mean not a 1000 but it is greater than 1000. Example 0.17. A company representative claims that the number of calls arriving at their center is no more than 15/week. To investigate the claim, 36 random weeks were selected from the company’s records with a sample mean of 17 and sample standard deviation of 3. Do the sample data contradict this statement? First we begin by stating the hypotheses of H0 : µ ≤ 15 The test statistic is T.S. = vs Ha : µ > 15 17 − 15 √ =4 3/ 36 The conclusion is that there is significance evidence to reject H0 as the pvalue (the area to the right of 4 under the standard normal) is very close to 0. 24 Unknown population variance If σ is unknown, which is usually the case, we replace it by its sample estimate s. Consequently, X̄ − µ0 H0 √ ∼ tn−1 , S/ n and the for an observed value X̄ = x̄, the test statistic becomes T.S. = x̄ − µ0 √ . s/ n At the α significance level, for the same hypothesis tests as before, we reject H0 if (i) p-value= P (tn−1 ≥ T.S.) < α (ii) p-value= P (tn−1 ≤ T.S.) < α (iii) p-value= P (|tn−1 | ≥ |T.S.|) < α Example 0.18. In an ergonomic study, 5 subjects were chosen to study the maximin weight of lift (MAWL) for a frequency of 4 lifts/min. Assuming the MAWL values are normally distributed, do the following data suggest that the population mean of MAWL exceeds 25? 25.8, 36.6, 26.3, 21.8, 27.2 H0 : µ ≤ 25 vs Ha : µ > 25 T.S. = 27.54 − 25 √ = 1.03832 5.47/ 5 The p-value is the area to the right of 1.03832 under the t4 distribution, which is 0.1788813. Hence, we fail to reject the null hypothesis. In R input: t.test(c(25.8, 36.6, 26.3, 21.8, 27.2),mu=25,alternative="greater") Remark 0.5. The values contained within a two-sided 100(1 − α)% C.I. are precisely those values (that when used in the null hypothesis) will result in the p-value of a two sided hypothesis test to be greater than α. For the one sided case, an interval that only uses the 25 • upper limit, contains precisely those values for which the p-value of a one-sided hypothesis test, with alternative less than, will be greater than α. • lower limit, contains precisely those values for which the p-value of a one-sided hypothesis test, with alternative greater than, will be greater than α. Example 0.19. The lifetime of single cell organism is believed to be on average 257 hours. A small preliminary study was conducted to test whether the average lifetime was different when the organism was placed in a certain medium. The measurements are assumed to be normally distributed and turned out to be 253, 261, 258, 255, and 256. The hypothesis test is H0 : µ = 257 vs. Ha : µ 6= 257 With x̄ = 256.6 and s = 3.05, the test statistic value is T.S. = 256.6 − 257 √ = −0.293. 3.05/ 5 The p-value is P (t4 < −0.293) + P (t4 > 0.293) = 0.7839. Hence, since the p-value is large (> 0.05) we fail to reject H0 and conclude that population mean is not statistically different from 257. Instead of a hypothesis test if a two sided 95% was constructed by 3.05 256.6 ∓ t(1−0.025,4) √ | {z } 5 → (252.81, 260.39), 2.776 it clear that the null hypothesis value of µ = 257 is a plausible value and consequently H0 is plausible, so it is not rejected. 26 0.4 Inference for Two Population Means 0.4.1 Independent samples There are instances when a C.I. for the difference between two means is of interest when one wishes to compare the sample mean from one population to the sample mean of another. Known population variances Let X1 , . . . , XnX and Y1 , . . . , YnY represent two independent random samples 2 with means µX , µY and variances σX , σY2 respectively. Once again the methodology will require X̄ and Ȳ to be normally distributed. This can occur by: • X1 , . . . , Xn be i.i.d. from a normal distribution, so that by Proposition 2 0.2, X̄ ∼ N(µX , σX /n) • nX > 40 and the C.L.T. is invoked. Similarly for Ȳ . Note that if the C.L.T. is to be invoked we require a more conservative criterion of nX > 40, nY > 40 as we are using the theorem (and hence an approximation twice). To compare two populations means µX and µY we find it easier to work with a new parameter the difference µK := µX − µY . Let K := X̄ − Ȳ is a normal random variable (by Proposition 0.2) with E(K) = E(X̄ − Ȳ ) = µX − µY , and V (K) = V (X̄ − Ȳ ) = Therefore, K := X̄ − Ȳ ∼ N 2 σ2 σX + Y. nX nY σ2 σ2 µX − µY , X + Y nX nY , and hence a 100(1 − α)% C.I. for the difference of µK = µX − µY is x̄ − ȳ ∓ z1−α/2 s 27 2 σ2 σX + Y. nX nY Example 0.20. In an experiment, 50 observations of soil NO3 concentration (mg/L) were taken at each of two (independent) locations X and Y . We have that x̄ = 88.5, σX = 49.4, ȳ = 110.6 and σY = 51.5. Construct a 95% C.I. for the difference in means and interpret. 88.5 − 110.6 ∓ 1.96 r 49.42 51.52 + → (−41.880683, −2.319317) 50 50 Note that 0 is not in the interval as a plausible value. This implies that µX − µY < 0 is plausible. In fact µX is less than µY by at least 2.32 units and at most 41.88. Unknown population variances As in equation (5) X̄ − Ȳ − (µX − µY ) q 2 ∼ tν sX s2Y + nY nX where ν= s2X nX s2Y nY + (s2X /nX )2 nX −1 + 2 (s2Y /nY )2 nY −1 . (9) Hence the 100(1 − α)% for µX − µY is x̄ − ȳ ∓ t1−α/2,ν s s2X s2 + Y. nX nY Example 0.21. Two methods are considered standard practice for surface hardening. For Method A there were 15 specimens with a mean of 400.9 (N/mm2 ) and standard deviation 10.6. For Method B there were also 15 specimens with a mean of 367.2 and standard deviation 6.1. Assuming the samples are independent and from a normal distribution the 98% C.I. for µA − µB is 400.9 − 367.2 ∓ t1−0.01,ν where ν= 10.62 15 (10.62 /15)2 14 + + 6.12 15 r 2 10.62 6.12 + 15 15 (6.12 /15)2 14 = 22.36 and hence t1−0.01,22.36 = 2.5052 giving a 98% C.I. for the difference µA − µB 28 of (25.7892 41.6108). Notice that 0 is not in the interval so we can conclude that the two means are different. In fact the interval is purely positive so we can conclude that µA is at least 25.7892 N/mm2 larger than µB and at most 41.6108 N/mm2 . 0.4.2 Paired data There are instances when two samples are not independent, when a relationship exists between the two. For example, before treatment and after treatment measurements made on the same experimental subject are dependent on eachother through the experimental subject. This is a common event in clinical studies where the effectiveness of a treatment, that may be quantified by the difference in the before and after measurements, is dependent upon the individual undergoing the treatment. Then, the data is said to be paired. Consider the data in the form of the pairs (X1 , Y1), (X2 , Y2 ), . . . , (Xn , Yn ). We note that the pairs, i.e. two dimensional vectors, are independent as the experimental subjects are assumed to be independent with marginal expectations E(Xi ) = µX and E(Yi ) = µY for all i = 1, . . . , n. By defining, D1 = X 1 − Y 1 D2 = X 2 − Y 2 .. . Dn = X n − Y n a two sample problem has been reduced to a one sample problem. Inference for µX − µY is equivalent to one sample inference on µD as was done in Chapter ??. This holds since, n µD := E(D̄) = E 1X Di n i=1 ! n =E 1X Xi − Y i n i=1 ! = E(X̄−Ȳ ) = µX −µY . In addition we note that the variance of D̄ does incorporate the covariance between the two samples and does have to be calculated separately as n 2 σD := V (D̄) = V 1X Di n i=1 ! n 1 X σ 2 + σY2 − 2σXY = 2 V (Di ) = X . n i=1 n 29 Example 0.22. A new and old type of rubber compound can be used in tires. A researcher is interested in a compound/type that does not wear easily. Ten random cars were chosen at random that would go around a track a predetermined number of times. Each car did this twice, once for each tire type and the depth of the tread was then measured. Car New Old D 1 2 3 4 5 6 7 8 9 10 4.35 5.00 4.21 5.03 5.71 4.61 4.70 6.03 3.80 4.70 4.19 4.62 4.04 4.72 5.52 4.26 4.27 6.24 3.46 4.50 0.16 0.38 0.17 0.31 0.19 0.35 0.43 -0.21 0.34 0.20 With d¯ = 0.232 and sD = 0.183. Assuming that the data are normally distributed, a 95% C.I. for µnew − µold = µD is 0.183 0.232 ∓ t1−0.025,9 √ | {z } 10 → (0.101, 0.363) 2.262 and we note that the interval is strictly greater than 0, implying that that the difference is positive, i.e. that µnew > µold . In fact we can conclude that µnew is larger than µold by at least 0.101 units and at most 0.363 units. 30 Chapter 1 Simple Linear Regression In this chapter we hypothesize a linear relationship between the two variables, estimate and draw inference about the model parameters. 1.1 Model The simplest deterministic mathematical relationship between two mathematical variables x and y is a linear relationship y = β0 + β1 x, where the coefficient • β0 represents the y-axis intercept, the value of y when x = 0, • β1 represents the slope, interpreted as the amount of change in the value of y for a 1 unit increase in x. ǫi To this model we add variability by introducing the random variable ∼ N(0, σ 2 ) for each observation i = 1, . . . , n. Hence, the statistical i.i.d. model by which we wish to model one random variable using known values of some predictor variable becomes Yi = β0 + β1 xi +ǫi | {z } i = 1, . . . , n (1.1) systematic where Yi represents the r.v. corresponding to the response, i.e. the variable we wish to model and xi stands for the observed value of the predictor. 31 Therefore we have that ind. Yi ∼ N(β0 + β1 xi , σ 2 ). (1.2) 5 y 10 15 Notice that the Y s are no longer identical since their mean depends on the value of xi . 0 Data points Regression line −20 −10 0 10 20 30 40 50 60 x Figure 1.1: Regression model. Remark 1.1. An alternate form with centered predictor is Yi = β0 + β1 (xi − x̄) + β1 x̄ + ǫi = (β0 + β1 x̄) + β1 (xi − x̄) + ǫi | {z } β0⋆ In order to fit a regression line one needs to find estimates for the coefficients β0 and β1 in order to find the mean line ŷi = β̂0 + β̂1 xi . 32 1.2 1.2.1 Parameter Estimation Regression function The goal is to have this line as “close” to the data points as possible. The concept, is to minimize the error from the actual data points to the predicted points (in the direction of Y , i.e. vertical) min n X i=1 (Yi − E(Yi )) 2 → min n X i=1 (Yi − (β0 + β1 xi ))2 . Hence, the goal is to find the values of β0 and β1 that minimizes the sum of the distances between the points and their expected value under the model. This is done by the following steps: 1. Taking the partial derivatives with respect to β0 and β1 2. Equate the two resulting equations to 0 3. Solve the simultaneous equations for β0 and β1 4. (Optional) Taking second partial derivatives to show that in fact they minimize, not maximize. Therefore, Pn (x − x̄)(yi − ȳ) Pn i b1 := β̂1 = i=1 (xi − x̄)2 Pn i=1 ( xi yi ) − nx̄ȳ = Pi=1 n ( i=1 x2i ) − nx̄2 ! n X (xi − x̄) Pn = yi 2 j=1 (xj − x̄) i=1 | {z } ki and b0 := β̂0 = ȳ − b1 x̄ = n X i=1 ! x̄(xi − x̄) 1 yi . + Pn 2 n j=1 (xj − x̄) | {z } li 33 (1.3) Hence both b1 and b0 are linear estimators, as they are linear combinations of the responses. Remark 1.2. Do not extrapolate model for values of the predictor x that were not in the data, as it is not clear how the model may behave for other values. Also, do not fit a linear regression for data that do not appear to be linear. Definition 1.1. The ith residual is defined to be the difference between the observed and fitted value of the response for point i. ei = yi − ŷi Notable Properties: • • • • P ei = 0 P xi ei = 0 P yi = P P ŷi ei = 0 ŷi • The regression line always goes through (x̄, ȳ) 1.2.2 Variance The variance term in the model is σ 2 = V (ǫ) = E(ǫ2 ) Hence to estimate it, the “sample mean” of the squared residuals e2i seems as a reasonable estimate. Pn 2 Pn 2 SSE e 2 2 i=1 (yi − ŷi ) = i=1 i = . s = MSE = σ̂ = n−2 n−2 n−2 where MSE stands for Mean Squared Error and SSE for Sum of Squares Error. Note that in the denominator we have n − 2, as we lose 2 degrees of freedom since we had to estimate two parameters, β0 and β1 , when estimating our center, ŷi . 34 Remark 1.3. Estimation of model parameters can also be done via maximum likelihood that yields exactly the same estimates of the parameters of the systematic component, β0 and β1 , but the estimate of σ 2 is slightly biased. 2 σ̂ = Pn i=1 (yi n so MSE = − ŷi )2 n σ̂ 2 n−2 Example 1.1. Let x be the number of copiers serviced and Y be the time spent (in minutes) by the technician for a known manufacturer. 1 20 2 Time (y) Copiers (x) 2 ··· 60 · · · 4 ··· 44 45 61 77 4 5 Table 1.1: Quantity of copiers and service time The complete dataset can be found at http://www.stat.ufl.edu/~ athienit/STA4210/Examples/copiers.csv 100 50 0 Time (in minutes) 150 Scatterplot 2 4 6 8 10 Quantity Figure 1.2: Scatterplot of Time vs Copiers. The scatterplot shows that there is a strong positive relationship between the two variables. Below is the R output. 35 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.5802 2.8039 -0.207 0.837 Copiers --- 15.0352 0.4831 31.123 <2e-16 *** Residual standard error: 8.914 on 43 degrees of freedom Multiple R-squared: 0.9575,Adjusted R-squared: 0.9565 F-statistic: 968.7 on 1 and 43 DF, p-value: < 2.2e-16 http://www.stat.ufl.edu/~ athienit/STA4210/Examples/copier.R The estimated equation is ŷ = −0.5802 + 15.0352x We note that the slope b1 = 15.0352 implies that for each unit increase in copier quantity, the service time increases by 15.0352 minutes (for quantity values between 1 and 10). If we wish to estimate the time needed for a service call for 5 copiers that would be −0.5802 + 15.0352(5) = 74.5958 minutes Example 1.2. Data on lot size (x) and work hours (y) was obtained from 25 recent runs of a manufacturing process. (See example on page 19 of textbook). A simple linear regression model was fit in R yielding Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 62.366 26.177 2.382 0.0259 * lotsize 3.570 0.347 10.290 4.45e-10 *** Residual standard error: 48.82 on 23 degrees of freedom Multiple R-squared: 0.8215,Adjusted R-squared: 0.8138 F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10 36 500 400 300 200 100 toluca$workhrs 20 40 60 80 100 120 toluca$lotsize Figure 1.3: Scatterplot of Work Hours vs Lot Size. We can obtain the residuals but will note that their magnitude in hours may not be easy to determine if a value is large or small in the context of the problem. Later we shall discuss standardized residuals. > round(resid(toluca.reg),1) 1 2 3 51.0 -48.5 -19.9 12 -60.3 22 4 -7.7 5 6 48.7 -52.6 13 14 15 5.3 -20.8 -20.1 23 24 25 16 0.6 7 55.2 17 42.5 8 9 10 11 4.0 -66.4 -83.9 -45.2 18 27.1 19 20 21 -6.7 -34.1 103.5 84.3 38.8 -6.0 10.7 > round(rstandard(toluca.reg),1) 1 2 3 4 1.1 -1.1 -0.4 -0.2 13 14 15 16 5 6 1.0 -1.1 17 18 0.1 -0.5 -0.4 0.9 0.0 7 1.2 19 8 9 10 11 12 0.1 -1.4 -1.8 -1.0 -1.3 20 21 22 23 24 0.6 -0.1 -0.7 2.3 1.8 0.8 -0.1 25 0.2 Note, that the first residual implies that the actual observed value of work hours was 51 hours greater than the model estimates. However, this difference is only 1.1 standard deviations. http://www.stat.ufl.edu/~ athienit/STA4210/Examples/toluca.R 37 Chapter 2 Inferences in Regression 2.1 Inferences concerning β0 and β1 The coefficients b0 and b1 of equation (1.3) are linear combinations of the responses. Therefore, they have corresponding r.vs B0 and B1 and since the Y s are independent normal r.vs (see (1.1)), by Proposition 0.2 are themselves normal r.vs. Re-expressing the r.v. B1 , B1 = Pn (x − x̄)(Yi − i=1 Pn i 2 i=1 (xi − x̄) Ȳ ) = ··· = n X i=1 Some notable properties are: • • • P ki = 0 P ki2 = 1/ P (x − x̄) Pn i Y 2 i j=1 (xj − x̄) | {z } ki ki xi = 1 This implies P (xi − x̄)2 E(B1 ) = n X i=1 = β0 ki E(Yi ) | {z } β0 +β1 xi n X ki + β1 i=1 = β1 38 n X i=1 ki xi and V (B1 ) = n X i=1 ki2 V (Yi ) | {z } σ2 2 σ . 2 j=1 (xj − x̄) = Pn Thus, B1 ∼ N σ2 β1 , Pn 2 i=1 (xi − x̄) . Remark 2.1. The larger the spread in the values of the predictor, the larger P the ni=1 (xi − x̄)2 value will be and hence the smaller the variances for B0 and B1 . Also, as (xi − x̄)2 are nonnegative terms when we have more data points, i.e. larger n, we are summing more non-negative terms and the larger P the ni=1 (xi − x̄)2 . Remark 2.2. The intercept term is not of much practical importance as it is the value of the response when the predictor value is 0 and is included to provide us with a “nice” model whether significant or not. Hence, inference is omitted. It can be shown, in similar fashion, that 1 x̄2 B0 ∼ N β0 , + Pn σ2 . 2 n i=1 (xi − x̄) Remark 2.3. The r.vs B0 and B1 are not independent and their covariance is not 0. Cov(B0 , B1 ) = Cov since X li Y i , X X ki Y i = li ki V (Yi )  l k V (Y ) i = j i i i Cov(li Yi , ki Yi ) = 0 i= 6 j In practice, σ 2 is not known, and in practice is replaced by its estimate, MSE. This is a scenario that we are all too familiar with, similar to equation (5) we use a Student’s t distribution instead of the normal, B1 − β1 ∼ tn−2 . √P s 2 (xi −x̄) 39 This is because (not proven in this class) (n − 2)s2 SSE = 2 ∼ χ2n−2 2 σ σ (2.1) is independent of B1 , and a ratio of a normal with the square root of independent chi-square is defined as t-distribution. Important to note is the fact that the degrees of freedom are n − 2, as 2 were lost due to the estimation of β0 and β1 in the mean. Therefore, a 100(1 − α)% C.I. for β1 is β̂1 ∓ t1−α/2,n−2 sb1 where sb1 = s/ pPn i=1 (xi − x̄)2 . Similarly, for a null hypothesis value H0 : β1 = β10 , the test statistic is T.S. = βˆ1 − β10 H0 ∼ tn−2 sb1 and p-values and conclusions made in the standard way, see Section 0.3. We have not yet learned to perform inference on all parameters in the ind. model Yi ∼ N(β0 + β1 xi , σ 2 ). We can perform inference on the parameters associated with the mean, i.e. β1 (and β0 ) but not yet σ 2 . From (2.1) we have that SSE 2 2 1 − α = P χ(α/2,n−2) < 2 < χ(1−α/2,n−2) σ ! SSE SSE < σ2 < 2 =P χ2(1−α/2,n−2) χ(α/2,n−2) and hence the 100(1 − α)% C.I. for σ 2 is SSE SSE , χ2(1−α/2,n−2) χ2(α/2,n−2) ! Example 2.1. Back to the copier example 1.1, a 95% C.I. for • β1 is 15.0352 ∓ t1−0.025,43 (0.4831) → (14.061010, 16.009486). | {z } 2.016692 40 (2.2) • σ 2 is 2.2 23(48.822) 23(48.822 ) , 38.076 11.689 Inferences involving E(Y ) and Ŷpred 2.2.1 Confidence interval on the mean response The mean is no longer a constant but is in fact a “mean line”. µY |X=xobs := E(Y |X = xobs ) = β0 + β1 xobs Hence, we can create an interval for the mean at a specific value of the predictor xobs . We simply need to find a statistic to estimate the mean and find its distribution. The sample statistic used is ŷ = b0 + b1 xobs and the corresponding r.v. is Ŷ = B0 + B1 xobs " # n X 1 xi − x̄ = + (xobs − x̄) Pn Yi . 2 n (x − x̄) j j=1 i=1 (2.3) Note that this can be expressed as a linear combination of the independent normal r.vs Yi whose distribution is known to be normal (equation (1.2)). Therefore, Ŷ is also a normal r.v. with mean E(Ŷ ) = E(B0 ) + E(B1 )xobs = β0 + β1 xobs and variance V (Ŷ ) = V (B0 + B1 x obs ) = V [Ȳ + B1 (x obs − x̄)] since B0 = Ȳ − B1 x̄ ✿0 ✘ ✘✘ = V [Ȳ ] + (x obs − x̄)2 V (B1 ) + 2(x obs − x̄)✘ Cov( ✘✘Ȳ✘, B1 ) = (x obs − x̄)2 σ 2 σ2 + Pn , 2 n i=1 (xi − x̄) 41 P ✟ ✯0 since Cov(Ȳ , B1 ) = (1/n)σ 2 ✟✟ki . Hence, Ŷ ∼ N # ! (xobs − x̄)2 1 + Pn β0 + β1 x obs , σ2 . 2 n (x − x̄) j=1 j " Thus, a 100(1 − α)% C.I. for the mean response, µY |X=xobs is s ŷ ∓ t1−α/2,n−2 s | ! 1 (xobs − x̄)2 . + Pn 2 n j=1 (xj − x̄) {z } sŶ Example 2.2. Refer back to Example 1.1. Assume we are interested in a 95% C.I. for the mean time value when the quantity of copiers is 5. 74.59608 ∓ t1−0.025,43 (1.329831) → (71.91422, 77.27794) | {z } 2.016692 In R, > newdata=data.frame(Copiers=5) > predict.lm(reg,se.fit=TRUE,newdata,interval="confidence",level=0.95) $fit fit lwr upr 1 74.59608 71.91422 77.27794 $se.fit [1] 1.329831 $df [1] 43 2.2.2 Prediction interval Once a regression model is fitted, after obtaining data (x1 , y1 ), . . . , (xn , yn ), it may be of interest to predict a future value of the response. From equation (1.1), we have some idea where this new prediction value will lie, somewhere around the mean response β0 + β1 x new However, according to the model, equation (1.1), we do not expect new predictions to fall exactly on the mean response, but close to them. Hence, 42 the r.v. corresponding to the statistic we plan to use is the same as equation (2.3) with the addition of the error term ǫ ∼ N(0, σ 2 ) Ŷ pred = B0 + B1 x new + ǫ Therefore, Ŷ pred ∼ N # ! (x new − x̄)2 1 σ2 , β0 + β1 x new , 1 + + Pn 2 (x − x̄) n j=1 j " and a 100(1 − α)% prediction interval (P.I.) for , for a value of the predictor that is unobserved, i.e. not in the data, is ŷ pred ∓ t1−α/2,n−2 s s 1+ | x̄)2 ! (x new − 1 . + Pn 2 n j=1 (xj − x̄) {z } s pred Example 2.3. Refer back to Example 1.1. Let us estimate the future service time value when copier quantity is 7 and create a interval around it. The predicted value is −0.5802 + 15.0352(7) = 104.6666 minutes a 95% P.I. around the predicted value is 104.6666 ∓ t1−0.025,43 (9.058051) → (86.399, 122.9339) | {z } 2.016692 In R > newdata=data.frame(Copiers=7) > predict.lm(reg,se.fit=TRUE,newdata,interval="prediction",level=0.95) $fit fit lwr upr 1 104.6666 86.39922 122.9339 $se.fit [1] 1.6119 $df [1] 43 43 Note that se.fit provided is the value for the CI not the PI. However, in the calculation of the PI the correct standard error term is used. http://www.stat.ufl.edu/~ athienit Example 2.4. Also see confidence and prediction intervals for example 1.2 http://www.stat.ufl.edu/~ athienit/STA4210/Examples/toluca.R 2.2.3 Confidence Band for Regression Line If we wish to create a simultaneous estimate for the population mean for all predictor values x, that is a (1 − α)100% simultaneous C.I. for β0 + β1 x ŷ ∓ W (sŶ ) known as the Working-Hotelling confidence band, where W = p 2F1−α;2,n−2 . 44 Example 2.5. Continuing from example 1.2 (Toluca) we can not only evaluate the band at specific points but at all points and plot it with the script found in http://www.stat.ufl.edu/~ athienit/STA4210/Examples/toluca.R CI=predict(toluca.reg,se.fit=TRUE) W=sqrt(2*qf(0.95,length(toluca.reg$coefficients),toluca.reg$df.residual)) Band=cbind( CI$fit - W * CI$se.fit, CI$fit + W * CI$se.fit ) points(sort(toluca$lotsize), sort(Band[,1]), type="l", lty=2) points(sort(toluca$lotsize), sort(Band[,2]), type="l", lty=2) legend("topleft",legend=c("Mean Line","95% CB"),col=c(1,1), + lty=c(1,2),bg="gray90") 300 100 200 toluca$workhrs 400 500 Mean Line 95% CB 20 40 60 80 100 120 toluca$lotsize Figure 2.1: Working-Hotelling 95% confidence band. 45 2.3 Analysis of Variance Approach Next we introduce some notation that will be useful in conducting inference of the model. In order to determine whether a regression model is adequate we must compare it to the most naive model which uses the sample mean Ȳ as its prediction, i.e. Ŷ = Ȳ . This model does not take into account any predictors as the prediction is the same for all values of x. Then, the total distance of a point yi to the sample mean ȳ can be broken down into two components, one measuring the error of the model for that point, and one measuring the “improvement” distance accounted by the regression model. (yi − ȳ) = (yi − ŷi ) + (ŷi − ȳ) | {z } | {z } | {z } Error Regression Total Looking back at Figure 1.1 and singling out a point we have that, Figure 2.2: Sum of Squares breakdown. Summing over all observations we have that n X (yi − ȳ)2 = |i=1 {z SST } n X (yi − ŷi )2 + |i=1 {z SSE 46 } n X (ŷi − ȳ)2 , |i=1 {z SSR } (2.4) since the cross-product term n X i=1 (yi − ŷi )(ŷi − ȳ) = = X ei (ŷi − ȳ) 0 X ✟ X✚ ✯0 ✟ ❃ ✚ ✟ e ŷ − ȳ e ✚ i ✟ i i ✟ ✚ =0 Remark 2.4. A useful result is SSR = X (ŷi − ȳ)2 = X X (b0 + b1 xi − ȳ)2 (ȳ − b1 x̄ + b1 xi − ȳ)2 X = b21 (xi − x̄)2 {z } | = (n−1)s2x Each sum of squares term has an associated degrees of freedom value. df SSR 1 SSE n − 2 SST n − 1 + We can summarize this information in an ANOVA table Source Reg Error Total df MS E(MS) P 2 1 SSR/1 σ + β12 (xi − x̄)2 n − 2 SSE/(n − 2) σ2 n−1 Table 2.1: ANOVA table Note that SSE ∼ χ2n−2 ⇒ E σ2 SSE σ2 =n−2⇒ E SSE n−2 = σ2 and that MSR = SSR = b21 X (xi − x̄)2 ⇒ E(MSR) = X (xi − x̄)2 E(B12 ) (xi − x̄)2 [V (B1 ) + E 2 (B1 )] X = σ 2 + β12 (xi − x̄)2 = 47 X 2.3.1 F-test for β1 In Section 2.1 we saw a t-test for testing the significance of β1 , bit now we introduce a different test that will be especially useful later in testing multiple β’s simultaneously. In table 2.1 we notice that  E(MSR) 1 = E(MSE) > 1 if β1 = 0 if β1 6= 0 By Cochran’s theorem it has been shown that under H0 : β1 = 0 SSR ∼ χ2 and that the two are independent, 1 σ2 ∼ χ2n−2 , • SSE σ2 • χ21 /1 χ2n−2 /(n−2) ∼ F1,n−2 Hence, we have that T.S. = SSR /1 σ2 SSE /(n − 2) σ2 = MSR H0 ∼ F1,n−2 . MSE The null is rejected if the p-value P (F1,n−2 > T.S.) < α, the area to the right being less that α. f F distribution p − value 0 T.S Figure 2.3: F1,n−2 distribution and p-value. Remark 2.5. The F-test and t-test for H0 : β1 = 0 vs. Ha : β1 6= 0 are equivalent since b2 MSR = 1 MSE P 2 (xi − x̄)2 b21 b21 b1 P = = 2 = 2 MSE MSE/ (xi − x̄) sb1 sb1 48 Example 2.6. Continuing from example 1.2, note that t2 = 10.2902 = 105.9 = F with the same p-value. 2.3.2 Goodness of fit A goodness of fit statistic is a quantity that measures how well a model explains a given set of data. For regression, we will use the coefficient of determination SSE SSR =1− , R2 = SST SST which is the proportion of variability in the response (to its naive mean ȳ) that is explained by the regression model, and R2 ∈ [0, 1]. Remark 2.6. For simple linear regression with (only) one predictor, the coefficient of determination is the square of the correlation coefficient, with the sign matching that of the slope, i.e.  √   + R2   √ r = − R2    0 b1 > 0 b1 < 0 b1 = 0 Example 2.7. In the output of example 1.2 we have R2 = 0.8215, implying that 82.15% of the (naive) variability in the work hours can now be explained by the regression model that incorporates lost size as the only predictor. 49 2.4 Normal Correlation Models Normal correlation models are useful when instead of a random normal response and a fixed predictor, there are two random normal variables and one will be used to model the other. Let (Y1 , Y2 ) have a bivariate normal distribution with p.d.f. f (y1 , y2) = 1 √ 2πσ1 σ2 1 − ρ12 e −1 2(1−ρ2 ) 12 y1 −µ1 σ1 2 −2ρ12 y1 −µ1 σ1 y2 −µ2 σ2 2 y −µ + 2σ 2 2 where ρ12 is the correlation coefficient σ12 /(σ1 σ2 ). It can be shown that marginally Y1 ∼ N(µ1 , σ12 ) and Y2 ∼ N(µ2 , σ22 ). Hence, the conditional density of (Y1 |Y2 = y2 ), and similarly of (Y2|Y1 = y1 ), can be found as −1 1 f (y1 , y2) =√ e2 f (y1|y2 ) = f (y2 ) 2πσ1|2 y1 −α1|2 −β1|2 y2 σ1|2 2 2 where α1|2 = µ1 − µ2 ρ12 (σ1 /σ2 ), β1|2 = ρ12 (σ1 /σ2 ), and σ1|2 = σ12 (1 − ρ212 ). Thus, 2 Y1 |Y2 = y2 ∼ N(α1|2 + β1|2 y2 , σ1|2 ) and we can “model” or make educated guesses as to the values of variable Y1 given Y2 (where Y2 is random). To determine if Y2 is an adequate “predictor” for Y1 , all we need to do is test H0 : ρ12 = 0, since under the null, (Y1 |Y2 ) ≡ Y1 . The sample estimate is the same as in equation (1). The test statistics is √ r12 n − 2 H0 ∼ tn−2 . T.S. = p 2 1 − r12 with p-values for two and one-sided tests found in the usual way. However, working with confidence intervals is more practical and even easier if we apply Fisher’s transformation to the sample correlation 1 z = log 2 ′ 50 1 + r12 1 − r12 . If the sample size is large, i.e n ≥ 25 then z ′ approx. ∼   1 1 + ρ12 1    N  log ,  1 − ρ12 n − 3  2 | {z } ζ and a 100(1 − α)% C.I.for ζ z ′ ∓ z1−α/2 p 1/(n − 3) → (L, U) and hence a 100(1 − α)% C.I.for ρ12 (after back-transforming ζ) e2L − 1 e2U − 1 , e2L + 1 e2U + 1 Non-normal data: When the data are not normal then we must implement a nonparametric procedure such as Spearman Rank Correlation coefficient. 1. Rank (y11 , . . . , yn1 ) from 1 to n and label as (R11 , . . . , Rn1 ). 2. Rank (y12 , . . . , yn2 ) from 1 to n and label as (R12 , . . . , Rn2 ). 3. Compute Pn − R̄1 )(Ri2 − R̄2 ) Pn 2 2 i=1 (Ri1 − R̄1 ) i=1 (Ri2 − R̄2 ) rs = pPn i=1 (Ri1 To test the null hypothesis of no association between Y1 and Y2 use the test statistic √ rs n − 2 H 0 ∼ tn−2 . T.S. = p 1 − rs2 Reject if p-value< α. Example 2.8. Consider the Muscle mass problem 1.27 and let Y1 =muscle mass, Y2 =age and we wish to model (Y1 |Y2) > muscle=read.table("http://www.stat.ufl.edu/~rrandles/sta4210/ + Rclassnotes/data/textdatasets/KutnerData/ + Chapter%20%201%20Data%20Sets/CH01PR27.txt",col.names=c("Y1","Y2")) > attach(muscle) > n=length(Y1) 51 > r=cor(Y1,Y2);r [1] -0.866064 > b1=r*sd(Y1)/sd(Y2);b1 [1] -1.189996 > b0=mean(Y1)-mean(Y2)*b1;b0 [1] 156.3466 > s2=var(Y1)*(1-r^2);s2 [1] 65.6686 Hence the estimated model is Y1 |Y2 = y2 ∼ N(156.35 − 1.19y2, 65.67). and r12 = −0.866. To test H0 : ρ12 = 0 > TS=(r*sqrt(n-2))/sqrt(1-r^2) > 2*pt(-abs(TS),n-2) #2 sided pvalue [1] 4.123987e-19 we reject the null due to the extremely small p-value. We can also create a 95% C.I. for ρ12 > zp=0.5*log((1+r)/(1-r)) > LU=zp+c(1,-1)*qnorm(0.025)*1/sqrt(n-3) > (exp(2*LU)-1)/(exp(2*LU)+1) [1] -0.9180874 -0.7847085 and conclude that there is a significant negative relationship. Obviously, before performing any of these procedure we need to ba able to assume that both variables are normal-which we will see later. If we cannot assume normality then we need to use Spearman’s Correlation > rs=cor(Y1,Y2,method="spearman");rs # default method is pearson [1] -0.8657217 > TSs=(rs*sqrt(n-2))/sqrt(1-rs^2) > 2*pt(-abs(TSs),n-2) #2 sided pvalue [1] 4.418881e-19 and reach the same conclusion. http://www.stat.ufl.edu/~ athienit/STA4210/Examples/corr_model.R 52 Chapter 3 Diagnostics and Remedial Measures 3.1 Diagnostics for Predictor Variable The goal is to identify any outlying values that could affect the appropriateness of the linear model. More information about influential cases will be covered in Chapter 10. The two main issues are: • Outliers. • The levels of the predictor are associated with the run order when the experiment is run sequentially. To check these we use • Histogram and/or Boxplot • Sequence Plot Example 3.1. Continuing from example 1.2 we see that there do not appear to be any outliers 53 Box Plot 2 0 1 Frequency 3 4 Histogram 20 40 60 80 100 120 20 40 Lot Size 60 80 100 Lot Size and no pattern/dependecy of the values of the predictor and the run order. 80 60 20 40 Lot Size 100 120 Sequence Plot 5 10 15 20 25 Run order 3.2 Checking Assumptions Recall that for the simple linear regression model Yi = β0 + β1 xi + ǫi i.i.d. i = 1, . . . , n we assume that ǫi ∼ N(0, σ 2 ) for i = 1, . . . , n. However, once a model is fit, before any inference or conclusions are made based upon a fitted model, the assumptions of the model need to be checked. These are: 1. Normality 2. Homogeneity of variance 3. Model fit/Linearity 4. Independence 54 120 with components of model fit being checked simultaneously within the first three. The assumptions are checked using the residuals ei := yi − ŷi for i = 1 . . . , n, or the standardized residuals, which are the residual standardized so that their standard deviation should be 1. 3.2.1 Graphical methods Normality The simplest way to check for normality is with two graphical procedures: • Histogram • P-P or Q-Q plot A probability plot is a graphical technique for comparing two data sets, either two sets of empirical observations, one empirical set against a theoretical set. Definition 3.1. The empirical distribution function, or empirical c.d.f., is the cumulative distribution function associated with the empirical measure of the sample. This c.d.f. is a step function that jumps up by 1/n at each of the n data points. n F̂n (x) = 1X number of elements ≤ x = I{xi ≤ x} n n i=1 Example 3.2. Consider the sample: 1, 5, 7, 8. The empirical c.d.f. is    0       0.25   F̂4 (x) = 0.50    0.75      1 55 if x < 1 if 1 ≤ x < 5 if 5 ≤ x < 7 if 7 ≤ x < 8 if x ≥ 8 1.0 0.8 0.6 0.0 0.2 0.4 Fn(x) 0 2 4 6 8 10 x Figure 3.1: Empirical c.d.f. The normal probability plot is a graphical technique for normality testing by assessing whether or not a data set is approximately normally distributed. The data are plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line. Departures from this straight line indicate departures from normality. There are two types of plots commonly used to plot the empirical c.d.f. to the normal theoretical one (G(·)). • P-P plot that plots (F̂n (x), G(x)) (with scaled changed to look linear), • Q-Q plot which plots the quantile functions (F̂n−1 (x), G−1 (x)). Example 3.3. An experiment of lead concentrations (mg/kg dry weight) from 37 stations, yielded 37 observations. Of interest is to determine if the data are normally distributed (of more practical use if sample sizes are small, e.g. < 30). 56 Smoothed Histogram Normal Data 0.010 Density 0.005 0 0.000 −2 −1 Theoretical Quantiles 1 2 0.015 Normal Q−Q Plot 0 50 100 150 200 0 50 100 150 200 250 Sample Quantiles Note that the data appears to be skewed right, with a lighter tail on the left and a heavier tail on the right (as compared to the normal). http://www.stat.ufl.edu/~ athienit/IntroStat/QQ.R With the vertical axis being the theoretical quantiles, and the horizontal axis being the sample quantiles the interpretation of P-P plots and Q-Q plots is equivalent. Compared to straight line that corresponds to the distribution you wish to compare your data, here is a quick guideline of how the tails are Left tail Right tail Above line Heavier Lighter Below line Lighter Heavier A histogram of the residuals is plotted and we try to determine if the histogram is symmetric and bell shaped like a normal distribution is. In addition, to check the model fit, we assume the observed response values yi are centered around the regression line ŷ. Hence, the histogram of the residuals should be centered at 0. 57 Example 3.4. Referring to Example 1.1, we obtain the following Histogram of std residuals 0.3 0.2 Density 1 0 0.0 −2 0.1 −1 Theoretical Quantiles 0.4 2 0.5 Normal Q−Q Plot −2 −1 0 1 −4 −3 −2 Sample Quantiles −1 0 1 2 3 std. residuals Homogeneity of variance/Fit of model Recall that the regression model assumes that the errors ǫi have constant variance σ 2 . In order to check this assumption a plot of the residuals (ei ) versus the fitted values (ŷi ) is used. If the variance is constant, one expects to see a constant spread/distance of the residuals to the 0 line across all the ŷi values of the horizontal axis. Referring to Example 1.1, we see that this assumption does not appear to be violated. 0 −3 −2 −1 std res 1 2 3 Homogeneity / Fit 20 40 60 80 100 120 140 y^ Figure 3.2: Residual versus fitted values plot. In addition, the same plot can be used to check the fit of the model. If the model is a good fit, once expects to see the residuals evenly spread 58 on either side of the 0 line. For example, if we observe residuals that are more heavily sided above the 0 line for some interval of ŷi , then this is an indication that the regression line is not “moving” through the center of the data points for that section. By construct, the regression line does “move” through the center of the data overall, i.e. for the whole big picture. So if it is underestimating (or overestimating) for some portion then it will overestimate (or underestimate) for some other. This is an indication that there is some curvature and that perhaps some polynomial terms should be added. (To be discussed in the next chapter). Independence To check for independence a time series plot of the residuals/standardized residuals is used, i.e. a plot of the value of the residual versus the value of its position in the data set. For example, the first data point (x1 , y1 ) will yield the residual e1 = y1 − ŷ1 . Hence, the order of e1 is 1, and so forth. Independence is graphically checked if there is no discernible pattern in the plot. That is, one cannot predict the next ordered residual by knowing the a few previous ordered residuals. Referring to Example 1.1, we obtain the following plot where there does not appear to be any discernible pattern. −2 −1 std res 0 1 Independence 0 10 20 30 40 Order Figure 3.3: Time series plot of residuals. 59 Remark 3.1. That when creating this plot that the order in which the data was obtained is same as the way they are in the datasheet. For example, assume that each person in a group is asked a question at a time. Then possibly the second person might be influenced by the first person’s response and so forth. If the data was then sorted, e.g. alphabetically, this order may then be lost. It is also important to note that this graph is heavily influenced by the validity of the model fit. Here is an example we will actually be addressing later in example 3.13 0 0.0 2 −1.0 0.0 1.0 2.0 Sample Quantiles Independence Homogeneity / Fit 0.6 std res 0 2000 4000 6000 8000 10000 1 std res −3 −1 0.5 1.5 −1.0 0.2 std res 3 0.4 prop 1 −1.5 3.0 2.0 1.0 Frequency 0.0 0.8 −1 1.5 Normal Q−Q Plot Theoretical Quantiles Histogram of std res 2 4 time 6 8 10 12 0.0 0.1 0.2 0.3 0.4 0.5 y^ Order http://www.stat.ufl.edu/~ athienit/STA4210/Examples/copier.R 3.2.2 Significance tests Independence • Runs Test (Presumes data are in time order) – Write out the sequence of +/− signs of the residuals – Count n1 = number of +ve residuals, n2 = number of −ve residuals – Count u = number of “runs” of +ve and −ve residuals So what is a run? For example, if we have the following 9 residuals: − + + +} |{z} −− −− |{z} + |{z} | {z |{z} 1 2 3 4 5 then we have in fact u = 5 runs with n1 = 4 and n2 = 5. 60 The null hypothesis is that the data are independent/random placement. We will use the exact sampling distribution of u to determine the p-value. The p.m.f. of the corresponding r.v. U is p(u) =  n −1 n −1 1 2 2( k−1 )( k−1 )    n +n 1 2  ( ) u = 2k, k ∈ N (u is even) n1  2 −1 + n2 −1 1 −1 (n1k−1)(nk−1 ) ( k )(nk−1 )    (n1 +n2 ) u = 2k + 1, k ∈ N (u is odd) n1 Then, the p-value is defined as P (U ≤ u). Luckily, there is no need to do this by hand. Example 3.5. Continuing from example 1.2, we run the ”Runs” test on the standardized residuals in R > library(lawstat) #may need to install package > runs.test(re,plot.it=TRUE) Runs Test - Two sided data: re Standardized Runs Statistic = -1.015, p-value = 0.3101 and note that we fail to reject the null due to the large p-value. We A A BBB AA B A BBBB AAA AAAAA BB BB 15 20 −0.4 0.0 0.2 0.4 have 11 runs out of a maximum of 25. There is also another runs.test under the randtest package (which actually provides the value of u). 5 10 61 25 Remark 3.2. It is notable that for n1 + n2 > 20 U    2n n 2n1 n2 (2n1 n2 − n1 − n2 )    1 2 + 1, ∼ N  2 (n + n2 ) (n1 + n2 − 1)   n1 + n2 | {z } | 1 {z } approx µu 2 σu and a test statistic can be used T.S. = u − µu + 0.5 σu to calculate the p-value P (Z ≤ T.S.). • Durbin-Watson Test. For this test we assume that the error term in equation (1.1) is of form i.i.d. ui ∼ N(0, σ 2 ), |ρ| < 1 ǫi = ρǫi−1 + ui , That is that the error term at a certain time period i, is correlated to the error term at the i − 1. The null hypothesis is H0 : ρ = 0, i.e. uncorrelated. The test statistic is Pn (ei − ei−1 )2 T.S. = i=2 n X e2i |i=1 {z } SSE Once the sampling distribution of the test statistic is determined then p-values can be obtained. However, the density function of this statistic is not easy to work with so we leave the heavy lifting to software. Example 3.6. Continuing from example 1.2, > library(car) > durbinWatsonTest(toluca.reg) lag Autocorrelation D-W Statistic p-value 1 0.2593193 1.43179 0.166 Alternative hypothesis: rho != 0 > library(lmtest) > dwtest(toluca.reg,alternative="two.sided") 62 Durbin-Watson test data: toluca.reg DW = 1.4318, p-value = 0.1616 alternative hypothesis: true autocorrelation is not 0 The p-value is large, i.e. greater than 5%, and hence we fail to reject the null, and conclude independence. Remark 3.3. The book suggests that in business and economics the correlation tends to be positive and hence a one sided test should be performed. However, this decision is context specific and left to the researcher. Normality test As expected there are many tests for normality. For a current list visit https://en.wikipedia.org/wiki/Normality_test. For now, we will discuss the Shapiro-Wilk Test. The null hypothesis is that normality holds (for the data entered, which we will use the standardized residuals). Example 3.7. Continuing from example 1.2, > shapiro.test(re) Shapiro-Wilk normality test data: re W = 0.97917, p-value = 0.8683 and hence we fail to reject the assumption of normality. Homogeneity of variance • If the response can be split into t distinct groups, i.e. the predictor(s) are categorical, then use the Brown-Forsythe/Levene Test. This test is used to test whether multiple populations have the same variance. The null hypothesis is that H0 : V (ǫi ) = σ 2 ∀i 63 or equivalently H0 : σ12 = · · · = σt2 Remark 3.4. If the data cannot be split into distinct groups, this can be done artificially by separating the responses based on their predictor values or fitted values. For example we can create two groups, data with “small” fitted values and data with “large” fitted values. Much in the same way we create bins for a histogram. The test statistic is tedious to calculate and left for software. However, H T.S. ∼0 Ft−1,n−t where t is the number of groups and n is the grand total number of observations. The p-value is P (Ft−1,n−t ≥ T.S.). Reject null if pvalue< α. Example 3.8. Continuing with example 1.2, assume we wish to split the data into two groups depending on whether the lot size is greater than 75 or not. > ind=I(toluca$lotsize>75) > temp=cbind(toluca$lotsize,re,ind);temp re ind 1 2 3 80 1.07281843 30 -1.06174371 50 -0.41228961 1 0 0 4 5 6 90 -0.15886988 70 1.01932471 60 -1.10742255 1 0 0 7 8 120 80 1.25374204 0.08237676 1 1 9 100 -1.45603607 10 50 -1.86517337 11 40 -0.96611354 1 0 0 12 13 70 -1.27729740 90 0.10987623 0 1 14 20 -0.45782448 15 110 -0.43096533 0 1 64 16 100 17 30 18 50 0.01286011 0.92610180 0.56452124 1 0 0 19 90 -0.13817529 20 110 -0.73719055 1 1 21 22 23 30 90 40 2.50810852 1.87651913 0.82578984 0 1 0 24 25 80 -0.12266660 70 0.21940878 1 0 > leveneTest(temp[,2],ind) # fcn in car library Levene’s Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 1 1.6553 23 Warning message: 0.211 In leveneTest.default(temp[, 2], ind) : ind coerced to factor. With a p-value greater than 0.05 we fail to reject the null. • Breusch-Pagan/Cook-Weisberg Test. It tests whether the estimated variance of the residuals from a regression are dependent on the values of the independent/predictor variables. In that case, heteroskedasticity is present. σ 2 = E(ǫ2 ) = γ0 + γ1 x1 + · · · + γp xp The null hypothesis is that they are independent of the independent/predictor variables. Although we will usually have software calculate the test statistic, the process if fairly simple. 1. Obtain SSE = Pn 2 i=1 ei from original equation. 2. Fit regression with e2i as the response using the same predictor(s), and obtain SSR⋆ . 3. SSR⋆ /2 H0 2 ∼ χp T.S. = (SSE/n)2 65 where p is the number of predictors in the model. The null is rejected in the p-value P (χ2p ≥ T.S.) < α. Example 3.9. Continuing with example 1.2, > ncvTest(toluca.reg) # fcn in car library Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 0.8209192 Df = 1 p = 0.3649116 Hence, we fail to reject the null due the p-value> α Linearity of regression We will perform an F -test for Lack-of-Fit if there are t distinct levels of the predictor(s). Not a valid test if the number of distinct levels is large, i.e. t ≈ n. H0 : E(Yi ) = β0 + β1 xi vs Ha : E(Yi ) 6= β0 + β1 xi 1. For each distinct level compute ŷj and ȳj , j = 1, . . . , t. 2. Compute SSLF = of freedom t − 2. Pt 3. Compute SSPE = Pt 4. j=1 P nj j=1 i=1 (ȳj −ŷj ) Pnj T.S. = i=1 (yij 2 = Pt j=1 nj (ȳj −ŷj ) 2 with degrees − ŷj )2 with degrees of freedom n − t SSLF/(t − 2) H0 ∼ Ft−2,n−t SSPE/(n − t) The null is rejected if the p-value P (Ft−2,n−t ≥ T.S.) < α. In R there is a work around where you do not have to compute these SS explicitly, as illustrated in the following example. Example 3.10. Continuing with example 1.2, we note that there are 11 distinct levels of lot size in the 25 observations. > length(unique(toluca$lotsize));length(toluca$lotsize) [1] 11 [1] 25 > Reduced=toluca.reg # fit reduced model 66 > Full=lm(workhrs~0+as.factor(lotsize),data=toluca) # fit full model > anova(Reduced, Full) # get lack-of-fit test Analysis of Variance Table Model 1: workhrs ~ lotsize Model 2: workhrs ~ 0 + as.factor(lotsize) Res.Df RSS Df Sum of Sq F Pr(>F) 1 23 54825 2 14 37581 9 17245 0.7138 0.6893 The p-value is greater than 0.05 so we fail to reject the null and conclude that the model is an adequate (linear) fit. 3.3 Remedial Measures • Nonlinear Relation: Add polynomials, fit exponential regression function, or transform x and/or y (more emphasis on x). • Non-Constant Variance: Weighted Least Squares, transform y and/or x, or fit Generalized Linear Model. • Non-Independence of Errors: Transform y or use Generalized Least Squares, or fit Generalized Linear Model with correlated errors. • Non-Normality of Errors: Box-Cox transformation, or fit Generalized Linear Model. • Omitted Predictors: Include important predictors in a multiple regression model. • Outlying Observations: Robust Estimation or Nonparametric regression. 3.3.1 Box-Cox (Power) transformation In the event that the model assumptions appear to be violated to a significant degree, then a linear regression model on the available data is not valid. However, have no fear, your friendly statistician is here. The data can be 67 transformed, in an attempt to fit a valid regression model to the new transformed data set. Both the response and the predictor can be transformed but there is usually more emphasis on the response. Remark 3.5. However, when we apply such a transformation, call it g(·), we are in fact fitting the mean line E(g(Y )) = β0 + β1 x1 + . . . As a result we cannot back-transform, i.e. apply the inverse transformation to make inference on E(Y ) as g −1 [E(g(Y ))] 6= E(Y ) A common transformation mechanism is the Box-Cox transformation (also known as Power transformation). This transformation mechanism when applied to the response variable will attempt to remedy the “worst” of the assumptions violated, i.e. to reach a compromise. A word of caution, is that in an attempt to remedy the worst it may worsen the validity of one of the other assumptions. The mechanism works by trying to identify the (minimum or maximum depending on software) value of a parameter λ that will be used as the power to which the responses will be transformed. The transformation is  λ   yi − 1 if λ 6= 0 λ−1 (λ) yi = λGy  G log(y ) if λ = 0 y i Q where Gy = ( ni=1 yi )1/n denotes the geometric mean of the responses. Note that a value of λ = 1 effectively implies no transformation is necessary. There are many software packages that can calculate an estimate for λ, and if the sample size is large enough even create a C.I. around the value. Referring to Example 1.1, we see that λ̂ = 1.11. 68 0 −100 −150 −200 log−Likelihood −50 95% −2 −1 0 11.11 2 λ Figure 3.4: Box-Cox plot. However, one could argue that the value is close to 1 and that a transformation may not necessarily improve the overall validity of the assumptions, so no transformation is necessary. In addition, we know that linear regression is somewhat robust to deviations from the assumptions, and it is more practical to work with the untransformed data that are in the original units of measurements. For example, if the data is in miles and a transformation is used on the response, inference will be on log(miles). Example 3.11. Continuing from example 1.1, we use the following R script: http://www.stat.ufl.edu/~ athienit/STA4210/Examples/boxcox.R Example 3.12. http://www.stat.ufl.edu/~ athienit/STA4210/Examples/diagnostic&BoxCox If the model fit assumption is the major culprit violated, a transformation of the predictor(s) will often resolve the issue without having to transform the response and consequently changing its scale. Example 3.13. In an experiment 13 subjects asked to memorize a list of disconnected items. Asked to recall them at various times up to a week later. • Response = proportion of items recalled correctly. • Predictor = time, in minutes, since initially memorized the list. 69 Time Prop Time Prop 1 5 15 30 60 120 240 0.84 0.71 0.61 0.56 0.54 0.47 0.45 480 720 1440 2880 5760 10080 0.38 0.36 0.26 0.20 0.16 0.08 0 0.0 2 −1.0 0.0 1.0 2.0 Sample Quantiles Independence Homogeneity / Fit 0.6 std res 0 2000 4000 6000 8000 1 std res −3 −1 0.5 1.5 −1.0 0.2 std res 3 0.4 prop 1 −1.5 3.0 2.0 1.0 Frequency 0.0 0.8 −1 1.5 Normal Q−Q Plot Theoretical Quantiles Histogram of std res 10000 2 4 time 6 8 10 12 0.0 0.1 0.2 0.3 0.4 0.5 y^ Order bcPower Transformation to Normality Est.Power Std.Err. Wald Lower Bound Wald Upper Bound dat$time 0.0617 0.1087 -0.1514 0.2748 Likelihood ratio tests about transformation parameters LRT df pval LR test, lambda = (0) 0.327992 LR test, lambda = (1) 46.029370 1 5.668439e-01 1 1.164935e-11 It seems that a decent choice for λ is 0, i.e. a log transformation for time. 1 0.0 2 −1.5 −0.5 0.5 std res Sample Quantiles Independence Homogeneity / Fit 1.5 0 2 4 6 8 2 l.time 0.0 1.0 −1.5 −1.5 std res 0.0 1.0 std res 0.4 0.2 prop 0 −1.5 4 3 2 1 Frequency 0.8 0 −1 0.6 −2 1.5 Normal Q−Q Plot Theoretical Quantiles Histogram of std res 4 6 8 Order 10 12 0.2 0.4 0.6 0.8 y^ http://www.stat.ufl.edu/~ athienit/STA4210/Examples/diagnostic&Linearity.R Remark 3.6. When creating graphs and checking if there are “pattern” try and keep the axis for the standardized residuals range from -3 to 3. That 70 is 3 standard deviation below 0, to 3 standard deviations above 0. Software have a tendency to “zoom” in. Is glass smooth? If you are viewing by eye then yes. If you are viewing via an electron microscope then no. In R just add plot(....., ylim=c(-3,3)) 3.3.2 Lowess (smoothed) plots • Nonparametric method of obtaining a smooth plot of the regression relation between y and x. • Fits regression in small neighborhoods around points along the regression line on the horizontal axis. • Weights observations closer to the specific point higher than more distant points. • Re-weights after fitting, putting lower weights on larger residuals (in absolute value). • Obtains fitted value for each point after “final” regression is fit. • Model is plotted along with linear fit, and confidence bands, linear fit is good if lowess lies within bands. Example 3.14. For 1.2 assume we wish to fit Lowess regression in R using the loess function with smoothing parameter α = 0.5. 71 300 100 200 workhrs 400 500 Loess α=0.5 95% CB SLR 20 40 60 80 100 120 lotsize Figure 3.5: Lowess smoothed plot. http://www.stat.ufl.edu/~ athienit/STA4210/Examples/loess.R 72 Chapter 4 Simultaneous Inference and Other Topics The main concept here is that if a 95% C.I. is created for β0 and another 95% C.I. for β1 we cannot say that we are 95% confident that these two confidence intervals are simultaneously both correct. 4.1 Controlling the Error Rate Let αI denote the individual comparison Type I error rate. Thus, P (Type I error) = αI on each of the g tests. Now assume we wish to combine all the individual tests into an overall/combined/simultaneous test H0 = H01 ∩ H02 ∩ · · · ∩ H0g H0 is rejected if any of the null hypotheses H0i is rejected. The experimentwise error rate αE , is the probability of falsely rejecting at least one of the g null hypotheses. If each of the g tests is done with αI , then assuming each test is independent and denoting the probability of not falsely rejecting H0i by Ei αE = 1 − P (∩gi=1 Ei ) =1− g Y P (Ei ) i=1 = 1 − (1 − αI )g 73 independence For example, if αI = 0.05 and 10 comparisons are made then αE = 0.401 which is very large. However, if we do not know if the tests are independent, we use the Bonferroni inequality P (∩gi=1 Ei ) ≥ g X i=1 P (Ei ) − g + 1 which implies αE = 1 − P (∩gi=1 Ei ) ≤g− = g X i=1 g = X g X P (Ei ) i=1 [1 − P (Ei )] αI i=1 = gαI Hence, αE ≤ gαI . So what we will do is choose an α to serve as an upper bound for αE . That is we won’t know the true value of αE but we will now it is bounded above by α, i.e. αE ≤ α. For example, if we set α = 0.05 then αE ≤ 0.05, or that simultaneous C.I. from g individual C.I.’s, will have a confidence of at least 95% (if not more). Set αI = α g For example, if we have 5 multiple comparisons and wish that the overall error rate is 0.05, or simultaneous confidence of at least 95%, then each one (of the five) C.I’s must be done at the 0.05 = 99% 100 1 − 5 confidence level. For additional details the reader can read the multiple comparisons problem and the familywise error rate. 74 4.1.1 Simultaneous estimation of mean responses • Bonferroni: Can be used for g simultaneous C.Is, each done at the 100(1 − α/g) level. If g is large then these intervals will be “too” wide for practical conclusions. ŷ ∓ t1−α/(2g),n−2 sŶ • Working-Hotelling: A confidence band is create for the entire regression line that can be used for any number of confidence intervals for means simultaneously. p ŷ ∓ 2F1−α;2,n−2 sŶ 4.1.2 Simultaneous predictions • Bonferroni: Can be used for g simultaneous P.Is, each done at the 100(1 − α/g) level. If g is large then these intervals will be “too” wide for practical conclusions. ŷ ∓ t1−α/(2g),n−2 spred • Scheffé: Widely used method. Like the Bonferroni, the width increases as g increases. p ŷ ∓ gF1−α;g;n−2spred 4.2 Regression Through the Origin When theoretical reasoning (in the context at hand) suggest that the regression line must pass through the origin (x = 0, y = 0), then the regression line must try and meet this criterion. This is done by restricting the intercept by setting it to 0, i.e. β0 = 0, yielding the model Yi = β1 xi + ǫi However with this model, there are some issues • V (Y |x = 0) = 0. The variance of the response at the origin is set to 0 which is not consistent with the “usual” regression model. 75 • Pn i=1 ei not necessarily equals 0. • SSE can potentially be larger than SST, affecting the analysis of variance and R2 interpretation. The reader is referred to p.164 of the textbook for more details To estimate β1 via least-squares we need to minimize n X i=1 (yi − β1 xi )2 . Taking the derivative with respect to β1 and equating to 0, we have n X [xi (yi − β1 xi )] = 0 ⇒−2 i=1 ⇒ n X xi yi = b1 i=1 n X x2i i=1 Pn n xi yi X xi i=1 Pn 2 yi . ⇒b1 = Pn 2 = i=1 xi i=1 xi i=1 The only other difference is that the degrees of freedom error are now n − 1, hence slightly changing the MSE estimate. As result, MSE • s2b1 = Pn 2 i=1 xi MSE(x2 ) • s2Ŷ = Pn 2 i=1 xi x2 2 • spred = MSE 1 + Pn i=1 x2i which are used in the C.I. for β1 , mean response and for the P.I. Example 4.1. A plumbing company operates 12 warehouses. A regression is fit with work units performed (x) and total variable cost (y). A regression through the origin yielded 76 800 600 400 200 labor 50 100 150 200 work > plumbing=read.table(... > with(plumbing,plot(labor~work,pch=16)) > plumb.reg=lm(labor~0+work,data=plumbing) > summary(plumb.reg) Coefficients: Estimate Std. Error t value Pr(>|t|) work 4.68527 0.03421 137 <2e-16 *** Residual standard error: 14.95 on 11 degrees of freedom Multiple R-squared: 0.9994,Adjusted R-squared: 0.9994 F-statistic: 1.876e+04 on 1 and 11 DF, p-value: < 2.2e-16 > abline(plumb.reg) Now we can create a PI for when work is equal to 100. R can do this too and it uses the right standard error. However, if you ask it to print the se.fit it only provides the se.fit for the CI, not the PI > syhat=sqrt(223.42*(100^2/sum(plumbing$work^2)));syhat [1] 3.420475 > spred=sqrt(223.42*(1+100^2/sum(plumbing$work^2)));spred [1] 15.33361 > newdata=data.frame(work=100) > predict.lm(plumb.reg,newdata)+c(1,-1)*qt(0.025,11)*spred [1] 434.7784 502.2765 77 > predict.lm(plumb.reg,newdata,se.fit=TRUE,interval="prediction") $fit fit lwr upr 1 468.5274 434.7781 502.2767 $se.fit [1] 3.420502 $df [1] 11 http://www.stat.ufl.edu/~ athienit/STA4210/Examples/plumbing_origin.R 4.3 Measurement Errors Firstly, let’s take a look at what we mean when a variable/effect is fixed or random and why there is still confusion concerning the use of these. http://andrewgelman.com/2005/01/25/why_i_dont_use/ 4.3.1 Measurement error in the dependent variable There is no problem as long as there is no bias, i.e. consistently recording lower or higher values. The extra error term is absorbed into the existing error term ǫ for the response Y . 4.3.2 Measurement error in the independent variable Assume there is no bias in measurement error. • Not a problem when the observed (recorded) value is fixed and actual value is random. For example, when the oven dial is set to 400◦F the actual temperature inside is not exactly 400◦ F. • When the observed (recorded) value is random it causes a problem by biasing β1 downward. Let Xi denote the true (unobserved) value, and Xi⋆ , the observed (recorded) value. Then, the measurement error is δi = Xi⋆ − Xi 78 The true model can be expressed as Yi = β0 + β1 Xi + ǫi = β0 + β1 (Xi⋆ − δi ) + ǫi = β0 + β1 Xi⋆ + (ǫi − β1 δi ) and we assume that δi is – unbiased, i.e. E(δi ) = 0, ✟0 ✯ ✯ ✟0 ✟i✟ ✟✟ E(ǫ ) ✟ – uncorrelated with random error, implying that E(ǫi δi ) = ✟ E(δ = i) 0. Hence, Cov(Xi⋆ , ǫi − β1 δi ) = E {[Xi⋆ − E(Xi⋆ )][(ǫi − β1 δi ) − E(ǫi − β1 δi )]} = E {[Xi⋆ − Xi ][ǫi − β1 δi ]} = E {δi (ǫi − β1 δi )} ✿0 ✘ E✘ {δ✘i ǫ✘ − β1 E δi2 =✘ i} = −β1 V (δi ) Therefore, the recorded value Xi⋆ is not independent of the error term (ǫi − β1 δi ) and E(Yi |Xi⋆ ) = β0⋆ + β1⋆ Xi⋆ where β1⋆ = β1 4.4 2 σX < β1 . 2 σX + σY2 Inverse Prediction The goal is to predict a new predictor value based on an observed new value of the response. Once we have a model, it is is to show (by rearranging terms in the prediction equation) that the prediction is x̂new = ynew − b0 b1 79 It has been shown (in higher statistics levels) that if t21−α/2,n−2 MSE P b21 ni=1 (xi − x̄)2 approx. < 0.1 then, a 100(1 − α)% P.I. for xnew is x̂new ∓ t1−α/2,n−2 s MSE 1 (x̂new − x̄)2 1 + + Pn 2 b21 n i=1 (xi − x̄) Remark 4.1. Bonferroni or Scheffé adjustments should be made for multiple simultaneous predictions. 4.5 Choice of Predictor Levels Recall that in most standard errors the term somewhere in a denominator. For example, Pn i=1 (xi − x̄)2 was present σ2 2 i=1 (xi − x̄) V (B1 ) = Pn So, in order to decrease the standard error we need to maximize this term, which in essence is a measure of spread of the predictor, by (i) Increase sample size, n. (ii) Increase the spacing of the predictor. Depending on the goal of research, when planning a controlled experiments, and selecting predictor levels, choose: • 2 levels if only interested in whether there is an effect and its direction, • 3 levels if goal is describing relation and any possible curvature, • 4 or more levels for further description of response curve and any potential non-linearity such as an asymptote. 80 Chapter 5 Matrix Approach to Simple Linear Regression We will cover the basics necessary to provide us with better understanding of regression which will be especially useful for multiple regression. The reader is also encouraged to review further topics and material at • http://stattrek.com/tutorials/matrix-algebra-tutorial.aspx • https://www.youtube.com/watch?v=xyAuNHPsq-g&list=PLFD0EB975BA0CC1E0 Definition 5.1. A matrix is a rectangular array of numbers or symbolic elements. In many applications, the rows will represent individual cases and columns will represent attributes or characteristics. The dimensions of a matrix is its number of rows and columns, often denoted m × n and has form Am,n  a1,1   a2,1 =  ..  . a1,2 a2,2 .. . am,1 am,2 5.1  · · · a1,n  · · · a2,n  ..  ..  . .  · · · am,n Special Types of Matrices • Square matrix: The number of rows is the same as the number of columns. For example, A2,2 = 81 a1,1 a1,2 a2,1 a2,2 ! • Vector: A column vector is matrix with only one column, and a row vector is a matrix with only one row. For example,   c1    c2   c=  ..  .   cn • Transpose: A matrix formed by interchanging rows and columns. For example,   ! 6 8 6 15 22   G= ⇒ GT = 15 13 8 13 25 22 25 • Matrix equality: Two matrices of the same dimension are equal when each element that is in the same position in each matrix is equal. • Symmetric matrix: A square matrix whose transpose is equal to itself, i.e. AT = A or element-wise ai,j = aj,i . For example,  6 19 −8   A =  19 14 3  ⇒ AT = A. −8 3 1  • Diagonal matrix: Square matrix with all off-diagonal elements equal to 0. For example, A3,3   a1 0 0   =  0 a2 0  = diag(a1 , a2 , a3 ) 0 0 a3 • Identity matrix: A diagonal matrix with all the diagonal elements equal to 1, i.e. Im = diag(1, 1, . . . , 1). For example,   1 0 0   I3 = 0 1 0 0 0 1 We will see later that Im Am,n = Am,n , and that Am,n In = Amn . 82 • Scalar matrix: A diagonal matrix with all the diagonal elements equal to same scalar k, that is kIm . For example,   k 0 0   kI3 = 0 k 0  0 0 k • 1-vector and matrix: The 1-vector, is simply a column vector whose elements are all 1. Similarly for the matrix denoted by J.   1 1 1   J3 = 1 1 1 1 1 1 5.2 Basic Matrix Operations To perform basic matrix operations in R, please visit http://www.statmethods.net/advstats/matrix.html. 5.2.1 Addition and subtraction Addition and subtraction is done elementwise for matrices of the same dimension.  a1,1 + b1,1   a2,1 + b2,1 = ..  .  Am,n + Bm,n a1,2 + b1,2 a2,2 + b2,2 .. . ··· ··· .. . a1,n + b1,n a2,n + b2,n .. . am,1 + bm,1 am,2 + bm,2 · · · am,n + bm,n Similarly for subtractions. In regression, let   Y1 . . Y = . Yn   E(Y1 )  .  .  E(Y ) =   .  E(Yn ) The model can be expresses as Y = E(Y ) + ǫ 83   ǫ1 . . ǫ= . ǫn       5.2.2 Multiplication We begin with multiplication of a matrix A by a scalar k. Each element of A is multiplied by k, i.e. kAm,n  ka1,1   ka2,1 =  ..  . ka1,2 ka2,2 .. . kam,1 kam,2  · · · ka1,n  · · · ka2,n  ..  ..  . .  · · · kam,n Multiplication of a matrix by a matrix is only defined if the inner dimensions are equal. That is the the column dimension of the first matrix equals the row dimension of the second matrix. That is, Am,n Bp,q is only defined if n = p. The resulting matrix of Am,n Bn,q is of dimension m × q with (i, j)th elements being [ab]i,j = n X ai,k bk,j i = 1, . . . , m j = 1, . . . , q k=1 Example 5.1. Let A3,2   2 5   = 3 −1 0 7 B2,2 = 3 −1 2 4 ! then    16 18 2(3) + 5(2) 2(−1) + 5(4)     AB = 3(3) + (−1)(2) 3(−1) + (−1)(4) =  7 −7 14 28 0(3) + 7(2) 0(−1) + 7(4)  Remark 5.1. When AB is defined, the matrix can be expressed as linear combination of the • columns of A • rows of B 84 Take example 5.1,     5 2     (−1) 3 + 4 −1 7 0      5 2      AB = 3 3 + 2 −1 7 0 and  2 3 −1 + 5 2 4     AB = 3 3 −1 + (−1) 2 4    0 3 −1 + 7 2 4  In R > A=matrix(c(2,3,0,5,-1,7),3,2);> B=matrix(c(3,2,-1,4),2,2) > A%*%B [,1] [,2] [1,] 16 18 [2,] [3,] 7 14 -7 28 Remark 5.2. Matrix multiplication is only defined when the inner dimensions match and as such in example 5.1, AB is defined but BA is not. Even in cases where both AB and BA are defined, it is not necessarily true that AB = BA. Take for example A= 1 2 3 4 ! 5 6 7 8 B= ! Systems of linear equations can also be written in matrix form. For example, let x1 and x2 be unknown such that a1,1 x1 + a1,2 x2 = y1 a2,1 x1 + a2,2 x2 = y2 This can be expressed as a1,1 a1,2 a2,1 a2,2 ! x1 x2 ! = Ax = y 85 y1 y2 ! (5.1) Also, sums of squares can also be expressed as a vector multiplication. n X   x1 . . where x =  . xn x2i = xT x i=1 Some useful multiplications that we will be using in regression are presented in the following list: List 1.     ! β0 + β1 x1 1 x1    . .  β0 ..   .. ..  = • Xβ =  .     β 1 β0 + β1 xn 1 xn P • y T y = ni=1 yi2 ! Pn n x i • X T X = Pn Pni=1 2 i=1 xi i=1 xi ! Pn y i • X T y = Pni=1 i=1 xi yi 5.3 Linear Dependence and Rank Definition 5.2. Let A be an m × n matrix that is made up of n column vectors ai , i = 1, . . . , n, each of dimension m i.e. A = [a1 · · · an ]. When n scalars k1 , . . . , kn , not all zero, can be found such that n X k i ai = 0 i=1 then the n columns are said to be linearly dependent. If the equality holds only for k1 = · · · = kn = 0, then the columns are said to be linearly independent. The definition also holds for rows. Example 5.2. Consider the matrix   1 0.5 3   A = 2 7 3 4 86 8 9 Notice that if we let scalars k1 = 2, k2 = 1k3 = −1 then,     4 2       2 0.5 + 1 7 − 1 8 = 0 9 3 3  1  Example 5.3. Consider the simple identity matrix   1 0 0   I3 = 0 1 0 0 0 1 Notice that the only way to achieve the 0 vector is with scalars k1 = k2 = k3 = 0, i.e.       0 0 1       0 0 + 0 1 + 0 0 = 0 0 0 1 Without going into too much detail we present the following definition. Definition 5.3. The rank of a matrix is the number of linearly independent columns or rows of the matrix. Hence, rank(Am,n ) ≤ min(m, n). If equality holds, then the matrix is said to be of full rank. There are many way to determine the rank of the matrix, such as the number of non-zero eigenvalues, but the simplest way is to express the matrix in reduced row echelon form and count the number of non-zero rows. However, software can calculate it for us (by finding the number of non-zero eigenvalues). Example 5.4. Let,     1 2 1 0 1 2     A = 1 2 1 ⇒ Arref = 0 1 2 0 0 0 2 7 8 Hence, the rank(A) = 2. Row 3 is a linear combination of Rows 1 and 2. Specifically, Row 3 = 3*( Row 1 ) + 2*( Row 2 ). Therefore, 3*( Row 1 ) + 2*( Row 2 ) - ( Row 3)= (Row of zeroes). Hence, matrix A has only two independent row vectors. 87 > A=matrix(c(0,1,2,1,2,7,2,1,8),3,3);A [,1] [,2] [,3] [1,] 0 1 2 [2,] [3,] 1 2 2 7 1 8 > qr(A)$rank [1] 2 and let > B=matrix(c(1,2,3,0,1,2,2,0,1),3,3);B [,1] [,2] [,3] [1,] [2,] [3,] 1 2 3 0 1 2 2 0 1 > qr(B)$rank [1] 3 > qr(B)$rank==min(dim(B)) #check if full rank [1] TRUE Remark 5.3. Other functions also exist in R that calculate the rank. The qr(), utilizes the QR decomposition. Remark 5.4. If we are simply interested in whether a square matrix A is full rank or not, recall from linear algebra that matrix is full rank matrix (a.k.a. nonsingular ) if and only if it has a determinant that is not equal to zero, i.e. |A| = 6 0. Hence, if A is not of full rank (singular ) it has a determinant equal to 0, i.e. |A| = 0. For example continuing example 5.4, > A=matrix(c(0,1,2,1,2,7,2,1,8),3,3);A [,1] [,2] [,3] [1,] 0 1 2 [2,] [3,] 1 2 2 7 1 8 > det(A) [1] 0 > qr(A)$rank==min(dim(A)) [1] FALSE 88 > B=matrix(c(1,2,3,0,1,2,2,0,1),3,3);B [,1] [,2] [,3] [1,] 1 0 2 [2,] [3,] 2 3 1 2 0 1 > det(B) [1] 3 > qr(B)$rank==min(dim(B)) [1] TRUE 5.4 Matrix Inverse Let An,n be a square matrix of full rank, i.e. rank(A) = n. Then A has a (unique) inverse A−1 such that A−1 A = AA−1 = In Computing an inverse of matrix can be done manually-which requires finding the reduced row echelon form but we will utilize software once again. Example 5.5. Continuing from example 5.4, only B was nonsingular and hence has an inverse > solve(B) [,1] [,2] [,3] [1,] 0.3333333 1.3333333 -0.6666667 [2,] -0.6666667 -1.6666667 1.3333333 [3,] 0.3333333 -0.6666667 0.3333333 > round(solve(B)%*%B,3) [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1 Example 5.6. In regression, (X T X)−1 = 1 n 2 P x̄ (xi −x̄)2 − P(xix̄−x̄)2 + 89 − P(xix̄−x̄)2 P 1 (xi −x̄)2 ! Recall from equation (5.1), that a system of equations (with unknown x) can be expressed in matrix form Ax = y. Then, if A is nonsingular −1 −1 −1 ⇒A | {zA} x = A y ⇒ x = A y I Example 5.7. Assume we have 2 systems of equations 12x1 + 6x2 = 48 10x1 + 2x2 = 12 that can be expressed as 12 6 10 −2 ! x1 x2 ! = 48 12 ! We can easily check that the 2 × 2 matrix of coefficients is nonsingular and has inverse ! ! ! ! 2 6 2 6 48 2 1 1 ⇒x= = 84 10 −12 84 10 −12 12 4 5.5 Useful Matrix Results All rules assume that the matrices are conformable to operations. • Addition: – A+B =B+A – (A + B) + C = A + (B + C) • Multiplication: – (AB)C = A(BC) – C(A + B) = CA + CB 90 – k(A + B) = kA + kB for scalar k • Transpose: – (AT )T = A – (A + B)T = AT + B T – (AB)T = B T AT – (ABC)T = C T B T AT • Inverse: – (A−1 )−1 = A – (AB)−1 = B −1 A−1 (If A and B are non-singular) – (ABC)−1 = C −1 B −1 A−1 (If A, B and C are non-singular) – (AT )−1 = (A−1 )T 5.6 Random Vectors and Matrices Let Y be a random column vector of dimension n, i.e.  Y1     Y2   Y =  ..  . Yn The expectation of this (multi-dimensional) random variable is  E(Y1 )    E(Y2 )   µ = E(Y ) =   ..  .   E(Yn )  91 and the variance-covariance is an n × n matrix defined as V (Y ) = E [Y − E(Y )][Y − E(Y )]T =  [Y1 − E(Y1 )][Y2 − E(Y2 )] · · · [Y1 − E(Y1 )][Yn − E(Yn )]    [Y2 − E(Y2 )][Y1 − E(Y1 )] [Y2 − E(Y2 )]2 · · · [Y2 − E(Y2 )][Yn − E(Yn )]  E . . . ..   .. .. .. .   [Yn − E(Yn )][Y1 − E(Y1 )] [Yn − E(Yn )][Y2 − E(Y2 )] · · · [Yn − E(Yn )]2  [Y1 − E(Y1 )]2  σ12 σ1,2   σ2,1 σ22 = ..  .. .  . σn,1 σn,2 =Σ  · · · σ1,n  · · · σ2,n  ..  ..  . .  · · · σn2 (symmetric) An alternate form is Σ = E(Y Y T ) − µµT More information can be found at: https://en.wikipedia.org/wiki/Covariance_matrix Example 5.8. In the regression model, assuming dimension n,the only random term is ǫ (which in turn makes Y random) and we assume    σ2 0 0   0 σ2 .   E(ǫ) =  ..  = 0 and V (ǫ) =  ..  .. . . 0 0 0 ··· ··· .. . 0   0 2 ..   = σ In . · · · σ2 Hence, for the model Y = Xβ + ǫ • E(Y ) = E(Xβ) + E(ǫ) = Xβ • V (Y ) = V (Xβ + ǫ) = σ 2 In 92 (5.2) 5.6.1 Mean and variance of linear functions of random vectors Let Am,n be a matrix of scalars and Y n,1 a random vector. Then, W m,1  a1,1 a1,2   a2,1 = AY =   ..  . a2,2 .. . · · · a1,n  Y1    · · · a2,n   Y2    ..  ..  .  . .   ..  Yn · · · am,n am,1 am,2  a1,1 Y1 + a1,2 Y2 + · · · + a1,n Yn   a2,1 Y1 + a2,2 Y2 + · · · + a2,n Yn = ..  .  am,1 Y1 + am,2 Y2 + · · · + am,n Yn       Since Y is a random vector, W m,1 is also a random vector with   E(a1,1 Y1 + a1,2 Y2 + · · · + a1,n Yn )   ..  E(W ) =  .   E(am,1 Y1 + am,2 Y2 + · · · + am,n Yn )   a1,1 E(Y1 ) + a1,2 E(Y2 ) + · · · + a1,n E(Yn )   ..  = .   am,1 E(Y1 ) + am,2 E(Y2 ) + · · · + am,n E(Yn )    E(Y1 ) a1,1 a1,2 · · · a1,n     a2,1 a2,2 · · · a2,n   E(Y2 )     = . .. ..   ..  ..  . . . .  .   . am,1 am,2 · · · am,n E(Yn ) = AE(Y ) and variance covariance matrix V (W ) = E [AY − AE(Y )][AY − AE(Y )]T = E A[Y − E(Y )][Y − E(Y )]T AT = AE [Y − E(Y )][Y − E(Y )]T AT = AV (Y )AT 93 5.6.2 Multivariate normal distribution Let Y n,1 be a random vector with mean µ and variance covariance Σ i.e.N(µ, Σ). Then, if Y is multivariate normal it has p.d.f f (Y ) = (2π)−n/2 |Σ|−1/2 e −1 (Y 2 −µ)T Σ−1 (Y −µ) and each element Yi ∼ N(µi , σi2 ) i = 1, . . . , n. Remark 5.5. • If Am,n is a full rank matrix of scalars, then AY ∼ N(Aµ, AΣAT ) • (True for any distribution) Two linear functions AU and BU are independent if and only if AΣB = 0. In particular, this means that Ui and Uj are independent if and only if the (i, j)th entry of Σ equals 0. • Y T AY ∼ χ2r (λ) if and only if AΣ is idempotent of rank(AΣ) = r and λ = 12 µT Aµ • The quadratic forms Y T AY and Y T BY are independent if and only if AΣB = 0(BΣA = 0). As a consequence Sum of Squares Error and Model (as well as its components) in linear models are independent. 5.7 Estimation and Inference in Regression Assuming multivariate normal random errors in equation (5.2) Y ∼ N(Xβ, σ 2 In ) 5.7.1 Estimating parameters by least squares For simple linear regression, recall in Section 1.2.1, that to estimate the parameters we had to solve a system of linear equations by minimizing n X i=1 (yi − (β0 + β1 xi ))2 = (y − Xβ)T (y − Xβ) 94 The resulting simultaneous equations after taking partial derivatives w.r.t. β0 , β1 and equating to zero, are: X X nb0 + b1 xi = yi X X X b0 xi + b1 x2i = xi yi which, using the results of list 1, can be expressed and solved in matrix form X T Xb = X T y ⇒ b = (X T X)−1 X T y Remark 5.6. To solve this system we assumed that X T X was nonsingular. This is nearly always the case for simple linear regression. However, for multiple regression we will need the following proposition to guarantee that the unique inverse exists. Proposition 5.1. Let Xn,p , where n ≥ p. If rank(X) = p, then X T X is nonsingular, i.e. rank(X T X) = p. 5.7.2 Fitted values and residuals Fitted response values are ŷ = Xb = X(X T X)−1 X T y | {z } Hn,n where the H matrix is called the projection matrix, (that is, if you premultiply a vector by H the result is the projection of that vector onto the column space of X). Therefore, H is • idempotent, i.e. HH = H • symmetric, i.e. H T = H The estimated residuals are e = y − ŷ = y − Hy = (In − H)y 95 where it is easy to check that In − H is also idempotent. As a result, • E(Ŷ ) = E(HY ) = HE(Y ) = HXβ = X(X T X)−1 X T Xβ = Xβ • V (Ŷ ) = Hσ 2 In H T = σ 2 H and MSE = σ̂ 2 • E(e) = E[(In −H)Y ] = (In −H)E(Y ) = (In −H)Xβ = Xβ −Xβ = 0 • V (e) = (In − H)σ 2 In (In − H)T = σ 2 (In − H) and MSE = σ̂ 2 5.7.3 Analysis of variance Recall that, SST = n X i=1 Now note that, T y y= n X n X 2 (yi − ȳ) = yi2 SST = y T y − Also, i=1 − ( Pn ( 1 T y Jy = n and i=1 Therefore, yi2 i=1 y i )2 n Pn i=1 . y i )2 n . 1 T y Jy = y T In − n−1 J y. n SSE = eT e = (y − Xb)T (y − Xb) = y T y − y T Xb − bT X T y + bT X T Xb = y T y − bT X T y = y T (In − H)y since bT X T y = y T Hy Finally, SSR = SST − SSE = · · · = y T H − n−1 J y Remark 5.7. Note that SST, SSR and SSE are all of quadratic form, i.e. y T Ay for symmetric matrices A. 96 5.7.4 Inference Since, b = (X T X)−1 X T y it is a linear function of the response. The corresponding random vector can be expressed as B = (X T X)−1 X T Y {z } | A Hence, • E(B) = AE(Y ) = AXβ = β • V (B) = AV (Y )AT = σ 2 (X T X)−1 and thus, B ∼ N(β, σ 2 (X T X)−1 ) We can also express the C.I. and P.I. in section 2.2 in matrix form. • Estimated mean response at xobs . ŷ = b0 + b1 xobs = xTobs b, with sŶ = p MSE(xTobs (X T X)−1 xobs ) xobs = 1 xobs ! • Predicted response at xnew . The point estimate is the same but spred = p 1 + MSE(xTnew (X T X)−1 xnew ) 97 Chapter 6 Multiple Regression I This chapter incorporates large sections from Chapters 8 from the textbook. 6.1 Model The multiple regression model is an extension of the simple regression model whereby instead of only one predictor, there are multiple predictors to better aid in the estimation and prediction of the response. The goal is to determine the effects (if any) of each predictor, controlling for the others. Let p − 1 denote the number of predictors and (yi , x1,i , x2,i , . . . xp−1,i ) denote the p dimensional data points for i = 1, . . . , n. The statistical model is Yi = β0 + β1 x1,i + · · · + βp−1 xp−1,i + ǫi ⇔ Yi = for p−1 X k=0 βk xk,i + ǫi x0,i ≡ 1 i.i.d. i = 1, . . . , n where ǫi ∼ N(0, σ 2 ). Multiple regression models can also include polynomial terms (powers of predictors). For example, one can define x2,i := x21,i . The model is still linear as it is linear in the coefficients (β’s). Polynomial terms are useful for accounting for potential curvature/nonlinearity in the relationship between predictors and the response. Also, a polynomial term such as x4,i = x1,i x3,i , is also coined as the interaction term of x1 with x3 . Such terms are of particular usefulness when an interaction exists between two predictors, i.e. when the level/magnitude of one predictor has a relationship to the level/magnitude of the other. For example, one may wish to fit a model with predictor terms, 98 although there are only 2 unique predictors: Yi = β0 + β1 x1,i + β2 x21,i + β3 x2,i + β4 x1,i x2,i + β5 x21,i x2,i + ǫi In p dimensions, we no longer use the term regression line, but a response/regression surface. Let p = 3, i.e. 2 predictors and a response. The resulting model may look like The interpretation of the slope coefficients now requires an additional statement. A 1-unit increase in predictor xk will cause the response, y, to change by amount βk , assuming all other predictors are held constant. In a model with interaction terms special care needs to be taken. Take for example E(Y |x1 , x2 ) = β0 + β1 x1 + β2 x2 + β3 x1 x2 where a 1-unit increase in x2 , i.e. x2 + 1, leads to E(Y |x1 , x2 + 1) = E(Y |x1 , x2 ) + β2 + β3 x1 The effect of increasing x2 depends on the level of x1 . 99 6.2 Special Types of Variables • Distinct numeric predictors. The traditional form for variables used thus far. • Polynomial terms. Used to allow for “curves” in the regression/response surface, as discussed earlier. Example 6.1. In an experiment using flyash % as a strength (sten) strength 4500 5000 5500 6000 factor in concrete compression test (PSI) for 28 day cured concrete, fitting a simple linear regression yielded the following 0 10 20 30 40 50 60 flyash Figure 6.1: First order model Clearly a linear model in the predictor is not adequate. Maybe a second order polynomial model might be more adequate, > flyash2=dat$flyash^2 > reg.2poly=lm(strength~flyash+flyash2,data=dat) > summary(reg.2poly) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4486.3611 174.7531 25.673 8.25e-14 *** flyash 63.0052 12.3725 5.092 0.000132 *** flyash2 -0.8765 0.1966 -4.458 0.000460 *** --Residual standard error: 312.1 on 15 degrees of freedom Multiple R-squared: 0.6485,Adjusted R-squared: 0.6016 F-statistic: 13.84 on 2 and 15 DF, p-value: 0.0003933 100 6000 5500 strength 5000 4500 0 10 20 30 40 50 60 flyash Figure 6.2: Second order model It still seems that there is some room for improvement, hence > flyash3=dat$flyash^3 > reg.3poly=lm(strength~flyash+flyash2+flyash3,data=dat) > summary(reg.3poly) Coefficients: (Intercept) flyash flyash2 Estimate Std. Error t value Pr(>|t|) 4.618e+03 1.091e+02 42.338 3.53e-16 *** -2.223e+01 3.078e+00 1.812e+01 7.741e-01 -1.227 0.240110 3.976 0.001380 ** flyash3 -4.393e-02 8.498e-03 -5.170 0.000142 *** --Residual standard error: 189.4 on 14 degrees of freedom Multiple R-squared: 0.8792,Adjusted R-squared: 0.8533 F-statistic: 33.95 on 3 and 14 DF, p-value: 1.118e-06 Before we continue, it is important to note that there are (mathematical) limitations to how many predictors can be added to a model. As a guideline we usually have one predictor per 10 observations. For example, a dataset with sample size 60 should have at most 6 predictors. The X matrix is n × p dimension so as p ↑ while n remains constant, we run the risk of X not being full column rank. So in this example we should only keep 2 predictors at most since we have 18 ≈ 20 observations. From the last output we see that the third and second order polynomial terms are significant (flyash3 and flyash2) but flyash1 is not significant, given the other two are already incorporated in the model. 101 >reg.3polym1=update(reg.3poly,.~.-flyash) >summary(reg.3polym1) Coefficients: Estimate Std. Error t value Pr(>|t|) 4.549e+03 9.504e+01 47.866 < 2e-16 *** (Intercept) flyash2 flyash3 2.166e+00 -3.445e-02 2.201e-01 3.581e-03 9.840 6.18e-08 *** -9.618 8.32e-08 *** Residual standard error: 192.6 on 15 degrees of freedom Multiple R-squared: 0.8662,Adjusted R-squared: 0.8483 6000 F-statistic: 48.54 on 2 and 15 DF, 5000 5500 1st order 2nd order 3rd order 4500 strength p-value: 2.814e-07 0 10 20 30 40 50 60 flyash Figure 6.3: Third order model http://www.stat.ufl.edu/~ athienit/STA4210/Examples/poly.R • Interaction terms. Used when the levels of one predictor influence another. We will see this in example 6.2. • Transformed variables. Transformed response such as log(Y ) or Y −1 (as seen with Power transformations) to achieve linearity (or to satisfy other assumptions). • Categorical predictors. A categorical predictor is a variable with groups or classification. The basic case with a variable with only two groups will be illustrated by the following example: 102 Example 6.2. A study is conducted to determine the effects of company size and the presence or absence of a safety program on the number of hours lost due to work-related accidents. A total of 40 companies are selected for the study. The variables are as follows: y = lost work hours x1 = number of employees  1 safety program used x2 = 0 no safety program used The proposed model, Yi = β0 + β1 x1,i + β2 x2,i + ǫi implies that  (β + β ) + β x + ǫ 0 2 1 1,i i Yi = β + β x + ǫ 0 1 1,i i if x2 = 1 if x2 = 0 When a safety program is used, i.e. x2 = 1, the intercept is β0 + β2 , but the slope (for x1 ) remains the same in both cases. A scatterplot of the data and the associated regression line, differentiated by whether x2 = 1 or 0, is presented. x2 100 0 50 y 150 0 1 2000 4000 6000 8000 x1 Although the overall fit of the model seems adequate we see that the regression line for x2 = 1 (red), does fit the data well - a fact that can also be seen by plotting the residuals in the assumption checking 103 procedure. The model is too restrictive by forcing parallel lines. Adding an interaction term makes the model less restrictive. Yi = β0 + β1 x1,i + β2 x2,i + β3 (x1 x2 )i + ǫi which implies  (β + β ) + (β + β )x + ǫ 0 2 1 3 1,i i Yi = β + β x + ǫ 0 1 1,i i if x2 = 1 if x2 = 0 Now, the slope for x1 is allowed to differ for x2 = 1 and x2 = 0. y = - 1.8 + 0.0197 x1 + 10.7 x2 - 0.0110 x1x2 Predictor Constant x1 x2 x1x2 Coef -1.84 SE Coef 10.13 T -0.18 P 0.857 0.019749 10.73 -0.010957 0.001546 14.05 0.002174 12.78 0.76 -5.04 0.000 0.450 0.000 S = 17.7488 R-Sq = 89.2% R-Sq(adj) = 88.3% Analysis of Variance Source Regression DF 3 SS 93470 MS 31157 Residual Error Total 36 39 11341 104811 315 Figure 6.4 also shows the better fit. 104 F 98.90 P 0.000 x2 100 0 50 y 150 0 1 2000 4000 6000 8000 x1 Figure 6.4: Scatterplot and fitted regression lines. Remark 6.1. Since the interaction term x1 x2 is deemed significant, then for model parsimony, all lower order terms of the interaction, i.e. x1 and x2 should be kept in the model, irrespective of their statistical significance. If x1 x2 is significant then intuitively x1 and x2 are of importance (maybe not in the statistical sense). Now lets try and to perform inference on the slope coefficient for x1 . From the previous equation we saw that the slope takes on two values depending on the value of x2 . – For x2 = 0, it is just β1 and inference in straightforward...right? – For x2 = 1, it is β1 + β3 . We can estimate this with b1 + b3 but the variance is not known to us. From equation (3) we have that V (B1 + B3 ) = V (B1 ) + V (B3 ) + 2Cov(B1 , B3 ) The sample variances and covariances can be found from the ˆ = MSE(X T X)−1 covariance matrix, or, obtained in R using V (B) the vcov function. Then, create a 100(1 − α)% CI for β1 + β3 b1 + b3 ∓ t1−α/2,n−p q s2b1 + s2b3 + 2sb1 b3 Remark 6.2. This concept can easily be extended to linear combinations of more that two coefficients. http://www.stat.ufl.edu/~ athienit/IntroStat/safe_reg.R 105 In the previous example the qualitative predictor only had two levels, the use or the the lack of use of a safety program. To fully state all levels only one dummy/indicator predictor was necessary. In general, if a qualitative predictor has k levels, then k − 1 dummy/indicator predictor variables are necessary. For example, a qualitative predictor for a traffic light has three levels: – red, – yellow, – green. Therefore, only two binary predictors are necessary to fully model this scenario. xred  1 if red = 0 otherwise xyellow  1 if yellow = 0 otherwise Braking it down by case we have an X matrix that has the following form: Color intercept Red 1 Yellow 1 Green 1 xred 1 0 0 xyellow 0 1 0 This restriction is usually expressed as βbase group = 0 where green is the base group in this situation, and the model is Yi = β0 + β1 xredi + β2 xyellowi + ǫi and hence the mean line, piecewise is E(Y ) =    β + β1   0 β0 + β2    β 0 if red if yellow if green Notice that if we created xgreen the X matrix would no longer be full column rank. 106 Remark 6.3. However, other restrictions do exist to make X full column rank too. – The restriction 3 X i=1 βi = 0 ⇒ β3 = −β1 − β2 i.e. the sum of the coefficients that correspond to the levels of a qualitative predictor only are equal to 0. Not all β’s. So, green can be written as a linear combination or red and yellow. The model is Yi = β0 + β1 xredi + β2 xyellowi + β3 xgreeni + ǫi and hence the mean line, piecewise is E(Y ) =    β + β1   0 if red β0 + β2    β − β − β 0 1 2 if yellow if green for this case the X matrix has the form Color intercept Red 1 Yellow 1 Green 1 xred 1 0 -1 xyellow 0 1 -1 – The model with no intercept/through the origin Yi = β1 xredi + β2 xyellowi + β3 xgreeni + ǫi and hence the mean line, piecewise is    β   1 E(Y ) = β2    β 3 if red if yellow if green for this case the X matrix has the form 107 Color Red Yellow Green xred 1 0 0 xyellow 0 1 0 xgreen 0 0 1 So now we have seen three alternative ways, but we will be using the base group approach as is done in R. The model through the origin has issues as discussed in an earlier section and the sum to zero implies that some parameters have to be expressed as linear combinations of others. Remark 6.4. The color variable has three categories, one may argue that color (in some context) is an ordinal qualitative predictor and therefore scores can be assigned, making it quantitative. In terms of frequency (or wavelength) there is also an order of Color Frequency (THz) Red 400-484 Yellow 508-526 Green 526-606 Score 442 517 566 Instead of creating 2 dummy/indicator variables we can create one quantitative variable using the midpoint of the frequency band. Example 6.3. Three different drugs are considered, drug A, B and C. Each is administered at 4 dosage levels and the response is measured Product Dose A B C 0.2 2.0 1.8 1.3 0.4 4.3 4.1 2.0 0.8 6.5 4.9 2.8 1.6 8.9 5.7 3.4 Let d = dosage level and let  1 drug B pB = 0 otherwise  1 drug C pC = 0 otherwise The model (that includes the interaction term) is Yi = β0 + β1 di + β2 pB + β3 pC + β4 (dpB )i + β5 (dpC )i + ǫi 108 and    β + β1 di   0 E(Y ) = β0 + β2 + (β1 + β4 )di    β + β + (β + β )d 0 3 1 5 i if drug A if drug B if drug C 6 4 2 Response 8 Product A B C 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Dose With a simple visual inspection we see that the model fit is not adequate. A log transformation on dosage seems to help Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) logDose 7.3072 3.3038 0.2103 0.2186 34.748 3.79e-08 *** 15.111 5.30e-06 *** ProductB ProductC logDose:ProductB -2.1548 -4.3486 -1.5004 0.2974 -7.245 0.000351 *** 0.2974 -14.622 6.42e-06 *** 0.3092 -4.853 0.002844 ** logDose:ProductC --- -2.2795 0.3092 -7.372 0.000319 *** Residual standard error: 0.3389 on 6 degrees of freedom Multiple R-squared: 0.9877,Adjusted R-squared: 0.9774 F-statistic: 96.3 on 5 and 6 DF, 109 p-value: 1.207e-05 6 4 2 Response 8 Product A B C −1.5 −1.0 −0.5 0.0 0.5 logDose The coefficients for the interactions are both significant and negative so slope for logDose is: drug A: 3.3038 drug B: 3.3038 − 1.5004 drug C: 3.3038 − 2.2795 We can test whether the slope for B is different than that for A, by testing β4 = 0, and for C versus A, by testing β5 = 0 (since A is the base group). A question that may arise is if the slope for logDose is the same for drug B as it is for drug C. That is H0 : β4 = β5 . We will see in the next chapter how to actually perform this test. In the meantime we can create a 95% CI for β4 − β5 . > vmat=vcov(modelfull);round(vmat,3) (Intercept) logDose ProductB ProductC (Intercept) logDose ProductB ProductC logDdose:PB 0.044 0.027 -0.044 -0.044 -0.027 0.027 0.048 -0.027 -0.027 -0.048 -0.044 -0.044 -0.027 -0.027 0.088 0.044 logDose:PC -0.027 -0.048 0.044 0.088 0.054 0.027 0.027 0.054 logDose:ProductB -0.027 -0.048 0.054 0.027 logDose:ProductC -0.027 -0.048 0.027 0.054 > d=diff(coefficients(modelfull)[6:5]);names(d)=NULL;d 0.096 0.048 0.048 0.096 0.7791 > d+c(1,-1)*qt(0.025,6)*sqrt(vmat[5,5]+vmat[6,6]-2*vmat[5,6]) [1] 0.02247186 1.53563879 110 and note that 0 is not in the interval, and conclude that β4 > β5 , the slope of logDose under drug B is larger than that for C. Remark 6.5. Since we are in fact making multiple comparisons, A vs B, A vs C and B vs C, we should probably adjust using Bonferroni’s or some other multiple comparison adjustment. There is however a simpler way. If we make drug C the base group, instead of A, the (different) model would be Yi = β0 + β1 di + β2 pA + β3 pB + β4 (dpA )i + β5 (dpB )i + ǫi so the model under drug A: Yi = β0 + β2 + (β1 + β4 )di + ǫi drug B: Yi = β0 + β3 + (β1 + β5 )di + ǫi drug C: Yi = β0 + β1 di + ǫi so comparing the slope for logDose between drug B and C, simply involves performing inference on β5 . > ds_base_c=transform(ds,Product=relevel(Product,"C")) > model_bc=lm(Response~logDose*Product,data=ds_base_c) > summary(model_bc) Coefficients: Estimate Std. Error t value Pr(>|t|) 2.9586 0.2103 14.069 8.05e-06 *** 1.0243 0.2186 4.685 0.003378 ** (Intercept) logDose ProductA ProductB logDose:ProductA 4.3486 2.1938 2.2795 0.2974 0.2974 0.3092 14.622 6.42e-06 *** 7.377 0.000318 *** 7.372 0.000319 *** logDose:ProductB --- 0.7791 0.3092 2.520 0.045312 * Residual standard error: 0.3389 on 6 degrees of freedom Multiple R-squared: 0.9877,Adjusted R-squared: 0.9774 F-statistic: 96.3 on 5 and 6 DF, p-value: 1.207e-05 which corresponds to the term “logDose:ProductB” in the output. http://www.stat.ufl.edu/~ athienit/STA4210/Examples/drug.R 111 6.3 Matrix Form This section is merely an extension of section 5.7. The model (including dimensions) is of the same form just different dimensions for some terms Y n,1 = Xn,p β p,1 + ǫn,1 Estimates, fitted values, residuals, standard errors and sums of squares are of the same form as in section 5.7. The differences/generalizations are: • The degrees of freedom are df SSR p − 1 SSE n − p SST n − 1 + This is because we know have to estimate p parameters for our “mean”, that is, our response surface. • The expected sums of squares are: – E(MSE) = σ 2 Pp−1 P P 2 β SS + – E(MSR) = σ 2 + p−1 kk k k=1 k ′ k βk βk ′ SSkk ′ Pn k=1 where SSkk′ = i=1 (xik − x̄k )(xik′ − x̄k′ ). It can be shown that, E(MSR) ≥ E(MSE) with equality holding only if β1 = · · · = βp−1 = 0. Therefore, to test H0 : β1 = · · · = βp−1 = 0 vs Ha : not all β’s equal zero we use the test statistic T.S. = MSR H0 ∼ Fp−1,n−p MSE and reject the null when p-value= P (Fp−1,n−p ≥ T.S.) < α. 112 (6.1) • Intuitively, we note that SSR will always increase, or that equivalently SSE decreases, as the we include more predictors in the model. This is because the fitted values (from a more complicated model) will better fit the observed values of the response. However, any increase in SSR, no matter how minuscule, will cause R2 to increase. The question is: “Is the gain in SSR worth the added model complexity?” This has lead to the introduction of the adjusted R2 , defined as 2 Radj p−1 := R − (1 − R ) n−p {z } | 2 2 = 1− penalizing fcn. MSE . SST/(n − 1) (6.2) As p−1 increases, R2 increases, but the second term which is subtracted from R2 also increases. Hence, the second term can be thought of as a penalizing factor. Example 6.4. A linear regression model of 50 observation with 3 pre2 dictors may yield an R(1) = 0.677, and an addition of 2 “unimportant” 2 predictors yields a slight increase to R(2) = 0.679. This increase does not seem to be worth the added model complexity. Notice, 3 = 0.6559 46 5 = 0.679 − (1 − 0.679) = 0.6425 44 2 = 0.677 − (1 − 0.677) R(1)adj 2 R(2)adj 2 that Radj has decreased from model (1) to model (2). • Inferences on the individual β’s follows from section 2.1 and 5.7. The only difference is is the degrees of freedom for the t-distribution is n − p (instead of n − 2). For example, to test H0 : βk = βk0 bk − βk0 H0 ∼ tn−p sbk (6.3) An individual test on βk , tests the significance of predictor k, assuming all other predictors j for j 6= k are included in the model. This can lead to different conclusions depending on what other predictors are included in the model. We shall explore this in more detail in the next chapter. 113 Consider the following theoretical toy example. Someone wishes to measure the area of a square (the response) using as predictors two potential variables, the length and the height of the square. Due to measurement error, replicate measurements are taken. – A simple linear regression is fitted with length as the only predictor, x = length. For the test H0 : β1 = 0, do you think that we would reject H0 , i.e. is length a significant predictor of area? – Now assume that a multiple regression model is fitted with both predictors, x1 = length and x2 = height. Now, for the test H0 : β1 = 0, do you think that we would reject H0 , i.e. is length a significant predictor of area given that height is already included in the model? This scenario is defined as confounding. In the toy example, “height” is a confounding variable, i.e. an extraneous variable in a statistical model that correlates with both the response variable and another predictor variable. • Confidence intervals on the mean response and predictions intervals performed as in section 5.7 with the exception that – the degrees of freedom for the t-distribution are now n − p – xobs (or xnew ) being xobs  1     x1,obs   = .    ..  xp−1,obs – The matrix X is an n × p matrix with columns being the predictors, i.e. X = [1 x1 · · · xp−1 ] – and for g simultaneous intervals ∗ the Bonferroni critical value is t1−α/(2g),n−p , that is, the degrees of freedom change ∗ the Working-Hotelling critical value is W = 114 p pF1−α,p,n−p Example 6.5. In a biological experiment, researchers wanted to model the biomass of an organism with respect to a salinity (SAL), acidity (pH), potassium (K), sodium (Na) and zinc (Zn) with a sample size of 45. The full model yielded the following results: Coefficients: (Intercept) salinity Estimate Std. Error t value Pr(>|t|) 171.06949 1481.15956 0.115 0.90864 -9.11037 28.82709 -0.316 0.75366 pH K 311.58775 -0.08950 105.41592 0.41797 2.956 -0.214 0.00527 0.83155 Na Zn -0.01336 -4.47097 0.01911 18.05892 -0.699 -0.248 0.48877 0.80576 Residual standard error: 477.8 on 39 degrees of freedom Multiple R-squared: 0.4867,Adjusted R-squared: 0.4209 F-statistic: 7.395 on 5 and 39 DF, p-value: 5.866e-05 Analysis of Variance Table Response: biomass Df Sum Sq Mean Sq F value salinity pH K Pr(>F) 1 121832 121832 0.5338 0.4694 1 7681463 7681463 33.6539 9.782e-07 *** 1 464316 464316 2.0343 0.1617 Na 1 157958 Zn 1 13990 Residuals 39 8901715 157958 13990 228249 0.6920 0.0613 0.4105 0.8058 Notice that the ANOVA table has broken down SSR with 5 df into 5 components. We will discuss the sequential sum of squares breakdown in the next chapter. For now if we sum the SS for each of the predictors we will get SSR= 8439559 Analysis of Variance Source Regression DF 5 SS 8439559 MS 1687912 Residual Error Total 39 44 8901715 17341274 228249.1 115 F 7.395 P 0.000 Assuming all the model assumptions are met, we first take a look at the overall fit of the model. H0 : β1 = · · · = β5 = 0 vs Ha : at least one of them 6= 0 The test statistic value is T.S. = 7.395 with an associated p-value of approximately 0 (found using an F5,39 distribution). Hence, at least one predictor appears to be significant. In addition, the coefficient of determination, R2 , is 48.67%, indicating that a large proportion of the variability in the response can be accounted for by the regression model. Looking at the individual tests, pH is significant given all the other predictors with a p-value of 0.00527, but salinity, K, Na and Zn have large p-values (from the individual tests). Table 6.1 provides the pairwise correlations of the quantitative predictor variables. biomass salinity pH K Na Zn biomass salinity pH K Na Zn . -0.084 0.669 -0.150 -0.219 -0.503 . . -0.051 -0.021 0.162 -0.421 . . . 0.019 -0.038 -0.722 . . . . 0.792 0.074 . . . . . 0.117 . . . . . . Table 6.1: Pearson correlation and associated p-value Notice that pH and Zn are highly negatively correlated, so it seems reasonable to attempt to remove Zn as its p-value is 0.80576 (and pH’s p-value is small). Also, there is a strong positive correlation between K and Na and since both their p-values are large at 0.83155 and 0.48877 respectively, we should attempt to remove K (but not both). Although we will see later how to perform simultaneous inference it is more advisable to test one predictor at a time. In effect we will perform backwards elimination. That is, start with a complete model and see if which predictors we can remove, one at a time. 1. Remove K that has the highest individual test p-value. Coefficients: (Intercept) Estimate Std. Error t value Pr(>|t|) 72.02975 1390.21648 0.052 0.95894 116 salinity pH Na -7.22888 314.31346 -0.01667 27.12606 103.38903 0.01106 -0.266 3.040 -1.507 0.79123 0.00416 ** 0.13972 -3.73299 17.51434 -0.213 0.83230 Zn Residual standard error: 472 on 40 degrees of freedom Multiple R-squared: 0.4861,Adjusted R-squared: 0.4347 F-statistic: 9.458 on 4 and 40 DF, p-value: 1.771e-05 2 Where we note that Radj has actually gone up. That is, even though SSR is smaller for this model (than the one with K also in it) the penalizing function now doesn’t penalize as much. So, K was not necessary. Also note how the p-value for Na has dropped from 0.4105. That is mainly due to correlation between K and Na. 2. Remove Zn that has the highest individual test p-value. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -188.93696 650.73603 -0.290 0.773 salinity -3.18957 19.18052 -0.166 0.869 pH Na 332.67478 -0.01743 56.49655 0.01036 5.888 6.24e-07 *** -1.682 0.100 Residual standard error: 466.5 on 41 degrees of freedom Multiple R-squared: 0.4855,Adjusted R-squared: 0.4478 F-statistic: 12.9 on 3 and 41 DF, p-value: 4.5e-06 2 is still increasing. Radj 3. Remove salinity Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -282.86356 pH 333.10556 Na -0.01770 319.38767 55.78001 0.01011 117 -0.886 0.3809 5.972 4.36e-07 *** -1.752 0.0871 . Residual standard error: 461.1 on 42 degrees of freedom Multiple R-squared: 0.4851,Adjusted R-squared: 0.4606 F-statistic: 19.79 on 2 and 42 DF, p-value: 8.82e-07 2 Radj is still increasing. 4. Now the question is whether we should remove Na, as its p-value is “small-ish”. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) pH -593.63 336.79 271.89 57.07 -2.183 0.0345 * 5.902 5.08e-07 *** Residual standard error: 472 on 43 degrees of freedom Multiple R-squared: 0.4475,Adjusted R-squared: 0.4347 F-statistic: 34.83 on 1 and 43 DF, p-value: 5.078e-07 2 But now Radj has decreased, so it is beneficial to keep Na (with respect 2 to R criterion). We can also create CI and/or PI using this model, and with the use of software, we do n0t actually have to compute any of the matrices. > newdata=data.frame(pH=4.15,Na=10000) > predict(modu3, newdata, interval="prediction",level=0.95) fit lwr upr 1 922.4975 -29.45348 1874.448 http://www.stat.ufl.edu/~ athienit/STA4210/Examples/linthurst.R 118 Chapter 7 Multiple Regression II For a given dataset, the total sum of squares (SST) remains the same, no matter what predictors are included (when no missing values exist among variables) as the formula does not involve any x’s . As we include more predictors, the regression sum of squares (SSR) does not decrease (think of it as increasing), and the error sum of squares (SSE) does not increase. 7.1 7.1.1 Extra Sums of Squares Definition and decompositions • When a model contains just x1 , we denote: SSR(x1 ), SSE(x1 ) • Model Containing x1 , x2 : SSR(x1 , x2 ), SSE(x1 , x2 ) • Predictive contribution of x2 above that of x1 : SSR(x2 |x1 ) = SSE(x1 ) − SSE(x1 , x2 ) = SSR(x1 , x2 ) − SSR(x1 ) This can be extended to any number of predictors. Lets take a look at some formulas for models with 3 predictors SST = SSR(x1 ) + SSE(x1 ) = SSR(x1 , x2 ) + SSE(x1 , x2 ) = SSR(x1 , x2 , x3 ) + SSE(x1 , x2 , x3 ) 119 and SSR(x1 |x2 ) = SSR(x1 , x2 ) − SSR(x2 ) = SSE(x2 ) − SSE(x1 , x2 ) SSR(x2 |x1 ) = SSR(x1 , x2 ) − SSR(x1 ) = SSE(x1 ) − SSE(x1 , x2 ) SSR(x3 |x2 , x1 ) = SSR(x1 , x2 , x3 ) − SSR(x1 , x2 ) = SSE(x1 , x2 ) − SSE(x1 , x2 , x3 ) SSR(x2 , x3 |x1 ) = SSR(x1 , x2 , x3 ) − SSR(x1 ) = SSE(x1 ) − SSE(x1 , x2 , x3 ) Similarly you can find other terms such as SSR(x2 |x1 , x3 ), SSR(x2 , x1 |x3 ) and so forth. Using some of this notation we find that SSR(x1 , x2 , x3 ) = SSR(x1 ) + SSR(x2 |x1 ) + SSR(x3 |x1 , x2 ) = SSR(x2 ) + SSR(x1 |x2 ) + SSR(x3 |x1 , x2 ) = SSR(x1 ) + SSR(x2 , x3 |x1 ) For multiple regression when we request the ANOVA table in R, we obtain a table where SSR is decomposed by sequential sums of squares. Source Regression x1 x2 |x1 x3 |x1 , x2 Error Total SS SSR(x1 , x2 , x3 ) SSR(x1 ) SSR(x2 |x1 ) SSR(x3 |x1 , x2 ) SSE(x1 , x2 , x3 ) SST df 3 1 1 1 n−4 n−1 MS MSR(x1 , x2 , x3 ) MSR(x1 ) MSR(x2 |x1 ) MSR(x3 |x1 , x2 ) MSE(x1 , x2 , x3 ) The sequential sum of squares regression differ depending on the order the variables are entered. Example 7.1. Let us take a look at example 7.1 from the textbook. > dat=read.table("http://www.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/ + data/textdatasets/KutnerData/Chapter%20%207%20Data%20Sets/CH07TA01.txt", + > col.names=c("X1","X2","X3","Y")) 120 > reg123=lm(Y~X1+X2+X3,data=dat) > summary(reg123) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 117.085 99.782 1.173 0.258 X1 X2 X3 4.334 -2.857 -2.186 3.016 2.582 1.595 1.437 -1.106 -1.370 0.170 0.285 0.190 Residual standard error: 2.48 on 16 degrees of freedom Multiple R-squared: 0.8014, Adjusted R-squared: 0.7641 F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06 From the F-test we see that at least one predictor is significant. However, the individual tests indicate that the predictors are not significant. We will investigate this later but this is because we are testing an individual predictor given all the other predictors. It will be helpful to view the sequential sum of squares Listing 7.1: Order 123 model > anova(reg123) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) X1 1 352.27 352.27 57.2768 1.131e-06 *** X2 X3 1 1 33.17 11.55 33.17 11.55 Residuals 16 98.40 6.15 5.3931 1.8773 0.03373 * 0.18956 Note that • SSR(x1 ) = 352.27, so x1 contributes a lot • SSR(x2 |x1 ) = 33.17, so x2 contributes some above and beyond what x1 • SSR(x3 |x1 , x2 ) = 11.55 but x3 does seem to contribute much above and beyond x1 and x2 If we switch the order in which the variables are entered 121 Listing 7.2: Order 213 model > reg213=lm(Y~X2+X1+X3,data=dat) > anova(reg213) Analysis of Variance Table Response: Y X2 Df Sum Sq Mean Sq F value Pr(>F) 1 381.97 381.97 62.1052 6.735e-07 *** X1 1 X3 1 Residuals 16 3.47 11.55 98.40 3.47 11.55 6.15 0.5647 1.8773 0.4633 0.1896 We note that x2 seems to be significant on its own, but that x1 does not contribute anything above and beyond x2 . Next we also try having x3 first. Listing 7.3: Order 321 model > reg321=lm(Y~X3+X2+X1,data=dat) > anova(reg321) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value X3 X2 1 10.05 1 374.23 X1 1 Residuals 16 12.70 98.40 Pr(>F) 10.05 1.6343 0.2193 374.23 60.8471 7.684e-07 *** 12.70 6.15 2.0657 0.1699 We note that x3 even on its own does not appear to be significant. We shall talk about the tests we see here in the next section. http://www.stat.ufl.edu/~ athienit/STA4210/Examples/bodyfat.R 7.1.2 Inference with extra sums of squares Let p − 1 denote the total number of predictors in a model. Then, we can simultaneously test for the significance of k(≤ p) predictors. For example, let p − 1 = 3 and the full model is Yi = β0 + β1 x1,i + β2 x2,i + β3 x3,i + ǫi 122 Now, assume we wish to test whether we can remove simultaneously the first, third and fourth predictor, i.e x1 and x3 . Consequently, we wish to test the hypotheses H0 : β1 = β3 = 0 (given x2 ) vs Ha : at least one of them 6= 0 In effect we wish to test the full model to the reduced model Yi = β0 + β2 x2,i + ǫi Remark 7.1. A full model does not necessarily imply a model with all the predictors. It simply means a model that has more predictors than the reduced model, i.e. a “fuller” model. The SSE of the reduced model will be larger than the SSE of the full model, as it only has two of the predictors of the full model and can never fit the data better. The general test statistic is based on comparing the difference in SSE of the reduced model to the full model. SSEred − SSEf ull dfEred − dfEf ull H0 ∼ Fν1 ,ν2 T.S. = SSEf ull dfEf ull (7.1) where • ν1 = dfEred − dfEf ull • ν2 = dfEf ull and the p-value for this test is always the area to the right of the F-distribution, i.e. P (Fν1 ,ν2 ≥ T.S.). In our example we have that • SSEred − SSEf ull = SSE(x2 ) − SSE(x1 , x2 , x3 ) = SSR(x1 , x3 |x2 ) • dfEred − dfEf ull = (n − 2) − (n − 4) = 2 and hence equation (7.1) becomes T.S. = MSR(x1 , x3 |x2 ) H0 SSR(x1 , x3 |x2 )/2 = ∼ F2,n−4 SSE(x1 , x2 , x3 )/(n − 4) MSE(x1 , x2 , x3 ) 123 Remark 7.2. Note that ν1 = dfEred − dfEf ull always equals the number of predictors being restricted to a singular point under the null hypothesis in a simultaneous test. In the previous example H0 : β1 = β3 = 0 meant 2 degrees of freedom but H0 : β1 = β3 is only 1 degree of freedom. We shall see examples in the section “Other Linear Tests”. Example 7.2. ?? From example 7.1, assume we wish to test H0 : β1 = β3 = 0 (given x2 ) We need to fit the reduced model and obtain the information necessary for equation (7.1). > reg2=update(reg123,.~.-X1-X3) > anova(reg2,reg123) Analysis of Variance Table Model 1: Y ~ X2 Model 2: Y ~ X1 + X2 + X3 Res.Df RSS Df Sum of Sq 1 18 113.424 2 16 98.405 2 F Pr(>F) 15.019 1.221 0.321 With a large p-value we fail to reject the null hypothesis, and drop x1 and x3 . Remember that we actually recommend not performing simultaneous tests but one variable at a time. Special cases • The output we saw in example 7.1 that listing 7.1 (and the other listings) also provided us with some default F-tests > anova(reg123) Response: Y Df Sum Sq Mean Sq F value X1 X2 X3 1 352.27 1 33.17 1 11.55 Residuals 16 98.40 Pr(>F) 352.27 57.2768 1.131e-06 *** 33.17 5.3931 0.03373 * 11.55 1.8773 0.18956 6.15 124 – The first T S = 57.2768 tests whether x1 is significant without any other predictors with a F-test with 1 and 16 degrees of freedom T.S. = MSR(x1 ) SSR(x1 ) = MSE(x1 , x2 , x3 ) MSE(x1 , x2 , x3 ) – The second T S = 5.3931 tests whether x2 is significant above and beyond x1 with a F-test with 1 and 16 degrees of freedom T.S. = SSR(x2 |x1 ) MSR(x2 |x1 ) = MSE(x1 , x2 , x3 ) MSE(x1 , x2 , x3 ) – The third T S = 1.8773 tests whether x3 is significant above and beyond (x1 , x2 ) with a F-test with 1 and 16 degrees of freedom T.S. = MSR(x3 |x1 , x2 ) SSR(x3 |x1 , x2 ) = MSE(x1 , x2 , x3 ) MSE(x1 , x2 , x3 ) • One coefficient. Assume we wish to test H0 : β3 = 0. We can either perform a t-test according to bullet 6.3 and equation (6.3) b3 − 0 H0 ∼ tn−4 sb3 Equivalently, we can still use equation (7.1) and note that – SSEred − SSEf ull = SSE(x1 , x2 ) − SSE(x1 , x2 , x3 ) = SSR(x3 |x1 , x2 ) – dfEred − dfEf ull = 1 yielding to T.S. = MSR(x3 |x1 , x2 ) H0 SSR(x3 |x1 , x2 )/1 = ∼ F1,n−4 SSE(x1 , x2 , x3 )/(n − 4) MSE(x1 , x2 , x3 ) with p-value= P (F1,n−4 ≥ T.S.). Back to example 7.1 we have the t-tests and can see the equivalent F-tests that have the same p-value. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 117.085 99.782 1.173 0.258 X1 4.334 3.016 1.437 0.170 125 X2 X3 -2.857 -2.186 2.582 1.595 -1.106 -1.370 0.285 0.190 > library(car) > SS2=Anova(reg123,type=2);SS2 #notice same p-values Anova Table (Type II tests) Response: Y X1 Sum Sq Df F value Pr(>F) 12.705 1 2.0657 0.1699 X2 7.529 1 X3 11.546 1 Residuals 98.405 16 1.2242 0.2849 1.8773 0.1896 • All coefficients (except intercept). Assume we wish to test H0 : β1 = · · · = β3 = 0 vs Ha : not all β’s equal zero We proceed in exactly the same way as bullet 6.3 and equation (6.1). This is because the model under the null (reduced model) is Yi = β0 + ǫi ⇔ Yi = µ + ǫi , and thus SSEred =SST and dfEred = n − 1. Therefore, SST − SSE SSR MSR(x1 , x2 , x3 ) H0 (n − 1) − (n − 4) = 3 = ∼ F3,n−4 T.S. = SSE SSE MSE(x1 , x2 , x3 ) n−4 n−4 In example example 7.1, we can see this F-test in the summary. > summary(reg123) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 117.085 99.782 1.173 0.258 X1 X2 4.334 -2.857 3.016 2.582 126 1.437 -1.106 0.170 0.285 X3 -2.186 1.595 -1.370 0.190 Residual standard error: 2.48 on 16 degrees of freedom Multiple R-squared: 0.8014,Adjusted R-squared: 0.7641 F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06 7.2 Other Linear Tests There are circumstances where we do not necessarily wish to test whether a coefficient equals 0, or whether a group of coefficients all equal zero. For example, consider the (full) model Yi = β0 + β1 x1,i + β2 x2,i + β3 x3,i + ǫi and we wish to test • H0 : β1 = β2 = β3 . Under this null the reduced model is Yi = β0 + β1 x1,i + β1 x2,i + β1 x3,i + ǫi = β0 + β1 (x1,i + x2,i + x3,i ) +ǫi {z } | zi The resulting F-test from equation (7.1) would have an F2,n−4 distribution. • H0 : β3 = β1 + β2 . Under this null the reduced model is Yi = β0 + β1 x1,i + β2 x2,i + (β1 + β2 )x3,i + ǫi = β0 + β1 (x1,i + x3,i ) +β2 (x2,i + x3,i ) +ǫi | {z } | {z } z1,i z2,i The resulting F-test from equation (7.1) would have an F1,n−4 distribution. 127 • H0 : β0 = 10, β3 = 1. Under this null the reduced model is Yi = 10 + β1 x1,i + β2 x2,i + x3,i + ǫi Yi − 10 − x3,i = β1 x1,i + β2 x2,i + ǫi {z } | Yi⋆ which is regression through the origin. The resulting F-test from equation (7.1) would have an F2,n−4 distribution. Example 7.3. To re-examine example 7.1 so far. With the sequential sums of squares we notes that x2 was significant above and beyond x1 , with a pvalue of 0.3373, but with the individual t-tests (and equivalent F-test) that it was not significant above and beyond (x1 , x3 ), with a p-value of 0.285. We also concluded in the simultaneous test that H0 : β1 = β3 = 0 holds. That means that we either need only x2 or only the combo (x1 , x3 ). Lets test H0 : β2 = β1 + β3 2 β1 + β3 x2,i + β3 x3,i + ǫi 2 = β0 + β1 (x1,i + 0.5x2,i ) +β3 (x3,i + 0.5x2,i ) +ǫi | {z } | {z } Yi = β0 + β1 x1,i + z1,i z2,i > dat[,"Z1"]=dat[,"X1"]+1/2*dat[,"X2"] > dat[,"Z2"]=dat[,"X3"]+1/2*dat[,"X2"] > reg2eq13=lm(Y~Z1+Z2,data=dat) > anova(reg2eq13,reg123) Analysis of Variance Table Model 1: Y ~ Z1 + Z2 Model 2: Y ~ X1 + X2 + X3 Res.Df RSS Df Sum of Sq 1 17 107.150 2 16 98.405 1 F Pr(>F) 8.745 1.4219 0.2505 128 We fail to reject the null but that is no surprise to us and this point. It seems all we need is just x1 and x3 so lets try it. > reg13=update(reg123,.~.-X2) > summary(reg13) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.7916 4.4883 1.513 0.1486 X1 1.0006 0.1282 7.803 5.12e-07 *** X3 --- -0.4314 0.1766 -2.443 0.0258 * Residual standard error: 2.496 on 17 degrees of freedom Multiple R-squared: 0.7862,Adjusted R-squared: 0.761 F-statistic: 31.25 on 2 and 17 DF, p-value: 2.022e-06 No further seems necessary at the moment as each variable appears significant given the other. Remark 7.3. In R if the null hypothesis requires a transformation of the response, such as in the last bullet using Yi⋆ , you will have to perform the F-test manually because the anova function will give you a warning that you are using two different datasets as the response variable in the two models is technically different. 7.3 Coefficient of Partial Determination The coefficient of partial determination similarly to the coefficient of determination R2 is the proportion of variation in the response explained by a set of predictors above and beyond another set of predictors. Consider a model with 3 predictors, i.e. p − 1 = 3. The proportion of variation in the response that is explained by x1 , given that x2 and x3 are 129 already in the model is SSE(x2 , x3 ) − SSE(x1 , x2 , x3 ) SSE(x2 , x3 ) SSR(x1 , x2 , x3 ) − SSR(x2 , x3 ) = SSE(x2 , x3 ) SSR(x1 |x2 , x3 ) = SSE(x2 , x3 ) 2 Ry,x = 1 |x2 ,x3 The coefficient of partial correlation is then defined as ry,x1 |x2 ,x3 q 2 = sgn(b1 ) Ry,x 1 |x2 ,x3 2 2 Similarly for Ry,x and Ry,x . We can also express the proportion of 2 |x1 ,x3 3 |x1 ,x2 variation in the response that is explained by x2 and x3 given x1 as SSE(x1 ) − SSE(x1 , x2 , x3 ) SSE(x1 ) SSR(x1 , x2 , x3 ) − SSR(x1 ) = SSE(x1 ) SSR(x2 , x3 |x1 ) = SSE(x1 ) 2 Ry,x = 2 ,x3 |x1 2 2 Similarly for Ry,x and Ry,x . 1 ,x3 |x2 1 ,x2 |x3 2 is Example 7.4. Sticking with example 7.1, we find that Ry,x 2 |x1 ,x3 > ### Coefficient of partial determination R^2_{Y x2|x1 x3}=SSR(x2|x1 x3)/SSE(x1 x > SST=(dim(dat)[1]-1)*var(dat$Y) > SS2["X2","Sum Sq"]/anova(lm(Y~X1+X3,data=dat))["Residuals","Sum Sq"] [1] 0.07107507 This implies that x2 has a tiny effect in reducing the variance in the response above and beyond (x1 , x3 ). This agrees with the t-test for H0 : β2 = 0 given (x1 , x3 ) that we saw earlier. 2 Also note that Ry,x 1 ,x3 |x3 > ### Coefficient of partial determination R^2_{Y x1 x3|x2} = SSR(x1 x3|x2)/SSE(x2 > SSEx2=anova(lm(Y~X2,data=dat))["Residuals","Sum Sq"] > (SSEx2-anova(reg123)["Residuals","Sum Sq"])/SSEx2 [1] 0.1324132 130 indicating that (x1 , x3 ) have something to contribute above and beyond x2 . This all seems to agree with our tests leading us to the final model with just x1 and x3 . 7.4 Standardized Regression Model Standardized regression simply means that all variables are standardized which helps in • removing round-off errors in computing (X T X)−1 • makes for an easier comparison of the magnitude of effects of predictors measured on different measurement scales. A coefficient from this model βk⋆ can be interpreted as a 1 standard deviation increase in predictor k causes a change of βk⋆ standard deviation in the response (holding all others constant). • (to be discussed later) reducing the standard error of coefficients due to multicollinearity The transformation used is known as the correlation transformation yi⋆ = √ 1 yi − ȳ , n − 1 sy x⋆k,i = √ xk,i − x̄k 1 , k = 1, . . . , p − 1 n − 1 sx k The model is ⋆ Yi⋆ = β1⋆ x⋆1,i + · · · + βp−1 x⋆p−1,i + ǫ⋆i We can always revert back to the unstandardized coefficients • βk = sy ⋆ β , s xk k k = 1, . . . , p − 1 • β0 = ȳ − β1 x̄1 − · · · − βp−1 x̄p−1 Under this model,   y1⋆  ..   y⋆ =   . , yn⋆ X ⋆ = x⋆1 · · · x⋆p−1 131 which results in  1   r2,1 ⋆ XT X⋆ =   ..  . r1,2 1 .. . ··· ··· .. .  r1,p−1  r2,p−1   =: r xx ,     ry,1  .  ⋆ .  X T y⋆ =   .  =: r yx ry,p−1 rp−1,1 rp−1,2 · · · rp−1,p−1 because • • • Pn ⋆ 2 i=1 (xk,i ) Pn = ··· = 1 ⋆ ⋆ i=1 (xk,i )(xk ′ ,i ) Pn ⋆ ⋆ i=1 (yi )(xk,i ) = · · · = rxk ,x′k = · · · = ry,xk Therefore, ⋆ ⋆ ⋆ ⋆ X T X ⋆ b⋆ = X T y ⋆ ⇒ b⋆ = (X T X ⋆ )−1 X T y ⋆ ⇒ b⋆ = r −1 xx r yx Example 7.5. So far we have concluded that for the bodyfat dataset in examples 7.1 that we only need x1 and x3 in the model. However, it seems that these two variables are still somewhat correlated with a sample correlation of rx1 ,x3 = 0.46. > round(cor(dat[,1:4]),2) X1 X2 X3 Y X1 1.00 0.92 0.46 0.84 X2 0.92 1.00 0.08 0.88 X3 0.46 0.08 1.00 0.14 Y 0.84 0.88 0.14 1.00 We have mentioned that correlated variables may increase the standard error or our predictors, making it more necessary to implement standarized regression. A useful tool is the Variance Inflation Factor (VIF). The square root of the variance inflation factor tells you how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model. If the variance inflation factor of √ a predictor variable were 5.27 ( 5.27 = 2.3) this means that the standard error for the coefficient of that predictor variable is 2.3 times as large as it 132 would be if that predictor variable were uncorrelated with the other predictor variables. Example 7.6. Continuing with our example wee that the inflation is not much actually > library(car) > sqrt(vif(reg13)) X1 X3 1.124775 1.124775 Performing standardized regression yields > cor.trans=function(y){ + n=length(y) + 1/sqrt(n-1)*(y-mean(y))/sd(y) + } > dat_trans=as.data.frame(apply(dat[,1:4],2,cor.trans)) > reg13_trans=lm(Y~0+X1+X3,data=dat_trans) > summary(reg13_trans) Coefficients: Estimate Std. Error t value Pr(>|t|) X1 X3 0.9843 -0.3082 0.1226 0.1226 8.029 2.33e-07 *** -2.514 0.0217 * compared to the standard errors of 0.1282 and 0.1766 respectively. 7.5 Multicollinearity Consider the following theoretical toy example. Someone wishes to measure the area of a square (the response) using as predictors two potential variables, the length and the height of the square. Due to measurement error, replicate measurements are taken. • A simple linear regression is fitted with length as the only predictor, x = length. For the test H0 : β1 = 0, do you think that we would reject H0 , i.e. is length a significant predictor of area? • Now assume that a multiple regression model is fitted with both predictors, x1 = length and x2 = height. Now, for the test H0 : β1 = 0, do 133 you think that we would reject H0 , i.e. is length a significant predictor of area given that height is already included in the model? This scenario is defined as confounding/collinearity. In the toy example, “height” is a confounding variable, i.e. an extraneous variable in a statistical model that correlates with both the response variable and another predictor variable. Example 7.7. In an experiment of 22 observations, a response y and two predictors x1 and x2 were observed. Two simple linear regression models were fitted: (1) y = 6.33 + 1.29 x1 Predictor Constant x1 S = 2.95954 Coef SE Coef T P 6.335 1.2915 2.174 0.1392 2.91 9.28 0.009 0.000 R-Sq = 81.1% R-Sq(adj) = 80.2% (2) y = 54.0 - 0.919 x2 Predictor Constant x2 S = 5.50892 Coef SE Coef T P 53.964 -0.9192 8.774 0.2821 6.15 -3.26 0.000 0.004 R-Sq = 34.7% R-Sq(adj) = 31.4% Each predictor in their respective model is significant due to the small pvalues for their corresponding coefficients. The simple linear regression model (1) is able to explain more of the variability in the response than model (2) with R2 = 81.1%. Logically one would then assume that a multiple regression model with both predictors would be the best model. The output of this model is given below: (3) y = 12.8 + 1.20 x1 - 0.168 x2 134 Predictor Constant x1 x2 S = 2.97297 Coef 12.844 SE Coef 7.514 T 1.71 P 0.104 1.2029 -0.1682 0.1707 0.1858 7.05 -0.91 0.000 0.377 R-Sq = 81.9% R-Sq(adj) = 80.0% We notice that the individual test for β1 stills classifies x1 as significant given x2 , but x2 is no longer significant given x1 . Also, we notice that the 2 coefficient of determination, R2 , has increased only by 0.8%, and in fact Radj has decreased from 80.2% in (1) to 80.0% in (3). This is because x1 is acting as a confounding variable on x2 . The relationship of x2 with the response y is mainly accounted for by the relationship of x1 on y. The correlation coefficient of rx1 ,x2 = −0.573 which indicates a moderate negative relationship. However, since x1 is a better predictor, the multiple regression model is still able to determine that x1 is significant given x2 , but not vice versa. When two variables are highly correlated, their estimated of the regression coefficients become unstable, and their standard errors become larger (leading to smaller test statistics and wider C.I’s). We can see this using VIF. Example 7.8. We have seen another example with 7.1. Recall that x1 and x2 are highly correlated. > round(cor(dat[,1:4]),2) X1 X2 X3 Y X1 1.00 0.92 0.46 0.84 X2 0.92 1.00 0.08 0.88 X3 0.46 0.08 1.00 0.14 Y 0.84 0.88 0.14 1.00 In listing 7.2 we noticed that x1 is not significant given x2 with a p-value of 0.4633 due to the fact that SSR(x2 |x1 ) = 33.17 but, in listing 7.1, testing x2 given x1 yielded a p-value of 0.03373 due to SSR(x1 |x2 ) = 3.47 indicating it was somewhat significant. 135 Using VIF we see that standard errors are greatly inflated for the model with all three > sqrt(vif(reg123)) X1 X2 X3 26.62410 23.75591 10.22771 http://www.stat.ufl.edu/~ athienit/STA4210/Examples/bodyfat.R 136 Chapter 9 Model Selection and Validation Note that Chapter 8 was merged back with Chapter 6. 9.1 Data Collection Strategies • Controlled Experiments: Subjects (Experimental Units) assigned to X-levels by experimenter – Purely Controlled Experiments: Researcher only uses predictors that were assigned to units – Controlled Experiments with Covariates: Researcher has information (additional predictors) associated with units • Observational Studies: Subjects (Units) have X-levels associated with them (not assigned by researcher) – Confirmatory Studies: New (primary) predictor(s) believed to be associated with Y , controlling for (control) predictor(s), known to be associated with Y – Exploratory Studies: Set of potential predictors believed that some or all are associated with Y 9.2 Reduction of Explanatory Variables • Controlled Experiments 137 – Purely Controlled Experiments: Rarely any need or desire to reduce number of explanatory variables – Controlled Experiments with Covariates: Remove any covariates that do not reduce the error variance • Observational Studies – Confirmatory Studies: Must keep in all control variables to compare with previous research, should keep all primary variables as well – Exploratory Studies: Often have many potential predictors (and polynomials and interactions). Want to fit parsimonious model that explains much of the variation in Y , while keeping model as basic as possible. Caution: do not make decisions based on single variable t-tests, make use of Complete/Reduced models for testing multiple predictors 9.3 Model Selection Criteria With p − 1 predictors there are 2p−1 potential models (each variable can be in or out of the model), not including interaction terms etc. • So far we have seen the adjusted R2 as in equation (6.2) where the goal is to maximize the value • Mallow’s Cp criterion where the goal is to find the smallest p so that Cp ≤ p Cp = SSEp − (n − 2p) MSE(X1 , . . . , Xp−1) Note in the first term, that the numerator is model specific, while the denominator is always the same (the one of the full model). • Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), where the goal is to choose the model with the minimum value AIC = n log(SSE/n) + 2p, BIC = n log(SSE/n) + p log(n) 138 • PRESS criterion where once again we aim to minimize this value PRESS = n X i=1 (yi − ŷi(i) )2 where ŷi(i) is the fitted value for the ith case when it was not used in fitting the model (leave-one-out). From this we have the – Ordinary Cross Validation (OCV) 2 n n 1 X yi − ŷi 1X 2 (yi − ŷi(i) ) = OCV = n i=1 n i=1 1 − hii due the Leaving-One-Out Lemma, where hii is the ith diagonal element of H = X(X T X)−1 X T . – Generalized Cross Validation (GCV) where hii is replaced by the average of the diagonal elements of H, leading to a weighted version P 1/n ni=1 (yi − ŷi )2 GCV = (1 − trace(H)/n)2 Example 9.1. A cruise ship company wishes to model the crew size needed for a ship using predictors such as: age, tonnage, passengers, length, cabins and passenger density (passdens). Without concerning ourselves with potential interactions we will look at simple additive models. > cruise <- read.fwf("http://www.stat.ufl.edu/~winner/data/cruise_ship.dat", + width=c(20,20,rep(8,7)), col.names=c("ship", "cline", "age", "tonnage", + "passengers", "length", "cabins", "passdens", "crew")) > fit0=lm(crew~age+tonnage+passengers+length+cabins+passdens,data=cruise) > summary(fit0) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.5213400 age -0.0125449 tonnage 0.0132410 1.0570350 0.0141975 0.0118928 -0.493 -0.884 1.113 0.62258 0.37832 0.26732 passengers length -0.1497640 0.4034785 0.0475886 0.1144548 -3.147 3.525 0.00199 ** 0.00056 *** cabins passdens 0.8016337 -0.0006577 0.0892227 0.0158098 8.985 9.84e-16 *** -0.042 0.96687 139 --Residual standard error: 0.9819 on 151 degrees of freedom Multiple R-squared: 0.9245,Adjusted R-squared: 0.9215 F-statistic: 308 on 6 and 151 DF, p-value: < 2.2e-16 > AIC(fit0) [1] 451.4394 We will consider this to be the full model at the moment and will implement some of the model selection criteria using the regsubsets function. > library(leaps) > allcruise <- regsubsets(crew~age+tonnage+passengers+length+cabins + passdens, nbest=4,data=cruise) > aprout <- summary(allcruise) > with(aprout,round(cbind(which,rsq,adjr2,cp,bic),3)) ## Prints "readable" results 140 (Intercept) age tonnage passengers length cabins passdens rsq adjr2 1 1 0 0 0 0 1 0 0.904 0.903 cp bic 37.772 -360.238 141 1 1 1 1 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0.860 0.859 125.086 -300.954 0 0.838 0.837 170.523 -277.122 0 0.803 0.801 240.675 -246.201 2 2 1 1 0 0 0 0 0 0 1 0 1 1 0 0.916 0.915 1 0.912 0.911 15.952 -376.131 24.261 -368.502 2 2 3 1 1 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 1 0 0.911 0.909 0 0.908 0.907 0 0.922 0.921 26.792 -366.249 32.443 -361.332 5.857 -382.878 3 3 1 1 0 0 0 1 0 1 1 0 1 1 1 0.919 0.918 0 0.918 0.916 11.341 -377.413 14.023 -374.808 3 4 4 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 0 0.917 0.915 0 0.924 0.922 0 0.923 0.921 15.909 -373.002 3.847 -381.933 5.084 -380.652 4 4 5 1 1 1 0 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0.923 0.921 1 0.919 0.917 0 0.924 0.922 5.197 -380.534 13.056 -372.631 5.002 -377.752 5 5 1 1 0 1 1 0 1 1 1 1 1 1 1 0.924 0.922 1 0.924 0.921 5.781 -376.939 6.240 -376.462 5 6 1 1 1 1 1 1 0 1 1 1 1 1 1 0.920 0.917 1 0.924 0.921 14.904 -367.717 7.000 -372.692 A good model choice might be (the 13th row) model with 4 predictors: ton2 nage, passengers, length, and cabins, whose Radj = 0.922, Cp = 3.847, and BIC= −381.933. Also we note that this model’s AIC is lower than that of the full model. > fit3=update(fit0,.~.-age-passdens) > AIC(fit3) [1] 448.3229 We can also calculate the PRESS, OCV and GCV statistics that we would compare to other potential models (but we haven’t here). > library(qpcR) > PRESS(fit3)$stat [1] 154.8479 > library(dbstats) > dblm(formula(fit3),data=cruise)$ocv [1] 0.9673963 > dblm(formula(fit3),data=cruise)$gcv [1] 0.9752566 http://www.stat.ufl.edu/~ athienit/STA4210/Examples/selection_validation.R 9.4 Regression Model Building As discussed, it is possible to have a large set of predictor variables (including interactions). The goal is to fit a “parsimoneous” model that explains as much variation in the response as possible with a relatively small set of predictors. There are 3 automated procedures • Backward Elimination (Top down approach) • Forward Selection (Bottom up approach) • Stepwise Regression (Combines Forward/Backward) We will explore these procedures using two different elimination/selection criteria. One that uses t-test and p-value and another that uses the AIC value. 142 9.4.1 Backward elimination 1. Select a significance level to stay in the model (e.g. αs = 0.20, generally .05 is too low, causing too many variables to be removed). 2. Fit the full model with all possible predictors. 3. Consider the predictor with lowest t-statistic (highest p-value). • If p-value > αs , remove the predictor and fit model without this variable (must re-fit model here because partial regression coefficients change). • If p-value ≤ αs , stop and keep current model. 4. Continue until all predictors have p-values ≤ αs . 9.4.2 Forward selection 1. Choose a significance level to enter the model (e.g. αe = 0.20, generally .05 is too low, causing too few variables to be entered). 2. Fit all simple regression models. 3. Consider the predictor with the highest t-statistic (lowest p-value). • If p-value ≤ αe , keep this variable and fit all two variable models that include this predictor. • If p-value > αe , stop and keep previous model. 4. Continue until no new predictors have p-values ≤ αe 9.4.3 Stepwise regression 1. Select αs and αe , (αe < αs ). 2. Start like Forward Selection (bottom up process) where new variables must have p-value ≤ αe to enter. 3. Re-test all “old variables” that have already been entered, must have p-value ≤ αs to stay in model. 4. Continue until no new variables can be entered and no old variables need to be removed. 143 Remark 9.1. Although we created a function in R that follows the steps of backward, forward and stepwise, there is also an already developed function stepAIC that can perform all three procedures by adding/removing variables depending on whether the AIC is reduced. Example 9.2. Continuing from example 9.1, we perform backward elimination with αs = 0.20. > source("http://www.stat.ufl.edu/~athienit/stepT.R") > stepT(fit0,alpha.rem=0.2,direction="backward") crew ~ age + tonnage + passengers + length + cabins + passdens ---------------------------------------------Step 1 -> Removing:- passdens Estimate Pr(>|t|) (Intercept) -0.556 0.394 age tonnage -0.012 0.013 0.358 0.150 passengers length cabins -0.149 0.404 0.802 0.000 0.001 0.000 crew ~ age + tonnage + passengers + length + cabins ---------------------------------------------Step 2 -> Removing:- age Estimate Pr(>|t|) (Intercept) tonnage passengers length cabins -0.819 0.016 -0.150 0.164 0.046 0.000 0.398 0.791 0.001 0.000 Final model: crew ~ tonnage + passengers + length + cabins We can also perform forward selection and stepwise regression by running stepT(fit0,alpha.enter=0.2,direction="forward") stepT(fit0,alpha.rem=0.2,alpha.enter=0.15,direction="both") 144 We can also use the built in function stepAIC > library(MASS) > fit1 <- lm(crew ~ age + tonnage + passengers + length + cabins + passdens) > fit2 <- lm(crew ~ 1) > stepAIC(fit1,direction="backward") Start: AIC=1.05 crew ~ age + tonnage + passengers + length + cabins + passdens - passdens Df Sum of Sq RSS AIC 1 0.002 145.57 -0.943 - age - tonnage <none> 1 1 0.753 146.32 -0.130 1.195 146.77 0.347 145.57 1.055 - passengers - length 1 1 9.548 155.12 9.092 11.980 157.55 11.551 - cabins 1 77.821 223.39 66.721 Step: AIC=-0.94 crew ~ age + tonnage + passengers + length + cabins Df Sum of Sq RSS AIC - age <none> 1 0.815 146.39 -2.062 145.57 -0.943 - tonnage - length - passengers 1 1 1 2.007 147.58 -0.780 12.069 157.64 9.641 14.027 159.60 11.591 - cabins 1 79.556 225.13 65.944 Step: AIC=-2.06 crew ~ tonnage + passengers + length + cabins Df Sum of Sq <none> - tonnage - length - passengers 1 1 1 RSS AIC 146.39 -2.062 3.866 150.25 0.056 11.739 158.13 8.126 14.275 160.66 10.640 145 - cabins 1 78.861 225.25 64.028 Call: lm(formula = crew ~ tonnage + passengers + length + cabins) and can also perform forward and stepwise regression by running stepAIC(fit2,direction="forward",scope=list(upper=fit1,lower=fit2)) stepAIC(fit2,direction="both",scope=list(upper=fit1,lower=fit2)) http://www.stat.ufl.edu/~ athienit/STA4210/Examples/selection_validation.R 9.5 Model Validation When we have a lot of data, we would like to see how well a model fit on one set of data (training sample) compares to one fit on a new set of data (validation sample), and how the training model fits the new data. • We want the data sets to be similar with respect to the levels of the predictors (so that the validation sample is not an extrapolation of the training sample). Should calculate some summary statistics such as means, standard deviations, etc. • The training set should have at least 6-10 times as many observations than potential predictors. • Models should give “similar” model fits based on SSE, PRESS, Mallow’s Cp , MSE and regression coefficients. Should obtain multiple models using multiple “adequate” training samples. The Mean Square Prediction Error (MSPE) when training model is applied to validation sample is MSPE = PnV V i=1 (yi − nV ŷiT )2 where nV is validation sample size, yiV represents a data point from the validation sample and ŷiT represents a fitted value using the predictor settings corresponding to yiV but the coefficients from the training sample, i.e. ŷiT = bT0 + bT1 xV1,i + · · · + bTp−1 xVp−1,i 146 If the MSPE is fairly close to the MSET of the regression model that was fitted to the training data set, then it indicates that the selected regression model is not seriously biased and gives an appropriate indication of the predictive ability of the model. At this point you should now go ahead and fit the model on the full data set. It is only a problem when MSPE ≫ MSET . Example 9.3. Continuing from example 9.1, we perform cross-validation with a hold-out sample. Randomly sample 100 ships, fit model, obtain predictions for the remaining 58 ships by applying their predictor levels to the regression coefficients from the fitted model. > cruise.cv.samp <- sample(1:length(cruise$crew),100,replace=FALSE) > cruise.cv.in <- cruise[cruise.cv.samp,] > cruise.cv.out <- cruise[-cruise.cv.samp,] > ### Check if training sample (and validation) is similar to the whole dataset > summary(cruise[,4:7]) tonnage passengers length cabins Min. : 2.329 1st Qu.: 46.013 Min. : 0.66 1st Qu.:12.54 Min. : 2.790 1st Qu.: 7.100 Min. : 0.330 1st Qu.: 6.133 Median : 71.899 Mean : 71.285 3rd Qu.: 90.772 Median :19.50 Mean :18.46 3rd Qu.:24.84 Median : 8.555 Mean : 8.131 3rd Qu.: 9.510 Median : 9.570 Mean : 8.830 3rd Qu.:10.885 Max. Max. Max. :220.000 Max. :54.00 > summary(cruise.cv.in[,4:7]) tonnage passengers :11.820 :27.000 length cabins Min. : 3.341 1st Qu.: 46.947 Min. : 0.66 1st Qu.:12.65 Min. : 2.790 1st Qu.: 7.168 Min. : 0.330 1st Qu.: 6.327 Median : 73.941 Mean : 73.581 3rd Qu.: 91.157 Median :19.87 Mean :19.24 3rd Qu.:26.00 Median : 8.610 Mean : 8.219 3rd Qu.: 9.605 Median : 9.750 Mean : 9.177 3rd Qu.:11.473 Max. Max. Max. :220.000 Max. :54.00 > summary(cruise.cv.out[,4:7]) tonnage Min. : 2.329 passengers Min. : 0.94 147 :11.820 length Min. : 2.960 :27.000 cabins Min. : 0.450 1st Qu.: 40.013 Median : 70.367 Mean : 67.325 1st Qu.:10.62 Median :18.09 Mean :17.11 1st Qu.: 6.370 Median : 8.260 Mean : 7.978 1st Qu.: 5.335 Median : 8.745 Mean : 8.232 3rd Qu.: 87.875 Max. :160.000 3rd Qu.:21.39 Max. :37.82 3rd Qu.: 9.510 Max. :11.320 3rd Qu.:10.430 Max. :18.170 > fit.cv.in <- lm(crew ~tonnage + passengers + length + cabins, + data=cruise.cv.in) > summary(fit.cv.in) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.10180 0.77347 -1.424 0.157581 tonnage passengers length 0.00479 -0.19192 0.45647 0.01177 0.05445 0.14573 0.407 0.685054 -3.525 0.000654 *** 3.132 0.002306 ** 0.95060 0.14510 6.551 2.92e-09 *** cabins --- Residual standard error: 1.059 on 95 degrees of freedom Multiple R-squared: 0.9203,Adjusted R-squared: 0.9169 F-statistic: 274.2 on 4 and 95 DF, p-value: < 2.2e-16 Then we obtain predicted values and prediction errors for the validation sample. The model is based on same 4 predictors that we chose before (columns 4-7 of cruise data) from which we compute the MSPE > pred.cv.out <- predict(fit.cv.in,cruise.cv.out[,4:7]) > delta.cv.out <- cruise$crew[-cruise.cv.samp]-pred.cv.out > (mspe <- sum((delta.cv.out)^2)/length(cruise$crew[-cruise.cv.samp])) [1] 0.7578447 We note that the MSPE of 0.7578 is fairly close to the MSE of 1.0592 = 1.121. (At least it is not much greater). http://www.stat.ufl.edu/~ athienit/STA4210/Examples/selection_validation.R 148 Chapter 10 Diagnostics See slides and examples at http://www.stat.ufl.edu/~ athienit/STA4210 The notes here are incomplete and under construction The goal of this chapter is to used refined diagnostics for checking the adequacy of the regression model that include detecting improper functional form for a predictor, outliers, influential observations and multicollinearity. 10.1 Outlying Y observations Model errors (unobserved) are defined as ǫi = Yi − p−1 X βj xi,j , xi,0 = 1 j=0 ǫ ∼ N(0, σ 2 In ) The observed residuals are ei = yi − p−1 X bj xi,j j=0 e ∼ N(0, σ 2 (In − H)) where H = X(X T X)−1 X T is the projection matrix. So the elements of the variance-covariance matrix σ 2 (In − H) are:  σ 2 (1 − h i) i σ{ei , ej } = −h σ 2 ij Using σˆ2 = MSE we then have 149 if i = j if i 6= j • Semi-Studentized residual e⋆i = √ ei MSE • Studentized residual, which uses the ei ri = p MSE(1 − hi i) • Studentized Deleted residual. When calculating a residual ei = yi − ŷi , the ith observation (yi , xi,1 , . . . , xi,p−1 ) was used in the creation of the model (as were all the other points), and then the model was used to estimate the response for the ith observation. That is, each observation played a role in the creation of the model, that was then used to estimate the the response of said observation. Not very objective. The solution is to delete/remove the ith observation, fit a model without that observation in the data, and use the model to predict the response of that observation by plugging in the predictor setting xi,1 , . . . , xi,p−1 . This sounds very computationally intensive in that you have to fit as many models as there are points. Luckily, it has been found that this can be done without refitting. It can be shown that e2i SSE = (n − p)MSE = (n − p − 1)MSE(i) + 1 − hii n−p−1 ⇒ ti = ei SSE(1 − hii ) − e2i where MSE(i) is the MSE of the model with the ith observation deleted, and ti is the “objective” residual. Then we can determine if a residual is an outlier if it is more than 2 to 3 standard deviations from 0. We can also use a Bonferroni adjustment and determine if an observation is an outlier if it is greater than t1−α/(2n),n−p−1 but that will usually be too large when n is large. 150 10.2 Outlying X-Cases Recall that H = X(X T X)−1 X T is the projection matrix with the (i, j) element being   1    xi,1  T T −1  hij = xi (X X) xj xi =   ..   .  xi,p−1 Note that • hij ∈ [0, 1] Pn T −1 T T T −1 • i=1 hii = trace(H) = trace(X(X X) X ) = trace(X X(X X) ) = trace(Ip ) = p Cases with X-levels close to the “center” of the sampled X-levels will have small leverages, i.e. hii . Cases with “extreme” levels have large leverages, and have the potential to “pull” the regression equation toward their observed Y -values. We can see this by ŷ = Hy ⇒ ŷi = n X j=1 hij yj = i−1 X hij yj + hii yi i + j=1 n X hij yj j=i+1 Leverage values are considered large if > 2p/n (2 times larger than the mean). Leverage values for potential new observations are hnew, new = xTnew (X T X)xnew and are considered extrapolations if their leverage values are larger than those in the original dataset. 10.3 Influential Cases 10.3.1 Fitted values 10.3.2 Regression coefficients 10.4 Multicollinearity See examples 7.6, 7.7 and 7.8 151 Chapter 11 Remedial Measures See slides and examples at http://www.stat.ufl.edu/~ athienit/STA4210 152 Chapter 12 Autocorrelation in Time Series See slides and examples at http://www.stat.ufl.edu/~ athienit/STA4210 END 153

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Regression Analysis - UF-Stat