Download Regression Analysis - UF-Stat

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Omnibus test wikipedia , lookup

Transcript
Regression Analysis
Demetris Athienitis
Department of Statistics,
University of Florida
Contents
Contents
1
0 Review
0.1 Random Variables and Probability Distributions . . . . . . . .
4
5
0.1.1
0.1.2
0.1.3
Expected value and variance . . . . . . . . . . . . . . . 8
Covariance . . . . . . . . . . . . . . . . . . . . . . . . . 12
Mean and variance of linear combinations . . . . . . . 14
0.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 15
0.3 Inference for Population Mean . . . . . . . . . . . . . . . . . . 16
0.3.1 Confidence intervals . . . . . . . . . . . . . . . . . . . 16
0.3.2 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . 20
0.4 Inference for Two Population Means . . . . . . . . . . . . . . 27
0.4.1
0.4.2
Independent samples . . . . . . . . . . . . . . . . . . . 27
Paired data . . . . . . . . . . . . . . . . . . . . . . . . 29
1 Simple Linear Regression
31
1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . 33
1.2.1 Regression function . . . . . . . . . . . . . . . . . . . . 33
1.2.2
Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2 Inferences in Regression
38
2.1 Inferences concerning β0 and β1 . . . . . . . . . . . . . . . . . 38
2.2 Inferences involving E(Y ) and Ŷpred . . . . . . . . . . . . . . . 41
2.2.1
2.2.2
Confidence interval on the mean response . . . . . . . . 41
Prediction interval . . . . . . . . . . . . . . . . . . . . 42
2.2.3 Confidence Band for Regression Line . . . . . . . . . . 44
2.3 Analysis of Variance Approach . . . . . . . . . . . . . . . . . . 46
2.3.1 F-test for β1 . . . . . . . . . . . . . . . . . . . . . . . . 48
1
2.3.2 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . 49
2.4 Normal Correlation Models . . . . . . . . . . . . . . . . . . . 50
3 Diagnostics and Remedial Measures
53
3.1 Diagnostics for Predictor Variable . . . . . . . . . . . . . . . . 53
3.2 Checking Assumptions . . . . . . . . . . . . . . . . . . . . . . 54
3.2.1 Graphical methods . . . . . . . . . . . . . . . . . . . . 55
3.2.2 Significance tests . . . . . . . . . . . . . . . . . . . . . 60
3.3 Remedial Measures . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.1 Box-Cox (Power) transformation . . . . . . . . . . . . 67
3.3.2
Lowess (smoothed) plots . . . . . . . . . . . . . . . . . 71
4 Simultaneous Inference and Other Topics
73
4.1 Controlling the Error Rate . . . . . . . . . . . . . . . . . . . . 73
4.1.1 Simultaneous estimation of mean responses . . . . . . . 75
4.1.2 Simultaneous predictions . . . . . . . . . . . . . . . . . 75
4.2 Regression Through the Origin . . . . . . . . . . . . . . . . . 75
4.3 Measurement Errors . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 Measurement error in the dependent variable . . . . . . 78
4.3.2 Measurement error in the independent variable . . . . . 78
4.4 Inverse Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Choice of Predictor Levels . . . . . . . . . . . . . . . . . . . . 80
5 Matrix Approach to Simple Linear Regression
81
5.1 Special Types of Matrices . . . . . . . . . . . . . . . . . . . . 81
5.2 Basic Matrix Operations . . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Addition and subtraction . . . . . . . . . . . . . . . . . 83
5.2.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Linear Dependence and Rank . . . . . . . . . . . . . . . . . . 86
5.4 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5 Useful Matrix Results . . . . . . . . . . . . . . . . . . . . . . . 90
5.6 Random Vectors and Matrices . . . . . . . . . . . . . . . . . . 91
5.6.1 Mean and variance of linear functions of random vectors 93
5.6.2 Multivariate normal distribution . . . . . . . . . . . . . 94
5.7 Estimation and Inference in Regression . . . . . . . . . . . . . 94
5.7.1
5.7.2
Estimating parameters by least squares . . . . . . . . . 94
Fitted values and residuals . . . . . . . . . . . . . . . . 95
2
5.7.3
5.7.4
Analysis of variance . . . . . . . . . . . . . . . . . . . . 96
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6 Multiple Regression I
98
6.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Special Types of Variables . . . . . . . . . . . . . . . . . . . . 100
6.3 Matrix Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7 Multiple Regression II
119
7.1 Extra Sums of Squares . . . . . . . . . . . . . . . . . . . . . . 119
7.1.1
7.1.2
Definition and decompositions . . . . . . . . . . . . . . 119
Inference with extra sums of squares . . . . . . . . . . 122
7.2 Other Linear Tests . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3 Coefficient of Partial Determination . . . . . . . . . . . . . . . 129
7.4 Standardized Regression Model . . . . . . . . . . . . . . . . . 131
7.5 Multicollinearity
. . . . . . . . . . . . . . . . . . . . . . . . . 133
9 Model Selection and Validation
137
9.1 Data Collection Strategies . . . . . . . . . . . . . . . . . . . . 137
9.2 Reduction of Explanatory Variables . . . . . . . . . . . . . . . 137
9.3 Model Selection Criteria . . . . . . . . . . . . . . . . . . . . . 138
9.4 Regression Model Building . . . . . . . . . . . . . . . . . . . . 142
9.4.1
9.4.2
Backward elimination . . . . . . . . . . . . . . . . . . . 143
Forward selection . . . . . . . . . . . . . . . . . . . . . 143
9.4.3 Stepwise regression . . . . . . . . . . . . . . . . . . . . 143
9.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 146
10 Diagnostics
149
10.1 Outlying Y observations . . . . . . . . . . . . . . . . . . . . . 149
10.2 Outlying X-Cases . . . . . . . . . . . . . . . . . . . . . . . . . 151
10.3 Influential Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 151
10.3.1 Fitted values . . . . . . . . . . . . . . . . . . . . . . . 151
10.3.2 Regression coefficients . . . . . . . . . . . . . . . . . . 151
10.4 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . 151
11 Remedial Measures
152
12 Autocorrelation in Time Series
153
3
Chapter 0
Review
In regression the emphasis is on finding links/associations between two or
more variables. For two variables a scatterplot can help in visualizing the
association
Example 0.1. A small study with 7 subjects on the pharmacodynamics
of LSD on how LSD tissue concentration affects the subjects math scores
yielded the following data.
Score
Conc.
78.93 58.20 67.47 37.47 45.65 32.92 29.97
1.17 2.97 3.26 4.69 5.83 6.00 6.41
Table 1: Math score with LSD tissue concentration
60
50
30
40
Math score
70
80
Scatterplot
1
2
3
4
5
6
LSD tissue concentration
Figure 1: Scatterplot of Math score vs. LSD tissue concentration
http://www.stat.ufl.edu/~ athienit/STA4210/scatterplot.R
4
Before we begin he will need to grasp some basic concepts.
0.1
Random Variables and Probability Distributions
Definition 0.1. A random variable is a function that assigns a numerical
value to each outcome of an experiment. It is a measurable function from a
probability space into a measurable space known as the state space.
It is an outcome characteristic that is unknown prior to the experiment.
For example, an experiment may consist of tossing two dice. One potential random variable could be the sum of the outcome of the two dice, i.e.
X= sum of two dice. Then, X is a random variable. Another experiment
could consist of applying different amounts of a chemical agent and a potential random variable could consist of measuring the amount of final product
created in grams.
Quantitative random random variables can either be discrete, by which
they have a countable set of possible values, or continuous which have
uncountably infinite.
Notation: For a discrete random variable (r.v.) X, the probability distribution is the probability of a certain outcome occurring, denoted as
P (X = x) = pX (x).
This is also called the probability mass function (p.m.f.).
Notation: For a continuous random variable (r.v.) X, the probability density function (p.d.f.), denoted by fX (x), models the relative frequency of X.
Since there are infinitely many outcomes within an interval, the probability
evaluated at a singularity is always zero, e.g. P (X = x) = 0, ∀x, X being a
continuous r.v.
5
Conditions for a function to be:
• p.m.f. 0 ≤ p(x) ≤ 1 and
• p.d.f. f (x) ≥ 0 and
R∞
−∞
P
∀x
p(x) = 1
f (x)dx = 1
Example 0.2. (Discrete) Suppose a storage tray contains 10 circuit boards,
of which 6 are type A and 4 are type B, but they both appear similar. An
inspector selects 2 boards for inspection. He is interested in X = number of
type A boards. What is the probability distribution of X?
The sample space of X is {0, 1, 2}. We can calculate the following:
p(2) = P (A on first)P (A on second|A on first)
= (6/10)(5/9) = 0.3333
p(1) = P (A on first)P (B on second|A on first)
+ P (B on first)P (A on second|B on first)
= (6/10)(4/9) + (4/10)(6/9) = 0.5333
p(0) = P (B on first)P (B on second|B on first)
= (4/10)(3/9) = 0.1334
Consequently,
X=x
p(x)
0
0.1334
1
0.5333
2
0.3333
Total
1.0
Table 2: Probability Distribution of X
6
0.3
0.2
0.0
0.1
Density
0.4
0.5
Example 0.3. (Continuous) The lifetime of a certain battery has a distribution that can be approximated by f (x) = 0.5e−0.5x , x > 0.
0
2
4
6
8
Lifetime in 100 hours
Figure 2: Probability density function of battery lifetime.
Normal
The normal distribution (Gaussian distribution) is by far the most important
distribution in statistics. The normal distribution is identified by a location
parameter µ and a scale parameter σ 2 (> 0). A normal r.v. X is denoted as
X ∼ N(µ, σ 2 ) with p.d.f.
1
2
1
f (x) = √ e− 2σ2 (x−µ)
σ 2π
−∞<x<∞
0.0
0.1
0.2
0.3
0.4
Normal Distribution
−3
−2
−1
0
1
2
3
µ
Figure 3: Density function of N(0, 1).
It is symmetric, unimodal, bell shaped with E(X) = µ and V (X) = σ 2 .
7
Notation: A normal random variable with mean 0 and variance 1 is called a
standard normal r.v. It is usually denoted by Z ∼ N(0, 1). The c.d.f. of a
standard normal is given at the end of the textbook, is available online, but
most importantly has a built in function in software. Note that probabilities,
which can be expressed in terms of c.d.f, can be conveniently obtained.
Example 0.4. Find P (−2.34 < Z < −1). From the relevant remark,
P (−2.34 < Z < −1) = P (Z < −1) − P (Z < −2.34)
= 0.1587 − 0.0096
= 0.1491
Notation: You may recall that
R
f (t)dt is contrived from lim
P
f (ti )∆i . Hence
for the following definitions and expressions we will only be using notation
R
for continuous variables and wherever you see “ ” simply replace it with
P
“ ”.
0.1.1
Expected value and variance
The expected value of a r.v. is thought of as the long term average for that
variable. Similarly, the variance is thought of as the long term average of
values of the r.v. to the expected value.
Definition 0.2. The expected value (or mean) of a r.v. X is
µX := E(X) =
Z
∞
xf (x)dx
−∞
discrete
=
X
∀x
!
xp(x) .
In actuality, this definition is a special case of a much broader statement.
Definition 0.3. The expected value (or mean) of function h(·) of a r.v. X
is
E(h(X)) =
Z
∞
h(x)f (x)dx.
−∞
Due to this last definition, if the function h performs a simple linear
transformation, such as h(t) = at + b, for constants a and b, then
E(aX + b) =
Z
(ax + b)f (x)dx = a
Z
8
xf (x)dx + b
Z
f (x)dx = aE(X) + b
Example 0.5. Referring back to Example 0.2, the expected value of the
number of type A boards (X) is
E(X) =
X
xp(x) = 0(0.1334) + 1(0.5333) + 2(0.3333) = 1.1999.
∀x
We can also calculate the expected value of (i) 5X + 3 and (ii) 3X 2 .
(i) 5(1.1999) + 3 = 8.995.
(ii) 3(02 )(0.1334) + 3(12)(0.5333) + 3(22 )(0.3333) = 5.5995
Definition 0.4. The variance of a r.v. X is
2
σX
:= V (X) = E (X − µX )2
Z
= (x − µX )2 f (x)dx
Z
= (x2 − 2xµX + µ2X )f (x)dx
Z
Z
Z
2
2
= x f (x)dx − 2µX xf (x)dx + µX f (x)dx
= E(X 2 ) − 2E 2 (X) + E 2 (X)
= E(X 2 ) − E 2 (X)
Example 0.6. This refers to Example 0.2. We know that E(X) = 1.1999
and E(X 2 ) = 02 (0.1334) + 12 (0.5333) + 22 (0.3333) = 1.8665. Thus,
V (X) = E(X 2 ) − E 2 (X)
= 1.8665 − 1.19992
= 0.42674
9
Example 0.7. This refers to example 0.3. If we were to do this by hand we
would need to do integration by parts (multiple times). However we can use
software such as Wolfram Alpha.
1. Find E(X), so in Wolfram Alpha simply input:
integrate x*0.5*e^(-0.5*x) dx from 0 to infinity
So E(X) = 2.
2. Find E(X 2 ), so input:
integrate x^2*0.5*e^(-0.5*x) dx from 0 to infinity
So, E(X 2 ) = 8.
3. V (X) = E(X 2 ) − E 2 (X) = 8 − 22 = 4.
Definition 0.5. The variance of a function h of a r.v. X is
Z
V (h(X)) = [h(x) − E(h(x))]2 f (x)dx
= E(h2 (X)) − E 2 (h(X))
Notice that if h stands for a linear transformation function then,
V (aX + b) = E (aX + b − E (aX + b))2
= a2 E (X − E(X))2
= a2 V (X)
If Z is standard normal then it has mean 0 and variance 1. Now if we
take a linear transformation of Z, say X = aZ + b, then
E(X) = E(aZ + b) = aE(Z) + b = b
and
V (X) = V (aZ + b) = a2 V (Z) = a2 .
This fact together with the following proposition allows us to express any
normal r.v. as a linear transformation of the standard normal r.v. Z by
setting a = σ and b = µ.
10
Proposition 0.1. The r.v. X that is expressed as the linear transformation
σZ + µ, is a also a normal r.v. with E(X) = µ and V (X) = σ 2 .
Linear transformations are completely reversible, so given a normal r.v.
X with mean µ and variance σ 2 we can revert back to a standard normal by
Z=
X −µ
.
σ
As a consequence any probability statements made about an arbitrary normal
r.v. can be reverted to statements about a standard normal r.v.
Example 0.8. Let X ∼ N(15, 7). Find P (13.4 < X < 19.0).
We begin by noting
P (13.4 < X < 19.0) = P
13.4 − 15
X − 15
19.0 − 15
√
√
< √
<
7
7
7
= P (−0.6047 < Z < 1.5119)
= P (Z < 1.5119) − P (Z < −0.6047)
= 0.6620312
If one is using a computer there is no need to revert back and forth from a
standard normal, but it is always useful to standardize concepts. You could
find the answer by using
pnorm(1.5119)-pnorm(-0.6047)
or
pnorm(19,15,sqrt(7))-pnorm(13.4,15,sqrt(7))
Example 0.9. The height of males in inches is assumed to be normally distributed with mean of 69.1 and standard deviation 2.6. Let X ∼ N(69.1, 2.62 ).
Find the 90th percentile for the height of males.
11
0.15
0.00
0.05
0.10
90 % area
69.1
Figure 4: N(69.1, 2.62 ) distribution
First we find the 90th percentile of the standard normal which is qnorm(0.9)=
1.281552. Then we transform to
2.6(1.281552) + 69.1 = 72.43204.
Or, just input into R: qnorm(0.9,69.1,2.6).
0.1.2
Covariance
The population covariance is a measure of strength of a linear relationship
among two variables.
Definition 0.6. Let X and Y be two r.vs. The population covariance of X
and Y is
Cov(X, Y ) = E [(X − E(X)) (Y − E(Y ))]
= E(XY ) − E(X)E(Y )
Remark 0.1. If X and Y are independent, then
E(XY ) =
Z Z
ind.
=
xyf (x, y)dxdy
Z Z
xyfX (x)fY (y)dxdy
Z
Z
= xfX (x)dx yfY (y)dy
= E(X)E(Y )
12
and consequently Cov(X, Y ) = 0. This is because under independence
f (x, y) = fX (x)fY (y). However, the converse is not true. Think of a circle such as sin2 X + cos2 Y = 1. Obviously, X and Y are dependent but they
have no linear relationship. Hence, Cov(X, Y ) = 0.
The covariance is not unitless so a measure called the population correlation is used to describe the strength of the linear relationship that is
• unitless
• ranges from −1 to 1
ρXY = p
Cov(X, Y )
p
,
V (X) V (Y )
A negative relationship implies a negative covariance and consequently a
negative correlation.
Moving away from the population parameters, to estimate the sample
statistic of the covariance and the correlation we need
n
1 X
(xi − x̄)(yi − ȳ)
n − 1 i=1
" n
!
#
X
1
=
xi yi − nx̄ȳ
n−1
i=1
\Y ) =
σ̂XY := Cov(X,
Therefore,
rXY := ρ̂XY =
(
Pn
xi yi ) − nx̄ȳ
.
(n − 1)sX sY
i=1
(1)
Example 0.10. Let’s assume that we want to look at the relationship between two variables, height (in inches) and self esteem for 20 individuals.
Height
Esteem
68
4.1
68
3.5
71 62 75 58
4.6 3.8 4.4 3.2
67 63 62 60
3.2 3.7 3.3 3.4
60 67 68
3.1 3.8 4.1
63 65 67
4.0 4.1 3.8
Table 3: Height to self esteem data
Hence,
rXY =
4937.6 − 20(65.4)(3.755)
= 0.731
19(4.406)(0.426)
there is a moderate to strong positive linear relationship.
13
71 69
4.3 3.7
63 61
3.4 3.6
0.1.3
Mean and variance of linear combinations
Let X and Y be two r.vs, for (aX + b) + (cY + d) for constants a, b, c and d,
E(aX + b + cY + d) = aE(X) + cE(Y ) + b + d
V (aX + b + cY + d) = Cov(aX, aX) + Cov(cY, cY ) + Cov(aX, cY ) + Cov(cY, aX)
{z
} |
{z
} |
{z
}
|
a2 V (X)
c2 V (Y )
2acCov(X,Y )
Example 0.11. Let X be a r.v. with E(X) = 3 and V (X) = 2, and Y be
another r.v. independent of X with E(Y ) = −5 and V (Y ) = 6. Then,
E(X − 2Y ) = E(X) − 2E(Y ) = 3 − 2(−5) = 13
and
V (X − 2Y ) = V (X) + 4V (Y ) = 2 + 4(6) = 26
Now we extend these two concepts to more than two r.vs. Let X1 , . . . , Xn
be a sequence of r.vs and a1 , . . . , an a sequence of constants. Then the r.v.
Pn
i=1 ai Xi has mean and variance
E
n
X
ai Xi
i=1
!
=
n
X
ai E(Xi )
i=1
and
V
n
X
ai Xi
i=1
!
=
=
n X
n
X
i=1 j=1
n
X
a2i V
i=1
ai aj Cov(Xi , Xj )
(Xi ) + 2
XX
ai aj Cov(Xi , Xj )
(2)
(3)
i<j
Example 0.12. Assume the random sample, i.e. independent identically
distributed (i.i.d.) r.vs, X1 , . . . , Xn are to be obtained and of interest will
be the specific linear combination corresponding to the sample mean X̄ =
P
(1/n) ni=1 Xi . Since the r.vs are i.i.d., let E(Xi ) = µ and V (Xi ) = σ 2
∀i = 1, . . . , n. Then,
n
E
1X
Xi
n i=1
!
n
=
1X
1
E(Xi ) = nµ = µ
n i=1
n
14
and
n
V
1X
Xi
n i=1
!
n
1 X
1
σ2
= 2
V (Xi ) = 2 nσ 2 =
n i=1
n
n
ind.
Remark 0.2. As the sample size increases, the variance of the sample mean
decreases with limn→∞ V (X̄) = 0.
A very useful theorem (whose proof is beyond the scope of this class is
the following.
Proposition 0.2. A linear combination of (independent) normal random
variables is a normal random variable.
0.2
Central Limit Theorem
The Central Limit Theorem (C.L.T.) is a powerful statement concerning
the mean of a random sample. There are three versions, the classical, the
Lyapunov and the Linderberg but in effect they all make the same statement
that the asymptotic distribution of the sample mean X̄ is normal, irrespective
of the distribution of the individual r.vs. X1 , . . . , Xn .
Proposition 0.3. (Central Limit Theorem)
Let X1 , . . . , Xn be a random sample, i.e. i.i.d., with E(Xi ) = µ < ∞ and
P
V (Xi ) = σ 2 < ∞. Then, for X̄ = (1/n) ni=1 Xi
X̄ − µ
√σ
n
d
−→ N(0, 1)
n→∞
Although the central limit theorem is an asymptotic statement, i.e. as the
sample size goes to infinity, we can in practice implement it for sufficiently
large sample sizes n > 30 as the distribution of X̄ will be approximately
normal with mean and variance derived from Example 0.12.
X̄
approx.
∼
N
15
σ2
µ,
n
0.3
Inference for Population Mean
When a population parameter is estimated by a sample statistic such as
µ̂ = x̄, the sample statistic is a point estimate of the parameter. Due to
sampling variability the point estimate will vary from sample to sample.
The fact that the sample estimate is not 100% accurate has to be taken into
account.
0.3.1
Confidence intervals
An alternative or complementary approach is to report an interval of plausible
values based on the point estimate sample statistic and its standard deviation
(a.k.a. standard error). A confidence interval (C.I.) is calculated by first
selecting the confidence level, the degree of reliability of the interval. A
100(1 − α)% C.I. means that the method by which the interval is calculated
will contain the true population parameter 100(1 − α)% of the time. That
is, if a sample is replicated multiple times, the proportion of times that the
C.I. will not contain the population parameter is α.
For example, assume that we know the (in practice unknown) population
parameter µ is 0 and from multiple samples, multiple C.Is are created.
4
2
0
J
mp
le
Sa
H
mp
le
I
Sa
mp
le
Sa
F
mp
le
G
Sa
E
mp
le
Sa
mp
le
Sa
C
mp
le
D
Sa
B
mp
le
Sa
mp
le
Sa
Sa
mp
le
A
−2
Figure 5: Multiple confidence intervals from different samples
Known population variance
Let X1 , . . . , Xn be i.i.d. from some distribution with finite unknown mean µ
and known variance σ 2 . The methodology will require that X̄ ∼ N(µ, σ 2 /n).
This can occur in the following ways:
• X1 , . . . , Xn be i.i.d. from a normal distribution, so that by Proposition
0.2, X̄ ∼ N(µ, σ 2 /n)
16
• n > 30 and the C.L.T. is invoked.
Let zc stand for the value of Z ∼ N(0, 1) such that P (Z ≤ zc ) = c.
Hence, the proportion of C.Is containing the population parameter is,
1−α
α 2
α 2
0.0
0.1
0.2
0.3
0.4
Standard Normal
zα
0
2
z1−α
2
Due to the symmetry of the normal distribution, z1−α/2 = |zα/2 | and
zα/2 = −z1−α/2 .
Note: Some books may define zc such that P (Z > zc ) = c, i.e. c referring to
the area to the right.
X̄ − µ
√ < z1−α/2
1 − α = P −z1−α/2 <
σ/ n
σ
σ
= P X̄ − z1−α/2 √ < µ < X̄ + z1−α/2 √
n
n
(4)
and the probability that (on the long run) the random C.I. interval,
σ
X̄ ∓ z1−α/2 √
n
contains the true value of µ is 1 − α. When a C.I. is constructed from a
single sample we can no longer talk about a probability as there is no long
run temporal concept but we can say that we are 100(1 − α)% confident that
the methodology by which the interval was contrived will contain the true
population parameter.
17
Example 0.13. A forester wishes to estimate the average number of count
trees per acre on a plantation. The variance is assumed to be known as 12.1.
A random sample of n = 50 one acre plots yields a sample mean of 27.3.
A 95% C.I. for the true mean is then
r
12.1
→ (26.33581, 28.26419)
27.3 ∓ z1−0.025
| {z }
50
1.96
Unknown population variance
In practice the population variance is unknown, that is σ is unknown. A
large sample size implies that the sample variance s2 is a good estimate for σ 2
and you will find that many simply replace it in the C.I. calculation. However,
there is a technically “correct” procedure for when variance is unknown.
Note that s2 is calculated from data, so just like x̄, there is a corresponding random variable S 2 to denote the theoretical properties of the sample
variance. In higher level statistics the distribution of S 2 is found, as once
again, it is a statistic that depends on the random variables X1 , . . . , Xn . It
is shown that
X̄ − µ
√ ∼ tn−1
(5)
S/ n
where tn−1 stands for Student’s-t distribution with parameter degrees of freedom ν = n−1. A Student’s-t distribution is “similar” to the standard normal
except that it places more “weight” to extreme values as seen in Figure 6.
18
0.4
Density Functions
0.0
0.1
0.2
0.3
N(0,1)
t_4
−4
−2
0
2
4
Figure 6: Standard normal and t4 probability density functions
It is important to note that Student’s-t is not just “similar” to the standard normal but asymptotically (as n → ∞) is the standard normal. One
just needs to view the t-table to see that under infinite degrees of freedom the
values in the table are exactly the same as the ones found for the standard
normal. Intuitively then, using Student’s-t when σ 2 is unknown makes sense
as it adds more probability to extreme values due to the uncertainty placed
by estimating σ 2 .
The 100(1 − α)% C.I. for µ is then
s
x̄ ∓ t1−α/2,n−1 √ .
n
(6)
Example 0.14. In a packaging plant, the sample mean and standard deviation for the fill weight of 100 boxes are x̄ = 12.05 and s = 0.1. The 95% C.I.
for the mean fill weight of the boxes is
0.1
→ (12.03016, 12.06984),
12.05 ∓ t1−0.025,99 √
| {z } 100
(7)
1.984
Remark 0.3. If we wanted to perform a 90% we would simply replace t(0.05/2,99)
with t(0.10/2,99) = 1.660, which would lead to CI of (12.0334, 12.0666) that is
a narrower interval. Thus, as α ↑ then 100(1 − α) ↓ which implies a narrower
interval.
19
Example 0.15. Suppose that a sample of 36 resistors is taken with x̄ = 10
and s2 = 0.7. A 95% C.I. for µ is
10 ∓ t1−0.025,35
| {z }
2.03
r
0.7
→ (9.71693, 10.28307)
36
Remark 0.4. So far we have only discussed two-sided confidence intervals.
In equation (4) However, one-sided confidence intervals might be more
appropriate in certain circumstances. For example, when one is interested
in the minimum breaking strength, or the maximum current in a circuit. In
these instances we are not interested in an upper and lower limit but only in
a lower or only in a upper limit. Then we simply replace zα/2 or t(α/2,n−1) by
zα or tα,n−1 , e.g. a 100(1 − α)% C.I. for µ
s
x̄ − t1−α,n−1 √ , ∞
n
0.3.2
or
s
−∞, x̄ + t1−α,n−1 √
n
Hypothesis tests
A statistical hypothesis is a claim about a population characteristic (and on
occasion more than one). An example of a hypothesis is the claim that the
population is some value, e.g. µ = 0.75.
Definition 0.7. The null hypothesis, denoted by H0 , is the hypothesis that
is initially assumed to be true.
The alternative hypothesis, denoted by Ha or H1 , is the complementary
assertion to H0 and is usually the hypothesis, the new statement that we
wish to test.
A test procedure is created under the assumption of H0 and then it is
determined how likely that assumption is compared to its complement Ha .
The decision will be based on
• Test statistic, a function of the sampled data.
• Rejection region/criteria, the set of all test statistic values for which
H0 will be rejected.
The basis for choosing a particular rejection region lies in an understanding
of the errors that can be made.
20
Definition 0.8. A type I error consists of rejecting H0 when it is actually
true.
A type II error consists of failing to reject H0 when in actuality H0 is
false.
The type I error is generally considered to be the most serious one, and
due to limitations, we can only control for one, so the rejection region is
chosen based upon the maximum P (type I error) = α that a researcher is
willing to accept.
Known population variance
We motivate the test procedure by an example whereby the drying time
of a certain type of paint, under fixed environmental conditions, is known
to be normally distributed with mean 75 min. and standard deviation 9
min. Chemists have added a new additive that is believed to decrease drying
time and have obtained a sample of 35 drying times and wish to test their
assertion. Hence,
H0 : µ ≥ 75 (or µ = 75)
Ha : µ < 75
Since we wish to control for the type I error, we set P (type I error) = α.
The default value of α is usually taken to be 5%.
An obvious candidate for a test statistic, that is an unbiased estimator
of the population mean, is X̄ which is normally distributed. If the data
were not known to be normally distributed the normality of X̄ can also be
confirmed by the C.L.T. Thus, under the null assumption H0
92
,
X̄ ∼ N 75,
35
H0
or equivalently
X̄ − 75
√9
35
H0
∼ N(0, 1).
The test statistic will be
T.S. =
x̄ − 75
21
√9
35
,
and assuming that x̄ = 70.8 from the 35 samples, then, T.S. = −2.76. This
implies that 70.8 is 2.76 standard deviations below 75. Although this appears
to be far, we need to use the p-value to reach a formal conclusion.
Definition 0.9. The p-value of a hypothesis test is the probability of observing the specific value of the test statistic, T.S., or a more extreme value,
under the null hypothesis. The direction of the extreme values is indicated
by the alternative hypothesis.
Therefore, in this example values more extreme than -2.76 are
{x|x ≤ −2.76},
as indicated by the alternative, Ha : µ < 75. Thus,
p-value = P (Z ≤ −2.76) = 0.0029.
The criterion for rejecting the null is p-value < α, the null hypothesis is
rejected in favor of the alternative hypothesis as the probability of observing
the test statistic value of -2.76 or more extreme (as indicated by Ha ) is smaller
than the probability of the type I error we are willing to undertake.
α=0.05 area
p−value
0.0
0.1
0.2
0.3
0.4
Standard Normal
−2.76
−1.645
0
Figure 7: Rejection region and p-value.
If we can assume that X̄ is normally distributed and σ 2 is known then,
to test
22
(i) H0 : µ ≤ µ0 vs Ha : µ > µ0
(ii) H0 : µ ≥ µ0 vs Ha : µ < µ0
(iii) H0 : µ = µ0 vs Ha : µ 6= µ0
at the α significance level, compute the test statistic
T.S. =
x̄ − µ0
√ .
σ/ n
(8)
Reject the null if the p-value < α, i.e.
(i) P (Z ≥ T.S.) < α (area to the right of T.S. < α)
(ii) P (Z ≤ T.S.) < α (area to the left of T.S. < α)
(iii) P (|Z| ≥ |T.S.|) < α (area to the right of |T.S.| plus area to the left of
−|T.S.| < α)
Example 0.16. A scale is to be calibrated by weighing a 1000g weight 60
times. From the sample we obtain x̄ = 1000.6 and s = 2. Test whether the
scale is calibrated correctly.
H0 : µ = 1000 vs Ha : µ 6= 1000
T.S. =
1000.6 − 1000
√
= 2.32379
2/ 60
Hence, the p-value is 0.02013675 and we reject the null hypothesis and conclude that the true mean is not 1000.
23
0.4
Standard Normal
0.0
0.1
0.2
0.3
p−value
−2.32379
0
2.32379
Figure 8: p-value.
Since 1000.6 is 2.32379 standard deviations greater than 1000, we can
conclude that not only is the true mean not a 1000 but it is greater than
1000.
Example 0.17. A company representative claims that the number of calls
arriving at their center is no more than 15/week. To investigate the claim, 36
random weeks were selected from the company’s records with a sample mean
of 17 and sample standard deviation of 3. Do the sample data contradict
this statement?
First we begin by stating the hypotheses of
H0 : µ ≤ 15
The test statistic is
T.S. =
vs
Ha : µ > 15
17 − 15
√
=4
3/ 36
The conclusion is that there is significance evidence to reject H0 as the pvalue (the area to the right of 4 under the standard normal) is very close to
0.
24
Unknown population variance
If σ is unknown, which is usually the case, we replace it by its sample
estimate s. Consequently,
X̄ − µ0 H0
√ ∼ tn−1 ,
S/ n
and the for an observed value X̄ = x̄, the test statistic becomes
T.S. =
x̄ − µ0
√ .
s/ n
At the α significance level, for the same hypothesis tests as before, we reject
H0 if
(i) p-value= P (tn−1 ≥ T.S.) < α
(ii) p-value= P (tn−1 ≤ T.S.) < α
(iii) p-value= P (|tn−1 | ≥ |T.S.|) < α
Example 0.18. In an ergonomic study, 5 subjects were chosen to study the
maximin weight of lift (MAWL) for a frequency of 4 lifts/min. Assuming the
MAWL values are normally distributed, do the following data suggest that
the population mean of MAWL exceeds 25?
25.8, 36.6, 26.3, 21.8, 27.2
H0 : µ ≤ 25 vs Ha : µ > 25
T.S. =
27.54 − 25
√ = 1.03832
5.47/ 5
The p-value is the area to the right of 1.03832 under the t4 distribution,
which is 0.1788813. Hence, we fail to reject the null hypothesis. In R input:
t.test(c(25.8, 36.6, 26.3, 21.8, 27.2),mu=25,alternative="greater")
Remark 0.5. The values contained within a two-sided 100(1 − α)% C.I. are
precisely those values (that when used in the null hypothesis) will result in
the p-value of a two sided hypothesis test to be greater than α.
For the one sided case, an interval that only uses the
25
• upper limit, contains precisely those values for which the p-value of
a one-sided hypothesis test, with alternative less than, will be greater
than α.
• lower limit, contains precisely those values for which the p-value of a
one-sided hypothesis test, with alternative greater than, will be greater
than α.
Example 0.19. The lifetime of single cell organism is believed to be on
average 257 hours. A small preliminary study was conducted to test whether
the average lifetime was different when the organism was placed in a certain
medium. The measurements are assumed to be normally distributed and
turned out to be 253, 261, 258, 255, and 256. The hypothesis test is
H0 : µ = 257 vs. Ha : µ 6= 257
With x̄ = 256.6 and s = 3.05, the test statistic value is
T.S. =
256.6 − 257
√
= −0.293.
3.05/ 5
The p-value is P (t4 < −0.293) + P (t4 > 0.293) = 0.7839. Hence, since the
p-value is large (> 0.05) we fail to reject H0 and conclude that population
mean is not statistically different from 257.
Instead of a hypothesis test if a two sided 95% was constructed by
3.05
256.6 ∓ t(1−0.025,4) √
| {z } 5
→
(252.81, 260.39),
2.776
it clear that the null hypothesis value of µ = 257 is a plausible value and
consequently H0 is plausible, so it is not rejected.
26
0.4
Inference for Two Population Means
0.4.1
Independent samples
There are instances when a C.I. for the difference between two means is of
interest when one wishes to compare the sample mean from one population
to the sample mean of another.
Known population variances
Let X1 , . . . , XnX and Y1 , . . . , YnY represent two independent random samples
2
with means µX , µY and variances σX
, σY2 respectively. Once again the methodology will require X̄ and Ȳ to be normally distributed. This can occur
by:
• X1 , . . . , Xn be i.i.d. from a normal distribution, so that by Proposition
2
0.2, X̄ ∼ N(µX , σX
/n)
• nX > 40 and the C.L.T. is invoked.
Similarly for Ȳ . Note that if the C.L.T. is to be invoked we require a more
conservative criterion of nX > 40, nY > 40 as we are using the theorem (and
hence an approximation twice).
To compare two populations means µX and µY we find it easier to work
with a new parameter the difference µK := µX − µY . Let K := X̄ − Ȳ is a
normal random variable (by Proposition 0.2) with
E(K) = E(X̄ − Ȳ ) = µX − µY ,
and
V (K) = V (X̄ − Ȳ ) =
Therefore,
K := X̄ − Ȳ ∼ N
2
σ2
σX
+ Y.
nX
nY
σ2
σ2
µX − µY , X + Y
nX nY
,
and hence a 100(1 − α)% C.I. for the difference of µK = µX − µY is
x̄ − ȳ ∓ z1−α/2
s
27
2
σ2
σX
+ Y.
nX nY
Example 0.20. In an experiment, 50 observations of soil NO3 concentration
(mg/L) were taken at each of two (independent) locations X and Y . We have
that x̄ = 88.5, σX = 49.4, ȳ = 110.6 and σY = 51.5. Construct a 95% C.I.
for the difference in means and interpret.
88.5 − 110.6 ∓ 1.96
r
49.42 51.52
+
→ (−41.880683, −2.319317)
50
50
Note that 0 is not in the interval as a plausible value. This implies that
µX − µY < 0 is plausible. In fact µX is less than µY by at least 2.32 units
and at most 41.88.
Unknown population variances
As in equation (5)
X̄ − Ȳ − (µX − µY )
q 2
∼ tν
sX
s2Y
+ nY
nX
where
ν=
s2X
nX
s2Y
nY
+
(s2X /nX )2
nX −1
+
2
(s2Y /nY )2
nY −1
.
(9)
Hence the 100(1 − α)% for µX − µY is
x̄ − ȳ ∓ t1−α/2,ν
s
s2X
s2
+ Y.
nX nY
Example 0.21. Two methods are considered standard practice for surface
hardening. For Method A there were 15 specimens with a mean of 400.9
(N/mm2 ) and standard deviation 10.6. For Method B there were also 15
specimens with a mean of 367.2 and standard deviation 6.1. Assuming the
samples are independent and from a normal distribution the 98% C.I. for
µA − µB is
400.9 − 367.2 ∓ t1−0.01,ν
where
ν=
10.62
15
(10.62 /15)2
14
+
+
6.12
15
r
2
10.62 6.12
+
15
15
(6.12 /15)2
14
= 22.36
and hence t1−0.01,22.36 = 2.5052 giving a 98% C.I. for the difference µA − µB
28
of (25.7892 41.6108).
Notice that 0 is not in the interval so we can conclude that the two means
are different. In fact the interval is purely positive so we can conclude that
µA is at least 25.7892 N/mm2 larger than µB and at most 41.6108 N/mm2 .
0.4.2
Paired data
There are instances when two samples are not independent, when a relationship exists between the two. For example, before treatment and after
treatment measurements made on the same experimental subject are dependent on eachother through the experimental subject. This is a common event
in clinical studies where the effectiveness of a treatment, that may be quantified by the difference in the before and after measurements, is dependent
upon the individual undergoing the treatment. Then, the data is said to be
paired.
Consider the data in the form of the pairs (X1 , Y1), (X2 , Y2 ), . . . , (Xn , Yn ).
We note that the pairs, i.e. two dimensional vectors, are independent as the
experimental subjects are assumed to be independent with marginal expectations E(Xi ) = µX and E(Yi ) = µY for all i = 1, . . . , n. By defining,
D1 = X 1 − Y 1
D2 = X 2 − Y 2
..
.
Dn = X n − Y n
a two sample problem has been reduced to a one sample problem. Inference
for µX − µY is equivalent to one sample inference on µD as was done in
Chapter ??. This holds since,
n
µD := E(D̄) = E
1X
Di
n i=1
!
n
=E
1X
Xi − Y i
n i=1
!
= E(X̄−Ȳ ) = µX −µY .
In addition we note that the variance of D̄ does incorporate the covariance
between the two samples and does have to be calculated separately as
n
2
σD
:= V (D̄) = V
1X
Di
n i=1
!
n
1 X
σ 2 + σY2 − 2σXY
= 2
V (Di ) = X
.
n i=1
n
29
Example 0.22. A new and old type of rubber compound can be used in
tires. A researcher is interested in a compound/type that does not wear
easily. Ten random cars were chosen at random that would go around a
track a predetermined number of times. Each car did this twice, once for
each tire type and the depth of the tread was then measured.
Car
New
Old
D
1
2
3
4
5
6
7
8
9
10
4.35 5.00 4.21 5.03 5.71 4.61 4.70 6.03 3.80 4.70
4.19 4.62 4.04 4.72 5.52 4.26 4.27 6.24 3.46 4.50
0.16 0.38 0.17 0.31 0.19 0.35 0.43 -0.21 0.34 0.20
With d¯ = 0.232 and sD = 0.183. Assuming that the data are normally
distributed, a 95% C.I. for µnew − µold = µD is
0.183
0.232 ∓ t1−0.025,9 √
| {z } 10
→
(0.101, 0.363)
2.262
and we note that the interval is strictly greater than 0, implying that that
the difference is positive, i.e. that µnew > µold . In fact we can conclude that
µnew is larger than µold by at least 0.101 units and at most 0.363 units.
30
Chapter 1
Simple Linear Regression
In this chapter we hypothesize a linear relationship between the two variables,
estimate and draw inference about the model parameters.
1.1
Model
The simplest deterministic mathematical relationship between two mathematical variables x and y is a linear relationship
y = β0 + β1 x,
where the coefficient
• β0 represents the y-axis intercept, the value of y when x = 0,
• β1 represents the slope, interpreted as the amount of change in the
value of y for a 1 unit increase in x.
ǫi
To this model we add variability by introducing the random variable
∼ N(0, σ 2 ) for each observation i = 1, . . . , n. Hence, the statistical
i.i.d.
model by which we wish to model one random variable using known values
of some predictor variable becomes
Yi = β0 + β1 xi +ǫi
| {z }
i = 1, . . . , n
(1.1)
systematic
where Yi represents the r.v. corresponding to the response, i.e. the variable
we wish to model and xi stands for the observed value of the predictor.
31
Therefore we have that
ind.
Yi ∼ N(β0 + β1 xi , σ 2 ).
(1.2)
5
y
10
15
Notice that the Y s are no longer identical since their mean depends on the
value of xi .
0
Data points
Regression line
−20
−10
0
10
20
30
40
50
60
x
Figure 1.1: Regression model.
Remark 1.1. An alternate form with centered predictor is
Yi = β0 + β1 (xi − x̄) + β1 x̄ + ǫi
= (β0 + β1 x̄) + β1 (xi − x̄) + ǫi
| {z }
β0⋆
In order to fit a regression line one needs to find estimates for the coefficients β0 and β1 in order to find the mean line
ŷi = β̂0 + β̂1 xi .
32
1.2
1.2.1
Parameter Estimation
Regression function
The goal is to have this line as “close” to the data points as possible. The
concept, is to minimize the error from the actual data points to the predicted
points (in the direction of Y , i.e. vertical)
min
n
X
i=1
(Yi − E(Yi ))
2
→
min
n
X
i=1
(Yi − (β0 + β1 xi ))2 .
Hence, the goal is to find the values of β0 and β1 that minimizes the sum of
the distances between the points and their expected value under the model.
This is done by the following steps:
1. Taking the partial derivatives with respect to β0 and β1
2. Equate the two resulting equations to 0
3. Solve the simultaneous equations for β0 and β1
4. (Optional) Taking second partial derivatives to show that in fact they
minimize, not maximize.
Therefore,
Pn
(x − x̄)(yi − ȳ)
Pn i
b1 := β̂1 = i=1
(xi − x̄)2
Pn i=1
(
xi yi ) − nx̄ȳ
= Pi=1
n
( i=1 x2i ) − nx̄2
!
n
X
(xi − x̄)
Pn
=
yi
2
j=1 (xj − x̄)
i=1
|
{z
}
ki
and
b0 := β̂0 = ȳ − b1 x̄
=
n
X
i=1
!
x̄(xi − x̄)
1
yi .
+ Pn
2
n
j=1 (xj − x̄)
|
{z
}
li
33
(1.3)
Hence both b1 and b0 are linear estimators, as they are linear combinations
of the responses.
Remark 1.2. Do not extrapolate model for values of the predictor x that were
not in the data, as it is not clear how the model may behave for other values.
Also, do not fit a linear regression for data that do not appear to be linear.
Definition 1.1. The ith residual is defined to be the difference between the
observed and fitted value of the response for point i.
ei = yi − ŷi
Notable Properties:
•
•
•
•
P
ei = 0
P
xi ei = 0
P
yi =
P
P
ŷi ei = 0
ŷi
• The regression line always goes through (x̄, ȳ)
1.2.2
Variance
The variance term in the model is
σ 2 = V (ǫ) = E(ǫ2 )
Hence to estimate it, the “sample mean” of the squared residuals e2i seems
as a reasonable estimate.
Pn 2
Pn
2
SSE
e
2
2
i=1 (yi − ŷi )
= i=1 i =
.
s = MSE = σ̂ =
n−2
n−2
n−2
where MSE stands for Mean Squared Error and SSE for Sum of Squares
Error. Note that in the denominator we have n − 2, as we lose 2 degrees of
freedom since we had to estimate two parameters, β0 and β1 , when estimating
our center, ŷi .
34
Remark 1.3. Estimation of model parameters can also be done via maximum
likelihood that yields exactly the same estimates of the parameters of the
systematic component, β0 and β1 , but the estimate of σ 2 is slightly biased.
2
σ̂ =
Pn
i=1 (yi
n
so
MSE =
− ŷi )2
n
σ̂ 2
n−2
Example 1.1. Let x be the number of copiers serviced and Y be the time
spent (in minutes) by the technician for a known manufacturer.
1
20
2
Time (y)
Copiers (x)
2 ···
60 · · ·
4 ···
44 45
61 77
4 5
Table 1.1: Quantity of copiers and service time
The complete dataset can be found at
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/copiers.csv
100
50
0
Time (in minutes)
150
Scatterplot
2
4
6
8
10
Quantity
Figure 1.2: Scatterplot of Time vs Copiers.
The scatterplot shows that there is a strong positive relationship between
the two variables. Below is the R output.
35
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5802
2.8039 -0.207
0.837
Copiers
---
15.0352
0.4831
31.123
<2e-16 ***
Residual standard error: 8.914 on 43 degrees of freedom
Multiple R-squared: 0.9575,Adjusted R-squared: 0.9565
F-statistic: 968.7 on 1 and 43 DF,
p-value: < 2.2e-16
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/copier.R
The estimated equation is
ŷ = −0.5802 + 15.0352x
We note that the slope b1 = 15.0352 implies that for each unit increase in
copier quantity, the service time increases by 15.0352 minutes (for quantity
values between 1 and 10).
If we wish to estimate the time needed for a service call for 5 copiers that
would be
−0.5802 + 15.0352(5) = 74.5958 minutes
Example 1.2. Data on lot size (x) and work hours (y) was obtained from
25 recent runs of a manufacturing process. (See example on page 19 of
textbook). A simple linear regression model was fit in R yielding
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
62.366
26.177
2.382
0.0259 *
lotsize
3.570
0.347
10.290 4.45e-10 ***
Residual standard error: 48.82 on 23 degrees of freedom
Multiple R-squared: 0.8215,Adjusted R-squared: 0.8138
F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10
36
500
400
300
200
100
toluca$workhrs
20
40
60
80
100
120
toluca$lotsize
Figure 1.3: Scatterplot of Work Hours vs Lot Size.
We can obtain the residuals but will note that their magnitude in hours may
not be easy to determine if a value is large or small in the context of the
problem. Later we shall discuss standardized residuals.
> round(resid(toluca.reg),1)
1
2
3
51.0 -48.5 -19.9
12
-60.3
22
4
-7.7
5
6
48.7 -52.6
13
14
15
5.3 -20.8 -20.1
23
24
25
16
0.6
7
55.2
17
42.5
8
9
10
11
4.0 -66.4 -83.9 -45.2
18
27.1
19
20
21
-6.7 -34.1 103.5
84.3 38.8 -6.0 10.7
> round(rstandard(toluca.reg),1)
1
2
3
4
1.1 -1.1 -0.4 -0.2
13
14
15
16
5
6
1.0 -1.1
17
18
0.1 -0.5 -0.4
0.9
0.0
7
1.2
19
8
9
10
11
12
0.1 -1.4 -1.8 -1.0 -1.3
20
21
22
23
24
0.6 -0.1 -0.7
2.3
1.8
0.8 -0.1
25
0.2
Note, that the first residual implies that the actual observed value of work
hours was 51 hours greater than the model estimates. However, this difference is only 1.1 standard deviations.
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/toluca.R
37
Chapter 2
Inferences in Regression
2.1
Inferences concerning β0 and β1
The coefficients b0 and b1 of equation (1.3) are linear combinations of the
responses. Therefore, they have corresponding r.vs B0 and B1 and since the
Y s are independent normal r.vs (see (1.1)), by Proposition 0.2 are themselves
normal r.vs. Re-expressing the r.v. B1 ,
B1 =
Pn
(x − x̄)(Yi −
i=1
Pn i
2
i=1 (xi − x̄)
Ȳ )
= ··· =
n
X
i=1
Some notable properties are:
•
•
•
P
ki = 0
P
ki2 = 1/
P
(x − x̄)
Pn i
Y
2 i
j=1 (xj − x̄)
|
{z
}
ki
ki xi = 1
This implies
P
(xi − x̄)2
E(B1 ) =
n
X
i=1
= β0
ki E(Yi )
| {z }
β0 +β1 xi
n
X
ki + β1
i=1
= β1
38
n
X
i=1
ki xi
and
V (B1 ) =
n
X
i=1
ki2 V (Yi )
| {z }
σ2
2
σ
.
2
j=1 (xj − x̄)
= Pn
Thus,
B1 ∼ N
σ2
β1 , Pn
2
i=1 (xi − x̄)
.
Remark 2.1. The larger the spread in the values of the predictor, the larger
P
the ni=1 (xi − x̄)2 value will be and hence the smaller the variances for B0
and B1 . Also, as (xi − x̄)2 are nonnegative terms when we have more data
points, i.e. larger n, we are summing more non-negative terms and the larger
P
the ni=1 (xi − x̄)2 .
Remark 2.2. The intercept term is not of much practical importance as it
is the value of the response when the predictor value is 0 and is included to
provide us with a “nice” model whether significant or not. Hence, inference
is omitted. It can be shown, in similar fashion, that
1
x̄2
B0 ∼ N β0 ,
+ Pn
σ2 .
2
n
i=1 (xi − x̄)
Remark 2.3. The r.vs B0 and B1 are not independent and their covariance
is not 0.
Cov(B0 , B1 ) = Cov
since
X
li Y i ,
X
X
ki Y i =
li ki V (Yi )

l k V (Y ) i = j
i i
i
Cov(li Yi , ki Yi ) =
0
i=
6 j
In practice, σ 2 is not known, and in practice is replaced by its estimate,
MSE. This is a scenario that we are all too familiar with, similar to equation
(5) we use a Student’s t distribution instead of the normal,
B1 − β1
∼ tn−2 .
√P s
2
(xi −x̄)
39
This is because (not proven in this class)
(n − 2)s2
SSE
= 2 ∼ χ2n−2
2
σ
σ
(2.1)
is independent of B1 , and a ratio of a normal with the square root of independent chi-square is defined as t-distribution. Important to note is the fact
that the degrees of freedom are n − 2, as 2 were lost due to the estimation
of β0 and β1 in the mean.
Therefore, a 100(1 − α)% C.I. for β1 is
β̂1 ∓ t1−α/2,n−2 sb1
where sb1 = s/
pPn
i=1 (xi
− x̄)2 .
Similarly, for a null hypothesis value H0 : β1 = β10 , the test statistic is
T.S. =
βˆ1 − β10 H0
∼ tn−2
sb1
and p-values and conclusions made in the standard way, see Section 0.3.
We have not yet learned to perform inference on all parameters in the
ind.
model Yi ∼ N(β0 + β1 xi , σ 2 ). We can perform inference on the parameters
associated with the mean, i.e. β1 (and β0 ) but not yet σ 2 . From (2.1) we
have that
SSE
2
2
1 − α = P χ(α/2,n−2) < 2 < χ(1−α/2,n−2)
σ
!
SSE
SSE
< σ2 < 2
=P
χ2(1−α/2,n−2)
χ(α/2,n−2)
and hence the 100(1 − α)% C.I. for σ 2 is
SSE
SSE
,
χ2(1−α/2,n−2) χ2(α/2,n−2)
!
Example 2.1. Back to the copier example 1.1, a 95% C.I. for
• β1 is
15.0352 ∓ t1−0.025,43 (0.4831) → (14.061010, 16.009486).
| {z }
2.016692
40
(2.2)
• σ 2 is
2.2
23(48.822) 23(48.822 )
,
38.076
11.689
Inferences involving E(Y ) and Ŷpred
2.2.1
Confidence interval on the mean response
The mean is no longer a constant but is in fact a “mean line”.
µY |X=xobs := E(Y |X = xobs ) = β0 + β1 xobs
Hence, we can create an interval for the mean at a specific value of the
predictor xobs . We simply need to find a statistic to estimate the mean and
find its distribution. The sample statistic used is
ŷ = b0 + b1 xobs
and the corresponding r.v. is
Ŷ = B0 + B1 xobs
"
#
n
X
1
xi − x̄
=
+ (xobs − x̄) Pn
Yi .
2
n
(x
−
x̄)
j
j=1
i=1
(2.3)
Note that this can be expressed as a linear combination of the independent
normal r.vs Yi whose distribution is known to be normal (equation (1.2)).
Therefore, Ŷ is also a normal r.v. with mean
E(Ŷ ) = E(B0 ) + E(B1 )xobs = β0 + β1 xobs
and variance
V (Ŷ ) = V (B0 + B1 x obs )
= V [Ȳ + B1 (x obs − x̄)]
since B0 = Ȳ − B1 x̄
✿0
✘
✘✘
= V [Ȳ ] + (x obs − x̄)2 V (B1 ) + 2(x obs − x̄)✘
Cov(
✘✘Ȳ✘, B1 )
=
(x obs − x̄)2 σ 2
σ2
+ Pn
,
2
n
i=1 (xi − x̄)
41
P ✟
✯0
since Cov(Ȳ , B1 ) = (1/n)σ 2 ✟✟ki . Hence,
Ŷ ∼ N
# !
(xobs − x̄)2
1
+ Pn
β0 + β1 x obs ,
σ2 .
2
n
(x
−
x̄)
j=1 j
"
Thus, a 100(1 − α)% C.I. for the mean response, µY |X=xobs is
s
ŷ ∓ t1−α/2,n−2 s
|
!
1
(xobs − x̄)2
.
+ Pn
2
n
j=1 (xj − x̄)
{z
}
sŶ
Example 2.2. Refer back to Example 1.1. Assume we are interested in a
95% C.I. for the mean time value when the quantity of copiers is 5.
74.59608 ∓ t1−0.025,43 (1.329831) → (71.91422, 77.27794)
| {z }
2.016692
In R,
> newdata=data.frame(Copiers=5)
> predict.lm(reg,se.fit=TRUE,newdata,interval="confidence",level=0.95)
$fit
fit
lwr
upr
1 74.59608 71.91422 77.27794
$se.fit
[1] 1.329831
$df
[1] 43
2.2.2
Prediction interval
Once a regression model is fitted, after obtaining data (x1 , y1 ), . . . , (xn , yn ),
it may be of interest to predict a future value of the response. From equation
(1.1), we have some idea where this new prediction value will lie, somewhere
around the mean response
β0 + β1 x new
However, according to the model, equation (1.1), we do not expect new
predictions to fall exactly on the mean response, but close to them. Hence,
42
the r.v. corresponding to the statistic we plan to use is the same as equation
(2.3) with the addition of the error term ǫ ∼ N(0, σ 2 )
Ŷ pred = B0 + B1 x new + ǫ
Therefore,
Ŷ pred ∼ N
# !
(x new − x̄)2
1
σ2 ,
β0 + β1 x new , 1 + + Pn
2
(x
−
x̄)
n
j=1 j
"
and a 100(1 − α)% prediction interval (P.I.) for , for a value of the predictor
that is unobserved, i.e. not in the data, is
ŷ pred ∓ t1−α/2,n−2 s
s
1+
|
x̄)2
!
(x new −
1
.
+ Pn
2
n
j=1 (xj − x̄)
{z
}
s pred
Example 2.3. Refer back to Example 1.1. Let us estimate the future service
time value when copier quantity is 7 and create a interval around it. The
predicted value is
−0.5802 + 15.0352(7) = 104.6666 minutes
a 95% P.I. around the predicted value is
104.6666 ∓ t1−0.025,43 (9.058051) → (86.399, 122.9339)
| {z }
2.016692
In R
> newdata=data.frame(Copiers=7)
> predict.lm(reg,se.fit=TRUE,newdata,interval="prediction",level=0.95)
$fit
fit
lwr
upr
1 104.6666 86.39922 122.9339
$se.fit
[1] 1.6119
$df
[1] 43
43
Note that se.fit provided is the value for the CI not the PI. However, in the
calculation of the PI the correct standard error term is used. http://www.stat.ufl.edu/~ athienit
Example 2.4. Also see confidence and prediction intervals for example 1.2
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/toluca.R
2.2.3
Confidence Band for Regression Line
If we wish to create a simultaneous estimate for the population mean for all
predictor values x, that is a (1 − α)100% simultaneous C.I. for β0 + β1 x
ŷ ∓ W (sŶ )
known as the Working-Hotelling confidence band, where
W =
p
2F1−α;2,n−2 .
44
Example 2.5. Continuing from example 1.2 (Toluca) we can not only evaluate the band at specific points but at all points and plot it with the script
found in
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/toluca.R
CI=predict(toluca.reg,se.fit=TRUE)
W=sqrt(2*qf(0.95,length(toluca.reg$coefficients),toluca.reg$df.residual))
Band=cbind( CI$fit - W * CI$se.fit, CI$fit + W * CI$se.fit )
points(sort(toluca$lotsize), sort(Band[,1]), type="l", lty=2)
points(sort(toluca$lotsize), sort(Band[,2]), type="l", lty=2)
legend("topleft",legend=c("Mean Line","95% CB"),col=c(1,1),
+ lty=c(1,2),bg="gray90")
300
100
200
toluca$workhrs
400
500
Mean Line
95% CB
20
40
60
80
100
120
toluca$lotsize
Figure 2.1: Working-Hotelling 95% confidence band.
45
2.3
Analysis of Variance Approach
Next we introduce some notation that will be useful in conducting inference
of the model. In order to determine whether a regression model is adequate
we must compare it to the most naive model which uses the sample mean
Ȳ as its prediction, i.e. Ŷ = Ȳ . This model does not take into account any
predictors as the prediction is the same for all values of x. Then, the total
distance of a point yi to the sample mean ȳ can be broken down into two
components, one measuring the error of the model for that point, and one
measuring the “improvement” distance accounted by the regression model.
(yi − ȳ) = (yi − ŷi ) + (ŷi − ȳ)
| {z }
| {z } | {z }
Error
Regression
Total
Looking back at Figure 1.1 and singling out a point we have that,
Figure 2.2: Sum of Squares breakdown.
Summing over all observations we have that
n
X
(yi − ȳ)2 =
|i=1 {z
SST
}
n
X
(yi − ŷi )2 +
|i=1 {z
SSE
46
}
n
X
(ŷi − ȳ)2 ,
|i=1 {z
SSR
}
(2.4)
since the cross-product term
n
X
i=1
(yi − ŷi )(ŷi − ȳ) =
=
X
ei (ŷi − ȳ)
0
X ✟
X✚
✯0
✟
❃
✚
✟
e
ŷ
−
ȳ
e
✚ i
✟ i i
✟
✚
=0
Remark 2.4. A useful result is
SSR =
X
(ŷi − ȳ)2 =
X
X
(b0 + b1 xi − ȳ)2
(ȳ − b1 x̄ + b1 xi − ȳ)2
X
= b21
(xi − x̄)2
{z
}
|
=
(n−1)s2x
Each sum of squares term has an associated degrees of freedom value.
df
SSR
1
SSE n − 2
SST n − 1
+
We can summarize this information in an ANOVA table
Source
Reg
Error
Total
df
MS
E(MS)
P
2
1
SSR/1
σ + β12 (xi − x̄)2
n − 2 SSE/(n − 2)
σ2
n−1
Table 2.1: ANOVA table
Note that
SSE
∼ χ2n−2 ⇒ E
σ2
SSE
σ2
=n−2⇒ E
SSE
n−2
= σ2
and that
MSR = SSR = b21
X
(xi − x̄)2 ⇒ E(MSR) =
X
(xi − x̄)2 E(B12 )
(xi − x̄)2 [V (B1 ) + E 2 (B1 )]
X
= σ 2 + β12
(xi − x̄)2
=
47
X
2.3.1
F-test for β1
In Section 2.1 we saw a t-test for testing the significance of β1 , bit now we
introduce a different test that will be especially useful later in testing multiple
β’s simultaneously. In table 2.1 we notice that

E(MSR) 1
=
E(MSE) > 1
if β1 = 0
if β1 6= 0
By Cochran’s theorem it has been shown that under H0 : β1 = 0
SSR ∼ χ2 and that the two are independent,
1
σ2
∼ χ2n−2 ,
• SSE
σ2
•
χ21 /1
χ2n−2 /(n−2)
∼ F1,n−2
Hence, we have that
T.S. =
SSR /1
σ2
SSE /(n − 2)
σ2
=
MSR H0
∼ F1,n−2 .
MSE
The null is rejected if the p-value P (F1,n−2 > T.S.) < α, the area to the right
being less that α.
f
F distribution
p − value
0
T.S
Figure 2.3: F1,n−2 distribution and p-value.
Remark 2.5. The F-test and t-test for H0 : β1 = 0 vs. Ha : β1 6= 0 are
equivalent since
b2
MSR
= 1
MSE
P
2
(xi − x̄)2
b21
b21
b1
P
=
= 2 =
2
MSE
MSE/ (xi − x̄)
sb1
sb1
48
Example 2.6. Continuing from example 1.2, note that t2 = 10.2902 =
105.9 = F with the same p-value.
2.3.2
Goodness of fit
A goodness of fit statistic is a quantity that measures how well a model
explains a given set of data. For regression, we will use the coefficient of
determination
SSE
SSR
=1−
,
R2 =
SST
SST
which is the proportion of variability in the response (to its naive mean ȳ)
that is explained by the regression model, and R2 ∈ [0, 1].
Remark 2.6. For simple linear regression with (only) one predictor, the coefficient of determination is the square of the correlation coefficient, with the
sign matching that of the slope, i.e.
 √


+ R2

 √
r = − R2



0
b1 > 0
b1 < 0
b1 = 0
Example 2.7. In the output of example 1.2 we have R2 = 0.8215, implying
that 82.15% of the (naive) variability in the work hours can now be explained
by the regression model that incorporates lost size as the only predictor.
49
2.4
Normal Correlation Models
Normal correlation models are useful when instead of a random normal response and a fixed predictor, there are two random normal variables and one
will be used to model the other.
Let (Y1 , Y2 ) have a bivariate normal distribution with p.d.f.
f (y1 , y2) =
1
√
2πσ1 σ2 1 − ρ12
e
−1
2(1−ρ2 )
12
y1 −µ1
σ1
2
−2ρ12
y1 −µ1
σ1
y2 −µ2
σ2
2 y −µ
+ 2σ 2
2
where ρ12 is the correlation coefficient σ12 /(σ1 σ2 ). It can be shown that
marginally Y1 ∼ N(µ1 , σ12 ) and Y2 ∼ N(µ2 , σ22 ). Hence, the conditional
density of (Y1 |Y2 = y2 ), and similarly of (Y2|Y1 = y1 ), can be found as
−1
1
f (y1 , y2)
=√
e2
f (y1|y2 ) =
f (y2 )
2πσ1|2
y1 −α1|2 −β1|2 y2
σ1|2
2
2
where α1|2 = µ1 − µ2 ρ12 (σ1 /σ2 ), β1|2 = ρ12 (σ1 /σ2 ), and σ1|2
= σ12 (1 − ρ212 ).
Thus,
2
Y1 |Y2 = y2 ∼ N(α1|2 + β1|2 y2 , σ1|2
)
and we can “model” or make educated guesses as to the values of variable
Y1 given Y2 (where Y2 is random).
To determine if Y2 is an adequate “predictor” for Y1 , all we need to do is
test H0 : ρ12 = 0, since under the null, (Y1 |Y2 ) ≡ Y1 . The sample estimate is
the same as in equation (1). The test statistics is
√
r12 n − 2 H0
∼ tn−2 .
T.S. = p
2
1 − r12
with p-values for two and one-sided tests found in the usual way. However,
working with confidence intervals is more practical and even easier if we apply
Fisher’s transformation to the sample correlation
1
z = log
2
′
50
1 + r12
1 − r12
.
If the sample size is large, i.e n ≥ 25 then
z
′ approx.
∼


1
1 + ρ12
1 


N  log
,

1 − ρ12 n − 3 
2
|
{z
}
ζ
and a 100(1 − α)% C.I.for ζ
z ′ ∓ z1−α/2
p
1/(n − 3) → (L, U)
and hence a 100(1 − α)% C.I.for ρ12 (after back-transforming ζ)
e2L − 1 e2U − 1
,
e2L + 1 e2U + 1
Non-normal data: When the data are not normal then we must implement a nonparametric procedure such as Spearman Rank Correlation coefficient.
1. Rank (y11 , . . . , yn1 ) from 1 to n and label as (R11 , . . . , Rn1 ).
2. Rank (y12 , . . . , yn2 ) from 1 to n and label as (R12 , . . . , Rn2 ).
3. Compute
Pn
− R̄1 )(Ri2 − R̄2 )
Pn
2
2
i=1 (Ri1 − R̄1 )
i=1 (Ri2 − R̄2 )
rs = pPn
i=1 (Ri1
To test the null hypothesis of no association between Y1 and Y2 use the test
statistic
√
rs n − 2 H 0
∼ tn−2 .
T.S. = p
1 − rs2
Reject if p-value< α.
Example 2.8. Consider the Muscle mass problem 1.27 and let Y1 =muscle
mass, Y2 =age and we wish to model (Y1 |Y2)
> muscle=read.table("http://www.stat.ufl.edu/~rrandles/sta4210/
+ Rclassnotes/data/textdatasets/KutnerData/
+ Chapter%20%201%20Data%20Sets/CH01PR27.txt",col.names=c("Y1","Y2"))
> attach(muscle)
> n=length(Y1)
51
> r=cor(Y1,Y2);r
[1] -0.866064
> b1=r*sd(Y1)/sd(Y2);b1
[1] -1.189996
> b0=mean(Y1)-mean(Y2)*b1;b0
[1] 156.3466
> s2=var(Y1)*(1-r^2);s2
[1] 65.6686
Hence the estimated model is
Y1 |Y2 = y2 ∼ N(156.35 − 1.19y2, 65.67).
and r12 = −0.866.
To test H0 : ρ12 = 0
> TS=(r*sqrt(n-2))/sqrt(1-r^2)
> 2*pt(-abs(TS),n-2) #2 sided pvalue
[1] 4.123987e-19
we reject the null due to the extremely small p-value. We can also create a
95% C.I. for ρ12
> zp=0.5*log((1+r)/(1-r))
> LU=zp+c(1,-1)*qnorm(0.025)*1/sqrt(n-3)
> (exp(2*LU)-1)/(exp(2*LU)+1)
[1] -0.9180874 -0.7847085
and conclude that there is a significant negative relationship.
Obviously, before performing any of these procedure we need to ba able to
assume that both variables are normal-which we will see later. If we cannot
assume normality then we need to use Spearman’s Correlation
> rs=cor(Y1,Y2,method="spearman");rs # default method is pearson
[1] -0.8657217
> TSs=(rs*sqrt(n-2))/sqrt(1-rs^2)
> 2*pt(-abs(TSs),n-2) #2 sided pvalue
[1] 4.418881e-19
and reach the same conclusion.
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/corr_model.R
52
Chapter 3
Diagnostics and Remedial
Measures
3.1
Diagnostics for Predictor Variable
The goal is to identify any outlying values that could affect the appropriateness of the linear model. More information about influential cases will be
covered in Chapter 10. The two main issues are:
• Outliers.
• The levels of the predictor are associated with the run order when the
experiment is run sequentially.
To check these we use
• Histogram and/or Boxplot
• Sequence Plot
Example 3.1. Continuing from example 1.2 we see that there do not appear
to be any outliers
53
Box Plot
2
0
1
Frequency
3
4
Histogram
20
40
60
80
100
120
20
40
Lot Size
60
80
100
Lot Size
and no pattern/dependecy of the values of the predictor and the run order.
80
60
20
40
Lot Size
100 120
Sequence Plot
5
10
15
20
25
Run order
3.2
Checking Assumptions
Recall that for the simple linear regression model
Yi = β0 + β1 xi + ǫi
i.i.d.
i = 1, . . . , n
we assume that ǫi ∼ N(0, σ 2 ) for i = 1, . . . , n. However, once a model is
fit, before any inference or conclusions are made based upon a fitted model,
the assumptions of the model need to be checked.
These are:
1. Normality
2. Homogeneity of variance
3. Model fit/Linearity
4. Independence
54
120
with components of model fit being checked simultaneously within the first
three. The assumptions are checked using the residuals ei := yi − ŷi for
i = 1 . . . , n, or the standardized residuals, which are the residual standardized
so that their standard deviation should be 1.
3.2.1
Graphical methods
Normality
The simplest way to check for normality is with two graphical procedures:
• Histogram
• P-P or Q-Q plot
A probability plot is a graphical technique for comparing two data sets,
either two sets of empirical observations, one empirical set against a theoretical set.
Definition 3.1. The empirical distribution function, or empirical c.d.f., is
the cumulative distribution function associated with the empirical measure
of the sample. This c.d.f. is a step function that jumps up by 1/n at each of
the n data points.
n
F̂n (x) =
1X
number of elements ≤ x
=
I{xi ≤ x}
n
n i=1
Example 3.2. Consider the sample: 1, 5, 7, 8. The empirical c.d.f. is



0






0.25


F̂4 (x) = 0.50



0.75





1
55
if x < 1
if 1 ≤ x < 5
if 5 ≤ x < 7
if 7 ≤ x < 8
if x ≥ 8
1.0
0.8
0.6
0.0
0.2
0.4
Fn(x)
0
2
4
6
8
10
x
Figure 3.1: Empirical c.d.f.
The normal probability plot is a graphical technique for normality testing
by assessing whether or not a data set is approximately normally distributed.
The data are plotted against a theoretical normal distribution in such a way
that the points should form an approximate straight line. Departures from
this straight line indicate departures from normality.
There are two types of plots commonly used to plot the empirical c.d.f.
to the normal theoretical one (G(·)).
• P-P plot that plots (F̂n (x), G(x)) (with scaled changed to look linear),
• Q-Q plot which plots the quantile functions (F̂n−1 (x), G−1 (x)).
Example 3.3. An experiment of lead concentrations (mg/kg dry weight)
from 37 stations, yielded 37 observations. Of interest is to determine if the
data are normally distributed (of more practical use if sample sizes are small,
e.g. < 30).
56
Smoothed Histogram
Normal
Data
0.010
Density
0.005
0
0.000
−2
−1
Theoretical Quantiles
1
2
0.015
Normal Q−Q Plot
0
50
100
150
200
0
50
100
150
200
250
Sample Quantiles
Note that the data appears to be skewed right, with a lighter tail on the
left and a heavier tail on the right (as compared to the normal).
http://www.stat.ufl.edu/~ athienit/IntroStat/QQ.R
With the vertical axis being the theoretical quantiles, and the horizontal
axis being the sample quantiles the interpretation of P-P plots and Q-Q plots
is equivalent. Compared to straight line that corresponds to the distribution
you wish to compare your data, here is a quick guideline of how the tails are
Left tail Right tail
Above line
Heavier
Lighter
Below line
Lighter
Heavier
A histogram of the residuals is plotted and we try to determine if the
histogram is symmetric and bell shaped like a normal distribution is. In
addition, to check the model fit, we assume the observed response values
yi are centered around the regression line ŷ. Hence, the histogram of the
residuals should be centered at 0.
57
Example 3.4. Referring to Example 1.1, we obtain the following
Histogram of std residuals
0.3
0.2
Density
1
0
0.0
−2
0.1
−1
Theoretical Quantiles
0.4
2
0.5
Normal Q−Q Plot
−2
−1
0
1
−4
−3
−2
Sample Quantiles
−1
0
1
2
3
std. residuals
Homogeneity of variance/Fit of model
Recall that the regression model assumes that the errors ǫi have constant
variance σ 2 . In order to check this assumption a plot of the residuals (ei )
versus the fitted values (ŷi ) is used. If the variance is constant, one expects
to see a constant spread/distance of the residuals to the 0 line across all the
ŷi values of the horizontal axis. Referring to Example 1.1, we see that this
assumption does not appear to be violated.
0
−3
−2
−1
std res
1
2
3
Homogeneity / Fit
20
40
60
80
100
120
140
y^
Figure 3.2: Residual versus fitted values plot.
In addition, the same plot can be used to check the fit of the model.
If the model is a good fit, once expects to see the residuals evenly spread
58
on either side of the 0 line. For example, if we observe residuals that are
more heavily sided above the 0 line for some interval of ŷi , then this is an
indication that the regression line is not “moving” through the center of the
data points for that section. By construct, the regression line does “move”
through the center of the data overall, i.e. for the whole big picture. So if it
is underestimating (or overestimating) for some portion then it will overestimate (or underestimate) for some other. This is an indication that there is
some curvature and that perhaps some polynomial terms should be added.
(To be discussed in the next chapter).
Independence
To check for independence a time series plot of the residuals/standardized
residuals is used, i.e. a plot of the value of the residual versus the value of
its position in the data set. For example, the first data point (x1 , y1 ) will
yield the residual e1 = y1 − ŷ1 . Hence, the order of e1 is 1, and so forth.
Independence is graphically checked if there is no discernible pattern in the
plot. That is, one cannot predict the next ordered residual by knowing the
a few previous ordered residuals. Referring to Example 1.1, we obtain the
following plot where there does not appear to be any discernible pattern.
−2
−1
std res
0
1
Independence
0
10
20
30
40
Order
Figure 3.3: Time series plot of residuals.
59
Remark 3.1. That when creating this plot that the order in which the data
was obtained is same as the way they are in the datasheet. For example,
assume that each person in a group is asked a question at a time. Then
possibly the second person might be influenced by the first person’s response
and so forth. If the data was then sorted, e.g. alphabetically, this order may
then be lost.
It is also important to note that this graph is heavily influenced by the
validity of the model fit. Here is an example we will actually be addressing
later in example 3.13
0
0.0
2
−1.0
0.0
1.0
2.0
Sample Quantiles
Independence
Homogeneity / Fit
0.6
std res
0
2000
4000
6000
8000
10000
1
std res
−3
−1
0.5 1.5
−1.0
0.2
std res
3
0.4
prop
1
−1.5
3.0
2.0
1.0
Frequency
0.0
0.8
−1
1.5
Normal Q−Q Plot
Theoretical Quantiles
Histogram of std res
2
4
time
6
8
10
12
0.0
0.1
0.2
0.3
0.4
0.5
y^
Order
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/copier.R
3.2.2
Significance tests
Independence
• Runs Test (Presumes data are in time order)
– Write out the sequence of +/− signs of the residuals
– Count n1 = number of +ve residuals, n2 = number of −ve residuals
– Count u = number of “runs” of +ve and −ve residuals So what
is a run? For example, if we have the following 9 residuals:
− +
+ +} |{z}
−−
−− |{z}
+ |{z}
| {z
|{z}
1
2
3
4
5
then we have in fact u = 5 runs with n1 = 4 and n2 = 5.
60
The null hypothesis is that the data are independent/random placement. We will use the exact sampling distribution of u to determine
the p-value. The p.m.f. of the corresponding r.v. U is
p(u) =
 n −1 n −1
1
2
2( k−1
)( k−1
)



n
+n
1
2
 (
)
u = 2k, k ∈ N (u is even)
n1

2 −1 + n2 −1
1 −1
(n1k−1)(nk−1
) ( k )(nk−1
)



(n1 +n2 )
u = 2k + 1, k ∈ N (u is odd)
n1
Then, the p-value is defined as P (U ≤ u). Luckily, there is no need to
do this by hand.
Example 3.5. Continuing from example 1.2, we run the ”Runs” test
on the standardized residuals in R
> library(lawstat) #may need to install package
> runs.test(re,plot.it=TRUE)
Runs Test - Two sided
data: re
Standardized Runs Statistic = -1.015, p-value = 0.3101
and note that we fail to reject the null due to the large p-value. We
A
A
BBB
AA
B
A
BBBB
AAA
AAAAA
BB
BB
15
20
−0.4
0.0
0.2
0.4
have 11 runs out of a maximum of 25. There is also another runs.test
under the randtest package (which actually provides the value of u).
5
10
61
25
Remark 3.2. It is notable that for n1 + n2 > 20
U


 2n n
2n1 n2 (2n1 n2 − n1 − n2 ) 


1 2
+ 1,
∼ N

2
(n + n2 ) (n1 + n2 − 1) 
 n1 + n2
|
{z
} | 1
{z
}
approx
µu
2
σu
and a test statistic can be used
T.S. =
u − µu + 0.5
σu
to calculate the p-value P (Z ≤ T.S.).
• Durbin-Watson Test. For this test we assume that the error term in
equation (1.1) is of form
i.i.d.
ui ∼ N(0, σ 2 ), |ρ| < 1
ǫi = ρǫi−1 + ui ,
That is that the error term at a certain time period i, is correlated to
the error term at the i − 1.
The null hypothesis is H0 : ρ = 0, i.e. uncorrelated. The test statistic
is
Pn
(ei − ei−1 )2
T.S. = i=2 n
X
e2i
|i=1
{z }
SSE
Once the sampling distribution of the test statistic is determined then
p-values can be obtained. However, the density function of this statistic
is not easy to work with so we leave the heavy lifting to software.
Example 3.6. Continuing from example 1.2,
> library(car)
> durbinWatsonTest(toluca.reg)
lag Autocorrelation D-W Statistic p-value
1
0.2593193
1.43179
0.166
Alternative hypothesis: rho != 0
> library(lmtest)
> dwtest(toluca.reg,alternative="two.sided")
62
Durbin-Watson test
data: toluca.reg
DW = 1.4318, p-value = 0.1616
alternative hypothesis: true autocorrelation is not 0
The p-value is large, i.e. greater than 5%, and hence we fail to reject
the null, and conclude independence.
Remark 3.3. The book suggests that in business and economics the
correlation tends to be positive and hence a one sided test should be
performed. However, this decision is context specific and left to the
researcher.
Normality test
As expected there are many tests for normality. For a current list visit
https://en.wikipedia.org/wiki/Normality_test. For now, we will discuss the Shapiro-Wilk Test.
The null hypothesis is that normality holds (for the data entered, which
we will use the standardized residuals).
Example 3.7. Continuing from example 1.2,
> shapiro.test(re)
Shapiro-Wilk normality test
data: re
W = 0.97917, p-value = 0.8683
and hence we fail to reject the assumption of normality.
Homogeneity of variance
• If the response can be split into t distinct groups, i.e. the predictor(s)
are categorical, then use the Brown-Forsythe/Levene Test. This test is
used to test whether multiple populations have the same variance.
The null hypothesis is that
H0 : V (ǫi ) = σ 2 ∀i
63
or equivalently
H0 : σ12 = · · · = σt2
Remark 3.4. If the data cannot be split into distinct groups, this can be
done artificially by separating the responses based on their predictor
values or fitted values. For example we can create two groups, data
with “small” fitted values and data with “large” fitted values. Much in
the same way we create bins for a histogram.
The test statistic is tedious to calculate and left for software. However,
H
T.S. ∼0 Ft−1,n−t
where t is the number of groups and n is the grand total number of
observations. The p-value is P (Ft−1,n−t ≥ T.S.). Reject null if pvalue< α.
Example 3.8. Continuing with example 1.2, assume we wish to split
the data into two groups depending on whether the lot size is greater
than 75 or not.
> ind=I(toluca$lotsize>75)
> temp=cbind(toluca$lotsize,re,ind);temp
re ind
1
2
3
80 1.07281843
30 -1.06174371
50 -0.41228961
1
0
0
4
5
6
90 -0.15886988
70 1.01932471
60 -1.10742255
1
0
0
7
8
120
80
1.25374204
0.08237676
1
1
9 100 -1.45603607
10 50 -1.86517337
11 40 -0.96611354
1
0
0
12
13
70 -1.27729740
90 0.10987623
0
1
14 20 -0.45782448
15 110 -0.43096533
0
1
64
16 100
17 30
18 50
0.01286011
0.92610180
0.56452124
1
0
0
19 90 -0.13817529
20 110 -0.73719055
1
1
21
22
23
30
90
40
2.50810852
1.87651913
0.82578984
0
1
0
24
25
80 -0.12266660
70 0.21940878
1
0
> leveneTest(temp[,2],ind) # fcn in car library
Levene’s Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group
1 1.6553
23
Warning message:
0.211
In leveneTest.default(temp[, 2], ind) : ind coerced to factor.
With a p-value greater than 0.05 we fail to reject the null.
• Breusch-Pagan/Cook-Weisberg Test. It tests whether the estimated
variance of the residuals from a regression are dependent on the values
of the independent/predictor variables. In that case, heteroskedasticity
is present.
σ 2 = E(ǫ2 ) = γ0 + γ1 x1 + · · · + γp xp
The null hypothesis is that they are independent of the independent/predictor variables. Although we will usually have software calculate
the test statistic, the process if fairly simple.
1. Obtain SSE =
Pn
2
i=1 ei
from original equation.
2. Fit regression with e2i as the response using the same predictor(s),
and obtain SSR⋆ .
3.
SSR⋆ /2 H0 2
∼ χp
T.S. =
(SSE/n)2
65
where p is the number of predictors in the model. The null is
rejected in the p-value P (χ2p ≥ T.S.) < α.
Example 3.9. Continuing with example 1.2,
> ncvTest(toluca.reg) # fcn in car library
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.8209192
Df = 1
p = 0.3649116
Hence, we fail to reject the null due the p-value> α
Linearity of regression
We will perform an F -test for Lack-of-Fit if there are t distinct levels of the
predictor(s). Not a valid test if the number of distinct levels is large, i.e.
t ≈ n.
H0 : E(Yi ) = β0 + β1 xi
vs Ha : E(Yi ) 6= β0 + β1 xi
1. For each distinct level compute ŷj and ȳj , j = 1, . . . , t.
2. Compute SSLF =
of freedom t − 2.
Pt
3. Compute SSPE =
Pt
4.
j=1
P nj
j=1
i=1 (ȳj −ŷj )
Pnj
T.S. =
i=1 (yij
2
=
Pt
j=1 nj (ȳj −ŷj )
2
with degrees
− ŷj )2 with degrees of freedom n − t
SSLF/(t − 2) H0
∼ Ft−2,n−t
SSPE/(n − t)
The null is rejected if the p-value P (Ft−2,n−t ≥ T.S.) < α.
In R there is a work around where you do not have to compute these SS
explicitly, as illustrated in the following example.
Example 3.10. Continuing with example 1.2, we note that there are 11
distinct levels of lot size in the 25 observations.
> length(unique(toluca$lotsize));length(toluca$lotsize)
[1] 11
[1] 25
> Reduced=toluca.reg # fit reduced model
66
> Full=lm(workhrs~0+as.factor(lotsize),data=toluca) # fit full model
> anova(Reduced, Full) # get lack-of-fit test
Analysis of Variance Table
Model 1: workhrs ~ lotsize
Model 2: workhrs ~ 0 + as.factor(lotsize)
Res.Df
RSS Df Sum of Sq
F Pr(>F)
1
23 54825
2
14 37581
9
17245 0.7138 0.6893
The p-value is greater than 0.05 so we fail to reject the null and conclude
that the model is an adequate (linear) fit.
3.3
Remedial Measures
• Nonlinear Relation: Add polynomials, fit exponential regression function,
or transform x and/or y (more emphasis on x).
• Non-Constant Variance: Weighted Least Squares, transform y and/or
x, or fit Generalized Linear Model.
• Non-Independence of Errors: Transform y or use Generalized Least
Squares, or fit Generalized Linear Model with correlated errors.
• Non-Normality of Errors: Box-Cox transformation, or fit Generalized
Linear Model.
• Omitted Predictors: Include important predictors in a multiple regression model.
• Outlying Observations: Robust Estimation or Nonparametric regression.
3.3.1
Box-Cox (Power) transformation
In the event that the model assumptions appear to be violated to a significant degree, then a linear regression model on the available data is not valid.
However, have no fear, your friendly statistician is here. The data can be
67
transformed, in an attempt to fit a valid regression model to the new transformed data set. Both the response and the predictor can be transformed
but there is usually more emphasis on the response.
Remark 3.5. However, when we apply such a transformation, call it g(·), we
are in fact fitting the mean line
E(g(Y )) = β0 + β1 x1 + . . .
As a result we cannot back-transform, i.e. apply the inverse transformation
to make inference on E(Y ) as
g −1 [E(g(Y ))] 6= E(Y )
A common transformation mechanism is the Box-Cox transformation
(also known as Power transformation). This transformation mechanism when
applied to the response variable will attempt to remedy the “worst” of the
assumptions violated, i.e. to reach a compromise. A word of caution, is
that in an attempt to remedy the worst it may worsen the validity of one
of the other assumptions. The mechanism works by trying to identify the
(minimum or maximum depending on software) value of a parameter λ that
will be used as the power to which the responses will be transformed. The
transformation is
 λ

 yi − 1
if λ 6= 0
λ−1
(λ)
yi = λGy

G log(y ) if λ = 0
y
i
Q
where Gy = ( ni=1 yi )1/n denotes the geometric mean of the responses. Note
that a value of λ = 1 effectively implies no transformation is necessary. There
are many software packages that can calculate an estimate for λ, and if the
sample size is large enough even create a C.I. around the value. Referring to
Example 1.1, we see that λ̂ = 1.11.
68
0
−100
−150
−200
log−Likelihood
−50
95%
−2
−1
0
11.11
2
λ
Figure 3.4: Box-Cox plot.
However, one could argue that the value is close to 1 and that a transformation may not necessarily improve the overall validity of the assumptions,
so no transformation is necessary. In addition, we know that linear regression is somewhat robust to deviations from the assumptions, and it is more
practical to work with the untransformed data that are in the original units
of measurements. For example, if the data is in miles and a transformation
is used on the response, inference will be on log(miles).
Example 3.11. Continuing from example 1.1, we use the following R script:
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/boxcox.R
Example 3.12. http://www.stat.ufl.edu/~ athienit/STA4210/Examples/diagnostic&BoxCox
If the model fit assumption is the major culprit violated, a transformation of the predictor(s) will often resolve the issue without
having to transform the response and consequently changing its scale.
Example 3.13. In an experiment 13 subjects asked to memorize a list of
disconnected items. Asked to recall them at various times up to a week later.
• Response = proportion of items recalled correctly.
• Predictor = time, in minutes, since initially memorized the list.
69
Time
Prop
Time
Prop
1
5
15
30
60
120
240
0.84 0.71 0.61 0.56 0.54 0.47 0.45
480 720 1440 2880 5760 10080
0.38 0.36 0.26 0.20 0.16 0.08
0
0.0
2
−1.0
0.0
1.0
2.0
Sample Quantiles
Independence
Homogeneity / Fit
0.6
std res
0
2000
4000
6000
8000
1
std res
−3
−1
0.5 1.5
−1.0
0.2
std res
3
0.4
prop
1
−1.5
3.0
2.0
1.0
Frequency
0.0
0.8
−1
1.5
Normal Q−Q Plot
Theoretical Quantiles
Histogram of std res
10000
2
4
time
6
8
10
12
0.0
0.1
0.2
0.3
0.4
0.5
y^
Order
bcPower Transformation to Normality
Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
dat$time
0.0617
0.1087
-0.1514
0.2748
Likelihood ratio tests about transformation parameters
LRT df
pval
LR test, lambda = (0) 0.327992
LR test, lambda = (1) 46.029370
1 5.668439e-01
1 1.164935e-11
It seems that a decent choice for λ is 0, i.e. a log transformation for time.
1
0.0
2
−1.5
−0.5
0.5
std res
Sample Quantiles
Independence
Homogeneity / Fit
1.5
0
2
4
6
8
2
l.time
0.0 1.0
−1.5
−1.5
std res
0.0 1.0
std res
0.4
0.2
prop
0
−1.5
4
3
2
1
Frequency
0.8
0
−1
0.6
−2
1.5
Normal Q−Q Plot
Theoretical Quantiles
Histogram of std res
4
6
8
Order
10
12
0.2
0.4
0.6
0.8
y^
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/diagnostic&Linearity.R
Remark 3.6. When creating graphs and checking if there are “pattern” try
and keep the axis for the standardized residuals range from -3 to 3. That
70
is 3 standard deviation below 0, to 3 standard deviations above 0. Software
have a tendency to “zoom” in.
Is glass smooth? If you are viewing by eye then yes. If you are viewing
via an electron microscope then no.
In R just add plot(....., ylim=c(-3,3))
3.3.2
Lowess (smoothed) plots
• Nonparametric method of obtaining a smooth plot of the regression
relation between y and x.
• Fits regression in small neighborhoods around points along the regression line on the horizontal axis.
• Weights observations closer to the specific point higher than more distant points.
• Re-weights after fitting, putting lower weights on larger residuals (in
absolute value).
• Obtains fitted value for each point after “final” regression is fit.
• Model is plotted along with linear fit, and confidence bands, linear fit
is good if lowess lies within bands.
Example 3.14. For 1.2 assume we wish to fit Lowess regression in R using
the loess function with smoothing parameter α = 0.5.
71
300
100
200
workhrs
400
500
Loess α=0.5
95% CB
SLR
20
40
60
80
100
120
lotsize
Figure 3.5: Lowess smoothed plot.
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/loess.R
72
Chapter 4
Simultaneous Inference and
Other Topics
The main concept here is that if a 95% C.I. is created for β0 and another 95%
C.I. for β1 we cannot say that we are 95% confident that these two confidence
intervals are simultaneously both correct.
4.1
Controlling the Error Rate
Let αI denote the individual comparison Type I error rate. Thus, P (Type I error) =
αI on each of the g tests.
Now assume we wish to combine all the individual tests into an overall/combined/simultaneous test
H0 = H01 ∩ H02 ∩ · · · ∩ H0g
H0 is rejected if any of the null hypotheses H0i is rejected.
The experimentwise error rate αE , is the probability of falsely rejecting
at least one of the g null hypotheses. If each of the g tests is done with αI ,
then assuming each test is independent and denoting the probability of not
falsely rejecting H0i by Ei
αE = 1 − P (∩gi=1 Ei )
=1−
g
Y
P (Ei )
i=1
= 1 − (1 − αI )g
73
independence
For example, if αI = 0.05 and 10 comparisons are made then αE = 0.401
which is very large.
However, if we do not know if the tests are independent, we use the
Bonferroni inequality
P (∩gi=1 Ei )
≥
g
X
i=1
P (Ei ) − g + 1
which implies
αE = 1 −
P (∩gi=1 Ei )
≤g−
=
g
X
i=1
g
=
X
g
X
P (Ei )
i=1
[1 − P (Ei )]
αI
i=1
= gαI
Hence, αE ≤ gαI . So what we will do is choose an α to serve as an upper
bound for αE . That is we won’t know the true value of αE but we will now
it is bounded above by α, i.e. αE ≤ α. For example, if we set α = 0.05 then
αE ≤ 0.05, or that simultaneous C.I. from g individual C.I.’s, will have a
confidence of at least 95% (if not more). Set
αI =
α
g
For example, if we have 5 multiple comparisons and wish that the overall
error rate is 0.05, or simultaneous confidence of at least 95%, then each one
(of the five) C.I’s must be done at the
0.05
= 99%
100 1 −
5
confidence level.
For additional details the reader can read the multiple comparisons problem
and the familywise error rate.
74
4.1.1
Simultaneous estimation of mean responses
• Bonferroni: Can be used for g simultaneous C.Is, each done at the
100(1 − α/g) level. If g is large then these intervals will be “too” wide
for practical conclusions.
ŷ ∓ t1−α/(2g),n−2 sŶ
• Working-Hotelling: A confidence band is create for the entire regression
line that can be used for any number of confidence intervals for means
simultaneously.
p
ŷ ∓ 2F1−α;2,n−2 sŶ
4.1.2
Simultaneous predictions
• Bonferroni: Can be used for g simultaneous P.Is, each done at the
100(1 − α/g) level. If g is large then these intervals will be “too” wide
for practical conclusions.
ŷ ∓ t1−α/(2g),n−2 spred
• Scheffé: Widely used method. Like the Bonferroni, the width increases
as g increases.
p
ŷ ∓ gF1−α;g;n−2spred
4.2
Regression Through the Origin
When theoretical reasoning (in the context at hand) suggest that the regression line must pass through the origin (x = 0, y = 0), then the regression line
must try and meet this criterion. This is done by restricting the intercept by
setting it to 0, i.e. β0 = 0, yielding the model
Yi = β1 xi + ǫi
However with this model, there are some issues
• V (Y |x = 0) = 0. The variance of the response at the origin is set to 0
which is not consistent with the “usual” regression model.
75
•
Pn
i=1 ei
not necessarily equals 0.
• SSE can potentially be larger than SST, affecting the analysis of variance and R2 interpretation.
The reader is referred to p.164 of the textbook for more details
To estimate β1 via least-squares we need to minimize
n
X
i=1
(yi − β1 xi )2 .
Taking the derivative with respect to β1 and equating to 0, we have
n
X
[xi (yi − β1 xi )] = 0
⇒−2
i=1
⇒
n
X
xi yi = b1
i=1
n
X
x2i
i=1
Pn
n xi yi X
xi
i=1
Pn 2 yi .
⇒b1 = Pn 2 =
i=1 xi
i=1 xi
i=1
The only other difference is that the degrees of freedom error are now n − 1,
hence slightly changing the MSE estimate. As result,
MSE
• s2b1 = Pn 2
i=1 xi
MSE(x2 )
• s2Ŷ = Pn 2
i=1 xi
x2
2
• spred = MSE 1 + Pn
i=1
x2i
which are used in the C.I. for β1 , mean response and for the P.I.
Example 4.1. A plumbing company operates 12 warehouses. A regression
is fit with work units performed (x) and total variable cost (y). A regression
through the origin yielded
76
800
600
400
200
labor
50
100
150
200
work
> plumbing=read.table(...
> with(plumbing,plot(labor~work,pch=16))
> plumb.reg=lm(labor~0+work,data=plumbing)
> summary(plumb.reg)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
work 4.68527
0.03421
137
<2e-16 ***
Residual standard error: 14.95 on 11 degrees of freedom
Multiple R-squared: 0.9994,Adjusted R-squared: 0.9994
F-statistic: 1.876e+04 on 1 and 11 DF,
p-value: < 2.2e-16
> abline(plumb.reg)
Now we can create a PI for when work is equal to 100. R can do this too and
it uses the right standard error. However, if you ask it to print the se.fit
it only provides the se.fit for the CI, not the PI
> syhat=sqrt(223.42*(100^2/sum(plumbing$work^2)));syhat
[1] 3.420475
> spred=sqrt(223.42*(1+100^2/sum(plumbing$work^2)));spred
[1] 15.33361
> newdata=data.frame(work=100)
> predict.lm(plumb.reg,newdata)+c(1,-1)*qt(0.025,11)*spred
[1] 434.7784 502.2765
77
> predict.lm(plumb.reg,newdata,se.fit=TRUE,interval="prediction")
$fit
fit
lwr
upr
1 468.5274 434.7781 502.2767
$se.fit
[1] 3.420502
$df
[1] 11
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/plumbing_origin.R
4.3
Measurement Errors
Firstly, let’s take a look at what we mean when a variable/effect is fixed or
random and why there is still confusion concerning the use of these.
http://andrewgelman.com/2005/01/25/why_i_dont_use/
4.3.1
Measurement error in the dependent variable
There is no problem as long as there is no bias, i.e. consistently recording
lower or higher values. The extra error term is absorbed into the existing
error term ǫ for the response Y .
4.3.2
Measurement error in the independent variable
Assume there is no bias in measurement error.
• Not a problem when the observed (recorded) value is fixed and actual
value is random. For example, when the oven dial is set to 400◦F the
actual temperature inside is not exactly 400◦ F.
• When the observed (recorded) value is random it causes a problem by
biasing β1 downward. Let Xi denote the true (unobserved) value, and
Xi⋆ , the observed (recorded) value. Then, the measurement error is
δi = Xi⋆ − Xi
78
The true model can be expressed as
Yi = β0 + β1 Xi + ǫi
= β0 + β1 (Xi⋆ − δi ) + ǫi
= β0 + β1 Xi⋆ + (ǫi − β1 δi )
and we assume that δi is
– unbiased, i.e. E(δi ) = 0,
✟0
✯
✯
✟0
✟i✟
✟✟
E(ǫ
) ✟
– uncorrelated with random error, implying that E(ǫi δi ) = ✟
E(δ
=
i)
0.
Hence,
Cov(Xi⋆ , ǫi − β1 δi ) = E {[Xi⋆ − E(Xi⋆ )][(ǫi − β1 δi ) − E(ǫi − β1 δi )]}
= E {[Xi⋆ − Xi ][ǫi − β1 δi ]}
= E {δi (ǫi − β1 δi )}
✿0
✘
E✘
{δ✘i ǫ✘
− β1 E δi2
=✘
i}
= −β1 V (δi )
Therefore, the recorded value Xi⋆ is not independent of the error term
(ǫi − β1 δi ) and
E(Yi |Xi⋆ ) = β0⋆ + β1⋆ Xi⋆
where
β1⋆ = β1
4.4
2
σX
< β1 .
2
σX
+ σY2
Inverse Prediction
The goal is to predict a new predictor value based on an observed new value
of the response. Once we have a model, it is is to show (by rearranging terms
in the prediction equation) that the prediction is
x̂new =
ynew − b0
b1
79
It has been shown (in higher statistics levels) that if
t21−α/2,n−2 MSE
P
b21 ni=1 (xi − x̄)2
approx.
<
0.1
then, a 100(1 − α)% P.I. for xnew is
x̂new ∓ t1−α/2,n−2
s
MSE
1
(x̂new − x̄)2
1 + + Pn
2
b21
n
i=1 (xi − x̄)
Remark 4.1. Bonferroni or Scheffé adjustments should be made for multiple
simultaneous predictions.
4.5
Choice of Predictor Levels
Recall that in most standard errors the term
somewhere in a denominator. For example,
Pn
i=1 (xi
− x̄)2 was present
σ2
2
i=1 (xi − x̄)
V (B1 ) = Pn
So, in order to decrease the standard error we need to maximize this term,
which in essence is a measure of spread of the predictor, by
(i) Increase sample size, n.
(ii) Increase the spacing of the predictor.
Depending on the goal of research, when planning a controlled experiments,
and selecting predictor levels, choose:
• 2 levels if only interested in whether there is an effect and its direction,
• 3 levels if goal is describing relation and any possible curvature,
• 4 or more levels for further description of response curve and any potential non-linearity such as an asymptote.
80
Chapter 5
Matrix Approach to Simple
Linear Regression
We will cover the basics necessary to provide us with better understanding of
regression which will be especially useful for multiple regression. The reader
is also encouraged to review further topics and material at
• http://stattrek.com/tutorials/matrix-algebra-tutorial.aspx
• https://www.youtube.com/watch?v=xyAuNHPsq-g&list=PLFD0EB975BA0CC1E0
Definition 5.1. A matrix is a rectangular array of numbers or symbolic
elements.
In many applications, the rows will represent individual cases and columns
will represent attributes or characteristics. The dimensions of a matrix is its
number of rows and columns, often denoted m × n and has form
Am,n

a1,1

 a2,1
=
 ..
 .
a1,2
a2,2
..
.
am,1 am,2
5.1

· · · a1,n

· · · a2,n 
.. 
..

.
. 
· · · am,n
Special Types of Matrices
• Square matrix: The number of rows is the same as the number of
columns. For example,
A2,2 =
81
a1,1 a1,2
a2,1 a2,2
!
• Vector: A column vector is matrix with only one column, and a row
vector is a matrix with only one row. For example,
 
c1
 
 c2 

c=
 .. 
.
 
cn
• Transpose: A matrix formed by interchanging rows and columns. For
example,


!
6 8
6 15 22


G=
⇒ GT = 15 13
8 13 25
22 25
• Matrix equality: Two matrices of the same dimension are equal when
each element that is in the same position in each matrix is equal.
• Symmetric matrix: A square matrix whose transpose is equal to itself,
i.e. AT = A or element-wise ai,j = aj,i . For example,

6 19 −8


A =  19 14 3  ⇒ AT = A.
−8 3 1

• Diagonal matrix: Square matrix with all off-diagonal elements equal to
0. For example,
A3,3


a1 0 0


=  0 a2 0  = diag(a1 , a2 , a3 )
0 0 a3
• Identity matrix: A diagonal matrix with all the diagonal elements equal
to 1, i.e. Im = diag(1, 1, . . . , 1). For example,


1 0 0


I3 = 0 1 0
0 0 1
We will see later that Im Am,n = Am,n , and that Am,n In = Amn .
82
• Scalar matrix: A diagonal matrix with all the diagonal elements equal
to same scalar k, that is kIm . For example,


k 0 0


kI3 = 0 k 0 
0 0 k
• 1-vector and matrix: The 1-vector, is simply a column vector whose
elements are all 1. Similarly for the matrix denoted by J.


1 1 1


J3 = 1 1 1
1 1 1
5.2
Basic Matrix Operations
To perform basic matrix operations in R, please visit
http://www.statmethods.net/advstats/matrix.html.
5.2.1
Addition and subtraction
Addition and subtraction is done elementwise for matrices of the same dimension.

a1,1 + b1,1

 a2,1 + b2,1
=
..

.

Am,n + Bm,n
a1,2 + b1,2
a2,2 + b2,2
..
.
···
···
..
.
a1,n + b1,n
a2,n + b2,n
..
.
am,1 + bm,1 am,2 + bm,2 · · · am,n + bm,n
Similarly for subtractions.
In regression, let


Y1
.
.
Y =
.
Yn


E(Y1 )
 . 
. 
E(Y ) = 
 . 
E(Yn )
The model can be expresses as
Y = E(Y ) + ǫ
83
 
ǫ1
.
.
ǫ=
.
ǫn






5.2.2
Multiplication
We begin with multiplication of a matrix A by a scalar k. Each element of
A is multiplied by k, i.e.
kAm,n

ka1,1

 ka2,1
=
 ..
 .
ka1,2
ka2,2
..
.
kam,1 kam,2

· · · ka1,n

· · · ka2,n 
.. 
..

.
. 
· · · kam,n
Multiplication of a matrix by a matrix is only defined if the inner dimensions are equal. That is the the column dimension of the first matrix equals
the row dimension of the second matrix. That is, Am,n Bp,q is only defined if
n = p. The resulting matrix of Am,n Bn,q is of dimension m × q with (i, j)th
elements being
[ab]i,j =
n
X
ai,k bk,j
i = 1, . . . , m j = 1, . . . , q
k=1
Example 5.1. Let
A3,2


2 5


= 3 −1
0 7
B2,2 =
3 −1
2
4
!
then

 
16 18
2(3) + 5(2)
2(−1) + 5(4)

 

AB = 3(3) + (−1)(2) 3(−1) + (−1)(4) =  7 −7
14 28
0(3) + 7(2)
0(−1) + 7(4)

Remark 5.1. When AB is defined, the matrix can be expressed as linear
combination of the
• columns of A
• rows of B
84
Take example 5.1,
 
 
5
2
 
 
(−1) 3 + 4 −1
7
0
 
  
5
2
 
  
AB = 3 3 + 2 −1
7
0
and

2 3 −1 + 5 2 4
 


AB = 3 3 −1 + (−1) 2 4 
 
0 3 −1 + 7 2 4

In R
> A=matrix(c(2,3,0,5,-1,7),3,2);> B=matrix(c(3,2,-1,4),2,2)
> A%*%B
[,1] [,2]
[1,]
16
18
[2,]
[3,]
7
14
-7
28
Remark 5.2. Matrix multiplication is only defined when the inner dimensions
match and as such in example 5.1, AB is defined but BA is not. Even in
cases where both AB and BA are defined, it is not necessarily true that
AB = BA. Take for example
A=
1 2
3 4
!
5 6
7 8
B=
!
Systems of linear equations can also be written in matrix form. For
example, let x1 and x2 be unknown such that
a1,1 x1 + a1,2 x2 = y1
a2,1 x1 + a2,2 x2 = y2
This can be expressed as
a1,1 a1,2
a2,1 a2,2
!
x1
x2
!
=
Ax = y
85
y1
y2
!
(5.1)
Also, sums of squares can also be expressed as a vector multiplication.
n
X


x1
.
.
where x = 
.
xn
x2i = xT x
i=1
Some useful multiplications that we will be using in regression are presented in the following list:
List 1.




!
β0 + β1 x1
1 x1


 . .  β0
..


.. .. 
=
• Xβ = 
.



 β
1
β0 + β1 xn
1 xn
P
• y T y = ni=1 yi2
!
Pn
n
x
i
• X T X = Pn
Pni=1 2
i=1 xi
i=1 xi
!
Pn
y
i
• X T y = Pni=1
i=1 xi yi
5.3
Linear Dependence and Rank
Definition 5.2. Let A be an m × n matrix that is made up of n column
vectors ai , i = 1, . . . , n, each of dimension m i.e. A = [a1 · · · an ]. When n
scalars k1 , . . . , kn , not all zero, can be found such that
n
X
k i ai = 0
i=1
then the n columns are said to be linearly dependent. If the equality holds
only for k1 = · · · = kn = 0, then the columns are said to be linearly independent. The definition also holds for rows.
Example 5.2. Consider the matrix


1 0.5 3


A = 2 7 3
4
86
8
9
Notice that if we let scalars k1 = 2, k2 = 1k3 = −1 then,
 
 
4
2
 
 
 
2 0.5 + 1 7 − 1 8 = 0
9
3
3

1

Example 5.3. Consider the simple identity matrix


1 0 0


I3 = 0 1 0
0 0 1
Notice that the only way to achieve the 0 vector is with scalars k1 = k2 =
k3 = 0, i.e.
 
 
 
0
0
1
 
 
 
0 0 + 0 1 + 0 0 = 0
0
0
1
Without going into too much detail we present the following definition.
Definition 5.3. The rank of a matrix is the number of linearly independent
columns or rows of the matrix. Hence, rank(Am,n ) ≤ min(m, n). If equality
holds, then the matrix is said to be of full rank.
There are many way to determine the rank of the matrix, such as the
number of non-zero eigenvalues, but the simplest way is to express the matrix in reduced row echelon form and count the number of non-zero rows.
However, software can calculate it for us (by finding the number of non-zero
eigenvalues).
Example 5.4. Let,




1 2 1
0 1 2




A = 1 2 1 ⇒ Arref = 0 1 2
0 0 0
2 7 8
Hence, the rank(A) = 2. Row 3 is a linear combination of Rows 1 and 2.
Specifically, Row 3 = 3*( Row 1 ) + 2*( Row 2 ). Therefore, 3*( Row 1 )
+ 2*( Row 2 ) - ( Row 3)= (Row of zeroes). Hence, matrix A has only two
independent row vectors.
87
> A=matrix(c(0,1,2,1,2,7,2,1,8),3,3);A
[,1] [,2] [,3]
[1,]
0
1
2
[2,]
[3,]
1
2
2
7
1
8
> qr(A)$rank
[1] 2
and let
> B=matrix(c(1,2,3,0,1,2,2,0,1),3,3);B
[,1] [,2] [,3]
[1,]
[2,]
[3,]
1
2
3
0
1
2
2
0
1
> qr(B)$rank
[1] 3
> qr(B)$rank==min(dim(B)) #check if full rank
[1] TRUE
Remark 5.3. Other functions also exist in R that calculate the rank. The
qr(), utilizes the QR decomposition.
Remark 5.4. If we are simply interested in whether a square matrix A is full
rank or not, recall from linear algebra that matrix is full rank matrix (a.k.a.
nonsingular ) if and only if it has a determinant that is not equal to zero, i.e.
|A| =
6 0. Hence, if A is not of full rank (singular ) it has a determinant equal
to 0, i.e. |A| = 0. For example continuing example 5.4,
> A=matrix(c(0,1,2,1,2,7,2,1,8),3,3);A
[,1] [,2] [,3]
[1,]
0
1
2
[2,]
[3,]
1
2
2
7
1
8
> det(A)
[1] 0
> qr(A)$rank==min(dim(A))
[1] FALSE
88
> B=matrix(c(1,2,3,0,1,2,2,0,1),3,3);B
[,1] [,2] [,3]
[1,]
1
0
2
[2,]
[3,]
2
3
1
2
0
1
> det(B)
[1] 3
> qr(B)$rank==min(dim(B))
[1] TRUE
5.4
Matrix Inverse
Let An,n be a square matrix of full rank, i.e. rank(A) = n. Then A has a
(unique) inverse A−1 such that
A−1 A = AA−1 = In
Computing an inverse of matrix can be done manually-which requires
finding the reduced row echelon form but we will utilize software once again.
Example 5.5. Continuing from example 5.4, only B was nonsingular and
hence has an inverse
> solve(B)
[,1]
[,2]
[,3]
[1,] 0.3333333 1.3333333 -0.6666667
[2,] -0.6666667 -1.6666667 1.3333333
[3,] 0.3333333 -0.6666667 0.3333333
> round(solve(B)%*%B,3)
[,1] [,2] [,3]
[1,]
1
0
0
[2,]
0
1
0
[3,]
0
0
1
Example 5.6. In regression,
(X T X)−1 =
1
n
2
P x̄
(xi −x̄)2
− P(xix̄−x̄)2
+
89
− P(xix̄−x̄)2
P 1
(xi −x̄)2
!
Recall from equation (5.1), that a system of equations (with unknown x)
can be expressed in matrix form
Ax = y.
Then, if A is nonsingular
−1
−1
−1
⇒A
| {zA} x = A y ⇒ x = A y
I
Example 5.7. Assume we have 2 systems of equations
12x1 + 6x2 = 48
10x1 + 2x2 = 12
that can be expressed as
12 6
10 −2
!
x1
x2
!
=
48
12
!
We can easily check that the 2 × 2 matrix of coefficients is nonsingular and
has inverse
!
!
!
!
2
6
2
6
48
2
1
1
⇒x=
=
84 10 −12
84 10 −12
12
4
5.5
Useful Matrix Results
All rules assume that the matrices are conformable to operations.
• Addition:
– A+B =B+A
– (A + B) + C = A + (B + C)
• Multiplication:
– (AB)C = A(BC)
– C(A + B) = CA + CB
90
– k(A + B) = kA + kB for scalar k
• Transpose:
– (AT )T = A
– (A + B)T = AT + B T
– (AB)T = B T AT
– (ABC)T = C T B T AT
• Inverse:
– (A−1 )−1 = A
– (AB)−1 = B −1 A−1 (If A and B are non-singular)
– (ABC)−1 = C −1 B −1 A−1 (If A, B and C are non-singular)
– (AT )−1 = (A−1 )T
5.6
Random Vectors and Matrices
Let Y be a random column vector of dimension n, i.e.

Y1

 
 Y2 

Y =
 .. 
.
Yn
The expectation of this (multi-dimensional) random variable is

E(Y1 )


 E(Y2 ) 

µ = E(Y ) = 
 .. 
.


E(Yn )

91
and the variance-covariance is an n × n matrix defined as
V (Y ) = E [Y − E(Y )][Y − E(Y )]T
=

[Y1 − E(Y1 )][Y2 − E(Y2 )] · · · [Y1 − E(Y1 )][Yn − E(Yn )]


 [Y2 − E(Y2 )][Y1 − E(Y1 )]
[Y2 − E(Y2 )]2
· · · [Y2 − E(Y2 )][Yn − E(Yn )]

E
.
.
.
..


..
..
..
.


[Yn − E(Yn )][Y1 − E(Y1 )] [Yn − E(Yn )][Y2 − E(Y2 )] · · ·
[Yn − E(Yn )]2

[Y1 − E(Y1 )]2

σ12 σ1,2

 σ2,1 σ22
=
..
 ..
.
 .
σn,1 σn,2
=Σ

· · · σ1,n

· · · σ2,n 
.. 
..

.
. 
· · · σn2
(symmetric)
An alternate form is
Σ = E(Y Y T ) − µµT
More information can be found at:
https://en.wikipedia.org/wiki/Covariance_matrix
Example 5.8. In the regression model, assuming dimension n,the only
random term is ǫ (which in turn makes Y random) and we assume

 
σ2 0
0

 0 σ2
.


E(ǫ) =  ..  = 0 and V (ǫ) = 
..
 ..
.
.
0
0 0
···
···
..
.
0


0
2
.. 
 = σ In
.
· · · σ2
Hence, for the model
Y = Xβ + ǫ
• E(Y ) = E(Xβ) + E(ǫ) = Xβ
• V (Y ) = V (Xβ + ǫ) = σ 2 In
92
(5.2)
5.6.1
Mean and variance of linear functions of random
vectors
Let Am,n be a matrix of scalars and Y n,1 a random vector. Then,
W m,1

a1,1
a1,2

 a2,1
= AY = 
 ..
 .
a2,2
..
.
· · · a1,n

Y1

 
· · · a2,n   Y2 
 
.. 
..
 . 
.
.   .. 
Yn
· · · am,n
am,1 am,2

a1,1 Y1 + a1,2 Y2 + · · · + a1,n Yn

 a2,1 Y1 + a2,2 Y2 + · · · + a2,n Yn
=
..

.

am,1 Y1 + am,2 Y2 + · · · + am,n Yn






Since Y is a random vector, W m,1 is also a random vector with


E(a1,1 Y1 + a1,2 Y2 + · · · + a1,n Yn )


..

E(W ) = 
.


E(am,1 Y1 + am,2 Y2 + · · · + am,n Yn )


a1,1 E(Y1 ) + a1,2 E(Y2 ) + · · · + a1,n E(Yn )


..

=
.


am,1 E(Y1 ) + am,2 E(Y2 ) + · · · + am,n E(Yn )



E(Y1 )
a1,1 a1,2 · · · a1,n



 a2,1 a2,2 · · · a2,n   E(Y2 ) 



= .
..
..   .. 
..

.
.
.
.  . 
 .
am,1 am,2 · · · am,n
E(Yn )
= AE(Y )
and variance covariance matrix
V (W ) = E [AY − AE(Y )][AY − AE(Y )]T
= E A[Y − E(Y )][Y − E(Y )]T AT
= AE [Y − E(Y )][Y − E(Y )]T AT
= AV (Y )AT
93
5.6.2
Multivariate normal distribution
Let Y n,1 be a random vector with mean µ and variance covariance Σ i.e.N(µ, Σ).
Then, if Y is multivariate normal it has p.d.f
f (Y ) = (2π)−n/2 |Σ|−1/2 e
−1
(Y
2
−µ)T Σ−1 (Y −µ)
and each element Yi ∼ N(µi , σi2 ) i = 1, . . . , n.
Remark 5.5.
• If Am,n is a full rank matrix of scalars, then
AY ∼ N(Aµ, AΣAT )
• (True for any distribution) Two linear functions AU and BU are independent if and only if AΣB = 0. In particular, this means that Ui
and Uj are independent if and only if the (i, j)th entry of Σ equals 0.
• Y T AY ∼ χ2r (λ) if and only if AΣ is idempotent of rank(AΣ) = r and
λ = 12 µT Aµ
• The quadratic forms Y T AY and Y T BY are independent if and only
if AΣB = 0(BΣA = 0). As a consequence Sum of Squares Error and
Model (as well as its components) in linear models are independent.
5.7
Estimation and Inference in Regression
Assuming multivariate normal random errors in equation (5.2)
Y ∼ N(Xβ, σ 2 In )
5.7.1
Estimating parameters by least squares
For simple linear regression, recall in Section 1.2.1, that to estimate the
parameters we had to solve a system of linear equations by minimizing
n
X
i=1
(yi − (β0 + β1 xi ))2 = (y − Xβ)T (y − Xβ)
94
The resulting simultaneous equations after taking partial derivatives w.r.t.
β0 , β1 and equating to zero, are:
X
X
nb0 + b1
xi =
yi
X
X
X
b0
xi + b1
x2i =
xi yi
which, using the results of list 1, can be expressed and solved in matrix form
X T Xb = X T y ⇒ b = (X T X)−1 X T y
Remark 5.6. To solve this system we assumed that X T X was nonsingular.
This is nearly always the case for simple linear regression. However, for
multiple regression we will need the following proposition to guarantee that
the unique inverse exists.
Proposition 5.1. Let Xn,p , where n ≥ p. If rank(X) = p, then X T X is
nonsingular, i.e. rank(X T X) = p.
5.7.2
Fitted values and residuals
Fitted response values are
ŷ = Xb
= X(X T X)−1 X T y
|
{z
}
Hn,n
where the H matrix is called the projection matrix, (that is, if you premultiply a vector by H the result is the projection of that vector onto the
column space of X). Therefore, H is
• idempotent, i.e. HH = H
• symmetric, i.e. H T = H
The estimated residuals are
e = y − ŷ
= y − Hy
= (In − H)y
95
where it is easy to check that In − H is also idempotent. As a result,
• E(Ŷ ) = E(HY ) = HE(Y ) = HXβ = X(X T X)−1 X T Xβ = Xβ
• V (Ŷ ) = Hσ 2 In H T = σ 2 H and MSE = σ̂ 2
• E(e) = E[(In −H)Y ] = (In −H)E(Y ) = (In −H)Xβ = Xβ −Xβ = 0
• V (e) = (In − H)σ 2 In (In − H)T = σ 2 (In − H) and MSE = σ̂ 2
5.7.3
Analysis of variance
Recall that,
SST =
n
X
i=1
Now note that,
T
y y=
n
X
n
X
2
(yi − ȳ) =
yi2
SST = y T y −
Also,
i=1
−
(
Pn
(
1 T
y Jy =
n
and
i=1
Therefore,
yi2
i=1
y i )2
n
Pn
i=1
.
y i )2
n
.
1 T
y Jy = y T In − n−1 J y.
n
SSE = eT e
= (y − Xb)T (y − Xb)
= y T y − y T Xb − bT X T y + bT X T Xb
= y T y − bT X T y
= y T (In − H)y
since bT X T y = y T Hy
Finally,
SSR = SST − SSE = · · · = y T H − n−1 J y
Remark 5.7. Note that SST, SSR and SSE are all of quadratic form, i.e.
y T Ay for symmetric matrices A.
96
5.7.4
Inference
Since, b = (X T X)−1 X T y it is a linear function of the response. The corresponding random vector can be expressed as
B = (X T X)−1 X T Y
{z
}
|
A
Hence,
• E(B) = AE(Y ) = AXβ = β
• V (B) = AV (Y )AT = σ 2 (X T X)−1
and thus,
B ∼ N(β, σ 2 (X T X)−1 )
We can also express the C.I. and P.I. in section 2.2 in matrix form.
• Estimated mean response at xobs .
ŷ = b0 + b1 xobs = xTobs b,
with sŶ =
p
MSE(xTobs (X T X)−1 xobs )
xobs =
1
xobs
!
• Predicted response at xnew . The point estimate is the same but spred =
p
1 + MSE(xTnew (X T X)−1 xnew )
97
Chapter 6
Multiple Regression I
This chapter incorporates large sections from Chapters 8 from the textbook.
6.1
Model
The multiple regression model is an extension of the simple regression model
whereby instead of only one predictor, there are multiple predictors to better
aid in the estimation and prediction of the response. The goal is to determine
the effects (if any) of each predictor, controlling for the others.
Let p − 1 denote the number of predictors and (yi , x1,i , x2,i , . . . xp−1,i )
denote the p dimensional data points for i = 1, . . . , n. The statistical model
is
Yi = β0 + β1 x1,i + · · · + βp−1 xp−1,i + ǫi ⇔ Yi =
for
p−1
X
k=0
βk xk,i + ǫi
x0,i ≡ 1
i.i.d.
i = 1, . . . , n where ǫi ∼ N(0, σ 2 ).
Multiple regression models can also include polynomial terms (powers of
predictors). For example, one can define x2,i := x21,i . The model is still
linear as it is linear in the coefficients (β’s). Polynomial terms are useful for
accounting for potential curvature/nonlinearity in the relationship between
predictors and the response. Also, a polynomial term such as x4,i = x1,i x3,i , is
also coined as the interaction term of x1 with x3 . Such terms are of particular
usefulness when an interaction exists between two predictors, i.e. when the
level/magnitude of one predictor has a relationship to the level/magnitude
of the other. For example, one may wish to fit a model with predictor terms,
98
although there are only 2 unique predictors:
Yi = β0 + β1 x1,i + β2 x21,i + β3 x2,i + β4 x1,i x2,i + β5 x21,i x2,i + ǫi
In p dimensions, we no longer use the term regression line, but a response/regression surface. Let p = 3, i.e. 2 predictors and a response. The
resulting model may look like
The interpretation of the slope coefficients now requires an additional
statement. A 1-unit increase in predictor xk will cause the response, y, to
change by amount βk , assuming all other predictors are held constant. In
a model with interaction terms special care needs to be taken. Take for
example
E(Y |x1 , x2 ) = β0 + β1 x1 + β2 x2 + β3 x1 x2
where a 1-unit increase in x2 , i.e. x2 + 1, leads to
E(Y |x1 , x2 + 1) = E(Y |x1 , x2 ) + β2 + β3 x1
The effect of increasing x2 depends on the level of x1 .
99
6.2
Special Types of Variables
• Distinct numeric predictors. The traditional form for variables used
thus far.
• Polynomial terms. Used to allow for “curves” in the regression/response surface, as discussed earlier.
Example 6.1. In an experiment using flyash % as a strength (sten)
strength
4500
5000
5500
6000
factor in concrete compression test (PSI) for 28 day cured concrete,
fitting a simple linear regression yielded the following
0
10
20
30
40
50
60
flyash
Figure 6.1: First order model
Clearly a linear model in the predictor is not adequate. Maybe a second
order polynomial model might be more adequate,
> flyash2=dat$flyash^2
> reg.2poly=lm(strength~flyash+flyash2,data=dat)
> summary(reg.2poly)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4486.3611
174.7531 25.673 8.25e-14 ***
flyash
63.0052
12.3725
5.092 0.000132 ***
flyash2
-0.8765
0.1966 -4.458 0.000460 ***
--Residual standard error: 312.1 on 15 degrees of freedom
Multiple R-squared: 0.6485,Adjusted R-squared: 0.6016
F-statistic: 13.84 on 2 and 15 DF, p-value: 0.0003933
100
6000
5500
strength
5000
4500
0
10
20
30
40
50
60
flyash
Figure 6.2: Second order model
It still seems that there is some room for improvement, hence
> flyash3=dat$flyash^3
> reg.3poly=lm(strength~flyash+flyash2+flyash3,data=dat)
> summary(reg.3poly)
Coefficients:
(Intercept)
flyash
flyash2
Estimate Std. Error t value Pr(>|t|)
4.618e+03 1.091e+02 42.338 3.53e-16 ***
-2.223e+01
3.078e+00
1.812e+01
7.741e-01
-1.227 0.240110
3.976 0.001380 **
flyash3
-4.393e-02 8.498e-03 -5.170 0.000142 ***
--Residual standard error: 189.4 on 14 degrees of freedom
Multiple R-squared: 0.8792,Adjusted R-squared: 0.8533
F-statistic: 33.95 on 3 and 14 DF, p-value: 1.118e-06
Before we continue, it is important to note that there are (mathematical) limitations to how many predictors can be added to a model.
As a guideline we usually have one predictor per 10 observations.
For example, a dataset with sample size 60 should have at most 6 predictors. The X matrix is n × p dimension so as p ↑ while n remains
constant, we run the risk of X not being full column rank. So in this
example we should only keep 2 predictors at most since we have 18 ≈ 20
observations. From the last output we see that the third and second
order polynomial terms are significant (flyash3 and flyash2) but flyash1
is not significant, given the other two are already incorporated in the
model.
101
>reg.3polym1=update(reg.3poly,.~.-flyash)
>summary(reg.3polym1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
4.549e+03 9.504e+01 47.866 < 2e-16 ***
(Intercept)
flyash2
flyash3
2.166e+00
-3.445e-02
2.201e-01
3.581e-03
9.840 6.18e-08 ***
-9.618 8.32e-08 ***
Residual standard error: 192.6 on 15 degrees of freedom
Multiple R-squared: 0.8662,Adjusted R-squared: 0.8483
6000
F-statistic: 48.54 on 2 and 15 DF,
5000
5500
1st order
2nd order
3rd order
4500
strength
p-value: 2.814e-07
0
10
20
30
40
50
60
flyash
Figure 6.3: Third order model
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/poly.R
• Interaction terms. Used when the levels of one predictor influence another. We will see this in example 6.2.
• Transformed variables. Transformed response such as log(Y ) or Y −1
(as seen with Power transformations) to achieve linearity (or to satisfy
other assumptions).
• Categorical predictors. A categorical predictor is a variable with groups
or classification. The basic case with a variable with only two groups
will be illustrated by the following example:
102
Example 6.2. A study is conducted to determine the effects of company size and the presence or absence of a safety program on the number of hours lost due to work-related accidents. A total of 40 companies
are selected for the study. The variables are as follows:
y = lost work hours
x1 = number of employees

1 safety program used
x2 =
0 no safety program used
The proposed model,
Yi = β0 + β1 x1,i + β2 x2,i + ǫi
implies that

(β + β ) + β x + ǫ
0
2
1 1,i
i
Yi =
β + β x + ǫ
0
1 1,i
i
if x2 = 1
if x2 = 0
When a safety program is used, i.e. x2 = 1, the intercept is β0 + β2 ,
but the slope (for x1 ) remains the same in both cases. A scatterplot of
the data and the associated regression line, differentiated by whether
x2 = 1 or 0, is presented.
x2
100
0
50
y
150
0
1
2000
4000
6000
8000
x1
Although the overall fit of the model seems adequate we see that the
regression line for x2 = 1 (red), does fit the data well - a fact that
can also be seen by plotting the residuals in the assumption checking
103
procedure. The model is too restrictive by forcing parallel lines. Adding
an interaction term makes the model less restrictive.
Yi = β0 + β1 x1,i + β2 x2,i + β3 (x1 x2 )i + ǫi
which implies

(β + β ) + (β + β )x + ǫ
0
2
1
3 1,i
i
Yi =
β + β x + ǫ
0
1 1,i
i
if x2 = 1
if x2 = 0
Now, the slope for x1 is allowed to differ for x2 = 1 and x2 = 0.
y = - 1.8 + 0.0197 x1 + 10.7 x2 - 0.0110 x1x2
Predictor
Constant
x1
x2
x1x2
Coef
-1.84
SE Coef
10.13
T
-0.18
P
0.857
0.019749
10.73
-0.010957
0.001546
14.05
0.002174
12.78
0.76
-5.04
0.000
0.450
0.000
S = 17.7488
R-Sq = 89.2%
R-Sq(adj) = 88.3%
Analysis of Variance
Source
Regression
DF
3
SS
93470
MS
31157
Residual Error
Total
36
39
11341
104811
315
Figure 6.4 also shows the better fit.
104
F
98.90
P
0.000
x2
100
0
50
y
150
0
1
2000
4000
6000
8000
x1
Figure 6.4: Scatterplot and fitted regression lines.
Remark 6.1. Since the interaction term x1 x2 is deemed significant, then
for model parsimony, all lower order terms of the interaction, i.e. x1
and x2 should be kept in the model, irrespective of their statistical
significance. If x1 x2 is significant then intuitively x1 and x2 are of
importance (maybe not in the statistical sense).
Now lets try and to perform inference on the slope coefficient for x1 .
From the previous equation we saw that the slope takes on two values
depending on the value of x2 .
– For x2 = 0, it is just β1 and inference in straightforward...right?
– For x2 = 1, it is β1 + β3 . We can estimate this with b1 + b3 but
the variance is not known to us. From equation (3) we have that
V (B1 + B3 ) = V (B1 ) + V (B3 ) + 2Cov(B1 , B3 )
The sample variances and covariances can be found from the
ˆ = MSE(X T X)−1 covariance matrix, or, obtained in R using
V (B)
the vcov function. Then, create a 100(1 − α)% CI for β1 + β3
b1 + b3 ∓ t1−α/2,n−p
q
s2b1 + s2b3 + 2sb1 b3
Remark 6.2. This concept can easily be extended to linear combinations
of more that two coefficients.
http://www.stat.ufl.edu/~ athienit/IntroStat/safe_reg.R
105
In the previous example the qualitative predictor only had two levels,
the use or the the lack of use of a safety program. To fully state all
levels only one dummy/indicator predictor was necessary. In general,
if a qualitative predictor has k levels, then k − 1 dummy/indicator
predictor variables are necessary. For example, a qualitative predictor
for a traffic light has three levels:
– red,
– yellow,
– green.
Therefore, only two binary predictors are necessary to fully model this
scenario.
xred

1 if red
=
0 otherwise
xyellow

1 if yellow
=
0 otherwise
Braking it down by case we have an X matrix that has the following
form:
Color intercept
Red
1
Yellow
1
Green
1
xred
1
0
0
xyellow
0
1
0
This restriction is usually expressed as βbase group = 0 where green is
the base group in this situation, and the model is
Yi = β0 + β1 xredi + β2 xyellowi + ǫi
and hence the mean line, piecewise is
E(Y ) =



β + β1

 0
β0 + β2



β
0
if red
if yellow
if green
Notice that if we created xgreen the X matrix would no longer be full
column rank.
106
Remark 6.3. However, other restrictions do exist to make X full column
rank too.
– The restriction
3
X
i=1
βi = 0 ⇒ β3 = −β1 − β2
i.e. the sum of the coefficients that correspond to the levels of a
qualitative predictor only are equal to 0. Not all β’s. So, green
can be written as a linear combination or red and yellow. The
model is
Yi = β0 + β1 xredi + β2 xyellowi + β3 xgreeni + ǫi
and hence the mean line, piecewise is
E(Y ) =



β + β1

 0
if red
β0 + β2



β − β − β
0
1
2
if yellow
if green
for this case the X matrix has the form
Color intercept
Red
1
Yellow
1
Green
1
xred
1
0
-1
xyellow
0
1
-1
– The model with no intercept/through the origin
Yi = β1 xredi + β2 xyellowi + β3 xgreeni + ǫi
and hence the mean line, piecewise is



β

 1
E(Y ) = β2



β
3
if red
if yellow
if green
for this case the X matrix has the form
107
Color
Red
Yellow
Green
xred
1
0
0
xyellow
0
1
0
xgreen
0
0
1
So now we have seen three alternative ways, but we will be using the
base group approach as is done in R. The model through the origin has
issues as discussed in an earlier section and the sum to zero implies
that some parameters have to be expressed as linear combinations of
others.
Remark 6.4. The color variable has three categories, one may argue
that color (in some context) is an ordinal qualitative predictor and
therefore scores can be assigned, making it quantitative. In terms of
frequency (or wavelength) there is also an order of
Color Frequency (THz)
Red
400-484
Yellow
508-526
Green
526-606
Score
442
517
566
Instead of creating 2 dummy/indicator variables we can create one
quantitative variable using the midpoint of the frequency band.
Example 6.3. Three different drugs are considered, drug A, B and C. Each
is administered at 4 dosage levels and the response is measured
Product
Dose A
B
C
0.2
2.0 1.8 1.3
0.4
4.3 4.1 2.0
0.8
6.5 4.9 2.8
1.6
8.9 5.7 3.4
Let d = dosage level and let

1 drug B
pB =
0 otherwise

1 drug C
pC =
0 otherwise
The model (that includes the interaction term) is
Yi = β0 + β1 di + β2 pB + β3 pC + β4 (dpB )i + β5 (dpC )i + ǫi
108
and



β + β1 di

 0
E(Y ) = β0 + β2 + (β1 + β4 )di



β + β + (β + β )d
0
3
1
5
i
if drug A
if drug B
if drug C
6
4
2
Response
8
Product
A B C
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Dose
With a simple visual inspection we see that the model fit is not adequate.
A log transformation on dosage seems to help
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
logDose
7.3072
3.3038
0.2103
0.2186
34.748 3.79e-08 ***
15.111 5.30e-06 ***
ProductB
ProductC
logDose:ProductB
-2.1548
-4.3486
-1.5004
0.2974 -7.245 0.000351 ***
0.2974 -14.622 6.42e-06 ***
0.3092 -4.853 0.002844 **
logDose:ProductC
---
-2.2795
0.3092
-7.372 0.000319 ***
Residual standard error: 0.3389 on 6 degrees of freedom
Multiple R-squared: 0.9877,Adjusted R-squared: 0.9774
F-statistic:
96.3 on 5 and 6 DF,
109
p-value: 1.207e-05
6
4
2
Response
8
Product
A B C
−1.5
−1.0
−0.5
0.0
0.5
logDose
The coefficients for the interactions are both significant and negative so
slope for logDose is:
drug A: 3.3038
drug B: 3.3038 − 1.5004
drug C: 3.3038 − 2.2795
We can test whether the slope for B is different than that for A, by testing
β4 = 0, and for C versus A, by testing β5 = 0 (since A is the base group). A
question that may arise is if the slope for logDose is the same for drug B as
it is for drug C. That is H0 : β4 = β5 . We will see in the next chapter how
to actually perform this test. In the meantime we can create a 95% CI for
β4 − β5 .
> vmat=vcov(modelfull);round(vmat,3)
(Intercept)
logDose
ProductB
ProductC
(Intercept) logDose ProductB ProductC logDdose:PB
0.044
0.027
-0.044
-0.044
-0.027
0.027
0.048
-0.027
-0.027
-0.048
-0.044
-0.044
-0.027
-0.027
0.088
0.044
logDose:PC
-0.027
-0.048
0.044
0.088
0.054
0.027
0.027
0.054
logDose:ProductB
-0.027 -0.048
0.054
0.027
logDose:ProductC
-0.027 -0.048
0.027
0.054
> d=diff(coefficients(modelfull)[6:5]);names(d)=NULL;d
0.096
0.048
0.048
0.096
0.7791
> d+c(1,-1)*qt(0.025,6)*sqrt(vmat[5,5]+vmat[6,6]-2*vmat[5,6])
[1] 0.02247186 1.53563879
110
and note that 0 is not in the interval, and conclude that β4 > β5 , the slope
of logDose under drug B is larger than that for C.
Remark 6.5. Since we are in fact making multiple comparisons, A vs B, A vs
C and B vs C, we should probably adjust using Bonferroni’s or some other
multiple comparison adjustment.
There is however a simpler way. If we make drug C the base group,
instead of A, the (different) model would be
Yi = β0 + β1 di + β2 pA + β3 pB + β4 (dpA )i + β5 (dpB )i + ǫi
so the model under
drug A: Yi = β0 + β2 + (β1 + β4 )di + ǫi
drug B: Yi = β0 + β3 + (β1 + β5 )di + ǫi
drug C: Yi = β0 + β1 di + ǫi
so comparing the slope for logDose between drug B and C, simply involves
performing inference on β5 .
> ds_base_c=transform(ds,Product=relevel(Product,"C"))
> model_bc=lm(Response~logDose*Product,data=ds_base_c)
> summary(model_bc)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
2.9586
0.2103 14.069 8.05e-06 ***
1.0243
0.2186
4.685 0.003378 **
(Intercept)
logDose
ProductA
ProductB
logDose:ProductA
4.3486
2.1938
2.2795
0.2974
0.2974
0.3092
14.622 6.42e-06 ***
7.377 0.000318 ***
7.372 0.000319 ***
logDose:ProductB
---
0.7791
0.3092
2.520 0.045312 *
Residual standard error: 0.3389 on 6 degrees of freedom
Multiple R-squared: 0.9877,Adjusted R-squared: 0.9774
F-statistic:
96.3 on 5 and 6 DF,
p-value: 1.207e-05
which corresponds to the term “logDose:ProductB” in the output.
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/drug.R
111
6.3
Matrix Form
This section is merely an extension of section 5.7. The model (including
dimensions) is of the same form just different dimensions for some terms
Y n,1 = Xn,p β p,1 + ǫn,1
Estimates, fitted values, residuals, standard errors and sums of squares
are of the same form as in section 5.7. The differences/generalizations are:
• The degrees of freedom are
df
SSR p − 1
SSE n − p
SST n − 1
+
This is because we know have to estimate p parameters for our “mean”,
that is, our response surface.
• The expected sums of squares are:
– E(MSE) = σ 2
Pp−1 P
P
2
β
SS
+
– E(MSR) = σ 2 + p−1
kk
k
k=1
k ′ k βk βk ′ SSkk ′
Pn k=1
where SSkk′ = i=1 (xik − x̄k )(xik′ − x̄k′ ).
It can be shown that,
E(MSR) ≥ E(MSE)
with equality holding only if β1 = · · · = βp−1 = 0. Therefore, to test
H0 : β1 = · · · = βp−1 = 0 vs Ha : not all β’s equal zero
we use the test statistic
T.S. =
MSR H0
∼ Fp−1,n−p
MSE
and reject the null when p-value= P (Fp−1,n−p ≥ T.S.) < α.
112
(6.1)
• Intuitively, we note that SSR will always increase, or that equivalently
SSE decreases, as the we include more predictors in the model. This is
because the fitted values (from a more complicated model) will better
fit the observed values of the response. However, any increase in SSR,
no matter how minuscule, will cause R2 to increase. The question is:
“Is the gain in SSR worth the added model complexity?”
This has lead to the introduction of the adjusted R2 , defined as
2
Radj
p−1
:= R − (1 − R )
n−p
{z
}
|
2
2
= 1−
penalizing fcn.
MSE
.
SST/(n − 1)
(6.2)
As p−1 increases, R2 increases, but the second term which is subtracted
from R2 also increases. Hence, the second term can be thought of as a
penalizing factor.
Example 6.4. A linear regression model of 50 observation with 3 pre2
dictors may yield an R(1)
= 0.677, and an addition of 2 “unimportant”
2
predictors yields a slight increase to R(2)
= 0.679. This increase does
not seem to be worth the added model complexity. Notice,
3
= 0.6559
46
5
= 0.679 − (1 − 0.679) = 0.6425
44
2
= 0.677 − (1 − 0.677)
R(1)adj
2
R(2)adj
2
that Radj
has decreased from model (1) to model (2).
• Inferences on the individual β’s follows from section 2.1 and 5.7. The
only difference is is the degrees of freedom for the t-distribution is n − p
(instead of n − 2). For example, to test H0 : βk = βk0
bk − βk0 H0
∼ tn−p
sbk
(6.3)
An individual test on βk , tests the significance of predictor k, assuming
all other predictors j for j 6= k are included in the model. This
can lead to different conclusions depending on what other predictors
are included in the model. We shall explore this in more detail in the
next chapter.
113
Consider the following theoretical toy example. Someone wishes to
measure the area of a square (the response) using as predictors two
potential variables, the length and the height of the square. Due to
measurement error, replicate measurements are taken.
– A simple linear regression is fitted with length as the only predictor, x = length. For the test H0 : β1 = 0, do you think that we
would reject H0 , i.e. is length a significant predictor of area?
– Now assume that a multiple regression model is fitted with both
predictors, x1 = length and x2 = height. Now, for the test H0 :
β1 = 0, do you think that we would reject H0 , i.e. is length a
significant predictor of area given that height is already included
in the model?
This scenario is defined as confounding. In the toy example, “height” is
a confounding variable, i.e. an extraneous variable in a statistical model
that correlates with both the response variable and another predictor
variable.
• Confidence intervals on the mean response and predictions intervals
performed as in section 5.7 with the exception that
– the degrees of freedom for the t-distribution are now n − p
– xobs (or xnew ) being
xobs

1



 x1,obs 

= . 

 .. 
xp−1,obs
– The matrix X is an n × p matrix with columns being the predictors, i.e. X = [1 x1 · · · xp−1 ]
– and for g simultaneous intervals
∗ the Bonferroni critical value is t1−α/(2g),n−p , that is, the degrees
of freedom change
∗ the Working-Hotelling critical value is
W =
114
p
pF1−α,p,n−p
Example 6.5. In a biological experiment, researchers wanted to model the
biomass of an organism with respect to a salinity (SAL), acidity (pH), potassium (K), sodium (Na) and zinc (Zn) with a sample size of 45. The full
model yielded the following results:
Coefficients:
(Intercept)
salinity
Estimate Std. Error t value Pr(>|t|)
171.06949 1481.15956
0.115 0.90864
-9.11037
28.82709 -0.316 0.75366
pH
K
311.58775
-0.08950
105.41592
0.41797
2.956
-0.214
0.00527
0.83155
Na
Zn
-0.01336
-4.47097
0.01911
18.05892
-0.699
-0.248
0.48877
0.80576
Residual standard error: 477.8 on 39 degrees of freedom
Multiple R-squared: 0.4867,Adjusted R-squared: 0.4209
F-statistic: 7.395 on 5 and 39 DF,
p-value: 5.866e-05
Analysis of Variance Table
Response: biomass
Df Sum Sq Mean Sq F value
salinity
pH
K
Pr(>F)
1 121832 121832 0.5338
0.4694
1 7681463 7681463 33.6539 9.782e-07 ***
1 464316 464316 2.0343
0.1617
Na
1 157958
Zn
1
13990
Residuals 39 8901715
157958
13990
228249
0.6920
0.0613
0.4105
0.8058
Notice that the ANOVA table has broken down SSR with 5 df into 5 components. We will discuss the sequential sum of squares breakdown in the next
chapter. For now if we sum the SS for each of the predictors we will get
SSR= 8439559
Analysis of Variance
Source
Regression
DF
5
SS
8439559
MS
1687912
Residual Error
Total
39
44
8901715
17341274
228249.1
115
F
7.395
P
0.000
Assuming all the model assumptions are met, we first take a look at the
overall fit of the model.
H0 : β1 = · · · = β5 = 0 vs Ha : at least one of them 6= 0
The test statistic value is T.S. = 7.395 with an associated p-value of approximately 0 (found using an F5,39 distribution). Hence, at least one predictor
appears to be significant. In addition, the coefficient of determination, R2 , is
48.67%, indicating that a large proportion of the variability in the response
can be accounted for by the regression model.
Looking at the individual tests, pH is significant given all the other predictors with a p-value of 0.00527, but salinity, K, Na and Zn have large p-values
(from the individual tests). Table 6.1 provides the pairwise correlations of
the quantitative predictor variables.
biomass
salinity
pH
K
Na
Zn
biomass salinity
pH
K
Na
Zn
.
-0.084 0.669 -0.150 -0.219 -0.503
.
. -0.051 -0.021 0.162 -0.421
.
.
. 0.019 -0.038 -0.722
.
.
.
. 0.792
0.074
.
.
.
.
.
0.117
.
.
.
.
.
.
Table 6.1: Pearson correlation and associated p-value
Notice that pH and Zn are highly negatively correlated, so it seems reasonable to attempt to remove Zn as its p-value is 0.80576 (and pH’s p-value
is small). Also, there is a strong positive correlation between K and Na and
since both their p-values are large at 0.83155 and 0.48877 respectively, we
should attempt to remove K (but not both). Although we will see later how
to perform simultaneous inference it is more advisable to test one predictor
at a time. In effect we will perform backwards elimination. That is, start
with a complete model and see if which predictors we can remove, one at a
time.
1. Remove K that has the highest individual test p-value.
Coefficients:
(Intercept)
Estimate Std. Error t value Pr(>|t|)
72.02975 1390.21648
0.052 0.95894
116
salinity
pH
Na
-7.22888
314.31346
-0.01667
27.12606
103.38903
0.01106
-0.266
3.040
-1.507
0.79123
0.00416 **
0.13972
-3.73299
17.51434
-0.213
0.83230
Zn
Residual standard error: 472 on 40 degrees of freedom
Multiple R-squared: 0.4861,Adjusted R-squared: 0.4347
F-statistic: 9.458 on 4 and 40 DF, p-value: 1.771e-05
2
Where we note that Radj
has actually gone up. That is, even though
SSR is smaller for this model (than the one with K also in it) the penalizing function now doesn’t penalize as much. So, K was not necessary.
Also note how the p-value for Na has dropped from 0.4105. That is
mainly due to correlation between K and Na.
2. Remove Zn that has the highest individual test p-value.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -188.93696 650.73603 -0.290
0.773
salinity
-3.18957
19.18052 -0.166
0.869
pH
Na
332.67478
-0.01743
56.49655
0.01036
5.888 6.24e-07 ***
-1.682
0.100
Residual standard error: 466.5 on 41 degrees of freedom
Multiple R-squared: 0.4855,Adjusted R-squared: 0.4478
F-statistic:
12.9 on 3 and 41 DF,
p-value: 4.5e-06
2
is still increasing.
Radj
3. Remove salinity
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -282.86356
pH
333.10556
Na
-0.01770
319.38767
55.78001
0.01011
117
-0.886
0.3809
5.972 4.36e-07 ***
-1.752
0.0871 .
Residual standard error: 461.1 on 42 degrees of freedom
Multiple R-squared: 0.4851,Adjusted R-squared: 0.4606
F-statistic: 19.79 on 2 and 42 DF, p-value: 8.82e-07
2
Radj
is still increasing.
4. Now the question is whether we should remove Na, as its p-value is
“small-ish”.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
pH
-593.63
336.79
271.89
57.07
-2.183
0.0345 *
5.902 5.08e-07 ***
Residual standard error: 472 on 43 degrees of freedom
Multiple R-squared: 0.4475,Adjusted R-squared: 0.4347
F-statistic: 34.83 on 1 and 43 DF,
p-value: 5.078e-07
2
But now Radj
has decreased, so it is beneficial to keep Na (with respect
2
to R criterion).
We can also create CI and/or PI using this model, and with the use of
software, we do n0t actually have to compute any of the matrices.
> newdata=data.frame(pH=4.15,Na=10000)
> predict(modu3, newdata, interval="prediction",level=0.95)
fit
lwr
upr
1 922.4975 -29.45348 1874.448
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/linthurst.R
118
Chapter 7
Multiple Regression II
For a given dataset, the total sum of squares (SST) remains the same, no
matter what predictors are included (when no missing values exist among
variables) as the formula does not involve any x’s . As we include more
predictors, the regression sum of squares (SSR) does not decrease (think of
it as increasing), and the error sum of squares (SSE) does not increase.
7.1
7.1.1
Extra Sums of Squares
Definition and decompositions
• When a model contains just x1 , we denote: SSR(x1 ), SSE(x1 )
• Model Containing x1 , x2 : SSR(x1 , x2 ), SSE(x1 , x2 )
• Predictive contribution of x2 above that of x1 :
SSR(x2 |x1 ) = SSE(x1 ) − SSE(x1 , x2 ) = SSR(x1 , x2 ) − SSR(x1 )
This can be extended to any number of predictors. Lets take a look at some
formulas for models with 3 predictors
SST = SSR(x1 ) + SSE(x1 )
= SSR(x1 , x2 ) + SSE(x1 , x2 )
= SSR(x1 , x2 , x3 ) + SSE(x1 , x2 , x3 )
119
and
SSR(x1 |x2 ) = SSR(x1 , x2 ) − SSR(x2 )
= SSE(x2 ) − SSE(x1 , x2 )
SSR(x2 |x1 ) = SSR(x1 , x2 ) − SSR(x1 )
= SSE(x1 ) − SSE(x1 , x2 )
SSR(x3 |x2 , x1 ) = SSR(x1 , x2 , x3 ) − SSR(x1 , x2 )
= SSE(x1 , x2 ) − SSE(x1 , x2 , x3 )
SSR(x2 , x3 |x1 ) = SSR(x1 , x2 , x3 ) − SSR(x1 )
= SSE(x1 ) − SSE(x1 , x2 , x3 )
Similarly you can find other terms such as SSR(x2 |x1 , x3 ), SSR(x2 , x1 |x3 ) and
so forth. Using some of this notation we find that
SSR(x1 , x2 , x3 ) = SSR(x1 ) + SSR(x2 |x1 ) + SSR(x3 |x1 , x2 )
= SSR(x2 ) + SSR(x1 |x2 ) + SSR(x3 |x1 , x2 )
= SSR(x1 ) + SSR(x2 , x3 |x1 )
For multiple regression when we request the ANOVA table in R, we obtain a
table where SSR is decomposed by sequential sums of squares.
Source
Regression
x1
x2 |x1
x3 |x1 , x2
Error
Total
SS
SSR(x1 , x2 , x3 )
SSR(x1 )
SSR(x2 |x1 )
SSR(x3 |x1 , x2 )
SSE(x1 , x2 , x3 )
SST
df
3
1
1
1
n−4
n−1
MS
MSR(x1 , x2 , x3 )
MSR(x1 )
MSR(x2 |x1 )
MSR(x3 |x1 , x2 )
MSE(x1 , x2 , x3 )
The sequential sum of squares regression differ depending on the order
the variables are entered.
Example 7.1. Let us take a look at example 7.1 from the textbook.
> dat=read.table("http://www.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/
+ data/textdatasets/KutnerData/Chapter%20%207%20Data%20Sets/CH07TA01.txt",
+
>
col.names=c("X1","X2","X3","Y"))
120
> reg123=lm(Y~X1+X2+X3,data=dat)
> summary(reg123)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 117.085
99.782
1.173
0.258
X1
X2
X3
4.334
-2.857
-2.186
3.016
2.582
1.595
1.437
-1.106
-1.370
0.170
0.285
0.190
Residual standard error: 2.48 on 16 degrees of freedom
Multiple R-squared: 0.8014,
Adjusted R-squared: 0.7641
F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06
From the F-test we see that at least one predictor is significant. However,
the individual tests indicate that the predictors are not significant. We will
investigate this later but this is because we are testing an individual predictor
given all the other predictors. It will be helpful to view the sequential sum
of squares
Listing 7.1: Order 123 model
> anova(reg123)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value
Pr(>F)
X1
1 352.27 352.27 57.2768 1.131e-06 ***
X2
X3
1
1
33.17
11.55
33.17
11.55
Residuals 16
98.40
6.15
5.3931
1.8773
0.03373 *
0.18956
Note that
• SSR(x1 ) = 352.27, so x1 contributes a lot
• SSR(x2 |x1 ) = 33.17, so x2 contributes some above and beyond what x1
• SSR(x3 |x1 , x2 ) = 11.55 but x3 does seem to contribute much above and
beyond x1 and x2
If we switch the order in which the variables are entered
121
Listing 7.2: Order 213 model
> reg213=lm(Y~X2+X1+X3,data=dat)
> anova(reg213)
Analysis of Variance Table
Response: Y
X2
Df Sum Sq Mean Sq F value
Pr(>F)
1 381.97 381.97 62.1052 6.735e-07 ***
X1
1
X3
1
Residuals 16
3.47
11.55
98.40
3.47
11.55
6.15
0.5647
1.8773
0.4633
0.1896
We note that x2 seems to be significant on its own, but that x1 does not
contribute anything above and beyond x2 . Next we also try having x3 first.
Listing 7.3: Order 321 model
> reg321=lm(Y~X3+X2+X1,data=dat)
> anova(reg321)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value
X3
X2
1 10.05
1 374.23
X1
1
Residuals 16
12.70
98.40
Pr(>F)
10.05 1.6343
0.2193
374.23 60.8471 7.684e-07 ***
12.70
6.15
2.0657
0.1699
We note that x3 even on its own does not appear to be significant. We shall
talk about the tests we see here in the next section.
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/bodyfat.R
7.1.2
Inference with extra sums of squares
Let p − 1 denote the total number of predictors in a model. Then, we can
simultaneously test for the significance of k(≤ p) predictors. For example,
let p − 1 = 3 and the full model is
Yi = β0 + β1 x1,i + β2 x2,i + β3 x3,i + ǫi
122
Now, assume we wish to test whether we can remove simultaneously the first,
third and fourth predictor, i.e x1 and x3 . Consequently, we wish to test the
hypotheses
H0 : β1 = β3 = 0 (given x2 ) vs Ha : at least one of them 6= 0
In effect we wish to test the full model to the reduced model
Yi = β0 + β2 x2,i + ǫi
Remark 7.1. A full model does not necessarily imply a model with all the
predictors. It simply means a model that has more predictors than the
reduced model, i.e. a “fuller” model.
The SSE of the reduced model will be larger than the SSE of the full
model, as it only has two of the predictors of the full model and can never
fit the data better. The general test statistic is based on comparing the
difference in SSE of the reduced model to the full model.
SSEred − SSEf ull
dfEred − dfEf ull H0
∼ Fν1 ,ν2
T.S. =
SSEf ull
dfEf ull
(7.1)
where
• ν1 = dfEred − dfEf ull
• ν2 = dfEf ull
and the p-value for this test is always the area to the right of the F-distribution,
i.e. P (Fν1 ,ν2 ≥ T.S.).
In our example we have that
• SSEred − SSEf ull = SSE(x2 ) − SSE(x1 , x2 , x3 ) = SSR(x1 , x3 |x2 )
• dfEred − dfEf ull = (n − 2) − (n − 4) = 2
and hence equation (7.1) becomes
T.S. =
MSR(x1 , x3 |x2 ) H0
SSR(x1 , x3 |x2 )/2
=
∼ F2,n−4
SSE(x1 , x2 , x3 )/(n − 4)
MSE(x1 , x2 , x3 )
123
Remark 7.2. Note that ν1 = dfEred − dfEf ull always equals the number of
predictors being restricted to a singular point under the null hypothesis in
a simultaneous test. In the previous example H0 : β1 = β3 = 0 meant 2
degrees of freedom but H0 : β1 = β3 is only 1 degree of freedom. We shall
see examples in the section “Other Linear Tests”.
Example 7.2. ?? From example 7.1, assume we wish to test
H0 : β1 = β3 = 0 (given x2 )
We need to fit the reduced model and obtain the information necessary for
equation (7.1).
> reg2=update(reg123,.~.-X1-X3)
> anova(reg2,reg123)
Analysis of Variance Table
Model 1: Y ~ X2
Model 2: Y ~ X1 + X2 + X3
Res.Df
RSS Df Sum of Sq
1
18 113.424
2
16
98.405
2
F Pr(>F)
15.019 1.221
0.321
With a large p-value we fail to reject the null hypothesis, and drop x1 and
x3 . Remember that we actually recommend not performing simultaneous tests but one variable at a time.
Special cases
• The output we saw in example 7.1 that listing 7.1 (and the other listings) also provided us with some default F-tests
> anova(reg123)
Response: Y
Df Sum Sq Mean Sq F value
X1
X2
X3
1 352.27
1 33.17
1 11.55
Residuals 16
98.40
Pr(>F)
352.27 57.2768 1.131e-06 ***
33.17 5.3931
0.03373 *
11.55 1.8773
0.18956
6.15
124
– The first T S = 57.2768 tests whether x1 is significant without any
other predictors with a F-test with 1 and 16 degrees of freedom
T.S. =
MSR(x1 )
SSR(x1 )
=
MSE(x1 , x2 , x3 )
MSE(x1 , x2 , x3 )
– The second T S = 5.3931 tests whether x2 is significant above and
beyond x1 with a F-test with 1 and 16 degrees of freedom
T.S. =
SSR(x2 |x1 )
MSR(x2 |x1 )
=
MSE(x1 , x2 , x3 )
MSE(x1 , x2 , x3 )
– The third T S = 1.8773 tests whether x3 is significant above and
beyond (x1 , x2 ) with a F-test with 1 and 16 degrees of freedom
T.S. =
MSR(x3 |x1 , x2 )
SSR(x3 |x1 , x2 )
=
MSE(x1 , x2 , x3 )
MSE(x1 , x2 , x3 )
• One coefficient. Assume we wish to test H0 : β3 = 0. We can either
perform a t-test according to bullet 6.3 and equation (6.3)
b3 − 0 H0
∼ tn−4
sb3
Equivalently, we can still use equation (7.1) and note that
– SSEred − SSEf ull = SSE(x1 , x2 ) − SSE(x1 , x2 , x3 ) = SSR(x3 |x1 , x2 )
– dfEred − dfEf ull = 1
yielding to
T.S. =
MSR(x3 |x1 , x2 ) H0
SSR(x3 |x1 , x2 )/1
=
∼ F1,n−4
SSE(x1 , x2 , x3 )/(n − 4)
MSE(x1 , x2 , x3 )
with p-value= P (F1,n−4 ≥ T.S.).
Back to example 7.1 we have the t-tests and can see the equivalent
F-tests that have the same p-value.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 117.085
99.782
1.173
0.258
X1
4.334
3.016
1.437
0.170
125
X2
X3
-2.857
-2.186
2.582
1.595
-1.106
-1.370
0.285
0.190
> library(car)
> SS2=Anova(reg123,type=2);SS2 #notice same p-values
Anova Table (Type II tests)
Response: Y
X1
Sum Sq Df F value Pr(>F)
12.705 1 2.0657 0.1699
X2
7.529 1
X3
11.546 1
Residuals 98.405 16
1.2242 0.2849
1.8773 0.1896
• All coefficients (except intercept). Assume we wish to test
H0 : β1 = · · · = β3 = 0 vs Ha : not all β’s equal zero
We proceed in exactly the same way as bullet 6.3 and equation (6.1).
This is because the model under the null (reduced model) is
Yi = β0 + ǫi
⇔
Yi = µ + ǫi ,
and thus SSEred =SST and dfEred = n − 1. Therefore,
SST − SSE
SSR
MSR(x1 , x2 , x3 ) H0
(n − 1) − (n − 4)
= 3 =
∼ F3,n−4
T.S. =
SSE
SSE
MSE(x1 , x2 , x3 )
n−4
n−4
In example example 7.1, we can see this F-test in the summary.
> summary(reg123)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 117.085
99.782
1.173
0.258
X1
X2
4.334
-2.857
3.016
2.582
126
1.437
-1.106
0.170
0.285
X3
-2.186
1.595
-1.370
0.190
Residual standard error: 2.48 on 16 degrees of freedom
Multiple R-squared: 0.8014,Adjusted R-squared: 0.7641
F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06
7.2
Other Linear Tests
There are circumstances where we do not necessarily wish to test whether
a coefficient equals 0, or whether a group of coefficients all equal zero. For
example, consider the (full) model
Yi = β0 + β1 x1,i + β2 x2,i + β3 x3,i + ǫi
and we wish to test
• H0 : β1 = β2 = β3 . Under this null the reduced model is
Yi = β0 + β1 x1,i + β1 x2,i + β1 x3,i + ǫi
= β0 + β1 (x1,i + x2,i + x3,i ) +ǫi
{z
}
|
zi
The resulting F-test from equation (7.1) would have an F2,n−4 distribution.
• H0 : β3 = β1 + β2 . Under this null the reduced model is
Yi = β0 + β1 x1,i + β2 x2,i + (β1 + β2 )x3,i + ǫi
= β0 + β1 (x1,i + x3,i ) +β2 (x2,i + x3,i ) +ǫi
| {z }
| {z }
z1,i
z2,i
The resulting F-test from equation (7.1) would have an F1,n−4 distribution.
127
• H0 : β0 = 10, β3 = 1. Under this null the reduced model is
Yi = 10 + β1 x1,i + β2 x2,i + x3,i + ǫi
Yi − 10 − x3,i = β1 x1,i + β2 x2,i + ǫi
{z
}
|
Yi⋆
which is regression through the origin. The resulting F-test from equation (7.1) would have an F2,n−4 distribution.
Example 7.3. To re-examine example 7.1 so far. With the sequential sums
of squares we notes that x2 was significant above and beyond x1 , with a pvalue of 0.3373, but with the individual t-tests (and equivalent F-test) that
it was not significant above and beyond (x1 , x3 ), with a p-value of 0.285. We
also concluded in the simultaneous test that H0 : β1 = β3 = 0 holds. That
means that we either need only x2 or only the combo (x1 , x3 ). Lets test
H0 : β2 =
β1 + β3
2
β1 + β3
x2,i + β3 x3,i + ǫi
2
= β0 + β1 (x1,i + 0.5x2,i ) +β3 (x3,i + 0.5x2,i ) +ǫi
|
{z
}
|
{z
}
Yi = β0 + β1 x1,i +
z1,i
z2,i
> dat[,"Z1"]=dat[,"X1"]+1/2*dat[,"X2"]
> dat[,"Z2"]=dat[,"X3"]+1/2*dat[,"X2"]
> reg2eq13=lm(Y~Z1+Z2,data=dat)
> anova(reg2eq13,reg123)
Analysis of Variance Table
Model 1: Y ~ Z1 + Z2
Model 2: Y ~ X1 + X2 + X3
Res.Df
RSS Df Sum of Sq
1
17 107.150
2
16
98.405
1
F Pr(>F)
8.745 1.4219 0.2505
128
We fail to reject the null but that is no surprise to us and this point. It seems
all we need is just x1 and x3 so lets try it.
> reg13=update(reg123,.~.-X2)
> summary(reg13)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
6.7916
4.4883
1.513
0.1486
X1
1.0006
0.1282
7.803 5.12e-07 ***
X3
---
-0.4314
0.1766
-2.443
0.0258 *
Residual standard error: 2.496 on 17 degrees of freedom
Multiple R-squared: 0.7862,Adjusted R-squared: 0.761
F-statistic: 31.25 on 2 and 17 DF,
p-value: 2.022e-06
No further seems necessary at the moment as each variable appears significant
given the other.
Remark 7.3. In R if the null hypothesis requires a transformation of the
response, such as in the last bullet using Yi⋆ , you will have to perform the
F-test manually because the anova function will give you a warning that you
are using two different datasets as the response variable in the two models is
technically different.
7.3
Coefficient of Partial Determination
The coefficient of partial determination similarly to the coefficient of determination R2 is the proportion of variation in the response explained by a set
of predictors above and beyond another set of predictors.
Consider a model with 3 predictors, i.e. p − 1 = 3. The proportion of
variation in the response that is explained by x1 , given that x2 and x3 are
129
already in the model is
SSE(x2 , x3 ) − SSE(x1 , x2 , x3 )
SSE(x2 , x3 )
SSR(x1 , x2 , x3 ) − SSR(x2 , x3 )
=
SSE(x2 , x3 )
SSR(x1 |x2 , x3 )
=
SSE(x2 , x3 )
2
Ry,x
=
1 |x2 ,x3
The coefficient of partial correlation is then defined as
ry,x1 |x2 ,x3
q
2
= sgn(b1 ) Ry,x
1 |x2 ,x3
2
2
Similarly for Ry,x
and Ry,x
. We can also express the proportion of
2 |x1 ,x3
3 |x1 ,x2
variation in the response that is explained by x2 and x3 given x1 as
SSE(x1 ) − SSE(x1 , x2 , x3 )
SSE(x1 )
SSR(x1 , x2 , x3 ) − SSR(x1 )
=
SSE(x1 )
SSR(x2 , x3 |x1 )
=
SSE(x1 )
2
Ry,x
=
2 ,x3 |x1
2
2
Similarly for Ry,x
and Ry,x
.
1 ,x3 |x2
1 ,x2 |x3
2
is
Example 7.4. Sticking with example 7.1, we find that Ry,x
2 |x1 ,x3
> ### Coefficient of partial determination R^2_{Y x2|x1 x3}=SSR(x2|x1 x3)/SSE(x1 x
> SST=(dim(dat)[1]-1)*var(dat$Y)
> SS2["X2","Sum Sq"]/anova(lm(Y~X1+X3,data=dat))["Residuals","Sum Sq"]
[1] 0.07107507
This implies that x2 has a tiny effect in reducing the variance in the response
above and beyond (x1 , x3 ). This agrees with the t-test for H0 : β2 = 0 given
(x1 , x3 ) that we saw earlier.
2
Also note that Ry,x
1 ,x3 |x3
> ### Coefficient of partial determination R^2_{Y x1 x3|x2} = SSR(x1 x3|x2)/SSE(x2
> SSEx2=anova(lm(Y~X2,data=dat))["Residuals","Sum Sq"]
> (SSEx2-anova(reg123)["Residuals","Sum Sq"])/SSEx2
[1] 0.1324132
130
indicating that (x1 , x3 ) have something to contribute above and beyond x2 .
This all seems to agree with our tests leading us to the final model with just
x1 and x3 .
7.4
Standardized Regression Model
Standardized regression simply means that all variables are standardized
which helps in
• removing round-off errors in computing (X T X)−1
• makes for an easier comparison of the magnitude of effects of predic-
tors measured on different measurement scales. A coefficient from this
model βk⋆ can be interpreted as a 1 standard deviation increase in
predictor k causes a change of βk⋆ standard deviation in the
response (holding all others constant).
• (to be discussed later) reducing the standard error of coefficients due
to multicollinearity
The transformation used is known as the correlation transformation
yi⋆ = √
1
yi − ȳ
,
n − 1 sy
x⋆k,i = √
xk,i − x̄k
1
, k = 1, . . . , p − 1
n − 1 sx k
The model is
⋆
Yi⋆ = β1⋆ x⋆1,i + · · · + βp−1
x⋆p−1,i + ǫ⋆i
We can always revert back to the unstandardized coefficients
• βk =
sy ⋆
β ,
s xk k
k = 1, . . . , p − 1
• β0 = ȳ − β1 x̄1 − · · · − βp−1 x̄p−1
Under this model,
 
y1⋆

.. 

y⋆ = 
 . ,
yn⋆
X ⋆ = x⋆1 · · · x⋆p−1
131
which results in

1

 r2,1
⋆
XT X⋆ = 
 ..
 .
r1,2
1
..
.
···
···
..
.

r1,p−1

r2,p−1 
 =: r xx ,




ry,1
 . 
⋆
. 
X T y⋆ = 
 .  =: r yx
ry,p−1
rp−1,1 rp−1,2 · · · rp−1,p−1
because
•
•
•
Pn
⋆ 2
i=1 (xk,i )
Pn
= ··· = 1
⋆
⋆
i=1 (xk,i )(xk ′ ,i )
Pn
⋆
⋆
i=1 (yi )(xk,i )
= · · · = rxk ,x′k
= · · · = ry,xk
Therefore,
⋆
⋆
⋆
⋆
X T X ⋆ b⋆ = X T y ⋆ ⇒ b⋆ = (X T X ⋆ )−1 X T y ⋆ ⇒ b⋆ = r −1
xx r yx
Example 7.5. So far we have concluded that for the bodyfat dataset in examples 7.1 that we only need x1 and x3 in the model. However, it seems that
these two variables are still somewhat correlated with a sample correlation
of rx1 ,x3 = 0.46.
> round(cor(dat[,1:4]),2)
X1
X2
X3
Y
X1 1.00 0.92 0.46 0.84
X2 0.92 1.00 0.08 0.88
X3 0.46 0.08 1.00 0.14
Y
0.84 0.88 0.14 1.00
We have mentioned that correlated variables may increase the standard error or our predictors, making it more necessary to implement standarized
regression.
A useful tool is the Variance Inflation Factor (VIF). The square root of
the variance inflation factor tells you how much larger the standard error
is, compared with what it would be if that variable were uncorrelated with
the other predictor variables in the model. If the variance inflation factor of
√
a predictor variable were 5.27 ( 5.27 = 2.3) this means that the standard
error for the coefficient of that predictor variable is 2.3 times as large as it
132
would be if that predictor variable were uncorrelated with the other predictor
variables.
Example 7.6. Continuing with our example wee that the inflation is not
much actually
> library(car)
> sqrt(vif(reg13))
X1
X3
1.124775 1.124775
Performing standardized regression yields
> cor.trans=function(y){
+ n=length(y)
+ 1/sqrt(n-1)*(y-mean(y))/sd(y)
+ }
> dat_trans=as.data.frame(apply(dat[,1:4],2,cor.trans))
> reg13_trans=lm(Y~0+X1+X3,data=dat_trans)
> summary(reg13_trans)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X1
X3
0.9843
-0.3082
0.1226
0.1226
8.029 2.33e-07 ***
-2.514
0.0217 *
compared to the standard errors of 0.1282 and 0.1766 respectively.
7.5
Multicollinearity
Consider the following theoretical toy example. Someone wishes to measure
the area of a square (the response) using as predictors two potential variables,
the length and the height of the square. Due to measurement error, replicate
measurements are taken.
• A simple linear regression is fitted with length as the only predictor,
x = length. For the test H0 : β1 = 0, do you think that we would reject
H0 , i.e. is length a significant predictor of area?
• Now assume that a multiple regression model is fitted with both predictors, x1 = length and x2 = height. Now, for the test H0 : β1 = 0, do
133
you think that we would reject H0 , i.e. is length a significant predictor
of area given that height is already included in the model?
This scenario is defined as confounding/collinearity. In the toy example,
“height” is a confounding variable, i.e. an extraneous variable in a statistical
model that correlates with both the response variable and another predictor
variable.
Example 7.7. In an experiment of 22 observations, a response y and two
predictors x1 and x2 were observed. Two simple linear regression models
were fitted:
(1)
y = 6.33 + 1.29 x1
Predictor
Constant
x1
S = 2.95954
Coef
SE Coef
T
P
6.335
1.2915
2.174
0.1392
2.91
9.28
0.009
0.000
R-Sq = 81.1%
R-Sq(adj) = 80.2%
(2)
y = 54.0 - 0.919 x2
Predictor
Constant
x2
S = 5.50892
Coef
SE Coef
T
P
53.964
-0.9192
8.774
0.2821
6.15
-3.26
0.000
0.004
R-Sq = 34.7%
R-Sq(adj) = 31.4%
Each predictor in their respective model is significant due to the small pvalues for their corresponding coefficients. The simple linear regression model
(1) is able to explain more of the variability in the response than model (2)
with R2 = 81.1%. Logically one would then assume that a multiple regression
model with both predictors would be the best model. The output of this
model is given below:
(3)
y = 12.8 + 1.20 x1 - 0.168 x2
134
Predictor
Constant
x1
x2
S = 2.97297
Coef
12.844
SE Coef
7.514
T
1.71
P
0.104
1.2029
-0.1682
0.1707
0.1858
7.05
-0.91
0.000
0.377
R-Sq = 81.9%
R-Sq(adj) = 80.0%
We notice that the individual test for β1 stills classifies x1 as significant
given x2 , but x2 is no longer significant given x1 . Also, we notice that the
2
coefficient of determination, R2 , has increased only by 0.8%, and in fact Radj
has decreased from 80.2% in (1) to 80.0% in (3). This is because x1 is acting
as a confounding variable on x2 . The relationship of x2 with the response
y is mainly accounted for by the relationship of x1 on y. The correlation
coefficient of
rx1 ,x2 = −0.573
which indicates a moderate negative relationship.
However, since x1 is a better predictor, the multiple regression model is
still able to determine that x1 is significant given x2 , but not vice versa.
When two variables are highly correlated, their estimated of the regression coefficients become unstable, and their standard errors become larger
(leading to smaller test statistics and wider C.I’s). We can see this using
VIF.
Example 7.8. We have seen another example with 7.1. Recall that x1 and
x2 are highly correlated.
> round(cor(dat[,1:4]),2)
X1
X2
X3
Y
X1 1.00 0.92 0.46 0.84
X2 0.92 1.00 0.08 0.88
X3 0.46 0.08 1.00 0.14
Y 0.84 0.88 0.14 1.00
In listing 7.2 we noticed that x1 is not significant given x2 with a p-value of
0.4633 due to the fact that SSR(x2 |x1 ) = 33.17 but, in listing 7.1, testing x2
given x1 yielded a p-value of 0.03373 due to SSR(x1 |x2 ) = 3.47 indicating it
was somewhat significant.
135
Using VIF we see that standard errors are greatly inflated for the model
with all three
> sqrt(vif(reg123))
X1
X2
X3
26.62410 23.75591 10.22771
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/bodyfat.R
136
Chapter 9
Model Selection and Validation
Note that Chapter 8 was merged back with Chapter 6.
9.1
Data Collection Strategies
• Controlled Experiments: Subjects (Experimental Units) assigned to
X-levels by experimenter
– Purely Controlled Experiments: Researcher only uses predictors
that were assigned to units
– Controlled Experiments with Covariates: Researcher has information (additional predictors) associated with units
• Observational Studies: Subjects (Units) have X-levels associated with
them (not assigned by researcher)
– Confirmatory Studies: New (primary) predictor(s) believed to be
associated with Y , controlling for (control) predictor(s), known to
be associated with Y
– Exploratory Studies: Set of potential predictors believed that
some or all are associated with Y
9.2
Reduction of Explanatory Variables
• Controlled Experiments
137
– Purely Controlled Experiments: Rarely any need or desire to reduce number of explanatory variables
– Controlled Experiments with Covariates: Remove any covariates
that do not reduce the error variance
• Observational Studies
– Confirmatory Studies: Must keep in all control variables to compare with previous research, should keep all primary variables as
well
– Exploratory Studies: Often have many potential predictors (and
polynomials and interactions). Want to fit parsimonious model
that explains much of the variation in Y , while keeping model as
basic as possible. Caution: do not make decisions based on single
variable t-tests, make use of Complete/Reduced models for testing
multiple predictors
9.3
Model Selection Criteria
With p − 1 predictors there are 2p−1 potential models (each variable can be
in or out of the model), not including interaction terms etc.
• So far we have seen the adjusted R2 as in equation (6.2) where the goal
is to maximize the value
• Mallow’s Cp criterion where the goal is to find the smallest p so that
Cp ≤ p
Cp =
SSEp
− (n − 2p)
MSE(X1 , . . . , Xp−1)
Note in the first term, that the numerator is model specific, while the
denominator is always the same (the one of the full model).
• Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), where the goal is to choose the model with the minimum
value
AIC = n log(SSE/n) + 2p,
BIC = n log(SSE/n) + p log(n)
138
• PRESS criterion where once again we aim to minimize this value
PRESS =
n
X
i=1
(yi − ŷi(i) )2
where ŷi(i) is the fitted value for the ith case when it was not used in
fitting the model (leave-one-out). From this we have the
– Ordinary Cross Validation (OCV)
2
n
n 1 X yi − ŷi
1X
2
(yi − ŷi(i) ) =
OCV =
n i=1
n i=1 1 − hii
due the Leaving-One-Out Lemma, where hii is the ith diagonal
element of H = X(X T X)−1 X T .
– Generalized Cross Validation (GCV) where hii is replaced by the
average of the diagonal elements of H, leading to a weighted version
P
1/n ni=1 (yi − ŷi )2
GCV =
(1 − trace(H)/n)2
Example 9.1. A cruise ship company wishes to model the crew size needed for a ship using predictors such as: age, tonnage, passengers, length,
cabins and passenger density (passdens). Without concerning ourselves with
potential interactions we will look at simple additive models.
> cruise <- read.fwf("http://www.stat.ufl.edu/~winner/data/cruise_ship.dat",
+ width=c(20,20,rep(8,7)), col.names=c("ship", "cline", "age", "tonnage",
+ "passengers", "length", "cabins", "passdens", "crew"))
> fit0=lm(crew~age+tonnage+passengers+length+cabins+passdens,data=cruise)
> summary(fit0)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5213400
age
-0.0125449
tonnage
0.0132410
1.0570350
0.0141975
0.0118928
-0.493
-0.884
1.113
0.62258
0.37832
0.26732
passengers
length
-0.1497640
0.4034785
0.0475886
0.1144548
-3.147
3.525
0.00199 **
0.00056 ***
cabins
passdens
0.8016337
-0.0006577
0.0892227
0.0158098
8.985 9.84e-16 ***
-0.042 0.96687
139
--Residual standard error: 0.9819 on 151 degrees of freedom
Multiple R-squared: 0.9245,Adjusted R-squared: 0.9215
F-statistic:
308 on 6 and 151 DF,
p-value: < 2.2e-16
> AIC(fit0)
[1] 451.4394
We will consider this to be the full model at the moment and will implement
some of the model selection criteria using the regsubsets function.
> library(leaps)
> allcruise <- regsubsets(crew~age+tonnage+passengers+length+cabins
+ passdens, nbest=4,data=cruise)
> aprout <- summary(allcruise)
> with(aprout,round(cbind(which,rsq,adjr2,cp,bic),3))
## Prints "readable" results
140
(Intercept) age tonnage passengers length cabins passdens
rsq adjr2
1
1
0
0
0
0
1
0 0.904 0.903
cp
bic
37.772 -360.238
141
1
1
1
1
1
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0 0.860 0.859 125.086 -300.954
0 0.838 0.837 170.523 -277.122
0 0.803 0.801 240.675 -246.201
2
2
1
1
0
0
0
0
0
0
1
0
1
1
0 0.916 0.915
1 0.912 0.911
15.952 -376.131
24.261 -368.502
2
2
3
1
1
1
0
0
0
1
0
0
0
1
1
0
0
1
1
1
1
0 0.911 0.909
0 0.908 0.907
0 0.922 0.921
26.792 -366.249
32.443 -361.332
5.857 -382.878
3
3
1
1
0
0
0
1
0
1
1
0
1
1
1 0.919 0.918
0 0.918 0.916
11.341 -377.413
14.023 -374.808
3
4
4
1
1
1
1
0
1
0
1
0
0
1
1
1
1
1
1
1
1
0 0.917 0.915
0 0.924 0.922
0 0.923 0.921
15.909 -373.002
3.847 -381.933
5.084 -380.652
4
4
5
1
1
1
0
0
1
0
1
1
1
0
1
1
1
1
1
1
1
1 0.923 0.921
1 0.919 0.917
0 0.924 0.922
5.197 -380.534
13.056 -372.631
5.002 -377.752
5
5
1
1
0
1
1
0
1
1
1
1
1
1
1 0.924 0.922
1 0.924 0.921
5.781 -376.939
6.240 -376.462
5
6
1
1
1
1
1
1
0
1
1
1
1
1
1 0.920 0.917
1 0.924 0.921
14.904 -367.717
7.000 -372.692
A good model choice might be (the 13th row) model with 4 predictors: ton2
nage, passengers, length, and cabins, whose Radj
= 0.922, Cp = 3.847, and
BIC= −381.933. Also we note that this model’s AIC is lower than that of
the full model.
> fit3=update(fit0,.~.-age-passdens)
> AIC(fit3)
[1] 448.3229
We can also calculate the PRESS, OCV and GCV statistics that we would
compare to other potential models (but we haven’t here).
> library(qpcR)
> PRESS(fit3)$stat
[1] 154.8479
> library(dbstats)
> dblm(formula(fit3),data=cruise)$ocv
[1] 0.9673963
> dblm(formula(fit3),data=cruise)$gcv
[1] 0.9752566
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/selection_validation.R
9.4
Regression Model Building
As discussed, it is possible to have a large set of predictor variables (including
interactions). The goal is to fit a “parsimoneous” model that explains as
much variation in the response as possible with a relatively small set of
predictors.
There are 3 automated procedures
• Backward Elimination (Top down approach)
• Forward Selection (Bottom up approach)
• Stepwise Regression (Combines Forward/Backward)
We will explore these procedures using two different elimination/selection
criteria. One that uses t-test and p-value and another that uses the AIC
value.
142
9.4.1
Backward elimination
1. Select a significance level to stay in the model (e.g. αs = 0.20, generally
.05 is too low, causing too many variables to be removed).
2. Fit the full model with all possible predictors.
3. Consider the predictor with lowest t-statistic (highest p-value).
• If p-value > αs , remove the predictor and fit model without this
variable (must re-fit model here because partial regression coefficients change).
• If p-value ≤ αs , stop and keep current model.
4. Continue until all predictors have p-values ≤ αs .
9.4.2
Forward selection
1. Choose a significance level to enter the model (e.g. αe = 0.20, generally
.05 is too low, causing too few variables to be entered).
2. Fit all simple regression models.
3. Consider the predictor with the highest t-statistic (lowest p-value).
• If p-value ≤ αe , keep this variable and fit all two variable models
that include this predictor.
• If p-value > αe , stop and keep previous model.
4. Continue until no new predictors have p-values ≤ αe
9.4.3
Stepwise regression
1. Select αs and αe , (αe < αs ).
2. Start like Forward Selection (bottom up process) where new variables
must have p-value ≤ αe to enter.
3. Re-test all “old variables” that have already been entered, must have
p-value ≤ αs to stay in model.
4. Continue until no new variables can be entered and no old variables
need to be removed.
143
Remark 9.1. Although we created a function in R that follows the steps of
backward, forward and stepwise, there is also an already developed function
stepAIC that can perform all three procedures by adding/removing variables
depending on whether the AIC is reduced.
Example 9.2. Continuing from example 9.1, we perform backward elimination with αs = 0.20.
> source("http://www.stat.ufl.edu/~athienit/stepT.R")
> stepT(fit0,alpha.rem=0.2,direction="backward")
crew ~ age + tonnage + passengers + length + cabins + passdens
---------------------------------------------Step 1 -> Removing:- passdens
Estimate Pr(>|t|)
(Intercept)
-0.556
0.394
age
tonnage
-0.012
0.013
0.358
0.150
passengers
length
cabins
-0.149
0.404
0.802
0.000
0.001
0.000
crew ~ age + tonnage + passengers + length + cabins
---------------------------------------------Step 2 -> Removing:- age
Estimate Pr(>|t|)
(Intercept)
tonnage
passengers
length
cabins
-0.819
0.016
-0.150
0.164
0.046
0.000
0.398
0.791
0.001
0.000
Final model:
crew ~ tonnage + passengers + length + cabins
We can also perform forward selection and stepwise regression by running
stepT(fit0,alpha.enter=0.2,direction="forward")
stepT(fit0,alpha.rem=0.2,alpha.enter=0.15,direction="both")
144
We can also use the built in function stepAIC
> library(MASS)
> fit1 <- lm(crew ~ age + tonnage + passengers + length + cabins + passdens)
> fit2 <- lm(crew ~ 1)
> stepAIC(fit1,direction="backward")
Start: AIC=1.05
crew ~ age + tonnage + passengers + length + cabins + passdens
- passdens
Df Sum of Sq
RSS
AIC
1
0.002 145.57 -0.943
- age
- tonnage
<none>
1
1
0.753 146.32 -0.130
1.195 146.77 0.347
145.57 1.055
- passengers
- length
1
1
9.548 155.12 9.092
11.980 157.55 11.551
- cabins
1
77.821 223.39 66.721
Step:
AIC=-0.94
crew ~ age + tonnage + passengers + length + cabins
Df Sum of Sq
RSS
AIC
- age
<none>
1
0.815 146.39 -2.062
145.57 -0.943
- tonnage
- length
- passengers
1
1
1
2.007 147.58 -0.780
12.069 157.64 9.641
14.027 159.60 11.591
- cabins
1
79.556 225.13 65.944
Step: AIC=-2.06
crew ~ tonnage + passengers + length + cabins
Df Sum of Sq
<none>
- tonnage
- length
- passengers
1
1
1
RSS
AIC
146.39 -2.062
3.866 150.25 0.056
11.739 158.13 8.126
14.275 160.66 10.640
145
- cabins
1
78.861 225.25 64.028
Call:
lm(formula = crew ~ tonnage + passengers + length + cabins)
and can also perform forward and stepwise regression by running
stepAIC(fit2,direction="forward",scope=list(upper=fit1,lower=fit2))
stepAIC(fit2,direction="both",scope=list(upper=fit1,lower=fit2))
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/selection_validation.R
9.5
Model Validation
When we have a lot of data, we would like to see how well a model fit on
one set of data (training sample) compares to one fit on a new set of data
(validation sample), and how the training model fits the new data.
• We want the data sets to be similar with respect to the levels of the
predictors (so that the validation sample is not an extrapolation of the
training sample). Should calculate some summary statistics such as
means, standard deviations, etc.
• The training set should have at least 6-10 times as many observations
than potential predictors.
• Models should give “similar” model fits based on SSE, PRESS, Mallow’s Cp , MSE and regression coefficients. Should obtain multiple models using multiple “adequate” training samples.
The Mean Square Prediction Error (MSPE) when training model is applied
to validation sample is
MSPE =
PnV
V
i=1 (yi −
nV
ŷiT )2
where nV is validation sample size, yiV represents a data point from the
validation sample and ŷiT represents a fitted value using the predictor settings
corresponding to yiV but the coefficients from the training sample, i.e.
ŷiT = bT0 + bT1 xV1,i + · · · + bTp−1 xVp−1,i
146
If the MSPE is fairly close to the MSET of the regression model that was fitted
to the training data set, then it indicates that the selected regression model
is not seriously biased and gives an appropriate indication of the predictive
ability of the model. At this point you should now go ahead and fit the model
on the full data set. It is only a problem when MSPE
≫
MSET .
Example 9.3. Continuing from example 9.1, we perform cross-validation
with a hold-out sample. Randomly sample 100 ships, fit model, obtain predictions for the remaining 58 ships by applying their predictor levels to the
regression coefficients from the fitted model.
> cruise.cv.samp <- sample(1:length(cruise$crew),100,replace=FALSE)
> cruise.cv.in <- cruise[cruise.cv.samp,]
> cruise.cv.out <- cruise[-cruise.cv.samp,]
> ### Check if training sample (and validation) is similar to the whole dataset
> summary(cruise[,4:7])
tonnage
passengers
length
cabins
Min.
: 2.329
1st Qu.: 46.013
Min.
: 0.66
1st Qu.:12.54
Min.
: 2.790
1st Qu.: 7.100
Min.
: 0.330
1st Qu.: 6.133
Median : 71.899
Mean
: 71.285
3rd Qu.: 90.772
Median :19.50
Mean
:18.46
3rd Qu.:24.84
Median : 8.555
Mean
: 8.131
3rd Qu.: 9.510
Median : 9.570
Mean
: 8.830
3rd Qu.:10.885
Max.
Max.
Max.
:220.000
Max.
:54.00
> summary(cruise.cv.in[,4:7])
tonnage
passengers
:11.820
:27.000
length
cabins
Min.
: 3.341
1st Qu.: 46.947
Min.
: 0.66
1st Qu.:12.65
Min.
: 2.790
1st Qu.: 7.168
Min.
: 0.330
1st Qu.: 6.327
Median : 73.941
Mean
: 73.581
3rd Qu.: 91.157
Median :19.87
Mean
:19.24
3rd Qu.:26.00
Median : 8.610
Mean
: 8.219
3rd Qu.: 9.605
Median : 9.750
Mean
: 9.177
3rd Qu.:11.473
Max.
Max.
Max.
:220.000
Max.
:54.00
> summary(cruise.cv.out[,4:7])
tonnage
Min.
: 2.329
passengers
Min.
: 0.94
147
:11.820
length
Min.
: 2.960
:27.000
cabins
Min.
: 0.450
1st Qu.: 40.013
Median : 70.367
Mean
: 67.325
1st Qu.:10.62
Median :18.09
Mean
:17.11
1st Qu.: 6.370
Median : 8.260
Mean
: 7.978
1st Qu.: 5.335
Median : 8.745
Mean
: 8.232
3rd Qu.: 87.875
Max.
:160.000
3rd Qu.:21.39
Max.
:37.82
3rd Qu.: 9.510
Max.
:11.320
3rd Qu.:10.430
Max.
:18.170
> fit.cv.in <- lm(crew ~tonnage + passengers + length + cabins,
+ data=cruise.cv.in)
> summary(fit.cv.in)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.10180
0.77347 -1.424 0.157581
tonnage
passengers
length
0.00479
-0.19192
0.45647
0.01177
0.05445
0.14573
0.407 0.685054
-3.525 0.000654 ***
3.132 0.002306 **
0.95060
0.14510
6.551 2.92e-09 ***
cabins
---
Residual standard error: 1.059 on 95 degrees of freedom
Multiple R-squared: 0.9203,Adjusted R-squared: 0.9169
F-statistic: 274.2 on 4 and 95 DF,
p-value: < 2.2e-16
Then we obtain predicted values and prediction errors for the validation
sample. The model is based on same 4 predictors that we chose before
(columns 4-7 of cruise data) from which we compute the MSPE
> pred.cv.out <- predict(fit.cv.in,cruise.cv.out[,4:7])
> delta.cv.out <- cruise$crew[-cruise.cv.samp]-pred.cv.out
> (mspe <- sum((delta.cv.out)^2)/length(cruise$crew[-cruise.cv.samp]))
[1] 0.7578447
We note that the MSPE of 0.7578 is fairly close to the MSE of 1.0592 = 1.121.
(At least it is not much greater).
http://www.stat.ufl.edu/~ athienit/STA4210/Examples/selection_validation.R
148
Chapter 10
Diagnostics
See slides and examples at http://www.stat.ufl.edu/~ athienit/STA4210
The notes here are incomplete and under construction
The goal of this chapter is to used refined diagnostics for checking the
adequacy of the regression model that include detecting improper functional
form for a predictor, outliers, influential observations and multicollinearity.
10.1
Outlying Y observations
Model errors (unobserved) are defined as
ǫi = Yi −
p−1
X
βj xi,j ,
xi,0 = 1
j=0
ǫ ∼ N(0, σ 2 In )
The observed residuals are
ei = yi −
p−1
X
bj xi,j
j=0
e ∼ N(0, σ 2 (In − H))
where H = X(X T X)−1 X T is the projection matrix. So the elements of the
variance-covariance matrix σ 2 (In − H) are:

σ 2 (1 − h i)
i
σ{ei , ej } =
−h σ 2
ij
Using σˆ2 = MSE we then have
149
if i = j
if i 6= j
• Semi-Studentized residual
e⋆i = √
ei
MSE
• Studentized residual, which uses the
ei
ri = p
MSE(1 − hi i)
• Studentized Deleted residual. When calculating a residual ei = yi − ŷi ,
the ith observation (yi , xi,1 , . . . , xi,p−1 ) was used in the creation of the
model (as were all the other points), and then the model was used to
estimate the response for the ith observation. That is, each observa-
tion played a role in the creation of the model, that was then used to
estimate the the response of said observation. Not very objective.
The solution is to delete/remove the ith observation, fit a model without
that observation in the data, and use the model to predict the response
of that observation by plugging in the predictor setting xi,1 , . . . , xi,p−1 .
This sounds very computationally intensive in that you have to fit as
many models as there are points. Luckily, it has been found that this
can be done without refitting. It can be shown that
e2i
SSE = (n − p)MSE = (n − p − 1)MSE(i) +
1 − hii
n−p−1
⇒ ti = ei
SSE(1 − hii ) − e2i
where MSE(i) is the MSE of the model with the ith observation deleted,
and ti is the “objective” residual. Then we can determine if a residual
is an outlier if it is more than 2 to 3 standard deviations from 0. We
can also use a Bonferroni adjustment and determine if an observation
is an outlier if it is greater than t1−α/(2n),n−p−1 but that will usually be
too large when n is large.
150
10.2
Outlying X-Cases
Recall that H = X(X T X)−1 X T is the projection matrix with the (i, j) element being


1


 xi,1 
T
T
−1

hij = xi (X X) xj xi = 
 .. 
 . 
xi,p−1
Note that
• hij ∈ [0, 1]
Pn
T
−1 T
T
T
−1
•
i=1 hii = trace(H) = trace(X(X X) X ) = trace(X X(X X) ) =
trace(Ip ) = p
Cases with X-levels close to the “center” of the sampled X-levels will have
small leverages, i.e. hii . Cases with “extreme” levels have large leverages, and
have the potential to “pull” the regression equation toward their observed
Y -values. We can see this by
ŷ = Hy ⇒ ŷi =
n
X
j=1
hij yj =
i−1
X
hij yj + hii yi i +
j=1
n
X
hij yj
j=i+1
Leverage values are considered large if > 2p/n (2 times larger than the mean).
Leverage values for potential new observations are
hnew, new = xTnew (X T X)xnew
and are considered extrapolations if their leverage values are larger than
those in the original dataset.
10.3
Influential Cases
10.3.1
Fitted values
10.3.2
Regression coefficients
10.4
Multicollinearity
See examples 7.6, 7.7 and 7.8
151
Chapter 11
Remedial Measures
See slides and examples at http://www.stat.ufl.edu/~ athienit/STA4210
152
Chapter 12
Autocorrelation in Time Series
See slides and examples at http://www.stat.ufl.edu/~ athienit/STA4210
END
153