Download IV. Modeling a Mean: Simple Linear Regression g p g

Document related concepts
no text concepts found
Transcript
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
IV. Modeling
g a Mean: Simple
p Linear Regression
g
We have talked about inference for a single mean, for comparing
two means,
means and for comparing several means.
means
What if the mean of one variable depends on the value of another
continuous-type
ti
t
variable?
i bl ? In
I the
th case off a linear
li
trend
t d (i.e.,
(i the
th
value of one outcome tends to increase or decrease linearly with an
increase in another variable), we can fit a model that describes that
trend. The tool most often used for this kind of analysis is called a
linear regression model.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Sampling Pairs of Points
In this setting, we have n pairs of points (X1,Y1), (X2,Y2),…,(Xn,Yn). In other
words, we are measuring two variables for each subject, sampling from a
population as demonstrated in the schematic below.
In this case we are often interested in how the variables correlate or covary. In
other words
words, considering the X variable as the explanatory or independent
variable, and the Y variable as the outcome or dependent variable, the question is:
How does the average E(Y) depends on value of X?
Population
P
l i –
E(Y), σY2
E(X),
( ) σX2
( X 1 , Y1 ), ( X 2 , Y2 ),..., ( X n , Yn )
Sample –
X , s X2
Y , sY2
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Exploratory Analysis
As indicated on the previous slide, an important first step in exploring the
distributions of paired observations is to use summary statistics and univariate
charts (e.g., histograms, boxplots) to understand their marginal or individual
distributions. Since we’re interested in how the variables relate to each other, a
key subsequent step is to construct a two-dimensional scatterplot, treating Y as a
function of X. Here, each point in the plot corresponds to one observation in the
sample, and is determined by the correponding (X,Y) coordinate of that
observation.
To plot a point
(Xi, Yi) from the
sample:
(Xi, Yi)
Yi
0
0
Xi
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Patterns of Association
A plot helps us quickly to identify relationships between variables that can inform
how we model the relationship. For example, what do you observe from the
following plot, of the % of physically active adults in each U.S. state versus average
annual temperature? (Source: http://www.economist.com/node/21016233.)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Linear versus Nonlinear Associations
For now, as we focus on simple regression models, we will look at examples
where the relationship between the two variables of interest is linear.
linear However
However, in
applied research settings one might observe a wide variety of patterns.
What is the relationship between
X and Y in the scatterplot to the
right?
Later, when we discuss
multivariable models we will see
how such nonlinear relationships
can be accommodated using a
regression approach.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Functional versus Statistical Relationships
It s also important to distinguish between a purely functional association between
It’s
two variables and what we might term a statistical association. A functional
relationship is one that is deterministic – i.e., a given value of X yields the exact
same value for Y whenever an experiment is repeated. An example of this is the
distance Y travelled by an object in free fall over time T, which is given by
Y  12 gT
g 2,
where g is the acceleration due to gravity at or near sea level.
On the other hand, a statistical relationship is one for which a given value of X
yields different values of Y for repetitions of the same experiment. That is, Y is
random, and it’s distribution may depend upon X. In the linear regression setting,
we assume that
th t E(Y) linearly
li
l depends
d
d upon X.
X
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.A
Engineers were interested in the effects of salt distribution on the
roadways with salt concentration in adjacent waterways. They
gathered data at 20 locations,
locations measuring the roadway area at each
site along with the salt concentration in the nearby river.
The data
Th
d are shown
h
on the
h following
f ll i slide
lid (they
( h are also
l postedd on
the course website in the file “salt.txt”). We would like to know
whether or not ggreater roadway
y area is associated with higher
g
average salt concentration.
Why is it natural to designate the explanatory variable and
response variable in this way?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.A (cont’d)
Obs
Salt
Concentration
Roadway
Area
Obs
Salt
Concentration
Roadway
Area
1
3.8
0.19
11
15.6
0.78
2
5.9
0.15
12
20.8
0.81
3
14.1
0.57
13
14.6
0.78
4
10 4
10.4
0 40
0.40
14
16 6
16.6
0 69
0.69
5
14.6
0.70
15
25.6
1.30
6
14.5
0.67
16
20.9
1.05
7
15.1
0.63
17
29.9
1.52
8
11.9
0.47
18
19.6
1.06
9
15 5
15.5
0 75
0.75
19
31 3
31.3
1 74
1.74
10
9.3
0.60
20
32.7
1.62
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.A (cont’d)
The scatterplot for these data is given below:
35
SA
ALT CONCENTRAT
TION
30
25
20
15
10
5
0
0
0.2
0.4
0.6
0.8
1
1.2
ROADWAY AREA
1.4
1.6
1.8
2
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.A (cont’d)
What
h sort off relationship
l i hi do
d you observe
b
between
b
the
h area off a given
i
road and the amount of salt found in nearby waterways?
Based on what we observe, we would like to answer a few
questions. These might include:
• What is the observed average increase in salt concentration
for incrementally larger roadway area?
• Is this average increase statistically significant? That is to
say, is the observed correlation real, or can it be attributed
to chance?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.A (cont’d)
One way off answering
O
i these
th
sorts
t off questions
ti
is
i to
t fit a model.
d l
That is, since we observe a somewhat linear association (average
salt concentration appears to increase linearly with increased road
area) we fit such a line to the data.
Knowing that a line is determined by a slope and an intercept,
intercept the
question is how do we select the “best” line? A statistical solution
to this problem is the so-called least-squares fit, or linear
regression
i off salt
lt concentration
t ti on roadd area.
The following slide shows this regression line overlaid on the
scatterplot of salt versus area.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.A (cont’d)
35
SALT CONCE
ENTRATION
30
25
20
15
10
5
0
0
02
0.2
04
0.4
06
0.6
08
0.8
1
12
1.2
ROADWAY AREA
14
1.4
16
1.6
18
1.8
2
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
The Model
As noted earlier, in general we sample pairs of points (X1,Y1),
(X2,Y
Y2),…,(X
) (Xn,Y
Yn),
) where
h X is
i referred
f
d to as the
h explanatory
l
variable
i bl
and Y is the response variable. Note that this does not imply that X
necessarilyy causes Y, although
g that is ppossible. X and Y mayy simply
py
be associated, without any causative effect. We typically want to
explain changes in the average of Y due to a difference in X.
If X and Y appear to be linearly associated, where Y on average
increases or decreases linearly with an increase in X, then we may
posit
i the
h linear
li
model
d l
Yi   0  1 X i   i ,
where the intercept β0 and slope β1 determine the line, and εi models
the variability around the line.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
The Error Term
The scatterplot
p in Example
p V.A is a typical
yp
representation
p
of
variables that are linearly associated: the points form a sort of
“cloud”. That is to say, the points do not lie on a straight line,
indicating that even though Y tends to increase or decrease linearly
with X, a given value of X will not necessarily result in exactly the
same value of Y. That is, the relationship is not deterministic.
The error term εi in the model accounts for this variability around
the line β0 + β1xi. Note that β0 + β1Xi is ffixed,, not random. We
further typically assume that εi ~ N(0, σ2). In other words, given
the value Xi, we have E(Yi) = β0 + β1Xi, and Var(Yi) = σ2. Therefore,
Yi ~ N(β0 + β1Xi, σ2).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
The Model Parameters
What do the terms in the linear model mean?
• The intercept β0 represents the average of y for x = 0. Although
the intercept is mathematically necessary in order to specify the
form of the line in the model, it seldom has practical meaning.
p β1 is ggenerallyy the focus of inference: it represents
p
the
• The slope
change in the average of y for every one unit increase in x. Since
we are interested in how y changes with x, then a nonzero slope
indicates that y and x are linearly associated.
associated
• The variance term σ2 represents the variability of the data around
the line.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
The Experiment
As always, our object is to infer something about the underlying
model parameters by sampling from the population and then
analyzing
y g the data. Havingg pposited the regression
g
model,, we can
think of the sampling in this way:
Population 
Yi   0  1 X i   i ,
 i ~ N(0,  2 )
( X 1 , Y1 ), ( X 2 , Y2 ),..., ( X n , Yn )
Sample –
estimate
ti t β0, β1,
and σ from data.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.B
Suppose that the purity of a chemical solution Y is related to the
amount of catalyst X through a linear regression model with
β0 = 123.0,
123 0 β1 = –22.16,
16 and with an error standard deviation of σ = 4.1.
41
What is the expected value of the purity when the catalyst level is 20?
How much does the average purity change when the catalyst amount
is increased by 10?
What is the probability that the purity is less than 60 when the catalyst
level is 25?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
In practical research settings, we do not know the actual parameter
values As indicated on the schematic two slides previous,
values.
previous we
sample from the population with the posited regression model and
then estimate the regression parameters from the data.
Example IV.C
Write out a linear model for the experiment described in Example
IV.A. Clearly interpret each of the parameters of the model.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
How do we estimate the model parameters?
That is to say,
y, what is the “best” line,, based on the data? In
statistical applications, we choose the line that achieves the
minimum squared distance between itself and the collective
observed data points
points.
Note that the distance from a given value Yi and its associated point
on the line is given by Yi – (β0 + β1Xi). We call this the residual. It
turns out that we compute estimates of the slope, intercept, and
variance that minimize the sum of the squared
q
residuals. The
resulting estimates of the slope and intercept are given by



n
b1
i 1
n
X iYi  nXY
i 1
X  nX
2
i
2
,
b0  Y  b1 X .
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Derivation of Parameter Estimates
One way of thinking about how b0 and b1 are derived is to consider
direct minimization of the sum of squared residuals:
n
n
i 1
i 1
Q    i2   [Yi  (  0  1 X i )]2 .
We sometimes refer to Q as the objective function. How can we
minimize this function with respect to β0 and β1? (Some discussion
about this is contained in Section 1.6 of the text, although the
technical details are not that important.)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.D
Using the data given in Example IV.A, fit the model specified in
Example IV.C. The necessary summary statistics are given below:
X  0.824
2
X
i i  17.2502
Y  17.135
n  20
XY
i
i i
 346.793
What does the estimated intercept represent in the model fit?
Interpret the estimated slope – what does it say about the observed
relationship between road area and average salt concentration?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Interpreting the Model Fit
Note that once we have obtained our estimates of the intercept and
slope, the fitted value for yi is given by
Yˆi  b0  b1 X i .
There are two ways of viewing such a fitted value:
• The fitted value is our predicted Yi for the given Xi.
• The fitted value is our estimate of the average Yi for the given Xi.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.E
Based on the model fit in Example IV.D, what is the predicted salt
concentration when the adjacent road area is 0.75?
What is our estimated average salt concentration level when the
adjacent road area is 0.75?
What is the predicted salt concentration when the road area is 2.0?
Why should we be cautious about this last prediction?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Estimating σ2
The last of the three pparameters that we need to estimate is the
model variance, which represents the variability of the yi’s around
the regression line.
Note that the estimated residuals based upon the model fit are given
by
ei  Yi  Yˆi  Yi  (b0  b1 X i ), i  1,..., n.
Therefore, our estimate of the model variance σ2 is the observed
“average” squared residual, also called the mean square error
(MSE):
n
n
1
1
SSE
2
2
ˆ
s 
e 
(Yi  Yi ) 
.


i 1 i
i 1
n2
n2
n2
2
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.F
The table below shows both the observed salt concentration and predicted salt
concentration ((usingg the fitted line)) for the observations in Example
p IV.A.
Salt Concentration
Salt Concentration
Obs #
Observed
Predicted
Obs #
Observed
Predicted
1
3.8
6.01
11
15.6
16.36
2
5.9
5.31
12
20.8
16.89
3
14.1
12.68
13
14.6
16.36
4
10.4
9.70
14
16.6
14.78
5
14.6
14.96
15
25.6
25.49
6
14.5
14.43
16
20.9
21.10
7
15.1
13.73
17
29.9
29.35
8
11 9
11.9
10 92
10.92
18
19 6
19.6
21 28
21.28
9
15.5
15.84
19
31.3
33.21
10
9.3
13.20
20
32.7
31.10
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.F (cont’d)
Based on the fitted values,
values we can see that the residual for the first
observation is –2.21, for the second observation the residual is 0.59,
and so forth. The average squared residual therefore is given by
s
2
20 2
1

e

i 1 i
20  2
1

(-2.21)2  (0.59) 2    (1.60) 2
18

 3.206

Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Inference for the Slope β1
Remember, the fundamental question in a linear regression analysis
is whether the dependent and independent variables are linearly
associated. As always, there are two aspects to the analysis:
• Do we observe a slope that is different from zero? That is, does
the average of the outcome variable depend on the value of the
explanatory variable?
• Do the data provide evidence that the slope is significantly
different from zero? That is,
is can we infer from our data that the
relationship we observe holds for the underlying population?
To address the second issue
issue, we need to know something about the
distribution of the our estimated slope.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Distribution of the Estimated Slope
Not surprisingly
surprisingly, it turns out that the estimated slope b1 is
approximately normally distributed (provided that the sample is
random and – in most cases – that the sample size is sufficiently
l
large).
) The
Th mean off the
th distribution
di t ib ti off b1 is
i β1. The
Th estimated
ti t d
standard error is given by

s.e.(b1 )  s{b1}  s / i 1 ( X i  X ) 2
n

1/2
,
where
h s2 is
i the
h model
d l MSE (or
( estimate
i
off model
d l variance
i
σ2).
)
Since we need to rely on the estimated standard error (i.e., σ is
), then we use the t(n–2)
( ) distribution to obtain a
unknown),
confidence interval and hypothesis test for β1.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Confidence Interval and Hypothesis Test for β1
A (1 – α)100% confidence interval for β1 is therefore given by
b1  t (1   / 2; n  2)s{b1}.
We also would like to test the null hypothesis H0 : β1 = 0 versus the
alternative hypothesis HA: β1 ≠ 0. A test statistic for assessing the
evidence against H0 is given by
b1  0
t
.
s{b1}
Under H0, this test statistic approximately follows the t(n–2)
distribution. The p-value is therefore given by 2P{t(n–2) ≥ |t|}.
Note that we can conceivably test against any specific value of β1,
although 0 is generally the value of interest.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.G
Give a 95% confidence interval for the slope parameter in the
model of Example IV.C, based upon the observed data given in
Example IV.A. Interpret this confidence interval.
State the null and alternative hypotheses for testing a linear
association between road area and average salt concentration.
Explain these hypotheses.
Carry out a test of the null hypothesis of no linear association.
What is the p-value for this test? Is there evidence of a relationship
between road area and average salt concentration?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Measuring the Strength of Association
Note that the slope is one measure of the linear association between
two continuous variables – it tells you how much the average of the
outcome variable changes with respect to a one-unit increase in the
explanatory variable. However, the estimated slope tells you
nothing about the variability of the points about the line.
Correlation is a measure of the strength of association between
two variables that reflects the degree of variability around the fitted
line It
line.
It’ss another popular summary statistic for illustrating the
degree to which variables are linearly associated.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
The slope itself does not always reflect the strength of
association…
For example, note that in the two plots below we observe two data
sets with approximately the same estimated slope. However, the
association in the first case looks much stronger,
stronger as the cloud of
points more tightly clusters about the regression line.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
The Correlation Coefficient
The correlation ρ is another population parameter that we can
estimate
i
from
f
the
h data.
d
We
W typically
i ll use r to denote
d
our estimate off
ρ. The so-called correlation coefficient r has several important
features:
•
•
•
•
•
r has a range of –1 to 1. It is an index, and has no units.
The closer r is to 1, the stronger the positive linear association
(r = 1 indicates perfect positive correlation).
The closer r is to –1, the stronger the negative linear association
(r = –11 indicates perfect negative correlation)
correlation).
An r close to zero indicates weak linear association. If r = 0,
this means no linear association.
r measures linear association only. Two variables can be highly
correlated in a nonlinear way, nevertheless yielding r close to 0.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.H
Plots illustratingg various values of r:
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Computing r
Our estimated
i
d correlation
l i coefficient
ffi i for
f two variables
i bl X andd Y is
i
given by

n
rXY

i 1

n
i 1
( X i  X )(Yi  Y )
(Xi  X )

n


n
i 1
2

n
i 1
(Yi  Y )
2
X iYi  nXY
2
2
X

n
X
i
i 1
 
n
2
2
Y

n
Y
i 1 i

.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.I
Given the five summary statistics in Example IV.D, and that
2
Y
i 1 i  7060.03,
n
what is the correlation coefficient between salt concentration and
roadway area?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Inference for r
To carry out a test of H0: ρ = 0 versus the alternative hypothesis
HA: ρ ≠ 0, we can use this statistic:
t
r n2
1 r
2
,
which approximately follows a t(n–2) distribution. In fact, it turns
out that this statistic is algebraically equivalent to the t statistic for
testingg that the regression
g
slope
p is equal
q to zero.
The p-value for this test is given by 2P{t(n–2) ≥ |t|}.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.J
Carry out a test of the null hypothesis that the salt concentration
and road area are not correlated, versus the alternative hypothesis
that they are correlated.
correlated
What is the p-value of this test?
Interpret this result in words.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Inference for Means and Predictions
In addition to inferences about the slope,
slope we may also want to
construct tests and confidence intervals for the regression line,
itself.
We will talk about inference for:
(1) The average of Y given a corresponding value of X, and
(2) A predicted value Y given a corresponding value of X.
X
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Inference for a Mean
Suppose we want to estimate
i
the
h average off Y for
f a given
i
value
l off X,
denoted by Xh. Our estimated average is
Yˆh  b0  b1 X h .
Thiss est
estimate
ate has
as a standard
sta da d error
e o given
g ve by
2 
1
(
X
X
)

h
s{Yˆh }  s  
,
2
 n  ( X i  X ) 
where s2 is the regression MSE (our estimate of the error variance σ2).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Inference for a Mean, continued
Iff E{Y
{ h} represents the
h actuall mean off Y at the
h value
l Xh, then
h the
h
statistic
Yˆh  E{Yh }
s{Yˆh }
ffollows
ll
a t(n–2)
t( 2) di
distribution.
t ib ti
A 1–α
1 confidence
fid
interval
i t
l for
f the
th mean
of Y is therefore given by
Yˆh  t (1   / 2; n  2) s{Yˆh }.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.K
For the roadway data, compute and interpret a 95% confidence
interval for the average salt concentration when the corresponding
roadway area is 1.0 m2.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Inference for a Prediction
As opposed to estimating a mean, suppose instead that we want to
make
k a prediction
di i for
f a single
l additional
ddi i l observation.
b
i
Again,
A i as with
ih
the mean, our estimated predicted value for a given Xh is computed as
Yˆh  b0  b1 X h .
However, in this case the estimated prediction has a standard error
given by
1/ 2
 1
( X h  X )2 
s{pred}  s 1  
.
2
 n  ( X i  X ) 
Note the difference between this standard error and the one given for
an estimated mean. The extra variability arises since here we are
estimating a value for a single observation as opposed to an average
over many observations.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Inference for a Prediction, continued
Iff Yh(new) represents a randomly
d l sampled
l d value
l off Y for
f a corresponding
di
Xh, then the statistic
ˆ
Yh (new)

Y
(
)
h
s{pred}
ffollows
ll
a t(n–2)
t( 2) di
distribution.
t ib ti
A 1–α
1 confidence
fid
interval
i t
l for
f a
predicted Yh(new) is therefore given by
Yˆh  t (1   / 2; n  2) s{pred}.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.L
For the roadway data, compute and interpret a 95% confidence
interval for the predicted salt concentration when the corresponding
roadway area is 1.0 m2.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Confidence and Prediction Bands
Researchers often find it useful to construct a confidence interval for
the regression line over the entire range of X-values. We can
accomplish this by computing the confidence intervals presented on
th previous
the
i
slides
lid either
ith for
f the
th means or the
th predictions
di ti
(depending
(d
di
on the investigative focus).
This is obviously accomplished in general by using computer
software.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.M
The plot on the following slide illustrates confidence and
prediction bands for the roadway data.
Note the relative widths of the intervals delimited by both sets
of bounds. How do you explain the wider intervals for the
prediction bands?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Analysis of Variance (ANOVA) for Regression
Important information about a regression analysis is generally
displayed in an ANOVA table.
The underlying principle is that the variation of the Y (or outcome)
variable arises from two sources:
Total Variation in Y = Variation due to Regression
+ Unexplained (Residual) Variation
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Sources of Variation
In more mathematical terms,
terms this relationship can be expressed as:
n
2
2
ˆ
ˆ
(
Y

Y
)

(
Y

Y
)

(
Y

Y
)
i1 i
i1 i
i1 i i
n
2
n
where:
2
(
Y

Y
)
 SSTO,
i 1 i
n
2
ˆ
 SSR,
(
Y

Y
)
i 1 i
n
2
ˆ
(
Y

Y
)
i 1 i i  SSE.
n
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
ANOVA F Test for Regression Coefficients
It turns out that the ANOVA approach provides us with a useful way
off testing
i coefficients
ffi i
andd comparing
i models
d l in
i a variety
i off settings
i
(particularly for multiple regression with several variables).
For the simple linear regression model,
model the ANOVA F statistic for
testing H0: β1 = 0 versus HA: β1 ≠ 0 is given by
MSR
,
MSE
where MSR is the mean squared error due to regression, or
MSR = SSR/df(Regression); and MSE is the mean squared error s2, or
MSE = SSE/df(Error).
F
There are generally nn–11 df associated with SSTO. As we’ve
we ve
discussed previously, in the simple model there are n–2 df for SSE,
leaving 1 df for SSR.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
ANOVA F Test for β1 in the Simple Model
For the simple regression model, relatively large values of F provide
evidence against the null H0: β1 = 0, and values of F close to 1.0
indicate little or no evidence against the null.
The p-value for this F test is determined by computing the upper-tail
probability for the observed statistic with respect to the F(1,n–2)
distribution.
Note that in the simple case, it turns out that the ANOVA F test and
th t test
the
t t for
f the
th slope
l
(discussed
(di
d earlier)
li ) are identical.
id ti l That
Th t is,
i
2
MSR 2  b1  0 
F
t 
.

MSE
 s{b1} 
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
ANOVA Table
All
ll off this
hi is
i summarized
i d in
i a table,
bl typically
i ll in
i this
hi familiar
f ili form:
f
Source
Degrees of
Freedom
Sum of
Squares
Mean
Squares
F-statistic
R
Regression
i
1
SSR
MSR
MSR/MSE
Error
n–2
SSE
MSE
Total
n–1
SSTO
p-value
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example V.N
The ANOVA ttable
Th
bl ffor th
the roadway
d
d t is
data
i partially
ti ll completed
l t d
below. Can you fill in the missing information?
Source
Degrees of
Freedom
Regression
Error
Total
Sum of
Squares
1130.15
18
1187.87
Mean
Squares
F-statistic
F
statistic
p-value
p
value
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.N (cont’d)
Based on the results of the ANOVA procedure, what are your
conclusions regarding the association between roadway area and
salt concentration?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Checking Model Assumptions
What are some of the underlying assumptions we have discussed
with
i h respect to the
h simple
i l regression
i model?
d l?
1.
2.
3.
4.
5.
6.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Residuals and Standardized Residuals
Examining the observed residuals can provide key diagnostic
i f
information
i about
b
whether
h h model
d l assumptions
i
are violated.
i l d Recall
R ll
that the residual ei for the ith subject is given by
ei  Yi  Yˆi , i  1,..., n.
Since the actual variance of the residuals is σ2, the estimated variance
is given by the MSE. It turns out that computing the actual standard
deviation of the residuals is a little more complex than simply taking
(MSE)1/2, but this estimate is not too far off. We therefore define what
is referred to as the semistandardized or semistudentized residual as
e 
*
i
ei
.
MSE
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Exploration of Residuals
A regression analysis is generally accompanied by an examination of
the
h residuals
id l or standardized
d di d residuals,
id l to assess
• the linearityy of the relationshipp between X and Y,
• the normality of the residuals,
• the constancy of the residual variance across the range of X,
• the
h independence
i d
d
off the
h residuals,
id l
p to X and Y),
) and
• effects of ppotential outliers ((both with respect
• whether any additional explanatory factors may have been omitted.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Diagnostics for Linearity
A scatterplot is one of the best ways to assess the nature of the X-Y relationship, but
ap
plot of the residuals (versus
(
either the predictor
p
variable X or the fitted values)) can
also reveal patterns that could indicate nonlinearity. Note the nonlinear pattern in
the plots below:
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.O
Is there anything in the residual plot (below) for the roadway data to indicate
nonlinearity?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Evaluating Non-constant Variance
Residual plots can also be very useful in assessing whether the variance remains
constant across the range of X.
X The plots below illustrate a classic pattern where this
assumption is not met:
In examining the residual plot in Example IV.O, is there any evidence of nonconstant variance for the roadway data?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Evaluating Dependence Between Residuals
This
hi can sometimes
i
be
b tricky,
i k but
b dependence
d
d
most often
f manifests
if
itself with respect to the sequence, or temporal ordering, of the
measurements.
Where an investigator knows the order in which observations were
sampled he or she ought to plot the residuals versus sampling
sampled,
sequence to ensure there is no systematic correlation between
contiguous observations.
Note that information about sampling order may not always be
available.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.P
Consider the data plotted below. How do the plots look in terms of
linearity and variance?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.P (continued)
The plot below shows the relationship between the residuals and the order of
measurements for the data plotted on the previous slide.
slide What do you observe?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Outliers
In addition to initial univariate exploratory analyses,
analyses residual plots
can be useful for identifying outliers.
Note
N
t that
th t outliers
tli with
ith respectt to
t the
th distribution
di t ib ti off Y or X in
i a
regression setting can potentially influence the model fit in
dramatically different ways. In some cases, outliers may not have
any appreciable effect on the analysis.
Simply identifying outliers is no reason to simply throw them out –
such observations must be examined individually to (hopefully)
explain why they have relatively extreme values. An outlier may
exist
i t because
b
off miscoding,
i di incorrect
i
t sampling,
li or even just
j t sheer
h
randomness.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.Q
Note the outlier below with respect to the distribution of the Y variable. What effect
(if any) does this observation have on the model fit?
OUTLIER
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.Q (continued)
A plot below of the residuals for the data on the previous slide clearly identifies the
outlier. Interestingly, the observation appears to be exerting very little influence on
the model fit.
OUTLIER
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.R
In the plot below, the outlier is extreme in particular with respect to the distribution
of the X variable. These kinds of outliers can be particularly problematic in terms of
their influence on model fit.
OUTLIER
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Normality of Residuals
A conventional univariate analysis (i
(i.e.,
e with summary statistics,
statistics
boxplots, etc.) can be useful in examining the distribution of
residuals.
The so-called normal probability plot (also known as a normalquantile or Q-Q plot) is also useful for assessing the normality of
residuals.
AQ Q plot for a given sample is constructed by plotting the empirical
AQ-Q
standardized quantiles for the data against the quantiles that would be
expected given the data arise from a normal distribution.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.S
AQ
Q-Q
Q plot for the residuals from the roadway data model is shown
on the following slide.
Note that the if the plotted data are at least approximately normally
distributed, then the points should roughly follow a straight line.
What is your interpretation of this plot?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.S (continued)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.T
The following two slides illustrate examples of Q-Q
Q Q plots for
non-normal data.
What is the nature of the deviation from normality in each case?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.T (continued)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.T (continued)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Variable Transformations
Problems
bl
with
i h nonlinearity,
li
i non-constant variance,
i
or nonnormality
li
can frequently be fixed with a simple transformation.
Logarithmic and power transformations are the most widely applied.
The following example illustrates the utility of this approach.
approach
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U
The
h data
d for
f this
hi example
l come from
f
a study
d off water use andd
household income in Concord, NH, during the summer of 1981 (the
dataset is pposted as “concord.txt” on the course website).
)
The following three slides contain a scatterplot with fitted regression
line along with two residual plots.
plots
What potential problems, if any, do you observe with respect to model
assumptions?
i ?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U (continued)
Yˆ  1201.1  47.5 X
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U (continued)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U (continued)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U (continued)
In this case,
case because of the positive skew of the water use
distribution, as well as the increasing variance of the residuals, it
would be useful to explore a log transformation or a transformation
using a power < 1.
The following six slides illustrate alternative fits for these data, first
using a log transformation, and second with a transformation using a
power of 0.3 for water use.
What are your conclusions? How do you interpret the fitted
coefficients in each case?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U (continued)
log(Yˆ )  7.016  0.022 X
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U (continued)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U (continued)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U (continued)
Yˆ 0.30  8.316  0.063 X
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U (continued)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Example IV.U (continued)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2013
Additional Notes on Diagnostics
•
Assessing the normality of residuals can be a bit tricky under certain
circumstances. For example, residuals may actually be normally distributed, but
plots (such as boxplots or Q-Q plots) can appear nonnormal because of (i)
randomness (especially with a small sample size), or (ii) the exclusion of one or
more additional key variables. It is usually a good idea to check other
assumptions
i
first
fi – such
h as li
linearity
i andd nonconstant variance
i
– before
b f
checking
h ki
normality.
•
Even where the outcome variable isn’t exactly normally distributed, substantive
conclusions based on a regression model fit may still be fundamentally correct
given a relatively large sample size. This is in some sense due to the fact that we
are estimating
i i an average, meaning
i that
h the
h Central
C
l Limit
Li i Theorem
Th
applies
li to the
h
distribution of the fitted mean.
•
We have not illustrated here with an example, but to check the possibility that
other variables are additionally associated with Y, we generally begin simply by
constructing additional scatterplots.