Download chapter14

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Instrumental variables estimation wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 14
Multiple Regression
Analysis
What is the general purpose of
regression?
• To model the relationship between the
Multiple
regression
can be and
usedone
to or
fitmore
dependent
variable
y (response)
models
to datax with
two or more
independent
variables
(predictors
or
independent variables.
explanatory variables)
For example, some variation in the price of a
house in a large city can be attributed to the
size of a house, but there are other variables
that also attribute to the price of a house; such
as age, lot size, number of bedrooms and
bathrooms, etc.
Consider a school district in which teachers with no prior
teaching experience and no college credits beyond a
bachelor’s degree start at an annual salary of $38,000.
Suppose that In
foraeach
yearregression
of teaching experience
simple
model, x1 up to
20 years, the
teacher
receives
an additional
$800 and
What
if
y
is
not
entirely
determined
and
x
represent
two
observations
2
that each unit
of
postgraduate
creditindependent
up to 75 credits
by the two
(or
more)
of$60
a single
variable.
results in an additional
per year.
In multiplevariables?
regression, x1 and x2
Let:
represent two independent
y = salary of a teacher withvariables!
at most 20 years experience
and at most 75 postgraduate units
Since y is determined entirely by
x1 = number
years
of
experience
x1of
and
x
,
this
a deterministic
2
How can is
this
scenario be modeled
model.
x2 = number of postgraduate
units
using
multiple
regression?
The equation to determine salary is
y  38,000  800x1  60x2
General Additive Multiple
Regression Model
A general additive multiple regression model, which relates
a dependent variable y to k predictor variables x1, x2, . . .,
xk, is given by the model equation
y’sare
 the
 population
1x1  2x
2  ...  
k xk  e
regression
coefficients.
Each i can be interpreted as the mean
The random
deviation
is assumed
to be normally
distributed
change
in yewhen
the predictor
xi increase
1
This
called
the
population
with mean value unit
0 andand
standard
deviation
sthe
for
any particular
theisvalue
of all
other
this is fixed.
the
amount
regression
function.
values x1, …, xk. Remember,
predictors
remains
that
randomly
This implies that for fixed
x1, xa2,point
…, xk values,
y has a normal
the
distribution with standard deviates
deviation s from
and
regression
model.
 mean y value for fixed


    1x1  2x 2  ...  k x k
 x , x , ..., x values

1
2
k


Data collected in a survey of approximately 1000 secondyear college students suggest that GPA at the end of the
second year is related to the student's level of
interaction with the faculty and staff and to the student’s
commitment to his or her major.
Let:
y = GPA at end of sophomore year
For
sophomores
whose= level
interaction
with= the
(mean
value of GPA)
1.4 + of
.33(4.2)
+ .16(2.1)
3.12
x1 = faculty
level ofand
faculty
and
staff interaction
(measured
staff
is rated
at 4.2 and whose
level on
of a
scale of 1 tocommitment
5)
to major is rated at 2.1,
It is likely that a y value will be within 2s (.30) of this
x2 = mean
level of
commitment
to major
(measured
on a 2.82
scaleto
of
value
(3.12 ± .30).
This interval
is from
+ .33(4.2) + .16(2.1) = 3.12
1 to(mean
5) value of GPA) = 1.4 3.42.
One possible population model might be:
y  1.4  .33x1  .16x2  e with s  0.15
Polynomial Regression
Suppose a scatterplot has the following
appearance:
Would a line be a good
Itfit
looks
a parabola
for like
these
data?
(quadratic
function)
Explain.
y
would provide a good fit
for the data.
x
Polynomial Regression
The kth degree polynomial regression model is
y    1x  2x  ...  k x  e
2
k
Note
that
we
include
Note
also
that

is a special case of the the
general
multiple
i cannot be
random
deviation
ex
interpreted
since
all
the
regression
Thismodel
is thewith
population
regression
The most
important
special
case
(other
than
since
this
is
a
values
are
functions
of
a
function
(mean
y
value
for
fixed
2
3
k
the simple
regression
model
when
k
=
1)
is
the
x1 = x, x2 = x , x3 =probabilistic
xsingle
, . . .,variable.
xk =model.
x
values of regression
the predictors).
quadratic
model
y =  + 1x + 2x2 + e
Many researchers have examined factors that are believed
to contribute to the risk of heart attacks. One study found
that hip-to-waist ratio was a better predictor of heart
attacks than body-mass index. A plot of data from this
study of a measure heart-attack risk (y) versus hip-to-waist
ratio (x) had a exhibited a curved relationship.
A model consistent with summary values given in the paper
is
y  1.023  0.024x  0.060x 2  e
Suppose the hip-to-waist ratio is 1.3, what are the possible
values of the heart-attack risk measure (if s = 0.25)?
y = 1.023 + .024(1.3) + .060(1.3)2 = 1.16
It is likely that the heart-attack risk measure for
a person with a hip-to-waist ratio of 1.3 is between
.66 and 1.66.
Mean y value
Suppose that an industrial chemist is interested in
the relationship between product yield (y) from a
certain chemical reaction and two independent
Because chemical theory
variables, x1 = reaction temperature and x2 = pressure at
suggest that the decline
which the reaction is carried out.in average yield when
Notice each is a straight
The chemist initially suggest that
for temperatures
2 increases
linepressure
with
a xslope
of -35.
should
beat
more
for
between 80 and 110 in combination
with
pressure
values
Let’s
look
plotrapid
of
these
a high temperature
than by
ranging from 50 to 70, the relationship
can
be
modeled
lines.
for athree
low temperature,
the chemist now has
y  1200  15x1 reason
35x2to e
doubt the
appropriateness of the
model.
Consider the mean y value for threeproposed
different
particular
temperature values:
x1 = 90:
x1 = 95:
x1 = 100:
mean y value = 1200 + 15(90) – 35x2 = 2550 – 35x2
mean y value = 1200 + 15(95) – 35x2 = 2625 – 35x2
mean y value = 1200 + 15(100) – 35x2 = 2700 – 35x2
x2
Chemical Reaction Continued . . .
that these
all
A better model would include aNotice
third predictor
variable
have different slopes
x1x2.
This third
variable
aspredictor
seen in the
plot ofis
One such model is
an interaction term.
these lines.
y  4500  75x1  60x2  x1x2  e
Mean y value
Consider the mean y value for three different particular
temperature values:
x1 = 90: mean y value = -4500 + 75(90) + 60x2 - 0x2 = 2250 – 30x2
x1 = 95: mean y value = -4500 + 75(95) + 60x2 - 95x2 = 2625 – 35x2
x1 = 100: mean y value = -4500 + 75(100) + 60x2 - 100x2 = 3000 – 40x2
x2
Interaction Between Variables
More than one interaction predictor can
be included
in the
model when
than
If the change
in the mean
y associated
withmore
a 1-unit
two
independent
variables
arequadratic,
available.
In quadratic
regression,
the (slope)
full
increase
in one
independent
variable
depends onor
second-order
model
is: is
the value ofcomplete
a second independent
variable,
there
interaction between these two variable. 2
y    1x1  2x2  3x1x2   4x1  5x22  e
When the variables are denoted by x1 and x2, such
interaction can be modeled by including x1x2, the product
of the variables that interact, as a predictor variable.
The general equation for a multiple regression model
based on two independent variables x1 and x2 that also
includes an interaction predictor is
y    1x1  2x2  3x1x2  e
Qualitative Predictor Variables
Qualitative
or categorical
variables can also be
If a qualitative
variable
In general, incorporating a
incorporated
into a or
multiple
regression model
had
three
more
categorical variable with c possible
categories,
throughcategories
the
use of into
an then
indicator
variable
or
a regression
model
multiplethe
indicator
dummy variable.
requires
use of c – 1 indicator
variables are needed.
An indicator variablevariables.
will use the values of 0 and
1 to indicate the different categories.
Example:
Location of houses in
Californian beach resort
gender of students
0 if male
x1  
1 if female
1 if ocean view and beachfront
x1  
0 otherwise
1 if ocean view and not beachfront
x2  
0 otherwise
One of the factors that has an effect on the price of a
house is location. We might want to incorporate
location, as well as numerical predictors, such as size
and age, into a multiple regression model for predicting
categorybeach
for location
would
becan be
house What
price. California
community
houses
classifiedrepresented
by location into
bythree
x1 = categories
0 and x2 =– ocean
0? view
and beachfront, ocean view but not beachfront, and no
ocean view.
Let:
1 if ocean view and beachfront
x1  
0 otherwise
1 if ocean view but not beachfront
x2  
0 otherwise
x 3  house size
x 4  house age
We could then consider a multiple
regression model of the form
y    1x1  2x2  3x2  4x4  e
One way colleges measure success is by graduation
rates. The Education Trust publishes graduation
rates along with other college characteristics.
Let’s consider the following variables:
y = 6-year graduation rate
Note that
twocollege
of
x1 = median SAT score of students accepted
to the
As in simple regression, we these
will need
to
predictors
x2 = student-related
per full
time student (in
estimate theexpense
regression
coefficients
are numerical
dollars)
of , 1, has
2, only
andfemale
3 bystudents
calculating
a, and
b1, students
variables
one
1 if college
or only male
b2, and b3. is categorical.
x3 =
0 if college has both male and female students
In simple regression, an observation is an (x,y)
One possible
model
that would
be considered
to
pair.
In
multiple
regression,
an
observation
In
this example,
anyobservation
describe the
relationship
between
and these
would consist of the k independent
variables
would
be
(x
,
x
,
x
,
y).
1
2– so
3 it would have
three predictors
and the is
dependent variable
k + 1 terms.
y    1x1  2x2  3x3  e
Least-Squares Estimates
According
to thesquares
principles
of least-squares,
The least
estimates
for a giventhe
fit ofdata
a particular
estimatedbyregression
function
set are obtained
solving a system
a + b1of
x1 +k .+. 1. +equations
bkxk to the
observed
data is a,
in the
k + 1 unknowns
b1, . by
. ., the
bk (called
equations).
measured
sum ofthe
thenormal
squared
deviations
This
difficult ytovalues
do byand
hand,
all
between
theisobserved
thebut
y values
the commonly
used statistical
software
predicted
by the estimated
regression
function:
packages have been programmed
to solve
y
2
for
these.
y  a  b x  ...  b x


1
1
k
k

The least-squares estimates of , 1, . . ., k are
those values of a, b1, . . ., bk that make this sum of
squared deviations as small as possible.
Graduation Rates Continued . . .
Minitab output from a regression command requesting
that the model y =  + 1x1 + 2x2 + 3x3 + e be fit to the
small college data (found on pages 815-816 of the
textbook) is given below:
The regression equation is y = -0.391 + 0.000760 x1 + 0.000007 x2 + 0.125 x3
Predictor
Coef
SE Coef
Constant
-0.3906
0.1976
x1
0.0007602
0.0002300
x2
0.0000069
0.0000045
x3
0.12495
0.05943
S = 0.0844346
R-Sq = 86.1%
Analysis of Variance
Source
DF
SS
3
0.79486
Residual Error
18
0.12833
Total
21
0.92318
Regression
What
areP the
T
These
are
the
This
value
is
interpreted
interpretations
of
-1.98
0.064
for
theof
asestimates
the
average
change
in
the
coefficients
3.30
0.004
regression
6-yearthe
graduation
rate for
predictor
1.55
0.139
coefficients.
a 1 unit
increase in median
variables
x2 enrolling
and x3?
2.10
0.050
SAT
score for
= 83.8%
students R-Sq(adj)
while the
type of
institution and the
MS
F
expenditures
remain
fixed.
0.26495
0.00713
37.16
P
0.000
Graduation Rates Continued . . .
Minitab output from a regression command requesting
This
is
the
coefficient
of
that the model y =  + 1x1 + 2x2 + 3x3 + e be fit to the
multiple
determination.
It of the
small college
data (found
on pages 815-816
is
the
proportion
of
the
textbook)
is
given
below:
This is s , the
e
variation
in 6 year 0.000007
value
the
The regression
equation
is y = -0.391 + 0.000760 x1 +This
x2 + is
0.125
x3
estimated
standard
2
graduation
rates that can
Predictordeviation
Coef
SE Coef
T adjusted
P R .
of the
be explained
by the
Constant
0.1976
-1.98
0.064
random -0.3906
deviation
e.
multiple 0.0002300
regression model.
x1
0.0007602
3.30
0.004
x2
0.0000069
0.0000045
1.55
0.139
x3
0.12495
0.05943
2.10
0.050
S = 0.0844346
R-Sq = 86.1%
R-Sq(adj) = 83.8%
Analysis of Variance
Source
DF
SS
MS
F
P
3
0.79486
0.26495
37.16
0.000
Residual Error
18
0.12833
0.00713
Total
21
0.92318
Regression
Is the model useful?
2, and the
We
use
s
,
R
• Recall
The estimate
for
the
random
deviation
e
that SSTo is
2 is given by adjusted R2 to
variance
s
the sum of the
squared deviations ofdetermine how useful
SS
Resid
the
multiple regression
2
the observedsy values
e 
is.
from the mean of n
y – (k model
1)
Recall that SSResid is
it is a measure of the
the sum of the squared
Residuals are the
total variability in the
residuals.
differences between
y
values.
Theof
dfmultiple
= nthe
- (kobserved
+ 1)
• The coefficient
determination
is
y values
because (k + and
1) dfthe
arepredicted y
lost in estimating
the
SS
Resid
values.
2
R k +11coefficients

,
1, . SS
. ., To
k.
Is the model useful? Continued …
• The adjusted R2 is computed using
 n  1  SS Resid 


adjusted R  1  

n  (k  1)  SS To 
2
the
value in the
2 On
rare
occasions,
the number
The adjusted RBecause
takes
into
account
square
brackets
exceeds
2 may
adjusted
be
of predictor variables.
ThisRis
important
2
1, thenegative.
value of r
because, givenadjusted
that you is
use
a large
number of
always
smaller
predictors, you can account
than for
r2. most of the
variability in y, even if no real relationship exist.
Graduation Rates Continued . . .
2 is
The
value
of
s
is
small
and
the
value
of
R
Minitab output from ae regression command requesting
large.
This
variation
that the
model
y = means
 + 1x1 that
+ 2x2most
+ 3x3of
+ ethe
be fit
to the
thisfor
model
useful?
is accounted
by the
modelofand
small college
dataIs
(found
on
pages
815-816
thethe
look
at these
textbook)
is given Let’s
below:
observations
have
little
deviation from the
three
values
again.
predicted
Also,
the
valuesx2of
R2 x3
The regression
equation is y
y =values.
-0.391 + 0.000760
x1
+ 0.000007
+ 0.125
2
and the
close, which
Predictor
Coef adjusted
SE Coef R are T
P
we haven’t-1.98
used too
many
Constant suggests
-0.3906 that0.1976
0.064
x1
0.0007602
0.0002300 in our3.30
predictors
model. 0.004
x2
0.0000069 0.0000045
x3
0.12495
S = 0.0844346
0.05943
1.55
0.139
2.10
0.050
R-Sq = 86.1%
R-Sq(adj) = 83.8%
Analysis of Variance
Source
DF
SS
MS
F
P
3
0.79486
0.26495
37.16
0.000
Residual Error
18
0.12833
0.00713
Total
21
0.92318
Regression
F Distributions
• The model utility test for multiple regression
is based on a probability distribution called
the F distribution.
• Like the t and c2 distributions, the F
distributions are based on df. However, it is
based upon the df1 for the numerator of the
test statistic and on the df2 for the
denominator of the test statistic.
• Each different combination of df1 and df2
produces a different F distribution.
F Distributions Continued . . .
• Here are some graphs of different F curves
The
is the
area
AllP-value
F tests
in this
under the associated F
textbook
are uppercurve
to the right
of the
tailed.
F curve for df1 = 3 and df2 calculated
= 18
F value. Most
statistical software
packages and graphing
calculators will compute
F curve for df1 = 18 and df2 =this
3 P-value.
F Test for Modal Utility
Null Hypothesis: H0: 1 = 2 = … = k = 0
Alternative Hypothesis:
one
of 1, …, k
SSRegr
= At
SSTo
- SSResid
There is
no least
useful
linear
are ynot
0 ANY
relationship between
and
SS Regr
of the predictors.
k
Test Statistic:
F 
SS is
Resid
There
a useful linear
relationship between
n  (kyand
1) at
least one of the predictors.
Assumptions: For any combination of predictor
variables values, the distribution of e is normal
with mean 0 and constant variance s2.
Graduation Rates Continued . . .
The model y =  + 1x1 + 2x2 + 3x3 + e was fitted to
the small college data (found on pages 815-816 of the
textbook).
H0: 1 = 2 = 3 = 0
Ha: at least one of the three ’s is not 0
Assumptions: A normal
probability plot of the
standardized residuals
is quite straight,
indicating that the
assumption of normality
of the random deviation
distribution is
reasonable.
Graduation Rates Continued . . .
H0: 1 = 2 = 3 = 0
Ha: at least one of the three ’s is not 0
Test Statistic:
0.79486 / 3 0.26495
F 

 37.16
0.12833 / 18 0.00713
df1 = 3, df2 = 18,  = .05, P-value ≈ 0
Since P-value < , we reject H0. There is evidence to
confirm the usefulness of the multiple regression model.
Graduation Rates Continued . . .
Minitab output from a regression command requesting
that the model y =  + 1x1 + 2x2 + 3x3 + e be fit to the
Dividing
these
two
MS
terms
Notice
the
sum
of
squares
are
small college
data
(found
on
pages
815-816
of
the
Dividing
the
SSRegr
bySSResid
its df
Similarly,
dividing
the
produces
the of
Foftest
given
thenumerator
Analysis
textbook)
is
giveninthe
below:
produces
thestatistic.
byVariance
its df produces
the
Table.
The regression equation
y = -0.391
+ 0.000760 x1 + 0.000007 x2 + 0.125 x3
F istest
statistic.
denominator
of
the F test
Predictor
Coef
SE Coef
T
P
statistic.
Constant
-0.3906
0.1976
-1.98
0.064
x1
0.0007602 0.0002300
3.30
0.004
x2
0.0000069 0.0000045
1.55
0.139
2.10
0.050
x3
0.12495
S = 0.0844346
0.05943
R-Sq = 86.1%
R-Sq(adj) = 83.8%
Analysis of Variance
Source
DF
SS
MS
F
P
3
0.79486
0.26495
37.16
0.000
Residual Error
18
0.12833
0.00713
Total
21
0.92318
Regression
What factors contribute to the price of energy bars?
Minitab output for data (found on page 825 of the
textbook) based on the following variables is shown
below.
y = price
The
equation
is
x2regression
= protein
content
x1 = calorie content
x3 = fat content
Price = 0.252 + 0.00125 Calories + 0.0485 Protein + 0.0444 Fat
Predictor
3Coef
SE Coef
T
P
Constant
0.2511
0.3524
0.71
0.487
0.001254
0.001724
0.73
0.478
Calories
to
ProteinAccording
0.04849
the
F test
0.01353
3.58
0.003
for model
the
Fat
0.04445 utility,
0.03648
1.22
0.242
fitted multiple
S = 0.2789
R-Sq = 74.7%
R-Sq(adj) = 69.6%
However,
looking
model at
is the t tests for each
Analysis of regression
Variance
predictor,
that only
Source useful
DFit appears
SS
MS the variable
F
P
in predicting
the
onprice
protein
content
is useful.
Let’s 14.76
redo our0.000
Regression
3.4453
1.1484
of 3the
energy
model to bars.
include
only the 0.0778
protein predictor
Residual Error
15
1.1670
variable.
Total
18
4.6122
What factors contribute to the price of energy bars?
Since the model with just one predictor accounts
Minitab output for data (found on page 825 of the
for almost
asthe
much
of the
variation
in y below.
values
textbook)
based on
following
variable
is shown
(69.4%) as the multiple regression model
y = price
x2 = protein to
content
(69.6%) - it is preferable
use the more
simple model.
The regression equation is
Price = 0.607 + 0.0623 Protein
Predictor
3Coef
SE Coef
T
P
Constant
0.6072
0.1419
4.28
0.001
0.062256
0.009618
6.47
0.000
Protein
According
S = 0.279843
to the
test
R-Sq =F71.7%
model utility, the
Analysis offor
Variance
Source fitted simple
DF regression
SS
model is also
useful
Regression
1
3.2809in
Residual predicting
Error
17the price
1.3313 of
Total
18
4.6122
the energy
bars.
R-Sq(adj) = 69.4%
MS
F
P
3.2809
41.90
0.000
0.0763