Download Section 8.2

Document related concepts

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 13:
Multiple Regression
Section 13.1: How Can We Use Several
Variables to Predict a Response?
1
Learning Objectives
1. Regression Models
2. The Number of Explanatory Variables
3. Plotting Relationships
4. Interpretation of Multiple Regression Coefficients
5. Summarizing the Effect While Controlling for a
Variable
6. Slopes in Multiple Regression and in Bivariate
Regression
7. Importance of Multiple Regression
2
Learning Objective 1:
Regression Models
 The model that contains only two
variables, x and y, is called a
bivariate model
    x
y
3
Learning Objective 1:
Regression Models
 Suppose there are two predictors,
denoted by x1 and x2
 This is called a multiple regression
model
    x   x
y
1
1
2
2
4
Learning Objective 1:
Multiple Regression Model
 The multiple regression model relates
the mean µy of a quantitative response
variable y to a set of explanatory
variables x1, x2,…
5
Learning Objective 1:
Multiple Regression Model
 Example: For three explanatory
variables, the multiple regression
equation is:
    x   x   x
y
1
1
2
2
3
3
6
Learning Objective 1:
Multiple Regression Model
 Example: The sample prediction
equation with three explanatory
variables is:
yˆ  a  b x  b x  b x
1
1
2
2
3
3
7
Learning Objective 1:
Example: Predicting Selling Price Using House
and Lot Size
 The data set “house selling prices”
contains observations on 100 home
sales in Florida in November 2003
 A multiple regression analysis was done
with selling price as the response
variable and with house size and lot size
as the explanatory variables
8
Learning Objective 1:
Example: Predicting Selling Price Using House
and Lot Size
 Output from the analysis:
9
Learning Objective 1:
Example: Predicting Selling Price Using House
and Lot Size
 Prediction Equation:
yˆ  10,536  53.8 x  2.84 x
1
2
where y = selling price, x1=house size and
x2 = lot size
10
Learning Objective 1:
Example: Predicting Selling Price Using House
and Lot Size
 One house listed in the data set had house size
= 1240 square feet, lot size = 18,000 square feet
and selling price = $145,000
 Find its predicted selling price:
yˆ  10,536  53.8(1240)  2.84(18,000)
 107,276
11
Learning Objective 1:
Example: Predicting Selling Price Using House
and Lot Size
 Find its residual:
y  yˆ  145,000  107,276  37,724
 The residual tells us that the actual selling price
was $37,724 higher than predicted
12
Learning Objective 2:
The Number of Explanatory Variables
 You should not use many explanatory
variables in a multiple regression model
unless you have lots of data
 A rough guideline is that the sample size
n should be at least 10 times the number
of explanatory variables
13
Learning Objective 3:
Plotting Relationships
 Always look at the data before doing a
multiple regression
 Most software has the option of
constructing scatterplots on a single
graph for each pair of variables

This is called a scatterplot matrix
14
Learning Objective 3:
Plotting Relationships
15
Learning Objective 4:
Interpretation of Multiple Regression
Coefficients
 The simplest way to interpret a multiple
regression equation looks at it in two
dimensions as a function of a single
explanatory variable
 We can look at it this way by fixing
values for the other explanatory
variable(s)
16
Learning Objective 4:
Interpretation of Multiple Regression
Coefficients
Example using the housing data:
 Suppose we fix x1 = house size at 2000 square
feet
 The prediction equation becomes:
yˆ  10,536  53.8(2000)  2.84x
2
 97,022  2.84x
2
17
Learning Objective 4:
Interpretation of Multiple Regression
Coefficients
 Since the slope coefficient of x2 is 2.84,
the predicted selling price increases by
$2.84 for every square foot increase in
lot size when the house size is 2000
square feet
 For a 1000 square-foot increase in lot
size, the predicted selling price
increases by 1000(2.84) = $2840 when
the house size is 2000 square feet
18
Learning Objective 4:
Interpretation of Multiple Regression
Coefficients
Example using the housing data:
 Suppose we fix x2 = lot size at 30,000 square
feet
 The prediction equation becomes:
yˆ  10,536  53.8 x  2.84(30,000)
1
 74,676  53.8x
1
19
Learning Objective 4:
Interpretation of Multiple Regression
Coefficients
 Since the slope coefficient of x1 is 53.8,
for houses with a lot size of 30,000
square feet, the predicted selling price
increases by $53.80 for every square foot
increase in house size
20
Learning Objective 4:
Interpretation of Multiple Regression
Coefficients
 In summary, an increase of a square foot in house
size has a larger impact on the selling price
($53.80) than an increase of a square foot in lot
size ($2.84)
 We can compare slopes for these explanatory
variables because their units of measurement are
the same (square feet)
 Slopes cannot be compared when the units differ
21
Learning Objective 5:
Summarizing the Effect While Controlling for a
Variable
 The multiple regression model assumes
that the slope for a particular explanatory
variable is identical for all fixed values of
the other explanatory variables
22
Learning Objective 5:
Summarizing the Effect While Controlling for a
Variable
 For example, the coefficient of x1 in the prediction
equation:
yˆ  10,536  53.8 x  2.84 x
1
2
is 53.8 regardless of whether we plug in x2 = 10,000
or x2 = 30,000 or x2 = 50,000
23
Learning Objective 5:
Summarizing the Effect While Controlling for a
Variable
24
Learning Objective 6:
Slopes in Multiple Regression and in Bivariate
Regression
 In multiple regression, a slope describes
the effect of an explanatory variable
while controlling effects of the other
explanatory variables in the model
25
Learning Objective 6:
Slopes in Multiple Regression and in Bivariate
Regression
 Bivariate regression has only a single
explanatory variable
 A slope in bivariate regression describes
the effect of that variable while ignoring
all other possible explanatory variables
26
Learning Objective 7:
Importance of Multiple Regression
 One of the main uses of multiple
regression is to identify potential lurking
variables and control for them by
including them as explanatory variables
in the model
27
Chapter 13:
Multiple Regression
Section 13.2 Extending the Correlation and
R-Squared for Multiple Regression
28
Learning Objectives
1. Multiple Correlation
2. R-squared
3. Properties of R2
29
Learning Objective 1:
Multiple Correlation
 To summarize how well a multiple
regression model predicts y, we analyze
how well the observed y values correlate
with the predicted yˆ values
 The multiple correlation is the correlation
between the observed y values and the
predicted yˆ values


It is denoted by R

30
Learning Objective 1:
Multiple Correlation
 For each subject, the regression equation
provides a predicted value
 Each subject has an observed y-value and a
predicted y-value
31
Learning Objective 1:
Multiple Correlation
 The correlation computed between all
pairs of observed y-values and predicted
y-values is the multiple correlation, R
 The larger the multiple correlation, the
better are the predictions of y by the set of
explanatory variables
32
Learning Objective 1:
Multiple Correlation
 The R-value always falls between 0 and 1
 In this way, the multiple correlation ‘R’
differs from the bivariate correlation ‘r’
between y and a single variable x, which
falls between -1 and +1
33
Learning Objective 2:
R-squared
 For predicting y, the square of R
describes the relative improvement from
using the prediction equation instead of
using the sample mean, y

34
Learning Objective 2:
R-squared
 The error in using the prediction equation to
predict y is summarized by the residual sum of
squares:
 ( y  yˆ )
2
35
Learning Objective 2:
R-squared
 The error in using
y
to predict y is
summarized by the total sum of squares:
 ( y  y)
2

36
Learning Objective 2:
R-squared
 The proportional reduction in error is:
( y  y )  ( y  yˆ )
R 
( y  y)
2
2
2
2
37
Learning Objective 2:
R-squared
 The better the predictions are using the
regression equation, the larger R2 is
 For multiple regression, R2 is the square of
the multiple correlation, R
38
Learning Objective 2:
Example: How Well Can We Predict House
Selling Prices?
 For the 100 observations on y = selling
price, x1 = house size, and x2 = lot size, a
table, called the ANOVA (analysis of
variance) table was created
 The table displays the sums of squares in
the SS column
39
Learning Objective 2:
Example: How Well Can We Predict House
Selling Prices?
 The R2 value can be created from the sums of
squares in the table
R 
2
 ( y  y )  ( y  yˆ )
2
2
( y  y)
314,433- 90,756

 0.711
90,756
2
40
Learning Objective 2:
Example: How Well Can We Predict House
Selling Prices?
 Using house size and lot size together to
predict selling price reduces the
prediction error by 71%, relative to using
y alone to predict selling price
41
Learning Objective 2:
Example: How Well Can We Predict House
Selling Prices?
 Find and interpret the multiple correlation
R  R  0.711  0.84
2
 There is a strong association between the
observed and the predicted selling prices
 House size and lot size are very helpful in
predicting selling prices
42
Learning Objective 2:
Example: How Well Can We Predict House
Selling Prices?
 If we used a bivariate regression model to
predict selling price with house size as the
predictor, the r2 value would be 0.58
 If we used a bivariate regression model to
predict selling price with lot size as the
predictor, the r2 value would be 0.51
43
Learning Objective 2:
Example: How Well Can We Predict House
Selling Prices?
 The multiple regression model has R2
0.71, so it provides better predictions than
either bivariate model
44
Learning Objective 2:
Example: How Well Can We Predict House
Selling Prices?
45
Learning Objective 2:
Example: How Well Can We Predict House
Selling Prices?
 The single predictor in the data set that is most
strongly associated with y is the house’s real
estate tax assessment

(r2 = 0.679)
 When we add house size as a second predictor,
R2 goes up from 0.679 to 0.730
 As other predictors are added, R2 continues to go
up, but not by much
46
Learning Objective 2:
Example: How Well Can We Predict House
Selling Prices?
 R2 does not increase much after a few
predictors are in the model
 When there are many explanatory variables but
the correlations among them are strong, once
you have included a few of them in the model,
R2 usually doesn’t increase much more when
you add additional ones
47
Learning Objective 2:
Example: How Well Can We Predict House
Selling Prices?
 This does not mean that the additional
variables are uncorrelated with the response
variable
 It merely means that they don’t add much new
power for predicting y, given the values of the
predictors already in the model
48
Learning Objective 3:
Properties of R2
 The previous example showed that R2 for
the multiple regression model was larger
than r2 for a bivariate model using only
one of the explanatory variables
 A key factor of R2 is that it cannot
decrease when predictors are added to a
model
49
Learning Objective 3:
Properties of R2
 R2 falls between 0 and 1
 The larger the value, the better the explanatory
variables collectively predict y
 R2 =1 only when all residuals are 0, that is, when
all regression predictions are prefect
 R2 = 0 when the correlation between y and each
explanatory variable equals 0
50
Learning Objective 3:
Properties of R2
 R2 gets larger, or at worst stays the same,
whenever an explanatory variable is added
to the multiple regression model
 The value of R2 does not depend on the
units of measurement
51
Chapter 13:
Multiple Regression
Section 13.3: How Can We Use Multiple
Regression to Make Inferences?
52
Learning Objectives
1.
2.
3.
4.
5.
6.
7.
Inferences about the Population
Inferences about Individual Regression Parameters
Significance Test about a Multiple Regression
Parameter
Confidence Interval for a Multiple Regression
Parameter
Estimating Variability Around the Regression
Equation
Do the Explanatory Variables Collectively Have an
Effect?
Summary of F Test That All Beta Parameters = 0
53
Learning Objective 1:
Inferences about the Population
 Assumptions required when using a
multiple regression model to make
inferences about the population:


The regression equation truly holds for the
population means
This implies that there is a straight-line
relationship between the mean of y and
each explanatory variable, with the same
slope at each value of the other predictors
54
Learning Objective 1:
Inferences about the Population
 Assumptions required when using a
multiple regression model to make
inferences about the population:


The data were gathered using
randomization
The response variable y has a normal
distribution at each combination of values
of the explanatory variables, with the same
standard deviation
55
Learning Objective 2:
Inferences about Individual Regression
Parameters
 Consider a particular parameter, β1
 If β1= 0, the mean of y is identical for all values of
x1, at fixed values of the other explanatory
variables
 So, H0: β1= 0 states that y and x1 are statistically
independent, controlling for the other variables
 This means that once the other explanatory
variables are in the model, it doesn’t help to have
x1 in the model
56
Learning Objective 3:
Significance Test about a Multiple Regression
Parameter
1. Assumptions:



Each explanatory variable has a straightline relation with µy with the same slope
for all combinations of values of other
predictors in the model
Data gathered with randomization
Normal distribution for y with same
standard deviation at each combination of
values of other predictors in model
57
Learning Objective 3:
Significance Test about a Multiple Regression
Parameter
2. Hypotheses:

H0: β1= 0
Ha: β1≠ 0

When H0 is true, y is independent of x1,

controlling for the other predictors
58
Learning Objective 3:
Significance Test about a Multiple Regression
Parameter
3.
Test Statistic:
b 0
t
se
1
59
Learning Objective 3:
Significance Test about a Multiple Regression
Parameter
4. P-value: Two-tail probability from tdistribution of values larger than
observed t test statistic (in absolute
value)
The t-distribution has:
df = n – number of parameters in the
regression equation
60
Learning Objective 3:
Significance Test about a Multiple Regression
Parameter
5. Conclusion: Interpret P-value in context;
compare to significance level if decision
needed
61
Learning Objective 3:
Example: What Helps Predict a Female Athlete’s
Weight?
 The “College Athletes” data set comes
from a study of 64 University of Georgia
female athletes
 The study measured several physical
characteristics, including total body
weight in pounds (TBW), height in inches
(HGT), the percent of body fat (%BF) and
age
62
Learning Objective 3:
Example: What Helps Predict a Female Athlete’s
Weight?
 The results of fitting a multiple regression
model for predicting weight using the other
variables:
63
Learning Objective 3:
Example: What Helps Predict a Female Athlete’s
Weight?
 Interpret the effect of age on weight in the
multiple regression equation:
Let yˆ  predicted weight, x1  height,
x2  % body fat, and x3  age
Then yˆ  97.7  3.43x1  1.36 x2  0.96 x3
64
Learning Objective 3:
Example: What Helps Predict a Female Athlete’s
Weight?
 The slope coefficient of age is -0.96
 For athletes having fixed values for x1 and
x2, the predicted weight decreases by 0.96
pounds for a 1-year increase in age, and
the ages vary only between 17 and 23
65
Learning Objective 3:
Example: What Helps Predict a Female Athlete’s
Weight?
 Run a hypothesis test to determine
whether age helps to predict weight, if you
already know height and percent body fat
66
Learning Objective 3:
Example: What Helps Predict a Female Athlete’s
Weight?
1. Assumptions Met?:


The 64 female athletes were a
convenience sample, not a random
sample
Caution should be taken when making
inferences about all female college
athletes
67
Learning Objective 3:
Example: What Helps Predict a Female Athlete’s
Weight?
2.
Hypotheses:


3.
H0: β3= 0
Ha: β3≠ 0
Test statistic:
b  0  0.960
t

 1.48
se
0.648
3
68
Learning Objective 3:
Example: What Helps Predict a Female Athlete’s
Weight?
4. P-value: This value is reported in the
output as 0.14
5. Conclusion:
• The P-value of 0.14 does not give much
evidence against the null hypothesis that
β3 = 0

Age does not significantly predict weight
if we already know height and % body fat
69
Learning Objective 4:
Confidence Interval for a Multiple Regression
Parameter
 A 95% confidence interval for a β slope
parameter in multiple regression equals:
Estimated slope  t (se)
.025
 The t-score has:
df = (n - # of parameters in the model)
 Assumptions are the same as for the t test
70
Learning Objective 4:
Confidence Interval for a Multiple Regression
Parameter
 Construct and interpret a 95% CI for β3, the effect
of age while controlling for height and % body fat
b  t ( se)  0.96  2.00(0.648)
3
.025
 0.96  1.30  (2.3,0.3)
71
Learning Objective 4:
Confidence Interval for a Multiple Regression
Parameter
 At fixed values of x1 and x2, we infer that
the population mean of weight changes
very little (and maybe not at all) for a 1
year increase in age
 The confidence interval contains 0

Age may have no effect on weight, once we
control for height and % body fat
72
Learning Objective 5:
Estimating Variability Around the Regression
Equation
 A standard deviation parameter, σ, describes variability
of the observations around the regression equation
 Its sample estimate is:
s
Residual SS

df
 ( y  yˆ )
2
n  (# of parameters in reg. eq.)
73
Learning Objective 5:
Example: Estimating Variability of Female
Athletes’ Weight
 Anova Table for the “college athletes” data set:
74
Learning Objective 5:
Example: Estimating Variability of Female
Athletes’ Weight
 For female athletes at particular values of height,
% of body fat, and age, estimate the standard
deviation of their weights
 Begin by finding the Mean Square Error:
residual SS 6131.0
s 

 102.2
df
60
2
 Notice that this value (102.2) appears in the MS
column in the ANOVA table
75
Learning Objective 5:
Example: Estimating Variability of Female
Athletes’ Weight
 The standard deviation is:
s  102.2  10.1
 This value is also displayed in the ANOVA table
 For athletes with certain fixed values of height, %
body fat, and age, the weights vary with a
standard deviation of about 10 pounds
76
Learning Objective 5:
Example: Estimating Variability of Female
Athletes’ Weight
 If the conditional distributions of weight
are approximately bell-shaped, about 95%
of the weight values fall within about 2s =
20 pounds of the true regression line
77
Learning Objective 5:
Do the Explanatory Variables Collectively Have
an Effect?
 Example: With 3 predictors in a model, we
can check this by testing:
H :     0
0
1
2
3
H : At least one  parameter  0
a
78
Learning Objective 5:
Do the Explanatory Variables Collectively Have
an Effect?
 The test statistic for H0 is denoted by F
Mean square for regression
F
Mean square error
79
Learning Objective 5:
Do the Explanatory Variables Collectively Have
an Effect?
 When H0 is true, the expected value of the
F test statistic is approximately 1
 When H0 is false, F tends to be larger than
1
 The larger the F test statistic, the stronger
the evidence against H0
80
Learning Objective 6:
Summary of F Test That All Beta Parameters = 0
1. Assumptions: Multiple regression
equation holds, data gathered randomly,
normal distribution for y with same
standard deviation at each combination
of predictors
81

Learning Objective 6:
Summary of F Test That All Beta Parameters = 0
2.
H0 : 1   2 
0
Ha : At least one  parameter  0
3.
Test statistic:
Mean square for regression
F
Mean square error
82
Learning Objective 6:
Summary of F-Test That All Beta Parameters = 0
4.
P-value: Right-tail probability above
observed F-test statistic value from F
distribution with:
 df1 = number of explanatory variables
 df2 = n – (number of parameters in regression
equation)
83
Learning Objective 6:
Summary of F-Test That All Beta Parameters = 0
5. Conclusion: The smaller the P-value,
the stronger the evidence that at least
one explanatory variable has an effect
on y
 If a decision is needed, reject H0 if Pvalue ≤ significance level, such as 0.05
84
Learning Objective 6:
Example: The F-Test for Predictors of Athletes’
Weight
 For the 64 female college athletes, the
regression model for predicting y = weight
using x1 = height, x2 = % body fat and x3 =
age is summarized in the ANOVA table on
the next slide
85
Learning Objective 6:
Example: The F-Test for Predictors of Athletes’
Weight
86
Learning Objective 6:
Example: The F-Test for Predictors of Athletes’
Weight
 Use the output in the ANOVA table to test the
hypothesis:
H :     0
0
1
2
3
H : At least one  parameter  0
a
87
Learning Objective 6:
Example: The F-Test for Predictors of Athletes’
Weight
 The observed F statistic is 40.48
 The corresponding P-value is 0.000
 We can reject H0 at the 0.05 significance
level
 We conclude that at least one predictor
has an effect on weight
88
Learning Objective 6:
Example: The F-Test for Predictors of Athletes’
Weight
 The F-test tells us that at least one explanatory
variable has an effect
 If the explanatory variables are chosen sensibly,
at least one should have some predictive power
 The F-test result tells us whether there is
sufficient evidence to make it worthwhile to
consider the individual effects, using t-tests
89
Learning Objective 6:
Example: The F-Test for Predictors of Athletes’
Weight
 The individual t-tests identify which of the
variables are significant (controlling for the
other variables)
90
Learning Objective 6:
Example: The F-Test for Predictors of Athletes’
Weight
 If a variable turns out not to be significant,
it can be removed from the model
 In this example, ‘age’ can be removed
from the model
91
Chapter 13:
Multiple Regression
Section 13.4: Checking a Regression Model
Using Residual Plots
92
Learning Objectives
1. Assumptions for Inference with a Multiple
Regression Model
2. Checking Shape and Detecting Unusual
Observations
3. Plotting Residuals against Each Explanatory
Variable
93
Learning Objective 1:
Assumptions for Inference with a Multiple
Regression Model
•
•
•
The regression equation approximates
well the true relationship between the
predictors and the mean of y
The data were gathered randomly
y has a normal distribution with the
same standard deviation at each
combination of predictors
94
Learning Objective 2:
Checking Shape and Detecting Unusual
Observations
 To test Assumption 3 (the conditional distribution
of y is normal at any fixed values of the
explanatory variables):
 Construction a histogram of the standardized
residuals
 The histogram should be approximately bellshaped
 Nearly all the standardized residuals should
fall between -3 and +3. Any residual outside
these limits is a potential outlier
95
Learning Objective 2:
Example: Residuals for House Selling Price
 For the house selling price data, a
MINITAB histogram of the standardized
residuals for the multiple regression
model predicting selling price by the
house size and the lot size was created
and is displayed on the following slide
96
Learning Objective 2:
Example: Residuals for House Selling Price
97
Learning Objective 2:
Example: Residuals for House Selling Price
 The residuals are roughly bell shaped
about 0
 They fall between about -3 and +3
 No severe nonnormality is indicated
98
Learning Objective 3:
Plotting Residuals against Each Explanatory
Variable
 Plots of residuals against each explanatory
variable help us check for potential problems with
the regression model
 Ideally, the residuals should fluctuate randomly
about 0
 There should be no obvious change in trend or
change in variation as the values of the
explanatory variable increases
99
Learning Objective 3:
Plotting Residuals against Each Explanatory
Variable
100
Chapter 13:
Multiple Regression
Section 13.5: How Can Regression Include
Categorical Predictors?
101
Learning Objectives
1. Indicator Variables
2. Is There Interaction?
102
Learning Objective 1:
Indicator Variables
 Regression models can specify categories
of a categorical explanatory variable using
artificial variables, called indicator
variables
 The indicator variable for a particular
category is binary

It equals 1 if the observation falls into that
category and it equals 0 otherwise
103
Learning Objective 1:
Indicator Variables
 In the house selling prices data set, the
city region in which a house is located is a
categorical variable
 The indicator variable x for region is


x = 1 if house is in NW (northwest region)
x = 0 if house is not in NW
104
Learning Objective 1:
Indicator Variables
 The coefficient β of the indicator variable
x is the difference between the mean
selling prices for homes in the NW and
for homes not in the NW:
y     (1)    , if house is in NW (so x  1)
y     (0)  , if house is not in NW (so x  0)

y
for NW  y for other regions         
105
Learning Objective 1:
Example: Including Region in Regression for
House Selling Price
 Output from the regression model for selling price
of home using house size and region
106
Learning Objective 1:
Example: Including Region in Regression for
House Selling Price
 Find and plot the lines showing how
predicted selling price varies as a
function of house size, for homes in the
NW and for homes not in the NW
107
Learning Objective 1:
Example: Including Region in Regression for
House Selling Price
 The regression equation from the MINITAB
output is:
yˆ  15,258  78.0 x  30,569 x
1
2
108
Learning Objective 1:
Example: Including Region in Regression for
House Selling Price
 For homes not in the NW, x2 = 0
 The prediction equation then simplifies to:
yˆ  15,258  78.0 x  30,569(0)
1
 - 15,258 78.0x
1
109
Learning Objective 1:
Example: Including Region in Regression for
House Selling Price
 For homes in the NW, x2 = 1
 The prediction equation then simplifies to:
yˆ  15,258  78.0 x  30,569(1)
1
 15,311 78.0x
1
110
Learning Objective 1:
Example: Including Region in Regression for
House Selling Price
111
Learning Objective 1:
Example: Including Region in Regression for
House Selling Price
 Both lines have the same slope, 78

For homes in the NW and for homes not in
the NW, the predicted selling price
increases by $78 for each square-foot
increase in house size
 The figure portrays a separate line for
each category of region (NW, not NW)
112
Learning Objective 1:
Example: Including Region in Regression for
House Selling Price
 The coefficient of the indicator variable is
30569

For any fixed value of house size, we
predict that the selling price is $30,569
higher for homes in the NW
113
Learning Objective 1:
Example: Including Region in Regression for
House Selling Price
 The line for homes in the NW is above the line for
homes not in the NW
 The predicted selling price is higher for homes in
the NW
 The P-value of 0.000 for the test for the coefficient
of the indicator variable suggests that this
difference is statistically significant
114
Learning Objective 2:
Is there Interaction?
 For two explanatory variables, interaction
exists between them in their effects on the
response variable when the slope of the
relationship between µy and one of them
changes as the value of the other changes
115
Learning Objective 2:
Example: Interaction in effects on House Selling
Price
 Suppose the actual population relationship
between house size and mean selling price is:
y  15,000  100x1 for homes in the NW
y  12,000  25x1 for homes in other regions
 Then the slope for the effect of x1 differs for the
two regions - there is interaction between house
size and region in their effects on selling price

116
Learning Objective 2:
Example: Interaction in effects on House Selling
Price
117
Learning Objective 2:
Example: Interaction in effects on House Selling
Price
 To allow for interaction with two explanatory
variables, one quantitative and one
categorical, you can fit a separate regression
line with a different slope between the two
quantitative variables for each category of the
categorical variable.
118
Chapter 13:
Multiple Regression
Section 13.6: How Can We Model A
Categorical Response?
119
Learning Objectives
1. Modeling a Categorical Response Variable
2. Examples of Logistic Regression
3. The Logistic Regression Model
120
Learning Objective 1:
Modeling a Categorical Response Variable
 When y is categorical, a different
regression model applies, called logistic
regression
121
Learning Objective 2:
Examples of Logistic Regression
 A voter’s choice in an election (Democrat or
Republican), with explanatory variables: annual
income, political ideology, religious affiliation,
and race
 Whether a credit card holder pays their bill on
time (yes or no), with explanatory variables:
family income and the number of months in the
past year that the customer paid the bill on time
122
Learning Objective 3:
The Logistic Regression Model
 Denote the possible outcomes for y as 0 and 1
 Use the generic terms failure (for outcome = 0) and
success (for outcome =1)
 The population mean of the scores equals the
population proportion of ‘1’ outcomes (successes)

That is, µy = p
 The proportion, p, also represents the probability that
a randomly selected subject has a success outcome
123
Learning Objective 3:
The Logistic Regression Model
 The straight-line model is usually
inadequate
 A more realistic model has a curved S-
shape instead of a straight-line trend
124
Learning Objective 3:
The Logistic Regression Model
 A regression equation for an S-shaped
curve for the probability of success p is:
(   x )
e
p
1 e
(   x )
125
Learning Objective 3:
Example: Annual Income and Having a Travel
Credit Card
 An Italian study with 100 randomly selected
Italian adults considered factors that are
associated with whether a person possesses at
least one travel credit card
 The table on the next page shows results for
the first 15 people on this response variable
and on the person’s annual income (in
thousands of euros)
126
Learning Objective 3:
Example: Annual Income and Having a Travel
Credit Card
127
Learning Objective 3:
Example: Annual Income and Having a Travel
Credit Card
 Let x = annual income and let y = whether the
person possesses a travel credit card (1 = yes,
0 = no)
128
Learning Objective 3:
Example: Annual Income and Having a Travel
Credit Card
 Substituting the α and β estimates into the
logistic regression model formula yields:
( 3.52 0.105 x )
e
pˆ 
1 e
( 3.52 0.105 x )
129
Learning Objective 3:
Example: Annual Income and Having a Travel
Credit Card
 Find the estimated probability of
possessing a travel credit card at the
lowest and highest annual income levels
in the sample, which were x = 12 and x =
65
130
Learning Objective 3:
Example: Annual Income and Having a Travel
Credit Card
 For x = 12 thousand euros, the estimated probability of
possessing a travel credit card is:
3.52 0.105 ( 12 )
2.26
e
e
pˆ 

1 e
1 e
0.104

 0.09
1.104
3.521.05 ( 12 )
 2.26
131
Learning Objective 3:
Example: Annual Income and Having a Travel
Credit Card
 For x = 65 thousand euros, the estimated probability of
possessing a travel credit card is:
3.52 0.105 ( 65 )
3.305
e
e
pˆ 

1 e
1 e
27.2485

 0.97
28.2485
3.521.05 ( 65 )
3.305
132
Learning Objective 3:
Example: Annual Income and Having a Travel
Credit Card
 Annual income has a strong positive
effect on having a credit card
 The estimated probability of having a
travel credit card changes from 0.09 to
0.97 as annual income changes over its
range
133
Learning Objective 3:
Example: Estimating Proportion of Students
Who’ve Used Marijuana
 A three-variable contingency table from a
survey of senior high-school students is
shown on the next slide
 The students were asked whether they
had ever used: alcohol, cigarettes or
marijuana
134
Learning Objective 3:
Example: Estimating Proportion of Students
Who’ve Used Marijuana
135
Learning Objective 3:
Example: Estimating Proportion of Students
Who’ve Used Marijuana
 Let y indicate marijuana use, coded:
(1 =
yes, 0 = no)
 Let x1 be an indicator variable for alcohol
use (1 = yes, 0 = no)
 Let x2 be an indicator variable for cigarette
use (1 = yes, 0 = no)
136
Learning Objective 3:
Example: Estimating Proportion of Students
Who’ve Used Marijuana
137
Learning Objective 3:
Example: Estimating Proportion of Students
Who’ve Used Marijuana
 The logistic regression prediction equation is:
5.31 2.99 x1  2.85 x2
e
pˆ 
1 e
5.31 2.99 x1  2.85 x2
138
Learning Objective 3:
Example: Estimating Proportion of Students
Who’ve Used Marijuana
 For those who have not used alcohol or
cigarettes, x1= x2 = 0 and:
5.31 2.99 ( 0 )  2.85 ( 0 )
e
pˆ 
1 e
5.31 2.99 ( 0 )  2.85 ( 0 )
 0.005
139
Learning Objective 3:
Example: Estimating Proportion of Students
Who’ve Used Marijuana
 For those who have used alcohol and
cigarettes, x1= x2 = 1 and:
5.31 2.99 ( 1 )  2.85 ( 1 )
e
pˆ 
1 e
5.31 2.99 ( 1 )  2.85 ( 1 )
 0.628
140
Learning Objective 3:
Example: Estimating Proportion of Students
Who’ve Used Marijuana
 The probability that students have tried
marijuana seems to depend greatly on
whether they’ve used alcohol and
cigarettes
141