Download Multiple Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Instrumental variables estimation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 13
Multiple Regression
1
Introduction
• In this chapter we extend simple linear
regression where we had one explanatory
variable, and allow for any number of
explanatory variables.
• We expect to build a model that fits the data
better than the simple linear regression model.
2
Introduction
• We shall use computer printout to
– Assess the model
• How well it fits the data
• Is it useful
• Are any required conditions violated?
– Employ the model
• Interpreting the coefficients
• Predictions using the prediction equation
• Estimating the expected value of the dependent variable
3
The Multiple Regression Model
Idea: Examine the linear relationship between 1 response
variable (y) & 2 or more explanatory variables (xi)
Population model:
Y-intercept
Population slopes
Random Error
y  β0  β1x1  β2 x 2    βk xk  ε
Estimated multiple regression model:
Estimated
(or predicted)
value of y
Estimated
intercept
Estimated slope coefficients
ŷ  b0  b1x1  b2 x 2    bk xk
Simple Linear Regression
ŷ  b0  b1 x
y
Observed Value of y
for xi
εi
Predicted Value of
y for xi
Slope = b1
Random Error for
this x value
Intercept = b0
xi
x
Multiple Regression, 2 explanatory variables
•Y
•*
•*
•*
•*
•*
•*
•*
•*
•* •*
•*
•Least
Squares Plane
(instead of
line)
•*
•*
•*
•X•2
•*
•*•*
•X•1
•Scatter of points around
plane are random error.
6
Multiple Regression Model
Two variable model
yi
y
ŷ  b0  b1x1  b2 x 2
<
yi
Sample
observation
<
e = (yi – yi)
x2i
x1
<
x1i
x2
The best fit equation, y ,
is found by minimizing the
sum of squared errors, e2
Estimating the Coefficients and
Assessing the Model
• The procedure used to perform regression analysis:
– Obtain the model coefficients and statistics using statistical
software.
– Diagnose violations of required conditions. Try to remedy
problems when identified.
– Assess the model fit using statistics obtained from the
sample.
– If the model assessment indicates good fit to the data, use it
to interpret the coefficients and generate predictions.
8
Estimating the Coefficients and
Assessing the Model, Example
• Predicting final exam scores in BUS/ST 350
– We would like to predict final exam scores in 350.
– Use information generated during the semester.
– Predictors of the final exam score:
•
•
•
•
Exam 1
Exam 2
Exam 3
Homework total
9
Estimating the Coefficients and
Assessing the Model, Example
• Data were collected from 203 randomly selected students from previous
semesters
• The following model is proposed
final exam = b0  b1exam1  b2exam2  b3exam3  b4hwtot
exam 1 exam2 exam3 hwtot finalexm
80
60
80 159
72
80
70
75 359
76
95
70
90 330
84
90
100 100 359
92
70
60
80 272
64
90
70
70 344
84
90
85
90 351
88
85
35
90 200
76
85
55
70 251
60
40
80
95 293
64
10
Regression Analysis, Excel Output
This is the sample regression equation
(sometimes called the prediction equation)
Regression Statistics
Multiple R
0.618439
R Square
0.38246679
Adjusted R Square 0.36999137
Standard Error
11.5122313
Observations
203
Final exam score =
0.0498 + 0.1002exam1 + 0.1541exam2 + 0.2960exam3
+0.1077hwtot
ANOVA
df
Regression
Residual
Total
Intercept
exam 1
exam2
exam3
hwtot
4
198
202
SS
16252.40443
26241.23104
42493.63547
MS
4063
132.5
F
Significance F
30.66 7.32692E-20
Coefficients Standard Error t Stat P-value Lower 95%
Upper 95%
0.04978935
8.17368799 0.006 0.995 -16.06886586 16.16844455
0.10021107
0.075633398 1.325 0.187 -0.048939306 0.249361453
0.15413733
0.072271404 2.133 0.034 0.011616858 0.296657794
0.29600913
0.066724619 4.436 2E-05
0.16442702 0.427591244
11
0.10771069
0.022685084 4.748 4E-06 0.062975308 0.152446072
Interpreting the Coefficients
• b0 = 0.0498. This is the intercept, the value of y when all
the variables take the value zero. Since the data range
of all the independent variables do not cover the value
zero, do not interpret the intercept.
• b1 = 0.1002. In this model, for each additional point on
exam 1, the final exam score increases on average by
0.1002 (assuming the other variables are held
constant).
12
Interpreting the Coefficients
• b2 = 0.1541. In this model, for each additional point on exam 2,
the final exam score increases on average by 0.1541 (assuming
the other variables are held constant).
• b3 = 0.2960. For each additional point on exam 3, the final
exam score increases on average by 0.2960 (assuming the
other variables are held constant).
• b4 = 0.1077. For each additional point on the homework, the
final exam score increases on average by 0.1077 (assuming
the other variables are held constant).
13
Final Exam Scores, Predictions
• Predict the average final exam score of a student with
the following exam scores and homework score:
–
–
–
–
Exam 1 score 75,
Exam 2 score 79,
Exam 3 score 85,
Homework score 310
Final exam score =
0.0498 + 0.1002(75) +0.1541(79) + 0.2960(85) + 0.1077(310) =
78.2857
– Use trend function in Excel
14
Model Assessment
• The model is assessed using three tools:
– The standard error of the residuals
– The coefficient of determination
– The F-test of the analysis of variance
• The standard error of the residuals participates
in building the other tools.
15
Standard Error of Residuals
• The standard deviation of the residuals is estimated
by the Standard Error of the Residuals:
SSE
se 
n  k 1
• The magnitude of se is judged by comparing it to y .
16
Regression Analysis, Excel Output
Standard error of the residuals; sqrt(MSE)
(standard error of the residuals)2: MSE=SSE/198
Regression Statistics
Multiple R
0.618439
R Square
0.38246679
Adjusted R Square 0.36999137
Standard Error
11.5122313
Observations
203
ANOVA
df
Regression
Residual
Total
Intercept
exam 1
exam2
exam3
hwtot
4
198
202
SS
16252.40443
26241.23104
42493.63547
MS
4063
132.5
F
Significance F
30.66 7.32692E-20
Sum of squares of
Lower 95%
Upper 95%
residuals
SSE
Coefficients Standard Error t Stat P-value
0.04978935
8.17368799 0.006 0.995 -16.06886586 16.16844455
0.10021107
0.075633398 1.325 0.187 -0.048939306 0.249361453
0.15413733
0.072271404 2.133 0.034 0.011616858 0.296657794
0.29600913
0.066724619 4.436 2E-05
0.16442702 0.427591244
17
0.10771069
0.022685084 4.748 4E-06 0.062975308 0.152446072
Standard Error of Residuals
•
•
•
•
From the printout, se = 11.5122….
Calculating the mean value of y we have y  78.84
It seems se is not particularly small.
Question:
Can we conclude the model does not fit the data
well?
18
Coefficient of Determination R2
(like r2 in simple linear regression
• The proportion of the variation in y that is explained by
differences in the explanatory variables x1, x2, …, xk
• R2 = 1 – (SSE/SSTotal)
• From the printout, R2 = 0.382466…
• 38.25% of the variation in final exam score is explained by
differences in the exam1, exam2, exam3, and hwtot
explanatory variables. 61.75% remains unexplained.
• When adjusted for degrees of freedom,
Adjusted R2 = 36.99%
19
Testing the Validity of the Model
• We pose the question:
Is there at least one explanatory variable linearly related
to the response variable?
• To answer the question we test the hypothesis
H0: b1 = b2 = … = bk=0
H1: At least one bi is not equal to zero.
• If at least one bi is not equal to zero, the model has
some validity.
20
Testing the Validity of the Final Exam
Scores Regression Model
• The hypotheses are tested by what is called an
F test shown in the Excel output below MSR/MSE
ANOVA
df
Regression
k =
Residual
n–k–1 =
Total
n-1 =
SS
4 16252.404
198 26241.231
202 42493.635
MS
4063
132.5
F
Significance F
30.66 7.32692E-20
P-value
SSR
MSR=SSR/k
SSE
MSE=SSE/(n-k-1)
21
Testing the Validity of the Final Exam
Scores Regression Model
[Variation in y] = SSR + SSE.
Large F results from a large SSR. Then, much of the
variation in y is explained by the regression model; the
model is useful, and thus, the null hypothesis H0 should
be rejected. Reject H0 when P-value < 0.05
22
Testing the Validity of the Final Exam
Scores Regression Model
ANOVA
Regression
Residual
Total
Conclusion: There is sufficient evidence to reject
the null hypothesis in favor of the alternative hypothesis.
At least one of the bi is not equal to zero. Thus, at least
one explanatory variable is linearly related to y.
Thisdflinear regression
model
SS
MS is valid
F
Significance F
4 16252.404
198 26241.231
202 42493.635
4063
132.5
30.66
7.32692E-20
The P-value (Significance F) < 0.05
Reject the null hypothesis.
23
Testing the Coefficients
• The hypothesis for each bi is
H0: bi  0
H1: bi  0
• Excel printout
Intercept
exam 1
exam2
exam3
hwtot
Coefficients Standard Error
0.04978935
8.17368799
0.10021107 0.075633398
0.15413733 0.072271404
0.29600913 0.066724619
0.10771069 0.022685084
Test statistic
bi  0
t
sbi
d.f. = n - k -1
t Stat
P-value
0.006 0.995145915
1.325 0.186712117
2.133 0.034176157
4.436 1.51714E-05
4.748 3.93288E-06
24