Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Omnibus test wikipedia , lookup

Transcript
Statistics for Business and
Economics
Module 2: Regression and time series analysis
Spring 2010
Lecture 5: Data Analysis with Multiple Regression
Inference for it
Priyantha Wijayatunga, Department of Statistics, Umeå University
[email protected]
These materials are altered ones from copyrighted lecture slides (© 2009 W.H.
Freeman and Company) from the homepage of the book:
The Practice of Business Statistics Using Data for Decisions :Second Edition
by Moore, McCabe, Duckworth and Alwan.
Data Analysis with Multiple Linear
Regression and Inference for it
Reference to the book: Chapters 11.1 and 11.2

Data for multiple regression

Estimating multiple regression coefficients

Regression residuals

Regression standard error

Multiple linear regression model

Inference about the regression coefficients

Anova table for multiple regression

Squared multiple correlation R2

Inference for a collection of regression coefficients
Data for multiple linear regression

Up to this point we have considered, in detail, the linear regression
model in one explanatory variable x.
ŷ = b0 + b1x

Usually more complex linear models are needed in practical
situations.

There are many problems in which a knowledge of more than one
explanatory variable is necessary in order to obtain a better
understanding and better prediction of a particular response.

In general, we have data on n cases and p explanatory variables.
Multiple linear regression:
Two or more independent variables and
one dependent variable
?
?
X3
?
?
X8
X1
?
X7
X6
X4
X2
X5
Y
Estimating multiple regression coefficients

We can write the most general type of linear model in independnet
variables x1, x2,…xp in the form
ŷ = b0 + b1x1+ b2x2 + … + bpxp



This model is called a ‘first-order model’ with p explanatory
variables.
Some explanatory variables can be functions of some of the others,
Eg. X2=X12 ,X5=X3X4, ....
The method of least squares chooses the b’s that make the sum of
squares of the residuals as small as possible.
Eg:
Y
investment
return
X1 share price
X2 interest
X3 inflation
Motivation



More ”realistic” models
May be better predictions
Separate effects by different variables
The simple linear regression model
allows for one independent variable, “x”
y =b0 + b1x + e
y
X
y
X1
The multiple linear regression model
allows for more than one independent variable.
Y = b0 + b1x1 + b2x2 + e
X2
y
Note how the straight line
becomes a plain, and...
X1
The multiple linear regression model
allows for more than one independent variable.
Y = b0 + b1x1 + b2x2 + e
X2
Required conditions for the error variable



The random error e of the model is normally distributed
Its mean is equal to zero and its standard deviation is a constant (s)
for all values of X
The errors are independent
Note that these are the same requirements as in the case of simple
linear regression
Regression residuals and regression standard error

The residuals for multiple linear regression are ei = yi – ŷi

Always examine the distribution of the residuals

The variability of the responses about the multiple regression
equation is measured by the regression standard error s, where s is
the square root of
s2 
2
ˆ
(
y

y
)
 i i
n  p 1

2
e
i
n  p 1
Multiple linear regression model
For “p” number of explanatory variables, we can express the population
mean response (my) as a linear equation:
my = b0 + b1x1 … + bpxp
The statistical model for n sample data (i = 1, 2, … n) is then:
Data =
fit
+
yi = (b0 + b1x1i … + bpxpi) +
residual
(ei)
Where the ei are independent and normally distributed N(0, s).
Therefore multiple linear regression assumes equal variance s2 of y.
The parameters of the model are b0, b1 … bp.
We selected a random sample of n individuals for which p + 1 variables
were measured (x1 … , xp, y). The least-squares regression method
minimizes the sum of squared deviations of ei (= yi – ŷi) to express y as
a linear function of the p explanatory variables:
ŷi = b0 + b1x1i … + bkxpi
■ As with simple linear regression, the constant b0 is the y intercept.
■ The regression coefficients (b1,…, bp) reflect the unique association of
each independent variable with the y variable. They are analogous to the
slope in simple regression.
my
ŷ
b0
bp
are unbiased estimates of population parameters
β0
βp
Now
Estimate b0, b1, b2,…bk with the estimates b0, b1, b2,
…, bk using the least squares method.
Use SPSS for estimation and get the least squares
regression equation
yˆ  b0  b1 x1  b2 x2  ....  bk xk
In practice, before trying to interpret regression
coefficients, we check how well the model fits
14
Estimating Coefficients and Assessing Model
The procedure used to perform regression
analysis:
Obtain the model coefficients and statistics using a
statistical software.
– Diagnose violations of required conditions. Try to
remedy problems when identified.
– Assess the model fit using statistics obtained from
the sample.
– If the model assessment indicates good fit to the data, use
it to interpret the coefficients and generate predictions.
Estimating Coefficients and Assessing the Model
Example: Where to locate a new motor inn?



La Quinta Motor Inns is planning an expansion
Management wishes to predict which sites are likely to
be profitable.
Several areas where predictors of profitability can be
identified are:





Competition
Market awareness
Demand generators
Demographics
Physical quality
Profitability
Competition
Rooms
Number of
hotels/motels
rooms within
3 miles from
the site.
Market
awareness
Nearest
Distance to
the nearest
La Quinta inn.
Customers
Office
space
College
enrollment
Margin
Community
Physical
Income
Disttwn
Median
household
income.
Distance to
downtown.
Data were collected from randomly selected 100 inns that belong
to La Quinta, and ran for the following suggested model:
MARGIN = b0  b1ROOMS  b2NEAREST  b3OFFICE_S
b4ENROLLME + b5INCOME + b6DISTANCE
Data:
Margin
55,5
33,8
49
31,9
57,4
49
Number
3203
2810
2890
3422
2687
3759
Nearest
4,2
2,8
2,4
3,3
0,9
2,9
Office Space
549
496
254
434
678
635
Enrollment
8
17,5
20
15,5
15,5
19
Income
37
35
35
38
42
33
Distance
2,7
14,4
2,6
12,1
6,9
10,8
Regression Analysis, SPSS Output
MARGIN = 37.149 - 0.008NUMBER +1.591NEAREST
+ 0.020OFFICE_ S +0.196ENROLLME
+ 0.421INCOME - 0.004DISTANCE
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Std. Error
37,149
7,044
NUMBER
-,008
,001
NEAREST
1,591
OFFICE_S
Coefficients
Beta
95% Confidence Interval for B
t
Sig.
Lower Bound
Upper Bound
5,274
,000
23,162
51,136
-,447
-6,112
,000
-,010
-,005
,651
,182
2,442
,016
,297
2,884
,020
,003
,418
5,690
,000
,013
,026
ENROLLME
,196
,134
,107
1,461
,147
-,070
,463
INCOME
,421
,141
,220
2,994
,004
,142
,701
DISTANCE
-,004
,156
-,002
-,028
,978
-,314
,305
a. Dependent Variable: MARGIN
Model Summary
Change Statistics
Model
1
R
,719
R Square
a
,517
Adjusted R
Std. Error of the
R Square
Square
Estimate
Change
,486
5,55895
F Change
,517
16,588
df1
df2
6
Sig. F Change
93
,000
a. Predictors: (Constant), DISTANCE, INCOME, NUMBER, ENROLLME, OFFICE_S, NEAREST
Regression standard error (standard error of estimate ) s should be small
in order to have a good regression model
s= 5,55895
The magnitude of s is judged by comparing it with mean of dependent
variable (mean of MARGIN=45.74)
Similar to simple linear regression, the coefficient of determination R2 is
interpreted
Confidence interval for βj
Estimating the regression parameters β0, … βj … βp is a case of onesample inference with unknown population variance.
 We rely on the t distribution, with n – p – 1 degrees of freedom.
A level C confidence interval for βj is:
bj ± t* SEbj
SEbj is the standard error of bj —we rely on software to obtain SEbj .
t* is the t critical for the t (n – 2) distribution with area C between –t*
and +t*.
Significance test for βj
To test the hypothesis H0: bj = 0 versus a 1 or 2 sided alternative.
We calculate the t statistic
which has the t (n – p – 1)
distribution, to find the
p-value of the test.
Note: Software typically provides
two-sided p-values.
t = bj/SEbj
F–Test and Many T–Tests
Performing many T–test will increase the probability of
Type I error (There are k number of T–tests )
F–test combines all of them into one single test and it
does not increase the probability of Type I error
F–test avoids the problem of multicolinearity
ANOVA F-test for multiple regression
For a multiple linear relationship, the ANOVA tests the hypotheses
H 0: β1 = β2 = … = βp = 0
versus Ha: H0 not true
by computing the F statistic: F = MSM / MSE
When H0 is true, F follows
the F(p, n − p − 1) distribution.
The p-value is P(F > f ).
A significant p-value doesn’t mean that all p explanatory variables
have a significant influence on y—only that at least one does.
ANOVA table for multiple regression
Source
Sum of squares SS
df
Mean square MS
F
P-value
MSM/MSE
Tail area above F
Model
(Regression)
2
ˆ
(
y

y
)
 i
p
SSM/DFM
Error
(Residual)
 ( y  yˆ )
n−p−1
SSE/DFE
Total
 ( y  y)
i
2
i
n−1
2
i
SST = SSM + SSE
DFT = DFM + DFE
The standard deviation of the sampling distribution, s, for n sample
data points is calculated from the residuals ei = yi – ŷi
s 
2
2
e
i
n  p 1

2
ˆ
(
y

y
)
 i i
n  p 1
SSE

 MSE
DFE
s is an unbiased estimate of the regression standard deviation σ.
Squared multiple correlation R2
Just as with simple linear regression, R2, the squared multiple
correlation is the proportion of the variation in the response variable
y that is explained by the model.
In the particular case of multiple linear regression, the model is all p
explanatory variables taken together.
2
ˆ
(
y

y
)
 i
SSModel
R 

2
 ( yi  y) SSTotal
2
Inference for a collection of regression coefficients

The null hypothesis that a collection of q explanatory variables all
have coefficients equal to zero is tested by an F statistic with q and
n - p - 1 degrees of freedom.

This statistic can be computed from the squared multiple
correlations for the model with all of the explanatory variables
included (R12) and the model with the q variables deleted (R22).
F = ((n - p - 1)(R12 – R22)) / ((q)(1 - R12))
Example: La Quinta...
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0,7246
0,5251
0,4944
5,51
100
ANOVA
df
Regression
Residual
Total
Intercept
Number
Nearest
Office Space
Enrollment
Income
Distance
6
93
99
SS
3123,8
2825,6
5949,5
Coefficients Standard Error
38,14
6,99
-0,0076
0,0013
1,65
0,63
0,020
0,0034
0,21
0,13
0,41
0,14
-0,23
0,18
MS
520,6
30,4
F
Significance F
17,14
0,0000
t Stat
P-value
5,45
0,0000
-6,07
0,0000
2,60
0,0108
5,80
0,0000
1,59
0,1159
2,96
0,0039
-1,26
0,2107
We have data on 224 first-year computer science majors at a large
university in a given year. The data for each student includes:
* Cumulative GPA after 2 semesters at the university (y, response variable)
* SAT math score (SATM, x1, explanatory variable)
* SAT verbal score (SATV, x2, explanatory variable)
* Average high school grade in math (HSM, x3, explanatory variable)
* Average high school grade in science (HSS, x4, explanatory variable)
* Average high school grade in English (HSE, x5, explanatory variable)
Here are the summary statistics for these data given by software SAS:
The first step in multiple linear regression is to study all pair-wise
relationships between the p + 1 variables. Here is the SAS output for all
pair-wise correlation analyses (value of r and 2 sided p-value of H0: ρ = 0).
Scatterplots for all 15 pair-wise relationships are also necessary to understand the data.
For simplicity, let’s first run a multiple linear regression using only
the three high school grade averages:
P-value very
significant
R2 is fairly small (20%)
HSM significant
HSS, HSE not
P-value very
significant
R2 is fairly small (20%)
The ANOVA for the multiple linear regression using only HSM, HSS, and HSE is
very significant  at least one of the regression coefficients is significantly
different from zero.
But R2 is fairly small (0.205)  only about 20% of the variations in cumulative
GPA can be explained by these high school scores.
(Remember, a small p-value does not imply a large effect.)
HSM significant
HSS, HSE not
The tests of hypotheses for each b
within the multiple linear regression
reach significance for HSM only.
We found a significant correlation
between HSS and GPA when
analyzed by themselves, so why is
bHSS not significant to the multiple
regression equation?
Well, HHS and HHM are also
significantly correlated.
When all three high school averages are used together in the multiple regression
analysis, only HSM contributes significantly to our ability to predict GPA.
We now drop the least significant variable from the previous model: HSS.
P-value very
significant
R2 is small (20%)
HSM significant
HSE not
The conclusions are about the same, but notice that the actual regression
coefficients have changed. predicted GPA=.590+.169HSM+.045HSE+.034HSS
predicted GPA=.624+.183HSM+.061HSE
Let’s run a multiple linear regression with the two SAT scores only.
P-value very
significant
R2 is very small (6%)
SATM significant
SATV not
The ANOVA test for βSATM and βSATV is very significant  at least one is not zero.
R2 is really small (0.06)  only 6% of GPA variations is explained by these tests.
When taken together, only SATM is a significant predictor of GPA (P 0.0007).
We finally run a multiple regression model with all the variables together
P-value very
significant
R2 fairly small (21%)
HSM significant
The overall test is significant, but only the average high school math score (HSM)
makes a significant contribution in this model to predict the cumulative GPA.
This conclusion applies to computer majors at this large university.
Excel
Minitab