Download Handout 6

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Purpose of Regression Analysis
• Regression analysis is used primarily to
model causality and provide prediction
– Predicts the value of a dependent (response)
variable based on the value of at least one
independent (explanatory) variable
– Explains the effect of the independent variables on
the dependent variable
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
Simple Linear Regression Model
• Relationship between variables is described
by a linear function
• The change of one variable causes the
change in the other variable
• A dependency of one variable on the other
Population Linear Regression
Population regression line is a straight line that
describes the dependence of the average value
(conditional mean) of one variable on the other
Population
Slope
Coefficient
Population
Y intercept
Dependent
(Response)
Variable
Random
Error
Yi      X i   i
Population
Regression
YX
Line
(conditional mean)

Independent
(Explanatory)
Variable
Population Linear Regression
(continued)
Y
(Observed Value of Y) = Yi
     X i   i
 i = Random Error

YX      X i

(Conditional Mean)
X
Observed Value of Y
Sample Linear Regression
Sample regression line provides an estimate
of the population regression line as well as a
predicted value of Y
Sample
Y Intercept
Yi  b0  b1 X i  ei
Sample
Slope
Coefficient
Residual
Sample Regression Line
Ŷ  b0  b1 X  (Fitted
Regression Line, Predicted Value)
Sample Linear Regression
(continued)
•
b0and b1are obtained by finding the values of b0
and b1 that minimizes the sum of the squared
residuals

n
i 1
•
•
Yi  Yˆi

2
n
  ei2
i 1
b0 provides an estimate of  
b1provides and estimate of 
Sample Linear Regression
(continued)
Yi  b0  b1 X i  ei
Y
ei
Yi      X i   i
b1
i

YX      X i

b0
Observed Value
Y i  b0  b1 X i
X
Interpretation of the
Slope and the Intercept
•   E Y | X  0 is the average value of Y when
the value of X is zero.
E Y | X 
• 1 
measures the change in the
X
average value of Y as a result of a one-unit
change in X.
Interpretation of the
Slope and the Intercept
(continued)
• b  Eˆ Y | X  0  is the estimated average
value of Y when the value of X is zero.
Eˆ Y | X 
• b1 
is the estimated change in
X
the average value of Y as a result of a one-unit
change in X.
Simple Linear Regression: Example
You want to examine
the linear dependency
of the annual sales of
produce stores on their
size in square footage.
Sample data for seven
stores were obtained.
Find the equation of
the straight line that
fits the data best.
Store Square
Feet
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
Annual
Sales
($1000)
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Scatter Diagram: Example
Annua l Sa le s ($000)
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
S q u a re F e e t
Excel Output
5000
6000
Equation for the Sample Regression Line:
Example
Yˆi  b0  b1 X i
 1636.415  1.487 X i
From Excel Printout:
C o e ffi c i e n ts
I n te r c e p t
1 6 3 6 .4 1 4 7 2 6
X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7
Excel Output
Regression Statistics
Multiple R
0.970557
R Square
0.941981
Adjusted R
Square
0.930378
Standard Error
611.7515
Observations
7
ANOVA
df
SS
F
81.17909
0.000281
P-value
0.015149
0.000281
Lower 95%
475.8109
1.06249
Regression
1
30380456
30380456
Residual
5
1871200
374239.9
6
Coefficient
s
1636.415
1.486634
32251656
Standard
Error
451.4953
0.164999
t Stat
3.624433
9.009944
Total
Intercept
X Variable 1
Significance
F
MS
Upper 95%
2797.019
1.910777
Annua l Sa le s ($000)
Graph of the Sample
Regression Line: Example
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
S q u a re F e e t
5000
6000
Interpretation of Results: Example
Yˆi  1636.415  1.487 X i
The slope of 1.487 means that for each increase of
one unit in X, we predict the average of Y to
increase by an estimated 1.487 units.
The model estimates that for each increase of one
square foot in the size of the store, the expected
annual sales are predicted to increase by $1487.
How Good is the regression?
•
•
•
•
•
R2
Confidence Intervals
Residual Plots
Analysis of Variance
Hypothesis (t) tests
Measure of Variation:
The Sum of Squares
SST
=
Total
=
Sample
Variability
SSR
Explained
Variability
+
SSE
+
Unexplained
Variability
Measure of Variation:
The Sum of Squares
(continued)
• SST = total sum of squares
– Measures the variation of the Yi values around their
mean Y
• SSR = regression sum of squares
– Explained variation attributable to the relationship
between X and Y
• SSE = error sum of squares
– Variation attributable to factors other than the
relationship between X and Y
Measure of Variation:
The Sum of Squares
(continued)

SSE =(Yi - Yi )2
Y
_
SST = (Yi - Y)2
 _
SSR = (Yi - Y)2
Xi
_
Y
X
The Coefficient of Determination
•
SSR Regression Sum of Squares
r 

SST
Total Sum of Squares
2
• Measures the proportion of variation in Y that
is explained by the independent variable X in
the regression model
Coefficients of Determination (r 2) and
Correlation (r)
Y r2 = 1, r = +1
Y r2 = 1, r = -1
^=b +b X
Y
i
^=b +b X
Y
i
0
1 i
0
X
Yr2 = .8, r = +0.9
X
Y
^=b +b X
Y
i
0
1 i
X
1 i
r2 = 0, r = 0
^=b +b X
Y
i
0
1 i
X
Linear Regression Assumptions
1. Linearity
2. Normality
– Y values are normally distributed for each X
– Probability distribution of error is normal
2. Homoscedasticity (Constant Variance)
3. Independence of Errors
Residual Analysis
• Purposes
– Examine linearity
– Evaluate violations of assumptions
• Graphical Analysis of Residuals
– Plot residuals vs. Xi , Yi and time
Residual Analysis for Linearity
Y
Y
X
e
X
X
e
X
Not Linear

Linear
Residual Analysis for Homoscedasticity
Y
Y
X
SR
X
SR
X
Heteroscedasticity
X
Homoscedasticity
Variation of Errors around
the Regression Line
f(e)
• Y values are normally distributed
around the regression line.
• For each X value, the “spread” or
variance around the regression line is
the same.
Y
X2
X1
X
Sample Regression Line
Residual Analysis:Excel Output for
Produce Stores Example
Observation
1
2
3
4
5
6
7
Excel Output
Residual Plot
0
1000
2000
3000
4000
Square Feet
5000
6000
Predicted Y
4202.344417
3928.803824
5822.775103
9894.664688
3557.14541
4918.90184
3588.364717
Residuals
-521.3444173
-533.8038245
830.2248971
-351.6646882
-239.1454103
644.0981603
171.6352829
Residual Analysis
for Independence
Graphical Approach

Not Independent
e
Independent
e
Time
Cyclical Pattern
Time
No Particular Pattern
Residual is plotted against time to detect any autocorrelation
Inference about the Slope:
t Test
• t test for a population slope
– Is there a linear dependency of Y on X ?
• Null and alternative hypotheses
– H0: 1 = 0
– H1: 1  0
(no linear dependency)
(linear dependency)
• Test statistic
b1  1
–
t
where Sb1 
Sb1
–
d. f .  n  2
SYX
n
2
(
X

X
)
 i
i 1
Example: Produce Store
Data for Seven Stores:
Store
1
2
3
4
5
6
7
Square
Feet
Annual
Sales
($000)
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Estimated
Regression
Equation:

Yi = 1636.415 +1.487Xi
The slope of this
model is 1.487.
Is square footage of
the store affecting its
annual sales?
Inferences about the Slope:
t Test Example
H0: 1 = 0
H1: 1  0
  .05
df  7 - 2 = 5
Critical Value(s):
Reject
.025
Test Statistic:
From Excel Printout
b1 Sb1
t
Coefficients Standard Error t Stat P-value
Intercept
1636.4147
451.4953 3.6244 0.01515
Footage
1.4866
0.1650 9.0099 0.00028
Decision:
Reject H0
Reject
.025
-2.5706 0 2.5706
t
Conclusion:
There is evidence that
square footage affects
annual sales.
The Multiple Regression Model
Relationship between 1 dependent & 2 or more
independent variables is a linear function
Population
Y-intercept
Population slopes
Random
Error
Yi      X1i    X 2i    k X ki  i
Yi  b0  b1 X1i  b2 X 2i   bk X ki  ei
Dependent (Response)
variable for sample
Independent (Explanatory)
variables for sample model
Residual
Population Multiple Regression Model
Bivariate model
Y
Response
Plane
X1
Yi =  0 +  1X1i +  2X2i + i
(Observed Y)
0
i
X2
(X1i,X2i)
 Y|X =  0 +  1X1i +  2X2i
Sample Multiple
Regression Model
Bivariate model
Response
Plane
X1
Y
Yi = b0 + b1X1i + b2X2i + ei
(Observed Y)
b0
ei
X2
(X1i, X2i)
^
Yi = b0 + b1X1i + b2X2i
Sample Regression Plane
Simple and Multiple
Regression Compared
• Coefficients in a simple regression pick up
the impact of that variable plus the impacts
of other variables that are correlated with it
and the dependent variable.
• Coefficients in a multiple regression net out
the impacts of other variables in the
equation.
Simple and Multiple
Regression Compared:Example
• Two simple regressions:
–
–
Oil   0  1 Temp
Oil   0  1 Insulation
• Multiple regression:
–
Oil   0  1 Temp   2 Insulation
Multiple Linear
Regression Equation
Too
complicated
by hand!
Ouch!
Interpretation of Estimated Coefficients
• Slope (bi)
– Estimated that the average value of Y changes by bi
for each 1 unit increase in Xi holding all other
variables constant (ceteris paribus)
– Example: if b1 = -2, then fuel oil usage (Y) is expected
to decrease by an estimated 2 gallons for each 1
degree increase in temperature (X1) given the inches
of insulation (X2)
• Y-intercept (b0)
– The estimated average value of Y when all Xi = 0
Multiple Regression Model: Example
0
Develop a model for estimating
heating oil used for a single
family home in the month of
January based on average
temperature and amount of
insulation in inches.
Oil (Gal) Temp ( F) Insulation
275.30
40
3
363.80
27
3
164.30
40
10
40.80
73
6
94.30
64
6
230.90
34
6
366.70
9
6
300.60
8
10
237.80
23
10
121.40
63
3
31.40
65
10
203.50
41
6
441.10
21
3
323.00
38
3
52.50
58
10
Sample Multiple Regression Equation:
Example
Yˆi  b0  b1 X1i  b2 X 2i 
Excel Output
Intercept
X Variable 1
X Variable 2
 bk X ki
Coefficients
562.1510092
-5.436580588
-20.01232067
Yˆi  562.151  5.437 X1i  20.012 X 2i
For each degree increase in
temperature, the estimated average
amount of heating oil used is
decreased by 5.437 gallons,
holding insulation constant.
For each increase in one inch
of insulation, the estimated
average use of heating oil is
decreased by 20.012 gallons,
holding temperature constant.
Confidence Interval Estimate
for the Slope
Provide the 95% confidence interval for the population
slope 1 (the effect of temperature on oil consumption).
b1  tn  p 1Sb1
Coefficients
Intercept
562.151009
X Variable 1 -5.4365806
X Variable 2 -20.012321
Lower 95% Upper 95%
516.1930837 608.108935
-6.169132673 -4.7040285
-25.11620102 -14.90844
-6.169  1  -4.704
The estimated average consumption of oil is reduced by
between 4.7 gallons to 6.17 gallons per each increase of 10 F.
Coefficient of
Multiple Determination
• Proportion of total variation in Y explained by all
X variables taken together
–
2
Y 12 k
r
SSR Explained Variation


SST
Total Variation
• Never decreases when a new X variable is
added to model
– Disadvantage when comparing models
Adjusted Coefficient
of Multiple Determination
• Proportion of variation in Y explained by all X
variables adjusted for the number of X variables
used
–
2
adj
r

2
 1  1  rY 12

n 1 
k 
n  k  1 
– Penalize excessive use of independent variables
2
r
– Smaller than Y 12 k
– Useful in comparing among models
Coefficient of Multiple Determination
Excel Output
rY2,12
R e g re ssi o n S ta ti sti c s
M u lt ip le R
0.982654757
R S q u a re
0.965610371
A d ju s t e d R S q u a re
0.959878766
S t a n d a rd E rro r
26.01378323
O b s e rva t io n s
15
SSR

SST
Adjusted r2
 reflects the number
of explanatory
variables and sample
size
 is smaller than r2
Interpretation of Coefficient of Multiple
Determination
•
2
Y ,12
r
SSR

 .9656
SST
– 96.56% of the total variation in heating oil can be
explained by different temperature and amount of
insulation
•
r  .9599
2
adj
– 95.99% of the total fluctuation in heating oil can be
explained by different temperature and amount of
insulation after adjusting for the number of
explanatory variables and sample size
Using The Model to Make Predictions
Predict the amount of heating oil used for a
home if the average temperature is 300 and
the insulation is six inches.
Yˆi  562.151  5.437 X 1i  20.012 X 2i
 562.151  5.437  30   20.012  6 
 278.969
The predicted heating oil
used is 278.97 gallons
Testing for Overall Significance
• Shows if there is a linear relationship between all
of the X variables together and Y
• Use F test statistic
• Hypotheses:
– H0:     …  k = 0 (no linear relationship)
– H1: at least one i   ( at least one independent
variable affects Y )
• The null hypothesis is a very strong statement
• Almost always reject the null hypothesis
Test for Significance:
Individual Variables
• Shows if there is a linear relationship between
the variable Xi and Y
• Use t test statistic
• Hypotheses:
– H0: i  0 (no linear relationship)
– H1: i  0 (linear relationship between Xi and Y)
Residual Plots
• Residuals vs. Yˆ
– May need to transform
variable
• Residuals vs. X1
– May need to transform X1 variable
• Residuals vs. time
X2
X2
– May have autocorrelation
Residual Plots: Example
T e m p e ra tu re R e s id u a l P lo t
Maybe some nonlinear relationship
60
Residuals
40
20
Insulation R esidual P lot
0
0
20
40
60
80
-20
-40
-60
0
No Discernable Pattern
2
4
6
8
10
12
The Quadratic Regression Model
• Relationship between one response variable
and two or more explanatory variables is a
quadratic polynomial function
• Useful when scatter diagram indicates nonlinear relationship
• Quadratic model :
–Y
i
  0  1 X 1i   2 X 12i   i
• The second explanatory variable is the square
of the first variable
Quadratic Regression Model
(continued)
Quadratic models may be considered when
scatter diagram takes on the following shapes:
Y
Y
2 > 0
X1
Y
2 > 0
X1
Y
2 < 0
X1
2 = the coefficient of the quadratic term
2 < 0 X1
Dummy Variable Models
• Categorical explanatory variable (dummy
variable) with two or more levels:
• Yes or no, on or off, male or female,
• Coded as 0 or 1
• Only intercepts are different
• Assumes equal slopes across categories
• The number of dummy variables needed is
(number of levels - 1)
• Regression model has same form:
Yi   0  1 X1i   2 X 2i       k X ki   i
Dummy-Variable Models
(with 2 Levels)
Given: Yˆi  b0  b1 X1i  b2 X 2i
Y = Assessed Value of House
X1 = Square footage of House
X2 = Desirability of Neighborhood =
Desirable (X2 = 1)
Yˆi  b0  b1 X1i  b2 (1)  (b0  b2 )  b1 X1i
Undesirable (X2 = 0)
Yˆ  b  b X  b (0)  b  b X
i
0
1
1i
2
0
1
1i
0 if
undesirable
1 if desirable
Same
slopes
Dummy-Variable Models
(with 2 Levels)
(continued)
Y (Assessed Value)
Same
slopes
b1
b0 + b2
Intercepts
different
b0
X1 (Square footage)
Interpretation of the Dummy Variable
Coefficient (with 2 Levels)
Example:
Yˆi  b0  b1 X1i  b2 X 2i  20  5 X1i  6 X 2i
Y : Annual salary of college graduate in thousand $
X1 : GPA
X 2:
0 Female
1 Male
On average, male college graduates are making
an estimated six thousand dollars more than
female college graduates with the same GPA.
Dummy-Variable Models
(with 3 Levels)
Given:
Y  Assessed Value of the House (1000 $)
X 1  Square Footage of the House
Style of the House = Split-level, Ranch, Condo
(3 Levels; Need 2 Dummy Variables)
1 if Split-level
1 if Ranch
X2  
X3  
 0 if not
 0 if not
Yˆi  b0  b1 X 1  b2 X 2  b3 X 3
Interpretation of the Dummy Variable
Coefficients (with 3 Levels)
Given the Estimated Model:
Yˆi  20.43  0.045 X 1i  18.84 X 2i  23.53 X 3i
For Split-level  X 2  1 :
Yˆi  20.43  0.045 X 1i  18.84
For Ranch  X 3  1 :
Yˆi  20.43  0.045 X 1i  23.53
For Condo:
Yˆ  20.43  0.045 X
i
1i
With the same footage, a Splitlevel will have an estimated
average assessed value of 18.84
thousand dollars more than a
Condo.
With the same footage, a Ranch
will have an estimated average
assessed value of 23.53
thousand dollars more than a
Condo.
Dummy Variables
• Predict Weekly Sales in a Grocery Store
• Possible independent variables:
– Price
– Grocery Chain
• Data Set:
– Grocery.xls
• Interaction Effect?
Interaction
Regression Model
• Hypothesizes interaction between pairs of X
variables
– Response to one X variable varies at different levels
of another X variable
• Contains two-way cross product terms
– Yi   0  1 X1i   2 X 2i   3 X 1i X 2i   i
• Can be combined with other models
– E.G., Dummy variable model
Effect of Interaction
• Given:
– Yi   0  1 X1i   2 X 2i   3 X 1i X 2i   i
• Without interaction term, effect of X1 on Y is
measured by 1
• With interaction term, effect of X1 on Y is
measured by 1 + 3 X2
• Effect changes as X2 increases
Interaction Example
Y
Y = 1 + 2X1 + 3X2 + 4X1X2
Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12
8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0
X1
0
0.5
1
1.5
Effect (slope) of X1 on Y does depend on X2 value
Interaction Regression Model Worksheet
Case, i
Yi
X1i
X2i
X1i X2i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
3
5
2
6
:
3
40
6
30
:
Multiply X1 by X2 to get X1X2.
Run regression with Y, X1, X2 , X1X2
Evaluating Presence
of Interaction
• Hypothesize interaction between pairs of
independent variables
• Contains 2-way product terms
Yi   0  1 X1i   2 X 2i   3 X 1i X 2i   i
Using Transformations
• Requires data transformation
• Either or both independent and dependent
variables may be transformed
• Can be based on theory, logic or scatter
diagrams
Inherently Linear Models
• Non-linear models that can be expressed in
linear form
– Can be estimated by least squares in linear form
• Require data transformation
Transformed Multiplicative Model (LogLog)
1
2
Original: Yi   0  X 1i  X 2i   i
Transformed: ln Yi   ln  0   1ln  X1i   2ln  X 2i   ln i 
1  1
Y
Y
0  1  1
1  1  0
1  1
1  1
X1
Similarly for X2
X1
Square Root Transformation
Yi   0  1 X1i   2 X 2i   i
Y
1 > 0
Similarly for X2
1 < 0
X1
Transforms one of above model to one that appears
linear. Often used to overcome heteroscedasticity.
Linear-Logarithmic Transformation
Yi   0  1 ln( X1i )   2 ln( X 2i )   i
Y
1 >
0
Similarly for X2
1 <
0
X1
Transformed from an original multiplicative model
Exponential Transformation
(Log-Linear)
Original Model
Y
Yi  e
 0  1 X1i   2 X 2 i
i
1 > 0
1 < 0
Transformed Into:
X1
ln Yi   0  1 X1i   2 X 2i  ln 1
Model Building / Model Selection
• Find “the best” set of explanatory variables among all the
ones given.
• “Best subset” regression (only linear models)
– Requires a lot of computation (2N regressions)
• “Stepwise regression”
• “Common Sense” methodology
– Run regression with all variables
– Throw out variables not statistically significant
– “Adjust” model by including some excluded variables, one at a
time
• Tradeoff: Parsimony vs. Fit
Association ≠ Causation !
Regression Limitations
• R2 measures the association between
independent and dependent variables
Association ≠ Causation !
• Be careful about doing predictions that involve
extrapolation
• Inclusion / Exclusion of independent
variables is subject to a type I / type II error
Multi-collinearity
• What?
– When one independent variable is highly correlated (“collinear”)
with one or more other independent variables
– Examples:
• square feet and square meters as independent variables to
predict house price (1 sq ft is roughly 0.09 sq meters)
• “total rooms” and bedrooms plus bathrooms for a house
• How to detect?
– Run a regression with the “not-so-independent” independent
variable (in the examples above: square feet and total rooms) as
a function of all other remaining independent variables, e.g.:
• X1 = β0 + β2 X2 + …+ βk Xk
– If R2 of the above regression is > 0.8, then one suspects multicollinearity to be present
Multi-collinearity
(continued)
• What effect?
– Coefficient estimates are unreliable
– Can still be used for predicting values for Y
– If possible, delete the “not-so-independent”
independent variable
• When to check?
– When one suspects that two variables measure the
same thing, or when the two variables are highly
correlated
– When one suspects that one independent variable is
a (linear) function of the other independent variables