Document related concepts
no text concepts found
Transcript
```Chapter 9
Regression Analysis
duplicated, or posted to a publicly accessible website, in whole or in part.
1
Icebreaker: The best tips to study for an
exam
• The class will be broken into pairs or groups of students.
• What is the best way to do well on an exam?
•
• Taking and reviewing notes
• Flashcards
• Class attendance
• Etc.
duplicated, or posted to a publicly accessible website, in whole or in part.
2
Chapter Objectives (1 of 2)
By the end of this chapter, you should be able to:
09.01 Explain the purpose of regression analysis.
09.02 Distinguish between functional and statistical relationships.
09.03 Create and interpret scatter plots.
09.04 Explain the method of least squares for estimating model parameters.
09.05 Create regression models using Excel’s Data Analysis tool.
09.06 Create predictions using regression models.
09.07 Create prediction intervals.
09.08 Explain and use the R2 and adjusted-R2 statistics.
duplicated, or posted to a publicly accessible website, in whole or in part.
3
Chapter Objectives (2 of 2)
09.09 Explain and use forward stepwise regression.
09.10 Explain the meaning and dangers of extrapolation.
09.11 Explain the concept of overfitting.
09.12 Conduct statistical tests for population parameters.
09.13 Describe techniques for and the importance of variable selection in multiple
regression.
09.14 Define multicollinearity and its implications in regression analysis.
09.15 Described the use of polynomial functions in linear regression models.
09.16 Explain and use the following functions: AVERAGE( ), TREND( ), SQRT( ),
SUM( ),TINV( ), VAR.P( ), VAR.S( ).
duplicated, or posted to a publicly accessible website, in whole or in part.
4
Introduction
duplicated, or posted to a publicly accessible website, in whole or in part.
5
Introduction to Regression Analysis
• Regression Analysis is used to estimate a function f( ) that describes the
relationship between a continuous dependent variable and one or more
independent variables.
Y  f (X1, X1, X1,
Xn )  
Note:
• f( ) describes systematic variation in the relationship.
• ε represents the unsystematic variation (or random error) in the relationship.
duplicated, or posted to a publicly accessible website, in whole or in part.
6
An Example
• Consider the relationship between advertising (X1) and sales (Y) for a
company.
• There probably is a relationship...
...as advertising increases, sales should increase.
• But how would we measure and quantify this relationship?
See file Fig9-1.xlsm
duplicated, or posted to a publicly accessible website, in whole or in part.
7
A Scatter Plot of the Data
duplicated, or posted to a publicly accessible website, in whole or in part.
8
The Nature of a Statistical Relationship
duplicated, or posted to a publicly accessible website, in whole or in part.
9
A Simple Linear Regression Model
• The scatter plot shows a linear relation between advertising and sales.
• So the following regression model is suggested by the data,
Yi   0   0 X1   i
• This refers to the true relationship between the entire population of advertising
and sales values.
• The estimated regression function (based on our sample) will be represented
as,
Ŷi  b 0  b1X1i
•
Ŷi is the estimated (of fitted) value of Y at a given level of X
duplicated, or posted to a publicly accessible website, in whole or in part.
10
Defining “Best Fit”
• Numerical values must be assigned to b0 and b1
• The method of “least squares” selects the values that minimize:
n
n
ˆ )   (Y  (b  b X )) 2
ESS   (Yi  Y
i
i
0
1 1i
2
i =1
i =1
• If ESS=0 our estimated function fits the data perfectly.
• We could solve this problem using Solver...
duplicated, or posted to a publicly accessible website, in whole or in part.
11
Using Solver…
Fig9-4.xlsm
duplicated, or posted to a publicly accessible website, in whole or in part.
12
The Estimated Regression Function
• The estimated regression function is:
Ŷi  36.342  5.550X1i
duplicated, or posted to a publicly accessible website, in whole or in part.
13
Using the Regression Tool
• Excel also has a built-in tool for performing regression that:
• is easier to use
See file Fig9-1.xlsm
duplicated, or posted to a publicly accessible website, in whole or in part.
14
The TREND() Function
TREND(Y-range, X-range, X-value for prediction)
where:
Y-range is the spreadsheet range containing the dependent Y variable,
X-range is the spreadsheet range containing the independent X variable(s),
X-value for prediction is a cell (or cells) containing the values for the
independent X variable(s) for which we want an estimated value of Y.
Note: The TREND( ) function is dynamically updated whenever any inputs to the
function change. However, it does not provide the statistical information provided
by the regression tool. It is best two use these two different approaches to doing
regression in conjunction with one another.
duplicated, or posted to a publicly accessible website, in whole or in part.
15
Evaluating the “Fit”
duplicated, or posted to a publicly accessible website, in whole or in part.
16
The R2 Statistic
• The R2 statistic indicates how well an estimated regression function fits the
data.
• 0 < R2 < 1
• It measures the proportion of the total variation in Y around its mean that is
accounted for by the estimated regression equation.
• To understand this better, consider the following graph...
duplicated, or posted to a publicly accessible website, in whole or in part.
17
Error Decomposition
duplicated, or posted to a publicly accessible website, in whole or in part.
18
Partition of the Total Sum of Squares
n
n
n
i 1
i 1
i 1
2
2
2
ˆ
ˆ
(Y

Y)

(Y

Y
)

(Y

Y)
 i
 i i  i
or,
ESS
R 
 1
TSS
TSS
2
duplicated, or posted to a publicly accessible website, in whole or in part.
19
Making Predictions (1 of 2)
• Suppose we want to estimate the average levels of sales expected if \$65,000
Ŷi  36.342  5.550X1i
• Estimated Sales = 36.342 + 5.550 * 65
= 397.092
• So when \$65,000 is spent on advertising, we expect the average sales level to
be \$397,092.
duplicated, or posted to a publicly accessible website, in whole or in part.
20
The Standard Error
• The standard error measures the scatter in the actual data around the estimate
regression line.
2
ˆ
(Y

Y
)
 i1 i i
n
Se 
n  k 1
• where k = the number of independent variables
• For our example, Se = 20.421
• This is helpful in making predictions...
duplicated, or posted to a publicly accessible website, in whole or in part.
21
An Approximate Prediction Interval
• An approximate 95% prediction interval for a new value of Y when X1=X1h is
given by
Ŷh  2 Se
where
Ŷh  b0  b1X1h
• Example: If \$65,000 is spent on advertising:
• 95% lower prediction interval = 397.092 − 2*20.421 = 356.250
• 95% upper prediction interval = 397.092 + 2*20.421 = 437.934
• If we spend \$65,000 on advertising we are approximately 95% confident actual
sales will be between \$356,250 and \$437,934.
duplicated, or posted to a publicly accessible website, in whole or in part.
22
An Exact Prediction Interval
• A (1−α)% prediction interval for a new value of Y when X1=X1h is given by
Ŷh  t 


1 ,n  2 
 2

Sp
where:
Ŷh  b0  b1X1h
(X1h  X) 2
1
S p  Se 1   n
n  (X1  X) 2
i
i 1
duplicated, or posted to a publicly accessible website, in whole or in part.
23
Example
• If \$65,000 is spent on advertising:
95% lower prediction interval = 397.092 − 2.306*21.489 = 347.556
95% upper prediction interval = 397.092 + 2.306*21.489 = 446.666
• If we spend \$65,000 on advertising we are 95% confident actual sales will be
between \$347,556 and \$446,666.
• This interval is only about \$20,000 wider than the approximate one calculated
earlier but was much more difficult to create.
• The greater accuracy is not always worth the trouble.
duplicated, or posted to a publicly accessible website, in whole or in part.
24
Comparison of Prediction Interval
Techniques
duplicated, or posted to a publicly accessible website, in whole or in part.
25
Confidence Intervals for the Mean
• A (1−α)% confidence interval for the true mean value of Y when X1=X1h is given
by
Ŷh  t 


1 ,n  2 
 2

Sa
where:
Ŷh  b0  b1X1h
S a  Se
(X1h  X) 2
1
 n
n  (X1  X) 2
i
i 1
duplicated, or posted to a publicly accessible website, in whole or in part.
26
• Predictions made using an estimated regression function may have little or no
validity for values of the independent variables that are substantially different
from those represented in the sample.
duplicated, or posted to a publicly accessible website, in whole or in part.
27
Discussion Activity (1 of 2)
You would like to create a regression model to help you invest in the stock
market. There is a tremendous amount of historical data on markets, stocks, and
other related areas. What are some of the variables you think would be most
important? Do you think you could create a model that would give you confidence
in predictions? Why or why not? Is it possible to create a general model that could
be applied across different sectors? What are the potential dangers?
duplicated, or posted to a publicly accessible website, in whole or in part.
28
Multiple Regression Analysis
• Most regression problems involve more than one independent variable.
• If each independent variables varies in a linear manner with Y, the estimated
regression function in this case is:
Ŷi  b0  b1X1i  b2 X 2i 
 bk X ki
• The optimal values for the bi can again be found by minimizing the ESS.
• The resulting function fits a hyperplane to our sample data.
duplicated, or posted to a publicly accessible website, in whole or in part.
29
Example Regression Surface for Two
Independent Variables
duplicated, or posted to a publicly accessible website, in whole or in part.
30
Multiple Regression Example: Real Estate
Appraisal
• A real estate appraiser wants to develop a model to help predict the fair market
values of residential properties.
• Three independent variables will be used to estimate the selling price of a
house:
• total square footage
• number of bedrooms
• size of the garage
• See data in file Fig9-17.xlsm
duplicated, or posted to a publicly accessible website, in whole or in part.
31
Selecting the Model
• We want to identify the simplest model that adequately accounts for the
systematic variation in the Y variable.
• Arbitrarily using all the independent variables may result in overfitting.
• A sample reflects characteristics:
• representative of the population
• specific to the sample
• We want to avoid fitting sample specific characteristics -- or overfitting the data.
duplicated, or posted to a publicly accessible website, in whole or in part.
32
Model with One Independent Variable
• With simplicity in mind, suppose we fit three simple linear regression functions:
Ŷi  b0  b1X1i
Ŷi  b0  b2 X 2i
Ŷi  b0  b3 X 3i
• Key regression results are:
Variables In the Model
R2
Se
Parameter Estimates
X1
0.870
0.855
10.299
b0 = 109.503, b1 = 56.394
X2
0.759
0.731
14.030
b0 = 178.290, b2 = 28.382
X3
0.793
0.770
12.982
b0 = 116.250, b3 = 27.607
• The model using X1 accounts for 87% of the variation in Y, leaving 13% unaccounted for.
duplicated, or posted to a publicly accessible website, in whole or in part.
33
Important Software Note
• When using more than one independent variable, all variables for the X-range
must be in one contiguous block of cells (that is, in adjacent columns).
duplicated, or posted to a publicly accessible website, in whole or in part.
34
Model with Two Independent Variable
• Now suppose we fit the following models with two independent variables:
Ŷi  b0  b1X1i  b2 X 2i
Ŷi  b0  b1X1i +b3 X 3i
• Key regression results are:
Variables In the Model
R2
Se
X1
0.870
0.855
10.299
b0 = 109.503, b1 = 56.394
X1 & X2
0.939
0.924
7.471
b0 = 127.684, b1 = 38.58, b2 = 12.875
X1 & X3
0.877
0.847
10.609
b0 = 108.311, b1 = 44.31, b3 = 6.743
Parameter Estimates
• The model using X1 and X2 accounts for 93.9% of the variation in Y, leaving 6.1%
unaccounted for.
duplicated, or posted to a publicly accessible website, in whole or in part.
35
• The R2 statistic can only increase.
• The Adjusted-R2 statistic can increase or decrease.
 ESS  n  1 
R 2a  1  


TSS
n

k

1



• The R2 statistic can be artificially inflated by adding any independent variable
to the model.
independent variable really helps.
duplicated, or posted to a publicly accessible website, in whole or in part.
36
A Comment on Multicollinearity
• It should not be surprising that adding X3 (# of bedrooms) to the model with X1
(total square footage) did not significantly improve the model.
• Both variables represent the same (or very similar) things -- the size of the
house.
• These variables are highly correlated (or collinear).
• Multicollinearity should be avoided.
duplicated, or posted to a publicly accessible website, in whole or in part.
37
Model with Three Independent Variable
• Now suppose we fit the following models with two independent variables:
Ŷi  b0  b1X1i  b2 X 2i +b3 X 3i
• Key regression results are:
Variables In the Model
R2
Se
X1
0.870
0.855
10.299
b0 = 109.503, b1 = 56.394
X1 & X2
0.939
0.924
7.471
b0 = 127.684, b1 = 38.58, b2 = 12.875
X1, X2 & X3
0.943
0.918
7.762
b0 = 126.440, b1 = 30.803, b2 = 12.567, b3 = 4.576
•
Parameter Estimates
The model using X1 and X2 appears to be best:
• Lowest Se (most precise prediction intervals)
duplicated, or posted to a publicly accessible website, in whole or in part.
38
Making Predictions (2 of 2)
• Let’s estimate the avg selling price of a house with 2,100 square feet and a
2-car garage:
Ŷi  b0  b1X1i  b2 X 2i
ˆ = 127.64 + 38.576X + 12.875X
Y
i
1i
2i
• The estimated average selling price is \$234,444
• A 95% prediction interval for the actual selling price is approximately:
Ŷh  2 Se
• 95% lower prediction interval = 234.444 − 2*7.471 = \$219,502
• 95% lower prediction interval = 234.444 + 2*7.471 = \$249,386
duplicated, or posted to a publicly accessible website, in whole or in part.
39
The Adjusted R2 Statistic and the Standard
Error
• Note that the model with the highest Adjusted-R2 will also have the lowest
standard error Se (and vice-versa).
 ESS  n  1 
2
2
R 2a  1  

1

S
S
e
Y


n

k

1
TSS



where S 2Y is the sample variance of Y.
• Thus, the model with the highest Adjusted-R2 has the smallest (most precise)
prediction intervals too.
duplicated, or posted to a publicly accessible website, in whole or in part.
40
Binary Independent Variables
• Other types of non-quantitative factors could independent variables could be included
in the analysis using binary variables.
• Example: The presence (or absence) of a swimming pool,
1, if house i has a pool
X pi  
0, otherwise
• Example: Whether the roof is in good, average or poor condition,
1, if the roof of house i is in good condition
X ri  
0, otherwise
X r +1i
1, if the roof of house i is in average condition

0, otherwise
duplicated, or posted to a publicly accessible website, in whole or in part.
41
Polynomial Regression
• Sometimes the relationship between a dependent and independent variable is not
linear.
• This graph suggests a quadratic relationship between square footage (X) and selling
price (Y).
duplicated, or posted to a publicly accessible website, in whole or in part.
42
The Regression Model
• An appropriate regression function in this case might be,
Ŷi  b0  b1X1i  b2 X12i
or equivalently,
Ŷi  b0  b1X1i  b2 X 2i
where,
X 2i  X12i
duplicated, or posted to a publicly accessible website, in whole or in part.
43
Implementing the Model
Fig9-25.xlsm
duplicated, or posted to a publicly accessible website, in whole or in part.
44
Function
duplicated, or posted to a publicly accessible website, in whole or in part.
45
Fitting a Third Order Polynomial Model
• We could also fit a third order polynomial model,
Ŷi  b0  b1X1i  b2 X12i  b3 X13i
or equivalently,
Ŷi  b0  b1X1i  b2 X 2i  b3 X 3i
where,
X 2i  X12i
X 3i  X13i
duplicated, or posted to a publicly accessible website, in whole or in part.
46
Graph of Estimated Third Order Polynomial
Regression Function
duplicated, or posted to a publicly accessible website, in whole or in part.
47
Overfitting
• When fitting polynomial models, care must be taken to avoid overfitting.
• The adjusted-R2 statistic can be used for this purpose here also.
duplicated, or posted to a publicly accessible website, in whole or in part.
48
Discussion Activity (2 of 2)
• In the last few elections there appears to have been some terrible predictions
on who would win which states. In light of this chapter and the power and use
(along with weaknesses and abuse) of regression analysis, why do you think
these poor predictions have occurred?
duplicated, or posted to a publicly accessible website, in whole or in part.
49
Self-Assessment
1. What does R2 mean? How is it different than adjusted-R2?
2. When data are overfit in a sample, what does that mean, and what are the
implications?
3. What is multicollinearity, and how would you address that in your regression
model?
duplicated, or posted to a publicly accessible website, in whole or in part.
50
Summary (1 of 2)
Now that the lesson has ended, you should have learned how to:
• Explain the purpose of regression analysis.
• Distinguish between functional and statistical relationships.
• Create and interpret scatter plots.
• Explain the method of least squares for estimating model parameters.
• Create regression models using Excel’s Data Analysis tool.
• Create predictions using regression models.
• Create prediction intervals.
• Explain and use the R2 and adjusted-R2 statistics.
duplicated, or posted to a publicly accessible website, in whole or in part.
51
Summary (2 of 2)
•
•
•
•
•
Explain and use forward stepwise regression.
Explain the meaning and dangers of extrapolation.
Explain the concept of overfitting.
Conduct statistical tests for population parameters.
Describe techniques for and the importance of variable selection in multiple
regression.
• Define multicollinearity and its implications in regression analysis.
• Described the use of polynomial functions in linear regression models.
• Explain and use the following functions: AVERAGE( ), TREND( ), SQRT( ),
SUM( ),TINV( ), VAR.P( ), VAR.S( ).