Download Basic Business Statistics (9th Edition)

Document related concepts

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Basic Business Statistics
(9th Edition)
Chapter 13
Simple Linear Regression
© 2004 Prentice-Hall, Inc.
Chap 13-1
Chapter Topics







Types of Regression Models
Determining the Simple Linear Regression
Equation
Measures of Variation
Assumptions of Regression and Correlation
Residual Analysis
Measuring Autocorrelation
Inferences about the Slope
© 2004 Prentice-Hall, Inc.
Chap 13-2
Chapter Topics



(continued)
Correlation - Measuring the Strength of the
Association
Estimation of Mean Values and Prediction of
Individual Values
Pitfalls in Regression and Ethical Issues
© 2004 Prentice-Hall, Inc.
Chap 13-3
Purpose of Regression Analysis

Regression Analysis is Used Primarily to Model
Causality and Provide Prediction


Predict the values of a dependent (response)
variable based on values of at least one
independent (explanatory) variable
Explain the effect of the independent variables on
the dependent variable
© 2004 Prentice-Hall, Inc.
Chap 13-4
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
© 2004 Prentice-Hall, Inc.
Relationship NOT Linear
No Relationship
Chap 13-5
Simple Linear Regression Model



Relationship between Variables is Described
by a Linear Function
The Change of One Variable Causes the Other
Variable to Change
A Dependency of One Variable on the Other
© 2004 Prentice-Hall, Inc.
Chap 13-6
Simple Linear Regression Model
(continued)
Population regression line is a straight line that
describes the dependence of the average value
(conditional mean) of one variable on the other
Population
Slope
Coefficient
Population
Y Intercept
Dependent
(Response)
Variable
© 2004 Prentice-Hall, Inc.
Random
Error
Yi      X i   i
Population
Regression
Y |X
Line
(Conditional Mean)

Independent
(Explanatory)
Variable
Chap 13-7
Simple Linear Regression Model
(continued)
Y
(Observed Value of Y) = Yi
     X i   i
 i = Random Error

Y | X      X i

(Conditional Mean)
X
Observed Value of Y
© 2004 Prentice-Hall, Inc.
Chap 13-8
Linear Regression Equation
Sample regression line provides an estimate of
the population regression line as well as a
predicted value of Y
Sample
Y Intercept
Yi  b0  b1 X i  ei
Ŷ  b0  b1 X
© 2004 Prentice-Hall, Inc.
Sample
Slope
Coefficient
Residual
Simple Regression Equation
(Fitted Regression Line, Predicted Value)
Chap 13-9
Linear Regression Equation
(continued)

b0 and b1 are obtained by finding the values
of b0 and b that minimize the sum of the
1
squared residuals

n
i 1


Yi  Yˆi

2
n
  ei2
i 1
b0 provides an estimate of  
b1 provides an estimate of 
© 2004 Prentice-Hall, Inc.
Chap 13-10
Linear Regression Equation
(continued)
Yi  b0  b1 X i  ei
Y
ei
Yi      X i   i
b1
i

Y | X      X i

b0
Yˆi  b0  b1 X i
X
Observed Value
© 2004 Prentice-Hall, Inc.
Chap 13-11
Interpretation of the Slope
and Intercept

   Y |X 0 is the average value of Y when
the value of X is zero

1 
change in Y |X
change in X
measures the change in
the average value of Y as a result of a oneunit change in X
© 2004 Prentice-Hall, Inc.
Chap 13-12
Interpretation of the Slope
and Intercept
(continued)

b  Yˆ  X  0 is the estimated average
value of Y when the value of X is zero

change in Yˆ
b1 
is the estimated change
change in X
in the average value of Y as a result of a oneunit change in X
© 2004 Prentice-Hall, Inc.
Chap 13-13
Simple Linear Regression:
Example
You wish to examine
the linear dependency
of the annual sales of
produce stores on their
sizes in square footage.
Sample data for 7
stores were obtained.
Find the equation of
the straight line that
fits the data best.
© 2004 Prentice-Hall, Inc.
Store
Square
Feet
Annual
Sales
($1000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Chap 13-14
Scatter Diagram: Example
Annua l Sa le s ($000)
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
5000
6000
S q u a re F e e t
Excel Output
© 2004 Prentice-Hall, Inc.
Chap 13-15
Simple Linear Regression
Equation: Example
Yˆi  b0  b1 X i
 1636.415  1.487 X i
From Excel Printout:
C o e ffi c i e n ts
I n te r c e p t
1 6 3 6 .4 1 4 7 2 6
X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7
© 2004 Prentice-Hall, Inc.
Chap 13-16
Graph of the Simple Linear
Regression Equation: Example
Annua l Sa le s ($000)
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
5000
6000
S q u a re F e e t
© 2004 Prentice-Hall, Inc.
Chap 13-17
Interpretation of Results:
Example
Yˆi  1636.415  1.487 X i
The slope of 1.487 means that for each increase of
one unit in X, we predict the average of Y to
increase by an estimated 1.487 units.
The equation estimates that for each increase of 1
square foot in the size of the store, the expected
annual sales are predicted to increase by $1487.
© 2004 Prentice-Hall, Inc.
Chap 13-18
Simple Linear Regression
in PHStat


In Excel, use PHStat | Regression | Simple
Linear Regression …
Excel Spreadsheet of Regression Sales on
Footage
© 2004 Prentice-Hall, Inc.
Chap 13-19
Measures of Variation:
The Sum of Squares
SST
=
Total
=
Sample
Variability
© 2004 Prentice-Hall, Inc.
SSR
Explained
Variability
+
SSE
+
Unexplained
Variability
Chap 13-20
Measures of Variation:
The Sum of Squares
(continued)

SST = Total Sum of Squares


SSR = Regression Sum of Squares


Measures the variation of the Yi values around
their mean, Y
Explained variation attributable to the relationship
between X and Y
SSE = Error Sum of Squares

Variation attributable to factors other than the
relationship between X and Y
© 2004 Prentice-Hall, Inc.
Chap 13-21
Measures of Variation:
The Sum of Squares
(continued)

SSE =(Yi - Yi )2
Y
_
SST = (Yi - Y)2
 _
SSR = (Yi - Y)2
Xi
© 2004 Prentice-Hall, Inc.
_
Y
X
Chap 13-22
Venn Diagrams and Explanatory
Power of Regression
Variations in
store Sizes not
used in
explaining
variation in
Sales
Sizes
© 2004 Prentice-Hall, Inc.
Sales
Variations in Sales
explained by the
error term or
unexplained by
Sizes  SSE 
Variations in Sales
explained by Sizes
or variations in Sizes
used in explaining
variation in Sales
 SSR 
Chap 13-23
The ANOVA Table in Excel
ANOVA
df
Regression
k
SS
MS
SSR
MSR
=SSR/k
Error
n-k-1 SSE
Total
n-1
© 2004 Prentice-Hall, Inc.
F
Significance
F
MSR/MSE
P-value of
the F Test
MSE
=SSE/(n-k-1)
SST
Chap 13-24
Measures of Variation
The Sum of Squares: Example
Excel Output for Produce Stores
Degrees of freedom
ANOVA
df
SS
MS
F
Regression
1
30380456.12 30380456.1 81.1790902
Error
5
1871199.595 374239.919
Total
6
32251655.71
Regression (explained) df
Error (unexplained) df
Total df
© 2004 Prentice-Hall, Inc.
SSE
SSR
Significance F
0.000281201
SST
Chap 13-25
The Coefficient of Determination


SSR Regression Sum of Squares
r 

SST
Total Sum of Squares
2
Measures the proportion of variation in Y that
is explained by the independent variable X in
the regression model
© 2004 Prentice-Hall, Inc.
Chap 13-26
Venn Diagrams and
Explanatory Power of Regression
r 
2
Sales
Sizes
© 2004 Prentice-Hall, Inc.
SSR

SSR  SSE
Chap 13-27
Coefficients of Determination (r 2)
and Correlation (r)
Y r2 = 1, r = +1
Y r2 = 1, r = -1
^=b +b X
Y
i
^=b +b X
Y
i
0
1 i
0
X
Y r2 = .81,r = +0.9
X
© 2004 Prentice-Hall, Inc.
X
Y
^=b +b X
Y
i
0
1 i
1 i
r2 = 0, r = 0
^=b +b X
Y
i
0
1 i
X
Chap 13-28
Standard Error of Estimate

n


SYX
SSE


n2
i 1
Y  Yˆi

2
n2
Measures the standard deviation (variation) of
the Y values around the regression equation
© 2004 Prentice-Hall, Inc.
Chap 13-29
Measures of Variation:
Produce Store Example
Excel Output for Produce Stores
R e g r e ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re
0 .9 3 0 3 7 7 5 4
S t a n d a rd E rro r
6 1 1 .7 5 1 5 1 7
O b s e r va t i o n s
r2 = .94
0 .9 7 0 5 5 7 2
n
7
Syx
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage.
© 2004 Prentice-Hall, Inc.
Chap 13-30
Linear Regression Assumptions

Normality




Y values are normally distributed for each X
Probability distribution of error is normal
Homoscedasticity (Constant Variance)
Independence of Errors
© 2004 Prentice-Hall, Inc.
Chap 13-31
Consequences of Violation
of the Assumptions

Violation of the Assumptions


Non-normality (error not normally distributed)
Heteroscedasticity (variance not constant)


Autocorrelation (errors are not independent)


Usually happens in time-series data
Consequences of Any Violation of the
Assumptions



Usually happens in cross-sectional data
Predictions and estimations obtained from the sample
regression line will not be accurate
Hypothesis testing results will not be reliable
It is Important to Verify the Assumptions
© 2004 Prentice-Hall, Inc.
Chap 13-32
Variation of Errors Around
the Regression Line
f(e)
• Y values are normally distributed
around the regression line.
• For each X value, the “spread” or
variance around the regression line is
the same.
Y
X2
X1
X
© 2004 Prentice-Hall, Inc.
Sample Regression Line
Chap 13-33
Residual Analysis

Purposes



Examine linearity
Evaluate violations of assumptions
Graphical Analysis of Residuals

Plot residuals vs. X and time
© 2004 Prentice-Hall, Inc.
Chap 13-34
Residual Analysis for Linearity
Y
Y
0
e
X
0
X
X 0
Not Linear
© 2004 Prentice-Hall, Inc.
0
e
X

Linear
Chap 13-35
Residual Analysis for
Homoscedasticity
Y
Y
0
SR
X
0
X
SR
0
X
Heteroscedasticity
© 2004 Prentice-Hall, Inc.
0
X
Homoscedasticity
Chap 13-36
Residual Analysis: Excel Output
for Produce Stores Example
Observation
1
2
3
4
5
6
7
Excel Output
Predicted Y
4202.344417
3928.803824
5822.775103
9894.664688
3557.14541
4918.90184
3588.364717
Residuals
-521.3444173
-533.8038245
830.2248971
-351.6646882
-239.1454103
644.0981603
171.6352829
Residual Plot
0
1000
© 2004 Prentice-Hall, Inc.
2000
3000
4000
Square Feet
5000
6000
Chap 13-37
Residual Analysis for
Independence

The Durbin-Watson Statistic


Used when data are collected over time to detect
autocorrelation (residuals in one time period are
related to residuals in another period)
Measures violation of independence assumption
n
D
2
(
e

e
)
 i i1
i 2
n
e
i 1
© 2004 Prentice-Hall, Inc.
2
i
Should be close to 2.
If not, examine the model
for autocorrelation.
Chap 13-38
Durbin-Watson Statistic
in PHStat

PHStat | Regression | Simple Linear
Regression …

Check the box for Durbin-Watson Statistic
© 2004 Prentice-Hall, Inc.
Chap 13-39
Obtaining the Critical Values of
Durbin-Watson Statistic
Table 13.4 Finding Critical Values of Durbin-Watson Statistic
 5
k=1
k=2
n
dL
dU
dL
dU
15
1.08
1.36
.95
1.54
16
1.10
1.37
.98
1.54
© 2004 Prentice-Hall, Inc.
Chap 13-40
Using the Durbin-Watson
Statistic
H0 :
H1
No autocorrelation (error terms are independent)
: There is autocorrelation (error terms are not
independent)
Reject H0
(positive
autocorrelation)
0
© 2004 Prentice-Hall, Inc.
dL
Inconclusive
Do not reject H0
(no autocorrelation)
dU
2
4-dU
Reject H0
(negative
autocorrelation)
4-dL
4
Chap 13-41
Residual Analysis for
Independence
Graphical Approach
Not Independent
e

Independent
e
0
Time 0
Cyclical Pattern
Time
No Particular Pattern
Residual is Plotted Against Time to Detect Any Autocorrelation
© 2004 Prentice-Hall, Inc.
Chap 13-42
Inference about the Slope:
t Test

t Test for a Population Slope


Null and Alternative Hypotheses



Is there a linear relationship between Y and X ?
H0: 1 = 0
H1: 1  0
(no linear relationship)
(linear relationship)
Test Statistic


b1  1
t
where Sb1 
Sb1
d. f .  n  2
© 2004 Prentice-Hall, Inc.
SYX
n
(X
i 1
i
 X)
2
Chap 13-43
Example: Produce Store
Data for 7 Stores:
Store
Square
Feet
Annual
Sales
($000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
© 2004 Prentice-Hall, Inc.
Estimated Regression
Equation:
Yˆi  1636.415  1.487X i
The slope of this
model is 1.487.
Are square footage
and annual sales
linearly related?
Chap 13-44
Inferences about the Slope:
t Test Example
Test Statistic:
H0: 1 = 0
From Excel Printout b
S
t
b1
H1: 1  0
1
Coefficients Standard Error t Stat P-value
  .05
Intercept
1636.4147
451.4953 3.6244 0.01515
df  7 - 2 = 5
Footage
1.4866
0.1650 9.0099 0.00028
Critical Value(s):
Reject
.025
Reject
.025
-2.5706 0 2.5706
© 2004 Prentice-Hall, Inc.
Decision:
Reject H0.
t
p-value
Conclusion:
There is evidence that
square footage is linearly
related to annual sales.
Chap 13-45
Inferences about the Slope:
Confidence Interval Example
Confidence Interval Estimate of the Slope:
b1  tn  2 Sb1
Excel Printout for Produce Stores
Intercept
Footage
Lower 95% Upper 95%
475.810926 2797.01853
1.06249037 1.91077694
At 95% level of confidence, the confidence interval
for the slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear relationship
between annual sales and the size of the store.
© 2004 Prentice-Hall, Inc.
Chap 13-46
Inferences about the Slope:
F Test

F Test for a Population Slope


Null and Alternative Hypotheses



Is there a linear relationship between Y and X ?
H0: 1 = 0
H1: 1  0
(no linear relationship)
(linear relationship)
Test Statistic

© 2004 Prentice-Hall, Inc.

SSR
1
F 
SSE
 n  2
Numerator d.f.=1, denominator d.f.=n-2
Chap 13-47
Relationship between a t Test
and an F Test

Null and Alternative Hypotheses





H0: 1 = 0
H1: 1  0
t 
n2
2
(no linear relationship)
(linear relationship)
 F1,n 2
The p –value of a t Test and the p –value of
an F Test are Exactly the Same
The Rejection Region of an F Test is Always
in the Upper Tail
© 2004 Prentice-Hall, Inc.
Chap 13-48
Inferences about the Slope:
F Test Example
H0: 1 = 0
H1: 1  0
  .05
numerator
df = 1
denominator
df  7 - 2 = 5
Test Statistic:
From Excel Printout
ANOVA
df
Regression
Residual
Total
1
5
6
Reject
 = .05
0
© 2004 Prentice-Hall, Inc.
6.61
F1, n  2
SS
MS
F Significance F
30380456.12 30380456.12 81.179
0.000281
1871199.595 374239.919
p-value
32251655.71
Decision: Reject H0.
Conclusion:
There is evidence that
square footage is linearly
related to annual sales.
Chap 13-49
Purpose of Correlation Analysis

Correlation Analysis is Used to Measure
Strength of Association (Linear Relationship)
Between 2 Numerical Variables


Only strength of the relationship is concerned
No causal effect is implied
© 2004 Prentice-Hall, Inc.
Chap 13-50
Purpose of Correlation Analysis
(continued)

Population Correlation Coefficient  (Rho) is
Used to Measure the Strength between the
Variables
© 2004 Prentice-Hall, Inc.
Chap 13-51
Purpose of Correlation Analysis
(continued)

Sample Correlation Coefficient r is an
Estimate of  and is Used to Measure the
Strength of the Linear Relationship in the
Sample Observations
n
r
 X
i 1
n
 X
i 1
© 2004 Prentice-Hall, Inc.
i
i
 X Yi  Y 
X
2
n
 Y  Y 
i 1
2
i
Chap 13-52
Sample Observations from
Various r Values
Y
Y
Y
X
r = -1
X
r = -.6
Y
© 2004 Prentice-Hall, Inc.
X
r=0
Y
r = .6
X
r=1
X
Chap 13-53
Features of  and r





Unit Free
Range between -1 and 1
The Closer to -1, the Stronger the Negative
Linear Relationship
The Closer to 1, the Stronger the Positive
Linear Relationship
The Closer to 0, the Weaker the Linear
Relationship
© 2004 Prentice-Hall, Inc.
Chap 13-54
t Test for Correlation

Hypotheses



H0:  = 0 (no correlation)
H1:   0 (correlation)
Test Statistic
t

r
where
 r
n2
2
n
r  r2 
© 2004 Prentice-Hall, Inc.
 X
i 1
n
 X
i 1
i
i
 X Yi  Y 
X
2
n
 Y  Y 
i 1
2
i
Chap 13-55
Example: Produce Stores
From Excel Printout
Is there any
evidence of linear
relationship between
annual sales of a
store and its square
footage at .05 level
of significance?
© 2004 Prentice-Hall, Inc.
r
R e g re ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 7 0 5 5 7 2
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re 0 . 9 3 0 3 7 7 5 4
S t a n d a rd E rro r
6 1 1 .7 5 1 5 1 7
O b s e rva t io n s
7
H0:  = 0 (no association)
H1:   0 (association)
  .05
df  7 - 2 = 5
Chap 13-56
Example: Produce Stores
Solution
r
.9706
t

 9.0099
2
1  .9420
 r
5
n2
Critical Value(s):
Reject
.025
Reject
.025
-2.5706 0 2.5706
© 2004 Prentice-Hall, Inc.
Decision:
Reject H0.
Conclusion:
There is evidence of a
linear relationship at 5%
level of significance.
The value of the t statistic is
exactly the same as the t
statistic value for test on the
slope coefficient.
Chap 13-57
Estimation of Mean Values
Confidence Interval Estimate for
Y | X  X
:
i
The Mean of Y Given a Particular Xi
Standard error
of the estimate
Size of interval varies according
to distance away from mean, X
Yˆi  tn 2 SYX
t value from table
with df=n-2
© 2004 Prentice-Hall, Inc.
(Xi  X )
1
 n
n
2
(Xi  X )
2
i 1
Chap 13-58
Prediction of Individual Values
Prediction Interval for Individual Response
Yi at a Particular Xi
Addition of 1 increases width of interval
from that for the mean of Y
Yˆi  tn2 SYX
1 (Xi  X )
1  n
n
2
(Xi  X )
2
i 1
© 2004 Prentice-Hall, Inc.
Chap 13-59
Interval Estimates for Different
Values of X
Y
Confidence
Interval for the
Mean of Y
Prediction Interval
for an Individual Yi
X
© 2004 Prentice-Hall, Inc.
X
a given X
Chap 13-60
Example: Produce Stores
Data for 7 Stores:
Store
Square
Feet
Annual
Sales
($000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
© 2004 Prentice-Hall, Inc.
Consider a store
with 2000 square
feet.
Regression Model Obtained:

Yi = 1636.415 +1.487Xi
Chap 13-61
Estimation of Mean Values:
Example
Confidence Interval Estimate for
Y | X  X
i
Find the 95% confidence interval for the average annual
sales for stores of 2,000 square feet.

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 (in $000)
X = 2350.29
SYX = 611.75
Yˆi  tn 2 SYX
tn-2 = t5 = 2.5706
( X i  X )2
1
 n
 4610.45  612.66
n
2
(Xi  X )
i 1
© 2004 Prentice-Hall, Inc.
3997.02  Y |X  X i  5222.34
Chap 13-62
Prediction Interval for Y :
Example
Prediction Interval for Individual YX  X i
Find the 95% prediction interval for annual sales of one
particular store of 2,000 square feet.

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 (in $000)
X = 2350.29
SYX = 611.75
Yˆi  tn 2 SYX
tn-2 = t5 = 2.5706
1 ( X i  X )2
1  n
 4610.45  1687.68
n
2
(
X

X
)
 i
i 1
© 2004 Prentice-Hall, Inc.
2922.00  YX  X i  6297.37
Chap 13-63
Estimation of Mean Values and
Prediction of Individual Values in PHStat

In Excel, use PHStat | Regression | Simple
Linear Regression …


Check the “Confidence and Prediction Interval for
X=” box
Excel Spreadsheet of Regression Sales on
Footage
© 2004 Prentice-Hall, Inc.
Chap 13-64
Pitfalls of Regression Analysis




Lacking an Awareness of the Assumptions
Underlining Least-Squares Regression
Not Knowing How to Evaluate the Assumptions
Not Knowing What the Alternatives to LeastSquares Regression are if a Particular
Assumption is Violated
Using a Regression Model Without Knowledge
of the Subject Matter
© 2004 Prentice-Hall, Inc.
Chap 13-65
Strategy for Avoiding the Pitfalls
of Regression



Start with a scatter plot to observe possible
relationship between X on Y
Perform residual analysis to check the
assumptions
Use a histogram, stem-and-leaf display, boxand-whisker plot, or normal probability plot of
the residuals to uncover possible nonnormality
© 2004 Prentice-Hall, Inc.
Chap 13-66
Strategy for Avoiding the Pitfalls
of Regression
(continued)


If there is violation of any assumption, use
alternative methods (e.g., least absolute
deviation regression or least median of squares
regression) to least-squares regression or
alternative least-squares models (e.g.,
curvilinear or multiple regression)
If there is no evidence of assumption violation,
then test for the significance of the regression
coefficients and construct confidence intervals
and prediction intervals
© 2004 Prentice-Hall, Inc.
Chap 13-67
Chapter Summary






Introduced Types of Regression Models
Discussed Determining the Simple Linear
Regression Equation
Described Measures of Variation
Addressed Assumptions of Regression and
Correlation
Discussed Residual Analysis
Addressed Measuring Autocorrelation
© 2004 Prentice-Hall, Inc.
Chap 13-68
Chapter Summary




(continued)
Described Inference about the Slope
Discussed Correlation - Measuring the
Strength of the Association
Addressed Estimation of Mean Values and
Prediction of Individual Values
Discussed Pitfalls in Regression and Ethical
Issues
© 2004 Prentice-Hall, Inc.
Chap 13-69