Download CH2

Document related concepts

German tank problem wikipedia , lookup

Confidence interval wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Introduction to Linear
Regression
Chap 2-1
Scatter Plots and Correlation

A scatter plot (or scatter diagram) is used to show
the relationship between two variables

Correlation analysis is used to measure strength
of the association (linear relationship) between
two variables

Only concerned with strength of the
relationship

No causal effect is implied
Chap 2-2
Scatter Plot Examples
Linear relationships
y
Curvilinear relationships
y
x
y
x
y
x
x
Chap 2-3
Scatter Plot Examples
(continued)
Strong relationships
y
Weak relationships
y
x
y
x
y
x
x
Chap 2-4
Scatter Plot Examples
(continued)
No relationship
y
x
y
x
Chap 2-5
Correlation Coefficient
(continued)


The population correlation coefficient ρ (rho)
measures the strength of the association
between the variables
The sample correlation coefficient r is an
estimate of ρ and is used to measure the
strength of the linear relationship in the
sample observations
Chap 2-6
Features of ρ and r





Unit free
Range between -1 and 1
The closer to -1, the stronger the negative
linear relationship
The closer to 1, the stronger the positive
linear relationship
The closer to 0, the weaker the linear
relationship
Chap 2-7
Examples of Approximate
r Values
y
y
y
x
r = -1
r = -.6
y
x
x
r=0
y
r = +.3
x
r = +1
x
Chap 2-8
Calculating the
Correlation Coefficient
Sample correlation coefficient:
r
 ( x  x)( y  y)
[ ( x  x ) ][  ( y  y ) ]
2
2
or the algebraic equivalent:
r
n xy   x  y
[n( x 2 )  ( x )2 ][n(  y 2 )  ( y )2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Chap 2-9
Calculation Example
Tree
Height
Trunk
Diameter
y
x
xy
y2
x2
35
8
280
1225
64
49
9
441
2401
81
27
7
189
729
49
33
6
198
1089
36
60
13
780
3600
169
21
7
147
441
49
45
11
495
2025
121
51
12
612
2601
144
=321
=73
=3142
=14111
=713
Chap 2-10
Calculation Example
(continued)
Tree
Height,
y 70
r
n xy   x  y
[n(  x 2 )  (  x) 2 ][n(  y 2 )  (  y) 2 ]
60

50
40
8(3142)  (73)(321)
[8(713)  (73)2 ][8(14111)  (321)2 ]
 0.886
30
20
10
0
0
2
4
6
8
10
Trunk Diameter, x
12
14
r = 0.886 → relatively strong positive
linear association between x and y
Chap 2-11
Significance Test for Correlation


Hypotheses
H0: ρ = 0
HA: ρ ≠ 0
(no correlation)
(correlation exists)
Test statistic

t
r
(with n – 2 degrees of freedom)
1 r
n2
2
Chap 2-12
Example: Produce Stores
Is there evidence of a linear relationship
between tree height and trunk diameter at
the .05 level of significance?
H0: ρ = 0
H1: ρ ≠ 0
(No correlation)
(correlation exists)
 =.05 , df = 8 - 2 = 6
t
r
1 r 2
n2

.886
1  .886 2
82
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc.
 4.68
Chap 2-13
Example: Test Solution
t
r
1 r 2
n2

.886
1  .886 2
82
Decision:
Reject H0
 4.68
Conclusion:
There is
evidence of a
linear relationship
at the 5% level of
significance
d.f. = 8-2 = 6
/2=.025
Reject H0
-tα/2
-2.4469
/2=.025
Do not reject H0
0
Reject H0
tα/2
2.4469
4.68
Chap 2-14
Introduction to Regression Analysis

Regression analysis is used to:

Predict the value of a dependent variable based on
the value of at least one independent variable

Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to
explain
Independent variable: the variable used to
explain the dependent variable
Chap 2-15
Simple Linear Regression Model

Only one independent variable, x

Relationship between x and y is
described by a linear function

Changes in y are assumed to be caused
by changes in x
Chap 2-16
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
Chap 2-17
Population Linear Regression
The population regression model:
Population
y intercept
Dependent
Variable
Population
Slope
Coefficient
Independent
Variable
y  β0  β1x  ε
Linear component
Random
Error
term, or
residual
Random Error
component
Chap 2-18
Linear Regression Assumptions

Error values (ε) are statistically independent

Error values are normally distributed for any
given value of x

The probability distribution of the errors is
normal

The probability distribution of the errors has
constant variance

The underlying relationship between the x
variable and the y variable is linear
Chap 2-19
Population Linear Regression
y
y  β0  β1x  ε
(continued)
Observed Value
of y for xi
εi
Predicted Value
of y for xi
Slope = β1
Random Error
for this x value
Intercept = β0
xi
x
Chap 2-20
Estimated Regression Model
The sample regression line provides an estimate of
the population regression line
Estimated
(or predicted)
y value
Estimate of
the regression
intercept
Estimate of the
regression slope
ŷi  ˆ0  ˆ1x
Independent
variable
The individual random error terms ei have a mean of zero
Chap 2-21
Least Squares Criterion
̂ 0
and ̂1 are obtained by finding the values of
̂ 0and ̂1 that minimize the sum of the squared
residuals

2
ˆ
 e   (y y)
2
2
ˆ
ˆ
  (y  (0  1x))
Chap 2-22
The Least Squares Equation

The formulas for ̂1 and ̂ 0 are:
( x  x )( y  y ) S xy

ˆ
1 

2
S xx
 (x  x )
algebraic equivalent:
ˆ1 
x y

 xy 
S xy
n

2
( x )
S xx
2
x  n
and
ˆ0  y  ˆ1 x
Chap 2-23
Interpretation of the
Slope and the Intercept


̂ 0 is the estimated average value of y
when the value of x is zero
̂1 is the estimated change in the
average value of y as a result of a oneunit change in x
Chap 2-24
Finding the Least Squares Equation


The coefficients ̂0 and ̂1 will usually be
found using computer software, such as
Minitab
Other regression measures will also be
computed as part of computer-based
regression analysis
Chap 2-25
Simple Linear Regression Example

A real estate agent wishes to examine the
relationship between the selling price of a home
and its size (measured in square feet)

A random sample of 10 houses is selected
 Dependent variable (y) = house price in $1000s
 Independent variable (x) = square feet
Chap 2-26
Sample Data for House Price Model
House Price in $1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Chap 2-27
Regression Using MINATAB

Stat / Regression/ Regression
Regression Analysis: House price
versus Square Feet
The regression equation is
 House price = 98.2 + 0.110 Square Feet
Predictor
Coef SE Coef T
P
Constant
98.25 58.03 1.69 0.129
Square Feet 0.110 0.03297 3.33 0.010
S = 41.3303 R-Sq = 58.1% R-Sq(adj) =
52.8%
Chap 2-28
MINITAB OUTPUT
Analysis of Variance
Source
DF SS
MS
F
P
Regression
1 18935 18935 11.08 0.010
Residual Error 8 13666 1708
Total
9 32600
Chap 2-29
Graphical Presentation
Stat /Regression/ Fitted Line Plot
Chap 2-30
Interpretation of the
Intercept, ̂ 0
House price = 98.25 + 0.110 Square Feet

̂ 0 is the estimated average value of Y when
the value of X is zero (if x = 0 is in the range of
observed x values)

Here, no houses had 0 square feet, so ̂ 0 = 98.25
just indicates that, for houses within the range of
sizes observed, $98,250 is the portion of the house
price not explained by square feet
Chap 2-31
Interpretation of the
Slope Coefficient, ̂
1
House price = 98.2 + 0.110 Square Feet

̂1 measures the estimated change in the
average value of Y as a result of a oneunit change in X

Here, ̂1 = .110 tells us that the average value of a
house increases by 0.110 ($1000) = $110, on
average, for each additional one square foot of size
Chap 2-32
Least Squares Regression
Properties

The sum of the residuals from the least squares
regression line is 0 (  ( y yˆ )  0 )

The sum of the squared residuals is a minimum
(minimized
( y yˆ ) 2 )

The simple regression line always passes through the
mean of the y variable and the mean of the x variable

The least squares coefficients are unbiased

estimates of β0 and β1
Chap 2-33
Explained and Unexplained Variation

Total variation is made up of two parts:

SST
Total sum of
Squares
SST   ( y  y )2
SSRe s
Sum of Squares
Error (Residual)
SSRe s   ( y  yˆ )2

SSR
Sum of Squares
Regression
SSR   ( yˆ  y )2
where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value
Chap 2-34
Explained and Unexplained Variation
(continued)

SST = total sum of squares


SSRe s = error (residual) sumSS of squares
T


Measures the variation of the yi values around their
mean y
Variation attributable to factors other than the
relationship between x and y
SS R = regression sum of squares

Explained variation attributable to the relationship
between x and y
Chap 2-35
Explained and Unexplained Variation
(continued)
y
yi
SSRe s
 2
= (yi - yi )
SS R
 _ 2
= (yi - y)

y
_

y
SST = (yi - y)2
_
y
Xi
_
y
x
Chap 2-36
Coefficient of Determination, R2

The coefficient of determination is the portion
of the total variation in the dependent variable
that is explained by variation in the
independent variable

The coefficient of determination is also called
R-squared and is denoted as R2
SS R
R 
SST
2
where
0  R2  1
Chap 2-37
Coefficient of Determination, R2
(continued)
Coefficient of determination
SS R sum of squares explained by regression
R 

SST
total sum of squares
2
Note: In the single independent variable case, the coefficient
of determination is
R r
2
2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Chap 2-38
Examples of Approximate
R2 Values
y
R2 = 1
R2 = 1
x
100% of the variation in y is
explained by variation in x
y
R2
= +1
Perfect linear relationship
between x and y:
x
Chap 2-39
Examples of Approximate
R2 Values
y
0 < R2 < 1
x
Weaker linear relationship
between x and y:
Some but not all of the
variation in y is explained
by variation in x
y
x
Chap 2-40
Examples of Approximate
R2 Values
R2 = 0
y
No linear relationship
between x and y:
R2 = 0
x
The value of Y does not
depend on x. (None of the
variation in y is explained
by variation in x)
Chap 2-41
Standard Error Estimation
House price = 98.2 + 0.110 Square Feet
Predictor
Coef SE Coef T
P
Constant
98.25 58.03 1.69 0.129
Square Feet 0.110 0.03297 3.33 0.010
S = 41.3303 R-Sq = 58.1% R-Sq(adj) = 52.8%
Source
DF SS
MS
F
P
Regression
1 18935 18935 11.08 0.010
Residual Error 8 13666 1708
Total
9 32600
Chap 2-42
Standard Error of Estimate

The standard deviation of the variation of
observations around the regression line is
estimated by
MS Re s
SS Re s

n  k 1
Where
SSRe s = Sum of squares error (residual)
n = Sample size
k = number of independent variables in the model
Chap 2-43
The Standard Deviation of the
Regression Slope

The standard error of the regression slope
coefficient ( ̂1) is estimated by
Se(ˆ1 ) 
MS Res
2
(x

x)


MS Res
2
(
x)

2
x  n
where:
Se( ˆ1 ) = Estimate of the standard error of the least squares slope
MSRe s
SSRe s

n2
= Sample standard error of the estimate
Chap 2-44
Comparing Standard Errors
y
Variation of observed y values
from the regression line
small MS Res
y
x
y
Variation in the slope of regression
lines from different possible samples
small Se(ˆ1 )
x
large Se(ˆ1 )
x
y
large MSRe s
x
Chap 2-45
Inference about the Slope:
t Test

t test for a population slope


Null and alternative hypotheses



Is there a linear relationship between x and y?
H0: β1 = 0
H1: β1  0
(no linear relationship)
(linear relationship does exist)
Test statistic


ˆ1  β1
t
Se( ˆ1 )
d.f.  n  2
where:
̂1= Sample regression slope
coefficient
β1 = Hypothesized slope
Se( ˆ1 ) = Estimator of the standard
error of the slope
Chap 2-46
Inference about the Slope:
t Test
(continued)
House Price
in $1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Estimated Regression Equation:
House price = 98.25 + 0.110 Square Feet
The slope of this model is 0.110
Does square footage of the house
affect its sales price?
Chap 2-47
Inferences about the Slope:
t Test Example
Test Statistic: t = 3.33
H0: β1 = 0
HA: β1  0
From MINITAB output:
Predictor
Constant
Coef SE Coef
98.25 58.03
T
P
1.69 0.129
Square Feet 0.110 0.03297 3.33 0.010
d.f. = 10-2 = 8
/2=.025
Reject H0
/2=.025
Do not reject H0
-tα/2
-2.3060
0
Reject H
0
tα/2
2.3060 3.329
Decision:
Reject H0
Conclusion:
There is sufficient evidence
that square footage affects
house price
Chap 2-48
Regression Analysis for
Description
Confidence Interval Estimate of the Slope:
ˆ1  t /2 Se(ˆ1 )
d.f. = n - 2
At 95% level of confidence, the confidence interval for
the slope is (0.0337, 0.1858)
This 95% confidence interval does not include 0.
Conclusion: There is a significant relationship between house
price and square feet at the .05 level of significance
Chap 2-49
Confidence Interval for
the Average y, Given x
Confidence interval estimate for the
mean of y given a particular x 0
Size of interval varies according
to distance away from mean, x
yˆ  t /2
1 (x 0  x)
MSRe s [ 
]
2
n  (x  x)
2
Chap 2-50
Confidence Interval for
an Individual y, Given x
Confidence interval estimate for an
Individual value of y given a particular x 0
yˆ  t /2
1 (x 0  x)
MSRe s [1  
]
2
n  (x  x)
2
This extra term adds to the interval width to reflect
the added uncertainty for an individual case
Chap 2-51
Interval Estimates
for Different Values of x
y
Prediction Interval
for an individual y,
given x 0
̂ 0
Confidence
Interval for
the mean of
y, given x 0
̂1
x
xp
x
Chap 2-52
Example: House Prices
House Price
in $1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Estimated Regression Equation:
House price = 98.25 + 0.110 Square Feet
Predict the price for a house
with 2000 square feet
Chap 2-53
Example: House Prices
(continued)
Predict the price for a house
with 2000 square feet:
house price  98.25  0.110 (sq.ft.)
 98.25  0.110(2000)
 318.25
The predicted price for a house with 2000
square feet is 318.25($1,000s) = $318,250
Chap 2-54
Estimation of Mean Values:
Example
Confidence Interval Estimate for E(y)| x 0
Find the 95% confidence interval for the average
price of 2,000 square-foot houses

Predicted Price Yi = 318.25 ($1,000s)
ŷ  t α/2
1 (x 0  x) 2
MS Res [ 
]  318.25  37.12
2
n  (x  x)
The confidence interval endpoints are 281.13 -- 355.37,
or from $281,13 -- $355,37
Chap 2-55
Estimation of Individual Values:
Example
Prediction Interval Estimate for y| x 0
Find the 95% confidence interval for an individual
house with 2,000 square feet

Predicted Price Yi = 317.85 ($1,000s)
ŷ  t α/2
1 (x 0  x) 2
MSRe s [1  
]  318.25  102.28
2
n  (x  x)
The prediction interval endpoints are 215.97 -- 420.53,
or from $215,97 -- $420,53
Chap 2-56
Finding Confidence and Prediction
Intervals MINITAB
Stat\Regression\Regression-option
Predicted Values for New Observations
New Obs Fit
SE Fit
95% CI
95% PI
1 317.8 16.1 (280.7, 354.9) (215.5, 420.1)
Values of Predictors for New Observations
Square Feet
New Obs
1
2000
Chap 2-57
Finding Confidence and Prediction
Intervals MINITAB
Stat\Regression\fitted line plot
Chap 2-58