Download Document

Document related concepts
no text concepts found
Transcript
Chapter 19
Linear Patterns
Copyright © 2011 Pearson Education, Inc.
19.1 Fitting a Line to Data
What is the relationship between the price
and weight of diamonds?


Use regression analysis to find an equation that
summarizes the linear association between price
and weight
The intercept and slope of the line estimate the
fixed and variable costs in pricing diamonds
3 of 37
Copyright © 2011 Pearson Education, Inc.
19.1 Fitting a Line to Data
Consider Two Questions about Diamonds:

What’s the average price of diamonds that weigh
0.4 carat?

How much more do diamonds that weigh 0.5
carat cost?
4 of 37
Copyright © 2011 Pearson Education, Inc.
19.1 Fitting a Line to Data
Equation of a Line

Using a sample of diamonds of various weights,
regression analysis produces an equation that
relates weight to price.

Let y denote the response variable (price) and let
x denote the explanatory variable (weight).
5 of 37
Copyright © 2011 Pearson Education, Inc.
19.1 Fitting a Line to Data
Scatterplot of Price vs. Weight
Linear association is evident (r = 0.66).
6 of 37
Copyright © 2011 Pearson Education, Inc.
19.1 Fitting a Line to Data
Equation of a Line

Identify the line fit to the data by an intercept b0
and a slope b1 .

The equation of the line is
yˆ  b0  b1 x
Estimated Price =
b0  b1
Weight.
7 of 37
Copyright © 2011 Pearson Education, Inc.
19.1 Fitting a Line to Data
Least Squares

Residual: vertical deviations from the data points
to the line ( e  y  yˆ ).

The best fitting line collectively makes the
squares of residuals as small as possible
(the choice of b0 and b1 minimizes the sum of the
squared residuals).
8 of 37
Copyright © 2011 Pearson Education, Inc.
19.1 Fitting a Line to Data
Residuals
9 of 37
Copyright © 2011 Pearson Education, Inc.
19.2 Interpreting the Fitted Line
Diamond Example

The least squares regression equation for relating
diamond prices to weight is
Estimated Price = 43 + 2670 Weight
11 of 37
Copyright © 2011 Pearson Education, Inc.
19.2 Interpreting the Fitted Line
Diamond Example

The average price of a diamond that weighs 0.4
carat is
Estimated Price = 43 + 2,670(0.4) = $1,111

A diamond that weighs 0.5 carat costs $267 more,
on average.
12 of 37
Copyright © 2011 Pearson Education, Inc.
19.2 Interpreting the Fitted Line
Diamond Example
13 of 37
Copyright © 2011 Pearson Education, Inc.
19.2 Interpreting the Fitted Line
Interpreting the Intercept

The intercept is the portion of y that is present for
all values of x (i.e., fixed cost, $43, per diamond).

The intercept estimates the average response
when x = 0 (where the line crosses the y axis).
14 of 37
Copyright © 2011 Pearson Education, Inc.
19.2 Interpreting the Fitted Line
Interpreting the Intercept
Unless the range of x values includes zero, b0 will
be an extrapolation.
15 of 37
Copyright © 2011 Pearson Education, Inc.
19.2 Interpreting the Fitted Line
Interpreting the Slope

The slope estimates the marginal cost used to
find the variable cost (i.e., marginal cost is $2,670
per carat).

While tempting, it is not correct to describe the
slope as the change in y caused by changing x.
16 of 37
Copyright © 2011 Pearson Education, Inc.
Another empirical problem
Empirical problem: Class size and educational output.
 Policy question:
What is the effect of reducing class size by one student per class?
by 8 students/class?
 What is the right output (performance) measure?
 parent satisfaction.
 student personal development.
 future adult welfare.
 future adult earnings.
 performance on standardized tests.
What do data say about class sizes and test
scores?
The California Test Score Data Set
All K-6 and K-8 California school districts (n = 420)
Variables:
 5th grade test scores (Stanford-9 achievement test, combined
math and reading), district average.
 Student-teacher ratio (STR) = number of students in the district
divided by number of full-time equivalent teachers.
An initial look at the California test score data:
Question:
Do districts with smaller classes (lower STR) have higher test
scores? And by how much?
The class size/test score policy question:
 What is the effect of reducing STR by one student/teacher on
test scores ?
 Object of policy interest:
.
 This is the slope of the line relating test score and STR.
This suggests that we want to draw a line through the Test Score
v.s. STR scatterplot, but how?
Linear Regression: Some Notation and
Terminology
The population regression line is



β0 and β1 are “population” parameters?
We would like to know the population value of β1
We don’t know β1, so we must estimate it using data.
The Population Linear Regression Model—
general notation




X is the independent variable or regressor.
Y is the dependent variable.
β0 = intercept.
β1 = slope.


ui = the regression error.
The regression error consists of omitted factors, or possibly
measurement error in the measurement of Y . In general, these
omitted factors are other factors that influence Y , other than the
variable X.
Application to the California Test Score-Class Size
data



Estimated slope = = - 2.28
Estimated intercept = = 698.9
Estimated regression line:
= 698.9 - 2.28 ST R
4M Example 19.1:
ESTIMATING CONSUMPTION
Motivation
A utility company that sells natural gas in the
Philadelphia area needs to estimate how
much is used in homes in which their
meters cannot be read.
17 of 37
Copyright © 2011 Pearson Education, Inc.
4M Example 19.1:
ESTIMATING CONSUMPTION
Method
Use regression analysis to find the equation
that relates y (amount of gas consumed
measured in CCF) to x (the average
number of degrees below 65º during the
billing period). The utility company has 4
years of data (n = 48 months) for one home.
18 of 37
Copyright © 2011 Pearson Education, Inc.
4M Example 19.1:
ESTIMATING CONSUMPTION
Mechanics
Linear association is evident.
19 of 37
Copyright © 2011 Pearson Education, Inc.
4M Example 19.1:
ESTIMATING CONSUMPTION
Mechanics
The fitted least squares regression line is
Estimated Gas = 26.7 + 5.7  (Degrees Below 65)
20 of 37
Copyright © 2011 Pearson Education, Inc.
4M Example 19.1:
ESTIMATING CONSUMPTION
Message
During the summer, the home uses about
26.7 CCF of gas during the billing period.
As the weather gets colder, the estimated
average amount of gas consumed rises by
5.7 CCF for each additional degree below
65º.
21 of 37
Copyright © 2011 Pearson Education, Inc.
Scattergram
1.
2.
Plot of all (xi, yi) pairs
Suggests how well model will fit
y
60
40
20
0
x
0
20
40
60
Thinking Challenge
• How would you draw a line through the points?
• How do you determine which line ‘fits best’?
y
60
40
20
0
x
0
20
40
60
迴歸分析的基本概念
•
迴歸分析(regression analysis) 以成對的資料點
(pair data) 研究兩個或兩個以上變數之間的關係
•
以兩個變數為例, 所謂成對的資料點(pair data)係
指觀察到的資料為:
, 如果經濟理論告訴我
們x 與y 之間具有一定的關係, 我們可用y = f (x) 來
刻畫此關係
•
舉例來說, 「個人所得」為「教育程度」所影響;
或者是「物價膨脹率」為「貨幣供給」所影響
Population Linear Regression
Model
y
yi   0  1 xi   i
Observed
value
 = Random error
i
E  y    0  1 x
x
Observed value
Ex : The population regression line and
the error term
母體迴歸線
•
簡單地說, 如果我們擁有母體資料, 母體迴歸線與
相關係數一樣, 都可視為描繪這組母體資料的敘述
統計量
Population & Sample Regression
Models
Random Sample
Population
Unknown
Relationship
$
y  0  1 x  
$
$
$
$
y  ˆ0  ˆ1 x  ˆ
$
$
Sample Linear Regression Model
y
yi  ˆ0  ˆ1 xi  ˆi
^ = Random error
i
yˆi  ˆ0  ˆ1 xi
Unsampled
observation
x
Observed value
Least Squares
‘Best fit’ means difference between actual y
values and predicted y values are a minimum
• But positive differences off-set negative
n
n
2
ˆ
  yi  yi    ˆ i
i 1
•
2
i 1
Least Squares minimizes the Sum of the Squared Differences
(SSE)
最佳預測式
•
最佳的預測式f (x) 極小化以下的方差和
•
利用極小化方差和的概念來解出(solve) 最佳的預
測式f (x) 的方法, 我們稱之為最小平方法(method
of least-squares)
最佳常數預測式
•
如果我們沒有任何x 的資訊, 對於y 的最佳預測為
何?亦即, f (xi ) = c
•
我們稱此預測最佳常數預測式
其一階條件為
因此,
Derivation of the OLS Estimators
and are the values of b0 and b1 the above two
normal equations.
From equations (1) and (2), and divide each term
by n, we have
From (3),
, substitute
collect terms, we have
in (4) and
定義: 誤差(殘差)
ei  yi  yˆi
Two Normal Equations
From equation (1), we have
e
i
0
From equation (2), we have
e X  e (X  X )
 e x  0
i
i
i
i
i
i i
51 of <51>
Copyright © 2011 Pearson Education, Inc.
Least Squares Example
You’re a marketing analyst for Hasbro
Toys. You gather the following data:
Ad $
Sales (Units)
1
1
2
1
3
2
4
2
5
4
Find the least squares line relating
sales and advertising.
Scattergram
Sales vs. Advertising
Sales
4
3
2
1
0
0
1
2
3
Advertising
4
5
Parameter Estimation Solution Table
x
i
y
i
x2
i
y2
i
xy
i i
1
1
1
1
1
2
1
4
1
2
3
2
9
4
6
4
2
16
4
8
5
4
25
16
20
15
10
55
26
37
Parameter Estimation Solution

 n 
  x i    yi 
n
 i 1   i 1 
x
y


i i
n
i 1
n
ˆ1 


 x i 
n
i 1
2


x


i
n
i 1
n
2

15 10 

37 
?0  y  1 x  2   .70  3   .10
yˆ  .1  .7 x
5
2
15 

55 
5
 .70
Regression Line Fitted
to the Data
Sales
4
3
yˆ  .1  .7 x
2
1
0
0
1
2
3
Advertising
4
5
Least Squares
Thinking Challenge
You’re an economist for the county
cooperative. You gather the following data:
Fertilizer (lb.) Yield (lb.)
4
3.0
6
5.5
10
6.5
12
9.0
Find the least squares line relating
crop yield and fertilizer.
© 1984-1994 T/Maker Co.
Scattergram
Crop Yield vs. Fertilizer*
Yield (lb.)
10
8
6
4
2
0
0
5
10
Fertilizer (lb.)
15
Parameter Estimation Solution
Table*
x2
i
y2
i
x
i
y
i
xy
i i
4
3.0
16
9.00
12
6
5.5
36
30.25
33
10
6.5
100
42.25
65
12
9.0
144
81.00
108
32
24.0
296
162.50
218
Parameter Estimation Solution*

 n 
  x i   yi 
n
i 1

 i 1 
x
y


i i
n
i 1
n
ˆ1 


 x i 
n
i 1
2


xi 

n
i 1
n
2

32  24 

218 
ˆ0  y  ˆ1 x  6   .65  8   .80
yˆ  .8  .65 x
4
2
32 

296 
4
 .65
Regression Line Fitted
to the Data*
Yield (lb.)
10
8
yˆ  .8  .65 x
6
4
2
0
0
5
10
Fertilizer (lb.)
15
Goodness of fit
•
•
•
如果我們設定 β0 = μy , β1 = 0, 則迴歸線y = μy ,
也就是說, 最佳常數預測式乃是最佳線性預測式的
一個特例
因此, 我們可以據此衡量, 在加入了x 的資訊後,對
於預測y 的預測力提升多少? 這就是迴歸線的配適
度衡量(goodness of fit)
簡單地說, 迴歸線的配適度衡量就是在比較: 相對
於最佳常數預測式, 最佳線性預測式增加了多少對
y 的解釋力
Goodness of fit
•
•
•
yi − μy 代表以最佳常數預測式預測y 的預測誤差
最佳線性預測式的預測誤差為
我們可以將y 變動的總變異拆解成迴歸線所不能解
釋的變異以及可解釋變異:
總變異:
可解釋變異:
不能解釋變異:
Goodness of fit
•
•
•
總變異為可解釋變異與不能解釋變異的加總:
TV = EV + UV
因此, 我們可以用「可解釋變異」佔「總變異」的
比例來衡量迴歸線的配適度:
當
越大, 代表總變異中有越多比例可以被迴歸
線所解釋, 亦即迴歸線的配適度越佳
Goodness of fit
•
然而, 我們可以用另外一個角度來詮釋迴歸線的配
適度衡量
•
亦即
可以用來衡量「在加入了x 的資訊後,對
於預測y 的預測力提升多少?」
•
迴歸線的配適度衡量同時也在比較: 相對於最佳常
數預測式, 最佳線性預測式增加了多少對y的解釋
力
Goodness of fit
•
令
而
•
由於TV = EV + UV, 則
•
因此, 如果
越大, 代表
越小, 也就是說最
佳線性預測式的預測誤差相對於最佳常數預測式
的預測誤差越小, 亦即相對於最佳常數預測式, 最
佳線性預測式所增加的解釋力越多
=最佳常數預測式的預測誤差,
=最佳線性預測式的預測誤差
Related documents