Download Topic 1. Linear regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Confidence interval wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
ECONOMETRICS
Lecturer
Dr. Veronika Alhanaqtah
Topic 1. Linear regression
1.Simple linear regression
2. The linear correlation coefficient
3. Modeling linear relationships with randomness present
4. The least squares regression line
5. Statistical inferences about Ξ²2
6. The coefficient of determination
7. Estimation and prediction
8. A complete example (Homework)
1. Simple linear regression
ο‚—
Learning objective: to learn what it means for two variables to
exhibit a relationship that is close to linear but which contains an
element of randomness.
Simple linear regression
π‘Œ =π‘šβˆ™π‘‹+𝑏
y is dependent variable
x is independent variable
b is the y-intercept
m is the slope of the regression line
Example 1: deterministic relationship
ο‚—
𝑦=
9
βˆ™
5
π‘₯ + 32
x -40 -15
0
20
50
y
32
68
122
-40
5
The relationship between x and y is called a simple linear relationship
because the points so plotted that all lie on a single straight line.
β€œsimple”: y depends on only one other variable.
β€œmultiple”: y depends on two or more variables.
9/5 is the slope of the line and measures its steepness.
32 is the y-intercept of the line.
Example 2: presence of randomness
Relationship: the height x of a man aged 25 and his weight y. We
randomly select several 25-year-old men and measure the height
and weight of each one, we might obtain a collection of (x, y) pairs
something like this:
ο‚—
76
75
74
73
72
71
70
69
68
67
0
50
100
150
200
X
151
163
146
180
157
170
164
175
171
178
160
188
Y
68
72
69
72
70
73
70
73
71
74
72
75
2. Linear correlation coefficient
ο‚—
Learning objective: to learn what the linear correlation
coefficient is, how to compute it, and what it tells us
about the relationship between two variables x and y.
Linear correlation coefficient
Properties of correlation coefficient:
(1) The value of r lies between βˆ’1 and 1, inclusive.
(2) The sign of r indicates the direction of the linear
relationship between x and y:
ο‚— if r < 0 then y tends to decrease as x is increased;
ο‚— if r > 0 then y tends to increase as x is increased.
(3) The size of |r| indicates the strength of the linear
relationship between x and y:
ο‚— If |r| is near 1 (that is, if r is near either 1 or βˆ’1) then
the linear relationship between x and y is strong;
ο‚— if |r| is near 0 (that is, if r is near 0 and of either sign)
then the linear relationship between x and y is weak.
Linear correlation coefficient
Correlation coefficient: calculation
𝒓
=
π‘Ίπ‘Ίπ’™π’š
𝑺𝑺𝒙𝒙 βˆ™π‘Ίπ‘Ίπ’šπ’š
where
ο‚—
𝑆𝑆π‘₯π‘₯ =
ο‚—
𝑆𝑆π‘₯𝑦 =
ο‚—
𝑆𝑆𝑦𝑦 =
1
π‘₯ βˆ’ βˆ™(
𝑛
1
π‘₯𝑦 βˆ’ βˆ™ (
𝑛
1
2
𝑦 βˆ’ βˆ™(
𝑛
2
π‘₯)2
π‘₯) βˆ™ ( 𝑦)
𝑦)2
Example 3: calculation of correlation coefficient
ο‚—
Compute the linear correlation coefficient for the height and
weight pairs plotted in Graph (in Example 2).
x
y
x2
y2
xy
151
68
22801
4624
10268
163
72
26569
5184
11736
146
69
21316
4761
10074
180
72
32400
5184
12960
157
70
24649
4900
10990
170
73
28900
5329
12410
164
70
26896
4900
11480
175
73
30625
5329
12775
171
71
29241
5041
12141
178
74
31684
5476
13172
160
72
25600
5184
11520
188
75
35344
5625
14100
2003
859
336025
61537
143626
Example 3: calculation of correlation coefficient
𝑆𝑆𝑦𝑦 =
1
𝑦2 βˆ’
𝑛
𝑆𝑆π‘₯𝑦 =
π‘₯𝑦 βˆ’
𝑆𝑆π‘₯π‘₯ =
1
π‘₯2 βˆ’
𝑛
2
𝑦
1
𝑛
= 61537 βˆ’
π‘₯
𝑦
1
859
12
π‘Ÿ=
= 336025 βˆ’
𝑆𝑆π‘₯𝑦
𝑆𝑆π‘₯π‘₯ βˆ™ 𝑆𝑆𝑦𝑦
= 46.916
= 143626 βˆ’
2
π‘₯
2
=
1
2003
12
1
859
12
2
(2003) = 244.583
= 1690.916
244.583
46.916 1690.916
= 0.868
The number r = 0.868 quantifies what is visually apparent from Graph:
weights tends to increase linearly with height (r is positive) and although the
relationship is not perfect, it is reasonably strong (r is near 1).
ο‚—
3. Modeling linear relationships with randomness
present
Learning objective: to learn the framework in which the
statistical analysis of the linear relationship between two
variables x and y will be done.
Modeling linear relationships with randomness present
Situation: the value of x can be used to draw conclusions about
the value of y, such as predicting the resale value y of a
residential house based on its size x.
The relationship between x and y is not deterministic.
The set of assumptions in simple linear regression are a
mathematical description of the relationship between x and y.
Modeling linear relationships with randomness present: Assumptions
(1) Relationship between x and the mean of the y-values in the subpopulation determined by x is linear.
This means that there exist numbers Ξ²1 and Ξ²2 such that:
𝐸(𝑦) = 𝛽1 + 𝛽2 βˆ™ π‘₯𝑖
*This line gives the mean of the variable y over the sub-population determined by x (population regression line).
(2) For each value of x the y-values scatter about the mean E(y)
according to a normal distribution centered at E (y) and with a
standard deviation Οƒ that is the same for every value of x.
This is the same as saying that there exists a normally distributed random variable Ξ΅ with the
mean 0 and an unknown standard deviation Οƒ so that the relationship between x and y in the
whole sample is:
𝑦𝑖 = 𝛽1 + 𝛽2 βˆ™ π‘₯𝑖 + πœ€
where Ξ²1 and Ξ²2 are fixed (observable) parameters and Ξ΅ is a normally distributed random variable
(non-observable).
(3) Random deviations associated with different observations are
independent.
The simple linear model concept
The symbols N (ΞΌ, Οƒ2) denote a normal distribution with
mean ΞΌ and variance Οƒ2, hence standard deviation Οƒ.
The simple linear model concept
𝑦𝑖 = 𝛽1 + 𝛽2 βˆ™ π‘₯𝑖 + πœ€
ο‚—
Deterministic part describes the trend in y as x increases.
There is nothing random in this part.
ο‚—
Random part. Ξ΅ (epsilon) is a random variable, called the
error term or the noise. This part explains why the actual
observed values of y are not exactly on but fluctuate
near a line. Information about this term is important
since only when one knows how much noise there is in
the data can one know how trustworthy the detected
trend is.
ο‚—
The slope parameter Ξ²2 represents the expected change in y
brought about by a unit increase in x. The standard deviation
Οƒ represents the magnitude of the noise in the data.
4. Least squares regression line
ο‚—
Learning objective: to learn how to measure how well a straight line
fits a collection of data; how to construct the least squares
regression line; how to use the least squares regression line to
estimate the response variable y in terms of the predictor variable
x.
ο‚—
The next step in the analysis is to find the straight line that best fits
the data.
Example 4.
x
2
2
6
8
10
y
0
1
2
3
3
𝟏
ο‚—
How well the straight line π’š = 𝟐 𝒙 βˆ’ 𝟏 fits the data set.
ο‚—
The line π’š = 𝒙 βˆ’ 𝟏 was selected as one that seems to fit the data
𝟐
reasonably well.
𝟏
error at data point (x, y) = (true y) βˆ’ (predicted y) = y βˆ’ 𝑦
Example 4. Computation of the error
x
y
2
2
6
8
10
0
1
2
3
3
x
y
2
2
6
8
10
0
1
2
3
3
1
y= xβˆ’1
2
𝑦 βˆ’ 𝑦 (𝑦 βˆ’ 𝑦)2
1
y= xβˆ’1
2
0
0
2
3
4
𝑦 βˆ’ 𝑦 (𝑦 βˆ’ 𝑦)2
βˆ‘
βˆ‘
0
1
0
0
-1
0
0
1
0
0
1
2
Least squares regression line
ο‚—
ο‚—
ο‚—
(𝑦 βˆ’ 𝑦)2 – sum of squared errors, measures a goodness of fit
Least squares regression line y = Ξ²2 x + Ξ²1 best fits the data in the
sense of minimizing the sum of the squared errors.
Ξ²2 -slope, Ξ²1 - y-intercept:
Ξ²2 =
ο‚—
SSxy
Ξ²1 = y βˆ’ Ξ²2 π‘₯
SSxx
where
𝑆𝑆π‘₯π‘₯ =
π‘₯2
1
βˆ’ βˆ™
𝑛
1
2
π‘₯
𝑆𝑆π‘₯𝑦 = π‘₯𝑦 βˆ’ βˆ™ ( π‘₯) βˆ™ ( 𝑦)
𝑛
x is the mean of all the x-values
y is the mean of all the y-values
n is the number of pairs in the data set.
Example 5.
ο‚—
Find the least squares regression line for the five-point data set.
x
2
2
6
8
10
y
0
1
2
3
3
x2
xy
x
2
2
6
8
10
28
y
0
1
2
3
3
9
x2
4
4
36
64
100
208
xy
0
2
12
24
9
68
βˆ‘
βˆ‘
Example 5.
ο‚—
Find the least squares regression line for the five-point data set.
𝑆𝑆π‘₯π‘₯ =
𝑆𝑆π‘₯𝑦 =
2
1
2
π‘₯ βˆ’ βˆ™
𝑛
π‘₯𝑦 βˆ’
1
βˆ™
𝑛
π‘₯
π‘₯ βˆ™
1
= 208 βˆ’ 28
5
𝑦 = 68 βˆ’
π‘₯=
π‘₯ 28
=
= 5.6
𝑛
5
𝑦=
𝑦 9
= = 1.8
𝑛
5
2
= 51.2
1
28 βˆ™ 9 = 17.6
5
SSxy 17.6
Ξ²2 =
=
= 0.34375
SSxx 51.2
Ξ²1 = y βˆ’ Ξ²2 π‘₯ = 1.8 βˆ’ 0.34275 5.6 = 0.125
y = Ξ²2 x + Ξ²1 = 𝟎. πŸ‘πŸ’πŸ‘πŸ•πŸ“π’™ βˆ’ 𝟎. πŸπŸπŸ“
Example 5.
ο‚—
Table. The errors in fitting data with the least squares regression line
x
y
π’š = 𝟎. πŸ‘πŸ’πŸ‘πŸ•πŸ“π’™ βˆ’ 𝟎. πŸπŸπŸ“
π‘¦βˆ’π‘¦
(𝑦 βˆ’ 𝑦)2
2
0
0.5625
βˆ’0.5625
0.31640625
2
1
0.5625
0.4375
0.19140625
6
2
1.9375
0.0625
0.00390625
8
3
2.6250
0.3750
0.14062500
10
3
3.3125
βˆ’0.3125
0.09765625
Sum of squared errors
ο‚—
ο‚—
To measure the goodness of fit of a line to a set of data, we
compute the predicted y-value 𝑦 at every point in the data set,
compute each error, square it, and then add up all the squares.
In the case of the least squares regression line, the line that best fits
the data, the sum of the squared errors can be computed directly
from the data using formula:
𝑆𝑆𝐸 = 𝑆𝑆𝑦𝑦 βˆ’ 𝛽2 𝑆𝑆π‘₯𝑦
Example 6.
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Table above shows the age in years and the retail
value in thousands of dollars of a random sample of
ten automobiles of the same make and model.
(a) Construct the scatter diagram.
(b) Compute the linear correlation coefficient r.
Interpret its value in the context of the problem.
(c) Compute the least squares regression line. Plot it
on the scatter diagram.
(d) Interpret the meaning of the slope of the least
squares regression line in the context of the
problem.
(e) Suppose a four-year-old automobile of this make
and model is selected at random. Use the
regression equation to predict its retail value.
ο‚—
Example 6 (a). Scatter diagram for age and value of used automobiles
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Example 6 (b). Correlation coefficient
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Compute the linear correlation coefficient r.
Interpret its value in the context of the
problem.
ο‚—
𝑆𝑆π‘₯π‘₯ =
ο‚—
𝑆𝑆π‘₯𝑦 =
ο‚—
𝑆𝑆𝑦𝑦 =
1
π‘₯ βˆ’ βˆ™
𝑛
1
π‘₯𝑦 βˆ’ βˆ™
𝑛
1
2
𝑦 βˆ’ βˆ™
𝑛
2
π‘Ÿ=
π‘₯
2
π‘₯ βˆ™
𝑦
𝑦
2
𝑆𝑆π‘₯𝑦
𝑆𝑆π‘₯π‘₯ βˆ™ 𝑆𝑆𝑦𝑦
Example 6 (b). Correlation coefficient
x
2
y
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Compute the linear correlation coefficient r. Interpret its value in the context
of the problem.
π‘₯ = 40
𝑦 = 246.3
π‘₯ 2 = 174
𝑦 2 = 6154.15
π‘₯𝑦 = 956.5
1
ο‚—
𝑆𝑆π‘₯π‘₯ =
π‘₯2 βˆ’ 𝑛 βˆ™
ο‚—
𝑆𝑆π‘₯𝑦 =
π‘₯𝑦 βˆ’ 𝑛 βˆ™
ο‚—
𝑆𝑆𝑦𝑦 =
𝑦2 βˆ’ βˆ™
ο‚—
𝒓=
1
π‘Ίπ‘Ίπ’™π’š
𝑺𝑺𝒙𝒙 βˆ™π‘Ίπ‘Ίπ’šπ’š
1
𝑛
=
π‘₯
2
1
= 174 βˆ’ 10 40
2
= 14
1
π‘₯ βˆ™
𝑦
2
𝑦 = 956.5 βˆ’ 10 40 βˆ™ 246.3 = βˆ’28.7
= 6154.15 βˆ’
βˆ’πŸπŸ–.πŸ•
πŸπŸ’ πŸ–πŸ•.πŸ•πŸ–πŸ
1
10
= βˆ’πŸŽ. πŸ–πŸπŸ—
246.3
2
= 87.781
Example 6 (c).
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Compute the least squares regression line. Plot it on the scatter
diagram.
ο‚—
ο‚—
π‘₯=
π‘₯
𝑛
Ξ²2 =
and
𝑦=
SSxy
SSxx
ο‚—
Ξ²1 = y βˆ’ Ξ²2 π‘₯
ο‚—
π’š = 𝜷𝟐 𝐱 + 𝜷𝟏
𝑦
𝑛
Example 6 (c).
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
π‘₯=
π‘₯
𝑛
40
= 10 = 4 and 𝑦 =
𝑦
𝑛
=
246.3
10
SSxy βˆ’28.7
Ξ²2 =
=
= βˆ’2.05
SSxx
14
Ξ²1 = y βˆ’ Ξ²2 π‘₯ = 24.63 βˆ’ βˆ’2.05 4 = 32.83
y = Ξ²2 x + Ξ²1 = βˆ’πŸ. πŸŽπŸ“π’™ + πŸ‘πŸ. πŸ–πŸ‘
= 24.63
Example 6 (d).
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Interpret the meaning of the slope of the least squares
regression line in the context of the problem.
ο‚—
The slope βˆ’2.05 means that for each unit increase in x
(additional year of age) the average value of vehicle
decreases by about 2.05 units (about $2,050).
Example 6 (e). Prediction
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Suppose a four-year-old automobile of this make and
model is selected at random. Use the regression equation
to predict its retail value.
Since we know nothing about the automobile other
than its age, we assume that it is of about average value
and use the average value of all four-year-old vehicles as
our estimate.
ο‚— y = βˆ’2.05 4 + 32.83 = 24.63,
which corresponds to $24,630.
ο‚—
5. Statistical inferences about 𝛽2
ο‚—
Learning objective: to learn how to construct a confidence interval
for Ξ²2 (the slope of the population regression line); how to test
hypotheses regarding Ξ²2.
5. Statistical inference about 𝛽2
ο‚—
The parameter Ξ²2 (the slope of the population regression line) gives
the true rate of change in the mean E(y) in response to a unit increase
in the predictor variable x. For every unit increase in x the mean of the
response variable y changes by Ξ²2 units, increasing if Ξ²2 > 0 and
decreasing if Ξ²2 < 0.
The slope 𝛽2 of the least squares regression line is a point estimate of
Ξ²2 .
ο‚— A 100(1-Ξ±)% confidence interval for Ξ²2 is given by the following
formula:
π‘†πœ€
𝛽2 ± 𝑑𝛼/2 βˆ™
𝑆𝑆π‘₯π‘₯
ο‚—
where π‘†πœ€ =
ο‚—
𝑆𝑆𝐸
π‘›βˆ’2
and the number of degrees of freedom is 𝑑𝑓 = 𝑛 βˆ’ 2.
The statistic sΞ΅ is called the sample standard deviation of errors. It
estimates the standard deviation Οƒ of the errors in the population of yvalues for each fixed value of x.
Example 7.
x
2
2
6
8
10
y
0
1
2
3
3
Construct the 95% confidence interval for the slope Ξ²2 of the
population regression line based on the five-point sample data set.
𝑆𝑆𝐸
π‘›βˆ’2
=
0.75
3
ο‚—
π‘†πœ€ =
ο‚—
ο‚—
Confidence level 95% means Ξ± = 1 βˆ’ 0.95 = 0.05 so Ξ± βˆ• 2 = 0.025.
From the row labeled df = 3 in Appendix "Critical Values of t" we
obtain t0.025 = 3.182.
ο‚—
𝛽2 ± 𝑑𝛼/2 βˆ™
ο‚—
We are 95% confident that the slope Ξ²2 of the population
regression line is between 0.1215 and 0.5661.
π‘†πœ€
𝑆𝑆π‘₯π‘₯
= 0.5
= 0.34375 ± 3.182
0.5
51.2
= 0.34375 ± 0.2223
Example 8.
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
ο‚—
Using the sample data from Example 6 construct a 90% confidence
interval for the slope Ξ²2 of the population regression line relating
age and value of the automobiles. Interpret the result in the
context of the problem.
ο‚—
π‘ πœ€ =
ο‚—
Confidence level 90% means Ξ± = 1 βˆ’ 0.90 = 0.10 so Ξ± βˆ• 2 = 0.05. From the
row labeled df = 8 in Appendix "Critical values of t" we obtain t0.05 = 1.860.
ο‚—
𝛽2 ± 𝑑𝛼/2 βˆ™
𝑆𝑆𝐸
π‘›βˆ’2
=
28.946
8
π‘†πœ€
𝑆𝑆π‘₯π‘₯
= 1.902169814
= βˆ’2.05 ± 1.860
1.902169814
14
= βˆ’2.05 ± 0.95
We are 90% confident that the slope Ξ²2 of the population regression line is
between βˆ’3.00 and βˆ’1.10.
ο‚— In the context of the problem this means that for vehicles of this make and
model between two and six years old we are 90% confident that for each
additional year of age the average value of such a vehicle decreases by
between $1,100 and $3,000.
ο‚—
Testing hypothesis about 𝛽2
ο‚—
ο‚—
ο‚—
H0 : Ξ²2 = 0, which corresponds to the situation in which x is not
useful for predicting y.
Test:
𝐻0 : 𝛽2 = 0
vs.
π»π‘Ž : 𝛽2 β‰  0
Standardized test statistic for hypothesis tests concerning the
slope Ξ²2 of the population regression line:
π‘‘π‘œπ‘π‘  =
𝛽2 βˆ’ 𝐡0
π‘†πœ€ / 𝑆𝑆π‘₯π‘₯
where B0 is a number determined from the statement of the problem.
ο‚—
The test statistic has Student’s t-distribution with df = nβˆ’2 degrees
of freedom.
Example 9. Critical value approach
Test, at the 2% level of significance,
whether the variable x is useful for
predicting y based on the information in
the five-point data set.
ο‚—
𝐻0 : 𝛽2 = 0
vs. π»π‘Ž : 𝛽2 β‰  0
𝛼 = 0.02
ο‚—
π‘‘π‘œπ‘π‘  =
=
ο‚—
𝛽2 βˆ’π΅0
π‘†πœ€ / 𝑆𝑆π‘₯π‘₯
0.34375βˆ’0
0.5/ 51.2
=
= 4.919
t-distribution with nβˆ’2 = 5 βˆ’ 2 = 3
degrees of freedom;
π‘‘π‘π‘Ÿ = 𝑑𝛼/2 = 𝑑0.01 = 4.541
ο‚— Compare t-observed and t-critical.
ο‚—
The test statistic falls in the rejection region
(π‘‘π‘œπ‘π‘  > π‘‘π‘π‘Ÿ ).The decision is to reject H0.
x
2
2
6
8
10
y
0
1
2
3
3
6. The coefficient of determination
ο‚—
Learning objective: to learn what the coefficient of
determination is, how to compute it, and what it tells us
about the relationship between two variables x and y.
6. The coefficient of determination
ο‚—
Coefficient of determination measures the
proportion of the variability in y that is accounted for by
the linear relationship between x and y.
2
SSyy βˆ’ SSE
SSxy
SSxy
R =
=
= Ξ²1
SSyy
SSxx SSyy
SSyy
2
R2 = r
2
Example 10.
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
The value of used vehicles of the make and model discussed in
Example 6 varies widely. The most expensive automobile in the
sample in Table has value $30,500, which is nearly half again as
much as the least expensive one, which is worth $20,400. Find
the proportion of the variability in value that is accounted for
by the linear relationship between age and value.
ο‚—
The proportion of the variability in value y that is accounted for by the
linear relationship between it and age x is given by the coefficient of
determination, R2.
r = βˆ’0.819, R2 = (βˆ’0.819) 2 = 0.671
ο‚—
About 67% of the variability in the value of this vehicle can be
explained by its age.
Example 11.
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Use each of the three formulas for the coefficient of determination to
compute its value for the example of ages and values of vehicles.
R2 =
SSyy βˆ’SSE
SSyy
2
SSxy
R2 =
SSxx SSyy
SSxy
R2 = Ξ²1
SSyy
The coefficient of determination R2 can always be computed by squaring
the correlation coefficient r if it is known.
ο‚— What should be avoided is trying to compute r by taking the square root
of R2, if it is already known, since it is easy to make a sign error this way.
ο‚—
Example 11.
x
y
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Use each of the three formulas for the coefficient of determination to
compute its value for the example of ages and values of vehicles.
In Example 6 we computed the exact values:
𝑆𝑆π‘₯π‘₯ = 14; 𝑆𝑆π‘₯𝑦 = βˆ’28.7; 𝑆𝑆𝑦𝑦 = 87.781; Ξ²2 = βˆ’2.05.
Also we know 𝑆𝑆𝐸 = 28.946.
ο‚—
ο‚— R2
ο‚— R2
ο‚—
=
=
SSyy βˆ’SSE
SSyy
SSxy
2
SSxx SSyy
R2 = Ξ²1
=
SSxy
SSyy
=
87.781βˆ’28.946
87.781
(βˆ’28.7)2
(14)(87.781)
= βˆ’2.05
= 0.6702475479
= 0.6702475479
βˆ’28.7
87.781
= 0.6702475479
7. Estimation and prediction
ο‚—
Learning objective: to learn the distinction between estimation and
prediction; the distinction between a confidence interval and a
prediction interval; to learn how to implement formulas for
computing confidence intervals and prediction intervals.
Example 12.
ο‚—
Consider the following pairs of problems in the context of the least
squares regression line, the automobile age and value example
(Example 6).
ο‚—
1a. Estimate the average value of all four-year-old automobiles of this
make and model.
ο‚—
1b. Construct a 95% confidence interval for the average value of all
four-year-old automobiles of this make and model.
ο‚—
2a. Ahmad intends to buy a four-year-old automobile of this make and
model next week. Predict the value of the first such automobile that
he encounters.
ο‚—
2b. Construct a 95% confidence interval for the value of the first such
automobile that he encounters.
Example 12.
ο‚—
1a. Estimate the average value of all four-year-old automobiles of this
make and model.
ο‚—
2a. Ahmad intends to buy a four-year-old automobile of this make and
model next week. Predict the value of the first such automobile that
he encounters.
y = βˆ’2.05 4 + 32.83 = 24.63
Example 12.
ο‚—
1b. Construct a 95% confidence interval for the average value of all
four-year-old automobiles of this make and model.
ο‚—
2b. Construct a 95% confidence interval for the value of the first such
automobile that he encounters.
ο‚—
ο‚—
Comments:
In the first case we seek to construct a confidence interval in the
same sense that we have done before.
In the second case the interval constructed has a different name,
prediction interval. In this case we are trying to β€œpredict” where the
value of a random variable will take its value.
Confidence interval and prediction interval
100(1-Ξ±)% confidence interval
for the mean value of y at x = xp.
𝑦𝑝 ± 𝑑𝛼/2 π‘ πœ€
π‘₯𝑝 βˆ’ π‘₯
1
+
𝑛
𝑆𝑆π‘₯π‘₯
100(1-Ξ±)% prediction interval for an
individual new value of y at x = xp.
2
𝑦𝑝 ± 𝑑𝛼/2 π‘ πœ€
π‘₯𝑝 βˆ’ π‘₯
1
1+ +
𝑛
𝑆𝑆π‘₯π‘₯
2
where
a. xp is a particular value of x that lies in the range of xvalues in the sample data set used to construct the least
squares regression line;
b. yˆp is the numerical value obtained when the least square
regression equation is evaluated at x = xp;
c. the number of degrees of freedom for tΞ±βˆ•2 is df = nβˆ’2.
Example 13.
x
y
ο‚—
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Using the sample data on age and value of used automobiles of a
specific make and model (Example 6), construct a 95% confidence
interval for the average value of all three-and-one-half-year-old
automobiles of this make and model.
𝑦𝑝 ± 𝑑𝛼 π‘ πœ€
2
π‘₯𝑝 βˆ’ π‘₯
1
+
𝑛
𝑆𝑆π‘₯π‘₯
2
We are 95% confident that the average value of all three-and-one-half-yearold vehicles of this make and model is between $24,149 and $27,161.
Example 14.
x
y
ο‚—
2
3
3
3
4
4
5
5
5
6
28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1
Using the sample data on Age and Value of Used Automobiles of a
Specific Make and Model (Example 6), construct a 95% prediction
interval for the predicted value of a randomly selected three-andone-half-year-old automobile of this make and model.
𝑦𝑝 ± 𝑑𝛼 π‘ πœ€
2
π‘₯𝑝 βˆ’ π‘₯
1
1+ +
𝑛
𝑆𝑆π‘₯π‘₯
2
We are 95% confident that the value of a randomly selected three-and-one
half-year-old vehicle of this make and model is between $21,017 and $30,293.
Note what an enormous difference the presence of the extra number 1
under the square root sign made. The prediction interval is about two-andone half times wider than the confidence interval at the same level of
confidence.
Homework
ο‚—
ο‚—
Visit instructor’s web-page on Econometrics.
www.alveronika.wordpress.com
Optional but useful: Practice with applet (correlation and
regression): http://mih5.github.io/statapps/