Download advertising expenditure (x). This is called a deterministic model

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 10 - Simple Linear Regression
Want to predict the assessed value of a house in Ames.
Select a random sample of n houses and estimate the population mean of assessed value (using methods in Ch.
7) and use it for prediction.
A better method uses other information about the houses
used by the assessor (e.g.,sq.ft. of oor space, age of the
house, location, etc.).
If we have a data set that has values for the variables assessed value (y), oor space (x1), age (x2), and location
(x3), we can develop a relationship between the y and
the x's that will allow us to predict what the assessed
value of a house will be, given the set of observed values
of the other variables for a particular house.
Chapter 10 covers the simplest situation { that of relating
two variables: y and x.
Suppose we want to model monthly sales revenue
(y ) of an appliance store as a function of monthly
Does an exact relationship exist between these two variables?
If a relationship such as the above exists, the monthly
sales revenue will be exactly 15 times the monthly advertising expenditure i.e.,
y = 15x
This is called a deterministic model. However such a
relationship is not possible because there are other factors
that aect sales revenue that are not measured; however,
we have to allow for them in the model.
To allow for the unexplained variation in monthly sales
due to unincluded variables or random phenomena, we
introduce the following model:
y = 15x + random error
63
64
advertising expenditure (x).
This is called a probabilistic model where we always
assume that the mean of the random error component is 0.
The simplest probabilistic model is the straight-line
Once a straight line model has been hypothesized, sample
data must be collected on the variables x and y.
regression model:
y = 0 + 1x + where
y
x
E (y )
0
1
=
=
=
=
=
=
dependent or response variable
independent or predictor variable
0 + 1x = deterministic component
random error component
intercept of the true line
slope of the true line.
65
Then we use the sample data to estimate the unknown
parameters in the model:
intercept
0
and slope
1.
It is helpful to obtain a scatterplot of y vs. x, to
determine if our hypothesis is plausible.
66
To see this, rst calculate the SSE for the eye-balled line:
You may eyeball a straight line through the points, and
obtain the values of intercept and the slope of that line.
The eye-balled line is
y = ;1 + x
So for this line 0 = ;1 and 1 = 1. But this line may
not be the \best" line for predicting y values.
The sums of squares of errors, (SSE) for the eyeballed line is 2.0. However, we can nd another line for
which the SSE is a minimum. This line is called the least
squares line or the regression line.
Obtain the line that minimizes the sum of squared
deviation for errors of the values predicted by the
(SSE) model for y (denoted by y^) and the actual (observed) y's.
This line is the \best" in the sense that it minimizes the
SSE.
We would like to estimate values for 0 and 1 which
minimizes the SSE. These are called the least squares estimates.
The least square estimates of the unknown slope
and intercept parameters are denoted by ^0 and ^1.
67
The fitted
line
68
is then denoted by
y^ = ^0 + ^1x
and is called the least squares line.
From this equation, we can calculate values for y^ corresponding to the values of x. These are called the
predicted values (or tted values).
The n data points for a straight line model are denoted
by (x1; y1); (x2; y2); ; (x ; y ) or simply by (x ; y ) for
each i = 1; : : : ; n.
Thus the predicted values are given by
y^ = ^0 + ^1x for each i = 1; : : : n
The sum of squares of the deviations is then
"
!#2
(y ; y^ )2 = y ; ^0 + ^1x
n
i
n
i
i
i
i
i
i
The least squares estimates ^0 and ^1 are
SS
^1 =
^0 = y ; ^1x
SS
where
SS = (x ; x)(y ; y)
x y
= x y ;
n
(x )2
2
SS = (x ; x) = x2 ;
xy
i
From the above calculations, we have
(15)(10)
SS = 37 ;
= 37 ; 30 = 7
52
(15)
SS = 55 ;
= 55 ; 45 = 10
5
SS
7
= = :7
^1 =
SS
10
0 1
10
15
^0 = y ; ^1x = ; (:7) B@ CA
5
5
= 2 ; 2:1 = ;:1
xy
xx
xy
xx
xx
xy
i
i
i
i
xx
i
i
i
i
69
n
i
70
Layout: Exercise10_19
Bivariate Fit of Retal Index By Salary
Retal Salary $
Index
301 62000
550 36500
755 21600
327 24000
500 30100
377 35000
290 47500
452 54000
535 19800
455 44000
615 46600
700 15100
650 70000
630 21000
360 16900
First, note that the SSE = 1.10 calculated from the least
square line is less than SSE = 2.0 of the eye-balled line.
800
Retal Index
700
600
500
400
300
200
10000
Salary
Linear Fit
Parameter Estimates
Term
Intercept
Salary
Estimate Std Error
569.58007 93.99729
-0.001924 0.002356
t Ratio
6.06
-0.82
Prob>|t|
<.0001
0.4289
71
Page 1 of 1
Bivariate Fit of Gasoline(cents/gal.) By Crude Oil($/bbl.)
Gasoline Crude Oil
(cents/gal) ($/bbl.)
57
10.38
59
10.89
62
11.96
63
12.46
86
17.72
119
28.07
131
35.24
122
31.87
116
28.99
113
28.63
112
26.75
86
14.55
90
17.90
90
14.67
100
17.97
115
22.23
72
16.54
71
15.99
75
14.24
67
13.21
63
14.63
72
18.56
140
130
Gasoline(cents/gal.)
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
120
110
100
90
80
70
60
50
10
15
20
25
30
35
40
Crude Oil($/bbl.)
Linear Fit
Gasoline(cents/gal.) = 30.134836 + 3.0181453 Crude Oil($/bbl.)
Parameter Estimates
Term
Intercept
Crude Oil($/bbl.)
Estimate Std Error
30.134836 5.454029
3.0181453 0.265423
t Ratio
5.53
11.37
Prob>|t|
<.0001
<.0001
Lower 95% Upper 95%
18.757931 41.511741
2.464482 3.5718085
Analysis of Variance
Source
Model
Error
C. Total
DF
1
20
21
Summary of Fit
Sum of Squares
10373.339
1604.524
11977.864
Mean Square
F Ratio
10373.3 129.3011
80.2 Prob > F
<.0001
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
Plot of Residuals vs. Year
20
Residuals Gasoline(cents/gal.)
Residuals Gasoline(cents/gal.)
0.866043
0.859345
8.956909
88.22727
22
Plot of Residuals vs. Predicted
20
15
10
5
0
-5
-10
-15
60
70
80
90
100
110
120
130
15
10
5
0
-5
-10
-15
1970
140
Predicted Gasoline(cents/gal.)
1975
1980
1985
Year
73
Sum of Squares
15320.93
298741.47
314062.40
Mean Square
15320.9
22980.1
F Ratio
0.6667
Prob > F
0.4289
72
Layout: Exercise10_18
Year
DF
1
13
14
1990
1995
2000
Retal Index = 569.58007 - 0.0019237 Salary
Summary of Fit
Analysis of Variance
Source
Model
Error
C. Total
30000 40000 50000 60000 70000
0.048783
RSquare
-0.02439
RSquare Adj
151.5919
Root Mean Square Error
499.8
Mean of Response
15
Observations (or Sum Wgts)
It is important to interpret the the slope and intercept of
the least squares line ^1 and ^0 relative to the problem
and the data used for the estimation.
The slope ^1 = :7 implies that for every unit increase in
the value of x the expected value (or mean value) of y is
predicted to increase by .7 units.
In this example, for every $100 increase in advertising the
mean sales revenue is predicted to increase by
:7 $1000 = $700
for the range of values of advertising expenditure in the
data, i.e., from $100 to $500.
The intercept of the least squares line is ^0 = ;:1
seems to say that if the advertising expenditure , x, was
equal to $0, the expected (or mean) sales revenue will be
;:1 $1000 = ;$100. However, since advertising expenditure of $0 is not in the range from $100 to $500 used
for estimating the least squares line, this interpretation is
not valid.
The moral: interpretation of the model parameter estimates must be made only within the range of values
of the predictor x used in the computation of the least
squares line.
We stated that E (y) = 0 + 1x is the deterministic
component that is the random component of the model.
We may call 0 + 1x the mean of y at a specied x.
The deterministic component is the equation of a straightline.
Making statistical inferences from the tted model requires us to specify the probability distribution of the
random error .
Assumption 1: Mean of the distribution of is 0.
Assumption
2:
The variance of the distribution of is 2
and is constant for all values of x.
Assumption 3: has the N (0; 2) distribution.
Assumption
4:
The random errors from tting the model
to n pairs of data (xi; yi), i = 1; : : : ; n
i.e., 1; 2; : : : ; n is a random sample from
N (0; 2) distribution.
Note:
An implication of these assumptions is that y has
a normal distribution with mean 0 + 1x and variance
2.
Model Assumptions
Model: y = 0 + 1x + 74
75
and
These assumptions allow us to construct confidence
square estimators and develop
examing the usefulness of the
least squares lines.
We have already given the formulas for calculating the
least squares estimates ^0 and ^1 of the intercept and
slope parameters.
What is a good estimate of 2 or ?
The best estimate s2 of 2 can be obtained from the results of tting the straight-line model:
intervals for the least
hypothesis tests for
SSyy = (yi ; y)2 = yi2 ;
It follows that the estimate s of is
p vuuu SSE
s = s2 = ut
n;2
For the advertising expenditure-sales revenue example,
the least square line was:
y^ = ;1: + :7x
Recall that n = 5 and SSE = 1.10 from that example.
Thus we have:
SSE 1:10
=
= :367
n;2
3p
as the estimate of 2 and s = :367 = :61 is the standard error of the regression model.
s2 =
s.
s measures the spread of the y-values around the least
squares line at any value of x. Therefore we can expect
most y values to lie within 2s from y^.
Interpretation of
s2 =
where
Sum of Squares for Error
SSE
=
Degrees of freedom for Error n ; 2
SSE = (yi ; y^i)2 = SSyy ; ^1SSxy
76
(yi)2
n
77
1.
Again consider the model
y = 0 + 1x + Recall that the mean y for a given x is
E (y) = 0 + 1x
By looking at this we can see how the straight-line model
makes the mean of y, and therefore the prediction of y,
depend on x.
If 1 = 0 in the above model then x will have no eect
on the prediction of y using the above model.
Therefore the test of the null hypothesis that x contributes no information to the prediction against the alternative that the above model is useful for predicting y,
is equivalent to testing
H0 : 1 = 0 vs. Ha : 1 6= 0
If the data supports Ha, the alternative hypothesis, then
we will conclude that x contributes information for the
prediction of y through the above straight-line model.
^1
If our assumptions about the regression model hold, then
the sampling distribution of ^1 is normal with mean 1
and standard deviation ^1 is given by ^1 where
^1 = p
SSxx
Since is unknown, we estimate it by s. Therefore an
estimate of ^1 is given by s^1 where
s
s^1 = p
SSxx
s^1 is called the estimated standard error of the least
squares slope ^1.
78
79
Inferences about the slope
A t-test for 1
The t-statistic for testing H0 : 1 = 0 vs: Ha : 1 6= 0
is:
^ ; 0
t= 1
s^1
p
where s^1 = s= SSxx:
Rejection region:
jtj > t=2;n;2
where t=2;n;2 is the critical value from the t-table based
on n ; 2 degrees of freedom.
Example:
In the advertising-sales example, for testing
H0 : 1 = 0 vs: Ha : 1 6= 0
the t-statistic is computed to be
^
^
:7p
t = 1 = p1 =
= 3:7
s^1 s= SSxx :61= 10
and t=2;n;2 = t:025;3 = 3:182 for = :05.
Thus the rejection region is
jtj > 3:182
and since the calculated t-value falls in the rejection region, we reject H0 and conclude that slope 1 is not zero.
80
Sampling Distribution of
This implies that the variable x (advertising expenditure)
does contribute to the prediction of y (sales revenue) using
the straight-line model.
When using computer software for the analysis of regression data, we can reach the same conclusion by using the
observed significance level or the p-value computed by the program. In the JMP regression output
this value is given under the column headed P, in the
table giving the parameter estimates.
Comparing jtj to t=2 for the two-sided test is equivalent to comparing to the computed P value.
Example:
Exercise 10.18
Bivariate Fit of Gasoline(cents/gal.) By Crude Oil($/bbl.)
Linear Fit
Gasoline(cents/gal.) = 30.134836 + 3.0181453 Crude Oil($/bbl.)
Parameter Estimates
Term
Intercept
Crude Oil($/bbl.)
Estimate Std Error
30.134836 5.454029
3.0181453 0.265423
t Ratio
5.53
11.37
Prob>|t|
<.0001
<.0001
Lower 95%
18.757931
2.464482
Upper 95%
41.511741
3.5718085
Analysis of Variance
Source
Model
Error
DF
1
20
Sum of Squares
10373.339
1604.524
Mean Square
F Ratio
10373.3 129.3011
80.2 Prob > F
Here the p-value is given as < :0001. Thus it is smaller
than the signicance level = :05. Therefore we reject
the null hypothesis H0 : 1 = 0 and conclude that slope
1 is not zero.
81
It is obvious from the plot below that the slope is not
zero.
Bivariate Fit of Gasoline(cents/gal.) By Crude Oil($/bbl.)
140
1
^1 t=2;n;2 s^
p
where s^ = s= SSxx and t=2;n;2 is the critical value
from the t-table based on n ; 2 degrees of freedom.
1
1
130
Gasoline(cents/gal.)
A 100(1 ; )% condence interval for 120
110
Example:
100
90
80
70
60
50
10
15
20
25
30
35
40
Crude Oil($/bbl.)
Look at the JMP output for the Whistle Blower example
(Exercise 10.19) again:
Bivariate Fit of Retal Index By Salary
Parameter Estimates
Term
Intercept
Salary
Estimate Std Error
569.58007 93.99729
-0.001924 0.002356
t Ratio
6.06
-0.82
Prob>|t|
<.0001
0.4289
Analysis of Variance
Source
Model
Error
C. Total
DF
1
13
14
Sum of Squares Mean Square
15320.93
15320.9
298741.47
22980.1
314062.40
F Ratio
0.6667
Prob > F
0.4289
Here the p-value is :429 which is larger than the significance level = :05. Thus we fail to reject the null
hypothesis of H0 and conclude that salary does not contribute to predicting the retaliation index using a straightline model.
82
For the advertising expenditure-sales revenue example
0
:61 1
^ t:025;3 s^1 = :7 3:182 B@ p CA = :7 :61
10
Thus the interval estimate for the slope parameter 1 is
(.09, 1.31).
Interpretation: We are 95% condent that the true
mean increase is monthly sales revenue per additional
$100 of advertising expentiture is between $90 and $1,310.
Also since, zero is not included in this interval we can use
this interval to also conclude that 1 is not zero.
This interval is rather wide.
The reason is that the sample size is too small to be able
to estimate 1 with more accuracy. As we have already
seen one way to increase accuracy of an estimate is to
increase the sample size.
83
The Coecient of Correlation
Consider n observations of a pair of variables (x; y) measured on observational (or experimental) units.
Definition
The Pearson product moment coefficient of
correlation, r, is a measure of the strength of the
linear relationship between two variables x and y . It is
computed from sample of n measurements on x and y as
follows:
SS
r = r xy
SSxxSSyy
Some properties of r
r is scaleless (or unitless)
r takes a value between -1 and + 1
r = 0 implies that a linear relationship does not exist
between x and y.
Closer r comes to 1, the stronger the linear relationship between x and y.
Positive r implies a positive relationship, negative r
implies a negative relationship.
r = 1 implies that an exact linear relationship exists
between x and y.
84
Since ^1 = SSxy =SSxx (slope of the least squares line) has
the same denominator as that of r,
r = 0 when ^1 = 0
r > 0 when ^1 > 0
r < 0 when ^1 < 0
85
Example
For the advertising-sales example Sxy = 7; SSxx = 10;
and SSyy = 6 giving
SS
7
= :904
r = p xy = p
10 6
SSxxSSyy
which indicates a strong positive linear relationship between advertising and sales, implying that sales revenue
increases as advertising expenditure increases (for these
5 months).
Population correlation coefficient The sample correlation coecient r is a sample statistic that is an estimate of the corresponding population
correlation coefficient .
is a parameter of the bivariate population distribution
of (x; y). So we can make statistical inferences about using r and its sampling distribution if we wish. These
would involve condence intervals and hypothesis tests
about .
The information that r provides about the least squares
line is identical to that provided by the slope of ^1. So in
the case of the straight-line model we will make inferences
about the model using the sampling distributuion of ^1
(instead of r). In fact, we have already done so.
The Coefficient of Determination
This is measures the contribution of x in predicting y.
If we assume y N (; 2) the variability in y is measured by
SSyy = (yi ; y)2
This is called the total sample variation.
If the straight-line model is correct (i.e., if x contributes
to the prediction of y) then y N (0 + 1x; 2) and the
variability in y is measured by
SSE = (yi ; y^)2
If 1 = 0, then SSE = SSyy
If 1 6= 0 then SSE < SSyy
Thus SSyy; SSE is the reduction in the variability of
y attributable to x. The larger it is, that is small SSE
is, the larger the contribution of x.
Usually this reduction in variance in y is expressed as
a proportion of total sample variation.
SSyy ; SSE
SSyy
This is the proportion of the total sample
variability explained by the fitted regression
model.
86
87
It can be shown that in simple linear regression (straightline) model, this proportion is the same as r2, where
r = coecient of correlation. That is
SS ; SSE
SSE
r2 = yy
=1;
:
SSyy
SSyy
Thus 82% of the sample variation in sales revenue (y)
is explained by using advertising expenditure (x) in a
straight-line model to predict y. Thus this is a \fairly
good" model for predicting y.
In the JMP output of Exercise 10.15 the value for the
coecient of determination is reported as a percentage:
R-Sq = 4:9%
Since r is in the range ;1
0 r2 1.
Interpretation of r2
r 1, r2 is in the range
the coefficient of determination
If r2 = :60, it means that we are doing 60% better by
using y^ to predict the mean of y, than just using the
sample mean y to predict the mean of y.
meaning r2 = :049. Thus less than 5% of the sample variation in retaliation index (y) can be explained by using
salary (x) in a straight-line model to predict y. Thus this
is not an adequate model for predicting y based on x.
Example:
In the advertising-sales example.
SSyy = 6:0 SSE = 1:10
Thus, the coecient of determination is:
SSyy ; SSE 6:0 ; 1:1
=
= :82
SSyy
6:0
We could have calculated this by just squaring the correlation coecient r = :904 we obtained earlier:
r2 = (:904)2 = :82
r2 =
88
89
Construction of an Analysis of Variance Table
An Analysis of Variance (ANOVA)Table is a way of organizing computed information about a tted model.
We can partition the total sum of squares (yi ; y)2 as
follows:
n
n
n
X
(yi ; y)2 = X (yi ; y^i)2 + X (^yi ; y)2
i=1
i=1
i=1
SSTot = SSE
+ SSR
Total SS = Error SS + Regression SS
measures \the total amount of variation of the
yi's about y"
SSE: measures \the total amount of variation of the yi's
about y^i's, i.e., the residual variation"
SSTot:
SSR:
measures \the total amount of variation of the y^i's
about y, i.e., the variation of the lled regression
line"
Properties of SSTot, SSR, and SSE
1. For a given data set, SSTot is always constant
2. If SSE increases, SSR decreases, and vice versa.
3. Best model minimizes SSE and maximizes SSR
The ANOVA table for the model y = 0 + 1x + is:
Source
Regression
Error
Total
Advertising Expenditure - Sales Revenue Example we
have
(y)2
SSTot = SSyy = y2 ;
n
(10)2
= 26 ; 20 = 6
= 26 ;
5
SSE = 1.10 (from previous calculations)
The results of these computations can be summarized in
an ANOVA table:
Source
Regression
Error
Total
yi2
400
324
100
36
121
981
xi
6
6
4
2
3
21
df
1
3
4
SS
4.90
1.10
6.00
x2i
36
36
16
4
9
101
xiyi
120
108
40
12
33
313
1. Fit the simple linear regression model by least squares:
SSxx = x2i ; (xi)2=n = 101 ; (21)5 = 12:8
2
SSTot = yi2 ; (yi )2=n = SSyy
= 981 ; (65)2 =5 = 136:00
SSR = (SSxy )2=SSxx
= (40)2 =12:8 = 125:00
SSE = SSTot - SSR
= 136 ; 125 = 11:00
Source
Regression
Error
Total
df
SS MS
1 125.00 125.00
3 11.00 3.66667
4 136.00
y)
= 313 ; (21)(65)
SSxy : xiyi ; (x )(
n
5 = 40:0
^1 = SSxy =SSxx = 40:0=12:8 = 3:125
i
i
^0 = y ; ^1x = 655 ; (3:125)( 215 ) = ;0:125
Fitted regression line: y^ = ;0:125 + 3:125x
2. Construct the ANOVA Table.
SSTot = SSE + SSR
92
MS
4.9
0.36667
91
Example: A car dealer is interested in modeling the relationship between the number of cars sold by the rm each
week (y) and the average number of salespeople who work
on the showroom oor per day during the week (x).
yi
20
18
10
6
11
65
SS MS
SSR MSR
SSE MSE
SSTot
Example:
90
i
1
2
3
4
5
df
1
n-2
n-1
93
Using Fitted Model for Estimation
and Prediction
Two types of inferences from tted model:
Estimating the mean value E (y) = 0 + 1x for
a specic value of x.
Predicting a new y value for a given value of x.
Example:
Advertising Expentiture { Sales Revenue Example:
Estimate the mean sales revenue for months for
which the advertising expenditure was $400 (i.e.,
x = 4 in the problem).
If we decide to spend $400 on advertising next
month, what does the model predict to be the sales
revenue?
The statistical inferences made are dierent:
In the rst case we want to estimate the mean of
the population of values of y at a given value of x.
In the second case we want to predict a single value
y at a specied x value.
Example:
In the Advertising-Sales example, the least squares
prediction equation was
y^ = ;:1 + :7x
We can use this equation for doing both of the above
inferences.
First, note that E (y ) = 0 + 1x is the mean value of
y at a given value of x.
Since ^0 + ^1x, is an estimate 0 + 1x, an estimate of
this mean value E (y) is y^ = ^0 + ^1x.
For example, the estimated mean sales revenue for all
months when x = 4 (i.e., advertising expenditure =
$400), is given by
y^ = ;:1 + :7(4) = 2:7
i.e., $2700.
On the other hand, y^ = ^0 + ^1 x is also the predicted
value of y at a given value of x.
Thus if we plan to spend $400 on advertising next month,
we can predict sales revenue to be $2700.
95
94
Obviously, there is a dierence between the two cases.
The dierence lies in the accuracy of the estimate y^
and the predictor y^. This is reected in the interval estimates given below that are constructed using the sampling distributions of these two statistics.
A 100(1-)% Condence Interval for the Mean
Value of y at x
y^ t=2 (Estimated standard error of y^)
or
v
u
u
u1
t=2; suut
2
n + (xSS; x)
xx
where t=2 is based on (n ; 2) degrees of freedom.
y^ A 100(1-)% Prediction Interval for an Individual New Value of y a x
y^ t=2 (Estimated standard error of prediction)
or
v
u
u
u
t=2 suut1 +
1 (x ; x)2
+
n
SSxx
where t=2 is based on (n ; 2) degrees of freedom.
y^ 96
Example:
Advertising Expentiture { Sales Revenue Example:
Find a 95% condence interval for the mean monthly sales
when the appliance store spends $400 on advertising.
For a $400 advertising expenditure, x = 4 and the condence interval for the mean value of y is:
y^ v
u
v
u
u
u1
t=2 suut
u
u1
(x ; x)2
(4 ; x)2
+
= y^ t:025;3 uut +
n
SSxx
5
SSxx
Recall that
y^ = 2:7; s = :61; x = 3; and SSxx = 10:
and from Table VI, t:025;3 = 3:182. Thus, we have
v
u
u
u1
(4 ; 3)2
2:7 (3:182)(:61)ut +
= 2:7 1:1 = (1:6; 3:8)
5
10
Therefore, we are 95% condent that when the store
spends $400 a month on advertising, the mean sales revenue is between $1,600 and $3,800.
97
Example:
Advertising Expentiture { Sales Revenue Example:
Predict the monthly sales for next month, if $400 is to be
spent on advertising. Use a 95% prediction interval.
To predict the sales for a particular month for which
x = 4, we calculate the 95% prediction interval as
y^ v
u
u
u
t=2 suut1 +
1 (x ; x)2
+
n
SSxx
v
u
u
u
u
t
1 (4 ; 3)2
= 2:7 (3:182)(:61) 1 + +
5
10
= 2:7 2:2 = (:5; 4:9)
Therefore, we predict with 95% condence that the sales
revenue next month (a month in which we spend $400 in
advertising) will fall in the interval from $500 to $4,900.
It is important to note that this interval is wider than the
interval on the mean monthly sales for $400 of advertising
expenditure, the reason being that the standard deviation
of the predictor y^ is larger than the standard deviation
of the estimate y^. (Note the additional factor of 1 under
the square root in the above expression.)
Example 10.61:
Many variables inuence the sales of existing single-family
home. One of these is the interest rate charged for mortgage loans. Shown in the table are the total number of
existing single-family homes sold annually (in 1000's) and
the average annual conventional mortgage interest rate
(as a %) from 1982{1991.
Identify the predictor and response
Predictor x: Interest Rate
Response y: Homes Sold
i y
x
y2
x2
xy
1 1990 14.8 3960100 219.04 29452.0
2 2719 12.3 7292961 151.29 33443.7
.. ..
..
..
..
..
10 3220 9.2 10368400 84.64 29624.0
31253 106.8 99841655 1172.74 325855.5
Fit LS regression line.
2
2
SSxx = x2i ; (xn ) = 1172:74 ; (10610:8) = 32:116
2
2
SSyy = yi2 ; (ny ) = 99841655 ; (31253)
=
10
2166654:1
y)
= 325855:5 ; (106:8)(31253)
SSxy = xiyi ; (x )(
n
10
= ;7926:54
SS = (;7926:54) = ;24681
^1 = SS
32:116
i
i
i
i
xy
xx
98
^0 = y;b1x = 3125:3 ;(;246:81)(10:68) = 5761:23
y^ = 57651:23 ; 246:81x
Construct the ANOVA Table.
SST = SSyy = 2166654.1
)2
SSR = (SS
SS = 1956346:9
SSE = SST { SSR = 210307.2
xy
xx
Source
Regression
Error
Total
df
SS
MS
1 1956346.9 1956346.9
8 210309.2 26288.4
9 2166654.1
Do the data provide sucient evidence to indicate
a non-zero slope? Use a 95% condence interval to
answer this question.
1
0p
s
B 26288:4 CC
C
1 t0:025;x p
= ;246:81 (2:306) BB@ p
SSxx
32:116 A
= ;246:81 65:98
= (;312:79; ;180:83)
Since 0 is not in the interval, we can say that the
data provide sucient evidence to conclude that the
slope is not zero.
100
99
Compute and interpret the coecient of determination.
1956346:9
r2 = SSR
SST = 2166654:1 = 0:9029
Interpretation: The tted line explains 90.29%
of the variation in the response.
Compute and interpret the Pearson correlation coefcient.
p p
r = r = 0:9029 = ;0:9502
(we take negative because it is the sign of ^1).
Interpretation: This is a very strong negative
linear relationship between the interest rate and number of homes sold.
Compute a 90% condence interval for the true mean
number of homes sold if the interest rate is 10%.
Need a 90% CI for E (y) at x = 10:0
v
u
x;x)2
y^ t0:05;8 sut n1 + (SS
p
xx
0s
= 3293:13 (1:86)( 26288:4) @ 101 + (1032;10:116:68)
= 3293:13 102:00
= (3191:13; 3395:13)
Interpretation: We are 90% condent that the
true mean number of homes sold when the interest
rate is 10% is between 3191.13 and 3395.13 homes.
101
2
1
A
of homes sold during a year in which the interest rate
is 10%.
Need a 90% prediction interval for y at x = 10:0
v
u
x;x)2
y^ t0:05;8 sut1 + n1 + (SS
p
xx
s
= 3293:13 (1:86)( 26288:4)( 1 + 101 + (1032;10:116:68) )
= 3293:13 318:36
= (2974:77; 3611:49)
Interpretation: We are 90% condent that the
number of homes sold during a year when the interest
rate is 10% is between 2974.77 and 3166.49 homes.
Exercise10_61a
Bivariate Fit of Homes Sold(1000’s) By Interest_Rate(%)
Year
1982
1983
1984
1985
1986
1987
1988
1990
1991
1992
1993
1994
1995
1996
1997
2
Homes
Interest Rate(%)
Sold(1000’s)
1990
15.82
2719
13.44
2868
13.81
3214
12.29
3565
10.09
3526
10.17
3594
10.22
3211
10.08
3220
9.2
3520
8.43
3802
7.36
3946
8.59
3812
8.05
4087
8.03
4215
7.76
Prediction and Confidence Intervals
4500
4000
Homes Sold(1000’s)
Construct a 90% prediction interval for the number
3000
2500
2000
1500
6
8
10
12
14
16
Interest_Rate(%)
Parameter Estimates
Term
Intercept
Interest_Rate(%)
Linear Fit
Estimate Std Error
5566.1297 253.9956
-210.3457 24.19405
t Ratio Prob>|t|
21.91 <.0001
-8.69 <.0001
Analysis of Variance
Source
Model
Error
C. Total
DF
1
14
15
Homes Sold(1000’s) = 5566.1297 - 210.34571 Interest_Rate(%)
Summary of Fit
Sum of Squares
3963718.6
734142.8
4697861.4
Mean Square
3963719
52439
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
F Ratio
75.5876
Prob > F
<.0001
Residuals by Year
300.00
200.00
300.00
100.00
Residuals
Residuals
200.00
100.00
0.00
-100.00
0.00
-100.00
-200.00
-300.00
-200.00
-400.00
-300.00
1980
0.843728
0.832566
228.9951
3414.688
16
Residuals by Predicted
400.00
1985
1990
1995
Year
102
3500
-500.00
2000.00 2500.00
3000.00
3500.00
Predicted
103
4000.00
Residual Analysis
Aim is to check if the assumptions about the model are
satised for a particular set of data.
Also examine what we can do if we detect departures
from the assumptions.
Recall that the model was of the form
y = E (y ) + where E (y) = 0 + 1x for a straight-line model, is the
deterministic component and is the random error component.
The basic assumption can be summarized as:
1, 2; : : : n is a random sample from a Normal
population with mean 0 and constant standard
deviation .
Because the assumption involve the random error component , the best way to study their properties is by rst
estimating the random error.
104
1. Histogram of the residuals
Check if the shape of the distribution is moundshaped.
2. Scatterplots of residuals in time order or against the x
variable.
From the model it follows that the actual random
error:
= y ; E (y )
= y ; (0 + 1x)
The estimated random error, ^, is:
^ = y ; (^0 + ^1x)
= y ; y^
= residual
Thus, the estimated random error for an observation y
is the corresponding residual y ; y^ . Earlier, we learned
that (y ; y^ ) = 0. Also s = SSE=(n ; 2) is an estimate
where SSE = (y ; y^ )2 .
Thus we would expect about 95% of the residuals to fall
within within 2 standard deviations i.e., 2s of 0 and
virtually all of them to lie inside of 3 standard deviations
of 0
We use a variety of plots of the residuals to check
whether these assumption about the random errors are
satised.
i
i
i
i
i
i
i
105
b.) Check visually whether the residuals appear to be
evenly spread around this line, as you go from low
to high values on the x-axis.
a.) Draw a line parallel to the x-axis through the
value residual = 0.
b.) Check visually whether the residuals appear to be
evenly centered around this line.
c.) Draw lines parallel to the x-axis through the value
residual = 2s.
d) Check visually whether many residuals outside
these lines. Check if those fall outside 3s.
If there is a clearly recognizable pattern such as those
shown below, then either
a dependence of the error variance 2 on the predictor, x, or
inadequacy of the deterministic part of the model,
e.g., the straight-line model is not sucient to explain
the variability in the response y.
3. Scatterplot
of residuals against the x variable or
against the predicted value, y^.
a.) Draw a residual = 0 horizonal line as before.
106
107
0.0
0.2
0.4
0.6
0.8
2
Residuals
0
1
-1
Residuals
-1
0
1
-2
•
• •
•
•
•
•••
• •• ••••
•
•
•
• • • • •• • • • •
•• ••••••••• • ••••••• ••• •••••••••• •
•• •• •
• • • • • •• ••
• •• • • • •
••
•
•
•
• •
•
• •
1.0
•
•
•
•
••
•
•
• ••
• • • • • •••• ••
•
• • •• •
• • • • • • • ••••••••• ••
••• •• •••••••••• • •••••••• • ••••••••
• •• •• •
•
•
• •
• •
•
•
•
•
0.0
0.2
•• •
•
••
•
• •••
••
•••••••••••••••• • ••• ••• ••• ••••
•••••• ••• • •••• •••
•••••••••
••• •••••• •••• • • •••
4.5
5.0
5.5
•
0.8
10
Residuals
15
1.0
0
-2
0
Residuals
2 4 6
8
•
0.4
0.6
Predicted
Frequency
5 10 15 20 25
x
6.0
0
x
5
Interpretation of the plots: On any of these
plots you should not put much eort into nding a pattern that is simply not there. Unless a pattern is very
obvious, conclude that the plot does not indicate a deviation from the assumptions checked or that the plot is
inconclusive.
108