Download Chapter 13 slides - Germantown School District

Document related concepts
no text concepts found
Transcript
Chapter 13
Simple Linear Regression
&
Correlation
Inferential Methods
1
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Deterministic Models
Consider the two variables x and y. A
deterministic relationship is one in which
the value of y (the dependent variable) is
described by some formula or mathematical
notation such as y = f(x), y = 3 + 2 x or
y = 5e-2x where x is the dependent variable.
2
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Probabilistic Models
A description of the relation between two
variables x and y that are not deterministically
related can be given by specifying a
probabilistic model.
The general form of an additive probabilistic
model allows y to be larger or smaller than
f(x) by a random amount, e.
The model equation is of the form
Y = deterministic function of x + random deviation
= f(x) + e
3
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Probabilistic Models
Deviations from the deterministic part of a
probabilistic model
e=-1.5
4
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Simple Linear Regression Model
The simple linear regression model
assumes that there is a line with vertical or
y intercept a and slope b, called the true or
population regression line.
When a value of the independent variable x
is fixed and an observation on the
dependent variable y is made,
y =  + x + e
Without the random deviation e, all observed points
(x, y) points would fall exactly on the population regression
line. The inclusion of e in the model equation allows points
to deviate from the line by random amounts.
5
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Simple Linear Regression Model
Population regression line
(Slope )
Observation when x = x1
(positive deviation)
e2
e2
Observation when x = x2
(positive deviation)
 = vertical intercept
0
6
0
x = x1
x = x2
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Basic Assumptions of the Simple
Linear Regression Model
1. The distribution of e at any particular x
value has mean value 0 (µe = 0).
2. The standard deviation of e (which
describes the spread of its distribution) is
the same for any particular value of x. This
standard deviation is denoted by .
3. The distribution of e at any particular x
value is normal.
4. The random deviations e1, e2, …, en
associated with different observations are
independent of one another.
7
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
More About the Simple Linear
Regression Model
For any fixed x value, y itself has a normal
distribution.
 mean y value   height of the population 
 for fixed x    regression line above x     x

 

and
(standard deviation of y for fixed x) = .
8
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Interpretation of Terms
1. The slope  of the population regression
line is the mean (average) change in y
associated with a 1-unit increase in x.
2. The vertical intercept  is the height of
the population line when x = 0.
3. The size of  determines the extent to
which the (x, y) observations deviate from
the population line.
Small 
9
Large 
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Illustration of Assumptions
10
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Estimates for the Regression Line
The point estimates of , the slope, and ,
the y intercept of the population regression
line, are the slope and y intercept,
respectively, of the least squares line.
That is,
S xy
b  point estimate of  
S xx
a  point estimate of   y  bx
where


 x   y 
 x
S   xy 
and S   x 
2
2
xy
11
n
xx
n
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Interpretation of y = a + bx
Let x* denote a specific value of the
predictor variable x. The a + bx* has two
interpetations:
1. a + bx* is a point estimate of the
mean y value when x = x*.
2. a + bx* is a point prediction of an
individual y value to be observed
when x = x*.
12
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
The following data was collected in a
study of age and fatness in humans.
Age
% Fat
23
9.5
23
27.9
27
7.8
27
17.8
39
31.4
41
25.9
45
27.4
49
25.2
50
31.1
Age
% Fat
53
34.7
53
42
54
29.1
56
32.5
57
30.3
58
33
58
33.8
60
41.1
61
34.5
One of the questions was, “What is the
relationship between age and fatness?”
13
* Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dualphoton (153Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
n  18
 X  834
 y  515
 X  41612
 XY  25489.2
2
14
Age (x) % Fat y
23
9.5
23
27.9
27
7.8
27
17.8
39
31.4
41
25.9
45
27.4
49
25.2
50
31.1
53
34.7
53
42
54
29.1
56
32.5
57
30.3
58
33
58
33.8
60
41.1
61
34.5
834
515
x2
xy
529
218.5
529
641.7
729
210.6
729
480.6
1521 1224.6
1681 1061.9
2025
1233
2401 1234.8
2500
1555
2809 1839.1
2809
2226
2916 1571.4
3136
1820
3249 1727.1
3364
1914
3364 1960.4
3600
2466
3721 2104.5
41612 25489.2
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
n  18,  x  834,
2
x
  41612,
S xx   x
2
 y  515
 xy  25489.2
x



2
n
8342
 41612 
 2970
18
S xy  
15
x   y 


xy 
n
834  515 

 25489.2 
 1627.53
18
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
S xy 1627.53
b

 0.54799
S xx
2970
515
834
a  y  bx 
 0.54799
 3.2209
18
18
ŷ  3.22  0.548x
16
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
ŷ  3.22  0.548x
A point estimate for the %Fat for a
human who is 45 years old is
a + bx=3.22+0.548(45)=27.9%
If 45 is put into the equation for x, we have both
an estimated %Fat for a 45 year old human or
an estimated average %Fat for 45 year old
humans
a + bx=3.22+0.548(45)=27.9%
The two interpretations are quite different.
17
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
Regression Plot
% Fat y = 3.22086 + 0.547991 Age (x)
S = 5.75361
R-Sq = 62.7 %
R-Sq(adj) = 60.4 %
A plot of the data
points along with
the least squares
regression line
created with
Minitab is given
to the right.
% Fat y
40
30
20
10
20
18
30
40
Age (x)
50
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
60
Terminology
The predicted or fitted values result from
substituting each sample x value into the
equation for the least squares line. This gives
ŷ1  a  bx1 =1st predicted value
ŷ 2  a  bx 2 =2nd predicted value
...
ŷ n  a  bx n =nth predicted value
The residuals for the least squares line are the
values: y1  y
ˆ 1 , y 2  yˆ 2 , ..., y n  yˆ n
19
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Definition formulae
The total sum of squares, denoted by SSTo,
is defined as
SSTo  (y1  y)  (y 2  y) 
2
2
 (y n  y)
2
  (y  y) 2
The residual sum of squares, denoted by
SSResid, is defined as
SSResid  (y1  yˆ 1 )  (y 2  yˆ 2 ) 
2
2
 (y n  yˆ n )
  (y  y)
ˆ 2
20
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
2
Calculation Formulae Recalled
SSTo and SSResid are generally found as
part of the standard output from most
statistical packages or can be obtained using
the following computational formulas:
 y

SSTo    y  y    y 
2
2
2
n
SSResid   (y  y)
ˆ 2  y 2  a  y  b  xy
21
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Coefficient of Determination
The coefficient of determination,
denoted by r2, gives the proportion of
variation in y that can be attributed to an
approximate linear relationship between x
and y.
The coefficient of determination, r2, can be
computed as 2
SSResid
r  1
22
SSTo
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Estimated Standard Deviation, se
The statistic for estimating the variance 2
is
SSRe sid
2
se 
n2
where
ˆ 2   y 2  a y  b xy
SSRe sid   (y  y)
The subscript e in s2e is a reminder that we are
estimating the variance of the "errors" or residuals.
23
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Estimated Standard Deviation, se
The estimate of  is the estimated
standard deviation
se  s
2
e
The number of degrees of freedom associated
with estimating 2 or  in simple linear regression
is n - 2.
24
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example continued
SSResid
 529.66
SSResid
s 
n2
529.66

18  2
 33.104
2
e
se  se2
 33.104
 5.754
25
Age (x) % Fat (y)
23
23
27
27
39
41
45
49
50
53
53
54
56
57
58
58
60
61
834
y2
9.5
90.3
27.9
778.4
7.8
60.8
17.8
316.8
31.4
986.0
25.9
670.8
27.4
750.8
25.2
635.0
31.1
967.2
34.7 1204.1
42.0 1764.0
29.1
846.8
32.5 1056.3
30.3
918.1
33.0 1089.0
33.8 1142.4
41.1 1689.2
34.5 1190.3
515.0 16156.3
Predicted Residual
Value ŷ
y  yˆ
15.82
15.82
18.02
18.02
24.59
25.69
27.88
30.07
30.62
32.26
32.26
32.81
33.91
34.46
35.00
35.00
36.10
36.65
 y  yˆ 
2
-6.32 40.00
12.08 145.81
-10.22 104.38
-0.22
0.05
6.81 46.34
0.21
0.04
-0.48
0.23
-4.87 23.74
0.48
0.23
2.44
5.93
9.74 94.78
-3.71 13.78
-1.41
1.98
-4.16 17.27
-2.00
4.02
-1.20
1.45
5.00 25.00
-2.15
4.62
529.66
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example continued
n  18,  y  515.0,  y 2  16156.3
 xy  25489.2 ,a  3.2209, b  0.54799
SSTot=  y-y    y
2
2
y



2
n
(515.0)2
 16156.3 
 1421.5
18
SSResid
529.66
r  1
 1
 1  0.373  0.627
SSTo
1421.5
2
26
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example continued
With r2 = 0.627 or 62.7%, we can say that
62.7% of the observed variation in %Fat
can be attributed to the probabilistic linear
relationship with human age.
The magnitude of a typical sample
deviation from the least squares line is
about 5.75(%) which is reasonably large
compared to the y values themselves.
This would suggest that the model is only
useful in the sense of provide gross
“ballpark” estimates for %Fat for humans
based on age.
27
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Properties of the Sampling
Distribution of b
When the four basic assumptions of the
simple linear regression model are satisfied,
the following conditions are met:
1. The mean value of b is . Specifically,
mb= and hence b is an unbiased
statistic for estimating 
2. The standard deviation

b 
of the statistic b is
Sxx
28
3. The statistic b has a normal distribution (a
consequence of the error e being normally
distributed)
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Estimated Standard Deviation of b
The estimated standard deviation of the
statistic b is
se
b 
S xx
When then four basic assumptions of the
simple linear regression model are satisfied,
the probability distribution of the
standardized variable
b
t
sb
is the t distribution with df = n - 2
29
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Confidence interval for 
When then four basic assumptions of the
simple linear regression model are
satisfied, a confidence interval for ,
the slope of the population regression
line, has the form
b  (t critical value)sb
where the t critical value is based on
df = n - 2.
30
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example continued
Recall
n  18,  x  834,
2
x
  41612,
 y  515
2
xy

25489.2,
y

  16156.3
b  0.54799, a  3.2209
se  5.754
se
5.754
sb 

 0.1056
Sxx
2970
A 95% confidence interval estimate for  is
b  t sb  0.5480  (2.12) (0.1056)  0.5480  0.2238
31
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example continued
A 95% confidence interval estimate for  is
b  t s b  0.5480  2.12(0.1056)
 0.5480  0.2238
(0.324,0.772)
Based on sample data, we are 95% confident that the
true mean increase in %Fat associated with a year of
age is between 0.324% and 0.772%.
32
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example continued
Minitab output looks like
Regression Analysis: % Fat y versus Age (x)
Estimated y intercept a
The regression equation is
% Fat y = 3.22 + 0.548 Age (x)
Predictor
Constant
Age (x)
S = 5.754
Coef
3.221
0.5480
Source
Regression
Residual Error
Total
33
Estimated slope b
SE Coef
T
5.076
0.63
0.1056
5.19
R-Sq = 62.7%
Analysis of Variance
Regression line
P
0.535
0.000
R-Sq(adj) = 60.4%
residual df = n -2
DF
SS
1
891.87
16
529.66
17
1421.54
SSTo
MS
891.87
33.10
F
26.94
P
0.000
2
e
SSResid
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
s
Hypothesis Tests Concerning 
Null hypothesis: H0:  = hypothesized value
Test statistic:
t
b  hypothesized value
sb
The test is based on df = n - 2
34
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Hypothesis Tests Concerning 
Alternate hypothesis and finding the P-value:
1. Ha:  > hypothesized value
P-value = Area under the t curve with
n - 2 degrees of freedom to the
right of the calculated t
2. Ha:  < hypothesized value
P-value = Area under the t curve with
n - 2 degrees of freedom to the left
of the calculated t
35
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Hypothesis Tests Concerning 
3. Ha:   hypothesized value
a) If t is positive, P-value = 2 (Area
under the t curve with n - 2 degrees
of freedom to the right of the
calculated t)
b) If t is negative, P-value = 2 (Area
under the t curve with n - 2 degrees
of freedom to the left of the
calculated t)
36
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Hypothesis Tests Concerning 
Assumptions:
1. The distribution of e at any particular x
value has mean value 0 (me = 0)
2. The standard deviation of e is , which
does not depend on x
3. The distribution of e at any particular x
value is normal
4. The random deviations e1, e2, … , en
associated with different observations are
independent of one another
37
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Hypothesis Tests Concerning 
Quite often the test is performed with the
hypotheses
H0:  = 0 vs. Ha:   0
This particular form of the test is called the
model utility test for simple linear
regression.
The null hypothesis specifies that there is no useful
linear relationship between x and y, whereas the
alternative hypothesis specifies that there is a useful
linear relationship between x and y.
b
The test statistic simplifies to t 
and is called the t ratio.
sb
38
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
Consider the following data on percentage
unemployment and suicide rates.
Percentage Suicide
Unemployed Rate
New York
3.0
72
Los Angeles
4.7
224
Chicago
3.0
82
Philadelphia
3.2
92
Detroit
3.8
104
Boston
2.5
71
San Francisco
4.8
235
Washington
2.7
81
Pittsburgh
4.4
86
St. Louis
3.1
102
Cleveland
3.5
104
City
* Smith, D. (1977) Patterns in Human Geography, Canada: Douglas David and Charles Ltd., 158.
39
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
The plot of the data points produced by
Minitab follows
40
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
Percentage Suicide
City
Unemployed Rate
(x)
(y)
New York
3.0
72
Los Angeles
4.7
224
Chicago
3.0
82
Philadelphia
3.2
92
Detroit
3.8
104
Boston
2.5
71
San Francisco
4.8
235
Washington
2.7
81
Pittsburgh
4.4
86
St. Louis
3.1
102
Cleveland
3.5
104
38.7
1253
41
x2
xy
y2
9.00
22.09
9.00
10.24
14.44
6.25
23.04
7.29
19.36
9.61
12.25
142.57
216.0
1052.8
246.0
294.4
395.2
177.5
1128.0
218.7
378.4
316.2
364.0
4787.2
05184
50176
06724
08464
10816
05041
55225
06561
07396
10404
10816
176807
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
Some basic summary statistics
n  11,  x  38.7,  x 2  142.57
2
y

1253,
y

  176807,  xy  4787.2
S xy  

 x   y 
xy 
n
(38.7)(1253)
 4787.2 
11
 378.92
42

 x
S  x 
2
2
xx
n
38.72
 142.57 
11
 6.4164
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
Continuing with the calculations
S xy 378.92
b

 59.06
S xx 6.4164
1253
38.7
a  y  bx 
 59.06
 93.86
11
11
ŷ  93.86  59.06x
43
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
Continuing with the calculations
SSResid
ˆ 2   y 2  a y  b xy
  (y  y)
 176807  ( 93.857)(1253)  59.055(4787.2)
 11701.9
2
y


2
2
SSTo  S yy   (y  y)   y 
n
12532
 176807 
11
 34078.9
44
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
SSResid
11701.9
se 

 36.06
n-2
9
SSRe sid
11701.9
r  1
 1
SSto
34078.9
 1  0.343  0.657
2
45
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Model Utility Test
1.  = the true average change in suicide
rate associated with an increase in the
unemployment rate of 1 percentage
point
2. H0:  = 0
3. Ha:   0
4.  has not been preselected. We shall
interpret the observed level of
significance (P-value)
5. Test statistic:
b  hypothesized value b  0 b
t


sb
sb
sb
46
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Model Utility Test
6. Assumptions: The following plot (Minitab) of
the data shows a linear pattern and the
variability of points does not appear to be
changing with x. Assuming that the distribution
of errors (residuals) at any given x value is
approximately normal, the assumptions of the
simple linear regression model are
appropriate.
47
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Model Utility Test
7. Calculation:
se
36.06
sb 

 14.24
S xx
6.4164
b 59.06
t 
 4.15
sb 14.24
8. P-value: The table of tail areas for tdistributions only has t values  4, so we can
see that the corresponding tail area is < 0.002.
Since this is a two-tail test the P-value < 0.004.
(Actual calculation gives a P-value = 0.002)
48
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Model Utility Test
8. Conclusion:
Even though no specific significance
level was chosen for the test, with the
P-value being so small (< 0.004) one
would generally reject the null
hypothesis that  = 0 and conclude that
there is a useful linear relationship
between the % unemployed and the
suicide rate.
49
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Minitab Output
Regression Analysis: Suicide Rate (y) versus Percentage Unemployed (x)
The regression equation is
Suicide Rate (y) = - 93.9 + 59.1 Percentage Unemployed (x)
Predictor
Constant
Percenta
S = 36.06
50
Coef
-93.86
59.05
SE Coef
51.25
14.24
R-Sq = 65.7%
T
-1.83
4.15
P
0.100
0.002
P-value
T value for Model Utility Test
R-Sq(adj) = 61.8%
H0:  = 0
Ha:   0
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example – Reality Check!
Although the medel utility test indicates that the model
is useful, we should be a bit reticent to use the model
principally as a estimation tool.
Notice that s = 36.06, where the actual range of
suicide rates is 235 – 71 = 164. This means to typical
error in estimating the suicide rate would be
approximately 22% of the range in error. With 9 of the
11 data points having suicide rates at or below 104,
this would constitute a very large amount of error in
the estimation.
The statistics is very clear: We have established a
strong positive linear relationship between percentage
employed and the suicide rate. I would just not be
particularly meaningful or useful to provide actual
numerical estimates for suicide rates.
51
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Residual Analysis
The simple linear regression model equation
is y =  + x + e where e represents the
random deviation of an observed y value
from the population regression line  + x .
Key assumptions about e
1. At any particular x value, the distribution
of e is a normal distribution
2. At any particular x value, the standard
deviation of e is , which is constant
over all values of x.
52
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Residual Analysis
To check on these assumptions, one would
examine the deviations e1, e2, …, en.
Generally, the deviations are not known, so
we check on the assumptions by looking at
the residuals which are the deviations from
the estimated line, a + bx.
The residuals are given by
y1  yˆ 1  y1  (a  bx1 )
y 2  yˆ 2  y 2  (a  bx 2 )
y n  yˆ n  yn  (a  bx n )
53
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Standardized Residuals
Recall: A quantity is standardized by
subtracting its mean value and then dividing
by its true (or estimated) standard deviation.
For the residuals, the true mean is zero (0)
if the assumptions are true.
The estimated standard deviation of a residual
depends on the x value. The estimated standard
deviation of the ith residual, yi  yˆ i , is given by
syi yˆ i  se
54
1 x  x
1 
n
Sxx
2
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Standardized Residuals
As you can see from the formula for the
estimated standard deviation the calculation
of the standardized residuals is a bit of a
calculational nightmare.
Fortunately, most statistical software
packages are set up to perform these
calculations and do so quite proficiently.
55
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Standardized Residuals - Example
Consider the data on percentage unemployment
and suicide rates
Percentage Suicide
Unemployed Rate
New York
3.0
72
Los Angeles
4.7
224
Chicago
3.0
82
Philadelphia
3.2
92
Detroit
3.8
104
Boston
2.5
71
San Francisco
4.8
235
Washington
2.7
81
Pittsburgh
4.4
86
St. Louis
3.1
102
Cleveland
3.5
104
City
Residual Standardized
y - yˆ
Residual
83.31 -11.31
-0.34
183.70 40.30
1.34
83.31 -1.31
-0.04
95.12 -3.12
-0.09
130.55 -26.55
-0.78
53.78 17.22
0.55
189.61 45.39
1.56
65.59 15.41
0.48
165.99 -79.98
-2.50
89.21 12.79
0.38
112.84 -8.84
-0.26
ŷ
Notice that the standardized residual for Pittsburgh
is -2.50, somewhat large for this size data set.
56
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
Pittsburgh
This point has
an unusually
high residual
57
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Normal Plots
Notice that both of the normal plots look similar. If
a software package is available to do the
calculation and plots, it is preferable to look at the
normal plot of the standardized residuals.
Normal Probability Plot of the Residuals
Normal Probability Plot of the Residuals
(response is Suicide)
(response is Suicide)
2
2
1
1
Normal Score
Normal Score
In both cases, the points look reasonable linear
with the possible exception of Pittsburgh, so the
assumption that the errors are normally distributed
seems to be supported by the sample data.
0
-1
-1
-2
-2
-50
58
0
0
Residual
50
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
Standardized Residual
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
More Comments
The fact that Pittsburgh has a large
standardized residual makes it worthwhile
to look at that city carefully to make sure the
figures were reported correctly. One might
also look to see if there are some reasons
that Pittsburgh should be looked at
separately because some other
characteristic distinguishes it from all of the
other cities.
Pittsburgh does have a large effect on
model.
59
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Visual Interpretation of
Standardized Residuals
Standardized Residuals Versus x
(response is y)
Standardized Residual
2
1
x
0
-1
-2
This plot is an example of a satisfactory plot that
indicates that the model assumptions are reasonable.
60
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Visual Interpretation of
Standardized Residuals
Standardized Residuals Versus x
Standardized Residual
(response is y)
2
1
0
x
-1
-2
This plot suggests that a curvilinear regression model
is needed.
61
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Visual Interpretation of
Standardized Residuals
Standardized Residuals Versus x
3
(response is y)
Standardized Residual
2
1
x
0
-1
-2
-3
This plot suggests a non-constant variance. The
assumptions of the model are not correct.
62
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Visual Interpretation of
Standardized Residuals
Standardized Residuals Versus x
(response is y)
Standardized Residual
2
1
x
0
-1
-2
-3
This plot shows a data point with a large standardized
residual.
63
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Visual Interpretation of
Standardized Residuals
Standardized Residuals Versus x
Standardized Residual
2
(response is y)
1
x
0
-1
-2
This plot shows a potentially influential observation.
64
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - % Unemployment vs. Suicide Rate
Generally
decreasing
pattern to these
points.
These two points are quite
influential since they are far
away from the others in
terms of the % unemployed
Unusually large
residual –
clearly an
influential point
65
This plot of the residuals (errors) indicates some
possible problems with this linear model. You can see
a pattern to the points.
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Properties of the Sampling Distribution
of a + bx for a Fixed x Value
Let x* denote a particular value of the
independent variable x. When the four basic
assumptions of the simple linear regression
model are satisfied, the sampling
distribution of the statistic a + bx* has the
following properties:
1. The mean value of a + bx* is  + x*,
so a + bx* is an unbiased statistic for
estimating the average y value when
x = x*
66
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Properties of the Sampling Distribution
of a + bx for a Fixed x Value
2. The standard deviation of the statistic
a + bx* denoted by a+bx*, is given by
abx*
1  x * x 
 
n
S xx
2
3. The distribution of the statistic a + bx* is
normal.
67
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Addition Information about the Sampling
Distribution of a + bx for a Fixed x Value
The estimated standard deviation of
the statistic a + bx*, denoted by
2
sa+bx*, is given by
1  x * x 
sabx*  se

n
S xx
When the four basic assumptions of the
simple linear regression model are satisfied,
the probability distribution of the standardized
variable
a  bx * (  x*)
t
sa  bx*
is the t distribution with df = n - 2.
68
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Confidence Interval for a Mean y Value
When the four basic assumptions of the
simple linear regression model are met, a
confidence interval for a + bx*, the
average y value when x has the value x*, is
a + bx*  (t critical value)sa+bx*
Where the t critical value is based on
df = n -2.
Many authors give the following equivalent form
for the confidence interval.
a  bx * (t critical value)se
69
1 (x *  x)2

n
S xx
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Confidence Interval for a Single y Value
When the four basic assumptions of the simple
linear regression model are met, a prediction
interval for y*, a single y observation made
when x has the value x*, has the form
a  bx * (t critical value) s2e  sa2bx*
Where the t critical value is based on df = n -2.
Many authors give the following equivalent form
for the prediction interval.
a  bx * (t critical value)se
70
1 (x *  x)2
1 
n
Sxx
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Mean Annual Temperature vs. Mortality
Data was collected in certain regions of
Great Britain, Norway and Sweden to study
the relationship between the mean annual
temperature and the mortality rate for a
specific type of breast cancer in women.
Mean Annual
Temperature (F°)
Mortality Index
Mean Annual
Temperature (F°)
Mortality Index
71
51.3
49.9
50.0 49.2 48.5 47.8 47.3 45.1
102.5 104.5 100.4 95.9 87.0 95.0 88.6 89.2
46.3
42.1
44.2 43.5 42.3 40.2 31.8 34.0
78.9
84.6
81.7 72.2 65.1 68.1 67.3 52.5
* Lea, A.J. (1965) New Observations on distribution of neoplasms of female breast in
certain European countries. British Medical Journal, 1, 488-490
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Mean Annual Temperature vs. Mortality
Regression Analysis: Mortality index versus Mean annual temperature
The regression equation is
Mortality index = - 21.8 + 2.36 Mean annual temperature
Predictor
Constant
Mean ann
S = 7.545
Coef
-21.79
2.3577
SE Coef
15.67
0.3489
R-Sq = 76.5%
T
-1.39
6.76
P
0.186
0.000
R-Sq(adj) = 74.9%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
14
15
Unusual Observations
Obs
Mean ann
Mortalit
15
31.8
67.30
SS
2599.5
796.9
3396.4
Fit
53.18
MS
2599.5
56.9
F
45.67
SE Fit
4.85
P
0.000
Residual
14.12
St Resid
2.44RX
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
72
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Mean Annual Temperature vs. Mortality
Regression Plot
Mortality in = -21.7947 + 2.35769 Mean annual
S = 7.54466
R-Sq = 76.5 %
R-Sq(adj) = 74.9 %
100
Mortality in
90
80
70
60
50
30
40
50
Mean annual
The point has a large standardized residual and is
influential because of the low Mean Annual Temperature.
73
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Mean Annual Temperature vs. Mortality
Predicted Values for New Observations
New Obs
Fit
SE Fit
95.0%
1
53.18
4.85
(
42.79,
2
60.72
3.84
(
52.48,
3
72.51
2.48
(
67.20,
4
83.34
1.89
(
79.30,
5
96.09
2.67
(
90.37,
6
99.16
3.01
(
92.71,
X denotes a row with X values away from
CI
63.57) (
68.96) (
77.82) (
87.39) (
101.81) (
105.60) (
the center
95.0%
33.95,
42.57,
55.48,
66.66,
78.93,
81.74,
PI
72.41) X
78.88)
89.54)
100.02)
113.25)
116.57)
Values of Predictors for New Observations
New Obs
1
2
3
4
5
6
74
Mean ann
31.8
35.0
40.0
44.6
50.0
51.3
These are the x* values for which the
above fits, standard errors of the fits,
95% confidence intervals for Mean y
values and prediction intervals for y
values given above.
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Mean Annual Temperature vs. Mortality
Regression Plot
Mortality in = -21.7947 + 2.35769 Mean annual
S = 7.54466
R-Sq = 76.5 %
R-Sq(adj) = 74.9 %
120
110
Mortality in
100
90
80
70
60
50
Regression
95% CI
40
95% PI
30
30
40
50
Mean annual
95% confidence interval for Mean y value at x = 40.
95% prediction interval for single y value at x = 45.
75
(67.20, 77.82)
(67.62,100.98)
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
A Test for Independence in a
Bivariate Normal Population
Null hypothesis: H0:  = 0
Test statistic: t 
r
1 r2
n2
The t critical value is based on df = n - 2
Assumption: r is the correlation coefficient for a
random sample from a bivariate normal
population.
76
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
A Test for Independence in a
Bivariate Normal Population
Alternate hypothesis: H0:  > 0 (Positive
dependence): P-value is the area under the
appropriate t curve to the right of the computed t.
Alternate hypothesis: H0:  < 0 (Negative
dependence): P-value is the area under the
appropriate t curve to the right of the computed t.
77
Alternate hypothesis: H0:   0 (Dependence):
P-value is
i. twice the area under the appropriate t curve to the left of
the computed t value if t < 0 and
ii. twice the area under the appropriate t curve to the right of
the computed t value if t > 0
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
Recall the data from
the study of %Fat vs.
Age for humans.
There are 18 data
points and a quick
calculation of the
Pierson correlation
coefficient gives
r = 0.79209.
We will test to see if
there is a dependence
at the 0.05
significance level.
78
Age (x) % Fat y
23
9.5
23
27.9
27
7.8
27
17.8
39
31.4
41
25.9
45
27.4
49
25.2
50
31.1
53
34.7
53
42
54
29.1
56
32.5
57
30.3
58
33
58
33.8
60
41.1
61
34.5
x2
529
529
729
729
1521
1681
2025
2401
2500
2809
2809
2916
3136
3249
3364
3364
3600
3721
xy
218.5
641.7
210.6
480.6
1224.6
1061.9
1233
1234.8
1555
1839.1
2226
1571.4
1820
1727.1
1914
1960.4
2466
2104.5
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
1.  = the correlation between % fat and
age in the population from which the
sample was selected
2. H0:  = 0
3. Ha:   0
4.  = 0.05
5. Test statistic: t 
79
r
1 r2
n2
, df  n  2
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example
6. Looking at the two normal plots, we can see
it is not reasonable to assume that either the
distribution of age nor the distribution of % fat
are normal. (Notice, the data points deviate
from a linear pattern quite substantially.
Since neither is normal, we shall not continue
with the test.
80
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Another Example
Height vs. Joint Length
The professor in an elementary statistics
class wanted to explain correlation so he
needed some bivariate data. He asked his
class (presumably a random or
representative sample of late adolescent
humans) to measure the length of the
metacarpal bone on the index finger of the
right hand (in cm) and height (in ft). The
data are provided on the next slide.
81
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Height vs. Joint Length
Joint length 3.5 3.4 3.4 2.7 3.5 3.5 4.2 4.0 3.0
Height 64 68.5 69 64 68 73 72 75 70
Joint length 3.4 2.9 3.5 3.5 2.8 4.0 3.8 3.3
Height 68.5 65 67 70 65 75 70 66
There are 17 data points and a quick
calculation of the Pierson correlation
coefficient gives r = 0.74908.
We will test to see if the true population
correlation coefficient is positive at the 0.05
level of significance.
82
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Height vs. Joint Length
1.  = the true correlation between height
and right index finger metacarpal joint in
the population from which the sample
was selected
2. H0:  = 0
3. Ha:  > 0
4.  = 0.05
5. Test statistic: t 
83
r
1 r2
n2
, df  n  2
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Height vs. Joint Length
6. Looking at the two normal plots, we can see it is
reasonable to assume that the distribution of age and
the distribution of % fat are both normal. (Notice, the
data points follow a reasonably linear pattern. This
appears to confirm the assumption that the sample is
from a bivariate normal distribution. We will assume
that the class was a random sample of young adults.
84
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example - Height vs. Joint Length
7. Calculation:
t
r
1 r2
n2

0.74908
1  (0.74908)2
17  2
 4.379
8. P-value: Looking on the table of tail areas for t
curves under 15 degrees of freedom, 4.379 is off
the bottom of the table, so P-value < 0.001. Minitab
reports the P-value to be 0.001.
9. Conclusion: The P-value is smaller than  = 0.05, so
we can reject H0. We can conclude that the true
population correlation coefficient is greater then 0.
I.e., the metacarpal bone is longer for taller people.
85
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Related documents