Download Regression Line

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Categorical variable wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Regression Line
I. Line of Least Squares
Consider the following data set and its scatterplot:
X
1
1
3
3
4
4
5
5
7
7
Y
1
2
2
3
3
4
3
4
4
4
X 4
SD-line
Y 3
 X  2 Y  1
(1) Find the equation of the SD-line.
1
Y  3  ( X  4)  Y  3  .5 X  2  Y  .5 X  1
2
(2) Find the points on the SD-line for X = 1, 3, 4, 5, 7.
X
X
X
X
X
 1 : Y  .5(1)  1  1.5  (1, 1.5)
 3 : Y  .5(3)  1  2.5  (3, 2.5)
 4 : Y  .5(4)  1  3  (4, 3)
 5 : Y  .5(5)  1  3.5  (5, 3.5)
 7 : Y  .5(7)  1  4.5  (7, 4.5)
(3) For each value of X, find the deviations of the Y values of the points in the data
set from the corresponding Y value on the SD-line; square these deviations and
find the sum.
X  1 : (1  1.5) 2  (.5) 2  .25
(2  1.5) 2  (.5) 2  .25
X  3 : (2  2.5) 2  (.5) 2  .25
(3  2.5) 2  (.5) 2  .25
1
X  4 : (3  3) 2  0 2  0
(4  3) 2  12  1
X  5 : (3  3.5) 2  (.5) 2  .25
(4  3.5) 2  (.5) 2  .25
X  7 : (4  4.5) 2  (.5) 2  .25
(4  4.5) 2  (.5) 2  .25
Sum = 3
Go through the same calculations using the line
Y  .425 X  1.3
X  1 : Y  .425(1)  1.3  1.725  (1  1.725) 2  (.725) 2  .525625
(2  1.725) 2  (.275) 2  .075625
X  3 : Y  .425(3)  1.3  2.575  (2  2.575) 2  (.575) 2  .330625
(3  2.575) 2  (.425) 2  .180625
X  4 : Y  .425(4)  1.3  3  (3  3) 2  0 2  0
(4  3) 2  12  1
X  5 : Y  .425(5)  1.3  3.425  (3  3.425) 2  (.425) 2  .180625
(4  3.425) 2  (.575) 2  .330625
X  7 : Y  .425(7)  1.3  4.275  (4  4.275) 2  (.425) 2  .180625
(4  4.275) 2  (.425) 2  .180625
Sum = 2.985
Notice that the sum for the second line is smaller than that for the SD-line. In fact,
for any line passing through the point of averages, this second line will always give
a smaller sum. This line of least squares is the regression line. Recall that the
average of a list of numbers is a point of least squares for that list. Thus, the
regression line is to a scatterplot as the average is to a list of numbers.
Note: The equation of the regression line above was calculated from the formula:
r
Y  Y  Y ( X  X ). The regression line does pass through the point of averages.
X
However, since the slope is
r Y
X
, then associated with a change of 1  X there
2
is a change of only r Y . The regression line is not as steep as the SD-line. This is
called the regression effect.
II. “Smoothed Version” of the Graph of Averages
Consider the same data set previously used. To plot the graph of averages, first
consider the data points with X-coordinate of 1; ie. (1, 1) and (1, 2). Average the
1 2
 1.5 . The point (1, 1.5) will be a point of the graph of averages.
Y-values; ie.
2
Similarly, for X = 3, you would get the point (3, 2.5); for X = 4, (4, 3.5); for X = 5,
(5, 3.5); and for X = 7, (7, 4).
Y
5
Regression line
4
3
2
1
X
1
3
4
5
7
Notice how nicely the regression line “fits” the points on the graph of averages. The
regression line is a “smoothed version” of the graph of averages.
III. Line for Predicting One Variable from Another Variable
Because the regression line is the line of least squares, it is the line used for predicting
one variable from another variable. Suppose that we are given the following statistics
about the heights and weights of college males:
H  69"
W  170lb
 H  3"
 W  20lb
r  0.6
Suppose that we want to predict the height of a college male who weighs 190 lb. This
male is 1 standard deviation above the mean in weight. We would not expect him to
be as much as 1 standard deviation above the mean in height, as would be the case if
there were perfect positive correlation and we used the SD-line. Instead, we should
3
expect him to be 0.6 standard deviation, or (0.6)(3" )  1.8" , above the mean. Thus,
the predicted height would be 69"1.8"  70.8" . This calculation can be simplified
0.6(3)
(W  170) or
by using the equation of the regression line; ie. H  69 
20
H  0.09W  53.7. When W = 190, the H = .09(190) + 53.7 = 70.8" . Now, suppose
that we wanted to predict the weight of a college male who is 72" tall. We might be
tempted to use the previous formula; ie. 72 = 0.09W + 53.7 which gives W = 203.3 lb.
This cannot be correct because we would expect the weight to be 0.6 standard
deviation, or (0.6)(20 lb) = 12 lb, above the mean; ie. W = 170 + 12 = 182 lb. The
answer to this apparent dilemma is that there are two regression lines, one for
predicting height from weight and a different one for predicting weight from height.
0.6(20)
( H  69) or W  4H  106 .
This other regression line would be W  170 
3
Thus, when H = 72, W = 4(72) – 106 = 182 lb. The first regression line, H = 0.09W +
53.7, is the line that minimizes the squared deviations in the heights; and the second
regression line, W = 4H – 106, is the line that minimizes the squared deviations in the
weights. These two cases can be summarized with the following formula:
Equation of the Regression Line for Predicting D from I
DD 
r D
I
(I  I )
IV. Predicting Percentiles
The regression method can be used to predict percentiles. Using the same population
of college males with the heights and weights and also assuming that the heights and
weights are normally distributed, what is the predicted percentile for height for a
college male is at the 80%-tile in weight?
Solution 1:
(1) Find the z-score for this 80%-tile weight.
80%
60%
20%
20%
z=0
z = .85
4
(2) Find the 80%-tile weight.
.85 
W  170
 17  W  170  W  187lb
20
(3) Find the predicted height for a college male who weighs 187 lb.
H  0.09W  53.7  0.09(187)  53.7  70.53"
(4) Find the z-score for a height of 70.53" .
z
70.53  69
 .51
3
(5) Find the percentile for a z-score of 0.51.
30.5%
+
30.5%
39%
=
69.5%-tile
39%
z = -.51
30.5%
z=0
z = .51
Solution 2:
(1) Find the z-score for this 80%-tile weight.
As in solution 1, the z for 80% is 0.85.
(2) Consider the following form of the regression line that predicts H from W:
H  H W  W 
r
  z H  rzW
W
 H  W 
Thus, the predicted z-score for the height is z H  0.6(0.85)  0.51.
H H 
r H
(W  W ) 
(3) Find the percentile for a z-score of 0.51.
As in solution 1, the predicted percentile is 69.5%-tile.
Therefore, in predicting percentiles, use the formula : z D  rz I .
5
Practice Sheet – Regression Line
SAT Scores
S  1120
GPA’s
G  2.5
 S  160
 G  0.5
r = 0.8
Assume that SAT scores and GPA’s are normally distributed.
(1) Give the equation of the regression line for predicting GPA from SAT score.
(2) Use this equation to predict GPA’s from the following SAT scores:
(i) 1120
(ii) 1340
(iii) 1000
(iv) 700
(3) Predict GPA percentiles from the following SAT percentiles:
(i) 50%-tile
(ii) 85%-tile
(iii) 99%-tile
(iv) 20%-tile
(4) Give the equation of the regression line for predicting SAT scores from GPA’s.
(5) Use this equation to predict SAT scores from the following GPA’s:
(i) 2.5
(ii) 3.2
(iii) 1.3
(iv) 4.0
(6) Predict SAT score percentiles from the following GPA percentiles:
(i) 50%-tile
(ii) 90%-tile
(iii) 30%-tile
(iv) 5%-tile
Solution Key for Regression Line
(1) G = 0.0025S – 0.3
(2) (i) 2.5
(ii) 3.05
(iii) 2.2
(iv) 1.45
(3) (i) 50%-tile
(ii) 80%-tile
(iii) 97%-tile
(4) S = 256G + 480
(5) (i) 1120 (ii) 1300
(iii) 810
(iv) 1500
(6) (i) 50%-tile
(ii) 84.5%-tile (iii) 34%-tile
6
(iv) 25.5%-tile
(iv) 9.7%-tile