Download Chapter 10 - Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 10 - Regression
PART III : CORRELATION & REGRESSION
Dr. Joseph Brennan
Math 148, BU
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
1 / 35
What is Regression?
If a scatter diagram shows a linear relationship, we would like to
summarize the overall pattern with a line on the scatter diagram.
A drawn line may be used to predict the values of y (dependent
variable) from the values of x (independent variable).
Applications
Trend Estimation: Predicting trends in business analytics.
Epidemiology: Relating tobacco smoking to mortality and morbidity.
Finance: Analyzing the systematic risk of investments.
Economics: The predominant empirical tool in economics.
No straight line passes through all the points. To the naked eye, many
lines appear potentially optimal.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
2 / 35
Many Optimal Lines
Which line best represents the linear trend of the data?
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
3 / 35
A Brief Review: Lines
A line is characterized by having a constant slope:
m = slope =
rise
run
Points on a line can be generated by the slope-intercept formula:
y = mx + b
where m is slope and b is the y-intercept. The y-intercept is the y value
when x is 0. The line y = −2x + 4:
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
4 / 35
Regression Line
We need a formal way to draw an optimal line that will go as close to the
points as possible!
The least squares regression line is the UNIQUE line fitted by the least
squares method and which passes as close to the data as possible in the
vertical direction. (More soon!)
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
5 / 35
Regression Line
Assume that y and x are the dependent and independent variables of a
study. Denote:
ŷ to be the predicted (by regression) value of y for a given x,
r to be the correlation coefficient between x and y ,
ȳ and sy the average and standard deviation for the dependent
(response) variable y ,
x̄ and sx the average and standard deviation for the independent
(explanatory) variable x.
The optimal least squares regression line of y on x, derived
mathematically, is defined as:
ŷ = mx + b
with slope and intercept:
m=r
Dr. Joseph Brennan (Math 148, BU)
sy
sx
b = ȳ − mx̄
Chapter 10 - Regression
6 / 35
Equations of the Regression Line: Z-Score Independent
The optimal least squares regression line of y on x, derived
mathematically, is defined as:
ŷ = mx + b
with slope and intercept:
m=r
sy
sx
b = ȳ − mx̄.
If we substitute the formula for slope and intercept into the equation of
the regression line and work some algebra, we will get the following form
of the regression line equation:
ŷ = ȳ + rsy
x − x̄
= ȳ + r · sy · zx ,
sx
(1)
where zx is the z - score for x.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
7 / 35
Equations of the Regression Line: Z-Score Dependent
Recall that the z-score is found via the equations:
zx =
x − x̄
sx
zy =
y − ȳ
sy
With some algebraic manipulation:
ŷ = ȳ + r · sy · zx
⇒
zˆy =
ŷ − ȳ
= r · zx
sy
Interpretation:
The correlation coefficient help predict the z-score for y with only the
z-score for x.
This equation is considered the regression equation for the
standardized data.
The data pairs (xi , yi ) can be transformed to (zxi , zyi ) by a linear
transformation.
The set of standardized z-scores has a mean of 0 and standard
deviation of 1. They also have a correlation coefficient of r .
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
8 / 35
Exercise 1, page 213 of the text.
Find the regression equations for predicting final score from midterm score,
based on the following information:
average midterm score = 70
average final score = 55
SD = 10
SD = 20
r = 0.6
Solution: First, let’s rewrite all the given information in our notation:
The dependent (response) variable, y , is the final score.
The independent variable, x, is the midterm score.
The x average x̄ = 70 and standard deviation sx = 10.
The y average ȳ = 55 and standard deviation sy = 20.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
9 / 35
Exercise 1, page 213 of the text.
Compute the slope of the regression line:
m=r
sy
20
= 0.6 = 1.2
sx
10
Then find the intercept:
b = ȳ − mx̄ = 55 − 1.2 · 70 = −29
The equation of the least squares regression line is:
ŷ = 1.2x − 29
The equation of the predicted y z-score:
zˆy = 0.6zx
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
10 / 35
Interpreting the Regression Line
1
By saying we regress y on x, we mean that we want to predict y
from x.
2
The best use of the regression line is to estimate the AVERAGE
value of y for a given value of x.
Using the regression line formula of Exercise 1:
ŷ = 1.2x − 29
if we assume x = 78, then we find
ŷ = 1.2 · 78 − 29 = 64.6
We interpret this as the average score on the final exam for students
who score 78 points on the midterm exam is 64.6.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
11 / 35
Example (HANES5: Weight and Height)
The scatter diagram of the weight and height measurements for 471 men
in HANES 5 survey is shown below:
The solid line in the above figure is the regression line. The three crosses on the
scatter diagram estimate average heights of men for x equal to 64, 73, and 76
inches.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
12 / 35
Example (HANES5: Weight and Height)
The graph of averages for the 471 men aged 18-24 in the HANES5
sample. The regression line smooths the graph:
In general, if the relationship between x and y is linear and there are no extreme
outliers, then the average points follow the regression line very closely.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
13 / 35
Using Regression Line for Individual Predictions
Although the best use of the regression line is to predict the average
outcomes, it may also be used to predict individual outcomes, but the
prediction error may be quite large. More to come on prediction error . . .
From Exercise 1: We will use the regression line to predict the final
score for a student with the midterm score of 50 points. The
regression line is
ŷ = 1.2x − 29
and the prediction is
ŷ = 1.2 · 50 − 29 = 31.
This prediction may be quite off the true value.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
14 / 35
The Role of r in Regression
The correlation coefficient r measures the amount of scattering of points
about the regression line.
Case 1 (extreme): A correlation of r = −1 or r = 1 correspond to a
perfect linear relationship. The scatter diagram is a perfect line
containing the points coincides with the regression line.
Case 2 (extreme): A correlation r = 0 corresponds to a chaotic
scattering of points which means there is no relationship between x
and y .
s
In this case the slope of the regression line m = r syx = 0, the
y -intercept b = ȳ − mx̄ = ȳ . Therefore, when r = 0 the regression
line is horizontal. Illustrated on the next slide.
Case 3: The closer r is to -1 or 1, the closer the points are to the
regression line, the greater the success of regression in explaining
individual responses y for given values of x.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
15 / 35
Horizontal Regression Line
There is no clear pattern for the points to drift up or down. As a result, the
correlation coefficient is close to zero and the regression line is horizontal.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
16 / 35
Interpretation of the Slope
The slope of the regression line shows how the average value of y
changes when x goes 1 unit up.
The units of measurement for the slope:
units of y
.
units of x
s
The expression for the slope m = r syx implies that x changes by one
standard deviation when y changes by r standard deviations.
Because −1 ≤ r ≤ 1, the change in y is less than or equal to the
change in x. As the correlation between x and y decreases, the
prediction ŷ changes more slowly in response to changes in x.This
effect is sometimes called attenuation.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
17 / 35
Interpretation of the Slope
The slope of the regression line is proportional to the regression coefficient
r , but not equal to it unless sy = sx .
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
18 / 35
Interpretation of the Intercept
The intercept of the regression line is the predicted value for y when
x equals 0.
Quite often the intercept does not have any physical interpretation!
In Exercise 1 the intercept of the line is -29. Does this mean that the
average final score for students who got 0 on the midterm will be -29?!
NO WAY!!! First of all, 0 on the midterm is not truly a score, it just
means that a student missed the test. Everyone who showed up is
expected to earn some points on the test. A reasonable range for x
which we can use to predict the values of y will be, say, from 25 to 100
points. Going beyond this range or extrapolating is risky: we can
obtain a non-sensible prediction!
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
19 / 35
The Graph of Averages
The Graph of Averages is constructed from the scatter diagram. For a
given x-value, the y -value on the graph of averages is the average of
y -values associated to x on the scatter diagram.
Example: Assume that three points are plotted on the scatter diagram
with x value 7:
(7, 2) (7, 5) (7, 8)
The graph of averages will have the point (7, 5) as 5 is the mean of
{2, 5, 8}.
The regression line for the graph of averages coincides with the
regression line of the original scatter diagram.
You do not lose information by pre-smoothing data by constructing
the graph of averages.
If the graph of averages does not follow a straight line:
There may be extreme outliers.
The relationship may be non-linear.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
20 / 35
Example: Non-linear Association
Year
AIDS Cases
1999
41,356
2000
41,267
2001
40,833
2002
41,289
2003
43,171
The following parabola seems a nice fit!
Figure : Nonlinear relationship.
For those curious, the equation (parabola) used above is
ŷ = 345.1428571x 2 − 1705.657143x + 42903.6.
SOURCE: US Dept. of Health and Human Services, Center for Disease Control and Prevention, HIV/AIDS Surveillance, 2003.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
21 / 35
Regression Effect
Regression Effect describes the tendency of individuals with extreme
values to retest towards the mean.
The regression effect is witnessed when an experiment is repeated and
an individuals progress is tracked.
On average the top group will value lower on a second experiment
and on average the bottom group will value higher on a second
experiment.
Example: One would expect a student who received an 95 on a
midterm with a class average of 72 and a standard deviation of 8
points to score significantly lower on the final. Outliers tend towards
the mean!
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
22 / 35
Regression Effect
Example (Page 167):
How would you predict a student’s rank in a mathematics class?
Without any additional knowledge we would be safe to assume that a
student earns the mean or median among all grades; the expected
(central) values.
Correlation enters into our predictions as additional information. As
physics and mathematics can be considered similar subjects, one can
assume that a student’s success in physics would correlate to a student’s
success in mathematics. Therefore, we are able to confidently predict the
rank of a mathematics student by their rank in a physics class.
On the other hand, pottery and mathematics are not similar subjects and
one wouldn’t expect a correlation in student success. Therefore, we are
unable to confidently predict the rank of a mathematics student by their
rank in a pottery class.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
23 / 35
Example: (Father-son heights)
In this data thee average height
of fathers is x̄ = 68 inches and
the average height of sons is
ȳ = 69 inches.
One of the vertical strips in the
figure corresponds to the fathers
who are 72 inches tall. The
average height of their sons is
71 inches.
The other vertical strip
corresponds to fathers who are
64 inches tall. The average
height of their sons is 67 inches.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
24 / 35
What’s wrong with this?
In the following figure, we see a near-perfect fit of a positive correlation
for Android’s market shares plotted against time.
How could such a neat fit go wrong?
Following the green line, Android would have 120% of the market share
by 2014!
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
25 / 35
Regression Fallacy
The Regression Effect assures us that it is natural for extremes to
become average.
The Regression Fallacy is a fallacy by which individuals conjecture a
cause for an extreme to become average.
Being ill prepared for an extreme event, surviving, becoming prepared
for another occurrence of the extreme event, and conjecturing that
current preparations prevent the repetition of extreme event.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
26 / 35
Example 2, p. 166 of the textbook
A university has made a statistical analysis of the relationship between
Math SAT scores (ranging from 200 to 800) and first year GPAs (ranging
from 0 to 4.0), for students who complete the first year.
average SAT score = 550,
average first-year GPA = 2.6,
SD = 80
SD = 0.6,
r = 0.4
The scatter diagram is football-shaped. Suppose the percentile rank of
one student on the SAT is 90%, among the first-year students. Predict his
percentile rank on first-year GPA.
We will make the following assumptions :
Distributions of SAT scores is approximately normal with the mean
x̄ = 550 and standard deviation sx = 80.
The distribution of GPA values is approximately normal with the
mean ȳ = 2.6 and sy = 0.6.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
27 / 35
Example 2, p. 166 of the textbook, continued
Solution Let’s find the z - score for the a student who is placed in 90th
percentile. From the normal table zx ≈ 1.3. We will use the regression
method to predict student’s GPA value from his SAT score. From the
equation of the regression line
ẑy = r · zx , ⇒ ẑy = r · zx = 0.4 · 1.3 = 0.52 ≈ 0.5,
which is the predicted student’s standard score. From the normal table we
find that the point with the z - score equal to 0.50 is approximately the
69th percentile.
The first-year GPA’s percentile rank of a student
who is the 90th percentile on the SAT distribution
is predicted (by regression) to be 69%.
90th percentile of SAT distribution but only 69th percentile of GPA
distribution. WHY? It’s due to the regression effect.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
28 / 35
Notes on Linear Regression
The least squares regression line always passes through the center of the
data; the point (x̄, ȳ ) consisting of averages.
By swapping dependent and independent status for variables, a second
regression line can be found.
We have seen that the correlation coefficient r is symmetric: if we switch
the axes, we will get the same correlation.
This is not true for regression. When switching the roles of x and y , you
get a different regression equation as there is not necessarily an equality
between x̄ and ȳ or between sx and sy .
The equations for y regressed on x and for x regressed on y are generally
different!
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
29 / 35
Twin Regression Lines
Figure : Data on the lean body mass and metabolic rate. The lines are the
least-squares regression lines of the rate on the mass (solid/red) and of the mass
on the rate (dashed/black).
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
30 / 35
Boston Marathon versus Temperature
The average finish time in minutes and the temperature during the race in
Fahrenheit for the Boston Marathon are listed below:
Year
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Avg. Finish Time (minutes)
221
226
221
235
253
237
230
234
231
229
230
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
Temperature (F)
49
54
55
65
85
68
54
49
53
50
53
31 / 35
Boston Marathon versus Temperature
Mean
Standard Deviation
Finish Time
231.5
8.4
Temperature
57.7
10.4
We have a properly scaled scatter plot:
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
32 / 35
Boston Marathon versus Temperature
We have two possible regression lines:
ŷ = 1.05x − 185
Dr. Joseph Brennan (Math 148, BU)
x̂ = 0.68y + 192
Chapter 10 - Regression
33 / 35
Football-Shaped Clustering
The data in a scatter plot for variables with a linear correlation clusters in
a football (elliptical) shape estimated by three lines:
The solid line is the regression line for y on x.
The dashed line SD (standard deviation) line.
The dotted line is the regression line for x on y.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
34 / 35
The SD Line
The SD line passes through the point of averages (x̄,ȳ ).
In fact, the SD line, the regression line of y on x, and the regression line
of x on y all pass through the point of averages.
sy
The slope of the SD line is the ratio of the standard deviations:
.
sx
sy
Compare this slope to the regression line slope of r ·
and recall that
sx
−1 ≤ r ≤ 1. This implies that the SD line has a larger absolute valued
slope than both regression lines.
Dr. Joseph Brennan (Math 148, BU)
Chapter 10 - Regression
35 / 35