Download Nov. 21

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
MATH 143 Activities for Mon., Nov. 21: Least Squares Regression Intro
Consider the data (from 1995)1 concerning distances and airfares for flights originating in
Baltimore, MD given in the table, with the given scatterplot.
300
●
Air fare ($)
Destination
Dist. Fare
Atlanta
576 178
Boston
370 138
Chicago
612
94
Dallas/Fort Worth 1216 278
Detroit
409 158
Denver
1502 258
Miami
946 198
New Orleans
998 188
New York
189
98
Orlando
787 179
Pittsburgh
210 138
St. Louis
737
98
●
250
200
●
●
●
●
150
100
●
●
●
●
●
●
50
500
1000
1500
Distance (mi)
The relevant data is found at
http://www.calvin.edu/~scofield/data/tab/rc/airfare.dat
A natural goal is to try to use the distance of a destination to predict the airfare for
flying there, and the simplest model for this prediction is to assume that a straight line
summarizes the relationship between distance and airfare.
1. Place a straightedge over the scatterplot above so that the edge forms a line which
roughly summarizes the relationship between distance and airfare. Draw this line
on the scatterplot.
2. Roughly what airfare does your line predict for a destination which is 500 miles
away?
3. Roughly what airfare does your line predict for a destination which is 1500 miles
away?
The equation of a line can be represented as y = a + bx, where y denotes the variable
being predicted (i.e., the response variable, plotted along the vertical axis), x denotes the
variable used for the prediction (the explanatory variable, plotted along the horizontal
axis), a is the value of the y-intercept of the line, and b is the value of the slope of the line.
In this case x represents distance and y airfare.
1
Taken from Workshop Statistics, Discovery with Data and Minitab, by Allan J. Rossman and Beth L. Chance, SpringerVerlag, 1998, p. 118.
MATH 143: Least Squares Regression Intro
4. Use your answers to 2 and 3 above to find the slope of your line, remembering that
rise change in y
slope b =
=
.
run change in x
5. Use your answers to 4 and 2 above to determine the intercept of your line. (Note:
the vertical axis on the scatterplot does not extend all the way down to zero.)
6. Put your answers to 4 and 5 together to produce the equation of your line. It is good
form to replace the generic symbols x and y in the equation with actual variable
names, here distance and airfare.
Not surprisingly, we would prefer a better way of choosing the line describing the relationship over simply drawing one that "seems about right." Since there are infinitely
many lines to select from, we need some criterion for choosing the "best" one. The most
commonly used criterion goes by the name least squares, and designates the best choice
as the line that minimizes the sum of squares of the residuals.
Click on "File" near the top left of the RStudio window and select "Open File". In the "File
name:" box, type
/home/scofield/sumSquaresResidTest.R
A panel should open up with lines of programming code. You need not worry about
what these lines say, but you should click on "Run All" at the top right of this panel. Once
you have "run all", you may close this panel by clicking the "×" appearing right after
sumSquresResidTest.R at the top left of the panel.
Much as a package adds functions to RStudio’s capabilities, the R program you just
executed has added a new function sum.sq.resid(). We demonstrate the use of this
command on the marriage data from last week. Recall that we were using the line y = x
to predict (rather poorly, as it turns out) a wife’s age from her husband’s. The line y = x
is one with slope 1 and intercept 0. Try typing
> mar = read.table("http://www.calvin.edu/~scofield/data/tab/rc/marriage.dat",
+
header=T, sep="\t")
> sum.sq.resid(mar$husband, mar$wife, 1, 0)
The sum of squares of the residuals is
617
Along with the output I’ve printed—which declares "the sum of squares of the residuals is
617"—you should see a plot of the husband-wife age data, the fitted line y = x, and vertical
red line segments extending from the sampled data to the fitted line. Each of these red
line segments represents a residual. The sum of their squared lengths is 617.
7. Find the sum of squares of residuals for the airfare data when fitted values come from
2
MATH 143: Least Squares Regression Intro
the line you gave in answer to Question 6.
8. Modify your guess of slope and intercept in such a way that you improve your fitted
line three times.
fitted "slope" fitted intercept
P
(residuals)2
While the sum.sq.resid() function gives you an objective way to tell if your guesses to
the best line are improving, formulas exist that allow you to jump directly to the correct
slope and intercept. These formulas are
sy
b = r ,
sx
and
a = y − bx.
9. Use the formulas given above to find the equation of the best fit line (the least squares
regression line) to the airfare data. Record these intermediate values: r, sx , s y , x, y.
Answer:
> sx = sd(air$distance); sx
[1] 402.6858
> sy = sd(air$air.fare); sy
[1] 59.45427
> r = cor(air$distance, air$air.fare); r
[1] 0.7949855
> xbar = mean(air$distance); xbar
[1] 712.6667
> ybar = mean(air$air.fare); ybar
[1] 166.9167
> b = r * sy / sx; b
[1] 0.1173751
> a = ybar - b * xbar; a
[1] 83.26735
Thus, b = (0.795)(59.45)/402.69 0.117, and a = 166.92 − (0.117)(712.67) 83.27.
10. Use the lm() command in RStudio to verify that you have calculated a and b correctly.
Answer:
3
MATH 143: Least Squares Regression Intro
> summary( lm( air.fare ~ distance, data=air ) )
Call:
lm(formula = air.fare ~ distance, data = air)
Residuals:
Min
1Q
-71.773 -8.690
Median
3.527
3Q
26.826
Max
52.005
Coefficients:
Estimate Std. Error t
(Intercept) 83.26735
22.94928
distance
0.11738
0.02832
--Signif. codes: 0 âĂŸ***âĂŹ 0.001
value Pr(>|t|)
3.628 0.00463 **
4.144 0.00200 **
âĂŸ**âĂŹ 0.01 âĂŸ*âĂŹ 0.05 âĂŸ.âĂŹ 0.1 âĂŸ âĂŹ 1
Residual standard error: 37.83 on 10 degrees of freedom
Multiple R-squared: 0.632,
Adjusted R-squared: 0.5952
F-statistic: 17.17 on 1 and 10 DF, p-value: 0.001999
See the correct numbers under "Estimate" next to the coefficients "(Intercept)" and "distance".
11. In numbers 2 and 3 you estimated airfares for destinations 500 and 1500 miles away
using an not-so-well-fitted line. Now that you have the least squares regression line,
estimated these airfares again.
Answer: For 500 miles (some students may use 300 miles instead of 500, as that is what
this sheet originally said), the predicted air fare is
83.27 + (0.117)(500) = $141.77.
12. Return to the scatterplot on page 1 and add the least squares regression line. [I
suggest you first plot data points corresponding to your answers to number 11.]
Compare the new line to the one you "eyeballed" before.
Answers:
These will vary.
Students may want to reflect on why they drew the line differently
than the one found via regression.
13. What airfare would the regression line predict for a flight to San Francisco, which is
2842 miles from Baltimore? Would you take this prediction as seriously as one, say,
for a destination 900 miles from Baltimore? Why or why not?
Answers:
The predicted airfare to San Francisco is 83.27+(0.117)(2842) = $415.78.
This
is an example of extrapolationpredicting at explanatory values outside the range seen
in dataand should be viewed much more cautiously than the predicted value at a distance
of 900 miles.
14. Fill in the predicted (from the best fit line) airfares for destinations 900, 901, 902 and
903 miles from Baltimore.
4
MATH 143: Least Squares Regression Intro
Distance
900
Predicted airfare $188.93
901
$189.04
902
$189.16
903
$189.28
What pattern do you notice? By how many dollars is each prediction higher than the
preceding one? Give a brief interpretation of the slope coefficient b for our regression
line.
Answers:
The predicted airfares are given in the table above, but be tolerant of nearby
answers, as they will change due to a different roundoff for a and b.
As the distance
goes up 1 mile, the predicted airfare goes up between $0.11 and $0.12, matching the value
of the estimated slope b.
This slope tells you, on average, how much the mean airfare
goes up for each additional mile.
15. By how much does the regression line predict airfare to rise for each additional 50
miles of travel?
Answers:
It predicts airfare to rise (50)(0.117) = $5.85.
In statistical modeling, one usually thinks of each data point as being comprised of two
parts: the part explained by the model (called the fitted or predicted value), and the
"leftover" part (called the residual). The latter is either the result of chance variation or
of other variables not included in the model. In the context of least squares regression,
the fitted value for an observation is simply the y-value that the regression line would
predict for the corresponding x-value of that observation. The corresponding residual is
the difference between what is actually observed at that x and the fitted value (i.e., residual
= actual - fitted). So, the residual appears as a vertical distance from the observed y to the
regression line.
16. Looking back at the airfare data, you see that Atlanta is 576 miles from Baltimore.
Find the predicted value for this observation.
Answer:
The predicted airfare is 83.27 + (0.117)(576) = $150.66.
17. The actual airfare from Baltimore to Atlanta is $178. Find the residual for Atlanta.
Answer:
Atlanta’s residual is
$178 − $150.66 = $27.34.
5
MATH 143: Least Squares Regression Intro
18. Fill in the missing values
Destination
Dist. Fare
Fitted Residual
in the table. Which city
Atlanta
576 178 150.66
27.34
has the largest (in absolute
Boston
370 138
126.70
11.3
value) residual? What were
Chicago
612
94 154.87
-61.10
its distance and airfare? By
Dallas/Fort Worth 1216 278
226.00
52.00
how much did the regresDetroit
409 158
131.27
26.73
sion line err in predicting
Denver
1502 258
259.57
-1.56
its airfare? Was it an overMiami
946 198
194.30
3.70
estimate or underestimate?
New Orleans
998 188
200.41
-12.41
In general, what can be said
New York
189
98
105.45
-7.45
about those predicted valOrlando
787 179
175.64
3.36
ues which are overestimated?
Pittsburgh
210 138
107.92
30.08
How do you identify these
St. Louis
737
98
169.77
-71.77
when looking at the scatterplot with regression line overlaid? [The reading told you how to attain such a plot.
Adapt the command to obtain one here.]
Answers:
The city with the largest absolute residual is St. Louis.
airfare are 737 miles and $169.77, respectively.
airfare by $71.77.
Its distance and
The regression line overestimated this
Those predicted values which are overestimates produce negative, and
correspond to points that lie below the regression line.
The standard deviation of the residuals (the column of numbers on the far right above) is a
numerical measure of how much of the variability in the data (airfares) is left unexplained
by the model.
19. Find the ratio of this column’s standard deviation to the standard deviation of the
airfares themselves. Then square the value.
Answer From the output below, this squared ratio is approximately 0.368.
> airfareLM = lm(air.fare ~ distance, data=air)
> ratio = sd(airfareLM$residuals) / sd(air$air.fare); ratio
[1] 0.6066284
> ratio^2
[1] 0.3679981
20. Add to your answer in 19 the square of the correlation. What is the result?
Answer From the output below, their sum is 1.
> ratio^2 + cor(air$air.fare, air$distance)^2
[1] 1
6
This is not a coincidence.
MATH 143: Least Squares Regression Intro
21. What proportion of the variability in airfares is "explained" by the regression line
with distance?
Answer
> cor(air$air.fare, air$distance)^2
[1] 0.6320019
So the portion of variability explained by the model is approximately 63%.
7