Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH 143 Activities for Mon., Nov. 21: Least Squares Regression Intro Consider the data (from 1995)1 concerning distances and airfares for flights originating in Baltimore, MD given in the table, with the given scatterplot. 300 ● Air fare ($) Destination Dist. Fare Atlanta 576 178 Boston 370 138 Chicago 612 94 Dallas/Fort Worth 1216 278 Detroit 409 158 Denver 1502 258 Miami 946 198 New Orleans 998 188 New York 189 98 Orlando 787 179 Pittsburgh 210 138 St. Louis 737 98 ● 250 200 ● ● ● ● 150 100 ● ● ● ● ● ● 50 500 1000 1500 Distance (mi) The relevant data is found at http://www.calvin.edu/~scofield/data/tab/rc/airfare.dat A natural goal is to try to use the distance of a destination to predict the airfare for flying there, and the simplest model for this prediction is to assume that a straight line summarizes the relationship between distance and airfare. 1. Place a straightedge over the scatterplot above so that the edge forms a line which roughly summarizes the relationship between distance and airfare. Draw this line on the scatterplot. 2. Roughly what airfare does your line predict for a destination which is 500 miles away? 3. Roughly what airfare does your line predict for a destination which is 1500 miles away? The equation of a line can be represented as y = a + bx, where y denotes the variable being predicted (i.e., the response variable, plotted along the vertical axis), x denotes the variable used for the prediction (the explanatory variable, plotted along the horizontal axis), a is the value of the y-intercept of the line, and b is the value of the slope of the line. In this case x represents distance and y airfare. 1 Taken from Workshop Statistics, Discovery with Data and Minitab, by Allan J. Rossman and Beth L. Chance, SpringerVerlag, 1998, p. 118. MATH 143: Least Squares Regression Intro 4. Use your answers to 2 and 3 above to find the slope of your line, remembering that rise change in y slope b = = . run change in x 5. Use your answers to 4 and 2 above to determine the intercept of your line. (Note: the vertical axis on the scatterplot does not extend all the way down to zero.) 6. Put your answers to 4 and 5 together to produce the equation of your line. It is good form to replace the generic symbols x and y in the equation with actual variable names, here distance and airfare. Not surprisingly, we would prefer a better way of choosing the line describing the relationship over simply drawing one that "seems about right." Since there are infinitely many lines to select from, we need some criterion for choosing the "best" one. The most commonly used criterion goes by the name least squares, and designates the best choice as the line that minimizes the sum of squares of the residuals. Click on "File" near the top left of the RStudio window and select "Open File". In the "File name:" box, type /home/scofield/sumSquaresResidTest.R A panel should open up with lines of programming code. You need not worry about what these lines say, but you should click on "Run All" at the top right of this panel. Once you have "run all", you may close this panel by clicking the "×" appearing right after sumSquresResidTest.R at the top left of the panel. Much as a package adds functions to RStudio’s capabilities, the R program you just executed has added a new function sum.sq.resid(). We demonstrate the use of this command on the marriage data from last week. Recall that we were using the line y = x to predict (rather poorly, as it turns out) a wife’s age from her husband’s. The line y = x is one with slope 1 and intercept 0. Try typing > mar = read.table("http://www.calvin.edu/~scofield/data/tab/rc/marriage.dat", + header=T, sep="\t") > sum.sq.resid(mar$husband, mar$wife, 1, 0) The sum of squares of the residuals is 617 Along with the output I’ve printed—which declares "the sum of squares of the residuals is 617"—you should see a plot of the husband-wife age data, the fitted line y = x, and vertical red line segments extending from the sampled data to the fitted line. Each of these red line segments represents a residual. The sum of their squared lengths is 617. 7. Find the sum of squares of residuals for the airfare data when fitted values come from 2 MATH 143: Least Squares Regression Intro the line you gave in answer to Question 6. 8. Modify your guess of slope and intercept in such a way that you improve your fitted line three times. fitted "slope" fitted intercept P (residuals)2 While the sum.sq.resid() function gives you an objective way to tell if your guesses to the best line are improving, formulas exist that allow you to jump directly to the correct slope and intercept. These formulas are sy b = r , sx and a = y − bx. 9. Use the formulas given above to find the equation of the best fit line (the least squares regression line) to the airfare data. Record these intermediate values: r, sx , s y , x, y. Answer: > sx = sd(air$distance); sx [1] 402.6858 > sy = sd(air$air.fare); sy [1] 59.45427 > r = cor(air$distance, air$air.fare); r [1] 0.7949855 > xbar = mean(air$distance); xbar [1] 712.6667 > ybar = mean(air$air.fare); ybar [1] 166.9167 > b = r * sy / sx; b [1] 0.1173751 > a = ybar - b * xbar; a [1] 83.26735 Thus, b = (0.795)(59.45)/402.69 0.117, and a = 166.92 − (0.117)(712.67) 83.27. 10. Use the lm() command in RStudio to verify that you have calculated a and b correctly. Answer: 3 MATH 143: Least Squares Regression Intro > summary( lm( air.fare ~ distance, data=air ) ) Call: lm(formula = air.fare ~ distance, data = air) Residuals: Min 1Q -71.773 -8.690 Median 3.527 3Q 26.826 Max 52.005 Coefficients: Estimate Std. Error t (Intercept) 83.26735 22.94928 distance 0.11738 0.02832 --Signif. codes: 0 âĂŸ***âĂŹ 0.001 value Pr(>|t|) 3.628 0.00463 ** 4.144 0.00200 ** âĂŸ**âĂŹ 0.01 âĂŸ*âĂŹ 0.05 âĂŸ.âĂŹ 0.1 âĂŸ âĂŹ 1 Residual standard error: 37.83 on 10 degrees of freedom Multiple R-squared: 0.632, Adjusted R-squared: 0.5952 F-statistic: 17.17 on 1 and 10 DF, p-value: 0.001999 See the correct numbers under "Estimate" next to the coefficients "(Intercept)" and "distance". 11. In numbers 2 and 3 you estimated airfares for destinations 500 and 1500 miles away using an not-so-well-fitted line. Now that you have the least squares regression line, estimated these airfares again. Answer: For 500 miles (some students may use 300 miles instead of 500, as that is what this sheet originally said), the predicted air fare is 83.27 + (0.117)(500) = $141.77. 12. Return to the scatterplot on page 1 and add the least squares regression line. [I suggest you first plot data points corresponding to your answers to number 11.] Compare the new line to the one you "eyeballed" before. Answers: These will vary. Students may want to reflect on why they drew the line differently than the one found via regression. 13. What airfare would the regression line predict for a flight to San Francisco, which is 2842 miles from Baltimore? Would you take this prediction as seriously as one, say, for a destination 900 miles from Baltimore? Why or why not? Answers: The predicted airfare to San Francisco is 83.27+(0.117)(2842) = $415.78. This is an example of extrapolationpredicting at explanatory values outside the range seen in dataand should be viewed much more cautiously than the predicted value at a distance of 900 miles. 14. Fill in the predicted (from the best fit line) airfares for destinations 900, 901, 902 and 903 miles from Baltimore. 4 MATH 143: Least Squares Regression Intro Distance 900 Predicted airfare $188.93 901 $189.04 902 $189.16 903 $189.28 What pattern do you notice? By how many dollars is each prediction higher than the preceding one? Give a brief interpretation of the slope coefficient b for our regression line. Answers: The predicted airfares are given in the table above, but be tolerant of nearby answers, as they will change due to a different roundoff for a and b. As the distance goes up 1 mile, the predicted airfare goes up between $0.11 and $0.12, matching the value of the estimated slope b. This slope tells you, on average, how much the mean airfare goes up for each additional mile. 15. By how much does the regression line predict airfare to rise for each additional 50 miles of travel? Answers: It predicts airfare to rise (50)(0.117) = $5.85. In statistical modeling, one usually thinks of each data point as being comprised of two parts: the part explained by the model (called the fitted or predicted value), and the "leftover" part (called the residual). The latter is either the result of chance variation or of other variables not included in the model. In the context of least squares regression, the fitted value for an observation is simply the y-value that the regression line would predict for the corresponding x-value of that observation. The corresponding residual is the difference between what is actually observed at that x and the fitted value (i.e., residual = actual - fitted). So, the residual appears as a vertical distance from the observed y to the regression line. 16. Looking back at the airfare data, you see that Atlanta is 576 miles from Baltimore. Find the predicted value for this observation. Answer: The predicted airfare is 83.27 + (0.117)(576) = $150.66. 17. The actual airfare from Baltimore to Atlanta is $178. Find the residual for Atlanta. Answer: Atlanta’s residual is $178 − $150.66 = $27.34. 5 MATH 143: Least Squares Regression Intro 18. Fill in the missing values Destination Dist. Fare Fitted Residual in the table. Which city Atlanta 576 178 150.66 27.34 has the largest (in absolute Boston 370 138 126.70 11.3 value) residual? What were Chicago 612 94 154.87 -61.10 its distance and airfare? By Dallas/Fort Worth 1216 278 226.00 52.00 how much did the regresDetroit 409 158 131.27 26.73 sion line err in predicting Denver 1502 258 259.57 -1.56 its airfare? Was it an overMiami 946 198 194.30 3.70 estimate or underestimate? New Orleans 998 188 200.41 -12.41 In general, what can be said New York 189 98 105.45 -7.45 about those predicted valOrlando 787 179 175.64 3.36 ues which are overestimated? Pittsburgh 210 138 107.92 30.08 How do you identify these St. Louis 737 98 169.77 -71.77 when looking at the scatterplot with regression line overlaid? [The reading told you how to attain such a plot. Adapt the command to obtain one here.] Answers: The city with the largest absolute residual is St. Louis. airfare are 737 miles and $169.77, respectively. airfare by $71.77. Its distance and The regression line overestimated this Those predicted values which are overestimates produce negative, and correspond to points that lie below the regression line. The standard deviation of the residuals (the column of numbers on the far right above) is a numerical measure of how much of the variability in the data (airfares) is left unexplained by the model. 19. Find the ratio of this column’s standard deviation to the standard deviation of the airfares themselves. Then square the value. Answer From the output below, this squared ratio is approximately 0.368. > airfareLM = lm(air.fare ~ distance, data=air) > ratio = sd(airfareLM$residuals) / sd(air$air.fare); ratio [1] 0.6066284 > ratio^2 [1] 0.3679981 20. Add to your answer in 19 the square of the correlation. What is the result? Answer From the output below, their sum is 1. > ratio^2 + cor(air$air.fare, air$distance)^2 [1] 1 6 This is not a coincidence. MATH 143: Least Squares Regression Intro 21. What proportion of the variability in airfares is "explained" by the regression line with distance? Answer > cor(air$air.fare, air$distance)^2 [1] 0.6320019 So the portion of variability explained by the model is approximately 63%. 7