Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Name _________________ Least Squares Regression The data below are airfares to various cities from Baltimore, MD (including the descriptive statistics). 178 138 N 12 94 Mean 166.92 278 158 258 Std. Dev. 59.5 Min 94 198 188 98 Median 168 Q1 108 179 138 Q3 195.5 98 Max 278 1. If someone asks how much they can expect to pay for airfare from Baltimore, what prediction would you give? Explain. 2. Suggest another variable (an explanatory variable) that might be useful for predicting the airfare to a certain destination (the response variable). The following table reports the distance (in miles) as well as the airfare for 14 destinations. A scatterplot is given as well. Destinat... Distance Airfare newfare 1 Atlanta 576 178 158 2 Boston 370 138 118 3 Chicago 612 94 96 4 Dallas/Fo... 1216 278 300 5 Detroit 409 158 162 6 Denver 1502 258 308 7 Miami 946 198 138 8 New Orle... 998 188 118 9 New York 189 98 108 10 Orlando 787 179 118 11 Pittsburgh 210 138 198 12 St. Louis 98 178 737 Scatter Plot Airfare 280 260 240 220 200 180 160 140 120 100 80 Airfare Airfare 0 400 800 Distance 1200 1600 3. Based on this scatterplot, does it seem that knowing the distance to a destination would be useful for predicting the airfare? Explain. AP Stats Least Squares Regression 4. Place a ruler (or straightedge) over the scatterplot so that it forms a straight line that roughly summarizes the relationship between distance and airfare. Then draw this line on the scatterplot. 5. Roughly what airfare does your line predict for a destination that is x1 = 300 miles away? y1 = 6. Roughly what airfare does your line predict for a destination that is x2 = 1500 miles away? y2 = The equation of a line can be represented as y = a + bx, where y denotes the response variable being predicted, x denotes the explanatory variable being used for prediction, a is the value of the y-intercept, and b is the value of the slope of the line. 7. Use your answers to 5 & 6 to find the slope (change in y over change in x) of your line. Slope = 8. Use your answers to 5 & 7 to determine the y-intercept of your line using the following equation: a = y1 – bx1. y-intercept = 9. Put your answers to 7 & 8 together to produce the equation of your line. Naturally, we would like to have a better way of choosing the line to approximate a relationship than simply drawing one that seems about right. Since there are infinitely many lines that one could draw, however, we need some criterion to select which line is the “best” at describing the relationship. The most commonly used criterion is least squares, which says to choose the line that minimizes the sum of squared vertical distances from the points to the line. We write the equation of the least squares line (also known as the regression line) as y = a + bx. The most convenient expression for calculating these coefficients relates them to the means and standard deviations of the two variables and the correlation coefficient _ _ _ _ sy between them: b = r and a = y − b x (where x and y represent the means of the sx variables, sx and sy are the standard deviations, and r the correlation between them). AP Stats Least Squares Regression 10. Enter the airfare data into your calculator (L1 and L2). 11. Use the CORR program to compute the mean and standard deviation of distance and airfare, and the value of the correlation between the two. Record the results below: Mean Std. Dev. Correlation Airfare (y) Distance (x) 12. Write the equation of the least squares line for predicting airfare from distance. We will now use the calculator to produce the regression line. In order for your calculator to display the correlation coefficient you will need to turn the diagnostic display mode to ON. To do this, access the calculator’s CATALOG menu ( 2nd and scroll down until you find “DiagnosticOn.” Press ENTER twice. 0) 13. To use your calculator to find the equation of the least squares line, use the STAT→CALC menu and select option LinReg(a+bx). 14. To find the least squares line for predicting airfare from distance, complete the command by entering the two appropriate lists (L1 and L2) so that you have the following: 15. Press ENTER and write the equation below. Compare this equation with the equation you found in #12. AP Stats Least Squares Regression One of the primary uses of regression is for prediction, for one can use the regression line to predict the value of the y-variable for a given value of the x-variable simply by plugging that value of x into the equation of the regression line. 16. What airfare does the least squares line predict for a destination that is 300 miles away? 17. What airfare does the least squares line predict for a destination that is 1500 miles away? 18. Draw the least squares line on the scatterplot below by plotting the two points that you found in 16 & 17 and connecting them with a straight line. Scatter Plot Airfare Airfare 280 260 240 220 200 180 160 140 120 100 80 0 400 800 1200 Distance 1600 19. To use your calculator to create a scatterplot of airfare vs. distance and then graph the least squares line: ¾ Select LinReg(a+bx) from the STAT→CALC menu as before. ¾ Enter L1,L2 as before, but now type another comma. Press the VARS button and then use the right arrow to see the Y-VARS menu. Press ENTER to select Function and then ENTER again to select Y1. You should have the following screen: ¾ Press ENTER. If you now press the blue Y= button at the top of your calculator, you will see the regression equation stored as the Y1 function. ¾ Press GRAPH. 20. Just from looking at the regression line that you have drawn on the scatterplot, guess what value the regression line would predict for the airfare to a destination 900 miles away. AP Stats Least Squares Regression 21. Use the equation of the regression line to predict the airfare to a destination 900 miles away, and compare this prediction to your guess in #20. 22. What airfare would the regression line predict for a flight to San Francisco, which is 2842 miles from Baltimore? Would you consider this prediction as reliable as the one for 900 miles? Explain. The actual airfare to San Francisco at that time was $198. That the regression line’s prediction is not very reasonable illustrates the danger of extrapolation, i.e., of trying to predict y for values of x beyond those contained in the data. 23. Use the equation of the regression line to predict the airfare if the distance is 900 miles. Record the prediction in the table below, and repeat for distances of 901, 902, and 903 miles. Distance Predicted airfare 900 901 902 903 24. Do you notice a pattern in these predictions? By how many dollars is each prediction higher than the preceding one? Does this number look familiar? Explain. 25. By how much does the regression line predict airfare to rise for each additional 100 miles that a destination is farther away? 26. If you look back at the original listing of distances and airfares, you find that Atlanta is 576 miles from Baltimore. What airfare would the regression line have predicted for Atlanta? 27. The actual airfare to Atlanta at that time was $178. By how much does the actual price exceed the prediction? The residual is the difference between the actual y-value and the expected value (from the regression line), so the residual measures the vertical distance from the observed yvalue to the regression line. AP Stats Least Squares Regression 28. Record Atlanta’s fitted value and residual in the table below. Then calculate Boston’s residual and Chicago’s fitted value using the definition of residual (i.e. without using the equation of the regression line), showing your calculations. Airfare Destinat... Distance Airfare fittedvalue residual meanfare deviation 1 Atlanta 576 178 2 Boston 370 138 3 Chicago 612 94 4 Dallas/Fo... 1216 278 226 52 166.92 5 Detroit 409 158 131.27 26.73 166.92 -8.92 6 Denver 1502 258 259.57 -1.56 166.92 91.08 7 Miami 946 198 194.3 3.7 166.92 31.08 8 New Orle... 998 188 200.41 -12.41 166.92 21.08 9 New York 189 98 105.45 -7.45 166.92 -68.92 10 Orlando 787 179 175.64 3.36 166.92 12.08 11 Pittsburgh 210 138 107.92 30.08 166.92 -28.92 12 St. Louis 98 169.77 -71.77 166.92 -68.92 737 126.7 -61.1 166.92 11.08 166.92 -28.92 166.92 -72.92 29. Which city has the largest (in absolute value) residual? What were its distance and airfare? By how much did the regression line err in predicting its airfare; was it an underestimate or an overestimate? Circle this observation on the scatterplot containing the regression line. 30. For observations with positive residual values, was their actual airfare greater or less than the predicted airfare? 31. For observations with negative residual values, do their points on the scatterplot fall above or below the regression line? 32. The mean of these airfares is $166.92 with a standard deviation of $59.45. In the absence of information about distance, we could use the overall mean airfare as the prediction for each city’s airfare. In this situation, the deviations from the mean would be the errors in those predictions. The last column of the table above reports these deviations. Determine the deviation from the mean airfare for Dallas and record it in the table. 33. For how many cities does the overall mean airfare result in a closer prediction to the actual airfare than the regression line does? In other words, how many cities have a deviation from the mean that is smaller than their residual from the regression line? Which cities are these? AP Stats Least Squares Regression 34. Do most cities have a smaller residual or a smaller deviation from the mean? Does this suggest that predictions from the regression line are generally better than the airfare mean? Explain. This worksheet was adapted from “Workshop Statistics”, 2nd edition, by Alan Rossman, Beth Chance, J. Barr Von Oehsen. AP Stats Least Squares Regression