Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Example: Old Faithful Given the sample data in Table 10-1, find the regression equation. Question: Is there a correlation between duration time of eruptions and the time interval after the eruption? Duration 240 120 178 234 235 269 255 220 Interval After 92 65 72 94 83 94 101 87 Slide 1 Solution Using the same procedure as in the previous example, we find that b1 = 0.234 and b0 = 34.8. Hence, the estimated regression equation is: y^ = 34.8 + 0.234x Slide 2 Example: Old Faithful - cont Given the sample data in Table 10-1, find the regression equation. Slide 3 Example: Old Faithful Given the sample data in Table 10-1, we found that the regression equation is ^ y = 34.8 + 0.234x. Assuming that the current eruption has a duration of x = 180 sec, find the best predicted value of y, the time interval after this eruption. Slide 4 Part 2: Beyond the Basics of Regression Predictions In predicting a value of y based on some given value of x ... 1. If there is not a linear correlation, the best predicted y-value is y. 2. If there is a linear correlation, the best predicted y-value is found by substituting the x-value into the regression equation. Slide 5 Guidelines for Using The Regression Equation 1. If there is no linear correlation, don’t use the regression equation to make predictions. 2. When using the regression equation for predictions, stay within the scope of the available sample data (no extrapolating!). 3. A regression equation based on old data is not necessarily valid now. 4. Don’t make predictions about a population that is different from the population from which the sample data were drawn. Slide 6 CwK p. 553 #7 and 8! Slide 7 Definitions Marginal Change The marginal change is the amount that a variable changes when the other variable changes by exactly one unit. Example: The regression line y-hat = 34.8 + 0.234x has a slope of .234 Interpretation: If we increase x (duration time) by 1 second, the predicted time interval after the eruption will increase by .234 minutes. Outlier An outlier is a point lying far away from the other data points. Influential Point An influential point strongly affects the graph of the regression line. Slide 8 Definition Residual The residual for a sample of paired (x, y) data, is the difference (y - ^ y) between an observed sample y-value and the value of y, which is the value of y that is predicted by using the regression equation. residual = observed y – predicted y = y - y^ Slide 9 Example • • • • Find the regression line for the following table: Find y-hat! XY Find residuals 1 4 Graph residuals 2 24 4 8 5 32 Slide 10 Definitions Least-Squares Property A straight line has the least-squares property if the sum of the squares of the residuals is the smallest sum possible. Residual Plot A scatterplot of the (x, y) values after each of the y-coordinate values have been replaced by the residual value y – ^ y. That is, a residual plot is a graph of the points (x, y –^ y) Slide 11 Residual Plot Analysis If a residual plot does not reveal any pattern, the regression equation is a good representation of the association between the two variables. If a residual plot reveals some systematic pattern, the regression equation is not a good representation of the association between the two variables. Slide 12 Residual Plots Good model for the data; points are close to LSRL, no distinct pattern Slide 13 Residual Plots Association is NOT linear; distinct pattern = linear model not a good model in this case Slide 14 Residual Plots Shows pattern of increasing variation; violates requirement that for different values of x, the distributions of y values have the same variance. Slide 15 The SAT essay: longer is better? (An observational study) Following the debut of the new SAT writing test in March 2005, Dr. Les Perelman from M.I.T. stirred controversy by reporting, “It appeared to me that regardless of what a student wrote, the longer the essay, the higher the score.” he went on to say, “I have never found a quantifiable predictor in 25 years of grading that was anywhere as strong as this one. If you just graded them based on length without ever reading them, you’d be right over 90 percent of the time.” The table below shows the data set that Dr. Perlman used to draw his conclusions. 1) Identify the explanatory/response variables 2) Draw a scatter plot and the LSRL 3) Find the vital statistics 4) Interpret r and r-squared 5) Find the marginal change (slope interpretation). 6) Use the regression line to estimate the score of a paper that is 390 words long. 7) Graph the residual plot. Do you feel confident that you could come up with a good estimate for a score using the regression line? Explain. Words 460 422 402 365 357 278 236 201 168 156 133 114 108 100 403 Score 6 6 5 5 6 5 4 4 4 3 2 2 1 1 5 Words 401 388 320 258 236 189 128 67 697 387 355 337 325 272 150 Score 6 6 5 4 4 3 2 1 6 6 5 5 4 4 2 Words 135 73 Score 3 1 Slide 16 P. 553 Do #7, #8 then: Answer each of the following questions for #16 and #17 a) Is there a linear correlation? Use your calculator commands to find the p-value, then the critical values from Table A-6 to prove it. Is your answer the same for each one? b) Graph the points (don’t forget axis labels) c) Find the vital statistics (r, r-squared, a, b, y-hat – don’t forget to define x and y) d) Tell me what r and r-squared means in the context of the problem (r: form, direction, strength) (r-squared: how much of the variation in x can be explained by the variation in y) e) Find the residuals f) Draw the residual plot – is the regression line a good model for the data? Why? Slide 17