Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 1 The Linear Model Correlation says “There seems to be a linear association between these two variables,” but it doesn’t tell what that association is. We can say more about the linear relationship between two quantitative variables with a model. A model simplifies reality to help us understand underlying patterns and relationships. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 2 The Linear Model (cont.) The linear model is just an equation of a straight line through the data. The points in the scatterplot don’t all line up, but a straight line can summarize the general pattern. The linear model can help us understand how the values are associated. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 3 Residuals The model won’t be perfect, regardless of the line we draw. Some points will be above the line and some will be below. The estimate made from a model is the predicted value (denoted as ŷ ). Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 4 Residuals (cont.) The difference between the observed value and its associated predicted value is called the residual. To find the residuals, we always subtract the predicted value from the observed one: re sid u a l o b se rv e d p re d ic te d y yˆ Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 5 Residuals (cont.) A negative residual means the predicted value’s too big (an overestimate). A positive residual means the predicted value’s too small (an underestimate). Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 6 “Best Fit” Means Least Squares Some residuals are positive, others are negative, and, on average, they cancel each other out. So, we can’t assess how well the line fits by adding up all the residuals. Similar to what we did with deviations, we square the residuals and add the squares. The smaller the sum, the better the fit. The line of best fit is the line for which the sum of the squared residuals is smallest. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 7 Least Squares Regression Line LSRL • The line that gives the best fit to the data set • The line that minimizes the sum of the squares of the deviations from the line (3,10) y =.5(6) + 4 = 7 4.5 2 – 7 = -5 y =.5(0) + 4 = 4 ˆ y .5 x 4 0 – 4 = -4 y =.5(3) + 4 = 5.5 -4 (0,0) 10 – 5.5 = 4.5 -5 (6,2) Sum of the squares = 61.25 What is the sum of the deviations from the line? Will it always be zero? (3,10) Use a calculator to find the line of best fit 6 1 ŷ x 3 3 Find y - y The line that minimizes the sum of the squares of the deviations from the line -3 is the LSRL. (0,0) -3 (6,2) Sum of the squares = 54 Least Squares Regression Line LSRL You may see the equation written in the form: yˆ a bx ŷ - (y-hat) means the predicted y Be sure to put the hat on the y Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley The Least Squares Line (cont.) Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 12 The Least Squares Line (cont.) Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 13 The following statistics are found for the variables posted speed limit and the average number of accidents. x 40, s x 11 .6, y 18, s y 8.4, r .9981 Find the LSRL & predict the number of accidents for a posted speed limit of 50 mph. ˆ y .723 x 10 .92 ˆ y 25.23 accidents Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Fat Versus Protein: An Example The regression line for the Burger King data fits the data well: The equation is Using the equation for our line we can predict the fat content for a 30 g of protein BK Broiler chicken sandwich as: 6.8 + 0.97(30) = 35.9 grams of fat. Note that the actual fat content is about 25 g. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 15 Interpretations SPLOPE is a rate of change and it explains how an increase in the explanatory variable affects the response. Y-INTERCEPT shows you what's predicted for the response when the explanatory value is zero, and sometimes it doesn't have a meaningful interpretation. Doesn't mean we shouldn't know how to interpret it, just sometimes it doesn't make a whole lot of sense in context. Sometimes it falls outside that reasonable predictions window. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley The ages (in months) and heights (in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60 Find the LSRL. Interpret the slope and y-intercept in the context of the problem. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slope: For an increase in age of one month, there is an approximate increase of .34 inches in heights of children. Y-Intercept: the predicted height of a newborn is 20 inches Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley The ages (in months) and heights (in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60 Predict the height of a child who is 4.5 years old. Predict the height of someone who is 20 years old. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Extrapolation: Reaching Beyond the Data Linear models give a predicted value for each case in the data. We cannot assume that a linear relationship in the data exists beyond the range of the data. Once we venture into new x territory, such a prediction is called an extrapolation. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Extrapolation The LSRL should not be used to predict y for values of x outside the data set. It is unknown whether the pattern observed in the scatterplot continues outside this range. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Residuals Revisited The vertical deviation between the observations & the LSRL the sum of the residuals is always zero error = observed - expected residual y yˆ Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Residual plot A scatterplot of the (x, residual) pairs. Residuals can be graphed against other statistics besides x Purpose is to tell if a linear association exist between the x & y variables If no pattern exists between the points in the residual plot, then the association is linear. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Residuals Residuals x Linear Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley x Not linear Range of Motion 35 154 24 142 40 137 31 133 28 122 25 126 26 135 16 135 14 108 20 120 21 127 30 122 One measure of the success of knee surgery is post-surgical range of motion for the knee joint following a knee dislocation. Is there a linear relationship between age & range of motion? Sketch a residual plot. Residuals Age x Since there is no pattern in the residual plot, there is a linear relationship between age and range of motion Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Range of Motion 35 154 24 142 40 137 31 133 28 122 25 126 26 135 16 135 14 108 20 120 21 127 30 122 Plot the residuals against the yhats. How does this residual plot compare to the previous one? Residuals Age Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley ŷ Residuals Residuals x Residual plots are the same no matter if plotted against x or y-hat. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley ŷ Assumptions and Conditions Quantitative Variables Condition: Regression can only be done on two quantitative variables, so make sure to check this condition. Straight Enough Condition: The linear model assumes that the relationship between the variables is linear. A scatterplot will let you check that the assumption is reasonable. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 28 Assumptions and Conditions (cont.) It’s a good idea to check linearity again after computing the regression when we can examine the residuals. You should also check for outliers, which could change the regression. If the data seem to clump or cluster in the scatterplot, that could be a sign of trouble worth looking into further. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 29 Computer Regression Output Least-Squares Regression A number of statistical software packages produce similar regression output. Be sure you can locate the slope b, the y intercept a, and the values of s and r2. + Interpreting Computer-generated regression analysis of knee surgery Be sure to convert r2 data: NEVER use to decimal before 2 adjusted r ! taking the square Predictor Coef Stdev T P root! Constant 107.58What is 11.12 9.67 of0.000 the equation the What Age 0.8710are the0.4146 LSRL? 2.10 0.062 correlation coefficient Find the slope & y-intercept. and the coefficient of s = 10.42 R-sq = 30.6% R-sq(adj) = 23.7% determination? yˆ 107.58 .8710 x Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley r .5532 Outliers, Leverage, and Influence Outlying points can strongly influence a regression. Even a single point far from the body of the data can dominate the analysis. Any point that stands away from the others can be called an outlier and deserves your special attention. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Outlier – In a regression setting, an outlier is a data point with a large residual Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Influential pointA point that influences where the LSRL is located If removed, it will significantly change the slope of the LSRL Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Outliers, Leverage, and Influence (cont.) The following scatterplot shows that something was awry in Palm Beach County, Florida, during the 2000 presidential election… Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Outliers, Leverage, and Influence (cont.) The red line shows the effects that one unusual point can have on a regression: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Outliers, Leverage, and Influence (cont.) The linear model doesn’t fit points with large residuals very well. Because they seem to be different from the other cases, it is important to pay special attention to points with large residuals. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Lurking Variables and Causation No matter how strong the association, no matter how straight the line, there is no way to conclude from a regression alone that one variable causes the other. There’s always the possibility that some third variable is driving both of the variables you have observed. With observational data, as opposed to data from a designed experiment, there is no way to be sure that a lurking variable is not the cause of any apparent association. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Lurking Variables and Causation (cont.) The following scatterplot shows that the average life expectancy for a country is related to the number of doctors per person in that country: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Lurking Variables and Causation (cont.) This new scatterplot shows that the average life expectancy for a country is related to the number of televisions per person in that country: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Lurking Variables and Causation (cont.) Since televisions are cheaper than doctors, send TVs to countries with low life expectancies in order to extend lifetimes. Right? How about considering a lurking variable? That makes more sense… Countries with higher standards of living have both longer life expectancies and more doctors (and TVs!). If higher living standards cause changes in these other variables, improving living standards might be expected to prolong lives and increase the numbers of doctors and TVs. Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley