Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Forecasting wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Lasso (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Choice modelling wikipedia , lookup
Time series wikipedia , lookup
Linear regression wikipedia , lookup
MATH 2560 C F03 Elementary Statistics I LECTURE 9: Least-Squares Regression Line and Equation 1 Outline ⇒ least-squares regresion line (LSRL); ⇒ equation of the LSRL; ⇒ interpreting the LSRL; ⇒ correlation and regression; 2 Least-Squares Regression Line =⇒ Our first aim is: we need a way to draw a regression line that doesn’t depend on our guess as to where the line should go. We want one line that is as close as possible. =⇒ Our second aim is: we want a regression line that makes the prediction errors as small as possible: errors=observed variables-predicted variables → minimize! Figure 2.13 illustrate the idea. =⇒ The most common idea how to make these errors ”as small as possible” precisely is the LEAST-SQUARES idea. Leat-Squares Regression Line A least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Below we have the least-squares idea expressed as a mathematical problem. Least-Squares Idea as a Mathematical Problem 1. There are n observations on two variables x and y : (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ); 2. The line y = a + bx through scatterplot of these observations predicts the value of y corresponding to xi as ŷi = a + bxi ; 3. The predicted response ŷi will not be exactly the same as the actually observed response yi ; 4.The prediction error for the point xi is: error=observed yi -predicted ŷi ; 5. The method of Least-Squares chooses the line that makes the sum of the squares of these errors as small as possible; 6. Mathematical problem : find the values of the intercept a and the slope b that minimize thePfollowing expression; P P (error)2 = (yi − ŷi )2 = (yi − a − bxi )2 → minimize. Equation for the LSRL Equation of the Leat-Squares Regression Line 1. Let we have data on explanatory variable x and a response variable y for n individuals; 2. The mean and standard deviations of the sample data are x̄ and sx for x and ȳ and sy for y, and the correlation between x and y is r; 3. The equation of the least-squares regression line of y on x is: ŷ = a + bx with slope b=r sy sx and intercept a = ȳ − bx̄. Example 2.13. Mean height of Kalama children (Table 2.7). We calculate means, standard deviations for x and y, correlation r, slope b, intercept a and give the equation of the least-squares line in this case: 1. Mean and Standard Deviation for x: x̄ = 23.5m, sx = 3.606m; 2. Mean and Standard Deviation for y: ȳ = 79.85, sy = 2.302; 3. Correlation: r = 0.9944; 4. Slope: b=r sy 2.302 = 0.9944 = 0.6348cm/m; sx 3.606 5. Intercept: a = ȳ − bx̄ = 79.85 − (0.6348)(23.5) = 64.932cm; 6. The equation of the least-squares line is: ŷ = 64.932 + 0.6348x. 3 Interpreting the regression line Interpreting the Leat-Squares Regression Line 1. Slope b = r ssxy : says that along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y; (The change in the predicted response ŷ is the same as the change in x when r = 1 or r = −1. Otherwise, −1 < r < 1, the change in ŷ is less than the change in x.) 2. The least-squares regression line always passes through the point (x̄, ȳ); Figure 2.14 displays the basic regression output for the Kalama data from a graphing calculator and two statistical software packages. 4 Correlation and Regression ⇒ Least-squares regression looks at the distances of the data points from the line only in the y direction. Example 2.14. Expanding the Universe (Figure 2.15). Figure 2.15 is a scatterplot of data that played a central role in the discovery that the universe is expanding. Here r = 0.7842, hence, relationship between the distances from Earth of 24 spiral galaxies and the speeds at which they are moving away from us is a positive and linear. Important Remark: Although there is only one correlation between velocity and distance, regression of velocity on distance and regression of distance on velocity give different lines. =⇒ There is a close connection between correlation and regression: Connection between Correlation and Regression: ⇒ the slope of the least-squares line involves r; ⇒ the square of the correlation, r2 , is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x. Relationship between r and r2 ⇒ When you report a regression, give r2 as a measure of how successfully the regression explains the response. All the software outputs in Figure 2.14 include r2 . ⇒ The use of r2 to describe the success of regression in explaining the response y is very common: it rests on the fact that there are two sources of variation in the responses y in a regression setting. Example: Kalama children. One reason the Kalama heights vary is that height changes with age; Second reason is that heights do not lie exactly on the line, but are scattered above and below it. ⇒ We use r2 to measure variation along the line as a fraction of the total variation in the response variables. For a pictorial grasp of what r2 tells us, look at Figure 2.16. Both scatterplots resemble the Kalama data, but with many more observations. The least-squares regression line is the same as we computed from the Kalama data. In Figure 2.16(a), r = 0.994 and r2 = 0.989. In Figure 2.16(b), r = 0.921 and r2 = 0.849. There is more scatter about the fitted line and here r2 is less than in Figure 2.16(a). 5 More Specific Interpretation of r2 ⇒ The squared correlation gives us the variance of the predicted responses as a fraction of the variance of the actual responses: r2 = varianceof predictedvalues ŷ . varianceof observedvalues y This fact is always true. Final Important Remark: The connections with correlation are special properties of least-squares regression. They are not true for other methods of fitting a line to data. 6 Summary 1. A regression line is stright line that describes how a response variable y changes as an explanatory variable x changes. 2. The most common method of fitting a line to a scatterplot is least squares. The least-squares regression line is the stright line ŷ = a + bx that minimizes the sum of the squares of the vertical distances of the observed y-values from the line. 3. A regression line is used to predict the value of y for any value of x by substituting this x into the eqution of the line. Exptrapolation beyond the range of x values spanned by the data is risky. 4. The slope b of a regression line ŷ = a + bx is the rate at which the predicted response ŷ changes along the line as the explanatory variable x changes. Specifically, b is the change in ŷ when x increases by 1. 5. The intercept a of a regression line ŷ = a + bx is the predicted response ŷ when the explanatory variable x = 0. This prediction is of no statistical use unless x can actually take values near 0. The least-squares regression line of y on x is the line with slope r ssxy and intercept a = ȳ − bx̄. This line always passes through the point (x̄, ȳ). 6. Remarks. Correlation and regression are closely connected. The correlation r is the slope of the least-squares regression line when we measure both x and y in standardized units. The square of the correaltion r2 is the fraction of the variance of one variable that is explaned by least-squares regression on the other variable.