Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 Regression and Correlation Simple Least squares regression SLR): is a method to find the “best fitting line” to a set of n points (x, y). SLR minimizes the sum of squares of vertical distances from points to the fitted line. That is why the procedure is also called least squares regression. Correlation coefficient (r): A number that measures the strength and direction of the linear association between X and Y (both quantitative variables). – 1 r + 1 always. Correlation (r) = + 1 when there is a perfectly linear increasing relationship between X and Y. Correlation (r) = – 1 when there is a perfectly linear decreasing relationship between X and Y. No units. Correlation is a unit-less entity R2 = (r)2 = is called coefficient of determination. R2 measures the percent of variability in the response (Y) explained by the changes in X [or by the regression on X]. What does R2 = 0.81 (=81%) mean? How do you find r when you are given R2? For example what is r, if R2 = 0.81 = 81%? Chapters 3 and 11 Fall 2007 Page 1 of 34 Example: Suppose your friend claims that she can guess a persons age correctly (well, almost). So, to see if this claim is justifiable, you select a random sample of 10 people, ask your friend to guess their ages and then ask the person his/her true age. The following are observed: ID 1 2 3 4 5 6 7 8 9 10 Guessed age 18 52 65 90 28 58 13 66 44 35 True Age 20 45 70 85 25 50 15 60 40 35 The very first step in regression analysis is to identify the independent (explanatory) and the dependent (response) variables. Since the true age determines your friend’s guesses, (and your friend’s guess has no effect on a person’s true age) we have X = Independent Variable = True age Y = Response = Dependent Variable = Guessed Age. The next step is to draw a scatter diagram of the data and interpret what you see (to get some ideas about the relation between two variables). Chapters 3 and 11 Fall 2007 Page 2 of 34 Scatterplot of Guessed Age vs True Age 100 90 Guessed Age 80 70 60 50 40 30 20 10 10 20 30 40 50 True Age 60 70 80 90 1. What do you see? 2. Verify the following summary statistics using your calculator: x 44.5 s X 22.02 y 46.9 sY 24.02 3. Compute the slope and intercept of the least squares regression line, given that r = 0.9844. Slope = b = r sY / sX = 0.9844 24.02 /22.42 = 1.054651561 Intercept = a = y b x = 46.9 – 1.054651561 44.5 = – 0.0319944692 Chapters 3 and 11 Fall 2007 Page 3 of 34 Hence the prediction equation is yˆ 0.03 1.05 x . Are these results consistent with what you have observed in the scatter plot? 3. Interpret the numerical results: Correlation = r = 0.9844, so there is a strong, increasing, linear relationship between the true and guessed ages Slope = 1.05, for every unit increase in the true age, the guessed age increases by 1.05 years. Intercept = – 0.03 DO NOT INTERPRET the intercept in this case because a) Zero is not within the range of observed values of the independent variable (X) and b) Zero and – 0.03 are not meaningful in this context. 4. Compute R2 (coefficient of determination) and interpret it. R2 = (r)2 = (0.9844)2 = 0.969 = 0.969100% = 96.9% Interpretation: 96.9% of the variation in guessed ages (Y) is explained by the true age (X). 96.9% of variability in guessed ages is explained by linear regression on true ages. 5. Plot the estimated regression line on the scatter diagram. Chapters 3 and 11 Fall 2007 Page 4 of 34 For this we choose two values of X (as far as meaningful) and predict the value of Y for those two values of X, using the prediction equation yˆ 0.03 1.05 x . For x = 15 we have ŷ = 15.72 yˆ 0.03 1.05 x 0.03 1.05 15 15.72 For x = 90, we get ŷ = 94.47 yˆ 0.03 1.05 x 0.03 1.05 90 94.47 . These give us 2 points (15, 15.72) and (90, 94.47) that we connect on the scatter diagram. Now mark the points (15, 15.72) and (90, 94.47) on the graph and joint them with a ruler. Scatterplot of Guessed Age vs True Age 100 90 Guessed Age 80 70 60 50 40 30 20 10 10 20 30 40 50 True Age 60 70 80 90 Chapters 3 and 11 Fall 2007 Page 5 of 34 Chapter 11 Inferences for SLR In Chapter 3, a linear relation between two quantitative variables (denoted by X and Y) was shown by ŷ a bX . In this equation (called the prediction equation) Y is called the response (dependent) variable, X is called the explanatory variable ŷ is called the predicted value of Y. a = ̂ = estimate of the intercept (α) and b = ̂ = estimate of the slope (). Hence, ŷ a bX is an estimate for a simple linear regression model. Also, ŷ is called the estimate of true (but unknown) regression line µY = α + X, so we also write ˆy ˆ Y Regression Model: A mathematical (or theoretical) equation that shows the linear relation between the explanatory variable and the response. The simple linear regression model we will use is Y = α + X + , where is called the error term. Chapters 3 and 11 Fall 2007 Page 6 of 34 Let’s see the error terms graphically: Total error = y – y = (y – ŷ ) + ( ŷ – y ) = Random error + Regression error In this relation, total error is divided into two parts, the random error = y – ŷ and the regression error = ŷ – y = error due to using the regression model instead of the sample mean, y =46.9. Scatterplot of Guessed Age vs True Age 90 80 Guessed Age 70 60 50 40 30 20 10 10 20 30 40 50 60 True Age 70 80 90 100 Chapters 3 and 11 Fall 2007 Page 7 of 34 The unbiased estimators of the parameters (α and ) of the regression line (those given in Chapter 3) are found by the method of least squares estimation (LSE) technique as S ˆ b r Y and ˆ a Y bX SX Hence, is the (true and unknown) slope of the regression line, S is estimated by ˆ b r Y and SX α is the (true and unknown) y-intercept (or simply intercept) of the regression line is estimated by ˆ a Y bX . What do the slope and intercept of the regression line tell us? Slope is the average amount of change in Y for one unit of increase in X. Note: slope ≠ rise / run. Why? Intercept is the value of Y when X = 0. Important Note: We DO NOT use the above interpretation when a) X = 0 is not meaningful or b) Zero is not within the range or near the observed values of X Chapters 3 and 11 Fall 2007 Page 8 of 34 Assumptions of Simple Linear Regression: 1. A random sample of n pairs of observations, (X1 , Y1), (X2 , Y2), …, (Xn , Yn) 2. The population of Ys have normal distribution with mean µY = α + X, which changes for each value of X, and the same standard deviation, , which is the same at every value of the independent variable, X. The relation between X and Y may also be formulated as Y X . 3. As a result of the above assumptions, the error terms, , are iid (identically and independently distributed) random variables that have a normal distribution with mean zero and standard deviation , i.e., ~ N(0, ). 4. These mean that both Y and are random variables [we may choose any value for X hence it is assumed to be a non-random variable (even when it is random)]. Are these assumptions satisfied in Example – 1? Chapters 3 and 11 Fall 2007 Page 9 of 34 Assumption 1 (random sample) is satisfied. To check assumptions 2 and 3 we look at the residuals, where, Residual = Observed value of Y – Predicted value of Y = y yˆ . If these residuals do not have any extreme value, we say assumptions 2 and 3 are justifiable, since we do not have any reason to suspect otherwise (more later). So, let’s calculate the residuals using the prediction equation, yˆ 0.03 1.05 x found in chapter 3. Then we will plot the residuals. Observed Predicted Residual Value of y Value ( ŷ ) y yˆ 20 21.06 – 3.06 45 47.43 4.57 70 73.79 – 8.79 85 89.61 0.39 25 26.33 1.67 50 52.70 5.30 15 15.79 – 2.79 60 63.25 2.75 40 42.15 1.85 35 36.88 – 1.88 Chapters 3 and 11 Fall 2007 Page 10 of 34 Histogram (response is Guessed Age) 3.0 Frequency 2.5 2.0 1.5 1.0 0.5 0.0 -5.0 -2.5 0.0 2.5 Residual 5.0 7.5 10.0 Do you think the assumption of normality is satisfied? Why or why not? Chapters 3 and 11 Fall 2007 Page 11 of 34 Inferences about parameters of SLR The parameters of the regression model are α, and These parameters are estimated by a, b and S, respectively. Chapter 11 deals with inferences about the true Simple Linear Regression (SLR) model1, i.e., a regression model with one explanatory variable (X). When making inferences about the parameters of the regression model, we will determine If X is a “good predictor” of Y If the regression line is useful for making predictions about Y If the slope is different from zero In this chapter we will also see how to find Prediction interval for an individual response, Y at X = x. Confidence intervals for the mean of Y, that is µY = mean response, at X = x. We carry out these using ANOVA. In Chapter 12 we will see how to make inferences about the parameters of a multiple regression model, i.e., a regression model with several (k 2) explanatory variables, X1, X2, …, Xk. 1 Chapters 3 and 11 Fall 2007 Page 12 of 34 ANOVA FOR SLR Is X a good predictor of Y? This is equivalent to saying is the slope of the line significantly different from zero? [If not, we might as well use Y as a predictor.] We can answer these questions using an ANOVA table: ANOVA for SLR Source df Regression Model 1 Residuals (Error) n–2 SSE Total n–1 SST Total SS SST SS MS SSReg MS Re g SS Re g /1 MSE F F MS Re g MSE SSE n2 = Model SS + Error SS = SSReg + SSE n n n 2 ˆ ˆ ( y y ) ( y y ) ( y y ) i i i i 2 i 1 df = (n – 1) = 2 i 1 i 1 1 + (n – 2) The df for regression = 1 because there is only 1 independent variable The df for residuals = n – 2 because we estimate 2 parameters (α and ). Chapters 3 and 11 Fall 2007 Page 13 of 34 Assumptions for ANOVA: Random sample Normal distribution (of and hence Y) Constant variance (of and Y) The hypothesis of interest is Ho: = 0 vs. Ha: ≠ 0. Test statistic = F = MSReg / MSE To find the p-value, first find the tabulated F-value from the F-tables with df1 = 1 and df2 = n – 2; then compare that value with the F in ANOVA table. The following is the output obtained from Minitab: Regression Analysis: Guessed Age vs. True Age Analysis of Variance Source DF SS MS Regression 1 5030.0 5030.0 Residual Error 8 160.9 20.1 Total 9 5190.9 F 250.08 P 0.000 To test Ho: = 0 vs. Ha: ≠ 0, the test statistic is F = 250.08 from the ANOVA table, The F-value is extremely large!!! What does it mean? The p-value = 0.000. What does it mean? Decision? Conclusion? Chapters 3 and 11 Fall 2007 Page 14 of 34 Decision: Reject Ho since the p-value < 0.0005 is less than any reasonable level of significance. Conclusion: The observed data indicate that the slope is significantly different from zero. Using the t-test We may also use t-test for testing the above hypotheses as explained in Chapter 8. Fro this we use the first block of the Minitab output: The regression equation is Guessed Age = – 0.03 + 1.05 True Age Predictor Coef Constant – 0.030 True Age 1.05462 SE Coef 3.289 0.06669 T – 0.01 15.81 P 0.993 0.000 S = 4.48483 R-Sq = 96.9% R-Sq(adj) = 96.5% In this case parameter is Estimate = b = 1.05462, SE(Estimate) = 0.06669 Significance Test: Ho: = 0 vs. Ha: ≠ 0 Estimator Value of parameter in Ho Test statistic = T SE ( Estimator ) Calculated value of the test statistic: 1.05462 0 Tcal 15.81 0.06669 Chapters 3 and 11 Fall 2007 Page 15 of 34 To find the p-value, go back and look at Ha. We have a 2-sided alternative and hence, P-Value = 2 P(T |Tcal |) = 2 P(T 15.81) = 0. This p-value gives us the same decision and conclusion as the one we got from the ANOVA table. In general, to find the p-value we would look at the ttable to see Tcal on the line with df = n – 2. [This is the df for error in ANOVA table.] Compare the Tcal above with the Fcal in ANOVA table. We have the following general relation 2 between the Tcal and Fcal in SLR: Tcal Fcal and equivalentlyTcal Fcal . So the p-value for the t-test is the same as the p-value for the F-test. Hence in SLR, the two significance tests for the slope give the same results in both the Ftest (using ANOVA table) and the t-test. Observe that the above conclusion does not tell us in what way is different from zero. We could use the t-test for testing one-sided alternatives about . However, these should be decided before looking at the data. Chapters 3 and 11 Fall 2007 Page 16 of 34 Confidence Interval for the Slope: Remember the general formula for the confidence interval: CI Estimate ME Estimate t * SE (estimate) This is used in finding a CI for , where the estimate is b and SE(estimate) is given in the Minitab output. All we need to do is to find t* from the t-tables with df = (n – 2) = dferror in ANOVA table. For the above example we had the following results from Minitab: Predictor Coef Constant – 0.030 True Age 1.05462 SE Coef 3.289 0.06669 T – 0.01 15.81 P 0.993 0.000 That is, b = slope = 1.05462, SE(Slope) = 0.06669. Also, since dferror = 8 in the ANOVA table, we use the table of t-distribution and read the t-value for a 95% CI on the row with df = 8 as t* = 2.306, which gives ME = t*SE(Estimate) = 2.3060.06669 = 0.153787. Hence a 95% CI for is CI = (1.05462 ± 0.15379) = (0.90083, 1.20841) = (0.9, 1.2) Chapters 3 and 11 Fall 2007 Page 17 of 34 As in Chapters 7 – 9 we can use the CI to make a decision for significance test: when zero is not in the CI we reject Ho and conclude that The observed data give strong evidence that the slope of the regression line is different from zero. Actually, we can say more: since the CI for in this example is (0.9, 1.2), we see that both ends of the CI are positive thus we can conclude with 95% confidence that the slope of the true regression line is some number between 0.9 and 1.2. Alternatively we interpret the CI as follows: We are 95% confident that on the average, as the true age increases by one year, the guessed age increases by somewhere between 0.9 and 1.2 years. Chapters 3 and 11 Fall 2007 Page 18 of 34 Confidence Interval for Mean Response And Prediction Interval General formula for CIs: CI Estimator t * SE (estimator ) Additional symbols: µY|x* = α + x* = Mean response for the population of ALL Y’s that have to X = x* = The point on the true regression line that corresponds to X = x* ˆY | x* = yˆ | x* a bx * = Estimator of mean response at X = x* SE( yˆ | x *) = SE(Estimator of Mean Response) 1 (x * X ) S n n 2 ( X X ) i i 1 Hence CI for Mean Response is 1 ( x * X ) CI ( Mean Response) yˆ t * S n n 2 ( X X ) i i 1 Chapters 3 and 11 Fall 2007 Page 19 of 34 yˆ | x* a bx * = Predicted value for one new response at X = x* SE( yˆ | x *) = SE(One new response) 1 (x * X ) = S 1 n n 2 ( X X ) i i 1 Hence prediction interval (PI) for one new response is 1 ( x * X ) PI (One New Response) yˆ t * S 1 n n 2 ( X X ) i1 i Chapters 3 and 11 Fall 2007 Page 20 of 34 Compare the formula for CI and PI to see the difference between them: 1 ( x * X ) CI ( Mean Response) yˆ t * S n n 2 ( X X ) i i 1 1 ( x * X ) PI (One New Response) yˆ t * S 1 n n 2 ( X X ) i1 i In both of the above formulas S = Standard deviation of points around the regression line = MSE df = dferror x* = a particular value of X for which we are making prediction. Both CI and PI are centered around yˆ | x* a bx * = prediction at X = x* PI for a new response is always wider than CI for mean response at the same value of X = x*. (Why?) SE’s and hence intervals will be smaller when x* is closer to X = the mean of the sample of X’s and wider when x* is far from X . (Why?) Chapters 3 and 11 Fall 2007 Page 21 of 34 CI and PI for Age Prediction problem Guessed Age = - 0.030 + 1.055 True Age Regression 95% CI 95% PI 100 S R-Sq R-Sq(adj) Guessed Age 80 4.48483 96.9% 96.5% 60 40 20 0 10 20 30 40 50 60 True Age 70 80 90 Age prediction example (Continued): a) Suppose you want to know with 95% confidence the range of your friend’s guesses for a 60 year old person. Here we have one value of X = x* = 65, hence you want a 95% prediction interval at this value. Using the prediction equation we have found, we get the predicted value of Y at X = 65 as yˆ 0.03 1.05 x * 0.03 1.05 65 68.22 Chapters 3 and 11 Fall 2007 Page 22 of 34 Calculations for the SE’s are long and tedious. However, we can use any one of statistical software to get what we want easily. For example, we got the prediction interval for X = 65 as PI = (57.22, 79.82) using Minitab. Observe that the center of the above interval is also 68.52. This is the predicted value of Y ( ŷ ) Minitab calculated, using X = 65. [This is slightly different from what we’ve found because Minitab carries more digits after the decimal point in its calculations.] b) You want to know, with 95% confidence what would be the average of your friend’s guesses for all people aged 65. Since we are now looking for the mean of all guessed ages with X = 65, this is a problem of CI for mean response. Minitab gives this as CI = (63.98, 73.06). Observe that both the CI and the PI are centered on the same point, i.e., around ŷ = 68.52. Chapters 3 and 11 Fall 2007 Page 23 of 34 Finally, observe the difference in the lengths of the intervals we got from Minitab: 95% CI at X = 65 is (63.98, 73.06). Length of CI = 73.06 – 63.98 = 9.08 95% PI at X = 65 is (57.22, 79.82). Length of PI = 79.82 – 57.22 = 22.6 As mentioned before, the PI is ALWAYS wider than the CI at the same level of confidence and the same value of X. Chapters 3 and 11 Fall 2007 Page 24 of 34 More on R2: We have seen that R2 = (r)2. This can also be defined and calculated from the following relation: SS Re g SST Variation in Y explain by Regression Total variation in Y R2 This leads to alternative interpretation of R2: R2 is the proportion of variability in Y that is explained by the regression on X or equivalently, R2 is the proportional reduction in the prediction error, that is, R2 is the percentage of reduction in prediction error we will see when the prediction equation is used, instead of y = the sample mean of Y as the predicted value of Y. Chapters 3 and 11 Fall 2007 Page 25 of 34 Example: In the ANOVA table for the analysis of guessed ages we had the following output: S = 4.48483 R-Sq = 96.9% R-Sq(adj) = 96.5% Source Regression Residual Error Total Then, R 2 DF SS 1 5030.0 8 160.9 9 5190.9 MS 5030.0 20.1 F P 250.08 0.000 SSR 5030.0 0.969 = 96.9%. SST 5190.0 This is the same result we had from Minitab, as it should be. We may now interpret this as follows: The regression model yields a predicted value for Y that has 96.9% less error than we would have if we used the sample mean of Y’s as a predicted value. Chapters 3 and 11 Fall 2007 Page 26 of 34 More on Residuals: Residual = Vertical distance from an observed point to the predicted value for the same X. = Observed y – predicted y = y yˆ Where yˆ 0.03 1.05 x Observed Values of y 20 45 70 85 25 50 15 60 40 35 Predicted Values ( ŷ ) 21.06 47.43 73.79 89.61 26.33 52.70 15.79 63.25 42.15 36.88 Residuals y yˆ – 3.06 4.57 – 8.79 0.39 1.67 5.30 – 2.79 2.75 1.85 – 1.88 Hence, for someone whose actual age is 35, the predicted value of his/her age is 36.88. This means the prediction was 1.88 years higher than the true age. Chapters 3 and 11 Fall 2007 Page 27 of 34 Positive residuals: Observations above regression line Negative residuals: Observations below regression line Sum of residuals = 0 ALWAYS. We (or computers) can make residual plots to see if there are any problems with the assumptions. Computer finds “standardized residuals = z-score for each observation. Any point that has a z score bigger than 3 in absolute value, i.e., |z| > 3 is called an outlier. More on Correlation: If the distance between a given value of X, say x* and X (in absolute value) is k standard deviations, i.e., |x* – X | = k S, then the distance (in absolute value) between the predicted value of y ( ŷ ) at x* and Y is r k standard deviations, i.e., | ŷ – Y | = r k S. Chapters 3 and 11 Fall 2007 Page 28 of 34 Example: Suppose Y = Height of children and X = heights of their fathers and the correlation between the two variables is r = 0.5. Then, If a father’s height is 2 standard deviations above the mean height of all fathers, then the predicted height of his child will be 0.5 2 = 1 standard deviation above the mean height of children. If the father’s height is 1.5 standard deviations below the mean height of all fathers, then his child’s predicted height will be 0.5 1.5 = 0.75 standard deviations below the mean height of all children. Chapters 3 and 11 Fall 2007 Page 29 of 34 Some more on correlation Correlation is very much affected by outliers and influential points. Outliers weaken the correlation. Influential points (far from the rest of observations in the x-direction that does not follow the trend) may change the sign and value of the slope. Chapters 3 and 11 Fall 2007 Page 30 of 34 Residual Plots Residuals are the estimators of the error term () in the regression model. Thus, the assumption of normality of can be checked by looking at the histogram of the residuals. A histogram of residuals that is (almost) bellshaped (symmetric) supports the assumption of normality of the residuals. A histogram or a dot plot that shows outliers is indicative of the violation of the assumption of normality. Normal probability plot or normal quantile plot can also be used to check the normality assumption. Points in a normal PP or QP around a straight line support assumption of normality. Chapters 3 and 11 Fall 2007 Page 31 of 34 Plot of residuals Against the explanatory variable (X) Magnify any problems with assumptions. If the residuals are randomly scattered around the line residuals = 0, this is good. It means nothing else is left after using X to predict Y. If the residual plot shows a curved pattern this indicates that a curvilinear fit (quadratic?) will give better results. If the residual plot is funnel shaped this means the assumption of constant variance is violated. If the residual plot shows an outlier, this may mean the violation of normality and/or constant variance or show an influential point. Chapters 3 and 11 Fall 2007 Page 32 of 34 11.5 Exponential regression This is one of the nonlinear regression models of the following form: Y X or equivalently, Y X . The model is called “exponential” because the independent variable, X appears as the exponent of the coefficient . Observe that when we take the logarithm of the model we obtain log( Y ) log (log )X , hence logarithm of the mean of Y is a linear function of X with coefficients log() and log(). Note that when X = 0, X = 0 = 1. Thus, gives us the mean of Y at X = 0, since Y = 0 =(1) = . The parameter represents the multiplicative effect of X on Y (as opposed to the additive effect in simple linear regression we have seen so far.). So, if, for example, = 1.5, increasing X by one unit will increase Y by 50% from its previous value, i.e., we need to multiply the value of Y at the previous value by 1.5 to obtain the current value. Chapters 3 and 11 Fall 2007 Page 33 of 34 Summary of SLR Model: y x Assumptions: a) Random sample b) Normal distribution c) Constant variance d) ~ N(0, ). Parameters and Estimators: Intercept = α Estimated by a Y bX S Slope = Estimated by b r Y SX Standard deviation = Estimated by S = MSE Interpretation of Slope Intercept R2 r Testing if the model is good: ANOVA The t-test for slope CI for slope PI and CI for response Residual plots and interpretations. Chapters 3 and 11 Fall 2007 Page 34 of 34