Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 10 - Simple Linear Regression Want to predict the assessed value of a house in Ames. Select a random sample of n houses and estimate the population mean of assessed value (using methods in Ch. 7) and use it for prediction. A better method uses other information about the houses used by the assessor (e.g.,sq.ft. of oor space, age of the house, location, etc.). If we have a data set that has values for the variables assessed value (y), oor space (x1), age (x2), and location (x3), we can develop a relationship between the y and the x's that will allow us to predict what the assessed value of a house will be, given the set of observed values of the other variables for a particular house. Chapter 10 covers the simplest situation { that of relating two variables: y and x. Suppose we want to model monthly sales revenue (y ) of an appliance store as a function of monthly Does an exact relationship exist between these two variables? If a relationship such as the above exists, the monthly sales revenue will be exactly 15 times the monthly advertising expenditure i.e., y = 15x This is called a deterministic model. However such a relationship is not possible because there are other factors that aect sales revenue that are not measured; however, we have to allow for them in the model. To allow for the unexplained variation in monthly sales due to unincluded variables or random phenomena, we introduce the following model: y = 15x + random error 63 64 advertising expenditure (x). This is called a probabilistic model where we always assume that the mean of the random error component is 0. The simplest probabilistic model is the straight-line Once a straight line model has been hypothesized, sample data must be collected on the variables x and y. regression model: y = 0 + 1x + where y x E (y ) 0 1 = = = = = = dependent or response variable independent or predictor variable 0 + 1x = deterministic component random error component intercept of the true line slope of the true line. 65 Then we use the sample data to estimate the unknown parameters in the model: intercept 0 and slope 1. It is helpful to obtain a scatterplot of y vs. x, to determine if our hypothesis is plausible. 66 To see this, rst calculate the SSE for the eye-balled line: You may eyeball a straight line through the points, and obtain the values of intercept and the slope of that line. The eye-balled line is y = ;1 + x So for this line 0 = ;1 and 1 = 1. But this line may not be the \best" line for predicting y values. The sums of squares of errors, (SSE) for the eyeballed line is 2.0. However, we can nd another line for which the SSE is a minimum. This line is called the least squares line or the regression line. Obtain the line that minimizes the sum of squared deviation for errors of the values predicted by the (SSE) model for y (denoted by y^) and the actual (observed) y's. This line is the \best" in the sense that it minimizes the SSE. We would like to estimate values for 0 and 1 which minimizes the SSE. These are called the least squares estimates. The least square estimates of the unknown slope and intercept parameters are denoted by ^0 and ^1. 67 The fitted line 68 is then denoted by y^ = ^0 + ^1x and is called the least squares line. From this equation, we can calculate values for y^ corresponding to the values of x. These are called the predicted values (or tted values). The n data points for a straight line model are denoted by (x1; y1); (x2; y2); ; (x ; y ) or simply by (x ; y ) for each i = 1; : : : ; n. Thus the predicted values are given by y^ = ^0 + ^1x for each i = 1; : : : n The sum of squares of the deviations is then " !#2 (y ; y^ )2 = y ; ^0 + ^1x n i n i i i i i i The least squares estimates ^0 and ^1 are SS ^1 = ^0 = y ; ^1x SS where SS = (x ; x)(y ; y) x y = x y ; n (x )2 2 SS = (x ; x) = x2 ; xy i From the above calculations, we have (15)(10) SS = 37 ; = 37 ; 30 = 7 52 (15) SS = 55 ; = 55 ; 45 = 10 5 SS 7 = = :7 ^1 = SS 10 0 1 10 15 ^0 = y ; ^1x = ; (:7) B@ CA 5 5 = 2 ; 2:1 = ;:1 xy xx xy xx xx xy i i i i xx i i i i 69 n i 70 Layout: Exercise10_19 Bivariate Fit of Retal Index By Salary Retal Salary $ Index 301 62000 550 36500 755 21600 327 24000 500 30100 377 35000 290 47500 452 54000 535 19800 455 44000 615 46600 700 15100 650 70000 630 21000 360 16900 First, note that the SSE = 1.10 calculated from the least square line is less than SSE = 2.0 of the eye-balled line. 800 Retal Index 700 600 500 400 300 200 10000 Salary Linear Fit Parameter Estimates Term Intercept Salary Estimate Std Error 569.58007 93.99729 -0.001924 0.002356 t Ratio 6.06 -0.82 Prob>|t| <.0001 0.4289 71 Page 1 of 1 Bivariate Fit of Gasoline(cents/gal.) By Crude Oil($/bbl.) Gasoline Crude Oil (cents/gal) ($/bbl.) 57 10.38 59 10.89 62 11.96 63 12.46 86 17.72 119 28.07 131 35.24 122 31.87 116 28.99 113 28.63 112 26.75 86 14.55 90 17.90 90 14.67 100 17.97 115 22.23 72 16.54 71 15.99 75 14.24 67 13.21 63 14.63 72 18.56 140 130 Gasoline(cents/gal.) 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 120 110 100 90 80 70 60 50 10 15 20 25 30 35 40 Crude Oil($/bbl.) Linear Fit Gasoline(cents/gal.) = 30.134836 + 3.0181453 Crude Oil($/bbl.) Parameter Estimates Term Intercept Crude Oil($/bbl.) Estimate Std Error 30.134836 5.454029 3.0181453 0.265423 t Ratio 5.53 11.37 Prob>|t| <.0001 <.0001 Lower 95% Upper 95% 18.757931 41.511741 2.464482 3.5718085 Analysis of Variance Source Model Error C. Total DF 1 20 21 Summary of Fit Sum of Squares 10373.339 1604.524 11977.864 Mean Square F Ratio 10373.3 129.3011 80.2 Prob > F <.0001 RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) Plot of Residuals vs. Year 20 Residuals Gasoline(cents/gal.) Residuals Gasoline(cents/gal.) 0.866043 0.859345 8.956909 88.22727 22 Plot of Residuals vs. Predicted 20 15 10 5 0 -5 -10 -15 60 70 80 90 100 110 120 130 15 10 5 0 -5 -10 -15 1970 140 Predicted Gasoline(cents/gal.) 1975 1980 1985 Year 73 Sum of Squares 15320.93 298741.47 314062.40 Mean Square 15320.9 22980.1 F Ratio 0.6667 Prob > F 0.4289 72 Layout: Exercise10_18 Year DF 1 13 14 1990 1995 2000 Retal Index = 569.58007 - 0.0019237 Salary Summary of Fit Analysis of Variance Source Model Error C. Total 30000 40000 50000 60000 70000 0.048783 RSquare -0.02439 RSquare Adj 151.5919 Root Mean Square Error 499.8 Mean of Response 15 Observations (or Sum Wgts) It is important to interpret the the slope and intercept of the least squares line ^1 and ^0 relative to the problem and the data used for the estimation. The slope ^1 = :7 implies that for every unit increase in the value of x the expected value (or mean value) of y is predicted to increase by .7 units. In this example, for every $100 increase in advertising the mean sales revenue is predicted to increase by :7 $1000 = $700 for the range of values of advertising expenditure in the data, i.e., from $100 to $500. The intercept of the least squares line is ^0 = ;:1 seems to say that if the advertising expenditure , x, was equal to $0, the expected (or mean) sales revenue will be ;:1 $1000 = ;$100. However, since advertising expenditure of $0 is not in the range from $100 to $500 used for estimating the least squares line, this interpretation is not valid. The moral: interpretation of the model parameter estimates must be made only within the range of values of the predictor x used in the computation of the least squares line. We stated that E (y) = 0 + 1x is the deterministic component that is the random component of the model. We may call 0 + 1x the mean of y at a specied x. The deterministic component is the equation of a straightline. Making statistical inferences from the tted model requires us to specify the probability distribution of the random error . Assumption 1: Mean of the distribution of is 0. Assumption 2: The variance of the distribution of is 2 and is constant for all values of x. Assumption 3: has the N (0; 2) distribution. Assumption 4: The random errors from tting the model to n pairs of data (xi; yi), i = 1; : : : ; n i.e., 1; 2; : : : ; n is a random sample from N (0; 2) distribution. Note: An implication of these assumptions is that y has a normal distribution with mean 0 + 1x and variance 2. Model Assumptions Model: y = 0 + 1x + 74 75 and These assumptions allow us to construct confidence square estimators and develop examing the usefulness of the least squares lines. We have already given the formulas for calculating the least squares estimates ^0 and ^1 of the intercept and slope parameters. What is a good estimate of 2 or ? The best estimate s2 of 2 can be obtained from the results of tting the straight-line model: intervals for the least hypothesis tests for SSyy = (yi ; y)2 = yi2 ; It follows that the estimate s of is p vuuu SSE s = s2 = ut n;2 For the advertising expenditure-sales revenue example, the least square line was: y^ = ;1: + :7x Recall that n = 5 and SSE = 1.10 from that example. Thus we have: SSE 1:10 = = :367 n;2 3p as the estimate of 2 and s = :367 = :61 is the standard error of the regression model. s2 = s. s measures the spread of the y-values around the least squares line at any value of x. Therefore we can expect most y values to lie within 2s from y^. Interpretation of s2 = where Sum of Squares for Error SSE = Degrees of freedom for Error n ; 2 SSE = (yi ; y^i)2 = SSyy ; ^1SSxy 76 (yi)2 n 77 1. Again consider the model y = 0 + 1x + Recall that the mean y for a given x is E (y) = 0 + 1x By looking at this we can see how the straight-line model makes the mean of y, and therefore the prediction of y, depend on x. If 1 = 0 in the above model then x will have no eect on the prediction of y using the above model. Therefore the test of the null hypothesis that x contributes no information to the prediction against the alternative that the above model is useful for predicting y, is equivalent to testing H0 : 1 = 0 vs. Ha : 1 6= 0 If the data supports Ha, the alternative hypothesis, then we will conclude that x contributes information for the prediction of y through the above straight-line model. ^1 If our assumptions about the regression model hold, then the sampling distribution of ^1 is normal with mean 1 and standard deviation ^1 is given by ^1 where ^1 = p SSxx Since is unknown, we estimate it by s. Therefore an estimate of ^1 is given by s^1 where s s^1 = p SSxx s^1 is called the estimated standard error of the least squares slope ^1. 78 79 Inferences about the slope A t-test for 1 The t-statistic for testing H0 : 1 = 0 vs: Ha : 1 6= 0 is: ^ ; 0 t= 1 s^1 p where s^1 = s= SSxx: Rejection region: jtj > t=2;n;2 where t=2;n;2 is the critical value from the t-table based on n ; 2 degrees of freedom. Example: In the advertising-sales example, for testing H0 : 1 = 0 vs: Ha : 1 6= 0 the t-statistic is computed to be ^ ^ :7p t = 1 = p1 = = 3:7 s^1 s= SSxx :61= 10 and t=2;n;2 = t:025;3 = 3:182 for = :05. Thus the rejection region is jtj > 3:182 and since the calculated t-value falls in the rejection region, we reject H0 and conclude that slope 1 is not zero. 80 Sampling Distribution of This implies that the variable x (advertising expenditure) does contribute to the prediction of y (sales revenue) using the straight-line model. When using computer software for the analysis of regression data, we can reach the same conclusion by using the observed significance level or the p-value computed by the program. In the JMP regression output this value is given under the column headed P, in the table giving the parameter estimates. Comparing jtj to t=2 for the two-sided test is equivalent to comparing to the computed P value. Example: Exercise 10.18 Bivariate Fit of Gasoline(cents/gal.) By Crude Oil($/bbl.) Linear Fit Gasoline(cents/gal.) = 30.134836 + 3.0181453 Crude Oil($/bbl.) Parameter Estimates Term Intercept Crude Oil($/bbl.) Estimate Std Error 30.134836 5.454029 3.0181453 0.265423 t Ratio 5.53 11.37 Prob>|t| <.0001 <.0001 Lower 95% 18.757931 2.464482 Upper 95% 41.511741 3.5718085 Analysis of Variance Source Model Error DF 1 20 Sum of Squares 10373.339 1604.524 Mean Square F Ratio 10373.3 129.3011 80.2 Prob > F Here the p-value is given as < :0001. Thus it is smaller than the signicance level = :05. Therefore we reject the null hypothesis H0 : 1 = 0 and conclude that slope 1 is not zero. 81 It is obvious from the plot below that the slope is not zero. Bivariate Fit of Gasoline(cents/gal.) By Crude Oil($/bbl.) 140 1 ^1 t=2;n;2 s^ p where s^ = s= SSxx and t=2;n;2 is the critical value from the t-table based on n ; 2 degrees of freedom. 1 1 130 Gasoline(cents/gal.) A 100(1 ; )% condence interval for 120 110 Example: 100 90 80 70 60 50 10 15 20 25 30 35 40 Crude Oil($/bbl.) Look at the JMP output for the Whistle Blower example (Exercise 10.19) again: Bivariate Fit of Retal Index By Salary Parameter Estimates Term Intercept Salary Estimate Std Error 569.58007 93.99729 -0.001924 0.002356 t Ratio 6.06 -0.82 Prob>|t| <.0001 0.4289 Analysis of Variance Source Model Error C. Total DF 1 13 14 Sum of Squares Mean Square 15320.93 15320.9 298741.47 22980.1 314062.40 F Ratio 0.6667 Prob > F 0.4289 Here the p-value is :429 which is larger than the significance level = :05. Thus we fail to reject the null hypothesis of H0 and conclude that salary does not contribute to predicting the retaliation index using a straightline model. 82 For the advertising expenditure-sales revenue example 0 :61 1 ^ t:025;3 s^1 = :7 3:182 B@ p CA = :7 :61 10 Thus the interval estimate for the slope parameter 1 is (.09, 1.31). Interpretation: We are 95% condent that the true mean increase is monthly sales revenue per additional $100 of advertising expentiture is between $90 and $1,310. Also since, zero is not included in this interval we can use this interval to also conclude that 1 is not zero. This interval is rather wide. The reason is that the sample size is too small to be able to estimate 1 with more accuracy. As we have already seen one way to increase accuracy of an estimate is to increase the sample size. 83 The Coecient of Correlation Consider n observations of a pair of variables (x; y) measured on observational (or experimental) units. Definition The Pearson product moment coefficient of correlation, r, is a measure of the strength of the linear relationship between two variables x and y . It is computed from sample of n measurements on x and y as follows: SS r = r xy SSxxSSyy Some properties of r r is scaleless (or unitless) r takes a value between -1 and + 1 r = 0 implies that a linear relationship does not exist between x and y. Closer r comes to 1, the stronger the linear relationship between x and y. Positive r implies a positive relationship, negative r implies a negative relationship. r = 1 implies that an exact linear relationship exists between x and y. 84 Since ^1 = SSxy =SSxx (slope of the least squares line) has the same denominator as that of r, r = 0 when ^1 = 0 r > 0 when ^1 > 0 r < 0 when ^1 < 0 85 Example For the advertising-sales example Sxy = 7; SSxx = 10; and SSyy = 6 giving SS 7 = :904 r = p xy = p 10 6 SSxxSSyy which indicates a strong positive linear relationship between advertising and sales, implying that sales revenue increases as advertising expenditure increases (for these 5 months). Population correlation coefficient The sample correlation coecient r is a sample statistic that is an estimate of the corresponding population correlation coefficient . is a parameter of the bivariate population distribution of (x; y). So we can make statistical inferences about using r and its sampling distribution if we wish. These would involve condence intervals and hypothesis tests about . The information that r provides about the least squares line is identical to that provided by the slope of ^1. So in the case of the straight-line model we will make inferences about the model using the sampling distributuion of ^1 (instead of r). In fact, we have already done so. The Coefficient of Determination This is measures the contribution of x in predicting y. If we assume y N (; 2) the variability in y is measured by SSyy = (yi ; y)2 This is called the total sample variation. If the straight-line model is correct (i.e., if x contributes to the prediction of y) then y N (0 + 1x; 2) and the variability in y is measured by SSE = (yi ; y^)2 If 1 = 0, then SSE = SSyy If 1 6= 0 then SSE < SSyy Thus SSyy; SSE is the reduction in the variability of y attributable to x. The larger it is, that is small SSE is, the larger the contribution of x. Usually this reduction in variance in y is expressed as a proportion of total sample variation. SSyy ; SSE SSyy This is the proportion of the total sample variability explained by the fitted regression model. 86 87 It can be shown that in simple linear regression (straightline) model, this proportion is the same as r2, where r = coecient of correlation. That is SS ; SSE SSE r2 = yy =1; : SSyy SSyy Thus 82% of the sample variation in sales revenue (y) is explained by using advertising expenditure (x) in a straight-line model to predict y. Thus this is a \fairly good" model for predicting y. In the JMP output of Exercise 10.15 the value for the coecient of determination is reported as a percentage: R-Sq = 4:9% Since r is in the range ;1 0 r2 1. Interpretation of r2 r 1, r2 is in the range the coefficient of determination If r2 = :60, it means that we are doing 60% better by using y^ to predict the mean of y, than just using the sample mean y to predict the mean of y. meaning r2 = :049. Thus less than 5% of the sample variation in retaliation index (y) can be explained by using salary (x) in a straight-line model to predict y. Thus this is not an adequate model for predicting y based on x. Example: In the advertising-sales example. SSyy = 6:0 SSE = 1:10 Thus, the coecient of determination is: SSyy ; SSE 6:0 ; 1:1 = = :82 SSyy 6:0 We could have calculated this by just squaring the correlation coecient r = :904 we obtained earlier: r2 = (:904)2 = :82 r2 = 88 89 Construction of an Analysis of Variance Table An Analysis of Variance (ANOVA)Table is a way of organizing computed information about a tted model. We can partition the total sum of squares (yi ; y)2 as follows: n n n X (yi ; y)2 = X (yi ; y^i)2 + X (^yi ; y)2 i=1 i=1 i=1 SSTot = SSE + SSR Total SS = Error SS + Regression SS measures \the total amount of variation of the yi's about y" SSE: measures \the total amount of variation of the yi's about y^i's, i.e., the residual variation" SSTot: SSR: measures \the total amount of variation of the y^i's about y, i.e., the variation of the lled regression line" Properties of SSTot, SSR, and SSE 1. For a given data set, SSTot is always constant 2. If SSE increases, SSR decreases, and vice versa. 3. Best model minimizes SSE and maximizes SSR The ANOVA table for the model y = 0 + 1x + is: Source Regression Error Total Advertising Expenditure - Sales Revenue Example we have (y)2 SSTot = SSyy = y2 ; n (10)2 = 26 ; 20 = 6 = 26 ; 5 SSE = 1.10 (from previous calculations) The results of these computations can be summarized in an ANOVA table: Source Regression Error Total yi2 400 324 100 36 121 981 xi 6 6 4 2 3 21 df 1 3 4 SS 4.90 1.10 6.00 x2i 36 36 16 4 9 101 xiyi 120 108 40 12 33 313 1. Fit the simple linear regression model by least squares: SSxx = x2i ; (xi)2=n = 101 ; (21)5 = 12:8 2 SSTot = yi2 ; (yi )2=n = SSyy = 981 ; (65)2 =5 = 136:00 SSR = (SSxy )2=SSxx = (40)2 =12:8 = 125:00 SSE = SSTot - SSR = 136 ; 125 = 11:00 Source Regression Error Total df SS MS 1 125.00 125.00 3 11.00 3.66667 4 136.00 y) = 313 ; (21)(65) SSxy : xiyi ; (x )( n 5 = 40:0 ^1 = SSxy =SSxx = 40:0=12:8 = 3:125 i i ^0 = y ; ^1x = 655 ; (3:125)( 215 ) = ;0:125 Fitted regression line: y^ = ;0:125 + 3:125x 2. Construct the ANOVA Table. SSTot = SSE + SSR 92 MS 4.9 0.36667 91 Example: A car dealer is interested in modeling the relationship between the number of cars sold by the rm each week (y) and the average number of salespeople who work on the showroom oor per day during the week (x). yi 20 18 10 6 11 65 SS MS SSR MSR SSE MSE SSTot Example: 90 i 1 2 3 4 5 df 1 n-2 n-1 93 Using Fitted Model for Estimation and Prediction Two types of inferences from tted model: Estimating the mean value E (y) = 0 + 1x for a specic value of x. Predicting a new y value for a given value of x. Example: Advertising Expentiture { Sales Revenue Example: Estimate the mean sales revenue for months for which the advertising expenditure was $400 (i.e., x = 4 in the problem). If we decide to spend $400 on advertising next month, what does the model predict to be the sales revenue? The statistical inferences made are dierent: In the rst case we want to estimate the mean of the population of values of y at a given value of x. In the second case we want to predict a single value y at a specied x value. Example: In the Advertising-Sales example, the least squares prediction equation was y^ = ;:1 + :7x We can use this equation for doing both of the above inferences. First, note that E (y ) = 0 + 1x is the mean value of y at a given value of x. Since ^0 + ^1x, is an estimate 0 + 1x, an estimate of this mean value E (y) is y^ = ^0 + ^1x. For example, the estimated mean sales revenue for all months when x = 4 (i.e., advertising expenditure = $400), is given by y^ = ;:1 + :7(4) = 2:7 i.e., $2700. On the other hand, y^ = ^0 + ^1 x is also the predicted value of y at a given value of x. Thus if we plan to spend $400 on advertising next month, we can predict sales revenue to be $2700. 95 94 Obviously, there is a dierence between the two cases. The dierence lies in the accuracy of the estimate y^ and the predictor y^. This is reected in the interval estimates given below that are constructed using the sampling distributions of these two statistics. A 100(1-)% Condence Interval for the Mean Value of y at x y^ t=2 (Estimated standard error of y^) or v u u u1 t=2; suut 2 n + (xSS; x) xx where t=2 is based on (n ; 2) degrees of freedom. y^ A 100(1-)% Prediction Interval for an Individual New Value of y a x y^ t=2 (Estimated standard error of prediction) or v u u u t=2 suut1 + 1 (x ; x)2 + n SSxx where t=2 is based on (n ; 2) degrees of freedom. y^ 96 Example: Advertising Expentiture { Sales Revenue Example: Find a 95% condence interval for the mean monthly sales when the appliance store spends $400 on advertising. For a $400 advertising expenditure, x = 4 and the condence interval for the mean value of y is: y^ v u v u u u1 t=2 suut u u1 (x ; x)2 (4 ; x)2 + = y^ t:025;3 uut + n SSxx 5 SSxx Recall that y^ = 2:7; s = :61; x = 3; and SSxx = 10: and from Table VI, t:025;3 = 3:182. Thus, we have v u u u1 (4 ; 3)2 2:7 (3:182)(:61)ut + = 2:7 1:1 = (1:6; 3:8) 5 10 Therefore, we are 95% condent that when the store spends $400 a month on advertising, the mean sales revenue is between $1,600 and $3,800. 97 Example: Advertising Expentiture { Sales Revenue Example: Predict the monthly sales for next month, if $400 is to be spent on advertising. Use a 95% prediction interval. To predict the sales for a particular month for which x = 4, we calculate the 95% prediction interval as y^ v u u u t=2 suut1 + 1 (x ; x)2 + n SSxx v u u u u t 1 (4 ; 3)2 = 2:7 (3:182)(:61) 1 + + 5 10 = 2:7 2:2 = (:5; 4:9) Therefore, we predict with 95% condence that the sales revenue next month (a month in which we spend $400 in advertising) will fall in the interval from $500 to $4,900. It is important to note that this interval is wider than the interval on the mean monthly sales for $400 of advertising expenditure, the reason being that the standard deviation of the predictor y^ is larger than the standard deviation of the estimate y^. (Note the additional factor of 1 under the square root in the above expression.) Example 10.61: Many variables inuence the sales of existing single-family home. One of these is the interest rate charged for mortgage loans. Shown in the table are the total number of existing single-family homes sold annually (in 1000's) and the average annual conventional mortgage interest rate (as a %) from 1982{1991. Identify the predictor and response Predictor x: Interest Rate Response y: Homes Sold i y x y2 x2 xy 1 1990 14.8 3960100 219.04 29452.0 2 2719 12.3 7292961 151.29 33443.7 .. .. .. .. .. .. 10 3220 9.2 10368400 84.64 29624.0 31253 106.8 99841655 1172.74 325855.5 Fit LS regression line. 2 2 SSxx = x2i ; (xn ) = 1172:74 ; (10610:8) = 32:116 2 2 SSyy = yi2 ; (ny ) = 99841655 ; (31253) = 10 2166654:1 y) = 325855:5 ; (106:8)(31253) SSxy = xiyi ; (x )( n 10 = ;7926:54 SS = (;7926:54) = ;24681 ^1 = SS 32:116 i i i i xy xx 98 ^0 = y;b1x = 3125:3 ;(;246:81)(10:68) = 5761:23 y^ = 57651:23 ; 246:81x Construct the ANOVA Table. SST = SSyy = 2166654.1 )2 SSR = (SS SS = 1956346:9 SSE = SST { SSR = 210307.2 xy xx Source Regression Error Total df SS MS 1 1956346.9 1956346.9 8 210309.2 26288.4 9 2166654.1 Do the data provide sucient evidence to indicate a non-zero slope? Use a 95% condence interval to answer this question. 1 0p s B 26288:4 CC C 1 t0:025;x p = ;246:81 (2:306) BB@ p SSxx 32:116 A = ;246:81 65:98 = (;312:79; ;180:83) Since 0 is not in the interval, we can say that the data provide sucient evidence to conclude that the slope is not zero. 100 99 Compute and interpret the coecient of determination. 1956346:9 r2 = SSR SST = 2166654:1 = 0:9029 Interpretation: The tted line explains 90.29% of the variation in the response. Compute and interpret the Pearson correlation coefcient. p p r = r = 0:9029 = ;0:9502 (we take negative because it is the sign of ^1). Interpretation: This is a very strong negative linear relationship between the interest rate and number of homes sold. Compute a 90% condence interval for the true mean number of homes sold if the interest rate is 10%. Need a 90% CI for E (y) at x = 10:0 v u x;x)2 y^ t0:05;8 sut n1 + (SS p xx 0s = 3293:13 (1:86)( 26288:4) @ 101 + (1032;10:116:68) = 3293:13 102:00 = (3191:13; 3395:13) Interpretation: We are 90% condent that the true mean number of homes sold when the interest rate is 10% is between 3191.13 and 3395.13 homes. 101 2 1 A of homes sold during a year in which the interest rate is 10%. Need a 90% prediction interval for y at x = 10:0 v u x;x)2 y^ t0:05;8 sut1 + n1 + (SS p xx s = 3293:13 (1:86)( 26288:4)( 1 + 101 + (1032;10:116:68) ) = 3293:13 318:36 = (2974:77; 3611:49) Interpretation: We are 90% condent that the number of homes sold during a year when the interest rate is 10% is between 2974.77 and 3166.49 homes. Exercise10_61a Bivariate Fit of Homes Sold(1000’s) By Interest_Rate(%) Year 1982 1983 1984 1985 1986 1987 1988 1990 1991 1992 1993 1994 1995 1996 1997 2 Homes Interest Rate(%) Sold(1000’s) 1990 15.82 2719 13.44 2868 13.81 3214 12.29 3565 10.09 3526 10.17 3594 10.22 3211 10.08 3220 9.2 3520 8.43 3802 7.36 3946 8.59 3812 8.05 4087 8.03 4215 7.76 Prediction and Confidence Intervals 4500 4000 Homes Sold(1000’s) Construct a 90% prediction interval for the number 3000 2500 2000 1500 6 8 10 12 14 16 Interest_Rate(%) Parameter Estimates Term Intercept Interest_Rate(%) Linear Fit Estimate Std Error 5566.1297 253.9956 -210.3457 24.19405 t Ratio Prob>|t| 21.91 <.0001 -8.69 <.0001 Analysis of Variance Source Model Error C. Total DF 1 14 15 Homes Sold(1000’s) = 5566.1297 - 210.34571 Interest_Rate(%) Summary of Fit Sum of Squares 3963718.6 734142.8 4697861.4 Mean Square 3963719 52439 RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) F Ratio 75.5876 Prob > F <.0001 Residuals by Year 300.00 200.00 300.00 100.00 Residuals Residuals 200.00 100.00 0.00 -100.00 0.00 -100.00 -200.00 -300.00 -200.00 -400.00 -300.00 1980 0.843728 0.832566 228.9951 3414.688 16 Residuals by Predicted 400.00 1985 1990 1995 Year 102 3500 -500.00 2000.00 2500.00 3000.00 3500.00 Predicted 103 4000.00 Residual Analysis Aim is to check if the assumptions about the model are satised for a particular set of data. Also examine what we can do if we detect departures from the assumptions. Recall that the model was of the form y = E (y ) + where E (y) = 0 + 1x for a straight-line model, is the deterministic component and is the random error component. The basic assumption can be summarized as: 1, 2; : : : n is a random sample from a Normal population with mean 0 and constant standard deviation . Because the assumption involve the random error component , the best way to study their properties is by rst estimating the random error. 104 1. Histogram of the residuals Check if the shape of the distribution is moundshaped. 2. Scatterplots of residuals in time order or against the x variable. From the model it follows that the actual random error: = y ; E (y ) = y ; (0 + 1x) The estimated random error, ^, is: ^ = y ; (^0 + ^1x) = y ; y^ = residual Thus, the estimated random error for an observation y is the corresponding residual y ; y^ . Earlier, we learned that (y ; y^ ) = 0. Also s = SSE=(n ; 2) is an estimate where SSE = (y ; y^ )2 . Thus we would expect about 95% of the residuals to fall within within 2 standard deviations i.e., 2s of 0 and virtually all of them to lie inside of 3 standard deviations of 0 We use a variety of plots of the residuals to check whether these assumption about the random errors are satised. i i i i i i i 105 b.) Check visually whether the residuals appear to be evenly spread around this line, as you go from low to high values on the x-axis. a.) Draw a line parallel to the x-axis through the value residual = 0. b.) Check visually whether the residuals appear to be evenly centered around this line. c.) Draw lines parallel to the x-axis through the value residual = 2s. d) Check visually whether many residuals outside these lines. Check if those fall outside 3s. If there is a clearly recognizable pattern such as those shown below, then either a dependence of the error variance 2 on the predictor, x, or inadequacy of the deterministic part of the model, e.g., the straight-line model is not sucient to explain the variability in the response y. 3. Scatterplot of residuals against the x variable or against the predicted value, y^. a.) Draw a residual = 0 horizonal line as before. 106 107 0.0 0.2 0.4 0.6 0.8 2 Residuals 0 1 -1 Residuals -1 0 1 -2 • • • • • • ••• • •• •••• • • • • • • • •• • • • • •• ••••••••• • ••••••• ••• •••••••••• • •• •• • • • • • • •• •• • •• • • • • •• • • • • • • • • 1.0 • • • • •• • • • •• • • • • • •••• •• • • • •• • • • • • • • • ••••••••• •• ••• •• •••••••••• • •••••••• • •••••••• • •• •• • • • • • • • • • • • 0.0 0.2 •• • • •• • • ••• •• •••••••••••••••• • ••• ••• ••• •••• •••••• ••• • •••• ••• ••••••••• ••• •••••• •••• • • ••• 4.5 5.0 5.5 • 0.8 10 Residuals 15 1.0 0 -2 0 Residuals 2 4 6 8 • 0.4 0.6 Predicted Frequency 5 10 15 20 25 x 6.0 0 x 5 Interpretation of the plots: On any of these plots you should not put much eort into nding a pattern that is simply not there. Unless a pattern is very obvious, conclude that the plot does not indicate a deviation from the assumptions checked or that the plot is inconclusive. 108