Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter Thirteen 13- 2 Linear Regression and Correlation GOALS When you have completed this chapter, you will be able to: ONE Draw a scatter diagram. TWO Understand and interpret the terms dependent variable and independent variable. THREE Calculate and interpret the coefficient of correlation, the coefficient of determination, and the standard error of estimate. FOUR Conduct a test of hypothesis to determine if the population coefficient of correlation is different from zero. Goals Chapter Thirteen 13- 3 continued Linear Regression and Correlation GOALS When you have completed this chapter, you will be able to: FIVE Calculate the least squares regression line and interpret the slope and intercept values. SIX Construct and interpret a confidence interval and prediction interval for the dependent variable. SEVEN Set up and interpret an ANOVA table. Goals 13- 4 Correlation Analysis is a group of statistical techniques to measure the association between two variables. Advertising Minutes and $ Sales The Dependent Variable is the variable being predicted or estimated. 30 Sales ($thousands) A Scatter Diagram is a chart that portrays the relationship between two variables. 25 20 15 10 5 0 70 90 110 130 150 170 190 Advertising Minutes The Independent Variable provides the basis for estimation. It is the predictor variable. Correlation Analysis 13- 5 The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables. Also called Pearson’s r and It requires interval or ratioPearson’s product moment scaled data. correlation coefficient. Pearson's r It can range from -1.00 to 1.00. Values of -1.00 or 1.00 indicate perfect and strong -1 correlation. 0 1 Negative values indicate an Values close to 0.0 indicate inverse relationship and weak correlation. positive values indicate a The Coefficient of Correlation, r direct relationship. 13- 6 Y 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 X 6 7 8 9 10 Perfect Negative Correlation 13- 7 Y 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 X 6 7 8 9 10 Perfect Positive Correlation 13- 8 Y 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 X 6 7 8 9 10 Zero Correlation 13- 9 Y 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 X 6 7 8 9 10 Strong Positive Correlation 13- 10 We calculate the coefficient of correlation from the following formula. r= S(X – X)(Y – Y) (n-1)sxsy Formula for r 13- 11 The coefficient of determination (r2) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X). It is the square of the coefficient of correlation. It ranges from 0 to 1. It does not give any information on the direction of the relationship between the variables. Coefficient of Determination Dan Ireland, the student body president at Toledo State University, is concerned about the cost to students of textbooks. He believes there is a relationship between the number of pages in the text and the selling price of the book. To provide insight into the problem he selects a sample of eight textbooks currently on sale in the bookstore. Draw a scatter diagram. Compute the correlation coefficient. 13- 12 Example 1 13- 13 Book Page Price($) Introduction to History 500 84 Basic Algebra 700 75 Introduction to Psychology 800 99 Introduction to Sociology 600 72 Business Management 400 69 Introduction to Biology 500 81 Fundamentals of Jazz 600 63 Principles of Nursing 800 93 Example 1 continued 13- 14 Scatter Diagram of Number of Pages and Selling Price of Text 100 90 Price ($) 80 70 60 400 500 600 700 800 Page Example 1 continued 13- 15 Page E X C E L Price Mean 625 Mean 79.50 Standard Error 49 Standard Error 4.32 Median 600 Median 78 Mode 600 Mode #N/A Standard Deviation 139 Standard Deviation 12.21 Sample Variance 19,286 Sample Variance 149.14 Kurtosis -0.55 Kurtosis -0.77 Skewness -0.16 Skewness 0.40 Range 400 Range 36 Minimum 400 Minimum 63 Maximum 800 Maximum 99 Sum 5,000 Sum 636 Count 8 Count 8 Example 1 continued 13- 16 S(X – X)(Y – Y) (a) Page 500 700 800 600 400 600 600 800 (b) Price 84 75 99 72 69 81 63 93 (c) (d) Page - Mean(Page) Price-Mean(Price) -125 4.5 75 -4.5 175 19.5 -25 -7.5 -225 -10.5 -25 1.5 -25 -16.5 175 13.5 (c)*(d) . (562.5) (337.5) 3,412.5 187.5 2,362.5 (37.5) 412.5 2,362.5 7,800.0 Example 1 continued 13- 17 r= S(X – X)(Y – Y) (n-1)sxsy 7800 = 7(138.87)(12.21) = .657 r2 = .6572 = .432 The correlation between the number of pages and the selling price of the book is 0.657. This indicates a moderate association between the variable. Example 1 continued 13- 18 Did a computed r come from a population of paired observations with zero correlation? Ho: r = 0 (The correlation in the population is zero.) H1: r ≠ 0 (The correlation in the population is different from zero. t test for the coefficient of correlation t = r n- 2 With n-2 d.f. 1- r2 T-test of significance of r 13- 19 Computed r = .657. Test the hypothesis that there is no correlation in the population. Use a .02 significance level. Step 1 H0: the correlation in the population is zero. H1:The correlation in the population is not zero. Step 3 The statistic to use follows the t distribution. Step 2 Significance level is .02. Step 4 H0 is rejected if t>3.143 or if t<-3.143 or if p < .02. There are 6 degrees of freedom, found by n – 2 = 8 – 2 = 6. 13- 20 Step 5 Find the value of the test statistic. t = r n- 22 = .657 8 – 2 = 2.135 1- r 1 - .6572 p(t > 2.135) = .077 H0 is not rejected. We cannot reject the hypothesis that there is no correlation in the population. The amount of association could be due to chance. Example 1 continued 13- 21 In Regression Analysis we use the independent variable (X) to estimate the dependent variable (Y). The relationship between the variables is linear. Both variables must be at least interval scale. The least squares criterion is used to determine the equation. That is the term S(Y – Y’)2 is minimized. Regression Analysis The regression equation is Y’= a + bX 13- 22 where Y’ is the average predicted value of Y for any X. a is the Y-intercept. It is the estimated Y value when X=0 b is the slope of the line, or the average change in Y’ for each change of one unit in X The least squares principle is used to obtain a and b. Regression Analysis 13- 23 The least squares principle is used to obtain a and b. The equations to determine a and b are: sy b=r sx a = Y – bX Regression Analysis 13- 24 Develop a regression equation for the information given in example 1 that can be used to estimate the selling price based on the number of pages. sy b=r sx 12.21 = (.657) 138.87 = .0578 a = Y – bX = 79.5 - .0578*625 = 43.39 Example 1 revisited 13- 25 The regression equation is: Y’ = 43.39 + .0578X The slope of the line is .0578. Each addition page costs about a nickel. The sign of the b value and the sign of r will always be the same. The equation crosses the Y-axis at $43.39. A book with no pages would cost $43.39. Example 1 revisited 13- 26 We can use the regression equation to estimate values of Y. The estimated selling price of an 800 page book is $89.61, found by Price = $43.39 + .0578(Number of Pages) = $43.39 + .0578(800) = $89.61 Example 1 revisited 13- 27 The Standard Error of Estimate measures the scatter, or dispersion, of the observed values around the line of regression The formula that is used to compute the standard error: sy x = (Y-Y')2 n-2 The Standard Error of Estimate 13- 28 Find the standard error of estimate for the problem involving the number of pages in a book and the selling price. Deviation Estimated Actual price price Deviation Squared 2 (Y) (Y') (Y-Y') (Y-Y') 84 72.28 11.72 137.41 75 83.83 -8.83 78.03 99 89.61 9.39 88.15 72 78.06 -6.06 36.67 69 66.50 2.50 6.25 81 78.06 2.94 8.67 63 78.06 -15.06 226.67 93 89.61 3.39 11.48 0.00 593.33 sy x = (Y-Y')2 n-2 = 593.33 8-2 = 9.944 Example 1 revisited 13- 29 Assumptions Underlying Linear Regression For each value of X, there is a group of Y values, and these Y values are normally distributed. The Y values are statistically independent. This means that in the selection of a sample, the Y values chosen for a particular X value do not depend on the Y values for any other X values. The means of these normal distributions of Y values all lie on the straight line of regression. The standard deviations of these normal distributions are the same. 13- 30 The confidence interval for the mean value of Y for a given value of X is given by: 1 + (X-X)2 Y' + t(sy x) n (X-X)2 Y’is the predicted value for any selected X value X is an selected value of X X is the mean of the Xs n is the number of observations Sy.x is the standard error of the estimate t is the value of t at n-2 degrees of freedom Confidence Interval 13- 31 For our earlier price estimate of $89.61, the confidence interval, assuming a desired 95% confidence, is calculated as follows. Page - Mean(Page) (Page - Mean(Page))2 -125 15625 75 5625 175 30625 -25 625 -225 50625 -25 625 -25 625 175 30625 135000 Example 1 Y’ the predicted value, is $89.61 13- 32 X is 800 pages X is 625, the mean of the pages n is 8, the number of observations Sy.x is 9.944, the standard error of the estimate t is 2.447 at 8-2 degrees of freedom and 95% confidence 89.61 + 9.944(2.447) 1 (800 - 625)2 + 8 135,000 =89.61 + 14.433 1 + (X-X)2 Y' + t(sy x) n (X-X)2 Example 1 revisited 13- 33 The prediction interval for an individual value of Y for a given value of X Y' + t(sy x) 1 + (X-X)2 1+ n (X-X)2 1 (800 - 625)2 89.61 + 9.944(2.447) 1 + 8 + 135,000 =89.61 + 28.292 Prediction Interval 13- 34 Summarizing The Results The estimated selling price for a book with 800 pages is $89.61. The standard error of estimate is $9.94. The 95 percent confidence interval for all books with 800 pages is $89.61 + $14.43. This means the limits are between $75.18 and $104.04. The 95 percent prediction interval for a particular book with 800 pages is $89.61+ $28.29. The means the limits are between $61.32 and $117.90. These results appear in the following Minitab and Excel outputs. Example 1 revisited 13- 35 Regression Analysis The regression equation is Price = 43.4 + 0.0578 No of Pages Predictor Constant No of Pages S = 9.944 Coef 43.39 0.05778 R-Sq = 43.2% Analysis of Variance Source DF SS Regression 1 450.67 Error 6 593.33 Total 7 1044.00 StDev 17.28 0.02706 T 2.51 2.13 P 0.046 0.077 R-Sq(adj) = 33.7% MS 450.67 98.89 F 4.56 P 0.077 Fit StDev Fit 95.0% CI 95.0% PI 89.61 5.90 ( 75.17, 104.05) ( 61.31, 117.91) Example 1 revisited M I N I T A B 13- 36 Regression Statistics Multiple R 0.657 R Square 0.432 Adjusted R Square 0.337 Standard Error 9.944 Observations 8 ANOVA df SS MS F Significance F Regression 1 450.67 450.67 4.5573034 0.0767 Residual 6 593.33 98.89 Total 7 1044 Intercept Page Coefficients 43.3889 0.0578 Standard Error 17.277 0.027 t Stat 2.511 2.135 E X C E L P-value 0.0458193 0.0767009 Example 1 revisited