Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of statistics wikipedia , lookup
Linear least squares (mathematics) wikipedia , lookup
History of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Student's t-test wikipedia , lookup
Chapter 24 ~ Linear Correlation & Regression Analysis 30 29 28 27 Waist Size 26 25 24 23 22 100 110 120 130 140 150 160 Weight 1 Chapter Goals • More detailed look at linear correlation and regression analysis • Develop a hypothesis test to determine the strength of a linear relationship • Consider the line of best fit. Use this to make confidence interval estimations. 2 Linear Correlation Analysis The coefficient of linear correlation, r, is a measure of the strength of a linear relationship Consider another measure of dependence: covariance Recall: bivariate data - ordered pairs of numerical values 3 Derivation of the Covariance Derivation of the Covariance Goal: a measure of the linear relationship between two variables Consider the following set of bivariate data: {(8, 22), (5, 28), (8, 18), (4, 16), (13, 27), (15, 23), (17, 17), (12, 13)} x 10.25 y 20.50 Consider a graph of the data: 1. The point ( x, y ) is the centroid of the data 2. A vertical and horizontal line through the centroid divides the graph into four sections 4 Graph of the Data with Centriod 30 ( x x) 28 26 ( y y) 24 22 y 20 (10.25, 20.5) 18 16 14 12 4 6 8 10 12 14 16 18 20 x 5 Notes 1. Each point (x, y) lies a certain distance from each of the two lines 2. ( x x ) : the horizontal distance from (x, y) to the vertical line passing through the centroid 3. ( y y ) : the vertical distance from (x, y) to the horizontal line passing through the centroid 4. The distances may be positive, negative, or zero 5. Consider the product: ( x x)( y y ) a. If the graph has lots of points to the upper right and lower left of the centroid (positive linear relationship), most products will be positive b. If the graph has lots of points to the upper left and lower right of the centroid (negative linear relationship), most products will be negative 6 Covariance of x and y The covariance of x and y is defined as the sum of the products of the distances of all values x and y from the centroid divided by n 1: n covar ( x, y ) Note: ( x x) 0 and ( xi x)( yi y) i 1 n 1 ( y y) 0 always! 7 Calculations for Finding Covar (x, y) Points (8, 22) (5, 28) (8, 18) (4, 16) (13, 27) (15, 23) (17, 17) (12, 13) Total covar ( x, y ) xx -2.25 -5.25 -2.25 -6.25 2.75 4.75 6.75 1.75 0.00 y y 1.5 7.5 -2.5 -4.5 6.5 2.5 -3.5 -7.5 0.0 ( x x)( y y ) -3.375 -39.375 5.625 28.125 17.875 11.875 -23.625 -13.125 -16.000 16 2.2857 7 8 Data & Covariance Positive covariance: 8 7 6 5 y ( x, y ) 4 3 2 1 0 0 1 2 3 4 5 6 7 8 x 9 Data & Covariance Negative covariance: 9 8 7 6 y ( x, y ) 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 x 10 Data & Covariance Covariance near 0: 9 8 7 6 5 y 4 ( x, y ) 3 2 1 0 0 1 2 3 4 5 6 7 8 9 x 11 Problems 1. The covariance does not have a standardized unit of measure 2. Suppose we multiply each data point in the example in this section by 15 The covariance of the new data set is -514.29 3. The amount of the dependency between x and y seems stronger but the relationship is really the same 4. We must find a way to eliminate the effect of the spread of the data when we measure the strength of a linear relationship 12 Solution 1. Standardize x and y: xx x' sx and y y y' sy 2. Compute the covariance of x and y 3. This covariance is not affected by the spread of the data 4. This is exactly what is accomplished by the coefficient of linear correlation: covar ( x, y ) r covar ( x' , y ' ) sx s y 13 Notes 1. The coefficient of linear correlation standardizes the measure of dependency and allows us to compare the relative strengths of dependency of different sets of data 2. Also commonly called Pearson’s product moment, r Calculation of r (for the data presented in this section): s x 4.71 r and s y 5.37 covar ( x, y ) 2.2857 0.0904 sx s y (4.71)(5.37) 14 Alternative (Computational) Formula for r Alternative (Computational) Formula for r: ( x x)( y y) covar ( x, y ) r sx s y n 1 sx s y SS( xy ) SS( x) SS( y ) 1. This formula avoids the separate calculations of the means, standard deviations, and the deviations from the means 2. This formula is easier and more accurate: minimizes round-off error 15 Inferences About the Linear Correlation Coefficient • Use the calculated value of the coefficient of linear correlation, r*, to make an inference about the population correlation coefficient, r • Consider a confidence interval for r and a hypothesis test concerning r 16 Assumptions... Assumptions for inferences about linear correlation coefficient: The set of (x, y) ordered pairs forms a random sample and the y-values at each x have a normal distribution. Inferences use the t-distribution with n 2 degrees of freedom. Caution: The inferences about the linear correlation coefficient are about the pattern of behavior of the two variables involved and the usefulness of one variable in predicting the other. Significance of the linear correlation coefficient does not mean there is a direct cause-andeffect relationship. 17 Confidence Interval Procedure 1. A confidence interval may be used to estimate the value of the population correlation coefficient, r 2. Use a table showing confidence belts 3. Table 10, Appendix B: confidence belts for 95% confidence intervals 4. Table 10 utilizes n, the sample size 18 Example Example: A random sample of 25 ordered pairs of data have a calculated value of r = 0.45. Find a 95% confidence interval for r, the population linear correlation coefficient. Solution: 1. Population Parameter of Concern The linear correlation coefficient for the population, r 2. The Confidence Interval Criteria a. Assumptions: The ordered pairs form a random sample, and for each x, the y-values have a mounded distribution b. Test statistic: The calculated value of r c. Confidence level: 1 a = 0.95 3. Sample Evidence n = 25 and r = 0.45 19 Solution Continued 4. The Confidence Interval The confidence interval is read from Table 10, Appendix B Find r = 0.45 at the bottom of Table 10 Visualize a vertical line through that point Find the two points where the belts marked for the correct sample size cross the vertical line Draw a horizontal line through each point to the vertical scale on the left and read the confidence interval The values are 0.68 and 0.12 5. The Results 0.68 to 0.12 is the 95% confidence interval for r 20 Table 10 The numbers on the curves are sample sizes: Scale of p (population correlation coefficient) -0.12 -0.68 -0.45 Scale of r (sample correlation) 21 Hypothesis Testing Solution 1. Null hypothesis: the two variables are linearly unrelated, r=0 2. Alternative hypothesis: one- or two-tailed, usually r 0 3. Test statistic: calculated value of r 4. Probability bounds or critical values for r: Table 11, Appendix B 5. Number of degrees of freedom for the r-statistic: n 2 22 Example Example: In a study of 32 randomly selected ordered pairs, r = 0.421. Is there any evidence to suggest the linear correlation coefficient is different from 0 at the 0.05 level of significance? Solution: 1. The Set-up a. Population parameter of concern: The linear correlation coefficient for the population, r b. The null and the alternative hypothesis: Ho: r = 0 Ha: r 0 23 Solution Continued 2. The Hypothesis Test Criteria a. Assumptions: The ordered pairs form a random sample and we will assume that the y-values at each x have a mounded distribution b. Test statistic: r* (calculated value of r) with df = 32 2 = 30 c. Level of significance: a = 0.05 3. The Sample Evidence n = 32 and r* = r = 0.421 24 Solution Continued 4. The Probability Distribution (p-Value Approach) a. The p-value: Use Table 11: 0.01 < P < 0.02 b. The p-value is smaller than the level of significance, a ~ or ~ 4. The Probability Distribution (Classical Approach) a. Critical Value: The critical value is found at the intersection of the df = 30 row and the two-tailed 0.05 column of Table 11: 0.349 b. r* is in the critical region 5. The Results a. Decision: Reject Ho b. Conclusion: At the 0.05 level of significance, there is evidence to suggest x and y are correlated 25 Linear Regression Analysis • Line of best fit results from an analysis of two (or more) related variables • Try to predict the value of the dependent, or output, variable • The variable we control is the independent, or input, variable 26 Method of Least Squares Method of Least Squares: The line of best fit: yˆ b0 b1 x The slope: b1 SS( xy ) SS( x) The y-intercept: b0 1 y b1 x n Notes: 1. A scatter diagram may suggest curvilinear regression 2. If two or more input variables are used: multiple regression 27 Linear Model The Linear Model: yˆ b 0 b1 x This equation represents the linear relationship between the two variables in a population b0: The y-intercept, estimated by b0 b1: The slope, estimated by b1 : Experimental error, estimated by e y yˆ The random variable e is called the residual e is the difference between the observed value of y and the predicted value of y at a given x The sum of the residuals is exactly zero Mean value of experimental error is zero: m = 0 2 Variance of experimental error: 28 Estimating the Variance of the Experimental Error Estimating the Variance of the Experimental Error: Assumption: The distribution of y’s is approximately normal and the variances of the distributions of y at all values of x are the same (The standard deviation of the distribution of y about yˆ is the same for all values of x) 2 ( x x ) Consider the sample variance: s 2 n 1 1. The variance of y involves an additional complication: there is a different mean for y at each value of x 2. Each “mean” is actually the predicted value, yˆ 2 ( y y ) ˆ 3. Variance of the error e estimated by: se2 n2 Degrees of freedom: n 2 29 Alternative (Computational) Formula for Variance of Experimental Error 2 Rewriting se : 2 ) y y ( ˆ se2 n2 2 ) x b b y ( 1 0 n2 2 b0 y b1 xy y n2 SSE n2 SSE = sum of squares for error 30 Example Example: A recent study was conducted to determine the relation between advertising expenditures and sales of statistics texts (for the first year in print). The data is given below (in thousands). Find the line of best fit and the variance of y about the line of best fit. Adv. Costs (x ) Sales (y ) Adv. Costs (x ) Sales (y ) 40 289 60 470 55 423 52 408 35 250 39 320 50 400 47 415 43 335 38 389 31 Solution 2 x (459) 2 2 SS( x) x 21677 608.9 n 10 x y (459)(3699) SS( xy ) xy 174163 4378.9 n 10 SS( xy ) 4378.9 b1 7.1915 SS( x) 608.9 y b1 x 3699 (7.1915)(459) b0 39.8105 n 10 32 Solution Continued • The equation for the line of best fit: yˆ 39.81 7.19 x • The variance of y about the regression line: 2 y b0 y b1 xy 2 s e n2 (1410485) (39.81)(3699) (7.1915)(174163) 8 10734.5955 1341.8244 8 Note: Extra decimal places are often needed for this type of calculation 33 Illustration • Scatter diagram, regression line, and random errors as line segments: 500 475 450 425 400 Sales 375 350 325 300 275 250 35 40 45 50 55 60 65 Advertising Costs 34 Minitab Output Regression Analysis The regression equation is C2 = 39.8 + 7.19 C1 Predictor Constant C1 Coef 39.81 7.191 StDev 69.11 1.484 S = 36.63 R-Sq = 74.6% T 0.58 4.84 P 0.580 0.001 R-Sq(adj) = 71.4% Analysis of Variance Source Regression Residual Error Total DF 1 8 9 SS 31491 10734 42225 MS 31491 1342 F 23.47 P 0.001 35 Inferences Concerning the Slope of the Regression Line • Confidence Interval for b1: 1-a confidence interval estimate for the population slope of the line of best fit • Hypothesis Test for b1: Tests the null hypothesis, b1= 0, the slope of the line of best fit is equal to 0, that is, the line is of no use in predicting y for a given value of x 36 Sampling Distribution of the Slope b1 Assume: Random samples of size n are repeatedly taken from a bivariate population 1. b1 has a sampling distribution that is approximately normal 2. The mean of b1 is b1 3. The variance of 2 b1 is: b1 2 2 ( x x ) provided there is no lack of fit 37 Standard Error of Regression Estimator for b21 : sb21 se2 2 ( x x) se2 x x n 2 2 se2 SS( x) The standard error of regression (slope) is b1 and is estimated by sb 1 Example (continued): For the advertising costs and sales data: sb21 se2 1341.8244 2.2037 SS( x) 608.9 38 Inferences About Slope Continued Assumptions for inferences about the slope parameter b1: The set of (x, y) ordered pairs forms a random sample and the yvalues at each x have a normal distribution. Since the population standard deviation is unknown and replaced with the sample standard deviation, the t-distribution will be used with n 2 degrees of freedom. Confidence Interval Procedure: The 1 a confidence interval for b1 is given by b1 t ( n 2 , a / 2 ) sb1 39 Example Example: Find the 95% confidence interval for the population slope b1 for the advertising costs and sales example Solution: 1. Population parameter of Interest The slope, b1, for the line of best fit for the population 2. The Confidence Interval Criteria a. Assumptions: The ordered pairs form a random sample and we will assume the y-values (sales) at each x (advertising costs) have a mounded distribution b. Test statistic: t with df = 10 2 = 8 c. Confidence level: 1 a = 0.95 3. Sample Evidence Sample information: n 10, b1 7.1915, sb21 2.2037 40 Solution Continued 4. The Confidence Interval a. Confidence coefficients: t(df, a/2) = t(8, 0.025) = 2.31 b. Interval: b1 t(n-2, a/2) sb1 7.1915 (2.31) 2.2037 7.1915 1.4845 (5.707, 8.676) 5. The Results The slope of the line of best fit of the population from which the sample was drawn is between 5.707 and 8.676 with 95% confidence 41 Hypothesis-Testing Procedure 1. Null hypothesis is always Ho: b1 = 0 2. Use the Students t distribution with df = n 2 3. The test statistic: t* b1 b1 sb1 42 Example Example: In the previous example, is the slope for the line of best fit significant enough to show that advertising cost is useful in predicting the first year sales? Use a = 0.05 Solution: 1. The Set-up a. Population parameter of concern: The parameter of concern is b1, the slope of the line of best fit for the population b. The null and alternative hypothesis: Ho: b1 = 0 (x is of no use in predicting y) Ha: b1 > 0 (we expect sales to increase as costs increase) 43 Solution Continued 2. The Hypothesis Test Criteria a. Assumptions: The ordered pairs form a random sample and we will assume the y-values (sales) at each x (advertising costs) have a mounded distribution b. Test statistic: t* with df = n 2 = 8 c. Level of significance: a = 0.05 3. The Sample Evidence a. Sample information: n 10, b1 7.1915, b. Calculate the value of the test statistic: b1 b1 7.1915 0.0 t* 4.8444 sb1 2.2037 sb21 2.2037 44 Solution Continued 4. The Probability Distribution (p-Value Approach) a. The p-value: P = P(t* > 4.8444, with df = 8) < 0.001 b. The p-value is smaller than the level of significance, a ~ or ~ 4. The Probability Distribution (Classical Approach) a. Critical value: t(8, 0.05) = 1.86 b. t* is in the critical region 5. The Results a. Decision: Reject Ho b. Conclusion: At the 0.05 level of significance, there is evidence to suggest the slope of the line of best fit is greater than zero. The evidence indicates there is a linear relationship and that advertising cost (x) is useful in predicting the first year sales (y). 45 Confidence Interval Estimates for Regression • Use the line of best fit to make predictions • Predict the population mean y-value at a given x • Predict the individual y-value selected at random that will occur at a given value of x • The best point estimate, or prediction, for both is yˆ 46 Notation & Background Notation: 1. Mean of the population y-values at a given value of x: m y|x0 2. The individual y-value selected at random for a given value of yx:x0 Background: 1. Recall: the development of confidence intervals for the population mean m when the variance was known and when the variance was estimated 2. The confidence interval for m y|x0 and the prediction interval for y x0 are constructed in a similar fashion 3. yˆ replaces x as the point estimate 4. The sampling distribution of yˆ is normal 47 Background Continued 5. The standard deviation in both cases is computed by multiplying the square root of the variance of the error by an appropriate correction factor 6. The line of best fit passes through the centroid: ( x, y ) Consider a confidence interval for the slope b1 If we draw lines with slopes equal to the extremes of that confidence interval through the centroid, the value for y fluctuates considerably for different values of x (See the Figure on the next slide.) It is reasonable to expect a wider confidence interval as we consider values of x further from x We need a correction factor to adjust for the distance between x0 and x This factor must also adjust for the variation of the y-values about yˆ 48 Confidence Interval for Slope Slope is 8.676 500 475 450 425 400 375 Sales 350 Slope is 5.707 ( x, y ) 325 300 275 250 35 40 45 50 55 60 65 Advertising Costs 49 Confidence Interval Confidence interval for the mean value of y at a given value of x, m y|x0 standard error of yˆ ( x0 x) 2 1 yˆ t (n-2, a /2) se n ( x x) 2 2 ( x x ) 1 yˆ t (n-2, a /2) se 0 n SS( x) Notes: 1. The numerator of the second term under the radical sign is the square of the distance of x0 from x 2. The denominator is closely related to the variance of x and has a standardizing effect on this term 50 Example Example: It is believed that the amount of nitrogen fertilizer used per acre has a direct effect on the amount of wheat produced. The data below shows the amount of nitrogen fertilizer used per test plot and the amount of wheat harvested per test plot. a. Find the line of best fit b. Construct a 95% confidence interval for the mean amount of wheat harvested for 45 pounds of fertilizer Pounds of Fertilizer (x ) 30 36 41 49 53 55 60 65 100 Pounds of Wheat (y ) 14 9 18 16 23 17 28 33 Pounds of Fertilizer (x ) 74 76 81 88 93 94 101 109 100 Pounds of Wheat (y ) 20 24 29 35 34 39 28 33 51 Solution • Using Minitab, the line of best fit: yˆ 4.42 0.298 x Confidence Interval: 1. Population Parameter of Interest The mean amount of wheat produced for 45 pounds of fertilizer, m y| x 45 2. The Confidence Interval Criteria a. Assumptions: The ordered pairs form a random sample and the y-values at each x have a mounded distribution b. Test statistic: t with df = 16 2 = 14 c. Confidence level: 1 a = 0.95 3. Sample Information: se2 25.97 y x 45 : se 25.97 5.096 yˆ 4.42 0.298(45) 17.83 52 Solution Continued 4. The Confidence Interval: 1 ( x0 x) 2 yˆ t (n-2, a /2) se n SS( x) 1 (45 69.06) 2 17.83 (2.14)(5.096) 16 8746.94 17.83 (2.14)(5.096) 0.0625 0.0662 17.83 (2.14)(5.096)(0.3587) 17.83 3.91 13.92 to 21.74, 95% confidence interval for m y| x 45 53 Confidence Belts for m y|x 0 • Confidence interval: green vertical line • Confidence interval belt: upper and lower boundaries of all 95% confidence intervals 45 Line of best fit 40 35 30 Upper boundary for m y|x0 Wheat 25 20 15 Lower boundary for m y|x0 10 30 40 50 60 70 80 90 100 110 120 Fertilizer 54 Prediction Interval Prediction interval of the value of a single randomly selected y: ( x0 x) 2 1 yˆ t (n-2, a /2) se 1 n SS( x) Example: Find the 95% prediction interval for the amount of wheat harvested for 45 pounds of fertilizer Solution: 1. Population Parameter of Interest yx=45, the amount of wheat harvested for 45 pounds of fertilizer 55 Solution Continued 2. The Confidence Interval Criteria a. Assumptions: The ordered pairs form a random sample and the y-values at each x have a mounded distribution b. Test statistic: t with df = 16 2 = 14 c. Confidence level: 1 a = 0.95 3. Sample Information se2 25.97 y x 45 : se 25.97 5.096 yˆ 4.42 0.298(45) 17.83 56 Solution Continued 4. The Confidence Interval 2 ( x x ) 1 yˆ t (n-2, a /2) se 1 0 n SS( x) 1 (45 69.06) 2 17.83 (2.14)(5.096) 1 16 8746.94 17.83 (2.14)(5.096) 1 0.0625 0.0662 17.83 (2.14)(5.096) 1.1287 17.83 (2.14)(5.096)(1.0624) 17.83 11.5859 6.24 to 29.41, 95% prediction interval for y x 45 57 Prediction belts for y x 0 45 Line of best fit 40 35 Upper boundary on individual y-values 30 Wheat 25 20 15 Lower boundary for 95% prediction interval on individual y-values at any x 10 30 40 50 x0 = 45 60 70 80 90 100 110 120 Fertilizer 58 Precautions 1. The regression equation is meaningful only in the domain of the x variable studied. Estimation outside this domain is risky; it assumes the relationship between x and y is the same outside the domain of the sample data. 2. The results of one sample should not be used to make inferences about a population other than the one from which the sample was drawn 3. Correlation (or association) does not imply causation. A significant regression does not imply x causes y to change. Most common problem: missing, or third, variable effect. 59 13.6 ~ Understanding the Relationship Between Correlation & Regression • We have considered correlation and regression analysis • When do we use these techniques? • Is there any duplication of work? 60 Remarks 1. The primary use of the linear correlation coefficient is in answering the question “Are these two variables related?” 2. The linear correlation coefficient may be used to indicate the usefulness of x as a predictor of y (if the linear model is appropriate) The test concerning the slope of the regression line (Ho: b1 = 0) tests the same basic concept 3. Lack-of-fit test: Is the linear model appropriate? Consider the scatter diagram 61 Conclusions 1. Linear correlation and regression measure different characteristics. It is possible to have a strong linear correlation and have the wrong model? 2. Regression analysis should be used to answer questions about the relationship between two variables: a. What is the relationship? b. How are the two variables related? 62