Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Regression Idea behind Regression Y We have a scatter of points, and we want to find the line that best fits that scatter. X For example, we might want to know the relationship between Exam score and hours studied, or Wheat yield and fertilizer usage, or Job performance and job training, or Sales revenue and advertising expenditure. Imagine that there is a true relationship behind the variables in which we are interested. That relationship is known perhaps to some supreme being. However, we are mere mortals, and the best we can do is to estimate that relationship based on a sample of observations. Perhaps the supreme being feels that the world would be too boring if a particular number of hours studied was always associated with the same exam score, a particular amount of job training always led to the same job performance, etc. So the supreme being tosses in a random error. Then the equation of the true relationship is: Yi Xi i The subscript i indicates which observation or which point we are considering. Xi is the value of the independent variable for observation i. Yi is the value of the dependent variable. is the true intercept. is the true slope. i is the random error. Again the equation of the true relationship is: Yi Xi i Our estimated equation is: Yi a b Xi ei a is our estimated intercept. b is our estimated slope. ei is the estimation error. Let’s look at our regression line and one particular observation. estimated equation of the line ˆ abX Y i i Y observed value of the Y i dependent variable ˆ e Yi Y i i predicted value of the dependent variable Ŷi The estimation error, ei, is the gap between the observed value and the predicted value of the dependent variable. Xi observed value of the independent variable X Fitting a scatter of points with a line by eye is too subjective. We need a more rigorous method. We will consider three possible criteria. Criterion 1: minimize the sum of the vertical errors n n ei i 1 Y i 1 ˆ X Y i i Yi ˆ e Yi Y i i Ŷi Xi ˆ ). ( Yi Y i X Problem: The best fit by this criterion may not be very good. For points below the estimated regression line, we have a negative error ei. Positive and negative errors cancel each other out. So the points could be far from the line, but we may have a small sum of vertical errors. Criterion 2: minimize the sum of the absolute values of the vertical errors n n | ei | i 1 Y ˆ |. | Yi Y i i 1 ˆ X Y i i Yi ˆ e Yi Y i i Ŷi This avoids our previous problem of positive and negative errors canceling each other out. However, the absolute value function is not differentiable, so using calculus to minimize will not work. Xi X Criterion 3: minimize the sum of the squares of the vertical errors n n (ei ) 2 i 1 Y i 1 ˆ X Y i i Yi ˆ e Yi Y i i Ŷi Xi ˆ )2 . ( Yi Y i X This also avoids the problem of positive and negative errors canceling each other out. In addition, the square function is differentiable, so using calculus to minimize will work. Minimizing the sum of the squared errors is the criterion that we will be using. The technique is called least squares or ordinary least squares (OLS). Using calculus, it can be shown that the values of a and b that give the line with the best fit can be calculated as: n slope : b i 1 i 1 n Y 1 X i Yi Xi n i 1 n intercept : n 1 X i2 n n i 1 i i 1 2 Xi a Y-bX Sometimes we omit the subscripts, since they are understood, and it’s less cumbersome without them. Then the equations are: slope : b intercept : X Y X 1 XY n 2 1 X n a Y-bX 2 Another equivalent formula for b that is sometimes used is: slope : XY n X Y b X nX 2 2 You may use either formula for b in this class. Example: Determine the least squares regression line for Y = wheat yield and X = fertilizer, using the following data. X Y 100 40 200 50 300 50 400 70 500 65 600 65 700 80 XY X2 We need the sums of the X’s, the Y’s, the XY’s and the X2’s X Y XY X2 100 40 4,000 10,000 200 50 10,000 40,000 300 50 15,000 90,000 400 70 28,000 14,000 500 65 32,500 25,000 600 65 39,000 36,000 700 80 56,000 49,000 2800 420 184,500 1,400,000 We also need the means of X and of Y. X Y XY X2 100 40 4,000 10,000 200 50 10,000 40,000 300 50 15,000 90,000 400 70 28,000 14,000 500 65 32,500 25,000 600 65 39,000 36,000 700 80 56,000 49,000 2800 420 184,500 1,400,000 X 2800 420 400 Y 60 7 7 Next, we calculate the estimated slope b. X Y XY X2 100 40 4,000 10,000 200 50 10,000 40,000 300 50 15,000 90,000 400 70 28,000 14,000 500 65 32,500 25,000 600 65 39,000 36,000 700 80 56,000 49,000 2800 420 184,500 1,400,000 X 2800 420 400 Y 60 7 7 1 XY n X Y b 2 2 1 X X n 1 184,500 ( 2800)( 420) 7 1 1,400,000 28002 7 184,500 168,000 1,400,000 1,120,000 16,500 0.059 280,000 Then we calculate the estimated intercept a. X Y XY X2 100 40 4,000 10,000 200 50 10,000 40,000 300 50 15,000 90,000 400 70 28,000 14,000 500 65 32,500 25,000 600 65 39,000 36,000 700 80 56,000 49,000 2800 420 184,500 1,400,000 X 2800 420 400 Y 60 7 7 a Y bX 60 (0.059)( 400) 36.4 So our estimated regression line is ˆ Y 36.4 0.059 X Given certain assumptions, the OLS estimators can be shown to have certain desirable properties. The assumptions are • The Y values are independent of each other. • The conditional distributions of Y given X are normal. • The conditional standard deviations of Y given X are equal for all values of X. Gauss-Markov Theorem: If the previous assumptions hold, ˆ of , , and then the OLS estimators a, b, and Y y.x are best, linear, unbiased estimators (BLUE). Linear means that the estimators are linear functions of the observed Y values. (There are no Y2s or square roots of Y, etc.) Unbiased means that the expected values of the estimators are equal to the parameters you are trying to estimate. Best means that the estimator has the lowest variance of any linear unbiased estimators of the parameter. Let’s look at our wheat example using our graph. Consider the fertilizer amount Xi = 700. Y ˆ abX Y i i Yi = 80 The average of all Y values is Y 60 . The observed value of Y corresponding to X = 700 is Y = 80. ˆ 77.7 Y i The predicted value of Y corresponding to X = 700 is Y 60 ˆ 36.4 (0.059)( 700) 77.7 . Y Xi=700 X Y ˆ abX Y i i Yi = 80 unexplained deviation explained deviation total ˆ 77.7 deviation Y i Y 60 Xi=700 X The difference between the predicted value of Y and the average value is called the explained deviation. The difference between the observed value of Y and the predicted value is the unexplained deviation. The difference between the observed value of Y and the average value is the total deviation. If we sum the squares of those deviations, we get SST sum of squares total (Yi - Y) 2 from the total deviations . SSR sum of squares regression ˆ - Y) 2 (Y i from the explained deviations . SSE sum of squares error ˆ )2 (Yi - Y i from the unexplaine d deviations . It can be shown that SST SSR SSE . The Sums of Squares are often reported in a Regression ANOVA Table Source of Variation Sum of squares Degrees of freedom Mean square ˆ - Y) 2 (Y i 1 MSR SSR/1 ˆ )2 (Yi - Y i n–2 MSE SSE/(n-2) n–1 MST SST/(n-1) Regression SSR Error SSE Total SST (Y - Y) i 2 Two measures of how well our regression line fits our data. The first measure is the standard error of the estimate or the standard error of the regression, se or SER. The se or SER tells you the typical error of fit, or how far the observed value of Y is from the expected value of Y. The second measure of “goodness of fit” is the coefficient of determination or R2. The R2 tells you the proportion of the total variation in the dependent variable that is explained by the regression on the independent variable (or variables). standard error of the estimate or standard error of the regression SSE Se SER n-2 Yi2 a ei2 n-2 n-2 Y b X Y i ˆ )2 ( Yi Y i i i n-2 There is a 2 in the denominator, because we estimated 2 parameters, the intercept a and the slope b. Later, we’ll have more parameters and this will change. Coefficient of determination or R2 SSR explained variation R SST total variation 2 ( Y Y ) 2 ˆ ( Yi Y ) i 2 a SSE 1SST Y b XY nY Y n Y 2 2 2 0 R 1 2 If the line fits the scatter of points perfectly, the points are all on the regression line and R2 = 1. If the line doesn’t fit at all and the scatter is just a jumble of points, then R2 = 0. Let’s return to our data and calculate se or SER and R2. X Y XY X2 100 40 4,000 10,000 200 50 10,000 40,000 300 50 15,000 90,000 400 70 28,000 14,000 500 65 32,500 25,000 600 65 39,000 36,000 700 80 56,000 49,000 2800 420 184,500 1,400,000 X 2800 420 400 Y 60 7 7 First, let’s add a column for Y2. X Y XY X2 Y2 100 40 4,000 10,000 1600 200 50 10,000 40,000 2500 300 50 15,000 90,000 2500 400 70 28,000 14,000 4900 500 65 32,500 25,000 4225 600 65 39,000 36,000 4225 700 80 56,000 49,000 6400 2800 420 184,500 1,400,000 26,350 X 2800 420 400 Y 60 7 7 Remember that a = 36.4 and b = 0.059. Then Se or SER Yi2 a Y b X Y i i i n-2 26,350 36.4(420) 0.059(184,500) 7-2 X Y XY X2 Y2 100 40 4,000 10,000 1600 200 50 10,000 40,000 2500 300 50 15,000 90,000 2500 400 70 28,000 14,000 4900 500 65 32,500 25,000 4225 600 65 39,000 36,000 4225 700 80 56,000 49,000 6400 2800 420 184,500 1,400,000 26,350 X 2800 420 400 Y 60 7 7 5.94 Again, a = 36.4 and b = 0.059. R 2 a Yb XY nY 2 36.4(420) 0.059(184,500) 7(60) 2 26,350 7(60) 2 Y 2 nY 2 X Y XY X2 Y2 100 40 4,000 10,000 1600 200 50 10,000 40,000 2500 300 50 15,000 90,000 2500 400 70 28,000 14,000 4900 500 65 32,500 25,000 4225 600 65 39,000 36,000 4225 700 80 56,000 49,000 6400 2800 420 184,500 1,400,000 26,350 X 2800 420 400 Y 60 7 7 973.5 1150 0.846 So about 85% of the variation in wheat yield is explained by the regression on fertilizer. SSR, SSE, and SST for wheat example On the previous slide, we found that R2 = 973.5 / 1150 = 0.846. SSR The sum of squares error, SSE, is the difference SSE = SST – SSR = 1150 – 973.5 = 176.5. SST What is the square root of R2? It is the sample correlation coefficient, usually denoted by lower case r. if b 0, then r R 2 if b 0, then r R 2 If you don’t already have R2 calculated, the sample correlation coefficient r can also be calculated from this formula. r X Y 1 1 X n X Y n Y 2 1 XY - n 2 2 2 For example, in our wheat problem, we had a = 36.4 and b = 0.059. r 1 1 X X Y n n Y 1 XY n 2 2 X Y 2 2 1 184,000 ( 2800)( 420) 7 1 1 1,400,000 ( 2800) 2 26,350 ( 420) 2 7 7 X Y XY X2 Y2 100 40 4,000 10,000 1600 200 50 10,000 40,000 2500 300 50 15,000 90,000 2500 400 70 28,000 14,000 4900 500 65 32,500 25,000 4225 600 65 39,000 36,000 4225 700 80 56,000 49,000 6400 2800 420 184,500 1,400,000 26,350 X 2800 420 400 Y 60 7 7 .92 Also, r R 2 0.846 0.92 The sample correlation coefficient r is often used to estimate the population correlation coefficient (rho). Cov(X, Y) X Y where Cov(X, Y) E[(X X )( Y Y )] is the covariance of X and Y, and X and Y are the standard deviations of X and Y respective ly. The correlation coefficient (and the covariance) tell how the variables move with each other. 1 1 = 1: There is a perfect positive linear relation. = -1: There is a perfect negative linear relation. = 0: There is no linear relation. Correlation Coefficient Graphs Y Y Y X X 1 Y Y -1 0.5 0.8 X X Y 0 X 0 X R2 adjusted or corrected for degrees of freedom R c2 1 (Y - Y) 2 ˆ (Y - Y) ( n 2) 2 ( n 1) or R c2 n 1 1 (1 R ) n 2 2 It is possible to compare specifications that would otherwise not be comparable by using the adjusted R2 . The “2” is because we are estimating 2 parameters, and . This will change when we are estimating more parameters. Adjusted R2 for wheat example n 1 1 (1 R ) n 2 7 1 1 (1 0.846) 7 2 2 Rc 2 0.815 Test on the correlation coefficient H0 : 0 versus H1 : 0 t n 2 r (1 r ) ( n 2) 2 Test at the 5% H0 : 0 versus H1 : 0 for the wheat example. Recall that r = 0 .92 and n = 7. t n 2 r From our t table, we see that for 5 dof, and a 2-tailed critical region, our cut-off points are -2.571 and 2.571. (1 r 2 ) ( n 2) .92 0 (1 0.922 ) (7 2) critical region .025 -2.571 5.25 critical region .025 0 2.571 t5 Since our t value of 5.25 is in the critical region, we reject H0 and accept H1 that the population correlation is not zero. If our regression line slope estimate b is close to zero, that would indicate that the true slope might be zero. To test if equals zero, we need to know the distribution of b. If is normally distributed with mean 0 and standard deviation , then b is normally distributed with mean and standard deviation . or standard error b 1 X n 2 X 2 Then, Z b- b b- 1 X n is a standard normal variable. 2 X 2 Since we usually don’t know , we estimate it using SER = se, and a tn-2 instead of the Z. So for our test statistic, we have t n -2 b- SER 1 2 X n . X 2 sb For the wheat example, test at the 5% level H0 : 0 vs. H1 : 0 . Recall : b 0.059, n 7, SER 5.94, t n -2 t5 b- SER 1 X2 n . X 5.27 X 2 1,400,000 . From our t table, we see that for 5 dof, and a 2-tailed critical region, our cut-off points are -2.571 and 2.571. 2 0.059 - 0 5.94 1 1,400,000 28002 7 0.059 0 0.0112 X 2800, critical region critical region .025 -2.571 .025 0 2.571 t5 Since our t value of 5.27 is in the critical region, we reject H0 and accept H1 that the slope is not zero. Notice that the value of the statistic we calculated when testi ng H 0 : 0 vs. H1 : 0 was 5.25, which is very close to the value of 5.27 that we found for the statistic when test ing H 0 : 0 vs. H1 : 0 . This is not a coincidence. When dealing with a regression with a single X value on the right side of the equation, testing whether there is a linear correlation between the 2 variables ( = 0) and testing whether the slope is zero ( = 0) are equivalent. Our values differ only because of rounding error. We can do an ANOVA test based on the amount of variation in the dependent variable Y that is explained by the regression. This is referred to as testing the significance of the regression. H0: there is no linear relationship between X and Y (this is the same thing as equals zero.) H1: there is a linear relationship between X and Y (this is the same thing as is not zero.) The statistic is MSR SSR 1 F1, n - 2 MSE SSE ( n 2) Example: Test the significance of the regression in the wheat problem at the 5% level . Recall SSR = 973.5 and SSE = 176.5. F1, n - 2 F1,5 973.5 1 MSR SSR 1 27.58 176.5 5 MSE SSE ( n 2) The F table shows that for 1 and 5 degrees of freedom, the 5% critical value is 6.61. Since our F has a value of 27.58, we reject H0: no linear relation and accept H1: there is a linear relation between wheat yield and fertilizer. f(F1,5) acceptance region crit. reg. 0.05 6.61 27.58 F1, 5 For a regression with just one independent variable X on the right side of the equation, testing the significance of the regression is equivalent to testing whether the slope is zero. Therefore, you might expect there to be a relationship between the statistics used for these tests, and there is one. The F-statistic for this test is the square of the t-statistic for the test on . In our wheat example, the t-statistic for the test on was 5.27 and the critical value or cut-off point was 2.571. For the F-test, the statistic was 27.58 (5.27)2 and the critical value or cut-off point was 6.61 (2.571)2. (The numbers don’t match exactly because of rounding error.) We can also calculate confidence intervals for the slope . b - t n -2 sb b t n -2 sb Calculate a 95% confidence interval for the slope for the wheat example. Recall that b = 0.059, n = 7, and sb = 0.0112. We also found the critical values for a 2-tailed t with 5 dof are 2.571 and -2.571. b - t n -2 sb b t n -2 sb 0.059 - 2.571 (0.0112) 0.059 2.571 (0.0112) 0.059 - 0.028 0.059 0.028 0.031 0.087 Our 95% confidence interval 0.031 0.087 means that we are 95% sure that the true slope of the relationship is between 0.031 and 0.087. Since zero is not in this interval, the results also imply that for a 5% test level, we would reject H 0 : 0 and accept H1 : 0 . Sometimes we want to calculate forecasting intervals for predicted Y values. For example, perhaps we’re working for an agricultural agency. A farmer calls to ask us for an estimate of the wheat yield that might be expected based on a particular fertilizer usage level on the farmer’s wheat field. We might reply that we are 95% certain that the yield would be between 60 and 80 bushels per acre. A representative from a cereal company might ask for an estimate of the average wheat yield that might be expected based on that same fertilizer usage level on many wheat fields. To that question, we might reply that we are 95% certain that the yield would be between 65 and 75 bushels per acre. Our intervals would both be centered around the same number (70 in this example), but we can give a more precise prediction for an average of many fields, than we can for an individual field. The width of our forecasting intervals also depends on our level of expertise with the specified value of the independent variable. Recall that the fertilizer values in our wheat problem had a mean of 400 and were all between 100 and 700. If someone asks about applying 2000 units of fertilizer to a field, we would probably feel less comfortable with our prediction than we would if the person asked about applying 500 units of fertilizer. The closer the value of X is to the mean value of our sample, the more comfortable we are with our numbers, and the narrower the interval required for a particular confidence level. Forecasting intervals for the individual case and for the mean of many cases. upper endpoint for the forecasting interval for the individual case Y upper endpoint for the forecasting interval for the mean of many cases regression line lower endpoint for the forecasting interval for the mean of many cases lower endpoint for the forecasting interval for the individual case X X Notice that the intervals for individual case are narrower that those for the average of many cases. Also all the intervals are narrower near the sample mean of the independent variable. For the given level of X requested by our callers, we would have the following. Y confidence interval for the individual case 80 75 70 65 confidence interval for the mean of many cases 60 Xgiven X Formulae for forecasting intervals ˆ a bX For both of the following intervals, Y g g forecasting interval for individual case: ˆ t s Y Y ˆ t s Y g n - 2 ind g g n - 2 ind 2 (X g X ) 1 where s ind SER 1 1 2 n 2 X X n forecasting interval for the mean of many cases: ˆ t s ˆ t s Y Y g n - 2 mean Y.X g g n - 2 mean where s mean 1 SER n (X g X ) 1 2 2 X X n 2 Example: If 550 pounds of fertilizer are applied in our wheat example, find the 95% forecasting interval for the mean wheat yield if we fertilized many fields. Recall : a 36.4, b 0.059, n 7, t 5,.05 2.571, SER 5.94, X 2800, X 2 1,400,000 . ˆ a b X 36.4 0.059(550) 68.8 Y g g s mean SER 1 n ( X g X )2 2 1 X2 X n 1 (550 400)2 5.94 7 1,400,000 1 ( 2800)2 7 2.81 ˆ t s ˆ t s Y Y g n - 2 mean Y.X g g n - 2 mean 68.8 2.571( 2.81) Y.X g 68.8 2.571( 2.81) 61.6 Y.X g 76.0 Example: If 550 pounds of fertilizer are applied in our wheat example, find the 95% forecasting interval for the wheat yield if we fertilized one field. Recall : a 36.4, b 0.059, n 7, t 5,.05 2.571, SER 5.94, X 2800, X 2 1,400,000 . ˆ a b X 36.4 0.059(550) 68.8 Y g g sind 1 SER 1 n ( X g X )2 1 (550 400)2 5.94 1 2 1 7 1,400,000 1 ( 2800)2 X2 X n 7 6.56 ˆ t s Y Y ˆ t s Y g n - 2 ind g g n - 2 ind 68.8 2.571( 6.56) Yg 68.8 2.571( 6.56) 51.9 Yg 85.7 Notice that, as we stated previously, the interval for the mean of many cases is narrower than the interval for the individual case. 51.9 Yg 85.7 61.6 Y.X g 76.0