Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 11/Economics 40 Lecture 20 (Chapter 10.1) 1. Least-Squares Regression Recap In Chapter 2 a line is used to summarize the relationship between some variable designated as the explanatory or independent variable (called x) and a response/outcome/dependent variable (called y). The slope was identified as b and the intercept as a. The equation for the line which summarizes the relationship is: y = a + bx where b = r*(Sy/Sx) and a = (average of the y variable) - (slope * average of the x variable) The idea here is that there is a line which "best" fits the data and it's criteria for "fit" is that it minimizes the squared deviations in the y-direction from the line. This line is used because it has good mathematical properties. 2. The population regression line (10.1) The interest here is to examine how y changes as x changes for all possible x and y out there. In chapter 2, there was no discussion of "sample" versus "population", but in Chapter 10, there is. The notation changes a little bit: µ y = β 0 + β 1x This is a population regression line some perfect representation of the relationship between x and y. β 0 and β 1 are parameters -- they are fixed, unchanging. The concern here is how mean values of y for a given x value change as x changes. The actual y values will vary around µ y and this variability is measured by σ y (and this is also thought to be unchanging, that is, the same for every value of x). See figure 10.2. This is "the truth" as it applies to a relationship between an X variable and a Y variable. We do not observe this line in practice, the lines we have been looking at come from samples. In other words, in practice, this is what we are "modeling" or "regressing" in a single sample: y i = β0 + β1x + εi (where the yi are individual observed values) or to think of it another way yˆ = b0 + b1 x (the y-hat – a predicted y -- contains the observed value and the residual) And the ε i are residuals. They should be(sometimes they are not, this is a bad thing) independently distributed with mean 0 and standard deviation σ y . b0 and b1 are estimates (from samples) of β 0 and β 1 (population parameters). The estimation/calculation of b0 and b1 is like Chapter 2. 3. The residuals, again The residuals as calculated from the sample regression (observed y - predicted y) are represented by ei and they are an estimate of εi they are just the difference between the actual y value in the sample and the predicted y value. The residuals should sum to zero. The importance of residuals in this context is that they are used to estimate variable around each level of x. σy -- the variability of the y- σy has a slightly different formula than you might expect: Statistics 11/Economics 40 n s2 = ∑( y i= 1 i Lecture 20 (Chapter 10.1) − yˆ ) 2 n −2 ( see the bottom of page 666, this is called the Residual Mean Square Error) And the square root of s2 is s and that is used as our estimate of σy . Since you are taking both x –bar and y-bar into account when calculating y-hat, the divisor is n-2. Compare this with your calculation of the standard deviation of y or of just x from chapter 1, there the divisor is n-1. Don't worry, you don't need to know how to calculate this, but you should know how to interpret a Stata output (like Lab 6) on regression if you see one. 4. Interpreting the Stata Output for Regression Analysis s n (see page 504) When you do not have sigma and you must A term you need to know: standard error or use s (the sample standard deviation) to estimate it, Moore and McCabe call the standard deviation of a statistic (such as a mean) the standard error. In the regression framework, one usually does not have igma and so one is almost always working with s. And if one also has small samples (under size 15) the assumptions of the Central Limit Theorem (i.e. sigma known, population normal, sample size large, chapter 5.2) are violated so the standard normal can’t be used, instead something called the t-distribution is used. So can no longer use a Z-test, you must use a t-test (Chapter 7.1 note when the sample size approaches infinity, z and t are distributed in the same manner). The t-test looks a lot like a z-test for the distribution of a sample mean: t= x −µ s/ n Vocabulary words of interest (from the Stata output) R Squared _cons Coefficients Standard Error t P>|t| 95% Confidence Interval Let's look again at the relationship between advertising expenditures and name recognition: . regress recognit spending Source | SS df MS ---------+-----------------------------Model | 7723.27815 1 7723.27815 Residual | 10494.1085 19 552.321499 ---------+-----------------------------Total | 18217.3866 20 910.869332 Number of obs F( 1, 19) Prob > F R-squared Adj R-squared Root MSE = = = = = = 21 13.98 0.0014 0.4240 0.3936 23.502 -----------------------------------------------------------------------------recognit | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------spending | .3631741 .0971203 3.739 0.001 .159899 .5664492 _cons | 22.16269 7.089478 3.126 0.006 7.324244 37.00114 ------------------------------------------------------------------------------ Statistics 11/Economics 40 Lecture 20 (Chapter 10.1) Remember, our goal is to predict the number of people who recognize the brand (recognit) from the amount of money spent advertising the brand (spending) The values β 0 and β 1 are parameters and are given in the column labeled COEF. The intercept is labeled _cons and its value is 22.16269 The interpretation is that if one had no advertising spending the predicted recognition is 22.16269 million people. β 1 is the slope of the line (or the rate of change in recognition for each unit change in spending) and its value is .3631741 so the least squares line for the regression of recognition on spending is: yˆ = 22.16269 + .3631741( x i ) 5. Confidence Intervals and Significance Tests in Regression Analysis You can also produce confidence intervals and significance tests for the coefficients in a linear regression model. The confidence intervals are very similar to the ones you have seen in Chapters 6 but they use a t instead of a Z as the multiplier. Estimate ± t*(Standard Error of the Estimate) Where t* is the critical value from the t distribution (table D) Generally, statistical software provides you with the upper and lower bounds. For the x variable spending, it is (.159899, .5664492) for a 95% confidence interval. As before, the interpretation is that you are 95% confident that the true parameter β 1 is covered by the interval given above. You could also give a 95% confidence interval around the intercept (7.324244, 37.00114). To test the hypothesis β 1 = 0, look at the value of t given by the output (3.739 for spending) it's just the coefficient divided by its standard error or .3631741/.0971203 and a p-value is given to you. Since the p-value given is .001, your interpretation would be that it is highly statistically significant (P < .05 or P < .01). Note that it is testing a two-sided alternative. Your textbook delves into some other estimates generated by statistical packages, but what has been detailed above is sufficient.