Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Interaction (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Regression toward the mean wikipedia , lookup
Choice modelling wikipedia , lookup
Data assimilation wikipedia , lookup
Time series wikipedia , lookup
Regression analysis wikipedia , lookup
Chapter 8 Lecture Slides 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8: Inference in Linear Models 2 Introduction • We discussed bivariate data in Chapter 2. • In this chapter, we learn to compute confidence intervals and to perform hypothesis tests on the slope and intercept of the true regression line. • We have talked about a single predictor but sometimes a single independent variable is not enough, in these cases, we have several independent variables that are related to a dependent variable. • If the relationship between the independent and dependent variables is linear, the technique of multiple regression can be used to include all of the independent variables in the model. 3 Section 8.1: Inferences Using the Least-Squares Coefficients • When two variables have a linear relationship, the scatterplot tends to be clustered around a line known as the least squares line. • We think of the slope and intercept of the leastsquares line as estimates of the slope and intercept of the true regression line. 4 Vocabulary • • • • • • • The linear model is yi = β0 + β1xi+ εi The dependent variable is yi. The independent variable is xi. The regression coefficients are β0 and β1. The error is εi. The line y = β0 + β1x is the true regression line. The quantities ̂ 0 and ̂1 are called the least squares coefficients and can be computed in the way we discussed in Chapter 2. 5 Assumptions for Errors in Linear Models In the simplest situation, the following assumptions are satisfied: 1. The errors 1,…,n are random and independent. In particular, the magnitude of any error i does not influence the value of the next error i+1. 2. The errors 1,…,n all have mean 0. 3. The errors 1,…,n all have the same variance, which we denote by 2. 4. The errors 1,…,n are normally distributed. 6 Distribution In the linear model yi = 0 +1xi +i, under assumptions 1 through 4, the observations y1,…,yn are independent random variables that follow the normal distribution. The mean and variance of yi are given by y 0 1 xi i . 2 yi 2 The slope represents the change in the mean of y associated with an increase in one unit in the value of x. 7 More Distributions Under assumptions 1 – 4: • The quantitiesˆ0 and ˆ1 are normally distributed random variables. • The means of ˆ0 and ˆ1 are the true values 0 and 1, respectively. • The standard deviations of ˆ0 and ˆ1 are estimated with 1 sˆ s n 0 x2 n (x i i 1 where s (1 r ) and x) 2 sˆ 1 s n (x i 1 n 2 i . x) 2 ( y i 1 2 y ) i n2 is an estimate of the error standard deviation . 8 Example 1 For the Hooke’s law data, compute s, sβˆ1 , sβˆ0 . 9 Notes 1. Since there is a measure of variation of x in the denominator in both of the uncertainties we just defined, the more spread out x’s are the smaller the uncertainties in ˆ and ˆ . 2. Use caution: if the range of x values extends beyond the range where the linear model holds, the results will not be valid. ˆ / s ˆ 1 / sˆ 3. The quantities and 1 have Student’s t distribution with n – 2 degrees of freedom. 0 0 0 ˆ 0 1 1 10 Confidence Intervals • Level 100(1 – )% confidence intervals for 0 and 1 are given by ˆ0 tn1, / 2 sˆ and ˆ1 tn1, / 2 sˆ . 0 1 • A level 100(1 – )% confidence intervals for 0 + 1x is given by ˆ0 ˆ1 x tn2, / 2 s yˆ , where 1 ( x x )2 s yˆ n . 2 n ( x x ) i i 1 11 Prediction Interval • A level 100(1 – )% prediction interval for 0+1x is given by ˆ0 ˆ1 x tn2, / 2 s pred , where s pred 1 ( x x )2 1 n . 2 n ( x x ) i i 1 12 Example 1 (cont.) Find a 95% CI for the spring constant in the Hooke’s law data. 13 Example 1 (cont.) In the Hooke’s law data, find a 99% CI for the unloaded length of the spring. 14 Example 1 (cont) For the data in Example 1, compute a confidence interval for the slope of the regression line. 15 Example 1 (cont.) The manufacturer of the spring in the Hooke’s law data claims that the spring constant β1 is at least 0.215 in./lb. We have estimated the spring constant to be 0.2046 in./lb. Can we conclude that the manufacturer’s claim is false? 16 Example 1 (cont.) Under the Hooke’s law data, compute a 95% CI for the length of a spring under a load of 1.4 lb. 17 COMPUTER OUTPUT Regression Analysis: Length versus Weight The regression equation is Length = 5.00 + 0.205 Weight (1) Predictor Constant Weight Coef(2) SE Coef(3) T(4) P(5) 4.99971 0.02477 201.81 0.000 0.20462 0.01115 18.36 0.000 S = 0.05749 (6) R-Sq = 94.9% (7) R-Sq(adj) = 94.6% Analysis of Variance (8) Source DF SS Regression 1 1.1138 Residual Error 18 0.0595 Total 19 1.1733 MS F P 1.1138 337.02 0.000 0.0033 Unusual Observations (9) Obs Weight Length Fit 12 2.20 5.5700 5.4499 SE Fit Residual St Resid 0.0133 0.1201 2.15R R denotes an observation with a large standardized residual Predicted Values for New Observations (10) New Obs Fit SE Fit 95.0% CI 95.0% PI 1 5.2453 0.0150 ( 5.2137, 5.2769) ( 5.1204, 5.3701) Values of Predictors for New Observations (11) New Obs Weight 1 1.20 Interpreting Computer Output 1. 2. 3. This is the equation of the least-squares line. Coef The coefficients ˆ0 4.99971 and ˆ1 0.20462 . SE Coef The standard deviations of the estimates for 0 and 1 . 4. T The values of the Student’s t statistics for testing the hypotheses 0 = 0 and 1 = 0. The t statistic is equal to the coefficient divided by its standard deviation. 5. P The P-values for the tests of the hypotheses 0 = 0 and 1 = 0. The more important P-value is that for 1. If this Pvalue is not small enough to reject the hypothesis that 1 = 0 the linear model is not useful for predicting y from x. 19 More Computer Output Interpretation 6. 7. S The estimate of s, the error standard deviation. R-Sq This is r2, the square of the correlation coefficient r, also called the coefficient of determination. 8. Analysis of Variance This table is not so important in simple linear regression, we will discuss it when we discuss multiple linear regression. 9. Unusual Observations Minitab tries to alert you to data points that may violate the assumptions 1-4. 10. Predicted Values for New Observations These are confidence intervals and prediction intervals for values of x specified by the user. 11. Values of Predictors for New Observations This is simply a list of the x values for which confidence and prediction intervals have been calculated. 20 Inferences on the Population Correlation • When we have a random sample from a population of ordered pairs, the correlation coefficient, r, is often called the sample correlation. • We have the true population correlation, ρ. • If the population of ordered pairs has a certain distribution known as a bivariate normal distribution, then the sample correlation can be used to construct CI’s and perform hypothesis tests on the population correlation. 21 Testing • The null hypotheses of interest are of the form ρ = 0, ρ ≤ 0, and ρ ≥ 0. • The method of testing these hypotheses is based on the test statistic U which has a Student’s t distribution with n – 2 degrees of freedom. U r n2 1 r 2 . 22 Section 8.2: Checking Assumptions • We stated some assumptions for the errors. Here we want to see if any of those assumptions are violated. • The single best diagnostic for least-squares regression is a plot of residuals versus the fitted values, sometimes called a residual plot. 23 More of the Residual Plot • When the linear model is valid, and assumptions 1 – 4 are satisfied, the plot will show no substantial pattern. There should be no curve to the plot, and the vertical spread of the points should not vary too much over the horizontal range of the data. • A good-looking residual plot does not by itself prove that the linear model is appropriate. However, a residual plot with a serious defect does clearly indicate that the linear model is inappropriate. 24 Residual Plots Upper left: No noticeable pattern. Upper right: Heteroscedastic. Lower left: Trend. Lower Right: Outlier. 25 Residuals versus Fitted Values If the plot of residuals versus fitted values • Shows no substantial trend or curve, and • Is homoscedastic, that is, the vertical spread does not vary too much along the horizontal length of plot, except perhaps near the edges. then it is likely, but not certain, that the assumptions of the linear model hold. However, if the residual plot does show a substantial trend or curve, or is heteroscedastic, it is certain that the assumptions of the linear model do not hold. 26 Transformations • If we fit the linear model y = 0 +1x + and find that the residual plot exhibits a trend or pattern, we can sometimes fix the problem by raising x, y, or both to a power. • It may be the case that a model of the form ya = 0 +1xb + fits the data well. • Replacing a variable with a function of itself is called transforming the variable. 27 Don’t Forget • Once the transformation has been completed, then you must inspect the residual plot again to see if that model is a good fit. • It is fine to proceed through transformations by trial and error. • It is important to remember that power transformations don’t always work. 28 Caution • When there are only a few points in a residual plot, it can be hard to determine whether the assumptions of the linear model are met. • When one is faced with a sparse residual plot that is hard to interpret, a reasonable thing to do is to fit a linear model, but to consider the results tentative, with the understanding that the appropriateness of the model has not been established. 29 Independence of Observations • If the plot of residuals versus fitted values looks good, then further diagnostics may be used to further check the fit of the linear model. • A time order plot of the residuals versus order in which observations were made. • If there are trends in this plot, then x and y may be varying with time. This means that the errors are not independent. When this feature is severe, linear regression should not be used, and the methods of time series analysis should be used instead. 30 Normality Assumption • To check that the errors are normally distributed, a normal probability plot of the residuals can be made. • If the plot looks like it follows a rough straight line, then we can conclude that the residuals are approximately normally distributed. 31 Comments • Physical laws are applicable to all future observations. • An empirical model is valid only for the data to which it is fit. It may or may not be useful in predicting outcomes for subsequent observations. • Determining whether to apply an empirical model to a future observation requires scientific judgment rather that statistical analysis. 32 Section 8.3: Multiple Regression • The methods of simple linear regression apply when we wish to fit a linear model relating the value of an dependent variable y to the value of a single independent variable x. • There are many situations when a single independent variable is not enough. • In situations like this, there are several independent variables, x1,x2,…,xp, that are related to a dependent variable y. 33 p Independent Variables • Assume that we have a sample of n items and that on each item we have measured a dependent variable y and p independent variables, x1,x2,…,xp. • The ith sampled item gives rise to the ordered set (yi, x1i,…, xpi). • We can then fit the multiple regression model yi = 0 + 1x1i +…+ pxpi + i. 34 Various Multiple Linear Regression Models • Polynomial regression model (the independent variables are all powers of a single variable) yi 0 1 xi 2 x L p x i 2 i p i • Quadratic model (polynomial regression of model of degree 2, and powers of several variables) yi 0 1 x1i 2 x2i 3 x1i x2i 4 x12i 5 x22i i • A variable that is the product of two other variables is called an interaction. • These models are considered linear models, even though they contain nonlinear terms in the independent variables. The reason is that they are linear in the coefficients, i . 35 Estimating the Coefficients • In any multiple regression model, the estimates ˆ0 , ˆ1 ,..., ˆ p are computed by least-squares, just as in simple linear regression. The equation yˆ ˆ0 ˆ1 x1 L ˆ p x p is called the least-squares equation or fitted regression equation. • Now define yˆ i to be the y coordinate of the least-squares equation corresponding to the x values (x1i,…, xpi). • The residuals are the quantities ei yi yˆi ,which are the differences between the observed y values and the y values given by the equation. • We want to compute ˆ0 , ˆ1 ,..., ˆ p so as to minimize the sum of the squared residuals. This is complicated and we rely on computers to calculate them. 36 Sums of Squares • Much of the analysis in multiple regression is based on three fundamental quantities. • They are regression sum of squares (SSR), the error sum of squares (SSE), and the total sum of squares (SST). • We defined these quantities in Chapter 7 and they hold here as well. • The analysis of variance identity is SST = SSR + SSE 37 Assumptions of the Error Terms Recall: Assumptions for Errors in Linear Models: In the simplest situation, the following assumptions are satisfied (notice that these are the same as for simple linear regression.): 1. The errors 1,…, n are random and independent. In particular, the magnitude of any error i does not influence the value of the next error i+1. 2. The errors 1,…, n all have mean 0. 3. The errors 1,…, n all have the same variance, which we denote by 2. 4. The errors 1,…, n are normally distributed. 38 Mean and Variance of yi The multiple linear regression model is yi = 0 + 1x1i +…+ pxpi + i. Under assumptions 1 through 4, the observations y1,…, yn are independent random variables that follow the normal distribution. The mean and variance of yi are given by y 0 1 x1i i p x pi y2 2 i Each coefficient represents the change in the mean of y associated with an increase of one unit in the value of xi, when the other x variables are held constant. 39 Statistics • The three statistics most often used in multiple regression are the estimated error variance s2, the coefficient of determination R2, and the F statistic. • We have to adjust the estimated standard deviation since we are estimating p + 1 coefficients, 2 ˆ ( y y ) SSE 2 i i i 1 s n p 1 n p 1 n • The estimated variance of each least-squares coefficient is a complicated calculation and we can find them on a computer. • In simple linear regression, the coefficient of determination, R2, measures the goodness of fit of the linear model. The goodness of fit statistic in multiple regression denoted by R2 is also called the coefficient of determination. The value of R2 is calculated in the same way as r2 in simple linear regression. That is, R2 = SSR/SST. 40 Distribution of βi • When assumptions 1 through 4 are satisfied, the quantity ˆ βi βi s βˆ i has a Student’s t distribution with n – p + 1 degrees of freedom. • The number of degrees of freedom is equal to the denominator used to compute the estimated error variance. • This statistic is used to compute confidence intervals and to perform hypothesis tests, as we did with simple linear regression. 41 Tests of Hypothesis • In simple linear regression, a test of the null hypothesis 1 = 0 is almost always made. If this hypothesis is not rejected, then the linear model may not be useful. • The test is multiple linear regression is H0 = 1 = 2 = … = p = 0. This is a very strong hypothesis. It says that none of the independent variables has any linear relationship with the dependent variable. • The test statistic for this hypothesis is F = (SSR/p)/(SSE/(n – p – 1)). • This is an F statistic and its null distribution is Fp,n-p-1. Note that the denominator of the F statistic is s2. The subscripts p and n – p – 1are the degrees of freedom for the F statistic. • Slightly different versions of the F statistics can be used to test milder null hypotheses. 42 Output The regression equation is Goodput = 96.0 - 1.82 Speed + 0.565 Pause + 0.0247 Speed*Pause + 0.0140 Speed^2 - 0.0118 Pause^2 Predictor Coef StDev Constant 96.024 3.946 Speed -1.8245 0.2376 Pause 0.5652 0.2256 Speed*Pa 0.024731 0.003249 Speed^2 0.014020 0.004745 Pause^2 -0.011793 0.003516 S = 2.942 R-Sq = 93.2% T 24.34 -7.68 2.51 7.61 2.95 -3.35 P 0.000 0.000 0.022 0.000 0.008 0.003 R-Sq(adj) = 91.4% Analysis of Variance Source DF Regression 5 Residual Error 19 Total 24 SS 2240.49 164.46 2404.95 MS 448.10 8.66 F 51.77 P 0.000 Predicted Values for New Observations New Obs Fit 1 74.272 SE Fit 95% CI 95% PI 1.175 (71.812, 76.732) (67.641, 80.903) Values of Predictors for New Observations New Obs Speed Pause Speed*Pause Speed^2 Pause^2 1 25.0 15.0 375 625 225 Interpreting Output • Much of the output is analogous to that of simple linear regression. 1. The fitted regression equation is presented near the top of the output. 2. Below that the coefficient estimates and their estimated standard deviations. 3. Next to each standard deviation is the Student’s t statistic for testing the null hypotheses that the true value of the coefficient is equal to 0. 4. The P-values for the tests are given in the next column. 44 Analysis of Variance Table 5. The DF column gives the degrees of freedom, the degrees of freedom for regression is equal to the number of independent variables in the model. The degrees of freedom for “Residual Error” is the number of observations – number of parameters estimated. The total degrees of freedom is the sum of the degrees of freedom for regression and for error. 6. The next column is SS. This column gives the sum of squares, the first is regression sum of squares, SSR, the second is error sum of squares, SSE, and the third is the total sum of squares, SST = SSR +SSE. 45 More on the ANOVA Table 7. The column MS is the column with the mean sum of squares which is the sums of squares divided by their respective degrees of freedom. Note that the mean square error is equal to the variance estimate, s2. 8. The column labeled F presents the mean square for regression divide by the mean square for error. 9. This is the F statistic that we discussed earlier that is used for testing the null hypothesis that none of the independent variables are related to the dependent variable. 46 Using the Output • From the output, we can use the fitted regression equation to predict for future observations. • It is also possible to calculate residuals for a value of y. • Constructing confidence interval for the coefficient of the independent variables is also possible from the output. 47 Example 2 Use the multiple regression model to predict the goodput for a network with speed 12 m/s and pause time 25 s. For the goodput data, find the residual for the point Speed = 20, Pause = 30. Find a 95% confidence interval for the coefficient of Speed in the multiple regression model. Test the null hypothesis that the coefficient of Pause is less than or equal to 0.3. 48 Checking Assumptions • It is important in multiple linear regression to test the validity of the assumptions for errors in the linear model. • Check plots of residuals versus fitted values, normal probability plots of residuals, and plots of residuals versus the order in which the observations were made. • It is also a good idea to make plots of residuals versus each of the independent variables. If the residual plots indicate a violation of assumptions, transformations can be tried. 49 Section 8.4: Model Selection • There are many situations in which a large number of independent variables have been measured, and we need to decide which of them to include in the model. • This is the problem of model selection, and it is not an easy one. • Good model selection rests on this basic principle known as Occam’s razor: “The best scientific model is the simplest model that explains the observed data.” • In terms of linear models, Occam’s razor implies the principle of parsimony: “A model should contain the smallest number of variables necessary to fit the data.” 50 Some Exceptions 1. A linear model should always contain an intercept, unless physical theory dictates otherwise. 2. If a power xn of a variable is included in the model, all lower powers x, x2, …, xn-1 should be included as well, unless physical theory dictates otherwise. 3. If a product xy of two variables is included in a model, then the variables x and y should be included separately as well, unless physical theory dictates otherwise. 51 Notes • Models that include only the variables needed to fit the data are called parsimonious models. • Adding a variable to a model can substantially change the coefficients of the variables already in the model. 52 Can a Variable Be Dropped? • It often happens that one has formed a model that contains a large number of independent variables, and one wishes to determine whether a given subset of them may be dropped from the model without significantly reducing the accuracy of the model. • Assume that we know that the model yi=0 + 1x1i +…+kxki+k+1xk+1i +… pxpi + i is correct. We will call this the full model. 53 Null Hypothesis • We wish to test the null hypothesis H0: k+1=…=p= 0. • If H0 is true, the model will remain correct if we drop the variables xk+1,…xp, so we can replace the full model with the following reduced model: yi=0 + 1x1i +…+kxki + i. 54 Test Statistic • To develop a test statistic for H0, we begin by computing the error sums of squares for both the full and reduced models. • We call this SSfull and SSreduced, respectively. • The number of degrees of freedom for SSfull is n – p – 1, and for SSreduced is n – k – 1. • The test statistic is f =[(SSreduced – SSfull)/(p – k)]/[SSfull/(n – p – 1)] • If H0 is true, then f tends to be close to 1. If H550 is false, then f tends to be larger. Comments • This method is very useful for developing parsimonious models by removing unnecessary variables. However, the conditions under which it is formally correct are rarely met. • More often, a large model is fit, some of the variables are seen to have fairly large P-values, and the F test is used to decide whether to drop them from the model. • It is often the case that there is no one “correct” model. There are several models that fit equally well. 56 Best Subsets Regression • Assume that there are p independent variables, x1, x2,…, xp, that are available to be put in the model. • Let’s assume that we wish to find a good model that contains exactly four independent variables. • We can simply fit every possible model containing four of the variables, and rank them in order of their goodness-of-fit, as measured by the coefficient of determination, R2. • The subset of four variables that yield the largest value of R2 is the “best” subset of size four. • One can repeat the process for subsets of other sizes, finding the best subsets of size 1, 2,…, p. • These best subsets can be examined to see which provides a good fit, while being parsimonious. 57 Stepwise Regression • This is the most widely use model selection technique. • Its main advantage over best subsets regression is that it is less computationally intensive, so it can be used in situations where there are a very large number of candidate independent variables and too many possible subsets for every one of them to be examined. • The user chooses two threshold P-values, in and out, with in < out . • The stepwise regression procedure begins with a step called a forward selection step, in which the independent variables with smallest P-value is selected, provided that P < in. • This variable is entered in the model, creating a model with a single independent variable. 58 More on Stepwise Regression • In the next step, the remaining variables are examined one at a time as candidates for the second variable in the model. The one with the smallest P-value is added to the model, again provided that P < in. • Now, it is possible that adding the second variables to the model increased the P-value of the first variable. In the next step, called a backward elimination step, the first variable is dropped from the model if its Pvalue has grown to exceed the value out . • The algorithm continues by alternating forward selection steps with backward eliminations steps. • The algorithm terminates when no variables meet the criteria for being added to or dropped from the model. 59 Notes on Model Selection • When there is little or no physical theory to rely on, many different models will fit the data about equally well. • The methods for choosing a model involve statistics, whose values depend on the data. Therefore, if the experiment is repeated, these statistics will come out differently, and different models may appear to be “best.” • Some or all of the independent variables in a selected model may not really be related to the dependent variable. Whenever possible, experiments should be repeated to test these apparent relationships. • Model selection is an art, not a science. 60 Summary • Uncertainties in the least-squares coefficients • Confidence intervals and hypothesis tests for leastsquares coefficients • Checking assumption • Residuals • Multiple regression models • Estimating the coefficients • Checking assumptions in multiple regression • Confounding and collinearity • Model selection 61