Download Week 11, Lecture 3, The regression assumptions

Business Statistics - QBM117 Statistical inference for regression Objectives  To define the linear model which defines the population of interest.  To explain the required conditions of the error variable.  Regression diagnostics In the previous two lectures, we have concentrated on summarising sample bivariate data:  we have learnt how to estimate the strength of the relationship between the variables using the correlation ceoefficient;  we have learnt how to estimate the relationship between the variables using the least squares regression line, and  we have learnt to estimate the accuracy of the line for prediction, using the standard error of estimate and the coefficient of determination. We now need to perform statistical inference about the population, from which these samples have been taken, in order to better understand the larger population. What is the appropriate population for a simple linear regression problem? The linear model y   0  1 x   Where y = the observed value in the population  0  1 x = the straight line population relationship  = the error variable Therefore the least squares regression line estimates the population relationship described by the linear model and, ˆ0 estimates  0 ˆ1 estimates 1 The linear model is the basic assumption required for statistical inference in regression and correlation. Required conditions of the error variable ˆ0 and ˆ1  0 and 1 will only provide good estimates for if certain assumptions about the error variable are valid. Similarly, the statistical tests we perform in hypothesis testing will only be valid, is these conditions are satisfied. So what are these conditions? Required conditions of the error variable  The probability distribution of  is normal.  The mean of the distribution is zero ie E() = 0  The variance of , 2 is constant, no matter what the value of x.  The errors associated with any two y values are independent. As a result, the value of the error variable at one point does not affect the value of the error variable at another point. Requirements 1, 2, and 3 can be interpreted in another way: For each value of x, y is a normally distributed random variable whose mean is E ( y)   0  1 x And whose standard deviation is  Since the mean depends on x, the expected value is often expressed as E ( y | x)   0  1 x The standard deviation however is not influenced by x, because it is constant for all values of x. Y Asumptions of the Simple Linear Regression Model E[Y]=0 + 1 X Identical normal distributions of errors, all centered on the regression line. X Regression diagnostics  Most departures from the required conditions can be diagnosed by examining the residuals.  Excel allows us to calculate these residuals and apply various graphical techniques to them.  Analysis of the residuals allow us to determine whether the variance of the error variable is constant and whether the errors are independent.  Excel can also generate standardised residuals. The residuals are standardised in the usual way, by subtracting the mean (0 in this case) and dividing by the standard deviation (or its estimate in this case, s ) Non-normality We can check for normality by drawing a histogram of the residuals to see if it appears that the error variable is normally distributed. Since the tests in regression analysis are robust, as long as the histogram at least resembles an approximate bell shape or is not extremely non-normal, it is safe to assume that the normality requirement has been met. Expectation of zero The use of the method of least squares to find the line of best fit ensures that this will always be the case. We can however observe from the histogram of the residuals, that the residuals are approximately symmetric about a value which is close to zero. Heteroscedasticity The variance of the error variable,  2 is required to be constant. When this requirement is violated, the condition is called heteroscedasticity. Homoscedasticity refers to the condition when the requirement is satisfied. One method of diagnosing heteroscedasticity is to plot the residuals against the x values or the predicted values of y and look for any change in the spread of the variation of the residuals. Residual Analysis and Checking for Model Inadequacies Residuals Residuals 0 0 x or y x or y Homoscedasticity: Residuals appear completely random. No indication of model inadequacy. Residuals Heteroscedasticity: Variance of residuals changes when x changes. Residuals 0 0 Time Residuals exhibit a linear trend with time. x or y Curved pattern in residuals resulting from underlying nonlinear relationship. Non-independence of the error term  This requirement states that the values of the error variable must be independent.  If the data are time series data, the errors are often correlated  Error terms which are correlated over time are said to be autocorrelated or serially correlated.  We can often detect autocorrelation if we plot the residuals against the time period.  If a pattern emerges, it is likely that the independence requirement is violated. Reading for next lecture Read Chapter 18 Sections 18.5 and 18.7 (Chapter 11 Sections 11.5 and 11.7 abridged)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Week 11, Lecture 3, The regression assumptions