Download Week 11, Lecture 3, The regression assumptions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Types of artificial neural networks wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Transcript
Business Statistics - QBM117
Statistical inference for regression
Objectives

To define the linear model which defines the population of
interest.

To explain the required conditions of the error variable.

Regression diagnostics
In the previous two lectures, we have concentrated on
summarising sample bivariate data:

we have learnt how to estimate the strength of the
relationship between the variables using the correlation
ceoefficient;

we have learnt how to estimate the relationship between
the variables using the least squares regression line, and

we have learnt to estimate the accuracy of the line for
prediction, using the standard error of estimate and the
coefficient of determination.
We now need to perform statistical inference about the
population, from which these samples have been taken, in
order to better understand the larger population.
What is the appropriate population for a simple linear
regression problem?
The linear model
y   0  1 x  
Where y = the observed value in the population
 0  1 x = the straight line population relationship
 = the error variable
Therefore the least squares regression line estimates the
population relationship described by the linear model and,
ˆ0 estimates  0
ˆ1 estimates 1
The linear model is the basic assumption required for
statistical inference in regression and correlation.
Required conditions of the error variable
ˆ0 and ˆ1
 0 and 1
will only provide good estimates for
if certain assumptions about the error variable
are valid.
Similarly, the statistical tests we perform in hypothesis
testing will only be valid, is these conditions are satisfied.
So what are these conditions?
Required conditions of the error variable

The probability distribution of  is normal.

The mean of the distribution is zero ie E() = 0

The variance of , 2 is constant, no matter what the value
of x.

The errors associated with any two y values are
independent. As a result, the value of the error variable at
one point does not affect the value of the error variable at
another point.
Requirements 1, 2, and 3 can be interpreted in another way:
For each value of x, y is a normally distributed random
variable whose mean is
E ( y)   0  1 x
And whose standard deviation is

Since the mean depends on x, the expected value is often
expressed as
E ( y | x)   0  1 x
The standard deviation however is not influenced by x,
because it is constant for all values of x.
Y
Asumptions of the Simple Linear
Regression Model
E[Y]=0 + 1 X
Identical normal
distributions of errors,
all centered on the
regression line.
X
Regression diagnostics

Most departures from the required conditions can be
diagnosed by examining the residuals.

Excel allows us to calculate these residuals and apply
various graphical techniques to them.

Analysis of the residuals allow us to determine whether the
variance of the error variable is constant and whether the
errors are independent.

Excel can also generate standardised residuals. The
residuals are standardised in the usual way, by subtracting
the mean (0 in this case) and dividing by the standard
deviation (or its estimate in this case, s )
Non-normality
We can check for normality by drawing a histogram of
the residuals to see if it appears that the error variable is
normally distributed.
Since the tests in regression analysis are robust, as long as
the histogram at least resembles an approximate bell shape
or is not extremely non-normal, it is safe to assume that the
normality requirement has been met.
Expectation of zero
The use of the method of least squares to find the line of
best fit ensures that this will always be the case.
We can however observe from the histogram of the
residuals, that the residuals are approximately symmetric
about a value which is close to zero.
Heteroscedasticity
The variance of the error variable,
 2
is required to be
constant.
When this requirement is violated, the condition is called
heteroscedasticity.
Homoscedasticity refers to the condition when the
requirement is satisfied.
One method of diagnosing heteroscedasticity is to plot the
residuals against the x values or the predicted values of y and
look for any change in the spread of the variation of the
residuals.
Residual Analysis and Checking
for Model Inadequacies
Residuals
Residuals
0
0
x or y
x or y
Homoscedasticity: Residuals appear completely
random. No indication of model inadequacy.
Residuals
Heteroscedasticity: Variance of residuals
changes when x changes.
Residuals
0
0
Time
Residuals exhibit a linear trend with time.
x or y
Curved pattern in residuals resulting from
underlying nonlinear relationship.
Non-independence of the error term

This requirement states that the values of the error variable
must be independent.

If the data are time series data, the errors are often
correlated

Error terms which are correlated over time are said to be
autocorrelated or serially correlated.

We can often detect autocorrelation if we plot the residuals
against the time period.

If a pattern emerges, it is likely that the independence
requirement is violated.
Reading for next lecture
Read Chapter 18 Sections 18.5 and 18.7
(Chapter 11 Sections 11.5 and 11.7 abridged)