Download e-con 581 transcript

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression toward the mean wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
BUSINESS STATISTICS (PART-36)
UNIT-III
REGRESSION AND CORRELATION (PART 5)
1. INTRODUCTION
Hello, dear students in my last lecture on regression analysis, we
have learned how to develop a linear regression model using the
bivariate sample data? Once the model is developed then it is
ready for the purpose of the prediction. Now the mathematical
model, i.e. the regression model, which we develop, is based on
the sound principles of mathematics i.e. the method of least
squares. Therefore, this model is very appropriate to use. But
still since it is a non-deterministic model and the model is built
on the basis of the sample data it is better to check the validity of
the model. Therefore, in this lecture we shall discuss the method
of checking the appropriateness or adequacy of the fitted
regression model.
It is the primary importance to be aware of the fact that a
regression analysis is not completed by fitting a model by least
squares, and provide estimate of dependent variable y. This only
tells half of the story. Because we fit a linear regression model
on the basis of sample observations; so we can never be sure that
a particular model is correct unless we check the appropriateness
or the validity of the model.
2. EXAMINING THE RESIDUALS
Now, how to examining the model? Now in order to check the
linear regression model let us consider the factor e i  y i  yˆ i ,
here in this equation, yi is the observed value, and we know that
ŷi denotes the corresponding value predicted by the fitted linear
regression model, and
ei denotes the error or the residual.
Now such errors we have for all the ‘n’ observed values, i.e. i=1,
. . ., n. In order to understand how to check the validity of the
model, we consider an example. This example concerns with the
reduction of automobile pollution (i.e., Nitrogen oxide) by
adding chemical composition in Petrol or Diesel.
The amount of additive and reduction in Nitrogen Oxides data
from 10 cars are given here in this table. The first row shows the
amount of additive which is the variable ‘x’ i.e. x is an
independent variable. And the value of this variable is 1, 1, 2, 3,
4, 4, 5, 6, 6, 7. This figure shows the amount of additive mixed
in the petrol or in the gasoline. And with the result of this
additive how much reduction in Nitrogen Oxide is there. This
factor is denoted by the variable ‘y’. And here y takes values
2.1, 2.5, 3.1 and so on. That is, if the amount of additive of ‘x’ is
1 unit then the reduction in Nitrogen Oxides is 2.1 units. And so,
for example, if the amount of additive is 2 in this car the
reduction in Nitrogen Oxide is 3.1 and so on.
In the last car, for example, where the amount of additive was 7
units the reduction in Nitrogen Oxide was 4.8 units. Now, when
we fit a linear regression model to this by using the method of
least squares we get this equation:
ŷ  2.0  0.387 x
This model provides estimated values for given x values. That
is,
yˆi  2.00  0.387 xi , where, i  1, 2,..., n
If we put a particular value of x, for that we get an estimated
value of y, which we are denoting it by 𝑦̂𝑖 . Now we have a linear
regression model or the linear regression equation, using this
equation for given values of x we get predicted values. That is,
when we put x=1 in the equation we get 2.387 as the predicted
value of y. And for x=2 we get the predicted value as 2.774 and
so on. For the given value of x=7 the predicted value is 4.709.
Now we have the two sets of y values. One is given in the
second column, i.e., the observed values which we are denoting
them by 𝑦𝑖 ’s. And other values are the predicted values, which
we are denoting by 𝑦̂𝑖 . Now this last column gives me the
difference between the observed values and the predicted values.
And this is the error term due to model which we are using. So
this is the difference between the observed value and the
predicted value and these values are -0.287, 0.113 and so on
0.078 and 0.091.
So these are the values of the residuals or the values of the
errors. Now the sum of the square due to errors, i.e. square all
the values of the residuals (or the errors ei’s) and sum them. We
know that it is sum of squares due to errors and this is given by
(-0.287)2 + (0.113)2 +.........+ (0.091)2
So the value of SSE comes out to be 0.7376.
And the value of
S x2 
x
2
S y2  6.85
 nx 2  40.9
& S xy   xy  n x y  15.81
and b1 (the reg coeff )  0.387
for this given set of data and, as our linear regression line is
given by
ŷ  2.0  0.387x
3. THE COEFFICIENT OF DETERMINATION
Now in order to check the validity of the model we consider the
Coefficient of determination. Here we introduce what is the
Coefficient of determination? This is an index to check the fitted
linear regression model. And,
2
b S2
r  1 2x
Sy
2
That is, r2 is equal to
SS due to linear regression
Total SS of y
Now, this value of r2, for the given data, comes out to be 0.89.
It means that 89 % of the variability in y is explained by linear
regression, and the linear model seems satisfactory in this
respect.
Note: r is nothing but the Karl-Pearson’s correlation coefficient.
So to check the model we have to calculate the r2, i.e. square of
the correlation coefficient.
When the value of r2 is small, we can only conclude that a
straight – line relation does not give a good fit for the data. Such
a case may arise due to following reasons:
either
 (a) There is a little relation between the variables in the
sense that scatter diagram fails to exhibit any pattern , as
illustrated in this figure. That is there is no relation between
the x and y variables.
or
 (b) There is a prominent relation but it is non-linear in
nature; that is, the scatter is banded around a curve rather
than a line .This part of S that is explained by straight-line
regression is small because the model is inappropriate.
Some other relationship may improve the fit substantially.
2
y
In this graph here we show that there is a curvilinear relationship
between the y and the x data. So the linear regression model
should not be fit to such type of data.
4. THEORETICAL ASPECT OF r2
We have taken r2 as an index to check the validity of linear
regression model. But what is the theoretical basis? What is the
theoretical aspect of taking the r2 as an index to check the
validity of linear regression model? So, let us look at here. The
observed value yi can be written as:
(b0  b1 xi )  ( yi  b0  b1 xi )
using the linear regression model. We can see that on the LHS
we have the yi and on the RHS also it is yi but written in this
particular form. So, we know
yi  observed value
(b0  b1 xi ) is the part which is explained by linear relation.
And the last factor
( y  b  b x )  residual or deviation from linear relation.
i
0
1
i
As we know that:
yˆ i  b0  b1 xi
As an overall measure of the discrepancy or the variation from
the linearity we can consider the sum of the squares of residuals.
As we have defined it earlier. That is,
SSE=
( y  b
i
0
 b1 x)2  S y2  b12 S x2
So the sum of the squares due to errors can be written as:
S y2  b12 S x2
And, the total variability of the y values is reflected in the sum
of squares and it can be expressed as:

( yi  y )2
Now this total sum of square can be expressed in terms of 𝑆𝑥2
and SSE by this equation. That is,
S y2  b12 S x2  SSE
As we know that:
b12 S x2  SS explained by linear relation.
SSE  Residual SS or sum of squares due to errors.
2
2
Now, here we look at this equation .We know that if b S is large
then the SSE will be small, because this equation gives me the
value of
1
SSE  S y2  b12 S x2
x
Now the value of 𝑆𝑦2 is not in our control. Because 𝑆𝑦2 is the sum
of the squares due to observed values. This is what we get
through the experiment. So we can not control it. We can control
the x variable; the independent variable. We take that variable as
an independent variable which has a high correlation coefficient
with the y variable. So if we pick up the x variable in such a way
that this factor b S is large enough in order to have a minimum
value of the SSE. So this is the basis of taking the r2 as an
indicator to test the validity of the linear regression model. Now,
let us look at this part. What is r2 in terms of 𝑆𝑥2 and 𝑆𝑦2 ?
As an index of how well the straight-line fits, it is then
reasonable to consider the proportion
2
1
2
x
b12 S x2 i.e.
r 
S y2
2
= SS due to linear regression
Total SS of y
where, r2 represents the proportion of the y variability explained
by the linear relation with x. Here, what is b1? b1 we know is:
S

b S
That is the regression coefficient of y on x. So that r2 can also be
written as:
xy
2
1
x
r 
2
2
S xy
S x2 S y2
Here Sxy we know is covariance part and so the value of
r
S xy
Sx S y
is the sample correlation coefficient and this we denote by r.
5. SUMMARY
In today’s lecture we have discussed a very important aspect of
the regression analysis. That is how to check the appropriateness
of the model or the validity of the linear regression model. We
know that our process of developing a model is that once these
sample data are given to us first in order to know that what type
of relationship is there between the y and x variables.
We plot a scatter diagram and once we have the scatter diagram
we get an idea whether the relation between y and x is linear or
whether there is a curvilinear relationship or whether there is an
exponential relationship between the y and the x variable. Once
we do that, then by the method of least squares we develop an
appropriate model. But once we develop the model it becomes
proper for us to check the validity of the model; and that is what
we have learnt in this lecture today.
That is how to test the validity of the linear regression model;
which we know that, it is a non deterministic or a statistical
model different then the deterministic or the mathematical
model. And this we have learnt today that, we can check the
validity of the model by calculating the square of the sample
correlation coefficient. That tells whether the model is valid or
not. Once we are satisfied that the model is valid then we can
use this model for the purpose of prediction.
Thank You!