* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download MULTPLE REGRESSION-I
Survey
Document related concepts
Transcript
MULTPLE REGRESSION-I
Regression analysis examines the relation between a single dependent variable Y and one
or more independent variables X1, …,Xp.
SIMPLE LINEAR REGRESSION MODELS
First Order Model with One Predictor Variable
Yi 0 1X i1 i , i=1,2,…,n
0 is the intercept of the line, and 1 is the slope of
the line. One unite increase in X gives 1 unites
increase in Y.
i is called a statistical random error for the ith
observation Yi. It accounts for the fact that the
statistical model does not give an exact fit to the
data.
i cannot be observed. We assume:
o E(i)=0
o Var(i)=2 for all i=1,…,n
o Cov(i, j)=0 for all ij
Response/Regression function and an example:
EYi 0 1X i1
EYi 9.5 2.1X i1
The response Yi, given the level of X in the ith
trial Xi, comes from a probability distribution
whose mean is 9.5+2.1 Xi. Therefore,
Response/Regression function relates the
means of the probability distributions of Y (for
given X ) to the level of X.
0, 1, and are unknown parameters and have to
be estimated from data. We use b0, b1, and e for the
estimates of 0, 1, and .
Fitted Values and an Example:
Ŷi b0 b1Xi1
is called fitted values
Ŷi 9.6 1.9Xi1
Residual and an Example:
ei Yi Ŷi
is called the residual value
ei 108 (9.6 1.9 * 45) 12.9
i 108 104 4
o The residuals are observable, and can be used to check
assumptions on the statistical random errors i.
o Points above the line have positive residuals, and points below the
line have negative residuals
o A line that fits the data well has small residuals.
o We want the residuals to be small in magnitude, because large
negative residuals are as bad as large positive residuals. Therefore,
we cannot simply require ei 0.
o In fact,
e
i
0. (derivation on board).
o Two immediate solutions:
| e | to be small
Require e to be small
- Require
-
i
2
i
o We consider the second option because working with squares is
mathematically easier than working with absolute values (for
example, it is easier to take derivatives).
Find the pair (b0, b1) that minimizes e (least
square estimates)
o We call SSE= e the Residual Sum of
Squares
o We want to find the pair (b0, b1) that
minimizes
SSE= e = (Y b b X )
2
i
2
i
2
i
2
i
0
1
i
o We set the partial derivatives of SSE with
respect to b0, b1 equal to zero:
SSE
0
Normal equation (1) :
(Y b
i
0
X (Y b
i
i
i
0
b1X i ) 0
b1X i ) 0
Normal equation (2) :
(1)(2)(Y b
0
SSE
1
(X )(2)(Y b
i
i
0
b1X i ) 0
b1X i ) 0
Then the solution is (derivation on board);
b 0 Ŷ b1X
b1
(X X )(Y Y )
(X X )
i
i
2
i
Properties of the residuals
o e 0. since the regression line goes through
the point (X, Y) .
o X e 0 and Ŷ e 0 The residuals are
uncorrelated with the independent variables
Xi and with the fitted values Ŷ . (prove it on
board)
i
i i
i i
i
Why e 0. , X e 0 and Ŷ e 0 ?
In fact, from normal equation (1) and (2), we
can immediately tell e 0 and X e 0 . Since
Ŷ e (b b X )e b e b X e , we can easily
know that Ŷ e 0 ?
i
i i
i i
i
i i
0
1
i
i i
i
0
i
i i
1
i i
o Least square estimates are uniquely defined
as long as the values of the independent
variable are not all identical. In that case the
numerator (X X) 0 (draw figure).
Point Estimation of Error Terms Variance 2
o Single Population: Unbiased sample
variance estimator of the population
variance
2
i
Yi i , i 1, , n
i , i , j 0
2
2
s2
2
Y
Y
i
n 1
E s2 2
o Regression model:Unbiased sample variance
estimator of the population variance
Yi 0 1X i i , i 1,, n
E i 0, 2 i 2 , i , j 0
SSE
n2
EMSE 2
MSE
Inference in Regression Analysis:
SS XY
b1
Estimator of 1
SS XX
Eb1 1
Mean of b1
2
2
Variance of b1
b1
SS XX
Estimated variance
MSE
s 2 b1
SS XX
t-Score and 1- CI of 1
t*
b1 1
~ t n 2
sb1
1 CI :
b1 t 1 2 ; n 2 sb1
s 2 b1
MSE
SS XX
Estimating the Mean Value at Xh:
EY X
Estimator
Yˆ b b X
Mean
E Yˆ EY
Variance
1 X X
Yˆ
SS
Estimated variance
n
1 X X
s Yˆ MSE
h
0
h
0
1
1
h
h
h
h
2
2
2
h
h
XX
2
2
h
h
n
SS XX
Analysis of Variance Approach to Regression
Analysis
SSTOT SSR SSE
Yi Y Ŷi Y Yi Ŷi
Y Y Ŷ Y Y Ŷ
Ŷ Y Y Ŷ
2
2
i
i
i
2
i
2
i
i
Basic Table
Source of
Variation
SS
Regression
SSR Yi Y
Error
SSE Yi Yi
Total
SSTOT Yi Y
2
2
2
df
MS
E{MS}
1
MSR
SSR 2 2 X X 2
1
i
1
n-2
MSE
SSE
n2
2
n-1
o Test Statistic
o F Distribution
o Numerator
Degrees of Freedom dfR=1
o Denominator
MSR
MSE
MSR
F*
~ F 1, n 2
MSE
Tail Probability :
F*
*
Degrees of Freedom dfR=n-1 P F F 1 ;1, n 2
H 0 : 1 0
o Hypothesis:
H 1 : 1 0
For level of significance
o Decision Rule:
If F * F 1 ;1, n 2 H 0
If F * F 1 ;1, n 2 H1
Fitting a regression in SAS
data Toluca;
infile 'C:\stat231B06\ch01ta01.txt';
input lotsize workhrs;
proc reg;
/*least square estimation of regression
coefficient*/
model workhrs=lotsize;
output out=results p=yhat r=residual;/*yhat denotes for fitted values
and residual denotes for residual values*/
run;
proc print data=results;
var yhat residual; /*print the fitted values and residual vales*/
run;
SAS Output
The REG Procedure
Model: MODEL1
Dependent Variable: workhrs
Number of Observations Read
Number of Observations Used
25
25
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
1
23
24
252378
54825
307203
252378
2383.71562
Root MSE
Dependent Mean
Coeff Var
48.82331
312.28000
15.63447
Variable
Intercept
lotsize
DF
1
1
R-Square
Adj R-Sq
Parameter Estimates
Parameter
Standard
Estimate
Error
62.36586
26.17743
3.57020
Obs
1
2
0.34697
yhat
residual
347.982
169.472
51.018
-48.472
F Value
Pr > F
105.88
<.0001
0.8215
0.8138
t Value
2.38
Pr > |t|
0.0259
10.29
<.0001
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
240.876
383.684
312.280
276.578
490.790
347.982
419.386
240.876
205.174
312.280
383.684
133.770
455.088
419.386
169.472
240.876
383.684
455.088
169.472
383.684
205.174
347.982
312.280
-19.876
-7.684
48.720
-52.578
55.210
4.018
-66.386
-83.876
-45.174
-60.280
5.316
-20.770
-20.088
0.614
42.528
27.124
-6.684
-34.088
103.528
84.316
38.826
-5.982
10.720
Extension to multiple regression model
Extension to multiple regression model:
Extension to multiple regression model
Extension to multiple regression model
Extension to multiple regression model
Degrees of freedom: the number of independent components that are needed to calculate the
respective sum of squares.
(1) SSTOT is the sum of n squared components. However, since ( yi y) 0 , only n1 components are needed for its calculation. The remaining one can always be
n 1
calculated from y n y ( yi y) . Hence, SSTOT has n-1 degrees of freedom.
i 1
(2) SSE= ei2 is the sum of n squared components. However, there are p restrictions
among the residuals, coming from the p normal equations
XY XX b X' (Y Xb ) X' e 0 . Hence SSE has n-p degrees of freedom.
p1
p p p1
(3) SSR= ( ŷi y)2 is the sum of p squared components since all fitted values Ŷi are
calculated from the same estimated regression line. One degree of freedom is lost
because of one restriction
(ŷi y) (ŷi yi yi y) (ŷi yi ) (yi y) 0 . Hence SSR has p-1
degrees of freedom.
Extension to multiple regression model
Extension to multiple regression model