Download Chapter 4 - City University of Hong Kong

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 4: Linear Regression Analysis – Part I
Page
1
Department of Mathematics
Faculty of Science and Engineering
City University of Hong Kong
MA 3518: Applied Statistics
Chapter 4: Linear Regression Analysis – Part I
Regression analysis mainly concerns with studying the relationship
between a dependent variable and several independent variables. It
estimates quantitatively the functional relationship between the
variables based on observations. Linear regression models assume
a linear relationship between the dependent variable and the
independent variables. It is easy to implement and provides a first
approximation to the underlying relationship between the
variables. This chapter introduces various concepts and modern
statistical computing techniques for fitting linear regressions by
SAS. Topics included in this chapter are listed as follows:
Section 4.1: Regression Analysis
Section 4.2: Simple Linear Regression Models
Section 4.3: Multiple Linear Regression Models
Section 4.4: Fitting Regression Models by SAS
Explain Variation by Identifying Key Factors!
Chapter 4: Linear Regression Analysis – Part I
Page
2
Section 4.1: Regression Analysis
1. Regression analysis:
 Concern the investigation the quantitative functional
relationship between a dependent variable (or response
variable) and several independent variables (or explanatory
variables)
 Use the functional relationship to explain the variability of
the response variable by the explanatory variables in the
presence of random error
2. Linear regression models:
 Assume a linear relationship between the response variable
and the explanatory variables
 Simple linear regression models: One explanatory variable
 Multiple linear regression
explanatory variables
models:
More
than
3. Real-life Examples:
 Studying impact of various factors on the yield of a
chemical process
Fit a multiple regression model with:
Response variable: The yield of a chemical process
one
Chapter 4: Linear Regression Analysis – Part I
Page
3
Explanatory variables: Temperature, the amount of catalyst
used, the relative proportions of
ingredients
 Investigate impact of various factors on the response of a
patient to a treatment
Fit a multiple regression model with:
Response variable: The response of a patient to a treatment
Explanatory variables: Gender, Age and the stage of the
disease
4. Prediction based on the fitted regression model from the data:
 Use the fitted regression model to predict the yield of a new
chemical process
 Use the fitted regression model from the data to predict the
response of a new patient to the treatment
Section 4.2: Simple Linear Regression Models
1. The model:
 Describe the relationship between the dependent variable Y
and the independent variable X
 Assume the following straight line relationship between X
and Y (i.e. Regression line):
Y = a + b X + e,
Chapter 4: Linear Regression Analysis – Part I
Page
4
where a and b are the unknown regression parameters to be
estimated from the data. a represents the intercept and b
represents the slope of the regression line.
2. Systematic and random components:
 Systematic component (a + b X):
Represent the part of the variability of the response variable
Y that can be explained by the explanatory variable X or the
regression model
 Random component (e):
Represent the part of the variability of the response variable
that cannot be explained by the regression model
3. Data form for simple linear regression model:
 Consider a random sample of n pairs of observations (X1, Y1),
(X2, Y2), …, (Xn, Yn)
Write the simple linear regression model in the data form as
follows:
Yi = a + b Xi + ei, i = 1, 2, …, n
where ei represents the random error corresponding to
the i-th observation
 If (X1, Y1), (X2, Y2), …, (Xn, Yn) are n pair of observations over
time, the simple linear regression model is a time-series
regression and can be represented by the time index t as
follows:
Chapter 4: Linear Regression Analysis – Part I
Page
5
Yt = a + b Xt + et, t = 1, 2,…, n
4. Basic assumptions:
 The response variable Y is continuous
 Xi’s are observed without measurement error (i.e. Xi’s are
non-random)
 ei’s are i.i.d. ~ N(0,  2), i = 1, 2, …, n, where  2 is the
unknown population variance for the random error
 e and X are independent
5. Estimation of Regression Models:
 Determine a regression line that is best-fit to the data by
finding the optimal values for a and b (Optimal?)
 Method of Linear Least Squares Estimation (LSE): Estimate
a and b by minimizing the sum of squared error
n
n
i 1
i 1
SSE =  (yi – a – b xi)2 =  ei2
Let ae and be denote the linear least squares estimators for a
and b, respectively
 Least squares estimates ae and be:
be=Sxy/Sxx and ae =
y
- be x ,
Chapter 4: Linear Regression Analysis – Part I
Page
n
n
i 1
i 1
6
where Sxy =  (xi – x ) (yi – y ) and Sxx =  (xi – x )2
 Fitted regression line:
ye = ae + be x
6. Partition of sum of squares:
 Notations and terminologies
n
Let SST =  (yi – y )2 = Total sum of squares
i 1
n
SSE =  (yi –ye)2 = Sum of squares error (Residual sum
i 1
of squares)
n
SSR =  (ye – y )2 = Sum of squares regression (Model
i 1
sum square)
 Partition of sum of squares:
SST = SSE + SSR
 Partition of degrees of freedom:
Consider the simple linear regression model with the
number of unknown parameters p = 2
Degree of freedom of SST = d.f.(SST) = n –1
Degree of freedom of SSE = d.f.(SSE) = n – 2
Chapter 4: Linear Regression Analysis – Part I
Page
7
Degree of freedom of SSR =d.f.(SSR) = p – 1 = 1
d.f.(SST) = d.f.(SSE) + d.f.(SSR)
7. Estimation of the error variance  2
 An unbiased estimate s2 of  2 is given by:
s2 =
SSE
n p
=
SSE
= 1
n2
n2
n
 (yi – ae– be xi)2 = MSE
i 1
8. Hypothesis testing for regression models
 Test for the significance of the relationship between X and Y
Main idea: Construct the following ANOVA table by
partitioning the sum of squares
Source
Model
d.f.
p–1=1
S.S.
SSR
Error
n–2
SSE
Total
n–1
SST
M.S.
MSR= SSR
2 1
MSE= SSE
n2
F
F=
MSR
MSE
Test H0: b = 0 (i.e. The relationship between X and Y is not
significant)
H1: b  0 (i.e. There is a significant relationship
between X and Y)
at the significance level 
Test statistic under H0:
F ~ F2-1, n-2 = F1,n-2
Chapter 4: Linear Regression Analysis – Part I
Page
8
Intuitively, we reject H0 when F is sufficiently large
Decision: Reject H0 if F > F1,n-2 (  ) or the p-value
= P(F > Fobs) < 
7. Assessment of goodness-of-fit of a regression model
 A regression model can provide a good fit to the data if the
total variation of the response variable can be well explained
by the explanatory variables
 Coefficient of Determination R2
R2 = SSR/SST
R2 is defined as the ratio of the variation explained by the
regression line (i.e. SSR) to the total variation (i.e. SST)
 Some remarks on R2
(a) R2 takes values in [0, 1]
(b) R2 = 1 => SSR = SST => Perfect fit
(c) Large R2 => Large proportion of the total variation can be
explained by the regression model
=> Good fit
(d) Small R2 => Cannot provide a good fit
Chapter 4: Linear Regression Analysis – Part I
Page
9
(e) R2 = [corr (X, Y)]2,
where corr (X, Y) is the correlation coefficient between X
and Y
8. Estimation and Prediction by regression models
 Estimation:
(a) Objective: Estimate the mean of the response variable Y
when X is fixed at x (i.e. E(Y | X = x))
(b) Example: Estimate the average heights for women in HK
with ages between 25 and 30 when their
weights are fixed at 103lbs
(c) Point Estimator:
Note that E(Y | X = x) = a + b x
Hence, it is natural to adopt ae + be x to estimate
E(Y | X = x)
 Prediction:
(a) Objective: Predict a new value for the response variable Y
when X is fixed at x (i.e. Y | X = x)
(b) Example: Predict the height of a woman in HK with ages
between 25 and 30 when her weight is fixed at
103lbs
(c) Point Predictor:
Ye | (X = x) = ae + be x
Chapter 4: Linear Regression Analysis – Part I
Page 10
 Remarks:
(a) The point estimator and the point predictor are the same
(b) The standard error of the point estimator is less than that
of the point predictor
Section 4.3: Multiple Linear Regression Models
1. Motivation:
 Explain the variation of a response variable by more than
one explanatory variables
2. The Model:
 Investigate the relationship between the dependent variable
Y and the independent variables X1, X2,…,Xp
 Describe how the dependent variable Y is related to the
independent variables X1, X2,…,Xp by the following linear
equation:
Y = a + b1X1 + b2X2 + b3X3 + …..+ bpXp + e,
where a is the intercept; b1, b2, ... and bp are the coefficients
of regression; e is random error
 In practical situations, the sample size n > the number of
regression coefficients p + 1
Chapter 4: Linear Regression Analysis – Part I
Page 11
 Consider a random sample of n observations represented by
the following random vectors
(Yi, Xi1, Xi2, …,Xip), i = 1, 2, …, n
Write the multiple linear regression model in the following
data form:
Yi = a + b1 Xi1 + b2 Xi2 +…. + bp Xip + ei, i = 1, 2, …, n
 If the vectors of observations are collected over time, we
have the following multiple time-series regression model:
Yt = a + b1 Xt1 + b2 Xt2 +…. + bp Xtp + et, t = 1, 2,…, n,
where t represents the time index
 Basic assumptions:
(a)
The response variable Y is continuous
(b)
X1, X2,…,Xp are non-random
(c)
The random error ei are iid ~ N(0,  2) , where  2 is the
unknown common population variance
(d)
e, X1, X2,…,Xp are independent
 Data Structure: Matrix Form
Chapter 4: Linear Regression Analysis – Part I
 y1 
 y 2
 
: 
 
: 
 yn 
1 x11
1 x 21

:
:

:
:
1 xn1
=
x1 p 
.. .. x 2 p 
: :
: 

: :
: 
.. .. xnp 
.. ..
Page 12
a 
b1 
 
: 
 
: 
bp 
+
e1 
e 2 
 
: 
 
: 
en 
Let Y denote the n  1 column vector for the observations of
the response variable Y; b denote the vector of the unknown
parameters and e denote the n  1 column vector for the
random errors e
Write X for the n

(p+1) design matrix as follows:
X=
1 x11
1 x 21

:
:

:
:
1 xn1
x1 p 
.. .. x 2 p 
: :
: 

: :
: 
.. .. xnp 
.. ..
Then, the matrix form for the data structure of the multiple
linear regression can be written as the following matrix
equation:
Y=Xb+e
 When p = 1, the multiple linear regression model is reduced
to a simple linear regression model with two unknown
parameters.
5. Estimation of unknown parameters by linear least squares
method (LSE)
 Main idea: (Same as the case of simple linear regression)
Estimate the unknown parameters b by minimizing the sum
of squared error or deviation
Chapter 4: Linear Regression Analysis – Part I
Page 13
 Matrix form:
min b (Y – X b)T(Y – X b)
where (Y – X b)T(Y – X b) is the matrix representation of
the sum of the squared error or deviation
 The solution: (The least squares estimators be)
be = arg [min b (Y – X b)T(Y – X b)]
= (XT X)-1 XT Y
 The best-fit regression equation
(a) In matrix form,
Ye = X be
(b) In scalar form,
Yie = ae + b1e Xi1 + b2e Xi2 +…. + bpe Xip , i = 1, 2, …, n
5. Estimate the common population variance  2:
 Let yi and yie denote the ith observed value and the ith
predicted values for the response variable Y
 Define the prediction error or the residual  i by
 i = yi – yie
Use the ith residual  i to estimate the ith random error ei
Chapter 4: Linear Regression Analysis – Part I
Page 14
The unbiased estimator s2 for  2 is given by:
s2 =
1
n  p 1
n

i 1
 i2 =
1
n  p 1
n
 (yi – yie)2 = MSE
i 1
6. Hypothesis testing for regression models:
 F-test for full model (i.e. Assessing the goodness-of-fit of a
regression line)
Test the following hypotheses at the significance level  :
H0: b1 = b2 = …..= bp = 0 (i.e. None of the explanatory
variables affect the response variable Y)
H1: bk  0, for some k = 1, 2, …, p (i.e. Some of the
explanatory variables affect the response variable Y)
Under H0: Y = a + e
Under H1: Y = a + b1X1 + b2X2 + b3X3 + …..+ bpXp + e,
for some non-zero bk
Main idea: Construct the following ANOVA table by
partitioning the sum of squares
Source
Model
d.f.
p
S.S.
SSR
Error
n  p 1
SSE
Total
n–1
SST
M.S.
MSR= SSR
MSE=
p
SSE
n  p 1
F
F=
MSR
MSE
Chapter 4: Linear Regression Analysis – Part I
Page 15
Test statistic under H0:
F ~ Fp , n - p- 1
Intuitively, we reject H0 when F is sufficiently large
Decision: Reject H0 if F > Fp, n - p- 1 (  ) or the
p-value P(F > Fobs) < 
 Coefficient of determination R2
(a) An alternative for the F-test of the full model in assessing
the goodness-of-fit test for a regression model
(b) Definition
R2 = SSR/SST = 1 – SSE/SST= Corr(Yi, Yie)
where Corr(Yi, Yie) is the correlation between the observed
value Yi and the predicted value Yie
(c) Adjusted R2: To adjust for R2 in order to incorporate the
effect of sample size n when n is small
relative to the number of parameters p + 1
R2adj = 1 – (
n 1
)
n  p 1
(1 – R2)
 T-Test for an individual regression parameter:
(a) Rationale: To test whether the intercept or a particular
explanatory variable can be removed from
the regression model
Chapter 4: Linear Regression Analysis – Part I
Page 16
(b) Test the following hypotheses at significance level  :
H0: bk = 0 v.s. H1: bk  0 for k = 1, 2, …, p
Under H0: (Reduced-form model)
Y = a + b1 X1 + b2 X2 + …+ bk-1 Xk-1 + bk+1 Xk+1+ …+ bp Xp + e
Under H1: (Full model)
Y = a + b1 X1 + b2 X2 + …+ bp Xp + e
(c) Test the significance of the intercept term at significance
level  as follows:
H0: a = 0 v.s. H1: a  0
Under H0: (Reduced-form model)
Y = b1 X1 + b2 X2 + …+ bp Xp + e
Under H1: (Full model)
Y = a + b1 X1 + b2 X2 + …+ bp Xp + e
(a) The procedures for the T-test:
Effect
Intercept
X1
:
:
Xp
Estimate
ae
b1e
:
:
bpe
S.E.
Var (ae)
Var(b1e)
:
:
Var (bpe)
t-ratio
p-value
ae/ Var(ae) P(T>|ae/ Var(ae) |)
b1e/ Var(b1e) P(T>|b1e/ Var(b1e) | )
:
:
:
:
bpe/ Var(bpe) P(T>|bpe/ Var(bpe) |)
Chapter 4: Linear Regression Analysis – Part I
Page 17
For example:
Test H0: b1 = 0 v.s. H1: b1  0 at significance level 
Decision: Reject H0 if the p-value = P(T>|b1e/
Var(b1e) | )
<
Conclusion:
If H0 is rejected, we conclude that the effect X1 is
significant in explaining the variation of the response
variable Y
 F-test for a subset of regression parameters simultaneously
The T-test mentioned above can only be used to test
hypotheses involving only one parameter
Let j denote a positive integer less than to p (i.e. j < p)
Test H0: bj+1 = bj+2 = …= bp = 0 (i.e. The first j explanatory
variables affect the response variable Y)
H1: H0 is not true
at significance level 
Under H0: (Reduced Model)
Y = a + b1 X1 + b2 X2 + ….+ bj Xj + e
Under H1: (Full Model)
Y = a + b1 X1 + b2 X2 + ….+ bp Xp + e
Chapter 4: Linear Regression Analysis – Part I
Page 18
Write SSRR = Sum of squared regression for the reduced
model
SSER = Sum of squared error for the reduced model
SSRF = Sum of squared regression for the full model
SSEF = Sum of squared error for the full model
Test statistic (i.e. F-statistic for the test) under H0
F = [(SSRF – SSRR) / (p – j)] / [SSEF / (n – p – 1)] ~ Fp – j, n – p –1
Note that
(a) (SSRF – SSRR) represents the sum of squared of total
variation explained by Xj+1, Xj+2, …, Xp
(b) SSEF is the sum of squared errors for the full model
7. Estimation and Prediction by regression models
 Estimation:
(a) Objective: Estimate the mean of the response variable Y
when X1 = x1, X2 = x2, …, Xp = xp, namely
E(Y | X1 = x1, X2 = x2, …, Xp = xp)
(b) Point Estimation:
Let x0 = (1, x1, x2, …, xp); b0 = (a, b1, b2, …, bp);
b0e = (ae, b1e, b2e, …, bpe)
Chapter 4: Linear Regression Analysis – Part I
Page 19
Write
E(Y | X1 = x1, X2 = x2, …, Xp = xp) = x0T b0
Hence, it is natural to use x0T b0e to estimate x0T b0
(c) Interval Estimation:
The sampling distribution for the point estimator x0T b0e is
N(x0T b0,  2 [x0T (XT X)-1 x0])
A 100  (1 –  ) % confidence interval for the unknown
mean x0T b0:
x0T b0e  t  /2 (n – p – 1)
MSE
[x0T (XT X)-1 x0]1/2
 Prediction:
(a) Objective: Predict a new value for the response variable Y
when X1 = x1, X2 = x2, …, Xp = xp, namely
Y | X1 = x1, X2 = x2, …, Xp = xp
(b) Point Prediction:
Let x0 = (1, x1, x2, …, xp); b0 = (a, b1, b2, …, bp);
b0e = (ae, b1e, b2e, …, bpe)
Write
Y | (X1 = x1, X2 = x2, …, Xp = xp) = x0T b0 + e0
where e0 represents the random error corresponding to the
new observation Y and e ~ N(0,  2)
Chapter 4: Linear Regression Analysis – Part I
Page 20
Hence, it is natural to use x0T b0e to estimate x0T b0 + e
(b) Interval Estimation:
The sampling distribution of the point estimator x0T b0e is
N(x0T b0 + e,  2 [1 + x0T (XT X)-1 x0])
Then, the 100  (1–  )% prediction interval for Y | (X1 = x1,
X2 = x2, …, Xp = xp) is given as follows:
x0T b0e  t  /2 (n – p – 1)
MSE
[1 + x0T (XT X)-1 x0]1/2
Section 4.4: Fitting Regression Models by SAS
1. The PROC REG Command:
 Fit a simple or multiple linear regression model
 The statement
PROC REG DATA = name of dataset <options>;
MODEL response variable = independent variables / <options>;
PLOT response variable*independent variables < = symbols> / <options >;
RUN;
 Options for PROC REG:
(a) simple: Displays descriptive statistics, for instances, sum,
mean, variance and standard deviation, for each
variable in the MODEL statement
(b) noprint: Suppresses the printed output
Chapter 4: Linear Regression Analysis – Part I
Page 21
(c) p: Displays the observed value, the predicted value and
the residual for each observation in the dataset
(d) clm: Displays the 95% confidence interval for the mean
of each observation
(e) cli: Displays the 95% prediction interval for the mean of
each observation
 Example:
Consider the following daily closing values of the Dow
Jones Index and the S&P500 index from 25 July 2003
to 7 August 2003 (Data Source: Yahoo Finance)
Data Indices;
Input SP500 DJI;
CARDS;
974.12 9126.45
967.08 9061.74
965.46 9036.32
982.82 9186.04
980.15 9153.97
990.31 9233.80
987.49 9200.05
989.28 9204.46
996.52 9266.51
998.68 9284.57
;
RUN;
Suppose that the S&P500 (Y) is linearly dependent with the
Dow Jones Index (X)
Chapter 4: Linear Regression Analysis – Part I
Page 22
Use the following SAS procedure to fit a simple linear
regression line with response variable Y and explanatory
variable X
PROC REG DATA = Indices;
MODEL SP500 = DJI / P CLM CLI;
RUN;
The SAS output is displayed as follows:
The SAS System
11:42 Thursday, October 2, 2003 1
The REG Procedure
Model: MODEL1
Dependent Variable: SP500
Analysis of Variance
Source
DF
Model
Error
Corrected Total
1
8
9
Sum of
Squares
Mean
Square
1171.71185
17.71804
1189.42989
Root MSE
Dependent Mean
Coeff Var
F Value Pr > F
1171.71185
2.21475
1.48821 R-Square
983.19100 Adj R-Sq
0.15136
529.05 <.0001
0.9851
0.9832
Parameter Estimates
Variable
DF
Intercept 1
DJI
1
Parameter
Estimate
Standard
Error
-295.69684
0.13938
55.60328
0.00606
t Value Pr > |t|
-5.32
23.00
0.0007
<.0001
Chapter 4: Linear Regression Analysis – Part I
The SAS System
Page 23
11:42 Thursday, October 2, 2003 2
The REG Procedure
Model: MODEL1
Dependent Variable: SP500
Output Statistics
Obs
Dep Var Predicted Std Error
SP500 Value Mean Predict
1 974.1200 976.3695
2 967.0800 967.3501
3 965.4600 963.8070
4 982.8200 984.6753
5 980.1500 980.2053
6 990.3100 991.3322
7 987.4900 986.6280
8 989.2800 987.2427
9 996.5200 995.8914
10 998.6800 998.4086
0.5563
0.8341
0.9652
0.4750
0.4882
0.5889
0.4938
0.5025
0.7255
0.8119
95% CL Mean
975.0867
965.4265
961.5811
983.5799
979.0795
989.9743
985.4894
986.0839
994.2184
996.5364
977.6522
969.2736
966.0328
985.7707
981.3310
992.6901
987.7667
988.4015
997.5644
1000
95% CL Predict Residual
972.7058
963.4159
959.7165
981.0729
976.5936
987.6415
983.0123
983.6205
992.0735
994.4993
980.0332
971.2842
967.8974
988.2777
983.8170
995.0229
990.2438
990.8649
999.7093
1002
-2.2495
-0.2701
1.6530
-1.8553
-0.0553
-1.0222
0.8620
2.0373
0.6286
0.2714
Sum of Residuals
0
Sum of Squared Residuals
17.71804
Predicted Residual SS (PRESS)
27.92906
Interpretation:
(a) From the result in ANOVA table, the p-value of the F-test
for the full model is less than 0.0001
Suppose the significance level for the F-test is 5%. Based
on the p-value, we reject the null hypothesis H0 and
conclude that the full model is better than the reduced
model; that is, the explanatory variable DJI is significant
in the explanation of the variation of the response variable
SP500
(b) The result in part (a) is further confirmed by the value of
the coefficient of determination R2 which is equal to
0.9851
Chapter 4: Linear Regression Analysis – Part I
Page 24
Note that R2 is very close to 1. This means that the
proportion of the total variation of Y explained by the
regression line is large; that is, the regression line can
provide a good fit to the data
We can also take into account the effect of sample size by
looking at the adjusted R2. It is especially important when
we consider a multiple linear regression model
(c) The t-tests for both the intercept and the slope of the
regression line reveal that the intercept and the regressor X
are significant (i.e. Both p-values are much less than 5%)
From the estimates of the parameters, we obtain the
following fitted or estimated regression line
Ye = - 295.69684 + 0.13938 X
(d) Based on the fitted or estimated regression line, the point
estimate for the mean (or the point predicted value for the
observation) of the response variable Y corresponding to
the first observation for the explanatory variable X is
976.3695. The standard error for the point estimator is
given by 0.5563
We can also compare the point predicted value for the
observation of Y with the actual observation of Y
corresponding to the first observation by considering the
corresponding value of the residual:
974.1200 – 976.3695 = - 2.2495
Chapter 4: Linear Regression Analysis – Part I
Page 25
The 95% confidence interval for the mean of the response
variable Y is given by:
(975.0867, 977.6522)
It seems that the confidence interval for the mean is quite
precise
The 95% prediction interval for the first observation of the
response variable Y is given by:
(972.7058, 980.0332)
2. Other functions of PROC REG statement
 Create a new SAS dataset containing information generated
by the PROC REG procedure using the OUTPUT statement
in the PRO REG statement
PROC REG DATA = name of dataset <options>;
MODEL response variable = explanatory variables / <options>;
OUTPUT OUT = new SAS datasetname KEYWORD = options;
RUN;
 KEYWORD = options:
(a) p = varname
Display the predicted values in the output dataset
Assign the name to the predicted values by varname
Chapter 4: Linear Regression Analysis – Part I
Page 26
(b) r = varname
Display the residuals in the output dataset
Assign the name to the residuals by varname
(c) student = varname
Display the standardized residuals in the output dataset
(d) L95M = varname
Display the lower bound for the 95% confidence interval
in the output dataset
(e) U95M = varname
Display the upper bound for the 95% confidence interval
in the output dataset
(f) L95 = varname
Display the lower bound for the 95% prediction interval
in the output dataset
(g) U95 = varname
Display the upper bound for the 95% prediction interval
in the output dataset
Chapter 4: Linear Regression Analysis – Part I
Page 27
 Example:
Consider again the dataset containing the daily closing
values of the Dow Jones Index and the S&P500 index from
25 July 2003 to 7 August 2003
PROC REG DATA = Indices;
MODEL SP500 = DJI / P CLM CLI;
PLOT SP500*DJI;
OUTPUT OUT = Indices1 p = Predict r = Residuals;
RUN;
The graph of the plot for SP500 against DJI is shown in
WORK.GSEG.REG (see the graph below)
SP500 = -295.7 +0.1394 DJI
1000
N
10
Rsq
0.9851
AdjRsq
0.9832
RMSE
1.4882
995
990
985
980
975
970
965
9025
9050
9075
9100
9125
9150
9175
DJI
9200
9225
9250
9275
9300
Chapter 4: Linear Regression Analysis – Part I
Page 28
Use the following PROC PRINT to print the newly created
SAS dataset, namely Indices1, in the SAS output window
PROC PRINT DATA = Indices1;
3. More examples on the PROC REG statement
 Example: (Model checking)
Consider again the dataset containing the daily closing
values of the Dow Jones Index and the S&P500 index from
25 July 2003 to 7 August 2003
Use the following SAS procedure to generate the plots for
checking the assumptions of the regression model (i.e. Work
on the new SAS dataset “Indices1”)
PROC REG DATA = Indices;
MODEL SP500 = DJI;
OUTPUT OUT = Indices1 p = Predict student = Residual;
PROC PLOT DATA = Indices1;
PLOT SP500*Predict;
TITLE 'Model Checking: Observed vs Predicted Values';
PLOT Residual*Predict;
TITLE 'Model Checking: Residuals vs Predicted Values';
RUN;
(Read the SAS output for the plots!)
Chapter 4: Linear Regression Analysis – Part I
Page 29
Observations and interpretations:
(1) The plot of SP500 against its predicted values reveals
that the relationship between SP500 and its predicted
values can be well-approximated by a straight line. This
indicates that the regression model can fit the data well
(2) There is no particular pattern for the plot of the residuals
against the predicted values of SP500. This reveals that
the residuals are random and that the residuals are
independent with the observed or predicted values of
SP500. Hence, the assumptions for the regression
model are justified
 Example: (Fit a multiple linear regression model: F-test for
the full model and t-test for individual effects)
Consider the following dataset containing the daily open
close, high, low values and the trading volume of S&P500
global index from 2 Sep 2003 to 2 Oct 2003
Data SP500;
Input Date $ Open High Low Close Volume;
CARDS;
2-Oct-03 1017.25 1021.90 1013.38 1020.24 1091209984
1-Oct-03 997.15 1018.22 997.15 1018.22 1329970048
30-Sep-03 1004.72 1004.72 990.34 995.97 1360259968
29-Sep-03 998.12 1006.91 995.31 1006.58 1128700000
26-Sep-03 1003.31 1003.32 996.03 996.85 1237640000
25-Sep-03 1010.24 1015.97 1003.26 1003.27 1276470000
24-Sep-03 1029.09 1029.83 1008.93 1009.38 1378250000
23-Sep-03 1023.26 1030.06 1021.50 1029.03 1124940000
22-Sep-03 1036.30 1036.30 1018.27 1022.82 1082870000
19-Sep-03 1039.64 1039.64 1031.85 1036.30 1328210000
18-Sep-03 1025.80 1040.18 1025.66 1039.58 1257790000
17-Sep-03 1028.91 1031.37 1024.23 1025.97 1135540000
16-Sep-03 1015.07 1029.68 1015.07 1029.32 1161780000
15-Sep-03 1018.68 1019.80 1013.59 1014.81 943448000
12-Sep-03 1014.54 1019.68 1007.70 1018.63 1092610000
Chapter 4: Linear Regression Analysis – Part I
Page 30
11-Sep-03 1011.34 1020.84 1011.34 1016.42 1151640000
10-Sep-03 1021.27 1021.28 1009.73 1010.92 1313300000
9-Sep-03 1030.51 1030.51 1021.13 1023.16 1226980000
8-Sep-03 1021.84 1032.42 1021.84 1031.64 1171310000
5-Sep-03 1027.02 1029.24 1018.20 1021.39 1292100000
4-Sep-03 1025.97 1029.15 1022.17 1027.97 1259030000
3-Sep-03 1023.37 1029.36 1022.39 1026.27 1547380000
2-Sep-03 1009.14 1022.63 1005.65 1021.99 1279880000
;
RUN;
PROC PRINT DATA = SP500;
RUN;
The following dataset is shown in the following SAS output
Window:
The SAS System
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Date
2-Oct-03
1-Oct-03
30-Sep-0
29-Sep-0
26-Sep-0
25-Sep-0
24-Sep-0
23-Sep-0
22-Sep-0
19-Sep-0
18-Sep-0
17-Sep-0
16-Sep-0
15-Sep-0
12-Sep-0
11-Sep-0
10-Sep-0
9-Sep-03
8-Sep-03
5-Sep-03
4-Sep-03
3-Sep-03
2-Sep-03
Open
1017.25
997.15
1004.72
998.12
1003.31
1010.24
1029.09
1023.26
1036.30
1039.64
1025.80
1028.91
1015.07
1018.68
1014.54
1011.34
1021.27
1030.51
1021.84
1027.02
1025.97
1023.37
1009.14
High
1021.90
1018.22
1004.72
1006.91
1003.32
1015.97
1029.83
1030.06
1036.30
1039.64
1040.18
1031.37
1029.68
1019.80
1019.68
1020.84
1021.28
1030.51
1032.42
1029.24
1029.15
1029.36
1022.63
12:01 Friday, October 3, 2003 1
Low
Close
Volume
1013.38 1020.24 1091209984
997.15 1018.22 1329970048
990.34 995.97 1360259968
995.31 1006.58 1128700000
996.03 996.85 1237640000
1003.26 1003.27 1276470000
1008.93 1009.38 1378250000
1021.50 1029.03 1124940000
1018.27 1022.82 1082870000
1031.85 1036.30 1328210000
1025.66 1039.58 1257790000
1024.23 1025.97 1135540000
1015.07 1029.32 1161780000
1013.59 1014.81 943448000
1007.70 1018.63 1092610000
1011.34 1016.42 1151640000
1009.73 1010.92 1313300000
1021.13 1023.16 1226980000
1021.84 1031.64 1171310000
1018.20 1021.39 1292100000
1022.17 1027.97 1259030000
1022.39 1026.27 1547380000
1005.65 1021.99 1279880000
Chapter 4: Linear Regression Analysis – Part I
Page 31
Suppose we are interested in investigating the effects of the
daily open, high, low values and the trading volumes on the
daily close values of the S&P500 index
We fit a multiple linear regression model with Close as the
response variable and Open, High, Low and Volume as
explanatory variables using the following SAS procedures
Close = a + b1 Open + b2 High + b3 Low + b4 Volume + e
PROC REG DATA = SP500;
MODEL Close = Open High Low Volume;
RUN;
The SAS output is given as follows:
The SAS System
12:01 Friday, October 3, 2003 3
The REG Procedure
Model: MODEL1
Dependent Variable: Close
Analysis of Variance
Source
DF
Model
Error
Corrected Total
4
18
22
Sum of
Squares
2777.01075
144.03613
2921.04689
Mean
Square
694.25269
8.00201
Root MSE
2.82878
R-Square
Dependent Mean 1019.42304 Adj R-Sq
Coeff Var
0.27749
F Value Pr > F
86.76
0.9507
0.9397
<.0001
Chapter 4: Linear Regression Analysis – Part I
Page 32
Parameter Estimates
Variable
Parameter
DF
Estimate
Standard
Error
Intercept
Open
High
Low
Volume
1
10.09909
1
-0.79023
1
0.98725
1
0.79761
1 -3.95259E-9
61.17871
0.17
0.11239
-7.03
0.15883
6.22
0.15471
5.16
4.840167E-9 -0.82
t Value
Pr > |t|
0.8707
<.0001
<.0001
<.0001
0.4248
Observations and interpretations:
(1) Since the p-value of the F-test is less than 0.0001 (< the
significance level 0.05), we reject the null hypothesis H0 of the F-test for the full model at 5% significance level
and conclude that the full model is superior than the
reduced model in the explanation of the variation of the
response variable Y . This conclusion is further
confirmed by the value of R2 (0.9507) which is close to
one
(2) The results of the t-test for the individual effects and the
corresponding conclusions are shown as follows:
Intercept: Since the p-value is 0.8707 (> 0.05), we do not
reject the null hypothesis H0 at 5% significance
level and conclude that the intercept is not
significant in the explanation of the variation
for the response variable Y in the presence of
other variables
Chapter 4: Linear Regression Analysis – Part I
Page 33
Open: Since the p-value is less than 0.0001 (< 0.05), we
reject H0 at 5% significance level and conclude
that the effect ‘Open’ is significant in the
explanation of the variation for the response
variable Y in the presence of other variables
For ‘High’ and ‘Low’, we have the same conclusion as
in that for ‘Open’
For ‘Volume’, we have the same conclusion as in that for
‘Intercept’
 Example: (F-test for a subset of individual effects)
Consider again the dataset containing the daily open close,
high, low values and the trading volume of S&P500 global
index from 2 Sep 2003 to 2 Oct 2003
Test H0: b1 = b2 = b3 = 0 (Reduced Model)
H1: At least one of b1, b2, b3 not equal to zero
(Full Model)
at 5% significance level
Use the following SAS procedure to perform the F-test for
the above hypotheses
PROC REG DATA = SP500;
MODEL Close = Open High Low Volume;
TEST Open = 0, High = 0, Low = 0;
RUN;
Chapter 4: Linear Regression Analysis – Part I
Page 34
The SAS output is given as follows:
The SAS System
17:08 Saturday, October 4, 2003 1
The REG Procedure
Model: MODEL1
Dependent Variable: Close
Analysis of Variance
Sum of
Squares
Source
DF
Model
Error
Corrected Total
4 2777.01075
18
144.03613
22 2921.04689
Root MSE
Dependent Mean
Coeff Var
Mean
Square
F Value Pr > F
694.25269
8.00201
86.76
2.82878 R-Square
1019.42304 Adj R-Sq
0.27749
<.0001
0.9507
0.9397
Parameter Estimates
Parameter
Standard
Estimate
Error
Variable
DF
Intercept
Open
High
Low
Volume
1
10.09909
1
-0.79023
1
0.98725
1
0.79761
1 -3.95259E-9
t Value
61.17871
0.11239
0.15883
0.15471
4.840167E-9
The SAS System
Pr > |t|
0.17
-7.03
6.22
5.16
-0.82
0.8707
<.0001
<.0001
<.0001
0.4248
17:08 Saturday, October 4, 2003 2
The REG Procedure
Model: MODEL1
Test 1 Results for Dependent Variable Close
Source
DF
Numerator
3
Denominator
18
Mean
Square
F Value
Pr > F
921.67424
115.18
<.0001
8.00201
Chapter 4: Linear Regression Analysis – Part I
Page 35
Observations and interpretations:
(1) Since the p-value for the F-test for the subset of
individual effects is less than 0.0001, we reject
H0 at 5% significant level and conclude that the full
model is superior than the reduced model in the
explanation of the variation of the response variable
‘Close’
(2) The result is further confirmed by the F-test for the full
model and the value of R2 for the full model
(3) The fitted regression equation for the full model is
given by:
Close = 10.09909 – 0.79023 Open+ 0.98725 High + 0.79761 Low – 3.95259  10-9 Volume
~ End of Chapter 4~