Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Interaction (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Data assimilation wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Time series wikipedia , lookup
Regression analysis wikipedia , lookup
Chapter 4: Linear Regression Analysis – Part I Page 1 Department of Mathematics Faculty of Science and Engineering City University of Hong Kong MA 3518: Applied Statistics Chapter 4: Linear Regression Analysis – Part I Regression analysis mainly concerns with studying the relationship between a dependent variable and several independent variables. It estimates quantitatively the functional relationship between the variables based on observations. Linear regression models assume a linear relationship between the dependent variable and the independent variables. It is easy to implement and provides a first approximation to the underlying relationship between the variables. This chapter introduces various concepts and modern statistical computing techniques for fitting linear regressions by SAS. Topics included in this chapter are listed as follows: Section 4.1: Regression Analysis Section 4.2: Simple Linear Regression Models Section 4.3: Multiple Linear Regression Models Section 4.4: Fitting Regression Models by SAS Explain Variation by Identifying Key Factors! Chapter 4: Linear Regression Analysis – Part I Page 2 Section 4.1: Regression Analysis 1. Regression analysis: Concern the investigation the quantitative functional relationship between a dependent variable (or response variable) and several independent variables (or explanatory variables) Use the functional relationship to explain the variability of the response variable by the explanatory variables in the presence of random error 2. Linear regression models: Assume a linear relationship between the response variable and the explanatory variables Simple linear regression models: One explanatory variable Multiple linear regression explanatory variables models: More than 3. Real-life Examples: Studying impact of various factors on the yield of a chemical process Fit a multiple regression model with: Response variable: The yield of a chemical process one Chapter 4: Linear Regression Analysis – Part I Page 3 Explanatory variables: Temperature, the amount of catalyst used, the relative proportions of ingredients Investigate impact of various factors on the response of a patient to a treatment Fit a multiple regression model with: Response variable: The response of a patient to a treatment Explanatory variables: Gender, Age and the stage of the disease 4. Prediction based on the fitted regression model from the data: Use the fitted regression model to predict the yield of a new chemical process Use the fitted regression model from the data to predict the response of a new patient to the treatment Section 4.2: Simple Linear Regression Models 1. The model: Describe the relationship between the dependent variable Y and the independent variable X Assume the following straight line relationship between X and Y (i.e. Regression line): Y = a + b X + e, Chapter 4: Linear Regression Analysis – Part I Page 4 where a and b are the unknown regression parameters to be estimated from the data. a represents the intercept and b represents the slope of the regression line. 2. Systematic and random components: Systematic component (a + b X): Represent the part of the variability of the response variable Y that can be explained by the explanatory variable X or the regression model Random component (e): Represent the part of the variability of the response variable that cannot be explained by the regression model 3. Data form for simple linear regression model: Consider a random sample of n pairs of observations (X1, Y1), (X2, Y2), …, (Xn, Yn) Write the simple linear regression model in the data form as follows: Yi = a + b Xi + ei, i = 1, 2, …, n where ei represents the random error corresponding to the i-th observation If (X1, Y1), (X2, Y2), …, (Xn, Yn) are n pair of observations over time, the simple linear regression model is a time-series regression and can be represented by the time index t as follows: Chapter 4: Linear Regression Analysis – Part I Page 5 Yt = a + b Xt + et, t = 1, 2,…, n 4. Basic assumptions: The response variable Y is continuous Xi’s are observed without measurement error (i.e. Xi’s are non-random) ei’s are i.i.d. ~ N(0, 2), i = 1, 2, …, n, where 2 is the unknown population variance for the random error e and X are independent 5. Estimation of Regression Models: Determine a regression line that is best-fit to the data by finding the optimal values for a and b (Optimal?) Method of Linear Least Squares Estimation (LSE): Estimate a and b by minimizing the sum of squared error n n i 1 i 1 SSE = (yi – a – b xi)2 = ei2 Let ae and be denote the linear least squares estimators for a and b, respectively Least squares estimates ae and be: be=Sxy/Sxx and ae = y - be x , Chapter 4: Linear Regression Analysis – Part I Page n n i 1 i 1 6 where Sxy = (xi – x ) (yi – y ) and Sxx = (xi – x )2 Fitted regression line: ye = ae + be x 6. Partition of sum of squares: Notations and terminologies n Let SST = (yi – y )2 = Total sum of squares i 1 n SSE = (yi –ye)2 = Sum of squares error (Residual sum i 1 of squares) n SSR = (ye – y )2 = Sum of squares regression (Model i 1 sum square) Partition of sum of squares: SST = SSE + SSR Partition of degrees of freedom: Consider the simple linear regression model with the number of unknown parameters p = 2 Degree of freedom of SST = d.f.(SST) = n –1 Degree of freedom of SSE = d.f.(SSE) = n – 2 Chapter 4: Linear Regression Analysis – Part I Page 7 Degree of freedom of SSR =d.f.(SSR) = p – 1 = 1 d.f.(SST) = d.f.(SSE) + d.f.(SSR) 7. Estimation of the error variance 2 An unbiased estimate s2 of 2 is given by: s2 = SSE n p = SSE = 1 n2 n2 n (yi – ae– be xi)2 = MSE i 1 8. Hypothesis testing for regression models Test for the significance of the relationship between X and Y Main idea: Construct the following ANOVA table by partitioning the sum of squares Source Model d.f. p–1=1 S.S. SSR Error n–2 SSE Total n–1 SST M.S. MSR= SSR 2 1 MSE= SSE n2 F F= MSR MSE Test H0: b = 0 (i.e. The relationship between X and Y is not significant) H1: b 0 (i.e. There is a significant relationship between X and Y) at the significance level Test statistic under H0: F ~ F2-1, n-2 = F1,n-2 Chapter 4: Linear Regression Analysis – Part I Page 8 Intuitively, we reject H0 when F is sufficiently large Decision: Reject H0 if F > F1,n-2 ( ) or the p-value = P(F > Fobs) < 7. Assessment of goodness-of-fit of a regression model A regression model can provide a good fit to the data if the total variation of the response variable can be well explained by the explanatory variables Coefficient of Determination R2 R2 = SSR/SST R2 is defined as the ratio of the variation explained by the regression line (i.e. SSR) to the total variation (i.e. SST) Some remarks on R2 (a) R2 takes values in [0, 1] (b) R2 = 1 => SSR = SST => Perfect fit (c) Large R2 => Large proportion of the total variation can be explained by the regression model => Good fit (d) Small R2 => Cannot provide a good fit Chapter 4: Linear Regression Analysis – Part I Page 9 (e) R2 = [corr (X, Y)]2, where corr (X, Y) is the correlation coefficient between X and Y 8. Estimation and Prediction by regression models Estimation: (a) Objective: Estimate the mean of the response variable Y when X is fixed at x (i.e. E(Y | X = x)) (b) Example: Estimate the average heights for women in HK with ages between 25 and 30 when their weights are fixed at 103lbs (c) Point Estimator: Note that E(Y | X = x) = a + b x Hence, it is natural to adopt ae + be x to estimate E(Y | X = x) Prediction: (a) Objective: Predict a new value for the response variable Y when X is fixed at x (i.e. Y | X = x) (b) Example: Predict the height of a woman in HK with ages between 25 and 30 when her weight is fixed at 103lbs (c) Point Predictor: Ye | (X = x) = ae + be x Chapter 4: Linear Regression Analysis – Part I Page 10 Remarks: (a) The point estimator and the point predictor are the same (b) The standard error of the point estimator is less than that of the point predictor Section 4.3: Multiple Linear Regression Models 1. Motivation: Explain the variation of a response variable by more than one explanatory variables 2. The Model: Investigate the relationship between the dependent variable Y and the independent variables X1, X2,…,Xp Describe how the dependent variable Y is related to the independent variables X1, X2,…,Xp by the following linear equation: Y = a + b1X1 + b2X2 + b3X3 + …..+ bpXp + e, where a is the intercept; b1, b2, ... and bp are the coefficients of regression; e is random error In practical situations, the sample size n > the number of regression coefficients p + 1 Chapter 4: Linear Regression Analysis – Part I Page 11 Consider a random sample of n observations represented by the following random vectors (Yi, Xi1, Xi2, …,Xip), i = 1, 2, …, n Write the multiple linear regression model in the following data form: Yi = a + b1 Xi1 + b2 Xi2 +…. + bp Xip + ei, i = 1, 2, …, n If the vectors of observations are collected over time, we have the following multiple time-series regression model: Yt = a + b1 Xt1 + b2 Xt2 +…. + bp Xtp + et, t = 1, 2,…, n, where t represents the time index Basic assumptions: (a) The response variable Y is continuous (b) X1, X2,…,Xp are non-random (c) The random error ei are iid ~ N(0, 2) , where 2 is the unknown common population variance (d) e, X1, X2,…,Xp are independent Data Structure: Matrix Form Chapter 4: Linear Regression Analysis – Part I y1 y 2 : : yn 1 x11 1 x 21 : : : : 1 xn1 = x1 p .. .. x 2 p : : : : : : .. .. xnp .. .. Page 12 a b1 : : bp + e1 e 2 : : en Let Y denote the n 1 column vector for the observations of the response variable Y; b denote the vector of the unknown parameters and e denote the n 1 column vector for the random errors e Write X for the n (p+1) design matrix as follows: X= 1 x11 1 x 21 : : : : 1 xn1 x1 p .. .. x 2 p : : : : : : .. .. xnp .. .. Then, the matrix form for the data structure of the multiple linear regression can be written as the following matrix equation: Y=Xb+e When p = 1, the multiple linear regression model is reduced to a simple linear regression model with two unknown parameters. 5. Estimation of unknown parameters by linear least squares method (LSE) Main idea: (Same as the case of simple linear regression) Estimate the unknown parameters b by minimizing the sum of squared error or deviation Chapter 4: Linear Regression Analysis – Part I Page 13 Matrix form: min b (Y – X b)T(Y – X b) where (Y – X b)T(Y – X b) is the matrix representation of the sum of the squared error or deviation The solution: (The least squares estimators be) be = arg [min b (Y – X b)T(Y – X b)] = (XT X)-1 XT Y The best-fit regression equation (a) In matrix form, Ye = X be (b) In scalar form, Yie = ae + b1e Xi1 + b2e Xi2 +…. + bpe Xip , i = 1, 2, …, n 5. Estimate the common population variance 2: Let yi and yie denote the ith observed value and the ith predicted values for the response variable Y Define the prediction error or the residual i by i = yi – yie Use the ith residual i to estimate the ith random error ei Chapter 4: Linear Regression Analysis – Part I Page 14 The unbiased estimator s2 for 2 is given by: s2 = 1 n p 1 n i 1 i2 = 1 n p 1 n (yi – yie)2 = MSE i 1 6. Hypothesis testing for regression models: F-test for full model (i.e. Assessing the goodness-of-fit of a regression line) Test the following hypotheses at the significance level : H0: b1 = b2 = …..= bp = 0 (i.e. None of the explanatory variables affect the response variable Y) H1: bk 0, for some k = 1, 2, …, p (i.e. Some of the explanatory variables affect the response variable Y) Under H0: Y = a + e Under H1: Y = a + b1X1 + b2X2 + b3X3 + …..+ bpXp + e, for some non-zero bk Main idea: Construct the following ANOVA table by partitioning the sum of squares Source Model d.f. p S.S. SSR Error n p 1 SSE Total n–1 SST M.S. MSR= SSR MSE= p SSE n p 1 F F= MSR MSE Chapter 4: Linear Regression Analysis – Part I Page 15 Test statistic under H0: F ~ Fp , n - p- 1 Intuitively, we reject H0 when F is sufficiently large Decision: Reject H0 if F > Fp, n - p- 1 ( ) or the p-value P(F > Fobs) < Coefficient of determination R2 (a) An alternative for the F-test of the full model in assessing the goodness-of-fit test for a regression model (b) Definition R2 = SSR/SST = 1 – SSE/SST= Corr(Yi, Yie) where Corr(Yi, Yie) is the correlation between the observed value Yi and the predicted value Yie (c) Adjusted R2: To adjust for R2 in order to incorporate the effect of sample size n when n is small relative to the number of parameters p + 1 R2adj = 1 – ( n 1 ) n p 1 (1 – R2) T-Test for an individual regression parameter: (a) Rationale: To test whether the intercept or a particular explanatory variable can be removed from the regression model Chapter 4: Linear Regression Analysis – Part I Page 16 (b) Test the following hypotheses at significance level : H0: bk = 0 v.s. H1: bk 0 for k = 1, 2, …, p Under H0: (Reduced-form model) Y = a + b1 X1 + b2 X2 + …+ bk-1 Xk-1 + bk+1 Xk+1+ …+ bp Xp + e Under H1: (Full model) Y = a + b1 X1 + b2 X2 + …+ bp Xp + e (c) Test the significance of the intercept term at significance level as follows: H0: a = 0 v.s. H1: a 0 Under H0: (Reduced-form model) Y = b1 X1 + b2 X2 + …+ bp Xp + e Under H1: (Full model) Y = a + b1 X1 + b2 X2 + …+ bp Xp + e (a) The procedures for the T-test: Effect Intercept X1 : : Xp Estimate ae b1e : : bpe S.E. Var (ae) Var(b1e) : : Var (bpe) t-ratio p-value ae/ Var(ae) P(T>|ae/ Var(ae) |) b1e/ Var(b1e) P(T>|b1e/ Var(b1e) | ) : : : : bpe/ Var(bpe) P(T>|bpe/ Var(bpe) |) Chapter 4: Linear Regression Analysis – Part I Page 17 For example: Test H0: b1 = 0 v.s. H1: b1 0 at significance level Decision: Reject H0 if the p-value = P(T>|b1e/ Var(b1e) | ) < Conclusion: If H0 is rejected, we conclude that the effect X1 is significant in explaining the variation of the response variable Y F-test for a subset of regression parameters simultaneously The T-test mentioned above can only be used to test hypotheses involving only one parameter Let j denote a positive integer less than to p (i.e. j < p) Test H0: bj+1 = bj+2 = …= bp = 0 (i.e. The first j explanatory variables affect the response variable Y) H1: H0 is not true at significance level Under H0: (Reduced Model) Y = a + b1 X1 + b2 X2 + ….+ bj Xj + e Under H1: (Full Model) Y = a + b1 X1 + b2 X2 + ….+ bp Xp + e Chapter 4: Linear Regression Analysis – Part I Page 18 Write SSRR = Sum of squared regression for the reduced model SSER = Sum of squared error for the reduced model SSRF = Sum of squared regression for the full model SSEF = Sum of squared error for the full model Test statistic (i.e. F-statistic for the test) under H0 F = [(SSRF – SSRR) / (p – j)] / [SSEF / (n – p – 1)] ~ Fp – j, n – p –1 Note that (a) (SSRF – SSRR) represents the sum of squared of total variation explained by Xj+1, Xj+2, …, Xp (b) SSEF is the sum of squared errors for the full model 7. Estimation and Prediction by regression models Estimation: (a) Objective: Estimate the mean of the response variable Y when X1 = x1, X2 = x2, …, Xp = xp, namely E(Y | X1 = x1, X2 = x2, …, Xp = xp) (b) Point Estimation: Let x0 = (1, x1, x2, …, xp); b0 = (a, b1, b2, …, bp); b0e = (ae, b1e, b2e, …, bpe) Chapter 4: Linear Regression Analysis – Part I Page 19 Write E(Y | X1 = x1, X2 = x2, …, Xp = xp) = x0T b0 Hence, it is natural to use x0T b0e to estimate x0T b0 (c) Interval Estimation: The sampling distribution for the point estimator x0T b0e is N(x0T b0, 2 [x0T (XT X)-1 x0]) A 100 (1 – ) % confidence interval for the unknown mean x0T b0: x0T b0e t /2 (n – p – 1) MSE [x0T (XT X)-1 x0]1/2 Prediction: (a) Objective: Predict a new value for the response variable Y when X1 = x1, X2 = x2, …, Xp = xp, namely Y | X1 = x1, X2 = x2, …, Xp = xp (b) Point Prediction: Let x0 = (1, x1, x2, …, xp); b0 = (a, b1, b2, …, bp); b0e = (ae, b1e, b2e, …, bpe) Write Y | (X1 = x1, X2 = x2, …, Xp = xp) = x0T b0 + e0 where e0 represents the random error corresponding to the new observation Y and e ~ N(0, 2) Chapter 4: Linear Regression Analysis – Part I Page 20 Hence, it is natural to use x0T b0e to estimate x0T b0 + e (b) Interval Estimation: The sampling distribution of the point estimator x0T b0e is N(x0T b0 + e, 2 [1 + x0T (XT X)-1 x0]) Then, the 100 (1– )% prediction interval for Y | (X1 = x1, X2 = x2, …, Xp = xp) is given as follows: x0T b0e t /2 (n – p – 1) MSE [1 + x0T (XT X)-1 x0]1/2 Section 4.4: Fitting Regression Models by SAS 1. The PROC REG Command: Fit a simple or multiple linear regression model The statement PROC REG DATA = name of dataset <options>; MODEL response variable = independent variables / <options>; PLOT response variable*independent variables < = symbols> / <options >; RUN; Options for PROC REG: (a) simple: Displays descriptive statistics, for instances, sum, mean, variance and standard deviation, for each variable in the MODEL statement (b) noprint: Suppresses the printed output Chapter 4: Linear Regression Analysis – Part I Page 21 (c) p: Displays the observed value, the predicted value and the residual for each observation in the dataset (d) clm: Displays the 95% confidence interval for the mean of each observation (e) cli: Displays the 95% prediction interval for the mean of each observation Example: Consider the following daily closing values of the Dow Jones Index and the S&P500 index from 25 July 2003 to 7 August 2003 (Data Source: Yahoo Finance) Data Indices; Input SP500 DJI; CARDS; 974.12 9126.45 967.08 9061.74 965.46 9036.32 982.82 9186.04 980.15 9153.97 990.31 9233.80 987.49 9200.05 989.28 9204.46 996.52 9266.51 998.68 9284.57 ; RUN; Suppose that the S&P500 (Y) is linearly dependent with the Dow Jones Index (X) Chapter 4: Linear Regression Analysis – Part I Page 22 Use the following SAS procedure to fit a simple linear regression line with response variable Y and explanatory variable X PROC REG DATA = Indices; MODEL SP500 = DJI / P CLM CLI; RUN; The SAS output is displayed as follows: The SAS System 11:42 Thursday, October 2, 2003 1 The REG Procedure Model: MODEL1 Dependent Variable: SP500 Analysis of Variance Source DF Model Error Corrected Total 1 8 9 Sum of Squares Mean Square 1171.71185 17.71804 1189.42989 Root MSE Dependent Mean Coeff Var F Value Pr > F 1171.71185 2.21475 1.48821 R-Square 983.19100 Adj R-Sq 0.15136 529.05 <.0001 0.9851 0.9832 Parameter Estimates Variable DF Intercept 1 DJI 1 Parameter Estimate Standard Error -295.69684 0.13938 55.60328 0.00606 t Value Pr > |t| -5.32 23.00 0.0007 <.0001 Chapter 4: Linear Regression Analysis – Part I The SAS System Page 23 11:42 Thursday, October 2, 2003 2 The REG Procedure Model: MODEL1 Dependent Variable: SP500 Output Statistics Obs Dep Var Predicted Std Error SP500 Value Mean Predict 1 974.1200 976.3695 2 967.0800 967.3501 3 965.4600 963.8070 4 982.8200 984.6753 5 980.1500 980.2053 6 990.3100 991.3322 7 987.4900 986.6280 8 989.2800 987.2427 9 996.5200 995.8914 10 998.6800 998.4086 0.5563 0.8341 0.9652 0.4750 0.4882 0.5889 0.4938 0.5025 0.7255 0.8119 95% CL Mean 975.0867 965.4265 961.5811 983.5799 979.0795 989.9743 985.4894 986.0839 994.2184 996.5364 977.6522 969.2736 966.0328 985.7707 981.3310 992.6901 987.7667 988.4015 997.5644 1000 95% CL Predict Residual 972.7058 963.4159 959.7165 981.0729 976.5936 987.6415 983.0123 983.6205 992.0735 994.4993 980.0332 971.2842 967.8974 988.2777 983.8170 995.0229 990.2438 990.8649 999.7093 1002 -2.2495 -0.2701 1.6530 -1.8553 -0.0553 -1.0222 0.8620 2.0373 0.6286 0.2714 Sum of Residuals 0 Sum of Squared Residuals 17.71804 Predicted Residual SS (PRESS) 27.92906 Interpretation: (a) From the result in ANOVA table, the p-value of the F-test for the full model is less than 0.0001 Suppose the significance level for the F-test is 5%. Based on the p-value, we reject the null hypothesis H0 and conclude that the full model is better than the reduced model; that is, the explanatory variable DJI is significant in the explanation of the variation of the response variable SP500 (b) The result in part (a) is further confirmed by the value of the coefficient of determination R2 which is equal to 0.9851 Chapter 4: Linear Regression Analysis – Part I Page 24 Note that R2 is very close to 1. This means that the proportion of the total variation of Y explained by the regression line is large; that is, the regression line can provide a good fit to the data We can also take into account the effect of sample size by looking at the adjusted R2. It is especially important when we consider a multiple linear regression model (c) The t-tests for both the intercept and the slope of the regression line reveal that the intercept and the regressor X are significant (i.e. Both p-values are much less than 5%) From the estimates of the parameters, we obtain the following fitted or estimated regression line Ye = - 295.69684 + 0.13938 X (d) Based on the fitted or estimated regression line, the point estimate for the mean (or the point predicted value for the observation) of the response variable Y corresponding to the first observation for the explanatory variable X is 976.3695. The standard error for the point estimator is given by 0.5563 We can also compare the point predicted value for the observation of Y with the actual observation of Y corresponding to the first observation by considering the corresponding value of the residual: 974.1200 – 976.3695 = - 2.2495 Chapter 4: Linear Regression Analysis – Part I Page 25 The 95% confidence interval for the mean of the response variable Y is given by: (975.0867, 977.6522) It seems that the confidence interval for the mean is quite precise The 95% prediction interval for the first observation of the response variable Y is given by: (972.7058, 980.0332) 2. Other functions of PROC REG statement Create a new SAS dataset containing information generated by the PROC REG procedure using the OUTPUT statement in the PRO REG statement PROC REG DATA = name of dataset <options>; MODEL response variable = explanatory variables / <options>; OUTPUT OUT = new SAS datasetname KEYWORD = options; RUN; KEYWORD = options: (a) p = varname Display the predicted values in the output dataset Assign the name to the predicted values by varname Chapter 4: Linear Regression Analysis – Part I Page 26 (b) r = varname Display the residuals in the output dataset Assign the name to the residuals by varname (c) student = varname Display the standardized residuals in the output dataset (d) L95M = varname Display the lower bound for the 95% confidence interval in the output dataset (e) U95M = varname Display the upper bound for the 95% confidence interval in the output dataset (f) L95 = varname Display the lower bound for the 95% prediction interval in the output dataset (g) U95 = varname Display the upper bound for the 95% prediction interval in the output dataset Chapter 4: Linear Regression Analysis – Part I Page 27 Example: Consider again the dataset containing the daily closing values of the Dow Jones Index and the S&P500 index from 25 July 2003 to 7 August 2003 PROC REG DATA = Indices; MODEL SP500 = DJI / P CLM CLI; PLOT SP500*DJI; OUTPUT OUT = Indices1 p = Predict r = Residuals; RUN; The graph of the plot for SP500 against DJI is shown in WORK.GSEG.REG (see the graph below) SP500 = -295.7 +0.1394 DJI 1000 N 10 Rsq 0.9851 AdjRsq 0.9832 RMSE 1.4882 995 990 985 980 975 970 965 9025 9050 9075 9100 9125 9150 9175 DJI 9200 9225 9250 9275 9300 Chapter 4: Linear Regression Analysis – Part I Page 28 Use the following PROC PRINT to print the newly created SAS dataset, namely Indices1, in the SAS output window PROC PRINT DATA = Indices1; 3. More examples on the PROC REG statement Example: (Model checking) Consider again the dataset containing the daily closing values of the Dow Jones Index and the S&P500 index from 25 July 2003 to 7 August 2003 Use the following SAS procedure to generate the plots for checking the assumptions of the regression model (i.e. Work on the new SAS dataset “Indices1”) PROC REG DATA = Indices; MODEL SP500 = DJI; OUTPUT OUT = Indices1 p = Predict student = Residual; PROC PLOT DATA = Indices1; PLOT SP500*Predict; TITLE 'Model Checking: Observed vs Predicted Values'; PLOT Residual*Predict; TITLE 'Model Checking: Residuals vs Predicted Values'; RUN; (Read the SAS output for the plots!) Chapter 4: Linear Regression Analysis – Part I Page 29 Observations and interpretations: (1) The plot of SP500 against its predicted values reveals that the relationship between SP500 and its predicted values can be well-approximated by a straight line. This indicates that the regression model can fit the data well (2) There is no particular pattern for the plot of the residuals against the predicted values of SP500. This reveals that the residuals are random and that the residuals are independent with the observed or predicted values of SP500. Hence, the assumptions for the regression model are justified Example: (Fit a multiple linear regression model: F-test for the full model and t-test for individual effects) Consider the following dataset containing the daily open close, high, low values and the trading volume of S&P500 global index from 2 Sep 2003 to 2 Oct 2003 Data SP500; Input Date $ Open High Low Close Volume; CARDS; 2-Oct-03 1017.25 1021.90 1013.38 1020.24 1091209984 1-Oct-03 997.15 1018.22 997.15 1018.22 1329970048 30-Sep-03 1004.72 1004.72 990.34 995.97 1360259968 29-Sep-03 998.12 1006.91 995.31 1006.58 1128700000 26-Sep-03 1003.31 1003.32 996.03 996.85 1237640000 25-Sep-03 1010.24 1015.97 1003.26 1003.27 1276470000 24-Sep-03 1029.09 1029.83 1008.93 1009.38 1378250000 23-Sep-03 1023.26 1030.06 1021.50 1029.03 1124940000 22-Sep-03 1036.30 1036.30 1018.27 1022.82 1082870000 19-Sep-03 1039.64 1039.64 1031.85 1036.30 1328210000 18-Sep-03 1025.80 1040.18 1025.66 1039.58 1257790000 17-Sep-03 1028.91 1031.37 1024.23 1025.97 1135540000 16-Sep-03 1015.07 1029.68 1015.07 1029.32 1161780000 15-Sep-03 1018.68 1019.80 1013.59 1014.81 943448000 12-Sep-03 1014.54 1019.68 1007.70 1018.63 1092610000 Chapter 4: Linear Regression Analysis – Part I Page 30 11-Sep-03 1011.34 1020.84 1011.34 1016.42 1151640000 10-Sep-03 1021.27 1021.28 1009.73 1010.92 1313300000 9-Sep-03 1030.51 1030.51 1021.13 1023.16 1226980000 8-Sep-03 1021.84 1032.42 1021.84 1031.64 1171310000 5-Sep-03 1027.02 1029.24 1018.20 1021.39 1292100000 4-Sep-03 1025.97 1029.15 1022.17 1027.97 1259030000 3-Sep-03 1023.37 1029.36 1022.39 1026.27 1547380000 2-Sep-03 1009.14 1022.63 1005.65 1021.99 1279880000 ; RUN; PROC PRINT DATA = SP500; RUN; The following dataset is shown in the following SAS output Window: The SAS System Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Date 2-Oct-03 1-Oct-03 30-Sep-0 29-Sep-0 26-Sep-0 25-Sep-0 24-Sep-0 23-Sep-0 22-Sep-0 19-Sep-0 18-Sep-0 17-Sep-0 16-Sep-0 15-Sep-0 12-Sep-0 11-Sep-0 10-Sep-0 9-Sep-03 8-Sep-03 5-Sep-03 4-Sep-03 3-Sep-03 2-Sep-03 Open 1017.25 997.15 1004.72 998.12 1003.31 1010.24 1029.09 1023.26 1036.30 1039.64 1025.80 1028.91 1015.07 1018.68 1014.54 1011.34 1021.27 1030.51 1021.84 1027.02 1025.97 1023.37 1009.14 High 1021.90 1018.22 1004.72 1006.91 1003.32 1015.97 1029.83 1030.06 1036.30 1039.64 1040.18 1031.37 1029.68 1019.80 1019.68 1020.84 1021.28 1030.51 1032.42 1029.24 1029.15 1029.36 1022.63 12:01 Friday, October 3, 2003 1 Low Close Volume 1013.38 1020.24 1091209984 997.15 1018.22 1329970048 990.34 995.97 1360259968 995.31 1006.58 1128700000 996.03 996.85 1237640000 1003.26 1003.27 1276470000 1008.93 1009.38 1378250000 1021.50 1029.03 1124940000 1018.27 1022.82 1082870000 1031.85 1036.30 1328210000 1025.66 1039.58 1257790000 1024.23 1025.97 1135540000 1015.07 1029.32 1161780000 1013.59 1014.81 943448000 1007.70 1018.63 1092610000 1011.34 1016.42 1151640000 1009.73 1010.92 1313300000 1021.13 1023.16 1226980000 1021.84 1031.64 1171310000 1018.20 1021.39 1292100000 1022.17 1027.97 1259030000 1022.39 1026.27 1547380000 1005.65 1021.99 1279880000 Chapter 4: Linear Regression Analysis – Part I Page 31 Suppose we are interested in investigating the effects of the daily open, high, low values and the trading volumes on the daily close values of the S&P500 index We fit a multiple linear regression model with Close as the response variable and Open, High, Low and Volume as explanatory variables using the following SAS procedures Close = a + b1 Open + b2 High + b3 Low + b4 Volume + e PROC REG DATA = SP500; MODEL Close = Open High Low Volume; RUN; The SAS output is given as follows: The SAS System 12:01 Friday, October 3, 2003 3 The REG Procedure Model: MODEL1 Dependent Variable: Close Analysis of Variance Source DF Model Error Corrected Total 4 18 22 Sum of Squares 2777.01075 144.03613 2921.04689 Mean Square 694.25269 8.00201 Root MSE 2.82878 R-Square Dependent Mean 1019.42304 Adj R-Sq Coeff Var 0.27749 F Value Pr > F 86.76 0.9507 0.9397 <.0001 Chapter 4: Linear Regression Analysis – Part I Page 32 Parameter Estimates Variable Parameter DF Estimate Standard Error Intercept Open High Low Volume 1 10.09909 1 -0.79023 1 0.98725 1 0.79761 1 -3.95259E-9 61.17871 0.17 0.11239 -7.03 0.15883 6.22 0.15471 5.16 4.840167E-9 -0.82 t Value Pr > |t| 0.8707 <.0001 <.0001 <.0001 0.4248 Observations and interpretations: (1) Since the p-value of the F-test is less than 0.0001 (< the significance level 0.05), we reject the null hypothesis H0 of the F-test for the full model at 5% significance level and conclude that the full model is superior than the reduced model in the explanation of the variation of the response variable Y . This conclusion is further confirmed by the value of R2 (0.9507) which is close to one (2) The results of the t-test for the individual effects and the corresponding conclusions are shown as follows: Intercept: Since the p-value is 0.8707 (> 0.05), we do not reject the null hypothesis H0 at 5% significance level and conclude that the intercept is not significant in the explanation of the variation for the response variable Y in the presence of other variables Chapter 4: Linear Regression Analysis – Part I Page 33 Open: Since the p-value is less than 0.0001 (< 0.05), we reject H0 at 5% significance level and conclude that the effect ‘Open’ is significant in the explanation of the variation for the response variable Y in the presence of other variables For ‘High’ and ‘Low’, we have the same conclusion as in that for ‘Open’ For ‘Volume’, we have the same conclusion as in that for ‘Intercept’ Example: (F-test for a subset of individual effects) Consider again the dataset containing the daily open close, high, low values and the trading volume of S&P500 global index from 2 Sep 2003 to 2 Oct 2003 Test H0: b1 = b2 = b3 = 0 (Reduced Model) H1: At least one of b1, b2, b3 not equal to zero (Full Model) at 5% significance level Use the following SAS procedure to perform the F-test for the above hypotheses PROC REG DATA = SP500; MODEL Close = Open High Low Volume; TEST Open = 0, High = 0, Low = 0; RUN; Chapter 4: Linear Regression Analysis – Part I Page 34 The SAS output is given as follows: The SAS System 17:08 Saturday, October 4, 2003 1 The REG Procedure Model: MODEL1 Dependent Variable: Close Analysis of Variance Sum of Squares Source DF Model Error Corrected Total 4 2777.01075 18 144.03613 22 2921.04689 Root MSE Dependent Mean Coeff Var Mean Square F Value Pr > F 694.25269 8.00201 86.76 2.82878 R-Square 1019.42304 Adj R-Sq 0.27749 <.0001 0.9507 0.9397 Parameter Estimates Parameter Standard Estimate Error Variable DF Intercept Open High Low Volume 1 10.09909 1 -0.79023 1 0.98725 1 0.79761 1 -3.95259E-9 t Value 61.17871 0.11239 0.15883 0.15471 4.840167E-9 The SAS System Pr > |t| 0.17 -7.03 6.22 5.16 -0.82 0.8707 <.0001 <.0001 <.0001 0.4248 17:08 Saturday, October 4, 2003 2 The REG Procedure Model: MODEL1 Test 1 Results for Dependent Variable Close Source DF Numerator 3 Denominator 18 Mean Square F Value Pr > F 921.67424 115.18 <.0001 8.00201 Chapter 4: Linear Regression Analysis – Part I Page 35 Observations and interpretations: (1) Since the p-value for the F-test for the subset of individual effects is less than 0.0001, we reject H0 at 5% significant level and conclude that the full model is superior than the reduced model in the explanation of the variation of the response variable ‘Close’ (2) The result is further confirmed by the F-test for the full model and the value of R2 for the full model (3) The fitted regression equation for the full model is given by: Close = 10.09909 – 0.79023 Open+ 0.98725 High + 0.79761 Low – 3.95259 10-9 Volume ~ End of Chapter 4~