Download Statistics 2014, Fall 2001

1 Correlation and Regression Suppose we have two random variables X and Y that have a joint bivariate normal distribution with correlation coefficient . A joint normal distribution has p.d.f. A graph of a joint normal density is shown below for  = -0.9: If we have selected a random sample of size n from this population, we may estimate  with the sample correlation coefficient: n r  x i 1 i  x  y i  y  n . n 2  2   x i  x     y i  y    i 1   i 1  The sample correlation coefficient may also be found by taking the square root of the coefficient of determination and attaching the sign shown for the relationship in the scatterplot of Y v. x. – a positive sign if the relationship is increasing, or a negative sign if the relationship is decreasing. Example: In the stainless steel stress fracture example, we found that R2 = 0.632518266, and the scatterplot showed a decreasing relationship between tensile stress and time to fracture. Hence, the 2 sample correlation coefficient would be r = -0.7953, indicating a fairly strong negative linear relationship between applied tensile stress and time to fracture. Multiple Regression Sometimes we have several possible predictor variables, with no single variable giving good prediction by itself. In such a situation, we may use a multiple regression model: k Yi  0  1 X i1  2 X i 2    k X ik   i  0    j X ij   i . j 1 Here Xij is the value of the jth predictor variable for the ith member of the sample. We assume that the relationship between Y and any one of the predictor variables is linear. We also assume that  1 ,  2 , ,  n ~ Normal 0,  2 . i .i .d . The first step in the data analysis would be to do scatterplots of Y v. each X, to check the assumption of linearity of the relationships. A cautionary note is in order here. Some of the variables in the pool of predictors may be only weakly related to Y. We don’t necessarily discard these variables from the pool of predictors, however, since we are looking for a model in which a collection of predictors give good prediction. Estimation of the parameters is accomplished using the Least Squares method. The quantity to be minimized is the sum of squared errors: 2 k   2 Q    i    Yi   0    j X ij  . i 1 i 1  j 1  n n 3 We take the partial derivative of Q with respect to each of the parameters and set the result equal to zero, obtaining a set of k + 1 equations in k + 1 unknowns, the normal equations: n  k  set Q   2  Yi   0    j X ij   0 ; and  0 i 1  j 1  n k   set Q  2 X il  Yi   0    j X ij   0 , for l = 1, 2, …, k.  l i 1 j 1   The solutions to the normal equations are the least squares estimators of the parameters. Example: Soil and sediment adsorption, the extent to which chemicals collect in a condensed form on the surface, is an important characteristic influencing the effectiveness of pesticides and various agricultural chemicals. The paper “Adsorption of phosphate, arsenate, methanearsonate, and cacodylate by lake and stream sediments: comparisons with soils” (Journal of Environmental Quality, 1984, pp. 499-504), gave the accompanying data on Y = phosphate adsorption index, X1 = amount of extractable iron, and X2 = amount of extractable aluminum. (from Probability and Statistics for Engineering and the Sciences, by Jay L. Devore) 4 Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 X1 61 175 111 124 130 173 169 169 160 244 257 333 199 X2 13 21 24 23 64 38 33 61 39 71 112 88 54 Y 4 18 14 18 26 26 21 30 28 36 65 62 40 The paper proposed the model: Yi  0  1 X i1  2 X i 2   i . We want to try to fit this model to the data, to see whether the two predictor variables give good prediction of the Adsorption Index. First, we want to check whether the relationship between the Adsorption Index and each of the predictors seems to be linear. From the scatterplots below, it appears that the relationship between Adsorption Index and each of the predictors is a fairly strong positive linear relationship. 5 Scatteplot of Adsorption Index v. Amount of Extractable Fe 70 Adsorption Index 60 50 40 30 20 10 0 0 50 200 150 100 250 300 350 Amount of Extractable Fe Scatterplot of Adsorption Index v. Amount of Extractable Al 70 Adsorption Index 60 50 40 30 20 10 0 0 20 40 60 80 Amount of Extractable Al 100 120 6 To fit a multiple regression model using Excel, we must use the LINEST function, which is an array function. We enter the data, being sure to put the predictor variables in adjacent columns. Then we choose an empty cell, and highlight an array containing 5 rows and k + 1 columns. We then enter =LINEST(a1..a13, b1..c13, TRUE, TRUE), followed by Ctrl-Shift-Enter. The first entry in parentheses is the column listing the values of Y. The second entry is the rectangular array of predictor variables. The third entry is an indicator that we want Excel to estimate the intercept, instead of simply assuming that it is zero. The fourth entry is an indicator that we want, not only the parameter estimates and their standard errors, but also additional regression statistics, such as the coefficient of determination, and sums of squares. The output is shown below: The first row of the table gives the parameter estimates; the second row gives their standard errors. The third row gives the coefficient of determination and the standard error for the estimator of the conditional mean. The fourth row gives the value of the F statistic and the error degrees of freedom. The fifth row gives the regression sum of squares, followed by the error sum of squares. 0.349 0.071306 0.948467 92.02558 3529.903 0.112733 0.029691 4.379375 10 191.7892 -7.35066 3.484668 The equation of the regression line is Yˆ  7.35066  0.349 X 1  0.112733 X 2 , and 94.8467% of the variability in the Adsorption Index is explained by the linear relationship between the Adsorption Index and the two predictors. 7 We can construct the ANOVA table from the above information: Source Regression Residual Total SS 3529.903 191.7892 3721.6922 d.f. 2 10 12 MS 1764.9515 19.17892 F 92.02559 Hence, if we want to test for a linear relationship between the Adsorption Index and the predictor variables, we proceed as follows: Step 1: H0: 1   2  0 v. Ha: Not both 0. Step 2: n = 13,  = 0.05 MSR The test statistic is F  MSE , which under the null hypothesis has an F(2, 10) distribution. Step 4: The critical value is F(0.95, 2, 10) = 4.10. If the calculated value of the test statistic is greater than 4.10, we will reject the null hypothesis. Step 5: From the ANOVA table, we have F = 92.02559. Step 6: We reject the null hypothesis at the 0.05 level of significance. We have sufficient evidence to conclude that at least one of the slope coefficients is not 0. Step 3: If we have a soil type for which the amount of extractable iron is 150 and the amount of extractable aluminum is 50, then we predict that the Adsorption Index will be 50.6360. Sometimes we have several predictors, and one or more of them is only weakly related to the response variable. After including some of the stronger predictors in the model, we want to know whether it would make sense to include any of the weaker predictors as well. Anytime we include another predictor in the model, we will increase 8 the value of SSR and decrease the value of SSE, so that the value of SSR 2 SST remains the same. Since R  SST , adding another predictor will always increase the explained variation in Y by some amount. We want to know whether the increase in R2 due to adding a relatively weak predictor is sufficiently large to offset the decrease in the error degrees of freedom. To do this, we will look at the adjusted coefficient of multiple determination. Defn: The adjusted coefficient of multiple determination for a multiple regression model with k predictor variables is  n  1  SSE SSE /( n  p) 2  Radj  1  1   , SST /( n  1) n  p   SST where p is the number of predictor variables after the additional variables are added. If the decrease in SSE from adding another predictor is not sufficient to offset the loss of an error degree of freedom, then the adjusted coefficient of multiple determination may actually decrease, and we would decide not to add the additional predictor. Example: Data were collected on three variables in an observational study at a semiconductor manufacturing plant. The finished semiconductor is wire-bonded to a frame. The three variables measured, for a sample of 25 units, are the Pull Strength (the amount of force necessary to break the bond), the Wire Length, and the Die Height. The data are shown in the table on page 288. If we include only Wire Length in the model, we find that R2 = 0.963954368. 9 If we include both variables in the model, the table of results from LINEST is shown below. We see that R2 = 0.979905735, an increase of 0.015951367. Is this amount of increase worth the loss of an error degree of freedom? The adjusted coefficient of multiple determination is found to be 0.962, which is less than the coefficient of multiple determination when we had only Wire Length in the model. Hence, we might not want to add Die Height to the model. 0.011815307 0.002827293 0.979905735 536.4198808 5983.250235 2.72668868 0.09878057 2.36157179 22 122.694469 2.820283576 1.032732335

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistics 2014, Fall 2001