Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part IV Advanced Regression Models 232 Chapter 17 Polynomial Regression In this chapter, we provide models to account for curvature in a data set. This curvature may be an overall trend of the underlying population or it may be a certain structure to a specified region of the predictor space. We will explore two common methods in this chapter. 17.1 Polynomial Regression In our earlier discussions on multiple linear regression, we have outlined ways to check assumptions of linearity by looking for curvature in various plots. • For instance, we look at the plot of residuals versus the fitted values. • We also look at a scatterplot of the response value versus each predictor. Sometimes, a plot of the response versus a predictor may also show some curvature in that relationship. Such plots may suggest there is a nonlinear relationship. If we believe there is a nonlinear relationship between the response and predictor(s), then one way to account for it is through a polynomial regression model: Y = β0 + β1 X + β2 X 2 + . . . + βh X h + , (17.1) where h is called the degree of the polynomial. For lower degrees, the relationship has a specific name (i.e., h = 2 is called quadratic, h = 3 is called cubic, h = 4 is called quartic, and so on). As for a bit of semantics, it was noted at the beginning of the previous course how nonlinear regression 233 234 CHAPTER 17. POLYNOMIAL REGRESSION (which we discuss later) refers to the nonlinear behavior of the coefficients, which are linear in polynomial regression. Thus, polynomial regression is still considered linear regression! In order to estimate equation (17.1), we would only need the response variable (Y ) and the predictor variable (X). However, polynomial regression models may have other predictor variables in them as well, which could lead to interaction terms. So as you can see, equation (17.1) is a relatively simple model, but you can imagine how the model can grow depending on your situation! For the most part, we implement the same analysis procedures as done in multiple linear regression. To see how this fits into the multiple linear regression framework, let us consider a very simple data set of size n = 50 that I generated (see Table 17.1). The data was generated from the quadratic model yi = 5 + 12xi − 3x2i + i , (17.2) where the i s are assumed to be normally distributed with mean 0 and variance 2. A scatterplot of the data along with the fitted simple linear regression line is given in Figure 17.1(a). As you can see, a linear regression line is not a reasonable fit to the data. Residual plots of this linear regression analysis are also provided in Figure 17.1. Notice in the residuals versus predictor plots how there is obvious curvature and it does not show uniform randomness as we have seen before. The histogram appears heavily left-skewed and does not show the ideal bell-shape for normality. Furthermore, the NPP seems to deviate from a straight line and curves down at the extreme percentiles. These plots alone would suggest that there is something wrong with the model being used and especially indicate the use of a higher-ordered model. The matrices for the second-degree polynomial model are: y1 1 x1 x21 1 β0 y2 1 x2 x2 2 2 β1 , = .. , Y = .. , X = .. .. .. , β = . . . . . β2 2 y50 1 x50 x50 50 where the entries in Y and X would consist of the raw data. So as you can see, we are in a setting where the analysis techniques used in multiple linear regression (e.g., OLS) are applicable here. STAT 501 D. S. Young 235 0 CHAPTER 17. POLYNOMIAL REGRESSION ● ● ●●● ● 20 ● −100 ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ●● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● −300 ● ● ● ● ● ● ● 0 ●● ● Residuals ●● y −200 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● −20 ● ● ● ● ● −40 −400 ●● ●● ● ● ● ● ● −500 ● 6 8 10 12 ● 14 −400 −300 x −200 −100 0 Fitted y (a) (b) Histogram of Residuals Normal Q−Q Plot 1 ● ● ● ●● ● −1 Sample Quantiles 10 ●● ● ● ● −2 5 Frequency 0 15 ●●●● ● ● ●●●●● ●●● ●●● ●● ● ● ●● ●● ●● ● ● ●●● ●● ● ● ● −3 0 ● −60 −40 −20 Residuals (c) 0 20 ● −2 −1 0 1 2 Theoretical Quantiles (d) Figure 17.1: (a) Scatterplot of the quadratic data with the OLS line. (b) Residual plot for the OLS fit. (c) Histogram of the residuals. (d) NPP for the Studentized residuals. D. S. Young STAT 501 236 CHAPTER 17. POLYNOMIAL REGRESSION i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 xi 6.6 10.1 8.9 6 13.3 6.9 9 12.6 10.6 10.3 14.1 8.6 14.9 6.5 9.3 5.2 10.7 7.5 14.9 12.2 yi -45.4 -176.6 -127.1 -31.1 -366.6 -53.3 -131.1 -320.9 -204.8 -189.2 -421.2 -113.1 -482.3 -42.9 -144.8 -14.2 -211.3 -75.4 -482.7 -295.6 i 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 xi 8.4 7.2 13.2 7.1 10.4 10.8 11.9 9.7 5.4 12.1 12.1 12.1 9.2 6.7 12.1 13.2 11 13.1 9.2 13.2 yi -106.5 -63 -362.2 -61 -194 -216.4 -278.1 -162.7 -21.3 -284.8 -287.5 -290.8 -137.4 -47.7 -292.3 -356.4 -228.5 -354.4 -137.2 -361.6 i 41 42 43 44 45 46 47 48 49 50 xi 8 8.9 10.1 11.5 12.9 8.1 14.9 13.7 7.8 8.5 yi -95.8 -126.2 -179.5 -252.6 -338.5 -97.3 -480.5 -393.6 -87.6 -105.4 Table 17.1: The simulated 2-degree polynomial data set with n = 50 values. Some general guidelines to keep in mind when estimating a polynomial regression model are: • The fitted model is more reliable when it is built on a larger sample size n. • Do not extrapolate beyond the limits of your observed values. • Consider how large the size of the predictor(s) will be when incorporating higher degree terms as this may cause overflow. • Do not go strictly by low p-values to incorporate a higher degree term, but rather just use these to support your model only if the plot looks reasonable. This is sort of a situation where you need to determine “practical significance” versus “statistical significance”. STAT 501 D. S. Young CHAPTER 17. POLYNOMIAL REGRESSION 237 • In general, you should obey the hierarchy principle, which says that if your model includes X h and X h is shown to be a statistically significant predictor of Y , then your model should also include each X j for all j < h, whether or not the coefficients for these lower-order terms are significant. 17.2 Response Surface Regression A response surface model (RSM) is a method for determining a surface predictive model based on one or more variables. In the context of RSMs,the variables are often called factors, so to keep consistent with the corresponding methodology, we will utilize that term for this section. RSM methods are usually discussed in a Design of Experiments course, but there is a relevant regression component. Specifically, response surface regression is fitting a polynomial regression with a certain structure of the predictors. Many industrial experiments are conducted to discover which values of given factor variables optimize a response. If each factor is measured at three or more values, then a quadratic response surface can be estimated by ordinary least squares regression. The predicted optimal value can be found from the estimated surface if the surface is shaped like a hill or valley. If the estimated surface is more complicated or if the optimum is far from the region of the experiment, then the shape of the surface can be analyzed to indicate the directions in which future experiments should be performed. In polynomial regression, the predictors are often continuous with a large number of different values. In response surface regression, the factors (of which there are k) typically represent a quantitative measure where their factor levels (of which there are p) are equally spaced and established at the design stage of the experiment. This is what we call a pk factorial design because the analysis will involve all of the pk different treatment combinations. Our goal is to find a polynomial approximation that works well in a specified region of the predictor space. As an example, we may be performing an experiment with k = 2 factors where one of the factors is a certain chemical concentration in a mixture. The factor levels for the chemical concentration are 10%, 20%, and 30% (so p = 3). The factors are then coded in the following way: ∗ Xi,j = D. S. Young Xi,j − [maxi (Xi,j ) + mini (Xi,j )]/2 , [maxi (Xi,j ) − mini (Xi,j )]/2 STAT 501 238 CHAPTER 17. POLYNOMIAL REGRESSION where i = 1, . . . , n indexes the sample and j = 1, . . . , k indexes the factor. For our example (assuming we label the chemical concentration factor as “1”) we would have ∗ Xi,1 = 10−[30+10]/2 [30−10]/2 = −1, if Xi,1 = 10%; 20−[30+10]/2 [30−10]/2 = 0, 30−[30+10]/2 [30−10]/2 = +1, if Xi,1 = 30%. if Xi,1 = 20%; Some aspects which differentiate a response surface regression model from the general context of a polynomial regression model include: • In a response surface regression model, p is usually 2 or 3, while k is usually the same value for each factor. More complex models can be developed outside of these constraints, but such a discussion is better dealt with in a Design of Experiments course. • The factors are treated as categorical variables. Therefore, the X matrix will have a noticeable pattern based on the way the experiment was designed. Furthermore, the X matrix is often called the design matrix in response surface regression. • The number of factor levels must be at least as large as the number of factors (p ≥ k). • If examining a response surface with interaction terms, then the model must obey the hierarchy principle (this is not required of general polynomial models, although it is usually recommended). • The number of factor levels must be greater than the order of the model (i.e., p > h). • The number of observations (n) must be greater than the number of terms in the model (including all higher-order terms and interactions). – It is desirable to have a larger n. A rule of thumb is to have at least 5 observations per term in the model. STAT 501 D. S. Young CHAPTER 17. POLYNOMIAL REGRESSION 239 ● ● ● ● ● ● ● ● ● ● ● ● ● (a) ● (b) ● ● (c) Figure 17.2: (a) The points of a square portion of a design with factor levels coded at ±1. This is how a 22 factorial design is coded. (b) Illustration of the axial (or star) points of a design at (+a,0), (-a,0), (0,-a), and (0,+a). (c) A diagram which shows the combination of the previous two diagrams with the design center at (0,0). This final diagram is how a composite design is coded. • Typically response surface regression models only have two-way interactions while polynomial regression models can (in theory) have k-way interactions. • The response surface regression models we outlined are for a factorial design. Figure 17.2 shows how a factorial design can be diagramed as a square using factorial points. More elaborate designs can be constructed, such as a central composite design, which takes into consideration axial (or star) points (also illustrated in Figure 17.2). Figure 17.2 pertains to a design with two factors while Figure 17.3 pertains to a design with three factors. We mentioned that response surface regression follows the hierarchy principle. However, some texts and software do report ANOVA tables which do not quite follow the hierarchy principle. While fundamentally there is nothing wrong with these tables, it really boils down to a matter of terminology. If the hierarchy principle is not in place, then technically you are just performing a polynomial regression. Table 17.2 gives a list of all possible terms when assuming an hth -order response surface model with k factors. For any interaction that appears in D. S. Young STAT 501 240 CHAPTER 17. POLYNOMIAL REGRESSION ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (a) ● ● (b) ● ● (c) Figure 17.3: (a) The points of a cube portion of a design with factor levels coded at the corners of the cube. This is how a 23 factorial design is coded. (b) Illustration of the axial (or star) points of this design. (c) A diagram which shows the combination of the previous two diagrams with the design center at (0,0). This final diagram is how a composite design is coded. the model (e.g., Xih1 Xjh2 such that h2 ≤ h1 ), then the hierarchy principle says that at least the main factor effects for 1, . . . , h1 must appear in the model, that all h1 -order interactions with the factor powers of 1, . . . , h2 must appear in the model, and all order interactions less than h1 must appear in the model. Luckily, response surface regression models (and polynomial models for that matter) rarely go beyond h = 3. For the next step, an ANOVA table is usually constructed to assess the significance of the model. Since the factor levels are all essentially treated as categorical variables, the designed experiment will usually result in replicates for certain factor level combinations. This is unlike multiple regression where the predictors are usually assumed to be continuous and no predictor level combinations are assumed to be replicated. Thus, a formal lack of fit test is also usually incorporated. Furthermore, the SSR is also broken down into the components making up the full model, so you can formally test the contribution of those components to the fit of your model. An example of a response surface regression ANOVA is given in Table 17.3. Since it is not possible to compactly show a generic ANOVA table nor to compactly express the formulas, this example is for a quadratic model with linear interaction terms. The formulas will be similar to their respective quantities defined earlier. For this example, assume that there are k STAT 501 D. S. Young CHAPTER 17. POLYNOMIAL REGRESSION Effect Main Factor Linear Interaction Quadratic Interaction Cubic Interaction .. . hth -order Interaction 241 Relevant Terms Xi , Xi2 , Xi3 , . . . , Xih for all i Xi Xj for all i < j Xi2 Xj for i 6= j and Xi2 Xj2 for all i < j Xi3 Xj , Xi3 Xj2 for i 6= j Xi3 Xj3 for all i < j .. . Xih Xj , Xih Xj2 , Xih Xj3 , . . . , Xih Xjh−1 for i 6= j Xih Xjh for all i < j Table 17.2: A table showing all of the terms that could be included in a response surface regression model. In the above, the indices for the factor are given by i = 1, . . . , k and j = 1, . . . , k. factors, n observations, m unique levels of the factor level combinations, and q total regression parameters are needed for the full model. In Table 17.3, the following partial sums of squares are used to compose the SSR value: • The sum of squares due to the linear component is SSLIN = SSR(X1 , X2 , . . . , Xk ). • The sum of squares due to the quadratic component is SSQUAD = SSR(X12 , X22 , . . . , Xk2 |X1 , X2 . . . , Xk ). • The sum of squares due to the linear interaction component is SSINT = SSR(X1 X2 , . . . , X1 Xk , X2 Xk , . . . , Xk−1 Xk |X1 , X2 . . . , Xk , X12 , X22 , . . . , Xk2 ). Other analysis techniques are commonly employed in response surface regression. For example, canonical analysis (which is a multivariate analysis tool) uses the eigenvalues and eigenvectors in the matrix of second-order parameters to characterize the shape of the response surface (e.g., is the surface flat or have some noticeable shape like a hill or a valley). There is also D. S. Young STAT 501 242 CHAPTER 17. POLYNOMIAL REGRESSION Source Regression Linear Quadratic Interaction Error Lack of Fit Pure Error Total df SS MS q−1 SSR MSR k SSLIN MSLIN k SSQUAD MSQUAD q − 2k − 1 SSINT MSINT n−q SSE MSE m−q SSLOF MSLOF n−m SSPE MSPE n−1 SSTO F MSR/MSE MSLIN/MSE MSQUAD/MSE MSINT/MSE MSLOF/MSPE Table 17.3: ANOVA table for a response surface regression model with linear, quadratic, and linear interaction terms. ridge analysis, which computes the estimated ridge of optimum response for increasing radii from the center of the original design. Since the context of these techniques is better suited for a Design of Experiments course, we will not develop their details here. 17.3 Examples Example 1: Yield Data Set This data set of size n = 15 contains measurements of yield from an experiment done at five different temperature levels. The variables are y = yield and x = temperature in degrees Fahrenheit. Table 17.4 gives the data used for this analysis. Figure 17.4 give a scatterplot of the raw data and then another scatterplot with lines pertaining to a linear fit and a quadratic fit overlayed. Obviously the trend of this data is better suited to a quadratic fit. Here we have the linear fit results: ########## Coefficients: Estimate Std. Error t value (Intercept) 2.306306 0.469075 4.917 temp 0.006757 0.005873 1.151 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 STAT 501 Pr(>|t|) 0.000282 *** 0.270641 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 D. S. Young CHAPTER 17. POLYNOMIAL REGRESSION i Temperature 1 50 2 50 3 50 4 70 5 70 6 70 7 80 8 80 9 80 10 90 11 90 12 90 13 100 14 100 15 100 243 Yield 3.3 2.8 2.9 2.3 2.6 2.1 2.5 2.9 2.4 3.0 3.1 2.8 3.3 3.5 3.0 Table 17.4: The yield measurements data set pertaining to n = 15 observations. Residual standard error: 0.3913 on 13 degrees of freedom Multiple R-Squared: 0.09242, Adjusted R-squared: 0.0226 F-statistic: 1.324 on 1 and 13 DF, p-value: 0.2706 ########## Here we have the quadratic fit results: ########## Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.9604811 1.2589183 6.323 3.81e-05 *** temp -0.1537113 0.0349408 -4.399 0.000867 *** temp2 0.0010756 0.0002329 4.618 0.000592 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.2444 on 12 degrees of freedom D. S. Young STAT 501 244 CHAPTER 17. POLYNOMIAL REGRESSION 3.4 ● 3.4 ● ● ● 3.2 ● 3.2 ● ● ● 2.8 Yield ● ● ● ● ● ● ● ● 2.4 ● 2.4 ● ● 2.6 2.8 ● ● 2.6 Yield ● ● 3.0 3.0 ● ● ● 2.2 ● 2.2 ● ● 50 60 70 ● 80 90 100 50 60 70 80 Temperature Temperature (a) (b) 90 100 Figure 17.4: The yield data set with (a) a linear fit and (b) a quadratic fit. Multiple R-Squared: 0.6732, Adjusted R-squared: 0.6187 F-statistic: 12.36 on 2 and 12 DF, p-value: 0.001218 ########## We see that both temperature and temperature squared are significant predictors for the quadratic model (with p-values of 0.0009 and 0.0006, respectively) and that the fit is much better for than the linear fit. From this output, we see the estimated regression equation is yi = 7.96050 − 0.15371xi + 0.00108x2i . Furthermore, the ANOVA table below shows that the model we fit is statistically significant at the 0.05 significance level with a p-value of 0.0012. Thus, our model should include a quadratic term. ########## Analysis of Variance Table Response: yield Df Sum Sq Mean Sq F value Pr(>F) Regression 2 1.47656 0.73828 12.36 0.001218 ** Residuals 12 0.71677 0.05973 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ########## STAT 501 D. S. Young CHAPTER 17. POLYNOMIAL REGRESSION 245 Example 2: Odor Data Set An experiment is designed to relate three variables (temperature, ratio, and height) to a measure of odor in a chemical process. Each variable has three levels, but the design was not constructed as a full factorial design (i.e., it is not a 33 design). Nonetheless, we can still analyze the data using a response surface regression routine. The data obtained was already coded and can be found in Table 17.5. Odor Temperature 66 -1 58 -1 65 0 -31 0 39 1 17 1 7 0 -35 0 43 -1 -5 -1 43 0 -26 0 49 1 -40 1 -22 0 Ratio Height -1 0 0 -1 -1 -1 0 0 -1 0 0 -1 1 -1 0 0 1 0 0 1 -1 1 0 0 1 0 0 1 1 1 Table 17.5: The odor data set measurements with the factor levels already coded. First we will fit a response surface regression model consisting of all of the first-order and second-order terms. The summary of this fit is given below: ########## Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -30.667 10.840 -2.829 0.0222 * temp -12.125 6.638 -1.827 0.1052 ratio -17.000 6.638 -2.561 0.0336 * D. S. Young STAT 501 246 CHAPTER 17. POLYNOMIAL REGRESSION height -21.375 6.638 -3.220 0.0122 temp2 32.083 9.771 3.284 0.0111 ratio2 47.833 9.771 4.896 0.0012 height2 6.083 9.771 0.623 0.5509 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 * * ** ’.’ 0.1 ’ ’ 1 Residual standard error: 18.77 on 8 degrees of freedom Multiple R-Squared: 0.8683, Adjusted R-squared: 0.7695 F-statistic: 8.789 on 6 and 8 DF, p-value: 0.003616 ########## As you can see, the square of height is the least statistically significant, so we will drop that term and rerun the analysis. The summary of this new fit is given below: ########## Coefficients: Estimate Std. Error t value (Intercept) -26.923 8.707 -3.092 temp -12.125 6.408 -1.892 ratio -17.000 6.408 -2.653 height -21.375 6.408 -3.336 temp2 31.615 9.404 3.362 ratio2 47.365 9.404 5.036 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 Pr(>|t|) 0.012884 0.091024 0.026350 0.008720 0.008366 0.000703 * . * ** ** *** ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 18.12 on 9 degrees of freedom Multiple R-Squared: 0.8619, Adjusted R-squared: 0.7852 F-statistic: 11.23 on 5 and 9 DF, p-value: 0.001169 ########## By omitting the square of height, the temperature main effect has now become marginally significant. Note that the square of temperature is statistically significant. Since we are building a response surface regression model, we must obey the hierarchy principle. Therefore temperature will be retained in the model. STAT 501 D. S. Young CHAPTER 17. POLYNOMIAL REGRESSION 247 Finally, contour and surface plots can also be generated for the response surface regression model. Figure 17.5 gives the contour plots (with odor as the contours) for each of the three levels of height (Figure 17.6 gives color versions of the plots). Notice how the contours are increasing as we go out to the corner points of the design space (so it is as if we are looking down into a cone). The surface plots of Figure 17.7 all look similar (with the exception of the temperature scale), but notice the curvature present in these plots. D. S. Young STAT 501 248 CHAPTER 17. POLYNOMIAL REGRESSION Height=0 0 Ratio −1 0 −1 Ratio 1 1 Height=−1 −1 0 −1 1 0 Temperature Temperature (a) (b) 1 0 −1 Ratio 1 Height=1 −1 0 1 Temperature (c) Figure 17.5: The contour plots of ratio versus temperature with odor as a response for (a) height=-1, (b) height=0, and (c) height=+1. STAT 501 D. S. Young CHAPTER 17. POLYNOMIAL REGRESSION 249 Height=0 Height=−1 1.0 1.0 80 100 80 0.5 60 0.5 40 Ratio Ratio 60 0.0 0.0 20 40 20 −0.5 0 −0.5 −20 0 −1.0 −1.0 −1.0 −0.5 0.0 0.5 −1.0 1.0 −0.5 0.0 Temperature Temperature (a) (b) 0.5 1.0 Height=1 1.0 60 40 0.5 Ratio 20 0.0 0 −20 −0.5 −40 −1.0 −1.0 −0.5 0.0 0.5 1.0 Temperature (c) Figure 17.6: The contour plots of ratio versus temperature with odor as a response for (a) height=-1, (b) height=0, and (c) height=+1. D. S. Young STAT 501 250 CHAPTER 17. POLYNOMIAL REGRESSION (a) (b) (c) Figure 17.7: The surface plots of ratio versus temperature with odor as a response for (a) height=-1, (b) height=0, and (c) height=+1. STAT 501 D. S. Young Chapter 18 Biased Regression Methods and Regression Shrinkage Recall earlier that we dealt with multicollinearity (i.e., a near-linear relationship amongst some of the predictors) by centering the variables in order to reduce the variance inflation factors (which reduces the linear dependency). When multicollinearity occurs, the ordinary least squares estimates are still unbiased, but the variances are very large. However, we can add a degree of bias to the estimation process, thus reducing the variance (and standard errors). This concept is known as the “bias-variance tradeoff” due to the functional relationship between the two values. We proceed to discuss some popular methods for producing biased regression estimates when faced with a high degree of multicollinearity. The assumptions made for these methods are mostly the same as in the multiple linear regression model. Namely, we assume linearity, constant variance, and independence. Any apparent violation of these assumptions must be dealt with first. However, these methods do not yield statistical intervals due to uncertainty in the distributional assumption, so normality of the data is not assumed. One additional note is that the procedures in this section are often referred to as “shrinkage methods”. They are called shrinkage methods because, as we will see, the regression estimates we obtain cover a smaller range than those from ordinary least squares. 251 CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION 252 SHRINKAGE 18.1 Ridge Regression Perhaps the most popular (albeit controversial) and widely studied biased regression technique to deal with multicollinearity is ridge regression. Before we get into the computational side of ridge regression, let us recall from the last course how to perform a correlation transformation (and the corresponding notation) which is performed by standardizing the variables. The standardized X matrix is given as: X1,1 −X̄1 X1,2 −X̄2 X1,p−1 −X̄p−1 . . . s s sXp−1 X2,1X−1X̄1 X2,2X−2X̄2 X2,p−1 −X̄p−1 . . . s s s X X X ∗ 1 1 2 p−1 X = √n−1 , . . . . .. .. .. .. Xn,2 −X̄2 Xn,p−1 −X̄p−1 Xn,1 −X̄1 . . . sX sX sX 1 2 p−1 which is a n × (p − 1) matrix, and the standardized Y vector is given as: ∗ 1 Y = √n−1 Y1 −Ȳ sY Y2 −Ȳ sY .. . Yn −Ȳ sY , which is still a n-dimensional vector. Here, sP n 2 i=1 (Xi,j − X̄j ) sXj = n−1 for j = 1, 2, . . . , (p − 1) and sP sY = n i=1 (Yi − Ȳ )2 . n−1 Remember that we have removed the column of 1’s in forming X∗ , effectively reducing the column dimension of the original X matrix by 1. Because of this, we no longer can estimate an intercept term (b0 ), which may be an important part of the analysis. When using the standardized variables, the regression model of interest becomes: Y∗ = X∗ β ∗ + ∗ , STAT 501 D. S. Young CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION SHRINKAGE 253 where β ∗ is now a (p − 1)-dimensional vector of standardized regression coefficients and ∗ is an n-dimensional vector of errors pertaining to this standardized model. Thus, the ordinary least squares estimates are ∗ β̂ = (X∗T X∗ )−1 X∗T Y∗ = r−1 XX rXY , where rXX is the (p − 1) × (p − 1) correlation matrix of the predictors and rXY is the (p − 1)dimensional matrix of correlation coefficients between the ∗ predictors and the response. Thus β̂ is a function of correlations and hence we have performed a correlation transformation. Notice further that ∗ ∗ E[β̂ ] = β̂ and ∗ −1 V[β̂ ] = σ 2 r−1 XX = rXX . For the variance-covariance matrix, σ 2 = 1 because we have standardized all of the variables. Ridge regression adds a small value k (called a biasing constant) to the diagonal elements of the correlation matrix. (Recall that a correlation matrix has 1’s down the diag, so it can sort of be thought of as a “ridge”.) Mathematically, we have β̃ = (rXX + kI(p−1)×(p−1) )−1 rXY , where 0 < k < 1, but usually less than 0.3. The amount of bias in this estimator is given by E[β̃ − β ∗ ] = [(rXX + kI(p−1)×(p−1) )−1 rXX − I(p−1)×(p−1) ]β ∗ , and the variance-covariance matrix is given by V[β̃] = (rXX + kI(p−1)×(p−1) )−1 rXX (rXX + kI−1 (p−1)×(p−1) ). Remember that β̃ is calculated on the standardized variables (sometimes called the “standardized” ridge regression estimates). We can transform D. S. Young STAT 501 CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION 254 SHRINKAGE back to the original scale (sometimes these are called the ridge regression estimates) by sY † β̃j = β̃j sXj β̃0† = ȳ − p−1 X β̃j† x̄j , j=1 where j = 1, 2, . . . , p − 1. How do we choose k? Many methods exist, but there is no agreement on which to use, mainly due to instability in the estimates asymptotically. Two methods are primarily used: one graphical and one analytical. The first method is called the fixed point method and uses the estimates provided by fitting the correlation transformation via ordinary least squares. This method suggests using (p − 1)MSE∗ k= , ∗T ∗ β̂ β̂ where MSE∗ is the mean square error obtained from the respective fit. Another method is the Hoerl-Kennard iterative method. This method calculates (p − 1)MSE∗ , k (t) = T β̃ k(t−1) β̃ k(t−1) where t = 1, 2, . . .. Here, β̃ k(t−1) pertains to the ridge regression estimates obtained when the biasing constant is k (t−1) . This process is repeated until the difference between two successive estimates of k is negligible. The starting value for this method (k (0) ) is chosen to be the value of k calculated using the fixed point method. Perhaps the most common method is a graphical method. The ridge trace is a plot of the estimated ridge regression coefficients versus k. The value of k is picked where the regression coefficients appear to have stabilized. The smallest value of k is chosen as it introduces the smallest amount of bias. There are criticisms regarding ridge regression. One major criticism is that ordinary inference procedures are not available since exact distributional properties of the ridge estimator are not known. Another criticism is in the subjective choice of k. While we mentioned a few of the methods here, there are numerous methods found in the literature, each with their STAT 501 D. S. Young CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION SHRINKAGE 255 own limitations. On the flip-side of these arguments lie some potential benefits of ridge regression. For example, it can accomplish what it sets out to do, and that is reduce multicollinearity. Also, occasionally ridge regression can provide an estimate of the mean response which is good for new values that lie outside the range of our observations (called extrapolation). The mean response found by ordinary least squares is known to not be good for extrapolation. 18.2 Principal Components Regression The method of principal components regression transforms the predictor variables to their principal components. Principal components of X∗T X∗ are extracted using the singular value decomposition (SVD) method which says there exist orthogonal matrices Un×(p−1) and Pn×(p−1) (i.e., UT U = PT P = I(p−1)×(p−1) ) such that X∗ = UDPT . P is called the (factor) loadings matrix while the (principal component) scores matrix is defined as Z = UD, such that ZT Z = Λ. Here, Λ is a (p − 1) × (p − 1) diagonal matrix consisting of the nonzero eigenvalues of X∗T X∗ on the diagonal (for simplicity, we assume that the eigenvalues are in decreasing order down the diagonal: λ1 ≥ λ2 ≥ . . . ≥ λp−1 > 0). Notice that Z = X∗ P, which implies that each entry of the Z matrix is a linear combination of the entries of the corresponding column of the X∗ matrix. This is because the goal of principal components is to only keep those linear combinations which help explain a larger amount of the variation (as determined by using the eigenvalues described below). Next, we regress Y∗ on Z. The model is Y∗ = Zβ + ∗ , which has the least squares solution β̂ Z = (ZT Z)−1 ZT Y∗ . D. S. Young STAT 501 CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION 256 SHRINKAGE Severe multicollinearity is identified by very small eigenvalues. Multicollinearity is corrected by omitting those components which have small eigenvalues. Since the ith entry of β̂ Z corresponds to the ith component, simply set those entries of β̂ Z to 0 which have correspondingly small eigenvalues. For example, suppose you have 10 predictors (and hence 10 principal components). You find that the last three eigenvalues are relatively small and decide to omit these three components. Therefore, you set the last three entries of β̂ Z equal to 0. With this value β̂ Z , we can transform back to get the coefficients on the ∗ X scale by β̂ PC = Pβ̂ Z . This is a solution to Y∗ = X∗ β ∗ + ∗ . Notice that we have not reduced the dimension of β̂ Z from the original calculation, but we have only set certain values equal to 0. Furthermore, as in ridge regression, we can transform back to the original scale by † β̂PC,j † β̂PC,0 = sY β̂PC,j sXj = ȳ − p−1 X † β̂PC,j x̄j , j=1 where j = 1, 2, . . . , p − 1. How do you choose the number of eigenvalues to omit? This can be accomplished by looking at the cumulative percent variation explained by each of the (p − 1) components. For the j th component, this percentage is Pj λi × 100, λ1 + λ2 + . . . + λp−1 i=1 where j = 1, 2, . . . , p − 1 (remember, the eigenvalues are in decreasing order). A common rule of thumb is that once you reach a component that explains roughly 80% − 90% of the variation, then you can omit the remaining components. STAT 501 D. S. Young CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION SHRINKAGE 257 18.3 Partial Least Squares We next look at a procedure that is very similar to principal components regression. Here, we will attempt to construct the Z matrix from the last section in a different manner such that we are still interested in models of the form Y∗ = Zβ + ∗ . Notice that in principal components regression that the construction of the linear combinations in Z do not rely whatsoever on the response Y∗ . Yet, Z we use the estimate β̂ (from regressing Y∗ on Z) to help us build our final estimate. The method of partial least squares allows us to choose the linear combinations in Z such that they predict Y∗ as best as possible. We proceed to describe a common way to estimate with partial least squares. First, define SST = X∗T Y∗ Y∗T X∗ . We construct the score vectors (i.e., the rows of Z) as zi = X∗ ri , for i = 1, . . . , p − 1. The challenge becomes to find the ri values. r1 is just the first eigenvector of SST . ri for i = 2, . . . , p − 1 maximizes T rT i−1 SS ri , subject to the constraint ∗T ∗ T rT i−1 X X ri = zi−1 zi = 0. Next, we regress Y∗ on Z, which has the least squares solution β̂ Z = (ZT Z)−1 ZT Y∗ . As in principal components regression, we can transform back to get the coefficients on the X∗ scale by β̂ PLS = Rβ̂ Z , which is a solution to Y∗ = X∗ β ∗ + ∗ . D. S. Young STAT 501 CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION 258 SHRINKAGE In the above, R is the matrix where the ith column is ri . Furthermore, as in both ridge regression and principal components regression, we can transform back to the original scale by sY † β̂PLS,j β̂PLS,j = sXj † β̂PLS,0 = ȳ − p−1 X † β̂PLS,j x̄j , j=1 where j = 1, 2, . . . , p − 1. The method described above is sometimes referred to as the SIMPLS method. Another method commonly used is nonlinear iterative partial least squares (NIPALS). NIPALS is more commonly used when you have a vector of responses. While we do not discuss the differences between these algorithms any further, we do discuss later the setting where we have a vector of responses. 18.4 Inverse Regression In simple linear regression, we introduced calibration intervals which are a type of statistical interval for a predictor value given a value for the response. An inverse regression technique is essentially what is performed to find the calibration intervals (i.e., regress the predictor on the response), but calibration intervals do not extend easily to the multiple regression setting. However, we can still extend the notion of inverse regression when dealing with p − 1 predictors. Let X i be a p-dimensional vector (with fist entry equal to 0 for an intercept so that we actually have p − 1 predictors) such that XT 1 X = ... . XT n However, assume that p is actually quite large with respect to n. Inverse regression can actually be used as a tool for dimension reduction (i.e., reducing p), which reveals to use the most important aspects (or direction) STAT 501 D. S. Young CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION SHRINKAGE 259 of the data.1 The tool commonly used is called Sliced Inverse Regression (or SIR). SIR uses the inverse regression curve E(X|Y = y), which falls into a reduced dimension space under certain conditions. SIR uses this curve to perform a weighted principal components analysis such that one can determine an effective subset of the predictors. The reason for reducing the dimensions of the predictors is because of the curse of dimensionality, which means that drawing inferences on the same number of data points in a higher dimensional space becomes difficult due to the sparsity of the data in the volume of the higher dimensional space compared to the volume of the lower dimensional space.2 When working with the classic linear regression model Y = Xβ + or a more general regression model Y = f (Xβ) + for some real-valued function f , we know that the distribution of Y |X depends on X only through the p-dimensional variable β = (β0 , β1 , . . . , βp−1 )T . Dimension reduction claims that the distribution of Y |X depends on X only through the k-dimensional variable β ∗ = (β1∗ , . . . , βk∗ )T such that k < p. This new vector β ∗ is called the effective dimension reduction direction (or EDR-direction). The inverse regression curve is computed by looking for E(X|Y = y), which is a curve in Rp , but consisting of p one-dimensional regressions (as opposed to one p-dimensional surface in standard regression). The center of the inverse regression curve is located at E(E(X|Y = y)) = E(X). Therefore, the centered inverse regression curve is m(y) = E(X|Y = y) − E(X), 1 When we have p large with respect to n, we use the terminology dimension reduction. However, when we are more concerned about which predictors are significant or which functional form is appropriate for our regression model (and the size of p is not too much of an issue), then we use the model selection terminology. 2 As an example, consider 100 points on the unit interval [0,1], then imagine 100 points on the unit square [0, 1]×[0, 1], then imagine 100 points on the unit cube [0, 1]×[0, 1]×[0, 1], and so on. As the dimension increases, the sparsity of the data makes it more difficult to make any relevant inferences about the data. D. S. Young STAT 501 CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION 260 SHRINKAGE which is a p-dimensional curve in Rp . Next, the “slice” part of SIR comes from estimating m(y) by dividing the range of Y into H non-overlapping intervals (or slices), which are then used to compute the sample means, m̂h , of each slice. These sample means are a crude estimate of m(y). With the basics of the inverse regression model in place, we can introduce an algorithm often used to estimate the EDR-direction vector for SIR: 1. Let ΣX be the variance-covariance matrix of X. Using the standardized X matrix (i.e., the matrix defined earlier in this chapter as X∗ ), we can rewrite the classic regression model as Y∗ = X∗ η + ∗ or the more general regression model as Y∗ = f (X∗ η) + ∗ , 1/2 where η = ΣX β. 2. Divide the range of y1 , . . . , yn into H non-overlapping slices (using the index h = 1, . . . , H). Let nh be the number of observation within each slice and Ih {·} be the indicator function for this slice such that nh = n X Ih {yi }. i=1 3. Compute m̂h (yi ) = n−1 h n X x∗i Ih {yi }, i=1 which are the means of the H slices. 4. Calculate the estimate for Cov(m(y)) by V̂ = n −1 H X nh m̂h (yi )m̂h (yi )T . h=1 5. Identify the k largest eigenvalues λ̂i and eigenvectors ri of V̂. Construct the score vectors zi = X∗ ri as in partial least squares, which are the rows of Z. Then η̂ = (ZT Z)−1 ZT Y∗ is the standardized EDR-direction vector. STAT 501 D. S. Young CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION SHRINKAGE 261 6. Transform the standardized EDR-direction vector back to the original scale by ∗ 1/2 β̂ = ΣX η̂. 18.5 Regression Shrinkage and Connections with Variable Selection Suppose we now wish to find the least squares estimate of the model Y = Xβ + , but subject to a set of equality constraints Aβ = a. It can be shown (by using Lagrange multipliers), that β̂ CLS = β̂ OLS − (XT X)1 AT [A(XT X)−1 AT ]−1 [Aβ̂ OLS − a], which is called the constrained least squares estimator. This is helpful when you wish to restrict β from being estimated in various areas of Rp . However, you can also have more complicated constraints (e.g., inequality constraints, quadratic constraints, etc.) in which case more sophisticated optimization techniques need to be utilized. The constraints are imposed to restrict the range of β and so any corresponding estimate can be thought of as a shrinkage estimate as they are covering a smaller range than the ordinary least squares estimates. Ridge regression is a method providing shrinkage estimators although they are biased. Oftentimes we hope to shrink our estimates to 0 by imposing certain constraints, but this may not always be possible. A common regression shrinkage procedure is least absolute shrinkage and selection operator or LASSO. LASSO is also concerned with the setting of finding the least squares Pp estimate of Y = Xβ + , but subject to a set of inequality constraints j=1 |βj | ≤ t, which is called an L1 −penalty since we are looking at an L1 -norm.3 Here, t ≥ 0 is a tuning parameter which the user sets to control the amount ofPshrinkage. If we let β̂ be the ordinary least squares estimate and let t0 = pj=1 |βj |, then values of t < t0 will cause shrinkage of the solution towards 0 and some coefficients may be exactly 0. Because of this, LASSO is also accomplishing model (or subset) selection as we can omit those predictors from the model whose coefficients become exactly 0. 3 Ordinary least squares minimizes with respect to the equality constraint which is an L2 −penalty. D. S. Young Pn 2 i=1 ei , STAT 501 CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION 262 SHRINKAGE It is important to restate the purposes of LASSO. Not only does it shrink the regression estimates, but it also provides a way to accomplish subset selection. Furthermore, ridge regression also serves this dual purpose, although we introduced ridge regression as a way to deal with multicollinearity and not as a first line effort for shrinkage. The way subset selection P is performed using ridge regression is by imposing the inequality constraint pj=1 βj2 ≤ t, which is an L2 −penalty. Many competitors to LASSO are available in the literature (such as regularized least absolute deviation and Dantzig selectors), but LASSO is one of the more commonly used methods. It should be noted that there are numerous efficient algorithms available for estimating with these procedures, but due to the level of detail necessary, we will not explore these techniques. 18.6 Examples Example 1: GNP Data This data set of size n = 16 contains macroeconomic data taken between the years 1947 and 1962. The economic indicators recorded were the GNP implicit price deflator (IPD), the GNP, the number of people unemployed, the number of people in the armed forces, the population, and the number of people employed. We wish to see if the GNP IPD can be modeled as a function of the other variables. The data set is given in Table 18.1. First we run a multiple linear regression procedure to obtain the following output: ########## Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2946.85636 5647.97658 0.522 0.6144 GNP 0.26353 0.10815 2.437 0.0376 * Unemployed 0.03648 0.03024 1.206 0.2585 Armed.Forces 0.01116 0.01545 0.722 0.4885 Population -1.73703 0.67382 -2.578 0.0298 * Year -1.41880 2.94460 -0.482 0.6414 Employed 0.23129 1.30394 0.177 0.8631 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 STAT 501 D. S. Young CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION SHRINKAGE 263 Year GNP IPD 1947 83.0 1948 88.5 1949 88.2 1950 89.5 1951 96.2 1952 98.1 1953 99.0 1954 100.0 1955 101.2 1956 104.6 1957 108.4 1958 110.8 1959 112.6 1960 114.2 1961 115.7 1962 116.9 GNP Unemployed 234.289 235.6 259.426 232.5 258.054 368.2 284.599 335.1 328.975 209.9 346.999 193.2 365.385 187.0 363.112 357.8 397.469 290.4 419.180 282.2 442.769 293.6 444.546 468.1 482.704 381.3 502.601 393.1 518.173 480.6 554.894 400.7 Forces 159.0 145.6 161.6 165.0 309.9 359.4 354.7 335.0 304.8 285.7 279.8 263.7 255.2 251.4 257.2 282.7 Population 107.608 108.632 109.773 110.929 112.075 113.270 115.094 116.219 117.388 118.734 120.445 121.950 123.366 125.368 127.852 130.081 Employed 60.323 61.122 60.171 61.187 63.221 63.639 64.989 63.761 66.019 67.857 68.169 66.513 68.655 69.564 69.331 70.551 Table 18.1: The macroeconomic data set for the years 1947 to 1962. Residual standard error: 1.195 on 9 degrees of freedom Multiple R-Squared: 0.9926, Adjusted R-squared: 0.9877 F-statistic: 202.5 on 6 and 9 DF, p-value: 4.426e-09 ########## As you can see, not many predictors appear statistically significant at the 0.05 significance level. We also have a fairly high R2 (over 99%). However, by looking at the variance inflation factors, multicollinearity is obviously an issue: ########## GNP 1214.57215 Employed 220.41968 ########## Unemployed Armed.Forces 83.95865 12.15639 Population 230.91221 Year 2065.73394 In performing a ridge regression, we first obtain a trace plot of possible ridge coefficients (Figure 18.1). As you can see, the estimates of the D. S. Young STAT 501 10 −10 0 t(x$coef) 20 CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION 264 SHRINKAGE 0.00 0.02 0.04 0.06 0.08 0.10 x$lambda Figure 18.1: Ridge regression trace plot with the ridge regression coefficient on the x-axis. regression coefficients shrink drastically until about 0.02. When using the Hoerl-Kennard method, a value of about k = 0.0068 is obtained. Other methods will certainly yield different estimates which illustrates some of the criticism surrounding ridge regression. The resulting estimates from this ridge regression analysis are ########## GNP 25.3615288 Employed 0.7864825 ########## Unemployed Armed.Forces 3.3009416 0.7520553 Population -11.6992718 Year -6.5403380 The estimates have obviously shrunk closer to 0 compared to the original estimates. Example 2: Acetylene Data This data set of size n = 16 contains observations of the percentage of conversion of n-heptane to acetylene and three predictor variables. The response STAT 501 D. S. Young CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION SHRINKAGE 265 variable is y = conversion of n-heptane to acetylene (%), x1 = reactor temperature (degrees Celsius), x2 = ratio of H2 to n-heptane (Mole ratio), and x3 = contact time (in seconds). The data set is given in Table 18.2. i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Y 49.0 50.2 50.5 48.5 47.5 44.5 28.0 31.5 34.5 35.0 38.0 38.5 15.0 17.0 20.5 29.5 X1 1300 1300 1300 1300 1300 1300 1200 1200 1200 1200 1200 1200 1100 1100 1100 1100 X2 7.5 9.0 11.0 13.5 17.0 23.0 5.3 7.5 11.0 13.5 17.0 23.0 5.3 7.5 11.0 17.0 X3 0.0120 0.0120 0.0115 0.0130 0.0135 0.0120 0.0400 0.0380 0.0320 0.0260 0.0340 0.0410 0.0840 0.0980 0.0920 0.0860 Table 18.2: The acetylene data set where Y =conversion of n-heptane to acetylene (%), X1 =reactor temperature (degrees Celsius), X2 =ratio of H2 to n-heptane (mole ratio), X3 =contact time (in seconds). First we run a multiple linear regression procedure to obtain the following output: ########## Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -121.26962 55.43571 -2.188 0.0492 * reactor.temp 0.12685 0.04218 3.007 0.0109 * H2.ratio 0.34816 0.17702 1.967 0.0728 . cont.time -19.02170 107.92824 -0.176 0.8630 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 D. S. Young STAT 501 CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION 266 SHRINKAGE Residual standard error: 3.767 on 12 degrees of freedom Multiple R-Squared: 0.9198, Adjusted R-squared: 0.8998 F-statistic: 45.88 on 3 and 12 DF, p-value: 7.522e-07 ########## As you can see, everything but contact time appears statistically significant at the 0.05 significance level. We also have a fairly high R2 (nearly 90%). However, by looking at the pairwise scatterplots for the predictors in Figure 18.2, there appears to be a distinctive linear relationship between contact time and reactor temperature. This is further verified by looking at the variance inflation factors: ########## reactor.temp 12.225045 ########## H2.ratio 1.061838 cont.time 12.324964 We will proceed with a principal components regression analysis. First we perform the SVD of X∗ in order to get the Z matrix. Then, regressing Y∗ on Z yields ########## Coefficients: Z1 Z2 -0.66277 0.03952 ########## Z3 -0.57268 The above is simply β̂ Z . From the SVD of X∗ , the factor loadings matrix is found to be −0.6742704 0.2183362 0.70547061 P = −0.2956893 −0.9551955 0.01301144 0.6767033 −0.1998269 0.70861973 So transforming back yields β̂ PC = Pβ̂ Z 0.05150079 = 0.15076806 . −0.86220916 STAT 501 D. S. Young 0.10 0.10 CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION SHRINKAGE 267 ● ● ● ● 0.06 Contact Time ● ● 0.04 ● ● ● 0.08 ● 0.06 0.04 Contact Time 0.08 ● ● ● ● ● ● ● ● ● 0.02 0.02 ● ● 5 ● ● ● ● 10 ● ● ● ● ● 15 20 1100 1150 1200 1250 1300 Reactor Temperature H2 Ratio (a) (b) ● ● ● ● ● ● ● ● ● ● ● ● 20 ● 15 10 H2 Ratio ● 5 ● 1100 1150 1200 ● 1250 1300 Reactor Temparature (c) Figure 18.2: Pairwise scatterplots for the predictors from the acetylene data set. LOESS curves are also provided. Does there appear to be any possible linear relationships between pairs of predictors? D. S. Young STAT 501 Chapter 19 Piecewise and Nonparametric Methods This chapter focuses on regression models where we start to deviate from the functional form discussed thus far. The first topic discusses a model where different regressions are fit depending on which area of the predictor space we are in. The second topic discusses non parametric models which, as the name suggests, fits a model which is free of distributional assumptions and subsequently does not have regression coefficients readily available for estimation. This is best accomplished by using a smoother, which is a tool for summarizing the trend of the response as a function of one or more predictors. The resulting estimate of the trend is less variable than the response itself. 19.1 Piecewise Linear Regression A model that proposes a different linear relationship for different intervals (or regions) of the predictor is called a piecewise linear regression model. The predictor values at which the slope changes are called knots, which we will discuss throughout this chapter. Such models are helpful when you expect the linear trend of your data to change once you hit some threshold. Usually the knot values are already predetermined due to previous studies or standards that are in place. However, there are methods for estimating the knot values (sometimes called changepoints in the context of piecewise linear regression), but we will not explore such methods. 268 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 269 For simplicity, we construct the piecewise linear regression model for the case of simple linear regression and also briefly discuss how this can be extended to the multiple regression setting. First, let us establish what the simple linear regression model with one knot value (k1 ) looks like: Y = β0 + β1 X1 + β2 (X1 − k1 )I{X1 > k1 } + , where I{·} is the indicator function such that 1, if X1 > k1 ; I{X1 > k1 } = 0, otherwise. So, when X1 ≤ k1 , the simple linear regression line is E(Y ) = β0 + β1 X1 and when X1 > k1 , the simple linear regression line is E(Y ) = (β0 − β2 k1 ) + (β1 + β2 )X1 . Such a regression model is fitted in the upper left-hand corner of Figure 19.1. For more than one knot value, we can extend the above regression model to incorporate other indicator values. Suppose we have c knot values (i.e., k1 , k2 , . . . , kc ) and we have n observations. Then the piecewise linear regression model is written as: yi = β0 + β1 xi,1 + β2 (xi,1 − k1 )I{xi,1 > k1 } + . . . + βc+1 (xi,1 kc )I{xi,1 > kc } + i . As you can see, this can be written more compactly as: y = Xβ + , where β is a c + 2−dimensional vector and 1 x1,1 x1,1 I{x1,1 > k1 } · · · x1,1 I{x1,1 > kc } 1 x2,1 x2,1 I{x2,1 > k1 } · · · x2,1 I{x2,1 > kc } X = .. .. .. .. .. . . . . . 1 xn,1 xn,1 I{xn,1 > k1 } · · · xn,1 I{xn,1 > kc } . Furthermore, you can see how for more than one predictor you can construct the X matrix to have columns as functions of the other predictors. D. S. Young STAT 501 270 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 1 Knot (Discontinuous) 11 1 Knot ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● 6 10 ● ● ● ● ● ● ● 8 Y ● ● 5 ● ● ● ●● ●● ●● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 X 2 Knots (Discontinuous) 9 8 Y ● 5 6 ●● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● 6 7 ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● 7 Y 8 9 ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 10 10 ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ●●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● 4 5 1.0 (b) 2 Knots ●● ●● ● ● ● ●● ● ●●● ● ● ● ●● ● 0.8 X (a) 0.0 ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● 5 ● ● ●● 7 ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● 6 Y 7 ● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● 9 8 ● ● ● ● 0.2 0.4 0.6 X (c) 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 X (d) Figure 19.1: Plots illustrating continuous and discontinuous piecewise linear regressions with 1 and 2 knots. STAT 501 D. S. Young CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 271 Sometimes you may also have a discontinuity that needs to be reflected at the knots (see the right-hand side plots of Figure 19.1). This is easily reflected in the piecewise linear model we constructed above by adding one more term to the model. For each kj where there is a discontinuity, you add the corresponding indicator random variable I{X1 > k1 } as a regressor. Thus, the X matrix would have the column vector I{x1,1 > kj } I{x2,1 > kj } .. . I{xn,1 > kj } appended to it for each kj where there is a discontinuity. Extending discontinuities to the case of more than one predictor is analogous. 19.2 Local Regression Methods Nonparametric regression attempts to find a functional relationship between yi and xi (only one predictor): yi = m(xi ) + i , where m(·) is the regression function to estimate and E(i ) = 0. It is not necessary to assume constant variance and, in fact, one typically assumes that i = σ 2 (xi ), where σ 2 (·) is a continuous, bounded function. Local regression is a method commonly used to model this nonparametric regression relationship. Specifically, local regression makes no global assumptions about the function m(·). Global assumptions are made in standard linear regression as we assume that the regression curve we estimate (which is characterized by the regression coefficient vector β) properly models all of our data. However, local regression assumes that m(·) can be well-approximated locally by a member from a simple class of parametric functions (e.g., a constant, straight-line, quadratic curve, etc.) What drives local regression is Taylor’s theorem from Calculus, which says that any continuous function (which we assume that m(·) is) can be approximated with a polynomial. In this section, we discuss some of the common local regression methods for estimating regressions nonparametrically. D. S. Young STAT 501 272 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 19.2.1 Kernel Regression One way of estimating m(·) is to use density estimation, which approximates the probability density function f (·) of a random variable X. Assuming we have n independent observations x1 , . . . , xn from the random variable X, the kernel density estimator fˆh (x) for estimating the density at x (i.e., f (x)) is defined as n xi − x 1 X ˆ K . fh (x) = nh i=1 h Here, K(·) is called the kernel function and h is called the bandwidth. K(·) is a function often resembling a probability density function, but with no parameters (some common kernel functions are provided in Table 19.1). h controls the window width around x which we perform the density estimation. Thus, a kernel density estimator is essentially a weighting scheme (dictated by the choice of kernel) which takes into consideration the proximity of a point in the data set near x when given a bandwidth h. Furthermore, more weight is given to points near x and less weight is given to points further from x. With the formalities established, one can perform a kernel regression of yi on Xi estimate mh (·) with the Nadaraya-Watson estimator: Pn i=1 K m̂h (x) = Pn i=1 xi −x h K xi −x h yi , where m has been subscripted to note its dependency on the bandwidth. As you can see, this kernel regression estimator is just a weighted sum of the observed responses. It is also possible to construct approximate confidence intervals and confidence bands using the Nadaraya-Watson estimator, but under some restrictive assumptions. An approximate 100×(1−α)% confidence interval is given by s m̂h (x) ± z1− α2 σ̂h2 (x)kKk22 , nhfˆh (x) where h = cn−1/5 for some constant c > 0, z1− α2 is the (1 − α/2)−quantile of STAT 501 D. S. Young CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 273 Kernel K(u) Triangle Beta Gaussian Cosinus Optcosinus (1 − |u|)I(|u| ≤ 1) (1−u2 )g I(|u| Beta(0.5,g+1) ≤ 1) 1 2 √1 e− 2 u 2π 1 (1 2 π 4 + cos(πu))I(|u| ≤ 1) ) I(|u| ≤ 1) cos( πu 2 Table 19.1: A table of common kernel functions. In the above, I(|u| ≤ 1) is the indicator function yielding 1 if |u| ≤ 1 and 0 otherwise. For the beta kernel, the value g ≥ 0 is specified by the user and is a shape parameter. Common values of g are 0, 1, 2, and 3, which are called the uniform, Epanechnikov, biweight, and triweight kernels, respectively. the standard normal distribution, kKk22 = σ̂h2 (x) = 1 n Pn K xi −x h Pn K i=1 i=1 2 xi −x , and i=1 K h Pn {yi m̂h (x)}2 . xi −x h Next, let h = n−δ for δ ∈ ( 15 , 12 ). Then, under certain regularity conditions, an approximate 100 × (1 − α)% confidence band is given by s m̂h (x) ± zn,α D. S. Young σ̂h2 (x)kKk22 , nhfˆh (x) STAT 501 274 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS where zn,α = − log{− 12 log(1 − α)} + dn (2δ log(n))1/2 1/2 and p 0 2 1/2 kK k p 2 . dn = (2δ log(n))1/2 + (2δ log(n))−1/2 log 2π kKk22 Some final notes about kernel regression include: • Choice of kernel and bandwidth are still major issues in research. There are some general guidelines to follow and procedures that have been developed, but are beyond the scope of this course. • What we developed in this section is only for the case of one predictor. If you have multiple predictors (i.e., x1,i , . . . , xp,i ), then one needs to use a multivariate kernel density estimator at a point x = (x1 , . . . , xp )T , which is defined as n xi,1 − x1 xi,p − xp 1 1X ˆ Q K ,..., . fh (x) = n i=1 pj=1 hj h1 hp Multivariate kernels require more advanced methods and are difficult to use as data sets with more predictors will often suffer from the curse of dimensionality. 19.2.2 Local Polynomial Regression and LOESS Local polynomial modeling is similar to kernel regression estimation, but the fitted values are now produced by a locally weighted regression rather than by a locally weighted average. The theoretical basis for this approach is to do a Taylor series expansion around a value xi : m(xi ) ≈ m(x)+m0 (x)(xi )(x−xi )+ m00 (x)(xi )(x − xi ) m(q) (x)(xi )(x − xi ) +. . .+ , 2 q! for x in a neighborhood of xi . It is then parameterized in a way such that: m(xi ) ≈ β0 (x)+β1 (x)(xi −x)+β2 (x)(xi −x)2 +. . .+βq (x)(xi −x)q , STAT 501 |xi −x| ≤ h, D. S. Young CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 275 so that β0 (x) = m(x) β1 (x) = m0 (x) β2 (x) = m00 (x)/2 .. . βq (x) = m(q) (x)/q!. Note that the β parameters are considered functions of x, hence the “local” aspect of this methodology. Local polynomial fitting minimizes 2 q n X X xi − x j yi − βj (x)(xi − x) K h i=1 j=0 with respect to the βj (x) terms. Then, letting β0 (x) 1 (x1 − x) . . . (x1 − x)q β1 (x) 1 (x2 − x) . . . (x2 − x)q , β(x) = X = .. .. . . . .. .. .. . . 1 (xn − x) . . . (x1 − x)q βq (x) and W = diag K x1h−x , . . . , K xnh−x , the local least squares estimate can be written as: β̂(x) = arg min(Y − Xβ(x))T W(Y − Xβ(x)) β(x) = (X W−1 X)−1 XT W−1 Y. T Thus we can estimate the ν th derivative of m(x) by m̂(ν) (x) = ν!β̂ ν (x). Finally, for any x, we can perform inference on the βj (x) (or the m(ν) (x)) terms in a manner similar to weighted least squares. The method of LOESS (which stands for Locally Estimated Scatterplot Smoother)1 is commonly used for local polynomial fitting. However, LOESS 1 There is also another version of LOESS called LOWESS, which stands for Locally WEighted Scatterplot Smoother. The main difference is the weighting that is introduced during the smoothing process. D. S. Young STAT 501 276 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS is not a simple mathematical model, but rather an algorithm that, when given a value of X, computes an appropriate value of Y. The algorithm was designed so that the LOESS curve travels through the middle of the data and gives points closest to each X value the greatest weight in the smoothing process, thus limiting the influence of outliers. Suppose we have a set of observations ((x1 , y1 ), . . . , (xn , yn ). LOESS follows a basic algorithm as follows: 1. Select a set of values partitioning [x(1) , x(n) ]. Let x0 be an individual value in this set. 2. For each observation, calculate the distance di = |xi − x0 |. Let q be the number of observations in the neighborhood of x0 . The neighborhood is formally defined as the q smallest values of di where q = dγne. γ is the proportion of points to be selected and is called the span (usually chosen to be about 0.40) and d·e means to take the next largest integer if the calculated value is not already an integer. 3. Perform a weighted regression of the yi ’s on the xi ’s using only the points in the neighborhood. The weights are given by |xi − x0 | wi (x0 ) = T , dq where T (·) is the tricube weight function given by (1 − |u|3 )3 , if |u| < 1; T (u) = 0, if |u| ≥ 1, and dq is the the largest distance in the neighborhood of observations close to x0 . The weighted regression for x0 is defined by the estimated regression coefficients β̂ LOESS = arg min β n X wi (x0 )[yi − (β0 + β1 (xi − x0 ) + β2 (xi − x0 )2 i=1 + . . . + βh (xi − x0 )h )]2 . For LOESS, usually h = 2 is sufficient. STAT 501 D. S. Young CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 277 4. Calculate the fitted values as ŷi,LOESS (x0 ) = β̂0,LOESS + β̂1,LOESS (x0 ) + [β̂2,LOESS (x0 )]2 + . . . + [β̂h,LOESS (x0 )]h . 5. Iterate the above procedure for another value of x0 . Since outliers can have a large impact on least squares estimates, a robust weighted regression procedure may also be used to lessen the influence of outliers on the LOESS curve. This is done by replacing Step 3 in the algorithm above with a new set of weights. These weights are calculated by taking the q LOESS residuals ri∗ = yi − ŷi,LOESS (xi ) and calculating new weights given by wi∗ = wi∗ B |ri∗ | . 6M For wi , the value wi∗0 is the previous weight for this observation (where the first time you calculate this weight can be done by the original LOESS procedure we outlined), M is the median of the q absolute values of the residuals, and B(·) is the bisquare weight function given by (1 − |u|2 )2 , if |u| < 1; B(u) = 0, if |u| ≥ 1. This robust procedure can be iterated up to 5 times for a given x0 . Some other notes about local regression methods include: • Various forms of local regression exist in the literature. The main thing to note is that these are approximation methods with much of the theory being driven by Taylor’s theorem from Calculus. • Kernel regression is actually a special case of local regression. • As with kernel regression, there is also an extension of local regression regarding multiple predictors. It requires use of a multivariate version of Taylor’s theorem around the p-dimensional point x0 . The model can D. S. Young STAT 501 278 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS include all main effects, pairwise combination, and k-wise combinations of the predictors up to the order of h. Weights can then be defined, such as kxi − x0 k wi (x0 ) = T , γ where again γ is the span. The values of xi can also be scaled so that the smoothness occurs the same way in all directions. However, note that this estimation is often difficult due to the curse of dimensionality. 19.2.3 Projection Pursuit Regression Besides procedures like LOESS, there is also an exploratory method called projection pursuit regression (or PPR) which attempts to reveal possible nonlinear and interesting structures in yi = m(xi ) + i by looking at univariate regressions instead of complicated multiple regressions to avoid the curse of dimensionality. A pure nonparametric approach can lead to strong oversmoothing and since the sparseness of the space requires to include a lot of space and observations to do a local averaging for a reliable estimate. To estimate the response function m(·) from the data, the following PPR algorithm is typically used: (0) 1. Set ri = yi . 2. For j = 1, . . ., maximize Pn i=1 2 R(j) =1− (j−1) ri 2 m̂(j) (α̂T (j) xi ) − 2 Pn (j−1) i=1 ri by varying over the orthogonal parameters α̂ ∈ Rp (i.e., kα̂k = 1) and a univariate regression function m̂(j) (·). 3. Compute new residuals (j) (j−1) ri = ri STAT 501 − m̂(j) (α̂T (j) xi ). D. S. Young CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 279 2 2 4. Repeat steps 2 and 3 until R(j) becomes small. A small R(j) implies T that m̂(j) (α̂(j) xi ) is approximately the zero function and we will not find any other useful direction. The advantages of using PPR for estimation is that we are using univariate regressions which are quick and easy to estimate. Also, PPR is able to approximate a fairly rich class of functions as well as ignore variables providing little to no information about m(·). Some disadvantages of using PPR include having to examine a p-dimensional parameter space to estimate α̂(j) and interpretation of a single term may be difficult. 19.3 Smoothing Splines A smoothing spline is a piecewise linear function where the polynomial pieces fit together at knots. Smoothing splines are continuous on the whole interval the function is defined on, including at the knots. Mathematically, a smoothing spline minimizes n X i=1 2 Z (yi − η(xi )) + ω b [η 00 (t)]2 dt a among all twice continuously differentiable functions η(·) where ω > 0 is a smoothing parameter and a ≤ x(1) ≤ . . . ≤ x(n) ≤ b (where, recall, that x(1) and x(n) are the minimum and maximum x values, respectively). The knots are usually chosen as the unique values of the predictors in the data set, but may also be a subset of them. In the function above, the first term measures the closeness to the data while the second term penalizes curvature in the function. In fact, it can be shown that there exists an explicit, unique minimizer, and that minimizer is a cubic spline with knots at each of the unique values of the xi . The smoothing parameter does just what it’s name suggests, it smooths the curve. Typically, 0 < ω ≤ 1, but this need not be the case. When ω > 1, then ω/(1 + ω) is said to be the tuning parameter. Regardless, when the smoothing parameter is near 1, a smoother curve is produced. Smaller values of the smoothing parameter (values near 0) often produce rougher curves as the curve is interpolating nearer to the observed data points (i.e., the curves are essentially being drawn right to the location of the data points). D. S. Young STAT 501 280 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS Notice that the cubic smoothing spline introduced above is only capable of handling one predictor. Suppose now that we have p predictors X1 , . . . , Xp .2 We wish to consider the model yi = φ(xi,1 , . . . , xi,p ) + i = φ(xi ) + i , fori = 1, . . . , n where φ(·) belongs to the space of functions whose partial derivatives of order m exist and are in L2 (Ωp ) such that Ωp is the domain of the p-dimensional random variable X.3 In general, m and p must satisfy the constraint 2m − p > 0. For a fixed ω (i.e., the smoothing parameter), we estimate φ by minimizing n 1X [yi − φ(xi )]2 + ωJm (φ), n i=1 which results in what is called a thin-plate smoothing spline. While there are several ways to define Jm (φ), a common way to define it for a thin-plate smoothing spline is by 2 Z +∞ Z +∞ X ∂φ m! dx1 · · · dxp , Jm (φ) = ··· tp t1 t 1 ! · · · tp ! ∂x1 · · · ∂xp −∞ −∞ Γ P where Γ is the set of all permutations of (t1 , . . . , tp ) such that pj=1 tj = m. Numerous algorithms exist for estimation and have demonstrated fairly stable numerical results. However, one must gently balance fitting the data closely with avoiding characterizing the fit with excess variation. Fairly general procedures also exist for constructing confidence intervals and estimating the smoothing parameter. The following subsection briefly describes the least squares method usually driving these algorithms. 19.3.1 Penalized Least Squares Penalized least squares estimates are a way to balance fitting the data closely while avoiding overfitting due to excess variation. A penalized least 2 We will forego assuming an intercept for simplicity in this discussion. Basically all this is saying is that the first m derivatives of φ(·) exist when evaluated at our values of xi,1 , . . . , xi,p for all i. 3 STAT 501 D. S. Young CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 281 squares fit is a surface which minimizes a penalized least squares function over the class of all such surfaces meeting certain regularity conditions. Let us assume we are in the case defined earlier for a thin-plate smoothing spline, but now there is also a parametric component. The model we will consider is yi = φ(zi ) + xT i β + i , where zi is a q-dimensional vector of covariates while xi is a p-dimensional vector of covariates whose relationship with yi is characterized through β. So notice that we have a parametric component to this function and a nonparametric component. Such a model is said to be semiparametric in nature and such models are discussed in the last chapter. The ordinary least squares estimate for our model estimates φ(zi ) and β by minimizing n 1X 2 (yi − φ(zi ) − xT i β) . n i=1 However, the functional space of φ(zi ) is so large that a function can always be found which interpolates the points perfectly, but this will simply reflect all random variation in the data. Penalized least squares attempts to fit the data well, but provide a degree of smoothness to the fit. Penalized least squares minimizes n 1X 2 (yi − φ(zi ) − xT i β) + ωJm (φ), n i=1 where Jm (φ) is the penalty on the roughness of the function φ(·). Again, the squared term of this function measures the goodness-of-fit, while the second term measures the smoothness associated with φ(·). A larger ω penalizes rougher fits, while a smaller ω emphasizes the goodness-of-fit. A final estimate of φ for the penalized least squares method can be written as n X θ + φ̂(zi ) = α + zT δk Bk (xi ), i k=1 where the Bk ’s are basis functions dependent on the location of the xi ’s and α, θ, and δ are coefficients to be estimated. For a fixed ω, (α, θ, δ) can be estimated. The smoothing parameter ω can be chosen by minimizing the generalized crossvalidation (or GCV) function. Write ŷ = A(ω)y, D. S. Young STAT 501 282 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS where A(ω) is called the smoothing matrix. Then the GCV is defined as V (ω) = k(In×n − A(ω))yk2 /n [tr(In×n − A(ω))/n]2 and ω̂ = arg minω V (ω). 19.4 Nonparametric Resampling Techniques for β̂ In this section, we discuss two commonly used resampling techniques, which are used for estimating characteristics of the sampling distribution of β̂. While we discuss these techniques for the regression parameter β, it should be noted that they can be generalized and applied to any parameter of interest. They can also be used for constructing nonparametric “confidence” intervals for the parameter(s) of interest. 19.4.1 The Bootstrap Bootstrapping is a method where you resample from your data (often with replacement) in order to approximate the distribution of the data at hand. While conceptually bootstrapping procedures are very appealing (and they have been shown to possess certain asymptotic properties), they are computationally intensive. In the non parametric regression routines we presented, standard regression assumptions were not made. In these nonstandard situations, bootstrapping provides a viable alternative for providing standard errors and confidence intervals for the regression coefficients and predicted values. When in the regression setting, there are two types of bootstrapping methods that may be employed. Before we differentiate these methods, we first discuss bootstrapping in a little more detail. In bootstrapping, you assume that your sample is actually the population of interest. You draw B samples (B is usually well over 1000) of size n from your original sample with replacement. With replacement means that each observation you draw for your sample is always selected from the entire set of values in your original sample. For each bootstrap sample, the regression results are computed and stored. For example, if B = 5000 and we are trying to estimate the sampling distribution of the STAT 501 D. S. Young CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 283 regression coefficients for a simple linear regression, then the bootstrap will ∗ ∗ ∗ ∗ ∗ ∗ ) as your sample. , β1,5000 ), . . . , (β0,5000 , β1,2 ), (β0,2 , β1,1 yield (β0,1 Now suppose that you want the standard errors and confidence intervals for the regression coefficients. The standard deviation of the B estimates provided by the bootstrapping scheme is the bootstrap estimate of the standard error for the respective regression coefficient. Furthermore, a bootstrap confidence interval is found by sorting the B estimates of a regression coefficient and selecting the appropriate percentiles from the sorted list. For example, a 95% bootstrap confidence interval would be given by the 2.5th and 97.5th percentiles from the sorted list. Other statistics may be computed in a similar manner. One assumption which bootstrapping relies heavily on is that your sample approximates the population fairly well. Thus, bootstrapping does not usually work well for small samples as they are likely not representative of the underlying population. Bootstrapping methods should be relegated to medium sample sizes or larger (what constitutes a medium sample size is somewhat subjective). Now we can turn our attention to the two bootstrapping techniques available in the regression setting. Assume for both methods that our sample consists of the pairs (x1 , y1 ), . . . , (xn , yn ). Extending either method to the case of multiple regression is analogous. We can first bootstrap the observations. In this setting, the bootstrap samples are selected from the original pairs of data. So the pairing of a response with its measured predictor is maintained. This method is appropriate for data in which both the predictor and response were selected at random (i.e., the predictor levels were not predetermined). We can also bootstrap the residuals. The bootstrap samples in this setting are selected from what are called the Davison-Hinkley modified residuals, given by n ei 1X e p j e∗i = p − , 1 − hi,i n j=1 1 − hj,j where the ei ’s are the original regression residuals. We do not simply use the ei ’s because these lead to biased results. In each bootstrap sample, the randomly sampled modified residuals are added to the original fitted values forming new values of y. Thus, the original structure of the predictors will remain the same while only the response will be changed. This method is appropriate for designed experiments where the levels of the predictor are D. S. Young STAT 501 284 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS predetermined. Also, since the residuals are sampled and added back at random, we must assume the variance of the residuals is constant. If not, this method should not be used. Finally, a 100 × (1 − α)% bootstrap confidence interval for the regression coefficient βi is given by ∗ ∗ α (βi,b ×nc , βi,d(1− α )×ne ), 2 2 which is then used to calculate a 100×(1−α)% bootstrap confidence interval for E(Y |X = xh ), which is given by ∗ ∗ ∗ ∗ α (β0,b ×nc + β1,b α ×nc xh , β0,d(1 α )×ne + β1,d(1− α )×ne xh ). 2 19.4.2 2 2 2 The Jackknife Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias and standard error (variance) of a statistic when a random sample of observations is used to calculate it. The basic idea behind the jackknife variance estimator lies in systematically recomputing the estimator of interest by leaving out one or more observations at a time from the original sample. From this new set of replicates of the statistic, an estimate for the bias and variance of the statistic can be calculated, which can then be used to calculate jackknife confidence intervals. Below we outline the steps for jackknifing in the simple linear regression setting for simplicity, but the multiple regression setting is analogous: 1. Draw a sample of size n (x1 , y1 ), . . . , (xn , yn ) and divide the sample into s independent groups, each of size d. 2. Omit the first set of d observations from the sample and estimate β0 and β1 from the (n − d) remaining observations (call these estimates (J ) (J ) β̂0 1 and β̂1 1 , respectively). The remaining set of (n − d) observations is called the delete-d jackknife sample. 3. Omit each of the remaining sets of 2, . . . , s groups in turn and esti(J ) (J ) mate the respective regression coefficients. These are β̂0 2 , . . . , β̂0 s (J ) (J ) and β̂1 2 , . . . , β̂1 s . Note that this results in s = nd delete-d jackknife samples. STAT 501 D. S. Young CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 285 (J) (J) 4. Obtain the (joint) probability distribution F (β0 , β1 ) of delete-d jackknife estimates. This may be done empirically or through use of investigating an appropriate distribution. 5. Calculate the jackknife regression coefficient estimate, which is the (J) (J) mean of the F (β0 , β1 ) distribution, as: (J) β̂j = s X (Jk ) β̂j , k=1 for j = 0, 1. Thus, the delete-d jackknife (simple) linear regression equation is (J) (J) ŷi = β̂0 + β̂1 xi + ei . The jackknife bias for each regression coefficient is d J (β̂j ) = (n − 1)(β̂ (J) − βˆj ), bias j where βˆj is the estimate obtained when using the full sample of size n. The jackknife variance for each regression is var c J (β̂j ) = (n − 1) (J) ˆ 2 (β̂j − βj ) , n which implies that the jackknife standard error is q s.e. c J (β̂j ) = var c J (β̂j ). Finally, if normality is appropriate, then a 100 × (1 − α)% jackknife con(J) fidence interval for the regression coefficient βj is given by βˆj ± t∗n−2;1−α/2 × s.e. c J (β̂j ). Otherwise, we can construct a fully nonparametric jackknife confidence interval in a similar manner as the bootstrap version. Namely, (J) (J) β̂j,b α ×nc , β̂j,d(1− α )×ne , 2 D. S. Young 2 STAT 501 286 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS which can then be used to calculate a 100 × (1 − α)% bootstrap confidence interval for E(Y |X = xh ), which is given by (J) (J) (J) (J) β̂0,b α ×nc + β̂1,b α ×nc xh , β̂0,d(1 α )×ne + β̂1,d(1− α )×ne xh . 2 2 2 2 While for moderately sized data the jackknife requires less computation, there are some drawbacks to using the jackknife. Since the jackknife is using fewer samples, it is only using limited information about β̂. In fact, the jackknife can be viewed as an approximation to the bootstrap (it is a linear approximation to the bootstrap in that the two are roughly equal for linear estimators). Moreover, the jackknife can perform quite poorly if the estimator of interest is not sufficiently “smooth” (intuitively, smooth can be thought of as small changes to the data result in small changes to the calculated statistic), which can especially occur when your sample is too small. 19.5 Examples Example 1: Packaging Data Set This data set of size n = 15 contains measurements of yield from a packaging plant where the manager wants to model the unit cost (y) of shipping lots of a fragile product as a linear function of lot size (x). Table 19.2 gives the data used for this analysis. Because of the economics of scale, the manager believes that the cost per unit will decrease at a fast rate for lot sizes of more than 1000. Based on the description of this data, we wish to fit a (continuous) piecewise regression with one knot value at k1 = 1000. Figure 19.2 gives a scatterplot of the raw data with a vertical line at the lot size of 1000. This appears to be a good fit. We can also obtain summary statistics regarding the fit ########## Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.0240268 0.1766955 22.774 3.05e-11 *** lot.size -0.0020897 0.0002052 -10.183 2.94e-07 *** lot.size.I -0.0013937 0.0003644 -3.825 0.00242 ** --STAT 501 D. S. Young CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 287 i Unit Cost Lot Size 1 1.29 1150 2 2.20 840 3 2.26 900 4 2.38 800 5 1.77 1070 6 1.25 1220 7 1.87 980 8 0.71 1300 9 2.90 520 10 2.63 670 11 0.55 1420 12 2.31 850 13 1.90 1000 14 2.15 910 15 1.20 1230 Table 19.2: The packaging data set pertaining to n = 15 observations. Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.09501 on 12 degrees of freedom Multiple R-Squared: 0.9838, Adjusted R-squared: 0.9811 F-statistic: 363.4 on 2 and 12 DF, p-value: 1.835e-11 ########## As can be seen, the two predictors are statistically significant for this piecewise linear regression model. Example 2: Quality Measurements Dataset (continued ) Recall that we fit a quadratic polynomial to the quality measurement data set. Let us also fit nonparametric regression curve to this data and calculate bootstrap confidence intervals for the slope parameters. Figure 19.3(a) shows two LOESS curves with two different spans. Here are some general things to think about when fitting data with LOESS: 1. Which fit appears to be better to you? D. S. Young STAT 501 288 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS ● 2.5 ● ● ● 2.0 ● ● ● Cost ●● 1.5 ● ● ● 1.0 ● ● 0.5 ● 600 800 1000 1200 1400 Lot Size Figure 19.2: A scatterplot of the packaging data set with a piecewise linear regression fitted to the data. 2. How do you think more data would affect the smoothness of the fits? 3. If we drive the span to 0, what type of regression line would you expect to see? 4. If we drive the span to 1, what type of regression line would you expect to see? Figure 19.3(b) shows two kernel regression curves with two different bandwidths. A Gaussian kernel is used. Some things to think about when fitting the data (as with the LOESS fit) are: 1. Which fit appears to be better to you? 2. How do you think more data would affect the smoothness of the fits? 3. What type of regression line would you expect to see as we change the bandwidth? 4. How does the choice of kernel affect the fit? STAT 501 D. S. Young CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 289 When performing local fitting (as with kernel regression), the last two points above are issues where there are still no clear solutions. Quality Scores 100 100 Quality Scores ● ● ● ● ● ● Score 2 ● ● ● ● ● ● ● ● ● ● ● 60 60 ● ● 60 70 Score 1 (a) 80 90 Bandwidth 5 15 ● 50 Span 0.4 0.9 ● 50 ● ● 70 80 ● ● ● 80 ● ● 70 Score 2 90 90 ● ● 60 70 80 90 Score 1 (b) Figure 19.3: (a) A scatterplot of the quality data set and two LOESS fits with different spans. (b) A scatterplot of the quality data set and two kernel regression fits with different bandwidths. Next, let us return to the orthogonal regression fit of this data. Recall that the slope term for the orthogonal regression fit was 1.4835. Using a nonparametric bootstrap (with B = 5000 bootstraps), we can obtain the following bootstrap confidence intervals for the orthogonal slope parameter: • 90% bootstrap confidence interval: (0.9677, 2.9408). • 95% bootstrap confidence interval: (0.8796, 3.6184). • 99% bootstrap confidence interval: (0.6473, 6.5323). Remember that if you were to perform another bootstrap with B = 5000, then the estimated intervals given above will be slightly different due to the randomness of the resampling process! D. S. Young STAT 501 Chapter 20 Regression Models with Censored Data Suppose we wish to estimate the parameters of a distribution where only a portion of the data is known. When the remainder of the data has a measurement that exceeds (or falls below) some threshold and only that threshold value is recorded for that observation, then the data are said to be censored. When the data exceeds (or falls below) some threshold, but the data is omitted from the database, then the data are said to be truncated. This chapter deals primarily with the analysis of censored data by first introducing the area of reliability (survival) analysis and then presenting some of the basic tools and models from this area as a segue into a regression setting. We also devote a section to discussing truncated regression models. 20.1 Overview of Reliability and Survival Analysis It is helpful to formally define the area of analysis which is heavily concerned with estimating models with censored data. Survival analysis concerns the analysis of data from biological events associated with the study of animals and humans. Reliability analysis concerns the analysis of data from events associated with the study of engineering applications. We will utilize terminology from both areas for the sake of completeness. Survival (reliability) analysis studies the distribution of lifetimes (failure times). The study will consist of the elapsed time between an initiating event 290 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 291 and a terminal event. For example: • Study the time of individuals in a cancer study. The initiating time could be the diagnosis of cancer or the start of treatment. The terminal event could be death or cure of the disease. • Study the lifetime of various machine motors. The initiating time could be the date the machine was first brought online. The terminal event could be complete machine failure or the first time it must be brought off-line for maintenance. The data are a combination of complete and censored values, which means a terminal event has occurred or not occurred, respectively. Formally, let Y be the observed time from the study, T denote the actual event time (and is sometimes referred to as the latent variable, and t denote some known threshold value where the values are censored. Observations in a study can be censored in the following manners: • Right censoring: This occurs when an observation has dropped out, been removed from a study, or did not reach a terminal event prior to termination of the study. In other words, Y ≤ T such that T, T < t; Y = t, T ≥ t. • Left censoring: This occurs when an observation reaches a terminal event before the first time point in the study. In other words, Y ≥ T such that T, T > t; Y = t, T ≤ t. • Interval censoring: This occurs when a study has discrete time points and an observation reaches a terminal event between two of the time points. In other words, for discrete time increments 0 = t1 < t2 < . . . < tr < ∞, we have Y1 < T < Y2 such that for j = 1, . . . , r − 1, tj , tj < T < tj+1 ; Y1 = 0, otherwise and Y2 = D. S. Young tj+1 , tj < T < tj+1 ; ∞, otherwise. STAT 501 292 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA • Double censoring: This is when all of the above censored observations can occur in a study. Moreover, there are two criteria that define the type of censoring in a study. If the experimenter controls the type of censoring, then we have non-random censoring, of which there are two types: • Type I or time-truncated censoring, which occurs if an observation is still alive (in operation) when a test is terminated after a predetermined length of time. • Type II or failure-truncated censoring, which occurs if an observation is still alive (in operation) when a test is terminated after a pre-determined number of failures is reached. Suppose T has probability density function f (t) with cumulative distribution function F (t). Since we are interested in survival times (lifetimes), the support of T is (0, +∞). There are 3 functions usually of interest in a survival (reliability) analysis: • The survival function S(t) (or reliability function R(t)) is given by: Z +∞ S(t) = R(t) = f (x)dx = 1 − F (t). t This is the probability that an individual survives (or something is reliable) beyond time t and is usually the first quantity studied. • The hazard rate h(t) (or conditional failure rate) is the probability that an observation at time t will experience a terminal event in the next instant. It is given by: h(t) = f (t) f (t) = . S(t) R(t) The empirical hazard (conditional failure) rate function is useful in identifying which probability distribution to use if it is not already specified. • The cumulative hazard function H(t) (or cumulative conditional failure function) is given by: Z t H(t) = h(x)dx. 0 STAT 501 D. S. Young CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 293 These are only the basics when it comes to survival (reliability) analysis. However, they provide enough of a foundation for our interests. We are interested in when a set of predictors (or covariates) are also measured with the observed time. 20.2 Censored Regression Model Censored regression models (also called the Tobit model) simply attempts to model the unknown variable T (which is assumed left-censored) as a linear combination of the covariates X1 , . . . , Xp−1 . For a sample of size n, we have Ti = β0 + β1 xi,1 + . . . + βp−1 xi,p−1 + i , where i ∼iid N (0, σ 2 ). Based on this model, it can be shown that for the observed variable Y , that E[Yi |Yi > t] = XT i β + σλ(αi ), where αi = (t − XT i β)/σ and λ(αi ) = φ(αi ) 1 − Φ(αi ) such that φ(·) and Φ(·) are the probability density function and cumulative distribution function of a standard normal random variable (i.e., N (0, 1)), respectively. Moreover, the quantity λ(αi ) is called the inverse Mills ratio, which reappears later in our discussion about the truncated regression model. If we let i1 be the index of all of the uncensored values and i2 be the index of all of the left-censored values, then we can define a log-likelihood function for the estimation of the regression parameters (see Appendix C for further details on likelihood functions): X 2 `(β, σ) = −(1/2) [log(2π) + log(σ 2 ) + (yi − XT i β)/σ ] i1 + X log(1 − Φ(XT i β/σ)). i2 Optimization of the above equation yields estimates for β and σ. Now it should be noted that this is a very special case of a broader class of survival (reliability) regression models. However, it is commonly used so that is why it is usually treated separately than the broader class of regression models that are discussed in the next section. D. S. Young STAT 501 294 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 20.3 Survival (Reliability) Regression Suppose we have n observations where we measure p − 1 covariates with the observed time. In our examples from the introduction, some covariates you may also measure include: • Gender, age, weight, and previous ailments of the cancer patients. • Manufacturer, metal used for the drive mechanisms, and running temperature of the machines. Let X∗ be the matrix of covariates as in the standard multiple regression model, but without the first column consisting of 1’s (so it is an n × (p − 1) matrix). Then we model T ∗ = β0 + X∗T β ∗ + , where β ∗ is a (p − 1)−dimensional vector, T ∗ = ln T , and has a certain distribution (which we will discuss shortly). Then T = exp (T ∗ ) = eβ0 +X ∗T β∗ e = eβ0 +X ∗T β∗ T ], where T ] = e() . So, the covariate acts multiplicatively on the survival time T. The distribution of will allow us to determine the distribution of T ∗ . Each possible probability distribution has a different h(t). Furthermore, in a survival regression setting, was assume the hazard rate at time t for an individual has the form: h(t|X∗ ) = h0 (t)k(X∗T β ∗ ) = h0 (t)eX ∗T β∗ . In the above, h0 (t) is called the baseline hazard and is the value of the hazard function when X∗ = 0 or when β ∗ = 0. Note in the expression for T ∗ that we separated out the intercept term β0 as it becomes part of the baseline hazard. Also, k(·) in the equation for h(t|X∗ ) is a specified link function, which for our purposes will be e(·) . Next we discuss some of the possible (and common) distributions assumed for . We do not write out the density formulas here, but they can be found in most statistical texts. The parameters for your distribution help control three STAT 501 D. S. Young CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 295 primary aspects of the density curve: location, scale, and shape. You will want to consider the properties your data appear to exhibit (or historically have exhibited) when determining which of the following to use: • The normal distribution with location parameter (mean) µ and scale parameter (variance) σ 2 . As we have seen, this is one of the more commonly used distributions in statistics, but is infrequently used for lifetime distribution as it allows negative values while lifetimes are always positive. One possibility is to consider a truncated normal or a log transformation (which we discuss next). • The lognormal distribution with location parameter δ, scale parameter µ, and shape parameter σ 2 . δ gives the minimum value of the random variable T , and the scale and shape parameters of the lognormal distribution are the location and scale parameters of the normal distribution, respectively. Note that if T has a lognormal distribution, then ln(T ) has a normal distribution. • The Weibull distribution with location parameter δ, scale parameter β, and shape parameter α. The Weibull distribution is probably most commonly used for time to failure data since it is fairly flexible to work with. δ again gives the minimum value of the random variable T and is often set to 0 so that the support of T is positive. Assuming δ = 0 provides the more commonly used two-parameter Weibull distribution. • The Gumbel distribution (or extreme-value distribution) with location parameter µ and scale parameter σ 2 is sometimes used, but more often it is presented due to its relationship to the Weibull distribution. If T has a Weibull distribution, then ln(T ) has a Gumbel distribution. • The exponential distribution with location parameter δ and scale parameter σ (or sometimes called rate 1/σ). δ again gives the minimum value of the random variable T and is often set to 0 so that the support of T is positive. Setting δ = 0 results in what is usually referred to as the exponential distribution. The exponential distribution is a model for lifetimes with a constant failure rate. If T has an exponential distribution with δ = 0, then ln(T ) has a standard Gumbel distribution (i.e., the scale of the Gumbel distribution is 1). D. S. Young STAT 501 296 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA • The logistic distribution with location parameter µ and scale parameter σ. This distribution is very similar to the normal distribution, but is used in cases where there are “heavier tails” (i.e., higher probability of the data occurring out in the tails of the distribution). • The log-logistic distribution with location parameter δ, scale parameter λ, and shape parameter α. δ again gives the minimum value of the random variable T and is often set to 0 so that the support of T is positive. Setting δ = 0 results in what is usually referred to as the loglogistic distribution. If T has a log-logistic distribution with δ = 0, then ln(T ) has a logistic distribution with location parameter µ = −1/ ln(λ) and scale parameter σ = 1/α. • The gamma distribution with location parameter δ, scale parameter β, and shape parameter α. The gamma distribution is a competitor to the Weibull distribution, but is more mathematically complicated and thus avoided where the Weibull appears to provide a good fit. The gamma distribution also arises because the sum of independent exponential random variables is gamma distributed. δ again gives the minimum value of the random variable T and is often set to 0 so that the support of T is positive. Setting δ = 0 results in what is usually referred to as the gamma distribution. • The beta distribution has two shape parameter, α and β, as well as two location parameters, A and B, which denote the minimum and maximum of the data. If the beta distribution is used for lifetime data, then it appears when fitting data which are assumed to have an absolute minimum and absolute maximum. Thus, A and B are almost always assumed known. Note that the above is not an exhaustive list, but provides some of the more commonly used distributions in statistical texts and software. Also, there is an abuse of notation in that duplication of certain characters (e.g., µ, σ, etc.) does not imply a mathematical relationship between all of the distributions where that character appears. Estimation of the parameters can be accomplished in two primary ways. One way is to construct a probability of the chosen distribution with your data and then apply least squares regression to this plot. Another, perhaps more appropriate, approach is to use maximum likelihood estimation as it STAT 501 D. S. Young CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 297 can be shown to be optimum in most situations and provides estimates of standard errors, and thus confidence limits. Maximum likelihood estimation is commonly accomplished by using a Newton-Raphson algorithm. 20.4 Cox Proportional Hazards Regression Recall from the last section that we set T ∗ = ln(T ) where the hazard function ∗Tβ ∗ is h(t|X∗ ) = h0 (t)eX . The Cox formation of this relationship gives: ln(h(t)) = ln(h0 (t)) + X∗T β ∗ , which yields the following form of the linear regression model: h(t) = X∗T β ∗ . ln h0 (t) Exponentiating both sides yields a ratio of the actual hazard rate and baseline hazard rate, which is called the relative risk: h(t) ∗T ∗ = eX β h0 (t) p−1 Y = eβi xi . i=1 Thus, the regression coefficients have the interpretation as the relative risk when the value of a covariate is increased by 1 unit. The estimates of the regression coefficients are interpreted as follows: • A positive coefficient means there is an increase in the risk, which decreases the expected survival (failure) time. • A negative coefficient means there is a decrease in the risk, which increases the expected survival (failure) time. • The ratio of the estimated risk functions for two different sets of covariates (i.e., two groups) can be used to examine the likelihood of Group 1’s survival (failure) time to Group 2’s survival (failure) time. D. S. Young STAT 501 298 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA Remember, for this model the intercept term has been absorbed by the baseline hazard. The model we developed above is the Cox Proportional Hazards regression model and does not include t on the right-hand side. Thus, the relative risk is constant for all values of t. Estimation for this regression model is usually done by maximum likelihood and Newton-Raphson is usually the algorithm used. Usually, the baseline hazard is found non parametrically, so the estimation procedure for the entire model is said to be semi parametric. Additionally, if there are failure time ties in the data, then the likelihood gets more complex and an approximation to the likelihood is usually used (such as the Breslow Approximation or the Efron Approximation). 20.5 Diagnostic Procedures Depending on the survival regression model being used, the diagnostic measures presented here may have a slightly different formulation. We do present somewhat of a general form for these measures, but the emphasis is on the purpose of each measure. It should also be noted that one can perform formal hypothesis testing and construct statistical intervals based on various estimates. Cox-Snell Residuals In the previous regression models we studied, residuals were defined as a difference between observed and fitted values. For survival regression, in order to check the overall fit of a model, the Cox-Snell residual for the ith observation in a data set is used and defined as: rCi = Ĥ0 (ti )eX ∗T β̂ ∗ . ∗ In the above, β̂ is the maximum likelihood estimate of the regression coefficient vector. Ĥ0 (ti ) is a maximum likelihood estimate of the baseline cumulative hazard function H0 (ti ), defined as: Z t H0 (t) = h0 (x)dx. 0 Notice that rCi > 0 for all i. The way we check for a goodness-of-fit with the Cox-Snell residuals is to estimate the cumulative hazard rate of the residuals STAT 501 D. S. Young CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 299 (call this ĤrC (trCi )) from whatever distribution you are assuming, and then plot ĤrC (trCi ) versus rCi . A good fit would be suggested if they form roughly a straight line (like we looked for in probability plots). Martingale Residuals Define a censoring indicator for the ith observation as 0, if observation i is censored; δi = 1, if observation i is uncensored. In order to identify the best functional form for a covariate given the assumed functional form of the remaining covariates, we use the Martingale residual for the ith observation, which is defined as: M̂i = δi − rCi . The M̂i values fall between the interval (−∞, 1] and are always negative for censored values. The M̂i values are plotted against the xj,i , where j represents the index of the covariate for which we are trying to identify the best functional form. Plotting a smooth-fitted curve over this data set will indicate what sort of function (if any) should be applied to xj,i . Note that the martingale residuals are not symmetrically distributed about 0, but asymptotically they have mean 0. Deviance Residuals Outlier detection in a survival regression model can be done using the deviance residual for the ith observation: q Di = sgn(M̂i ) −2(`i (θ̂) − `Si (θi )). For Di , `i (θ̂) is the ith log likelihood evaluated at θ̂, which is the maximum likelihood estimate of the model’s parameter vector θ. `Si (θi ) is the log likelihood of the saturated model evaluated at the maximum likelihood θ. A saturated model is one where n parameters (i.e., θ1 , . . . , θn ) fit the n observations perfectly. The Di values should behave like a standard normal sample. A normal probability plot of the Di values and a plot of Di versus the fitted ln(t)i values, will help to determine if any values are fairly far from the bulk of D. S. Young STAT 501 300 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA the data. It should be noted that this only applies to cases where light to moderate censoring occur. Partial Deviance Finally, we can also consider hierarchical (nested) models. We start by defining the model deviance: n X ∆= Di2 . i=1 Suppose we are interested in seeing if adding additional covariates to our model significantly improves the fit from our original model. Suppose we calculate the model deviances under each model. Denote these model deviances as ∆R and ∆F for the reduced model (our original model) and the full model (our model with all covariates included), respectively. Then, a measure of the fit can be done using the partial deviance: Λ = ∆R − ∆F = −2(`(θ̂ R ) − `(θ̂ F )) `(θ̂ R ) , = −2 log `(θ̂ F ) where `(θ̂ R ) and `(θ̂ F ) are the log likelihood functions evaluated at the maximum likelihood estimates of the reduced and full models, respectively. Luckily, this is a likelihood ratio statistic and has the corresponding asymptotic χ2 distribution. A large value of Λ (large with respect to the corresponding χ2 distribution) indicates the additional covariates improve the overall fit of the model. A small value of Λ means they add nothing significant to the model and you can keep the original set of covariates. Notice that this procedure is similar to the extra sum of squares procedure developed in the previous course. 20.6 Truncated Regression Models Truncated regression models are used in cases where observations with values for the response variable that are below and/or above certain thresholds are systematically excluded from the sample. Therefore, entire observations are missing so that neither the dependent nor independent variables STAT 501 D. S. Young CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 301 are known. For example, suppose we had wages and years of schooling for a sample of employees. Some persons for this study are excluded from the sample because their earned wages fall below the minimum wage. So the data would be missing for these individuals. Truncated regression models are often confused with the censored regression models that we introduced earlier. In censored regression models, only the value of the dependent variable is clustered at a lower and/or upper threshold value, while values of the independent variable(s) are still known. In truncated regression models, entire observations are systematically omitted from the sample based on the lower and/or upper threshold values. Regardless, if we know that the data has been truncated, we can adjust our estimation technique to account for the bias introduced by omitting values from the sample. This will allow for more accurate inferences about the entire population. However, if we are solely interested in the population that does not fall outside the threshold value(s), then we can rely on standard techniques that we have already introduced, namely ordinary least squares. Let us formulate the general framework for truncated distributions. Suppose that X is a random variable with a probability density function fX and associated cumulative distribution function FX (the discrete setting is defined analogously). Consider the two-sided truncation a < X < b. Then the truncated distribution is given by fX (x|a < X < b) = where gX (x) = gX (x) , FX (b) − FX (a) fX (x), a < x < b; 0, otherwise. Similarly, one-sided truncated distributions can be defined by assuming a or b are set at the respective, natural bound of the support for the distribution of X (i.e., FX (a) = 0 or FX (b) = 1, respectively). So a bottom-truncated (or left-truncated) distribution is given by fX (x|a < X) = gX (x) , 1 − FX (a) while a top-truncated (or right-truncated) distribution is given by fX (x|X < b) = D. S. Young gX (x) . FX (b) STAT 501 302 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA gX (x) is then defined accordingly for whichever distribution with which you are working. Consider the canonical multiple linear regression model Yi = XT i β + i , where i ∼iid N (0, σ 2 ). If no truncation (or censoring) is assumed with the data, then normal distribution theory yields 2 Yi |Xi ∼ N (XT i β, σ ). When truncating the response, the distribution, and consequently the mean and variance of the truncated distribution, must be adjusted accordingly. Consider the three possible truncation settings of a < Y < b (two-sided truncation), a < Yi (bottom-truncation), and Yi < b (top-truncation). Let T T αi = (a − XT i β)/σ, γi = (b − Xi β)/σ, and ψi = (yi − Xi β)/σ, such that yi is the realization of the random variable Yi . Moreover, recall that λ(z) is the inverse Mills ratio applied to the value of z and let δ(z) = λ(z)[(Φ(z))−1 − 1]. Then using established results for the truncated normal distribution, the three different truncated probability density functions are fY |X (yi |Θ, XT i , β, σ) = 1 φ(ψi ) σ 1−Φ(αi ) 1 φ(ψi ) σ Φ(γi )−Φ(αi ) 1 φ(ψi ) σ Φ(γi ) Θ = {a < Yi } and a < yi ; , , Θ = {a < Yi < b} and a < yi < b; , Θ = {Yi < b} and yi < b, while the respective truncated cumulative distribution functions are FY |X (yi |Θ, XT i , β, σ) = STAT 501 φ(ψi )−φ(αi ) , 1−Φ(αi ) Θ = {a < Yi } and a < yi ; φ(ψi )−φ(αi ) , Φ(γi )−Φ(αi ) Θ = {a < Yi < b} and a < yi < b; φ(ψi ) , Φ(γi ) Θ = {Yi < b} and yi < b. D. S. Young CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 303 Furthermore, the means of the three different truncated distributions are T Xi β + σλ(αi ), Θ = {a < Yi }; φ(αi )−φ(γi ) T T E[Yi |Θ, Xi ] = Xi β + σ Φ(γi )−Φ(αi ) , Θ = {a < Yi < b}; T Xi β − σδ(γi ), Θ = {Yi < b}, while the corresponding variances are 2 σ {1 − λ(αi )[λ(αi ) − αi ]}, Θ = {a < Yi }; 2 T αi φ(αi )−γi φ(γi ) φ(αi )−φ(γi ) 2 Var[Yi |Θ, Xi ] = , Θ = {a < Yi < b}; σ 1 + Φ(γi )−Φ(αi ) − Φ(γi )−Φ(αi ) 2 σ {1 − δ(γi )[δ(γi ) + γi ]}, Θ = {Yi < b}. Using the distributions defined above, the likelihood function can be found and maximum likelihood procedures can be employed. Note that the likelihood functions will not have a closed-form solution and thus numerical techniques must be employed to find the estimates of β and σ. It is also important to underscore the type of estimation method used in a truncated regression setting. The maximum likelihood estimation method that we just described will be used when you are interested in a regression equation that characterizes the entire population, including the observations that were truncated. If you are interested in characterizing just the subpopulation of observations that were not truncated, then ordinary least squares can be used. In the context of the example provided at the beginning of this section, if we regressed wages on years of schooling and were only interested in the employees who made above the minimum wage, then ordinary least squares can be used for estimation. However, if we were interested in all of the employees, including those who happened to be excluded due to not meeting the minimum wage threshold, then maximum likelihood can be employed. 20.7 Examples Example 1: Motor Dataset This data set of size n = 16 contains observations from a temperatureD. S. Young STAT 501 304 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA accelerated life test for electric motors. The motorettes were tested at four different temperature levels and when testing terminated, the failure times were recorded. The data can be found in Table 20.1. Hours 8064 1764 2772 3444 3542 3780 4860 5196 5448 408 1344 1440 1680 408 504 528 Censor Count Temperature 0 10 150 1 1 170 1 1 170 1 1 170 1 1 170 1 1 170 1 1 170 1 1 170 0 3 170 1 2 190 1 2 190 1 1 190 0 5 190 1 2 220 1 3 220 0 5 220 Table 20.1: The motor data set measurements with censoring occurring if a 0 appears in the Censor column. This data set is actually a very common data set analyzed in survival analysis texts. We will proceed to fit it with a Weibull survival regression model. The results from this analysis are ########## Value Std. Error z p (Intercept) 17.0671 0.93588 18.24 2.65e-74 count 0.3180 0.15812 2.01 4.43e-02 temp -0.0536 0.00591 -9.07 1.22e-19 Log(scale) -1.2646 0.24485 -5.17 2.40e-07 Scale= 0.282 STAT 501 D. S. Young CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 305 Weibull distribution Loglik(model)= -95.6 Loglik(intercept only)= -110.5 Chisq= 29.92 on 2 degrees of freedom, p= 3.2e-07 Number of Newton-Raphson Iterations: 9 n= 16 ########## As we can see, the two covariates are statistically significant. Furthermore, the scale which is estimated (at 0.282) is the scale pertaining to the distribution being fit for this model (i.e., Weibull). It too is found to be statistically significant. While we appear to have a decent fitting model, we will turn to looking at the deviance residuals. Figure 20.1(a) gives a plot of the deviance residuals versus ln(time). As you can see, there does appear to be one value with a deviance residual of almost -3. This value may be cause for concern. Also, Figure 20.1(b) gives the NPP plot for these residuals and they do appear to fit along a straight line, with the trend being somewhat impacted by that residual in question. One could attempt to remove this point and rerun the analysis, but the overall fit seems to be good and there are no indications in the study that this was an incorrectly recorded point, so we will leave it in the analysis. Example 2: Logical Reasoning Dataset Suppose that an educational researcher administered a (hypothetical) test meant to relate one’s logical reasoning with their mathematical skills. n = 100 participants were chosen for this study and the (simulated) data is provided in Table 20.2. The test consists of a logical reasoning section (where the participants received a score between 0 and 10) and a mathematical problem solving section (where the participants receive a score between 0 and 80). The scores from the mathematics section (y) were regressed on the scores from the logical reasoning section (x). The researcher was interested in only those individuals who received a score of 50 or better on the mathematics section as they would be used for the next portion of the study, so the data was truncated at y = 50. Figure 20.2 shows the data with different regression fits depending on the assumptions that are made. The solid black circles are all of the participants with a score of 50 or better on the mathematics section, while the open red circles indicate those values that are either truncated (as in Figure 20.2(a)) D. S. Young STAT 501 306 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA Deviance Residuals Normal Q−Q Plot 1 ● 1 ● ● ● ● ● ● ● ● ● −1 ● ● 6.0 6.5 7.0 7.5 Log Hours (a) 8.0 8.5 9.0 0 ● ● ● ● ● ● ● ● −2 ● ● ● −3 −3 ● ● ● −1 0 ● ● −2 Deviance Residuals ● ● Sample Quantiles ● ● −2 −1 0 1 2 Theoretical Quantiles (b) Figure 20.1: (a) Plot of the deviance residuals. (b) NPP plot for the deviance residuals. or censored (as in Figure 20.2(b)). The dark blue line on the left is the truncated regression line that is estimated using ordinary least squares. So the interpretation of this line will only apply tho those data that were not truncated, which is what the researcher is interested in. The estimated model is given below: ########## Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 50.8473 1.1847 42.921 < 2e-16 *** x 1.6884 0.1871 9.025 7.84e-14 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.439 on 80 degrees of freedom Multiple R-squared: 0.5045, Adjusted R-squared: 0.4983 F-statistic: 81.46 on 1 and 80 DF, p-value: 7.835e-14 ########## Notice that the estimated regression line never drops below the level of truncation (i.e., y = 50) within the domain of the x variable. STAT 501 D. S. Young CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 307 Suppose now that the researcher is interested in all of the data and, say, there is some problem with recovering those participants that were in the truncated portion of the sample. Then the truncated regression line can be estimated via the method of maximum likelihood estimation, which is the light blue line in Figure 20.2(a). This line can (and will likely) go beyond the level of truncation since the estimation method is accounting for the truncation. The estimated model is given below: ########## Coefficients : Estimate Std. Error t-value Pr(>|t|) (Intercept) 47.69938 1.89772 25.1351 < 2.2e-16 *** x 2.09223 0.26979 7.7551 8.882e-15 *** sigma 4.81855 0.46211 10.4273 < 2.2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Log-Likelihood: -231.21 on 3 Df ########## This will be helpful for the researcher to say something about the broader population of individuals tested other than those who only received a math score of 50 or higher. Moreover, notice that both methods yield highly significant slope and intercept terms for this data, as would be expected by observing the strong linear trend in this data. Figure 20.2(b) shows the estimate obtained when using a survival regression fit when assuming normal errors. Suppose the data had inadvertently been censored at y = 50. So all of the red open circles now correspond to a solid red circle in Figure 20.2(b). Since the data is now treated as leftcensored, we are actually fitting a Tobit regression model. The Tobit fit is given by the green line and the results are given below: ########## Value Std. Error z p (Intercept) 52.21 1.0110 51.6 0.0e+00 x 1.50 0.1648 9.1 8.9e-20 Log(scale) 1.46 0.0766 19.0 7.8e-81 Scale= 4.3 D. S. Young STAT 501 308 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA Gaussian distribution Loglik(model)= -241.6 Loglik(intercept only)= -267.1 Chisq= 51.03 on 1 degrees of freedom, p= 9.1e-13 Number of Newton-Raphson Iterations: 5 n= 100 ########## Moreover, the dashed red line in both figures is the ordinary least squares fit (assuming all of the data values are known and used in the estimation) and is simply provided for comparative purposes. The estimates for this fit are given below: ########## Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 47.2611 0.9484 49.83 <2e-16 *** x 2.1611 0.1639 13.19 <2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.778 on 98 degrees of freedom Multiple R-squared: 0.6396, Adjusted R-squared: 0.6359 F-statistic: 173.9 on 1 and 98 DF, p-value: < 2.2e-16 ########## As you can see, the structure of your data and underlying assumptions can change your estimate - namely because you are attempting to estimate different models. The regression lines in Figure 20.2 are a good example of how different assumptions can alter the final estimates that you report. STAT 501 D. S. Young CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 309 Survival Regression Fit 80 Truncated MLE Fit Truncated OLS Fit Regular OLS Fit ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● 50 ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● 70 ● ●● ● ● ● ● ● ● ● ●●●●●●● ●●● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 40 ● 40 ● ●● ● ● ● 0 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 50 y 60 ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● y 70 ● ● ● Survival Fit Regular OLS Fit ● ● 60 80 Truncated Regression Fits 2 4 6 x (a) 8 10 ● 0 2 4 6 8 10 x (b) Figure 20.2: (a) A plot of the logical reasoning data. The red circles have been truncated as they fall below 50. The maximum likelihood fit for the truncated regression (solid light blue line) and the ordinary least squares fit for the truncated data set (solid dark blue line) are shown. The ordinary least squares line (which includes the truncated values for the estimation) is shown for reference. (b) The logical reasoning data with a Tobit regression fit provided (solid green line). The data has been censored at 50 (i.e., the solid red dots are included in the data). Again, the ordinary least squares line has been provided for reference. D. S. Young STAT 501 310 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA x 0.00 0.10 0.20 0.30 0.40 0.51 0.61 0.71 0.81 0.91 1.01 1.11 1.21 1.31 1.41 1.52 1.62 1.72 1.82 1.92 2.02 2.12 2.22 2.32 2.42 y 46.00 56.38 45.59 53.66 40.05 46.62 44.56 47.20 57.06 49.18 51.06 51.75 46.73 42.04 48.83 51.81 57.35 49.91 49.82 61.53 47.40 54.78 48.94 55.13 43.57 x 2.53 2.63 2.73 2.83 2.93 3.03 3.13 3.23 3.33 3.43 3.54 3.64 3.74 3.84 3.94 4.04 4.14 4.24 4.34 4.44 4.55 4.65 4.75 4.85 4.95 y 49.95 51.58 59.50 50.84 55.65 51.55 49.16 58.59 51.90 62.95 57.74 54.37 58.21 55.44 58.62 53.63 43.46 57.42 60.64 50.99 50.42 54.68 54.40 60.21 58.70 x 5.05 5.15 5.25 5.35 5.45 5.56 5.66 5.76 5.86 5.96 6.06 6.16 6.26 6.36 6.46 6.57 6.67 6.77 6.87 6.97 7.07 7.17 7.27 7.37 7.47 y 64.31 68.22 58.39 58.55 60.40 57.10 58.64 58.93 61.30 60.75 58.67 60.67 59.46 65.49 60.96 57.36 59.83 57.40 62.96 67.02 65.93 63.55 61.99 64.48 62.61 x 7.58 7.68 7.78 7.88 7.98 8.08 8.18 8.28 8.38 8.48 8.59 8.69 8.79 8.89 8.99 9.09 9.19 9.29 9.39 9.49 9.60 9.70 9.80 9.90 10.00 y 50.91 65.51 61.32 71.37 76.97 56.72 67.90 65.30 61.62 68.68 69.43 64.82 63.81 59.27 62.23 64.78 64.88 72.30 65.18 78.35 64.62 76.85 68.57 61.29 71.46 Table 20.2: The test scores from n = 100 participants for a logical reasoning section (x) and a mathematics section (y). STAT 501 D. S. Young Chapter 21 Nonlinear Regression All of the models we have discussed thus far have been linear in the parameters (i.e., linear in the beta terms). For example, polynomial regression was used to model curvature in our data by using higher-ordered values of the predictors. However, the final regression model was just a linear combination of higher-ordered predictors. Now we are interested in studying the nonlinear regression model: Y = f (X, β) + , where X is a vector of p predictors, β is a vector of k parameters, f (·) is some known regression function, and is an error term whose distribution mayor may not be normal. Notice that we no longer necessarily have the dimension of the parameter vector simply one greater than the number of predictors. Some examples of nonlinear regression models are: eβ0 +β1 xi + i 1 + eβ0 +β1 xi β0 + β1 xi yi = + i 1 + β2 eβ3 xi yi = β0 + (0.4 − β0 )e−β1 (xi −5) + i . yi = However, there are some nonlinear models which are actually called intrinsically linear because they can be made linear in the parameters by a simply transformation. For example: Y = β0 X β1 + X 311 312 CHAPTER 21. NONLINEAR REGRESSION can be rewritten as 1 1 β1 = + X Y β0 β0 = θ0 + θ1 X, which is linear in the transformed variables θ0 and θ1 . In such cases, transforming a model to its linear form often provides better inference procedures and confidence intervals, but one must be cognizant of the effects that the transformation has on the distribution of the errors. We will discuss some of the basics of fitting and inference with nonlinear regression models. There is a great deal of theory, practice, and computing associated with nonlinear regression and we will only get to scratch the surface of this topic. We will then turn to a few specific regression models and discuss generalized linear models. 21.1 Nonlinear Least Squares We initially consider the setting Yi = f (Xi , β) + i , where the i are iid normal with mean 0 and constant variance σ 2 . For this setting, we can rely on some of the least squares theory we have developed over the course. For other nonnormal error terms, different techniques need to be employed. First, let n X Q= (yi − f (Xi , β))2 . i=1 In order to find β̂ = arg min Q, β we first find each of the partial derivatives of Q with respect to βj : n X ∂f (Xi , β) ∂Q = −2 . [yi − f (Xi , β)] ∂βj ∂βj i=1 STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION 313 Then, we set each of the above partial derivatives equal to 0 and the parameters βk are each replaced by β̂k . This yields: n n X X ∂f (Xi , β) ∂f (Xi , β) yi − f (Xi , β̂) = 0, ∂βk ∂βk β=β̂ β=β̂ i=1 i=1 for k = 0, 1, . . . , p − 1. The solutions to the critical values of the above partial derivatives for nonlinear regression are nonlinear in the parameter estimates β̂ k and are often difficult to solve, even in the simplest cases. Hence, iterative numerical methods are often employed. Even more difficulty arises in that multiple solutions may be possible! 21.1.1 A Few Algorithms We will discuss a few incarnations of methods used in nonlinear least squares estimation. It should be noted that this is NOT an exhaustive list of algorithms, but rather an introduction to some of the more commonly implemented algorithms. First let us introduce some notation used in these algorithms: (t) • Since these numerical algorithms are iterative, let β̂ be the estimated value of β at time t. When t = 0, this symbolizes a user-specified starting value for the algorithm. • Let 1 y1 − f (X1 , β) .. = ... = . n yn − f (Xn , β) be an n-dimensional vector of the error terms and e is again the residual vector. • Let ∇Q(β) = ∂Q(β) = ∂β T ∂kk2 ∂β1 .. . ∂kk2 ∂βk 2 be Pnthe 2gradient of the sum of squared errors where Q(β) = kk = i=1 i is the sum of squared errors. D. S. Young STAT 501 314 CHAPTER 21. NONLINEAR REGRESSION • Let ∂1 ∂β1 ... ... ∂1 ∂βk ∂n ∂β1 ... ∂n ∂βk J(β) = ... .. . be the Jacobian matrix. • Let H(β) = ∂ 2 Q(β) ∂β T ∂β 2 2 ∂ kk ∂β1 ∂β1 = .. . ∂ 2 kk2 ∂βk ∂β1 ... ... ... ∂ 2 kk2 ∂β1 ∂βk .. . ∂ 2 kk2 ∂βk ∂βk be the Hessian matrix (i.e., matrix of mixed partial derivatives and second-order derivatives). • In the following algorithms, we will use the notation established above (t) (t) (t) for ∇Q(β̂ ) = ∇Q(β)|β=β̂(t) , H(β̂ ) = H(β)|β = βˆ(t) , and J(β̂ ) = J(β)|β=β̂(t) The classical method based on the gradient approach is Newton’s method, (0) which starts at β̂ and iteratively calculates β̂ (t+1) = β̂ (t) (t) (t) − [H(β̂ )]−1 ∇Q(β̂ ) until a convergence criterion is achieved. The difficulty in this approach is that inversion of the Hessian matrix can be computation ally difficult. In particular, the Hessian is not always positive definite unless the algorithm is initialized with a good starting value, which may be difficult to find. A modification to Newton’s method is the Gauss-Newton algorithm, which, unlike Newton’s method, can only be used to minimize a sum of squares function. The advantage with using the Gauss-Newton algorithm is that it no longer requires calculation of the Hessian matrix, but rather approximates it using the Jacobian. The gradient and approximation to the Hessian matrix can be written as ∇Q(β) = 2J(β)T and H(β) ≈ 2J(β)T J(β). STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION 315 Thus, the iterative approximation based on the Gauss-Newton method yields β̂ (t+1) = β̂ = β̂ (t) (t) (t) − δ(β̂ ) (t) (t) (t) − [J(β̂ )T J(β̂ )]−1 J(β̂ )T e, (t) (t) where we have defined δ(β̂ ) to be everything that is subtracted from β̂ . Convergence is not always guaranteed with the Gauss-Newton algorithm. Since the steps for this method may be too large (thus leading to divergence), one can incorporate a partial step by using β̂ (t+1) = β̂ (t) (t) − αδ(β̂ ) such that 0 < α < 1. However, if α is close to 0, an alternative method is the Levenberg-Marquardt method, which calculates (t) (t) (t) (t) δ(β̂ ) = (J(β̂ )T J(β̂ ) + λD)−1 J(β̂ )T e, where D is a positive diagonal matrix (often taken as the identity matrix) and λ is the so-called Marquardt parameter. The above is optimized for λ which limits the length of the step taken at each iteration and improves an ill-conditioned Hessian matrix. For these algorithms, you will want to try the easiest one to calculate for a given nonlinear problem. Ideally, you would like to be able to use the algorithms in the order they were presented. Newton’s method will give you an accurate estimate if the Hessian is not ill-conditioned. The GaussNewton will give you a good approximation to the solution Newton’s method should have arrived at, but convergence is not always guaranteed. Finally, the Levenberg-Marquardt method can take care of computational difficulties arising with the other methods, but searching for λ can be tedious. 21.2 Exponential Regression One simple nonlinear model is the exponential regression model yi = β0 + β1 eβ2 xi,1 +...+βp+1 xi,1 + i , where the i are iid normal with mean 0 and constant variance σ 2 . Notice that if β0 = 0, then the above is intrinsically linear by taking the natural logarithm of both sides. D. S. Young STAT 501 316 CHAPTER 21. NONLINEAR REGRESSION Exponential regression is probably one of the simplest nonlinear regression models. An example where an exponential regression is often utilized is when relating the concentration of a substance (the response) to elapsed time (the predictor). 21.3 Logistic Regression Logistic regression models a relationship between predictor variables and a categorical response variable. For example, you could use logistic regression to model the relationship between various measurements of a manufactured specimen (such as dimensions and chemical composition) to predict if a crack greater than 10 mils will occur (a binary variable: either yes or no). Logistic regression helps us estimate a probability of falling into a certain level of the categorical response given a set of predictors. You can choose from three types of logistic regression, depending on the nature of your categorical response variable: Binary Logistic Regression: Used when the response is binary (i.e., it has two possible outcomes). The cracking example given above would utilize binary logistic regression. Other examples of binary responses could include passing or failing a test, responding yes or no on a survey, and having high or low blood pressure. Nominal Logistic Regression: Used when there are three or more categories with no natural ordering to the levels. Examples of nominal responses could include departments at a business (e.g., marketing, sales, HR), type of search engine used (e.g., Google, Yahoo!, MSN), and color (black, red, blue, orange). Ordinal Logistic Regression: Used when there are three or more categories with a natural ordering to the levels, but the ranking of the levels do not necessarily mean the intervals between them are equal. Examples of ordinal responses could be how you rate the effectiveness of a college course on a scale of 1-5, levels of flavors for hot wings, and medical condition (e.g., good, stable, serious, critical). The problems with logistic regression include nonnormal error terms, nonconstant error variance, and constraints on the response function (i.e., the STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION 317 response is bounded between 0 and 1). We will investigate ways of dealing with these in the logistic regression setting. 21.3.1 Binary Logistic Regression The multiple binary logistic regression model is the following: eβ0 +β1 X1 +...+βp−1 Xp−1 1 + eβ0 +β1 X1 +...+βp−1 Xp−1 T eX β = T , 1 + eX β π= (21.1) where here π denotes a probability and not the irrational number. • π is the probability that an observation is in a specified category of the binary Y variable. • Notice that the model describes the probability of an event happening as a function of X variables. For instance, it might provide estimates of the probability that an older person has heart disease. • With the logistic model, estimates of π from equations (21.1) will always be between 0 and 1. The reasons are: – The numerator eβ0 +β1 X1 +...+βp−1 Xp−1 must be positive, because it is a power of a positive value (e). – The denominator of the model is (1+numerator), so the answer will always be less than 1. • With one X variable, the theoretical model for π has an elongated “S” shape (or sigmoidal shape) with asymptotes at 0 and 1, although in sample estimates we may not see this “S” shape if the range of the X variable is limited. There are algebraically equivalent ways to write the logistic regression model in equation (21.1): • First is D. S. Young π = eβ0 +β1 X1 +...+βp−1 Xp−1 , 1−π (21.2) STAT 501 318 CHAPTER 21. NONLINEAR REGRESSION which is an equation that describes the odds of being in the current category of interest. By definition, the odds for an event is P/(1 − P ) such that P is the probability of the event. For example, if you are at the racetrack and there is a 80% chance that a certain horse will win the race, then his odds are .80/(1-.80)=4, or 4:1. • Second is π log 1−π = β0 + β1 X1 + . . . + βp−1 Xp−1 , (21.3) which states that the logarithm of the odds is a linear function of the X variables (and is often called the log odds). In order to discuss goodness-of-fit measures and residual diagnostics for binary logistic regression, it is necessary to at least define the likelihood (see Appendix C for a further discussion). For a sample of size n, the likelihood for a binary logistic regression is given by: L(β; y, X) = n Y πiyi (1 − πi )1−yi i=1 y i 1−yi T n Y 1 eXi β . = T T 1 + eXi β 1 + eXi β i=1 This yields the log likelihood: `(β) = n X T T [yi eXi β − log(1 + eXi β )]. i=1 Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like iteratively reweighted least squares is used to find an estimate of the regression coefficients, β̂. Once this value of β̂ has been obtained, we may proceed to define some various goodness-of-fit measures and calculated residuals. For the residuals we present, they serve the same purpose as in linear regression. When plotted versus the response, they will help identify suspect data points. It should also be noted that the following is by no way an exhaustive list of diagnostic procedures, but rather some of the more common methods which are used. STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION 319 Odds Ratio The odds ratio (which we will write as θ) determines the relationship between a predictor and response and is available only when the logit link is used. The odds ratio can be any nonnegative number. An odds ratio of 1 serves as the baseline for comparison and indicates there is no association between the response and predictor. If the odds ratio is greater than 1, then the odds of success are higher for the reference level of the factor (or for higher levels of a continuous predictor). If the odds ratio is less than 1, then the odds of success are less for the reference level of the factor (or for higher levels of a continuous predictor). Values farther from 1 represent stronger degrees of association. For binary logistic regression, the odds of success are: π T = eX β . 1−π This exponential relationship provides an interpretation for β. The odds increase multiplicatively by eβj for every one-unit increase in Xj . More formally, the odds ratio between two sets of predictors (say X(1) and X(2) ) is given by (π/(1 − π))|X=X(1) θ= . (π/(1 − π))|X=X(2) Wald Test The Wald test is the test of significance for regression coefficients in logistic regression (recall that we use t-tests in linear regression). For maximum likelihood estimates, the ratio Z= β̂i s.e.(β̂i ) can be used to test H0 : βi = 0. The standard normal curve is used to determine the p-value of the test. Furthermore, confidence intervals can be constructed as β̂i ± z1−α/2 s.e.(β̂i ). Raw Residual The raw residual is the difference between the actual response and the estimated probability from the model. The formula for the raw residual is ri = yi − π̂i . D. S. Young STAT 501 320 CHAPTER 21. NONLINEAR REGRESSION Pearson Residual The Pearson residual corrects for the unequal variance in the raw residuals by dividing by the standard deviation. The formula for the Pearson residuals is ri . pi = p π̂i (1 − π̂i ) Deviance Residuals Deviance residuals are also popular because the sum of squares of these residuals is the deviance statistic. The formula for the deviance residual is s yi 1 − yi di = ± 2 yi log + (1 − yi ) log . π̂i 1 − π̂i Hat Values The hat matrix serves a similar purpose as in the case of linear regression to measure the influence of each observation on the overall fit of the model but the interpretation is not as clear due to its more complicated form. The hat values are given by T hi,i = π̂i (1 − π̂i )xT i (X WX)xi , where W is an n × n diagonal matrix with the values of π̂i (1 − π̂i ) for i = 1, . . . , n on the diagonal. As before, a hat value is large if hi,i > 2p/n. Studentized Residuals We can also report Studentized versions of some of the earlier residuals. The Studentized Pearson residuals are given by spi = p pi 1 − hi,i and the Studentized deviance residuals are given by sdi = p di . 1 − hi,i C and C̄ C and C̄ are extensions of Cook’s distance for logistic regression. C̄ measures STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION 321 the overall change in fitted log its due to deleting the ith observation for all points excluding the one deleted while C includes the deleted point. They are defined by: p2i hi,i Ci = (1 − hi,i )2 and C̄i = p2i hi,i . (1 − hi,i ) Goodness-of-Fit Tests Overall performance of the fitted model can be measured by two different chi-square tests. There is the Pearson chi-square statistic P = n X p2i i=1 and the deviance statistic G= n X d2i . i=1 Both of these statistics are approximately chi-square distributed with n − p degrees of freedom. When a test is rejected, there is a statistically significant lack of fit. Otherwise, there is no evidence of lack of fit. These goodness-of-fit tests are analogous to the F -test in the analysis of variance table for ordinary regression. The null hypothesis is H0 : β1 = β2 = . . . = βk−1 = 0. A significant p-value means that at least one of the X variables is a predictor of the probabilities of interest. In general, one can also use the likelihood ratio test for testing the null hypothesis that any subset of the β’s is equal to 0. Suppose we test that r < p of the β’s are equal to 0. Then the likelihood ratio test statistic is given by: (0) Λ∗ = −2(`(β̂ ) − `(β̂)), (0) where `(β̂ ) is the log likelihood of the model specified by the null hypothesis evaluated at the maximum likelihood estimate of that reduced model. This test statistic has a χ2 distribution with p − r degrees of freedom. D. S. Young STAT 501 322 CHAPTER 21. NONLINEAR REGRESSION One additional test is Brown’s test, which has a test statistic to judge the fit of the logistic model to the data. The formula for the general alternative with two degrees of freedom is: T = sT C −1 s, where sT = (s1 , s2 ) and C is the covariance matrix of s. The formulas for s1 and s2 are: s1 = n X (yi − π̂i )(1 + log(π̂i ) ) 1 − π̂i (yi − π̂i )(1 + log(1 − π̂i ) ). π̂i i=1 s2 = n X i=1 The formula for the symmetric alternative with 1 degree of freedom is: (s1 + s2 )2 . Var(s1 + s2 ) To interpret the test, if the p-value is less than your accepted significance level, then reject the null hypothesis that the model fits the data adequately. DFDEV and DFCHI DFDEV and DFCHI are statistics that measure the change in deviance and in Pearson’s chi-square, respectively, that occurs when an observation is deleted from the data set. Large values of these statistics indicate observations that have not been fitted well. The formulas for these statistics are DFDEVi = d2i + C̄i and DFCHILi = C̄i . hi,i RA2 The calculation of R2 used in linear regression does not extend directly to logistic regression. The version of R2 used in logistic regression is defined as R2 = STAT 501 `(β̂) − `(βˆ0 ) , `(βˆ0 ) − `S (β) D. S. Young CHAPTER 21. NONLINEAR REGRESSION 323 where `(βˆ0 ) is the log likelihood of the model when only the intercept is included and `S (β) is the log likelihood of the saturated model (i.e., where a model is fit perfectly to the data). This R2 does go from 0 to 1 with 1 being a perfect fit. 21.3.2 Nominal Logistic Regression In binomial logistic regression, we only had two possible outcomes. For nominal logistic regression, we will consider the possibility of having k possible outcomes. When k > 2, such responses are known as polytomous.1 The multiple nominal logistic regression model (sometimes called the multinomial logistic regression model) is given by the following: T eX βj 1+Pkj=2 eXT βj , j = 2, . . . , k; (21.4) πj = 1 Pk XT β , j = 1, j 1+ j=2 e where again πj denotes a probability and not the irrational number. Notice that k − 1 of the groups have their own set of β values. Furthermore, since P k j=1 πj = 1, we set the β values for group 1 to be 0 (this is what we call the reference group). Notice that when k = 2, we are back to binary logistic regression. πj is the probability that an observation is in one of k categories. The likelihood for the nominal logistic regression model is given by: L(β; y, X) = n Y k Y y πi,ji,j (1 − πi,j )1−yi,j , i=1 j=1 where the subscript (i, j) means the ith observation belongs to the j th group. This yields the log likelihood: `(β) = n X k X yi,j πi,j . i=1 j=1 Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like iteratively reweighted least squares is used to find an estimate of the regression coefficients, β̂. 1 The word polychotomous is sometimes used, but note that this is not actually a word! D. S. Young STAT 501 324 CHAPTER 21. NONLINEAR REGRESSION An odds ratio (θ) of 1 serves as the baseline for comparison. If θ = 1, then there is no association between the response and predictor. If θ > 1, then the odds of success are higher for the reference level of the factor (or for higher levels of a continuous predictor). If θ < 1, then the odds of success are less for the reference level of the factor (or for higher levels of a continuous predictor). Values farther from 1 represent stronger degrees of association. For nominal logistic regression, the odds of success (at two different levels of the predictors, say X(1) and X(2) ) are: θ= (πj /π1 )|X=X(1) . (πj /π1 )|X=X(2) Many of the procedures discussed in binary logistic regression can be extended to nominal logistic regression with the appropriate modifications. 21.3.3 Ordinal Logistic Regression For ordinal logistic regression, we again consider k possible outcomes as in nominal logistic regression, except that the order matters. The multiple ordinal logistic regression model is the following: ∗ k X πj = j=1 eβ0,k∗ +X T β 1 + eβ0,k∗ +X T β , (21.5) such that k ∗ ≤ k, π1 ≤ π2 , ≤ . . . , ≤ πk , and again πj denotes a probability. Notice that this model is a cumulative sum of probabilities which involves just changing the intercept of the linear regression portion (so β is now (p − 1)dimensional and X is n × (p − 1) such that Pfirst column of this matrix is not a column of 1’s). Also, it still holds that kj=1 πj = 1. πj is still the probability that an observation is in one of k categories, but we are constrained by the model written in equation (21.5). The likelihood for the ordinal logistic regression model is given by: L(β; y, X) = k n Y Y y πi,ji,j (1 − πi,j )1−yi,j , i=1 j=1 where the subscript (i, j) means the ith observation belongs to the j th group. STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION 325 This yields the log likelihood: `(β) = n X k X yi,j πi,j . i=1 j=1 Notice that this is identical to the nominal logistic regression likelihood. Thus, maximization again has no closed-form solution, so we defer to a procedure like iteratively reweighted least squares. For ordinal logistic regression, a proportional odds model is used to determine the odds ratio. Again, an odds ratio (θ) of 1 serves as the baseline for comparison between the two predictor levels, say X(1) and X(2) . Only one parameter and one odds ratio is calculated for each predictor. Suppose we are interested in calculating the odds of X(1) to X(2) . If θ = 1, then there is no association between the response and these two predictors. If θ > 1, then the odds of success are higher for the predictor X(1) . If θ < 1, then the odds of success are less for the predictor X(1) . Values farther from 1 represent stronger degrees of association. For ordinal logistic regression, the odds ratio utilizes cumulative probabilities and their complements and is given by: Pk∗ j=1 πj )|X=X(1) j=1 πj |X=X(1) /(1 − . Pk∗ Pk∗ j=1 πj )|X=X(2) j=1 πj |X=X(2) /(1 Pk∗ θ= 21.4 Poisson Regression The Poisson distribution for a random variable X has the following probability mass function for a given value X = x: P(X = x|λ) = e−λ λx , x! for x = 0, 1, 2, . . .. Notice that the Poisson distribution is characterized by the single parameter λ, which is the mean rate of occurrence for the even being measured. For the Poisson distribution, it is assumed that large counts (with respect to the value of λ) are rare. Poisson regression is similar to logistic regression in that the dependent variable (Y ) is a categorical response. Specifically, Y is an observed count that follows the Poisson distribution, but the rate λ is now determined by D. S. Young STAT 501 326 CHAPTER 21. NONLINEAR REGRESSION a set of p predictors X = (X1 , . . . , Xp )T . The expression relating these quantities is λ = exp{XT β}. Thus, the fundamental Poisson regression model for observation i is given by T yi e− exp{Xi β} exp{XT i β} . P(Yi = yi |Xi , β) = yi ! That is, for a given set of predictors, the categorical outcome follows a Poisson distribution with rate exp{XT β}. In order to discuss goodness-of-fit measures and residual diagnostics for Poisson regression, it is necessary to at least define the likelihood. For a sample of size n, the likelihood for a Poisson regression is given by: L(β; y, X) = T n Y e− exp{Xi β} exp{XT β}yi i yi ! i=1 . This yields the log likelihood: `(β) = n X yi XT i β− n X exp{XT i β} − log(yi !). i=1 i=1 i=1 n X Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like iteratively reweighted least squares is used to find an estimate of the regression coefficients, β̂. Once this value of β̂ has been obtained, we may proceed to define some various goodness-of-fit measures and calculated residuals. For the residuals we present, they serve the same purpose as in linear regression. When plotted versus the response, they will help identify suspect data points. Goodness-of-Fit Overall performance of the fitted model can be measured by two different chi-square tests. There is the Pearson statistic P = n X (yi − exp{XT β̂})2 i T exp{Xi β̂} i=1 and the deviance statistic n X G= yi log i=1 STAT 501 yi exp{XT i β̂} − . (yi exp{XT i β̂}) D. S. Young CHAPTER 21. NONLINEAR REGRESSION 327 Both of these statistics are approximately chi-square distributed with n − p degrees of freedom. When a test is rejected, there is a statistically significant lack of fit. Otherwise, there is no evidence of lack of fit. Overdispersion means that the actual covariance matrix for the observed data exceeds that for the specified model for Y |X. For a Poisson distribution, the mean and the variance are equal. In practice, the data almost never reflects this fact. So we have overdispersion in the Poisson regression model since the variance is oftentimes greater than the mean. In addition to testing goodness-of-fit, the Pearson statistic can also be used as a test of overdispersion. Note that overdispersion can also be measured in the logistic regression models that were discussed earlier. Deviance Recall the measure of deviance introduced in the study of survival regressions and logistic regression. The measure of deviance for the Poisson regression setting is given by D(y, β̂) = 2`S (β) − `(β̂), where `S (β) is the log likelihood of the saturated model (i.e., where a model is fit perfectly to the data). This measure of deviance (which differs from the deviance statistic defined earlier) is a generalization of the sum of squares from linear regression. The deviance also has an approximate chi-square distribution. Pseudo R2 The value of R2 used in linear regression also does not extend to Poisson regression. One commonly used measure is the pseudo R2 , defined as R2 = `(β̂) − `(βˆ0 ) , `S (β) − `(βˆ0 ) where `(βˆ0 ) is the log likelihood of the model when only the intercept is included. The pseudo R2 goes from 0 to 1 with 1 being a perfect fit. Raw Residual The raw residual is the difference between the actual response and the estimated value from the model. Remember that the variance is equal to the mean for a Poisson random variable. Therefore, we expect that the variances D. S. Young STAT 501 328 CHAPTER 21. NONLINEAR REGRESSION of the residuals are unequal. This can lead to difficulties in the interpretation of the raw residuals, yet it is still used. The formula for the raw residual is ri = yi − exp{XT i β}. Pearson Residual The Pearson residual corrects for the unequal variance in the raw residuals by dividing by the standard deviation. The formula for the Pearson residuals is ri , pi = q φ̂ exp{XT β} i where n 2 1 X (yi − exp{XT i β̂}) . φ̂ = n − p i=1 exp{XT i β̂} φ̂ is a dispersion parameter to help control overdispersion. Deviance Residuals Deviance residuals are also popular because the sum of squares of these residuals is the deviance statistic. The formula for the deviance residual is s yi T T di = sgn(yi − exp{Xi β̂}) 2 yi log − (yi − exp{Xi β̂}) . exp{XT i β̂} Hat Values The hat matrix serves the same purpose as in the case of linear regression to measure the influence of each observation on the overall fit of the model. The hat values, hi,i , are the diagonal entries of the Hat matrix H = W1/2 X(XT WX)−1 XT W1/2 , where W is an n × n diagonal matrix with the values of exp{XT i β̂} on the diagonal. As before, a hat value is large if hi,i > 2p/n. Studentized Residuals Finally, we can also report Studentized versions of some of the earlier residuals. The Studentized Pearson residuals are given by pi spi = p 1 − hi,i STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION 329 and the Studentized deviance residuals are given by di . 1 − hi,i sdi = p 21.5 Generalized Linear Models All of the regression models we have considered (both linear and nonlinear) actually belong to a family of models called generalized linear models. Generalized linear models provides a generalization of ordinary least squares regression that relates the random term (the response Y ) to the systematic term (the linear predictor XT β) via a link function (denoted by g(·)). Specifically, we have the relation E(Y ) = µ = g −1 (XT β), so g(µ) = XT β. Some common link functions are: • The identity link: g(µ) = µ = XT β, which is used in traditional linear regression. • The logit link: µ g(µ) = log 1−µ = XT β T eX β ⇒µ= T , 1 + eX β which is used in logistic regression. • The log link: g(µ) = log(µ) = XT β ⇒ µ = eX T β , which is used in Poisson regression. D. S. Young STAT 501 330 CHAPTER 21. NONLINEAR REGRESSION • The probit link: g(µ) = Φ−1 (µ) = XT β ⇒ µ = Φ(XT β), where Φ(·) is the cumulative distribution function of the standard normal distribution. This link function is also sometimes called the normit link. This also can be used in logistic regression. • The complementary log-log link: g(µ) = log(− log(1 − µ)) = XT β ⇒ µ = 1 − exp{−eX T β }, which can also be used in logistic regression. This link function is also sometimes called the gompit link. • The power link: g(µ) = µλ = XT β ⇒ µ = (XT β)1/λ , where λ 6= 0. This is used in other regressions which we do not explore (such as gamma regression and inverse Gaussian regression). Also, the variance is typically a function of the mean and is often written as Var(Y ) = V (µ) = V (g −1 (XT β)). The random variable Y is assumed to belong to an exponential family distribution where the density can be expressed in the form yθ − b(θ) q(y; θ, φ) = exp + c(y, φ) , a(φ) where a(·), b(·), and c(·) are specified functions, θ is a parameter related to the mean of the distribution, and φ is called the dispersion parameter. Many probability distributions belong to the exponential family. For example, the normal distribution is used for traditional linear regression, the binomial distribution is used for logistic regression, and the Poisson distribution is used for Poisson regression. Other exponential family distributions STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION 331 lead to gamma regression, inverse Gaussian (normal) regression, and negative binomial regression, just to name a few. The unknown parameters, β, are typically estimated with maximum likelihood techniques (in particular, using iteratively reweighted least squares), Bayesian methods (which we will touch on in the advanced topics section), or quasi-likelihood methods. The quasi-likelihood is a function which possesses similar properties to the log-likelihood function and is most often used with count or binary data. Specifically, for a realization y of the random variable Y , it is defined as Z µ y−t dt, Q(µ; y) = 2 y σ V (t) where σ 2 is a scale parameter. There are also tests using likelihood ratio statistics for model development to determine if any predictors may be dropped from the model. 21.6 Examples Example 1: Nonlinear Regression Example A simple model for population growth towards an asymptote is the logistic model β1 + i , yi = 1 + eβ2 +β3 xi where yi is the population size at time xi , β1 is the asymptote towards which the population grows, β2 reflects the size of the population at time x = 0 (relative to its asymptotic size), and β3 controls the growth rate of the population. We fit this model to Census population data for the United States (in millions) ranging from 1790 through 1990 (see Table 21.1). The data are graphed in Figure 21.1(a) and the line represents the fit of the logistic population growth model. To fit the logistic model to the U. S. Census data, we need starting values for the parameters. It is often important in nonlinear least squares estimation to choose reasonable starting values, which generally requires some insight into the structure of the model. We know that β1 represents asymptotic population. The data in Figure 21.1(a) show that in 1990 the U. S. population stood at about 250 million and did not appear to be close to an asymptote; D. S. Young STAT 501 332 CHAPTER 21. NONLINEAR REGRESSION year population 1790 3.929 1800 5.308 1810 7.240 1820 9.638 1830 12.866 1840 17.069 1850 23.192 1860 31.443 1870 39.818 1880 50.156 1890 62.948 1900 75.995 1910 91.972 1920 105.711 1930 122.775 1940 131.669 1950 150.697 1960 179.323 1970 203.302 1980 226.542 1990 248.710 Table 21.1: The U.S. Census data. so as not to extrapolate too far beyond the data, let us set the starting value of β1 to 350. It is convenient to scale time so that x1 = 0 in 1790, and so that the unit of time is 10 years. Then substituting β1 = 350 and x = 0 into the model, using the value y1 = 3.929 from the data, and assuming that the error is 0, we have 3.929 = STAT 501 350 . 1 + eβ2 +β3 (0) D. S. Young CHAPTER 21. NONLINEAR REGRESSION 333 Solving for β2 gives us a plausible start value for this parameter: 350 −1 3.929 350 β2 = log − 1 ≈ 4.5. 3.929 eβ2 = Finally, returning to the data, at time x = 1 (i.e., at the second Census performed in 1800), the population was y2 = 5.308. Using this value, along with the previously determined start values for β1 and β2 , and again setting the error to 0, we have 5.308 = 350 . 1 + e4.5+β3(1) Solving for β3 we get 350 −1 5.308 350 − 1 − 4.5 ≈ −0.3. β3 = log 5.308 e4.5+β3 = So now we have starting values for the nonlinear least squares algorithm that we use. Below is the output from running a Gauss-Newton algorithm for optimization. As you can see, the starting values resulted in convergence with values not too far from our guess. ########## Formula: population ~ beta1/(1 + exp(beta2 + beta3 * time)) Parameters: Estimate Std. Error t value Pr(>|t|) beta1 389.16551 30.81197 12.63 2.20e-10 *** beta2 3.99035 0.07032 56.74 < 2e-16 *** beta3 -0.22662 0.01086 -20.87 4.60e-14 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.45 on 18 degrees of freedom D. S. Young STAT 501 334 CHAPTER 21. NONLINEAR REGRESSION Residuals 250 Census Data ● ● 200 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −5 ● ● 50 ● ● 0 Residuals 150 ● ● ● 100 Population ● ● ● ● 0 ● ● ● ● 1800 ● ● ● ● ● 1850 1900 Year (a) 1950 1800 1850 1900 1950 Year (b) Figure 21.1: (a) Plot of the Census data with the logistic functional fit. (b) Plot of the residuals versus the year. Number of iterations to convergence: 6 Achieved convergence tolerance: 1.492e-06 ########## Figure 21.1(b) is a plot of the residuals versus the year. As you can see, the logistic functional form that we chose did catch the gross characteristics of this data, but some of the nuances appear to not be as well characterized. Since there are indications of some cyclical behavior, a model incorporating correlated errors or, perhaps, trigonometric functions could be investigated. Example 2: Binary Logistic Regression Example We will first perform a binary logistic regression analysis. The data set we will use is data published on n = 27 leukemia patients. The data (found in Table 21.2) has a response variable of whether leukemia remission occurred (REMISS), which is given by a 1. The independent variables are cellularity of the marrow clot section (CELL), smear differential percentage of blasts (SMEAR), percentage of absolute marrow leukemia cell infiltrate (INFIL), percentage labeling index of the bone marrow leukemia cells (LI), absolute number of blasts in the peripheral blood (BLAST), and the highest temperature prior to start of treatment (TEMP). STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION 335 The following gives the estimated logistic regression equation and associated significance tests. The reference group of remission is 1 for this data. ########## Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 64.25808 74.96480 0.857 0.391 cell 30.83006 52.13520 0.591 0.554 smear 24.68632 61.52601 0.401 0.688 infil -24.97447 65.28088 -0.383 0.702 li 4.36045 2.65798 1.641 0.101 blast -0.01153 2.26634 -0.005 0.996 temp -100.17340 77.75289 -1.288 0.198 (Dispersion parameter for binomial family taken to be 1) Null deviance: 34.372 Residual deviance: 21.594 AIC: 35.594 on 26 on 20 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 8 ########## As you can see, the index of the bone marrow leukemia cells appears to be closest to a significant predictor of remission occurring. After looking at various subsets of the data, it is found that a significant model is one which only includes the labeling index as a predictor. ########## Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.777 1.379 -2.740 0.00615 ** li 2.897 1.187 2.441 0.01464 * --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 34.372 D. S. Young on 26 degrees of freedom STAT 501 336 CHAPTER 21. NONLINEAR REGRESSION Deviance Residuals Pearson Residuals ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● 0.0 ● Pearson Residuals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● −0.5 Deviance Residuals 0.5 ● ● ● ● ● 0 5 ● 10 15 20 25 0 Observation 5 10 15 20 25 Observation (a) (b) Figure 21.2: (a) Plot of the deviance residuals. (b) Plot of the Pearson residuals. Residual deviance: 26.073 AIC: 30.073 on 25 degrees of freedom Number of Fisher Scoring iterations: 4 Odds Ratio: 18.125 95% Confidence Interval: 1.770 185.562 ########## Notice that the odds ratio for LI is 18.12. It is calculated as e2.897 . The 95% confidence interval is calculated as e2.897±z0.975 ∗1.187 , where z0.975 = 1.960 is the 97.5th percentile from the standard normal distribution. The interpretation of the odds ratio is that for every increase of 1 unit in LI, the estimated odds of leukemia reoccurring are multiplied by 18.12. However, since the LI appears to fall between 0 and 2, it may make more sense to say that for every .1 unit increase in L1, the estimated odds of remission are multiplied by e2.897×0.1 = 1.337. So, assume that we have CELL=1.0 and TEMP=0.97. Then • At LI=0.8, the estimated odds of leukemia reoccurring is exp{−3.777+ 2.897 ∗ 0.8} = 0.232. STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION 337 10 Simulated Poisson Data ● ● 8 ● ● Y 6 ● ● ● ● ● ● ● ● 4 ● ● 2 ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● 5 10 15 20 X Figure 21.3: Scatterplot of the simulated Poisson data set. • At LI=0.9, the estimated odds of leukemia reoccurring is exp{−3.777+ 2.897 ∗ 0.9} = 0.310. 0.232 • The odds ratio is θ = 0.310 , which is the ratio of the odds of death when LI=0.8 compared to the odds when L1=0.9. Notice that 0.232×1.337 = ˆ 0.310, which demonstrates the multiplicative effect by e−β2 on the odds ratio. Figure 21.2 also gives plots of the deviance residuals and the Pearson residuals. These plots seem to be okay. Example 3: Poisson Regression Example Table 21.3 consists of a simulated data set of size n = 30 such that the response (Y ) follows a Poisson distribution with rate λ = exp{0.50 + 0.07X}. A plot of the response versus the predictor is given in Figure 21.3. The following gives the analysis of the Poisson regression data: ########## Coefficients: Estimate Std. Error t value Pr(>|t|) D. S. Young STAT 501 338 CHAPTER 21. NONLINEAR REGRESSION Deviance Residuals Pearson Residuals 4 ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● 0 Pearson Residuals 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 4 ● ● 5 ● ● ● ● 2 ● −2 −2 ● ● 1 ● ● ● ● 0 Deviance Residuals ● ● ● ● 6 7 1 2 3 4 Fitted Values Fitted Values (a) (b) 5 6 7 Figure 21.4: (a) Plot of the deviance residuals. (b) Plot of the Pearson residuals. (Intercept) 0.007217 0.989060 0.007 0.994 x 0.306982 0.066799 4.596 8.37e-05 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for gaussian family taken to be 3.977365) Null deviance: 195.37 Residual deviance: 111.37 AIC: 130.49 on 29 on 28 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 2 ########## As you can see, the predictor is highly significant. Finally, Figure 21.4 also provides plots of the deviance residuals and Pearson residuals versus the fitted values. These plots appear to be good for a Poisson fit. Further diagnostic plots can also be produced and model selection techniques can be employed when faced with multiple predictors. STAT 501 D. S. Young CHAPTER 21. NONLINEAR REGRESSION REMISS 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1 0 CELL SMEAR 0.80 0.83 0.90 0.36 0.80 0.88 1.00 0.87 0.90 0.75 1.00 0.65 0.95 0.97 0.95 0.87 1.00 0.45 0.95 0.36 0.85 0.39 0.70 0.76 0.80 0.46 0.20 0.39 1.00 0.90 1.00 0.84 0.65 0.42 1.00 0.75 0.50 0.44 1.00 0.63 1.00 0.33 0.90 0.93 1.00 0.58 0.95 0.32 1.00 0.60 1.00 0.69 1.00 0.73 INFIL 0.66 0.32 0.70 0.87 0.68 0.65 0.92 0.83 0.45 0.34 0.33 0.53 0.37 0.08 0.90 0.84 0.27 0.75 0.22 0.63 0.33 0.84 0.58 0.30 0.60 0.69 0.73 LI BLAST 1.90 1.10 1.40 0.74 0.80 0.18 0.70 1.05 1.30 0.52 0.60 0.52 1.00 1.23 1.90 1.35 0.80 0.32 0.50 0.00 0.70 0.28 1.20 0.15 0.40 0.38 0.80 0.11 1.10 1.04 1.90 2.06 0.50 0.11 1.00 1.32 0.60 0.11 1.10 1.07 0.40 0.18 0.60 1.59 1.00 0.53 1.60 0.89 1.70 0.96 0.90 0.40 0.70 0.40 339 TEMP 1.00 0.99 0.98 0.99 0.98 0.98 0.99 1.02 1.00 1.04 0.99 0.98 1.01 0.99 0.99 1.02 1.01 1.00 0.99 0.99 1.01 1.02 1.00 0.99 0.99 0.99 0.99 Table 21.2: The leukemia data set. Descriptions of the variables are given in the text. D. S. Young STAT 501 340 CHAPTER 21. NONLINEAR REGRESSION i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 xi yi i xi yi 2 0 16 16 7 15 6 17 13 6 19 4 18 6 2 14 1 19 16 5 16 5 20 19 5 15 2 21 24 6 9 2 22 9 2 17 10 23 12 5 10 3 24 7 1 23 10 25 9 3 14 2 26 7 3 14 6 27 15 3 9 5 28 21 4 5 2 29 20 6 17 2 30 20 9 Table 21.3: Simulated data for the Poisson regression example. STAT 501 D. S. Young Chapter 22 Multivariate Multiple Regression Up until now, we have only been concerned with univariate responses (i.e., the case where the response Y is simply a single value for each observation). However, sometimes you may have multiple responses measured for each observation, whether it be different characteristics or perhaps measurements taken over time. When our regression setting must accommodate multiple responses for a single observation, the technique is called multivariate regression. 22.1 The Model A multivariate multiple regression model is a multivariate linear model that describes how a vector of responses (or y-variables) relates to a set of predictors (or x-variables). For example, you may have a newly machined component which is divided into four sections (or sites). Various experimental predictors may be the temperature and amount of stress induced on the component. The responses may be the average length of the cracks that develop at each of the four sites. The general structure of a multivariate multiple regression model is as follows: • A set of p − 1 predictors, or independent variables, are measured for 341 342 CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION each of the i = 1, . . . , n observations: Xi,1 .. Xi = . . Xi,p−1 • A set of m responses, or dependent variables, are measured for each of the i = 1, . . . , n observations: Yi,1 Yi = ... . Yi,m • Each of the j = 1, . . . , m responses has it’s own regression model: Yi,j = β0,j + β1,j Xi,1 + β2,j Xi,2 + . . . + βp−1,j Xi,p−1 + i,j . Vectorizing the above model for a single observation yields: Yi = (1 XT i )B + i , where B= β1 β2 . . . βm = β0,1 β1,1 .. . β0,2 β1,2 .. . ... ... .. . β0,m β1,m .. . βp−1,1 βp−1,2 . . . βp−1,m and i,1 i = ... . i,m Notice that i is the vector of errors for the ith observation. STAT 501 D. S. Young CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION 343 • Finally, we may explicitly write down the multivariate multiple regression model: T Y1 .. Yn×m = . YT n XT 1 1 .. . . = . . 1 XT n β0,1 β1,1 .. . β0,2 β1,2 .. . ... ... ... β0,m β1,m .. . + βp−1,1 βp−1,2 . . . βp−1,m T 1 .. . T n = Xn×p Bp×m + εn×m . Or more compactly, without the dimensional subscripts, we will write: Y = XB + ε. 22.2 Estimation and Statistical Regions Least Squares Extending least squares theory from the multiple regression setting to the multivariate multiple regression setting is fairly intuitive. The biggest hurdle is dealing with the matrix calculations (which statistical packages perform for you anyhow). We can also formulate similar assumptions for the multivariate model. Let 1,j (j) = ... , n,j which is the vector of errors for the j th trial of all n observations. We assume that E((j) ) = 0 and Cov((i) , (k) ) = σi,k Im×m for each i, k = 1, . . . , n. Notice that the j th trial of the n observations have variance-covariance matrix Σ = {σi,k }, but observations from different entries of the vector are uncorrelated. The least squares estimate for B is simply given by: B̂ = (XT X)−1 XT Y. D. S. Young STAT 501 344 CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION Using B̂, we can calculate the predicted values as: Ŷ = XB̂ and the residuals as: ε̂ = Y − Ŷ. Furthermore, an estimate of Σ (which is the maximum likelihood estimate of Σ) is given by: 1 Σ̂ = ε̂T ε̂. n Hypothesis Testing Suppose we are interested in testing the hypothesis that our multivariate responses do not depend on the predictors Xi,q+1 , . . . , Xi,p−1 . We can partition B to consist of two matrices: one with the regression coefficients of the predictors we assume will remain in the model and one with the regression coefficients we wish to test. Similarly, we can partition X in a similar manner. Formally, the test is H0 : β (2) = 0, where B= β (1) β (2) and X= X1 X2 . Here X2 is an n × (p − q − 1) matrix of predictors corresponding to the null hypothesis and X1 is an n × (q) matrix of predictors we assume will remain in the model. Furthermore, β (2) and β (1) are (p − q − 1) × m and q × m matrices, respectively, for these predictor matrices. Under the null hypothesis, we can calculate −1 T β̂ (1) = (XT 1 X1 ) X1 Y and Σ̂1 = (Y − X1 β̂ (1) )T (Y − X1 β̂ (1) )/n. STAT 501 D. S. Young CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION 345 These values (which are maximum likelihood estimates under the null hypothesis) can be used to calculate one of four commonly used multivariate test statistics: Wilks’ Lambda = |nΣ̂| |n(Σ̂1 − Σ̂)| Pillai’s Trace = tr[(Σ̂1 − Σ̂)Σ̂−1 1 ] Hotelling-Lawley Trace = tr[(Σ̂1 − Σ̂)Σ̂−1 ] λ1 . Roy’s Greatest Root = 1 + λ1 In the above, λ1 is the largest nonzero eigenvalue of (Σ̂1 − Σ̂)Σ̂−1 . Also, the value |Σ| is the determinant of the variance-covariance matrix Σ and is called the generalized variance which assigns a single numerical value to express the overall variation of this multivariate problem. All of the above test statistics have approximate F −distributions with degrees of freedom which are more complicated to calculate than what we have seen. Most statistical packages will report at least one of the above if not all four. For large sample sizes, the associated p-values will likely be similar, but various situations (such as many large eigenvalues of (Σ̂1 Σ̂)Σ̂−1 or a relatively small sample size) will lead to a discrepancy between the results. In this case, it is usually accepted to report the Wilks’ lambda value as this is the likelihood ratio test. Confidence Regions One problem is to predict the mean responses corresponding to fixed values T xh of the predictors. Using various distributional results concerning B̂ xh and Σ̂, it can be shown that the 100 × (1 − α)% simultaneous confidence intervals for E(Yi |X = xh ) = xT h β̂ i are s m(n − p − 2) T Fm,n−p−1−m;1−α xh β̂ i ± n−p−1−m s n T T −1 × xh (X X) xh σ̂i,i , n−p−2 for i = 1, . . . , m. Here, β̂ i is the ith column of B̂ and σ̂i,i is the ith diagonal element of Σ̂. Also, notice that the simultaneous confidence intervals are D. S. Young STAT 501 346 CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION constructed for each of the m entries of the response vector, thus why they are considered “simultaneous”. Furthermore, the collection of these simultaneous T intervals yields what we call a 100 × (1 − α)% confidence region for B̂ xh . Prediction Regions Another problem is to predict new responses Yh = BT xh + εh . Again, skipping over a discussion on various distributional assumptions, it can be shown that the 100 × (1 − α)% simultaneous prediction intervals for the individual responses Yh,i are s xT h β̂ i ± m(n − p − 2) Fm,n−p−1−m;1−α n−p−1−m s × (1 + T −1 xT h (X X) xh ) n σ̂i,i , n−p−2 for i = 1, . . . , m. The quantities here are the same as those in the simultaneous confidence intervals. Furthermore, the collection of these simultaneous prediction intervals are called a 100 × (1 − α)% prediction region for yh . MANOVA The multivariate analysis of variance (MANOVA) table is similar to it’s univariate counterpart. The sum of squares values in a MANOVA are no longer scalar quantities, but rather matrices. Hence, the entries in the MANOVA table are called sum of squares and cross-products (SSCPs). These quantities are described in a little more detail below: • The cross-products for total is SSCPTO = Pn sum of squares and T i=1 (Yi − Ȳ)(Yi − Ȳ) , which is the sum of squared deviations from the overall mean vector of the Yi ’s. SSCPTO is a measure of the overall variation in the Y vectors. The corresponding total degrees of freedom are n − 1. • P The sum of squares and cross-products for the errors is SSCPE = n T i=1 (Yi − Ŷi )(Yi − Ŷi ) , which is the sum of squared observed errors (residuals) for the observed data vectors. SSE is a measure of the variation in Y that is not explained by the multivariate regression. The corresponding error degrees of freedom are n − p. STAT 501 D. S. Young CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION 347 • The sum of squares and cross-products due to the regression is SSCPR = SSCPTO − SSCPE, and it is a measure of the total variation in Y that can be explained by the regression with the predictors. The corresponding model degrees of freedom are p − 1. Formally, a MANOVA table is given in Table 22.1. Source Regression Error Total df p−1 n−p n−1 SSCP Pn (Ŷ − Ȳ)(Ŷi − Ȳ)T Pni=1 i (Yi − Ŷi )(Yi − Ŷi )T Pi=1 n T i=1 (Yi − Ȳ)(Yi − Ȳ) Table 22.1: MANOVA table for the multivariate multiple linear regression model. Notice in the MANOVA table that we do not define any mean square values or an F -statistic. Rather, a test of the significance of the multivariate multiple regression model is carried out using a Wilks’ lambda quantity similar to Pn T (Y − Ŷ )(Y − Ŷ ) i i i i i=1 , Λ∗ = P n (Yi − Ȳ)(Yi − Ȳ)T i=1 which will follow a χ2 distribution. However, depending on the number of variables and the number of trials, modified versions of this test statistic must be used, which will affect the degrees of freedom for the corresponding χ2 distribution. 22.3 Reduced Rank Regression Reduced rank regression is a way of constraining the multivariate linear regression model so that the rank of the regression coefficient matrix has less than full rank. The objective in reduced rank regression is to minimize the sum of squared residual subject to a reduced rank condition. Without the rank condition, the estimation problem is an ordinary least squares problem. Reduced-rank regression is important in that it contains as special cases D. S. Young STAT 501 348 CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION the classical statistical techniques of principal component analysis, canonical variate and correlation analysis, linear discriminant analysis, exploratory factor analysis, multiple correspondence analysis, and other linear methods of analyzing multivariate data. It is also heavily utilized in neural network modeling and econometrics. Recall that the multivariate regression model is Y = XB + ε, where Y is an n × m matrix, X is an n × p matrix, and B is a p × m matrix of regression parameters. A reduced rank regression occurs when we have the rank constraint rank(B) = t ≤ min(p, m), with equality yielding the traditional least squares setting. When the rank condition above holds, then there exists two non-unique full rank matrices Ap×t and Ct×m , such that B = AC. Moreover, there may be an additional set of predictors, say W, such that W is a n × q matrix. Letting D denote a q × m matrix of regression parameters, we can then write the reduced rank regression model as follows: Y = XAC + WD + ε. In order to get estimates for the reduced rank regression model, first note that E((j) ) = 0 and Var(0(j) ) = Im×m ⊗ Σ. For simplicity in the following, let Z0 = Y, Z1 = X, and Z2 = W. Next, we define the moment matrices −1 Mi,j = ZT i Zj /m for i, j = 0, 1, 2 and Si,j = Mi,j − Mi,2 M2,2 M2,i , i, j = 0, 1. Then, the parameters estimates for the reduced rank regression model are as follows: Â = (ν̂1 , . . . , ν̂t )Φ T T Ĉ = S0,1 Â(Â S1,1 Â)−1 T T −1 −1 D̂ = M0,2 M2,2 , − Ĉ Â M1,2 M2,2 where (ν̂1 , . . . , ν̂t ) are the eigenvectors corresponding to the t largest eigen−1 values λ̂1 , . . . λ̂t of |λS1,1 − S1,0 S0,0 S0,1 | = 0 and where Φ is an arbitrary t × t matrix with full rank. STAT 501 D. S. Young CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION 22.4 349 Example Example: Amitriptyline Data This example analyzes conjectured side effects of amitriptyline - a drug some physicians prescribe as an antidepressant. Data were gathered on n = 17 patients admitted to a hospital with an overdose of amitriptyline. The two response variables are Y1 = total TCAD plasma level and Y2 = amount of amitriptyline present in TCAD plasma level. The five predictors measured are X1 = gender (0 for male and 1 for female), X2 = amount of the drug taken at the time of overdose, X3 = PR wave measurement, X4 = diastolic blood pressure, and X5 = QRS wave measurement. Table 22.2 gives the data set and we wish to fit a multivariate multiple linear regression model. Y1 3389 1101 1131 596 896 1767 807 1111 645 628 1360 652 860 500 781 1070 1754 Y2 X1 3149 1 653 1 810 0 448 1 844 1 1450 1 493 1 941 0 547 1 392 1 1283 1 458 1 722 1 384 0 501 0 405 0 1520 1 X2 7500 1975 3600 675 750 2500 350 1500 375 1050 3000 450 1750 2000 4500 1500 3000 X3 X4 220 0 200 0 205 60 160 60 185 70 180 60 154 80 200 70 137 60 167 60 180 60 160 64 135 90 160 60 180 0 170 90 180 0 X5 140 100 111 120 83 80 98 93 105 74 80 60 79 80 100 120 129 Table 22.2: The amitriptyline data set. First we obtain the regression estimates for each response: ########## Coefficients: D. S. Young STAT 501 350 CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION (Intercept) X1 X2 X3 X4 X5 ########## Y1 -2879.4782 675.6508 0.2849 10.2721 7.2512 7.5982 Y2 -2728.7085 763.0298 0.3064 8.8962 7.2056 4.9871 Then we can obtain individual ANOVA tables for each response and see that the multiple regression model for each response is statistically significant. ########## Response Y1 : Df Sum Sq Mean Sq F value Pr(>F) Regression 5 6835932 1367186 17.286 6.983e-05 *** Residuals 11 870008 79092 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Response Y2 : Df Sum Sq Mean Sq F value Pr(>F) Regression 5 6669669 1333934 15.598 0.0001132 *** Residuals 11 940709 85519 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ########## The following also gives the SSCP matrices for this fit: ########## $SSCPR Y1 Y2 Y1 6835932 6709091 Y2 6709091 6669669 $SSCPE Y1 Y2 Y1 870008.3 765676.5 STAT 501 D. S. Young CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION 351 Y2 765676.5 940708.9 $SSCPTO Y1 Y2 Y1 7705940 7474767 Y2 7474767 7610378 ########## We can also see which predictors are statistically significant for each response: ########## Response Y1 : Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.879e+03 8.933e+02 -3.224 0.008108 ** X1 6.757e+02 1.621e+02 4.169 0.001565 ** X2 2.849e-01 6.091e-02 4.677 0.000675 *** X3 1.027e+01 4.255e+00 2.414 0.034358 * X4 7.251e+00 3.225e+00 2.248 0.046026 * X5 7.598e+00 3.849e+00 1.974 0.074006 . --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 281.2 on 11 degrees of freedom Multiple R-Squared: 0.8871, Adjusted R-squared: 0.8358 F-statistic: 17.29 on 5 and 11 DF, p-value: 6.983e-05 Response Y2 : Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.729e+03 9.288e+02 -2.938 0.013502 * X1 7.630e+02 1.685e+02 4.528 0.000861 *** X2 3.064e-01 6.334e-02 4.837 0.000521 *** X3 8.896e+00 4.424e+00 2.011 0.069515 . X4 7.206e+00 3.354e+00 2.149 0.054782 . X5 4.987e+00 4.002e+00 1.246 0.238622 D. S. Young STAT 501 352 CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION 1.5 Response = Y1 Response = Y2 ● ● ● 1.0 2 ● ● ● ● ● ● −0.5 ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● −1 −1.0 ● ● ● 0 Studentized Residuals 0.5 0.0 ● −1.5 Studentized Residuals ● ● ● ● −2.0 ● ● 500 1000 1500 ● 2000 2500 3000 500 1000 1500 2000 Fitted Values Fitted Values (a) (b) 2500 3000 Figure 22.1: Plots of the Studentized residuals versus fitted values for the response (a) total TCAD plasma level and the response (b) amount of amitriptyline present in TCAD plasma level. --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 292.4 on 11 degrees of freedom Multiple R-Squared: 0.8764, Adjusted R-squared: 0.8202 F-statistic: 15.6 on 5 and 11 DF, p-value: 0.0001132 ########## We can proceed to drop certain predictors from the model in an attempt to improve the fit as well as view residual plots to assess the regression assumptions. Figure 22.1 gives the Studentized residual plots for each of the responses. Notice that the plots have a fairly random pattern, but there is one value high with respect to the fitted values. We could formally test (i.e., with a Levene’s test) to see if this affects the constant variance assumption and also to study pairwise scatterplots for any potential multicollinearity in this model. STAT 501 D. S. Young Chapter 23 Data Mining The field of Statistics is constantly being presented with larger and more complex data sets than ever before. The challenge for the Statistician is to be able to make sense of all of this data, extract important patterns, and find meaningful trends. We refer to the general tools and the approaches for dealing with these challenges in massive data sets as data mining.1 Data mining problems typically involve an outcome measurement which we wish to predict based on a set of feature measurements. The set of these observed measurements is called the training data. From these training data, we attempt to build a learner, which is a model used to predict the outcome for new subjects. These learning problems are (roughly) categorized as either supervised or unsupervised. A supervised learning problem is one where the goal is to predict the value of an outcome measure based on a number of input measures, such as classification with labeled samples from the training data. An unsupervised learning problem is one where there is no outcome measure and the goal is to describe the associations and patterns among a set of input measures, which involves clustering unlabeled training data by partitioning a set of features into a number of statistical classes. The regression problems that are the focus of this text are (generally) supervised learning problems. Data mining is an extensive field in and of itself. In fact, many of the methods utilized in this field are regression-based. For example, smoothing splines, shrinkage methods, and multivariate regression methods are all often found in data mining. The purpose of this chapter will not be to revisit these 1 Data mining is also referred to as statistical learning or machine learning. 353 354 CHAPTER 23. DATA MINING methods, but rather to add to our toolbox additional regression methods, which are methods that happen to be utilized more in data mining problems. 23.1 Some Notes on Variable and Model Selection When faced with high-dimensional data, it is often desired to perform some variable selection procedure. Methods discussed earlier, such as best subsets, forward selection, and backwards elimination can be used; however, these can be very computationally expensive to implement. Shrinkage methods like LASSO can be implemented, but these too can be expensive. Another alternative used in variable selection and commonly discussed in the context of data mining is least angle regression or LARS. LARS is a stagewise procedure that uses a simple mathematical formula to accelerate the computations relative to other variable selection procedure we have discussed. Only p steps are required for the full set of solutions, where pis the number of predictors. The LARS procedure starts with all coefficients equal to zero, and then finds the predictor most correlated with the response, say Xj1 . We take the largest step possible in the direction of this predictor until some other predictor, say Xj2 , has as much correlation with the current residual. LARS then proceeds in a direction equiangular between the two predictors until a third variable, say Xj3 earns its way into the “most correlated” set. LARS then proceeds equiangularly between Xj1 , Xj2 , and Xj3 (along the “least angle direction”) until a fourth variable enters. This continues until all p predictors have entered the model and then the analyst studies these p models to determine which yields an appropriate level of parsimony. A related methodology to LARS is forward stagewise regression. Forward stagewise regression starts by taking the residuals between the response values and their mean (i.e., all of the regression slopes are set to 0). Call this vector r. Then, find the predictor most correlated with r, say Xj1 . Update the regression coefficient βj1 by setting βj1 = βj∗1 + δji , where δji = × corr(r, Xj1 ) for some > 0 and βj∗1 is the old value of βj1 . Finally, update r by setting it equal to r∗ − δj1 Xj1 , such that r∗ is the old value of r. Repeat this process until no predictor has a correlation with r. LARS and forward stagewise regression are very computationally efficient. STAT 501 D. S. Young CHAPTER 23. DATA MINING 355 In fact, a slight modification to the LARS algorithm can calculate all possible LASSO estimates for a given problem. Moreover, a different modification to LARS efficiently implements forward stagewise regression. In fact, the acronym for LARS includes an “S” at the end to reflect its connection to LASSO and forward stagewise regression. Earlier in the text we also introduced the bootstrap as a way to get bootstrap confidence intervals for the regression parameters. However, the notion of the bootstrap can also be extended to fitting a regression model. Suppose that we have p − 1 feature measurements and one outcome variable. Let Z = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} be our training data that we wish to fit a model to such that we obtain the prediction fˆ(x) at each input x. Bootstrap aggregation or bagging averages this prediction over a collection of bootstrap samples, thus reducing its variance. For each bootstrap sample Z∗b , b = 1, 2, . . . , B, we fit our model, which yields the prediction fˆb∗ (x). The bagging estimate is then defined by B 1 X ˆ∗ f (x). fˆbag(x) = B b=1 b Denote the empirical distribution function by P̂, which puts equal probability 1/n on each of the data points (xi , yi ). The “true” bagging estimate is defined by EP̂ fˆ∗ (x), where Z∗ = {(x∗1 , y1∗ ), (x∗2 , y2∗ ), . . . , (x∗n , yn∗ )} and each (x∗i , yi∗ ) ∼ P̂ . Note that the bagging estimate given above is a Monte Carlo estimate of the “true” bagging estimate, which it approaches as B → ∞. The bagging approach can be used in other model selection approaches throughout Statistics and data mining. 23.2 Classification and Support Vector Regression Classification is the problem of identifying the subpopulation to which new observations belong, where the labels of the subpopulation are unknown, on the basis of a training set of data containing observations whose subpopulation is known. The classification problem is often contrasted with clustering, where the problem is to analyze a data set and determine how (or if) the data set can be divided into groups. In data mining, classification D. S. Young STAT 501 356 CHAPTER 23. DATA MINING is a supervised learning problem while clustering is an unsupervised learning problem. In this chapter, we will focus on a special classification technique which has many regression applications in data mining. Support Vector Machines (or SVMs) perform classification by constructing an N -dimensional hyperplane that optimally separates the data into two categories. SVM models are closely related to neural networks, which we discuss later in this chapter. The predictor variables are called attributes, and a transformed attribute that is used to define the hyperplane is the feature. The task of choosing the most suitable representation is known as feature selection. A set of features that describes one case (i.e., a row of predictor values) is called a vector. So the goal of SVM modeling is to find the optimal hyperplane that separates clusters of vectors in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors. Suppose we wish to perform classification with the data shown in Figure 23.1(a) and our data has a categorical target variable with two categories. Also assume that the attributes have continuous values. Figure 23.1(b) provides a snapshot of how we perform SVM modeling. The SVM analysis attempts to find a 1-dimensional hyperplane (i.e., a line) that separates the cases based on their target categories. There are an infinite number of possible lines and we show only one in Figure 23.1(b). The question is which line is optimal and how do we define that line. The dashed lines drawn parallel to the separating line mark the distance between the dividing line and the closest vectors to the line. The distance between the dashed lines is called the margin. The vectors (i.e., points) that constrain the width of the margin are the support vectors. An SVM analysis finds the line (or, in general, hyperplane) that is oriented so that the margin between the support vectors is maximized. Unfortunately, the data we deal with is not generally as simple as that in Figure 23.2. The challenge will be to develop an SVM model that accommodates such characteristics as: 1. more than two attributes; 2. separation of the points with nonlinear curves; 3. handling of cases where the clusters cannot be completely separated; and STAT 501 D. S. Young CHAPTER 23. DATA MINING 357 ● ● ● ● ● ● ● ●● ●● ● ● −2 ● 2 x (a) 4 ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● 0 ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● y ● ● ● ●● −2 ● ● −4 y 0 −2 −4 4 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 2 ● ● ● 0 4 ● −2 ● ● 0 2 4 x (b) Figure 23.1: (a) A plot of the data where classification is to be performed. (b) The data where a support vector machine has been used. The points near the parallel dashed lines are the support vectors. The regions between the parallel dashed lines is called the margin, which is the region we want to optimize. 4. handling classification with more than two categories. The setting with nonlinear curves and where clusters cannot be completely separated is illustrated in Figure 23.2. Without loss of generality, our discussion will mainly be focused to the one attribute and one feature setting. Moreover, we will be utilizing support vectors in order to build a regression relationship that fits our data adequately. A little more terminology is necessary before we move into the regression discussion. A loss function represents the loss in utility associated with an estimate being “wrong” (i.e., different from either a desired or a true value) as a function of a measure of the degree of “wrongness” (generally the difference between the estimated value and the true or desired value). When discussing SVM modeling in the regression setting, the loss function will need to incorporate a distance measure as well. As a quick illustration of some common loss functions look at Figure 23.3. Figure 23.3(a) is a quadratic loss function, which is what we use in classical ordinary least squares. Figure 23.3(b) is a Laplacian loss function, which D. S. Young STAT 501 358 CHAPTER 23. DATA MINING SVM classification plot 4 1.0 3 ● 2 0.5 ● ● ● ● ● ●● 1 0 ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0.0 ● ● ● −0.5 ● ● ● −1.0 ● ● −2 ● ● ● −2 −1 0 1 2 3 Figure 23.2: A plot of data where a support vector machine has been used for classification. The data was generated where we know that the circles belong to group 1 and the triangles belong to group 2. The white contours show where the margin is; however, there are clearly some values that have been misclassified since the two clusters are not well-separated. The points that are solid were used as the training data. is less sensitive to outliers than the quadratic loss function. Figure 23.3(c) is Huber’s loss function, which is a robust loss function that has optimal properties when the underlying distribution of the data is unknown. Finally, Figure 23.3(d) is called the -insensitive loss function, which enables a sparse set of support vectors to be obtained. In Support Vector Regressions (or SVRs), the input is first mapped onto an N -dimensional feature space using some fixed (nonlinear) mapping, and then a linear model is constructed in this feature space. Using mathematical notation, the linear model (in the feature space) is given by f (x, ω) = N X ωj gj (x) + b, j=1 where gj (·), j = 1, . . . , N denotes a set of nonlinear transformations, and b is a bias term. If the data is assumed to be of zero mean (as it usually is), STAT 501 D. S. Young CHAPTER 23. DATA MINING 359 (a) (b) (c) (d) Figure 23.3: Plots of the (a) quadratic loss, (b) Laplace loss, (c) Huber’s loss, and (d) -insensitive loss functions. D. S. Young STAT 501 360 CHAPTER 23. DATA MINING then the bias term is dropped. Note that b is not considered stochastic in this model and is not akin to the error terms we have studied in previous models. The optimal regression function is given by the minimum of the functional n X 1 Φ(ω, ξ) = kωk2 + C (ξ − + ξ + ), 2 i=1 where C is a pre-specified constant, ξ − and ξ + are slack variables representing upper and lower constraints (respectively) on the output of the system. In other words, we have the following constraints: yi − f (xi , ω) ≤ + ξi+ f (xi , ω) − yi ≤ + ξi− ξ − , ξ + ≥ 0, i = 1, . . . , n, where yi is defined through the loss function we are using. The four loss functions we show in Figure 23.3 are as follows: Quadratic Loss: L2 = (f (x) − y) = (f (x) − y)2 Laplace Loss: L1 = (f (x) − y) = |f (x) − y| Huber’s Loss2 : LH = 1 (f (x) 2 − y)2 , δ|f (x) − y| − δ2 , 2 for |f (x) − y| < δ; otherwise. -Insensitive Loss: L = 0, for |f (x) − y| < ; |f (x) − y| − , otherwise. Depending on which loss function is chosen, then an appropriate optimization problem can be specified, which can involve kernel methods. Moreover, specification of the kernel type as well as values like C, , and δ all control the complexity of the model in different ways. There are many subtleties 2 The quantity δ is a specified threshold constant. STAT 501 D. S. Young CHAPTER 23. DATA MINING 361 depending on which loss function is used and the investigator should become familiar with the loss function being employed. Regardless, the optimization approach will require the use of numerical methods. It is also desirable to strike a balance between complexity and the error that is present with the fitted model. Test error (also known as generalization error) is the expected prediction error over an independent test sample and is given by Err = E[L(Y, fˆ(X))], where X and Y are drawn randomly from their joint distribution. This expectation is an average of everything that is random in this set-up, including the randomness in the training sample that produced the estimate fˆ(·). Training error is the average loss over the training sample and is given by n 1X err = L(yi , fˆ(xi )). n i=1 We would like to know the test error of our estimated model fˆ(·). As the model increases in complexity, it is able to capture more complicated underlying structures in the data, which thus decreases bias. But then the estimation error increases, which thus increases variance. This is known as the bias-variance tradeoff. In between there is an optimal model complexity that gives minimum test error. 23.3 Boosting and Regression Transfer Transfer learning is the notion that it is easier to learn a new concept (such as how to play racquetball) if you are already familiar with a similar concept (such as knowing how to play tennis). In the context of supervised learning, inductive transfer learning is often framed as the problem of learning a concept of interest, called the target concept, given data from multiple sources: a typically small amount of target data that reflects the target concept, and a larger amount of source data that reflects one or more different, but possibly related, source concepts. While most algorithms addressing this notion are in classification settings, some of the common algorithms can be extended to the regression setting to help us build our models. The approach we discuss is called boosting or D. S. Young STAT 501 362 CHAPTER 23. DATA MINING boosted regression. Boosted regression is highly flexible in that it allows the researcher to specify the feature measurements without specifying their functional relationship to the outcome measurement. Because of this flexibility, a boosted model will tend to fit better than a linear model and therefore inferences made based on the boosted model may have more credibility. Our goal is to learn a model of a concept ctarget mapping feature vectors from the feature space containing X to the response space Y . We are given a set of training instances Ttarget = {(xi , yi )}, with xi ∈ X and yi ∈ Y for i = 1, . . . , n that reflect ctarget . In addition, we are given data sets 1 B , . . . , Tsource source reflecting B different, but possibly related, concepts Tsource also mapping X to Y . In order to learn the most accurate possible model of ctarget , we must decide how to use both the target and source data sets. If Ttarget is sufficiently large, we can likely learn a good model using only this data. However, if Ttarget is small and one or more of the source concepts is similar to ctarget , then we may be able to use the source data to improve our model. Regression transfer algorithms fit into two basic categories: those that make use of models trained on the source data, and those that use the source data directly as training data. The two algorithms presented here fit into each of these categories and are inspired by two boosting-based algorithms for classification transfer: ExpBoost and AdaBoost. The regression analogues that we present are called ExpBoost.R2 and AdaBoost.R2. Boosting is an ensemble method in which a sequence of models (or hypotheses) h1 , . . . , hm , each mapping from X to Y , are iteratively fit to some transformation of a data set using a base learner. The outputs of these models are then combined into a final hypothesis, which we denote as h∞ . We can now formalize the two regression transfer algorithms. AdaBoost.R2 Input the labeled target data set T of size n, the maximum number of iterations B, and a base learning algorithm called Learner. Unless otherwise specified, set the initial weight vector w1 such that wi1 = 1/n for i = 1, . . . , n. For t = 1, . . . , B: 1. Call Learner with the training set T and the distribution wt , and get a hypothesis ht : X → R. 2. Calculate the adjusted error eti for each instance. Let Dt = maxi |yi − STAT 501 D. S. Young CHAPTER 23. DATA MINING 363 ht (xj )|, so that eti = |yi − ht (xi )|/Dt . 3. Calculate the adjusted error of ht , which is t = then stop and set B = t − 1. Pn t t i=1 ei wi . If t ≥ 0.5, 4. Let γt = t /(1 − t ). 1−eti 5. Update the weight vector as wit+1 = wit γt normalizing constant. /Zt , such that Zt is a Output the hypothesis h∞ , which is the median of ht (x) for t = 1, . . . , B, using ln(1/γt ) as the weight for hypothesis ht . The method used in AdaBoost.R2 is to express each error in relation to the largest error D = maxi |ei | in such a way that each adjusted error e0i is in the range [0, 1]. In particular, one of three possible loss functions is used: e0i = ei /D (linear), e0i = e2i /D2 (quadratic), or e0i = 1 − exp(−ei /D) (exponential). The degree to which instance xi is reweighted in iteration t thus depends on how large the error of ht is on xi relative to the error on the worst instance. ExpBoost.R2 Input the labeled target data set T of size n, the maximum number of iterations B, and a base learning algorithm called Learner. Unless otherwise specified, set the initial weight vector w1 such that wi1 = 1/n for i = 1, . . . , n. Moreover, each source data set gets assigned to one expert from the set of experts H B = {h1 , . . . , hB }. For t = 1, . . . , B: 1. Call Learner with the training set T and the distribution wt , and get a hypothesis ht : X → R. 2. Calculate the adjusted error eti for each instance. Let Dt = maxi |yi − ht (xj )|, so that eti = |yi − ht (xi )|/Dt . P 3. Calculate the adjusted error of ht , which is t = ni=1 eti wit . If t ≥ 0.5, then stop and set B = t − 1. D. S. Young STAT 501 364 CHAPTER 23. DATA MINING 4. Calculate the weighted errors of each expert in H B on the current weighting scheme. If any expert in H B has a lower weighted error than ht , then replace ht with this “best” expert. 5. Let γt = t /(1 − t ). 1−eti 6. Update the weight vector as wit+1 = wit γt normalizing constant. /Zt , such that Zt is a Output the hypothesis h∞ , which is the median of ht (x) for t = 1, . . . , B, using ln(1/γt ) as the weight for hypothesis ht . As can be seen, ExpBoost.R2 is similar to the AdaBoost.R2 algorithm, but with a few minor differences. 23.4 CART and MARS Classification and regression trees (CART) is a nonparametric treebased method which partitions the predictor space into a set of rectangles and then fits a simple model (like a constant) in each one. While they seem conceptually simple, they are actually quite powerful. Suppose we have one response (yi ) and p predictors (xi,1 , . . . , xi,p ) for i = 1, . . . , n. First we partition the predictor space into M regions (say, R1 , . . . , RM ) and model the response as a constant cm in each region: f (x) = M X cm I(x ∈ Rm ). m=1 Then, minimizing the sum of squares yields ĉm = n X (yi − f (xi ))2 i=1 P n yi I(xi ∈ Rm ) . = Pi=1 n i=1 I(xi ∈ Rm ) We proceed to grow the tree by finding the best binary partition in terms of the ĉm values. This is generally computationally infeasible which leads to STAT 501 D. S. Young CHAPTER 23. DATA MINING 365 use of a greedy search algorithm. Typically, the tree is grown until a small node size (such as 5 nodes) is reached and then a method for pruning the tree is implemented. Multivariate adaptive regression splines (MARS) is another nonparametric method that can be viewed as a modification of CART and is wellsuited for high-dimensional problems. MARS uses expansions in piecewise linear basis functions of the form (x−t)+ and (t−x)+ such that the “+” subscript simply means we take the positive part (e.g., (x−t)+ = (x−t)I(x > t)). These two functions together are called a reflected pair. In MARS, each function is piecewise linear with a knot at t. The idea is to form a reflected pair for each predictor Xj with knots at each observed value xi,j of that predictors. Therefore, the collection of basis functions for j = 1, . . . , p is C = {(Xj − t)+ , (t − Xj )+ }t∈{x1,j ,...,xn,j } . MARS proceeds like a forward stepwise regression model selection procedure, but instead of selecting the predictors to use, we use functions from the set C and their products. Thus, the model has the form f (X) = β0 + M X βm hm (X), m=1 where each hm (X) is a function in C or a product of two or more such functions. You can also think of MARS as “selecting” a weighted sum of basis functions from the set of (a large number of) basis functions that span all values of each predictor (i.e., that set would consist of one basis function and knot value t for each distinct value of each predictor variable). The MARS algorithm then searches over the space of all inputs and predictor values (knot locations t) as well as interactions between variables. During this search, an increasingly larger number of basis functions are added to the model (selected from the set of possible basis functions) to maximize an overall least squares goodness-of-fit criterion. As a result of these operations, MARS automatically determines the most important independent variables as well as the most significant interactions among them. D. S. Young STAT 501 366 23.5 CHAPTER 23. DATA MINING Neural Networks With the exponential growth in available data and advancement in computing power, researchers in statistics, artificial intelligence, and data mining have been faced with the challenge to develop simple, flexible, powerful procedures for modeling large data sets. One such model is the neural network approach, which attempts to model the response as a nonlinear function of various linear combinations of the predictors. Neural networks were first used as models for the human brain. The most commonly used neural network model is the single-hiddenlayer, feedforward neural network (sometimes called the single-layer perceptron.) In this neural network model, the ith response yi is modeled as a nonlinear function fY of m derived predictor values, Si,0 , Si,1 , . . . , Si,m−1 : yi = fY (β0 Si,0 + β1 Si,1 + . . . + βm−1 Si,m−1 ) + i = fY (ST i β) + i , where β= β0 β1 .. . Si,0 Si,1 .. . and Si = . Si,m−1 βm−1 Si,0 equals 1 and for j = 1, . . . , m − 1, the j th derived predictor value for the ith observation, Si,j , is a nonlinear function fj of a linear combination of the original predictors: Si,j = fj (XT i θ j ), where θj = θj,0 θj,1 .. . θj,p−1 and X = i Xi,0 Xi,1 .. . Xi,p−1 and Xi,0 = 1. The functions fY , f1 , . . . , fm−1 are called activation functions. Finally, we can combine all of the above to form the neural network STAT 501 D. S. Young CHAPTER 23. DATA MINING 367 model as: yi = fY (ST i β) + i = fY (β0 + m−1 X βj fj (XT i θ j )) + i . j=1 There are various numerical optimization algorithms for fitting neural networks (e.g., quasi-Newton methods and conjugate-gradient algorithms). One important thing to note is that parameter estimation in neural networks often utilizes penalized least squares to control the level of overfitting. The penalized least squares criterion is given by: Q= n X [yi − fY (β0 + i=1 m−1 X 2 βj fj (XT i θ j ))] + pλ (β, θ 1 , . . . , θ m−1 ), j=1 where the penalty term is given by: pλ (β, θ 1 , . . . , θ m−1 ) = λ m−1 X i=0 βi2 + p−1 m−1 XX 2 θi,j . i=1 j=1 Finally, there is also a modeling technique which is similar to the treebased methods discussed earlier. The hierarchical mixture-of-experts model (HME model) is a parametric tree-based method which recursively splits the function of interest at each node. However, the splits are done probabilistically and the probabilities are functions of the predictors. The model is written as f (yi ) = k1 X j1 =1 λj1 (xi , τ ) k2 X λj2 (xi , τ j1 ) j2 =1 ··· kr X λjr (xi , τ j1 , j2 , . . . , jr−1 )g(yi ; xi , θ j1 ,j2 ,...,jr−1 ), jr =1 which has a tree structure with r levels (i.e., r levels where probabilistic splits occur). The λ(·) functions provide the probabilities for the splitting and, in addition to being dependent on the predictors, they also have their own set of parameters (the different τ values) requiring estimation (these mixing D. S. Young STAT 501 368 CHAPTER 23. DATA MINING proportions are modeled using logistic regressions). Finally, θ is simply the parameter vectors for the regression modeled at each terminal node of the tree constructed using the HME structure. The HME model is similar to CART, however, unlike CART it does not provide a “hard” split at each node (i.e., either a node splits or it does not). The HME model incorporates these predictor-dependent mixing proportions which provide a “soft” probabilistic split at each node. The HME model can also be thought of as being in the middle of continuum where at one end we have CART (which provides hard splits) and at the other end is mixtures of regressions (which is closely related to the HME model, but the mixing proportions which provide the soft probabilistic splits are no longer predictor-dependent). We will discuss mixtures of regressions at the end of this chapter. 23.6 Examples Example 1: Simulated Neural Network Data In this very simple toy example, we have provided two features (i.e., input neurons X1 and X2 ) and one response measurement (i.e., output neuron Y ) which are given in Table 23.1. The model fit is one where X1 and X2 do not interact on Y . A single hidden-layer and a double hidden-layer neural net model are each fit to this data to highlight the difference in the fits. Below is the output (for each neural net) which shows the results from training the model. A total of 5 training samples were used and a threshold value of 0.01 was used as a stopping criterion. The stopping criterion pertains to the partial derivatives of the error function and once we fall beneath that threshold value, then the algorithm stops for that training sample. X1 0 1 0 1 X2 0 0 1 1 Y 0 1 1 0 Table 23.1: The simulated neural network data. ########## 5 repetitions were calculated. STAT 501 D. S. Young CHAPTER 23. DATA MINING 369 Error Reached Threshold Steps 3 0.3480998376 0.007288519715 41 1 0.5000706639 0.009004839727 14 2 0.5000949028 0.009028409036 0.5001216674 0.008221843135 35 5 0.5007429970 0.007923316336 10 26 4 5 repetitions were calculated. Error Reached Threshold Steps 4 0.0002241701811 0.009294280160 61 2 0.0004741186530 0.008171862296 193 5 0.2516368073472 0.006640846189 88 3 0.3556429122848 0.007036160421 46 1 0.5015928330534 0.009549108455 25 ########## 1 1 1 1 1 431 2.8 X1 X1 6.15091 Y 4 38 .76 −5 (a) 6.07832 52 1 X2 3.3107 00 Error: 0.3481 Steps: 41 502 3 . −3 96 Y 3 87 X2 41 2.17 .1 40 −1.62898 −1 .40 307 1864 3 −6 37 0.67 90 −0.4 −3 . 25 .1 −1 Error: 0.000224 Steps: 61 (b) Figure 23.4: (a) The fitted single hidden-layer neural net model to the toy data. (b) The fitted double hidden-layer neural net model to the toy data. In the above output, the first group of 5 repetitions pertain to the single hidden-layer neural net. For those 5 repetitions, the third training sample yielded the smallest error (about 0.3481). The second group of 5 repetitions D. S. Young STAT 501 370 CHAPTER 23. DATA MINING ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ●●●● ●● ●●●●●●●● ●● ●● ● ● ●●● ● ● −100 −100 ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● 20 ● ● ● ● ●● ● ● ● 10 ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● −50 ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●●●● ●● ●●●●●●●● ●● ●● ● ● ●●● −50 accel 0 ● 0 ● ● 50 ● ● ● ●● accel 50 ● 30 times (a) 40 50 ● ● ● ● ● ● ● ●● ●● ● ● ● ε = 0.01 ε = 0.10 ε = 0.70 ● 10 20 30 40 50 times (b) Figure 23.5: (a) Data from a simulated motorcycle accident where the time until impact (in milliseconds) is plotted versus the recorded head acceleration (in g). (b) The data with different values of used for the support vector regression obtained with an −insensitive loss function. Note how the smaller the , the more features you pick up in the fit, but the complexity of the model also increases. pertain to the double hidden-layer neural net. For those 5 repetitions, the fourth training sample yielded the smallest error (about 0.0002). The increase in complexity of the neural net has yielded a smaller training error. The fitted neural net models are depicted in Figure 23.4. Example 2: Motorcycle Accident Data This data set is from a simulated accident involving different motorcycles. The time in milliseconds until impact and the g-force measurement of acceleration are recorded. The data are provided in Table 23.2 and plotted in Figure 23.5(a). Given the obvious nonlinear trend that is present with this data, we will attempt to fit a support vector regression to this data. A support vector regression using an -insensitive loss function is fitted to this data. ∈ {0.01, 0.10, 0.70} are fitted to this data and are shown in Figure 23.5(b). As decreases, different characteristics of the data are emphasized, but the level of complexity of the model is increased. As noted earlier, we STAT 501 D. S. Young CHAPTER 23. DATA MINING 371 want to try and strike a good balance regarding the model complexity. For the training error, we get values of 0.177, 0.168, and 0.250 for the three levels of . Since our objective is to minimize the training error, the value of (which has a training error of 0.168) is chosen. This corresponds to the green line in Figure 23.5(b). D. S. Young STAT 501 372 Obs. Times 1 2.4 2 2.6 3 3.2 4 3.6 5 4.0 6 6.2 7 6.6 8 6.8 9 7.8 10 8.2 11 8.8 12 9.6 13 10.0 14 10.2 15 10.6 16 11.0 17 11.4 18 13.2 19 13.6 20 13.8 21 14.6 22 14.8 23 15.4 24 15.6 25 15.8 26 16.0 27 16.2 28 16.4 29 16.6 30 16.8 31 17.6 32 17.8 CHAPTER 23. DATA MINING Accel. 0.0 -1.3 -2.7 0.0 -2.7 -2.7 -2.7 -1.3 -2.7 -2.7 -1.3 -2.7 -2.7 -5.4 -2.7 -5.4 0.0 -2.7 -2.7 0.0 -13.3 -2.7 -22.8 -40.2 -21.5 -42.9 -21.5 -5.4 -59.0 -71.0 -37.5 -99.1 Obs. Times 33 18.6 34 19.2 35 19.4 36 19.6 37 20.2 38 20.4 39 21.2 40 21.4 41 21.8 42 22.0 43 23.2 44 23.4 45 24.0 46 24.2 47 24.6 48 25.0 49 25.4 50 25.6 51 26.0 52 26.2 53 26.4 54 27.0 55 27.2 56 27.6 57 28.2 58 28.4 59 28.6 60 29.4 61 30.2 62 31.0 63 31.2 Accel. -112.5 -123.1 -85.6 -127.2 -123.1 -117.9 -134.0 -101.9 -108.4 -123.1 -123.1 -128.5 -112.5 -95.1 -53.5 -64.4 -72.3 -26.8 -5.4 -107.1 -65.6 -16.0 -45.6 4.0 12.0 -21.5 46.9 -17.4 36.2 75.0 8.1 Obs. Times 64 32.0 65 32.8 66 33.4 67 33.8 68 34.4 69 34.8 70 35.2 71 35.4 72 35.6 73 36.2 74 38.0 75 39.2 76 39.4 77 40.0 78 40.4 79 41.6 80 42.4 81 42.8 82 43.0 83 44.0 84 44.4 85 45.0 86 46.6 87 47.8 88 48.8 89 50.6 90 52.0 91 53.2 92 55.0 93 55.4 94 57.6 Accel. 54.9 46.9 16.0 45.6 1.3 75.0 -16.0 69.6 34.8 -37.5 46.9 5.4 -1.3 -21.5 -13.3 30.8 29.4 0.0 14.7 -1.3 0.0 10.7 10.7 -26.8 -13.3 0.0 10.7 -14.7 -2.7 -2.7 10.7 Table 23.2: The motorcycle data. STAT 501 D. S. Young Chapter 24 Advanced Topics This chapter presents topics where theory beyond the scope of this course needs to be developed with the applicability. The topics are not arranged in any particular order, but rather are just a sample of some of the more advanced regression procedures that are available. Not all computer software has the capabilities to perform analysis on the models presented here. 24.1 Semiparametric Regression Semiparametric regression is concerned with flexible modeling of nonlinear functional relationships in regression analysis by building a model consisting of both parametric and nonparametric components. We have already visited a semiparametric model with the Cox proportional hazards model. In this model, there is the baseline hazard, which is nonparametric, and then the hazards ratio, which is parametric. Suppose we have n = 200 observations where y is the response, x1 is a predictor taking on only values of 1, 2, 3 or 4, and x2 , x3 and x4 are predictors taking on values between 0 and 1. A semiparametric regression model of interest for this setting is yi = β0 + β1 z2,i + β2 z3,i + β3 z4,i + m(x2,i , x3,i , x4,i ), where zj,i = I{x1,i = j}. In otherwords, we are using the leave-one-out method for the levels of x1 . 373 374 CHAPTER 24. ADVANCED TOPICS The results of fitting a semiparametric regression model are given in Figure 24.1. There are noticeable functional forms for x2 and x3 , however, x4 appears to almost be 0. In fact, this is exactly how the data was generated. The data were generated according to: 6 3 10 yi = −5.15487 + e2xi,1 + 0.2x11 2,i (10(1 − x2,i )) + 10(10x2,i ) (1 − x2,i ) + ei , where the ei were generated according to a normal distribution with mean 0 and variance 4. Notice how the x4,i term was not used in the data generation, which is reflected in the plot and in the significance of the smoothing term from the output below: ########## Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.9660 0.2939 13.494 < 2e-16 *** x12 1.8851 0.4176 4.514 1.12e-05 *** x13 3.8264 0.4192 9.128 < 2e-16 *** x14 6.1100 0.4181 14.615 < 2e-16 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Approximate significance of smooth terms: edf Est.rank F p-value s(x2) 1.729 4.000 25.301 <2e-16 *** s(x3) 7.069 9.000 45.839 <2e-16 *** s(x4) 1.000 1.000 0.057 0.811 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 R-sq.(adj) = 0.78 GCV score = 4.5786 ########## Deviance explained = 79.4% Scale est. = 4.2628 n = 200 There are actually many general forms of semiparametric regression models. We will list a few of them. In the following outline, X = (X1 , . . . , Xp )T pertains to the predictors and may be partitioned such that X = (UT , VT )T where U = (U1 , . . . , Ur )T , V = (V1 , . . . , Vs )T , and r + s = p. Also, m(·) is a nonparametric function and g(·) is a link function as established in the discussion on generalized linear models. STAT 501 D. S. Young 5 10 375 −5 0 s(x3,7.07) 5 0 −5 s(x2,1.73) 10 CHAPTER 24. ADVANCED TOPICS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 1.0 6 4 0 2 Partial for x1 5 0 −5 s(x4,1) 0.8 x3 10 x2 0.6 0.0 0.2 0.4 0.6 x4 0.8 1.0 1 2 3 4 x1 Figure 24.1: Semiparametric regression fits of the generated data. D. S. Young STAT 501 376 CHAPTER 24. ADVANCED TOPICS Additive Models: In the model E(Y |X) = β0 + p X mj (Xj ), j=1 we have a fixed intercept term and wish to estimate p nonparametric functions - one for each of the predictors. Partial Linear Models: In the model E(Y |U, V) = UT β + m(V), we have the sum of a purely parametric part and a purely nonparametric part, which involves parametric estimation routines and nonparametric estimation routines, respectively. This is the type of model used in the generation of the example given above. Generalized Additive Models: In the model E(Y |X) = g(β0 + p X mj (Xj )), j=1 we have the same setting as in an additive model, but a link function relates the sum of functions to the response variable. This is the model fitted to the example above. Generalized Partial Linear Models: In the model E(Y |U, V) = g(UT β + m(V)), we have the same setting as in a partial linear model, but a link function relates the sum of parametric and nonparametric components to the response. Generalized Partial Linear Partial Additive Models: In the model T E(Y |U, V) = g(U β + s X mj (Vj )), j=1 we have the sum of a parametric component and the sum of s individual nonparametric functions, but there is also a link function that relates this sum to the response. STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 377 Another method (which is often a semiparametric regression model due to its exploratory nature) is the projection pursuit regression method discussed earlier. “Projection Pursuit” stands for a class of exploratory projection techniques. This class contains statistical methods designed for analyzing high-dimensional data using low-dimensional projections. The aim of projection pursuit regression is to reveal possible nonlinear relationship between a response and very large number of predictors with the ultimate goal of finding interesting structures hidden within the high-dimensional data. To conclude this section, let us outline the general context of the three classes of regression models: Parametric Models: These models are fully determined up to a parameter vector. If the underlying assumptions are correct, then the fitted model can easily be interpreted and estimated accurately. If the assumptions are violated, then fitted parametric estimates may provide inconsistencies and misleading interpretations. Nonparametric Models: These models provide flexible models and avoid the restrictive parametric form. However, they may be difficult to interpret and yield inaccurate estimates for a large number of regressors. Semiparametric Models: These models combine parametric and nonparametric components. They allow easy interpretation of the parametric component while providing the flexibility of the nonparametric component. 24.2 Random Effects Regression and Multilevel Regression The next model we consider is not unlike growth curve models. Suppose we have responses measured on each subject repeatedly. However, we no longer assume that the same number of responses are measured for each subject (such data is often called longitudinal data or trajectory data). In addition, the regression parameters are now subject-specific parameters. The regression parameters are considered random effects and are assumed to follow their own distribution. Earlier, we only discussed the sampling distribution of the regression parameters and the regression parameters were assumed fixed (i.e., they were assumed to be fixed effects). D. S. Young STAT 501 378 CHAPTER 24. ADVANCED TOPICS All Groups −100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −200 Response 0 100 ● ● −300 ● ● ● 5 10 15 20 Time Figure 24.2: Scatterplot of the infant data with a trajectory (in this case, a quadratic response curve) fitted to each infant. As an example, consider a sample of 40 infants used in the study of a habituation task. Suppose the infants were broken into four groups and studied by four different psychologists. A similar habituation task is given to the four groups, but the number of times it is performed in each group differs. Furthermore, it is suspected that each infant will have it’s own trajectory when a response curve is constructed. All of the data are presented in Figure 24.2 with a quadratic response curve fitted to each infant. When broken down further, notice in Figure 24.3 how each group has a set of infants with a different number of responses. Furthermore, notice how a different trajectory was fit to each infant. Each of these trajectories has its own set of regression parameter estimates. Let us formulate the linear model for this setting. Suppose we have i = 1, . . . , N subjects and each subject has a response vector yi which consists of ni measurements (notice that n is subscripted by i to signify the varying number of measurements nested within each subject - if all subjects have the same number of measurements, then ni ≡ n). The random effects regression model is given by: yi = Xi β i + i , STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 379 Group 1 Group 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 15 20 5 ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −200 ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● Response ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −200 ● ● ● ● ● ● ● ● ● −100 100 Group 4 ● ● 20 (b) Group 3 ● 15 Time (a) 0 ● ● ● ● ● ● Time −100 ● ● −300 −300 5 Response ● ● ● ● ● ● ● ● −200 −100 ● ● ● ● ● 0 ● −200 Response ● ● ● ● ● ● −100 ● ● ● ● ● ● ● Response 0 100 ● ● ● ● ● −300 −300 ● ● ● 5 10 Time (c) 15 20 5 10 15 20 Time (d) Figure 24.3: Plots for each group of infants where each group has a different number of measurements. D. S. Young STAT 501 380 CHAPTER 24. ADVANCED TOPICS where the Xi are known ni × p design matrices, β i are regression parameters for subject i, and the i are ni × 1 vectors of random within-subject residuals distributed independently as Nni (0, σ 2 Ini ×ni ). Furthermore, the β i are assumed to be multivariate normally distributed with mean vector µβ and variance-covariance matrix Σβ . Given these assumptions, it can be shown that the yi are marginally distributed as independent normals with mean 2 Xi µβ and variance-covariance matrix Xi Σβ XT i + σ Ini ×ni . Another regression model, not unrelated to random effects regression models, involves imposing another model structure on the regression coefficients. These are called multilevel (hierarchical) regression models. For example, suppose the random effects regression model above is a simple linear case (i.e., β = (β0 β1 )T ). We may assume that the regression coefficients in the random effects regression model from above to have the following structure: β i = α0 + α1 ui + δi . In this regression relationship, we would also have observed the ui ’s and we assume that δi ∼iid N (0, τ 2 ) for all i. Then, we estimate α0 and α1 directly from the data. Note the hierarchical structure of this model, hence the name. Estimation of the random effects regression model and the multilevel regression model requires more sophisticated methods. Some common estimation methods include use of empirical or hierarchical Bayes estimates, iteratively reweighted maximum marginal likelihood methods, and EM algorithms. Various statistical intervals can also be constructed for these models. 24.3 Functional Linear Regression Analysis Functional data consists of observations which can be treated as functions rather than numeric vectors. One example is fluorescence curves used in photosynthesis research where the curve reflects the biological processes which occur during the plant’s initial exposure to sunlight. Longitudinal data can be considered a type of functional data such as taking repeated measurements over time on the same subject (e.g., blood pressure or cholesterol readings). Functional regression models are of the form yi (t) = β(t)φ(xi ) + i (t), where yi (t), β(t), and i (t) represent the functional response, average curve, and the error process, respectively. φ(xi ) is a multiplicative effect modifying STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 381 the average curve according to the predictors. So each of the i = 1, . . . , n trajectories (or functions) are observed at points t1 , . . . , tk in time, where k is large. In otherwords, we are trying to fit a regression surface for a collection of functions (i.e., we actually observe trajectories and not individual data points). Functional regression models do sound similar to random effects regression models, but differ in a few ways. In random effects regression models, we assume that each observation’s set of regression coefficients are random variables from some distribution. However, functional regression models do not make distributions on the regression coefficients and are treated as separate functions which are characterized by the densely sampled set of points over t1 , . . . , tk . Also, random effects regression models easily accommodate trajectories of varying dimensions, whereas this is not reflected in a functional regression model. Estimation of β(t) is beyond the scope of this discussion as it requires knowledge of Fourier series and more advance multivariate techniques. Furthermore, estimation of β(t) is intrinsically an infinite-dimensional problem. However, estimates found in the literature have been shown to possess desirable properties of an estimator. While there are also hypothesis tests available concerning these models, difficulties still exist with using these models for prediction. 24.4 Mediation Regression Consider the following research questions found in psychology: • Will changing social norms about science improve children’s achievement in scientific disciplines? • Can changes in cognitive attributions reduce depression? • Does trauma affect brain stem activation in a way that inhibits memory? Such questions suggest a chain of relations where a predictor variable affects another variable, which then affects the response variable. A mediation regression model attempts to identify and explicate the mechanism that underlies an observed relationship between an independent variable and a D. S. Young STAT 501 382 CHAPTER 24. ADVANCED TOPICS dependent variable, via the inclusion of a third explanatory variable called a mediator variable. Instead of modeling a direct, causal relationship between the independent and dependent variables, a mediation model hypothesizes that the independent variable causes the mediator variable which, in turn, causes the dependent variable. Mediation models are generally utilizes in the area of psychometrics, while other scientific disciplines (including statistics) have criticized the methodology. One such criticism is that sometimes the roles of the mediator variable and the dependent variable can be switched and yield a model which explains the data equally well, thus causing identifiability issues. The model we present simply has one independent variable, one dependent variable, and one mediator variable in the model. Models including more of any of these variables are possible to construct. The following three regression models are used in our discussion: 1. Y = β0 + β1 X + 2. Y = α0 + α1 X + α2 M + δ 3. M = θ0 + θ1 X + γ. The first model is the simple linear regression model we are familiar with. The is the relationship between X and Y that we typically wish to study (in causal analysis, this is written as X → Y ). The second and third models show how we incorporate the mediator variable into this framework so that X causes the mediator M and M causes Y (i.e., X → M → Y ). So α1 is the coefficient relating X to Y adjusted for M , α2 is the coefficient relating M to Y adjusted for X, θ1 is the coefficient relating X to M , and , δ, and γ are error terms for the three relationships. Figure 24.4 gives a diagram showing these relationships, sans the error terms. The mediated effect in the above models can be calculated in two ways - either as θ̂1 α̂2 or β̂1 − β̂2 . There are various methods for estimating these coefficients, including those based on ordinary least squares and maximum likelihood theory. To test for significance, the chosen quantity (i.e., either θ̂1 α̂2 or β̂1 − β̂2 ) is divided by the standard error and then the ratio is compared to a standard normal distribution. Thus, confidence intervals for the mediation effect are readily available by using the 100×(1−α/2)th -percentile of the standard normal distribution. Finally, the strength and form of mediated effects may depend on, yet, another variable. These variables, which affect the hypothesized relationship STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 383 Mediator Variable M α2 θ1 Independent Variable X α1 Dependent Variable Y Figure 24.4: Diagram showing the basic flow of a mediation regression model. amongst the variables already in our model, are called moderator variables and are often tested as an interaction effect. A significantly different from 0 XM interaction in the second equation above suggests that the α2 coefficient differs across different levels of X. These different coefficient levels may reflect mediation as a manipulation, thus altering the relationship between M and Y . The moderator variables may be either a manipulated factor in an experimental setting (e.g., dosage of medication) or a naturally occurring variable (e.g., gender). By examining moderator effects, one can investigate whether the experiment differentially affects subgroups of individuals. Three primary models involving moderator variables are typically studied: Moderated mediation: The simplest of the three, this model has a variable which mediates the effects of an independent variable on a dependent variable, and the mediated effect depends on the level of another variable (i.e., the moderator). Thus, the mediational mechanism differs for subgroups of the study. This model is more complex from an interpretative viewpoint when the moderator is continuous. Basically, you have either X → M and/or M → Y dependent on levels of another variable (call it Z). D. S. Young STAT 501 384 CHAPTER 24. ADVANCED TOPICS Mediated moderation: This occurs when a mediator is intermediate in the causal sequence from an interaction effect to a dependent variable. The purpose of this model is to determine the mediating variables that explain the interaction effect. Mediated baseline by treatment moderation: This model is a special case of the mediated moderation model. The basic interpretation of the mediated effect in this model is that the mediated effect depends on the baseline level of the mediator. This scenario is common in prevention and treatment research, where the effects of an intervention are often stronger for participants who are at higher risk on the mediating variable at the time they enter the program. 24.5 Meta-Regression Models In statistics, a meta-analysis combines the results of several studies that address a set of related research hypotheses. In its simplest form, this is normally by identification of a common measure of effect size, which is a descriptive statistic that quantifies the estimated magnitude of a relationship between variables without making any inherent assumption about if such a relationship in the sample reflects a true relationship for the population. In a meta-analysis, a weighted average might be used as the output. The weighting might be related to sample sizes within the individual studies. Typically, there are other differences between the studies that need to be allowed for, but the general aim of a meta-analysis is to more powerfully estimate the true “effect size” as opposed to a smaller “effect size” derived in a single study under a given set of assumptions and conditions. Meta-regressions are similar in essence to classic regressions, in which a response variable is predicted according to the values of one or more predictor variables. In meta-regression, the response variable is the effect estimate (for example, a mean difference, a risk difference, a log odds ratio or a log risk ratio). The predictor variables are characteristics of studies that might influence the size of intervention effect. These are often called potential effect modifiers or covariates. Meta-regressions usually differ from simple regressions in two ways. First, larger studies have more influence on the relationship than smaller studies, since studies are weighted by the precision of their respective effect estimate. Second, it is wise to allow for the residSTAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 385 ual heterogeneity among intervention effects not modeled by the predictor variables. This gives rise to the random-effects meta-regression, which we discuss later. The regression coefficient obtained from a meta-regression analysis will describe how the response variable (the intervention effect) changes with a unit increase in the predictor variable (the potential effect modifier). The statistical significance of the regression coefficient is a test of whether there is a linear relationship between intervention effect and the predictor variable. If the intervention effect is a ratio measure, the log-transformed value of the intervention effect should always be used in the regression model, and the exponential of the regression coefficient will give an estimate of the relative change in intervention effect with a unit increase in the predictor variable. Generally, three types of meta-regression models are commonplace in the literature: Simple meta-regression: This model can be specified as: yi = β0 + β1 xi,1 + β2 xi,2 + . . . + βp−1 xi,p−1 + , where yi is the effect size in study i and β0 (i.e., the “intercept”) is the estimated overall effect size. The variables xi,j , for j = 1, . . . , (p − 1), specify different characteristics of the study and specifies the between study variation. Note that this model does not allow specification of within study variation. Fixed-effect meta-regression: This model assumes that the true effect size δθ is distributed as N (θ, σθ2 ), where σθ2 is the within study variance of the effect size. A fixed effect1 meta-regression model thus allows for within study variability, but no between study variability because all studies have the identical expected fixed effect size δθ ; i.e., = 0.This model can be specified as: yi = β0 + β1 xi,1 + β2 xi,2 + . . . + βp−1 xi,p−1 + ηi , where ση2i is the variance of the effect size in study i. Fixed effect metaregressions ignore between study variation. As a result, parameter estimates are biased if between study variation can not be ignored. Furthermore, generalizations to the population are not possible. 1 Note that for the “fixed-effect” model, no plural is used as only ONE true effect across all studies is assumed. D. S. Young STAT 501 386 CHAPTER 24. ADVANCED TOPICS Random effects meta-regression: This model rests on the assumption that θ in N (θ, σθ ) is a random variable following a hyper-distribution N (µθ , ςθ ). The model can be specified as: yi = β0 + β1 xi,1 + β2 xi,2 + . . . + βp−1 xi,p−1 + η + i , were σ2i is the variance of the effect size in study i. Between study variance ση2 is estimated using common estimation procedures for random effects models (such as restricted maximum likelihood (REML) estimators). 24.6 Bayesian Regression Bayesian inference is concerned with updating our model (which is based on previous beliefs) as a result of receiving incoming data. Bayesian inference is based on Bayes’ Theorem, which says for two events, A and B, P(A|B) = P(B|A)P(A) . P(B) We update our model by treating the parameter(s) of interest as a random variable and defining a distribution for the parameters based on previous beliefs (this distribution is called a prior distribution). This is multiplied by the likelihood function of our model and then divided by the marginal density function (which is the joint density function with the parameter integrated out). The result is called the posterior distribution. Luckily, the marginal density function is just a normalizing constant and does not usually have to be calculated in practice. For multiple linear regression, the ordinary least squares estimate β̂ = (XT X)−1 XT y is constructed from the frequentist’s view (along with the maximum likelihood estimate σ̂ 2 of σ 2 ) in that we assume there are enough measurements of the predictors to say something meaningful about the response. In the Bayesian view, we assume we have only a small sample of the possible measurements and we seek to correct our estimate by “borrowing” information from a larger set of similar observations. STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 387 The (conditional) likelihood is given as: 1 2 2 −n/2 2 `(y|X, β, σ ) = (2πσ ) exp − 2 ky − Xβk . 2σ We seek a conjugate prior (a prior which yields a joint density that is of the same functional form as the likelihood). Since the likelihood is quadratic in β, we re-write the likelihood so it is normal in (β − β̂). Write ky − Xβk2 = ky − Xβ̂k2 + (β − β̂)T (XT X)(β − β̂) Now rewrite the likelihood as 1 vs2 2 2 −(n−v)/2 2 2 −v/2 exp − 2 kX(β − β̂)k , `(y|X, β, σ ) ∝ (σ ) exp − 2 (σ ) 2σ 2σ where vs2 = ky − Xβ̂k2 and v = n − p with p as the number of parameters to estimate. This suggests a form for the priors: π(β, σ 2 ) = π(σ 2 )π(β|σ 2 ). The prior distributions are characterized by hyperparameters, which are parameter values (often data-dependent) which the researcher specifies. The prior for σ 2 (π(σ 2 )) is an inverse gamma distribution with shape hyperparameter α and scale hyperparameter γ. The prior for β (π(β|σ 2 )) is a multivariate normal distribution with location and dispersion hyperparameters β̄ and Σ. This yields the joint posterior distribution: f (β, σ 2 |y, X) ∝ `(y|X, β, σ 2 )π(β|σ 2 )π(σ 2 ) 1 T T −1 −n−α ∝σ exp − 2 (s̃ + (β − β̃) (Σ + X X)(β − β̃)) , 2σ where β̃ = (Σ−1 + XT X)−1 (Σ−1 β̄ + XT Xβ̂) s̃ = 2γ + σ̂ 2 (n − p) + (β̄ − β̃)T Σ−1 β̄ + (β̂ − β̃)T XT Xβ̂. Finally, it can be shown that the distribution of β|X, y is a multivariate-t distribution with n + α − p − 1 degrees of freedom such that: E(β|X, y) = β̃ Cov(β|X, y) = D. S. Young s̃(Σ−1 + XT X)−1 . n+α−p−3 STAT 501 388 CHAPTER 24. ADVANCED TOPICS Furthermore, the distribution of σ 2 |X, y is an inverse gamma distribution with shape parameter n + α − p and scale parameter 0.5σ̂ 2 (n + α − p). One can also construct statistical intervals based on draws simulated from a Bayesian posterior distribution. A 100 × (1 − α)% credible interval is constructed by taken the middle 100 × (1 − α)% of the values simulated from the parameter’s posterior distribution. The interpretation of these intervals is that there is a 100 × (1 − α)% chance that the true population parameter is in the 100 × (1 − α)% credible interval which is constructed (which is how many people initially try to interpret confidence intervals). 24.7 Quantile Regression The τ th quantile of a random variable X is the value of x such that P(X ≤ x) = τ. For example, if τ = 21 , then the corresponding value of x would be the median. This concept of quantiles can also be extended to the regression setting. In a quantile regression, we have a data set of size n with response y and predictors x = (x1 , . . . , xp−1 )T and we seek a solution to the least squares criterion n X 2 β̂ τ = ρτ (yi − µ(xT i β)) , i=1 where µ(·) is some parametric function and ρτ (·) is called the linear check function and is defined as ρτ (x) = τ x − xI{x < 0}. T For linear regression, µ(xT i β) = xi β. We actually encountered quantile regression earlier. Least absolute deviations regression is just the case of quantile regression where τ = 1/2. Figure 24.5 gives a plot relating food expenditure to a family’s monthly household income. Overlaid on the plot is a dashed red line which gives the ordinary least squares fit. The solid blue line is the least absolute deviation fit (i.e., τ = 0.50). The gray lines (from bottom to top) are the quantile regression fits for τ = 0.05, 0.10, 0.25, 0.75, 0.90, and 0.95, respectively. Essentially, what this says is that if we looked at those households with the highest food expenditures, they will likely have larger regression coefficients (such as the STAT 501 D. S. Young 2000 CHAPTER 24. ADVANCED TOPICS 389 ● 1500 ● ● ● ● ● ● ● ● 1000 ● 500 Food Expenditure ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●●●● ●● ● ● ● ● ●● ● ● ● ● ●● ●●●●● ● ●● ● ●● ● ● ● ● ●●● ● ●●● ●● ● ● ● ●● ● ●● ●●● ●●● ● ●● ●● ●● ● ● ● ● ●● ●●● ●● ● ●● ● ● ● ● ● 1000 ●● ● ● ● ● mean (LSE) fit median (LAE) fit 2000 3000 4000 5000 Household Income Figure 24.5: Various qunatile regression fits for the food expenditures data set. τ = 0.95 regression quantile) while those with the lowest food expenditures will likely have smaller regression coefficients (such as the τ = 0.05 regression quantile). The estimates for each of these quantile regressions is as follows: ########## Coefficients: tau= 0.05 tau= 0.10 tau= 0.25 (Intercept) 124.8800408 110.1415742 95.4835396 x 0.3433611 0.4017658 0.4741032 tau= 0.75 tau= 0.90 tau= 0.95 (Intercept) 62.3965855 67.3508721 64.1039632 x 0.6440141 0.6862995 0.7090685 Degrees of freedom: 235 total; 233 residual ########## Estimation for quantile regression can be done through linear programming or other optimization procedures. Furthermore, statistical intervals can also be computed. D. S. Young STAT 501 390 24.8 CHAPTER 24. ADVANCED TOPICS Monotone Regression Suppose we have a set of data (x1 , y1 ), . . . , (xn , yn ). For ease of notation, let us assume there is already an ordering on the predictor variable. Specifically, we assume that x1 ≤ . . . ≤ xn . Monotonic regression is a technique where we attempt to find a weighted least squares fit of the responses y1 , . . . , yn to a set of scalars a1 , . . . , an with corresponding weights w1 , . . . , wn , subject to monotonicity constraints giving a simple or partial ordering of the responses. In otherwords, the responses are suppose to strictly increase (or decrease) as the predictor increases and the regression line we fit is piecewise constant (which resembles a step function). The weighted least squares problem for monotonic regression is given by the following quadratic program: arg min a n X wi (yi − ai )2 i=1 and is subject to one of two possible constraints. If the direction of the trend is to be monotonically increasing, then the process is called isotonic regression and the constraint is yi ≥ yj for all i > j where this ordering is true. If the direction of the trend is to be monotonically decreasing, then the process is called antitonic regression and the constraint is yi ≤ yj for all i > j where this ordering is true. More generally, one can also perform monotonic regression under Lp for p > 0: arg min a n X wi |yi − ai |p , i=1 with the appropriate constraints imposed for isotonic or antitonic regression. Monotonic regression does have its place in statistical inference. For example, in astronomy data sets, there is a measure of a gamma-ray burst flux measurements versus time. On the log scale, one can identify an area of “flaring” which is an area where the flux measurements are to increase. Such an area could be fit using an isotonic regression. An example of an isotonic regression fitted to a made-up data set is given in Figure 24.6. The top plot gives the actual isotonic regression fit. The horizontal lines represent the values of the scalars minimizing the weighted least squares problem given earlier. The bottom plot shows the cumulative sums of the responses plotted against the predictors. The piecewise regression STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 391 Isotonic Regression 6 ● 5 ● ● x$y 3 4 ● ● ● ● ● 2 ● 1 ● ● ● 0 ● 0 ● 2 4 6 8 10 x0 Cumulative Data and Convex Minorant ● ● cumsum(x$y) 15 25 ● ● ● ● ● ● ● ● 5 ● 0 ● 0 ● ● 2 4 6 8 10 x0 Figure 24.6: An example of an isotonic regression fit. line which is plotted is called the convex minorant. Each predictor value where this convex minorant intersects at the value of the cumulative sum is the same value of the predictor where the slope changes in the isotonic regression plot. 24.9 Spatial Regression Suppose an econometrician is trying to quantify the price of a house. In doing so, he will surely need to incorporate neighborhood effects (e.g., how much is the house across the street valued at as well as the one next door?) However, the house prices in an adjacent neighborhood may also have an impact on the price, but house prices in the adjacent county will likely not. The framework for such modeling is likely to incorporate some sort of spatial effect as houses nearest to the home of interest are likely to have a greater impact on the price while homes further away will have a smaller or negligible impact. Spatial regression deals with the specification, estimation, and diagnostic analysis of regression models which incorporate spatial effects. Two D. S. Young STAT 501 392 CHAPTER 24. ADVANCED TOPICS broad classes of spatial effects are often distinguished: spatial heterogeneity and spatial dependency. We will provide a brief overview of both types of effects, but it should be noted that we will only skim the surface of what is a very rich area. A spatial regression model reflecting spatial heterogeneity is written locally as Y = Xβ(g) + , where g indicates that the regression coefficients are to be estimated locally at the coordinates specified by g and is an error term distributed with mean 0 and variance σ 2 . This model is called geographically weighted regression or GWR. The estimation of β(g) is found using a weighting scheme such that β̂(g) = (XT W(g)X)−1 XT W(g)Y. The weights in the geographic weighting matrix W(g) are chosen such that those observations near the point in space where the parameter estimates are desired have more influence on the result than those observations further away. This model is essentially a local regression model like the one discussed in the section on LOESS. While the choice of a geographic (or spatially) weighted matrix is a blend of art and science, one commonly used weight is the Gaussian weight function, where the diagonal entris of the n × n matrix W(g) are: wi (g) = exp{−di /h}, where di is the Euclidean distance between observation i and location g, while h is the bandwidth. The resulting parameter estimates or standard errors for the spatial heterogeneity model may be mapped in order to examine local variations in the parameter estimates. Hypothesis tests are also possible regarding this model. Spatial regression models also accommodate spatial dependency in two major ways: through a spatial lag dependency (where the spatial correlation occurs in the dependent variable) or a spatial error dependency (where the spatial correlation occurs through the error term). A spatial lag model is a spatial regression model which models the response as a function of not only the predictors, but also values of the response observed at other (likely neighboring) locations: yi = f (yj(i) ; θ) + X T i β + i , STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 393 where j(i) is an index including all of the neighboring locations j of i such that i 6= j. The function f can be very general, but typically is simplified by using a spatially weighted matrix (as introduced earlier). Assuming a spatially Pn weighted matrix W(g) which has row-standardized spatial weights (i.e., j=1 wi,j = 1), we obtain a mixed regressive spatial autoregressive model: yi = ρ n X wi,j yj + X T i β + i , j=1 where ρ is the spatial autoregressive coefficient. In matrix notation, we have Y = ρW(g)Y + Xβ + . The proper solution to the equation for all observations requires (after some matrix algebra) Y = (In×n − ρW)−1 Xβ + (In×n − ρW)−1 to be solved simultaneously for β and ρ. The inclusion of a spatial lag is similar to a time series model, although with a fundamental difference. Unlike time dependency, a spatial dependency is multidirectional, implying feedback effects and simultaneity. More precisely, if i and j are neighboring locations, then yj enters on the righthand side in the equation for yi , but yi also enters on the right-hand side in the equation for yj . In a spatial error model, the spatial autocorrelation does not enter as an additional variable in the model, but rather enters only through its affects on the covariance structure of the random disturbance term. In otherwords, Var() = Σ such that the off-diagonals of Σ are not 0. One common way to model the error structure is through direct representation, which is similar to the weighting scheme used in GWR. In this setting, the off-diagonals of Σ are given by σi,j = σ 2 g(di,j , φ), where again di,j is the Euclidean distance between locations i and j and φ is a vector of parameters which may include a bandwidth parameter. Another way to model the error structure is through a spatial process, such as specifying the error terms to have a spatial autoregressive structure as in the spatial lag model from earlier: = λW(g) + u, D. S. Young STAT 501 394 CHAPTER 24. ADVANCED TOPICS where u is a vector of random error terms. Other spatial processes exist, such as a conditional autoregressive process and a spatial moving average process, both which resemble similar time series processes. Estimation of these spatial regression models can be accomplished through various techniques, but they differ depending on if you have a spatial lag dependency or a spatial error dependency. Such estimation methods include maximum likelihood estimation, the use of instrumental variables, and semiparametric methods. There are also tests for the spatial autocorrelation coefficient, of which the most notable uses Moran’s I statistic. Moran’s I statistics is calculated as eT W(g)e/S0 , I= eT e/n where e is a vector of ordinary squares residuals, W(g) is a geographic Pleast n Pn weighting matrix, and S0 = i=1 j=1 wi,j is a normalizing factor. Then, Moran’s I test can be based on a normal approximation using a standardized value I statistic such that E(I) = tr(MW/(n − p)) and Var(I) = tr(MWMWT ) + tr(MWMW) + [tr(MW)]2 , (n − p)(n − p + 2) where M = In×n − X(XT X)−1 XT . As an example, let us consider 1978 house prices in Boston which we will try to fit a spatial regression model with spatial error dependency. There are 20 variables measured for 506 locations. Certain transformations on the predictors have already been performed due to the investigator’s claims. In particular, only 13 of the predictors are of interest. First a test on the spatial autocorrelation coefficient is performed: ########## Global Moran’s I for regression residuals data: model: lm(formula = log(MEDV) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + I(RM^2) + AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + log(LSTAT), data = boston.c) STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 395 weights: boston.listw Moran I statistic standard deviate = 14.5085, p-value < 2.2e-16 alternative hypothesis: two.sided sample estimates: Observed Moran’s I Expectation Variance 0.4364296993 -0.0168870829 0.0009762383 ########## As can be seen, the p-value is very small and so the spatial autocorrelation coefficient is significant. Next, we attempt to fit a spatial regression model with spatial error dependency including those variables that the investigator specified: ########## Call:errorsarlm(formula = log(MEDV) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + I(RM^2) + AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + log(LSTAT), data = boston.c, listw = boston.listw) Residuals: Min 1Q -0.6476342 -0.0676007 Median 0.0011091 3Q 0.0776939 Type: error Coefficients: (asymptotic standard errors) Estimate Std. Error z value (Intercept) 3.85706025 0.16083867 23.9809 CRIM -0.00545832 0.00097262 -5.6120 ZN 0.00049195 0.00051835 0.9491 INDUS 0.00019244 0.00282240 0.0682 CHAS1 -0.03303428 0.02836929 -1.1644 I(NOX^2) -0.23369337 0.16219194 -1.4408 I(RM^2) 0.00800078 0.00106472 7.5145 AGE -0.00090974 0.00050116 -1.8153 log(DIS) -0.10889420 0.04783714 -2.2764 log(RAD) 0.07025730 0.02108181 3.3326 TAX -0.00049870 0.00012072 -4.1311 PTRATIO -0.01907770 0.00564160 -3.3816 D. S. Young Max 0.6491629 Pr(>|z|) < 2.2e-16 2.000e-08 0.3425907 0.9456389 0.2442466 0.1496286 5.707e-14 0.0694827 0.0228249 0.0008604 3.611e-05 0.0007206 STAT 501 396 CHAPTER 24. ADVANCED TOPICS B log(LSTAT) 0.00057442 -0.27212781 0.00011101 5.1744 2.286e-07 0.02323159 -11.7137 < 2.2e-16 Lambda: 0.70175 LR test value: 211.88 p-value: < 2.22e-16 Asymptotic standard error: 0.032698 z-value: 21.461 p-value: < 2.22e-16 Wald statistic: 460.59 p-value: < 2.22e-16 Log likelihood: 255.8946 for error model ML residual variance (sigma squared): 0.018098, (sigma: 0.13453) Number of observations: 506 Number of parameters estimated: 16 AIC: -479.79, (AIC for lm: -269.91) ########## As can be seen, there are some predictors that do not appear to be significant. Model selection procedures can be employed or other transformations can be tried in order to improve the fit of this model. 24.10 Circular Regression A circular random variable is one which takes values on the circumference of a circle (i.e., the angel is in the range of (0, 2π) radians or (0◦ , 360◦ )). A circular-circular regression is used to determine the relationship between a circular predictor variable X and a circular response variable Y . Circular data occurs when there is periodicity to the phenomena at hand or where there are naturally angular measurements. An example could be determining the relationship between wind direction measurements (the response) on an aircraft and wind direction measurements taken by radar (the predictor). Another related model is one where only the response is a circular variable while the predictor is linear. This is called a circular-linear regression. Both types of circular regression models can be given by yi = β0 + β1 xi + i (mod 2π). The expression i (mod 2π) is read as i modulus 2π and is a way of expression the remainder of the quantity i /(2π).2 In this model, i is a circular random 2 For example, 11(mod 7) = 4 because 11 divided by 7 leaves a remainder of 4. STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 397 error assumed to follow a von Mises distribution with circular mean 0 and concentration parameter κ. The von Mises distribution is the circular analog of the univariate normal distribution, but has a more “complex” form. The von Mises distribution with circular mean µ and concentration parameter κ is defined on the range x ∈ [0, 2π), with probability density function f (x) = eκ cos(x−µ) 2πI0 (κ) and cumulative distribution function ∞ X Ij (κ) sin(j(x − µ)) 1 xI0 (κ) + 2 . F (x) = 2πI0 (κ) j j=1 In the above, Ip (·) is called a modified Bessel function of the first kind of order p. The Bessel function is the contour integral I 1 e(z/2)(t−1/t) t−(p+1) dt, Ip (z) = 2πi where the contour encloses the origin and traverses in a counterclockwise √ direction in the complex plane such that i = −1. Maximum likelihood estimates can be obtained for the circular regression models (with minor differences in the details when dealing with a circular predictor or linear predictor). Needless to say, such formulas do not lend themselves well to closed-dorm solutions. Thus we turn to numerical methods, which go beyond the scope of this course. As an example, suppose we have a data set of size n = 100 where Y is a circular response and X is a continuous predictor (so a circular-linear regression model will be built). The error terms are assumed to follow a von Mises distribution with circular mean 0 and concentration parameter κ (for this generated data, κ = 1.9). The error terms used in the generation of this data can be plotted on a circular histogram as given in Figure 24.10. Estimates for the circular-linear regression fit are given below: ########## Circular-Linear Regression Coefficients: Estimate Std. Error t value Pr(>|t|) D. S. Young STAT 501 398 CHAPTER 24. ADVANCED TOPICS Errors π + 0 ●● ● ●●● ●● ●●●● ● ●●●● ●● ●● ●●● ●● ● ● ●● ● ● 3π π ● 2 ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ● ●● ● ●●● ● ● ●● ● ●● ●● ●●● ●●● ●● ● ●●●● ●●● ● ● ● ●● ● ● 6 ● 4 ● ● Y ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ●● 2 π 2 ● ● 0 ● ● ● ● ● ●●●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●● ●● ● −2 ●● ● −2 −1 0 1 2 X (a) (b) Figure 24.7: (a) Plot of the von Mises error terms used in the generation of the sample data. (b) Plot of the continuous predictor (X) versus the circular response (Y ) along with the circular-linear regression fit. [1,] 6.7875 [2,] 0.9618 --Signif. codes: 1.1271 0.2223 6.022 8.61e-10 *** 4.326 7.58e-06 *** 0 *** 0.001 ** 0.01 * 0.05 . 0.1 Log-Likelihood: 1 55.89 Summary: (mu in radians) mu: 0.4535 ( 0.08698 ) kappa: 1.954 ( 0.2421 ) p-values are approximated using normal distribution ########## Notice that the maximum likelihood estimates of µ and κ are 0.4535 and 1.954, respectively. Both estimates are close to the values used for generation of the error terms. Furthermore, the values in parentheses next to these estimates are the standard errors for the estimates - both of which are relatively small. A rough way of looking at the data and estimated circular-linear regression equation is given in Figure 24.10. This is difficult to display since we are STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 399 trying to look at a circular response versus a continuous predictor. Packages specific to circular regression modeling provide better graphical alternatives. 24.11 Mixtures of Regressions 1.2 1.1 ● ● ●● ● ● ● 1.0 0.9 ● ● ● ● ● ●● ● ● ●● ● 0.8 Equivalence ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 0.6 0.6 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0.7 0.7 ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ●● 1.0 ● ● ● ● Equivalence ●● 0.9 ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● 0.8 ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● 1.1 1.2 Consider a large data set consisting of the heights of males and females. When looking at the distribution of this data, the data for the males will (on average) be higher than that of the females. A histogram of this data would clearly show two distinct bumps or modes. Knowing the gender labels of each subject would allow one to account for that subgroup in the analysis being used. However, what happens if the gender label of each subject were lost? In otherwords, we don’t know which observation belongs to which gender. The setting where data appears to be from multiple subgroups, but there is no label providing such identification, is the focus of the area called mixture modeling. ● ● ● 1 2 3 NO (a) 4 1 2 3 4 NO (b) Figure 24.8: (a) Plot of spark-ignition engine fuel data with equivalence ratio as the response and the measure of nitrogen oxide emissions. (b) Plot of the same data with EM algorithm estimates from a 2-component mixture of regressions fit. There are many issues one should be cognizant of when building a mixture model. In particular, maximum likelihood estimation can be quite complex D. S. Young STAT 501 400 CHAPTER 24. ADVANCED TOPICS since the likelihood does not yield closed-form solutions and there are identifiability issues (however, the use of a Newton-Raphson or EM algorithm usually provides a good solution). One alternative is to use a Bayesian approach with Markov Chain Monte Carlo (MCMC) methods, but this too has its own set of complexities. While we do not explore these issues, we do see how a mixture model can occur in the regression setting. A mixture of linear regressions model can be used when it appears that there is more than one regression line that could fit this data due to some underlying characteristic (i.e., a latent variable). Suppose we have n observations which belongs to one of k groups. If we knew to which group an observation belonged (i.e., its label), then we could write down explicitly the linear regression model given that observation i belongs to group j: yi = XT i β j + ij , such that ij is normally distributed with mean 0 and variance σj2 . Notice how the regression coefficients and variance terms are different for each group. However, now assume that the labels are unobserved. In this case, we can only assign a probability that observation i came from group j. Specifically, the density function for the mixture of linear regression model is: k X 1 T 2 2 −1/2 f (yi ) = λj (2πσj ) exp − 2 (yi − Xi βj ) , 2σj j=1 P such that kj=1 λj = 1. Estimation is done by using the likelihood (or rather log likelihood) function based on the above density. For maximum likelihood, one typically uses an EM algorithm. As an example, consider the data set which gives the equivalence ratios and peak nitrogen oxide emissions in a study using pure ethanol as a sparkignition engine fuel. A plot of the equivalence ratios versus the measure of nitrogen oxide is given in Figure 24.11. Suppose one wanted to predict the equivalence ratio from the amount of nitrogen oxide emissions. As you can see, there appears to be groups of data where separate regressions appear appropriate (one with a positive trend and one with a negative trend). Figure 24.11 gives the same plot, but with estimates from an EM algorithm overlaid. EM algorithm estimates for this data are β 1 = (0.565 0.085)T , β 1 = (1.247 − 0.083)T , σ 2 = 0.00188, and σ 2 = 0.00058. It should be noted that mixtures of regressions appear in many areas. For example, in economics it is called switching regimes. In the social STAT 501 D. S. Young CHAPTER 24. ADVANCED TOPICS 401 sciences it is called latent class regressions. As we saw earlier, the neural networking terminology calls this model (without the hierarchical structure) the mixture-of-experts problem. D. S. Young STAT 501