Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Lasso (statistics) wikipedia , lookup
Time series wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Regression toward the mean wikipedia , lookup
Choice modelling wikipedia , lookup
Linear regression wikipedia , lookup
WEIGHTED RIDGE ESTIMATION IN A COLLECTIVE BARGAINING CONTEXT L Marsh, University of Notre Dame T. Ghilarducci, University of Notre Dame Abstract maximization of this index sUbject to a binary strike vote choice constraint leads directly to a logistic regression model for the individual miner. The problem IS that individual miners are not about to reveal how they voted. The lowest level at which votes are collected and counted is at the union local level. There are about 800 locals in the UMWA Some of these are retirement locals, anthracite locals or other locals not participating In the contract ratification vote and, therefore, not used in this analysIs. Since regression analysis Is used based on the characteristics of the particular coal mine(s) represented by each local, trucking locals and other locals not aSSOCiated with a particular coal mine also had to be dropped. This left about 6S0 UMWA locals to be used in the regression analysis. For each union local we have the number of yes votes in favor of contract ratification and the number of no votes. We are interested in determining what factors influence the true population probability of a miner voting yes in a particular local. Since we only have information that relates to all miners in that local as a group, we must make the operational assumption that the above probability Is the same for each miner in that particular local. In otner wordS, we assume tnat any personal differences between individual miners will average or aggregate out within each local. Thus, these differences are left as part of the unexplained variation in the error term. Since we don't know the true population probability of a miner voting yes in a particular local, we must estimate it as the observed proportion of yes votes for that local. SubStituting this into the logistiC model and log-linearizing it, we obtain a log-linear regreSSion model with the dependent variable expressed as the logarithm of the yes-no odds ratio. One problem brought about by the aggregation process from individual miner to union local described above is the introduction of heteroskedasticity into the error term. This heteroskedasticity is caused by two factors: differences in the number of miners in each local and differences in the probability of voting yes from local to local. Each miner's The purpose of this paper is both substantive and methOlogical. The substantive purpose Is to better understand the 1981 UMWA coal strike in terms of the factors that influenced the strike vote. The methological purpose is to demonstrate the use of ridge regression and principal components regreSSion (PROC RIDGEREG) in evaluating the stability, and, therefore, the reliability of ordinary least squares regreSSion estimates. Introduction Previous collective bargaining literature in economics sheds little light on the 1981 coal strike because most of that literature focuses on the union-management conflict and ignores Internal rank-and-fIle dissent. The 1981 strike occurred after UMWA President Sam Church had reached an agreement with the BCOA (Bituminous Coal Operators Association) that the Union leadership official1y endorsed. However, that contract was rejected by a twoto-one majority of UMWA members. Thus much of the conn ict was between the Union's leadership and Its rank-and-me miners. This dissatisfaction surfaced again the following year when Sam Church lost his reelection bid to Richard Trumka, a lawyer and former coal miner, by another two-to-one majority in favor of Trumka. Thus, internal union politics may have played a role in the earlier contract ratification vote. The regression analysis model used in PROC REG, PROC MATRIX and PROC RIDGEREG, to explain the contract ratification vote IS derived from mlcroeconomlc tMory. Each miner'S behavior is explained in terms of his or her utility function. The level of a miner's satisfaction is represented as a function of such abstract concepts as expected compensation, good working conditions, leisure, group acceptance and self-determination. These general concepts are later replaced with specific variables that serve as proxies for some of these Ideas. In any case to be operational the utility function must be expressed as some sort of utility index. In this case a utility index was chosen such that the 88 vote has a Bernoulli distribution since it represents a single binary outcome. Within each local tne total yes votes have the binomial distribution. If X Is the number of yes votes at a given local, then var X = npq where n is the number of miners voting, p is the probability of voting yes and Q is the probability of voting no. Since n, p and Q differ from local to local, variances then must differ from local to local and represent a problem of heteroskedasticity in the error term. ConseqUently, it is necessary to correct for heteroskedasticity in the error term by reweighting the observations before using PROC REG or PROC RIDGEREG or using generalized least squares in PROC MATRIX. The dependent variable in the contract ratification regreSSion, LOGYESNO, represents the logarithm of the odds ratio of yes to no votes for contract ratification. The explanatory variables are selected as proxies for the concepts used in the original utility function specification. These concepts are expected compensation, good WOrking COnditions, leisure group acceptance and self-determination. The expected compensation would ordinarily include wage rate and fringe benefits adjusted for inflation as well as some measure of length of expected employment. However, this union locals data set represents cross sectional units facing exactly the same contract with a single schedule of wages and benefits. Thus, there are no differences in the wages and benefits offered to the different locals. Thus, the only real differences between the locals in real expected compensation lie In the rate of change in the consumer price index, IRATE, representing the rate of inflation in the nearest Standard Metropolttan Statistical Area (SMSA) and differences in the unemployment rates, UNEMPLOY, in the counties represented by the various locals. One would expect the IRATE variable to have a negative impact on LOGYESNO as long as the substitution effect of price increases is not overpowered by the short-run income effect of having to pay the mortgage, car payments and other fixed (at least In the ShOrt-run) expenses. In other words, the IRATE variable may have a positive effect on LOGYESNO if rising prices cause real income (and real savings) to fall to SUCh an extent that workers cannot afford a lengthy strike especially if that strike is. not likely to result in a significant increase in the wage rate. The UNEMPLOY variable may generate a "threat effect" where high unemployment suggests little likelihood of finding alternative employment or even temporary employment during a strike. Thus, this tendency to hang on to the job you've got during periods of high unemployment can be expected to generate a positive coefficient for the UNEMPLOY variable. However, this conclusion may be dependent upon reiatively competitive labor market conditions. An Imperfectly competitive market could result In special conditions approximating the bilateral monopoly case. In the bilateral monopoly situation alternative jobs are not readily avatlable but neither are alternative workers. If the Union can keep the company's mines shut down, then the employer may be forced to offer a significantly higher wage rate. Thus if the firm is emPloying just enough labor to equate It's marginal factor cost with the value of it's marginal product of labor but is reading the wage rate off of the labor supply curve, the firm may be able to afford a substantial increase in the wage rate without any reduction in it's usage of labor and still adequately cover it's labor costs. This means that high unemployment may not only fatl to deter a strike but might actually indicate a situation where the potential benefit of striking is greater. Consequently, such a bilateral monopoly situation could be expected to produce a negative sign for the UNEMPlOY variable in explaining contract ratification. Good working conditions in each mine covered by a given local are represented directly by a proxy productivity variable, TONHOUR, which is tons of coal per miner hour, and inversely by INJHOUR, which is number of injuries per miner hour. Higher prOductivity represented by larger values of the TONHOUR variable may reflect superior geOlogical conditions, better equipment and/or better employee relations. Mines with low morale are unlikely to be very productive or very safe. Given the compleXity of modern coal mines it's unlikely that productivity can be increased by simply pUShing the miners harder. Miners can be somewhat independent and reSistant to attempts at inVOluntary speedup in any case. Thus TONHOUR can probab ly be expected to have a positive coefficient. However, this could be negated if miners viewed high productivity as an indication of greater prOfits ana, tnerefore, a greater ablllty to pay higher wages w ltMut being forced to layoff workers. The INJHOUR variable can be expected to reflect poorer working conditions in general, 89 from zero and, therefore, are either unimportant in explaining the contract ratification vote or their importance is disguished by a high level of multicollinearity that unduly inflates the estimated coefficient variances. Table I. possibly poorer employee-management relations and certainly lower employee morale. It is difficult to find anything positive about high Injury rates. Consequently, the INJHOUR variable can be expected to have a negative coefficient indicating that high injury rates may be associated with greater worker dissatisfaction and, therefore, less willingness to ratify the contract. HOURSlAB serves as a measure of leisure foregone and, conversely, as a proxy for expected compensation as long as the wage rate distribution Is roughly the same in each of the mines represented by UMWA locals. This leads to some ambiguity concerning the sign of the HOURSlAB variable since it represents the classic tradeoff between leisure and income. The socio-politicai variables are TOTVOTE and TRUMKA. The TOT VOTE variable is the total number of miners voting at each local. This variable represents both the size of the unionized workforce at the local mine and the intensity of the desire to vote. Mines with a smaller workforce may work more closely with management and be less alienated than workers at larger mines. Also, angry, militant miners are more likely to vote than passive, complacent ones. Both of these factors would tend to suggest a negative relationship between the size of the TOT VOTE variable and lOGYESNO. The TRUMKA variable is the percentage of miners whO voted for Richard Trumka in each local In the 1982 UMWA ptesidential elections. To the extent that votes against the 1981 contract reflected dissatisfaction with the leadership of President Sam Church, the percentage for Trumka In 1982 snould De InVersely related to the 1981 contract ratification vote. Dependent Variable: R2 = .7042 variable TONHOUR INJHOUR HOURSLAB UNEMPLOY IRATE TOTVOTE TRUMKA LOGYESNO Adjusted R2 = .6943 coefficient .03157 .01790 .06963 -.02357 .89686 -.56401 -.15732 t-stat .82 .50 .64 -.37 6.32 -7.99 -2.28 F = 71.3 prob-value .4136 .6529 .5206 .7124 .0001 .0001 .0230 One standard procedure for checking for the posSibility of a mUlticollinearity Is to drop out one explanatory variable at a time from the original regression model and see if the R2, adjusted R2, and F-statistic change much. Table 2 displays these statistics corresponding to the deletion of the variable specified. Table 2. Analysis of Ordinary Regression Results Table I. provides the weighted least squares estimated coefficients, t-statistics and corresponding probabilities for the set of explanatory variables discussed above. Note that all variables have been transformed to a mean of zero and variance of one such that the XX matrix became the correlation matrix for this set of explanatory variables. In Table I only the IRATE and TOT VOTE coefficients are clearly significant while the TRUMKA coefficient is significant at the 5% level but not at the I % level. The TONHOUR, INJHOUR,HOURSlAB and UNEMPlOY coefficients are not statistically different Variable Deleted: Adjusted ~ ~ F-Value TONHOUR INJHOUR HOUR5LAB UNEMPLOY IRATE TOTVOTE TRUMKA .7006 .704t* .6876 .6957 .6455 .6.\09 .6796 .6911 .6947* .6777 .6861 .6342 .5985 .6694 73.700* 74.950* 69.341 72.023* 57.354 49.449 66.8t \ The INJHOUR variable is so weak that its deletion from the original model does not result in much of a reduction in R2 and actually causes the adjusted R2 and the F-value to increase. Deletion of the TONHOUR and UNEMPlOY variables also bring about an Increase In the F statistiC but cause a fall in the R2 and adjusted R2 values. The INJHOUR variable is so weak that it is probably a case of an inappropriate or irrelevant variable while the TONHOUR and UNEMPlOY variables are much more likely to simply be victIms of high mult icoll ineari ty. 90 A second approach to checking for a multicollinearity problem Is to regress each of the explanatory variables on all of the remaining explanatory variables. If the RZ from any of these regressions is greater than the RZ of the original model, then high multicollinearity may well explain the low tstatistic associated with that variable's estimated coefficient in the original model. Table 3 gives the results of these regreSSions of each explanatory variable on all of the others. coefficient estimates. as: 13 ..E- -.fL F-Value TONHOUR It-UHOUR HOURSLAB UNEMPLOV IRATE TOTVOTE TRUMKA .1675 .3319 .8707* .9093* .9757* .9408* .2748 .1411 .3107 .8666* .9064* .9749* .9389* .2518 6,338 15.651 212.110* 315.838* 1265.106" 500.547* 11.935 (X'X+ KrlX'V where X is the matrix of explanatory variable values, K is a diagonal matrix with the ridge biasing parameter down the diagonal and V is the vector of dependent variable values. Hoerl and Kennard (1970) have shOwn that if the X and V data are appropriately standardized to mean of zero and variance of one such that the X'X matrix becomes the correlation matrix, then there exists a value for k between zero and one that will generate mean squared errors for the estimated regreSSion coefficients that are smaller than those of ordinary least squares. Unfortunately, the exact location of these k values for any given set of sample data is unknown so the superiority of ridge regreSSion over ordinary least squares cannot be guaranteed In applied work. However, it is useful to trace the value.s of the estimated ridge regression coefficientS as k goes from zero to one to observe the possible values and particularly to watch for Table J. Dependent Variable: = It may be expressed Adjusted The regressions using the dependent variables HOURSlAB, UNEMPLOY, IRATE and TOTVOTE all have RZ, adjusted RZ and F-values that are larger than the corresponding statistics for the original regression model. However, since IRATE and TOTVOTE were quite clearly significant In the original model, they are not of concern here since they are clearly not victims of the multicollinearity problem. This leaves HOURSlAB and UNEMPlOY as likely candidates for special attention in considering the multicollinearity problem., Figure 1- VARIANCE INFLATION FACTORS (VIF) AS K INCREASES fROM 0 TO \ 4.0 \ ,AA" 3.5 HOURSlAIl \ 3.0 lOWol£ lRUtI\C.1\ 2.5 AnalysIs of RIdge and pc RegressIons Ridge regression and principal components regreSSion are applied In this section using PROC RIDGEREG and PROC MATRIX to evaluate the stability and reliability of the estimated regreSSion coefficients. Rl<lge regression was <levelope<l to help deal with the high variances aSSOCiated with multicollinear data and to find a technique that offered at least the possibility of lower population mean squared errors than ordinary least squares regreSSion. Ridge regression augments the XX matrix of explanatory variable values by adding some value, k, to each of its diagonal elements before inverting X'X to obtain \\ UII(tll'lO~ 2.0 \ .5 \.0 0.5 0.0 0.0 0.2 0.4 RlOGE K 91 0.6 VALUE 0-8 1.0 drops dramatically from above before leveling off while that of the TOTVOTE variable increases from below in a similar manner. The HOURSlAB coefficient rising abruptly and then settles back down. The INJHOUR and TONHOUR coefficients baSically stay positive and close to zero. The most interesting behavior is that of the UNEMPlOY coefficient which is initially negative suggesting that the "threat effect" of hight unemployment does not increase the likelihOod of voting in favor of contract ratification. However, very shortly after pulling away from the ordinary least squares estimate along the vertical axis, the UNEMPlOY coefficient switches from negative to positive and stays positive throughout the rest of the range from zero to one. This suggests that the high degree of multicollinearity aSSOCiated. with the UNEMPlOY variable as demonstrated above, Initially disgUises the true effect of unemployment on contract ratification. Thus ridge regression brings out the role of the "threat effect" of unemployment on contract ratification and suggests that the ordinary least squares estimate was mlsleadingat best. Finally, note that the TRUMKA coefficient also switches signs from negative to positive but not until much later. Since most research seems to indicate that the optimal k values for ridge regreSSion tend to lie fairly close to zero, changes occurring for somewhat larger values of k may be miSleading. Thus, the original negative sign for the TRUMKA variable is probably correct. Many authors have suggested formulas for estimating the optimal k parameter required for a final set of ridge coefficient estimates. Typical methods to get k include Hoerl, Kennard and Baldwin (1975): KHKB = (NVAR*52); /(ALPHA'*ALPHA); any changes in sign. Thus ridge regression may be used effectively in this limited way to check on the stability of the regression coefficient -e' estimates. Figure I presents the SAS/GRAPltof the variance inflation factors (VIF's) plotted against values of k from zero to one. The VIF's are the diagonal elements of the Inverse of the correlation matrix or the explanatory variables. They can be expressed as VIFj = I / (I - R/) where R/ is the multiple correlation of the jth explanatory variable regressed against all of the other explanatory variables. Figure I shows that TONHOUR and INJHOUR maintain only a very weak relationship with the other explanatory variables and as already noted do not seem to be victims of multicollinearity. However, IRATE, HOURSlAB, TOT VOTE, TRUMKA and UNEMPlOV do show evidence of strong multiple correlations and substantial reductions in those correlations for k values in the lower part of the zero to one range. Figure 2 plots the standardized ridge coeffiCients against values of k from zero to one. The coefficient of the IRATE variable. Figure? RIDGE STANDARDIZED COEFFICIENTS AS K INCREASES FROM 0 TO 1 "'.IRATf 0.14 flOURSlAB ~ ~ TONflOUR \l\l(I'W\O~ 0.04 INJHOUR a -1-1------7'=--- a a variation of lindley and Smith (1972): KlS = (55EBHAP' I(NOB5+2)-" I«ALPHA'''ALPHA) -"/(NVAR+2)); lawless and Wang (1976): Kl W = (NVAR*52)# /(ALPHA'#DIAG(XXVAU "ALPHA); Generalized RIdge RegressIon: KGEN = 52" INV(D1AG(ALPHA*ALPHA')); -0.05 0.0 0.4 Using the contract ratification regreSSion data the corresponding k values are: KHKB = .007, KlS=,008, KlW=.040 andK6EN = (068, .230, 1.884, .046, .010,003,002). 0.6 RlOGE K VALUE 92 Conclusion As a final check, princip al components regression is applied to these data. The eigenvalues of the standardized X'X matrix are 4.60, 1.01, .67, .28, .22, .18 and .04. Figure 3 plots the standardized coeffic ient estima tes as the princip al components are dropped. The UNEMPLOY coeffic ient changes from negative to positiv e wlth the elimina tion of just one dimension. This demonstrates the immediate power of princip al component regression as it elimina tes the weakest or most marginal dimension in going from seven to six components. In this case the rule of dropping all components that have corresponding eigenvalues less one would seem to be overkil l since it would drop five of the dimensions and leave only two dimensions. In many cases it may only be necessary to delete one dimension to reduce variances suffiCi ently to deal adequately with a multico llinear ity problem. Using PROC REG, PROC RIDGEREG, PROC STANDARD, PROC MATRIX and SAS/GRAPH this paper demonstrated the use of ridge regression analysiS and princip al components regression to check the stabili ty and reliabi lity of ordinary least squares regression estimates. In analyzing the 1981 contrac t ratifica tion vote leading to the 72-day strike by the United Mine Workers, we found that several variables exhibited high multico llinear ity. The UNEMPLOY variable in particu lar was found to switch signs very QuiCkly when either ridge regression or princip al components regression was used. Thus these methods may be useful techniques for better understanding and interpr etating ordinary least squares regression results. Bibliography Hemmerle, William J., "An Explici t Solution for Generalized Ridge Regression",Technometrics. vol. 17, no. 3, August 1975, pages 309-314. S AS STANDARDIZED COEFFICIENT DROPPED PRINCIPAL COMPONENTS ARE Hoerl, AE. and R.W. Kennard, "Ridge Regression: Applications to Nonorthogonal Problems", Technometrics, vol. 12, no. I, February 1970, pages 69-82. 1.0 0.8 Hoerl, AE. and RW. Kennard, "Ridge Regression: Biased Estimation for NonorthOgonal Problems", Technometrics, vol. 12, no. I, February 1970, pages 5S-67. 0.6 Hoerl, AE. and RW. Kennard, "Ridge Regression Iterativ e Estimation of the Biasing Parameter" Communications in Statist ics - Theory and Methods, vol. AS, no. I, 1976, pages 77-88. 0.2 IONIIOUR 0.0 Hoerl, A.E., RW. Kennard and K.F. BaldWin, "Ridge Regression: Some Simulations", Communications in Statist ics - Theory and Methods, vol. A4, no. I, 1975, pages 105-123. -0.2 Lawless, J.F. and P. wang, "A Simulation Study of Ridge and Other Regression Estimators", Communications in Statist ics - Theory and Methods, vol. S, 1976, pages 307-323. -0.4 -0.6 a 2 3 4 5 6 Lindley, D.V. and AJ.M. Smith, "Bayes Estimates for the Linear Moder, Journal of the Royal Statist ical SQciety, Series B, vol. 34, 1972, pages 1-41. NUMBER OF OMITTED COMPONENTS 5A5/GRAPIJ is the registe red tradem ark of 5A5 Institu te Inc., Cary, NC, USA. 93