Download Weighted Ridge Estimation in a Collective Bargaining Context

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Lasso (statistics) wikipedia , lookup

Time series wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
WEIGHTED RIDGE ESTIMATION IN A COLLECTIVE BARGAINING CONTEXT
L Marsh, University of Notre Dame
T. Ghilarducci, University of Notre Dame
Abstract
maximization of this index sUbject to a binary
strike vote choice constraint leads directly to
a logistic regression model for the individual
miner.
The problem IS that individual miners are
not about to reveal how they voted. The lowest
level at which votes are collected and counted
is at the union local level. There are about 800
locals in the UMWA Some of these are
retirement locals, anthracite locals or other
locals not participating In the contract
ratification vote and, therefore, not used in
this analysIs. Since regression analysis Is used
based on the characteristics of the particular
coal mine(s) represented by each local, trucking
locals and other locals not aSSOCiated with a
particular coal mine also had to be dropped.
This left about 6S0 UMWA locals to be used in
the regression analysis.
For each union local we have the number
of yes votes in favor of contract ratification
and the number of no votes. We are interested
in determining what factors influence the true
population probability of a miner voting yes in
a particular local. Since we only have
information that relates to all miners in that
local as a group, we must make the operational
assumption that the above probability Is the
same for each miner in that particular local.
In otner wordS, we assume tnat any personal
differences between individual miners will
average or aggregate out within each local.
Thus, these differences are left as part of the
unexplained variation in the error term.
Since we don't know the true population
probability of a miner voting yes in a particular
local, we must estimate it as the observed
proportion of yes votes for that local.
SubStituting this into the logistiC model and
log-linearizing it, we obtain a log-linear
regreSSion model with the dependent variable
expressed as the logarithm of the yes-no odds
ratio.
One problem brought about by the
aggregation process from individual miner to
union local described above is the introduction
of heteroskedasticity into the error term. This
heteroskedasticity is caused by two factors:
differences in the number of miners in each
local and differences in the probability of
voting yes from local to local. Each miner's
The purpose of this paper is both
substantive and methOlogical. The substantive
purpose Is to better understand the 1981 UMWA
coal strike in terms of the factors that
influenced the strike vote. The methological
purpose is to demonstrate the use of ridge
regression and principal components regreSSion
(PROC RIDGEREG) in evaluating the stability,
and, therefore, the reliability of ordinary least
squares regreSSion estimates.
Introduction
Previous collective bargaining literature
in economics sheds little light on the 1981 coal
strike because most of that literature focuses
on the union-management conflict and ignores
Internal rank-and-fIle dissent. The 1981 strike
occurred after UMWA President Sam Church had
reached an agreement with the BCOA
(Bituminous Coal Operators Association) that
the Union leadership official1y endorsed.
However, that contract was rejected by a twoto-one majority of UMWA members. Thus much
of the conn ict was between the Union's
leadership and Its rank-and-me miners. This
dissatisfaction surfaced again the following
year when Sam Church lost his reelection bid to
Richard Trumka, a lawyer and former coal
miner, by another two-to-one majority in favor
of Trumka. Thus, internal union politics may
have played a role in the earlier contract
ratification vote.
The regression analysis model used in
PROC REG, PROC MATRIX and PROC
RIDGEREG, to explain the contract ratification
vote IS derived from mlcroeconomlc tMory.
Each miner'S behavior is explained in terms of
his or her utility function. The level of a
miner's satisfaction is represented as a
function of such abstract concepts as expected
compensation, good working conditions, leisure,
group acceptance and self-determination.
These general concepts are later replaced with
specific variables that serve as proxies for
some of these Ideas. In any case to be
operational the utility function must be
expressed as some sort of utility index. In this
case a utility index was chosen such that the
88
vote has a Bernoulli distribution since it
represents a single binary outcome. Within
each local tne total yes votes have the binomial
distribution. If X Is the number of yes votes at
a given local, then var X = npq where n is the
number of miners voting, p is the probability
of voting yes and Q is the probability of voting
no. Since n, p and Q differ from local to local,
variances then must differ from local to local
and represent a problem of heteroskedasticity
in the error term. ConseqUently, it is necessary
to correct for heteroskedasticity in the error
term by reweighting the observations before
using PROC REG or PROC RIDGEREG or using
generalized least squares in PROC MATRIX.
The dependent variable in the contract
ratification regreSSion, LOGYESNO, represents
the logarithm of the odds ratio of yes to no
votes for contract ratification. The explanatory
variables are selected as proxies for the
concepts used in the original utility function
specification. These concepts are expected
compensation, good WOrking COnditions, leisure
group acceptance and self-determination.
The expected compensation would
ordinarily include wage rate and fringe benefits
adjusted for inflation as well as some measure
of length of expected employment. However,
this union locals data set represents cross
sectional units facing exactly the same
contract with a single schedule of wages and
benefits. Thus, there are no differences in the
wages and benefits offered to the different
locals. Thus, the only real differences between
the locals in real expected compensation lie In
the rate of change in the consumer price index,
IRATE, representing the rate of inflation in
the nearest Standard Metropolttan Statistical
Area (SMSA) and differences in the
unemployment rates, UNEMPLOY, in the
counties represented by the various locals.
One would expect the IRATE variable to
have a negative impact on LOGYESNO as long as
the substitution effect of price increases is
not overpowered by the short-run income effect
of having to pay the mortgage, car payments
and other fixed (at least In the ShOrt-run)
expenses. In other words, the IRATE variable
may have a positive effect on LOGYESNO if
rising prices cause real income (and real
savings) to fall to SUCh an extent that workers
cannot afford a lengthy strike especially if that
strike is. not likely to result in a significant
increase in the wage rate.
The UNEMPLOY variable may generate a
"threat effect" where high unemployment
suggests little likelihood of finding alternative
employment or even temporary employment
during a strike. Thus, this tendency to hang on
to the job you've got during periods of high
unemployment can be expected to generate a
positive coefficient for the UNEMPLOY
variable. However, this conclusion may be
dependent upon reiatively competitive labor
market conditions. An Imperfectly competitive
market could result In special conditions
approximating the bilateral monopoly case. In
the bilateral monopoly situation alternative
jobs are not readily avatlable but neither are
alternative workers. If the Union can keep the
company's mines shut down, then the employer
may be forced to offer a significantly higher
wage rate. Thus if the firm is emPloying just
enough labor to equate It's marginal factor cost
with the value of it's marginal product of labor
but is reading the wage rate off of the labor
supply curve, the firm may be able to afford a
substantial increase in the wage rate without
any reduction in it's usage of labor and still
adequately cover it's labor costs. This means
that high unemployment may not only fatl to
deter a strike but might actually indicate a
situation where the potential benefit of
striking is greater. Consequently, such a
bilateral monopoly situation could be expected
to produce a negative sign for the UNEMPlOY
variable in explaining contract ratification.
Good working conditions in each mine
covered by a given local are represented
directly by a proxy productivity variable,
TONHOUR, which is tons of coal per miner
hour, and inversely by INJHOUR, which is
number of injuries per miner hour. Higher
prOductivity represented by larger values of the
TONHOUR variable may reflect superior
geOlogical conditions, better equipment and/or
better employee relations. Mines with low
morale are unlikely to be very productive or
very safe. Given the compleXity of modern coal
mines it's unlikely that productivity can be
increased by simply pUShing the miners harder.
Miners can be somewhat independent and
reSistant to attempts at inVOluntary speedup in
any case. Thus TONHOUR can probab ly be
expected to have a positive coefficient.
However, this could be negated if miners
viewed high productivity as an indication of
greater prOfits ana, tnerefore, a greater ablllty
to pay higher wages w ltMut being forced to
layoff workers.
The INJHOUR variable can be expected to
reflect poorer working conditions in general,
89
from zero and, therefore, are either
unimportant in explaining the contract
ratification vote or their importance is
disguished by a high level of multicollinearity
that unduly inflates the estimated coefficient
variances.
Table I.
possibly poorer employee-management
relations and certainly lower employee morale.
It is difficult to find anything positive about
high Injury rates. Consequently, the INJHOUR
variable can be expected to have a negative
coefficient indicating that high injury rates
may be associated with greater worker
dissatisfaction and, therefore, less willingness
to ratify the contract.
HOURSlAB serves as a measure of
leisure foregone and, conversely, as a proxy for
expected compensation as long as the wage rate
distribution Is roughly the same in each of the
mines represented by UMWA locals. This leads
to some ambiguity concerning the sign of the
HOURSlAB variable since it represents the
classic tradeoff between leisure and income.
The socio-politicai variables are
TOTVOTE and TRUMKA. The TOT VOTE
variable is the total number of miners voting at
each local. This variable represents both the
size of the unionized workforce at the local
mine and the intensity of the desire to vote.
Mines with a smaller workforce may work more
closely with management and be less alienated
than workers at larger mines. Also, angry,
militant miners are more likely to vote than
passive, complacent ones. Both of these
factors would tend to suggest a negative
relationship between the size of the TOT VOTE
variable and lOGYESNO. The TRUMKA
variable is the percentage of miners whO voted
for Richard Trumka in each local In the 1982
UMWA ptesidential elections. To the extent
that votes against the 1981 contract reflected
dissatisfaction with the leadership of
President Sam Church, the percentage for
Trumka In 1982 snould De InVersely related to
the 1981 contract ratification vote.
Dependent Variable:
R2 = .7042
variable
TONHOUR
INJHOUR
HOURSLAB
UNEMPLOY
IRATE
TOTVOTE
TRUMKA
LOGYESNO
Adjusted R2 = .6943
coefficient
.03157
.01790
.06963
-.02357
.89686
-.56401
-.15732
t-stat
.82
.50
.64
-.37
6.32
-7.99
-2.28
F = 71.3
prob-value
.4136
.6529
.5206
.7124
.0001
.0001
.0230
One standard procedure for checking for
the posSibility of a mUlticollinearity Is to drop
out one explanatory variable at a time from the
original regression model and see if the R2,
adjusted R2, and F-statistic change much.
Table 2 displays these statistics corresponding
to the deletion of the variable specified.
Table 2.
Analysis of Ordinary Regression Results
Table I. provides the weighted least
squares estimated coefficients, t-statistics
and corresponding probabilities for the set of
explanatory variables discussed above. Note
that all variables have been transformed to a
mean of zero and variance of one such that the
XX matrix became the correlation matrix for
this set of explanatory variables.
In Table I only the IRATE and TOT VOTE
coefficients are clearly significant while the
TRUMKA coefficient is significant at the 5%
level but not at the I % level. The TONHOUR,
INJHOUR,HOURSlAB and UNEMPlOY
coefficients are not statistically different
Variable
Deleted:
Adjusted
~
~
F-Value
TONHOUR
INJHOUR
HOUR5LAB
UNEMPLOY
IRATE
TOTVOTE
TRUMKA
.7006
.704t*
.6876
.6957
.6455
.6.\09
.6796
.6911
.6947*
.6777
.6861
.6342
.5985
.6694
73.700*
74.950*
69.341
72.023*
57.354
49.449
66.8t \
The INJHOUR variable is so weak that its
deletion from the original model does not
result in much of a reduction in R2 and actually
causes the adjusted R2 and the F-value to
increase. Deletion of the TONHOUR and
UNEMPlOY variables also bring about an
Increase In the F statistiC but cause a fall in
the R2 and adjusted R2 values. The INJHOUR
variable is so weak that it is probably a case of
an inappropriate or irrelevant variable while
the TONHOUR and UNEMPlOY variables are
much more likely to simply be victIms of high
mult icoll ineari ty.
90
A second approach to checking for a
multicollinearity problem Is to regress each of
the explanatory variables on all of the
remaining explanatory variables. If the RZ from
any of these regressions is greater than the RZ
of the original model, then high
multicollinearity may well explain the low tstatistic associated with that variable's
estimated coefficient in the original model.
Table 3 gives the results of these regreSSions
of each explanatory variable on all of the
others.
coefficient estimates.
as:
13
..E-
-.fL
F-Value
TONHOUR
It-UHOUR
HOURSLAB
UNEMPLOV
IRATE
TOTVOTE
TRUMKA
.1675
.3319
.8707*
.9093*
.9757*
.9408*
.2748
.1411
.3107
.8666*
.9064*
.9749*
.9389*
.2518
6,338
15.651
212.110*
315.838*
1265.106"
500.547*
11.935
(X'X+ KrlX'V
where X is the matrix of explanatory variable
values, K is a diagonal matrix with the ridge
biasing parameter down the diagonal and V is
the vector of dependent variable values.
Hoerl and Kennard (1970) have shOwn
that if the X and V data are appropriately
standardized to mean of zero and variance of
one such that the X'X matrix becomes the
correlation matrix, then there exists a
value for k between zero and one that will
generate mean squared errors for the estimated
regreSSion coefficients that are smaller than
those of ordinary least squares. Unfortunately,
the exact location of these k values for any
given set of sample data is unknown so the
superiority of ridge regreSSion over ordinary
least squares cannot be guaranteed In applied
work. However, it is useful to trace the value.s
of the estimated ridge regression coefficientS
as k goes from zero to one to observe the
possible values and particularly to watch for
Table J.
Dependent
Variable:
=
It may be expressed
Adjusted
The regressions using the dependent variables
HOURSlAB, UNEMPLOY, IRATE and TOTVOTE
all have RZ, adjusted RZ and F-values that are
larger than the corresponding statistics for the
original regression model. However, since
IRATE and TOTVOTE were quite clearly
significant In the original model, they are not
of concern here since they are clearly not
victims of the multicollinearity problem. This
leaves HOURSlAB and UNEMPlOY as likely
candidates for special attention in considering
the multicollinearity problem.,
Figure 1-
VARIANCE INFLATION FACTORS (VIF)
AS K INCREASES fROM 0 TO \
4.0
\ ,AA"
3.5
HOURSlAIl
\
3.0
lOWol£
lRUtI\C.1\
2.5
AnalysIs of RIdge and pc RegressIons
Ridge regression and principal
components regreSSion are applied In this
section using PROC RIDGEREG and PROC
MATRIX to evaluate the stability and
reliability of the estimated regreSSion
coefficients.
Rl<lge regression was <levelope<l to help
deal with the high variances aSSOCiated with
multicollinear data and to find a technique that
offered at least the possibility of lower
population mean squared errors than ordinary
least squares regreSSion. Ridge regression
augments the XX matrix of explanatory variable
values by adding some value, k, to each of its
diagonal elements before inverting X'X to obtain
\\
UII(tll'lO~
2.0
\ .5
\.0
0.5
0.0
0.0
0.2
0.4
RlOGE K
91
0.6
VALUE
0-8 1.0
drops dramatically from above before leveling
off while that of the TOTVOTE variable
increases from below in a similar manner. The
HOURSlAB coefficient rising abruptly and then
settles back down. The INJHOUR and
TONHOUR coefficients baSically stay positive
and close to zero.
The most interesting behavior is that of
the UNEMPlOY coefficient which is initially
negative suggesting that the "threat effect" of
hight unemployment does not increase the
likelihOod of voting in favor of contract
ratification. However, very shortly after
pulling away from the ordinary least squares
estimate along the vertical axis, the
UNEMPlOY coefficient switches from negative
to positive and stays positive throughout the
rest of the range from zero to one. This
suggests that the high degree of
multicollinearity aSSOCiated. with the
UNEMPlOY variable as demonstrated above,
Initially disgUises the true effect of
unemployment on contract ratification. Thus
ridge regression brings out the role of the
"threat effect" of unemployment on contract
ratification and suggests that the ordinary
least squares estimate was mlsleadingat best.
Finally, note that the TRUMKA
coefficient also switches signs from negative
to positive but not until much later. Since
most research seems to indicate that the
optimal k values for ridge regreSSion tend to
lie fairly close to zero, changes occurring for
somewhat larger values of k may be miSleading.
Thus, the original negative sign for the
TRUMKA variable is probably correct.
Many authors have suggested formulas
for estimating the optimal k parameter
required for a final set of ridge coefficient
estimates. Typical methods to get k include
Hoerl, Kennard and Baldwin (1975):
KHKB = (NVAR*52); /(ALPHA'*ALPHA);
any changes in sign. Thus ridge regression may
be used effectively in this limited way to check
on the stability of the regression coefficient
-e'
estimates.
Figure I presents the SAS/GRAPltof
the variance inflation factors (VIF's) plotted
against values of k from zero to one. The VIF's
are the diagonal elements of the Inverse of the
correlation matrix or the explanatory variables.
They can be expressed as VIFj = I / (I - R/)
where R/ is the multiple correlation of the jth
explanatory variable regressed against all of
the other explanatory variables. Figure I
shows that TONHOUR and INJHOUR maintain
only a very weak relationship with the other
explanatory variables and as already noted do
not seem to be victims of multicollinearity.
However, IRATE, HOURSlAB, TOT VOTE,
TRUMKA and UNEMPlOV do show evidence of
strong multiple correlations and substantial
reductions in those correlations for k values in
the lower part of the zero to one range.
Figure 2 plots the standardized ridge
coeffiCients against values of k from zero to
one. The coefficient of the IRATE variable.
Figure?
RIDGE STANDARDIZED COEFFICIENTS
AS K INCREASES FROM 0 TO 1
"'.IRATf
0.14
flOURSlAB
~
~
TONflOUR
\l\l(I'W\O~
0.04
INJHOUR
a -1-1------7'=--- a
a variation of lindley and Smith (1972):
KlS = (55EBHAP' I(NOB5+2)-" I«ALPHA'''ALPHA)
-"/(NVAR+2));
lawless and Wang (1976):
Kl W = (NVAR*52)# /(ALPHA'#DIAG(XXVAU
"ALPHA);
Generalized RIdge RegressIon:
KGEN = 52" INV(D1AG(ALPHA*ALPHA'));
-0.05
0.0
0.4
Using the contract ratification regreSSion data
the corresponding k values are: KHKB = .007,
KlS=,008, KlW=.040 andK6EN =
(068, .230, 1.884, .046, .010,003,002).
0.6
RlOGE K VALUE
92
Conclusion
As a final check, princip al components
regression is applied to these data. The
eigenvalues of the standardized X'X matrix are
4.60, 1.01, .67, .28, .22, .18 and .04.
Figure 3 plots the standardized
coeffic ient estima tes as the princip al
components are dropped. The UNEMPLOY
coeffic ient changes from negative to positiv e
wlth the elimina tion of just one dimension.
This demonstrates the immediate power of
princip al component regression as it elimina tes
the weakest or most marginal dimension in
going from seven to six components. In this
case the rule of dropping all components that
have corresponding eigenvalues less one would
seem to be overkil l since it would drop five of
the dimensions and leave only two dimensions.
In many cases it may only be necessary to
delete one dimension to reduce variances
suffiCi ently to deal adequately with a
multico llinear ity problem.
Using PROC REG, PROC RIDGEREG,
PROC STANDARD, PROC MATRIX and
SAS/GRAPH this paper demonstrated the use
of ridge regression analysiS and princip al
components regression to check the stabili ty
and reliabi lity of ordinary least squares
regression estimates. In analyzing the 1981
contrac t ratifica tion vote leading to the 72-day
strike by the United Mine Workers, we found
that several variables exhibited high
multico llinear ity. The UNEMPLOY variable in
particu lar was found to switch signs very
QuiCkly when either ridge regression or
princip al components regression was used.
Thus these methods may be useful techniques
for better understanding and interpr etating
ordinary least squares regression results.
Bibliography
Hemmerle, William J., "An Explici t Solution for
Generalized Ridge Regression",Technometrics.
vol. 17, no. 3, August 1975, pages 309-314.
S AS
STANDARDIZED COEFFICIENT
DROPPED
PRINCIPAL COMPONENTS ARE
Hoerl, AE. and R.W. Kennard, "Ridge Regression:
Applications to Nonorthogonal Problems",
Technometrics, vol. 12, no. I, February 1970,
pages 69-82.
1.0
0.8
Hoerl, AE. and RW. Kennard, "Ridge Regression:
Biased Estimation for NonorthOgonal Problems", Technometrics, vol. 12, no. I, February
1970, pages 5S-67.
0.6
Hoerl, AE. and RW. Kennard, "Ridge Regression
Iterativ e Estimation of the Biasing Parameter"
Communications in Statist ics - Theory and
Methods, vol. AS, no. I, 1976, pages 77-88.
0.2
IONIIOUR
0.0
Hoerl, A.E., RW. Kennard and K.F. BaldWin, "Ridge
Regression: Some Simulations",
Communications in Statist ics - Theory and
Methods, vol. A4, no. I, 1975, pages 105-123.
-0.2
Lawless, J.F. and P. wang, "A Simulation Study
of Ridge and Other Regression Estimators",
Communications in Statist ics - Theory and
Methods, vol. S, 1976, pages 307-323.
-0.4
-0.6
a
2
3
4
5
6
Lindley, D.V. and AJ.M. Smith, "Bayes Estimates
for the Linear Moder, Journal of the Royal
Statist ical SQciety, Series B, vol. 34, 1972,
pages 1-41.
NUMBER OF OMITTED COMPONENTS
5A5/GRAPIJ is the registe red tradem ark of
5A5 Institu te Inc., Cary, NC, USA.
93