Download Quality of Group Judgment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Psychological Bulletin
1977, Vol. 84, No. 1, 158-172
Quality of Group Judgment
Hillel J. Einhorn
Robin M. Hogarth
Graduate School of Business
University of Chicago
Institut Europeen d'Administration des Affaires
Fontainebleau, France
Eric Klempner
Graduate School of Business
University of Chicago
The quality of group judgment is examined in situations in which groups have
to express an opinion in quantitative form. To provide a yardstick for evaluating the quality of group performance (which is itself defined as the absolute
value of the discrepancy between the judgment and the true value), four baseline models are considered. These models provide a standard for evaluating
how well groups perform. The four models are: (a) randomly picking a single
individual; (b) weighting the judgments of the individual group members
equally (the group mean); (c) weighting the "best" group member (i.e., the
one closest to the true value) totally where the best is known, a priori, with
certainty; (d) weighting the best member totally where there is a given probability of misidentifying the best and getting the second, third, etc., best member. These four models are examined under varying conditions of group size
and "bias." Bias is denned as the degree to which the expectation of the population of individual judgments does not equal the true value (i.e., there is
systematic bias in individual judgments). A method is then developed to evaluate the accuracy of group judgment in terms of the four models. The method
uses a Bayesian approach by estimating the probability that the accuracy of
actual group judgment could have come from distributions generated by the
four models. Implications for the study of group processes and improving group
judgment are discussed.
Consider a group of size N that has to arrive
at some quantitative judgment, for example, a
sales forecast, a prediction of next year's gross
national product, the number of bushels of
wheat expected in the next quarter, and the
like. Given the prevalence of such predictive
activity in the real world, it is clearly important to consider how well groups can and
do perform such tasks, as well as to consider
strategies that may be used to improve performance. In this paper we address the issue
of defining the quality of group judgment and
assess the effects and limitations on judgmental
quality of different strategies for combining
opinions under a variety of circumstances.
First, we define quality of performance in
terms of how close the group judgment is to
the true (actual) value being predicted once it
is known. We then consider the differential
expected quality of performance of different
baseline models, that is, how well would groups
perform if they formed their judgments according to a number of different assumptions.
However, it is shown that in many circumstances the baseline performances expected of
the different models are quite similar. We
therefore present, and illustrate, a statistical
procedure for considering which baselines are
appropriate for evaluating the quality of
group judgment in empirical studies. The
conceptualization and procedures presented
here do, we believe, have considerable potential
This research was supported by a grant from the
for illuminating the often seemingly contraSpencer Foundation.
We would like to thank Sarah Lichtenstein for her dictory results in the literature on the accuracy
insightful comments on an earlier draft of this paper of group judgment, as well as for setting up
and Ken Friend for making his data available to us. standards for comparing quality of group
Requests for reprints should be sent to Hillel J.
judgment both within and between different
Einhorn, Graduate School of Business, University of
Chicago, Chicago, Illinois 60637.
populations of groups.
158
QUALITY OF GROUP JUDGMENT
The earliest standard used in comparing
group judgment was the individual; that is,
given a group judgment and N individual
judgments, the accuracy of the group judgment
was compared with the various individual
judgments. One could then determine if the
group was performing at the level of its best,
second best, etc., member (cf. Taylor, 1954;
Steiner & Rajaratnam, 1961). The results of
such studies have been summarized by Lorge,
Fox, Davitz, and Brenner (1958): "At best
group judgment equals the best individual
judgment but usually is somewhat inferior to
the best judgment" (p. 348). Although groups
do not seem to perform at the level of their
best member (which is, after all, denned after
the true value is known), the question remains
as to how well groups can identify and weight
their better members before the true value
becomes known.
A second and related line of research, using
judgments made in simple laboratory tasks
(such as estimating the number of beans in a
jar), has dealt with staticized groups. Staticized
refers simply to an average of a number of
individual judgments (or even one person's
judgment given many times). Those averages
have been compared to individual judgments
in terms of accuracy (Gordon, 1923; Stroop,
1932; Zajonc, 1962). Results have shown that
the average judgment is more accurate than
most individual judgments (there have been
exceptions, see Klugman, 1947). However,
comparisons have rarely been made between
staticized groups and actual groups because
the emphasis of this line of research has been
on groups versus individuals.
A third line of research, developed outside
the field of psychology, deals with the potential
advantages that can result from the pooling of
individual judgments by a systematic statistical procedure. The method used has been
called the "Delphi" technique (Dalkey, 1969b;
Dalkey & Helmer, 1963). The general idea is
to try to produce a consensus of opinion
through statistical feedback (usually the
median of the individual judgments). Furthermore, the group does not meet in a face-to-face
format, since it is contended that social interaction causes biases that adversely affect group
performance. Although more experimental evidence is needed on this point (Dalkey, 1969a;
159
Sackman, 1974), the Delphi technique explicitly recognizes the possibility that actual
groups may be performing below some statistical standard.
A Baseline Approach
In conceptualizing how well groups perform
specific tasks, Steiner (1966, 1972) has identified three critical factors: (a) the type of task
with which the group is faced, (b) the resources
at the group's disposal (i.e., the expertise of
the different group members), and (c) the
process used by the group. In the kinds of
judgmental tasks considered here, we conceptualize the group judgment as a weighting
and combining of the judgments of the individual members. Thus, one crucial issue is
the process used by the group to allocate
weights to the opinions of the different members and the extent to which various strategies
for weighting opinions have important effects
on the quality of judgment in different
circumstances.
Steiner (1972) listed four reasons why groups
may not, in fact, weight their individual
members appropriately:
(a) failure of status differences to parallel the quality
of the contributions offered by participating
members;
(b) the low level of confidence proficient members
sometimes have in their own ability to perform
the task;
(c) the social pressures that an incompetent majority
may exert on a competent minority;
(d) the fact that the quality of individual contributions
is often very difficult to evaluate, (pp. 38-39)
Whether groups do misweight in actuality,
and with what frequency, is an empirical
matter. However, before one can conclude
that a group is misweighting, there must be
some standard against which to compare its
performance. The approach taken here is to
develop baseline models founded on assumptions made about group processes. These
models are not meant to describe what groups
actually do, they simply say that if groups
were to do such and such, then a certain level
of performance would result. Although the four
models considered differ greatly in what they
assume about group process, it is instructive
to compare them under a wide variety of
circumstances.
160
H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER
The first model consists in assuming that
the group picks one member at random and
uses that judgment as the group judgment.
Intuitively we would expect such a model to
yield a low level of performance because it
assumes that the group lacks any ability to
identify and weight its better members appropriately. However, it is possible that actual
group judgment may be no better than this
strategy. The random model is discussed at
greater length when considering our second
model.
The second baseline model involves weighting each individual judgment equally, that is,
by \/N. This model is equivalent to using the
average of the individual judgments as a comparison for actual group judgment. It must, of
course, be remembered that we are not interested in whether the group average is a
good representation of the group judgment but
rather whether the group average is as accurate
as the group judgment. This is a crucial distinction that must be kept in mind in discussing
all of our models.
There are several reasons for considering the
equal weight model: (a) The equal weight
model can be thought of as representing individual members' weights before discussion
takes place; that is, before information concerning perceived expertise is obtained, the
group treats each member equally. Therefore,
equal weighting provides an interesting baseline with which to compare groups' abilities to
allocate weights on a differential basis, (b)
Recent research (cf. Dawes & Corrigan, 1974;
Einhorn & Hogarth, 1975) has shown that an
equal weight model can outpredict differential
weight models under a wide variety of circumstances. Part of the reason for this is that equal
weights cannot reverse the relative weighting
of the variables. For example, it is better to
weight all group members equally than to
assign high weights to those with poor judgment. Therefore, if groups reverse the relative
weights (due to nonvalid social cues that
influence perceived expertise), an equal weight
model can be expected to perform better. If
groups actually perform worse than would be
expected on the basis of an equal weight model,
it may suggest that group discussion is dysfunctional with respect to the assignment of
weights, (c) The mean of a random variable
has certain desirable statistical qualities. For
example, consider that each individual judgment contains the true value of the phenomenon to be predicted plus a random error
component. If this is the case, the expectation
of the individual scores will be the true score.
Furthermore, the expectation of the means of
groups of size N drawn from the distribution
of individual scores will be equal to the true
score and the variance of this distribution will
be less than the distribution of individual
scores. Therefore, using the group mean will
result in a "tighter" distribution around the
true value—a situation that is most beneficial
and of great practical importance.
The merits of the preceding argument depend on the assumption that each individual
judgment can be divided into a true value plus
a random error component. However, when
dealing with human judgment in complex tasks
(such as predicting sales, judging guilt or
innocence, etc.), we feel that systematic biases
may enter into judgment in ways different
from laboratory tasks. The former situations
differ from the latter in at least two respects:
(a) The definition of the stimulus is more
ambiguous and subject to diverse influences.
This means that the information on which
judgments are based may differ among individuals. Furthermore, in such conditions of
stimulus ambiguity, there is much research
that indicates that individual judgments are
biased by social pressures (Deutsch & Gerard,
1955). (b) Because a large and diverse set of
information has to be processed, it is quite
likely that erroneous assumptions, biases, and
other constant errors will be made. Recent psychological work (e.g., Slovic, 1972a; Tversky
& Kahneman, 1974) has shown that the
human's limited information processing ability
leads him to make systematic errors in judgment. Moreover, these biases seem to be widespread and applicable to "experts" as well as
to novices (Kidd, 1970; Slovic, 1972b). Given
the questionable assumption of random error
in individual judgments, we examine each of
our baseline models under varying amounts of
bias (this is defined formally in the next
section).
The third model we consider is the following:
Assume that through group discussion, the
group is able to identify its best member with
QUALITY OF GROUP JUDGMENT
161
distance between xt and M, that is,
b = (xt—n).
(1)
We call the second the standardized bias
because it is the distance between xt and n
measured in terms of the population standard
deviation, that is,
x.-N(M,o)
x -N(u,oAN)
N
/. Distribution of individual judgments and
group means.
certainty (i.e., the group can determine which
member's judgment will be closest to the true
value). In such a situation, a sensible strategy
would clearly be to give all the weight to the
"best" judgment and none to the remaining
N — 1 members. Although it is possible for
the actual group judgment, or the group mean,
to be closer to the true value on any trial, this
is unlikely to be the case on average. Therefore, we compare the random and mean models
with the best model.
Our final model takes cognizance of the
fact that groups will find it extremely difficult
to identify their best member with certainty.
That is, what happens if the group can be
mistaken as to the identity of the best member? In other words, how well will the group
perform if it only has a certain probability of
picking the best person? We denote this model
as the "proportional" model and compare it to
the three models discussed above. We now
turn to a formal development of the models.
B = (*, -
M)A .
Now consider that we sample N individuals
from the Xj distribution and calculate their
mean (Xn/). The result can be considered as a
drawing from the sampling distribution of the
mean with mean of /x and standard deviation
of a/-\N. This is shown in the lower half of
Figure 1. The important point to note is that
xt is further out in the tail of the distribution
of means than in the original distribution.
Moreover, as group size increases, the variance
of the distribution of means decreases; therefore, xt will be even further out in the tail area.
The implication here is that the probability of
being close to xt would then decrease.1 It is
clear that the standardized bias (B), as well as
group size (N), will affect the quality of performance of both the mean and random
models (the latter being, of course, equivalent
to sampling a single observation from the x,distribution). We now turn to a more complete
consideration of these models under varying
amounts of B and N.
We first need to define the quality of any
particular judgment. We do this by defining
quality as being synonymous with accuracy.2
If we wish to be as close to xt as possible and
feel that it makes no difference whether we
are above or below xt, then an appropriate
measure of accuracy is given by
d = \XK-xt\ .
The Models
We begin by considering a population distribution of individual judgments. Let Xj be the
judgment of the j'th person and assume that
judgments are normally distributed with
E(XJ) = n and varfe) = a-2. The true value to
be predicted is denoted by xt. The distribution
is shown in Figure 1, in which we have drawn
xt so that it does not coincide with the mean of
the individual judgments. We now define two
measures of bias. The first is simply the
(2)
(3)
Note that when N = 1, Equation 3 expresses
the accuracy of any individual judgment.3 In
1
However, the probability of being very far from xt
also decreases.
2
We realize that certain writers (cf. Maier, 1967)
have denned the effectiveness of group problem solving
as a function of both the quality of the solution and its
acceptability by the group members. We do not deal
with the acceptability issue in this paper.
3
A more general form for Equation 3 would be
d = ct\Xn — xt for a > 0. However, because we are
162
H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER
order not to confuse these meanings of d, we
denote di as being the case when N = 1 and d
for all other values of N. Clearly, the smaller
d is, the greater the accuracy. The absolute
value operator is used because we assume that
being above or below xt incurs the same cost
(symmetric loss function).
In order to evaluate the effects of group size
and standardized bias on d, we examine the
expected value and variance of d under varying
combinations of B and N. Therefore, we look
at d "on average" as well as its dispersion. In
order to calculate E(d\ B, N) and var(d\ B, N),
we must examine the distribution of d. By way
of illustration, we assume N = 1. Consider
Figure 1 again. To obtain the distribution of
d, assume that we can "fold" the Xj distribution
at xt so that the area previously lying to the
left of xt now lies to its right. This procedure
yields the distribution that results when xt is
subtracted from Xj and the absolute value is
taken (e.g., when xt = n, we get "half" of a
normal distribution). Note that it does not
matter whether x, is below or above the mean
because the same d distribution will result.
Figure 2 shows the distribution of d when xt is
at the value shown in Figure 1 . The upper part
of Figure 2, (a) , shows the effect of folding over
the Xj distribution from left to right at xt. The
shaded area refers to the tail area that was to
the right of xt. This tail will begin at d = 0
(where Xj = xt). The lower part of Figure
2, (b), shows the distribution of d when the
tail area is added to the distribution truncated
at xt.
Before deriving E(d) and var(rf), however,
we deal with a standardized distribution of
individual judgments, which will simplify our
discussion. However, in order to distinguish
between results on the basis of standardized
and original units, we use primes to denote
that the variables have been standardized in
terms of the population of individual scores.
Therefore,
f(d)
oo
f(d)
03
Figure 2. (a) "Folded" distribution of Xj at x>. (b)
Distribution of d for given xt.
and
'=
X'N-x't\
It is important to note that d = ad'. This
means that any results using d' can be converted to original units by multiplying by the
population standard deviation.
We first wish to find the unconditional expectation of d'. This is given by
E(d') = f+X | X'N - x't | f(X'N)dx . (4)
It is shown in Appendix A that this is
E(d') =
(5)
where F = cumulative normal distribution and
fN = ordinate of normal distribution. We can
also determine the variance of d', namely,
var(rf') = E(\X'N - B\? - \_E(d')J. (6)
It is shown in Appendix A that this is
X'N = (XN x't = (*, - it)
B,
only interested in the relative differences between the
models, we may conveniently work with the special
case of a = 1 without loss of generality.
var(d') = (1/AO + W- lE(d')J.
(7)
In order to examine the effects of B and N
on Equations 5 and 7, we have calculated
E(d' B, N) and va.r(d' B, N) for the following
values of B and N: B = 0, .5, 1, 1.5, 2, 2.5,
QUALITY OF GROUP JUDGMENT
163
Table 1
E(d') and var(&') for Varying Levels of B and N
N
B
0
.5
1.0
l.S
2.0
2.5
3.0
.798
(.364)
.896
(.448)
1.167
(.607)
1.559
(.821)
2.017
(.931)
2.504
(.980)
3.001
(.996)
.564
(.181)
.700
(.260)
1.050
(.397)
1.509
(.475)
2.001
(.496)
2.500
(.500)
3.000
(.500)
.461
(.121)
.623
(.194)
1.020
(.294)
1.502
(.328)
2.000
(.333)
2.500
(.333)
3.000
(.333)
.399
(.091)
.583
(.160)
1.008
(.233)
1.500
(.249)
2.000
(.250)
2.500
(.250)
3.000
(.250)
.357
(.073)
.559
(.138)
1.004
(.192)
1.500
(.200)
2.000
(.200)
2.500
(.200)
3.000
(.200)
.326
(.061)
.544
(.121)
1.002
(.163)
1.500
(.166)
2.000
(.166)
2.500
(.166)
3.000
(.166)
.282
(.045)
.525
(.099)
1.000
(.124)
1.500
(.125)
2.000
(.125)
2.500
(.125)
3.000
(.125)
10
12
16
.252
(.036)
.515
(.085)
1.000
(.010)
1.500
(.010)
2.000
(.010)
2.500
(.010)
3.000
(.010)
.230
(.030)
.510
(.073)
1.000
(.084)
1.500
(.084)
2.000
(.084)
2.500
(.084)
3.000
(.084)
.199
(.023)
.504
(.058)
1.000
(.063)
1.500
(.063)
2.000
(.063)
2.500
(.063)
3.000
(.063)
Note. N — 1 is the random model. The numbers in parentheses represent var(d').
and 3; N = 1, 2, 3, 4, 5, 6, 8, 10, 12, and 16.4
These results are shown in Table 1.
There are four main results in Table 1: (a)
as B (i.e., x't) increases, for any given N, E(d')
and var(d') increase. As would be expected,
the greater the standardized bias in the population, the poorer the mean model does; (b)
as N increases, E(d') decreases, but when
B ^ 1.0, the decrease is small. Note that although E(d') does not decrease much under
these conditions, var(d') does; (c) when N — 1
(the random model), both E(d'i) and var(rf'0
are higher than any other group size. Therefore,
the random model will do worse than the mean
model on average. However, when B ^ 1.0,
the expected value of d' is similar for these two
models;6 (d) as N increases, E(d') approaches
B.
We now turn to the models for the best and
proportional strategies. Consider the d\ distribution for a given B. Furthermore, let us
randomly sample N individuals from this
distribution and order their d\ scores (from
lowest to highest). To make use of this ordering, we can use the following result from order
statistics: If groups of size N are randomly
assembled from the d'\ distribution and the
members are ordered according to their d'i
scores, on average, the members will divide the
population distribution into N + 1 equal parts
(cf. Hogg & Craig, 1965; Steiner & Rajaratnam, 1961). This means, for example, that
four-person groups will, on average, have
members that fall at the 20th, 40th, 60th, and
80th fractiles of the d'\ distribution. Therefore, on average, the best member of a fourperson group will perform better than 80% of
the population (i.e., will be at the 20th fractile
of the d\ distribution).
Let us denote the ith best score in a group
of size N as </',-,#. Therefore, d'\,± would be the
best score in a group of size four. We wish to
determine the expectation of d'i,n for various
combinations of B and N. We use an approximation here to calculate this expected value.
The sampling distribution of the ith best
person is asymptotically normal with mean at
the fractile corresponding to the ith best
(Crame'r, 1951). For example, the mean of the
best person in a group of size four will fall at
approximately the 20th fractile of the d\ distribution (note that because smaller d\ values
are more desirable, we use the 20th fractile
4
Although we only use positive values of B, it is the
case that negative values of B yield identical results.
Therefore, the absolute value of B is the important
determinant of E(d').
'Under the loss function, "a miss is as good as a
mile," the random model actually has a lower E(d')
than the mean model. This occurs because the probability of being at x, for the mean model is smaller if
xt 7* 0. The fact that the mean model has a lower
probability of being further away from xt is immaterial
under this loss function.
164
H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER
d\ distribution will have a distance of d' from
the origin. Therefore, to determine the value
of d' that corresponds to any given fractile of
the d'i distribution, one needs the area that d'
cuts off. This can be found using the normal
distribution by noting that
d'% = F(B + d') - F(B - d') ,
Figure 3. Distribution of x'j and folded distribution of
d'i around B showing d' distance.
rather than the 80th). Therefore, if we can find
the d\ score that corresponds to the appropriate fractile, we can find E(d'i^).
Consider Figure 3, which shows the parent
distribution of x'j. A distance of d' (corresponding to x\ and x'2) is shown around B. When
the distribution is folded at B, the resulting
(8)
where d'% = fractile of the d\ distribution and
F = cumulative normal distribution. An iterative computer program has been written that
yields appropriate values of d1 for any given
fractile of the d\ distribution. Conversely, for
any given d' value, one can obtain the fractile
of the d'i distribution.
Before presenting the results of E(d'\1x) for
various values of B and N, we consider our
fourth model, the proportional model. In order
to formally deal with this model, it is necessary
to introduce a new variable, pi.tf- This is the
probability of identifying the ith best person
in a group of size N as the best. Therefore,
pi,4 would denote the probability of correctly
identifying the best person in a group of size
four as the best. It seems likely that pi,ti will
be affected by the size of the group (i.e., it
would seem to be easier to correctly identify
the best person in a group of size 4 than in a
group of size 16). In order to incorporate this
into our model, we assume here that p{,x is
inversely proportional to each member's rank
Table 2
E(d'i, N ) and E p (d') for Values of B and N
N
0
.5
1.0
1.5
2.0
2.5
3.0
.798
(.798)
.895
(.895)
1.166
(1.166)
1.557
(1.557)
2.016
(2.016)
2.503
(2.503)
3.000
(3.000)
.467
(.687)
.527
(.772)
.719
(1.017)
1.044
(1.386)
1.469
(1.834)
1.944
(2.317)
2.438
(2.813)
.335
(.632)
.379
(.711)
.527
(.942)
.806
(1.301)
1.202
(1.742)
1.666
(2.224)
2.157
(2.719)
.262
(.599)
.297
(.674)
.418
(.898)
.662
(1,250)
1.033
(1.688)
1.487
(2.168)
1,976
(2.663)
.216
(.577)
.244
(.650)
.347
(.868)
.564
(1.215)
.914
(1.651)
1.357
(2.130)
1.843
(2.626)
Note. The numbers in parentheses represent Ef(d').
.183
(.562)
.208
(.632)
.296
(.846)
.492
(1.191)
.823
(1.625)
1.257
(2.104)
1.740
(2.599)
.141
(.541)
.160
(.609)
.230
(.818)
.393
(1.158)
.691
(1.590)
1.108
(2.068)
1.586
(2.563)
10
12
16
.115
(.527)
.131
(.594)
.188
(.800)
.328
(1.138)
.599
(1.568)
1.000
(2.046)
1.474
(2.540)
.097
(.518)
.110
(.584)
.159
(.788)
.281
(1.123)
.501
(1.553)
.917
(2.030)
1.386
(2.525)
.074
(.506)
.084
(.571)
.122
(.771)
.219
(1.105)
.433
(1.533)
.794
(2.010)
1.254
(2.504)
QUALITY OF GROUP JUDGMENT
165
B=0
1.0
E(d')
Random
.75
Proportional
.50
.25
0
1
2
3
4
5
6
8
10
12
SAMPLE SIZE (N)
Figure 4. E(d') for the models at B = 0
in the group. For example, in a four-person
group, one can consider that there are ten
weights to be allocated ( 4 + 3 + 2 + 1 ) . The
best person will receive four, the second best
three, and so on. Subsequently, the weights
must be divided by their sum in order to
normalize them. The probabilities allocated
under such a scheme are given by
pi.N=2(N+l-i)/(N+l)N.
(9)
For example, in a four-person group, the probability of correctly identifying the best person
is .4, whereas the probability of identifying the
second best as the best is .3, third best as best
is .2, and worst as best is .I.6 Of course, the
scheme presented is arbitrary. We do not know
how well groups actually do identify their
"better" members. However, we feel that results from such a model provide a useful benchmark to contrast with the best model, which
appears unrealistic.
The expected level of performance using the
proportional model is
EP(d') = L pitNE(d'i,lf) .
t-i
(10)
Table 2 shows the values for £(^'I,AT) and
Ep(d') for various levels of B and N.
In order to compare the four models, we have
plotted E(d') for each model as a function of
both standardized bias and group size. These
results can be seen in Figures 4, 5, 6, and 7.
Consider Figure 4, in which there is no
standardized bias. The most important result
is that the mean strategy is quite close to the
best model. Note further that the proportional
model is clearly inferior to the mean, whereas
the random model is poorest. Furthermore,
when the group size is greater than three, the
E(d') values for the models decrease very
slowly. This indicates that increasing group
size after three does not reduce E(d') greatly.
As bias increases, in Figures 5, 6, and 7, the
best model begins to improve relative to the
others. However, the closeness of the mean
and proportional models is particularly interesting. It is not until the standardized bias
is around .7 that the proportional model begins
to perform better than the mean. Again, the
effect of group size is small except for the best
strategy.
Using the Models
The theoretical results shown in Figures 4
through 7 indicate that depending upon B and
N, the baseline performance of the various
models as represented by E(d') may be quite
close. Furthermore, because there is dispersion
around the expected levels of baseline performance, in empirical situations it would often be
quite difficult to determine the level of performance (i.e., baseline of a particular group
or groups). This, in turn, would of course lead
to difficulty in judging the quality of group
performance.
For example, consider the situation in
which we have observed a number of group
judgments (xg) and can measure the accuracy
'' This model should not be confused with a model
in which each Xj is given a weight and the weighted
Xj& are combined into a group judgment. The proportional model says that only one judgment is to be used
as the group judgment.
166
H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER
B=.5
1.0
E(d')
Random
.75
Porportional
.50
.25
2
3
4
5
6
10
8
16
12
SAMPLE SIZE <N)
Figure 5. E(d') for the models at B = .5.
of such judgments by
da = \xt — xt\ .
At what level of performance are these groups
performing? We consider that a reasonable
manner in which to answer this question is to
assess the probability of each of the baseline
models given the observed data and any other
information we deem relevant. These questions
can be answered by treating the problem within
the framework of Bayesian statistical inference.
Specifically, we need to determine the
posterior probability favoring each model, k,
given the data, that is, p (model k \ d g ) . This
probability can be obtained through Bayes'
Theorem as
/•(model k \ d a )
/>(<£„ | model k)p (model k)
£ p(dg\model k)p(model k) '
k
where the term p (dg \ model k} is the likelihood
of observing dg given the kih model and
/•(model k) is the prior probability that the
kth model is correct.7
If the investigator has prior knowledge concerning the probability of the different models
(based on, for example, theoretical or empirical
considerations), then he may assign different
prior probabilities to the different models. On
the other hand, he may wish to proceed as if
he had no prior knowledge and assign equal
prior probability to each of the models (i.e.,
.25). For illustrative purposes, we do this
below. However, we note that it is a restrictive
prior distribution in that it assumes that only
four models are possible. A way around this
difficulty is to use the posterior odds form of
reporting results and to consider the odds of
one model versus another, or all the others. For
example, for two models i and j, the posterior
odds favoring model i are given by
p (model i \ dg)
/•(model j \ d g )
p (dg | model i) p (model i}
p(dg\ model j) p (model j)
(12)
which breaks down into the likelihood and
prior odds ratios. In this form, the investigator
need only consider the relative prior probability of one model against another, or the
others.
We now develop the likelihood functions for
the four models. For the random model, consider Figure 3 again. The values x\ and x'%
are equidistant from B. Therefore, when the
x'j distribution is folded over, the density of
d'i will be the sum of the densities of x'\ and
x\ in the nonfolded distribution, that is,
f(d\) = M*'x) + M*'2) .
(13)
However, note that
x\ = B + d'
(14)
and
x\ = B - d'.
Therefore
f(d'i) = MB + d') + fN(B - d') . (15)
7
For those not familiar with the Bayesian approach,
the interpretation of the terms in Equation 11 is as
follows: p (model k\da) = probability that the kth
model could have generated results as good as the given
da; />(<*„ | model k) = probability of getting a d, performance level given that d, was generated by the £th
model; p (model k) = probability that the &th model
generates all dg values.
QUALITY OF GROUP JUDGMENT
Random
1.25
167
Mean
1.0
E(d")
.75
.50
Best
.25
0
2
3
5
6
8
10
12
16
SAMPLE SIZE (N)
1.0.
Figure 6. E(d') for the models at B
Similarly, for the mean model the density function of d' is given by
stituted into Equation 11 to yield the posterior
probability of each model given the data.
To illustrate the above procedures, consider
f(d') =
*jN(fN^N(B+d/)l
the data from an experiment performed at the
+ fN^N(B - d')]} • (16) University of Chicago. Twenty groups of size
three were formed randomly using MBA
The density function for d'i,N when i — 1 can
(master of business administration) students.
be found in Hogg and Craig (1965, p. 173). In The subjects were asked to estimate the
our notation, this is
metropolitan population (as of the 1970 census)
= N(\ (17) of several cities. Here we only consider Washington, D. C. The subjects first estimated the
where d'% = fractile that d' cuts off in the d!\ population individually and then met in groups
distribution. The density function for the to come to a consensus answer. Therefore, we
proportional strategy is more complicated and have 60 individual judgments (x,), 20 group
is derived in Appendix B. It is
judgments (xa), and the true value (xt = .75
million). In our example, we consider one group
[N-(N- 1K%] - (is) answer of .55 million and ask what is the probability that an answer as good as .55 could have
Equations 15 through 18 provide the condi- come from each of the baseline models.
Because the results depend on knowing B,
tional probability of any d' value given the
particular model. These can then be sub- it must be estimated. This involves estimating
B-3.0
Random and Mean
3.0
2.75
2.50
2.25
EW)
2.0
1.75
1.50
Beat
1.25
1
2
3
4
5
6
7
8
10
12
SAMPLE SIZE (N)
•Figure 7. E(d'~) for the models at B =• 3.0.
16
168
H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER
Table 3
Posterior Probabilities for Models X Groups
Model
Group
1
2
3
4
S
6
7
8
9
10
11
12
13
14
IS
16
17
18
19
20
All groups
Best Mean
.406*
.249
.406*
.328*
.354*
.431*
.354*
.406*
.381*
.301*
.085
.015
.354*
.301*
.406*
.301*
.200
.200
.406*
.301*
.796*
.221
.287*
.221
.257
.245
.209
.245
.221
.233
.268
.271
.115
.245
.268
.221
.268
.299*
.299*
.221
.268
.114
Proportional Random
.220
.245
.220
.232
.228
.216
.228
.220
.224
.237
.282
.325
.228
.237
.220
.237
.253
.253
.220
.237
.088
.152
.219
.152
.182
.172
.144
.172
.152
.162
.194
.362*
.545*
.172
.194
.152
.194
.249
.249
.152
.194
.002
Note. The asterisk indicates the highest probability
in each row.
both n and a, since it is known that xt = .75.
Using the total sample of 60 individual judgments, we can estimate M and a by the sample
mean (X) and unbiased sample standard
deviation (SD). For our data, X — 1.02 and
SD = .638. Therefore, our best estimate of B
is .42. We now convert the group consensus to
d'g because our results are all in terms of a
standardized distribution:
d',= xg-x,\/SD.
For our data, d'g = .31. Because we know N
to be 3 and B to be .42, we can substitute d'a
into Equations 15-18 to obtain the likelihoods.
When this is done, the results can be put into
Equation 11 to obtain the posterior probability
of each model given the group d'0 value. For
our example, we have done this assuming that
the prior probability of each model is .25 (see
above discussion); thus, the posterior probabilities are .328, .257, .232, and .182 for the
best, mean, proportional, and random models,
respectively. Therefore, for a group consensus
of .55 million, the probability that a result this
good could come from the best model is highest,
although there is substantial probability that
this result could have come from the other
models. In Table 3 we present the posterior
probabilities for the 20 groups individually. We
also present the posterior probability over the
20 groups, that is, because the groups interacted independently of the others, we can
assume independence and multiply the individual likelihoods to obtain the joint likelihood over all groups :
Lk=
.! model*),
where Lj, — likelihood of &th model over all
groups and g = 1,2 ..... M. These values can
then be substituted into Equation 1 1 to obtain
the overall probability of the £th model given
the data.8
Examination of Table 3 reveals that the
posterior probability for the best model is
highest for 15 of the 20 groups. This result is
perhaps surprising in that the best model could
be considered a kind of upper limit on group
performance. In order to check this result, we
looked at the raw data and did indeed find
that for nine groups the group consensus was
at least as good as the best person in the group
(for three groups the consensus was better than
the best person). However, note that for two
groups, Numbers 11 and 12, the model that
seems to best describe the quality of the
judgment is the random model. Overall, the
posterior probability of the best model is
considerably higher than the other models (the
posterior odds of the best to the mean are
almost 7:1, best to proportional 9:1, best to
random 398:1). Although we realize that predicting the population of cities is not a task
from which one can generalize, the data do
illustrate how the theory and method can be
used to analyze actual group data in terms of
the four baseline models.
Discussion
We discuss our results in terms of four
general areas.
1. We began this paper by discussing the
research on groups versus individuals and
8
A computer program has been written (in BASIC)
that will print out the posterior probabilities for each
group and the posterior probability over all groups. The
input needed for the program is simply xa, X,, xt, and
SD. A listing of the program is available from the
authors.
QUALITY OF GROUP JUDGMENT
staticized groups. Our theoretical analysis
offers insight into why the experimental literature in these two areas has led to conflicting
results. Because previous researchers did not
explicitly consider the effects of standardized
bias and group size, exceptions to "general
rules" were always found. Our results indicate
that B and N are crucial determinants in
considering whether individuals perform better
than actual groups or staticized groups. Therefore, at the very least, our models make it
clear that these issues will not be settled experimentally. What is amenable to experimental study are the variables that affect B,
whether they are task and/or individual
factors. We know very little about this,
although work dealing with the biases of
judges in probabilistic situations is potentially
relevant (Tversky & Kahneman, 1974). Furthermore, it is important to know the empirical
distribution of B over varying tasks because
this has great practical importance. For example, if B is large, use of the proportional
strategy, where the group decides to follow the
opinion of one member, is to be preferred to
the group mean.
2. Given dg and estimates of /* and <r, we
can determine the posterior probability of each
baseline model given the data. However, our
results are in terms of a particular population
of individual judgments. Just how this population is defined is of crucial importance. Consider two populations, one of experts and the
other of nonexperts. This is shown in Figure 8.
We have drawn Figure 8 so that the mean of
the expert distribution is at the true score and
the mean of the nonexpert distribution is far
from the true score (one criterion for expertise
may be that B is small, cf. Einhorn, 1974).
Also, we have drawn the distribution of expert
judgment to have a smaller variance. Although
we do not know the actual distributions of
experts and nonexperts, it should be clear that
our results are relative to the particular distributions in the population. We consider this
an advantage because it allows for comparisons
to be made across populations—for example, a
group of experts performing at the level of the
random model may be better than a group of
nonexperts performing at the level of the best
model. Such comparisons are certainly legitimate and can easily be investigated by our
approach. The relativity of our results is also
169
Experts
Nonexperts
Figure 8. Distribution of expert and nonexpert
judgments.
useful from a psychological point of view. For
example, Steiner (1966, 1972) has stated that
actual group performance equals potential performance minus process losses. Because different groups will have different "potentials"
(under given circumstances), it seems useful to
define quality in terms of the upper limit of
performance. However, if other populations
are available, cross-comparisons can be made.
3. Given the same population, one could
examine variables that might affect quality
of performance (as defined by the baseline
models). For example, consider that we wish
to compare the performance of Delphi and
face-to-face groups. This could be done experimentally, yet the question would remain
as to how well the better group did (in an
absolute sense). It might be that Delphi groups
perform better than face-to-face groups, but
the level of performance might be what one
would expect by using the mean model. In
this case, the superiority of one method over
another would not be so impressive. Therefore,
when comparisons are made between competing methods, the baseline models can be used
to assess the "winners."
4. Finally, the models we proposed have
several interesting implications for future
research. Although we have only dealt with
the quality of group judgment, one could use
these models (appropriately modified) as representing how group judgment is made
(independently of how accurate it is). Second,
if groups do try to weight their better members,
170
H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER
it is an interesting empirical question as to the
relationship of pi,x and group size. Third, our
results have potential use in a normative sense.
For example, consider that in a particular
group, one member gives a judgment that is
very discrepant from the judgments of the
other members. The tendency of the majority
may be to ignore the discrepant opinion
(weight it zero). However, if the majority
opinions were too high relative to xt, the inclusion of the discrepant opinion (if it was
below xt) might improve the accuracy of the
group judgment. In fact, there is the real
possibility that equally weighting N "wrong"
judgments could lead to the correct answer.
Whether groups have any ability to make use
of the statistical properties of their judgments
is an interesting and important question that
awaits further research.9 If groups are not able
to make use of this information, mechanically
combining individual judgments might be
called for in order to improve judgmental
accuracy.
Our hope is that the above theoretical and
methodological results will help to stimulate
more research in the area of group judgment.
Although psychologists have traditionally been
mainly interested in how groups behave, we
feel that more concern with the quality of
judgment should lead to both theoretical and
methodological insights that will bear on both
the descriptive and normative aspects of
judgment.
9
It is interesting to speculate whether groups are
able to apply negative weights to individual judgments.
It seems to us that this would be very difficult for a
group to do. The possibility that groups only apply
zero or positive weights makes the use of equal weighting strategies even more effective (see Einhorn &
Hogarth, 1975).
References
Cramer, H. Mathematical methods of statistics. Princeton,
N. J.: Princeton University Press, 1951.
Dalkey, N. C. The Delphi method: An experimental
study of group opinion (RM-5888-PR). Santa Monica,
Calif.: The Rand Corporation, 1969. (a)
Ualkey, N. An experimental study of group opinion:
The Delphi method. Futures, 1969, 1, 408-426. (b)
Dalkey, N., & Helmer, O. An experimental application
of the Delphi method to the use of experts. Management Sciences, 1963, 9, 458^67.
Dawes, R. M., & Corrigan, B. Linear models in decision
making. Psychological Bulletin, 1974, 81, 95-106.
Deutsch, M., & Gerard, H. B. A study of normative and
informational social influences upon individual judgment. Journal of Abnormal and Social Psychology,
1955, 51, 629-636.
Einhorn, H. J. Expert judgment: Some necessary conditions and an example. Journal of Applied Psychology, 1974, 59, 562-571.
Einhorn, H. J., & Hogarth, R. M. Unit weighting
schemes for decision making. Organizational Behavior and Human Performance, 1975, 13, 171-192.
Gordon, K. A study of esthetic judgments. Journal of
Experimental Psychology, 1923, 6, 36-43.
Hogg, R. V., & Craig, A. T. Introduction to mathematical statistics (2nd ed.). New York: Macmillan
1965.
Kidd, J. B. The utilization of subjective probabilities
in production planning. Acta Psychologica, 1970, 34,
338-347.
Klguman, S. F. Group and individual judgments for
anticipated events. Journal of Social Psychology,
1947,2(5,21-28.
Lorge, L, Fox, D., Davitz, J., & Brenner, M. A survey
of studies contrasting the quality of group performance and individual performance, 1920-1957. Psychological Bulletin, 1958, 55, 337-372.
Maier, N. R. F. Assets and liabilities in group problem
solving: The need for an integrative function.
Psychological Review, 1967, 74, 239-249.
Sackman, H. Delphi assessment: Expert opinion, forecasting, and group process (R-1283-PR). Santa
Monica, Calif.: The Rand Corporation, 1974.
Schlaifer, R. Probability and statistics for business
decisions. New York: McGraw-Hill, 1959.
Slovic, P. From Shakespeare to Simon: Speculation—•
and some evidence—about man's ability to process
information. Oregon Research Institute Bulletin, 1972,
12 (12), 1-29. (a)
Slovic, P. Psychological study of human judgment:
Implications for investment decision-making. Journal
of Finance, 1972, 27, 779-799. (b)
Steiner, I. D. Models for inferring relationships between
group size and potential group productivity. Behavioral Science, 1966, 11, 273-283.
Steiner, I. D. Group process and productivity. New York:
Academic Press, 1972.
Steiner, I. D., & Rajaratnam, N. A model for the
comparison of individual and group performance
scores. Behavioral Science, 1961, 6, 142-147.
Stroop, J. B. Is the judgment of the group better than
that of the average member of the group? Journal
of Experimental Psychology, 1932, 15, 550-560.
Taylor, D. W. Problem solving by groups. Proceedings
of the 14th International Congress of Psychology, 1954.
Amsterdam: North-Holland Publishing, 1954.
Tversky, A., & Kahneman, D. Judgment under uncertainty: Heuristics and biases. Science, 1974, 185,
1124-1131.
Zajonc, R. B. A note on group judgments and group
size. Human Relations, 1962, 15, 177-180.
171
QUALITY OF GROUP JUDGMENT
Appendix A
We wish to derive E(d') and var(d')-
\X'N- B\f(X'N)dX'N.
/
(A4)
+
«>
(Al)
r
-^ /
f(u)du .
JVNB
This can be divided into two parts without the
absolute operator (viz., when X'K < B and
X'N > B).
E(d') = r
(B -
Terms a and d are expressed in terms of
cumulative normal distributions, that is,
X'N)f(X'N)dX'N
J-«
and d =
(A2,
Terms b and c are the partial expectations of
a normal distribution. For a unit normal distribution, they are (see Schlaifer, 1959, p. 300)
g
/
-00
f(X'lf')dX'N
- IB
X'Nf(X'N)dX'N
and
-B
(A3)
. . .
_
Because X tf is normally distributed with
H = 0, (7 = 1/Vtf, we consider the variable u,
Combining all terms yields
_
E(d") = B[_2F(^NB) - 1]
where « = ^NX'N and du = -\lNdX' y.^Since
« is XV multiplied by a constant (V^V), it
will be distributed normally with n = 0 and
<r = 1 (i.e., VA/VJV). Therefore, the distribution of u is unit normal. Multiplying the end
points of the integrals by ^N, Equation (A3)
becomes, by variable transformation,
E(d') = B
+
The variance of rf/ is
__
/—
V]V ^
'
8iven bV
var(d') = E(d')* - [E(<2')]2
(A6)
.
f(u)du
N
(A7)
Therefore,
var(<0 = ^ -
(A8)
(Appendix B on next page)
172
H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER
Appendix B
We wish to derive the probability density
function for the proportional model, fp(d').
Substituting Equation B4 into Equation B3
yields
2 ? -I- 2
(Bi) fp(dr) = j=oE TL/(\ 2M M—M_l,
+2
N
M
/P(«O = E»-i pi,N!(d'i,N) ,
where
X/60'|MX % )/(rf'i)
(BS)
and
d'
=_
- E JMJ Itf,<*'
^_
X
+ E / » W I ^ , r f ' % ) ] . (B6)
>'-»
(see Hogg & Craig, 1965, p. 173). Therefore,
However,
Ijt
\ » 1/4
if
\N—j//j/ \ I
X \A%)
(1 — "•%)
](."' l) •
J
3—0
/DO\
\"A)
and
Let j = i — 1 and M = N — 1. Then Equation B2 becomes
fd') = E \2(M~
j+ 1]
(M)l
M, .
..
, _
,
£, JJ6" I M' d %> ~ Md%.
Therefore,
.
X (W(l - *)-</<*>.
(B3)
The binomial distribution of j successes in M
trials, with probability of success, d'%, is
given by
X (rf'%) J '(l — d'%)M->.
(B7)
(B4)
~ f/ji \
fp(d')
—
[TV — (TV — l)rf'%] . (B8)
TV + 1
Received October 29, 197S