Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Psychological Bulletin 1977, Vol. 84, No. 1, 158-172 Quality of Group Judgment Hillel J. Einhorn Robin M. Hogarth Graduate School of Business University of Chicago Institut Europeen d'Administration des Affaires Fontainebleau, France Eric Klempner Graduate School of Business University of Chicago The quality of group judgment is examined in situations in which groups have to express an opinion in quantitative form. To provide a yardstick for evaluating the quality of group performance (which is itself defined as the absolute value of the discrepancy between the judgment and the true value), four baseline models are considered. These models provide a standard for evaluating how well groups perform. The four models are: (a) randomly picking a single individual; (b) weighting the judgments of the individual group members equally (the group mean); (c) weighting the "best" group member (i.e., the one closest to the true value) totally where the best is known, a priori, with certainty; (d) weighting the best member totally where there is a given probability of misidentifying the best and getting the second, third, etc., best member. These four models are examined under varying conditions of group size and "bias." Bias is denned as the degree to which the expectation of the population of individual judgments does not equal the true value (i.e., there is systematic bias in individual judgments). A method is then developed to evaluate the accuracy of group judgment in terms of the four models. The method uses a Bayesian approach by estimating the probability that the accuracy of actual group judgment could have come from distributions generated by the four models. Implications for the study of group processes and improving group judgment are discussed. Consider a group of size N that has to arrive at some quantitative judgment, for example, a sales forecast, a prediction of next year's gross national product, the number of bushels of wheat expected in the next quarter, and the like. Given the prevalence of such predictive activity in the real world, it is clearly important to consider how well groups can and do perform such tasks, as well as to consider strategies that may be used to improve performance. In this paper we address the issue of defining the quality of group judgment and assess the effects and limitations on judgmental quality of different strategies for combining opinions under a variety of circumstances. First, we define quality of performance in terms of how close the group judgment is to the true (actual) value being predicted once it is known. We then consider the differential expected quality of performance of different baseline models, that is, how well would groups perform if they formed their judgments according to a number of different assumptions. However, it is shown that in many circumstances the baseline performances expected of the different models are quite similar. We therefore present, and illustrate, a statistical procedure for considering which baselines are appropriate for evaluating the quality of group judgment in empirical studies. The conceptualization and procedures presented here do, we believe, have considerable potential This research was supported by a grant from the for illuminating the often seemingly contraSpencer Foundation. We would like to thank Sarah Lichtenstein for her dictory results in the literature on the accuracy insightful comments on an earlier draft of this paper of group judgment, as well as for setting up and Ken Friend for making his data available to us. standards for comparing quality of group Requests for reprints should be sent to Hillel J. judgment both within and between different Einhorn, Graduate School of Business, University of Chicago, Chicago, Illinois 60637. populations of groups. 158 QUALITY OF GROUP JUDGMENT The earliest standard used in comparing group judgment was the individual; that is, given a group judgment and N individual judgments, the accuracy of the group judgment was compared with the various individual judgments. One could then determine if the group was performing at the level of its best, second best, etc., member (cf. Taylor, 1954; Steiner & Rajaratnam, 1961). The results of such studies have been summarized by Lorge, Fox, Davitz, and Brenner (1958): "At best group judgment equals the best individual judgment but usually is somewhat inferior to the best judgment" (p. 348). Although groups do not seem to perform at the level of their best member (which is, after all, denned after the true value is known), the question remains as to how well groups can identify and weight their better members before the true value becomes known. A second and related line of research, using judgments made in simple laboratory tasks (such as estimating the number of beans in a jar), has dealt with staticized groups. Staticized refers simply to an average of a number of individual judgments (or even one person's judgment given many times). Those averages have been compared to individual judgments in terms of accuracy (Gordon, 1923; Stroop, 1932; Zajonc, 1962). Results have shown that the average judgment is more accurate than most individual judgments (there have been exceptions, see Klugman, 1947). However, comparisons have rarely been made between staticized groups and actual groups because the emphasis of this line of research has been on groups versus individuals. A third line of research, developed outside the field of psychology, deals with the potential advantages that can result from the pooling of individual judgments by a systematic statistical procedure. The method used has been called the "Delphi" technique (Dalkey, 1969b; Dalkey & Helmer, 1963). The general idea is to try to produce a consensus of opinion through statistical feedback (usually the median of the individual judgments). Furthermore, the group does not meet in a face-to-face format, since it is contended that social interaction causes biases that adversely affect group performance. Although more experimental evidence is needed on this point (Dalkey, 1969a; 159 Sackman, 1974), the Delphi technique explicitly recognizes the possibility that actual groups may be performing below some statistical standard. A Baseline Approach In conceptualizing how well groups perform specific tasks, Steiner (1966, 1972) has identified three critical factors: (a) the type of task with which the group is faced, (b) the resources at the group's disposal (i.e., the expertise of the different group members), and (c) the process used by the group. In the kinds of judgmental tasks considered here, we conceptualize the group judgment as a weighting and combining of the judgments of the individual members. Thus, one crucial issue is the process used by the group to allocate weights to the opinions of the different members and the extent to which various strategies for weighting opinions have important effects on the quality of judgment in different circumstances. Steiner (1972) listed four reasons why groups may not, in fact, weight their individual members appropriately: (a) failure of status differences to parallel the quality of the contributions offered by participating members; (b) the low level of confidence proficient members sometimes have in their own ability to perform the task; (c) the social pressures that an incompetent majority may exert on a competent minority; (d) the fact that the quality of individual contributions is often very difficult to evaluate, (pp. 38-39) Whether groups do misweight in actuality, and with what frequency, is an empirical matter. However, before one can conclude that a group is misweighting, there must be some standard against which to compare its performance. The approach taken here is to develop baseline models founded on assumptions made about group processes. These models are not meant to describe what groups actually do, they simply say that if groups were to do such and such, then a certain level of performance would result. Although the four models considered differ greatly in what they assume about group process, it is instructive to compare them under a wide variety of circumstances. 160 H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER The first model consists in assuming that the group picks one member at random and uses that judgment as the group judgment. Intuitively we would expect such a model to yield a low level of performance because it assumes that the group lacks any ability to identify and weight its better members appropriately. However, it is possible that actual group judgment may be no better than this strategy. The random model is discussed at greater length when considering our second model. The second baseline model involves weighting each individual judgment equally, that is, by \/N. This model is equivalent to using the average of the individual judgments as a comparison for actual group judgment. It must, of course, be remembered that we are not interested in whether the group average is a good representation of the group judgment but rather whether the group average is as accurate as the group judgment. This is a crucial distinction that must be kept in mind in discussing all of our models. There are several reasons for considering the equal weight model: (a) The equal weight model can be thought of as representing individual members' weights before discussion takes place; that is, before information concerning perceived expertise is obtained, the group treats each member equally. Therefore, equal weighting provides an interesting baseline with which to compare groups' abilities to allocate weights on a differential basis, (b) Recent research (cf. Dawes & Corrigan, 1974; Einhorn & Hogarth, 1975) has shown that an equal weight model can outpredict differential weight models under a wide variety of circumstances. Part of the reason for this is that equal weights cannot reverse the relative weighting of the variables. For example, it is better to weight all group members equally than to assign high weights to those with poor judgment. Therefore, if groups reverse the relative weights (due to nonvalid social cues that influence perceived expertise), an equal weight model can be expected to perform better. If groups actually perform worse than would be expected on the basis of an equal weight model, it may suggest that group discussion is dysfunctional with respect to the assignment of weights, (c) The mean of a random variable has certain desirable statistical qualities. For example, consider that each individual judgment contains the true value of the phenomenon to be predicted plus a random error component. If this is the case, the expectation of the individual scores will be the true score. Furthermore, the expectation of the means of groups of size N drawn from the distribution of individual scores will be equal to the true score and the variance of this distribution will be less than the distribution of individual scores. Therefore, using the group mean will result in a "tighter" distribution around the true value—a situation that is most beneficial and of great practical importance. The merits of the preceding argument depend on the assumption that each individual judgment can be divided into a true value plus a random error component. However, when dealing with human judgment in complex tasks (such as predicting sales, judging guilt or innocence, etc.), we feel that systematic biases may enter into judgment in ways different from laboratory tasks. The former situations differ from the latter in at least two respects: (a) The definition of the stimulus is more ambiguous and subject to diverse influences. This means that the information on which judgments are based may differ among individuals. Furthermore, in such conditions of stimulus ambiguity, there is much research that indicates that individual judgments are biased by social pressures (Deutsch & Gerard, 1955). (b) Because a large and diverse set of information has to be processed, it is quite likely that erroneous assumptions, biases, and other constant errors will be made. Recent psychological work (e.g., Slovic, 1972a; Tversky & Kahneman, 1974) has shown that the human's limited information processing ability leads him to make systematic errors in judgment. Moreover, these biases seem to be widespread and applicable to "experts" as well as to novices (Kidd, 1970; Slovic, 1972b). Given the questionable assumption of random error in individual judgments, we examine each of our baseline models under varying amounts of bias (this is defined formally in the next section). The third model we consider is the following: Assume that through group discussion, the group is able to identify its best member with QUALITY OF GROUP JUDGMENT 161 distance between xt and M, that is, b = (xt—n). (1) We call the second the standardized bias because it is the distance between xt and n measured in terms of the population standard deviation, that is, x.-N(M,o) x -N(u,oAN) N /. Distribution of individual judgments and group means. certainty (i.e., the group can determine which member's judgment will be closest to the true value). In such a situation, a sensible strategy would clearly be to give all the weight to the "best" judgment and none to the remaining N — 1 members. Although it is possible for the actual group judgment, or the group mean, to be closer to the true value on any trial, this is unlikely to be the case on average. Therefore, we compare the random and mean models with the best model. Our final model takes cognizance of the fact that groups will find it extremely difficult to identify their best member with certainty. That is, what happens if the group can be mistaken as to the identity of the best member? In other words, how well will the group perform if it only has a certain probability of picking the best person? We denote this model as the "proportional" model and compare it to the three models discussed above. We now turn to a formal development of the models. B = (*, - M)A . Now consider that we sample N individuals from the Xj distribution and calculate their mean (Xn/). The result can be considered as a drawing from the sampling distribution of the mean with mean of /x and standard deviation of a/-\N. This is shown in the lower half of Figure 1. The important point to note is that xt is further out in the tail of the distribution of means than in the original distribution. Moreover, as group size increases, the variance of the distribution of means decreases; therefore, xt will be even further out in the tail area. The implication here is that the probability of being close to xt would then decrease.1 It is clear that the standardized bias (B), as well as group size (N), will affect the quality of performance of both the mean and random models (the latter being, of course, equivalent to sampling a single observation from the x,distribution). We now turn to a more complete consideration of these models under varying amounts of B and N. We first need to define the quality of any particular judgment. We do this by defining quality as being synonymous with accuracy.2 If we wish to be as close to xt as possible and feel that it makes no difference whether we are above or below xt, then an appropriate measure of accuracy is given by d = \XK-xt\ . The Models We begin by considering a population distribution of individual judgments. Let Xj be the judgment of the j'th person and assume that judgments are normally distributed with E(XJ) = n and varfe) = a-2. The true value to be predicted is denoted by xt. The distribution is shown in Figure 1, in which we have drawn xt so that it does not coincide with the mean of the individual judgments. We now define two measures of bias. The first is simply the (2) (3) Note that when N = 1, Equation 3 expresses the accuracy of any individual judgment.3 In 1 However, the probability of being very far from xt also decreases. 2 We realize that certain writers (cf. Maier, 1967) have denned the effectiveness of group problem solving as a function of both the quality of the solution and its acceptability by the group members. We do not deal with the acceptability issue in this paper. 3 A more general form for Equation 3 would be d = ct\Xn — xt for a > 0. However, because we are 162 H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER order not to confuse these meanings of d, we denote di as being the case when N = 1 and d for all other values of N. Clearly, the smaller d is, the greater the accuracy. The absolute value operator is used because we assume that being above or below xt incurs the same cost (symmetric loss function). In order to evaluate the effects of group size and standardized bias on d, we examine the expected value and variance of d under varying combinations of B and N. Therefore, we look at d "on average" as well as its dispersion. In order to calculate E(d\ B, N) and var(d\ B, N), we must examine the distribution of d. By way of illustration, we assume N = 1. Consider Figure 1 again. To obtain the distribution of d, assume that we can "fold" the Xj distribution at xt so that the area previously lying to the left of xt now lies to its right. This procedure yields the distribution that results when xt is subtracted from Xj and the absolute value is taken (e.g., when xt = n, we get "half" of a normal distribution). Note that it does not matter whether x, is below or above the mean because the same d distribution will result. Figure 2 shows the distribution of d when xt is at the value shown in Figure 1 . The upper part of Figure 2, (a) , shows the effect of folding over the Xj distribution from left to right at xt. The shaded area refers to the tail area that was to the right of xt. This tail will begin at d = 0 (where Xj = xt). The lower part of Figure 2, (b), shows the distribution of d when the tail area is added to the distribution truncated at xt. Before deriving E(d) and var(rf), however, we deal with a standardized distribution of individual judgments, which will simplify our discussion. However, in order to distinguish between results on the basis of standardized and original units, we use primes to denote that the variables have been standardized in terms of the population of individual scores. Therefore, f(d) oo f(d) 03 Figure 2. (a) "Folded" distribution of Xj at x>. (b) Distribution of d for given xt. and '= X'N-x't\ It is important to note that d = ad'. This means that any results using d' can be converted to original units by multiplying by the population standard deviation. We first wish to find the unconditional expectation of d'. This is given by E(d') = f+X | X'N - x't | f(X'N)dx . (4) It is shown in Appendix A that this is E(d') = (5) where F = cumulative normal distribution and fN = ordinate of normal distribution. We can also determine the variance of d', namely, var(rf') = E(\X'N - B\? - \_E(d')J. (6) It is shown in Appendix A that this is X'N = (XN x't = (*, - it) B, only interested in the relative differences between the models, we may conveniently work with the special case of a = 1 without loss of generality. var(d') = (1/AO + W- lE(d')J. (7) In order to examine the effects of B and N on Equations 5 and 7, we have calculated E(d' B, N) and va.r(d' B, N) for the following values of B and N: B = 0, .5, 1, 1.5, 2, 2.5, QUALITY OF GROUP JUDGMENT 163 Table 1 E(d') and var(&') for Varying Levels of B and N N B 0 .5 1.0 l.S 2.0 2.5 3.0 .798 (.364) .896 (.448) 1.167 (.607) 1.559 (.821) 2.017 (.931) 2.504 (.980) 3.001 (.996) .564 (.181) .700 (.260) 1.050 (.397) 1.509 (.475) 2.001 (.496) 2.500 (.500) 3.000 (.500) .461 (.121) .623 (.194) 1.020 (.294) 1.502 (.328) 2.000 (.333) 2.500 (.333) 3.000 (.333) .399 (.091) .583 (.160) 1.008 (.233) 1.500 (.249) 2.000 (.250) 2.500 (.250) 3.000 (.250) .357 (.073) .559 (.138) 1.004 (.192) 1.500 (.200) 2.000 (.200) 2.500 (.200) 3.000 (.200) .326 (.061) .544 (.121) 1.002 (.163) 1.500 (.166) 2.000 (.166) 2.500 (.166) 3.000 (.166) .282 (.045) .525 (.099) 1.000 (.124) 1.500 (.125) 2.000 (.125) 2.500 (.125) 3.000 (.125) 10 12 16 .252 (.036) .515 (.085) 1.000 (.010) 1.500 (.010) 2.000 (.010) 2.500 (.010) 3.000 (.010) .230 (.030) .510 (.073) 1.000 (.084) 1.500 (.084) 2.000 (.084) 2.500 (.084) 3.000 (.084) .199 (.023) .504 (.058) 1.000 (.063) 1.500 (.063) 2.000 (.063) 2.500 (.063) 3.000 (.063) Note. N — 1 is the random model. The numbers in parentheses represent var(d'). and 3; N = 1, 2, 3, 4, 5, 6, 8, 10, 12, and 16.4 These results are shown in Table 1. There are four main results in Table 1: (a) as B (i.e., x't) increases, for any given N, E(d') and var(d') increase. As would be expected, the greater the standardized bias in the population, the poorer the mean model does; (b) as N increases, E(d') decreases, but when B ^ 1.0, the decrease is small. Note that although E(d') does not decrease much under these conditions, var(d') does; (c) when N — 1 (the random model), both E(d'i) and var(rf'0 are higher than any other group size. Therefore, the random model will do worse than the mean model on average. However, when B ^ 1.0, the expected value of d' is similar for these two models;6 (d) as N increases, E(d') approaches B. We now turn to the models for the best and proportional strategies. Consider the d\ distribution for a given B. Furthermore, let us randomly sample N individuals from this distribution and order their d\ scores (from lowest to highest). To make use of this ordering, we can use the following result from order statistics: If groups of size N are randomly assembled from the d'\ distribution and the members are ordered according to their d'i scores, on average, the members will divide the population distribution into N + 1 equal parts (cf. Hogg & Craig, 1965; Steiner & Rajaratnam, 1961). This means, for example, that four-person groups will, on average, have members that fall at the 20th, 40th, 60th, and 80th fractiles of the d'\ distribution. Therefore, on average, the best member of a fourperson group will perform better than 80% of the population (i.e., will be at the 20th fractile of the d\ distribution). Let us denote the ith best score in a group of size N as </',-,#. Therefore, d'\,± would be the best score in a group of size four. We wish to determine the expectation of d'i,n for various combinations of B and N. We use an approximation here to calculate this expected value. The sampling distribution of the ith best person is asymptotically normal with mean at the fractile corresponding to the ith best (Crame'r, 1951). For example, the mean of the best person in a group of size four will fall at approximately the 20th fractile of the d\ distribution (note that because smaller d\ values are more desirable, we use the 20th fractile 4 Although we only use positive values of B, it is the case that negative values of B yield identical results. Therefore, the absolute value of B is the important determinant of E(d'). 'Under the loss function, "a miss is as good as a mile," the random model actually has a lower E(d') than the mean model. This occurs because the probability of being at x, for the mean model is smaller if xt 7* 0. The fact that the mean model has a lower probability of being further away from xt is immaterial under this loss function. 164 H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER d\ distribution will have a distance of d' from the origin. Therefore, to determine the value of d' that corresponds to any given fractile of the d'i distribution, one needs the area that d' cuts off. This can be found using the normal distribution by noting that d'% = F(B + d') - F(B - d') , Figure 3. Distribution of x'j and folded distribution of d'i around B showing d' distance. rather than the 80th). Therefore, if we can find the d\ score that corresponds to the appropriate fractile, we can find E(d'i^). Consider Figure 3, which shows the parent distribution of x'j. A distance of d' (corresponding to x\ and x'2) is shown around B. When the distribution is folded at B, the resulting (8) where d'% = fractile of the d\ distribution and F = cumulative normal distribution. An iterative computer program has been written that yields appropriate values of d1 for any given fractile of the d\ distribution. Conversely, for any given d' value, one can obtain the fractile of the d'i distribution. Before presenting the results of E(d'\1x) for various values of B and N, we consider our fourth model, the proportional model. In order to formally deal with this model, it is necessary to introduce a new variable, pi.tf- This is the probability of identifying the ith best person in a group of size N as the best. Therefore, pi,4 would denote the probability of correctly identifying the best person in a group of size four as the best. It seems likely that pi,ti will be affected by the size of the group (i.e., it would seem to be easier to correctly identify the best person in a group of size 4 than in a group of size 16). In order to incorporate this into our model, we assume here that p{,x is inversely proportional to each member's rank Table 2 E(d'i, N ) and E p (d') for Values of B and N N 0 .5 1.0 1.5 2.0 2.5 3.0 .798 (.798) .895 (.895) 1.166 (1.166) 1.557 (1.557) 2.016 (2.016) 2.503 (2.503) 3.000 (3.000) .467 (.687) .527 (.772) .719 (1.017) 1.044 (1.386) 1.469 (1.834) 1.944 (2.317) 2.438 (2.813) .335 (.632) .379 (.711) .527 (.942) .806 (1.301) 1.202 (1.742) 1.666 (2.224) 2.157 (2.719) .262 (.599) .297 (.674) .418 (.898) .662 (1,250) 1.033 (1.688) 1.487 (2.168) 1,976 (2.663) .216 (.577) .244 (.650) .347 (.868) .564 (1.215) .914 (1.651) 1.357 (2.130) 1.843 (2.626) Note. The numbers in parentheses represent Ef(d'). .183 (.562) .208 (.632) .296 (.846) .492 (1.191) .823 (1.625) 1.257 (2.104) 1.740 (2.599) .141 (.541) .160 (.609) .230 (.818) .393 (1.158) .691 (1.590) 1.108 (2.068) 1.586 (2.563) 10 12 16 .115 (.527) .131 (.594) .188 (.800) .328 (1.138) .599 (1.568) 1.000 (2.046) 1.474 (2.540) .097 (.518) .110 (.584) .159 (.788) .281 (1.123) .501 (1.553) .917 (2.030) 1.386 (2.525) .074 (.506) .084 (.571) .122 (.771) .219 (1.105) .433 (1.533) .794 (2.010) 1.254 (2.504) QUALITY OF GROUP JUDGMENT 165 B=0 1.0 E(d') Random .75 Proportional .50 .25 0 1 2 3 4 5 6 8 10 12 SAMPLE SIZE (N) Figure 4. E(d') for the models at B = 0 in the group. For example, in a four-person group, one can consider that there are ten weights to be allocated ( 4 + 3 + 2 + 1 ) . The best person will receive four, the second best three, and so on. Subsequently, the weights must be divided by their sum in order to normalize them. The probabilities allocated under such a scheme are given by pi.N=2(N+l-i)/(N+l)N. (9) For example, in a four-person group, the probability of correctly identifying the best person is .4, whereas the probability of identifying the second best as the best is .3, third best as best is .2, and worst as best is .I.6 Of course, the scheme presented is arbitrary. We do not know how well groups actually do identify their "better" members. However, we feel that results from such a model provide a useful benchmark to contrast with the best model, which appears unrealistic. The expected level of performance using the proportional model is EP(d') = L pitNE(d'i,lf) . t-i (10) Table 2 shows the values for £(^'I,AT) and Ep(d') for various levels of B and N. In order to compare the four models, we have plotted E(d') for each model as a function of both standardized bias and group size. These results can be seen in Figures 4, 5, 6, and 7. Consider Figure 4, in which there is no standardized bias. The most important result is that the mean strategy is quite close to the best model. Note further that the proportional model is clearly inferior to the mean, whereas the random model is poorest. Furthermore, when the group size is greater than three, the E(d') values for the models decrease very slowly. This indicates that increasing group size after three does not reduce E(d') greatly. As bias increases, in Figures 5, 6, and 7, the best model begins to improve relative to the others. However, the closeness of the mean and proportional models is particularly interesting. It is not until the standardized bias is around .7 that the proportional model begins to perform better than the mean. Again, the effect of group size is small except for the best strategy. Using the Models The theoretical results shown in Figures 4 through 7 indicate that depending upon B and N, the baseline performance of the various models as represented by E(d') may be quite close. Furthermore, because there is dispersion around the expected levels of baseline performance, in empirical situations it would often be quite difficult to determine the level of performance (i.e., baseline of a particular group or groups). This, in turn, would of course lead to difficulty in judging the quality of group performance. For example, consider the situation in which we have observed a number of group judgments (xg) and can measure the accuracy '' This model should not be confused with a model in which each Xj is given a weight and the weighted Xj& are combined into a group judgment. The proportional model says that only one judgment is to be used as the group judgment. 166 H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER B=.5 1.0 E(d') Random .75 Porportional .50 .25 2 3 4 5 6 10 8 16 12 SAMPLE SIZE <N) Figure 5. E(d') for the models at B = .5. of such judgments by da = \xt — xt\ . At what level of performance are these groups performing? We consider that a reasonable manner in which to answer this question is to assess the probability of each of the baseline models given the observed data and any other information we deem relevant. These questions can be answered by treating the problem within the framework of Bayesian statistical inference. Specifically, we need to determine the posterior probability favoring each model, k, given the data, that is, p (model k \ d g ) . This probability can be obtained through Bayes' Theorem as /•(model k \ d a ) />(<£„ | model k)p (model k) £ p(dg\model k)p(model k) ' k where the term p (dg \ model k} is the likelihood of observing dg given the kih model and /•(model k) is the prior probability that the kth model is correct.7 If the investigator has prior knowledge concerning the probability of the different models (based on, for example, theoretical or empirical considerations), then he may assign different prior probabilities to the different models. On the other hand, he may wish to proceed as if he had no prior knowledge and assign equal prior probability to each of the models (i.e., .25). For illustrative purposes, we do this below. However, we note that it is a restrictive prior distribution in that it assumes that only four models are possible. A way around this difficulty is to use the posterior odds form of reporting results and to consider the odds of one model versus another, or all the others. For example, for two models i and j, the posterior odds favoring model i are given by p (model i \ dg) /•(model j \ d g ) p (dg | model i) p (model i} p(dg\ model j) p (model j) (12) which breaks down into the likelihood and prior odds ratios. In this form, the investigator need only consider the relative prior probability of one model against another, or the others. We now develop the likelihood functions for the four models. For the random model, consider Figure 3 again. The values x\ and x'% are equidistant from B. Therefore, when the x'j distribution is folded over, the density of d'i will be the sum of the densities of x'\ and x\ in the nonfolded distribution, that is, f(d\) = M*'x) + M*'2) . (13) However, note that x\ = B + d' (14) and x\ = B - d'. Therefore f(d'i) = MB + d') + fN(B - d') . (15) 7 For those not familiar with the Bayesian approach, the interpretation of the terms in Equation 11 is as follows: p (model k\da) = probability that the kth model could have generated results as good as the given da; />(<*„ | model k) = probability of getting a d, performance level given that d, was generated by the £th model; p (model k) = probability that the &th model generates all dg values. QUALITY OF GROUP JUDGMENT Random 1.25 167 Mean 1.0 E(d") .75 .50 Best .25 0 2 3 5 6 8 10 12 16 SAMPLE SIZE (N) 1.0. Figure 6. E(d') for the models at B Similarly, for the mean model the density function of d' is given by stituted into Equation 11 to yield the posterior probability of each model given the data. To illustrate the above procedures, consider f(d') = *jN(fN^N(B+d/)l the data from an experiment performed at the + fN^N(B - d')]} • (16) University of Chicago. Twenty groups of size three were formed randomly using MBA The density function for d'i,N when i — 1 can (master of business administration) students. be found in Hogg and Craig (1965, p. 173). In The subjects were asked to estimate the our notation, this is metropolitan population (as of the 1970 census) = N(\ (17) of several cities. Here we only consider Washington, D. C. The subjects first estimated the where d'% = fractile that d' cuts off in the d!\ population individually and then met in groups distribution. The density function for the to come to a consensus answer. Therefore, we proportional strategy is more complicated and have 60 individual judgments (x,), 20 group is derived in Appendix B. It is judgments (xa), and the true value (xt = .75 million). In our example, we consider one group [N-(N- 1K%] - (is) answer of .55 million and ask what is the probability that an answer as good as .55 could have Equations 15 through 18 provide the condi- come from each of the baseline models. Because the results depend on knowing B, tional probability of any d' value given the particular model. These can then be sub- it must be estimated. This involves estimating B-3.0 Random and Mean 3.0 2.75 2.50 2.25 EW) 2.0 1.75 1.50 Beat 1.25 1 2 3 4 5 6 7 8 10 12 SAMPLE SIZE (N) •Figure 7. E(d'~) for the models at B =• 3.0. 16 168 H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER Table 3 Posterior Probabilities for Models X Groups Model Group 1 2 3 4 S 6 7 8 9 10 11 12 13 14 IS 16 17 18 19 20 All groups Best Mean .406* .249 .406* .328* .354* .431* .354* .406* .381* .301* .085 .015 .354* .301* .406* .301* .200 .200 .406* .301* .796* .221 .287* .221 .257 .245 .209 .245 .221 .233 .268 .271 .115 .245 .268 .221 .268 .299* .299* .221 .268 .114 Proportional Random .220 .245 .220 .232 .228 .216 .228 .220 .224 .237 .282 .325 .228 .237 .220 .237 .253 .253 .220 .237 .088 .152 .219 .152 .182 .172 .144 .172 .152 .162 .194 .362* .545* .172 .194 .152 .194 .249 .249 .152 .194 .002 Note. The asterisk indicates the highest probability in each row. both n and a, since it is known that xt = .75. Using the total sample of 60 individual judgments, we can estimate M and a by the sample mean (X) and unbiased sample standard deviation (SD). For our data, X — 1.02 and SD = .638. Therefore, our best estimate of B is .42. We now convert the group consensus to d'g because our results are all in terms of a standardized distribution: d',= xg-x,\/SD. For our data, d'g = .31. Because we know N to be 3 and B to be .42, we can substitute d'a into Equations 15-18 to obtain the likelihoods. When this is done, the results can be put into Equation 11 to obtain the posterior probability of each model given the group d'0 value. For our example, we have done this assuming that the prior probability of each model is .25 (see above discussion); thus, the posterior probabilities are .328, .257, .232, and .182 for the best, mean, proportional, and random models, respectively. Therefore, for a group consensus of .55 million, the probability that a result this good could come from the best model is highest, although there is substantial probability that this result could have come from the other models. In Table 3 we present the posterior probabilities for the 20 groups individually. We also present the posterior probability over the 20 groups, that is, because the groups interacted independently of the others, we can assume independence and multiply the individual likelihoods to obtain the joint likelihood over all groups : Lk= .! model*), where Lj, — likelihood of &th model over all groups and g = 1,2 ..... M. These values can then be substituted into Equation 1 1 to obtain the overall probability of the £th model given the data.8 Examination of Table 3 reveals that the posterior probability for the best model is highest for 15 of the 20 groups. This result is perhaps surprising in that the best model could be considered a kind of upper limit on group performance. In order to check this result, we looked at the raw data and did indeed find that for nine groups the group consensus was at least as good as the best person in the group (for three groups the consensus was better than the best person). However, note that for two groups, Numbers 11 and 12, the model that seems to best describe the quality of the judgment is the random model. Overall, the posterior probability of the best model is considerably higher than the other models (the posterior odds of the best to the mean are almost 7:1, best to proportional 9:1, best to random 398:1). Although we realize that predicting the population of cities is not a task from which one can generalize, the data do illustrate how the theory and method can be used to analyze actual group data in terms of the four baseline models. Discussion We discuss our results in terms of four general areas. 1. We began this paper by discussing the research on groups versus individuals and 8 A computer program has been written (in BASIC) that will print out the posterior probabilities for each group and the posterior probability over all groups. The input needed for the program is simply xa, X,, xt, and SD. A listing of the program is available from the authors. QUALITY OF GROUP JUDGMENT staticized groups. Our theoretical analysis offers insight into why the experimental literature in these two areas has led to conflicting results. Because previous researchers did not explicitly consider the effects of standardized bias and group size, exceptions to "general rules" were always found. Our results indicate that B and N are crucial determinants in considering whether individuals perform better than actual groups or staticized groups. Therefore, at the very least, our models make it clear that these issues will not be settled experimentally. What is amenable to experimental study are the variables that affect B, whether they are task and/or individual factors. We know very little about this, although work dealing with the biases of judges in probabilistic situations is potentially relevant (Tversky & Kahneman, 1974). Furthermore, it is important to know the empirical distribution of B over varying tasks because this has great practical importance. For example, if B is large, use of the proportional strategy, where the group decides to follow the opinion of one member, is to be preferred to the group mean. 2. Given dg and estimates of /* and <r, we can determine the posterior probability of each baseline model given the data. However, our results are in terms of a particular population of individual judgments. Just how this population is defined is of crucial importance. Consider two populations, one of experts and the other of nonexperts. This is shown in Figure 8. We have drawn Figure 8 so that the mean of the expert distribution is at the true score and the mean of the nonexpert distribution is far from the true score (one criterion for expertise may be that B is small, cf. Einhorn, 1974). Also, we have drawn the distribution of expert judgment to have a smaller variance. Although we do not know the actual distributions of experts and nonexperts, it should be clear that our results are relative to the particular distributions in the population. We consider this an advantage because it allows for comparisons to be made across populations—for example, a group of experts performing at the level of the random model may be better than a group of nonexperts performing at the level of the best model. Such comparisons are certainly legitimate and can easily be investigated by our approach. The relativity of our results is also 169 Experts Nonexperts Figure 8. Distribution of expert and nonexpert judgments. useful from a psychological point of view. For example, Steiner (1966, 1972) has stated that actual group performance equals potential performance minus process losses. Because different groups will have different "potentials" (under given circumstances), it seems useful to define quality in terms of the upper limit of performance. However, if other populations are available, cross-comparisons can be made. 3. Given the same population, one could examine variables that might affect quality of performance (as defined by the baseline models). For example, consider that we wish to compare the performance of Delphi and face-to-face groups. This could be done experimentally, yet the question would remain as to how well the better group did (in an absolute sense). It might be that Delphi groups perform better than face-to-face groups, but the level of performance might be what one would expect by using the mean model. In this case, the superiority of one method over another would not be so impressive. Therefore, when comparisons are made between competing methods, the baseline models can be used to assess the "winners." 4. Finally, the models we proposed have several interesting implications for future research. Although we have only dealt with the quality of group judgment, one could use these models (appropriately modified) as representing how group judgment is made (independently of how accurate it is). Second, if groups do try to weight their better members, 170 H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER it is an interesting empirical question as to the relationship of pi,x and group size. Third, our results have potential use in a normative sense. For example, consider that in a particular group, one member gives a judgment that is very discrepant from the judgments of the other members. The tendency of the majority may be to ignore the discrepant opinion (weight it zero). However, if the majority opinions were too high relative to xt, the inclusion of the discrepant opinion (if it was below xt) might improve the accuracy of the group judgment. In fact, there is the real possibility that equally weighting N "wrong" judgments could lead to the correct answer. Whether groups have any ability to make use of the statistical properties of their judgments is an interesting and important question that awaits further research.9 If groups are not able to make use of this information, mechanically combining individual judgments might be called for in order to improve judgmental accuracy. Our hope is that the above theoretical and methodological results will help to stimulate more research in the area of group judgment. Although psychologists have traditionally been mainly interested in how groups behave, we feel that more concern with the quality of judgment should lead to both theoretical and methodological insights that will bear on both the descriptive and normative aspects of judgment. 9 It is interesting to speculate whether groups are able to apply negative weights to individual judgments. It seems to us that this would be very difficult for a group to do. The possibility that groups only apply zero or positive weights makes the use of equal weighting strategies even more effective (see Einhorn & Hogarth, 1975). References Cramer, H. Mathematical methods of statistics. Princeton, N. J.: Princeton University Press, 1951. Dalkey, N. C. The Delphi method: An experimental study of group opinion (RM-5888-PR). Santa Monica, Calif.: The Rand Corporation, 1969. (a) Ualkey, N. An experimental study of group opinion: The Delphi method. Futures, 1969, 1, 408-426. (b) Dalkey, N., & Helmer, O. An experimental application of the Delphi method to the use of experts. Management Sciences, 1963, 9, 458^67. Dawes, R. M., & Corrigan, B. Linear models in decision making. Psychological Bulletin, 1974, 81, 95-106. Deutsch, M., & Gerard, H. B. A study of normative and informational social influences upon individual judgment. Journal of Abnormal and Social Psychology, 1955, 51, 629-636. Einhorn, H. J. Expert judgment: Some necessary conditions and an example. Journal of Applied Psychology, 1974, 59, 562-571. Einhorn, H. J., & Hogarth, R. M. Unit weighting schemes for decision making. Organizational Behavior and Human Performance, 1975, 13, 171-192. Gordon, K. A study of esthetic judgments. Journal of Experimental Psychology, 1923, 6, 36-43. Hogg, R. V., & Craig, A. T. Introduction to mathematical statistics (2nd ed.). New York: Macmillan 1965. Kidd, J. B. The utilization of subjective probabilities in production planning. Acta Psychologica, 1970, 34, 338-347. Klguman, S. F. Group and individual judgments for anticipated events. Journal of Social Psychology, 1947,2(5,21-28. Lorge, L, Fox, D., Davitz, J., & Brenner, M. A survey of studies contrasting the quality of group performance and individual performance, 1920-1957. Psychological Bulletin, 1958, 55, 337-372. Maier, N. R. F. Assets and liabilities in group problem solving: The need for an integrative function. Psychological Review, 1967, 74, 239-249. Sackman, H. Delphi assessment: Expert opinion, forecasting, and group process (R-1283-PR). Santa Monica, Calif.: The Rand Corporation, 1974. Schlaifer, R. Probability and statistics for business decisions. New York: McGraw-Hill, 1959. Slovic, P. From Shakespeare to Simon: Speculation—• and some evidence—about man's ability to process information. Oregon Research Institute Bulletin, 1972, 12 (12), 1-29. (a) Slovic, P. Psychological study of human judgment: Implications for investment decision-making. Journal of Finance, 1972, 27, 779-799. (b) Steiner, I. D. Models for inferring relationships between group size and potential group productivity. Behavioral Science, 1966, 11, 273-283. Steiner, I. D. Group process and productivity. New York: Academic Press, 1972. Steiner, I. D., & Rajaratnam, N. A model for the comparison of individual and group performance scores. Behavioral Science, 1961, 6, 142-147. Stroop, J. B. Is the judgment of the group better than that of the average member of the group? Journal of Experimental Psychology, 1932, 15, 550-560. Taylor, D. W. Problem solving by groups. Proceedings of the 14th International Congress of Psychology, 1954. Amsterdam: North-Holland Publishing, 1954. Tversky, A., & Kahneman, D. Judgment under uncertainty: Heuristics and biases. Science, 1974, 185, 1124-1131. Zajonc, R. B. A note on group judgments and group size. Human Relations, 1962, 15, 177-180. 171 QUALITY OF GROUP JUDGMENT Appendix A We wish to derive E(d') and var(d')- \X'N- B\f(X'N)dX'N. / (A4) + «> (Al) r -^ / f(u)du . JVNB This can be divided into two parts without the absolute operator (viz., when X'K < B and X'N > B). E(d') = r (B - Terms a and d are expressed in terms of cumulative normal distributions, that is, X'N)f(X'N)dX'N J-« and d = (A2, Terms b and c are the partial expectations of a normal distribution. For a unit normal distribution, they are (see Schlaifer, 1959, p. 300) g / -00 f(X'lf')dX'N - IB X'Nf(X'N)dX'N and -B (A3) . . . _ Because X tf is normally distributed with H = 0, (7 = 1/Vtf, we consider the variable u, Combining all terms yields _ E(d") = B[_2F(^NB) - 1] where « = ^NX'N and du = -\lNdX' y.^Since « is XV multiplied by a constant (V^V), it will be distributed normally with n = 0 and <r = 1 (i.e., VA/VJV). Therefore, the distribution of u is unit normal. Multiplying the end points of the integrals by ^N, Equation (A3) becomes, by variable transformation, E(d') = B + The variance of rf/ is __ /— V]V ^ ' 8iven bV var(d') = E(d')* - [E(<2')]2 (A6) . f(u)du N (A7) Therefore, var(<0 = ^ - (A8) (Appendix B on next page) 172 H. J. EINHORN, R. M. HOGARTH, AND E. KLEMPNER Appendix B We wish to derive the probability density function for the proportional model, fp(d'). Substituting Equation B4 into Equation B3 yields 2 ? -I- 2 (Bi) fp(dr) = j=oE TL/(\ 2M M—M_l, +2 N M /P(«O = E»-i pi,N!(d'i,N) , where X/60'|MX % )/(rf'i) (BS) and d' =_ - E JMJ Itf,<*' ^_ X + E / » W I ^ , r f ' % ) ] . (B6) >'-» (see Hogg & Craig, 1965, p. 173). Therefore, However, Ijt \ » 1/4 if \N—j//j/ \ I X \A%) (1 — "•%) ](."' l) • J 3—0 /DO\ \"A) and Let j = i — 1 and M = N — 1. Then Equation B2 becomes fd') = E \2(M~ j+ 1] (M)l M, . .. , _ , £, JJ6" I M' d %> ~ Md%. Therefore, . X (W(l - *)-</<*>. (B3) The binomial distribution of j successes in M trials, with probability of success, d'%, is given by X (rf'%) J '(l — d'%)M->. (B7) (B4) ~ f/ji \ fp(d') — [TV — (TV — l)rf'%] . (B8) TV + 1 Received October 29, 197S