Download Variance makes the difference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Variance makes the difference
Patrick Simon and David Kitz Krämer
1
Introduction
In this letter, we would like to stress the importance of variances in statistics,
which – in the opinion of the authors – is sometimes overseen. The following
simple example demonstrates that averages of probability distributions alone
are not always sufficient to give a definite answer to a statistics related problem
(frankly, almost never).
Consider the following situation. Two different groups, 100 persons in total
and 50 persons of each group, participate in a assessment test. It is known that
members of one group score, on average, in this test 100 points. Members of
the other group, on the other hand, achieve on average a higher result of 110
points. The test results of all persons altogether are sighted and the persons
with the 10 best results are selected regardless what group they belong to (10
percent best). How many persons of the selected sample belong to group one
and how many to group two? One might expect that more persons of group
two belong to this sample – after all they are on average better than members
of group one, right?
As we will see, with the above given minimum of information this question
cannot be answered. Essential for a quantitative answer is the knowledge of
the full probability distribution, p1 (x) and p2 (x), of the individual groups’ test
results, x, or, at least, the variance of the distributions, σ1 and σ2 . That is if
we resort to a frequency distribution of test results that depends only on the
average and the variance. Such a distribution is the Gaussian:1
1
(x − x̄)2
.
(1)
p(x) = √
exp −
2σ 2
2πσ 2
x̄ is the mean of the distribution.
2
The general solution
In this section, I will give a quite general solution to the class of problems the
one of the foregoing section is belonging to. We will return to the concrete
example of the last section in the next section.
Say, we have got N different groups each of which having its own probability
distribution (PD) of, say, test results x. We call this PDs pi (x) where i is an
group index ranging between 1 and N . Now, we randomly put together n1
persons from group i = 1, n2 persons from group i = 2, etc., into a new mixed
group of
N
X
ni
(2)
ntotal =
i=1
1 This
can at most only be an approximation because test results are usually always positive,
whereas a Gaussian distribution is symmetric about its mean and stretches infinitely into both
directions.
1
persons.
To decide whether a person of group i belongs to the pbest percent best of
the mixed group we need to know the test result limit xbest which divides the
best from the rest. For that purpose, we have to work out the PD of the mixed
group, ptotal (x), because
Z ∞
pbest =
dx ptotal (x) = 1 − Ctotal (xbest ) .
(3)
xbest
This means that the (normalised) area under the total test result distribution
starting from xbest up to the largest result (infinity) has to be the fraction pbest .
This area contains the top results of persons we are seeking for selection. By
Z x
Ctotal (x) ≡
dx′ ptotal (x′ )
(4)
0
we denote the so-called cumulative distribution of the frequency distribution of
all test results. With this definition we can, formally, write down the result that
divides best from rest:
−1
xbest = Ctotal
(1 − pbest ) .
(5)
It just means that we have to set the limit xbest such that pbest percent of the
total distribution of results are exactly beyond that limit.
−1
The function Ctotal
(p) is the inverse cumulative distribution; it tells you the
range of test results, starting from zero, within which p percent of all results can
−1
be found. For example, Ctotal
(0.5) defines the median of the total distribution
because it returns the test result below which half of the results are located.
But what is now the total distribution? If the persons in the new mixed
group perform exactly the same way than in their individual groups2 (they are
statistically independent), then the PD of the mixed group is just the weighted
average of all individual PDs, namely
N
N
X
1 X
ni
pi (x) =
ni pi (x) .
ptotal (x) =
n
ntotal i=1
i=1 total
(6)
This is because a) the probability that one particular person in the mixed group
belongs to group i is ni /ntotal , and b) if it belongs to group i it has a test result
x with a likelihood of pi (x).
The cumulative distribution – the probability to obtain a result between zero
and x – is therefore:
Ctotal (x) =
1
ntotal
N
X
ni Ci (x) ,
(7)
i=1
where Ci (x) are the cumulative distributions of the individual groups,
Z x
Ci (x) =
dx′ pi (x′ ) .
(8)
0
2 For example, it does not matter whether we assemble all people in one room while they
are doing their test.
2
Unfortunately, it is among many circumstances quite hard or even impossible
to find in general the inverse cumulative distribution from Eq. (7) analytically.
In these cases we have to find a (approximate) numerical solution based on
a concrete problem. This is what has been done for the example in the final
section.
The actual question has not been answered yet. How many persons of group
i do we expect, on average (!), inside the top sample? This boils down simply
to the question about the probability to find a) a person of group i and b) with
a test result beyond xbest . Following the reasoning of the previous paragraphs
this number is
ni [1 − Ci (xbest )] .
(9)
This number in relation to the total number of persons in the top sample is
ni
[1 − Ci (xbest )] ,
ntotal pbest
(10)
where the expression in the denominator corresponds to the total size of the top
sample which we fixed initially.
We can see in the final result, Eq. (9), that the answer to the initially raised
question does not simply depend on the averages of the PDs but on the full
shape of this distribution, encoded within Ci (x).
3
Graphical solution
The analytic result, yet exact but often not solvable analytically, can be pictured
graphically. We did this for one example with three different groups in Fig. 1.
To find the solution you have to proceed basically in three steps:
1. Draw the cumulative PD, ni [1 − Ci (x)], of each group into the same diagram. In words, this distribution tells you how many individuals of a
particular group have (on average) scores better than x. In the figure, the
groups have in total 40 (one), 30 (two) and 30 (three) members. As a
consistency check: for the lowest possible x (here x = 0) the PD has to be
identical to the total number of members of the group, and the PD has to
decline (or stay constant) for increasingly larger x.
2. Now compute the sum of all individual PDs. This yields the PD of the
mixed group that does not discriminate between the individual group
members. This is expressed by Eq. (7). Again, at x = 0, or the lowest possible test result, you must find the size of the mixed group (here:
100).
3. Finally, decide how large the top sample should be (here: 9). Employing
the PD of the mixed group you can work out the result lower limit xbest
that has to be achieved by a member of the top sample (intersect point of
#persons = 9 with mixed PD). This corresponds to the Eq. (5). At the
intersect points of x = xbest with the PDs of the individual groups you
can read off the number of persons of each group contributing to the top
sample. This is analogues to Eq. (9).
3
total (sum)
group 1
group 2
group 3
1
2
x(best) for top sample with 9 persons: 133
persons of group one in top sample: 5
persons of group two in top sample: 3
persons of group three in top sample: 1
3
x(best)
Figure 1: Figure visualising how Eq. (9) works using one particular example
with three different groups containing 40, 30 and 30 persons, respectively. The
PDs are Gaussians with (x̄1 = 100, σ1 = 25), (x̄2 = 110, σ2 = 20) and (x̄3 =
60, σ3 = 40) (upper left panel for the cumulative distribution of the group
specific results; number of persons better than a given x). The mixed group is
thus made up of 100 persons (upper right panel for the cumulative distribution
of the mixed group; just the sums of the individuals). The nine best persons
have on average a score better than xbest = 133 (see lower left panel). The top
sample contains on average five persons of group one, three from group two and
one belonging to group three. See text for details.
4
Figure 2: PDs of two groups (solid and dashed lines) and the total PD (yellow
area in background) obtained by combining the two groups (50:50). The 10%
best results are located in the blue shaded area beyond about xbest ≈ 150; this
region is shown in more detail in the small inlet panel in the upper right.
4
Example
Going back to the beginning, we can now give an answer for a particular example. According to the discussion in the last section we know that a definitive
answer can only be given if the PD of the frequency of results for both groups
has to be given. We assume here that group one has an average of x̄1 = 100
and a variance of σ1 = 25, while group two has x̄2 = 110 and σ2 = 20, hence
a larger average but a somewhat less variance. Both groups are modelled by a
Gaussian distribution, Eq. (1).
We combine these two groups to a mixed group with a fraction of 50% of
group one persons and 50% group two persons (100 persons in total). The
individual PDs and the total PD can be seen in Fig. 2. If we now focus on
the 10% best (10 persons) in the total distribution we can see that the ratio
between group one and group two in the top sample is pretty balanced (50:50;
5 from group one, 5 from group two) even though there is, apparently, only a
slight difference in the variances of the distributions and even though group two
is clearly better than group one – on average. If we get even more restrictive
and chose the top sample of the 1% best (one person: the best person), then it
is even more likely to get a member of group one than group two (65% against
36%; in 65 out of 100 of such tests we get for this single candidate somebody
from group one).
So, why is that? Is there some sort of discrimination at work? Should the
fraction of group two members not be larger, which are on average score better
in the test? In this example, the answer is no – and no conspiracy is happening.
5
The key is that group one has a larger spread in possible results (variance) than
group two. A larger variance not only means that we have a larger probability
to obtain worse results but also a better chance to get excellent results. This
is why we find more group one than group two persons in the extreme tail of
the mixed sample. Therefore a larger variance in a sample can make up for an
apparent inferiority due to a low average.
This shows that it can be, when using statistical arguments, fatal to reduce
the properties of a sample to just the “average”.
6